Lab 12 - Chi-square and ANOVA

The class video is attached here so that you can watch my lecture again when you prepare the exams.

  • If you have questions about my lecture, please use the comment section at the bottom of this document.

Chi-square more than 2 proportions

We can use Chi-square test more than two categories.

The usage of Instagram

Let us assume we have the following data:

Freshmen Non-Freshmen Total
Do not use 30 40 70
Often 180 180 360
Everyday 400 350 750
———– ——— ————– ——-
Total 610 570 1180

This table is about two variables: School year and Instagram usage.

The Instagram usage variable has three categories:

  • Do not use (D)
  • Often (O)
  • Everyday (E)

Our goal is to test whether the usage distribution of Instagram defer from the school year. We can test this by looking the data as follows:

prop. F
Do not use 30/70
Often 180/360
Everyday 400/750

If we assume that the population proportion of being a freshmen \(p\), then \(p\) should not differ from each category when the two variables are indenpendent. Thus, our null hypothesis looks like this:

\[ H_0: p_D = p_O = p_E \] which can be interpreted as follows: the proportion of being a freshmen is not changed by each usage category.

Data load in SAS

data instagram ;
input schoolyear $ frequency $ count ;
datalines ;
F D 30
F O 180
F E 400
S D 40
S O 180
S E 350
;
run ;

Chi-square test in SAS

You can use expected chisq option at the same time.

proc freq data = instagram ;
  weight count ;
  tables schoolyear * frequency / expected chisq;
run ;

Result

                                      The FREQ Procedure

                               Table of frequency by schoolyear

                              frequency     schoolyear

                              Frequency‚
                              Expected ‚
                              Percent  ‚
                              Row Pct  ‚
                              Col Pct  ‚F       ‚S       ‚  Total
                              ---------ˆ--------ˆ--------ˆ
                              D        ‚     30 ‚     40 ‚     70
                                       ‚ 36.186 ‚ 33.814 ‚
                                       ‚   2.54 ‚   3.39 ‚   5.93
                                       ‚  42.86 ‚  57.14 ‚
                                       ‚   4.92 ‚   7.02 ‚
                              ---------ˆ--------ˆ--------ˆ
                              E        ‚    400 ‚    350 ‚    750
                                       ‚ 387.71 ‚ 362.29 ‚
                                       ‚  33.90 ‚  29.66 ‚  63.56
                                       ‚  53.33 ‚  46.67 ‚
                                       ‚  65.57 ‚  61.40 ‚
                              ---------ˆ--------ˆ--------ˆ
                              O        ‚    180 ‚    180 ‚    360
                                       ‚  186.1 ‚  173.9 ‚
                                       ‚  15.25 ‚  15.25 ‚  30.51
                                       ‚  50.00 ‚  50.00 ‚
                                       ‚  29.51 ‚  31.58 ‚
                              ---------ˆ--------ˆ--------ˆ
                              Total         610      570     1180
                                          51.69    48.31   100.00


                        Statistics for Table of frequency by schoolyear

                    Statistic                     DF       Value      Prob
                    ------------------------------------------------------
                    Chi-Square                     2      3.4099    0.1818
                    Likelihood Ratio Chi-Square    2      3.4131    0.1815
                    Mantel-Haenszel Chi-Square     1      0.0001    0.9929
                    Phi Coefficient                       0.0538
                    Contingency Coefficient               0.0537
                    Cramer's V                            0.0538

                                      Sample Size = 1180

Q. We want to test whether the two variables affects to each other. What is the Chi-square test statistic?

Ans: 3.4099

Q. What is d.f.?

Ans: 2 because we have 2 columns and 3 rows, so \((2-1) \times (3-1) = 2\)

Q. If we use \(\alpha = 0.05\), what is the decision?

Ans: p-value is 0.1818 which is larger than the significance level. Thus, we cannot reject the null hypothesis.

ANOVA: Analysis of Variance

Let us compare the population means under three different situations.

Corndogs data

Let us load data from the prof. Cowles’ website directly this time. We will feed the url of the data from our professor’s data set site as follows;

The hotdogs dataset contains data on the sodium and calories contained in each of 54 major corndog brands.

The variables are:

  • type: Beef, Meat, or Poultry
  • calories per corn dog
  • sodium per corn dog

There are many other brands of hotdogs on the market besides those included in this dataset. We are interested in determining whether the mean of the calories per hotdog is the same in all of the three types of hotdogs.

Data load

filename mydata url "http://homepage.divms.uiowa.edu/~kcowles/Datasets/hotdogs.dat";
data corndog;
infile mydata;
input type $ calories sodium ;
run ;

Check the assumption for ANOVA

  • I independent simple random samples
  • Each population \(i\) has a normal distribution with unknown mean \(μ_i\).
  • All of the populations have the same standard deviation \(\sigma\) (unknown)
proc univariate plot data = corndog ;
var calories ;
by type;

proc means data = corndog ;
var calories ;
by type ;
run ;

ANOVA test in SAS

proc anova data = corndog ;
class type ;
model calories = type  ;
means type / bon alpha = .05 ;
run ;

Result


                                      The ANOVA Procedure

                                Dependent Variable: calories

                                              Sum of
      Source                      DF         Squares     Mean Square    F Value    Pr > F

      Model                        2     17692.19510      8846.09755      16.07    <.0001

      Error                       51     28067.13824       550.33604

      Corrected Total             53     45759.33333



                                      The ANOVA Procedure

                            Bonferroni (Dunn) t Tests for calories

 NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher
                 Type II error rate than Tukey's for all pairwise comparisons.


                               Alpha                        0.05
                               Error Degrees of Freedom       51
                               Error Mean Square         550.336
                               Critical Value of t       2.47551


                Comparisons significant at the 0.05 level are indicated by ***.


                                        Difference
                         type              Between     Simultaneous 95%
                      Comparison             Means    Confidence Limits

                   Meat    - Beef            1.856     -17.302   21.013
                   Meat    - Poultry        39.941      20.022   59.860  ***
                   Beef    - Meat           -1.856     -21.013   17.302
                   Beef    - Poultry        38.085      18.928   57.243  ***
                   Poultry - Meat          -39.941     -59.860  -20.022  ***
                   Poultry - Beef          -38.085     -57.243  -18.928  ***
Previous
Next