Lab 11 - Two sample proportion tests

The class video is attached here so that you can watch my lecture again when you prepare the exams.

  • If you have questions about my lecture, please use the comment section at the bottom of this document.

Two sample proportions test

We can use z-test to compare the difference between the population proportions of the two different groups.

COVID 19 data

As of April. 17, 2020, unfortunately, USA has the largest number of death from COVID 19 in the world. The second largest death ocurrs in Italy. Here is the information from the worldoemeters website:

USA Italy Total
Death 36,922 22,745 59,667
Recovered 59,158 42,727 101,885
Total 96,080 65,472 161,552

Hypothesis

Our goal is to determine whether the death rate COVID 19 from each country are the same or not with significance level 0.001. Thus, we can set our null hypothesis as follows:

\[ H_0: p_1 - p_2 = 0 \quad vs. \quad H_A: p_1 - p_2 \ne 0 \] where \(p_1\) represents the death rate of COVID 19 in USA and \(p_2\) represents the death rate of COVID 19 in Italy.

Estimate of population proportion

The estimates of the \(p_1\) and \(p_2\) are,

\[ \begin{align} \hat{p_1} = \frac{36,922}{96,080} = 0.3843 \\ \hat{p_2} = \frac{22,745}{65,472} = 0.3474 \end{align} \]

Standard Error

We have learned that the standard deviation of the difference of the proportions is as follows: \[ \sqrt{\frac{p_{1}\left(1-p_{1}\right)}{n_{1}}+\frac{p_{2}\left(1-p_{2}\right)}{n_{2}}} \]

However, under the null hypothesis, we assume that the death rates in both countries are the same, let say \(p\). Then, the above standard equation becomes as follows:

\[ \sqrt{p\left(1-p\right)\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)} \] Now, how can we estimate the \(p\) then? We should use all the information that we can use for calculating the death rate of COVID 19. Thus, the estimate of \(p\) can be obtained by

\[ \hat{p} = \frac{\text{# of death from both countries}}{\text{# of total observations}} = \frac{59,667}{161,552} = 0.3693 \]

Thus, the standard serror will be as follows:

\[ \begin{align} SE & = \sqrt{\hat{p}\left(1-\hat{p}\right)\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)} \\ & = \sqrt{0.3693\left(0.6307\right)\left(\frac{1}{96,080}+\frac{1}{65,472}\right)} \\ & = 0.002446 \end{align} \]

Test statistic

Now, test statistic \(z\) can be obtained as follows: \[ z=\frac{\hat{p}_{1}-\hat{p}_{2}}{SE}=\frac{0.3843-0.3474}{0.002446}=15.085 \]

p-value

We can determine the p-value based on the form of the test. See here if you don’t remember. Since our hypothesis has the two side test form, the p-value will be \[ \begin{align} \text{p-value} & = 2 P(Z \ge |z|) \\ & = 2 P(Z \ge 15.085) \le 0.0001 \end{align} \]

SAS code

We can load the data into the SAS system as follows:

data covid19 ;
input country $ status $ count ;
datalines ;
USA Death 36922
USA Recovered 59158
Italy Death 22745
Italy Recovered 42727
;
run ;

To do the two sample proportion test, we can use riskdiff(equal var = null) option. In SAS, binomial proportions are called “risks,” so a “risk difference” is a difference in proportions. Thus, the following SAS code will do the test

proc freq data = covid19 ;
  weight count ;
  tables country * status / riskdiff(equal var = null);
run ;

Result

The FREQ Procedure

                                  Table of country by status

                              country     status

                              Frequency‚
                              Percent  ‚
                              Row Pct  ‚
                              Col Pct  ‚Death   ‚Recovere‚  Total
                              ---------ˆ--------ˆ--------ˆ
                              Italy    ‚  22745 ‚  42727 ‚  65472
                                       ‚  14.08 ‚  26.45 ‚  40.53
                                       ‚  34.74 ‚  65.26 ‚
                                       ‚  38.12 ‚  41.94 ‚
                              ---------ˆ--------ˆ--------ˆ
                              USA      ‚  36922 ‚  59158 ‚  96080
                                       ‚  22.85 ‚  36.62 ‚  59.47
                                       ‚  38.43 ‚  61.57 ‚
                                       ‚  61.88 ‚  58.06 ‚
                              ---------ˆ--------ˆ--------ˆ
                              Total       59667   101885   161552
                                          36.93    63.07   100.00

Among the tables you have, you can see the following result:

                                     Risk Difference Test
                                H0: P1 - P2 = 0    Wald Method

                                Risk Difference        -0.0369
                                ASE (H0)                0.0024
                                Z                     -15.0803
                                One-sided Pr <  Z       <.0001
                                Two-sided Pr > |Z|      <.0001

                                  Column 1 (status = Death)

Based on the Two-sided Pr > |Z| <.0001, we can reject the null hypothesis. Thus, staticstically, the death rate of COVID 19 is different from countries. Do you think why?

Two sample proportion test - The Chi-square test

Using Chi-square test, we can also do the same test: compairing the difference between the population proportions of the two different groups. Let us do this in SAS using the same data.

USA Italy Total
Death 36,922 22,745 59,667
Recovered 59,158 42,727 101,885
Total 96,080 65,472 161,552

Here is the steps for Chi-square test.

Step 1: Draw the 2-way table

2-way table looks like this. However, for some problems, we need to make this table from what you read in the problem.

Step 2: Calculate expected counts for each cell

\[ \text{Expected count} = \frac{\text{row total}\times \text{column total}}{\text{table total}} \]

We can draw the table and expected number table at the same time in SAS using the following code:

data covid19 ;
input country $ status $ count ;
datalines ;
USA Death 36922
USA Recovered 59158
Italy Death 22745
Italy Recovered 42727
;
run ;

proc freq data = covid19 ;
  weight count ;
  tables country * status / expected ;
run ;

Result


                                  Table of country by status

                              country     status

                              Frequency‚
                              Expected ‚
                              Percent  ‚
                              Row Pct  ‚
                              Col Pct  ‚Death   ‚Recovere‚  Total
                              ---------ˆ--------ˆ--------ˆ
                              Italy    ‚  22745 ‚  42727 ‚  65472
                                       ‚  24181 ‚  41291 ‚
                                       ‚  14.08 ‚  26.45 ‚  40.53
                                       ‚  34.74 ‚  65.26 ‚
                                       ‚  38.12 ‚  41.94 ‚
                              ---------ˆ--------ˆ--------ˆ
                              USA      ‚  36922 ‚  59158 ‚  96080
                                       ‚  35486 ‚  60594 ‚
                                       ‚  22.85 ‚  36.62 ‚  59.47
                                       ‚  38.43 ‚  61.57 ‚
                                       ‚  61.88 ‚  58.06 ‚
                              ---------ˆ--------ˆ--------ˆ
                              Total       59667   101885   161552
                                          36.93    63.07   100.00

The first and second number in each cell represent the observations and expected number. However, these numbers are rounded at the first decimal so the numbers are approximated. Since the expected numbers in the all cells are larger than \(5\), we can use Chi-square test.

Step 3: Calculate Chi-square statistic

\[ \mathcal{\chi}^2 = \sum^{rc}_{i=1} \frac{(obs_i - exp_i)^2}{exp_i} \]

In our problem,

\[ \begin{align} \mathcal{\chi}^2 \approx \frac{(22745 - 24181)^2}{24181} + \frac{(42727 - 41291)^2}{41291} +\\ \frac{(36922 - 35486)^2}{35486} + \frac{(59158 - 60594)^2}{60594} = 227.3596 \end{align} \]

This obtained Chi-square statistics follows Chi-square distribtuion with degree of freedom: \[ d.f. = (\text{# of rows in the table}) \times (\text{# of colums in the table}) \] in our case, it follows Chi-square distribtuion* with df 1.

Step 4: calculate p-value

p-value can be calculated by the Chisqaure table, or SAS

proc freq data = covid19 ;
  weight count ;
  tables country * status / chisq;
run ;

Result


                           Statistics for Table of country by status

                    Statistic                     DF       Value      Prob
                    ------------------------------------------------------
                    Chi-Square                     1    227.4160    <.0001
                    Likelihood Ratio Chi-Square    1    228.0910    <.0001
                    Continuity Adj. Chi-Square     1    227.2577    <.0001
                    Mantel-Haenszel Chi-Square     1    227.4146    <.0001
                    Phi Coefficient                      -0.0375
                    Contingency Coefficient               0.0375
                    Cramer's V                           -0.0375


                                     Fisher's Exact Test
                              ----------------------------------
                              Cell (1,1) Frequency (F)     22745
                              Left-sided Pr <= F          <.0001
                              Right-sided Pr >= F         1.0000

                              Table Probability (P)       <.0001
                              Two-sided Pr <= P           <.0001

                                     Sample Size = 161552

Note SAS gives the almost same Chi-square statistic that we have calculated, 227.4160. Verify this number, using your calculator.

Note that if one of the expected numbers is less than 5, then you should check out the Fisher’s exact test which will give you more accurate p-value when there is a small number of samples.

In our case, we are safe to use Chi-sqaure p-value which is less than 0.0001. Thus, we can reject the null hypothesis again.

Previous
Next