# Lab 11 - Two sample proportion tests

The class video is attached here so that you can watch my lecture again when you prepare the exams.

• If you have questions about my lecture, please use the comment section at the bottom of this document.

## Two sample proportions test

We can use z-test to compare the difference between the population proportions of the two different groups.

#### COVID 19 data

As of April. 17, 2020, unfortunately, USA has the largest number of death from COVID 19 in the world. The second largest death ocurrs in Italy. Here is the information from the worldoemeters website:

USA Italy Total
Death 36,922 22,745 59,667
Recovered 59,158 42,727 101,885
Total 96,080 65,472 161,552

#### Hypothesis

Our goal is to determine whether the death rate COVID 19 from each country are the same or not with significance level 0.001. Thus, we can set our null hypothesis as follows:

$H_0: p_1 - p_2 = 0 \quad vs. \quad H_A: p_1 - p_2 \ne 0$ where $$p_1$$ represents the death rate of COVID 19 in USA and $$p_2$$ represents the death rate of COVID 19 in Italy.

#### Estimate of population proportion

The estimates of the $$p_1$$ and $$p_2$$ are,

\begin{align} \hat{p_1} = \frac{36,922}{96,080} = 0.3843 \\ \hat{p_2} = \frac{22,745}{65,472} = 0.3474 \end{align}

#### Standard Error

We have learned that the standard deviation of the difference of the proportions is as follows: $\sqrt{\frac{p_{1}\left(1-p_{1}\right)}{n_{1}}+\frac{p_{2}\left(1-p_{2}\right)}{n_{2}}}$

However, under the null hypothesis, we assume that the death rates in both countries are the same, let say $$p$$. Then, the above standard equation becomes as follows:

$\sqrt{p\left(1-p\right)\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)}$ Now, how can we estimate the $$p$$ then? We should use all the information that we can use for calculating the death rate of COVID 19. Thus, the estimate of $$p$$ can be obtained by

$\hat{p} = \frac{\text{# of death from both countries}}{\text{# of total observations}} = \frac{59,667}{161,552} = 0.3693$

Thus, the standard serror will be as follows:

\begin{align} SE & = \sqrt{\hat{p}\left(1-\hat{p}\right)\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)} \\ & = \sqrt{0.3693\left(0.6307\right)\left(\frac{1}{96,080}+\frac{1}{65,472}\right)} \\ & = 0.002446 \end{align}

#### Test statistic

Now, test statistic $$z$$ can be obtained as follows: $z=\frac{\hat{p}_{1}-\hat{p}_{2}}{SE}=\frac{0.3843-0.3474}{0.002446}=15.085$

#### p-value

We can determine the p-value based on the form of the test. See here if you don’t remember. Since our hypothesis has the two side test form, the p-value will be \begin{align} \text{p-value} & = 2 P(Z \ge |z|) \\ & = 2 P(Z \ge 15.085) \le 0.0001 \end{align}

### SAS code

We can load the data into the SAS system as follows:

data covid19 ;
input country $status$ count ;
datalines ;
USA Death 36922
USA Recovered 59158
Italy Death 22745
Italy Recovered 42727
;
run ;

To do the two sample proportion test, we can use riskdiff(equal var = null) option. In SAS, binomial proportions are called “risks,” so a “risk difference” is a difference in proportions. Thus, the following SAS code will do the test

proc freq data = covid19 ;
weight count ;
tables country * status / riskdiff(equal var = null);
run ;

Result

The FREQ Procedure

Table of country by status

country     status

Frequency‚
Percent  ‚
Row Pct  ‚
Col Pct  ‚Death   ‚Recovere‚  Total
---------ˆ--------ˆ--------ˆ
Italy    ‚  22745 ‚  42727 ‚  65472
‚  14.08 ‚  26.45 ‚  40.53
‚  34.74 ‚  65.26 ‚
‚  38.12 ‚  41.94 ‚
---------ˆ--------ˆ--------ˆ
USA      ‚  36922 ‚  59158 ‚  96080
‚  22.85 ‚  36.62 ‚  59.47
‚  38.43 ‚  61.57 ‚
‚  61.88 ‚  58.06 ‚
---------ˆ--------ˆ--------ˆ
Total       59667   101885   161552
36.93    63.07   100.00

Among the tables you have, you can see the following result:

                                     Risk Difference Test
H0: P1 - P2 = 0    Wald Method

Risk Difference        -0.0369
ASE (H0)                0.0024
Z                     -15.0803
One-sided Pr <  Z       <.0001
Two-sided Pr > |Z|      <.0001

Column 1 (status = Death)


Based on the Two-sided Pr > |Z| <.0001, we can reject the null hypothesis. Thus, staticstically, the death rate of COVID 19 is different from countries. Do you think why?

## Two sample proportion test - The Chi-square test

Using Chi-square test, we can also do the same test: compairing the difference between the population proportions of the two different groups. Let us do this in SAS using the same data.

USA Italy Total
Death 36,922 22,745 59,667
Recovered 59,158 42,727 101,885
Total 96,080 65,472 161,552

Here is the steps for Chi-square test.

Step 1: Draw the 2-way table

2-way table looks like this. However, for some problems, we need to make this table from what you read in the problem.

Step 2: Calculate expected counts for each cell

$\text{Expected count} = \frac{\text{row total}\times \text{column total}}{\text{table total}}$

We can draw the table and expected number table at the same time in SAS using the following code:

data covid19 ;
input country $status$ count ;
datalines ;
USA Death 36922
USA Recovered 59158
Italy Death 22745
Italy Recovered 42727
;
run ;

proc freq data = covid19 ;
weight count ;
tables country * status / expected ;
run ;

Result


Table of country by status

country     status

Frequency‚
Expected ‚
Percent  ‚
Row Pct  ‚
Col Pct  ‚Death   ‚Recovere‚  Total
---------ˆ--------ˆ--------ˆ
Italy    ‚  22745 ‚  42727 ‚  65472
‚  24181 ‚  41291 ‚
‚  14.08 ‚  26.45 ‚  40.53
‚  34.74 ‚  65.26 ‚
‚  38.12 ‚  41.94 ‚
---------ˆ--------ˆ--------ˆ
USA      ‚  36922 ‚  59158 ‚  96080
‚  35486 ‚  60594 ‚
‚  22.85 ‚  36.62 ‚  59.47
‚  38.43 ‚  61.57 ‚
‚  61.88 ‚  58.06 ‚
---------ˆ--------ˆ--------ˆ
Total       59667   101885   161552
36.93    63.07   100.00


The first and second number in each cell represent the observations and expected number. However, these numbers are rounded at the first decimal so the numbers are approximated. Since the expected numbers in the all cells are larger than $$5$$, we can use Chi-square test.

Step 3: Calculate Chi-square statistic

$\mathcal{\chi}^2 = \sum^{rc}_{i=1} \frac{(obs_i - exp_i)^2}{exp_i}$

In our problem,

\begin{align} \mathcal{\chi}^2 \approx \frac{(22745 - 24181)^2}{24181} + \frac{(42727 - 41291)^2}{41291} +\\ \frac{(36922 - 35486)^2}{35486} + \frac{(59158 - 60594)^2}{60594} = 227.3596 \end{align}

This obtained Chi-square statistics follows Chi-square distribtuion with degree of freedom: $d.f. = (\text{# of rows in the table}) \times (\text{# of colums in the table})$ in our case, it follows Chi-square distribtuion* with df 1.

Step 4: calculate p-value

p-value can be calculated by the Chisqaure table, or SAS

proc freq data = covid19 ;
weight count ;
tables country * status / chisq;
run ;

Result


Statistics for Table of country by status

Statistic                     DF       Value      Prob
------------------------------------------------------
Chi-Square                     1    227.4160    <.0001
Likelihood Ratio Chi-Square    1    228.0910    <.0001
Continuity Adj. Chi-Square     1    227.2577    <.0001
Mantel-Haenszel Chi-Square     1    227.4146    <.0001
Phi Coefficient                      -0.0375
Contingency Coefficient               0.0375
Cramer's V                           -0.0375

Fisher's Exact Test
----------------------------------
Cell (1,1) Frequency (F)     22745
Left-sided Pr <= F          <.0001
Right-sided Pr >= F         1.0000

Table Probability (P)       <.0001
Two-sided Pr <= P           <.0001

Sample Size = 161552


Note SAS gives the almost same Chi-square statistic that we have calculated, 227.4160. Verify this number, using your calculator.

Note that if one of the expected numbers is less than 5, then you should check out the Fisher’s exact test which will give you more accurate p-value when there is a small number of samples.

In our case, we are safe to use Chi-sqaure p-value which is less than 0.0001. Thus, we can reject the null hypothesis again.