# Lab 11 - Two sample proportion tests

The class video is attached here so that you can watch my lecture again when you prepare the exams.

- If you have questions about my lecture, please use
**the comment section**at the bottom of this document.

## Two sample proportions test

We can use z-test to compare the difference between the population proportions of the two different groups.

#### COVID 19 data

As of April. 17, 2020, unfortunately, USA has the largest number of death from COVID 19 in the world. The second largest death ocurrs in Italy. Here is the information from the worldoemeters website:

USA | Italy | Total | |
---|---|---|---|

Death | 36,922 | 22,745 | 59,667 |

Recovered | 59,158 | 42,727 | 101,885 |

Total | 96,080 | 65,472 | 161,552 |

#### Hypothesis

Our goal is to **determine whether the death rate COVID 19 from each country are the same or not** with significance level 0.001. Thus, we can set our null hypothesis as follows:

\[ H_0: p_1 - p_2 = 0 \quad vs. \quad H_A: p_1 - p_2 \ne 0 \] where \(p_1\) represents the death rate of COVID 19 in USA and \(p_2\) represents the death rate of COVID 19 in Italy.

#### Estimate of population proportion

The estimates of the \(p_1\) and \(p_2\) are,

\[ \begin{align} \hat{p_1} = \frac{36,922}{96,080} = 0.3843 \\ \hat{p_2} = \frac{22,745}{65,472} = 0.3474 \end{align} \]

#### Standard Error

We have learned that the standard deviation of the difference of the proportions is as follows: \[ \sqrt{\frac{p_{1}\left(1-p_{1}\right)}{n_{1}}+\frac{p_{2}\left(1-p_{2}\right)}{n_{2}}} \]

However, under the null hypothesis, we assume that the death rates in both countries are the same, let say \(p\). Then, the above standard equation becomes as follows:

\[ \sqrt{p\left(1-p\right)\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)} \] Now, how can we estimate the \(p\) then? We should use all the information that we can use for calculating the death rate of COVID 19. Thus, the estimate of \(p\) can be obtained by

\[ \hat{p} = \frac{\text{# of death from both countries}}{\text{# of total observations}} = \frac{59,667}{161,552} = 0.3693 \]

Thus, the standard serror will be as follows:

\[ \begin{align} SE & = \sqrt{\hat{p}\left(1-\hat{p}\right)\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)} \\ & = \sqrt{0.3693\left(0.6307\right)\left(\frac{1}{96,080}+\frac{1}{65,472}\right)} \\ & = 0.002446 \end{align} \]

#### Test statistic

Now, test statistic \(z\) can be obtained as follows: \[ z=\frac{\hat{p}_{1}-\hat{p}_{2}}{SE}=\frac{0.3843-0.3474}{0.002446}=15.085 \]

#### p-value

We can determine the p-value based on the form of the test. See here if you don’t remember. Since our hypothesis has the two side test form, the *p-value* will be
\[
\begin{align}
\text{p-value} & = 2 P(Z \ge |z|) \\
& = 2 P(Z \ge 15.085) \le 0.0001
\end{align}
\]

### SAS code

We can load the data into the SAS system as follows:

```
data covid19 ;
input country $ status $ count ;
datalines ;
USA Death 36922
USA Recovered 59158
Italy Death 22745
Italy Recovered 42727
;
run ;
```

To do the two sample proportion test, we can use `riskdiff(equal var = null)`

option. In SAS, binomial proportions are called “risks,” so a “risk difference” is a difference in proportions. Thus, the following SAS code will do the test

```
proc freq data = covid19 ;
weight count ;
tables country * status / riskdiff(equal var = null);
run ;
```

**Result**

```
The FREQ Procedure
Table of country by status
country status
Frequency‚
Percent ‚
Row Pct ‚
Col Pct ‚Death ‚Recovere‚ Total
---------ˆ--------ˆ--------ˆ
Italy ‚ 22745 ‚ 42727 ‚ 65472
‚ 14.08 ‚ 26.45 ‚ 40.53
‚ 34.74 ‚ 65.26 ‚
‚ 38.12 ‚ 41.94 ‚
---------ˆ--------ˆ--------ˆ
USA ‚ 36922 ‚ 59158 ‚ 96080
‚ 22.85 ‚ 36.62 ‚ 59.47
‚ 38.43 ‚ 61.57 ‚
‚ 61.88 ‚ 58.06 ‚
---------ˆ--------ˆ--------ˆ
Total 59667 101885 161552
36.93 63.07 100.00
```

Among the tables you have, you can see the following result:

```
Risk Difference Test
H0: P1 - P2 = 0 Wald Method
Risk Difference -0.0369
ASE (H0) 0.0024
Z -15.0803
One-sided Pr < Z <.0001
Two-sided Pr > |Z| <.0001
Column 1 (status = Death)
```

Based on the `Two-sided Pr > |Z| <.0001`

, we can reject the null hypothesis. Thus, staticstically, the death rate of COVID 19 is different from countries. Do you think why?

## Two sample proportion test - The Chi-square test

Using Chi-square test, we can also do the same test: compairing the difference between the population proportions of the two different groups. Let us do this in SAS using the same data.

USA | Italy | Total | |
---|---|---|---|

Death | 36,922 | 22,745 | 59,667 |

Recovered | 59,158 | 42,727 | 101,885 |

Total | 96,080 | 65,472 | 161,552 |

Here is the steps for Chi-square test.

Step 1: Draw the 2-way table

2-way table looks like this. However, for some problems, we need to make this table from what you read in the problem.

Step 2: Calculate expected counts for each cell

\[ \text{Expected count} = \frac{\text{row total}\times \text{column total}}{\text{table total}} \]

We can draw the table and expected number table at the same time in SAS using the following code:

```
data covid19 ;
input country $ status $ count ;
datalines ;
USA Death 36922
USA Recovered 59158
Italy Death 22745
Italy Recovered 42727
;
run ;
proc freq data = covid19 ;
weight count ;
tables country * status / expected ;
run ;
```

**Result**

```
Table of country by status
country status
Frequency‚
Expected ‚
Percent ‚
Row Pct ‚
Col Pct ‚Death ‚Recovere‚ Total
---------ˆ--------ˆ--------ˆ
Italy ‚ 22745 ‚ 42727 ‚ 65472
‚ 24181 ‚ 41291 ‚
‚ 14.08 ‚ 26.45 ‚ 40.53
‚ 34.74 ‚ 65.26 ‚
‚ 38.12 ‚ 41.94 ‚
---------ˆ--------ˆ--------ˆ
USA ‚ 36922 ‚ 59158 ‚ 96080
‚ 35486 ‚ 60594 ‚
‚ 22.85 ‚ 36.62 ‚ 59.47
‚ 38.43 ‚ 61.57 ‚
‚ 61.88 ‚ 58.06 ‚
---------ˆ--------ˆ--------ˆ
Total 59667 101885 161552
36.93 63.07 100.00
```

The first and second number in each cell represent the observations and expected number. However, `these numbers are rounded at the first decimal`

so the numbers are approximated. Since the expected numbers in the all cells are larger than \(5\), we can use Chi-square test.

Step 3: Calculate Chi-square statistic

\[ \mathcal{\chi}^2 = \sum^{rc}_{i=1} \frac{(obs_i - exp_i)^2}{exp_i} \]

In our problem,

\[ \begin{align} \mathcal{\chi}^2 \approx \frac{(22745 - 24181)^2}{24181} + \frac{(42727 - 41291)^2}{41291} +\\ \frac{(36922 - 35486)^2}{35486} + \frac{(59158 - 60594)^2}{60594} = 227.3596 \end{align} \]

This obtained Chi-square statistics follows *Chi-square distribtuion* with `degree of freedom`

:
\[
d.f. = (\text{# of rows in the table}) \times (\text{# of colums in the table})
\]
in our case, it follows Chi-square distribtuion* with df 1.

Step 4: calculate p-value

p-value can be calculated by the Chisqaure table, or SAS

```
proc freq data = covid19 ;
weight count ;
tables country * status / chisq;
run ;
```

**Result**

```
Statistics for Table of country by status
Statistic DF Value Prob
------------------------------------------------------
Chi-Square 1 227.4160 <.0001
Likelihood Ratio Chi-Square 1 228.0910 <.0001
Continuity Adj. Chi-Square 1 227.2577 <.0001
Mantel-Haenszel Chi-Square 1 227.4146 <.0001
Phi Coefficient -0.0375
Contingency Coefficient 0.0375
Cramer's V -0.0375
Fisher's Exact Test
----------------------------------
Cell (1,1) Frequency (F) 22745
Left-sided Pr <= F <.0001
Right-sided Pr >= F 1.0000
Table Probability (P) <.0001
Two-sided Pr <= P <.0001
Sample Size = 161552
```

Note SAS gives the almost same Chi-square statistic that we have calculated, 227.4160. Verify this number, using your calculator.

Note that if one of the expected numbers is less than 5, then you should check out the Fisher’s exact test which will give you more accurate p-value when there is a small number of samples.

In our case, we are safe to use Chi-sqaure p-value which is less than 0.0001. Thus, we can reject the null hypothesis again.