# Lab 13 - Inference in Regression

The class video is attached here so that you can watch my lecture again when you prepare the exams.

- If you have questions about my lecture, please use
**the comment section**at the bottom of this document.

### t-test revisit

We have learned about t-test for population mean in week 8.

If we know \(X\) follows Normal distribution whose mean is \(\mu\) and std. is \(\sigma\), we have seen that the sampling distribution follows:

\[ \overline{X} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}}) \]

Then to do the test about the population mean, where the null hypothesis looks like

\[ H_0: \mu = \mu_0 \]

we have to calculate \(t\) statistic when we replace the \(\sigma\) as the sample standard deviation \(s\) as follows:

\[ t = \frac{\overline{x} - \mu_0}{s/\sqrt{n}} \]

### Distribution of the estimator of the slope and intercept

Let us assume that our regression line formula is:

\[ y = a + b x \] where \(a\) represents the intercept and \(b\) represents the slope. In the regression we estimate these two quantities as follows:

\[ \begin{align*} \hat{b} & =r\frac{s_{y}}{s_{x}} \\ \hat{a} & =\overline{y}-\hat{b}\overline{x} \end{align*} \]

Then, the statistician figured out the following fact.

\[ \begin{align*} \hat{a} & \sim\mathcal{N}\left(a,\sigma\sqrt{\frac{1}{n}+\frac{\bar{x}^{2}}{\sum\left(x_{i}-\bar{x}\right)^{2}}}\right)\\ \hat{b} & \sim\mathcal{N}\left(b,\frac{\sigma}{\sqrt{\sum\left(x_{i}-\bar{x}\right)^{2}}}\right) \end{align*} \]

### Replacing \(\sigma\)

Note that we don’t know what is \(\sigma\) here, and we want to replace this \(\sigma\) with the following: \[ s_{y|x}=\sqrt{\frac{1}{n-2}\sum\left(y_{i}-\hat{y}_{i}\right)^{2}} \]

### Confidence intervals for the slope

It follows the same logic, we figure out the distribution and replace its \(\sigma\) with other information from the sample. The C.I. uses the \(t\) distribution coefficient but different d.f. \[ \hat{b}\pm t_{n-2,\alpha}^{*}S.E. \] where SE is \[ \frac{s_{y|x}}{\sqrt{\sum\left(x_{i}-\bar{x}\right)^{2}}} \]

### Test for the slope

`slope`

value represents the linear relationship between the two variable. Thus, the meaning of the test whether the slope is zero or not is to check out there is a linear relation between the independent variable and the dependent variable.

Let us assume that we want to test the following null hypothesis:

\[ H_0: b = 0 \quad vs. \quad H_A: b \ne 0 \]

then we need to calculate the \(t\)-test statistic as follows: \[ t = \frac{\hat{b}}{S.E.} \sim t_{n-2} \]

If we set our significance level \(\alpha\) as 0.05, we will reject our null hypothesis when our p-value is less than 0.05.

\[ p\text{-value}: 2 P(t_{n-2} > |t|) \]

## SAS example

### OECD data revisit

The OECD dataset is collected from the Organization for Economic Cooperation and Development (OECD). It provides summary statistics for the 29 member nations. The variables are as follows:

**name**: name of country**infmort**: infant mortality (1996) number of deaths of infants < 1 yr of age per 1000 live births**pcgdp**: per capita gross domestic product (1998) reported in US dollars converted using Purchasing Power Parities to adjust for differences in price levels between countries**pch**: per capita health care expenditures (1996) reported in US dollars converted using Purchasing Power Parities**beds**: in-patient hospital beds per 1000 population (1996)**los**: average length of stay in days for hospital patients (1996)**docs**: doctors per 1000 population (1996)**region**: region of the world

### Load data in SAS

```
filename oec url "https://homepage.divms.uiowa.edu/~kcowles/Datasets/OECD.dat";
data OECD;
infile oec;
input country $ 13. pcgdp pch beds los docs infmort ;
run ;
```

- Note that the
`13.`

in the`input`

statement tells SAS the number of characters in the longest country name. Without this information**SAS would truncate the country names to 8 letters each**.

### Our goal

Suppose we want to get predicted values of `pch`

if we know `pcgdp`

.

### Plotting

```
proc sgscatter data = OECD ;
title "Scatter plot of pch and pcgdp";
plot pch * pcgdp /
datalabel = country reg = (nogroup) grid;
run ;
```

### Regression

We can obtain the result of regression from the following code:

```
proc reg data = OECD ;
model pch = pcgdp / clb ;
run ;
```

```
The REG Procedure
Model: MODEL1
Dependent Variable: pch
Number of Observations Read 30
Number of Observations Used 29
Number of Observations with Missing Values 1
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 12390695 12390695 87.52 <.0001
Error 27 3822638 141579
Corrected Total 28 16213333
Root MSE 376.27009 R-Square 0.7642
Dependent Mean 1508.89655 Adj R-Sq 0.7555
Coeff Var 24.93677
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits
Intercept 1 -465.66368 222.33244 -2.09 0.0457 -921.85216 -9.47520
pcgdp 1 0.09682 0.01035 9.36 <.0001 0.07558 0.11805
```

Note that the option `clb`

represents the C.I. for the slope and the intercept.

### Interpretation of the result

By checking out the results, you need to make sure you can answer to these questions:

Q1. Check the assumptions needed for linear regression by examining the scatter plot.

Q2. How well does the linear regression line fit the data?

Q3. The null hypothesis is that there is no linear relationship between

`pch`

and`pcgdp`

. Write this null and alternative hypothesis as a statement about a population parameter.Q4. Give a point estimate and a 95% confidence interval for the parameter of interest in the hypotheses.

Q5. Based on your answer to the preceding question, would you reject the null hypothesis at significance level \(\alpha\) = .05? (yes/no)

Q6. What are the numeric values of the test statistic and the p-value for the two-sided test of no linear relationship between

`pch`

and`pcgdp`

?Q7. Based on your answer to the preceding question, would you reject the null hypothesis at significance level \(\alpha\) = .05?