# Lab 13 - Inference in Regression

The class video is attached here so that you can watch my lecture again when you prepare the exams.

• If you have questions about my lecture, please use the comment section at the bottom of this document.

### t-test revisit

We have learned about t-test for population mean in week 8.

If we know $$X$$ follows Normal distribution whose mean is $$\mu$$ and std. is $$\sigma$$, we have seen that the sampling distribution follows:

$\overline{X} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}})$

Then to do the test about the population mean, where the null hypothesis looks like

$H_0: \mu = \mu_0$

we have to calculate $$t$$ statistic when we replace the $$\sigma$$ as the sample standard deviation $$s$$ as follows:

$t = \frac{\overline{x} - \mu_0}{s/\sqrt{n}}$

### Distribution of the estimator of the slope and intercept

Let us assume that our regression line formula is:

$y = a + b x$ where $$a$$ represents the intercept and $$b$$ represents the slope. In the regression we estimate these two quantities as follows:

\begin{align*} \hat{b} & =r\frac{s_{y}}{s_{x}} \\ \hat{a} & =\overline{y}-\hat{b}\overline{x} \end{align*}

Then, the statistician figured out the following fact.

\begin{align*} \hat{a} & \sim\mathcal{N}\left(a,\sigma\sqrt{\frac{1}{n}+\frac{\bar{x}^{2}}{\sum\left(x_{i}-\bar{x}\right)^{2}}}\right)\\ \hat{b} & \sim\mathcal{N}\left(b,\frac{\sigma}{\sqrt{\sum\left(x_{i}-\bar{x}\right)^{2}}}\right) \end{align*}

### Replacing $$\sigma$$

Note that we don’t know what is $$\sigma$$ here, and we want to replace this $$\sigma$$ with the following: $s_{y|x}=\sqrt{\frac{1}{n-2}\sum\left(y_{i}-\hat{y}_{i}\right)^{2}}$

### Confidence intervals for the slope

It follows the same logic, we figure out the distribution and replace its $$\sigma$$ with other information from the sample. The C.I. uses the $$t$$ distribution coefficient but different d.f. $\hat{b}\pm t_{n-2,\alpha}^{*}S.E.$ where SE is $\frac{s_{y|x}}{\sqrt{\sum\left(x_{i}-\bar{x}\right)^{2}}}$

### Test for the slope

slope value represents the linear relationship between the two variable. Thus, the meaning of the test whether the slope is zero or not is to check out there is a linear relation between the independent variable and the dependent variable.

Let us assume that we want to test the following null hypothesis:

$H_0: b = 0 \quad vs. \quad H_A: b \ne 0$

then we need to calculate the $$t$$-test statistic as follows: $t = \frac{\hat{b}}{S.E.} \sim t_{n-2}$

If we set our significance level $$\alpha$$ as 0.05, we will reject our null hypothesis when our p-value is less than 0.05.

$p\text{-value}: 2 P(t_{n-2} > |t|)$

## SAS example

### OECD data revisit

The OECD dataset is collected from the Organization for Economic Cooperation and Development (OECD). It provides summary statistics for the 29 member nations. The variables are as follows:

• name: name of country
• infmort: infant mortality (1996) number of deaths of infants < 1 yr of age per 1000 live births
• pcgdp: per capita gross domestic product (1998) reported in US dollars converted using Purchasing Power Parities to adjust for differences in price levels between countries
• pch: per capita health care expenditures (1996) reported in US dollars converted using Purchasing Power Parities
• beds: in-patient hospital beds per 1000 population (1996)
• los: average length of stay in days for hospital patients (1996)
• docs: doctors per 1000 population (1996)
• region: region of the world

filename oec url "https://homepage.divms.uiowa.edu/~kcowles/Datasets/OECD.dat";
data OECD;
infile oec;
input country \$ 13. pcgdp pch beds los docs infmort ;
run ;
• Note that the 13. in the input statement tells SAS the number of characters in the longest country name. Without this information SAS would truncate the country names to 8 letters each.

### Our goal

Suppose we want to get predicted values of pch if we know pcgdp.

### Plotting

proc sgscatter data = OECD ;
title "Scatter plot of pch and pcgdp";
plot pch * pcgdp /
datalabel = country reg = (nogroup) grid;
run ;

### Regression

We can obtain the result of regression from the following code:

proc reg data = OECD ;
model pch = pcgdp / clb ;
run ;
                                       The REG Procedure
Model: MODEL1
Dependent Variable: pch

Number of Observations Used                         29
Number of Observations with Missing Values           1

Analysis of Variance

Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     1       12390695       12390695      87.52    <.0001
Error                    27        3822638         141579
Corrected Total          28       16213333

Root MSE            376.27009    R-Square     0.7642
Dependent Mean     1508.89655    Adj R-Sq     0.7555
Coeff Var            24.93677

Parameter Estimates

Parameter      Standard
Variable    DF      Estimate         Error   t Value   Pr > |t|     95% Confidence Limits

Intercept    1    -465.66368     222.33244     -2.09     0.0457    -921.85216      -9.47520
pcgdp        1       0.09682       0.01035      9.36     <.0001       0.07558       0.11805


Note that the option clb represents the C.I. for the slope and the intercept.

### Interpretation of the result

By checking out the results, you need to make sure you can answer to these questions:

• Q1. Check the assumptions needed for linear regression by examining the scatter plot.

• Q2. How well does the linear regression line fit the data?

• Q3. The null hypothesis is that there is no linear relationship between pch and pcgdp. Write this null and alternative hypothesis as a statement about a population parameter.

• Q4. Give a point estimate and a 95% confidence interval for the parameter of interest in the hypotheses.

• Q5. Based on your answer to the preceding question, would you reject the null hypothesis at significance level $$\alpha$$ = .05? (yes/no)

• Q6. What are the numeric values of the test statistic and the p-value for the two-sided test of no linear relationship between pch and pcgdp?

• Q7. Based on your answer to the preceding question, would you reject the null hypothesis at significance level $$\alpha$$ = .05?

Previous