Lab 13 - Inference in Regression

The class video is attached here so that you can watch my lecture again when you prepare the exams.

  • If you have questions about my lecture, please use the comment section at the bottom of this document.

t-test revisit

We have learned about t-test for population mean in week 8.

If we know \(X\) follows Normal distribution whose mean is \(\mu\) and std. is \(\sigma\), we have seen that the sampling distribution follows:

\[ \overline{X} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}}) \]

Then to do the test about the population mean, where the null hypothesis looks like

\[ H_0: \mu = \mu_0 \]

we have to calculate \(t\) statistic when we replace the \(\sigma\) as the sample standard deviation \(s\) as follows:

\[ t = \frac{\overline{x} - \mu_0}{s/\sqrt{n}} \]

Distribution of the estimator of the slope and intercept

Let us assume that our regression line formula is:

\[ y = a + b x \] where \(a\) represents the intercept and \(b\) represents the slope. In the regression we estimate these two quantities as follows:

\[ \begin{align*} \hat{b} & =r\frac{s_{y}}{s_{x}} \\ \hat{a} & =\overline{y}-\hat{b}\overline{x} \end{align*} \]

Then, the statistician figured out the following fact.

\[ \begin{align*} \hat{a} & \sim\mathcal{N}\left(a,\sigma\sqrt{\frac{1}{n}+\frac{\bar{x}^{2}}{\sum\left(x_{i}-\bar{x}\right)^{2}}}\right)\\ \hat{b} & \sim\mathcal{N}\left(b,\frac{\sigma}{\sqrt{\sum\left(x_{i}-\bar{x}\right)^{2}}}\right) \end{align*} \]

Replacing \(\sigma\)

Note that we don’t know what is \(\sigma\) here, and we want to replace this \(\sigma\) with the following: \[ s_{y|x}=\sqrt{\frac{1}{n-2}\sum\left(y_{i}-\hat{y}_{i}\right)^{2}} \]

Confidence intervals for the slope

It follows the same logic, we figure out the distribution and replace its \(\sigma\) with other information from the sample. The C.I. uses the \(t\) distribution coefficient but different d.f. \[ \hat{b}\pm t_{n-2,\alpha}^{*}S.E. \] where SE is \[ \frac{s_{y|x}}{\sqrt{\sum\left(x_{i}-\bar{x}\right)^{2}}} \]

Test for the slope

slope value represents the linear relationship between the two variable. Thus, the meaning of the test whether the slope is zero or not is to check out there is a linear relation between the independent variable and the dependent variable.

Let us assume that we want to test the following null hypothesis:

\[ H_0: b = 0 \quad vs. \quad H_A: b \ne 0 \]

then we need to calculate the \(t\)-test statistic as follows: \[ t = \frac{\hat{b}}{S.E.} \sim t_{n-2} \]

If we set our significance level \(\alpha\) as 0.05, we will reject our null hypothesis when our p-value is less than 0.05.

\[ p\text{-value}: 2 P(t_{n-2} > |t|) \]

SAS example

OECD data revisit

The OECD dataset is collected from the Organization for Economic Cooperation and Development (OECD). It provides summary statistics for the 29 member nations. The variables are as follows:

  • name: name of country
  • infmort: infant mortality (1996) number of deaths of infants < 1 yr of age per 1000 live births
  • pcgdp: per capita gross domestic product (1998) reported in US dollars converted using Purchasing Power Parities to adjust for differences in price levels between countries
  • pch: per capita health care expenditures (1996) reported in US dollars converted using Purchasing Power Parities
  • beds: in-patient hospital beds per 1000 population (1996)
  • los: average length of stay in days for hospital patients (1996)
  • docs: doctors per 1000 population (1996)
  • region: region of the world

Load data in SAS

filename oec url "";
data OECD;
infile oec;
input country $ 13. pcgdp pch beds los docs infmort ;
run ;
  • Note that the 13. in the input statement tells SAS the number of characters in the longest country name. Without this information SAS would truncate the country names to 8 letters each.

Our goal

Suppose we want to get predicted values of pch if we know pcgdp.


proc sgscatter data = OECD ;
title "Scatter plot of pch and pcgdp";
  plot pch * pcgdp /
    datalabel = country reg = (nogroup) grid;
run ;


We can obtain the result of regression from the following code:

proc reg data = OECD ;
model pch = pcgdp / clb ;
run ;
                                       The REG Procedure
                                         Model: MODEL1
                                   Dependent Variable: pch

                    Number of Observations Read                         30
                    Number of Observations Used                         29
                    Number of Observations with Missing Values           1

                                     Analysis of Variance

                                            Sum of           Mean
        Source                   DF        Squares         Square    F Value    Pr > F

        Model                     1       12390695       12390695      87.52    <.0001
        Error                    27        3822638         141579
        Corrected Total          28       16213333

                     Root MSE            376.27009    R-Square     0.7642
                     Dependent Mean     1508.89655    Adj R-Sq     0.7555
                     Coeff Var            24.93677

                                      Parameter Estimates

                     Parameter      Standard
  Variable    DF      Estimate         Error   t Value   Pr > |t|     95% Confidence Limits

  Intercept    1    -465.66368     222.33244     -2.09     0.0457    -921.85216      -9.47520
  pcgdp        1       0.09682       0.01035      9.36     <.0001       0.07558       0.11805

Note that the option clb represents the C.I. for the slope and the intercept.

Interpretation of the result

By checking out the results, you need to make sure you can answer to these questions:

  • Q1. Check the assumptions needed for linear regression by examining the scatter plot.

  • Q2. How well does the linear regression line fit the data?

  • Q3. The null hypothesis is that there is no linear relationship between pch and pcgdp. Write this null and alternative hypothesis as a statement about a population parameter.

  • Q4. Give a point estimate and a 95% confidence interval for the parameter of interest in the hypotheses.

  • Q5. Based on your answer to the preceding question, would you reject the null hypothesis at significance level \(\alpha\) = .05? (yes/no)

  • Q6. What are the numeric values of the test statistic and the p-value for the two-sided test of no linear relationship between pch and pcgdp?

  • Q7. Based on your answer to the preceding question, would you reject the null hypothesis at significance level \(\alpha\) = .05?