Lab 3 - Correlation and regression

Correlation

Let $$(x_i, y_i), i = 1, · · · , n$$ be pairs of observations of $$(x, y)$$. Then the sample correlation between $$x$$ and $$y$$ is

$r=\frac{1}{n-1}\sum_{i=1}^{n}\left(\frac{x_{i}-\bar{x}}{s_{x}}\right)\left(\frac{y_{i}-\bar{y}}{s_{y}}\right)$ where $$x, y$$ denote the sample mean of $$x$$ and $$y$$ respectively and $$s_x, s_y$$ denote the sample standard deviations of $$x$$ and $$y$$ respectively.

• Sample correlation is a measure of linear relationship between two variables.
• $$r$$ is always between −1 and 1.
• $$r > 0$$ indicates positive linear association, $$r < 0$$ indicates negative linear association and $$r = 0$$ indicates no linear association
• $$r = 0$$ doesn’t mean no relationship and it is entirely possible for two variables to have some relationship (such as quadratic relationship) while having $$r = 0$$
• Sample correlation is unit free

Correlation in SAS

Let us load data from the website directly this time. We will feed the url of the data from our professor’s data set site as follows;

filename oec url "http://homepage.divms.uiowa.edu/~kcowles/Datasets/OECD.dat";
data OECD;
infile oec;
input country $13. pcgdp pch beds los docs infmort ; run ; • Note that the 13. in the input statement tells SAS the number of characters in the longest country name. Without this information SAS would truncate the country names to 8 letters each. • We tell to SAS that there are 8 variables in the dataset; country pcgdp pch beds los docs infmort and country is not a numerical variable ($ sign).

Scatter plots

Let us draw the scatter plot in SAS. Here is the syntax of drawing scatter plot:

PROC sgscatter  DATA=DATASET;
PLOT VARIABLE_1 * VARIABLE_2;
RUN;

We have many variables in our data set. Let us pick pch and pcgdp.

proc sgscatter data = OECD ;
title "Scatter plot of pch and pcgdp";
plot pch * pcgdp /
datalabel = country ;
run ;

Correlation in SAS

We can calculate the sample correlation in SAS by using corr.

proc corr data = OECD ;
var pcgdp pch ;
run ;

The output is as follows:

The CORR Procedure

2  Variables:    pcgdp    pch

Simple Statistics

Variable           N          Mean       Std Dev           Sum       Minimum       Maximum

pcgdp             30         20381          6752        611441          6720         34536
pch               29          1509     760.95177         43758     232.00000          3898

Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations

pcgdp           pch

pcgdp       1.00000       0.87420
<.0001
30            29

pch         0.87420       1.00000
<.0001
29            29

The sample correlation between pch and pcgdp is 0.87420, which indicates that the two variables are positively correlated each other.

Regression in SAS

Regression can be done by reg key word.

proc reg data = OECD ;
model pch = pcgdp ;           * model <resp vbl> = <explanatory vbl> ;
run ;

The model code indicates that we want to use the following form of regression line;

$\hat{pch} = intercept + slope \times pcgdp$

The following is the output:

The REG Procedure
Model: MODEL1
Dependent Variable: pch

Number of Observations Read                         30
Number of Observations Used                         29
Number of Observations with Missing Values           1

Analysis of Variance

Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     1       12390695       12390695      87.52    <.0001
Error                    27        3822638         141579
Corrected Total          28       16213333

Root MSE            376.27009    R-Square     0.7642
Dependent Mean     1508.89655    Adj R-Sq     0.7555
Coeff Var            24.93677

Parameter Estimates

Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1     -465.66368      222.33244      -2.09      0.0457
pcgdp         1        0.09682        0.01035       9.36      <.0001

Based on the result we have, we know that the estimation formula looks like this:

$\hat{pch} = -465.66368 + 0.09682 \times pcgdp$

Prediction using the formula

How can we predict pch of a country whose pcdgp is $$2000$$ using the above formula? We can simply plug in the $$2000$$ to the above formula;

$\hat{pch} = -465.66368 + 0.09682 \times 2000 = -272.0237$

SAS provides the command for obtaining the prediction values of pch which corresponding to the each data point.

proc reg data = OECD ;
model pch = pcgdp / p ;
id country ;
run ;

Output:

The REG Procedure
Model: MODEL1
Dependent Variable: pch

Output Statistics

Dependent    Predicted
Obs    country           Variable        Value     Residual

1    Australia             1775         1731      43.9558
2    Austria               1748         1857    -108.5206
3    Belgium               1708         1867    -159.3642
4    Canada                2065         1903     161.7162
5    CzechRepub             904     806.2369      97.7631
6    Denmark               1802         2079    -276.7183
7    Finland               1380         1631    -251.3215
8    France                2002         1673     328.8531
9    Germany               2278         1745     532.8203
10    Greece                 888     934.6178     -46.6178
11    Hungary                602     553.2509      48.7491
12    Iceland               1893         2080    -187.2674
13    Ireland               1276         1714    -437.6169
14    Italy                 1584         1639     -55.0669
15    Japan                 1677         1869    -191.5260
16    Korea                  537     845.2546    -308.2546
17    Luxembourg            2139         2878    -739.0493
18    Mexico                 358     308.6882      49.3118
19    Netherlands           1766         1769      -3.0938
20    NewZealand            1270         1249      20.8199
21    Norway                1928         2197    -268.5461
22    Poland                 371     307.5264      63.4736
23    Portugal              1071         1012      58.6372
24    Spain                 1115         1155     -40.0728
25    Sweden                1675         1588      86.8594
26    Switzerland           2499         2108     390.6553
27    Turkey                 232     184.9546      47.0454
28    UnitedKingdom         1317         1584    -266.9774
29    UnitedStates          3898         2489         1409
30    predict                  .         1471            .

Sum of Residuals                           0
Sum of Squared Residuals             3822638
Predicted Residual SS (PRESS)        4770165

Residual plot in SAS

proc reg data = OECD ;
model pch = pcgdp / p ;
title "Residual vs. Predicted value";
plot r. * p. ;
run ;

Note that r. and p. represent the residual and the predictions in the model. The output of the above code will be: