# Lab 3 - Correlation and regression

## Correlation

Let $$(x_i, y_i), i = 1, · · · , n$$ be pairs of observations of $$(x, y)$$. Then the sample correlation between $$x$$ and $$y$$ is

$r=\frac{1}{n-1}\sum_{i=1}^{n}\left(\frac{x_{i}-\bar{x}}{s_{x}}\right)\left(\frac{y_{i}-\bar{y}}{s_{y}}\right)$ where $$x, y$$ denote the sample mean of $$x$$ and $$y$$ respectively and $$s_x, s_y$$ denote the sample standard deviations of $$x$$ and $$y$$ respectively.

• Sample correlation is a measure of linear relationship between two variables.
• $$r$$ is always between −1 and 1.
• $$r > 0$$ indicates positive linear association, $$r < 0$$ indicates negative linear association and $$r = 0$$ indicates no linear association
• $$r = 0$$ doesn’t mean no relationship and it is entirely possible for two variables to have some relationship (such as quadratic relationship) while having $$r = 0$$
• Sample correlation is unit free

## Correlation in SAS

Let us load data from the website directly this time. We will feed the url of the data from our professor’s data set site as follows;

filename oec url "http://homepage.divms.uiowa.edu/~kcowles/Datasets/OECD.dat";
data OECD;
infile oec;
input country $13. pcgdp pch beds los docs infmort ; run ; • Note that the 13. in the input statement tells SAS the number of characters in the longest country name. Without this information SAS would truncate the country names to 8 letters each. • We tell to SAS that there are 8 variables in the dataset; country pcgdp pch beds los docs infmort and country is not a numerical variable ($ sign).

### Scatter plots

Let us draw the scatter plot in SAS. Here is the syntax of drawing scatter plot:

PROC sgscatter  DATA=DATASET;
PLOT VARIABLE_1 * VARIABLE_2;
RUN;

We have many variables in our data set. Let us pick pch and pcgdp.

proc sgscatter data = OECD ;
title "Scatter plot of pch and pcgdp";
plot pch * pcgdp /
datalabel = country ;
run ;

### Correlation in SAS

We can calculate the sample correlation in SAS by using corr.

proc corr data = OECD ;
var pcgdp pch ;
run ;

The output is as follows:

                                      The CORR Procedure

2  Variables:    pcgdp    pch

Simple Statistics

Variable           N          Mean       Std Dev           Sum       Minimum       Maximum

pcgdp             30         20381          6752        611441          6720         34536
pch               29          1509     760.95177         43758     232.00000          3898

Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations

pcgdp           pch

pcgdp       1.00000       0.87420
<.0001
30            29

pch         0.87420       1.00000
<.0001
29            29

The sample correlation between pch and pcgdp is 0.87420, which indicates that the two variables are positively correlated each other.

## Regression in SAS

Regression can be done by reg key word.

proc reg data = OECD ;
model pch = pcgdp ;           * model <resp vbl> = <explanatory vbl> ;
run ;

The model code indicates that we want to use the following form of regression line;

$\hat{pch} = intercept + slope \times pcgdp$

The following is the output:

                                       The REG Procedure
Model: MODEL1
Dependent Variable: pch

Number of Observations Used                         29
Number of Observations with Missing Values           1

Analysis of Variance

Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     1       12390695       12390695      87.52    <.0001
Error                    27        3822638         141579
Corrected Total          28       16213333

Root MSE            376.27009    R-Square     0.7642
Dependent Mean     1508.89655    Adj R-Sq     0.7555
Coeff Var            24.93677

Parameter Estimates

Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1     -465.66368      222.33244      -2.09      0.0457
pcgdp         1        0.09682        0.01035       9.36      <.0001

Based on the result we have, we know that the estimation formula looks like this:

$\hat{pch} = -465.66368 + 0.09682 \times pcgdp$

### Prediction using the formula

How can we predict pch of a country whose pcdgp is $$2000$$ using the above formula? We can simply plug in the $$2000$$ to the above formula;

$\hat{pch} = -465.66368 + 0.09682 \times 2000 = -272.0237$

SAS provides the command for obtaining the prediction values of pch which corresponding to the each data point.

proc reg data = OECD ;
model pch = pcgdp / p ;
id country ;
run ;

Output:

                                       The REG Procedure
Model: MODEL1
Dependent Variable: pch

Output Statistics

Dependent    Predicted
Obs    country           Variable        Value     Residual

1    Australia             1775         1731      43.9558
2    Austria               1748         1857    -108.5206
3    Belgium               1708         1867    -159.3642
5    CzechRepub             904     806.2369      97.7631
6    Denmark               1802         2079    -276.7183
7    Finland               1380         1631    -251.3215
8    France                2002         1673     328.8531
9    Germany               2278         1745     532.8203
10    Greece                 888     934.6178     -46.6178
11    Hungary                602     553.2509      48.7491
12    Iceland               1893         2080    -187.2674
13    Ireland               1276         1714    -437.6169
14    Italy                 1584         1639     -55.0669
15    Japan                 1677         1869    -191.5260
16    Korea                  537     845.2546    -308.2546
17    Luxembourg            2139         2878    -739.0493
18    Mexico                 358     308.6882      49.3118
19    Netherlands           1766         1769      -3.0938
20    NewZealand            1270         1249      20.8199
21    Norway                1928         2197    -268.5461
22    Poland                 371     307.5264      63.4736
23    Portugal              1071         1012      58.6372
24    Spain                 1115         1155     -40.0728
25    Sweden                1675         1588      86.8594
26    Switzerland           2499         2108     390.6553
27    Turkey                 232     184.9546      47.0454
28    UnitedKingdom         1317         1584    -266.9774
29    UnitedStates          3898         2489         1409
30    predict                  .         1471            .

Sum of Residuals                           0
Sum of Squared Residuals             3822638
Predicted Residual SS (PRESS)        4770165

### Residual plot in SAS

proc reg data = OECD ;
model pch = pcgdp / p ;
title "Residual vs. Predicted value";
plot r. * p. ;
run ;

Note that r. and p. represent the residual and the predictions in the model. The output of the above code will be: