# Lab 3 - Correlation and regression

## Correlation

Let \((x_i, y_i), i = 1, · · · , n\) be pairs of observations of \((x, y)\). Then the sample correlation between \(x\) and \(y\) is

\[ r=\frac{1}{n-1}\sum_{i=1}^{n}\left(\frac{x_{i}-\bar{x}}{s_{x}}\right)\left(\frac{y_{i}-\bar{y}}{s_{y}}\right) \] where \(x, y\) denote the sample mean of \(x\) and \(y\) respectively and \(s_x, s_y\) denote the sample standard deviations of \(x\) and \(y\) respectively.

- Sample correlation is a measure of linear relationship between two variables.
- \(r\) is always between −1 and 1.
- \(r > 0\) indicates positive linear association, \(r < 0\) indicates negative linear association and \(r = 0\) indicates no linear association
- \(r = 0\) doesn’t mean no relationship and it is entirely possible for two variables to have some relationship (such as quadratic relationship) while having \(r = 0\)
- Sample correlation is unit free

## Correlation in SAS

Let us load data from the website directly this time. We will feed the url of the data from our professor’s data set site as follows;

```
filename oec url "http://homepage.divms.uiowa.edu/~kcowles/Datasets/OECD.dat";
data OECD;
infile oec;
input country $ 13. pcgdp pch beds los docs infmort ;
run ;
```

Note that the

`13.`

in the`input`

statement tells SAS the number of characters in the longest country name. Without this information**SAS would truncate the country names to 8 letters each**.We tell to SAS that there are 8 variables in the dataset;

`country pcgdp pch beds los docs infmort`

and`country`

is not a numerical variable (`$`

sign).

### Scatter plots

Let us draw the scatter plot in SAS. Here is the syntax of drawing scatter plot:

```
PROC sgscatter DATA=DATASET;
PLOT VARIABLE_1 * VARIABLE_2;
RUN;
```

We have many variables in our data set. Let us pick `pch`

and `pcgdp`

.

```
proc sgscatter data = OECD ;
title "Scatter plot of pch and pcgdp";
plot pch * pcgdp /
datalabel = country ;
run ;
```

### Correlation in SAS

We can calculate the sample correlation in SAS by using `corr`

.

```
proc corr data = OECD ;
var pcgdp pch ;
run ;
```

The output is as follows:

```
The CORR Procedure
2 Variables: pcgdp pch
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
pcgdp 30 20381 6752 611441 6720 34536
pch 29 1509 760.95177 43758 232.00000 3898
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
pcgdp pch
pcgdp 1.00000 0.87420
<.0001
30 29
pch 0.87420 1.00000
<.0001
29 29
```

The sample correlation between `pch`

and `pcgdp`

is 0.87420, which indicates that the two variables are positively correlated each other.

## Regression in SAS

Regression can be done by `reg`

key word.

```
proc reg data = OECD ;
model pch = pcgdp ; * model <resp vbl> = <explanatory vbl> ;
run ;
```

The `model`

code indicates that we want to use the following form of regression line;

\[ \hat{pch} = intercept + slope \times pcgdp \]

The following is the output:

```
The REG Procedure
Model: MODEL1
Dependent Variable: pch
Number of Observations Read 30
Number of Observations Used 29
Number of Observations with Missing Values 1
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 12390695 12390695 87.52 <.0001
Error 27 3822638 141579
Corrected Total 28 16213333
Root MSE 376.27009 R-Square 0.7642
Dependent Mean 1508.89655 Adj R-Sq 0.7555
Coeff Var 24.93677
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -465.66368 222.33244 -2.09 0.0457
pcgdp 1 0.09682 0.01035 9.36 <.0001
```

Based on the result we have, we know that the estimation formula looks like this:

\[ \hat{pch} = -465.66368 + 0.09682 \times pcgdp \]

### Prediction using the formula

How can we predict `pch`

of a country whose `pcdgp`

is \(2000\) using the above formula? We can simply plug in the \(2000\) to the above formula;

\[ \hat{pch} = -465.66368 + 0.09682 \times 2000 = -272.0237 \]

SAS provides the command for obtaining the prediction values of `pch`

which corresponding to the each data point.

```
proc reg data = OECD ;
model pch = pcgdp / p ;
id country ;
run ;
```

Output:

```
The REG Procedure
Model: MODEL1
Dependent Variable: pch
Output Statistics
Dependent Predicted
Obs country Variable Value Residual
1 Australia 1775 1731 43.9558
2 Austria 1748 1857 -108.5206
3 Belgium 1708 1867 -159.3642
4 Canada 2065 1903 161.7162
5 CzechRepub 904 806.2369 97.7631
6 Denmark 1802 2079 -276.7183
7 Finland 1380 1631 -251.3215
8 France 2002 1673 328.8531
9 Germany 2278 1745 532.8203
10 Greece 888 934.6178 -46.6178
11 Hungary 602 553.2509 48.7491
12 Iceland 1893 2080 -187.2674
13 Ireland 1276 1714 -437.6169
14 Italy 1584 1639 -55.0669
15 Japan 1677 1869 -191.5260
16 Korea 537 845.2546 -308.2546
17 Luxembourg 2139 2878 -739.0493
18 Mexico 358 308.6882 49.3118
19 Netherlands 1766 1769 -3.0938
20 NewZealand 1270 1249 20.8199
21 Norway 1928 2197 -268.5461
22 Poland 371 307.5264 63.4736
23 Portugal 1071 1012 58.6372
24 Spain 1115 1155 -40.0728
25 Sweden 1675 1588 86.8594
26 Switzerland 2499 2108 390.6553
27 Turkey 232 184.9546 47.0454
28 UnitedKingdom 1317 1584 -266.9774
29 UnitedStates 3898 2489 1409
30 predict . 1471 .
Sum of Residuals 0
Sum of Squared Residuals 3822638
Predicted Residual SS (PRESS) 4770165
```

### Residual plot in SAS

```
proc reg data = OECD ;
model pch = pcgdp / p ;
title "Residual vs. Predicted value";
plot r. * p. ;
run ;
```

Note that `r.`

and `p.`

represent the residual and the predictions in the model. The output of the above code will be: