Lab 3 - Correlation and regression

Correlation

Let \((x_i, y_i), i = 1, · · · , n\) be pairs of observations of \((x, y)\). Then the sample correlation between \(x\) and \(y\) is

\[ r=\frac{1}{n-1}\sum_{i=1}^{n}\left(\frac{x_{i}-\bar{x}}{s_{x}}\right)\left(\frac{y_{i}-\bar{y}}{s_{y}}\right) \] where \(x, y\) denote the sample mean of \(x\) and \(y\) respectively and \(s_x, s_y\) denote the sample standard deviations of \(x\) and \(y\) respectively.

  • Sample correlation is a measure of linear relationship between two variables.
  • \(r\) is always between −1 and 1.
  • \(r > 0\) indicates positive linear association, \(r < 0\) indicates negative linear association and \(r = 0\) indicates no linear association
  • \(r = 0\) doesn’t mean no relationship and it is entirely possible for two variables to have some relationship (such as quadratic relationship) while having \(r = 0\)
  • Sample correlation is unit free

Correlation in SAS

Let us load data from the website directly this time. We will feed the url of the data from our professor’s data set site as follows;

filename oec url "http://homepage.divms.uiowa.edu/~kcowles/Datasets/OECD.dat";
data OECD;
infile oec;
input country $ 13. pcgdp pch beds los docs infmort ;
run ;
  • Note that the 13. in the input statement tells SAS the number of characters in the longest country name. Without this information SAS would truncate the country names to 8 letters each.

  • We tell to SAS that there are 8 variables in the dataset; country pcgdp pch beds los docs infmort and country is not a numerical variable ($ sign).

Scatter plots

Let us draw the scatter plot in SAS. Here is the syntax of drawing scatter plot:

PROC sgscatter  DATA=DATASET;
   PLOT VARIABLE_1 * VARIABLE_2;
RUN;

We have many variables in our data set. Let us pick pch and pcgdp.

proc sgscatter data = OECD ;
title "Scatter plot of pch and pcgdp";
  plot pch * pcgdp /
    datalabel = country ;
run ;

Correlation in SAS

We can calculate the sample correlation in SAS by using corr.

proc corr data = OECD ;
var pcgdp pch ;
run ;

The output is as follows:

                                      The CORR Procedure

                               2  Variables:    pcgdp    pch


                                      Simple Statistics

  Variable           N          Mean       Std Dev           Sum       Minimum       Maximum

  pcgdp             30         20381          6752        611441          6720         34536
  pch               29          1509     760.95177         43758     232.00000          3898


                               Pearson Correlation Coefficients
                                  Prob > |r| under H0: Rho=0
                                    Number of Observations

                                             pcgdp           pch

                               pcgdp       1.00000       0.87420
                                                          <.0001
                                                30            29

                               pch         0.87420       1.00000
                                            <.0001
                                                29            29

The sample correlation between pch and pcgdp is 0.87420, which indicates that the two variables are positively correlated each other.

Regression in SAS

Regression can be done by reg key word.

proc reg data = OECD ;
model pch = pcgdp ;           * model <resp vbl> = <explanatory vbl> ;
run ;

The model code indicates that we want to use the following form of regression line;

\[ \hat{pch} = intercept + slope \times pcgdp \]

The following is the output:

                                       The REG Procedure
                                         Model: MODEL1
                                   Dependent Variable: pch

                    Number of Observations Read                         30
                    Number of Observations Used                         29
                    Number of Observations with Missing Values           1


                                     Analysis of Variance

                                            Sum of           Mean
        Source                   DF        Squares         Square    F Value    Pr > F

        Model                     1       12390695       12390695      87.52    <.0001
        Error                    27        3822638         141579
        Corrected Total          28       16213333


                     Root MSE            376.27009    R-Square     0.7642
                     Dependent Mean     1508.89655    Adj R-Sq     0.7555
                     Coeff Var            24.93677


                                     Parameter Estimates

                                  Parameter       Standard
             Variable     DF       Estimate          Error    t Value    Pr > |t|

             Intercept     1     -465.66368      222.33244      -2.09      0.0457
             pcgdp         1        0.09682        0.01035       9.36      <.0001

Based on the result we have, we know that the estimation formula looks like this:

\[ \hat{pch} = -465.66368 + 0.09682 \times pcgdp \]

Prediction using the formula

How can we predict pch of a country whose pcdgp is \(2000\) using the above formula? We can simply plug in the \(2000\) to the above formula;

\[ \hat{pch} = -465.66368 + 0.09682 \times 2000 = -272.0237 \]

SAS provides the command for obtaining the prediction values of pch which corresponding to the each data point.

proc reg data = OECD ;
model pch = pcgdp / p ;
id country ;
run ;

Output:

                                       The REG Procedure
                                         Model: MODEL1
                                   Dependent Variable: pch

                                       Output Statistics

                                          Dependent    Predicted
                  Obs    country           Variable        Value     Residual

                    1    Australia             1775         1731      43.9558
                    2    Austria               1748         1857    -108.5206
                    3    Belgium               1708         1867    -159.3642
                    4    Canada                2065         1903     161.7162
                    5    CzechRepub             904     806.2369      97.7631
                    6    Denmark               1802         2079    -276.7183
                    7    Finland               1380         1631    -251.3215
                    8    France                2002         1673     328.8531
                    9    Germany               2278         1745     532.8203
                   10    Greece                 888     934.6178     -46.6178
                   11    Hungary                602     553.2509      48.7491
                   12    Iceland               1893         2080    -187.2674
                   13    Ireland               1276         1714    -437.6169
                   14    Italy                 1584         1639     -55.0669
                   15    Japan                 1677         1869    -191.5260
                   16    Korea                  537     845.2546    -308.2546
                   17    Luxembourg            2139         2878    -739.0493
                   18    Mexico                 358     308.6882      49.3118
                   19    Netherlands           1766         1769      -3.0938
                   20    NewZealand            1270         1249      20.8199
                   21    Norway                1928         2197    -268.5461
                   22    Poland                 371     307.5264      63.4736
                   23    Portugal              1071         1012      58.6372
                   24    Spain                 1115         1155     -40.0728
                   25    Sweden                1675         1588      86.8594
                   26    Switzerland           2499         2108     390.6553
                   27    Turkey                 232     184.9546      47.0454
                   28    UnitedKingdom         1317         1584    -266.9774
                   29    UnitedStates          3898         2489         1409
                   30    predict                  .         1471            .


                         Sum of Residuals                           0
                         Sum of Squared Residuals             3822638
                         Predicted Residual SS (PRESS)        4770165

Residual plot in SAS

proc reg data = OECD ;
model pch = pcgdp / p ;
title "Residual vs. Predicted value";
plot r. * p. ;
run ;

Note that r. and p. represent the residual and the predictions in the model. The output of the above code will be:

Previous
Next