The Coefficients of Linear Regression in R

gmp package, Cholesky and QR decomposition

This article is also one of my portfolio articles, but I should note that this article is written based on a wornderful course that I took at the University of Iowa: the intensive computing course.

The following are the R packages used in this post.

library(datasets)
library(magrittr)
library(microbenchmark)
library(gmp)

Normal equation for the beta vector

Today, I want to talk little bit about the process of getting the coefficients of a linear model in R. Among many methods, our main focus for today is related to the Cholesky decomposition and QR decomposition.

The reason why we are discussing matrix decomposition methods for linear regression is that the coefficients of the regression are the solution of the matrix equation called the Normal equation.

Suppose we have the following \(n\) observations in a vector \(y\) and the data matrix \(X\). If we assume that the vector \(y\) follows a linear model with noise vector \(\mathbf{e}\), then the situation can be written as follows:

\[ \begin{eqnarray*} \mathbf{y}_{n\times1} & = & X_{n\times p}\boldsymbol{\beta}_{p\times1}+\mathbf{e}_{n\times1}\\ & = & \left[\begin{array}{ccccc} \mathbf{1} & \mathbf{x}_{1} & \mathbf{x}_{2} & ... & \mathbf{x}_{p-1}\end{array}\right]\left[\begin{array}{c} \beta_{0}\\ \beta_{1}\\ \vdots\\ \beta_{p-1} \end{array}\right]+\mathbf{e}\\ & = & \beta_{0}\mathbf{1}+\beta_{1}\mathbf{x}_{1}+\beta_{2}\mathbf{x}_{2}+...+\beta_{p-1}\mathbf{x}_{p-1}+\mathbf{e} \end{eqnarray*} \]

Note that \(\mathbf{1}\), \(\mathbf{e}\) and \(\mathbf{x}_{i}\), \(i=1,...,p-1\) are \(n \times 1\) vectors.

Under this assumption, the coefficients of the linear model, \(\beta\), which minimize the residual sum of squares (RSS), can be obtained by solving the following Normal eqaution:

\[ X^{T}X\mathbf{\beta}=X^{T}\mathbf{y} \]

The details about the Normal equation can be found at wikipedia.

longley data set

The longley data in the paper

The longley data in the paper

The Longley data set belongs to the appendix of the paper written by J. W. Longley (1967): An appraisal of least-squares programs from the point of view of the user. The data set is toy data and consists of 7 explanatory variables. This data is famous for having high correlation among the variables when we regress the Employed variable on the rest of the variables. The data set is contained in the datasets package, so let us load the data set.

library(datasets)
longley
kable(longley, format = "html") %>% 
  kable_styling() %>% 
  scroll_box(width = "100%", height = "300px")
GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
1947 83.0 234.289 235.6 159.0 107.608 1947 60.323
1948 88.5 259.426 232.5 145.6 108.632 1948 61.122
1949 88.2 258.054 368.2 161.6 109.773 1949 60.171
1950 89.5 284.599 335.1 165.0 110.929 1950 61.187
1951 96.2 328.975 209.9 309.9 112.075 1951 63.221
1952 98.1 346.999 193.2 359.4 113.270 1952 63.639
1953 99.0 365.385 187.0 354.7 115.094 1953 64.989
1954 100.0 363.112 357.8 335.0 116.219 1954 63.761
1955 101.2 397.469 290.4 304.8 117.388 1955 66.019
1956 104.6 419.180 282.2 285.7 118.734 1956 67.857
1957 108.4 442.769 293.6 279.8 120.445 1957 68.169
1958 110.8 444.546 468.1 263.7 121.950 1958 66.513
1959 112.6 482.704 381.3 255.2 123.366 1959 68.655
1960 114.2 502.601 393.1 251.4 125.368 1960 69.564
1961 115.7 518.173 480.6 257.2 127.852 1961 69.331
1962 116.9 554.894 400.7 282.7 130.081 1962 70.551

We can see that the longley data set in R was scaled compared to the original data.

Calculating accurate coefficients using gmp package

The gmp package uses fractions for its calculation instead of using decimal points. This package allows us to achieve more accurate estimates for the coefficients since it avoids the truncation of floating numbers. By using the solve() function, let us calculate accurate coefficients for the linear regression of the Employed variable on the rest of the variables in the longley data set.

y <- as.vector(longley[,7])
X <- as.matrix(cbind(constant = 1, longley[,-7]))

# install.packages("gmp")
# library(gmp)

# X, y rationals
r_X <- as.bigq(round(1000 * X)) / as.bigq(1000)
r_y <- as.bigq(round(1000 * y)) / as.bigq(1000)
head(r_X)
## Big Rational ('bigq') 6 x 7 matrix:
##      [,1] [,2]   [,3]        [,4]    [,5]    [,6]        [,7]
## [1,] 1    83     234289/1000 1178/5  159     13451/125   1947
## [2,] 1    177/2  129713/500  465/2   728/5   13579/125   1948
## [3,] 1    441/5  129027/500  1841/5  808/5   109773/1000 1949
## [4,] 1    179/2  284599/1000 3351/10 165     110929/1000 1950
## [5,] 1    481/5  13159/40    2099/10 3099/10 4483/40     1951
## [6,] 1    981/10 346999/1000 966/5   1797/5  11327/100   1952
# coefficients
cef_exact <- as.double(solve(t(r_X) %*% r_X,
                             (t(r_X) %*% r_y)))
cef_exact
## [1] -3.482259e+03  1.506187e-02 -3.581918e-02 -2.020230e-02 -1.033227e-02
## [6] -5.110411e-02  1.829151e+00

Cholesky decomposition

If the matrix \(A\) satisfies \(x^T \mathbf{A} x > 0\) for every \(x\), then the matrix \(A\) is a positive definite.

The Cholesky decomposition can be applied to symmetric positive definite matrices. If a matrix \(A\) is a symmetirc positive definite matrix then there exists a lower triangular matrix, \(L\), such that

\[ \mathbf{A} = \mathbf{L} \mathbf{L}^T \]

Thus, if our data matrix is of full rank, in other words, if the matrix \(X^TX\) is positive definite, then the Normal equation can be written as follows:

\[ \begin{align*} \mathbf{X}^{T}\mathbf{X}\beta & =\mathbf{X}^{T}y\\ \Rightarrow\mathbf{L}\mathbf{L}^{T}\beta & =\mathbf{X}^{T}y\\ \tag{1}\Rightarrow\mathbf{L}\left(\mathbf{L}^{T}\beta\right) & =\mathbf{X}^{T}y \end{align*} \]

Note that the matrix \(\mathbf{X}^T \mathbf{X}\) is always symmetric. Thus, the \(\beta\) vector can be obtained by the following two steps:

  1. Solve Lz=X’y for z (forward solve lower triangular system).
  2. Solve L’b=z for b (backward solve upper triangular system).

Is there any benefit of using the upper or lower triangular matrix in solving the equation? Yes. There exist nice and fast functions in R to solve a system of linear equations when the matrix in the equation is an upper or lower triangular marix: backsolve and forwardsolve. Also, look at the algorithm for finding the inverse matrix of triangular matrices of any size \(n\) by \(n\). Here are link 1 and link 2. As the number of variables used in linear regressions has increased over the years, finding an efficient way of solving the Normal equation has become a very important topic.

Implementation in R

Let us confirm whether the choesky decomposition method works using the chol() function in R.

library(matrixcalc)

crossprod(X) %>% is.positive.definite
## [1] TRUE
crossprod(X) %>% eigen(symmetric = TRUE) %$% values > 0
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

To check out whether the matrix \(\mathbf{X}^T \mathbf{X}\) is positive definite or not, we can either use the is.positiv.definite() function in matrixcalc package or use the basic functions in R for checking that the signs of the eigenvalues are positive. Since the matrix is positive definite, we can apply the Cholesky decomposition to \(\mathbf{X}^T \mathbf{X}\). The following function calculates the regression coefficients using the Cholesky decomposition:

# X always includes a constant vector
regCef_chol <- function(X, y){

    a <- crossprod(X)
    b <- crossprod(X, y)
    
    # cholesky decomposition
    upper <- chol(a)
    
    # we need to do the following
    # z <- forwardsolve( t(upper), b )
    # but transpose can be avoid by using
    # backsolve & transpose option = TRUE
    z <- backsolve( upper, b, transpose = TRUE)
    cef <- as.vector(backsolve(upper, z))
    
    cef
}

Here are some comments about the regCef_chol() function:

  1. In R, you must use the code crossprod(X) rather than the t(X) %*% X becasue of the efficiency.
  2. The chol() function returns an upper triangle matrix. Thus, the code was adjusted for using the upper tringle matrix.
  3. We need to use the functions forwardsolve and backsolve if we want to implement equation (1). However, the transpose option allows us to use backsolve twice so that we can avoid the transpose. Let us take a look at the following code.
a <- crossprod(X)
b <- crossprod(X, y)

# cholesky decomposition
upper <- chol(a)
microbenchmark(
    forwardsolve( t(upper), b ),
    backsolve( upper, b, transpose = TRUE)
)
## Unit: microseconds
##                                   expr   min    lq    mean median    uq    max
##              forwardsolve(t(upper), b) 7.698 7.862 8.12732  7.950 8.069 18.667
##  backsolve(upper, b, transpose = TRUE) 4.627 4.770 5.20111  4.878 4.966 33.927
##  neval
##    100
##    100


z1 <- forwardsolve( t(upper), b )
z2 <- backsolve( upper, b, transpose = TRUE)

z1 == z2
##      [,1]
## [1,] TRUE
## [2,] TRUE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,] TRUE
## [7,] TRUE

Let us save the Cholesky decomposition-based coefficients of the regression to the cef_chol variable.

cef_chol <- regCef_chol(X, y)
cef_chol
## [1] -3.482259e+03  1.506187e-02 -3.581918e-02 -2.020230e-02 -1.033227e-02
## [6] -5.110411e-02  1.829151e+00

QR decomposition

For a real-valued \(n \times p\) matrix \(A\), where \(n \geq p\), we can decompose the matrix \(A\) into the following product of two matrices:

\[ A = QR \] where the matrix \(Q\) is a \(n \times p\) matrix with orthonormal columns which satisfies \(Q^TQ=I_p\). If the matrix \(A\) is square, then \(Q\) is orthogonal. Also, the matrix \(R\) is an \(p \times p\) upper triangular matrix whose diagonal elements are nonzero, which implies that \(R\) is nonsingular.

The QR decomposition is preferred to the Cholesky decomposition when it comes to the calculation of regression coefficients because

  1. The QR decomposition can be applied to matrices of any size, and it usually provides a more stable calculation.
  2. More importantly, in the Normal equation solving process, we don’t need to calculate the term, \(\mathbf{X}^T \mathbf{X}\), which can be seen as follows:

\[ \begin{eqnarray*} X^{T}X\mathbf{\beta} & = & X^{T}\mathbf{y}\\ \Rightarrow\left(QR\right)^{T}\left(QR\right)\mathbf{\beta} & = & \left(QR\right)^{T}\mathbf{y}\\ \Rightarrow R^{T}Q^{T}QR\mathbf{\beta} & = & R^{T}Q^{T}\mathbf{y}\\ \Rightarrow R^{T}R\mathbf{\beta} & = & R^{T}Q^{T}\mathbf{y} \\ \Rightarrow R\mathbf{\beta} & = & Q^{T}\mathbf{y} \end{eqnarray*} \]

Note that we used the properties of the matrices \(Q\) and \(R\) to reduce the equation above. From the last equation we can see that the beta vector is actually the solution of the linear system

\[ X \mathbf{\beta}=\mathbf{y} \] using the QR factorization of the matrix \(X\). Thus, to calculate \(\beta\), we can use backsolve() only one time after applying the QR decomposition to the matrix \(X\).

QR decomposition in R

If you use the qr() function in R, you will be a little confused when you see the result of the function:

QRmat <- qr(X)
QRmat
## $qr
##      constant  GNP.deflator           GNP    Unemployed  Armed.Forces
## 1947    -4.00 -4.067250e+02 -1.550794e+03 -1.277325e+03 -1.042675e+03
## 1948     0.25  4.179551e+01  3.817172e+02  2.246174e+02  1.252618e+02
## 1949     0.25  2.331590e-01  4.982290e+01 -3.118589e+01 -2.998495e+01
## 1950     0.25  2.020552e-01 -1.320204e-01 -2.820602e+02  1.644256e+02
## 1951     0.25  4.175090e-02  2.352068e-01 -4.032101e-01 -1.703533e+02
## 1952     0.25 -3.708533e-03  2.301639e-01 -4.987968e-01  2.597864e-01
## 1953     0.25 -2.524195e-02  3.010869e-02 -4.454059e-01  2.499492e-01
## 1954     0.25 -4.916796e-02  2.634768e-01  3.041786e-02  5.401806e-01
## 1955     0.25 -7.787919e-02 -2.008098e-01 -1.549424e-02  2.755704e-01
## 1956     0.25 -1.592276e-01  1.765048e-03 -2.087977e-01 -8.693505e-02
## 1957     0.25 -2.501465e-01  2.417450e-01 -3.582924e-01 -3.417886e-01
## 1958     0.25 -3.075689e-01  6.566704e-01  1.704505e-02 -1.486493e-01
## 1959     0.25 -3.506358e-01  2.287415e-01 -1.269142e-01 -3.868639e-01
## 1960     0.25 -3.889174e-01  1.297815e-01 -7.134241e-02 -4.035762e-01
## 1961     0.25 -4.248064e-01  9.885428e-02  2.227783e-01 -1.470166e-01
## 1962     0.25 -4.535177e-01 -4.128804e-01  1.547782e-01 -1.059541e-01
##         Population          Year
## 1947 -469.69600000 -7818.0000000
## 1948   26.37951035    18.2758880
## 1949    4.19691804     1.7752596
## 1950   -3.18974111    -1.4529877
## 1951    0.04624846    -0.4491762
## 1952    1.46320173    -0.2819006
## 1953   -0.25728181    -0.6693051
## 1954   -0.26118351     0.4588864
## 1955    0.43979880     0.2646956
## 1956    0.33806289     0.3701027
## 1957    0.13690629     0.1755959
## 1958   -0.05114355    -0.1027051
## 1959    0.47762105    -0.3582417
## 1960    0.13635911    -0.1388969
## 1961   -0.35055843    -0.3066874
## 1962   -0.40330323    -0.3490426
## 
## $rank
## [1] 7
## 
## $qraux
## [1] 1.250000 1.225981 1.156696 1.349325 1.102230 1.065271 1.421288
## 
## $pivot
## [1] 1 2 3 4 5 6 7
## 
## attr(,"class")
## [1] "qr"

The help page for the qr() function has the answer for this:

The upper triangle contains the \(\bold{R}\) matrix of the decomposition and the lower triangle contains information on the \(\bold{Q}\) matrix of the decomposition (stored in compact form).

TO get the \(Q\) and \(R\) matrices seperately, we need to use the qr.Q() and qr.R() functions, respectively.

Qmat <- qr.Q(QRmat)
Rmat <- qr.R(QRmat)
Qmat; Rmat;
##        [,1]        [,2]        [,3]         [,4]        [,5]        [,6]
##  [1,] -0.25 -0.44696790  0.34534075 -0.097267502  0.11348450  0.14346687
##  [2,] -0.25 -0.31537481 -0.15833007  0.074204845  0.54306575  0.27577175
##  [3,] -0.25 -0.32255262 -0.13087500 -0.415649678 -0.03377641  0.05658751
##  [4,] -0.25 -0.29144879  0.16361083 -0.306089180  0.02304976 -0.32175743
##  [5,] -0.25 -0.13114448 -0.17388270  0.302760179 -0.06259505 -0.13059025
##  [6,] -0.25 -0.08568505 -0.16040787  0.396678978 -0.23146214  0.03796899
##  [7,] -0.25 -0.06415163  0.04364140  0.413247504 -0.20796276  0.34643193
##  [8,] -0.25 -0.04022562 -0.18528890 -0.147931978 -0.57608466  0.12886308
##  [9,] -0.25 -0.01151440  0.28432317  0.061965649 -0.25775992 -0.48931159
## [10,] -0.25  0.06983406  0.09683706  0.176548287  0.05777222 -0.25842925
## [11,] -0.25  0.16075293 -0.12627901  0.233202984  0.25321471  0.02907702
## [12,] -0.25  0.21817537 -0.53055357 -0.295032653 -0.04874867  0.03998579
## [13,] -0.25  0.26124220 -0.09463650 -0.001197881  0.23969677 -0.38761993
## [14,] -0.25  0.29952382  0.01142409 -0.024274066  0.24921047 -0.06437349
## [15,] -0.25  0.33541285  0.04900808 -0.310066907 -0.04091063  0.26459047
## [16,] -0.25  0.36412407  0.56606822 -0.061098580 -0.02019395  0.32933852
##              [,7]
##  [1,] -0.00859901
##  [2,]  0.03836384
##  [3,] -0.03603622
##  [4,] -0.01634503
##  [5,]  0.62685034
##  [6,]  0.24825479
##  [7,] -0.29828585
##  [8,] -0.18932241
##  [9,] -0.06276184
## [10,] -0.39060004
## [11,] -0.36911636
## [12,] -0.02273730
## [13,]  0.16401119
## [14,] -0.09588745
## [15,]  0.16626351
## [16,]  0.24594785
##      constant GNP.deflator        GNP  Unemployed Armed.Forces    Population
## 1947       -4   -406.72500 -1550.7938 -1277.32500  -1042.67500 -469.69600000
## 1948        0     41.79551   381.7172   224.61743    125.26181   26.37951035
## 1949        0      0.00000    49.8229   -31.18589    -29.98495    4.19691804
## 1950        0      0.00000     0.0000  -282.06021    164.42555   -3.18974111
## 1951        0      0.00000     0.0000     0.00000   -170.35326    0.04624846
## 1952        0      0.00000     0.0000     0.00000      0.00000    1.46320173
## 1953        0      0.00000     0.0000     0.00000      0.00000    0.00000000
##               Year
## 1947 -7818.0000000
## 1948    18.2758880
## 1949     1.7752596
## 1950    -1.4529877
## 1951    -0.4491762
## 1952    -0.2819006
## 1953    -0.6693051

Thus, in R we can obtain the QR factorization-based coefficients as follows:

as.vector(backsolve(Rmat, crossprod(Qmat, y)))
## [1] -3.482259e+03  1.506187e-02 -3.581918e-02 -2.020230e-02 -1.033227e-02
## [6] -5.110411e-02  1.829151e+00

However, R provides the following compact solve function for the QR factorization.

cef_qr <- qr.solve(X, y) # or solve.qr(QRmat, y)
cef_qr
##      constant  GNP.deflator           GNP    Unemployed  Armed.Forces 
## -3.482259e+03  1.506187e-02 -3.581918e-02 -2.020230e-02 -1.033227e-02 
##    Population          Year 
## -5.110411e-02  1.829151e+00

Result comparison

Let us compare the results between the Cholesky-based and the QR-based coefficients using the gmp package coefficients as a reference value.

sum((cef_exact - cef_chol)^2)
## [1] 1.219947e-09
sum((cef_exact - cef_qr)^2)
## [1] 3.494839e-23

We can see that the QR decomposition coefficients are closer to the gmp coefficients than the one based on the Cholesky decomposition. In general, QR is more stable and faster than Cholesky, which is usually said to be unstable when the positive definite matrix has one or more eigenvalues close to zero. As we can see from the following R code, the last eigenvalue is close to zero, which makes the Cholesky-based coefficients unstable.

crossprod(X) %>% is.positive.definite
## [1] TRUE
crossprod(X) %>% eigen(symmetric = TRUE) %$% values
## [1] 6.665301e+07 2.090730e+05 1.053550e+05 1.803976e+04 2.455730e+01
## [6] 2.015117e+00 1.172178e-07

For these reasons, the lm function in R uses the QR factorization-based calculation for the regression coefficients.

cef_lm <- lm(y~0+X)$coefficients
cef_lm == cef_qr
##     Xconstant XGNP.deflator          XGNP   XUnemployed XArmed.Forces 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##   XPopulation         XYear 
##          TRUE          TRUE

Reference

[1] Luke Tierney, STAT:7400 Computer Intensive Statistics Note: http://homepage.divms.uiowa.edu/~luke/classes/STAT7400/notes.pdf

[2] J. W. Longley (1967) An appraisal of least-squares programs from the point of view of the user. Journal of the American Statistical Association 62, 819–841.

[3] The QR Decomposition of a Matrix in R: http://astrostatistics.psu.edu/su07/R/html/base/html/qr.html

[4] Monahan, John F. A primer on linear models. CRC Press, 2008.

[5] Cholesky Factorization written by Glyn A. Holton

Avatar
Issac Lee
PhD candidate

I believe anyone can learn anything with a proper education process.

comments powered by Disqus