Recursive Least Squares Estimation

from least square estimation to recursive least square estimation

Image credit: vuejsdevelopers

Mango again

In the previous post, I explained the weighted least squares estimation using the example of measuring the weight of a mango. In this post, I will use a similar setup, but a slightly different one.

I bought a mango from a grocery store and measured its weight using my scale. However, since my scale is cheap, it has its own bias, and a random noise whose expectation is zero and the standard deviation is 1g. Every time I check the weight of mango, I have checked the values of the scale before I put mango on it, I got these values.

## [1] -0.1035329  0.6387146  1.0422206 -0.6728489  0.7145623  0.7530279  0.2126300

Note that these values include the bias of my scale and the random noises. After measuring the weight of the mango seven times, I got these values again.

weight_mango <- c(536.5859, 539.5549, 541.1689, 534.3086, 539.8582, 540.0121, 537.8505)
## [1] 536.5859 539.5549 541.1689 534.3086 539.8582 540.0121 537.8505

Next, let us denote the two parameters the bias of my scale and weight of mango by \(\beta_0\) and \(\beta_1\).

\[ \beta=\left(\begin{array}{c} \beta_{0}\\ \beta_{1} \end{array}\right) \]

Since I have eight measurements, my observation vector \(y\) is as follow;

y <- c(bias, weight_mango)
##  [1]  -0.1035329   0.6387146   1.0422206  -0.6728489   0.7145623   0.7530279
##  [7]   0.2126300 536.5859000 539.5549000 541.1689000 534.3086000 539.8582000
## [13] 540.0121000 537.8505000

Then, we can model the situation using data matrix \(X\), observation \(y\), and a random noise vector \(w\) as follows:

\[ y=X\beta+w \] where the data matrix \(X\) is

X <- cbind(rep(1, 14), c(rep(0, 7), rep(1, 7)))
##       [,1] [,2]
##  [1,]    1    0
##  [2,]    1    0
##  [3,]    1    0
##  [4,]    1    0
##  [5,]    1    0
##  [6,]    1    0
##  [7,]    1    0
##  [8,]    1    1
##  [9,]    1    1
## [10,]    1    1
## [11,]    1    1
## [12,]    1    1
## [13,]    1    1
## [14,]    1    1

We know that ordinary least square estimation of the parameter vector \(\beta\) can be calculated as,

\[ \hat{\beta}=\left(X^{T}X\right)^{-1}X^{T}y \]

In R, we can obtain these using lm function.

beta_hat <- lm(y ~ X + 0)$coefficients
##          X1          X2 
##   0.3692534 538.1077609

Recursive least square estimation

What is the recursive least square estimation, and why do we need it? The problem of calculating beta estimates in the previous section that we have to calculate the inverse of the whole data matrix \(X^TX\). When we have more variables in the data set, this matrix will become larger and larger since its dimension is \(p \times p\) where \(p\) is the number of variables we have. In this big data era, we often have a large \(p\). So, what if we have one more observation \(y_{new}\) and we want to embrace this new information to our beta estimation. Should we update the data matrix \(X\) and redo the whole process of the least square estimation again? It is redundant when we think about our beta_hat has the information of the past data already. Can we update the little information to the information that we already have? This is when the recursive least square estimation comes into play.

When we say recursive, we are considering the situation that observes the sample again and again. Let us assume that I doubted my estimation, so I measured the weight of mango again.

new_y <- 538.7267
## [1] 538.7267

Now, when I think about the information I have so far, I have new_y and my previous estimation of beta beta_hat. Can I estimate the beta again combining this two information? With statistical notations, the situation can be expressed as follows;

\[ \begin{align*} y_{new} & =\left(\begin{array}{cc} 1 & 1\end{array}\right)\left(\begin{array}{c} \beta_{0}\\ \beta_{1} \end{array}\right)+w_{new}\\ \hat{\beta}_{new} & =\hat{\beta}_{old}+K\left(y_{new}-\hat{y}_{new}^{(old)}\right)\\ & =\hat{\beta}_{old}+K\left(y_{new}-\left(\begin{array}{cc} 1 & 1\end{array}\right)\hat{\beta}_{old}\right) \end{align*} \] where \(\hat{y}_{new}^{old}\) can be interpreted as the prediction value of our new observation, \(y_{new}\), with old information of beta. Thus, using the recursive least square method, we want to update the estimation of beta by adding the prediction error value for the new observation with the weight of \(K\). So, how do we determine \(K\)? Why the recursive least square follow this specific update form?


First, if we use this form of update equation, we can confirm that our update of the estimate of the beta will be unbiased. Note, our \(\hat{\beta_{old}}\) is the least square estimation of beta, which means we assume it is unbiased estimator as follows;

\[ \begin{align*} \mathbb{E}\left[\hat{\beta}_{old}\right] & =\mathbb{E}\left[\left(X^{T}X\right)^{-1}X^{T}y\right]\\ & =\left(X^{T}X\right)^{-1}X^{T}\mathbb{E}\left[y\right]\\ & =\left(X^{T}X\right)^{-1}X^{T}\left(X\beta+\mathbb{E}\left[w\right]\right)\\ & =\beta \end{align*} \]

Now, if our old estimator is unbiased, our new updated estimator \(\hat{\beta}_{new}\) will be unbiased regardless of \(K\) as follows;

\[ \begin{align*} \mathbb{E}\left[\hat{\beta}_{new}\right] & =\mathbb{E}\left[\hat{\beta}_{old}+K\left(y_{new}-\hat{y}_{new}^{(old)}\right)\right]\\ & =\beta+K\mathbb{E}\left[\left(y_{new}-\hat{y}_{new}^{(old)}\right)\right]\\ & =\beta+K\left(\left(\begin{array}{cc} 1 & 1\end{array}\right)\beta-\left(\begin{array}{cc} 1 & 1\end{array}\right)\mathbb{E}\left[\hat{\beta}_{old}\right]\right)\\ & =\beta \end{align*} \]

Minium variance

Thus, in the recursive least square method, we want to make our estimator have one more good property, which is the minimum variance.

\[ \begin{align*} var\left(\hat{\beta}_{new}\right) & =var\left(\hat{\beta}_{old}+K\left(y_{new}-\hat{y}_{new}^{(old)}\right)\right)\\ & =var\left(\left(I-Kx\right)\hat{\beta}_{old}+K\left(y_{new}\right)\right)\\ & =\left(I-Kx\right)var\left(\hat{\beta}_{old}\right)\left(I-Kx\right)^{T}+Kvar\left(w_{new}\right)K^{T} \end{align*} \] where \(x\) represents the data vector \((1, 1)^T\). If we denote the covariance matrix of the estimator and random noise vector as \(var(\hat{\beta}_{old}):=P_{old}\), \(var(\hat{\beta}_{new}):=P_{new}\), and \(var(w_{new}):=R_{new}\) respectively, the equation above can be re-written as follows;

\[ P_{new}=\left(I-Kx\right)P_{old}\left(I-Kx\right)^{T}+KR_{new}K^{T} \]

If we take a trace of the \(P_{new}\), that is the summation of the variance of each estimator of \(\beta_0\) and \(\beta_1\). Thus, we can use this as a criterion for finding an optimal \(K\). So the objective function \(L(K)\) will be

\[ L\left(K\right)=tr\left(P_{new}\right), \] and our estimation of \(K\) can be obtained after taking derivative of function \(L\) with respect of \(K\) and set it as zero, then solve it for \(K\).

\[ \begin{align*} \frac{\partial L\left(K\right)}{\partial K} & =\frac{\partial tr\left(P_{new}\right)}{\partial K}\\ & =\frac{\partial tr\left(\left(I-Kx\right)P_{old}\left(I-Kx\right)^{T}\right)}{\partial K}+\frac{\partial tr\left(KR_{new}K^{T}\right)}{\partial K}\\ & =2\left(I-Kx\right)P_{old}\left(-x^{T}\right)+2KR_{new}\overset{set}{=}0 \end{align*} \] Note that we use the derivative formula, \(\frac{\partial tr(ABA)}{\partial A}=2AB\) when \(B\) is symmetric. If we solve the above equation for \(K\), we have the following;

\[ K=P_{old}x^{T}\left(xP_{old}x^{T}+R_{new}\right)^{-1} \]

Moreover, if we assume our old estimator is the least square estimator, then the \(P_{old}\) can be calculated as follows;

\[ \begin{align*} P_{old} & =var\left(\hat{\beta}_{old}\right)\\ & =var\left(\left(X^{T}X\right)^{-1}X^{T}y\right)\\ & =\left(X^{T}X\right)^{-1}X^{T}var\left(y\right)X\left(X^{T}X\right)^{-1}\\ & =\left(X^{T}X\right)^{-1}X^{T}\left(\sigma^{2}I\right)X\left(X^{T}X\right)^{-1}\\ & =\sigma^{2}\left(X^{T}X\right)^{-1} \end{align*} \] where \(\sigma^2\) is the variance of the random noise \(w\). Thus, in our mango case, we have \(\sigma^2\) is 1,

P_old <- 1 * solve(t(X) %*% X)
##            [,1]       [,2]
## [1,]  0.1428571 -0.1428571
## [2,] -0.1428571  0.2857143

Note that we have already calculated this P_old for estimating beta before. Thus, the \(K\) can be calculated by the following;

x <- matrix(c(1, 1), nrow = 1)
matK <- P_old %*% t(x) %*% solve(x %*% P_old %*% t(x) + 1)
##       [,1]
## [1,] 0.000
## [2,] 0.125

The final update for our beta will be

\[ \begin{align*} y_{new} & =\left(\begin{array}{cc} 1 & 1\end{array}\right)\left(\begin{array}{c} \beta_{0}\\ \beta_{1} \end{array}\right)+w_{new}\\ \hat{\beta}_{new} & =\hat{\beta}_{old}+K\left(y_{new}-\hat{y}_{new}^{(old)}\right)\\ & =\hat{\beta}_{old}+K\left(y_{new}-\left(\begin{array}{cc} 1 & 1\end{array}\right)\hat{\beta}_{old}\right) \end{align*} \]

beta_hat_new = beta_hat + matK %*% (new_y - x %*% beta_hat)
##             [,1]
## [1,]   0.3692534
## [2,] 538.1389716

Is this the same result of using the least square method? Yes.

y <- c(y, new_y)
X <- rbind(X, c(1, 1))
solve(t(X) %*% X) %*% t(X) %*% y
##             [,1]
## [1,]   0.3692534
## [2,] 538.1389716

Recursive least square algorithm

So far, we use old and new notation, but now let us switch to general \(k\) time points. When we have an estimator at time \(k-1\), for time \(k\), the recursive least square estimator updates is as follows;

\[ \begin{align*} P_{k} & =\left(I-K_{k}x_{k}\right)P_{k-1}\left(I-K_{k}x_{k}\right)^{T}+K_{k}R_{k}K_{k}^{T}\\ K_{k} & =\left(x_{k}P_{k-1}x_{k}^{T}+R_{k}\right)^{-1}P_{k-1}x_{k}^{T}\\ \hat{\beta}_{k} & =\hat{\beta}_{k-1}+K_{k}\left(y_{k}-x_{k}\hat{\beta}_{k-1}\right) \end{align*} \]

Therefore, the recursive least square allows estimating the beta recursively. It will give us computational convenience, but we need to keep updating our P matrix and K matrix for each time \(k\).


  1. Simon, D. (2006). Optimal state estimation: Kalman, H infinity, and nonlinear approaches. John Wiley & Sons.
Issac Lee
PhD candidate, ABD

I believe anyone can learn anything with a proper education process.

comments powered by Disqus