Linear Algebra and the Geometry of Least Squares

Last time we took an in-depth look at solving the linear system of equations

Ax=b\begin{equation}\label{1} \bm{A}\bm{x}=\bm{b} \end{equation}

from an optimization perspective. It is not too interesting when the system has a single unique solution; instead, we spent most of the post exploring the other two possibilities:

  1. the system is inconsistent and has no solutions;
  2. the system is under-determined and has infinite solutions.

In the first case, we chose the value x\bm{x}^\star that minimizes the least squares error Axb22\|\bm{A}\bm{x}-\bm{b}\|^2_2. In the second case, we choose the solution x\bm{x}^\star satisfying (1)\eqref{1} that is closest to some other desired point y\bm{y}. Here we revisit both of these cases from a geometric perspective, which will reveal that both problems can be interpreted as projections onto a linear subspace. We also examine the addition of damping in both cases.

An interactive Python notebook that accompanies this post can be accessed directly in the browser here.

No Solutions

Let’s begin with a simple example. Consider a line in Rm\mathbb{R}^m that passes through the origin, defined as the set of points

L={axxR},\begin{equation}\label{2} \mathcal{L} = \{ \bm{a}x \mid x\in\mathbb{R} \}, \end{equation}

where aRm\bm{a}\in\mathbb{R}^m is a non-zero vector representing the line’s direction. In other words, L\mathcal{L} is the linear subspace spanned by a\bm{a}. Since a line is a one-dimensional subspace, we need only one variable xRx\in\mathbb{R} to parameterize it.

Suppose we want to find the point ax\bm{a}x^\star on the line that is closest to some other point bRm\bm{b}\in\mathbb{R}^m in terms of Euclidean distance, as shown below in Figure 1. This problem is called the vector projection of b\bm{b} onto a\bm{a}.

A line passing through the origin.

Figure 1: A line (in red) passing through the origin O\mathcal{O} in the direction of the vector a\bm{a}. We want to find the closest point on the line to an arbitrary point b\bm{b}, which known as the vector projection of b\bm{b} onto a\bm{a}.

Assuming that b\bm{b} is not actually on the line, the system ax=b\bm{a}x=\bm{b} is inconsistent and has no solution. Instead, we want to find the point that solves the least-squares problem

minx(1/2)e(x)22,\begin{equation*} \min_x \quad (1/2)\|\bm{e}(x)\|_2^2, \end{equation*}

where e(x)=axb\bm{e}(x)=\bm{a}x-\bm{b} is the error vector. However, we will turn to geometry rather than optimization theory to solve it.

Column space projection.

Figure 2: The vector representing the closest point ax\bm{a}x^\star (in blue) on the line to b\bm{b} is just the projection of b\bm{b} onto the line. This is equivalent to the condition aTe(x)=0\bm{a}^T\bm{e}(x^\star)=0; that is, the difference between ax\bm{a}x^\star and b\bm{b} is orthogonal to the line.

By looking at Figure 2, we see that e(x)\bm{e}(x) has minimum length when it is orthogonal to the line, which means

aTe(x)=aT(axb)=0,\begin{equation}\label{3} \bm{a}^T\bm{e}(x^\star) = \bm{a}^T(\bm{a}x^\star-\bm{b}) = 0, \end{equation}

where ax\bm{a}x^\star is the closest point. Rearranging (3)\eqref{3}, we obtain the solution

x=(aTa)1aTb.\begin{equation}\label{4} x^\star=(\bm{a}^T\bm{a})^{-1}\bm{a}^T\bm{b}. \end{equation}

This looks very similar to the solution to the least-squares problem we saw last time. Indeed, if we generalize (2)\eqref{2} from a one-dimensional subspace to an nn-dimensional subspace, we get the column space

C(A)={AxxRn},\begin{equation}\label{5} \mathcal{C}(\bm{A}) = \{ \bm{A}\bm{x} \mid \bm{x}\in\mathbb{R}^n \}, \end{equation}

of the matrix A=[a1,,an]Rm×n\bm{A}=[\bm{a}_1,\dots,\bm{a}_n]\in\mathbb{R}^{m\times n}, where x=[x1,,xn]T\bm{x}=[x_1,\dots,x_n]^T such that Ax=i=1naixi\bm{A}\bm{x}=\sum_{i=1}^n\bm{a}_ix_i. The least-squares problem becomes

minx(1/2)e(x)22\begin{equation}\label{6} \min_{\bm{x}} \quad (1/2)\|\bm{e}(\bm{x})\|_2^2 \end{equation}

with e(x)=Axb\bm{e}(\bm{x})=\bm{A}\bm{x}-\bm{b}, the equation (3)\eqref{3} becomes ATe(x)=0\bm{A}^T\bm{e}(\bm{x}^\star)=\bm{0}, and the solution (4)\eqref{4} becomes

x=(ATA)1ATb,\begin{equation*} \bm{x}^\star = (\bm{A}^T\bm{A})^{-1}\bm{A}^T\bm{b}, \end{equation*}

where A+:=(ATA)1AT\bm{A}^+:=(\bm{A}^T\bm{A})^{-1}\bm{A}^T is the (left) Moore-Penrose pseudoinverse that we saw last time. The corresponding closest point in the subspace C(A)\mathcal{C}(\bm{A}) is

y(x)=Ax=AA+b.\begin{equation*} \bm{y}(\bm{x}^\star) = \bm{A}\bm{x}^\star = \bm{A}\bm{A}^+\bm{b}. \end{equation*}

Projection

The matrix AA+\bm{A}\bm{A}^+ is the projection matrix that maps arbitrary points to the closest point in C(A)\mathcal{C}(\bm{A}). A projection matrix on Rn\mathbb{R}^n is a matrix URn×n\bm{U}\in\mathbb{R}^{n\times n} that satisfies UU=U\bm{U}\bm{U}=\bm{U} (that is, it is idempotent), which means that multiplying a vector by U\bm{U} more than once has no additional effect. We can easily verify that AA+\bm{A}\bm{A}^+ satisfies this property:

(AA+)(AA+)=A(ATA)1ATA(ATA)1AT=A(ATA)1AT=AA+.\begin{equation*} \begin{aligned} (\bm{A}\bm{A}^+)(\bm{A}\bm{A}^+) &= \bm{A}(\bm{A}^T\bm{A})^{-1}\bm{A}^T\bm{A}(\bm{A}^T\bm{A})^{-1}\bm{A}^T \\ &= \bm{A}(\bm{A}^T\bm{A})^{-1}\bm{A}^T \\ &= \bm{A}\bm{A}^+. \end{aligned} \end{equation*}

Since AA+\bm{A}\bm{A}^+ is also symmetric (this is easy to check and left to the reader), it is an orthogonal projection matrix. Indeed, all of the projections we describe in this post are orthogonal.

This is the geometric interpretation of the least-squares problem (6)\eqref{6}: it is just an (orthogonal) projection of b\bm{b} onto the column space C(A)\mathcal{C}(\bm{A}).

Damping

Recall that we introduced damping into the least-squares problem to bias the solution toward a smaller norm and avoid problems with ill-conditioning. Returning to our example with the line L\mathcal{L} defined in (2)\eqref{2}, the damped least squares problem is

minx(1/2)axb22+(α/2)x2,\begin{equation*} \min_x \quad (1/2)\|\bm{a}x - \bm{b}\|_2^2 + (\alpha/2)x^2, \end{equation*}

where α0\alpha\geq0 is the damping factor.

How do we think about this problem geometrically? Consider again the projected error (3)\eqref{3}: this equation says that the projections of b\bm{b} and its closest point ax\bm{a}x^\star onto a\bm{a} must be equal. In the damped case, we are willing to sacrifice this equality to make the magnitude of xx^\star smaller, as shown in Figure 3 below.

Column space projection with damping.

Figure 3: Damping pulls xx^\star closer to the origin by padding the value aTax\bm{a}^T\bm{a}x^\star with the term αx\alpha x^\star, which added together must equal aTb\bm{a}^T\bm{b}.

Instead of aTb=aTax\bm{a}^T\bm{b}=\bm{a}^T\bm{a}x^\star, we add an extra term αx\alpha x^\star to the right hand side so that xx^\star becomes smaller, yielding

x=(aTa+α)1aTb.\begin{equation*} x^\star=(\bm{a}^T\bm{a} + \alpha)^{-1}\bm{a}^T\bm{b}. \end{equation*}

This solution generalizes to

x=(ATA+α1n)1ATb\begin{equation*} \bm{x}^\star=(\bm{A}^T\bm{A} + \alpha\bm{1}_n)^{-1}\bm{A}^T\bm{b} \end{equation*}

in the matrix case, where A+(α):=(ATA+α1n)1AT\bm{A}^+(\alpha):=(\bm{A}^T\bm{A}+\alpha\bm{1}_n)^{-1}\bm{A}^T is the damped Moore-Penrose pseudoinverse we saw last time. The matrix AA+(α)\bm{A}\bm{A}^+(\alpha) is not a projection matrix when α>0\alpha>0, because repeated applications to a vector continue to shrink that vector’s magnitude.

Infinite Solutions

Now let’s look at the case when there are infinite solutions to (1)\eqref{1}, and we need to pick one.

In (2)\eqref{2} we defined a line in Rm\mathbb{R}^m with a single variable xx representing its single dimension. In (5)\eqref{5}, we generalized to an nn-dimensional subspace parameterized by n<mn<m variables. Instead, we could parameterize it with a full mm variables subject to p=mnp=m-n constraints. Using this approach, we can define a linear subspace of Rm\mathbb{R}^m as

S={xRmAx=0},\begin{equation}\label{7} \mathcal{S} = \{ \bm{x}\in\mathbb{R}^m \mid \bm{A}\bm{x}=\bm{0} \}, \end{equation}

where each row of ARp×m\bm{A}\in\mathbb{R}^{p\times m} defines one of the pp constraints. Each row of A\bm{A} is orthogonal to the subspace itself, and we assume the rows of A\bm{A} are all linearly independent. This means that the columns of AT\bm{A}^T span the pp-dimensional subspace of Rm\mathbb{R}^m that is orthogonal to S\mathcal{S}. Indeed, S\mathcal{S} is just the nullspace N(A)\mathcal{N}(\bm{A}), and the span of AT\bm{A}^T is C(AT)\mathcal{C}(\bm{A}^T), which is called the range space of A\bm{A}; in general, the range space and nullspace are orthogonal to each other. (For more information, refer to this unit on the four fundamental subspaces of linear algebra.)

Let’s generalize (7)\eqref{7} so that it does not necessarily pass through the origin by defining the affine subspace

S={xRmAx=b},\begin{equation*} \mathcal{S}' = \{\bm{x}\in\mathbb{R}^m\mid\bm{A}\bm{x}=\bm{b}\}, \end{equation*}

which now satisfies the constraint (1)\eqref{1}. We can interpret S\mathcal{S}' as the set of points in S\mathcal{S} offset by some vector x0Rm\bm{x}_0\in\mathbb{R}^m, such that S={x+x0xS}\mathcal{S}'=\{\bm{x}+\bm{x}_0\mid\bm{x}\in\mathcal{S}\}, where

b=Ax0.\begin{equation}\label{8} \bm{b} = \bm{A}\bm{x}_0. \end{equation}

Figure 4 below shows what S\mathcal{S}' looks like when it is just a line in R2\mathbb{R}^2, which requires a single constraint defined by A=[aT]R1×2\bm{A}=[\bm{a}^T]\in\mathbb{R}^{1\times2} and bRb\in\mathbb{R}.

A line defined by constraints.

Figure 4: The line S\mathcal{S}' (in red) is defined by the constraint aTx=b\bm{a}^T\bm{x}=b, where b=aTx0b=\bm{a}^T\bm{x}_0.

Suppose we want to find the point xS\bm{x}^\star\in\mathcal{S}' that has the smallest Euclidean norm; that is, we want to solve the problem

minx(1/2)x22,subject toAx=b,\begin{equation}\label{9} \begin{aligned} \min_{\bm{x}} &\quad (1/2)\|\bm{x}\|_2^2, \\ \text{subject to} &\quad \bm{A}\bm{x} = \bm{b}, \end{aligned} \end{equation}

which is equivalent to finding the point that is closest to the origin.

As can be seen in Figure 5, the optimal solution x\bm{x}^\star of (9)\eqref{9} is orthogonal to S\mathcal{S}', which means that it lives in C(AT)\mathcal{C}(\bm{A}^T); that is, there must exist a vector λRm\bm{\lambda}\in\mathbb{R}^m such that x=ATλ.\bm{x}^{\star}=\bm{A}^T\bm{\lambda}. We saw last time that λ\bm{\lambda} can be interpreted as the vector of Lagrange multipliers for the constraint (1)\eqref{1}; the optimal value of λ\bm{\lambda} ensures that x\bm{x}^\star does indeed satisfy (1)\eqref{1}.

Range space projection.

Figure 5: The closest point xS\bm{x}^\star\in\mathcal{S}' (in blue) to the origin is just the projection of an arbitrary point x0S\bm{x}_0\in\mathcal{S}' onto the range space C(AT)\mathcal{C}(\bm{A}^T), where A=[aT]\bm{A}=[\bm{a}^T] in this example.

Multiplying by A\bm{A} gives us Ax=AATλ\bm{A}\bm{x}^\star=\bm{A}\bm{A}^T\bm{\lambda}, which we rearrange to obtain λ=(AAT)1b\bm{\lambda}=(\bm{A}\bm{A}^T)^{-1}\bm{b}. Substituting back into the equation x=ATλ\bm{x}^\star=\bm{A}^T\bm{\lambda} and making use of (8)\eqref{8}, we arrive at the solution

x=A+b=A+Ax0,\begin{equation*} \bm{x}^{\star} = \bm{A}^+\bm{b} = \bm{A}^+\bm{A}\bm{x}_0, \end{equation*}

where A+=AT(AAT)1\bm{A}^+=\bm{A}^T(\bm{A}\bm{A}^T)^{-1} is the (right) Moore-Penrose pseudoinverse. The matrix A+A\bm{A}^+\bm{A} is a projection matrix that projects x0\bm{x}_0 onto the range space C(AT)\mathcal{C}(\bm{A}^T) of A\bm{A}.

Nullspace Projection

Like last time, instead of finding the closest point to the origin, we could find the closest point to an arbitrary vector yRm\bm{y}\in\mathbb{R}^m. That is, we want to solve the problem

minx(1/2)e(x)22,subject toAx=b,\begin{equation}\label{10} \begin{aligned} \min_{\bm{x}} &\quad (1/2)\|\bm{e}(\bm{x})\|_2^2, \\ \text{subject to} &\quad \bm{A}\bm{x} = \bm{b}, \end{aligned} \end{equation}

where e(x)=yx\bm{e}(\bm{x})=\bm{y}-\bm{x} is the error vector we want to minimize. We can again derive the solution by recognizing that e(x)\bm{e}(\bm{x}^\star) is orthogonal to S\mathcal{S}', and therefore e(x)=ATλ\bm{e}(\bm{x}^{\star})=\bm{A}^T\bm{\lambda} for some λRm\bm{\lambda}\in\mathbb{R}^m. We ultimately arrive at the solution

x=A+b+P(A)y,\begin{equation}\label{11} \bm{x}^{\star} = \bm{A}^+\bm{b} + \bm{P}(\bm{A})\bm{y}, \end{equation}

where P(A)=1nA+A\bm{P}(\bm{A})=\bm{1}_n-\bm{A}^+\bm{A} is the nullspace projector. As the name implies, P(A)\bm{P}(\bm{A}) is also a projection matrix, which we can easily verify:

P(A)P(A)=(1nA+A)(1nA+A)=1n2A+A+A+AA+A=1n2A+A+A+A=P(A).\begin{equation*} \begin{aligned} \bm{P}(\bm{A})\bm{P}(\bm{A}) &= (\bm{1}_n-\bm{A}^+\bm{A})(\bm{1}_n-\bm{A}^+\bm{A}) \\ &= \bm{1}_n - 2\bm{A}^+\bm{A} + \bm{A}^+\bm{A}\bm{A}^+\bm{A} \\ &= \bm{1}_n - 2\bm{A}^+\bm{A} + \bm{A}^+\bm{A} \\ &= \bm{P}(\bm{A}). \end{aligned} \end{equation*}

If we set b=0\bm{b}=\bm{0}, then (11)\eqref{11} is just a projection of y\bm{y} onto the nullspace N(A)\mathcal{N}(\bm{A}), as shown in Figure 6.

Nullspace projection.

Figure 6: The closest point xS\bm{x}^{\star}\in\mathcal{S}' to an arbitrary point y\bm{y} is just the projection of y\bm{y} onto S\mathcal{S}', where S=N(A)\mathcal{S}'=\mathcal{N}(\bm{A}) and A=[aT]\bm{A}=[\bm{a}^T] in this example.

Damping

The damped version of (9)\eqref{9} is

minx,s(α/2)x22+(1/2)s22,subject toAx=b+s,\begin{equation*} \begin{aligned} \min_{\bm{x},\bm{s}} &\quad (\alpha/2)\|\bm{x}\|_2^2 + (1/2)\|\bm{s}\|_2^2, \\ \text{subject to} &\quad \bm{A}\bm{x} = \bm{b} + \bm{s}, \end{aligned} \end{equation*}

where we have introduced the slack variable sRp\bm{s}\in\mathbb{R}^p and damping factor α>0\alpha>0. As we saw last time, the solution is

x=AT(AAT+α1p)1b,\begin{equation*} \bm{x}^\star = \bm{A}^T(\bm{A}\bm{A}^T + \alpha\bm{1}_p)^{-1}\bm{b}, \end{equation*}

where AT(AAT+α1p)1\bm{A}^T(\bm{A}\bm{A}^T+\alpha\bm{1}_p)^{-1} is equivalent (when α>0\alpha>0) to the damped Moore-Penrose pseudoinverse A+(α)\bm{A}^+(\alpha) that we saw earlier.

In the undamped case, we had x=ATλ\bm{x}^\star=\bm{A}^T\bm{\lambda} for some λRp\bm{\lambda}\in\mathbb{R}^p. In the damped case, we have x=ATz\bm{x}^\star=\bm{A}^T\bm{z} where z=λ/α\bm{z}=\bm{\lambda}/\alpha is just a scaled version of λ\bm{\lambda} that satisfies

b=(AAT+α1p)z.\begin{equation*} \bm{b} = (\bm{A}\bm{A}^T + \alpha\bm{1}_p)\bm{z}. \end{equation*}

Continuing our running example of a line in R2\mathbb{R}^2 defined by , we can see what damping looks like in Figure 7.

Figure 7 shows what damping looks like for our example of a line in R2\mathbb{R}^2 defined by a constraint of the form (1)\eqref{1} with A=[aT]R1×2\bm{A}=[\bm{a}^T]\in\mathbb{R}^{1\times2}. The value of b\bm{b} is now split between the two terms AATz\bm{A}\bm{A}^T\bm{z} and αz\alpha\bm{z}, which serves to shrink the norm of x\bm{x}^{\star} because z\bm{z} itself becomes smaller. Moreover, if one goes through the math it turns out that s=λ\bm{s}=\bm{\lambda}, and we can see from the figure that x\bm{x}^{\star} has been reduced by αz=λ=s\alpha\bm{z}=\bm{\lambda}=\bm{s}, which makes sense given that s\bm{s} is the amount of slack we added to the constraint. (Here it is really helpful to go through the optimization math by constructing the Lagrangian, taking derivatives, etc., to follow along with the diagram.)

Range space projection with damping.

Figure 7: Damping pulls x=azx^\star=\bm{a}z closer to the origin by padding the value aTaz\bm{a}^T\bm{a}z with the term αz\alpha z, which added together must equal bb.

Summary

We have seen that various forms of the least-squares problem are just projections onto different subspaces:

  • Problem (6)\eqref{6} projects b\bm{b} onto the column space C(A)\mathcal{C}(\bm{A});
  • Problem (9)\eqref{9} projects x0\bm{x}_0 onto the range space C(AT)\mathcal{C}(\bm{A}^T);
  • Problem (10)\eqref{10} (with b=0\bm{b}=\bm{0}) projects y\bm{y} onto the nullspace N(A)\mathcal{N}(\bm{A}).

Hopefully this provides some intuition about what least-squares optimization problems are actually doing.

Thanks to Abhishek Goudar for reading a draft of this.