Linear Algebra and the Geometry of Least Squares
Last time we took an in-depth look at solving the linear system of equations
from an optimization perspective. It is not too interesting when the system has a single unique solution; instead, we spent most of the post exploring the other two possibilities:
- the system is inconsistent and has no solutions;
- the system is under-determined and has infinite solutions.
In the first case, we chose the value that minimizes the least squares error . In the second case, we choose the solution satisfying that is closest to some other desired point . Here we revisit both of these cases from a geometric perspective, which will reveal that both problems can be interpreted as projections onto a linear subspace. We also examine the addition of damping in both cases.
An interactive Python notebook that accompanies this post can be accessed directly in the browser here.
No Solutions
Let’s begin with a simple example. Consider a line in that passes through the origin, defined as the set of points
where is a non-zero vector representing the line’s direction. In other words, is the linear subspace spanned by . Since a line is a one-dimensional subspace, we need only one variable to parameterize it.
Suppose we want to find the point on the line that is closest to some other point in terms of Euclidean distance, as shown below in Figure 1. This problem is called the vector projection of onto .
Figure 1: A line (in red) passing through the origin in the direction of the vector . We want to find the closest point on the line to an arbitrary point , which known as the vector projection of onto .
Assuming that is not actually on the line, the system is inconsistent and has no solution. Instead, we want to find the point that solves the least-squares problem
where is the error vector. However, we will turn to geometry rather than optimization theory to solve it.
Figure 2: The vector representing the closest point (in blue) on the line to is just the projection of onto the line. This is equivalent to the condition ; that is, the difference between and is orthogonal to the line.
By looking at Figure 2, we see that has minimum length when it is orthogonal to the line, which means
where is the closest point. Rearranging , we obtain the solution
This looks very similar to the solution to the least-squares problem we saw last time. Indeed, if we generalize from a one-dimensional subspace to an -dimensional subspace, we get the column space
of the matrix , where such that . The least-squares problem becomes
with , the equation becomes , and the solution becomes
where is the (left) Moore-Penrose pseudoinverse that we saw last time. The corresponding closest point in the subspace is
Projection
The matrix is the projection matrix that maps arbitrary points to the closest point in . A projection matrix on is a matrix that satisfies (that is, it is idempotent), which means that multiplying a vector by more than once has no additional effect. We can easily verify that satisfies this property:
Since is also symmetric (this is easy to check and left to the reader), it is an orthogonal projection matrix. Indeed, all of the projections we describe in this post are orthogonal.
This is the geometric interpretation of the least-squares problem : it is just an (orthogonal) projection of onto the column space .
Damping
Recall that we introduced damping into the least-squares problem to bias the solution toward a smaller norm and avoid problems with ill-conditioning. Returning to our example with the line defined in , the damped least squares problem is
where is the damping factor.
How do we think about this problem geometrically? Consider again the projected error : this equation says that the projections of and its closest point onto must be equal. In the damped case, we are willing to sacrifice this equality to make the magnitude of smaller, as shown in Figure 3 below.
Figure 3: Damping pulls closer to the origin by padding the value with the term , which added together must equal .
Instead of , we add an extra term to the right hand side so that becomes smaller, yielding
This solution generalizes to
in the matrix case, where is the damped Moore-Penrose pseudoinverse we saw last time. The matrix is not a projection matrix when , because repeated applications to a vector continue to shrink that vector’s magnitude.
Infinite Solutions
Now let’s look at the case when there are infinite solutions to , and we need to pick one.
In we defined a line in with a single variable representing its single dimension. In , we generalized to an -dimensional subspace parameterized by variables. Instead, we could parameterize it with a full variables subject to constraints. Using this approach, we can define a linear subspace of as
where each row of defines one of the constraints. Each row of is orthogonal to the subspace itself, and we assume the rows of are all linearly independent. This means that the columns of span the -dimensional subspace of that is orthogonal to . Indeed, is just the nullspace , and the span of is , which is called the range space of ; in general, the range space and nullspace are orthogonal to each other. (For more information, refer to this unit on the four fundamental subspaces of linear algebra.)
Let’s generalize so that it does not necessarily pass through the origin by defining the affine subspace
which now satisfies the constraint . We can interpret as the set of points in offset by some vector , such that , where
Figure 4 below shows what looks like when it is just a line in , which requires a single constraint defined by and .
Figure 4: The line (in red) is defined by the constraint , where .
Suppose we want to find the point that has the smallest Euclidean norm; that is, we want to solve the problem
which is equivalent to finding the point that is closest to the origin.
As can be seen in Figure 5, the optimal solution of is orthogonal to , which means that it lives in ; that is, there must exist a vector such that We saw last time that can be interpreted as the vector of Lagrange multipliers for the constraint ; the optimal value of ensures that does indeed satisfy .
Figure 5: The closest point (in blue) to the origin is just the projection of an arbitrary point onto the range space , where in this example.
Multiplying by gives us , which we rearrange to obtain . Substituting back into the equation and making use of , we arrive at the solution
where is the (right) Moore-Penrose pseudoinverse. The matrix is a projection matrix that projects onto the range space of .
Nullspace Projection
Like last time, instead of finding the closest point to the origin, we could find the closest point to an arbitrary vector . That is, we want to solve the problem
where is the error vector we want to minimize. We can again derive the solution by recognizing that is orthogonal to , and therefore for some . We ultimately arrive at the solution
where is the nullspace projector. As the name implies, is also a projection matrix, which we can easily verify:
If we set , then is just a projection of onto the nullspace , as shown in Figure 6.
Figure 6: The closest point to an arbitrary point is just the projection of onto , where and in this example.
Damping
The damped version of is
where we have introduced the slack variable and damping factor . As we saw last time, the solution is
where is equivalent (when ) to the damped Moore-Penrose pseudoinverse that we saw earlier.
In the undamped case, we had for some . In the damped case, we have where is just a scaled version of that satisfies
Continuing our running example of a line in defined by , we can see what damping looks like in Figure 7.
Figure 7 shows what damping looks like for our example of a line in defined by a constraint of the form with . The value of is now split between the two terms and , which serves to shrink the norm of because itself becomes smaller. Moreover, if one goes through the math it turns out that , and we can see from the figure that has been reduced by , which makes sense given that is the amount of slack we added to the constraint. (Here it is really helpful to go through the optimization math by constructing the Lagrangian, taking derivatives, etc., to follow along with the diagram.)
Figure 7: Damping pulls closer to the origin by padding the value with the term , which added together must equal .
Summary
We have seen that various forms of the least-squares problem are just projections onto different subspaces:
- Problem projects onto the column space ;
- Problem projects onto the range space ;
- Problem (with ) projects onto the nullspace .
Hopefully this provides some intuition about what least-squares optimization problems are actually doing.
Thanks to Abhishek Goudar for reading a draft of this.