December 17, 2025

Linear Algebra and the Geometry of Least Squares

Last time we took an in-depth look at solving the linear system of equations

\begin{equation}\label{1} \bm{A}\bm{x}=\bm{b} \end{equation}

from an optimization perspective. It is not too interesting when the system has a single unique solution; instead, we spent most of the post exploring the other two possibilities:

the system is inconsistent and has no solutions;
the system is under-determined and has infinite solutions.

In the first case, we chose the value $\bm{x}^\star$ that minimizes the least squares error $\|\bm{A}\bm{x}-\bm{b}\|^2_2$ . In the second case, we choose the solution $\bm{x}^\star$ satisfying $\eqref{1}$ that is closest to some other desired point $\bm{y}$ . Here we revisit both of these cases from a geometric perspective, which will reveal that both problems can be interpreted as projections onto a linear subspace. We also examine the addition of damping in both cases.

An interactive Python notebook that accompanies this post can be accessed directly in the browser here.

No Solutions

Let’s begin with a simple example. Consider a line in $\mathbb{R}^m$ that passes through the origin, defined as the set of points

\begin{equation}\label{2} \mathcal{L} = \{ \bm{a}x \mid x\in\mathbb{R} \}, \end{equation}

where $\bm{a}\in\mathbb{R}^m$ is a non-zero vector representing the line’s direction. In other words, $\mathcal{L}$ is the linear subspace spanned by $\bm{a}$ . Since a line is a one-dimensional subspace, we need only one variable $x\in\mathbb{R}$ to parameterize it.

Suppose we want to find the point $\bm{a}x^\star$ on the line that is closest to some other point $\bm{b}\in\mathbb{R}^m$ in terms of Euclidean distance, as shown below in Figure 1. This problem is called the vector projection of $\bm{b}$ onto $\bm{a}$ .

Figure 1: A line (in red) passing through the origin $\mathcal{O}$ in the direction of the vector $\bm{a}$ . We want to find the closest point on the line to an arbitrary point $\bm{b}$ , which known as the vector projection of $\bm{b}$ onto $\bm{a}$ .

Assuming that $\bm{b}$ is not actually on the line, the system $\bm{a}x=\bm{b}$ is inconsistent and has no solution. Instead, we want to find the point that solves the least-squares problem

\begin{equation*} \min_x \quad (1/2)\|\bm{e}(x)\|_2^2, \end{equation*}

where $\bm{e}(x)=\bm{a}x-\bm{b}$ is the error vector. However, we will turn to geometry rather than optimization theory to solve it.

Figure 2: The vector representing the closest point $\bm{a}x^\star$ (in blue) on the line to $\bm{b}$ is just the projection of $\bm{b}$ onto the line. This is equivalent to the condition $\bm{a}^T\bm{e}(x^\star)=0$ ; that is, the difference between $\bm{a}x^\star$ and $\bm{b}$ is orthogonal to the line.

By looking at Figure 2, we see that $\bm{e}(x)$ has minimum length when it is orthogonal to the line, which means

\begin{equation}\label{3} \bm{a}^T\bm{e}(x^\star) = \bm{a}^T(\bm{a}x^\star-\bm{b}) = 0, \end{equation}

where $\bm{a}x^\star$ is the closest point. Rearranging $\eqref{3}$ , we obtain the solution

\begin{equation}\label{4} x^\star=(\bm{a}^T\bm{a})^{-1}\bm{a}^T\bm{b}. \end{equation}

This looks very similar to the solution to the least-squares problem we saw last time. Indeed, if we generalize $\eqref{2}$ from a one-dimensional subspace to an $n$ -dimensional subspace, we get the column space

\begin{equation}\label{5} \mathcal{C}(\bm{A}) = \{ \bm{A}\bm{x} \mid \bm{x}\in\mathbb{R}^n \}, \end{equation}

of the matrix $\bm{A}=[\bm{a}_1,\dots,\bm{a}_n]\in\mathbb{R}^{m\times n}$ , where $\bm{x}=[x_1,\dots,x_n]^T$ such that $\bm{A}\bm{x}=\sum_{i=1}^n\bm{a}_ix_i$ . The least-squares problem becomes

\begin{equation}\label{6} \min_{\bm{x}} \quad (1/2)\|\bm{e}(\bm{x})\|_2^2 \end{equation}

with $\bm{e}(\bm{x})=\bm{A}\bm{x}-\bm{b}$ , the equation $\eqref{3}$ becomes $\bm{A}^T\bm{e}(\bm{x}^\star)=\bm{0}$ , and the solution $\eqref{4}$ becomes

\begin{equation*} \bm{x}^\star = (\bm{A}^T\bm{A})^{-1}\bm{A}^T\bm{b}, \end{equation*}

where $\bm{A}^+:=(\bm{A}^T\bm{A})^{-1}\bm{A}^T$ is the (left) Moore-Penrose pseudoinverse that we saw last time. The corresponding closest point in the subspace $\mathcal{C}(\bm{A})$ is

\begin{equation*} \bm{y}(\bm{x}^\star) = \bm{A}\bm{x}^\star = \bm{A}\bm{A}^+\bm{b}. \end{equation*}

Projection

The matrix $\bm{A}\bm{A}^+$ is the projection matrix that maps arbitrary points to the closest point in $\mathcal{C}(\bm{A})$ . A projection matrix on $\mathbb{R}^n$ is a matrix $\bm{U}\in\mathbb{R}^{n\times n}$ that satisfies $\bm{U}\bm{U}=\bm{U}$ (that is, it is idempotent), which means that multiplying a vector by $\bm{U}$ more than once has no additional effect. We can easily verify that $\bm{A}\bm{A}^+$ satisfies this property:

\begin{equation*} \begin{aligned} (\bm{A}\bm{A}^+)(\bm{A}\bm{A}^+) &= \bm{A}(\bm{A}^T\bm{A})^{-1}\bm{A}^T\bm{A}(\bm{A}^T\bm{A})^{-1}\bm{A}^T \\ &= \bm{A}(\bm{A}^T\bm{A})^{-1}\bm{A}^T \\ &= \bm{A}\bm{A}^+. \end{aligned} \end{equation*}

Since $\bm{A}\bm{A}^+$ is also symmetric (this is easy to check and left to the reader), it is an orthogonal projection matrix. Indeed, all of the projections we describe in this post are orthogonal.

This is the geometric interpretation of the least-squares problem $\eqref{6}$ : it is just an (orthogonal) projection of $\bm{b}$ onto the column space $\mathcal{C}(\bm{A})$ .

Damping

Recall that we introduced damping into the least-squares problem to bias the solution toward a smaller norm and avoid problems with ill-conditioning. Returning to our example with the line $\mathcal{L}$ defined in $\eqref{2}$ , the damped least squares problem is

\begin{equation*} \min_x \quad (1/2)\|\bm{a}x - \bm{b}\|_2^2 + (\alpha/2)x^2, \end{equation*}

where $\alpha\geq0$ is the damping factor.

How do we think about this problem geometrically? Consider again the projected error $\eqref{3}$ : this equation says that the projections of $\bm{b}$ and its closest point $\bm{a}x^\star$ onto $\bm{a}$ must be equal. In the damped case, we are willing to sacrifice this equality to make the magnitude of $x^\star$ smaller, as shown in Figure 3 below.

Figure 3: Damping pulls $x^\star$ closer to the origin by padding the value $\bm{a}^T\bm{a}x^\star$ with the term $\alpha x^\star$ , which added together must equal $\bm{a}^T\bm{b}$ .

Instead of $\bm{a}^T\bm{b}=\bm{a}^T\bm{a}x^\star$ , we add an extra term $\alpha x^\star$ to the right hand side so that $x^\star$ becomes smaller, yielding

\begin{equation*} x^\star=(\bm{a}^T\bm{a} + \alpha)^{-1}\bm{a}^T\bm{b}. \end{equation*}

This solution generalizes to

\begin{equation*} \bm{x}^\star=(\bm{A}^T\bm{A} + \alpha\bm{1}_n)^{-1}\bm{A}^T\bm{b} \end{equation*}

in the matrix case, where $\bm{A}^+(\alpha):=(\bm{A}^T\bm{A}+\alpha\bm{1}_n)^{-1}\bm{A}^T$ is the damped Moore-Penrose pseudoinverse we saw last time. The matrix $\bm{A}\bm{A}^+(\alpha)$ is not a projection matrix when $\alpha>0$ , because repeated applications to a vector continue to shrink that vector’s magnitude.

Infinite Solutions

Now let’s look at the case when there are infinite solutions to $\eqref{1}$ , and we need to pick one.

In $\eqref{2}$ we defined a line in $\mathbb{R}^m$ with a single variable $x$ representing its single dimension. In $\eqref{5}$ , we generalized to an $n$ -dimensional subspace parameterized by $n<m$ variables. Instead, we could parameterize it with a full $m$ variables subject to $p=m-n$ constraints. Using this approach, we can define a linear subspace of $\mathbb{R}^m$ as

\begin{equation}\label{7} \mathcal{S} = \{ \bm{x}\in\mathbb{R}^m \mid \bm{A}\bm{x}=\bm{0} \}, \end{equation}

where each row of $\bm{A}\in\mathbb{R}^{p\times m}$ defines one of the $p$ constraints. Each row of $\bm{A}$ is orthogonal to the subspace itself, and we assume the rows of $\bm{A}$ are all linearly independent. This means that the columns of $\bm{A}^T$ span the $p$ -dimensional subspace of $\mathbb{R}^m$ that is orthogonal to $\mathcal{S}$ . Indeed, $\mathcal{S}$ is just the nullspace $\mathcal{N}(\bm{A})$ , and the span of $\bm{A}^T$ is $\mathcal{C}(\bm{A}^T)$ , which is called the range space of $\bm{A}$ ; in general, the range space and nullspace are orthogonal to each other. (For more information, refer to this unit on the four fundamental subspaces of linear algebra.)

Let’s generalize $\eqref{7}$ so that it does not necessarily pass through the origin by defining the affine subspace

\begin{equation*} \mathcal{S}' = \{\bm{x}\in\mathbb{R}^m\mid\bm{A}\bm{x}=\bm{b}\}, \end{equation*}

which now satisfies the constraint $\eqref{1}$ . We can interpret $\mathcal{S}'$ as the set of points in $\mathcal{S}$ offset by some vector $\bm{x}_0\in\mathbb{R}^m$ , such that $\mathcal{S}'=\{\bm{x}+\bm{x}_0\mid\bm{x}\in\mathcal{S}\}$ , where

\begin{equation}\label{8} \bm{b} = \bm{A}\bm{x}_0. \end{equation}

Figure 4 below shows what $\mathcal{S}'$ looks like when it is just a line in $\mathbb{R}^2$ , which requires a single constraint defined by $\bm{A}=[\bm{a}^T]\in\mathbb{R}^{1\times2}$ and $b\in\mathbb{R}$ .

Figure 4: The line $\mathcal{S}'$ (in red) is defined by the constraint $\bm{a}^T\bm{x}=b$ , where $b=\bm{a}^T\bm{x}_0$ .

Suppose we want to find the point $\bm{x}^\star\in\mathcal{S}'$ that has the smallest Euclidean norm; that is, we want to solve the problem

\begin{equation}\label{9} \begin{aligned} \min_{\bm{x}} &\quad (1/2)\|\bm{x}\|_2^2, \\ \text{subject to} &\quad \bm{A}\bm{x} = \bm{b}, \end{aligned} \end{equation}

which is equivalent to finding the point that is closest to the origin.

As can be seen in Figure 5, the optimal solution $\bm{x}^\star$ of $\eqref{9}$ is orthogonal to $\mathcal{S}'$ , which means that it lives in $\mathcal{C}(\bm{A}^T)$ ; that is, there must exist a vector $\bm{\lambda}\in\mathbb{R}^m$ such that $\bm{x}^{\star}=\bm{A}^T\bm{\lambda}.$ We saw last time that $\bm{\lambda}$ can be interpreted as the vector of Lagrange multipliers for the constraint $\eqref{1}$ ; the optimal value of $\bm{\lambda}$ ensures that $\bm{x}^\star$ does indeed satisfy $\eqref{1}$ .

Figure 5: The closest point $\bm{x}^\star\in\mathcal{S}'$ (in blue) to the origin is just the projection of an arbitrary point $\bm{x}_0\in\mathcal{S}'$ onto the range space $\mathcal{C}(\bm{A}^T)$ , where $\bm{A}=[\bm{a}^T]$ in this example.

Multiplying by $\bm{A}$ gives us $\bm{A}\bm{x}^\star=\bm{A}\bm{A}^T\bm{\lambda}$ , which we rearrange to obtain $\bm{\lambda}=(\bm{A}\bm{A}^T)^{-1}\bm{b}$ . Substituting back into the equation $\bm{x}^\star=\bm{A}^T\bm{\lambda}$ and making use of $\eqref{8}$ , we arrive at the solution

\begin{equation*} \bm{x}^{\star} = \bm{A}^+\bm{b} = \bm{A}^+\bm{A}\bm{x}_0, \end{equation*}

where $\bm{A}^+=\bm{A}^T(\bm{A}\bm{A}^T)^{-1}$ is the (right) Moore-Penrose pseudoinverse. The matrix $\bm{A}^+\bm{A}$ is a projection matrix that projects $\bm{x}_0$ onto the range space $\mathcal{C}(\bm{A}^T)$ of $\bm{A}$ .

Nullspace Projection

Like last time, instead of finding the closest point to the origin, we could find the closest point to an arbitrary vector $\bm{y}\in\mathbb{R}^m$ . That is, we want to solve the problem

\begin{equation}\label{10} \begin{aligned} \min_{\bm{x}} &\quad (1/2)\|\bm{e}(\bm{x})\|_2^2, \\ \text{subject to} &\quad \bm{A}\bm{x} = \bm{b}, \end{aligned} \end{equation}

where $\bm{e}(\bm{x})=\bm{y}-\bm{x}$ is the error vector we want to minimize. We can again derive the solution by recognizing that $\bm{e}(\bm{x}^\star)$ is orthogonal to $\mathcal{S}'$ , and therefore $\bm{e}(\bm{x}^{\star})=\bm{A}^T\bm{\lambda}$ for some $\bm{\lambda}\in\mathbb{R}^m$ . We ultimately arrive at the solution

\begin{equation}\label{11} \bm{x}^{\star} = \bm{A}^+\bm{b} + \bm{P}(\bm{A})\bm{y}, \end{equation}

where $\bm{P}(\bm{A})=\bm{1}_n-\bm{A}^+\bm{A}$ is the nullspace projector. As the name implies, $\bm{P}(\bm{A})$ is also a projection matrix, which we can easily verify:

\begin{equation*} \begin{aligned} \bm{P}(\bm{A})\bm{P}(\bm{A}) &= (\bm{1}_n-\bm{A}^+\bm{A})(\bm{1}_n-\bm{A}^+\bm{A}) \\ &= \bm{1}_n - 2\bm{A}^+\bm{A} + \bm{A}^+\bm{A}\bm{A}^+\bm{A} \\ &= \bm{1}_n - 2\bm{A}^+\bm{A} + \bm{A}^+\bm{A} \\ &= \bm{P}(\bm{A}). \end{aligned} \end{equation*}

If we set $\bm{b}=\bm{0}$ , then $\eqref{11}$ is just a projection of $\bm{y}$ onto the nullspace $\mathcal{N}(\bm{A})$ , as shown in Figure 6.

Figure 6: The closest point $\bm{x}^{\star}\in\mathcal{S}'$ to an arbitrary point $\bm{y}$ is just the projection of $\bm{y}$ onto $\mathcal{S}'$ , where $\mathcal{S}'=\mathcal{N}(\bm{A})$ and $\bm{A}=[\bm{a}^T]$ in this example.

Damping

The damped version of $\eqref{9}$ is

\begin{equation*} \begin{aligned} \min_{\bm{x},\bm{s}} &\quad (\alpha/2)\|\bm{x}\|_2^2 + (1/2)\|\bm{s}\|_2^2, \\ \text{subject to} &\quad \bm{A}\bm{x} = \bm{b} + \bm{s}, \end{aligned} \end{equation*}

where we have introduced the slack variable $\bm{s}\in\mathbb{R}^p$ and damping factor $\alpha>0$ . As we saw last time, the solution is

\begin{equation*} \bm{x}^\star = \bm{A}^T(\bm{A}\bm{A}^T + \alpha\bm{1}_p)^{-1}\bm{b}, \end{equation*}

where $\bm{A}^T(\bm{A}\bm{A}^T+\alpha\bm{1}_p)^{-1}$ is equivalent (when $\alpha>0$ ) to the damped Moore-Penrose pseudoinverse $\bm{A}^+(\alpha)$ that we saw earlier.

In the undamped case, we had $\bm{x}^\star=\bm{A}^T\bm{\lambda}$ for some $\bm{\lambda}\in\mathbb{R}^p$ . In the damped case, we have $\bm{x}^\star=\bm{A}^T\bm{z}$ where $\bm{z}=\bm{\lambda}/\alpha$ is just a scaled version of $\bm{\lambda}$ that satisfies

\begin{equation*} \bm{b} = (\bm{A}\bm{A}^T + \alpha\bm{1}_p)\bm{z}. \end{equation*}

Continuing our running example of a line in $\mathbb{R}^2$ defined by , we can see what damping looks like in Figure 7.

Figure 7 shows what damping looks like for our example of a line in $\mathbb{R}^2$ defined by a constraint of the form $\eqref{1}$ with $\bm{A}=[\bm{a}^T]\in\mathbb{R}^{1\times2}$ . The value of $\bm{b}$ is now split between the two terms $\bm{A}\bm{A}^T\bm{z}$ and $\alpha\bm{z}$ , which serves to shrink the norm of $\bm{x}^{\star}$ because $\bm{z}$ itself becomes smaller. Moreover, if one goes through the math it turns out that $\bm{s}=\bm{\lambda}$ , and we can see from the figure that $\bm{x}^{\star}$ has been reduced by $\alpha\bm{z}=\bm{\lambda}=\bm{s}$ , which makes sense given that $\bm{s}$ is the amount of slack we added to the constraint. (Here it is really helpful to go through the optimization math by constructing the Lagrangian, taking derivatives, etc., to follow along with the diagram.)

Figure 7: Damping pulls $x^\star=\bm{a}z$ closer to the origin by padding the value $\bm{a}^T\bm{a}z$ with the term $\alpha z$ , which added together must equal $b$ .

Summary

We have seen that various forms of the least-squares problem are just projections onto different subspaces:

Problem $\eqref{6}$ projects $\bm{b}$ onto the column space $\mathcal{C}(\bm{A})$ ;
Problem $\eqref{9}$ projects $\bm{x}_0$ onto the range space $\mathcal{C}(\bm{A}^T)$ ;
Problem $\eqref{10}$ (with $\bm{b}=\bm{0}$ ) projects $\bm{y}$ onto the nullspace $\mathcal{N}(\bm{A})$ .

Hopefully this provides some intuition about what least-squares optimization problems are actually doing.

Thanks to Abhishek Goudar for reading a draft of this.