November 17, 2025

Linear Algebra and Robot Control

In this post we take a look at some linear algebra through the lens of optimization and then see how it can be applied to robot manipulator control. In particular, we will spend a fair bit of time looking at pseudoinverses and nullspace projection.

Linear Systems of Equations

As engineers we often encounter linear systems of equations of the form

\begin{equation}\label{1} \bm{A}\bm{x} = \bm{b}, \end{equation}

where we are given a matrix $\bm{A}\in\mathbb{R}^{m\times n}$ and a vector $\bm{b}\in\mathbb{R}^m$ , and we want to find a vector $\bm{x}\in\R^m$ satisfying the equation. In this post, we are interested in the solutions of $\eqref{1}$ . We will begin with some background review to make this post fairly self-contained.

Background

Column Space and Nullspace

The column space $\mathcal{C}(\bm{A})$ of a matrix $\bm{A}$ is the span of its columns; that is, it is the set of all linear combinations of the columns of $\bm{A}$ :

\begin{equation*} \mathcal{C}(\bm{A}) = \{ \bm{A}\bm{x} \mid \bm{x}\in\mathbb{R}^n \}. \end{equation*}

The nullspace (also called the kernel) $\mathcal{N}(\bm{A})$ of $\bm{A}$ is the set of all vectors $\bm{x}$ for which $\bm{A}\bm{x}=\bm{0}$ :

\begin{equation*} \mathcal{N}(\bm{A}) = \{ \bm{x}\in\mathbb{R}^n \mid \bm{A}\bm{x}=\bm{0} \}. \end{equation*}

Singular Values

Any matrix $\bm{A}\in\mathbb{R}^{m\times n}$ can be expressed using the singular value decomposition (SVD) as

\begin{equation*} \bm{A} = \textstyle\sum_{i=1}^r \sigma_i\bm{u}_i\bm{v}_i^T, \end{equation*}

where $r=\min(m,n)$ , $\sigma_i$ is the $i$ th singular value of $\bm{A}$ , and $\bm{u}_i\in\R^m$ and $\bm{v}_i\in\R^n$ are the $i$ th left and right singular vectors of $\bm{A}$ . The singular values are all non-negative and each set of singular vectors forms an orthonormal basis. Expressed in matrix form, we have

\begin{equation*} \bm{A} = \bm{U}\bm{\Sigma}\bm{V}^T, \end{equation*}

where $\bm{U}\in\mathbb{R}^{m\times m}$ and $\bm{V}\in\mathbb{R}^{n\times n}$ are orthonormal and $\bm{\Sigma}\in\mathbb{R}^{m\times n}$ is a rectangular diagonal matrix with the singular values on the diagonal.

Rank

The rank of $\bm{A}$ is equal to the number of non-zero singular values. We say that a matrix $\bm{A}$ satisfying $\mathrm{rank}(\bm{A})=r$ has full rank (i.e., the maximum rank it can possibly have given its shape). The rank of $\bm{A}$ is equal to the number of linearly independent columns (or rows) of $\bm{A}$ , which is also the dimension of the subspace spanned by $\mathcal{C}(\bm{A})$ .

Inverse

When $\bm{A}$ has full rank and is also square, such that $\mathrm{rank}(\bm{A})=m=n$ , then its inverse $\bm{A}^{-1}$ exists and we say that $\bm{A}$ is invertible. Expressed in terms of the SVD of $\bm{A}$ , the inverse is given by

\begin{equation}\label{2} \bm{A}^{-1} = \textstyle\sum_{i=1}^n \sigma_i^{-1}\bm{v}_i\bm{u}_i^T = \bm{V}\bm{\Sigma}^{-1}\bm{U}^T. \end{equation}

(When $\bm{A}$ is square, the SVD is actually just an eigendecomposition, so in this case $\bm{u}_i=\bm{v}_i$ are the eigenvectors of $\bm{A}$ and $\sigma_i$ are its eigenvalues.) Notice that if one (or more) of the singular values $\sigma_i=0$ , then we cannot compute the inverse because $\sigma_i^{-1}$ does not exist, and we say that $\bm{A}$ is singular. Note that we only use the term “singular” in reference to square matrices.

Condition Number

The “closeness” to singularity for a square matrix $\bm{A}$ is quantified by the matrix’s condition number $\kappa(\bm{A})$ , which describes how much the solution $\bm{x}$ of $\eqref{1}$ can change when the value of $\bm{b}$ changes. A common definition is

\begin{equation*} \kappa(\bm{A}) = \frac{\sigma_{\max}(\bm{A})}{\sigma_{\min}(\bm{A})}, \end{equation*}

which is the ratio between the largest and smallest singular values. When $\kappa(\bm{A})$ is low, small changes in $\bm{b}$ do not cause large changes in $\bm{x}$ , and we say that $\bm{A}$ is well-conditioned. When $\kappa(\bm{A})$ is high, small changes in $\bm{b}$ can cause large changes in $\bm{x}$ , and we say that $\bm{A}$ is ill-conditioned. A larger condition number means $\bm{A}$ is closer to singularity, with $\kappa(\bm{A})=\infty$ when $\bm{A}$ is singular.

We are now ready to return to our examination of $\eqref{1}$ .

Solutions of Linear Systems

To determine the number of possible solutions of $\eqref{1}$ , we use the augmented matrix

\begin{equation*} \tilde{\bm{A}} = \begin{bmatrix} \bm{A} & \bm{b} \end{bmatrix}\in\mathbb{R}^{m\times(n+1)}, \end{equation*}

which is simply $\bm{A}$ with $\bm{b}$ added as an additional column. The Rouché-Capelli theorem tells us that:

if $\mathrm{rank}(\tilde{\bm{A}}) > \mathrm{rank}(\bm{A})$ , then there is no solution;
if $\mathrm{rank}(\tilde{\bm{A}}) = \mathrm{rank}(\bm{A}) = n$ , then there is a single unique solution;
otherwise, there are infinite solutions.

When the system has a solution we say it is consistent; if it has no solution then it is inconsistent. Consistent systems are those satisfying $\bm{b}\in\mathcal{C}(\bm{A})$ . When the system has multiple (infinite) solutions, we say that the system is under-determined.

The case of a single unique solution is easy because the inverse $\bm{A}^{-1}$ exists, so the solution is simply $\bm{x}=\bm{A}^{-1}\bm{b}$ . Let’s take a look at the other (more interesting) cases.

No Solutions

When the system is inconsistent, we cannot solve it exactly. A reasonable alternative is to choose the value of $\bm{x}$ that comes the closest to satisfying $\eqref{1}$ (in the sense of least-squares). That is, we want to find the $\bm{x}$ that solves the least-squares problem

\begin{equation}\label{3} \min_{\bm{x}} \quad (1/2)\|\bm{A}\bm{x}-\bm{b}\|^2_2, \end{equation}

which can be interpreted as finding the vector in $\mathcal{C}(\bm{A})$ that has the smallest Euclidean distance to $\bm{b}$ (note that the coefficient $(1/2)$ does not change the solution, but avoids ugly coefficients in the derivative below). To solve $\eqref{3}$ , we just take the derivative and set it equal to zero, revealing that the solution satisfies

\begin{equation*} \frac{\partial}{\partial\bm{x}} (1/2)\|\bm{A}\bm{x}-\bm{b}\|^2_2 = \bm{A}^T\bm{A}\bm{x} - \bm{A}^T\bm{b} = \bm{0}, \end{equation*}

which we rearrange to obtain the solution

\begin{equation*} \bm{x} = (\bm{A}^T\bm{A})^{-1}\bm{A}^T\bm{b}. \end{equation*}

The matrix

\begin{equation*} \bm{A}^+ := (\bm{A}^T\bm{A})^{-1}\bm{A}^T \end{equation*}

is known as the Moore-Penrose pseudoinverse of $\bm{A}$ . In this case, it is a left inverse because it satisfies the identity $\bm{A}^+\bm{A}=\bm{1}_n$ (i.e., it multiplies $\bm{A}$ from the left; we will see a right inverse below), where $\bm{1}_n$ is the $n\times n$ identity matrix.

Conditioning and Singularities

As the condition number of $\bm{A}^T\bm{A}$ increases, small changes in $\bm{b}$ (due to noisy measurements, for example) can cause large changes in the solution of $\eqref{3}$ . Indeed, if $\bm{A}$ is not full-rank, then $\bm{A}^T\bm{A}$ is singular and cannot be inverted at all. To see this, we can use the SVD of $\bm{A}$ to obtain

\begin{equation*} \begin{aligned} (\bm{A}^T\bm{A})^{-1} &= (\bm{V}\bm{\Sigma}^T\bm{U}^T\bm{U}\bm{\Sigma}\bm{V})^{-1} \\ &= (\bm{V}\bm{\Sigma}^T\bm{\Sigma}\bm{V}^T)^{-1} \\ &= \bm{V}(\bm{\Sigma}^T\bm{\Sigma})^{-1}\bm{V}^T \\ &= \textstyle\sum_{i=1}^n\sigma_i^{-2}\bm{v}_i\bm{v}_i^T, \end{aligned} \end{equation*}

which shows that we cannot compute $(\bm{A}^T\bm{A})^{-1}$ if any $\sigma_i=0$ , because then $\sigma_i^{-2}$ does not exist.

To deal with poor conditioning and singularities, we can regularize the solution by adding a damping term to $\eqref{3}$ so that it becomes

\begin{equation}\label{4} \min_{\bm{x}} \quad (1/2)\|\bm{A}\bm{x}-\bm{b}\|^2_2 + (\alpha/2)\|\bm{x}\|^2_2, \end{equation}

where $\alpha\geq0$ is a small constant damping factor that serves to bias the solution toward a smaller norm (this regularization approach is commonly known as ridge regression). The solution of $\eqref{4}$ is

\begin{equation*} \bm{x} = (\bm{A}^T\bm{A} + \alpha\bm{1}_n)^{-1}\bm{A}^T\bm{b}, \end{equation*}

where

\begin{equation}\label{5} \bm{A}^+(\alpha) := (\bm{A}^T\bm{A} + \alpha\bm{1}_n)^{-1}\bm{A}^T \end{equation}

is called the damped pseudoinverse of $\bm{A}$ . Again using the SVD of $\bm{A}$ , we have

\begin{equation*} \begin{aligned} (\bm{A}^T\bm{A} + \alpha\bm{1}_n)^{-1} &= (\bm{V}\bm{\Sigma}^T\bm{\Sigma}\bm{V}^T + \alpha\bm{1}_n)^{-1} \\ &= \bm{V}(\bm{\Sigma}^T\bm{\Sigma} + \alpha\bm{1}_n)^{-1}\bm{V}^T \\ &= \textstyle\sum_{i=1}^n(\sigma_i^{2}+\alpha)^{-1}\bm{v}_i\bm{v}_i^T, \end{aligned} \end{equation*}

which shows that the matrix inverse in $\eqref{5}$ always exists for $\alpha>0$ , since $\alpha>0$ implies $(\sigma_i^2+\alpha)^{-1}$ exists even if $\sigma_i=0$ .

Infinite Solutions

When the system is under-determined, there are infinite values of $\bm{x}$ that satisfy $\eqref{1}$ , so we need to pick one. If we imagine that $\bm{x}$ represents some sort of effort or resource we want to save, a reasonable choice is the solution with the smallest Euclidean norm; that is, the solution satisfying

\begin{equation}\label{6} \begin{aligned} \min_{\bm{x}} &\quad (1/2)\|\bm{x}\|^2_2 \\ \text{subject to} &\quad \bm{A}\bm{x} = \bm{b}. \end{aligned} \end{equation}

To solve this optimization problem, we construct the Lagrangian function

\begin{equation*} \mathcal{L}(\bm{x},\bm{\lambda}) = (1/2)\|\bm{x}\|^2_2 + \bm{\lambda}^T(\bm{A}\bm{x}-\bm{b}), \end{equation*}

where $\bm{\lambda}\in\mathbb{R}^m$ is the vector of Lagrange multipliers. At the optimum we have

\begin{equation}\label{7} \frac{\partial\mathcal{L}}{\partial\bm{x}} = \bm{x} + \bm{A}^T\bm{\lambda} = \bm{0}, \end{equation}

which we can left-multiply by $\bm{A}$ to obtain $\bm{A}\bm{A}^T\bm{\lambda}=-\bm{A}\bm{x}$ and thus $\bm{\lambda}=-(\bm{A}\bm{A}^T)^{-1}\bm{b}$ . Substituting back into $\eqref{7}$ and rearranging yields the solution

\begin{equation}\label{8} \bm{x} = \bm{A}^T(\bm{A}\bm{A}^T)^{-1}\bm{b}. \end{equation}

The matrix $\bm{A}^+=\bm{A}^T(\bm{A}\bm{A}^T)^{-1}$ is again the Moore-Penrose pseudoinverse of $\bm{A}$ , but this time it is a right inverse because it satisfies $\bm{A}\bm{A}^+=\bm{1}_m$ .

The Moore-Penrose Pseudoinverse

Wait a minute — we said previously that the Moore-Penrose pseudoinverse of $\bm{A}$ is $(\bm{A}^T\bm{A})^{-1}\bm{A}^T$ , but now we are saying it is $\bm{A}^T(\bm{A}\bm{A}^T)^{-1}$ . What’s going on?

The Moore-Penrose pseudoinverse always exists and is unique, with the general solution given in terms of the SVD of $\bm{A}$ :

\begin{equation}\label{9} \bm{A}^+ = \textstyle\sum_{i=1}^r\sigma_i^+\bm{v}_i\bm{u}_i^T \in \mathbb{R}^{n\times m}, \end{equation}

where

\begin{equation*} \sigma_i^+ = \begin{cases} \sigma_i^{-1} &\quad \text{if } \sigma_i>0, \\ 0 &\quad \text{otherwise.} \end{cases} \end{equation*}

(Notice the similarity to $\eqref{2}$ .)

However, if $\bm{A}$ is full-rank, then the pseudoinverse also has a nice algebraic expression (which is equivalent to $\eqref{9}$ ):

if $m=n$ , $\bm{A}^+=\bm{A}^{-1}$ ;
if $m<n$ , $\bm{A}^+=\bm{A}^T(\bm{A}\bm{A}^T)^{-1}$ ;
if $m>n$ , $\bm{A}^+=(\bm{A}^T\bm{A})^{-1}\bm{A}^T$ .

The pseudoinverse $\bm{A}^+$ always has the same rank as $\bm{A}$ .

The Damped Pseudoinverse

Recall that we introduced damping in $\eqref{4}$ to handle ill-conditioning when $\bm{A}^T\bm{A}$ approached singularity, resulting in the damped pseudoinverse $\bm{A}^+(\alpha)$ . We can also obtain the damped pseudoinverse by modifying $\eqref{6}$ to defend ourselves against poor conditioning in $\bm{A}\bm{A}^T$ , resulting in the problem

\begin{equation}\label{10} \begin{aligned} \min_{\bm{x},\bm{s}} &\quad (\alpha/2)\|\bm{x}\|^2_2 + (1/2)\|\bm{s}\|^2_2 \\ \text{subject to} &\quad \bm{A}\bm{x} = \bm{b} + \bm{s}, \end{aligned} \end{equation}

where we have introduced the slack variable $\bm{s}\in\R^m$ and $\alpha>0$ is again the damping factor. This problem says that we are okay with a bit of error (“slack”) in the solution to $\eqref{1}$ if it means we can reduce the norm of $\bm{x}$ in exchange, with $\alpha$ controlling the trade-off. The optimal solution is

\begin{equation*} \bm{x} = \bm{A}^T(\bm{A}\bm{A}^T + \alpha\bm{1}_m)^{-1}\bm{b}. \end{equation*}

The matrix $\bm{A}^T(\bm{A}\bm{A}^T + \alpha\bm{1}_m)^{-1}$ has a similar structure to the damped pseudoinverse from $\eqref{5}$ but does not look exactly the same; however, they are indeed equal for any $\alpha>0$ . To see this, first observe that

\begin{equation*} (\bm{A}^T\bm{A} + \alpha\bm{1}_n)\bm{A}^T = \bm{A}^T(\bm{A}\bm{A}^T+\alpha\bm{1}_m), \end{equation*}

where we have just grouped the terms differently on each side. Left-multiplying both sides by $(\bm{A}^T\bm{A} + \alpha\bm{1}_n)^{-1}$ and right-multiplying both sides by $(\bm{A}\bm{A}^T+\alpha\bm{1}_m)^{-1}$ , we obtain

\begin{equation*} \bm{A}^T(\bm{A}\bm{A}^T+\alpha\bm{1}_m)^{-1} = (\bm{A}^T\bm{A} + \alpha\bm{1}_n)^{-1}\bm{A}^T = \bm{A}^+(\alpha), \end{equation*}

which shows that both expressions for the damped pseudoinverse are equivalent. Indeed, the Moore-Penrose pseudoinverse can be expressed as the following equivalent limits:

\begin{equation*} \begin{aligned} \bm{A}^+ &= \lim_{\alpha\to0}\ \bm{A}^+(\alpha) \\ &= \lim_{\alpha\to0}\ \bm{A}^T(\bm{A}\bm{A}^T+\alpha\bm{1}_m)^{-1} \\ &= \lim_{\alpha\to0}\ (\bm{A}^T\bm{A} + \alpha\bm{1}_n)^{-1}\bm{A}^T, \end{aligned} \end{equation*}

which always exist but may not be full-rank. In contrast, the damped pseudoinverse is always full-rank for $\alpha>0$ and its conditioning improves as $\alpha$ increases. However, the trade-off is that increasing $\alpha$ reduces the accuracy of $\bm{x}=\bm{A}^+(\alpha)\bm{b}$ as a solution to $\eqref{1}$ .

Secondary Objectives

Instead of solving $\eqref{6}$ for the minimum-norm solution of $\eqref{1}$ , we may want to find the $\bm{x}$ satisfying $\eqref{1}$ that is closest to some other vector $\bm{y}\in\mathbb{R}^n$ . That is, we want to choose the $\bm{x}$ satisfying

\begin{equation}\label{11} \begin{aligned} \min_{\bm{x}} &\quad (1/2)\|\bm{x}-\bm{y}\|^2_2 \\ \text{subject to} &\quad \bm{A}\bm{x} = \bm{b}. \end{aligned} \end{equation}

Going through the same process with the Lagrangian as above, we obtain the solution

\begin{equation}\label{12} \bm{x} = \bm{A}^+\bm{b} + (\bm{1}_n-\bm{A}^+\bm{A})\bm{y}, \end{equation}

which of course reduces to $\eqref{8}$ when $\bm{y}=\bm{0}$ . So what’s the deal with the matrix

\begin{equation*} \bm{P}(\bm{A}) := \bm{1}_n-\bm{A}^+\bm{A}? \end{equation*}

Nullspace Projection

The matrix $\bm{P}(\bm{A})$ is called a nullspace projector because it projects vectors onto the nullspace $\mathcal{N}(\bm{A})$ . To see this, let $\bm{x}=\bm{P}(\bm{A})\bm{y}$ . Then we have

\begin{equation*} \bm{A}\bm{x} = \bm{A}\bm{P}(\bm{A})\bm{y} = \bm{A}(\bm{1}_n-\bm{A}^+\bm{A})\bm{y} = (\bm{A}-\bm{A})\bm{y} = \bm{0}, \end{equation*}

where we have used the fact that the pseudoinverse obeys the identity $\bm{A}\bm{A}^+\bm{A}=\bm{A}$ , and so we see that $\bm{P}(\bm{A})\bm{y}$ belongs to $\mathcal{N}(\bm{A})$ for any vector $\bm{y}$ . This means that no matter what value of $\bm{y}$ we choose, $\eqref{12}$ will always be a solution to $\eqref{1}$ .

It turns out that $\bm{P}(\bm{A})\bm{y}$ is in fact the closest vector in $\mathcal{N}(\bm{A})$ to $\bm{y}$ in the sense of least-squares. To see this, consider the least-squares optimization problem

\begin{equation*} \begin{aligned} \min_{\bm{x}} &\quad (1/2)\|\bm{x}-\bm{y}\|^2_2 \\ \text{subject to} &\quad \bm{A}\bm{x} = \bm{0}. \end{aligned} \end{equation*}

This is the same problem as $\eqref{11}$ with $\bm{b}=\bm{0}$ , so the solution is simply $\bm{x}=\bm{P}(\bm{A})\bm{y}$ — the nullspace projector!

Secondary Objectives with Damping

If we modify $\eqref{11}$ with a slack variable and damping factor similar to $\eqref{10}$ , we obtain the problem

\begin{equation*} \begin{aligned} \min_{\bm{x},\bm{s}} &\quad (\alpha/2)\|\bm{x}-\bm{y}\|^2_2 + (1/2)\|\bm{s}\|^2_2 \\ \text{subject to} &\quad \bm{A}\bm{x} = \bm{b} + \bm{s}, \end{aligned} \end{equation*}

which has optimal value

\begin{equation*} \bm{x} = \bm{A}^+(\alpha)\bm{b} + \bm{P}(\bm{A},\alpha)\bm{y}, \end{equation*}

where

\begin{equation*} \bm{P}(\bm{A},\alpha) := \bm{1}_n-\bm{A}^+(\alpha)\bm{A} \end{equation*}

is the damped (and therefore approximate) nullspace projector.

Robot Manipulator Control

Let’s see how we can apply some of these ideas to robotics. Suppose we have a robot arm with $n$ joints. The relationship between the velocity of the joints (i.e., the arm’s motor speeds) and the velocity of the robot’s hand in Cartesian space is given by the linear equation

\begin{equation}\label{13} \bm{J}\bm{v} = \bm{\xi}, \end{equation}

where $\bm{J}\in\mathbb{R}^{m\times n}$ is known as the Jacobian matrix (and in general depends on the robot’s current configuration), $\bm{v}\in\mathbb{R}^n$ is the vector of joint velocities, and $\bm{\xi}\in\mathbb{R}^m$ is the Cartesian velocity vector.

Many modern robot arms are redundant, which means that $m<n$ and $\eqref{13}$ has many solutions (typically $m=6$ , representing the three-dimensional linear and angular velocities). Thus we can choose from a set of joint velocities that all achieve (approximately) the same Cartesian velocity:

\begin{equation}\label{14} \bm{v} = \bm{J}^+(\alpha)\bm{\xi} + \bm{P}(\bm{J},\alpha)\bm{v}_0, \end{equation}

where we use a damping factor $\alpha>0$ to avoid singularities and we can design $\bm{v}_0$ to accomplish some secondary objective (see this paper for more details). Common choices for the secondary objective include steering toward a given “home” configuration, avoiding configurations that cause the Jacobian to lose rank, or avoiding collisions with obstacles.

Quadratic Programming

These days it is common to solve an optimization problem online in the control loop rather than use the analytic solution $\eqref{14}$ , so that we can include inequality constraints like joint limits. That is, at each control step we obtain $\bm{v}$ by solving a problem like

\begin{equation}\label{15} \begin{aligned} \min_{\bm{v}} &\quad (1/2)\|\bm{v}-\bm{v}_0\|^2_2 \\ \text{subject to} &\quad \bm{J}\bm{v} = \bm{\xi} \\ &\quad \bm{v}_{\min} \leq \bm{v} \leq \bm{v}_{\max}, \end{aligned} \end{equation}

which is a quadratic program because it has a quadratic objective and affine constraints. (Convex) quadratic programs can be solved efficiently with a variety of mature off-the-shelf solvers (see qpsolvers for a list of solvers and to easily try them out in Python).

We could of course modify $\eqref{15}$ to add slack and damping similar to the problems earlier in the post. A slightly different approach is to instead modify the problem to

\begin{equation*} \begin{aligned} \min_{\bm{v},s} &\quad (\alpha/2)\|\bm{v}-\bm{v}_0\|^2_2 + s \\ \text{subject to} &\quad \bm{J}\bm{v} = (1-s)\bm{\xi} \\ &\quad 0 \leq s \leq 1 \\ &\quad \bm{v}_{\min} \leq \bm{v} \leq \bm{v}_{\max}, \end{aligned} \end{equation*}

which ensures the resulting Cartesian velocity is always in the desired direction but can scale it down to make $\bm{v}$ closer to $\bm{v}_0$ , where $(1-s)$ is the scaling factor and $\alpha>0$ controls the trade-off. When $s=0$ we recover $\eqref{15}$ , but unlike $\eqref{15}$ , this problem is always feasible (with solution $(\bm{v},s)=(\bm{0},1)$ ).

Thanks to Abhishek Goudar for reading an early draft of this.