In this post we take a look at some linear algebra through the lens of
optimization and then see how it can be applied to robot manipulator control.
In particular, we will spend a fair bit of time looking at pseudoinverses and
nullspace projection.
Linear Systems of Equations
As engineers we often encounter linear systems of equations of the form
Ax=b,
where we are given a matrix A∈Rm×n and a vector
b∈Rm, and we want to find a vector x∈Rm
satisfying the equation. In this post, we are interested in the solutions of
(1). We will begin with some background review to make this
post fairly self-contained.
Background
Column Space and Nullspace
The column space C(A) of a matrix A is the span of
its columns; that is, it is the set of all linear combinations of the columns
of A:
C(A)={Ax∣x∈Rn}.
The nullspace (also called the kernel) N(A) of A is
the set of all vectors x for which Ax=0:
N(A)={x∈Rn∣Ax=0}.
Singular Values
Any matrix A∈Rm×n can be expressed using the
singular value decomposition
(SVD) as
A=∑i=1rσiuiviT,
where r=min(m,n), σi is the ith singular value of A, and
ui∈Rm and vi∈Rn are the ith left and right singular
vectors of A. The singular values are all non-negative and each set of
singular vectors forms an orthonormal basis. Expressed in matrix form, we have
A=UΣVT,
where U∈Rm×m and V∈Rn×n
are orthonormal and
Σ∈Rm×n is a rectangular diagonal matrix with
the singular values on the diagonal.
Rank
The rank of A is equal to the number of non-zero singular values. We say
that a matrix A satisfying rank(A)=r has full rank
(i.e., the maximum rank it can possibly have given its shape).
The rank of A is equal to the number of linearly independent columns (or
rows) of A, which is also the dimension of the subspace spanned by
C(A).
Inverse
When A has full rank and is also square, such that
rank(A)=m=n, then its inverse A−1 exists and we say
that A is invertible. Expressed in terms of the SVD of A, the
inverse is given by
A−1=∑i=1nσi−1viuiT=VΣ−1UT.
(When A is square, the SVD is actually just an
eigendecomposition,
so in this case ui=vi are the eigenvectors of A and σi are its
eigenvalues.) Notice that if one (or more) of the singular values σi=0,
then we cannot compute the inverse because σi−1 does not exist, and
we say that A is singular. Note that we only use the term “singular”
in reference to square matrices.
Condition Number
The “closeness” to singularity for a square matrix A is quantified by
the matrix’s condition number κ(A), which describes how much the
solution x of (1) can change when the value of b
changes. A common definition is
κ(A)=σmin(A)σmax(A),
which is the ratio between the largest and smallest singular values. When
κ(A) is low, small changes in b do not cause large changes in
x, and we say that A is well-conditioned. When κ(A)
is high, small changes in b can cause large changes in x, and we
say that A is ill-conditioned. A larger condition number means A
is closer to singularity, with κ(A)=∞ when A is
singular.
We are now ready to return to our examination of (1).
Solutions of Linear Systems
To determine the number of possible solutions of (1), we use the
augmented matrix
A~=[Ab]∈Rm×(n+1),
which is simply A with b added as an additional column.
The Rouché-Capelli
theorem
tells us that:
- if rank(A~)>rank(A), then there is no
solution;
- if rank(A~)=rank(A)=n, then there is a
single unique solution;
- otherwise, there are infinite solutions.
When the system has a solution we say it is consistent; if it has no solution
then it is inconsistent. Consistent systems are those satisfying
b∈C(A). When the system has multiple (infinite)
solutions, we say that the system is under-determined.
The case of a single unique solution is easy because the inverse A−1
exists, so the solution is simply x=A−1b. Let’s take a look
at the other (more interesting) cases.
No Solutions
When the system is inconsistent, we cannot solve it exactly. A reasonable
alternative is to choose the value of x that comes the closest to
satisfying (1) (in the sense of
least-squares). That is, we want
to find the x that solves the least-squares problem
xmin(1/2)∥Ax−b∥22,
which can be interpreted as finding the vector in C(A) that
has the smallest Euclidean distance to b (note that the coefficient
(1/2) does not change the solution, but avoids ugly coefficients in the
derivative below). To solve (3), we just take the derivative and
set it equal to zero, revealing that the solution satisfies
∂x∂(1/2)∥Ax−b∥22=ATAx−ATb=0,
which we rearrange to obtain the solution
x=(ATA)−1ATb.
The matrix
A+:=(ATA)−1AT
is known as the Moore-Penrose pseudoinverse of A. In this case, it is a
left inverse because it satisfies the identity A+A=1n
(i.e., it multiplies A from the left; we will see a right inverse
below), where 1n is the n×n identity matrix.
Conditioning and Singularities
As the condition number of ATA increases, small changes in
b (due to noisy measurements, for example) can cause large changes in
the solution of (3). Indeed, if A is not full-rank,
then ATA is singular and cannot be inverted at all. To see this,
we can use the SVD of A to obtain
(ATA)−1=(VΣTUTUΣV)−1=(VΣTΣVT)−1=V(ΣTΣ)−1VT=∑i=1nσi−2viviT,
which shows that we cannot compute (ATA)−1 if any σi=0,
because then σi−2 does not exist.
To deal with poor conditioning and singularities, we can regularize the
solution by adding a damping term to (3) so that it becomes
xmin(1/2)∥Ax−b∥22+(α/2)∥x∥22,
where α≥0 is a small constant damping factor that serves to bias
the solution toward a smaller norm (this regularization approach is commonly
known as ridge regression).
The solution of (4) is
x=(ATA+α1n)−1ATb,
where
A+(α):=(ATA+α1n)−1AT
is called the damped pseudoinverse of A. Again using the SVD of
A, we have
(ATA+α1n)−1=(VΣTΣVT+α1n)−1=V(ΣTΣ+α1n)−1VT=∑i=1n(σi2+α)−1viviT,
which shows that the matrix inverse in (5) always exists for
α>0, since α>0 implies (σi2+α)−1 exists even if σi=0.
Infinite Solutions
When the system is under-determined, there are infinite values of x that
satisfy (1), so we need to pick one. If we imagine that x
represents some sort of effort or resource we want to save, a reasonable choice
is the solution with the smallest Euclidean norm; that is, the solution
satisfying
xminsubject to(1/2)∥x∥22Ax=b.
To solve this optimization problem, we construct the Lagrangian
function
L(x,λ)=(1/2)∥x∥22+λT(Ax−b),
where λ∈Rm is the vector of Lagrange multipliers. At
the optimum we have
∂x∂L=x+ATλ=0,
which we can left-multiply by A to obtain
AATλ=−Ax and thus
λ=−(AAT)−1b. Substituting back into
(7) and rearranging yields the solution
x=AT(AAT)−1b.
The matrix A+=AT(AAT)−1 is again the Moore-Penrose
pseudoinverse of A, but this time it is a right inverse because it
satisfies AA+=1m.
The Moore-Penrose Pseudoinverse
Wait a minute — we said previously that the Moore-Penrose pseudoinverse
of A is (ATA)−1AT, but now we are saying it is
AT(AAT)−1. What’s going on?
The Moore-Penrose pseudoinverse always exists and is
unique,
with the general solution given in terms of the SVD of A:
A+=∑i=1rσi+viuiT∈Rn×m,
where
σi+={σi−10if σi>0,otherwise.
(Notice the similarity to (2).)
However, if A is full-rank, then the pseudoinverse also has a nice
algebraic expression (which is equivalent to (9)):
- if m=n, A+=A−1;
- if m<n, A+=AT(AAT)−1;
- if m>n, A+=(ATA)−1AT.
The pseudoinverse A+ always has the same rank as A.
The Damped Pseudoinverse
Recall that we introduced damping in (4) to handle ill-conditioning when
ATA approached singularity, resulting in the damped pseudoinverse
A+(α). We can also obtain the damped pseudoinverse by modifying
(6) to defend ourselves against poor conditioning in
AAT, resulting in the problem
x,sminsubject to(α/2)∥x∥22+(1/2)∥s∥22Ax=b+s,
where we have introduced the slack variable s∈Rm and α>0 is
again the damping factor. This problem says that we are okay with a bit of
error (“slack”) in the solution to (1) if it means we can reduce the
norm of x in exchange, with α controlling the trade-off.
The optimal solution is
x=AT(AAT+α1m)−1b.
The matrix AT(AAT+α1m)−1 has a similar
structure to the damped pseudoinverse from (5) but does
not look exactly the same; however, they are indeed equal for any α>0.
To see this, first observe that
(ATA+α1n)AT=AT(AAT+α1m),
where we have just grouped the terms differently on each side. Left-multiplying
both sides by (ATA+α1n)−1 and right-multiplying
both sides by (AAT+α1m)−1, we obtain
AT(AAT+α1m)−1=(ATA+α1n)−1AT=A+(α),
which shows that both expressions for the damped pseudoinverse are equivalent.
Indeed, the Moore-Penrose pseudoinverse can be expressed as the following equivalent
limits:
A+=α→0lim A+(α)=α→0lim AT(AAT+α1m)−1=α→0lim (ATA+α1n)−1AT,
which always exist but may not be full-rank. In contrast, the damped
pseudoinverse is always full-rank for α>0 and its conditioning
improves as α increases. However, the trade-off is that increasing
α reduces the accuracy of x=A+(α)b as a solution
to (1).
Secondary Objectives
Instead of solving (6) for the minimum-norm solution of (1), we
may want to find the x satisfying (1) that is closest to some
other vector y∈Rn. That is, we want to choose the x
satisfying
xminsubject to(1/2)∥x−y∥22Ax=b.
Going through the same process with the Lagrangian as above, we obtain the
solution
x=A+b+(1n−A+A)y,
which of course reduces to (8) when y=0. So
what’s the deal with the matrix
P(A):=1n−A+A?
Nullspace Projection
The matrix P(A) is called a nullspace projector because it projects
vectors onto the nullspace N(A). To see this, let x=P(A)y.
Then we have
Ax=AP(A)y=A(1n−A+A)y=(A−A)y=0,
where we have used the fact that the pseudoinverse obeys the identity
AA+A=A, and so we see that P(A)y
belongs to N(A) for any vector y. This means that no
matter what value of y we choose, (12) will always
be a solution to (1).
It turns out that P(A)y is in fact the closest vector in
N(A) to y in the sense of least-squares. To see this,
consider the least-squares optimization problem
xminsubject to(1/2)∥x−y∥22Ax=0.
This is the same problem as (11) with b=0, so the solution is
simply x=P(A)y — the nullspace projector!
Secondary Objectives with Damping
If we modify (11) with a slack variable and damping factor similar to
(10), we obtain the problem
x,sminsubject to(α/2)∥x−y∥22+(1/2)∥s∥22Ax=b+s,
which has optimal value
x=A+(α)b+P(A,α)y,
where
P(A,α):=1n−A+(α)A
is the damped (and therefore approximate) nullspace projector.
Robot Manipulator Control
Let’s see how we can apply some of these ideas to robotics.
Suppose we have a robot arm with n joints. The relationship between the
velocity of the joints (i.e., the arm’s motor speeds) and the velocity of the
robot’s hand in Cartesian space is given by the linear equation
Jv=ξ,
where J∈Rm×n is known as the Jacobian matrix (and in
general depends on the robot’s current configuration), v∈Rn
is the vector of joint velocities, and ξ∈Rm is the
Cartesian velocity vector.
Many modern robot arms are redundant, which means that m<n and (13)
has many solutions (typically m=6, representing the three-dimensional linear
and angular velocities). Thus we can choose from a set of joint velocities that
all achieve (approximately) the same Cartesian velocity:
v=J+(α)ξ+P(J,α)v0,
where we use a damping factor α>0 to avoid singularities and we can design
v0 to accomplish some secondary objective (see this
paper for more
details). Common choices for the secondary objective include steering toward a
given “home” configuration, avoiding configurations that cause the Jacobian to
lose rank, or avoiding collisions with obstacles.
Quadratic Programming
These days it is common to solve an optimization problem online in the control
loop rather than use the analytic solution (14), so that we can include
inequality constraints like joint limits. That is, at each control step we
obtain v by solving a problem like
vminsubject to(1/2)∥v−v0∥22Jv=ξvmin≤v≤vmax,
which is a quadratic program because it has a quadratic objective and affine
constraints. (Convex) quadratic programs can be solved efficiently with a
variety of mature off-the-shelf solvers (see
qpsolvers for a list of solvers and
to easily try them out in Python).
We could of course modify (15) to add slack and damping similar to the
problems earlier in the post. A slightly different approach is to instead
modify the problem to
v,sminsubject to(α/2)∥v−v0∥22+sJv=(1−s)ξ0≤s≤1vmin≤v≤vmax,
which ensures the resulting Cartesian velocity is always in the desired
direction but can scale it down to make v closer to v0, where
(1−s) is the scaling factor and α>0 controls the trade-off. When s=0
we recover (15), but unlike (15), this problem is always
feasible (with solution (v,s)=(0,1)).
Thanks to Abhishek Goudar for reading an early draft of this.