Multiple View Geometry in Computer Vision Chapter 7 Solutions -- Computation of the Camera Matrix P

Here’s a quick index to all the problems in this chapter.

The main index can be found here.

I. Given 5 world-to-image point correspondences, $\textbf{X}_i \leftrightarrow \textbf{x}_i$, show that there are in general four solutions for a camera matrix $P$ with zero skew that exactly maps the world to image points.

With zero skew the camera matrix $P$ has 10 degrees of freedom and thus we need a minimum of 5 point correspondences (10 equations) to compute $P$. The linear system of equations that we need to solve to obtain $P$ will be of the form

$$A\textbf{p} = 0$$

where $p$ is the 12-vector containing the entries of the matrix $P$ and $A$ is a $10 \times 12$ matrix that has a 2 dimensional nullspace (12 - 10).

Let the basis of the nullspace be the two 12-vectors $\textbf{p}_1$ and $\textbf{p}_2$. The vector representing a camera matrix that satisfies the given mapping can be expressed as a linear combination of the basis of the nullspace. As this mapping is homogeneous, we can write the general vector as

$$\textbf{p}(\lambda) = \lambda \textbf{p}_1 + \textbf{p}_2$$

where $\lambda$ is a scalar. We further constrain the 12-vector $p$ by requiring that the skew be zero.

Note that if the skew is zero, the $x$ and $y$ axes in the image plane will be perpendicular to each other. The directions of the $x$ and $y$ axis of the image plane in 3 space can be represented by the cross products $\hat{p}^2 \times \hat{p}^3$ and $\hat{p}^1 \times \hat{p}^3$ respectively, where $\hat{p}^i$ is the vector composed of the first three entries in the $i$th row of $P$, i.e. the normal vector to the plane $P^i$. Hence the constraint for the skew to be zero is¹

$$(\hat{p}^1 \times \hat{p}^3).(\hat{p}^2 \times \hat{p}^3) = 0$$

Combining these two constraints, we get a quartic equation in $\lambda$ which will in general have 4 solutions. Hence, in general, there are 4 camera matrices with zero skew that satisfy a given mapping of 5 point correspondences exactly.

II. Given 3 world-to-image point correspondences, $\textbf{X}_i \leftrightarrow \textbf{x}_i$, show that there are in general four solutions for a camera matrix $P$ with known calibration $K$ that exactly maps the world to image points.

This is the famous three point perspective pose estimation (P3P) problem in computer vision. There are a variety of ways to arrive at the solution². They all boil down to reducing the equations to a quartic in one variable. As we know that a quartic has 4 solutions in general, we can conclude that in general there are four camera configurations (poses) with a known calibration $K$ that performs the given mapping.

III. Find a linear algorithm for computing the camera matrix $P$ under each of the following conditions:
(a) The camera location (but not orientation) is known.
(b) The direction of the principal ray of the camera is known.
(c) The camera location and the principal ray of the camera are known.
(d) The camera location and complete orientation of the camera are known.
(e) The camera location and orientation are known, as well as some subset of the internal camera parameters ($\alpha_x, \alpha_y, s, x_0, y_0$).

The goal of this question is to come up with linear constraints that satisfy the given information.

(a) If the center of the camera is known then we know that the following must hold $$P^{1T}C = 0$$ $$P^{2T}C = 0$$ $$P^{3T}C = 0$$

In other words, we can add these three constraints to the system $Ap = 0$

$$\begin{pmatrix} C_1 & C_2 & C_3 & C_4 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & C_1 & C_2 & C_3 & C_4 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & C_1 & C_2 & C_3 & C_4 \end{pmatrix}\textbf{p} = 0$$

(b) If the direction of the principal ray, $\textbf{d}$ is known then we can say

$$\textbf{d} \times \hat{p}^3 = 0$$

This gives us 3 linear constraints only 2 of which are independent.

In other words, we can add these two constraints to the system $Ap = 0$

$$\begin{pmatrix} 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & d_3 & d_2 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & d_3 & 0 & -d_1 & 0 \end{pmatrix}\textbf{p} = 0$$

(d) If we know the pose of the camera we just need to estimate the affine transformation represented by the calibration matrix. First we project the world points using the known pose values

$$x’ = [R | t]X$$

Then we estimate the calibration matrix $K$ using a linear DLT algorithm to minimize the residual $x \times Kx’$.

(e) As in (d), the problem reduces to estimation of an affinity. This gives us a linear system of equations in the intrinsic parameters. If some subset of parameters are known then it reduces from a total least squares to an ordinary least squares (or regression) problem which is still linear.

IV. Conflation of focal length and position on principal axis. Compare the imaged position of a point of depth $d$ before and after an increase in camera focal length $\Delta f$, or a displacement $\Delta t_3$ of the camera backwards along the principal axis. Let $(x, y)^T$ and $(x’, y’)^T$ be the image coordinates of the point before and after the change. Following a similar derivation to that of (6.19-p169), show that $$\begin{pmatrix}x’ \\ y’ \end{pmatrix} = \begin{pmatrix}x \\ y \end{pmatrix} + k\begin{pmatrix}x - x_0 \\ y - y_0 \end{pmatrix}$$ where $k^f = \Delta f/f$ for a focal length change, or $k^{t_3} = -\Delta t_3/d$ for a displacement (here skew $s = 0$ and $\alpha_x = \alpha_y = f$).

Let us assume that the camera coordinate system aligns with the world coordinate system without loss of generality. Then, a point at depth $d$ can be represented as $(x_1, y_1, d, 1)$. The camera matrix will be $K[I | 0]$ and the imaged point will be $\textbf{x} = K (x_1, y_1, d)^T$.

If the focal length is changed from $f$ to $f + \Delta f$ then it is equivalent to scaling the image by $(f + \Delta f)/f = 1 + \Delta f/f = 1 + k^f$. This means the image of the point after the focal length change will be

$$\textbf{x}’ = K * diag(1 + k^f, 1 + k^f, 1) * (x_1, y_1, d)^T$$ $$ = K * ((1 + k^f)x_1, (1 + k^f)y_1, d)^T$$

Writing the calibration matrix as

$$K = \begin{pmatrix} K_{2 \times 2} & \tilde{\textbf{x}}_0 \\ \textbf{0}^T & 1 \end{pmatrix}$$

and $\tilde{\textbf{x}}_1 = (x_1, y_1)^T$ gives

$$\textbf{x} = \begin{pmatrix} K_{2 \times 2}\tilde{\textbf{x}}_1 + d\tilde{\textbf{x}}_0 \\ d \end{pmatrix}$$

$$\textbf{x}’ = \begin{pmatrix} K_{2 \times 2}(1 + k^f)\tilde{\textbf{x}}_1 + d\tilde{\textbf{x}}_0 \\ d \end{pmatrix}$$

Dehomogenizing the two imaged points, we get

$$\tilde{\textbf{x}} = \frac{K_{2 \times 2}\tilde{\textbf{x}}_1 + d\tilde{\textbf{x}}_0}{d}$$

$$\tilde{\textbf{x}}’ = \frac{K_{2 \times 2}(1 + k^f)\tilde{\textbf{x}}_1 + d\tilde{\textbf{x}}_0}{d}$$

Multiplying both sides by $d$ and subtracting both sides by $d\tilde{\textbf{x}}_0$, gives

$$d\tilde{\textbf{x}} - d\tilde{\textbf{x}}_0 = K_{2 \times 2}\tilde{\textbf{x}}_1$$

$$d\tilde{\textbf{x}}’ - d\tilde{\textbf{x}}_0 = K_{2 \times 2}(1 + k^f)\tilde{\textbf{x}}_1 $$

From these two equations, we can see that $$(1 + k^f)(d\tilde{\textbf{x}} - d\tilde{\textbf{x}}_0) = d\tilde{\textbf{x}}’ - d\tilde{\textbf{x}}_0$$ $$ \implies (1 + k^f)\tilde{\textbf{x}} - k^f\tilde{\textbf{x}}_0 = \tilde{\textbf{x}}’ $$ $$\implies \tilde{\textbf{x}}’ = \tilde{\textbf{x}} + k^f(\tilde{\textbf{x}} - \tilde{\textbf{x}}_0)$$

If there’s a displacement of the camera backwards along the principal axis by $\Delta t_3$ then the camera matrix will have the form $K[I | -\tilde{C}]$, where $\tilde{C} = (0, 0, -\Delta t_3)$.

This means, the image of the point after the displacement will be $x’ = K (x_1, y_1, d + \Delta t_3)^T$.

Writing the calibration matrix as

$$K = \begin{pmatrix} K_{2 \times 2} & \tilde{\textbf{x}}_0 \\ \textbf{0}^T & 1 \end{pmatrix}$$

and $\tilde{\textbf{x}}_1 = (x_1, y_1)^T$ gives

$$\textbf{x}’ = \begin{pmatrix} K_{2 \times 2}\tilde{\textbf{x}}_1 + (d + \Delta t_3)\tilde{\textbf{x}}_0 \\ d + \Delta t_3\end{pmatrix}$$

Dehomogenizing the two imaged points gives $$\tilde{\textbf{x}} = \frac{K_{2 \times 2}\tilde{\textbf{x}}_1 + d\tilde{\textbf{x}}_0}{d}$$ $$\implies \tilde{\textbf{x}} - \tilde{\textbf{x}}_0 = \frac{K_{2 \times 2}\tilde{\textbf{x}}_1}{d}$$

$$\tilde{\textbf{x}}’ = \frac{K_{2 \times 2}\tilde{\textbf{x}}_1 + (d + \Delta t_3)\tilde{\textbf{x}}_0}{d + \Delta t_3}$$ $$\implies \tilde{\textbf{x}}’ - \tilde{\textbf{x}}_0 = \frac{K_{2 \times 2}\tilde{\textbf{x}}_1}{d + \Delta t_3}$$

Combining the two equations gives us $$(d + \Delta t_3)(\tilde{\textbf{x}}’ - \tilde{\textbf{x}}_0) = (d)(\tilde{\textbf{x}} - \tilde{\textbf{x}}_0)$$ $$\implies \tilde{\textbf{x}}’ - \tilde{\textbf{x}}_0 = \frac{d}{d + \Delta t_3}(\tilde{\textbf{x}} - \tilde{\textbf{x}}_0)$$ $$\implies \tilde{\textbf{x}}’ = \tilde{\textbf{x}}_0 + \frac{d}{d + \Delta t_3}(\tilde{\textbf{x}} - \tilde{\textbf{x}}_0)$$ $$\implies \tilde{\textbf{x}}’ = \frac{d}{d + \Delta t_3}\tilde{\textbf{x}} + \frac{\Delta t_3}{d + \Delta t_3}\tilde{\textbf{x}}_0$$ $$\implies \tilde{\textbf{x}}’ = \tilde{\textbf{x}} - \frac{\Delta t_3}{d + \Delta t_3}(\tilde{\textbf{x}} - \tilde{\textbf{x}}_0)$$

So $k^{t_3} = -\Delta t_3/(d + \Delta t_3)$. I think there’s an error in the book where it says $k^{t_3}$ should be $-\Delta t_3/d$.

References

Larsson, Viktor, Zuzana Kukelova, and Yinqiang Zheng. Camera pose estimation with unknown principal point. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018. ^[return]
Haralick, Bert M., et al. Review and analysis of solutions of the three point perspective pose estimation problem. International journal of computer vision 13.3 (1994): 331-356. ^[return]