% DEFINE some information that will be populated throughout the course notes.
\def \coursename {Advanced Linear Algebra}
\def \coursecode {MATH 3221}
\def \courseterm {Fall 2020}
\def \instructorname {Nathan Johnston}
% END DEFINITIONS


% IMPORT the course note formatting and templates
\input{course_notes_template}
% END IMPORT


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\setcounter{chapter}{9} % Set to one less than the week number
\chapter{Applications of the\\ Singular Value Decomposition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

{\large This week we will learn about:

\begin{itemize}
	\item The pseudoinverse of a matrix,
	\item The operator norm of a matrix, and
	\item Low-rank approximation and image compression.
\end{itemize}\bigskip\bigskip

\noindent Extra reading and watching:
\begin{itemize}
	\item Section 2.3.3 and 2.C in the textbook
	\item Lecture videos \href{https://www.youtube.com/watch?v=Jqg7JgCwQrk&list=PLOAf1ViVP13jdhvy-wVS7aR02xnDxueuL&index=39}{38}, \href{https://www.youtube.com/watch?v=FDURdvi6WB4&list=PLOAf1ViVP13jdhvy-wVS7aR02xnDxueuL&index=40}{39}, \href{https://www.youtube.com/watch?v=G2RKg1pHApc&list=PLOAf1ViVP13jdhvy-wVS7aR02xnDxueuL&index=41}{40}, and \href{https://www.youtube.com/watch?v=9QkKcEQQ38g&list=PLOAf1ViVP13jdhvy-wVS7aR02xnDxueuL&index=42}{41} on YouTube
	\item \href{https://en.wikipedia.org/wiki/Moore\%E2\%80\%93Penrose_inverse}{Moore--Penrose inverse} (pseudoinverse) at Wikipedia
	\item \href{https://en.wikipedia.org/wiki/Operator_norm}{Operator norm} at Wikipedia
	\item \href{https://en.wikipedia.org/wiki/Low-rank_approximation}{Low-rank approximation} at Wikipedia
\end{itemize}\bigskip\bigskip

\noindent Extra textbook problems:
\begin{itemize}
	\item[$\star$] 2.3.2, 2.3.4(d,e,h), 2.C.4(a,b,d,e)
	\item[$\phantom{\star}\star\star$] 2.3.8--2.3.12, 2.C.1--2.C.3
	\item[$\star\star\star$] 2.3.15, 2.3.21, 2.C.5, 2.C.6, 2.C.9, 2.C.10
	\item[$\skull$] 2.3.17(a)
\end{itemize}}


\newpage


\section*{The Pseudoinverse}

We have been working with the inverse of a matrix since early-on in introductory linear algebra, and while we can do great things with it, it has some deficiencies as well. For example, we know that if a matrix $A \in \M_n$ is invertible, then the linear system $A\x = \b$...

\horlines{1}
% has unique solution $\x = A^{-1}\b$.

\noindent However, that linear system might have a solution even if $A$ is \emph{not} invertible. For example...

\exx[11]{Show that the linear system \[\begin{bmatrix}1 & 2 & 3 \\
			-1 & 0 & 1 \\
			3 & 2 & 1
		\end{bmatrix}\begin{bmatrix}
			x_1 \\ x_2 \\ x_3
		\end{bmatrix} = \begin{bmatrix}
			6 \\ 0 \\ 6
		\end{bmatrix}\]has a solution, but its coefficient matrix is not invertible.}
% Row reduced or just show via MATLAB? Rank 2 matrix, but (6,0,6) is in its range. Some solutions are $(x_1,x_2,x_3) = (1,1,1)$ and $(x_1,x_2,x_3) = (2,-1,2)$
% Remind students of exactly what the inverse theorem from last semester said: if Ax = 0 has a solution then A invertible. If Ax = b has solution for ALL b, then A invertible. If Ax = b has solution for a PARTICULAR b, can't say anything.
It thus seems natural to ask whether or not there exists a matrix $A^\dagger$ with the property that a solution to the linear system $A\x = \b$ (when it exists) is $\x = A^\dagger\b$. Well...


\newpage


\begin{definition}[Pseudoinverse of a Matrix]\label{defn:pseudoinverse}
	Suppose $\mathbb{F} = \R$ or $\mathbb{F} = \C$, and $A \in \M_{m,n}(\mathbb{F})$ has orthogonal rank-one sum decomposition\\[0.1cm]
	\[
		{}%A = \sum_{i=1}^r \sigma_i \u_i\v_i^*.
	\]
	Then the \textbf{pseudoinverse} of $A$, denoted by $A^\dagger \in \M_{n,m}(\mathbb{F})$, is the matrix\\[0.1cm]
	\[
		{}%A^\dagger \defeq \sum_{i=1}^r \frac{1}{\sigma_i} \v_i\u_i^*.
	\]
\end{definition}

There are several aspects of the pseudoinverse that we should clarify:

\begin{itemize}
	\item If $A$ is invertible, ...
	
	\vspace*{-1.2cm}\horlines{1} % pseudoinverse is inverse.
	
	\item If $A$ has SVD $A = U\Sigma V^*$, then...
	
	\vspace*{-1.2cm}\horlines{3} % A^\dagger = V\Sigma^\dagger U^* is the SVD of A^\dagger, where $\Sigma^\dagger \in \M_{n,m}$ is the diagonal matrix whose non-zero entries are the reciprocals of the non-zero entries of $\Sigma$ (and the zero entries are unchanged).
	
	\item The pseudoinverse is well-defined.
	
	\vspace*{-1.2cm}\horlines{2} % That is, even though the SVD of a matrix is not unique, all SVDs give the same pseudoinverse.
\end{itemize}

\exx[4]{Compute the pseudoinverse of the matrix\[A = \begin{bmatrix}1 & 1 & 1 & -1 \\ 0 & 1 & 1 & 0 \\ -1 & 1 & 1 & 1\end{bmatrix}.\]}
%	This is the same matrix from an earlier example, which has singular value decomposition $A = U\Sigma V^*$, where
%\begin{align*}
%U & = \frac{1}{\sqrt{6}}\begin{bmatrix}
%\sqrt{2} & -\sqrt{3} & -1 \\
%\sqrt{2} & 0 & 2 \\
%\sqrt{2} & \sqrt{3} & -1
%\end{bmatrix}, \quad \Sigma = \begin{bmatrix}
%\sqrt{6} & 0 & 0 & 0 \\ 0 & 2 & 0 & 0 \\ 0 & 0 & 0 & 0
%\end{bmatrix} \quad \text{and} \\
%V & = \frac{1}{\sqrt{2}}\begin{bmatrix}
%0 & -1 & 0 & 1 \\
%1 & 0 & 1 & 0 \\
%1 & 0 & -1 & 0 \\
%0 & 1 & 0 & 1
%\end{bmatrix}.
%\end{align*}
%Thus\marginnote{Again, try this matrix multiplication on your own. It builds character.}[1.5cm]
%\begin{align*}
%A^\dagger & = V\Sigma^\dagger U^* \\
%& = \frac{1}{\sqrt{12}}\begin{bmatrix}
%0 & -1 & 0 & 1 \\
%1 & 0 & 1 & 0 \\
%1 & 0 & -1 & 0 \\
%0 & 1 & 0 & 1
%\end{bmatrix}\begin{bmatrix}
%1/\sqrt{6} & 0 & 0 \\ 0 & 1/2 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0
%\end{bmatrix}\begin{bmatrix}
%\sqrt{2} & \sqrt{2} & \sqrt{2} \\
%-\sqrt{3} & 0 & \sqrt{3} \\
%-1 & 2 & -1
%\end{bmatrix} \\
%& = \frac{1}{12}\begin{bmatrix}
%3 & 0 & -3 \\
%2 & 2 & 2 \\
%2 & 2 & 2 \\
%-3 & 0 & 3
%\end{bmatrix}.
%\end{align*}


\newpage


\horlines{8}

The nice thing about the pseudoinverse is that it always exists (even if $A$ is not invertible, or not even square), and it always finds a solution to the corresponding linear system (if a solution exists). Not only that, but if there are multiple different solutions, it finds the smallest one:

\begin{theorem}[Pseudoinverses Solve Linear Systems]\label{thm:pseudoinverse_finds_sol}
	Suppose $\mathbb{F} = \R$ or $\mathbb{F} = \C$, $A \in \M_{m,n}(\mathbb{F})$, and suppose that the system of linear equations $A\x = \b$ has at least one solution. Then\\[0.1cm]
	\[
		{}% \x = A^\dagger\b
	\]
	is a solution. Furthermore, if $\y$ is any other solution then $\|A^\dagger\b\| < \|\y\|$.
\end{theorem}

\begin{proof}
	We start by writing $A$ in its orthogonal rank-one sum decomposition...
	
	\horlines{5}
	
	\newpage
	
	\horlines{6}\vspace*{-1.3cm}
%	\[
%	A = \sum_{i=1}^r \sigma_i \v_i\w_i^*, so A^\dagger = (formula).
%	\]
%	Since the linear system $A\x = \b$ has a solution, $\b$ must be in the range of $A$. Since $\{\v_i\}_{i=1}^r$ is a basis of $\mathrm{range}(A)$, we can write $\b$ as a linear combination of those basis vectors:
%	\[
%	\b = \sum_{i=1}^r c_i\v_i.
%	\]
%	Then we can directly compute:
%	\begin{align*}
%	A(A^\dagger\b) & = \left(\sum_{i=1}^r \sigma_i \v_i\w_i^*\right)\left(\sum_{i=1}^r \frac{1}{\sigma_i} \w_i\v_i^*\right)\left(\sum_{i=1}^r c_i\v_i\right) \\
%	& = \left(\sum_{i,j=1}^r \frac{\sigma_i}{\sigma_j} \v_i(\w_i^*\w_j)\v_j^*\right)\left(\sum_{i=1}^r c_i\v_i\right) \\
%	& = \left(\sum_{i=1}^r \v_i\v_i^*\right)\left(\sum_{i=1}^r c_i\v_i\right) \\
%	& = \sum_{i,j=1}^r c_j \v_i(\v_i^*\v_j) \\
%	& = \sum_{i=1}^r c_i \v_i = \b,
%	\end{align*}
%	so $\x = A^\dagger\b$ is indeed a solution of the linear system.
%
% To see that $\|A^\dagger\b\| \leq \|\y\|$ for all solutions $\y$...
% Nah, really need orthogonal projection to do this. Didn't cover, so gloss over.
\end{proof}

To get a rough idea for why it's desirable to find the solution with smallest norm, let's return to the linear system
\begin{align*}
	\begin{bmatrix}1 & 2 & 3 \\
		-1 & 0 & 1 \\
		3 & 2 & 1
	\end{bmatrix}\begin{bmatrix}
	x_1 \\ x_2 \\ x_3
\end{bmatrix} = \begin{bmatrix}
6 \\ 0 \\ 6
\end{bmatrix}
\end{align*}
from earlier. The solution set of this linear system consists of the vectors of the form

\horlines{1}
% (0,3,0) + z(1,-2,1), where z is a free variable.

This solution set contains some vectors that are hideous, and some that are not so hideous:

\horlines{1}
% (e.g., choosing $z = 341$ gives the solution $(x,y,z) = (341,-679,341)$)
% (e.g., choosing $z = 1$ gives the solution $(x,y,z) = (1,1,1)$, which is the solution found by the pseudoinverse).

The guarantee that the pseudoinverse finds the smallest-norm solution means that we do not have to worry about it returning ``large and ugly'' solutions like the first one above. \\

Geometrically, it means that the pseudoinverse finds the solution closest to the origin:

\horlines{5}
% Insert week9_svd_smallest.png


\newpage


Not only does the pseudoinverse find the ``best'' solution when a solution exists, it even find the ``best'' non-solution when no solution exists!\\

This is strange to think about, but it makes sense if we again think in terms of norms and distances---if no solution to a linear system $A\x = \b$ exists, then it seems reasonable that the ``next best thing'' to a solution would be the vector that makes $A\x$ as close to $\b$ as possible. In other words, we want to find the vector $\x$ that...

\horlines{1}
% minimizes $\|A\x - \b\|$.

The following theorem shows that choosing $\x = A^\dagger\b$ also solves this problem:

\begin{theorem}[Linear Least Squares]\label{thm:linear_least_squares}
	Suppose $\mathbb{F} = \R$ or $\mathbb{F} = \C$, $A \in \M_{m,n}(\mathbb{F})$, and $\b \in \mathbb{F}^m$. If $\x = A^\dagger\b$ then\\[0.1cm]
	\[
		{}%\|A\x - \b\| \leq \|A\y - \b\| for all \y \in \mathbb{F}^n.
	\]
\end{theorem}

We won't prove this theorem (see the textbook if you're curious), but it comes up a lot in statistics, since it can be used to fit data to a model. For example, suppose we had $4$ data points:

\horlines{4}
% Ask students for 4 data points (e.g., (1,1), (2,3), (3,-1), (4,6))

\noindent and we want to find a line of best fit for those data points (i.e., a line with the property that the sum of squares of vertical distances between the points and the line is minimized). To find this line, we consider the ``ideal'' scenario---we try (and typically fail) to find a line that passes exactly through all $n$ data points by setting up the corresponding linear system:

\horlines{4}
% ON RIGHT: Draw rough picture of what we want to find (highlight vertical distances we are trying to minimize)
% ON LEFT: Linear system:
%\begin{align*}
%	y_1 & = mx_1 + b \\
%	y_2 & = mx_2 + b \\
%	& \ \ \vdots \\
%	y_n & = mx_n + b.
%\end{align*}


\newpage


Since this linear system has $4$ equations, but only $2$ variables ($m$ and $b$), we do not expect to find an exact solution, but we can find the closest thing to a solution by using the pseudoinverse:

\horlines{17}
% Write it in matrix form Ax = b
% Compute pseudoinverse of A via SVD (make sure you use 2x2 matrix A^*A to find pseudoinverse)
% Compute A^\dagger b
% m and b are entries in this vector
% Plot line after it's computed

This exact same method also works for finding the ``plane of best fit'' for data points $(x_1,y_1,z_1),(x_2,y_2,z_2),\ldots,(x_n,y_n,z_n)$, and so on for higher-dimensional data as well. You can even do things like find quadratics of best fit, exponentials of best fit, or other weird functions of best fit.


\newpage


By putting together all of the results of this section, we see that the pseudoinverse gives the ``best solution'' to a system of linear equations $A\x = \b$ in all cases:\medskip

\horlines{5}
% If the system has a unique solution, it is $\x = A^\dagger\b$.\smallskip
%	
% If the system has infinitely many solutions, then $\x = A^\dagger\b$ is the smallest solution---it minimizes $\|\x\|$.\smallskip
%	
% If the system has no solutions, then $\x = A^\dagger\b$ is the closest thing to a solution---it minimizes $\|A\x - \b\|$.\medskip


\section*{The Operator Norm}

We have seen one way of measuring the size of a matrix---the Frobenius norm. In practice, the Frobenius norm is actually not very useful (it's just used because it's easy to calculate), and the following norm is more commonly used instead:

\begin{definition}[Operator Norm]\label{defn:operator_norm}
	Suppose $\mathbb{F} = \R$ or $\mathbb{F} = \C$, and $A \in \M_{m,n}(\mathbb{F})$. Then the \textbf{operator norm} of $A$, denoted by $\|A\|$, is either of the following (equivalent) quantities:\\[1cm]
	\begin{align*}
		{}%\|A\| & \defeq \max_{\v \in \mathbb{F}^n} \left\{ \frac{\|A\v\|}{\|\v\|} : \v \neq \0 \right\} = \max_{\v \in \mathbb{F}^n} \big\{ \|A\v\| : \|\v\| \leq 1 \big\} = \max_{\v \in \mathbb{F}^n} \big\{ \|A\v\| : \|\v\| = 1 \big\}.
	\end{align*}
\end{definition}

The operator norm is the maximum amount by which a matrix can stretch a vector:

\horlines{5}
% Draw SVD ellipse picture
% Note that it looks like it's probably largest singular value (it is)


\newpage


Before showing that $\|A\|$ really is the largest singular value of $A$, let's establish some of its more basic properties.
\begin{theorem}[Submultiplicativity]\label{thm:submultiplicativity}
	Suppose $A \in \M_{m,n}$ and $B \in \M_{n,p}$. Then\\[0.1cm]
	\[
		{}%\|AB\| \leq \|A\|\|B\|.
	\]
\end{theorem}

\begin{proof}
	Notice that a matrix $A \in \M_{m,n}$ cannot stretch any vector by more than a factor of $\|A\|$:
	
	\horlines{5}\vspace*{-1.3cm}
	%That is, $\|A\v\| \leq \|A\|\|\v\|$ for all $\v \in \mathbb{F}^n$. Similarly, if $B \in \M_{n,p}$ then it cannot stretch any vector by more than a factor of $\|B\|$. Thus
	%\begin{align*}
	%	\|(AB)\v\| = \|A(B\v)\| \leq \|A\|\|B\v\| \leq \|A\|\|B\|\|\v\|.
	%\end{align*}
	%Dividing both sides by $\|\v\|$ shows that $\|(AB)\v\| / \|\v\| \leq \|A\|\|B\|$ for all $\v \neq \0$, so $\|AB\| \leq \|A\|\|B\|$.
	% Frobenius norm left as an exercise in the textbook.
\end{proof}

\begin{theorem}[Unitary Invariance]\label{thm:unitary_invariance}
	Let $A \in \M_{m,n}$ and suppose $U \in \M_m$ and $V \in \M_n$ are unitary matrices. Then\\[0.1cm]
	\[
		{}% \|UAV\| = \|A\|.
	\]
\end{theorem}

\begin{proof}
	We start by showing that every unitary matrix $U \in \M_m$ has $\|U\| = 1$:
	
	\horlines{6}\vspace*{-1.3cm}
	% Well, we know from Theorem~\ref{thm:unitary_characterize} that $\|U\v\| = \|\v\|$ for all $\v \in \mathbb{F}^m$. Thus $\|U\v\| / \|\v\| = 1$ for all $\v \neq \0$, so $\|U\| = 1$.
	% We then know from submultiplicativity that \[\|UAV\| \leq \|U\|\|AV\| \leq \|U\|\|A\|\|V\| = \|A\|.\]However, by cleverly using the fact that $U^*U = I$ and $VV^* = I$, we can also deduce the opposite inequality:
	% \[\|A\| = \|(U^*U)A(VV^*)\| = \|U^*(UAV)V^*\| \leq \|U^*\|\|UAV\|\|V^*\| = \|UAV\|.\]Thus $\|A\| = \|UAV\|$, as desired.
	% Frobenius norm: think back to the midterm (use trace definion of Frobenius norm)!
\end{proof}


\newpage


As a side note, the previous two theorems both hold for the Frobenius norm as well (try to prove these facts on your own). That is,

\horlines{2}
% \|AB\|_{\textup{F}} \leq \|A\|_{\textup{F}}\|B\|_{\textup{F}} and
% \|UAV\|_{F} = \|A\|_{F} for all A and all unitary U,V

By combining unitary invariance with the singular value decomposition, we almost immediately confirm our observation that the operator norm should equal the matrix's largest singular value, and we also get a new formula for the Frobenius norm:

\begin{theorem}[Matrix Norms in Terms of Singular Values]\label{thm:op_HS_norms_sing_values}
	Let $A \in \M_{m,n}$ have rank $r$ and singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$. Then\\[0.2cm]
	\[
		{}%\|A\| = \sigma_1 \qquad \text{and} \qquad \|A\|_{F} = \sqrt{\sum_{i=1}^r \sigma_i^2}.
	\]
\end{theorem}

\begin{proof}
	If we write $A$ in its singular value decomposition $A = U\Sigma V^*$, then unitary invariance tells us that $\|A\| = \|\Sigma\|$ and $\|A\|_{\textup{F}} = \|\Sigma\|_{\textup{F}}$. Well,
	
	\horlines{11}\vspace*{-1.3cm}
	
%	the fact that $\|\Sigma\|_{F} = \sqrt{\sum_{i=1}^r \sigma_i^2}$ follows straight from the fact that $\sigma_1,\sigma_2,\ldots,\sigma_r$ are exactly the non-zero entries in $\Sigma$.
%	
%	To see that $\|\Sigma\| = \sigma_1$, first note that direct matrix multiplication shows that \[\|\Sigma\e_1\| = \|(\sigma_1,0,\ldots,0)\| = \sigma_1,\]so $\|\Sigma\| \geq \sigma_1$. For the opposite inequality, suppose $\v = (v_1,v_2,\ldots,v_n) \in \mathbb{F}^n$. Then
%	\begin{center}
%		\begin{tikzpicture}%
%		\draw (0,0) node[anchor=south]{$\displaystyle \|\Sigma\v\| = 
%			\|(\sigma_1 v_1, \ldots, \sigma_r v_r, 0, \ldots, 0)\| = \sqrt{\sum_{i=1}^r \sigma_i^2 |v_i|^2} \leq \sqrt{\sigma_1^2 \sum_{i=1}^r|v_i|^2} \leq \sigma_1\|\v\|.$};
%		\draw (4.06,0.05) node[anchor=north]{\color{gray}$\uparrow$ since $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r$};
%		\end{tikzpicture}
%	\end{center}
%	By dividing both sides of this inequality by $\|\v\|$, we see that $\|\Sigma\v\| / \|\v\| \leq \sigma_1$ for all $\v \neq \0$, so $\|\Sigma\| \leq \sigma_1$. Since we already proved the opposite inequality, we conclude that $\|\Sigma\| = \sigma_1$, so we are done.
\end{proof}


\newpage


\exx[6]{Compute the operator and Frobenius norms of $\displaystyle A = \begin{bmatrix}1 & 2 & 3 \\ -1 & 0 & 1 \\ 3 & 2 & 1 \end{bmatrix}$.}
% Computed singular values earlier: 2sqrt(6), sqrt(6), 0, so operator norm is 2sqrt(6).
% Frobenius norm is Sqrt(30). Compute directly by adding up squares of entries of matrices, then again via singular values.


\section*{Low-Rank Approximation}

As one final application of the singular value decomposition, we consider the problem of approximating a matrix by another matrix with small rank. One of the primary reasons why we would do this is that it allows us to compress data that is represented by a matrix, since a full $n \times n$ matrix requires us to store...

\horlines{1}
% $n^2$ numbers,

However, a rank-$k$ matrix only requires us to store

\horlines{4}
% $2kn$ numbers. To see this, note that we can store a low-rank matrix via its orthogonal rank-one sum decomposition
%\[
%A = \sum_{i=1}^k \sigma_i \u_i\v_i^*,
%\]
%which consists of $2k$ vectors with $n$ entries each, as well as $k$ singular values (but those singular values can be absorbed into the u or v vectors).

Since $2kn$ is much smaller than $n^2$ when $k$ is small, it is much less resource-intensive to store low-rank matrices than general matrices. Thus to compress data, instead of storing the exact matrix $A$ that contains our data of interest, we can sometimes find a nearby matrix with small rank and store that instead.


\newpage


To actually find a nearby low-rank matrix, we use the following theorem:

\begin{theorem}[Eckart--Young--Mirsky]\label{thm:eckart_young_mirsky}
	Suppose $\mathbb{F} = \R$ or $\mathbb{F} = \C$, and $A \in \M_{m,n}(\mathbb{F})$ has orthogonal rank-one sum decomposition\\[0.1cm]
	\begin{align*}
		{}%A = \sum_{i=1}^{r}\sigma_i\u_i\v_i^*,
	\end{align*}
	with $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$. Then the closest rank-$k$ matrix to $A$ (i.e., the rank-$k$ matrix $B$ that minimizes $\|A - B\|$) is\\[0.1cm]
	\begin{align*}
		{}%B = \sum_{i=1}^{k}\sigma_i\u_i\v_i^*.
	\end{align*}
\end{theorem}

In other words, the Eckart--Young--Mirsky theorem says that...

\horlines{6}
% the singular value decomposition organizes a matrix into its ``most important'' and ``least important'' pieces---the largest singular values (and their corresponding singular vectors) describe the broad strokes of the matrix, while the smallest singular values (and their corresponding singular vectors) just fill in the fine details of the matrix. Thus a reasonably good approximation of a matrix can be obtained by discarding the portion of it corresponding to the small singular values.

We skip the proof of the Eckart--Young--Mirsky theorem (see the textbook if you're curious), and instead jump right into a numerical example to illustrate its usage.

\exx[4]{Find the closest rank-1 matrix to $\displaystyle A = \begin{bmatrix}1 & 2 & 3 \\ -1 & 0 & 1 \\ 3 & 2 & 1 \end{bmatrix}$.}
% Already computed SVD. Just plug in. Easy.


\newpage


\horlines{5}

It is also worth noting that the Eckart--Young--Mirsky theorem works for many other matrix norms as well (like the Frobenius norm)---not just the operator norm.\\

One of the most interesting applications of this theorem is that it lets us do (lossy) image compression. We can represent an image by...

\horlines{5}
% using 3 matrices (red, green, blue), entries in the matrix say how much of the color in that pixel

Applying the Eckart--Mirsky--Young theorem to those matrices then lets us compress the image. For example, let's use the following image:

\horlines{6}
% Google image search. Resize it first to be roughly 500-by-700. Make a joke about making sure that safe search is on.


\newpage


Let's use MATLAB to compress the image by truncating its matrices' singular value decompositions:

% A = imread('file.jpg'); gives A(:,:,1) as red matrix, A(:,:,2) as blue matrix, etc. All entries from 0 to 255 (0 is none, 255 is full amount of that color)
% I wrote a script svd_compress.m that does this. Walk students through the basics of what the script does. Do svd_compress(A,1), svd_compress(A,10), etc. and copy results into the course notes.
% Do rank-1 approximation, rank-5, and a bunch of others. Note the banding effect in rank-1.

% Mention somewhere the idea that this works because images have correlations between rows and columns -- a white pixel is more likely to be next to other white (or light) pixels. Random junk/noise images wont' compress well.
\end{document}