No one man should have all that power

I apologize. The title is pure clickbait. This post is not about Kanye West (or Malcolm X). But I promise, it will be interesting! (And it will have something to do with the word 'power')

One reason I started this website for is to explain things to myself in a somewhat coherent form. My reasoning is that if I can take a concept and write it up in publishable form, I will have truly understood and digested it. So today's topic is the power method for approximating the dominant eigenvector.

"But why?" is the question that everyone should be asking by now and I wouldn't blame you. One of the answers is that it's cool, it's interesting, I want to understand it, and Google and Twitter use it every day for multiple things. It's always cool to learn things about the internals of Google and Twitter, right?

The power method is an iterative algorithm to determine the dominant eigenvector of a matrix \(A\). The way this is/was used by Google was to calculate the quality of a website as part of their PageRank algorithm. I will not go into too much detail on this, but on a high level, Google assembles a network of websites that link to each other in a graph, which can be represented with a adjacency matrix. The PageRank values are then contained in the dominant eigenvector of that matrix.

How do you get the dominant eigenvector? One way to iteratively calculate it is using the power method, which can be described with a difference equation like this:

\begin{equation} \textbf v_k = \textbf A \textbf v_{k-1} \end{equation}

Using words, this means that we start off with some random initial estimate for our dominant eigenvector \(v_0\). During each iteration of the algorithm we then multiply our previous estimate by our matrix \(A\) until we end up (after doing this \(k\) times) with a multiple of the real eigenvector \(s_1\).

Now let's try to prove this. But let me give a little disclaimer beforehand: I'm not necessarily the best mathematician out there, so I do not think that this post can be considered very rigorous. It's intended to make sense to me (and potentially to others) and accurately describe what's going on. In researching this, I mostly used this great book and freely googled information like this and this.

So let's dive in. We begin with a square matrix \(A\) and its eigenvalue decomposition:

\begin{equation} \textbf A = \textbf S \boldsymbol \Lambda \textbf S^{-1} \end{equation}

Here, each column of \(\textbf S\) is an eigenvector of \(\textbf A\) and \(\boldsymbol \Lambda\) is the diagonal matrix containing the eigenvalues \(\lambda_i\) of \(\textbf A\). Hence, each column of \(\textbf S\) (denoted as \(\textbf s_i\)) and each eigenvalue \(\lambda_i\) satisfy the following equation:

\begin{equation} \textbf A \textbf s_i = \lambda_i \textbf s_i \end{equation}

The algorithm for the power method states that if we keep multiplying a randomly chosen vector $ \textbf v\_o$ with our matrix \(\textbf A\), the result will converge to a scalar multiple of dominant eigenvector, so we do the following \(k\) times:

\begin{equation} \textbf v_1 = \textbf A \textbf v_o, \; \textbf v_2 = \textbf A \textbf v_1, \; \textbf v_3 = \textbf A \textbf v_2, ... \end{equation}

Then, supposedly, as \(k\) gets large,

\begin{equation} \textbf v_k \to c \textbf s_1 \end{equation}

So let's just do what we're told. The important part is that we can write the vector $ \textbf v\_0$ as a linear combination of $ \textbf A$'s eigenvectors $ \textbf s\_i$:

\begin{equation} \textbf v_0 =c_1 \textbf s_1 + ... + c_n \textbf s_n \end{equation}

We will find out later that for the power method to work, we need to assume \(c_1 \ne 0\), which means that $ \textbf v\_0$ needs to have some component pointing in the direction of $ \textbf s\_1$. But we'll get to that later.

So, if we write $ \textbf v\_0$ according to 1 and perform the first multiplication of the power method, we end up with something that looks like this:

\begin{equation} \textbf v_1 = \textbf A(c_1 \textbf s_1 + ... + c_n \textbf s_n) = c_1 \textbf A \textbf s_1 + ... + c_n \textbf A \textbf s_n \end{equation}

If we now look back at 1, we see that we can substitute the $ \textbf A \textbf s\_i$ terms with $ λ\_i \textbf s\_i$, which is exactly what we do to arrive here:

\begin{equation} \textbf v_1 = c_1 \lambda_1 \textbf s_1 + ... + c_n \lambda_n \textbf s_n \end{equation}

That was the first iteration of the algorithm. Let's do one more:

\begin{equation} \begin{aligned} \textbf v_2 \; &= \textbf A \textbf v_1 \\ &= \textbf A(c_1 \lambda_1 \textbf s_1 + ... + c_n \lambda_n \textbf s_n) \\ &=c_1 \lambda_1 \textbf A \textbf s_1 + ... + c_n \lambda_n \textbf A \textbf s_n \\ &= c_1 \lambda_1^2 \textbf s_1 + ... + c_n \lambda_n^2 \textbf s_n \end{aligned} \end{equation}

After writing out this step, it becomes clear that the expression for the resulting vector \(\textbf v\_k\) must be the following:

\begin{equation} \textbf v_k = c_1 \lambda_1^k \textbf s_1 + ... + c_n \lambda_n^k \textbf s_n \end{equation}

Which is good for us if we make an important assumption about the eigenvalue decomposition from 1. First, recall the form of \(\boldsymbol \Lambda\):

\begin{equation} \boldsymbol \Lambda = \begin{bmatrix} \lambda_1 & & & \\ & \lambda_2 & & \\ & & \ddots & \\ & & & \lambda_n \\ \end{bmatrix} \end{equation}

The $λ\_i$s on the diagonal are the eigenvalues of our matrix $ \textbf A$. Now let's suppose that $ λ\_1$ is the largest eigenvalue (it could be any \(\lambda_i\), but for the sake of simplicity, let's choose \(\lambda_1\)). By 'largest eigenvalue', I mean the following:

\begin{equation} |\lambda_1| > |\lambda_2| \geq |\lambda_3| \geq ... \geq |\lambda_n| \end{equation}

Now let's observe that if 1 holds and \(\lambda_1\) is in fact the largest eigenvalue, the first term of 1 will dominate the expression as \(k\) gets large. To document this even further, let's divide 1 by \(\lambda_1^k\):

\begin{equation} \frac{\textbf v_k}{\lambda_1^k} = c_1 \textbf s_1 + c_2 \left(\frac{\lambda_2}{\lambda_1}\right)^k \textbf s_2 + c_3 \left(\frac{\lambda_3}{\lambda_1}\right)^k \textbf s_3 + ... + c_n \left(\frac{\lambda_n}{\lambda_1}\right)^k \textbf s_n \end{equation}

If we now let \(k\) get large (meaning we multiply by $ \textbf A$ a lot of times), all of the \(\left(\frac{\lambda_i}{\lambda_1}\right)^k\) terms will go to zero because according to 1, \(\lambda_1\) is larger an any one of the other $λ_i$s:

\begin{equation} \lim_{k \to \infty} \;\; \frac{\textbf v_k}{\lambda_1^k} \; = \; c_1 \textbf s_1 \end{equation}

Remember that \(\lambda_1^k\) is just a scalar! We can let \(c_1\) swallow \(\lambda_1^k\) to arrive at the following:

\begin{equation} \lim_{k \to \infty} \;\; \textbf v_k \; = \; c \textbf s_1 \end{equation}

If we look at 1, that's what we wanted to prove, right? Remember our condition from earlier, that \(c_1 \ne 0\)? It becomes clear now. If \(c_1\) were zero, then this expression will converge to zero, and not the eigenvector corresponding to the top eigenvalue.

Now let's look at this problem a little closer and simulate it. This little piece of Matlab will do the trick:

A = randi(9,3);
v0 = randi(9,3,1);
[S, LAMBDA] = eig(A);
s1 = S(:,1)
vk = v0;
for k = 1:20
  vk = A*vk;
end
vk
printf('is vk a scalar multiple of s1?\n')
if (rank([vk s1]) == 1) printf('yes\n') else printf('no\n') end;

Let's examine the output of this:

s1 =

  -0.46568
  -0.68421
  -0.56125

vk =

   1.7449e+22
   2.5638e+22
   2.1030e+22

is vk a scalar multiple of s1?
yes

So, it looks like we did indeed get a scalar multiple of \(s_1\). But that scalar is huge! The elements of \(\textbf v_k\) are on the order of \(10^{22}\), while the elements of \(s_1\) are on the order of 1! It makes sense: We're doing 20 matrix-vector multiplications with a matrix and a vector filled with positive integers.

So what can we do about that? We can prohibit \(v_k\) from getting larger and larger (and from possibly causing a buffer overflow) by normalizing it after each iteration of the algorithm (Remember that normalizing a vector means making it a unit vector, but preserving the direction it is pointing at). This would change our expression for the power method from \(v_k = \textbf A \textbf v_{k-1}\) to \(v_k = \frac{\textbf A \textbf v_{k-1}}{\left||\textbf A \textbf v_{k-1}\right||}\).

The code will change only slightly: Here, I've added the normalization step, and I let the simulation run 10 steps longer for a little better precision:

A = randi(9,3);
v0 = randi(9,3,1);
[S, LAMBDA] = eig(A);
s1 = S(:,1)
vk = v0;
for k = 1:30
  vk = A*vk;
  vk = vk/norm(vk, 2);
end
vk
printf('is vk a scalar multiple of s1?\n')
if (rank([vk s1]) == 1) printf('yes\n') else printf('no\n') end;

Let's look at the output! I think this looks pretty good:

s1 =

   0.56542
   0.65589
   0.50011

vk =

   0.56542
   0.65589
   0.50011

is vk a scalar multiple of s1?
yes

We have successfully calculated \(s_1\), the eigenvector corresponding to the dominant eigenvalue \(\lambda_1\)! (If the simulation is run multiple times, it can happen that \(v_k\) converges to \((-1) \textbf s_1\), a proof that I will leave as an exercise to the student)

This concludes a very long post, filled with many equations. But you should feel great, you now know a lot more about the internals of Google and Twitter! Like I said earlier, I do not claim much mathematical rigor here, I'm just trying to explain an interesting concept to myself (and possibly anyone that reads this).

I'd appreciate any feedback and comments (send me email), especially if you spot a mistake!

Dennis

"
\n

RSS License: CC BY-SA 4.0

"