Application of Eigenvalues and Eigenvectors: Google Rank
Eigenvalues and Eigenvectors: intuitive idea
Eigenvectors are those that maintain the direction after being multiplied by a matrix. Although they maintain the direction, their length and their point may change. This depends on the factor that multiplies the direction and has its eigenvalue.
Exemple en 2D
Consider the matrix
to visualize the eigenvectors and eigenvalues, we plot in blue nthe original vector and in green its imatge by A. To see this in
, we can take unit vectors v ( in bleu) with the arrow over the circle of radius 1 and we paint their transformed by the product with the matrix (in green). As we will see, the originals move on a circle and the transformed ones move on an ellipse. Note that when the two directions are aligned we are facing a proper direction. What we can observe is that: - If v and
have the same orientation, the associated eigenvalue λ is positive - If v and
have inverted orientation, the associated eigenvalue λ is negative - If
, the associated eigenvalue 
- If
, the associated eigenvalue 
ang=0:h:2*pi-h; %we take 500
v=[cos(ang); sin(ang)]; %all initial vectors of norm 1
vt=A*v; %all vectors transformed by A
% we paint each vector and its transformed
for k=1:2 %we do two laps
plot(v(1,:), v(2,:),'b'); % the circle
plot(vt(1,:), vt(2,:),'g'); % the ellipse
plot([0,v(1,i)],[0,v(2,i)],'b');
plot([0,vt(1,i)],[0,vt(2,i)],'g');
plot(v(1,i),v(2,i),'marker','o','markersize',10,'markerfacecolor','blue');
plot(vt(1,i),vt(2,i),'marker','o','markersize',10,'markerfacecolor','green');
The power method: maximum module eigenvalue
The simplest way to calculate an eigenvector is the power method, which consists in applying the iteration:
being A a linear matrix and normalizing x at every step
This method provides us with the eigenvector corresponding to the eigenvalue of the maximum (dominant) module. The convergence is guaranted if there exists an eigenvalue bigger than the rest
. The eigenvalue is computed from the ratio of the coordinates beteween a vector and its transformed by A
.
To see how it works, we will compare it with a simple example with what Matlab gives us.
[veps, vaps]=eig(A)
-0.7071 -0.6000
0.7071 -0.8000
x = [1;1]; % initial vector
xant = x/norm(x); % normalization
tol = 1.e-6; % tolerance value
variation = 1000; % big enough value
while (variation > tol && nIter < 100 ) %maximum 100 iterations
variation = max(abs(xnou-xant)); %norma subinfinit (equivalent: norm(xnou-xant,inf))
% actualitzem per passar a la propera iteració
eigMax = max((A*xnou)./xnou); %ha de donar pràcticament el mateix per a totes les components
fprintf('nIter = %d , vap max = %f, vep=[%.4e, %.4e] \n',nIter,eigMax,xnou); %noteu el canvi de signe del vector
nIter = 15 , vap max = 1.250000, vep=[6.0000e-01, 8.0000e-01]
Application of the calculation of Eigenvalues and Eigenvectors: Google Rank
Pagerank's algorithm was invented by Page i Brin (1998) and was used in the engine prototype near Google.
The objective is to estimate the popularity, or importance, of a page based on the interconnexion of the web. The idea behind is
- that a page with more incoming links is more important than a page that has less.
- A page with a link to a page that is known to be very important is also more important.
The idea of the classification that Google does in the vicinity of the most relevant pages when we make a search, is based on the study of the connectivity matrix between the different pages. This connectivity is quantified from the links between the pages. Thus, for example, we can have a set of pages related to each other by the graph of the figure (where each arrow means a link from one to the other).
If we number the different nodes of the graph as A = 1, B = 2, C = 3, D = 4. Each connection is represented as a coefficient in the connectivity matrix, in which each row is the information that nodes link to that of the current row. Rows mean arrows that enter and columns that leave each node.
1,1,1,0]
0 0 1 1
1 0 0 0
1 0 0 1
1 1 1 0
Standarization:
In this matrix each column indicates the arrows that 'leave' each node, therefore we will normalize by columns to represent in a normalized way the weight of the connectivity of each node. Technically it is said that the matrix is stochastic by columns. The matrix obtained satisfies that the sum by columns is 1.
A1(:,i) = C1(:,i)/sum(C1(:,i));
A1
0 0 0.5000 0.5000
0.3333 0 0 0
0.3333 0 0 0.5000
0.3333 1.0000 0.5000 0
Iterator algorithm:
Essentially it consists of applying matrix A succesively to an initial vector that is normally the vector onas. In fact, it is the power method as we have seen before. Here the own vector associated to the dominant own value will have a fundamental interpretation at the component level since the maximum component corresponds to the most important node of the graph.
x = x/norm(x); %normalize
variation = max(abs(xnou-x));
while (nIter < 100 && variation > tol)
variation = max(abs(xnou-x));
X
0 0.5000 0.5000 0.5000 0.5000
1.0000 0.5000 0.1667 0.4167 0.9167
2.0000 0.6667 0.1667 0.6250 0.5417
3.0000 0.5833 0.2222 0.4931 0.7014
4.0000 0.5972 0.1944 0.5451 0.6632
5.0000 0.6042 0.1991 0.5307 0.6661
6.0000 0.5984 0.2014 0.5344 0.6658
7.0000 0.6001 0.1995 0.5324 0.6681
8.0000 0.6002 0.2000 0.5341 0.6657
9.0000 0.5999 0.2001 0.5329 0.6671
[val,highestRank]=max(X(end,2:end)); %maximum value of the last iteration
fprintf('value =%f, number of page = %d \n',val, highestRank);
value =0.666484, number of page = 4
A special case: one page does not cite any other (dangling node)
We now consider the case of the figure, in which node D has no exit arrow (does not cite any pages). This case does not allow the application of the algorithm and we will have to modify the connectivity matrix.
For this case the connectivity matrix is:
% that standarized it would be:
A2(:,i) = C2(:,i)/sum(C2(:,i));
A2
0 0 0.5000 NaN
0.3333 0 0 NaN
0.3333 0 0 NaN
0.3333 1.0000 0.5000 NaN
This matrix could not be used to iterate. Even if we respect the last column of zeros, with which the iterations could be done, then the convergence would be towards zero (do the calculation).
To fix the problem, we substitute the column of NaN with a value divided as it would be 1 / N where N is the number of nodes.
A2
0 0 0.5000 0.2500
0.3333 0 0 0.2500
0.3333 0 0 0.2500
0.3333 1.0000 0.5000 0.2500
Now we can do the iterations:
x = x/norm(x); %normalize
variation = max(abs(xnou-x));
while (nIter < 100 && variation > tol)
variation = max(abs(xnou-x));
X
0 0.5000 0.5000 0.5000 0.5000
1.0000 0.3750 0.2917 0.2917 1.0417
2.0000 0.4062 0.3854 0.3854 0.8229
3.0000 0.3984 0.3411 0.3411 0.9193
4.0000 0.4004 0.3626 0.3626 0.8743
5.0000 0.3999 0.3521 0.3521 0.8960
6.0000 0.4000 0.3573 0.3573 0.8854
7.0000 0.4000 0.3547 0.3547 0.8906
8.0000 0.4000 0.3560 0.3560 0.8880
9.0000 0.4000 0.3553 0.3553 0.8893
[val,highestRank]=max(X(end,2:end)); %maximum value of the last iteration
fprintf('value =%f, number of page = %d \n',val, highestRank);
value =0.888672, number of page = 4
Separable networks
A different problem is when the connections are defined so that there are blocks in the network from which you could not leave (following only the links)
That's what happens with the blocks E, F, G, H of the network of the figure
A3(:,i) = C3(:,i)/sum(C3(:,i));
Irreducible graph: the magic number of Google
The previous case is that of a reducible graph, in fact a graph is said to be irreducible if we can leave a node and following the arrows, I can reach any other node and vice versa.
If we iterate through a reducible graph, they will tend to zero. To fix the problem you have to modify the normalized matrix with the formula:
where d is a constant (damping factor),
and N the number of columns of A. Google used the value of damping d = 0.85
M=d*A3+((1-d)/N)*ones(size(A3));
% Now we can do the iterations as in the previous cases:
x = x/norm(x); %normalize
variation = max(abs(xnou-x));
while (nIter < 100 && variation > tol)
variation = max(abs(xnou-x));
[val,highestRank]=max(X(end,2:end)); %maximum component of the last iteration
fprintf('value =%f, number of page = %d \n',val, highestRank);
value =0.802451, number of page = 6
Associated eigenvalues
The convergence of the iterations is linked to the eigenvalues of the iteration matrix. By construction the matrix we have used are called stochastic matrix by columns (the sum by columns is 1) and all their elements are positive. The Perron-Frobenius theorem guarantees that these matrices have their own value of module 1 and the others are strictly lower than 1.
We see this in our matrices:
% Modulus of eigenvalues for A1
abs(eig(A1)) %abs is the module if the vap is complex
1.0000
0.4082
0.4082
0.5000
% Modulus of eigenvalues for A2
abs(eig(A2))
1.0000
0.5000
0.2500
0.0000
% Modulus of eigenvalues for M
abs(eig(M))
1.0000
0.7361
0.7361
0.4250
0.3470
0.4250
0.3470
0.0000
N large: Generation of a random graph
We can generate larger measurement graphs at random (if we do not have data) from a matrix C generated with 0 and 1's.
C4=round(rand(N)); %we fill the matrix of 0 and 1
A4(:,i) = C4(:,i)/sum(C4(:,i));
% we see if there is any column of NaN
% we avoid irreducible grafs
M4=d*A4+((1-d)/N)*ones(size(A4));
% Now we can do the iterations as in the previous cases:
variation = max(abs(xnou-x));
while (nIter < 100 && variation > tol)
variation = max(abs(xnou-x));
[val,highestRank]=max(X(end,2:end)); %maximum value of the last iteration
fprintf('value =%f, number of page = %d \n',val, highestRank);
value =0.194661, number of page = 2
% Modulus eigenvalues M4: (first 8 )
modVaps(1:8) %we show only the first 8 vaps in module
1.0000
0.1450
0.1450
0.1214
0.1214
0.1165
0.1165
0.1193
Exercice 1:
The figure below corresponds to the representation of a sphere (a soccer ball) or also of a bidirectional graph of 60 nodes (carbon atoms) and 90 connection connections known in Chemistry with the name of fullerene C-60 (Bucky ) and its numbering within the Matlab.
Its connectivity or adjacency matrix can be loaded from Matlab by doing
(a) Apply the GoogleRank method with
and
to classify the most important nodes (b) Enter a new connection from node 10 to 50 and apply the same algorithm.
(c) Do the same but from the original model and eliminating the two connections between nodes 7 and 30.