1. Playing With PageRank

You’ll verify for your self that PageRank works by performing calculations on a small universe of web pages. Let’s use the 6 page universe that we had in the course notes.

For this directed graph, perform the following calculations in R:

Form the A matrix.

The directed graph above can be represented in matrix form as follows:

\[\textbf{A}=\begin{bmatrix} 0 & 1/2 & 1/2 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 1/3 & 1/3 & 0 & 0 & 1/3 & 0 \\ 0 & 0 & 0 & 0 & 1/2 & 1/2 \\ 0 & 0 & 0 & 1/2 & 0 & 1/2 \\ 0 & 0 & 0 & 1 & 0 & 0 \end{bmatrix}\]

Each row i of the matrix \(\textbf{A}\) corresponds to associated outlinks (where nonzero), and each column j corresponds to inlinks.

Here is the matrix initialized in R:

# rows of matrix
at1 <- c(0, 1/2, 1/2, 0, 0, 0)
at2 <- rep(0,6)
at3 <- c(1/3, 1/3, 0, 0, 1/3, 0)
at4 <- c(0, 0, 0, 0, 1/2, 1/2)
at5 <- c(0, 0, 0, 1/2, 0, 1/2)
at6 <- c(0, 0, 0, 1, 0, 0)

# matrix A formed from individual rows
A <- matrix(rbind(at1,at2,at3,at4,at5,at6),6,6)

# print A
formatC(A, format="f", digits=3)

     [,1]    [,2]    [,3]    [,4]    [,5]    [,6]   
[1,] "0.000" "0.500" "0.500" "0.000" "0.000" "0.000"
[2,] "0.000" "0.000" "0.000" "0.000" "0.000" "0.000"
[3,] "0.333" "0.333" "0.000" "0.000" "0.333" "0.000"
[4,] "0.000" "0.000" "0.000" "0.000" "0.500" "0.500"
[5,] "0.000" "0.000" "0.000" "0.500" "0.000" "0.500"
[6,] "0.000" "0.000" "0.000" "1.000" "0.000" "0.000"

Then, introduce decay and form the B matrix as we did in the course notes.

Matrix \(\textbf{B}\) is formulated as follows:
\[\textbf{B}=d\textbf{A} + \frac{(1-d)}{n}\textbf{ee}^T \]
The scalar \(n\) represents the number of webpages in our universe, scalar \(d\) is our selected decay factor, and \(\textbf{ee}^T\) is a square matrix with the same dimensions as matrix \(\textbf{A}\) and with all elements set to 1.

Before we implement the formula above, we must modify the original matrix A:

Node 2 in our six webpage universe is referred to as a “dangling node.” If a web surfer reaches new webpages by clicking on hyperlinks from other webpages, then he inevitably get stuck after entering Node 2. That is, Node 2 has no outbound links.
The dangling node phenomenon is problematic, as our ultimate goal is to construct a transition matrix \(\textbf{B}\) so that repeated left multiplications ,N, of the transpose matrix, \(\textbf{B}^T\) to an arbitrary column vector, \(\textbf{r}_0\), converges to a unique positive vector, \(\textbf{r}_f\). That is, \({ { (\textbf{B} }^{ T }) }^{ N }{ \textbf{r} }_{ 0 }={ \textbf{r} }_{ f }\).
One key requirement for convergence is that our transition matrix is stochastic, which requires all non-negative entries and each row of the matrix to sum to one. While our original matrix \(\textbf{A}\) has all non-negative entries, row 2 sums to zero.

We apply the following adjustments to the original matrix \(\textbf{A}\) to ensure our matrix is stochastic:
\[\textbf{A}^*=\textbf{A} + \textbf{a}(\frac{1}{n}\textbf{e}^T)\]
Here, \(\textbf{a}\) is a column vector with element i set to 1 if row i of matrix \(\textbf{A}\) is a dangling node, and 0 otherwise. The scalar n refers to the total number of webpages in our universe, and \(\textbf{e}^T\) is a row vector of 1s.

Let’s implement the modified matrix, \(\textbf{A}\) in R:

# number of webpages in universe 
n <- nrow(A)

# A modified, a stochastic matrix with each row summing to 1
A_mod <- A + (apply(A,1,sum)!=1) * 1/n

# print A modified
formatC(A_mod, format="f", digits=3)

     [,1]    [,2]    [,3]    [,4]    [,5]    [,6]   
[1,] "0.000" "0.500" "0.500" "0.000" "0.000" "0.000"
[2,] "0.167" "0.167" "0.167" "0.167" "0.167" "0.167"
[3,] "0.333" "0.333" "0.000" "0.000" "0.333" "0.000"
[4,] "0.000" "0.000" "0.000" "0.000" "0.500" "0.500"
[5,] "0.000" "0.000" "0.000" "0.500" "0.000" "0.500"
[6,] "0.000" "0.000" "0.000" "1.000" "0.000" "0.000"

We are now ready to form matrix B, using the formula described earlier, but now substituting \(\textbf{A}^*\) for \(\textbf{A}\) in the earlier equation.

# set decay factor
d <- 0.85

# initialize matrix B  
B <- d*A_mod + (1-d)/n 

# print B
formatC(B, format="f", digits=3)

     [,1]    [,2]    [,3]    [,4]    [,5]    [,6]   
[1,] "0.025" "0.450" "0.450" "0.025" "0.025" "0.025"
[2,] "0.167" "0.167" "0.167" "0.167" "0.167" "0.167"
[3,] "0.308" "0.308" "0.025" "0.025" "0.308" "0.025"
[4,] "0.025" "0.025" "0.025" "0.025" "0.450" "0.450"
[5,] "0.025" "0.025" "0.025" "0.450" "0.025" "0.450"
[6,] "0.025" "0.025" "0.025" "0.875" "0.025" "0.025"

The additional adjustments applied to \(\textbf{A}^*\) in our equation for \(\textbf{B}\) ensure convergence of this Markov process:

The matrix is now irreducible, which means that each node is connected to every other node–refer to the positive probabilities for each element ij in matrix \(\textbf{B}\).
The matrix is aperiodic, as indicated by the positive probability of a web surfer staying within the same webpage with each successive iteration

Start with a uniform rank vector \(\textbf{r}_0\) and perform power iterations on \(B\) till convergence. That is, compute the solution \(\textbf{r}_f\). That is, \({ { (\textbf{B} }^{ T }) }^{ N }{ \textbf{r} }_{ 0 }={ \textbf{r} }_{ f }\). Attempt this for a sufficiently large \(n\) so that \(\textbf{r}_f\) actually converges.

Below, we calculate \(\textbf{r}_f\) for various values of \(n\). We see almost no change in the resulting vector between \(N=20\) and \(N=30\).

# install expm package, if not already installed.
# package allows us to take exponents of matrices
if (!require('expm')) {install.packages('expm');library(expm)}

# initial, uniform rank vector
r0 <- rep(1/6,6)


# show rank vector after 0, 10, 20, and 30 iterations
# rf = B^T * r0
for (i in seq(0,30, by=10)){
  print(paste0("rank vector at iteration ",i,":"))
  print((t(B) %^% i) %*% r0)
  print("-----------------------------")

}

[1] "rank vector at iteration 0:"
          [,1]
[1,] 0.1666667
[2,] 0.1666667
[3,] 0.1666667
[4,] 0.1666667
[5,] 0.1666667
[6,] 0.1666667
[1] "-----------------------------"
[1] "rank vector at iteration 10:"
           [,1]
[1,] 0.05205661
[2,] 0.07428990
[3,] 0.05782138
[4,] 0.34797267
[5,] 0.19975859
[6,] 0.26810085
[1] "-----------------------------"
[1] "rank vector at iteration 20:"
           [,1]
[1,] 0.05170616
[2,] 0.07368173
[3,] 0.05741406
[4,] 0.34870083
[5,] 0.19990313
[6,] 0.26859408
[1] "-----------------------------"
[1] "rank vector at iteration 30:"
           [,1]
[1,] 0.05170475
[2,] 0.07367927
[3,] 0.05741242
[4,] 0.34870367
[5,] 0.19990381
[6,] 0.26859607
[1] "-----------------------------"

Compute the eigen-decomposition of \(B\) and verify that you indeed get an eigenvalue of 1 as the largest eigenvalue.

Using R’s ’eigen()` function, we see that the largest eigenvalue is 1:

# eigen-decomposition of matrix B^T
eigen(t(B))

$values
[1]  1.00000000+0i  0.57619235+0i -0.42500000+0i -0.42500000-0i
[5] -0.34991524+0i -0.08461044+0i

$vectors
             [,1]          [,2]                      [,3]
[1,] 0.1044385+0i  0.2931457+0i  2.486934e-15+0.0000e+00i
[2,] 0.1488249+0i  0.5093703+0i -8.528385e-16-6.9832e-23i
[3,] 0.1159674+0i  0.3414619+0i -1.930646e-15-0.0000e+00i
[4,] 0.7043472+0i -0.5890805+0i -7.071068e-01+0.0000e+00i
[5,] 0.4037861+0i -0.1413606+0i  7.071068e-01+0.0000e+00i
[6,] 0.5425377+0i -0.4135367+0i  0.000000e+00-1.7058e-08i
                          [,4]           [,5]            [,6]
[1,]  2.486934e-15-0.0000e+00i -0.06471710+0i -0.212296003+0i
[2,] -8.528385e-16+6.9832e-23i  0.01388698+0i  0.854071294+0i
[3,] -1.930646e-15+0.0000e+00i  0.07298180+0i -0.363638739+0i
[4,] -7.071068e-01+0.0000e+00i -0.66058664+0i  0.018399984+0i
[5,]  7.071068e-01-0.0000e+00i  0.73761812+0i -0.304719509+0i
[6,]  0.000000e+00+1.7058e-08i -0.09918316+0i  0.008182973+0i

Verify that the corresponding eigenvector is the same vector that you obtained in the previous power iteration method.

Let’s calculate \(r_f\) using \(N=30\) iterations and normalize to a unit vector. Then, compare to the eigenvector using R’s built-in eigen() function:

# rf after 30 iterations
rf_pwrmthd <- t(B) %^% 30 %*% r0

# normalize rf_pwrmthd to unit vector 
rf_pwrmthd_unit <- rf_pwrmthd / sqrt(sum(rf_pwrmthd^2))

# eigenvector associated with lambda = 1, from eigen() function
rf_eigen1 <-  eigen(t(B))$vectors[,1]

# verify that this is equal to eigenvector calculation using eigen(), given small error tolerance
abs(rf_pwrmthd_unit - rf_eigen1) < 0.0001

     [,1]
[1,] TRUE
[2,] TRUE
[3,] TRUE
[4,] TRUE
[5,] TRUE
[6,] TRUE

Further, verify this eigenvector has all positive entries and it sums to 1.

# verify all positive entries in eigenvector
Re(rf_pwrmthd) > 0

     [,1]
[1,] TRUE
[2,] TRUE
[3,] TRUE
[4,] TRUE
[5,] TRUE
[6,] TRUE

# verify entries sum to 1
sum(rf_pwrmthd)

[1] 1

Use the graph package in R and its page.rank method to compute the Page Rank of the graph as given in A. Note that you don’t need to apply decay. The package starts with a connected graph and applies decay internally.

# install and/or load igraph package 
if (!require('igraph')) {install.packages('igraph');library(igraph)}

# plot original matrix A in igraph
A_graph <- graph_from_adjacency_matrix(A, weighted=TRUE)
plot(A_graph)

# calculate PageRank given original matrix A 
rf_igraph <- page.rank(graph_from_adjacency_matrix(A, weighted=TRUE))
rf_igraph$vector

[1] 0.05170475 0.07367926 0.05741241 0.34870369 0.19990381 0.26859608

Verify that you do get the same PageRank vector as the two approaches above.

For consistency we’ll normalize the PageRank to be of length 1, and then compare to the other two approaches.

# normalize pagerank vector
rf_igraph_unit <- rf_igraph$vector / sqrt(sum(rf_igraph$vector^2))  
rf_igraph_unit

[1] 0.1044385 0.1488249 0.1159674 0.7043472 0.4037861 0.5425377

# verify that rf_igraph_unit is equal to eigenvector calculation using eigen(), given small error tolerance
abs(rf_igraph_unit - rf_eigen1) < 0.0001

[1] TRUE TRUE TRUE TRUE TRUE TRUE

# verify that rf_igraph_unit is equal to eigenvector from power method with N=30, given small error tolerance
abs(rf_igraph_unit - rf_eigen1) < 0.0001

[1] TRUE TRUE TRUE TRUE TRUE TRUE

1. Playing With PageRank

References