Lab 1

Background Information: Question a. Did you find Bryan and Leise’s paper insightful and comprehensive regarding the page rank algorithm? What additional information did you need to complete this laboratory?

Answer:

Begin your analysis by generating the system of equations that represents the “web” in Figure 2.1 of Bryan and Leise’s paper. Question a. What is the system of equations that represents the web in Figure 2.1?

Answer:

Next, create a link (or connectivity) matrix ???? in R/RStudio that represents this system of equations. Matrix ???? should be a column stochastic matrix. (Note: a “column stochastic” matrix is defined as having all non-negative entries and where the sum of each column equals 1 (one).)
```
  Question a. What is the link (or connectivity) matrix ???? that represents this system of
  equations?




  Question b. Is the link matrix "column-stochastic"? Why?
```

## 3. Set-up the link (connectivity) matrix
A <- matrix(c(0, 0, 1, 1/2, 1/3, 0, 0, 0, 1/3, 1/2, 0, 1/2, 1/3, 1/2, 0, 0),4,4, byrow = TRUE)

Recall from class that an eigenvector is a principal vector that does not change direction; however, it can be scaled. Bryan and Leise’s paper specify the eigenvector [12 4 9 6]T as the “dominant” eigenvector associated with the eigenvalue = 1, as well as any other non-zero multiple of the eigenvector for the eigenvalue = 1. Let’s find this eigenvector.

Last, scale this eigenvector to obtain the individual “importance score values”. You can use the sum of the whole number values of the eigenvector components to do this.

Find the eigenvalues and eigenvectors for the matrix ???? using the eigen() function we have discussed in R/RStudio. (Note: in general the eigenvectors produced will have both real and imaginary parts. We are only interested in the eigenvector associated with the eigenvalue = 1. It only has real numbers associated with it.) In order to compute the eigenvector comprised of all whole numbers as shown in Bryan and Leise’s paper multiply the eigenvector by function by 31 (remember that eigenvectors are not unique). This will return the desired eigenvector [12 4 9 6]T. Answer:

## Find and normalize the dominant eigenvector to compute the pageRanks
ea <- eigen(A)
ea

## eigen() decomposition
## $values
## [1]  1.0000000+0.0000000i -0.3606233+0.4109755i -0.3606233-0.4109755i
## [4] -0.2787533+0.0000000i
## 
## $vectors
##              [,1]                  [,2]                  [,3]
## [1,] 0.7210101+0i -0.7552157+0.0000000i -0.7552157+0.0000000i
## [2,] 0.2403367+0i  0.3036721+0.3460725i  0.3036721-0.3460725i
## [3,] 0.5407576+0i  0.0931532-0.2746779i  0.0931532+0.2746779i
## [4,] 0.3605051+0i  0.3583904-0.0713946i  0.3583904+0.0713946i
##               [,4]
## [1,]  0.5064856+0i
## [2,] -0.6056557+0i
## [3,] -0.3815392+0i
## [4,]  0.4807092+0i

# normalize the dominant eigenvector values
nr_scaled_values <- length(ea$vectors[,1])
ea_vectors <- ea$vectors[,1]
sca <- sum(ea$vectors)

for(i in 1:nr_scaled_values){
 ea_vectors[i] <- ea$vectors[i]/sca
}
ea_vectors

## [1] 0.3870968+0i 0.1290323+0i 0.2903226+0i 0.1935484+0i

ea_vectors <- Re(ea_vectors)

ea_vectors * 31

## [1] 12  4  9  6

## automate the process to determine pageRank
##
## first set-up a representative webgraph
##
library(igraph)

## Warning: package 'igraph' was built under R version 3.4.4

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

g <- graph(c(1, 2, 1, 3, 1, 4, 2, 3, 2, 4, 3, 1, 4, 1, 4, 3), directed = TRUE)

# plot the webgraph
tkplot(g)

## [1] 1

## (additional solution) 4 . Use the first method to compute the pageRanks
library(matlib)

## Warning: package 'matlib' was built under R version 3.4.4

Imatrix4 <- matrix(c(1,0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1), 4, 4, byrow = TRUE)
ea2 <- gaussianElimination(A-Imatrix4)
ea2[4,4] = -1
ea2_wholeNumbers <- ea2*-6
ea2_wholeNumbers

##      [,1] [,2] [,3] [,4]
## [1,]   -6    0    0   12
## [2,]    0   -6    0    4
## [3,]    0    0   -6    9
## [4,]    0    0    0    6

ea2_scaled <- ea2_wholeNumbers/31
ea2_scaled[,4]

## [1] 0.3870968 0.1290323 0.2903226 0.1935484

####################################

      Question a. Briefly discuss what the "importance scores" or page ranks of the nodes in
      the web shown in Figure 2.1 are. Include in your discussion how the nodes are
      ordered by page rank and why they are ordered in this way.

# normalize the dominant eigenvector values
nr_scaled_values <- length(ea$vectors[,1])
ea_vectors <- ea$vectors[,1]
sca <- sum(ea$vectors)

As discussed in Bryan and Leise’s paper, we can improve on our method for computing importance scores, e.g. by creating a matrix ???? where all entries are 1/???? and applying a “weighting factor,” ????. For example, the web in Figure 2.1 has ???? = 4. Thus, we can create a matrix ???? as: ???? = [ 1/4 1/4 1/4 1/4] and we can apply it in Equation 3.1: ???? = (1 ??? ????)???? + ???????? which includes the weighting factor ????. Use this matrix ????, Equation 3.1 and ???? = 0.15 to compute the modified matrix ????.

## This finds the modified matrix M
B = get.adjacency(g, sparse = FALSE)
B = t(B / rowSums(B))
B

##           [,1] [,2] [,3] [,4]
## [1,] 0.0000000  0.0    1  0.5
## [2,] 0.3333333  0.0    0  0.0
## [3,] 0.3333333  0.5    0  0.5
## [4,] 0.3333333  0.5    0  0.0

n = nrow(B)
U = matrix(data = rep(1/n, n^2), nrow = n, ncol = n)
U

##      [,1] [,2] [,3] [,4]
## [1,] 0.25 0.25 0.25 0.25
## [2,] 0.25 0.25 0.25 0.25
## [3,] 0.25 0.25 0.25 0.25
## [4,] 0.25 0.25 0.25 0.25

beta = 0.85
M = beta*B + (1-beta)*U
M

##           [,1]   [,2]   [,3]   [,4]
## [1,] 0.0375000 0.0375 0.8875 0.4625
## [2,] 0.3208333 0.0375 0.0375 0.0375
## [3,] 0.3208333 0.4625 0.0375 0.4625
## [4,] 0.3208333 0.4625 0.0375 0.0375

##  This finds the modified matrix M

B = get.adjacency(g, sparse = FALSE)
B = t(B / rowSums(B))
B

##           [,1] [,2] [,3] [,4]
## [1,] 0.0000000  0.0    1  0.5
## [2,] 0.3333333  0.0    0  0.0
## [3,] 0.3333333  0.5    0  0.5
## [4,] 0.3333333  0.5    0  0.0

n = nrow(B)

U = matrix(data = rep(1/n, n^2), nrow = n, ncol = n)
U

##      [,1] [,2] [,3] [,4]
## [1,] 0.25 0.25 0.25 0.25
## [2,] 0.25 0.25 0.25 0.25
## [3,] 0.25 0.25 0.25 0.25
## [4,] 0.25 0.25 0.25 0.25

beta = 0.85
M = beta*B + (1-beta)*U
M

##           [,1]   [,2]   [,3]   [,4]
## [1,] 0.0375000 0.0375 0.8875 0.4625
## [2,] 0.3208333 0.0375 0.0375 0.0375
## [3,] 0.3208333 0.4625 0.0375 0.4625
## [4,] 0.3208333 0.4625 0.0375 0.0375

      Question a. Equation 3.1 has two terms, (1 ??? ????)???? and ????????. What do each of these
      terms represent?
     
     
      Question b. Consistent with Bryan and Leise's paper you were instructed to use a
      value of 0.15 for the weighting factor ????. Why use this particular value?

Find the eigenvalues and eigenvectors for the matrix ???? using the eigen() function as we have discussed in R/RStudio. The values of the eigenvector associated with the dominant eigenvalue need to be scaled as before.

  Question a. Briefly discuss the page rank values obtained for the modified link matrix
  ???? as compared to those obtained from the original link matrix ????. Include in 
  our discussion any changes in the order of the page ranks and why or why not such
  changes occur.

## Calculate the eigenvalues/eigenvectors of M
e = eigen(M)
v <- e$vectors[,1]

#Scale the values of the dominant eigenvector of M to get the pageRanks
v <- as.numeric(v) / sum(as.numeric(v))
v

## [1] 0.3681507 0.1418094 0.2879616 0.2020783

Now use the page_rank() function to find the page ranks for the web as shown in Figure 2.1. Question a. What are the salient points you need to know in order to run the page_rank() function, e.g. relevant arguments/parameters including required data?

## Now use the page_rank() function to find the pageRanks
page_rank(g)$vector

## [1] 0.3681507 0.1418094 0.2879616 0.2020783

The last method discussed in Bryan and Leise’s paper is the power method for computing an eigenvector. A reference for more information about the power method is given in this paper. The power method is important because of the enormous number of web pages on the Internet. Bryan and Leise quote a number of eight billion. The URL http://www.worldwidewebsize.com/ quotes a number of roughly 4.5 to 4.7 billion for Google; and, other researchers quote much, much greater numbers. Regardless of which number is correct this is still a much, much bigger matrix than we want to have if we are performing the kinds of computations we been doing. So, let’s look at the power method.

In general, the power method depends on recursion to find a single eigenvalue, e.g. the largest or smallest eigenvalue associated with a system. For more information see Page and Brin’s original paper (1999), “The PageRank Citation Ranking: Bringing Order to the Web”. For example, civil engineers may want to find the smallest eigenvalue for a bridge or support column. This smallest eigenvalue occurs when the system is under maximum load. The corresponding eigenvector represents the shape of the system, e.g. shape of the support column at the instant of failure under this maximum load. The single eigenvalue we are interested in is 1 (one). This is the largest, and dominant, eigenvalue and has associated a dominant left dominant eigenvector.

There are a couple things you should understand in order to reliably use an algorithm embodying the power method. You should be aware that there is a difference between using a simple loop in an iterative fashion in order to repeat an operation and a recursive procedure that actually calls itself (i.e. the recursive procedure) multiple times. When loops are used in algorithms they are typically used to iterate a specified number of times. In general, a recursive algorithm will continue to run until some convergence criteria is met. So, it is important to carefully specify the convergence criteria. And, it is important to properly use arguments, i.e. input values in matrices that allow recursion to achieve the convergence criteria.

Computations using numerical methods normally do not result in exact values. Convergence criteria represent the tolerance allowed due to this inexact nature of numerical methods. Convergence criteria are established based on the desire that the difference in a computed value from iteration to iteration approaches the exact amount desired. That is, as this difference becomes smaller and approaches zero, assuming the algorithm has been coded correctly, then the computed value approaches the desired exact value. Typical convergence criteria are small, e.g. the difference in computed value equals or is less than 0.001 or smaller.

However, numerical methods might not produce monotonic results. The differences in computed values can both decrease and increase. Therefore, care must be taken in implementing convergence criteria. There are many different problems that can occur, e.g. an algorithm might never converge to a given criteria. If there are problems with either the convergence criteria or with data values in the arguments then an infinite loop can occur. In this case you might or might not get an error message. Alternatively, you can set an algorithm to only run for some number of recursions but then you may fail to converge.

To use the script you will need to install and attach 3 (three) additional packages in R/RStudio; igraph, expm, and arpack. For some reason the dependencies wouldn’t install with igraph. So, I had to use the following command rather than the pull down menu:

Before we implement the power method the R script uses a slightly different syntax to recompute the modified matrix ????.

##  Now use a slightly different method to compute the pageRanks of M
library(expm)

## Warning: package 'expm' was built under R version 3.4.4

## Loading required package: Matrix

## 
## Attaching package: 'expm'

## The following object is masked from 'package:Matrix':
## 
##     expm

n = nrow(M)
U = matrix(data=rep(1/n, n^2), nrow = n, ncol = n)
U

##      [,1] [,2] [,3] [,4]
## [1,] 0.25 0.25 0.25 0.25
## [2,] 0.25 0.25 0.25 0.25
## [3,] 0.25 0.25 0.25 0.25
## [4,] 0.25 0.25 0.25 0.25

beta = 0.85
C = beta*B + (1-beta)*U
C

##           [,1]   [,2]   [,3]   [,4]
## [1,] 0.0375000 0.0375 0.8875 0.4625
## [2,] 0.3208333 0.0375 0.0375 0.0375
## [3,] 0.3208333 0.4625 0.0375 0.4625
## [4,] 0.3208333 0.4625 0.0375 0.0375

r = matrix(data = rep(1/n, n), nrow = n, ncol = 1)
r

##      [,1]
## [1,] 0.25
## [2,] 0.25
## [3,] 0.25
## [4,] 0.25

The R Script provided then uses a number of iterations rather than convergence criteria to implement the power method. The number of iterations is set at 100. The script outputs the page ranks to the data object pageRanks.

## Use the power method to compute page ranks by setting the number of
## iterations as 100
pageRanks <- t(C%^%100 %*% r)
pageRanks

##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.3681507 0.1418094 0.2879616 0.2020783

      Question a. Briefly discuss how the power method is implemented in this part of the
      laboratory, e.g. what matrix operations are carried out, is the original link matrix A 
      used in the computations, etc
     
     
      Question b. Briefly discuss how the page ranks ("importance scores") computed after 
      100 iterations compare to those calculated previously including any changes in the 
      ordering of the nodes. Also include in your discussion how you would determine the
      number of iterations to use if this were required input to run the algorithm.

Last, the R Script implements the power method to compute page ranks using a convergence criteria of 0.00001.

## Use the power method to compute page ranks using convergence criteria
set.seed(1234)
n <- length(M[,1])
iter <- 1
maxiter <- 1000
tolerance <- 0.00001
ccrit <- 100
pwr <- M %*% M
ckCC <- matrix(nrow=n, ncol=maxiter)
ckCC[1,1] <- 100
while((abs(ccrit) > tolerance) && (iter < maxiter)) {
 iter <- iter + 1
 pwr <- M %*% pwr
 PRout <- pwr %*% r
 ckCC[,iter] <- PRout[,]
 ccrit <- ckCC[1,iter-1] - ckCC[1,iter]
}
print(iter)

## [1] 12

print(t(PRout))

##           [,1]     [,2]      [,3]      [,4]
## [1,] 0.3681576 0.141809 0.2879588 0.2020746

      Question a. How many iterations are required for the algorithm to converge? Is it
      important to know how many iterations are required? Why? How does the number
      of iterations required for the algorithm to converge to a solution compare to the
      specified 100 iterations in Part 10?
      
      
      Question b. If the number of iterations is significantly less than 100 is that important
      and why or why not? For example, mention whether or not the convergence criteria 
      used is sufficient. Also, if the number of iterations required is small briefly explain
      how this can be true in the context of extremely large matrix operations. 
      
      
      Question c. Briefly discuss how the page ranks obtained using convergence criteria
      compare with those obtained using a specified number of iterations. Include in your 
      discussion whether or not any differences in values are important and why or why
      not.

Lab 1

AG

22 April 2018