ANLY 530 - Lab 1

knitr::opts_chunk$set(echo=TRUE)

Detailed Instructions for the Laboratory Analysis

1. Carefully read Bryan and Leise’s paper. You may have to read the paper more than once to fully understand its content. Carefully reading and understanding the paper will save you a lot of time and effort on this laboratory.

Question a. Did you find Bryan and Leise’s paper insightful and comprehensive regarding the page rank algorithm? What additional information did you need to complete this laboratory?

Answer: Bryan and Leise’s paper regarding Google’s page rank algorithm was comprehensive but required a lot of supplementary reading to follow. Additional information used to understand the paper are as follows:

Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.
ANLY 530 Lecture 2 Slides
https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors
https://www.khanacademy.org/math/linear-algebra/alternate-bases/eigen-everything/v/linear-algebra-introduction-to-eigenvalues-and-eigenvectors
https://www.youtube.com/watch?v=G4N8vJpf7hM
http://math.mit.edu/~gs/linearalgebra/ila0601.pdf

2. Begin your analysis by generating the system of equations that represents the “web” in Figure 2.1 of Bryan and Leise’s paper.

Question a. What is the system of equations that represents the web in Figure 2.1?

Step 1: X1, X2, X3, X4 = 4 webpages Create back links to each; X1=2; X2=1; X3=3; X4=2

X1=2
X2=1
X3=3
X4=2

Step 2: Create scores of each page based on the importance of their corresponding backlinks x1 = X3/1 + X4/2 x2 = X1/3 x3 = X1/3 + X2/2 + X4/2 x4 = X1/3 + X2/2

x1 = X3/1 + X4/2
x2 = X1/3
x3 = X1/3 + X2/2 + X4/2
x4 = X1/3 + X2/2

Step 3: Combine linear equations as AX = X, where X = t(X1,x2,x3,x4) - (Result displayed in next answer)

3. Next, create a link (or connectivity) matrix ???? in R/RStudio that represents this system of equations. Matrix ???? should be a column stochastic matrix. (Note: a “column stochastic” matrix is defined as having all non-negative entries and where the sum of each column equals 1 (one).)

Question a. What is the link (or connectivity) matrix ???? that represents this system of equations?

A=matrix(c(0,1/3,1/3,1/3,0,0,1/2,1/2,1,0,0,0,1/2,0,1/2,0),ncol=4)
A

##           [,1] [,2] [,3] [,4]
## [1,] 0.0000000  0.0    1  0.5
## [2,] 0.3333333  0.0    0  0.0
## [3,] 0.3333333  0.5    0  0.5
## [4,] 0.3333333  0.5    0  0.0

Question b. Is the link matrix “column-stochastic”? Why?

Answer: The link matrix A is column stochastic. The reason is that each column contains n number of entries which are all non-negative and each equal to 1/n value making the sum of each column 1.

4. Recall from class that an eigenvector is a principal vector that does not change direction; however, it can be scaled. Bryan and Leise’s paper specify the eigenvector [12 4 9 6]T as the “dominant” eigenvector associated with the eigenvalue = 1, as well as any other non-zero multiple of the eigenvector for the eigenvalue = 1. Let’s find this eigenvector.

Last, scale this eigenvector to obtain the individual “importance score values”. You can use the sum of the whole number values of the eigenvector components to do this.

Find the eigenvalues and eigenvectors for the matrix ???? using the eigen() function we have discussed in R/RStudio. (Note: in general the eigenvectors produced will have both real and imaginary parts. We are only interested in the eigenvector associated with the eigenvalue = 1. It only has real numbers associated with it.) In order to compute the eigenvector comprised of all whole numbers as shown in Bryan and Leise’s paper multiply the eigenvector by function by 31 (remember that eigenvectors are not unique). This will return the desired eigenvector [12 4 9 6]T.

ea <- eigen(A)
ea

## eigen() decomposition
## $values
## [1]  1.0000000+0.0000000i -0.3606233+0.4109755i -0.3606233-0.4109755i
## [4] -0.2787533+0.0000000i
## 
## $vectors
##              [,1]                  [,2]                  [,3]
## [1,] 0.7210101+0i -0.7552157+0.0000000i -0.7552157+0.0000000i
## [2,] 0.2403367+0i  0.3036721+0.3460725i  0.3036721-0.3460725i
## [3,] 0.5407576+0i  0.0931532-0.2746779i  0.0931532+0.2746779i
## [4,] 0.3605051+0i  0.3583904-0.0713946i  0.3583904+0.0713946i
##               [,4]
## [1,]  0.5064856+0i
## [2,] -0.6056557+0i
## [3,] -0.3815392+0i
## [4,]  0.4807092+0i

# normalize the dominant eigenvector values
nr_scaled_values <- length(ea$vectors[,1])
ea_vectors <- ea$vectors[,1]
sca <- sum(ea$vectors)
for(i in 1:nr_scaled_values){
ea_vectors[i] <- ea$vectors[i]/sca
}
ea_vectors

## [1] 0.3870968+0i 0.1290323+0i 0.2903226+0i 0.1935484+0i

ea_vectors <- Re(ea_vectors)
ea_vectors * 31

## [1] 12  4  9  6

5. As discussed in Bryan and Leise’s paper, we can improve on our method for computing importance scores, e.g. by creating a matrix ???? where all entries are 1/???? and applying a “weighting factor,” ????. For example, the web in Figure 2.1 has ???? = 4. Thus, we can create a matrix ???? as: ????= [1/41/41/41/4] and we can apply it in Equation 3.1: ????=(1???????)????+???????? which includes the weighting factor ????. Use this matrix ????, Equation 3.1 and ???? = 0.15 to compute the modified matrix ?????????* Question a. Equation 3.1 has two terms, (1???m)A and mS. What do these terms represent?

Answer: S is an n*n matrix with all entries 1/n. Matrix S is column stochastic.‘m’ is a value greater than or equal to 0 but less than or equal to 1. (1-m) or ‘d’ represents a damping factor.This is based on a random surfer model which is the probability that a person will stop clicking. The two terms in combination gives us the weighted average unambiguous importance scores. The weighted average is between the probability that a user stops clicking links on the current page and the probability that a user doesn’t follow any links on the page.

Question b. Consistent with Bryan and Leise’s paper you were instructed to use a value of 0.15 for the weighting factor ????. Why use this particular value?

Answer: Let’s assume (1-m)=d; d is the damping factor that a random surfer will get bored and stop clicking and move on to a different URL. After testing different damping factors, it was determined that 0.85 was the perfect value. If it’s too high then it takes ages for the numbers to settle, if it’s too low then you get repeated over-shoot, both above and below the average - the numbers just swing about the average like a pendulum and never settle down.

6. Find the eigenvalues and eigenvectors for the matrix ???? using the eigen() function as we have discussed in R/RStudio. The values of the eigenvector associated with the dominant eigenvalue need to be scaled as before.

library(igraph)

## Warning: package 'igraph' was built under R version 3.4.4

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

g <- graph(c(1, 2, 1, 3, 1, 4, 2, 3, 2, 4, 3, 1, 4, 1, 4, 3), directed = TRUE)
# plot the webgraph
tkplot(g)

## [1] 1

B = get.adjacency(g, sparse = FALSE)
B = t(B / rowSums(B))
B

##           [,1] [,2] [,3] [,4]
## [1,] 0.0000000  0.0    1  0.5
## [2,] 0.3333333  0.0    0  0.0
## [3,] 0.3333333  0.5    0  0.5
## [4,] 0.3333333  0.5    0  0.0

n = nrow(B)
U = matrix(data = rep(1/n, n^2), nrow = n, ncol = n)
U

##      [,1] [,2] [,3] [,4]
## [1,] 0.25 0.25 0.25 0.25
## [2,] 0.25 0.25 0.25 0.25
## [3,] 0.25 0.25 0.25 0.25
## [4,] 0.25 0.25 0.25 0.25

beta = 0.85
M = beta*B + (1-beta)*U
M

##           [,1]   [,2]   [,3]   [,4]
## [1,] 0.0375000 0.0375 0.8875 0.4625
## [2,] 0.3208333 0.0375 0.0375 0.0375
## [3,] 0.3208333 0.4625 0.0375 0.4625
## [4,] 0.3208333 0.4625 0.0375 0.0375

e = eigen(M)
v <- e$vectors[,1]
#Scale the values of the dominant eigenvector of M to get the pageRanks
v <- as.numeric(v) / sum(as.numeric(v))
v

## [1] 0.3681507 0.1418094 0.2879616 0.2020783

Question a. Briefly discuss the page rank values obtained for the modified link matrix M???? as compared to those obtained from the original link matrix A????. Include in your discussion any changes in the order of the page ranks and why or why not such changes occur.

Answer: The modified link matrix M resolves the problem of having an eigenspace for eigenvalue 1 of a column stochastic matrix A is greater than 1.This means that the web is assumed to have no dangling nodes and multiple subwebs and the modification is used to generate unambiguous importance scores. The modified matrix M allows us to compare pages in different subwebs. The rankings of M with a damping factor of 0.85 or m of 0.15 is the same as A, but the importance scores are different due to the probability factor induced.

7. Now use the page_rank() function to find the page ranks for the web as shown in Figure 2.1.

page_rank(g)$vector

## [1] 0.3681507 0.1418094 0.2879616 0.2020783

Question a. What are the salient points you need to know in order to run the page_rank() function, e.g. relevant arguments/parameters including required data?

Answer: Required data - g <- graph(c(1, 2, 1, 3, 1, 4, 2, 3, 2, 4, 3, 1, 4, 1, 4, 3), directed = TRUE) Required packages - igraph, expm, arpack

8. The last method discussed in Bryan and Leise’s paper is the power method for computing an eigenvector. A reference for more information about the power method is given in this paper. The power method is important because of the enormous number of web pages on the Internet. Bryan and Leise quote a number of eight billion. The URL http://www.worldwidewebsize.com/ quotes a number of roughly 4.5 to 4.7 billion for Google; and, other researchers quote much, much greater numbers. Regardless of which number is correct this is still a much, much bigger matrix than we want to have if we are performing the kinds of computations we been doing. So, let’s look at the power method. In general, the power method depends on recursion to find a single eigenvalue, e.g. the largest or smallest eigenvalue associated with a system. For more information see Page and Brin’s original paper (1999), “The PageRank Citation Ranking: Bringing Order to the Web”. For example, civil engineers may want to find the smallest eigenvalue for a bridge or support column. This smallest eigenvalue occurs when the system is under maximum load. The corresponding eigenvector represents the shape of the system, e.g. shape of the support column at the instant of failure under this maximum load. The single eigenvalue we are interested in is 1 (one). This is the largest, and dominant, eigenvalue and has associated a dominant left dominant eigenvector. There are a couple things you should understand in order to reliably use an algorithm embodying the power method. You should be aware that there is a difference between using a simple loop in an iterative fashion in order to repeat an operation and a recursive procedure that actually calls itself (i.e. the recursive procedure) multiple times. When loops are used in algorithms they are typically used to iterate a specified number of times. In general, a recursive algorithm will continue to run until some convergence criteria is met. So, it is important to carefully specify the convergence criteria. And, it is important to properly use arguments, i.e. input values in matrices that allow recursion to achieve the convergence criteria. Page 5 of 9 Computations using numerical methods normally do not result in exact values. Convergence criteria represent the tolerance allowed due to this inexact nature of numerical methods. Convergence criteria are established based on the desire that the difference in a computed value from iteration to iteration approaches the exact amount desired. That is, as this difference becomes smaller and approaches zero, assuming the algorithm has been coded correctly, then the computed value approaches the desired exact value. Typical convergence criteria are small, e.g. the difference in computed value equals or is less than 0.001 or smaller. However, numerical methods might not produce monotonic results. The differences in computed values can both decrease and increase. Therefore, care must be taken in implementing convergence criteria. There are many different problems that can occur, e.g. an algorithm might never converge to a given criteria. If there are problems with either the convergence criteria or with data values in the arguments then an infinite loop can occur. In this case you might or might not get an error message. Alternatively, you can set an algorithm to only run for some number of recursions but then you may fail to converge. To use the script you will need to install and attach 3 (three) additional packages in R/RStudio; igraph, expm, and arpack. For some reason the dependencies wouldn’t install with igraph. So, I had to use the following command rather than the pull down menu: ?????????.

library(expm)

## Warning: package 'expm' was built under R version 3.4.4

## Loading required package: Matrix

## 
## Attaching package: 'expm'

## The following object is masked from 'package:Matrix':
## 
##     expm

n = nrow(M)
U = matrix(data=rep(1/n, n^2), nrow = n, ncol = n)
U

##      [,1] [,2] [,3] [,4]
## [1,] 0.25 0.25 0.25 0.25
## [2,] 0.25 0.25 0.25 0.25
## [3,] 0.25 0.25 0.25 0.25
## [4,] 0.25 0.25 0.25 0.25

beta = 0.85
C = beta*B + (1-beta)*U
C

##           [,1]   [,2]   [,3]   [,4]
## [1,] 0.0375000 0.0375 0.8875 0.4625
## [2,] 0.3208333 0.0375 0.0375 0.0375
## [3,] 0.3208333 0.4625 0.0375 0.4625
## [4,] 0.3208333 0.4625 0.0375 0.0375

r = matrix(data = rep(1/n, n), nrow = n, ncol = 1)
r

##      [,1]
## [1,] 0.25
## [2,] 0.25
## [3,] 0.25
## [4,] 0.25

9. The R Script provided then uses a number of iterations rather than convergence criteria to implement the power method. The number of iterations is set at 100. The script outputs the page ranks to the data object pageRanks.

pageRanks <- t(C%^%100 %*% r)
pageRanks

##           [,1]      [,2]      [,3]      [,4]
## [1,] 0.3681507 0.1418094 0.2879616 0.2020783

Question a. Briefly discuss how the power method is implemented in this part of the laboratory, e.g. what matrix operations are carried out, is the original link matrix A used in the computations, etc.

Answer: The power method is a simple iteration method that can be used to find ?? and for a given matrix M, where ?? is the largest eigenvalue and is the corresponding eigenvector.

We first assume that the matrix A has a dominant eigenvalue with the corresponding dominant eigenvectors. As stated before, the power method for approximating eigenvalues is iterative. Hence, we start with an initial approximation of the dominant eigenvector of A, which must be non-zero. When k is large, we can obtain a good approximation of the dominant eigenvector of A by properly scaling the sequence.

Question b. Briefly discuss how the page ranks (“importance scores”) computed after 100 iterations compare to those calculated previously including any changes in the ordering of the nodes. Also include in your discussion how you would determine the number of iterations to use if this were required input to run the algorithm.

Answer:

10. Last, the R Script implements the power method to compute page ranks using a convergence criteria of 0.00001.

Question a. How many iterations are required for the algorithm to converge? Is it important to know how many iterations are required? Why? How does the number of iterations required for the algorithm to converge to a solution compare to the specified 100 iterations in Part 10?

Answer Intuitively, imagine a billion surfers who all start at the same point. Think about the system a thousand steps later. After this point, there will have been a bunch of teleportations and a bunch of random decisions. I think it’s clear that it would take a “weird” circumstance for the mass of surfers at, say, this page on the web (the one you’re looking at) to depend sensitively on whether it is time period 1000 or 1001. The various sequences of events that might lead me to this page at time 1000 are about as likely as those that would bring me here one period later, because there is a lot of inherent randomness in the system. So the number of surfers there at the two times should be about the same.

set.seed(1234)
n <- length(M[,1])
iter <- 1
maxiter <- 1000
tolerance <- 0.00001
ccrit <- 100
pwr <- M %*% M
ckCC <- matrix(nrow=n, ncol=maxiter)
ckCC[1,1] <- 100
while((abs(ccrit) > tolerance) && (iter < maxiter)) {
iter <- iter + 1
pwr <- M %*% pwr
PRout <- pwr %*% r
ckCC[,iter] <- PRout[,]
ccrit <- ckCC[1,iter-1] - ckCC[1,iter]
}
print(iter)

## [1] 12

print(t(PRout))

##           [,1]     [,2]      [,3]      [,4]
## [1,] 0.3681576 0.141809 0.2879588 0.2020746

Question b. If the number of iterations is significantly less than 100 is that important and why or why not? For example, mention whether or not the convergence criteria used is sufficient. Also, if the number of iterations required is small briefly explain how this can be true in the context of extremely large matrix operations.

Answer Brin and Page report success using only 50 to 100 power iterations, implying that ?? could range from 10???3 to 10???7.This depends on the alpha that is chosen. The smaller the alpha the faster the convergence but also less the true hyperlink structure of the web is used to determine the web page structure. And slightly different values for ?? can produce very different PageRanks. Moreover, as ?? ??? 1, not only does convergence slow drastically, but sensitivity issues begin to surface as well.

Question c. Briefly discuss how the page ranks obtained using convergence criteria compare with those obtained using a specified number of iterations. Include in your discussion whether or not any differences in values are important and why or why not.

Answer Typically, PageRank is iterated until convergence, i.e., when the PageRank values of nodes no longer change (within some tolerance, to take into account, for example, floating point precision errors). Therefore, at the end of each iteration, the PageRank driver program must check to see if convergence has been reached. Alternative stopping criteria include running a fixed number of iterations (useful if one wishes to bound algorithm running time) or stopping when the ranks of PageRank values no longer change. The latter is useful for some applications that only care about comparing the PageRank of two arbitrary pages and do not need the actual PageRank values. Rank stability is obtained faster than the actual convergence of values.

In absolute terms, how many iterations are necessary for PageRank to converge? This is a difficult question to precisely answer since it depends on many factors, but generally, fewer than one might expect. In the original PageRank paper [117], convergence on a graph with 322 million edges was reached in 52 iterations (see also Bianchini et al. [22] for additional discussion). On today’s web, the answer is not very meaningful due to the adversarial nature of web search as previously discussed-the web is full of spam and populated with sites that are actively trying to “game” PageRank and related hyperlinkbased metrics. As a result, running PageRank in its unmodified form presented here would yield unexpected and undesirable results. Of course, strategies developed by web search companies to combat link spam are proprietary (and closely-guarded secrets, for obvious reasons)-but undoubtedly these algorithmic modifications impact convergence behavior

ANLY 530 - Lab 1

Bharat Kulkarni

8th April, 2018