Problem Set 1: Playing with PageRank

You’ll verify for yourself that PageRank works by performing calculations on a small universe of web pages.

Let’s use the 6 page universe that we had in the course notes. For this directed graph, perform the following calculations in R.

Form the A matrix. Then, introduce decay and form the B matrix as we did in the course notes.

\(A =\) \(\left[ \begin{array}{ccc} 0 & \frac{1}{2} & \frac{1}{2} & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0\\ \frac{1}{3} & \frac{1}{3} & 0 & 0 & \frac{1}{3} & 0\\ 0 & 0 & 0 & 0 & \frac{1}{2} & \frac{1}{2}\\ 0 & 0 & 0 & \frac{1}{2} & 0 & \frac{1}{2}\\ 0 & 0 & 0 & 1 & 0 & 0\\ \end{array}\right]\)

\(B = 0.85 \times A + \frac{0.15}{n}\)

I assume that n = 6 for our example, but the notes are less than clear. This gives \(\frac{0.15}{6} = 0.025\). When I scalar multiply A by 0.85 to decay, then add 0.025 to each value I get an interesting B. We have 4 rows that add up to exactly 1, 1 row that almost adds up to 1, and a row 2 that is no where close.

Row 1 adds up to 1: 0.025 + 0.450 + 0.450 + 0.025 + 0.025 + 0.025 = 1

Row 2 does NOT add up: 0.025 + 0.025 + 0.025 + 0.025 + 0.025 + 0.025 = 0.15

Row 3 almost does add up: 0.3055 + 0.3055 + 0.025 + 0.025 + 0.3055 + 0.025 = 0.9915

Row 4 adds up to 1: 0.025 + 0.025 + 0.025 + 0.025 + 0.450 + 0.450 = 1

Row 5 adds up to 1: 0.025 + 0.025 + 0.025 + 0.450 + 0.025 + 0.450 = 1

Row 6 adds up to 1: 0.025 + 0.025 + 0.025 + 0.875 + 0.025 + 0.025 = 1

# Form the A matrix
A <- matrix(c(0, 0, .33, 0, 0, 0, .50, 0, .33, 0, 0, 0, .50, 0, 0, 0, 0, 0, 0, 0, 0, 0, .50, 1, 0, 0, .33, .50, 0, 0, 0, 0, 0, 0.5, 0.5, 0), nrow=6)
A
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.00 0.50  0.5  0.0 0.00  0.0
## [2,] 0.00 0.00  0.0  0.0 0.00  0.0
## [3,] 0.33 0.33  0.0  0.0 0.33  0.0
## [4,] 0.00 0.00  0.0  0.0 0.50  0.5
## [5,] 0.00 0.00  0.0  0.5 0.00  0.5
## [6,] 0.00 0.00  0.0  1.0 0.00  0.0
# Introduce decay and form the B matrix
B <- .85 * A + (.15/6)
B
##        [,1]   [,2]  [,3]  [,4]   [,5]  [,6]
## [1,] 0.0250 0.4500 0.450 0.025 0.0250 0.025
## [2,] 0.0250 0.0250 0.025 0.025 0.0250 0.025
## [3,] 0.3055 0.3055 0.025 0.025 0.3055 0.025
## [4,] 0.0250 0.0250 0.025 0.025 0.4500 0.450
## [5,] 0.0250 0.0250 0.025 0.450 0.0250 0.450
## [6,] 0.0250 0.0250 0.025 0.875 0.0250 0.025

Start with a uniform rank vector r and perform power iterations on B till convergence.

That is, compute the solution \(r = B^n \times r\). Attempt this for a sufficiently large n so that r actually converges.

\(r_f = B \times r_i\), \(r_i = \left[ \begin{array}{ccc} \frac{1}{6} \\ \frac{1}{6} \\ \frac{1}{6} \\ \frac{1}{6} \\ \frac{1}{6} \\ \frac{1}{6}\\ \end{array}\right]\), for i = 1

# A freakin worked-thru example would be great. I'm guessing that iterations are as follows. First a test.
ri1 <- matrix(c(.167, .167, .167, .167, .167, .167), nrow=6)
# calculate A X r_i to see if I get answer in handout ... I do
crossprod(A, ri1)
##         [,1]
## [1,] 0.05511
## [2,] 0.13861
## [3,] 0.08350
## [4,] 0.25050
## [5,] 0.13861
## [6,] 0.16700
# So, I guess we keep taking cross-products to see if things converge
rf1 <- crossprod(B, ri1)
rf1
##           [,1]
## [1,] 0.0718935
## [2,] 0.1428685
## [3,] 0.0960250
## [4,] 0.2379750
## [5,] 0.1428685
## [6,] 0.1670000
rf2 <- crossprod(B, rf1)
rf2
##            [,1]
## [1,] 0.04840078
## [2,] 0.07895551
## [3,] 0.05202050
## [4,] 0.22413488
## [5,] 0.14954015
## [6,] 0.18332425
rf3 <- crossprod(B, rf2)
rf3
##            [,1]
## [1,] 0.03300115
## [2,] 0.05357148
## [3,] 0.03897973
## [4,] 0.23778958
## [5,] 0.12825847
## [6,] 0.17722129
rf4 <- crossprod(B, rf3)
rf4
##            [,1]
## [1,] 0.02765436
## [2,] 0.04167985
## [3,] 0.03074603
## [4,] 0.22186849
## [5,] 0.12871493
## [6,] 0.17229096
rf5 <- crossprod(B, rf4)
rf5
##            [,1]
## [1,] 0.02419813
## [2,] 0.03595123
## [3,] 0.02732697
## [4,] 0.21672503
## [5,] 0.11849223
## [6,] 0.16457182
rf6 <- crossprod(B, rf5)
rf6
##            [,1]
## [1,] 0.02234685
## [2,] 0.03263105
## [3,] 0.02496584
## [4,] 0.20492688
## [5,] 0.11445499
## [6,] 0.15714897
rf7 <- crossprod(B, rf6)
rf7
##            [,1]
## [1,] 0.02091478
## [2,] 0.03041219
## [3,] 0.02340928
## [4,] 0.19613186
## [5,] 0.10800871
## [6,] 0.14964916
rf8 <- crossprod(B, rf7)
rf8
##            [,1]
## [1,] 0.01977945
## [2,] 0.02866823
## [3,] 0.02210193
## [4,] 0.18631863
## [5,] 0.10313549
## [6,] 0.14247289
rf9 <- crossprod(B, rf8)
rf9
##            [,1]
## [1,] 0.01876151
## [2,] 0.02716777
## [3,] 0.02096818
## [4,] 0.17749646
## [5,] 0.09794693
## [6,] 0.13557992
rf10 <- crossprod(B, rf9)
rf10
##            [,1]
## [1,] 0.01782959
## [2,] 0.02580324
## [3,] 0.01992166
## [4,] 0.16881839
## [5,] 0.09326559
## [6,] 0.12901146
# OK, I could do this all day, I'm not proud, but ...
# Let's use a For-Loop to save some space, allow for bigger n, and calc the %-difference between runs.

i <- 1
n <- 20
prank <- ri1
for (i in 1:n) {
  lrank <- prank
  prank <- crossprod(B, lrank)
  
  if (i == 1) {
    print(prank)
    last <- prank
  } else {
    percent_diff <- (last - prank)/last * 100
    print("This is run ") 
    print(i)
    print(prank)
    print("This is the percent difference from last run")
    print(percent_diff)
    last <- prank
  }
}
##           [,1]
## [1,] 0.0718935
## [2,] 0.1428685
## [3,] 0.0960250
## [4,] 0.2379750
## [5,] 0.1428685
## [6,] 0.1670000
## [1] "This is run "
## [1] 2
##            [,1]
## [1,] 0.04840078
## [2,] 0.07895551
## [3,] 0.05202050
## [4,] 0.22413488
## [5,] 0.14954015
## [6,] 0.18332425
## [1] "This is the percent difference from last run"
##           [,1]
## [1,] 32.677120
## [2,] 44.735535
## [3,] 45.826087
## [4,]  5.815789
## [5,] -4.669784
## [6,] -9.775000
## [1] "This is run "
## [1] 3
##            [,1]
## [1,] 0.03300115
## [2,] 0.05357148
## [3,] 0.03897973
## [4,] 0.23778958
## [5,] 0.12825847
## [6,] 0.17722129
## [1] "This is the percent difference from last run"
##           [,1]
## [1,] 31.816894
## [2,] 32.149790
## [3,] 25.068519
## [4,] -6.092181
## [5,] 14.231413
## [6,]  3.329054
## [1] "This is run "
## [1] 4
##            [,1]
## [1,] 0.02765436
## [2,] 0.04167985
## [3,] 0.03074603
## [4,] 0.22186849
## [5,] 0.12871493
## [6,] 0.17229096
## [1] "This is the percent difference from last run"
##           [,1]
## [1,] 16.201843
## [2,] 22.197696
## [3,] 21.123026
## [4,]  6.695453
## [5,] -0.355886
## [6,]  2.782015
## [1] "This is run "
## [1] 5
##            [,1]
## [1,] 0.02419813
## [2,] 0.03595123
## [3,] 0.02732697
## [4,] 0.21672503
## [5,] 0.11849223
## [6,] 0.16457182
## [1] "This is the percent difference from last run"
##           [,1]
## [1,] 12.497957
## [2,] 13.744334
## [3,] 11.120345
## [4,]  2.318247
## [5,]  7.942119
## [6,]  4.480297
## [1] "This is run "
## [1] 6
##            [,1]
## [1,] 0.02234685
## [2,] 0.03263105
## [3,] 0.02496584
## [4,] 0.20492688
## [5,] 0.11445499
## [6,] 0.15714897
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 7.650501
## [2,] 9.235221
## [3,] 8.640285
## [4,] 5.443834
## [5,] 3.407183
## [6,] 4.510398
## [1] "This is run "
## [1] 7
##            [,1]
## [1,] 0.02091478
## [2,] 0.03041219
## [3,] 0.02340928
## [4,] 0.19613186
## [5,] 0.10800871
## [6,] 0.14964916
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 6.408362
## [2,] 6.799842
## [3,] 6.234774
## [4,] 4.291784
## [5,] 5.632154
## [6,] 4.772424
## [1] "This is run "
## [1] 8
##            [,1]
## [1,] 0.01977945
## [2,] 0.02866823
## [3,] 0.02210193
## [4,] 0.18631863
## [5,] 0.10313549
## [6,] 0.14247289
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 5.428368
## [2,] 5.734410
## [3,] 5.584725
## [4,] 5.003382
## [5,] 4.511872
## [6,] 4.795394
## [1] "This is run "
## [1] 9
##            [,1]
## [1,] 0.01876151
## [2,] 0.02716777
## [3,] 0.02096818
## [4,] 0.17749646
## [5,] 0.09794693
## [6,] 0.13557992
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 5.146470
## [2,] 5.233874
## [3,] 5.129639
## [4,] 4.734994
## [5,] 5.030824
## [6,] 4.838093
## [1] "This is run "
## [1] 10
##            [,1]
## [1,] 0.01782959
## [2,] 0.02580324
## [3,] 0.01992166
## [4,] 0.16881839
## [5,] 0.09326559
## [6,] 0.12901146
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 4.967156
## [2,] 5.022639
## [3,] 4.991003
## [4,] 4.889147
## [5,] 4.779464
## [6,] 4.844716
## [1] "This is run "
## [1] 11
##            [,1]
## [1,] 0.01695427
## [2,] 0.02453185
## [3,] 0.01894383
## [4,] 0.16066386
## [5,] 0.08870209
## [6,] 0.12275194
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 4.909369
## [2,] 4.927226
## [3,] 4.908397
## [4,] 4.830358
## [5,] 4.893013
## [6,] 4.851907
## [1] "This is run "
## [1] 12
##            [,1]
## [1,] 0.01612744
## [2,] 0.02333301
## [3,] 0.01801926
## [4,] 0.15285123
## [5,] 0.08440958
## [6,] 0.11679423
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 4.876850
## [2,] 4.886895
## [3,] 4.880552
## [4,] 4.862716
## [5,] 4.839244
## [6,] 4.853459
## [1] "This is run "
## [1] 13
##            [,1]
## [1,] 0.01534277
## [2,] 0.02219693
## [3,] 0.01714253
## [4,] 0.14543753
## [5,] 0.08030455
## [6,] 0.11112422
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 4.865419
## [2,] 4.868949
## [3,] 4.865527
## [4,] 4.850273
## [5,] 4.863232
## [6,] 4.854701
## [1] "This is run "
## [1] 14
##            [,1]
## [1,] 0.01459719
## [2,] 0.02111787
## [3,] 0.01630939
## [4,] 0.13837373
## [5,] 0.07640814
## [6,] 0.10572910
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 4.859479
## [2,] 4.861313
## [3,] 4.860071
## [4,] 4.856934
## [5,] 4.852032
## [6,] 4.855034
## [1] "This is run "
## [1] 15
##            [,1]
## [1,] 0.01388817
## [2,] 0.02009198
## [3,] 0.01551719
## [4,] 0.13165658
## [5,] 0.07269700
## [6,] 0.10059568
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 4.857257
## [2,] 4.857943
## [3,] 4.857316
## [4,] 4.854353
## [5,] 4.856995
## [6,] 4.855253
## [1] "This is run "
## [1] 16
##            [,1]
## [1,] 0.01321374
## [2,] 0.01911621
## [3,] 0.01476364
## [4,] 0.12526372
## [5,] 0.06916778
## [6,] 0.09571144
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 4.856163
## [2,] 4.856501
## [3,] 4.856261
## [4,] 4.855707
## [5,] 4.854699
## [6,] 4.855321
## [1] "This is run "
## [1] 17
##            [,1]
## [1,] 0.01257211
## [2,] 0.01818795
## [3,] 0.01404675
## [4,] 0.11918194
## [5,] 0.06580920
## [6,] 0.09106430
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 4.855736
## [2,] 4.855868
## [3,] 4.855752
## [4,] 4.855179
## [5,] 4.855713
## [6,] 4.855360
## [1] "This is run "
## [1] 18
##            [,1]
## [1,] 0.01196167
## [2,] 0.01730482
## [3,] 0.01336470
## [4,] 0.11339512
## [5,] 0.06261400
## [6,] 0.08664279
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 4.855533
## [2,] 4.855596
## [3,] 4.855550
## [4,] 4.855452
## [5,] 4.855246
## [6,] 4.855374
## [1] "This is run "
## [1] 19
##            [,1]
## [1,] 0.01138088
## [2,] 0.01646459
## [3,] 0.01271579
## [4,] 0.10788940
## [5,] 0.05957380
## [6,] 0.08243595
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 4.855451
## [2,] 4.855477
## [3,] 4.855455
## [4,] 4.855345
## [5,] 4.855452
## [6,] 4.855381
## [1] "This is run "
## [1] 20
##            [,1]
## [1,] 0.01082829
## [2,] 0.01566516
## [3,] 0.01209838
## [4,] 0.10265094
## [5,] 0.05668128
## [6,] 0.07843337
## [1] "This is the percent difference from last run"
##          [,1]
## [1,] 4.855414
## [2,] 4.855425
## [3,] 4.855417
## [4,] 4.855399
## [5,] 4.855358
## [6,] 4.855384
# Let's try really BIG n's and and only print the end results.

i <- 1
n <- 100
prank <- ri1
for (i in 1:n) {
  lrank <- prank
  prank <- crossprod(B, lrank)
  
  if (i == 1) {
    print(prank)
    last <- prank
  } else {
    percent_diff <- (last - prank)/last * 100
    last <- prank
  }
}
##           [,1]
## [1,] 0.0718935
## [2,] 0.1428685
## [3,] 0.0960250
## [4,] 0.2379750
## [5,] 0.1428685
## [6,] 0.1670000
print("This is run ")
## [1] "This is run "
print(i)
## [1] 100
print(prank)
##              [,1]
## [1,] 0.0002019744
## [2,] 0.0002921941
## [3,] 0.0002256649
## [4,] 0.0019146952
## [5,] 0.0010572467
## [6,] 0.0014629773
print("This is the percent difference from last run")
## [1] "This is the percent difference from last run"
print(percent_diff)
##          [,1]
## [1,] 4.855386
## [2,] 4.855386
## [3,] 4.855386
## [4,] 4.855386
## [5,] 4.855386
## [6,] 4.855386

OK, I don’t know if I’m doing this correctly, but I did notice some interesting things in what I did.

When we look at my iterations, we see the percent difference among all the values converge. We see between the first 2 runs that the percent difference between values ranges from 45.8% to -9.8%. The next run 32% to -6%. By run 5 they are all positve changes ranging from 13.7% to 2.3%. By run 9 we seem to be approaching a convergence of 5% difference with a high of 5.2% and a low of 4.7%. By run 11 we have 4 values that round to 4.9% and 2 at 4.8%. So, now we need to decide what is significant. We have convergence to 1 place (5%) by run 9. We bounce around with 4 to 5 values being the same to the 1st decimal from run 11 to 16. Run 16 converges to (4.86%) for all values.

When I did a 100 runs all the values has the same percentage difference to 6 decimal places (4.855386), however this rounds to the 4.86% which we got at run 16 and 5% which we had at run 9. Maybe 10 interations is large enough.

Compute the eigen-decomposition of B

And, verify that you indeed get an eigenvalue of 1 as the largest eigenvalue and that its corresponding eigenvector is the same vector that you obtained in the previous power iteration method. Further, this eigenvector has all positive entries and it sums to 1.

# Run eigen and show the eigen values and vectors
x <- eigen(B)
x
## $values
## [1]  0.95144614+0i -0.42500000+0i -0.42500000-0i  0.41284482+0i
## [5] -0.34563163+0i -0.01865934+0i
## 
## $vectors
##               [,1]                        [,2]                        [,3]
## [1,] 0.21474933+0i -5.276118e-01-1.156583e-07i -5.276118e-01+1.156583e-07i
## [2,] 0.05719614+0i  0.000000e+00+1.169753e-09i  0.000000e+00-1.169753e-09i
## [3,] 0.29551769+0i  5.276118e-01+1.009896e-07i  5.276118e-01-1.009896e-07i
## [4,] 0.53643286+0i -2.718000e-01+0.000000e+00i -2.718000e-01-0.000000e+00i
## [5,] 0.53643286+0i -2.718000e-01-0.000000e+00i -2.718000e-01+0.000000e+00i
## [6,] 0.53643286+0i  5.436001e-01+0.000000e+00i  5.436001e-01+0.000000e+00i
##                [,4]            [,5]           [,6]
## [1,]  0.78010934+0i -0.776640715+0i  0.55007763+0i
## [2,]  0.07569051+0i  0.009353752+0i -0.61532089+0i
## [3,]  0.60858164+0i  0.629856947+0i  0.56415486+0i
## [4,] -0.07148134+0i  0.002703970+0i -0.01321747+0i
## [5,] -0.07148134+0i  0.002703970+0i -0.01321747+0i
## [6,] -0.07148134+0i  0.002703970+0i -0.01321747+0i

Nothing seems to look right here. We have imaginary numbers, which I didn’t expect. If we ignore the imaginary part, I guess we can say the first value is almost 1 at 0.95, but the vectors do not correspond to any of my values above. I still think row 2 being all zeros in A and then giving a row 2 in B that doesn’t add up to 1 is a problem. I tried transposing A to see if that would help, but it didn’t.

Use the graph package in R

Use its page.rank method to compute the Page Rank of the graph as given in A. Note that you don’t need to apply decay. The package starts with a connected graph and applies decay internally. Verify that you do get the same PageRank vector as the two approaches above.

Odd, the R package graph has no mention of page.rank or anything with page in the name. So, I searched online and found igraph, which does have a page.rank function.

# 
library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
# Not sure how this works, but here's my adjacency matrix attempt
gadj <- graph_from_adjacency_matrix(A)
page.rank(gadj)
## $vector
## [1] 0.1459854 0.1459854 0.1459854 0.2700730 0.1459854 0.1459854
## 
## $value
## [1] 1
## 
## $options
## NULL
# The tutorials I found on igraph showed building the graph, and this looks like the example if we count the pages withlinks going to each other as two edges.
edges <- c(1,2, 1,3, 3,1, 3,2, 3,5, 4,5, 4,6, 5,4, 5,6, 6,4)
g <- graph(edges, n=max(edges), directed=TRUE)
vcount(g)
## [1] 6
ecount(g)
## [1] 10
page.rank(gadj)
## $vector
## [1] 0.1459854 0.1459854 0.1459854 0.2700730 0.1459854 0.1459854
## 
## $value
## [1] 1
## 
## $options
## NULL

Neither one of these methods gives a vector close to what I calcualted from interation.