data_605_final

Final Problem 1

You’ll verify for yourself that PageRank works by performing calculations on a small universe of web pages. Let’s use the 6 page universe that we had in the previous discussion For this directed graph, perform the following calculations in R.

The solution of this problem and my entire understanding of how the Page Rank algorithm works, comes from this excellent tutorial, The algorithm that started google.

(a)

Form the A matrix. Then, introduce decay and form the B matrix as we did in the course notes. (5 Points)

# get transition matrix A
url_link = c(0, 1, 1, 0, 0, 0, 
             0, 0, 0, 0, 0, 0, 
             1, 1, 0, 0, 1, 0, 
             0, 0, 0, 0, 1, 1, 
             0, 0, 0, 1, 0, 1, 
             0, 0, 0, 1, 0, 0)

tm <- matrix(data = url_link,
             byrow = TRUE, 
             nrow = 6)

rownames(tm) <- 1:6
colnames(tm) <- 1:6

score = tm %>% 
    as.data.frame %>%
    rowSums() %>%
    as.vector()
    
# so that we can get 0 instead of NaN in the final transition matrix
score = ifelse(score == 0, 1, score)

A = round(tm / score, 3)
print("Matrix A:")
## [1] "Matrix A:"
A 
##       1     2   3   4     5   6
## 1 0.000 0.500 0.5 0.0 0.000 0.0
## 2 0.000 0.000 0.0 0.0 0.000 0.0
## 3 0.333 0.333 0.0 0.0 0.333 0.0
## 4 0.000 0.000 0.0 0.0 0.500 0.5
## 5 0.000 0.000 0.0 0.5 0.000 0.5
## 6 0.000 0.000 0.0 1.0 0.000 0.0
# introduce decay and form matrix B
for(i in 1:nrow(A)){
    
    x <- sum(A[i, ])
    
    for(j in 1:ncol(A)){
        
        if(x >0){
            A[i, j] <- A[i, j] / x
        } else {
            A[i, j] <- 1/nrow(A)
        }
    }
    
}

decay = 0.85

B <- decay * A + (1 - decay)/nrow(A)
print("Matrix B:")
## [1] "Matrix B:"
B
##           1         2         3         4         5         6
## 1 0.0250000 0.4500000 0.4500000 0.0250000 0.0250000 0.0250000
## 2 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
## 3 0.3083333 0.3083333 0.0250000 0.0250000 0.3083333 0.0250000
## 4 0.0250000 0.0250000 0.0250000 0.0250000 0.4500000 0.4500000
## 5 0.0250000 0.0250000 0.0250000 0.4500000 0.0250000 0.4500000
## 6 0.0250000 0.0250000 0.0250000 0.8750000 0.0250000 0.0250000

(b)

Start with a uniform rank vector r and perform power iterations on B till convergence. That is, compute the solution r = Bn × r. Attempt this for a sufficiently large n so that r actually converges. (5 Points)

# convergence - method 1
n = 100
purrr::reduce(rep(list(B), n), `%*%`)
##            1          2          3         4         5         6
## 1 0.05170475 0.07367926 0.05741241 0.3487037 0.1999038 0.2685961
## 2 0.05170475 0.07367926 0.05741241 0.3487037 0.1999038 0.2685961
## 3 0.05170475 0.07367926 0.05741241 0.3487037 0.1999038 0.2685961
## 4 0.05170475 0.07367926 0.05741241 0.3487037 0.1999038 0.2685961
## 5 0.05170475 0.07367926 0.05741241 0.3487037 0.1999038 0.2685961
## 6 0.05170475 0.07367926 0.05741241 0.3487037 0.1999038 0.2685961
# convergence - method 2
i = 1
initial_vector <- matrix(data = 1/nrow(B), nrow = nrow(B))
init = t(B) %*% initial_vector

while(i <=n){
    convergence_matrix = t(B) %*% init
    init = convergence_matrix
    i = i+1
}

convergence_matrix
##         [,1]
## 1 0.05170475
## 2 0.07367926
## 3 0.05741241
## 4 0.34870369
## 5 0.19990381
## 6 0.26859608

I tried two above methods to calculate the convergence based on 100 iterations on B, both reached the same result. At the end, people are most likely going to end up in site 4 (35%) followed by site 6 (27%) after many numbers of random clicks.

(c)

Compute the eigen-decomposition of B and verify that you indeed get an eigenvalue of 1 as the largest eigenvalue and that its corresponding eigenvector is the same vector that you obtained in the previous power iteration method. Further, this eigenvector has all positive entries and it sums to 1. (10 points)

# convergence - method 3 normalized left eigenvector
eig = eigen(t(B))
eig
## eigen() decomposition
## $values
## [1]  1.00000000+0i  0.57619235+0i -0.42500000+0i -0.42500000-0i -0.34991524+0i
## [6] -0.08461044+0i
## 
## $vectors
##              [,1]          [,2]                      [,3]
## [1,] 0.1044385+0i  0.2931457+0i  2.486934e-15+0.0000e+00i
## [2,] 0.1488249+0i  0.5093703+0i -8.528385e-16-6.9832e-23i
## [3,] 0.1159674+0i  0.3414619+0i -1.930646e-15-0.0000e+00i
## [4,] 0.7043472+0i -0.5890805+0i -7.071068e-01+0.0000e+00i
## [5,] 0.4037861+0i -0.1413606+0i  7.071068e-01+0.0000e+00i
## [6,] 0.5425377+0i -0.4135367+0i  0.000000e+00-1.7058e-08i
##                           [,4]           [,5]            [,6]
## [1,]  2.486934e-15-0.0000e+00i -0.06471710+0i -0.212296003+0i
## [2,] -8.528385e-16+6.9832e-23i  0.01388698+0i  0.854071294+0i
## [3,] -1.930646e-15+0.0000e+00i  0.07298180+0i -0.363638739+0i
## [4,] -7.071068e-01+0.0000e+00i -0.66058664+0i  0.018399984+0i
## [5,]  7.071068e-01-0.0000e+00i  0.73761812+0i -0.304719509+0i
## [6,]  0.000000e+00+1.7058e-08i -0.09918316+0i  0.008182973+0i
# solution
solution = as.numeric(eigen(t(B))$vectors[,1]/sum(eigen(t(B))$vectors[,1]))
solution
## [1] 0.05170475 0.07367926 0.05741241 0.34870369 0.19990381 0.26859608
print(paste0("The sum of this solution is equal to ", sum(solution)))
## [1] "The sum of this solution is equal to 1"

I presented the third method of calculating the convergence by computing the eigen-decomposition of B. Indeed, we get the eigenvalue of 1 as the largest eigenvalue and that its corresponding eigenvector is the same vector that I have obtained in above methods. Furthermore, this eigenvector has all positive entries and it sums to 1.

(d)

Use the graph package in R and its page.rank method to compute the Page Rank of the graph as given in A. Note that you don’t need to apply decay. The package starts with a connected graph and applies decay internally. Verify that you do get the same PageRank vector as the two approaches above. (10 points)

g <- graph.adjacency(tm)
plot(g)

page.rank(g)
## $vector
##          1          2          3          4          5          6 
## 0.05170475 0.07367926 0.05741241 0.34870369 0.19990381 0.26859608 
## 
## $value
## [1] 1
## 
## $options
## NULL
data.frame(url = 1:6) %>%
    dplyr::mutate(method_1 = purrr::reduce(rep(list(B), n), `%*%`) %>% as.data.frame() %>% .[1, ] %>% unlist(),
                  method_2 = convergence_matrix %>% as.vector(),
                  method_3 = solution,
                  `method_4 (page.rank)` = page.rank(g)$vector)
##   url   method_1   method_2   method_3 method_4 (page.rank)
## 1   1 0.05170475 0.05170475 0.05170475           0.05170475
## 2   2 0.07367926 0.07367926 0.07367926           0.07367926
## 3   3 0.05741241 0.05741241 0.05741241           0.05741241
## 4   4 0.34870369 0.34870369 0.34870369           0.34870369
## 5   5 0.19990381 0.19990381 0.19990381           0.19990381
## 6   6 0.26859608 0.26859608 0.26859608           0.26859608

I tried 4 different methods (including the one from the graph package). I have achieved the consistent results for the PageRank vector, aka convergence for landing on the six urls.

Final Problem 2

(a)

Using the training.csv file, plot representations of the first 10 images to understand the data format. Go ahead and divide all pixels by 255 to produce values between 0 and 1. (This is equivalent to min-max scaling.) (5 points)

trainDf <- readit::readit("data/digit-recognizer/train.csv")
testDf <- readit::readit("data/digit-recognizer/test.csv")
head(trainDf) %>% kable() %>% scroll_box()
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10 pixel11 pixel12 pixel13 pixel14 pixel15 pixel16 pixel17 pixel18 pixel19 pixel20 pixel21 pixel22 pixel23 pixel24 pixel25 pixel26 pixel27 pixel28 pixel29 pixel30 pixel31 pixel32 pixel33 pixel34 pixel35 pixel36 pixel37 pixel38 pixel39 pixel40 pixel41 pixel42 pixel43 pixel44 pixel45 pixel46 pixel47 pixel48 pixel49 pixel50 pixel51 pixel52 pixel53 pixel54 pixel55 pixel56 pixel57 pixel58 pixel59 pixel60 pixel61 pixel62 pixel63 pixel64 pixel65 pixel66 pixel67 pixel68 pixel69 pixel70 pixel71 pixel72 pixel73 pixel74 pixel75 pixel76 pixel77 pixel78 pixel79 pixel80 pixel81 pixel82 pixel83 pixel84 pixel85 pixel86 pixel87 pixel88 pixel89 pixel90 pixel91 pixel92 pixel93 pixel94 pixel95 pixel96 pixel97 pixel98 pixel99 pixel100 pixel101 pixel102 pixel103 pixel104 pixel105 pixel106 pixel107 pixel108 pixel109 pixel110 pixel111 pixel112 pixel113 pixel114 pixel115 pixel116 pixel117 pixel118 pixel119 pixel120 pixel121 pixel122 pixel123 pixel124 pixel125 pixel126 pixel127 pixel128 pixel129 pixel130 pixel131 pixel132 pixel133 pixel134 pixel135 pixel136 pixel137 pixel138 pixel139 pixel140 pixel141 pixel142 pixel143 pixel144 pixel145 pixel146 pixel147 pixel148 pixel149 pixel150 pixel151 pixel152 pixel153 pixel154 pixel155 pixel156 pixel157 pixel158 pixel159 pixel160 pixel161 pixel162 pixel163 pixel164 pixel165 pixel166 pixel167 pixel168 pixel169 pixel170 pixel171 pixel172 pixel173 pixel174 pixel175 pixel176 pixel177 pixel178 pixel179 pixel180 pixel181 pixel182 pixel183 pixel184 pixel185 pixel186 pixel187 pixel188 pixel189 pixel190 pixel191 pixel192 pixel193 pixel194 pixel195 pixel196 pixel197 pixel198 pixel199 pixel200 pixel201 pixel202 pixel203 pixel204 pixel205 pixel206 pixel207 pixel208 pixel209 pixel210 pixel211 pixel212 pixel213 pixel214 pixel215 pixel216 pixel217 pixel218 pixel219 pixel220 pixel221 pixel222 pixel223 pixel224 pixel225 pixel226 pixel227 pixel228 pixel229 pixel230 pixel231 pixel232 pixel233 pixel234 pixel235 pixel236 pixel237 pixel238 pixel239 pixel240 pixel241 pixel242 pixel243 pixel244 pixel245 pixel246 pixel247 pixel248 pixel249 pixel250 pixel251 pixel252 pixel253 pixel254 pixel255 pixel256 pixel257 pixel258 pixel259 pixel260 pixel261 pixel262 pixel263 pixel264 pixel265 pixel266 pixel267 pixel268 pixel269 pixel270 pixel271 pixel272 pixel273 pixel274 pixel275 pixel276 pixel277 pixel278 pixel279 pixel280 pixel281 pixel282 pixel283 pixel284 pixel285 pixel286 pixel287 pixel288 pixel289 pixel290 pixel291 pixel292 pixel293 pixel294 pixel295 pixel296 pixel297 pixel298 pixel299 pixel300 pixel301 pixel302 pixel303 pixel304 pixel305 pixel306 pixel307 pixel308 pixel309 pixel310 pixel311 pixel312 pixel313 pixel314 pixel315 pixel316 pixel317 pixel318 pixel319 pixel320 pixel321 pixel322 pixel323 pixel324 pixel325 pixel326 pixel327 pixel328 pixel329 pixel330 pixel331 pixel332 pixel333 pixel334 pixel335 pixel336 pixel337 pixel338 pixel339 pixel340 pixel341 pixel342 pixel343 pixel344 pixel345 pixel346 pixel347 pixel348 pixel349 pixel350 pixel351 pixel352 pixel353 pixel354 pixel355 pixel356 pixel357 pixel358 pixel359 pixel360 pixel361 pixel362 pixel363 pixel364 pixel365 pixel366 pixel367 pixel368 pixel369 pixel370 pixel371 pixel372 pixel373 pixel374 pixel375 pixel376 pixel377 pixel378 pixel379 pixel380 pixel381 pixel382 pixel383 pixel384 pixel385 pixel386 pixel387 pixel388 pixel389 pixel390 pixel391 pixel392 pixel393 pixel394 pixel395 pixel396 pixel397 pixel398 pixel399 pixel400 pixel401 pixel402 pixel403 pixel404 pixel405 pixel406 pixel407 pixel408 pixel409 pixel410 pixel411 pixel412 pixel413 pixel414 pixel415 pixel416 pixel417 pixel418 pixel419 pixel420 pixel421 pixel422 pixel423 pixel424 pixel425 pixel426 pixel427 pixel428 pixel429 pixel430 pixel431 pixel432 pixel433 pixel434 pixel435 pixel436 pixel437 pixel438 pixel439 pixel440 pixel441 pixel442 pixel443 pixel444 pixel445 pixel446 pixel447 pixel448 pixel449 pixel450 pixel451 pixel452 pixel453 pixel454 pixel455 pixel456 pixel457 pixel458 pixel459 pixel460 pixel461 pixel462 pixel463 pixel464 pixel465 pixel466 pixel467 pixel468 pixel469 pixel470 pixel471 pixel472 pixel473 pixel474 pixel475 pixel476 pixel477 pixel478 pixel479 pixel480 pixel481 pixel482 pixel483 pixel484 pixel485 pixel486 pixel487 pixel488 pixel489 pixel490 pixel491 pixel492 pixel493 pixel494 pixel495 pixel496 pixel497 pixel498 pixel499 pixel500 pixel501 pixel502 pixel503 pixel504 pixel505 pixel506 pixel507 pixel508 pixel509 pixel510 pixel511 pixel512 pixel513 pixel514 pixel515 pixel516 pixel517 pixel518 pixel519 pixel520 pixel521 pixel522 pixel523 pixel524 pixel525 pixel526 pixel527 pixel528 pixel529 pixel530 pixel531 pixel532 pixel533 pixel534 pixel535 pixel536 pixel537 pixel538 pixel539 pixel540 pixel541 pixel542 pixel543 pixel544 pixel545 pixel546 pixel547 pixel548 pixel549 pixel550 pixel551 pixel552 pixel553 pixel554 pixel555 pixel556 pixel557 pixel558 pixel559 pixel560 pixel561 pixel562 pixel563 pixel564 pixel565 pixel566 pixel567 pixel568 pixel569 pixel570 pixel571 pixel572 pixel573 pixel574 pixel575 pixel576 pixel577 pixel578 pixel579 pixel580 pixel581 pixel582 pixel583 pixel584 pixel585 pixel586 pixel587 pixel588 pixel589 pixel590 pixel591 pixel592 pixel593 pixel594 pixel595 pixel596 pixel597 pixel598 pixel599 pixel600 pixel601 pixel602 pixel603 pixel604 pixel605 pixel606 pixel607 pixel608 pixel609 pixel610 pixel611 pixel612 pixel613 pixel614 pixel615 pixel616 pixel617 pixel618 pixel619 pixel620 pixel621 pixel622 pixel623 pixel624 pixel625 pixel626 pixel627 pixel628 pixel629 pixel630 pixel631 pixel632 pixel633 pixel634 pixel635 pixel636 pixel637 pixel638 pixel639 pixel640 pixel641 pixel642 pixel643 pixel644 pixel645 pixel646 pixel647 pixel648 pixel649 pixel650 pixel651 pixel652 pixel653 pixel654 pixel655 pixel656 pixel657 pixel658 pixel659 pixel660 pixel661 pixel662 pixel663 pixel664 pixel665 pixel666 pixel667 pixel668 pixel669 pixel670 pixel671 pixel672 pixel673 pixel674 pixel675 pixel676 pixel677 pixel678 pixel679 pixel680 pixel681 pixel682 pixel683 pixel684 pixel685 pixel686 pixel687 pixel688 pixel689 pixel690 pixel691 pixel692 pixel693 pixel694 pixel695 pixel696 pixel697 pixel698 pixel699 pixel700 pixel701 pixel702 pixel703 pixel704 pixel705 pixel706 pixel707 pixel708 pixel709 pixel710 pixel711 pixel712 pixel713 pixel714 pixel715 pixel716 pixel717 pixel718 pixel719 pixel720 pixel721 pixel722 pixel723 pixel724 pixel725 pixel726 pixel727 pixel728 pixel729 pixel730 pixel731 pixel732 pixel733 pixel734 pixel735 pixel736 pixel737 pixel738 pixel739 pixel740 pixel741 pixel742 pixel743 pixel744 pixel745 pixel746 pixel747 pixel748 pixel749 pixel750 pixel751 pixel752 pixel753 pixel754 pixel755 pixel756 pixel757 pixel758 pixel759 pixel760 pixel761 pixel762 pixel763 pixel764 pixel765 pixel766 pixel767 pixel768 pixel769 pixel770 pixel771 pixel772 pixel773 pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 188 255 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 191 250 253 93 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 123 248 253 167 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 80 247 253 208 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29 207 253 235 77 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 54 209 253 253 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 93 254 253 238 170 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 210 254 253 159 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 209 253 254 240 81 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 253 253 254 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 206 254 254 198 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 168 253 253 196 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 203 253 248 76 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 188 253 245 93 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 103 253 253 191 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 89 240 253 195 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 220 253 253 80 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 94 253 253 253 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 89 251 253 250 131 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 214 218 95 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 30 137 137 192 86 72 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 86 250 254 254 254 254 217 246 151 32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 179 254 254 254 254 254 254 254 254 254 231 54 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 72 254 254 254 254 254 254 254 254 254 254 254 254 104 0 0 0 0 0 0 0 0 0 0 0 0 0 61 191 254 254 254 254 254 109 83 199 254 254 254 254 243 85 0 0 0 0 0 0 0 0 0 0 0 0 172 254 254 254 202 147 147 45 0 11 29 200 254 254 254 171 0 0 0 0 0 0 0 0 0 0 0 1 174 254 254 89 67 0 0 0 0 0 0 128 252 254 254 212 76 0 0 0 0 0 0 0 0 0 0 47 254 254 254 29 0 0 0 0 0 0 0 0 83 254 254 254 153 0 0 0 0 0 0 0 0 0 0 80 254 254 240 24 0 0 0 0 0 0 0 0 25 240 254 254 153 0 0 0 0 0 0 0 0 0 0 64 254 254 186 7 0 0 0 0 0 0 0 0 0 166 254 254 224 12 0 0 0 0 0 0 0 0 14 232 254 254 254 29 0 0 0 0 0 0 0 0 0 75 254 254 254 17 0 0 0 0 0 0 0 0 18 254 254 254 254 29 0 0 0 0 0 0 0 0 0 48 254 254 254 17 0 0 0 0 0 0 0 0 2 163 254 254 254 29 0 0 0 0 0 0 0 0 0 48 254 254 254 17 0 0 0 0 0 0 0 0 0 94 254 254 254 200 12 0 0 0 0 0 0 0 16 209 254 254 150 1 0 0 0 0 0 0 0 0 0 15 206 254 254 254 202 66 0 0 0 0 0 21 161 254 254 245 31 0 0 0 0 0 0 0 0 0 0 0 60 212 254 254 254 194 48 48 34 41 48 209 254 254 254 171 0 0 0 0 0 0 0 0 0 0 0 0 0 86 243 254 254 254 254 254 233 243 254 254 254 254 254 86 0 0 0 0 0 0 0 0 0 0 0 0 0 0 114 254 254 254 254 254 254 254 254 254 254 239 86 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 182 254 254 254 254 254 254 254 254 243 70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 76 146 254 255 254 255 146 19 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 141 139 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 254 254 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 254 254 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 254 254 106 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 254 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 254 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 254 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 185 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 89 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 146 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 254 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 254 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 254 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 254 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 254 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 156 254 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 185 255 255 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 185 254 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 185 254 254 184 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 63 254 254 62 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 220 179 6 0 0 0 0 0 0 0 0 9 77 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 247 17 0 0 0 0 0 0 0 0 27 202 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 242 155 0 0 0 0 0 0 0 0 27 254 63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 160 207 6 0 0 0 0 0 0 0 27 254 65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 127 254 21 0 0 0 0 0 0 0 20 239 65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 77 254 21 0 0 0 0 0 0 0 0 195 65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 70 254 21 0 0 0 0 0 0 0 0 195 142 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 56 251 21 0 0 0 0 0 0 0 0 195 227 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 222 153 5 0 0 0 0 0 0 0 120 240 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 67 251 40 0 0 0 0 0 0 0 94 255 69 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 234 184 0 0 0 0 0 0 0 19 245 69 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 234 169 0 0 0 0 0 0 0 3 199 182 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 154 205 4 0 0 26 72 128 203 208 254 254 131 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 61 254 129 113 186 245 251 189 75 56 136 254 73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 216 233 233 159 104 52 0 0 0 38 254 73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 254 73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 254 73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 206 106 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 186 159 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 209 101 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 25 130 155 254 254 254 157 30 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 103 253 253 253 253 253 253 253 253 114 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 208 253 253 253 253 253 253 253 253 253 253 107 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 31 253 253 253 253 253 253 253 253 253 253 253 215 101 3 0 0 0 0 0 0 0 0 0 0 0 0 23 210 253 253 253 248 161 222 222 246 253 253 253 253 253 39 0 0 0 0 0 0 0 0 0 0 0 0 136 253 253 253 229 77 0 0 0 70 218 253 253 253 253 215 91 0 0 0 0 0 0 0 0 0 0 5 214 253 253 253 195 0 0 0 0 0 104 224 253 253 253 253 215 29 0 0 0 0 0 0 0 0 0 116 253 253 253 247 75 0 0 0 0 0 0 26 200 253 253 253 253 216 4 0 0 0 0 0 0 0 0 254 253 253 253 195 0 0 0 0 0 0 0 0 26 200 253 253 253 253 5 0 0 0 0 0 0 0 0 254 253 253 253 99 0 0 0 0 0 0 0 0 0 25 231 253 253 253 36 0 0 0 0 0 0 0 0 254 253 253 253 99 0 0 0 0 0 0 0 0 0 0 223 253 253 253 129 0 0 0 0 0 0 0 0 254 253 253 253 99 0 0 0 0 0 0 0 0 0 0 127 253 253 253 129 0 0 0 0 0 0 0 0 254 253 253 253 99 0 0 0 0 0 0 0 0 0 0 139 253 253 253 90 0 0 0 0 0 0 0 0 254 253 253 253 99 0 0 0 0 0 0 0 0 0 78 248 253 253 253 5 0 0 0 0 0 0 0 0 254 253 253 253 216 34 0 0 0 0 0 0 0 33 152 253 253 253 107 1 0 0 0 0 0 0 0 0 206 253 253 253 253 140 0 0 0 0 0 30 139 234 253 253 253 154 2 0 0 0 0 0 0 0 0 0 16 205 253 253 253 250 208 106 106 106 200 237 253 253 253 253 209 22 0 0 0 0 0 0 0 0 0 0 0 82 253 253 253 253 253 253 253 253 253 253 253 253 253 209 22 0 0 0 0 0 0 0 0 0 0 0 0 1 91 253 253 253 253 253 253 253 253 253 253 213 90 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 18 129 208 253 253 253 253 159 129 90 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 141 202 254 193 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 165 254 179 163 249 244 72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 135 254 150 0 0 189 254 243 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 82 248 209 5 0 0 164 236 254 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 211 254 58 0 0 0 0 33 230 212 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 119 254 156 3 0 0 0 0 18 230 254 33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 212 254 35 0 0 0 0 0 33 254 254 33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 116 254 154 3 0 0 0 0 0 33 254 254 33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 124 254 115 0 0 0 0 0 0 160 254 239 23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 203 254 35 0 0 0 0 0 0 197 254 178 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 239 221 11 0 0 0 0 0 0 198 255 123 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 238 178 0 0 0 0 0 0 10 219 254 96 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 249 204 0 0 0 0 0 0 25 235 254 62 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 243 204 0 0 0 0 0 0 91 254 248 36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 254 204 0 0 0 0 0 67 241 254 133 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 254 214 7 0 0 0 50 242 254 194 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 193 254 78 0 0 19 128 254 195 36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 103 254 222 74 143 235 254 228 83 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 242 254 254 254 254 252 84 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 64 158 200 174 61 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
print(paste0("trainDf: ", "nrow = ", nrow(trainDf), ", ncol = ", ncol(trainDf)))
## [1] "trainDf: nrow = 42000, ncol = 785"
print(paste0("testDf: ", "nrow = ", nrow(testDf), ", ncol = ", ncol(testDf)))
## [1] "testDf: nrow = 28000, ncol = 784"
print(paste0("the train set has ", colSums(is.na(trainDf)) %>% unique(), " missing values in all columns"))
## [1] "the train set has 0 missing values in all columns"
print(paste0("the test set has ", colSums(is.na(testDf)) %>% unique(), " missing values in all columns"))
## [1] "the test set has 0 missing values in all columns"
trainDf2 <- trainDf[, 2:ncol(trainDf)] / 255
trainDf2$label <- trainDf$label
trainDf2 <- trainDf2 %>% dplyr::select(label, everything())

testDf2 <- lapply(testDf[, 1:ncol(testDf)], function(x) x/255) %>%
    dplyr::bind_cols() %>%
    as.data.frame()
plotDigit <- function(row, label){

    x <- rep(0:27)/27
    y <- rep(0:27)/27
    z <- matrix(as.numeric(row), ncol = 28, byrow = TRUE)
    rotate <- function(x) t(apply(x, 2, rev))
    z <- rotate(z)

    image(x, y, z,
          main = paste("Actual label:", label),
          #col = gray.colors(255, start = 1, end = 0),
          asp = 1,
          xlim = c(0, 1),
          ylim = c(-0.1, 1.1),
          useRaster = TRUE,
          axes = FALSE,
          xlab = "",
          ylab = "")

}

par(mfrow = c(2, 5))
for(i in 1:10){

    r = trainDf2[i, ] %>% dplyr::select(-label)

    lab = trainDf2[i, ] %>% dplyr::select(label)

    plotDigit(row = r, label = lab)

}

(b)

What is the frequency distribution of the numbers in the dataset? (5 points)

par(mfrow = c(1, 1))
trainDf2 %>%
    group_by(label) %>%
    count(count = n()) %>%
    ggplot(aes(x = factor(label),
               y = n)) +
    geom_bar(stat = "identity") +
    xlab(label = "") +
    ylab(label = "")

table(trainDf2$label)
## 
##    0    1    2    3    4    5    6    7    8    9 
## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188

(c)

For each number, provide the mean pixel intensity. What does this tell you? (5 points)

pixel_intensity = trainDf2 %>%
    rowwise() %>%
    dplyr::mutate(pixel_intensity = sum(c_across(starts_with("pixel")))) %>%
    ungroup()

mean_pixel_intensity = pixel_intensity %>%
    group_by(label) %>%
    summarise(mean_pixel_intensity = mean(pixel_intensity)) %>%
    ungroup()

mean_pixel_intensity
## # A tibble: 10 x 2
##    label mean_pixel_intensity
##    <dbl>                <dbl>
##  1     0                136. 
##  2     1                 59.6
##  3     2                117. 
##  4     3                111. 
##  5     4                 95.0
##  6     5                101. 
##  7     6                109. 
##  8     7                 89.9
##  9     8                118. 
## 10     9                 96.3
pixel_intensity %>%
    dplyr::select(label, pixel_intensity) %>%
    ggplot(aes(pixel_intensity, color = as.factor(label))) +
    geom_density(aes(fill = as.factor(label)), alpha = 0.5) +
    geom_vline(data = aggregate(pixel_intensity ~ label,
                                data = pixel_intensity %>%
                                    dplyr::select(label, pixel_intensity),
                                FUN = median),
               aes(xintercept = pixel_intensity,
                   color = "red"),
               linetype = "dashed",
               size = 1) +
    theme_classic() +
    facet_wrap(~ label,
               #scales = "free",
               nrow = 5) +
    theme(legend.position = "none")

Above is the mean pixel intensity after scaling (divided by 255 for each pixel column). We sum across all pixel columns and then calculate the mean by group (i.e. label column). We see that some labels appear to be very close (visually) to each other, such as “2” and “8”, “3” and “6”, “5” and “9”, and so forth. Letter “1” is possibly the most unique and easiest to be recognized.

(d)

Reduce the data by using principal components that account for 95% of the variance. How many components did you generate? Use PCA to generate all possible components (100% of the variance). How many components are possible? Why? (5 points)

# get the covariance matrix of the predictors
trainCov <- cov(trainDf2 %>% dplyr::select(-label))

# conduct pca
trainPC <- prcomp(trainCov)

# summary of pca
summary(trainPC)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     0.1660 0.1351 0.1158 0.10108 0.08829 0.08025 0.06049
## Proportion of Variance 0.2533 0.1678 0.1232 0.09395 0.07168 0.05922 0.03365
## Cumulative Proportion  0.2533 0.4211 0.5443 0.63827 0.70995 0.76918 0.80282
##                            PC8    PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.05458 0.0499 0.04430 0.03926 0.03805 0.03204 0.03194
## Proportion of Variance 0.02739 0.0229 0.01804 0.01417 0.01332 0.00944 0.00938
## Cumulative Proportion  0.83022 0.8531 0.87116 0.88533 0.89865 0.90808 0.91746
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.02975 0.02799 0.02485 0.02410 0.02230 0.02171 0.02024
## Proportion of Variance 0.00814 0.00720 0.00568 0.00534 0.00457 0.00433 0.00377
## Cumulative Proportion  0.92560 0.93280 0.93848 0.94382 0.94840 0.95273 0.95649
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.01916 0.01821 0.01722 0.01670 0.01579 0.01531 0.01467
## Proportion of Variance 0.00338 0.00305 0.00273 0.00256 0.00229 0.00216 0.00198
## Cumulative Proportion  0.95987 0.96292 0.96565 0.96821 0.97050 0.97266 0.97464
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.01396 0.01294 0.01240 0.01206 0.01131 0.01107 0.01065
## Proportion of Variance 0.00179 0.00154 0.00141 0.00134 0.00118 0.00113 0.00104
## Cumulative Proportion  0.97643 0.97797 0.97938 0.98072 0.98189 0.98302 0.98406
##                           PC36     PC37     PC38     PC39     PC40     PC41
## Standard deviation     0.01021 0.009607 0.009195 0.008975 0.008794 0.008539
## Proportion of Variance 0.00096 0.000850 0.000780 0.000740 0.000710 0.000670
## Cumulative Proportion  0.98502 0.985870 0.986650 0.987390 0.988100 0.988770
##                            PC42     PC43     PC44     PC45     PC46     PC47
## Standard deviation     0.008397 0.007889 0.007502 0.007246 0.007074 0.006812
## Proportion of Variance 0.000650 0.000570 0.000520 0.000480 0.000460 0.000430
## Cumulative Proportion  0.989420 0.989990 0.990510 0.990990 0.991450 0.991880
##                            PC48    PC49    PC50     PC51     PC52     PC53
## Standard deviation     0.006551 0.00635 0.00604 0.005924 0.005832 0.005538
## Proportion of Variance 0.000390 0.00037 0.00034 0.000320 0.000310 0.000280
## Cumulative Proportion  0.992270 0.99264 0.99298 0.993300 0.993610 0.993900
##                            PC54     PC55     PC56     PC57     PC58     PC59
## Standard deviation     0.005387 0.005288 0.005089 0.005014 0.004831 0.004781
## Proportion of Variance 0.000270 0.000260 0.000240 0.000230 0.000210 0.000210
## Cumulative Proportion  0.994160 0.994420 0.994660 0.994890 0.995100 0.995310
##                            PC60     PC61     PC62    PC63     PC64     PC65
## Standard deviation     0.004646 0.004521 0.004504 0.00429 0.004177 0.004036
## Proportion of Variance 0.000200 0.000190 0.000190 0.00017 0.000160 0.000150
## Cumulative Proportion  0.995510 0.995700 0.995890 0.99606 0.996220 0.996370
##                            PC66     PC67     PC68     PC69     PC70     PC71
## Standard deviation     0.003877 0.003827 0.003691 0.003653 0.003557 0.003515
## Proportion of Variance 0.000140 0.000130 0.000130 0.000120 0.000120 0.000110
## Cumulative Proportion  0.996500 0.996640 0.996760 0.996890 0.997000 0.997120
##                            PC72     PC73     PC74     PC75     PC76    PC77
## Standard deviation     0.003411 0.003334 0.003256 0.003132 0.003081 0.00302
## Proportion of Variance 0.000110 0.000100 0.000100 0.000090 0.000090 0.00008
## Cumulative Proportion  0.997220 0.997330 0.997420 0.997510 0.997600 0.99768
##                            PC78     PC79     PC80     PC81     PC82    PC83
## Standard deviation     0.002909 0.002757 0.002686 0.002662 0.002645 0.00262
## Proportion of Variance 0.000080 0.000070 0.000070 0.000070 0.000060 0.00006
## Cumulative Proportion  0.997760 0.997830 0.997900 0.997960 0.998030 0.99809
##                            PC84     PC85     PC86     PC87     PC88     PC89
## Standard deviation     0.002549 0.002497 0.002468 0.002445 0.002345 0.002306
## Proportion of Variance 0.000060 0.000060 0.000060 0.000050 0.000050 0.000050
## Cumulative Proportion  0.998150 0.998210 0.998260 0.998320 0.998370 0.998420
##                            PC90    PC91     PC92     PC93     PC94     PC95
## Standard deviation     0.002257 0.00218 0.002149 0.002119 0.002083 0.002038
## Proportion of Variance 0.000050 0.00004 0.000040 0.000040 0.000040 0.000040
## Cumulative Proportion  0.998470 0.99851 0.998550 0.998590 0.998630 0.998670
##                            PC96     PC97     PC98     PC99    PC100    PC101
## Standard deviation     0.002026 0.001958 0.001942 0.001916 0.001887 0.001839
## Proportion of Variance 0.000040 0.000040 0.000030 0.000030 0.000030 0.000030
## Cumulative Proportion  0.998710 0.998740 0.998780 0.998810 0.998850 0.998880
##                           PC102   PC103    PC104    PC105    PC106    PC107
## Standard deviation     0.001783 0.00177 0.001722 0.001712 0.001674 0.001629
## Proportion of Variance 0.000030 0.00003 0.000030 0.000030 0.000030 0.000020
## Cumulative Proportion  0.998910 0.99893 0.998960 0.998990 0.999010 0.999040
##                           PC108    PC109    PC110   PC111    PC112    PC113
## Standard deviation     0.001593 0.001576 0.001541 0.00148 0.001475 0.001465
## Proportion of Variance 0.000020 0.000020 0.000020 0.00002 0.000020 0.000020
## Cumulative Proportion  0.999060 0.999080 0.999110 0.99913 0.999150 0.999170
##                          PC114    PC115    PC116    PC117   PC118   PC119
## Standard deviation     0.00145 0.001429 0.001411 0.001385 0.00137 0.00135
## Proportion of Variance 0.00002 0.000020 0.000020 0.000020 0.00002 0.00002
## Cumulative Proportion  0.99919 0.999200 0.999220 0.999240 0.99926 0.99927
##                           PC120    PC121    PC122    PC123    PC124    PC125
## Standard deviation     0.001322 0.001307 0.001294 0.001282 0.001256 0.001237
## Proportion of Variance 0.000020 0.000020 0.000020 0.000020 0.000010 0.000010
## Cumulative Proportion  0.999290 0.999310 0.999320 0.999340 0.999350 0.999370
##                           PC126    PC127   PC128    PC129    PC130   PC131
## Standard deviation     0.001216 0.001198 0.00118 0.001159 0.001142 0.00114
## Proportion of Variance 0.000010 0.000010 0.00001 0.000010 0.000010 0.00001
## Cumulative Proportion  0.999380 0.999390 0.99940 0.999420 0.999430 0.99944
##                           PC132    PC133    PC134    PC135    PC136    PC137
## Standard deviation     0.001113 0.001106 0.001103 0.001086 0.001075 0.001065
## Proportion of Variance 0.000010 0.000010 0.000010 0.000010 0.000010 0.000010
## Cumulative Proportion  0.999450 0.999460 0.999480 0.999490 0.999500 0.999510
##                           PC138    PC139    PC140     PC141     PC142     PC143
## Standard deviation     0.001044 0.001008 0.000991 0.0009851 0.0009645 0.0009518
## Proportion of Variance 0.000010 0.000010 0.000010 0.0000100 0.0000100 0.0000100
## Cumulative Proportion  0.999520 0.999530 0.999540 0.9995400 0.9995500 0.9995600
##                           PC144     PC145     PC146     PC147     PC148
## Standard deviation     0.000942 0.0009318 0.0009263 0.0009116 0.0008994
## Proportion of Variance 0.000010 0.0000100 0.0000100 0.0000100 0.0000100
## Cumulative Proportion  0.999570 0.9995800 0.9995900 0.9995900 0.9996000
##                            PC149     PC150     PC151     PC152     PC153
## Standard deviation     0.0008927 0.0008803 0.0008777 0.0008707 0.0008608
## Proportion of Variance 0.0000100 0.0000100 0.0000100 0.0000100 0.0000100
## Cumulative Proportion  0.9996100 0.9996100 0.9996200 0.9996300 0.9996400
##                            PC154     PC155    PC156    PC157     PC158
## Standard deviation     0.0008526 0.0008483 0.000832 0.000828 0.0007979
## Proportion of Variance 0.0000100 0.0000100 0.000010 0.000010 0.0000100
## Cumulative Proportion  0.9996400 0.9996500 0.999660 0.999660 0.9996700
##                            PC159     PC160     PC161     PC162     PC163
## Standard deviation     0.0007931 0.0007851 0.0007794 0.0007683 0.0007521
## Proportion of Variance 0.0000100 0.0000100 0.0000100 0.0000100 0.0000100
## Cumulative Proportion  0.9996700 0.9996800 0.9996800 0.9996900 0.9997000
##                            PC164     PC165     PC166   PC167     PC168
## Standard deviation     0.0007442 0.0007417 0.0007386 0.00072 0.0007156
## Proportion of Variance 0.0000100 0.0000100 0.0000100 0.00000 0.0000000
## Cumulative Proportion  0.9997000 0.9997100 0.9997100 0.99972 0.9997200
##                            PC169     PC170     PC171     PC172     PC173
## Standard deviation     0.0007081 0.0007054 0.0006905 0.0006852 0.0006777
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9997200 0.9997300 0.9997300 0.9997400 0.9997400
##                            PC174     PC175     PC176     PC177     PC178
## Standard deviation     0.0006736 0.0006662 0.0006657 0.0006509 0.0006474
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9997500 0.9997500 0.9997500 0.9997600 0.9997600
##                            PC179     PC180     PC181     PC182     PC183
## Standard deviation     0.0006441 0.0006393 0.0006305 0.0006221 0.0006137
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9997700 0.9997700 0.9997700 0.9997800 0.9997800
##                          PC184     PC185     PC186     PC187     PC188    PC189
## Standard deviation     0.00061 0.0006048 0.0006039 0.0005984 0.0005978 0.000586
## Proportion of Variance 0.00000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
## Cumulative Proportion  0.99978 0.9997900 0.9997900 0.9997900 0.9998000 0.999800
##                            PC190     PC191     PC192     PC193    PC194
## Standard deviation     0.0005845 0.0005761 0.0005734 0.0005644 0.000563
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
## Cumulative Proportion  0.9998000 0.9998100 0.9998100 0.9998100 0.999820
##                            PC195     PC196     PC197     PC198     PC199
## Standard deviation     0.0005549 0.0005529 0.0005517 0.0005459 0.0005378
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9998200 0.9998200 0.9998200 0.9998300 0.9998300
##                            PC200     PC201     PC202     PC203     PC204
## Standard deviation     0.0005354 0.0005299 0.0005213 0.0005173 0.0005126
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9998300 0.9998300 0.9998400 0.9998400 0.9998400
##                            PC205     PC206     PC207     PC208     PC209
## Standard deviation     0.0005038 0.0005031 0.0004961 0.0004945 0.0004935
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9998400 0.9998500 0.9998500 0.9998500 0.9998500
##                            PC210     PC211     PC212     PC213     PC214
## Standard deviation     0.0004873 0.0004842 0.0004816 0.0004796 0.0004763
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9998600 0.9998600 0.9998600 0.9998600 0.9998600
##                            PC215     PC216     PC217     PC218    PC219
## Standard deviation     0.0004733 0.0004681 0.0004673 0.0004613 0.000459
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
## Cumulative Proportion  0.9998700 0.9998700 0.9998700 0.9998700 0.999870
##                            PC220     PC221     PC222     PC223     PC224
## Standard deviation     0.0004543 0.0004535 0.0004517 0.0004502 0.0004463
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9998800 0.9998800 0.9998800 0.9998800 0.9998800
##                            PC225     PC226     PC227     PC228     PC229
## Standard deviation     0.0004379 0.0004359 0.0004321 0.0004286 0.0004261
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9998800 0.9998900 0.9998900 0.9998900 0.9998900
##                            PC230    PC231     PC232     PC233     PC234
## Standard deviation     0.0004239 0.000417 0.0004142 0.0004101 0.0004062
## Proportion of Variance 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9998900 0.999890 0.9999000 0.9999000 0.9999000
##                            PC235     PC236     PC237     PC238     PC239
## Standard deviation     0.0004044 0.0004021 0.0003991 0.0003978 0.0003925
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999000 0.9999000 0.9999000 0.9999100 0.9999100
##                            PC240     PC241     PC242     PC243     PC244
## Standard deviation     0.0003872 0.0003845 0.0003836 0.0003808 0.0003768
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999100 0.9999100 0.9999100 0.9999100 0.9999100
##                            PC245     PC246     PC247     PC248     PC249
## Standard deviation     0.0003727 0.0003714 0.0003673 0.0003656 0.0003633
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999100 0.9999200 0.9999200 0.9999200 0.9999200
##                            PC250    PC251     PC252     PC253    PC254
## Standard deviation     0.0003614 0.000359 0.0003554 0.0003523 0.000349
## Proportion of Variance 0.0000000 0.000000 0.0000000 0.0000000 0.000000
## Cumulative Proportion  0.9999200 0.999920 0.9999200 0.9999200 0.999930
##                            PC255     PC256    PC257     PC258     PC259
## Standard deviation     0.0003482 0.0003467 0.000345 0.0003434 0.0003423
## Proportion of Variance 0.0000000 0.0000000 0.000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999300 0.9999300 0.999930 0.9999300 0.9999300
##                            PC260     PC261     PC262   PC263     PC264
## Standard deviation     0.0003389 0.0003335 0.0003313 0.00033 0.0003273
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.00000 0.0000000
## Cumulative Proportion  0.9999300 0.9999300 0.9999300 0.99994 0.9999400
##                            PC265     PC266     PC267     PC268     PC269
## Standard deviation     0.0003256 0.0003236 0.0003226 0.0003179 0.0003173
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999400 0.9999400 0.9999400 0.9999400 0.9999400
##                            PC270     PC271     PC272     PC273     PC274
## Standard deviation     0.0003165 0.0003136 0.0003101 0.0003091 0.0003077
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999400 0.9999400 0.9999400 0.9999400 0.9999500
##                            PC275     PC276     PC277     PC278     PC279
## Standard deviation     0.0003056 0.0003028 0.0003022 0.0002997 0.0002989
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999500 0.9999500 0.9999500 0.9999500 0.9999500
##                            PC280     PC281     PC282     PC283     PC284
## Standard deviation     0.0002967 0.0002933 0.0002915 0.0002899 0.0002865
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999500 0.9999500 0.9999500 0.9999500 0.9999500
##                            PC285     PC286     PC287     PC288     PC289
## Standard deviation     0.0002847 0.0002838 0.0002811 0.0002799 0.0002765
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999500 0.9999500 0.9999600 0.9999600 0.9999600
##                            PC290     PC291     PC292     PC293     PC294
## Standard deviation     0.0002754 0.0002708 0.0002705 0.0002698 0.0002677
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999600 0.9999600 0.9999600 0.9999600 0.9999600
##                            PC295     PC296     PC297     PC298     PC299
## Standard deviation     0.0002669 0.0002648 0.0002611 0.0002594 0.0002574
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999600 0.9999600 0.9999600 0.9999600 0.9999600
##                            PC300     PC301     PC302     PC303     PC304
## Standard deviation     0.0002555 0.0002549 0.0002537 0.0002517 0.0002496
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999600 0.9999600 0.9999700 0.9999700 0.9999700
##                            PC305     PC306     PC307     PC308     PC309
## Standard deviation     0.0002468 0.0002448 0.0002425 0.0002412 0.0002394
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999700 0.9999700 0.9999700 0.9999700 0.9999700
##                            PC310     PC311     PC312     PC313     PC314
## Standard deviation     0.0002389 0.0002375 0.0002361 0.0002341 0.0002312
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999700 0.9999700 0.9999700 0.9999700 0.9999700
##                            PC315     PC316     PC317     PC318     PC319
## Standard deviation     0.0002306 0.0002287 0.0002281 0.0002263 0.0002246
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999700 0.9999700 0.9999700 0.9999700 0.9999700
##                            PC320     PC321     PC322     PC323     PC324
## Standard deviation     0.0002231 0.0002212 0.0002182 0.0002177 0.0002162
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999700 0.9999800 0.9999800 0.9999800 0.9999800
##                            PC325     PC326     PC327    PC328     PC329
## Standard deviation     0.0002144 0.0002136 0.0002109 0.000208 0.0002073
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.000000 0.0000000
## Cumulative Proportion  0.9999800 0.9999800 0.9999800 0.999980 0.9999800
##                            PC330    PC331    PC332     PC333     PC334
## Standard deviation     0.0002061 0.000206 0.000204 0.0002034 0.0002018
## Proportion of Variance 0.0000000 0.000000 0.000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999800 0.999980 0.999980 0.9999800 0.9999800
##                            PC335     PC336     PC337    PC338     PC339
## Standard deviation     0.0002014 0.0001997 0.0001971 0.000197 0.0001945
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.000000 0.0000000
## Cumulative Proportion  0.9999800 0.9999800 0.9999800 0.999980 0.9999800
##                            PC340     PC341     PC342     PC343     PC344
## Standard deviation     0.0001929 0.0001912 0.0001899 0.0001883 0.0001873
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999800 0.9999800 0.9999800 0.9999800 0.9999800
##                            PC345     PC346    PC347     PC348     PC349
## Standard deviation     0.0001861 0.0001831 0.000183 0.0001817 0.0001785
## Proportion of Variance 0.0000000 0.0000000 0.000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999800 0.9999800 0.999980 0.9999900 0.9999900
##                            PC350     PC351     PC352     PC353     PC354
## Standard deviation     0.0001776 0.0001756 0.0001746 0.0001728 0.0001719
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999900 0.9999900 0.9999900 0.9999900 0.9999900
##                            PC355     PC356     PC357     PC358     PC359
## Standard deviation     0.0001708 0.0001699 0.0001691 0.0001678 0.0001665
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999900 0.9999900 0.9999900 0.9999900 0.9999900
##                            PC360     PC361     PC362     PC363     PC364
## Standard deviation     0.0001651 0.0001644 0.0001621 0.0001605 0.0001604
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999900 0.9999900 0.9999900 0.9999900 0.9999900
##                            PC365     PC366     PC367     PC368     PC369
## Standard deviation     0.0001597 0.0001582 0.0001561 0.0001557 0.0001551
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999900 0.9999900 0.9999900 0.9999900 0.9999900
##                            PC370     PC371    PC372     PC373     PC374
## Standard deviation     0.0001526 0.0001504 0.000149 0.0001479 0.0001473
## Proportion of Variance 0.0000000 0.0000000 0.000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999900 0.9999900 0.999990 0.9999900 0.9999900
##                            PC375     PC376    PC377     PC378   PC379     PC380
## Standard deviation     0.0001462 0.0001446 0.000143 0.0001414 0.00014 0.0001381
## Proportion of Variance 0.0000000 0.0000000 0.000000 0.0000000 0.00000 0.0000000
## Cumulative Proportion  0.9999900 0.9999900 0.999990 0.9999900 0.99999 0.9999900
##                            PC381     PC382     PC383     PC384     PC385
## Standard deviation     0.0001378 0.0001369 0.0001363 0.0001354 0.0001341
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999900 0.9999900 0.9999900 0.9999900 0.9999900
##                           PC386     PC387     PC388    PC389     PC390
## Standard deviation     0.000131 0.0001305 0.0001294 0.000128 0.0001264
## Proportion of Variance 0.000000 0.0000000 0.0000000 0.000000 0.0000000
## Cumulative Proportion  0.999990 0.9999900 0.9999900 0.999990 0.9999900
##                            PC391    PC392     PC393     PC394     PC395
## Standard deviation     0.0001254 0.000124 0.0001227 0.0001206 0.0001187
## Proportion of Variance 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999900 0.999990 0.9999900 0.9999900 0.9999900
##                            PC396     PC397     PC398     PC399     PC400
## Standard deviation     0.0001183 0.0001153 0.0001146 0.0001137 0.0001122
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  0.9999900 1.0000000 1.0000000 1.0000000 1.0000000
##                            PC401     PC402    PC403    PC404     PC405
## Standard deviation     0.0001108 0.0001099 0.000109 0.000108 0.0001061
## Proportion of Variance 0.0000000 0.0000000 0.000000 0.000000 0.0000000
## Cumulative Proportion  1.0000000 1.0000000 1.000000 1.000000 1.0000000
##                            PC406     PC407     PC408     PC409     PC410
## Standard deviation     0.0001056 0.0001038 0.0001028 0.0001018 0.0001016
## Proportion of Variance 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## Cumulative Proportion  1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##                            PC411     PC412     PC413     PC414    PC415
## Standard deviation     0.0001002 9.943e-05 9.851e-05 9.694e-05 9.65e-05
## Proportion of Variance 0.0000000 0.000e+00 0.000e+00 0.000e+00 0.00e+00
## Cumulative Proportion  1.0000000 1.000e+00 1.000e+00 1.000e+00 1.00e+00
##                            PC416     PC417     PC418     PC419     PC420
## Standard deviation     9.489e-05 9.445e-05 9.314e-05 9.296e-05 9.098e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC421     PC422     PC423     PC424     PC425
## Standard deviation     8.898e-05 8.831e-05 8.703e-05 8.671e-05 8.601e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC426     PC427     PC428    PC429     PC430
## Standard deviation     8.372e-05 8.223e-05 8.083e-05 8.04e-05 8.008e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.00e+00 1.000e+00
##                            PC431     PC432     PC433     PC434     PC435
## Standard deviation     7.881e-05 7.756e-05 7.639e-05 7.622e-05 7.563e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC436     PC437    PC438     PC439     PC440
## Standard deviation     7.241e-05 7.154e-05 7.11e-05 7.067e-05 6.936e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.00e+00 1.000e+00 1.000e+00
##                            PC441     PC442     PC443     PC444     PC445
## Standard deviation     6.884e-05 6.807e-05 6.664e-05 6.627e-05 6.559e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC446    PC447     PC448     PC449     PC450
## Standard deviation     6.493e-05 6.42e-05 6.361e-05 6.267e-05 6.251e-05
## Proportion of Variance 0.000e+00 0.00e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.00e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC451     PC452     PC453     PC454     PC455
## Standard deviation     6.059e-05 5.947e-05 5.897e-05 5.817e-05 5.684e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC456     PC457     PC458     PC459     PC460
## Standard deviation     5.626e-05 5.557e-05 5.487e-05 5.447e-05 5.365e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC461     PC462     PC463     PC464     PC465
## Standard deviation     5.319e-05 5.265e-05 5.074e-05 5.029e-05 4.827e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC466     PC467     PC468     PC469     PC470
## Standard deviation     4.796e-05 4.786e-05 4.684e-05 4.665e-05 4.601e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC471     PC472     PC473     PC474     PC475
## Standard deviation     4.573e-05 4.536e-05 4.457e-05 4.352e-05 4.277e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC476     PC477     PC478 PC479     PC480     PC481
## Standard deviation     4.217e-05 4.108e-05 4.033e-05 4e-05 3.983e-05 3.921e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1e+00 1.000e+00 1.000e+00
##                            PC482     PC483    PC484     PC485     PC486
## Standard deviation     3.888e-05 3.832e-05 3.76e-05 3.733e-05 3.675e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.00e+00 1.000e+00 1.000e+00
##                            PC487     PC488     PC489     PC490     PC491
## Standard deviation     3.568e-05 3.555e-05 3.501e-05 3.408e-05 3.375e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC492     PC493     PC494     PC495     PC496
## Standard deviation     3.329e-05 3.279e-05 3.248e-05 3.038e-05 2.983e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC497     PC498     PC499     PC500     PC501
## Standard deviation     2.957e-05 2.936e-05 2.886e-05 2.865e-05 2.832e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC502     PC503     PC504    PC505    PC506
## Standard deviation     2.781e-05 2.729e-05 2.696e-05 2.68e-05 2.62e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.00e+00 0.00e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.00e+00 1.00e+00
##                            PC507     PC508     PC509     PC510     PC511
## Standard deviation     2.615e-05 2.581e-05 2.522e-05 2.494e-05 2.483e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                           PC512     PC513     PC514     PC515     PC516   PC517
## Standard deviation     2.35e-05 2.325e-05 2.317e-05 2.305e-05 2.236e-05 2.2e-05
## Proportion of Variance 0.00e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.0e+00
## Cumulative Proportion  1.00e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.0e+00
##                            PC518     PC519     PC520     PC521     PC522
## Standard deviation     2.195e-05 2.186e-05 2.164e-05 2.089e-05 2.061e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC523     PC524     PC525     PC526     PC527
## Standard deviation     2.011e-05 1.969e-05 1.955e-05 1.939e-05 1.853e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC528     PC529     PC530    PC531     PC532
## Standard deviation     1.793e-05 1.752e-05 1.732e-05 1.71e-05 1.693e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.00e+00 1.000e+00
##                            PC533     PC534    PC535     PC536    PC537    PC538
## Standard deviation     1.682e-05 1.618e-05 1.59e-05 1.543e-05 1.51e-05 1.49e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00 0.00e+00 0.00e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.00e+00 1.000e+00 1.00e+00 1.00e+00
##                            PC539     PC540     PC541     PC542     PC543
## Standard deviation     1.456e-05 1.445e-05 1.401e-05 1.394e-05 1.373e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                           PC544     PC545     PC546    PC547    PC548    PC549
## Standard deviation     1.33e-05 1.308e-05 1.261e-05 1.25e-05 1.22e-05 1.21e-05
## Proportion of Variance 0.00e+00 0.000e+00 0.000e+00 0.00e+00 0.00e+00 0.00e+00
## Cumulative Proportion  1.00e+00 1.000e+00 1.000e+00 1.00e+00 1.00e+00 1.00e+00
##                            PC550     PC551     PC552     PC553     PC554
## Standard deviation     1.143e-05 1.133e-05 1.101e-05 1.068e-05 1.045e-05
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC555     PC556    PC557     PC558     PC559
## Standard deviation     1.024e-05 1.007e-05 9.81e-06 9.738e-06 9.631e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.00e+00 1.000e+00 1.000e+00
##                           PC560     PC561     PC562     PC563     PC564
## Standard deviation     9.45e-06 9.293e-06 9.049e-06 8.955e-06 8.562e-06
## Proportion of Variance 0.00e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.00e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC565     PC566     PC567     PC568     PC569
## Standard deviation     8.226e-06 8.085e-06 7.919e-06 7.639e-06 7.543e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC570     PC571     PC572    PC573     PC574
## Standard deviation     7.487e-06 7.233e-06 7.191e-06 7.11e-06 6.706e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.00e+00 1.000e+00
##                            PC575     PC576    PC577     PC578    PC579
## Standard deviation     6.422e-06 6.365e-06 6.14e-06 5.888e-06 5.75e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.00e+00 0.000e+00 0.00e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.00e+00 1.000e+00 1.00e+00
##                            PC580     PC581     PC582     PC583     PC584
## Standard deviation     5.734e-06 5.633e-06 5.333e-06 5.109e-06 5.016e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC585     PC586     PC587     PC588     PC589
## Standard deviation     4.941e-06 4.906e-06 4.802e-06 4.749e-06 4.619e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC590     PC591     PC592     PC593     PC594
## Standard deviation     4.452e-06 4.362e-06 4.273e-06 4.203e-06 4.144e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC595    PC596     PC597     PC598     PC599
## Standard deviation     4.031e-06 3.88e-06 3.677e-06 3.519e-06 3.356e-06
## Proportion of Variance 0.000e+00 0.00e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.00e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC600     PC601     PC602     PC603     PC604
## Standard deviation     3.325e-06 3.317e-06 3.118e-06 3.029e-06 2.998e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                           PC605     PC606     PC607     PC608     PC609
## Standard deviation     2.97e-06 2.777e-06 2.703e-06 2.616e-06 2.609e-06
## Proportion of Variance 0.00e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.00e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC610     PC611     PC612     PC613    PC614
## Standard deviation     2.415e-06 2.395e-06 2.338e-06 2.273e-06 2.23e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.00e+00
##                            PC615    PC616     PC617     PC618     PC619
## Standard deviation     2.143e-06 2.04e-06 2.008e-06 1.891e-06 1.783e-06
## Proportion of Variance 0.000e+00 0.00e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.00e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC620     PC621     PC622     PC623     PC624
## Standard deviation     1.721e-06 1.645e-06 1.626e-06 1.575e-06 1.562e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC625     PC626     PC627     PC628     PC629
## Standard deviation     1.509e-06 1.477e-06 1.433e-06 1.354e-06 1.317e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC630     PC631     PC632     PC633     PC634
## Standard deviation     1.312e-06 1.285e-06 1.232e-06 1.202e-06 1.169e-06
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                           PC635     PC636     PC637     PC638     PC639
## Standard deviation     1.13e-06 1.046e-06 1.027e-06 9.778e-07 9.598e-07
## Proportion of Variance 0.00e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.00e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC640     PC641     PC642     PC643    PC644
## Standard deviation     9.353e-07 9.217e-07 9.076e-07 7.152e-07 6.97e-07
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.00e+00
##                            PC645     PC646     PC647     PC648     PC649
## Standard deviation     6.805e-07 6.652e-07 6.109e-07 6.081e-07 5.981e-07
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC650     PC651     PC652     PC653     PC654
## Standard deviation     5.746e-07 5.661e-07 5.284e-07 4.682e-07 4.631e-07
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC655     PC656     PC657     PC658     PC659
## Standard deviation     4.402e-07 4.339e-07 4.125e-07 3.874e-07 3.726e-07
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC660     PC661     PC662     PC663     PC664
## Standard deviation     3.582e-07 3.574e-07 3.506e-07 3.309e-07 3.206e-07
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC665     PC666     PC667     PC668     PC669
## Standard deviation     3.095e-07 2.115e-07 2.065e-07 1.708e-07 1.635e-07
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC670     PC671   PC672     PC673     PC674
## Standard deviation     1.412e-07 1.404e-07 1.3e-07 1.279e-07 9.703e-08
## Proportion of Variance 0.000e+00 0.000e+00 0.0e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.0e+00 1.000e+00 1.000e+00
##                            PC675     PC676     PC677     PC678     PC679
## Standard deviation     9.057e-08 8.619e-08 8.013e-08 7.437e-08 7.172e-08
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC680     PC681     PC682     PC683     PC684
## Standard deviation     6.084e-08 4.417e-08 4.165e-08 3.901e-08 3.704e-08
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC685     PC686     PC687     PC688     PC689
## Standard deviation     3.624e-08 3.326e-08 3.208e-08 2.847e-08 2.462e-08
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC690     PC691     PC692     PC693     PC694
## Standard deviation     2.219e-08 1.839e-08 1.397e-08 1.327e-08 1.086e-08
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC695     PC696     PC697    PC698     PC699
## Standard deviation     7.174e-09 6.211e-09 3.776e-09 3.49e-09 1.637e-09
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.00e+00 1.000e+00
##                            PC700     PC701     PC702     PC703     PC704
## Standard deviation     3.063e-10 1.996e-10 1.671e-10 3.308e-11 3.014e-11
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC705     PC706     PC707     PC708     PC709
## Standard deviation     2.389e-16 1.869e-16 1.513e-16 1.446e-16 1.005e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC710     PC711     PC712     PC713     PC714
## Standard deviation     7.544e-17 5.035e-17 4.695e-17 4.302e-17 3.603e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC715     PC716     PC717    PC718     PC719
## Standard deviation     3.339e-17 3.058e-17 2.802e-17 2.62e-17 2.525e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.00e+00 1.000e+00
##                            PC720     PC721     PC722     PC723     PC724
## Standard deviation     2.368e-17 1.948e-17 1.615e-17 1.566e-17 1.479e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC725     PC726     PC727     PC728     PC729
## Standard deviation     1.479e-17 1.479e-17 1.479e-17 1.479e-17 1.479e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC730     PC731     PC732     PC733     PC734
## Standard deviation     1.479e-17 1.479e-17 1.479e-17 1.479e-17 1.479e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC735     PC736     PC737     PC738     PC739
## Standard deviation     1.479e-17 1.479e-17 1.479e-17 1.479e-17 1.479e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC740     PC741     PC742     PC743     PC744
## Standard deviation     1.479e-17 1.479e-17 1.479e-17 1.479e-17 1.479e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC745     PC746     PC747     PC748     PC749
## Standard deviation     1.479e-17 1.479e-17 1.479e-17 1.479e-17 1.479e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC750     PC751     PC752     PC753     PC754
## Standard deviation     1.479e-17 1.479e-17 1.479e-17 1.479e-17 1.479e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC755     PC756     PC757     PC758     PC759
## Standard deviation     1.479e-17 1.479e-17 1.479e-17 1.479e-17 1.479e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC760     PC761     PC762     PC763     PC764
## Standard deviation     1.479e-17 1.479e-17 1.479e-17 1.479e-17 1.479e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC765     PC766     PC767     PC768     PC769
## Standard deviation     1.479e-17 1.479e-17 1.479e-17 1.479e-17 1.479e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC770     PC771     PC772     PC773     PC774
## Standard deviation     1.479e-17 1.479e-17 1.479e-17 1.479e-17 1.479e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC775     PC776     PC777     PC778     PC779
## Standard deviation     1.479e-17 1.479e-17 1.479e-17 1.479e-17 1.469e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                            PC780     PC781     PC782     PC783    PC784
## Standard deviation     1.183e-17 1.004e-17 8.345e-18 7.018e-18 3.15e-18
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.00e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.00e+00
# scree, cumulative variance plots
par(mfrow = c(1, 2))

# scree plot
screeplot(trainPC, type = "l", npcs = 30, main = "Screeplot of the first 30 PCs")

# cumulative variance plot
cumpro <- cumsum(trainPC$sdev^2 / sum(trainPC$sdev^2))
plot(cumpro[0:30], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(v = 20, col = "red", lty = 2)

trainPC_percent_var <- trainPC$sdev^2/sum(trainPC$sdev^2)

options(digits = 4)
result.pc <- data.frame(PCs = 1:length(trainPC_percent_var),
                        Covariance = cumsum(trainPC_percent_var))

result.pc[seq(from = 10, to = 100, by = 5), ]
##     PCs Covariance
## 10   10     0.8712
## 15   15     0.9256
## 20   20     0.9527
## 25   25     0.9682
## 30   30     0.9780
## 35   35     0.9841
## 40   40     0.9881
## 45   45     0.9910
## 50   50     0.9930
## 55   55     0.9944
## 60   60     0.9955
## 65   65     0.9964
## 70   70     0.9970
## 75   75     0.9975
## 80   80     0.9979
## 85   85     0.9982
## 90   90     0.9985
## 95   95     0.9987
## 100 100     0.9988
train_score <- as.matrix(trainDf2[, 2:ncol(trainDf2)]) %*% trainPC$rotation[, 1:20]
label <- as.factor(trainDf2[[1]])
train_score_label <- cbind(label, as.data.frame(train_score))

colors <- rainbow(length(unique(train_score_label$label)))
names(colors) <- unique(train_score_label$label)
plot(train_score_label$PC1,
     train_score_label$PC2,
     type = "n",
     main = "First two principal components")
text(train_score_label$PC1,
     train_score_label$PC2,
     label = train_score_label$label,
     col = colors[train_score_label$label])

PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables, i.e. min(n - 1, p). This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component. Features that are strongly correlated with each other are more suitable for PCA than those loosely related.

In our train set, we run the principal components on the covariance matrix. The first 20 components account for 95% of the variance in our train set. The scree plot (“elbow” method) would indicate that the first 7 or 8 components are the most important ones. They capture over 80% of the variance in the data. Above plot indicates that it is good enough to classify the labels using just the first two components.

(e)

Plot the first 10 images generated by PCA. They will appear to be noise. Why? (5 points)

X_loading <- trainPC$rotation

X_train <- as.matrix(trainDf2 %>% dplyr::select(-label)) %*% X_loading
y_train <- trainDf2[['label']]

par(mfrow = c (2, 5))
for(i in 1:10){

    r = X_train[i, ]

    lab = y_train[i]

    plotDigit(row = r, label = lab)

}

I am guessing that it’s because the PCA is conducted based on the entire data set. It should have been built based on each of the 10 subsets in order to come up with meaningful components for each of the 10 labels.

(f)

Now, select only those images that have labels that are 8’s. Re-run PCA that accounts for all of the variance (100%). Plot the first 10 images. What do you see? (5 points)

# get the covariance matrix of the predictors
trainCov8 <- cov(trainDf2 %>% dplyr::filter(label == 8) %>% dplyr::select(-label))

# conduct pca
trainPC8 <- prcomp(trainCov8)

X_loading8 <- trainPC8$rotation

X_train8 <- as.matrix(trainDf2 %>% dplyr::filter(label == 8) %>% dplyr::select(-label)) %*% X_loading8
y_train8 <- trainDf2 %>% dplyr::filter(label == 8) %>% .$label

par(mfrow = c (2, 5))
for(i in 1:10){

    r = X_train8[i, ]

    lab = y_train8[i]

    plotDigit(row = r, label = lab)

}

I am still seeing noise here, and perhaps that’s because I have not properly configured my plotDigit function to account for the data transformation from the PCA.

(g)

An incorrect approach to predicting the images would be to build a linear regression model with y as the digit values and X as the pixel matrix. Instead, we can build a multinomial model that classifies the digits. Build a multinomial model on the entirety of the training set. Then provide its classification accuracy (percent correctly identified) as well as a matrix of observed versus forecast values (confusion matrix). This matrix will be a 10 x 10, and correct classifications will be on the diagonal. (10 points)

# split the data in 80/20
set.seed(1234)
index <- sample(1:nrow(trainDf2), 0.8 * nrow(trainDf2))

# data prep
train_x <- trainDf2[index, ] %>% dplyr::select(-label)
train_y <- trainDf2[index, "label"]

test_x <- trainDf2[-index, ] %>% dplyr::select(-label)
test_y <- trainDf2[-index, "label"]

# convert every variable to numeric
train_x <- as.data.frame(lapply(train_x, as.numeric))
test_x <- as.data.frame(lapply(test_x, as.numeric))

# convert data to xgboost format
train_x_xgb <- xgb.DMatrix(data = data.matrix(train_x), label = train_y)
test_x_xgb <- xgb.DMatrix(data = data.matrix(test_x), label = test_y)

# set parameters
params <- list (
    eta = 0.3,
    max_depth = 5,
    min_child_weight = 1,
    subsample = 1,
    colsample_bytree = 1,
    objective = "multi:softmax",
    eval_metric = "merror",
    num_class = 11
)

# build model
xgb.model <- xgb.train(params, train_x_xgb, nrounds = 10)

# make prediction
xgb.predict <- predict(xgb.model, test_x_xgb)

# display 10x10 confusion matrix
xgb.cm = caret::confusionMatrix(as.factor(xgb.predict), as.factor(test_y))
xgb.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 777   1   6   6   2   5   9   3   1   4
##          1   0 902   4   8   5   3   1   5  18   5
##          2   1   1 790  17   5   4   3  13  10   1
##          3   2   6   5 798   0  22   0   6  13  12
##          4   2   2  12   7 754   5   8  11   5  23
##          5   1   2   0  19   4 705  16   1  10   6
##          6   2   2   8   3   6  19 784   0   7   0
##          7   2   1  12   6   1   5   1 819   6  11
##          8   9   4   8  23   4  15  13   6 705  11
##          9   2   0   4  14  33   6   0  30  17 734
## 
## Overall Statistics
##                                        
##                Accuracy : 0.925        
##                  95% CI : (0.919, 0.93)
##     No Information Rate : 0.11         
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.916        
##                                        
##  Mcnemar's Test P-Value : NA           
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9737    0.979    0.931    0.886   0.9263   0.8935
## Specificity            0.9951    0.993    0.993    0.991   0.9901   0.9922
## Pos Pred Value         0.9545    0.948    0.935    0.924   0.9095   0.9228
## Neg Pred Value         0.9972    0.997    0.992    0.986   0.9921   0.9890
## Prevalence             0.0950    0.110    0.101    0.107   0.0969   0.0939
## Detection Rate         0.0925    0.107    0.094    0.095   0.0898   0.0839
## Detection Prevalence   0.0969    0.113    0.101    0.103   0.0987   0.0910
## Balanced Accuracy      0.9844    0.986    0.962    0.938   0.9582   0.9429
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.9389   0.9161   0.8902   0.9095
## Specificity            0.9938   0.9940   0.9878   0.9860
## Pos Pred Value         0.9434   0.9479   0.8835   0.8738
## Neg Pred Value         0.9933   0.9900   0.9886   0.9903
## Prevalence             0.0994   0.1064   0.0943   0.0961
## Detection Rate         0.0933   0.0975   0.0839   0.0874
## Detection Prevalence   0.0989   0.1029   0.0950   0.1000
## Balanced Accuracy      0.9664   0.9551   0.9390   0.9478
# report accuracy
xgb.model.accuracy = round(xgb.cm$overall[1]*100, 1)
names(xgb.model.accuracy) <- NULL
print(glue("The accuracy of the XGBoost model is ", {xgb.model.accuracy}, "%."))
## The accuracy of the XGBoost model is 92.5%.

Final Problem 3

You are to compete in the House Prices: Advanced Regression Techniques competition https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.

(a) Descriptive and Inferential Statistics

# read df
house_price_train <- readit::readit("data/house-prices-advanced-regression-techniques/train.csv")
house_price_test <- readit::readit("data/house-prices-advanced-regression-techniques/test.csv")
summary_df = sapply(house_price_train, class) %>% 
    as.data.frame() %>% 
    tibble::rownames_to_column() %>%
    dplyr::select(fields = rowname, dtypes = ".") %>%
    dplyr::mutate(dtypes = as.character(dtypes),
                  missing = colSums(is.na(house_price_train)) %>% unlist %>% as.vector(),
                  percent_missing_value = round(missing / nrow(house_price_train), 4),
                  unique_value_ct = lapply(house_price_train, function(x) length(unique(x))) %>% unlist %>% as.vector())

print(paste0("There are ", nrow(house_price_train), " rows and ", ncol(house_price_train), " columns in the train set."))
## [1] "There are 1460 rows and 81 columns in the train set."
summary_df %>% kable() %>% scroll_box()
fields dtypes missing percent_missing_value unique_value_ct
Id numeric 0 0.0000 1460
MSSubClass numeric 0 0.0000 15
MSZoning character 0 0.0000 5
LotFrontage numeric 259 0.1774 111
LotArea numeric 0 0.0000 1073
Street character 0 0.0000 2
Alley character 1369 0.9377 3
LotShape character 0 0.0000 4
LandContour character 0 0.0000 4
Utilities character 0 0.0000 2
LotConfig character 0 0.0000 5
LandSlope character 0 0.0000 3
Neighborhood character 0 0.0000 25
Condition1 character 0 0.0000 9
Condition2 character 0 0.0000 8
BldgType character 0 0.0000 5
HouseStyle character 0 0.0000 8
OverallQual numeric 0 0.0000 10
OverallCond numeric 0 0.0000 9
YearBuilt numeric 0 0.0000 112
YearRemodAdd numeric 0 0.0000 61
RoofStyle character 0 0.0000 6
RoofMatl character 0 0.0000 8
Exterior1st character 0 0.0000 15
Exterior2nd character 0 0.0000 16
MasVnrType character 8 0.0055 5
MasVnrArea numeric 8 0.0055 328
ExterQual character 0 0.0000 4
ExterCond character 0 0.0000 5
Foundation character 0 0.0000 6
BsmtQual character 37 0.0253 5
BsmtCond character 37 0.0253 5
BsmtExposure character 38 0.0260 5
BsmtFinType1 character 37 0.0253 7
BsmtFinSF1 numeric 0 0.0000 637
BsmtFinType2 character 38 0.0260 7
BsmtFinSF2 numeric 0 0.0000 144
BsmtUnfSF numeric 0 0.0000 780
TotalBsmtSF numeric 0 0.0000 721
Heating character 0 0.0000 6
HeatingQC character 0 0.0000 5
CentralAir character 0 0.0000 2
Electrical character 1 0.0007 6
1stFlrSF numeric 0 0.0000 753
2ndFlrSF numeric 0 0.0000 417
LowQualFinSF numeric 0 0.0000 24
GrLivArea numeric 0 0.0000 861
BsmtFullBath numeric 0 0.0000 4
BsmtHalfBath numeric 0 0.0000 3
FullBath numeric 0 0.0000 4
HalfBath numeric 0 0.0000 3
BedroomAbvGr numeric 0 0.0000 8
KitchenAbvGr numeric 0 0.0000 4
KitchenQual character 0 0.0000 4
TotRmsAbvGrd numeric 0 0.0000 12
Functional character 0 0.0000 7
Fireplaces numeric 0 0.0000 4
FireplaceQu character 690 0.4726 6
GarageType character 81 0.0555 7
GarageYrBlt numeric 81 0.0555 98
GarageFinish character 81 0.0555 4
GarageCars numeric 0 0.0000 5
GarageArea numeric 0 0.0000 441
GarageQual character 81 0.0555 6
GarageCond character 81 0.0555 6
PavedDrive character 0 0.0000 3
WoodDeckSF numeric 0 0.0000 274
OpenPorchSF numeric 0 0.0000 202
EnclosedPorch numeric 0 0.0000 120
3SsnPorch numeric 0 0.0000 20
ScreenPorch numeric 0 0.0000 76
PoolArea numeric 0 0.0000 8
PoolQC character 1453 0.9952 4
Fence character 1179 0.8075 5
MiscFeature character 1406 0.9630 5
MiscVal numeric 0 0.0000 21
MoSold numeric 0 0.0000 12
YrSold numeric 0 0.0000 5
SaleType character 0 0.0000 9
SaleCondition character 0 0.0000 6
SalePrice numeric 0 0.0000 663
summary_df %>%
    dplyr::mutate(missing_flag = dplyr::case_when(missing == 0 ~ 'no missing', TRUE ~'with missing')) %>%
    group_by(missing_flag) %>%
    summarise(num_of_variables = n())
## # A tibble: 2 x 2
##   missing_flag num_of_variables
##   <chr>                   <int>
## 1 no missing                 62
## 2 with missing               19
summary_df %>%
    dplyr::mutate(missing_flag = dplyr::case_when(missing == 0 ~ 'no missing', TRUE ~'with missing')) %>%
    group_by(missing_flag, dtypes) %>%
    summarise(num_of_variables = n()) %>%
    tidyr::spread(missing_flag, num_of_variables) %>%
    dplyr::inner_join(., summary_df %>% 
                          group_by(dtypes) %>%
                          summarise(num_of_variables = n()),
                      by = "dtypes")
## # A tibble: 2 x 4
##   dtypes    `no missing` `with missing` num_of_variables
##   <chr>            <int>          <int>            <int>
## 1 character           27             16               43
## 2 numeric             35              3               38
# let's just look at non-missing numeric data
numFields <- summary_df %>% 
    dplyr::filter(percent_missing_value == 0 & dtypes == "numeric") %>% 
    .$fields
numFields <- setdiff(numFields, "Id")
#numFields

# corrplot()
house_price_train %>%
    dplyr::select(all_of(numFields)) %>%
    #GGally::ggpairs()
    cor() %>%
    corrplot(type = "upper")

# scatterplot matrix, let's just look at c("SalePrice", "TotRmsAbvGrd", "GrLivArea")
house_price_train %>%
    dplyr::select("Id", "SalePrice", "TotRmsAbvGrd", "GrLivArea") %>%
    tidyr::gather(key, value, TotRmsAbvGrd:GrLivArea) %>%
    ggplot(aes(x = value, y = SalePrice)) +
    geom_jitter() +
    geom_smooth(method = lm) +
    scale_y_continuous(labels = scales::dollar_format()) +
    xlab("") +
    facet_wrap(~key, scales = "free")

tempDf <- house_price_train %>%
  dplyr::select(all_of(
    summary_df %>% 
      filter(fields != "Id" & 
               percent_missing_value == 0 & 
               dtypes == "numeric") %>% 
      arrange(desc(unique_value_ct)) %>% 
      head(3) %>% 
      .$fields
    ))

cor.mat = cor(tempDf)
cor.mat
##             LotArea GrLivArea BsmtUnfSF
## LotArea    1.000000    0.2631 -0.002618
## GrLivArea  0.263116    1.0000  0.240257
## BsmtUnfSF -0.002618    0.2403  1.000000
# just get the unique pair of the two, i.e. order of the pairs doesn't matter, such as A <-> B is the same as B <-> A
loops <- data.frame(Var1 = c(1, 2, 3),
                    Var2 = c(2, 3, 1))
tempList <- list()

for(i in 1:nrow(loops)){
    
    j = loops$Var1[i]
    k = loops$Var2[i]
    
    variable_1 = tempDf[[j]]
    variable_2 = tempDf[[k]]
    
    result = cor.test(variable_1, variable_2, method = "pearson", conf.level = 0.8) %>% 
        broom::tidy() %>%
        dplyr::mutate(Var1 = j, Var2 = k)
    
    tempList <- c(tempList, list(result))
    
}

tempList %>% 
  dplyr::bind_rows() %>%
  dplyr::inner_join(., data.frame(Var1 = 1:3,
                                  field_1 = names(tempDf))) %>%
  dplyr::inner_join(., data.frame(Var2 = 1:3,
                                  field_2 = names(tempDf))) %>%
  dplyr::mutate(flag = dplyr::case_when(`p.value` < .05~ 1, TRUE ~0)) %>%
  dplyr::select(-Var1, -Var2, -method, -alternative) %>%
  arrange(`p.value`) %>%
  kable() %>% 
  scroll_box()
estimate statistic p.value parameter conf.low conf.high field_1 field_2 flag
0.2631 10.414 0.0000 1458 0.2316 0.2941 LotArea GrLivArea 1
0.2403 9.451 0.0000 1458 0.2084 0.2716 GrLivArea BsmtUnfSF 1
-0.0026 -0.100 0.9204 1458 -0.0362 0.0309 BsmtUnfSF LotArea 0

There are many missing variables from the data set, e.g. 16 out of 43 “character” variables have missing data, 3 out of 38 “numeric” variables have missing data. We only look at non-missing numeric variables in above statistics. The corplot() displays correlation among these numeric variables, and subsequently we look at scatter plot for “SalePrice” vs “TotRmsAbvGrd”, and “SalePrice” vs “GrLivArea”. These are variables that display strongly positive correlation with “SalePrice” from the corplot().

We pick out three quantitative variables, i.e. “LotArea”, “GrLivArea” & “BsmtUnfSF”, and then we look at the correlation between each pair. There is no correlation between “BsmtUnfSF” and “LotArea” as the p-value is way above .05. The 80% confidence interval of the correlation is between -0.0362 and 0.0309, a sign of no relationship between the two. On the other hand, the other two pairs display statistically significant relationship. Their small p-value (<.05) rejects the null hypothesis that the true correlation between the two is zero. Given their small p-value, we shouldn’t worry too much about familywise/type I error.

(b) Linear Algebra and Correlation

First, let’s invert the above correlation matrix

inv.cor.mat <- solve(cor.mat)
inv.cor.mat
##            LotArea GrLivArea BsmtUnfSF
## LotArea    1.07972   -0.3022   0.07544
## GrLivArea -0.30221    1.1459  -0.27609
## BsmtUnfSF  0.07544   -0.2761   1.06653

We multiply the correlation matrix by the precision matrix, and then it results in the identity matrix.

cor.by.prec <- cor.mat %*% inv.cor.mat
cor.by.prec <- zapsmall(cor.by.prec)
cor.by.prec
##           LotArea GrLivArea BsmtUnfSF
## LotArea         1         0         0
## GrLivArea       0         1         0
## BsmtUnfSF       0         0         1

We multiply the precision matrix by the correlation matrix, and again it results in the identity matrix.

prec.by.cor <- inv.cor.mat %*% cor.mat
prec.by.cor <- zapsmall(prec.by.cor)
prec.by.cor
##           LotArea GrLivArea BsmtUnfSF
## LotArea         1         0         0
## GrLivArea       0         1         0
## BsmtUnfSF       0         0         1

Let’s conduct the LU decomposition on the matrix.

# output the lower triangular matrix
lu.decomp <- lu.decomposition(cor.mat)
lower <- lu.decomp$L
print("output the lower triangular matrix")
## [1] "output the lower triangular matrix"
lower
##           [,1]   [,2] [,3]
## [1,]  1.000000 0.0000    0
## [2,]  0.263116 1.0000    0
## [3,] -0.002618 0.2589    1
# output the upper triangular matrix
upper <- lu.decomp$U
print("output the upper triangular matrix")
## [1] "output the upper triangular matrix"
upper
##      [,1]   [,2]      [,3]
## [1,]    1 0.2631 -0.002618
## [2,]    0 0.9308  0.240946
## [3,]    0 0.0000  0.937620

We can verify the accuracy of LU decomposition by multiplying the lower and upper triangular matrices to return to initial correlation matrix. We would see identical matrices.

lower %*% upper
##           [,1]   [,2]      [,3]
## [1,]  1.000000 0.2631 -0.002618
## [2,]  0.263116 1.0000  0.240257
## [3,] -0.002618 0.2403  1.000000
cor.mat
##             LotArea GrLivArea BsmtUnfSF
## LotArea    1.000000    0.2631 -0.002618
## GrLivArea  0.263116    1.0000  0.240257
## BsmtUnfSF -0.002618    0.2403  1.000000

(c) Calculus-Based Probability & Statistics

Let’s begin by plotting some density curves by numeric variables with non-missing data and that has at least more than 100 unique data value (to avoid plotting ordinal or categorical variables)

# density curves by numeric variables with non-missing data and that has at least more than 100 unique data value
house_price_train %>%
    dplyr::select(all_of(
        summary_df %>% 
            dplyr::filter(percent_missing_value == 0 & dtypes == "numeric" & unique_value_ct >100) %>% 
            .$fields
    )) %>%
    tidyr::gather(key, value, -Id) %>%
    ggplot(aes(value, color = as.factor(key))) +
    geom_density(aes(fill = as.factor(key)), alpha = 0.5) +
    geom_vline(data = aggregate(value ~ key,
                                data = house_price_train %>%
                                    dplyr::select(all_of(
                                        summary_df %>% 
                                            dplyr::filter(percent_missing_value == 0 & 
                                                              dtypes == "numeric" & 
                                                              unique_value_ct >100) %>%
                                            .$fields)) %>%
                                    tidyr::gather(key, value, -Id),
                                FUN = median),
               aes(xintercept = value,
                   color = "red"),
               linetype = "dashed",
               size = 1) +
    theme_classic() +
    facet_wrap(~ key,
               scales = "free",
               ncol = 2) +
    xlab("") +
    theme(legend.position = "none")

Let’s pick “OpenPorchSF”, which is heavily skewed to the right. We would shift it by doing the “plus 1” so that we can ensure the minimal value is above zero.

# OpenPorchSF, i.e. heavily skewed to the right
open.por <- house_price_train$OpenPorchSF

# shift +1, i.e. make it above zero
open.por <- open.por + 1

# take a look
summary(open.por)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0    26.0    47.7    69.0   548.0

Fit an exponential probability distribution using MASS::fitdistr(), and then find the optimal value of lambda for the distribution.

# fit an exponential probability density function
exp.pdf <- MASS::fitdistr(open.por, densfun = "exponential")

# find the optimal value of lambda for this distribution
lambda <- exp.pdf$estimate
print(paste0("The optimal value of lambda is ", round(lambda, 4), "."))
## [1] "The optimal value of lambda is 0.021."

Let’s generate 1000 samples using the lambda value.

# generate random numbers using the rexp() and the lambda value
set.seed(605)
hist(rexp(1000, lambda), main = "1000 samples from the exponential distribution")

# find 5th and 95th percentiles of this newly fitted distribution
lo = qexp(.05, rate = lambda, lower.tail = TRUE)
up = qexp(.95, rate = lambda, lower.tail = TRUE)
print(paste0("The 5th percentile is approximately ", round(lo, 2), " whereas the 95th percentile is about ", round(up, 2), "."))
## [1] "The 5th percentile is approximately 2.44 whereas the 95th percentile is about 142.78."

And we compare the above with below empirical data.

# generate a 95% confidence interval from the empirical data, assuming normality
m <- mean(open.por)
s <- sd(open.por)
se <- qnorm(0.975) * (s/sqrt(length(open.por)-1))
left <- m - se
right <- m + se

set.seed(2358)
hist(sample(open.por, 1000), main = "1000 samples from the empirical data")

print(paste0("The 95% confidence interval is roughly between ", round(left, 2), " and ", round(right, 2)))
## [1] "The 95% confidence interval is roughly between 44.26 and 51.06"
# run a t test to confirm the CI
t.test(open.por) %>% broom::tidy()
## # A tibble: 1 x 8
##   estimate statistic   p.value parameter conf.low conf.high method   alternative
##      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl> <chr>    <chr>      
## 1     47.7      27.5 2.28e-134      1459     44.3      51.1 One Sam~ two.sided
# empirical 5th and 95th percentile of the data
q <- quantile(open.por, c(0.05, 0.95)) %>% unlist %>% as.vector()
print(paste0("The empirical 5th percentile is approximately ", round(q[1], 2), " whereas the empirical 95th percentile is about ", round(q[2], 2), "."))
## [1] "The empirical 5th percentile is approximately 1 whereas the empirical 95th percentile is about 176.05."

We pick the “OpenPorchSF” which is heavily skewed to the right for our use case. The values are shifted by +1 to ensure that the minimum value is above zero. We fit the exponential distribution using the transformed data. We find the optimal value of lambda for the distribution. Subsequently, we sample 1000 data points from the distribution and then plot the histogram. In contrast to the original data, the histogram shows a smaller range of values across the x-axis along with a less concentrated count along the y-axis. That is still rightly skewed, but as expected from the exponetial distribution.

In addition, the 5th and 95th percentiles from the newly transformed distribution show a smaller range [2.44, 142.78] in comparison to the empirical’s, i.e. [1, 176.05]. Moreover, the 95% CI from the empirical data suggests that the mean is between 44.26 and 51.06. We run a t.test in order to extract the CI interval (which is between 44.3 and 51.1), and that the result is extremely close to our calculation.

With the help of exponential distribution, the skewed data is now providing a more realistic outlook than the original observations.

(d) Modeling

Below is a correlation matrix that shows the correlation between a list of numeric variables that have no missing values.

# clean up df name, which I should have done in the beginning before running all the analyses
house_price_train2 <- janitor::clean_names(house_price_train)
summary_df2 = sapply(house_price_train2, class) %>% 
    as.data.frame() %>% 
    tibble::rownames_to_column() %>%
    dplyr::select(fields = rowname, dtypes = ".") %>%
    dplyr::mutate(dtypes = as.character(dtypes),
                  missing = colSums(is.na(house_price_train2)) %>% unlist %>% as.vector(),
                  percent_missing_value = round(missing / nrow(house_price_train2), 4),
                  unique_value_ct = lapply(house_price_train2, function(x) length(unique(x))) %>% unlist %>% as.vector())

# correlation matrix
house_price_train %>%
    dplyr::select(all_of(
        summary_df %>% 
            dplyr::filter(percent_missing_value == 0 & 
                              dtypes == "numeric" & 
                              unique_value_ct >20) %>%
            .$fields)) %>%
    dplyr::select(-Id) %>%
    cor() %>%
    as.data.frame() %>%
    dplyr::mutate(flag = dplyr::case_when(abs(SalePrice) > .5 ~1, TRUE ~0)) %>%
    dplyr::arrange(desc(abs(SalePrice))) %>% 
    kable() %>% 
    scroll_box()
LotArea YearBuilt YearRemodAdd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea GarageArea WoodDeckSF OpenPorchSF EnclosedPorch ScreenPorch MiscVal SalePrice flag
SalePrice 0.2638 0.5229 0.5071 0.3864 -0.0114 0.2145 0.6136 0.6059 0.3193 -0.0256 0.7086 0.6234 0.3244 0.3159 -0.1286 0.1114 -0.0212 1.0000 1
GrLivArea 0.2631 0.1990 0.2874 0.2082 -0.0096 0.2403 0.4549 0.5660 0.6875 0.1347 1.0000 0.4690 0.2474 0.3302 0.0091 0.1015 -0.0024 0.7086 1
GarageArea 0.1804 0.4790 0.3716 0.2970 -0.0182 0.1833 0.4867 0.4898 0.1383 -0.0676 0.4690 1.0000 0.2247 0.2414 -0.1218 0.0514 -0.0274 0.6234 1
TotalBsmtSF 0.2608 0.3915 0.2911 0.5224 0.1048 0.4154 1.0000 0.8195 -0.1745 -0.0332 0.4549 0.4867 0.2320 0.2473 -0.0955 0.0845 -0.0185 0.6136 1
1stFlrSF 0.2995 0.2820 0.2404 0.4459 0.0971 0.3180 0.8195 1.0000 -0.2026 -0.0142 0.5660 0.4898 0.2355 0.2117 -0.0653 0.0888 -0.0211 0.6059 1
YearBuilt 0.0142 1.0000 0.5929 0.2495 -0.0491 0.1490 0.3915 0.2820 0.0103 -0.1838 0.1990 0.4790 0.2249 0.1887 -0.3873 -0.0504 -0.0344 0.5229 1
YearRemodAdd 0.0138 0.5929 1.0000 0.1285 -0.0678 0.1811 0.2911 0.2404 0.1400 -0.0624 0.2874 0.3716 0.2057 0.2263 -0.1939 -0.0387 -0.0103 0.5071 1
BsmtFinSF1 0.2141 0.2495 0.1285 1.0000 -0.0501 -0.4953 0.5224 0.4459 -0.1371 -0.0645 0.2082 0.2970 0.2043 0.1118 -0.1023 0.0620 0.0036 0.3864 0
WoodDeckSF 0.1717 0.2249 0.2057 0.2043 0.0679 -0.0053 0.2320 0.2355 0.0922 -0.0254 0.2474 0.2247 1.0000 0.0587 -0.1260 -0.0742 -0.0096 0.3244 0
2ndFlrSF 0.0510 0.0103 0.1400 -0.1371 -0.0993 0.0045 -0.1745 -0.2026 1.0000 0.0634 0.6875 0.1383 0.0922 0.2080 0.0620 0.0406 0.0162 0.3193 0
OpenPorchSF 0.0848 0.1887 0.2263 0.1118 0.0031 0.1290 0.2473 0.2117 0.2080 0.0183 0.3302 0.2414 0.0587 1.0000 -0.0931 0.0743 -0.0186 0.3159 0
LotArea 1.0000 0.0142 0.0138 0.2141 0.1112 -0.0026 0.2608 0.2995 0.0510 0.0048 0.2631 0.1804 0.1717 0.0848 -0.0183 0.0432 0.0381 0.2638 0
BsmtUnfSF -0.0026 0.1490 0.1811 -0.4953 -0.2093 1.0000 0.4154 0.3180 0.0045 0.0282 0.2403 0.1833 -0.0053 0.1290 -0.0025 -0.0126 -0.0238 0.2145 0
EnclosedPorch -0.0183 -0.3873 -0.1939 -0.1023 0.0365 -0.0025 -0.0955 -0.0653 0.0620 0.0611 0.0091 -0.1218 -0.1260 -0.0931 1.0000 -0.0829 0.0184 -0.1286 0
ScreenPorch 0.0432 -0.0504 -0.0387 0.0620 0.0889 -0.0126 0.0845 0.0888 0.0406 0.0268 0.1015 0.0514 -0.0742 0.0743 -0.0829 1.0000 0.0319 0.1114 0
LowQualFinSF 0.0048 -0.1838 -0.0624 -0.0645 0.0148 0.0282 -0.0332 -0.0142 0.0634 1.0000 0.1347 -0.0676 -0.0254 0.0183 0.0611 0.0268 -0.0038 -0.0256 0
MiscVal 0.0381 -0.0344 -0.0103 0.0036 0.0049 -0.0238 -0.0185 -0.0211 0.0162 -0.0038 -0.0024 -0.0274 -0.0096 -0.0186 0.0184 0.0319 1.0000 -0.0212 0
BsmtFinSF2 0.1112 -0.0491 -0.0678 -0.0501 1.0000 -0.2093 0.1048 0.0971 -0.0993 0.0148 -0.0096 -0.0182 0.0679 0.0031 0.0365 0.0889 0.0049 -0.0114 0

We would build a simple multi-linear regression model using some of the above numeric variables. These variables have no missing values, and they are strongly correlated with the dependent variable.

# let's just look at numeric variables only and that have no missing, also strongly correlate with the DV
IVs <- house_price_train2 %>%
    dplyr::select(all_of(
        summary_df2 %>% 
            dplyr::filter(percent_missing_value == 0 & 
                              dtypes == "numeric" & 
                              unique_value_ct >20) %>%
            .$fields)) %>%
    dplyr::select(-id) %>%
    cor() %>%
    as.data.frame() %>%
    dplyr::mutate(flag = dplyr::case_when(abs(sale_price) > .5 ~1, TRUE ~0)) %>%
    tibble::rownames_to_column() %>%
    dplyr::filter(flag == 1) %>%
    dplyr::select(fields = rowname) %>%
    .$fields

DV <- "sale_price"        
IVs <- setdiff(IVs, DV)
f <- as.formula(
    paste(DV,
          paste(IVs, collapse = " + "),
          sep = "~")
)
print(f)
## sale_price ~ year_built + year_remod_add + total_bsmt_sf + x1st_flr_sf + 
##     gr_liv_area + garage_area

Let’s split our train data (80/20), and then build a simple lm model with the help of stepAIC().

# split dataset 80/20
set.seed(1314)
index <- sample(1:nrow(trainDf2), 0.8 * nrow(trainDf2))
train <- house_price_train2[index, ]
test <- house_price_train2[-index, ]

# build model
house_price_model = eval(bquote(lm(f, data = train)))
house_price_model.aic <- stepAIC(house_price_model, direction = "both")
## Start:  AIC=25162
## sale_price ~ year_built + year_remod_add + total_bsmt_sf + x1st_flr_sf + 
##     gr_liv_area + garage_area
## 
##                  Df Sum of Sq      RSS   AIC
## - x1st_flr_sf     1  1.45e+07 1.82e+12 25160
## <none>                        1.82e+12 25162
## - year_remod_add  1  8.52e+10 1.91e+12 25214
## - year_built      1  1.11e+11 1.93e+12 25230
## - garage_area     1  1.14e+11 1.94e+12 25232
## - total_bsmt_sf   1  1.52e+11 1.97e+12 25255
## - gr_liv_area     1  1.07e+12 2.90e+12 25710
## 
## Step:  AIC=25160
## sale_price ~ year_built + year_remod_add + total_bsmt_sf + gr_liv_area + 
##     garage_area
## 
##                  Df Sum of Sq      RSS   AIC
## <none>                        1.82e+12 25160
## + x1st_flr_sf     1  1.45e+07 1.82e+12 25162
## - year_remod_add  1  8.53e+10 1.91e+12 25212
## - year_built      1  1.12e+11 1.93e+12 25229
## - garage_area     1  1.16e+11 1.94e+12 25231
## - total_bsmt_sf   1  3.26e+11 2.15e+12 25353
## - gr_liv_area     1  1.21e+12 3.03e+12 25764
summary(house_price_model.aic)
## 
## Call:
## lm(formula = sale_price ~ year_built + year_remod_add + total_bsmt_sf + 
##     gr_liv_area + garage_area, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -382295  -19396   -2636   15936  263534 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.91e+06   1.23e+05  -15.52  < 2e-16 ***
## year_built      4.38e+02   5.15e+01    8.51  < 2e-16 ***
## year_remod_add  5.20e+02   6.99e+01    7.44  1.9e-13 ***
## total_bsmt_sf   4.83e+01   3.32e+00   14.54  < 2e-16 ***
## gr_liv_area     7.37e+01   2.63e+00   28.05  < 2e-16 ***
## garage_area     5.94e+01   6.86e+00    8.66  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39300 on 1183 degrees of freedom
##   (32411 observations deleted due to missingness)
## Multiple R-squared:  0.76,   Adjusted R-squared:  0.759 
## F-statistic:  748 on 5 and 1183 DF,  p-value: <2e-16

The residual chart and QQ plot look acceptable. However, we can build a better model next time by removing some outliers from our data. The below “actual” versus “predicted” sale price chart indicated that the prediction is on par with actual sale price in majority of cases, i.e. no over- or under- estimation. However, there is one extreme case that is significantly over-predicted and that drives up the RMSE. Overall, this simple model is doing a decent job with adjusted r-squared above .75.

# plot
par(mfrow = c(2, 2))
plot(house_price_model.aic)

# prediction
test["pred"] = predict(house_price_model.aic, newdata = test)
par(mfrow = c(1, 1))
plot(test$sale_price/1000, test$pred/1000,
     main = "Actual vs. Predicted Sale Price",
     xlab = "actual sale price in '000",
     ylab = "predicted sale price in '000")
abline(coef = c(0, 1), col = "red", lty = "dashed") 

print(paste0("The root mean square error of the model is ", round(rmse(test$sale_price, test$pred), 1), "."))
## [1] "The root mean square error of the model is 52381.2."
# submission to kaggle
house_price_test2 <- janitor::clean_names(house_price_test)
sale_price_pred <- predict(house_price_model.aic, newdata = house_price_test2)
submission <- data.frame(Id = house_price_test2$id, 
                         SalePrice = sale_price_pred) %>%
    dplyr::mutate(SalePrice = dplyr::case_when(is.na(SalePrice) ~ median(SalePrice, na.rm = TRUE),
                                               TRUE ~SalePrice))

# write output
#write.table(submission, "clipboard-16384", sep = "\t", row.names = FALSE)

# report
print("My Kaggle username is 'JimmyNg' and my score is 0.23099")
## [1] "My Kaggle username is 'JimmyNg' and my score is 0.23099"