Problem1 - Page Rank Solution

Question 1

Form the A matrix. Then, introduce decay and form the B matrix as we did in the course notes.

# Form A Matrix
A <- matrix(c(0, 1/2, 1/2, 0, 0, 0,
              0, 0, 0, 0, 0, 0,
              1/3, 1/3, 0, 0, 1/3, 0,
              0, 0, 0, 0, 1/2, 1/2,
              0, 0, 0, 1/2, 0, 1/2,
              0, 0, 0, 1, 0, 0
              ), nrow = 6)

A
##      [,1] [,2]      [,3] [,4] [,5] [,6]
## [1,]  0.0    0 0.3333333  0.0  0.0    0
## [2,]  0.5    0 0.3333333  0.0  0.0    0
## [3,]  0.5    0 0.0000000  0.0  0.0    0
## [4,]  0.0    0 0.0000000  0.0  0.5    1
## [5,]  0.0    0 0.3333333  0.5  0.0    0
## [6,]  0.0    0 0.0000000  0.5  0.5    0
# making A row-stochastic with a uniform vector because if we leave at 0(indicating there are no outgoing links from p2), page rank will not converge because of a dangling node.
A <- matrix(c(0, 1/2, 1/2, 0, 0, 0,
             1/6, 1/6, 1/6, 1/6, 1/6, 1/6,
             1/3, 1/3, 0, 0, 1/3, 0,
             0, 0, 0, 0, 1/2, 1/2,
             0, 0, 0, 1/2, 0, 1/2,
             0, 0, 0, 1, 0, 0), nrow=6)
print(A)
##      [,1]      [,2]      [,3] [,4] [,5] [,6]
## [1,]  0.0 0.1666667 0.3333333  0.0  0.0    0
## [2,]  0.5 0.1666667 0.3333333  0.0  0.0    0
## [3,]  0.5 0.1666667 0.0000000  0.0  0.0    0
## [4,]  0.0 0.1666667 0.0000000  0.0  0.5    1
## [5,]  0.0 0.1666667 0.3333333  0.5  0.0    0
## [6,]  0.0 0.1666667 0.0000000  0.5  0.5    0
# Next we form the decay matrix B
n <- 6
B=(0.85*A+0.15/n)
print(B)
##       [,1]      [,2]      [,3]  [,4]  [,5]  [,6]
## [1,] 0.025 0.1666667 0.3083333 0.025 0.025 0.025
## [2,] 0.450 0.1666667 0.3083333 0.025 0.025 0.025
## [3,] 0.450 0.1666667 0.0250000 0.025 0.025 0.025
## [4,] 0.025 0.1666667 0.0250000 0.025 0.450 0.875
## [5,] 0.025 0.1666667 0.3083333 0.450 0.025 0.025
## [6,] 0.025 0.1666667 0.0250000 0.450 0.450 0.025

Question 2

Start with a uniform rank vector r and perform power iterations on B till convergence. That is, compute the solution r = Bn × r. Attempt this for a sufficiently large n so that r actually converges

# The following function is created to perform power iterations on B until convergence, utilizing a uniform rank vector
r <- matrix(c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6), nrow=6, ncol=1)
r
##           [,1]
## [1,] 0.1666667
## [2,] 0.1666667
## [3,] 0.1666667
## [4,] 0.1666667
## [5,] 0.1666667
## [6,] 0.1666667
n <- 30
ns <- c(0)
rs <- c(r[1])
n_r <- r
Bn <- B

for (i in 1:n) {
  Bn <- Bn%*%B
  n_r <- Bn%*%r
  rs <- append(rs, n_r[4])
  ns <- append(ns, i)
}

n_r
##            [,1]
## [1,] 0.05170475
## [2,] 0.07367927
## [3,] 0.05741242
## [4,] 0.34870368
## [5,] 0.19990381
## [6,] 0.26859608
plot(ns, rs)

Question 3

Compute the eigen-decomposition of B and verify that you indeed get an eigenvalue of 1 as the largest eigenvalue and that its corresponding eigenvector is the same vector that you obtained in the previous power iteration method. Further, this eigenvector has all positive entries and it sums to 1.

# 1 is an eigenvalue of B

x<-eigen(B)
x$values
## [1]  1.00000000+0i  0.57619235+0i -0.42500000+0i -0.42500000-0i -0.34991524+0i
## [6] -0.08461044+0i
# vector sums to 1 
x$vectors[,1]/sum(x$vectors[,1])
## [1] 0.05170475+0i 0.07367926+0i 0.05741241+0i 0.34870369+0i 0.19990381+0i
## [6] 0.26859608+0i

Question 4

Use the graph package in R and its page.rank method to compute the Page Rank of the graph as given in A. Note that you don’t need to apply decay. The package starts with a connected graph and applies decay internally. Verify that you do get the same PageRank vector as the two approaches above.

A <- matrix(c(0, 1/2, 1/2, 0, 0, 0,
              0, 0, 0, 0, 0, 0,
              1/3, 1/3, 0, 0, 1/3, 0,
              0, 0, 0, 0, 1/2, 1/2,
              0, 0, 0, 1/2, 0, 1/2,
              0, 0, 0, 1, 0, 0
              ), nrow = 6, byrow = T)
library('igraph')
g <- graph_from_adjacency_matrix(A, weighted=TRUE)
plot(g)

Problem2 - MNIST Macchine Learning Techniques

Question 1

Using the training.csv file, plot representations of the first 10 images to understand the data format. Go ahead and divide all pixels by 255 to produce values between 0 and 1.

testdigds <-read_csv("testdig.csv")
## Rows: 28000 Columns: 784
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (784): pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pixel7, p...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
testdigds
## # A tibble: 28,000 x 784
##    pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10
##     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
##  1      0      0      0      0      0      0      0      0      0      0       0
##  2      0      0      0      0      0      0      0      0      0      0       0
##  3      0      0      0      0      0      0      0      0      0      0       0
##  4      0      0      0      0      0      0      0      0      0      0       0
##  5      0      0      0      0      0      0      0      0      0      0       0
##  6      0      0      0      0      0      0      0      0      0      0       0
##  7      0      0      0      0      0      0      0      0      0      0       0
##  8      0      0      0      0      0      0      0      0      0      0       0
##  9      0      0      0      0      0      0      0      0      0      0       0
## 10      0      0      0      0      0      0      0      0      0      0       0
## # ... with 27,990 more rows, and 773 more variables: pixel11 <dbl>,
## #   pixel12 <dbl>, pixel13 <dbl>, pixel14 <dbl>, pixel15 <dbl>, pixel16 <dbl>,
## #   pixel17 <dbl>, pixel18 <dbl>, pixel19 <dbl>, pixel20 <dbl>, pixel21 <dbl>,
## #   pixel22 <dbl>, pixel23 <dbl>, pixel24 <dbl>, pixel25 <dbl>, pixel26 <dbl>,
## #   pixel27 <dbl>, pixel28 <dbl>, pixel29 <dbl>, pixel30 <dbl>, pixel31 <dbl>,
## #   pixel32 <dbl>, pixel33 <dbl>, pixel34 <dbl>, pixel35 <dbl>, pixel36 <dbl>,
## #   pixel37 <dbl>, pixel38 <dbl>, pixel39 <dbl>, pixel40 <dbl>, ...
traindigds <-read_csv("traindig.csv")
## Rows: 42000 Columns: 785
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (785): label, pixel0, pixel1, pixel2, pixel3, pixel4, pixel5, pixel6, pi...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
traindigds
## # A tibble: 42,000 x 785
##    label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1     1      0      0      0      0      0      0      0      0      0      0
##  2     0      0      0      0      0      0      0      0      0      0      0
##  3     1      0      0      0      0      0      0      0      0      0      0
##  4     4      0      0      0      0      0      0      0      0      0      0
##  5     0      0      0      0      0      0      0      0      0      0      0
##  6     0      0      0      0      0      0      0      0      0      0      0
##  7     7      0      0      0      0      0      0      0      0      0      0
##  8     3      0      0      0      0      0      0      0      0      0      0
##  9     5      0      0      0      0      0      0      0      0      0      0
## 10     3      0      0      0      0      0      0      0      0      0      0
## # ... with 41,990 more rows, and 774 more variables: pixel10 <dbl>,
## #   pixel11 <dbl>, pixel12 <dbl>, pixel13 <dbl>, pixel14 <dbl>, pixel15 <dbl>,
## #   pixel16 <dbl>, pixel17 <dbl>, pixel18 <dbl>, pixel19 <dbl>, pixel20 <dbl>,
## #   pixel21 <dbl>, pixel22 <dbl>, pixel23 <dbl>, pixel24 <dbl>, pixel25 <dbl>,
## #   pixel26 <dbl>, pixel27 <dbl>, pixel28 <dbl>, pixel29 <dbl>, pixel30 <dbl>,
## #   pixel31 <dbl>, pixel32 <dbl>, pixel33 <dbl>, pixel34 <dbl>, pixel35 <dbl>,
## #   pixel36 <dbl>, pixel37 <dbl>, pixel38 <dbl>, pixel39 <dbl>, ...
plot_img = function(row)
{
  img <- matrix(row, nrow=28,ncol=28)
  mode(img) = "numeric"
  image(img, useRaster=TRUE, axes=FALSE)
}

for (i in 1:10) {
  
  plot_img(traindigds[i,])
}
## Warning in matrix(row, nrow = 28, ncol = 28): data length [785] is not a sub-
## multiple or multiple of the number of rows [28]

## Warning in matrix(row, nrow = 28, ncol = 28): data length [785] is not a sub-
## multiple or multiple of the number of rows [28]

## Warning in matrix(row, nrow = 28, ncol = 28): data length [785] is not a sub-
## multiple or multiple of the number of rows [28]

## Warning in matrix(row, nrow = 28, ncol = 28): data length [785] is not a sub-
## multiple or multiple of the number of rows [28]

## Warning in matrix(row, nrow = 28, ncol = 28): data length [785] is not a sub-
## multiple or multiple of the number of rows [28]

## Warning in matrix(row, nrow = 28, ncol = 28): data length [785] is not a sub-
## multiple or multiple of the number of rows [28]

## Warning in matrix(row, nrow = 28, ncol = 28): data length [785] is not a sub-
## multiple or multiple of the number of rows [28]

## Warning in matrix(row, nrow = 28, ncol = 28): data length [785] is not a sub-
## multiple or multiple of the number of rows [28]

## Warning in matrix(row, nrow = 28, ncol = 28): data length [785] is not a sub-
## multiple or multiple of the number of rows [28]

## Warning in matrix(row, nrow = 28, ncol = 28): data length [785] is not a sub-
## multiple or multiple of the number of rows [28]

Go ahead and divide all pixels by 255 to produce values between 0 and 1

trainingds_pix <- traindigds[,2:ncol(traindigds)]/255

Question 2

What is the frequency distribution of the numbers in the dataset?

trainingds_test <- array_reshape(as.matrix(trainingds_pix), c(nrow(trainingds_pix), 28, 28, 1))
hist(trainingds_test)

hist(trainingds_test[trainingds_test > 0])

## Question 3

For each number, provide the mean pixel intensity. What does this tell you?

All the initial values are 0 and the means increase in value towards the middle. This just means all the drawings are towards the center of the image.

colMeans(traindigds)[1:60]
##        label       pixel0       pixel1       pixel2       pixel3       pixel4 
## 4.4566428571 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 
##       pixel5       pixel6       pixel7       pixel8       pixel9      pixel10 
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 
##      pixel11      pixel12      pixel13      pixel14      pixel15      pixel16 
## 0.0000000000 0.0030000000 0.0111904762 0.0051428571 0.0002142857 0.0000000000 
##      pixel17      pixel18      pixel19      pixel20      pixel21      pixel22 
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 
##      pixel23      pixel24      pixel25      pixel26      pixel27      pixel28 
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 
##      pixel29      pixel30      pixel31      pixel32      pixel33      pixel34 
## 0.0000000000 0.0000000000 0.0000000000 0.0003809524 0.0013095238 0.0105476190 
##      pixel35      pixel36      pixel37      pixel38      pixel39      pixel40 
## 0.0272619048 0.0509047619 0.0664047619 0.1295714286 0.1741190476 0.1913095238 
##      pixel41      pixel42      pixel43      pixel44      pixel45      pixel46 
## 0.1905952381 0.1960476190 0.1713571429 0.1644761905 0.1517142857 0.1053095238 
##      pixel47      pixel48      pixel49      pixel50      pixel51      pixel52 
## 0.0607857143 0.0450714286 0.0154047619 0.0105238095 0.0050476190 0.0000000000 
##      pixel53      pixel54      pixel55      pixel56      pixel57      pixel58 
## 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0000000000 0.0015238095

Question 4

Reduce the data by using principal components that account for 95% of the variance.How many components did you generate? Use PCA to generate all possible components (100% of the variance). How many components are possible? Why?

traindsx <- traindigds
#Scaling the pixels
traindsxreduced <- traindsx/255
traindsxcov <- cov(traindsxreduced)

#Reduce to principle components
pcatraindsx <- prcomp(traindsxcov)

(cumsum(pcatraindsx$sdev^2) / sum(pcatraindsx$sdev^2))[1:50]
##  [1] 0.2533214 0.4211306 0.5443493 0.6382875 0.7099679 0.7691895 0.8028369
##  [8] 0.8302285 0.8531289 0.8711703 0.8853394 0.8986566 0.9080934 0.9174708
## [15] 0.9256099 0.9328111 0.9384893 0.9438301 0.9484009 0.9527328 0.9564977
## [22] 0.9598726 0.9629213 0.9656491 0.9682131 0.9705047 0.9726596 0.9746378
## [29] 0.9764288 0.9779685 0.9793811 0.9807173 0.9818937 0.9830211 0.9840639
## [36] 0.9850215 0.9858701 0.9866476 0.9873883 0.9880994 0.9887699 0.9894182
## [43] 0.9899904 0.9905079 0.9909907 0.9914508 0.9918774 0.9922720 0.9926428
## [50] 0.9929782

Question 5

Plot the first 10 images generated by PCA

It appears to be just noise, looking at the variation across a variety of different images it looks like a blob.

l<- 1
m<-812

for (i in 1:10){
  img<- matrix(pcatraindsx$rotation[l:m],nrow=28, ncol=28)
  image(img, useRaster=TRUE, axes=FALSE)
  l<- l+812
  m<- m+812
}

Question 6

Now, select only those images that have labels that are 8’s.

Re-run PCA that accounts for all of the variance (100%)

# filter the training data down to 8’s 
library(dplyr)

traindigds_8 <- traindigds %>% filter(label == 8)
traindigds_8 <- dplyr::select(traindigds_8,c(2:785))
dim(traindigds_8)
## [1] 4063  784
knitr::opts_chunk$set(cache = TRUE)
X_8 <- traindigds_8
#Scaling the pixels
X_8reduced <- X_8/255
X_8cov <- cov(X_8reduced)

#Reduce to principle components
pcaX_8 <- prcomp(X_8cov)

Plotting 8’s

m<- 1
n<-812

for (i in 1:10){
  img<- matrix(pcaX_8$rotation[m:n],nrow=28, ncol=28)
  image(img, useRaster=TRUE, axes=FALSE)
  m<- m+812
  n<- n+812
}

# All the images resemble 8

Question 7

Multinomial

An incorrect approach to predicting the images would be to build a linear regression model with y as the digit values and X as the pixel matrix. Instead, we can build a multinomial model that classifies the digits. Build a multinomial model on the entirety of the training set. Then provide its classification accuracy (percent correctly identified) as well as a matrix of observed versus forecast values (confusion matrix). This matrix will be a 10 x 10, and correct classifications will be on the diagonal.

# Divide our  training dataset to trainds_1 and trainds_2
trainds_size <- floor(.2 * nrow(traindigds))
set.seed(2200)
trainds_index  <- sample(seq_len(nrow(traindigds)), size = trainds_size )

trainds_1  <- traindigds[trainds_index,]
trainds_2 <-  traindigds[-trainds_index,]
trainds1_x <- trainds_1[,names(trainds_1) != "label"]
trainds1_x <- lapply(trainds1_x, as.numeric)
trainds1_y <- as.factor(trainds_1$label)
trainds2_x <- trainds_2[,names(trainds_2) != "label"]
trainds2_x <- lapply(trainds2_x, as.numeric)
trainds2_y <- as.factor(trainds_2$label)
knitr::opts_chunk$set(cache = TRUE)
library(nnet)
trainds_mn_model <- multinom(trainds1_y~., trainds1_x, MaxNWts = 100000)
## # weights:  7860 (7065 variable)
## initial  value 19341.714781 
## iter  10 value 4865.900733
## iter  20 value 3912.737730
## iter  30 value 3581.493792
## iter  40 value 3453.794647
## iter  50 value 3395.592725
## iter  60 value 3361.886757
## iter  70 value 3324.984639
## iter  80 value 3292.613879
## iter  90 value 3267.937840
## iter 100 value 3248.331137
## final  value 3248.331137 
## stopped after 100 iterations
library("caret")
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following objects are masked from 'package:openintro':
## 
##     ethanol, lsegments
## 
## Attaching package: 'caret'
## The following object is masked from 'package:openintro':
## 
##     dotPlot
## The following object is masked from 'package:purrr':
## 
##     lift
trainds_prediction <- predict(trainds_mn_model, newdata = trainds2_x)
confusionMatrix(trainds_prediction,trainds2_y) 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 3061    2   64   41   13   55   70   42   84   48
##          1    2 3555   57   30   45   22   40   79  162   33
##          2   24   29 2785  130   47   16  106   78   65   19
##          3   21    9   60 2687   13  106    7   51   98   45
##          4    8    2   79   13 2794   31   34   58   16  159
##          5   99   73   35  359  101 2590  114   56  398  152
##          6   46    6   74   19   39   42 2886   18   25    7
##          7    8   13   64   47   41   16    9 2759   15   82
##          8   34   43  108   80   42  110   24   35 2281   19
##          9    7    5   37   69  154   39   14  355   62 2794
## 
## Overall Statistics
##                                          
##                Accuracy : 0.839          
##                  95% CI : (0.8351, 0.843)
##     No Information Rate : 0.1112         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8212         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.92477   0.9513  0.82813  0.77324  0.84950  0.85563
## Specificity           0.98617   0.9843  0.98300  0.98639  0.98680  0.95463
## Pos Pred Value        0.87960   0.8832  0.84420  0.86761  0.87477  0.65124
## Neg Pred Value        0.99173   0.9938  0.98092  0.97417  0.98372  0.98525
## Prevalence            0.09851   0.1112  0.10009  0.10342  0.09789  0.09009
## Detection Rate        0.09110   0.1058  0.08289  0.07997  0.08315  0.07708
## Detection Prevalence  0.10357   0.1198  0.09818  0.09217  0.09506  0.11836
## Balanced Accuracy     0.95547   0.9678  0.90557  0.87981  0.91815  0.90513
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.87349  0.78137  0.71148  0.83204
## Specificity           0.99089  0.99019  0.98371  0.97546
## Pos Pred Value        0.91271  0.90341  0.82169  0.79016
## Neg Pred Value        0.98627  0.97473  0.96999  0.98124
## Prevalence            0.09833  0.10509  0.09542  0.09994
## Detection Rate        0.08589  0.08211  0.06789  0.08315
## Detection Prevalence  0.09411  0.09089  0.08262  0.10524
## Balanced Accuracy     0.93219  0.88578  0.84760  0.90375