Final Project (2/3) - DIGIT RECOGNIZER

Kaggle data set to learn computer vision fundamentals with the famous MNIST data. The goal is to take an image of a handwritten single digit, and determine what that digit is. For every in the test set, you should predict the correct label.

Required Libraries

library(tidyverse)
library(kableExtra)
library(webshot)
library(plotly)
library(caret)
library(devtools)
library(factoextra)
library(knitr)
library(nnet)
library(reactable)
options(warn=-1)

1. Load the datasets

The data files, Kaggle, train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

Dimensions of dataset: glimpse of how many instances (rows) and how attributes (columns) the data contains with the head function.

#library(reactable)
reactable(head(dr_train), highlight = TRUE, striped = TRUE, resizable = TRUE, bordered = TRUE)
## Training set has 42000 rows and 785 columns
## Test set has 28000 rows and 784 columns

2. Plot Representation of the first 10 images

Step 2a. We can quickly plot the pixel color values to obtain a picture of the digit.

# Create a 28*28 matrix with pixel color values
dr_m = matrix(unlist(dr_train[10,-1]),nrow = 28,byrow = T)
# Plot that matrix
image(dr_m,col=grey.colors(255))

Image of Digits This image needs to be rotated to the right. I will rotate the matrix and plot a bunch of images.

rotate <- function(x) t(apply(x, 2, rev)) # reverses (rotates the matrix)

# Plot a bunch of images
par(mfrow=c(2,5))
lapply(1:10, 
    function(x) image(
                    rotate(matrix(unlist(dr_train[x,-1]),nrow = 28,byrow = T)),
                    col=grey.colors(255),
                    xlab=dr_train[x,1]
                )
)

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
par(mfrow=c(1,1)) # set plot options back to default

Step 2b. Normalization: Go ahead and divide all pixels by 255 to produce values between 0 and 1. (This is equivalent to min-max scaling.)

drtrain <- dr_train[, 2:ncol(dr_train)] / 255
drtrain$label <- dr_train$label
drtrain <- drtrain %>% dplyr::select(label, everything())

#drtest <- lapply(dr_test[, 1:ncol(dr_test)], function(x) x/255) %>%
    #dplyr::bind_cols() %>%
    #as.data.frame()

3. Frequency Distribution

We will look at the frequency of observations, showing the number of times the observation occurs in the data with the webshot package.

#convert digit label to factor for classification 
label_f <- as.factor(drtrain$label)
label_lv <- levels(label_f)
count_label <- summary(label_f)
table(drtrain$label)
## 
##    0    1    2    3    4    5    6    7    8    9 
## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188
#plot of drtrain (train dataset) labels
x = label_lv
y = count_label

fig1 = plot_ly(
    x = x,
    y = y,
    name = 'Label dist.',
    text = paste(count_label),
    textposition = 'auto',
    type = 'bar',
    marker = list(color = ~x)    
)
fig1

4. Pixel Intensity (Quantifiable Data)

Pixel intensity value is the primary information stored within pixels, it is the most popular and important feature used for classification. The intensity value for each pixel is a single value for a gray-level, or three values for a color image.

For a grayscale images, the pixel value is a single number that represents the brightness of the pixel. The most common pixel format is the byte image, where this number is stored as an 8-bit integer giving a range of possible values from 0 to 255. Typically zero is taken to be black, and 255 is taken to be white. Values in between make up the different shades of gray.

pixel_intensity = drtrain %>%
    rowwise() %>%
    dplyr::mutate(pixel_intensity = sum(c_across(starts_with("pixel")))) %>%
    ungroup()

mean_pixel_intensity = pixel_intensity %>%
    group_by(label) %>%
    summarise(mean_pixel_intensity = mean(pixel_intensity)) %>%
    ungroup()

mean_pixel_intensity
## # A tibble: 10 x 2
##    label mean_pixel_intensity
##    <dbl>                <dbl>
##  1     0                136. 
##  2     1                 59.6
##  3     2                117. 
##  4     3                111. 
##  5     4                 95.0
##  6     5                101. 
##  7     6                109. 
##  8     7                 89.9
##  9     8                118. 
## 10     9                 96.3

The table shows each label mean pixel intensity, indicating their level of brightness. The highest pixel intensity, Label 0 - 135.81336 has a gray shade closer to white, and Label 1 - 59.56261 has the lowest grade shade to black. Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity.

5. Principal Components Analysis

Reduce or extract the important information from the data and express this information as a set of summary indices called principal components. Why? PCA will simply reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set, while preserving as much information as possible.

6.Reduce the data by using principal components that account for 95% of the variance. How many components did you generate? Use PCA to generate all possible components (100% of the variance). How many components are possible?

# get the covariance matrix of the predictors
trainCov <- cov(drtrain %>% dplyr::select(-label))

# remove zero or near zero variance features (caret package)
nzv = nearZeroVar(trainCov, saveMetrics=T) 
trainCov = trainCov[,!nzv$zeroVar]

The PCA has generated components (100% of the variance) and the number of possible components for the dataset.

# conduct pca
pc_train <- prcomp(trainCov)
print(paste0("Columns with zero variance: ", sum(nzv$zeroVar)))
## [1] "Columns with zero variance: 76"
print(paste('PCs generated:', ncol(pc_train$x)))
## [1] "PCs generated: 708"
# Creating a datatable to store and plot the
# No of Principal Components vs Cumulative Variance Explained
vexplained <- as.data.frame(pc_train$sdev^2/sum(pc_train$sdev^2))
vexplained <- cbind(c(1:708),vexplained,cumsum(vexplained[,1]))
colnames(vexplained) <- c("No_of_Principal_Components","Individual_Variance_Explained","Cumulative_Variance_Explained")
library(graphics)
#Plotting the curve using the datatable obtained
plot(vexplained$No_of_Principal_Components,vexplained$Cumulative_Variance_Explained, xlim = c(0,100),type='b',pch=16,xlab = "Principal Componets",ylab = "Cumulative Variance Explained",main = "Variance by # of PCA's. 60 PC's capture 99.5 % variance")
#number of PC's components
abline(v=c(60), col = "blue", lty =c(1,2), lwd=c(1,2))
#percentage of variance for number of PC's components
abline(h=c(0.9955116), col = "red", lty=c(1,2), lwd=c(1,3))

#Datatable to store the summary of the datatable obtained
vexplainedsummary <- vexplained[seq(0,100,10),]
reactable(vexplainedsummary, striped = TRUE)

Scree Plot: This plot shows the percentage of variance explained by each of the first 10 principal components.

ei_p1 = fviz_eig(pc_train, ncp=10, addlabels=T, main='Non-Scaled version of Scree Plot')

plot(ei_p1)

6. Visualizing the first 10 images generated by PCA.

#par(mfrow=c(2,5))
plot_array <- function(array){
  arrayMat <- matrix(array, nrow=28, ncol=28)
  mode(arrayMat) = "numeric"
  image(arrayMat)
}
par(mfrow=c(2,5))

X_loading <- pc_train$rotation

for (i in 1:10) {
  plot_array(X_loading[,i])
}

7. PCA on Images that are 8’s

Selecting only those images that have labels that are 8’s. Re-run PCA that accounts for all of the variance(100%). Plot the first 10 images. What do you see? The re-run of the PCA reduced the number of variables of the selected data set. It naturally reduced accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity.

# get the covariance matrix of the predictors
trainCov8 <- cov(drtrain %>% dplyr::filter(label == 8) %>% dplyr::select(-label))

# conduct pca
trainPC8 <- prcomp(trainCov8)
par(mfrow = c (2, 5))
for(i in 1:10){
    image(1:28, 1:28, array(trainPC8$x[,i], dim = c(28,28)))
}

8. Modeling

An incorrect approach to predicting the images would be to build a linear regression model with y as the digit values and X as the pixel matrix. Instead, we can build a multinomial model that classifies the digits. Build a multinomial model on the entirety of the training set. Then provide its classification accuracy (percent correctly identified) as well as a matrix of observed versus forecast values (confusion matrix). This matrix will be a 10 x 10, and correct classifications will be on the diagonal.

8a. Create a new dataset for predictive modelling.

train2 <- dr_train
test2 <- dr_test

View the dataset (train and test) dimensions.

dim(train2)
## [1] 42000   785
dim(test2)
## [1] 28000   784
#Separate the labels and normalize the dataset.
#labels = train2[,1]
#data <- train2[,-1]/255

8b. Multinomial Logistic Regression

#model <- multinom(train_labels~., train_df, MaxNWts = 100000)

Separate the labels and normalize the dataset.

train2$label <- as.factor(train2$label)
labels2 <- train2[,1]
data2 <- train2[,-1]/255
data2["label"] <- c(labels2)

Split data into train and test samples.

# Using sample_frac to create 70 - 30 split into test and train
train2s <- sample_frac(data2, 0.7)
sample_id2 <- as.numeric(rownames(train2s)) # rownames() returns character so as.numeric
test2s <- data2[-sample_id2,]

Train the data

# Training the multinomial model
multinom.fit <- multinom(label ~ ., data = train2s, MaxNWts = 100000)
## # weights:  7860 (7065 variable)
## initial  value 67696.001734 
## iter  10 value 17115.553188
## iter  20 value 11971.848813
## iter  30 value 9878.448773
## iter  40 value 8382.229390
## iter  50 value 7297.676602
## iter  60 value 6584.411971
## iter  70 value 6236.917336
## iter  80 value 6019.442192
## iter  90 value 5657.517533
## iter 100 value 5296.370318
## final  value 5296.370318 
## stopped after 100 iterations

Prediction and Model Accuracy on the train dataset

# Predicting the values for train dataset
train2s$precticed <- predict(multinom.fit, newdata = train2s, "class")

# Building classification table
ctable <- table(train2s$label, train2s$precticed)

# Calculating accuracy - sum of diagonal elements divided by total obs
paste0("The model prediction accuracy on training dataset: ", round((sum(diag(ctable))/sum(ctable))*100,2), " %")
## [1] "The model prediction accuracy on training dataset: 94.77 %"
# Predicting the values for train dataset
test2s$precticed <- predict(multinom.fit, newdata = test2s, "class")
 
# Building classification table
dtable <- table(test2s$label, test2s$precticed)
 
# Calculating accuracy - sum of diagonal elements divided by total obs
paste0("The model prediction accuracy on testing dataset: ", round((sum(diag(dtable))/sum(dtable))*100,2), " %")
## [1] "The model prediction accuracy on testing dataset: 93.6 %"

Confusion Matrix

confusionMatrix(test2s$precticed,test2s$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 1227    1    6    4    3   14    6    1    5    4
##          1    0 1383   13    8    5    3    4    2   24    2
##          2    7    2 1118   34    9   14    7   10    7    2
##          3    2    5   17 1202    2   29    0    4   28   19
##          4    3    1   12    3 1118   17    9    9    7   28
##          5    7    3    5   30    2 1009   12    1   27    6
##          6    8    1   14    3    7   13 1184    0    6    0
##          7    3    2    9   10    4    3    1 1271    4   37
##          8    5    9   30   13    7   21    4    2 1116    8
##          9    1    2    2   10   27   12    0   32   11 1166
## 
## Overall Statistics
##                                           
##                Accuracy : 0.936           
##                  95% CI : (0.9316, 0.9402)
##     No Information Rate : 0.1118          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9289          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.97150   0.9815  0.91191   0.9127  0.94426  0.88899
## Specificity           0.99612   0.9945  0.99191   0.9906  0.99220  0.99189
## Pos Pred Value        0.96538   0.9578  0.92397   0.9190  0.92626  0.91561
## Neg Pred Value        0.99682   0.9977  0.99052   0.9898  0.99421  0.98904
## Prevalence            0.10024   0.1118  0.09730   0.1045  0.09397  0.09008
## Detection Rate        0.09738   0.1098  0.08873   0.0954  0.08873  0.08008
## Detection Prevalence  0.10087   0.1146  0.09603   0.1038  0.09579  0.08746
## Balanced Accuracy     0.98381   0.9880  0.95191   0.9516  0.96823  0.94044
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.96496   0.9542  0.90364  0.91667
## Specificity           0.99543   0.9935  0.99129  0.99144
## Pos Pred Value        0.95793   0.9457  0.91852  0.92320
## Neg Pred Value        0.99622   0.9946  0.98955  0.99065
## Prevalence            0.09738   0.1057  0.09802  0.10095
## Detection Rate        0.09397   0.1009  0.08857  0.09254
## Detection Prevalence  0.09810   0.1067  0.09643  0.10024
## Balanced Accuracy     0.98019   0.9739  0.94747  0.95405

Additional Final Reports

RPubs - Final (1/3): Playing With PageRank

RPubs - Final (3/3): House Prices - Advanced Regression Techniques Competition