library(tidyverse)
library(keras)
library(nnet)
library(caret)Go to Kaggle.com and build an account if you do not already have one. It is free.
Go to https://www.kaggle.com/c/digit-recognizer/overview, accept the rules of the competition, and download the data. You will not be required to submit work to Kaggle, but you do need the data.
‘MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.”
# define url for data
urlTrain <- "https://raw.githubusercontent.com/esteban-data-enthusiast/data605/main/shared-data/digit-recognizer/train.csv"
# load the training data
rawTrain <- read.csv(urlTrain, stringsAsFactors = TRUE)
#urlTest <- "https://raw.githubusercontent.com/esteban-data-enthusiast/data605/main/shared-data/digit-recognizer/test.csv"
# load the test data
#rawTest <- read.csv(urlTest, stringsAsFactors = TRUE)Using the training.csv file, plot representations of the first 10 images to understand the data format. Go ahead and divide all pixels by 255 to produce values between 0 and 1. (This is equivalent to min-max scaling.) (5 points)
Rescale the pixels in both datasets
# rescale the pixels
# divide all pixels by 255 to produce values between 0 and 1
rawTrain[-1] <- rawTrain[-1] / 255
# rescale the pixels
# divide all pixels by 255 to produce values between 0 and 1
#rawTest[-1] <- rawTest[-1] / 255Plot the representations of the first 10 images from the training dataset
imgPlotter <- function(row)
{
img <- matrix(row, nrow=28,ncol=28)
mode(img) = "numeric"
image(img, useRaster=TRUE, axes=FALSE)
}
# plot the first 10 images from the training set
for (i in 1:10) {
imgPlotter(rawTrain[i,])
}What is the frequency distribution of the numbers in the dataset? (5 points)
# exclude the "label" column and reshape the data
x_train <-
keras::array_reshape(as.matrix(rawTrain[-1]), c(nrow(rawTrain), 28, 28, 1))
dim(x_train)#> [1] 42000 28 28 1
Generate the frequency distribution of the numbers in the data set
hist(x_train)From the histogram we can see that most of the numbers in the data set are zero values while the rest of the numbers are between 0 and 1 because we normalized them (divided them by 255). The high frequency of zeros must be due to the pixels representing the background of each of the digits represented in each row.
Let’s represent the non-zero elements in the dataset.
hist(x_train[x_train > 0])After excluding the zero values, we can see that value 1 has the highest frequency, which makes it the mode of the distribution of the dataset.
For each number, provide the mean pixel intensity. What does this tell you? (5 points)
hist(colMeans(x_train))From the histogram above, we can see that the means are mostly 0’s. Which perhaps also has to do with the uniform backgrounds of each digit represented in each row of data.
Reduce the data by using principal components that account for 95% of the variance. How many components did you generate? Use PCA to generate all possible components (100% of the variance). How many components are possible? Why? (5 points)
Get the covariance of the training data
x_train_cov <- cov(rawTrain[-1])Reduce to principal components
pca_x_train <- prcomp(x_train_cov)After looking at the summary of the PCA object, we can see that by default 784 components (one for each of the images) have been generated, which should account for 100% of the variance.
Plot the first 10 images generated by PCA. They will appear to be noise. Why? (5 points)
# initialize index counters
k <- 1
j <- 784
for (i in 1:10){
img <- matrix(pca_x_train$rotation[k:j], nrow = 28, ncol = 28)
image(img, useRaster = TRUE, axes = FALSE)
k <- k + 784
j <- j + 784
}We can see that the images look like blurry images with significant noise on them. This could be explained by the variance calculated across all images as opposed to variance across similar images.
Now, select only those images that have labels that are 8’s. Re-run PCA that accounts for all of the variance (100%). Plot the first 10 images. What do you see? (5 points)
x_train_8 <- rawTrain %>%
filter(label == 8) %>%
select(c(2:785))Re-run PCA that accounts for all of the variance (100%)
x_cov_8 <- cov(x_train_8 / 255)Reduce to PCA components
pca_x_train_8 <- prcomp(x_cov_8)Plot first 10 images
k<- 1
j<-784
for (i in 1:10){
img<- matrix(pca_x_train_8$rotation[k:j],nrow=28, ncol=28)
image(img, useRaster=TRUE, axes=FALSE)
k<- k+784
j<- j+784
}We can see that the images do look like images of the number 8. However, they still look blurry and with significant noise.
An incorrect approach to predicting the images would be to build a linear regression model with y as the digit values and X as the pixel matrix. Instead, we can build a multinomial model that classifies the digits. Build a multinomial model on the entirety of the training set. Then provide its classification accuracy (percent correctly identified) as well as a matrix of observed versus forecast values (confusion matrix). This matrix will be a 10 x 10, and correct classifications will be on the diagonal. (10 points)
Before fitting a model to the training set convert variable “label” to a factor
rawTrain$label <- as.factor(rawTrain$label)
X <- rawTrain[2:785]
X$label <- rawTrain$labelFit a multinomial model to the training dataset
multinom_mod <- nnet::multinom(label ~ ., data = X, MaxNWts = 1000000)#> # weights: 7860 (7065 variable)
#> initial value 96708.573906
#> iter 10 value 25322.714106
#> iter 20 value 20402.086316
#> iter 30 value 19312.872829
#> iter 40 value 18703.256586
#> iter 50 value 18197.815143
#> iter 60 value 17732.985798
#> iter 70 value 16739.962157
#> iter 80 value 14961.658448
#> iter 90 value 13446.085942
#> iter 100 value 12442.636014
#> final value 12442.636014
#> stopped after 100 iterations
Use the fitted model to predict the digit label on the Training dataset
pred <- predict(multinom_mod, rawTrain[2:785])
# label de predicted value column
pred <- as.data.frame(pred, row.names = c('predicted_value'))Add to columns to our prediction data frame to compare the actual value vs the matched value
pred$actual <- rawTrain$label
# assign a 1 if the prediction matched the actual, else a 0
pred$equal <- ifelse( pred$pred == pred$actual, 1, 0)Compute the classification accuracy (percent correctly identified)
round(sum(pred$equal / nrow(pred)) * 100, 4)#> [1] 92.4548
The classification accuracy is 92.45%, which is not bad.
Generate the confusion matrix
confusionMatrix(pred$pred, pred$actual)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1 2 3 4 5 6 7 8 9
#> 0 3994 0 19 11 4 35 17 5 19 12
#> 1 3 4588 59 32 20 39 28 37 134 18
#> 2 11 13 3753 88 17 19 17 37 21 10
#> 3 8 12 65 3879 9 91 1 9 91 53
#> 4 13 6 60 10 3852 55 35 44 38 136
#> 5 33 12 20 162 5 3386 45 10 132 33
#> 6 35 3 38 15 22 52 3973 2 19 3
#> 7 7 8 54 35 7 28 2 4076 18 87
#> 8 20 32 85 78 22 51 17 4 3519 25
#> 9 8 10 24 41 114 39 2 177 72 3811
#>
#> Overall Statistics
#>
#> Accuracy : 0.9245
#> 95% CI : (0.922, 0.9271)
#> No Information Rate : 0.1115
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9161
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Statistics by Class:
#>
#> Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
#> Sensitivity 0.96660 0.9795 0.89849 0.89152 0.94597 0.89223
#> Specificity 0.99678 0.9901 0.99384 0.99100 0.98953 0.98817
#> Pos Pred Value 0.97036 0.9254 0.94155 0.91963 0.90657 0.88223
#> Neg Pred Value 0.99636 0.9974 0.98885 0.98751 0.99417 0.98928
#> Prevalence 0.09838 0.1115 0.09945 0.10360 0.09695 0.09036
#> Detection Rate 0.09510 0.1092 0.08936 0.09236 0.09171 0.08062
#> Detection Prevalence 0.09800 0.1180 0.09490 0.10043 0.10117 0.09138
#> Balanced Accuracy 0.98169 0.9848 0.94617 0.94126 0.96775 0.94020
#> Class: 6 Class: 7 Class: 8 Class: 9
#> Sensitivity 0.9604 0.92615 0.86611 0.90998
#> Specificity 0.9950 0.99346 0.99120 0.98712
#> Pos Pred Value 0.9546 0.94308 0.91331 0.88669
#> Neg Pred Value 0.9957 0.99137 0.98574 0.99000
#> Prevalence 0.0985 0.10479 0.09674 0.09971
#> Detection Rate 0.0946 0.09705 0.08379 0.09074
#> Detection Prevalence 0.0991 0.10290 0.09174 0.10233
#> Balanced Accuracy 0.9777 0.95981 0.92865 0.94855