Kaggle data set to learn computer vision fundamentals with the famous MNIST data. The goal is to take an image of a handwritten single digit, and determine what that digit is. For every in the test set, you should predict the correct label.
Required Libraries
library(tidyverse)
library(kableExtra)
library(webshot)
library(plotly)
library(caret)
library(devtools)
library(factoextra)
library(knitr)
library(nnet)
library(reactable)
options(warn=-1)The data files, Kaggle, train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.
Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.
Dimensions of dataset: glimpse of how many instances (rows) and how attributes (columns) the data contains with the head function.
#library(reactable)
reactable(head(dr_train), highlight = TRUE, striped = TRUE, resizable = TRUE, bordered = TRUE)## Training set has 42000 rows and 785 columns
## Test set has 28000 rows and 784 columns
Step 2a. We can quickly plot the pixel color values to obtain a picture of the digit.
# Create a 28*28 matrix with pixel color values
dr_m = matrix(unlist(dr_train[10,-1]),nrow = 28,byrow = T)
# Plot that matrix
image(dr_m,col=grey.colors(255))Image of Digits This image needs to be rotated to the right. I will rotate the matrix and plot a bunch of images.
rotate <- function(x) t(apply(x, 2, rev)) # reverses (rotates the matrix)
# Plot a bunch of images
par(mfrow=c(2,5))
lapply(1:10,
function(x) image(
rotate(matrix(unlist(dr_train[x,-1]),nrow = 28,byrow = T)),
col=grey.colors(255),
xlab=dr_train[x,1]
)
)## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
par(mfrow=c(1,1)) # set plot options back to defaultStep 2b. Normalization: Go ahead and divide all pixels by 255 to produce values between 0 and 1. (This is equivalent to min-max scaling.)
drtrain <- dr_train[, 2:ncol(dr_train)] / 255
drtrain$label <- dr_train$label
drtrain <- drtrain %>% dplyr::select(label, everything())
#drtest <- lapply(dr_test[, 1:ncol(dr_test)], function(x) x/255) %>%
#dplyr::bind_cols() %>%
#as.data.frame()We will look at the frequency of observations, showing the number of times the observation occurs in the data with the webshot package.
#convert digit label to factor for classification
label_f <- as.factor(drtrain$label)
label_lv <- levels(label_f)
count_label <- summary(label_f)table(drtrain$label)##
## 0 1 2 3 4 5 6 7 8 9
## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188
#plot of drtrain (train dataset) labels
x = label_lv
y = count_label
fig1 = plot_ly(
x = x,
y = y,
name = 'Label dist.',
text = paste(count_label),
textposition = 'auto',
type = 'bar',
marker = list(color = ~x)
)
fig1Pixel intensity value is the primary information stored within pixels, it is the most popular and important feature used for classification. The intensity value for each pixel is a single value for a gray-level, or three values for a color image.
For a grayscale images, the pixel value is a single number that represents the brightness of the pixel. The most common pixel format is the byte image, where this number is stored as an 8-bit integer giving a range of possible values from 0 to 255. Typically zero is taken to be black, and 255 is taken to be white. Values in between make up the different shades of gray.
pixel_intensity = drtrain %>%
rowwise() %>%
dplyr::mutate(pixel_intensity = sum(c_across(starts_with("pixel")))) %>%
ungroup()
mean_pixel_intensity = pixel_intensity %>%
group_by(label) %>%
summarise(mean_pixel_intensity = mean(pixel_intensity)) %>%
ungroup()
mean_pixel_intensity## # A tibble: 10 x 2
## label mean_pixel_intensity
## <dbl> <dbl>
## 1 0 136.
## 2 1 59.6
## 3 2 117.
## 4 3 111.
## 5 4 95.0
## 6 5 101.
## 7 6 109.
## 8 7 89.9
## 9 8 118.
## 10 9 96.3
The table shows each label mean pixel intensity, indicating their level of brightness. The highest pixel intensity, Label 0 - 135.81336 has a gray shade closer to white, and Label 1 - 59.56261 has the lowest grade shade to black. Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity.
Reduce or extract the important information from the data and express this information as a set of summary indices called principal components. Why? PCA will simply reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set, while preserving as much information as possible.
6.Reduce the data by using principal components that account for 95% of the variance. How many components did you generate? Use PCA to generate all possible components (100% of the variance). How many components are possible?
# get the covariance matrix of the predictors
trainCov <- cov(drtrain %>% dplyr::select(-label))
# remove zero or near zero variance features (caret package)
nzv = nearZeroVar(trainCov, saveMetrics=T)
trainCov = trainCov[,!nzv$zeroVar]The PCA has generated components (100% of the variance) and the number of possible components for the dataset.
# conduct pca
pc_train <- prcomp(trainCov)print(paste0("Columns with zero variance: ", sum(nzv$zeroVar)))## [1] "Columns with zero variance: 76"
print(paste('PCs generated:', ncol(pc_train$x)))## [1] "PCs generated: 708"
# Creating a datatable to store and plot the
# No of Principal Components vs Cumulative Variance Explained
vexplained <- as.data.frame(pc_train$sdev^2/sum(pc_train$sdev^2))
vexplained <- cbind(c(1:708),vexplained,cumsum(vexplained[,1]))
colnames(vexplained) <- c("No_of_Principal_Components","Individual_Variance_Explained","Cumulative_Variance_Explained")library(graphics)
#Plotting the curve using the datatable obtained
plot(vexplained$No_of_Principal_Components,vexplained$Cumulative_Variance_Explained, xlim = c(0,100),type='b',pch=16,xlab = "Principal Componets",ylab = "Cumulative Variance Explained",main = "Variance by # of PCA's. 60 PC's capture 99.5 % variance")
#number of PC's components
abline(v=c(60), col = "blue", lty =c(1,2), lwd=c(1,2))
#percentage of variance for number of PC's components
abline(h=c(0.9955116), col = "red", lty=c(1,2), lwd=c(1,3))#Datatable to store the summary of the datatable obtained
vexplainedsummary <- vexplained[seq(0,100,10),]
reactable(vexplainedsummary, striped = TRUE)Scree Plot: This plot shows the percentage of variance explained by each of the first 10 principal components.
ei_p1 = fviz_eig(pc_train, ncp=10, addlabels=T, main='Non-Scaled version of Scree Plot')
plot(ei_p1)#par(mfrow=c(2,5))
plot_array <- function(array){
arrayMat <- matrix(array, nrow=28, ncol=28)
mode(arrayMat) = "numeric"
image(arrayMat)
}par(mfrow=c(2,5))
X_loading <- pc_train$rotation
for (i in 1:10) {
plot_array(X_loading[,i])
}Selecting only those images that have labels that are 8’s. Re-run PCA that accounts for all of the variance(100%). Plot the first 10 images. What do you see? The re-run of the PCA reduced the number of variables of the selected data set. It naturally reduced accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity.
# get the covariance matrix of the predictors
trainCov8 <- cov(drtrain %>% dplyr::filter(label == 8) %>% dplyr::select(-label))
# conduct pca
trainPC8 <- prcomp(trainCov8)par(mfrow = c (2, 5))
for(i in 1:10){
image(1:28, 1:28, array(trainPC8$x[,i], dim = c(28,28)))
}An incorrect approach to predicting the images would be to build a linear regression model with y as the digit values and X as the pixel matrix. Instead, we can build a multinomial model that classifies the digits. Build a multinomial model on the entirety of the training set. Then provide its classification accuracy (percent correctly identified) as well as a matrix of observed versus forecast values (confusion matrix). This matrix will be a 10 x 10, and correct classifications will be on the diagonal.
8a. Create a new dataset for predictive modelling.
train2 <- dr_train
test2 <- dr_testView the dataset (train and test) dimensions.
dim(train2)## [1] 42000 785
dim(test2)## [1] 28000 784
#Separate the labels and normalize the dataset.
#labels = train2[,1]
#data <- train2[,-1]/2558b. Multinomial Logistic Regression
#model <- multinom(train_labels~., train_df, MaxNWts = 100000)Separate the labels and normalize the dataset.
train2$label <- as.factor(train2$label)
labels2 <- train2[,1]
data2 <- train2[,-1]/255
data2["label"] <- c(labels2)Split data into train and test samples.
# Using sample_frac to create 70 - 30 split into test and train
train2s <- sample_frac(data2, 0.7)
sample_id2 <- as.numeric(rownames(train2s)) # rownames() returns character so as.numeric
test2s <- data2[-sample_id2,]Train the data
# Training the multinomial model
multinom.fit <- multinom(label ~ ., data = train2s, MaxNWts = 100000)## # weights: 7860 (7065 variable)
## initial value 67696.001734
## iter 10 value 17115.553188
## iter 20 value 11971.848813
## iter 30 value 9878.448773
## iter 40 value 8382.229390
## iter 50 value 7297.676602
## iter 60 value 6584.411971
## iter 70 value 6236.917336
## iter 80 value 6019.442192
## iter 90 value 5657.517533
## iter 100 value 5296.370318
## final value 5296.370318
## stopped after 100 iterations
Prediction and Model Accuracy on the train dataset
# Predicting the values for train dataset
train2s$precticed <- predict(multinom.fit, newdata = train2s, "class")
# Building classification table
ctable <- table(train2s$label, train2s$precticed)
# Calculating accuracy - sum of diagonal elements divided by total obs
paste0("The model prediction accuracy on training dataset: ", round((sum(diag(ctable))/sum(ctable))*100,2), " %")## [1] "The model prediction accuracy on training dataset: 94.77 %"
# Predicting the values for train dataset
test2s$precticed <- predict(multinom.fit, newdata = test2s, "class")
# Building classification table
dtable <- table(test2s$label, test2s$precticed)
# Calculating accuracy - sum of diagonal elements divided by total obs
paste0("The model prediction accuracy on testing dataset: ", round((sum(diag(dtable))/sum(dtable))*100,2), " %")## [1] "The model prediction accuracy on testing dataset: 93.6 %"
Confusion Matrix
confusionMatrix(test2s$precticed,test2s$label)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 1227 1 6 4 3 14 6 1 5 4
## 1 0 1383 13 8 5 3 4 2 24 2
## 2 7 2 1118 34 9 14 7 10 7 2
## 3 2 5 17 1202 2 29 0 4 28 19
## 4 3 1 12 3 1118 17 9 9 7 28
## 5 7 3 5 30 2 1009 12 1 27 6
## 6 8 1 14 3 7 13 1184 0 6 0
## 7 3 2 9 10 4 3 1 1271 4 37
## 8 5 9 30 13 7 21 4 2 1116 8
## 9 1 2 2 10 27 12 0 32 11 1166
##
## Overall Statistics
##
## Accuracy : 0.936
## 95% CI : (0.9316, 0.9402)
## No Information Rate : 0.1118
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9289
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.97150 0.9815 0.91191 0.9127 0.94426 0.88899
## Specificity 0.99612 0.9945 0.99191 0.9906 0.99220 0.99189
## Pos Pred Value 0.96538 0.9578 0.92397 0.9190 0.92626 0.91561
## Neg Pred Value 0.99682 0.9977 0.99052 0.9898 0.99421 0.98904
## Prevalence 0.10024 0.1118 0.09730 0.1045 0.09397 0.09008
## Detection Rate 0.09738 0.1098 0.08873 0.0954 0.08873 0.08008
## Detection Prevalence 0.10087 0.1146 0.09603 0.1038 0.09579 0.08746
## Balanced Accuracy 0.98381 0.9880 0.95191 0.9516 0.96823 0.94044
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.96496 0.9542 0.90364 0.91667
## Specificity 0.99543 0.9935 0.99129 0.99144
## Pos Pred Value 0.95793 0.9457 0.91852 0.92320
## Neg Pred Value 0.99622 0.9946 0.98955 0.99065
## Prevalence 0.09738 0.1057 0.09802 0.10095
## Detection Rate 0.09397 0.1009 0.08857 0.09254
## Detection Prevalence 0.09810 0.1067 0.09643 0.10024
## Balanced Accuracy 0.98019 0.9739 0.94747 0.95405
Additional Final Reports
RPubs - Final (1/3): Playing With PageRank
RPubs - Final (3/3): House Prices - Advanced Regression Techniques Competition