HW4 Data 622

Problem Description:

The problem at hand involves building predictive models to recognize hand-written digits from the MNIST dataset. Handwritten digit recognition is a classic problem in the field of computer vision and machine learning, with applications ranging from postal automation to digital document processing. The primary goal is to accurately classify images of handwritten digits into their corresponding numerical values (0-9).

Dataset Description and Data Preparation:

The MNIST dataset consists of 28x28 pixel grayscale images of handwritten digits along with their corresponding labels. Each image is represented as a matrix of pixel values, where each pixel’s intensity ranges from 0 to 255. The dataset is split into a training set and a test set.

To prepare the data for analysis, several steps were taken. Firstly, the training dataset (train.csv) was loaded into R using the read.csv() function. Subsequently, the dataset was examined for missing values to ensure data integrity, as the presence of missing values could adversely affect model training. Exploratory Data Analysis (EDA) was then conducted to gain insights into the distribution of handwritten digits within the dataset. Various visualizations, including bar plots and image plots, were utilized to comprehend the frequency distribution of each digit and visualize individual handwritten samples. Following EDA, the pixel values of the images were normalized to a range between 0 and 1 to enhance model convergence speed during training and potentially improve performance. Additionally, Principal Component Analysis (PCA) was employed to reduce the dimensionality of the dataset while retaining significant variance, thus reducing computational complexity and potentially enhancing model performance. Finally, model training was conducted using two different classification algorithms: logistic regression with the glmnet package and XGBoost with the xgboost package, aiming to build accurate predictive models for handwritten digit recognition.

Methodologies Used:

Exploratory Data Analysis (EDA): EDA was conducted to understand the distribution of handwritten digits in the dataset and visualize individual images. This helps in identifying patterns and gaining insights into the data.
Principal Component Analysis (PCA): PCA was utilized for dimensionality reduction. By projecting the high-dimensional image data onto a lower-dimensional subspace, PCA helps in capturing the most significant variations in the data while reducing computational complexity.
Model Training: Two different classification algorithms were employed for model training from PCA transformations. Logistic Regression: Utilized the glmnet package for logistic regression. Logistic regression is a linear classification model that predicts the probability of a categorical outcome. XGBoost: Employed the xgboost package for XGBoost, an ensemble learning technique that uses a gradient boosting framework. XGBoost is known for its high performance and effectiveness in various machine learning tasks.

Output and Model Accuracy

For the logistic regression model, the overall accuracy achieved was 84.51%, with a 95% confidence interval ranging from 83.72% to 85.28%. The model’s Kappa statistic was 0.8278, indicating a substantial level of agreement between the predicted and actual classifications. The detailed statistics by class revealed varying sensitivity and specificity across different digits, with balanced accuracy ranging from 85.91% to 96.36% for individual digits. This indicates that while the model performed reasonably well, there were variations in its effectiveness at correctly identifying specific digits.

For the XGBoost model, the overall accuracy was significantly higher at 95.3%, with a 95% confidence interval between 94.82% and 95.74%. The Kappa statistic was 0.9477, showing an almost perfect agreement between predictions and actual values. The statistics by class demonstrated high sensitivity and specificity for all digits, with balanced accuracy consistently above 96%. This indicates that the XGBoost model was very effective at correctly classifying all digits, making it a more robust and reliable model for handwritten digit recognition compared to logistic regression.

These results highlight the superior performance of the XGBoost model in terms of accuracy, precision, and reliability for the task of handwritten digit recognition.

Conclusions and Business Impact:

From the analysis, it can be concluded that both logistic regression and XGBoost models achieved reasonable performance in recognizing handwritten digits from the MNIST dataset. The confusion matrices obtained from both models provide insights into their classification capabilities, including accuracy, precision, recall, and F1-score for each class. The business impact of accurate handwritten digit recognition models can be significant across various domains. In postal services, automated sorting of mail based on handwritten addresses can improve efficiency and reduce processing time, leading to cost savings and enhanced customer satisfaction. In finance and banking, handwritten digit recognition can be utilized in automated check processing systems, facilitating faster and more accurate check clearance processes. In digital document processing, particularly in industries where handwritten forms and documents are prevalent, such as healthcare and legal sectors, automated digit recognition can streamline data entry processes, reduce errors, and enhance document management efficiency. In education and learning, handwritten digit recognition can be integrated into educational technology platforms to provide interactive learning experiences for students, especially in mathematics and handwriting practice applications. Overall, the deployment of accurate handwritten digit recognition models can lead to improved productivity, reduced manual intervention, and enhanced user experiences across various domains, ultimately contributing to business efficiency and effectiveness.

Code:

library(ggplot2)
library(tidyr)
library(reshape2)
library(rgl)
library(glmnet) 
library(caret)
library(xgboost)

train <- read.csv("train.csv")

# Display the dimensions of the data frame
dim(train)

## [1] 42000   785

# View the first few rows of the data
head(train[, 1:8], 8)

##   label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6
## 1     1      0      0      0      0      0      0      0
## 2     0      0      0      0      0      0      0      0
## 3     1      0      0      0      0      0      0      0
## 4     4      0      0      0      0      0      0      0
## 5     0      0      0      0      0      0      0      0
## 6     0      0      0      0      0      0      0      0
## 7     7      0      0      0      0      0      0      0
## 8     3      0      0      0      0      0      0      0

Missing Values

# Check for missing values in the dataframe
missing_values <- any(is.na(train))

# Print the result
if (missing_values) {
  print("There are missing values in the dataframe.")
} else {
  print("There are no missing values in the dataframe.")
}

## [1] "There are no missing values in the dataframe."

Frequancy of numbers to see if we need to weight some images more then others

# Create a data frame with the counts of each digit
digit_counts <- as.data.frame(table(train[,1]))
names(digit_counts) <- c("Digit", "Count")


ggplot(digit_counts, aes(x = factor(Digit), y = Count)) +
  geom_bar(stat = "identity", fill = rainbow(10, 0.5)) +  # Set colors directly
  labs(title = "Digits in Train", x = "Digit", y = "Count") +
  theme_minimal()

digit <- matrix(as.numeric(train[300,-1]), nrow = 28)#look at one digit
image(digit , col = grey.colors(255))

The images were initially upside down and needed to be rotated for clearer visualization. Ensuring that the distribution of data is normalized is a common preprocessing step in image recognition tasks. Normalization typically involves scaling the pixel values of images to a range that makes training more efficient and effective. This can be done by scaling pixel values to a range between 0 and 1 or standardizing them to have a mean of 0 and a standard deviation of 1. Normalizing the distribution of data helps converge faster during training and can improve the stability and performance of the model. This process is often achieved by dividing the pixel values by 255 (if the pixel values are in the range of 0-255) or by using other statistical methods to scale the data appropriately.

digit <- matrix(as.numeric(train[300,-1]), nrow = 28)#look at one digit
m <- apply(digit, 1, rev)
image(t(m), col = grey.colors(255))

# Extract label column and image data
train_labels <- train$label
train_images <- train[, -1]  # Exclude label column

par(mfrow = c(3, 3)) 
for (i in 1:9) {
  # Convert image data to a matrix and reshape it to 28x28
  image_matrix <- matrix(as.numeric(train_images[i, ]), nrow = 28, byrow = FALSE)
  
  # Plot the image with grayscale
  image(image_matrix[,28:1], col = gray.colors(256), main = paste("Label:", train_labels[i]))
}

Now that we see from visualization the the images are normalized, we can perceed with a PCA

# Set the seed for reproducibility
set.seed(123)

# Randomly select indices for the training set
train_indices <- sample(nrow(train), 0.8 * nrow(train))

# Create training set
train_images <- train[train_indices, -1]  # Exclude label column
train_labels <- as.integer(train$label[train_indices])

# Create testing set
test_images <- train[-train_indices, -1]  # Exclude label column
test_labels <- as.integer(train$label[-train_indices])

# Identify constant or zero columns
constant_cols <- sapply(train_images, function(x) length(unique(x)) == 1)
zero_cols <- sapply(train_images, function(x) all(x == 0))

# Combine the columns to remove
cols_to_remove <- constant_cols | zero_cols

# Remove constant or zero columns
train_images_filtered <- train_images[, !cols_to_remove]

# Perform PCA on filtered training set
train_pca <- prcomp(train_images_filtered, center = TRUE, scale. = TRUE)

# Scree plot
screeplot(train_pca, type = "lines", main = "Scree Plot")

# Sort labels and corresponding principal components
sorted_indices <- order(train_labels)
sorted_labels <- train_labels[sorted_indices]
sorted_pca <- train_pca$x[sorted_indices, ]

# Plot the first two principal components with unique train labels in numeric order
plot(sorted_pca[, 1], sorted_pca[, 2], 
     col = as.factor(sorted_labels),
     pch = 19,
     xlab = "PC1",
     ylab = "PC2",
     main = "First 2 Principal Components with Unique Train Labels (Numeric Order)")

legend("topright", legend = unique(sorted_labels), col = 1:length(unique(sorted_labels)), pch = 19, title = "Labels")

# Retain enough components to explain 95% of the variance
explained_variance <- cumsum(train_pca$sdev^2 / sum(train_pca$sdev^2))
n_components <- which(explained_variance >= 0.95)[1]

# Transform the training set using the PCA model
train_pca_transformed <- train_pca$x[, 1:n_components]


test_images_filtered <- test_images[, !cols_to_remove]  # Apply the same column filtering
test_pca_transformed <- predict(train_pca, test_images_filtered)[, 1:n_components]

# Train Logistic Regression Model
classifier <- glmnet(train_pca_transformed, train_labels, family = "multinomial")

predictions_glmnet <- as.integer(predict(classifier, s = 0.01, newx = test_pca_transformed, type = "class"))

# Ensure predictions and test_labels have the same levels
class_levels_glmnet <- unique(c(predictions_glmnet, test_labels))
predictions_glmnet <- factor(predictions_glmnet, levels = class_levels_glmnet)
test_labels_glmnet <- factor(test_labels, levels = class_levels_glmnet)

(conf_matrix_glmnet <- confusionMatrix(predictions_glmnet, test_labels_glmnet))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   9   2   6   1   8   5   4   7   0   3
##          9 701   5   0   0  16  16  62  45   4  13
##          2   5 649   5  15  19   5   3  16   3  29
##          6   0  35 766   5   8   9  13   2   5   7
##          1  24  27  12 900  30  40  13  28   0  21
##          8  10  26   6  11 614  18   3   2   5  23
##          5  11   4  17  10  56 552   5   2  24  53
##          4  50   5  19   0  10  23 715   8   5   9
##          7  51  17   3   1   3   3   5 737   2  20
##          0  10  19  30   0  10  13   2   3 755   5
##          3  12  22   0   2  38  65   0   3   7 710
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8451          
##                  95% CI : (0.8372, 0.8528)
##     No Information Rate : 0.1124          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8278          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 9 Class: 2 Class: 6 Class: 1 Class: 8 Class: 5
## Sensitivity           0.80206  0.80222  0.89277   0.9534  0.76368  0.74194
## Specificity           0.97861  0.98683  0.98886   0.9738  0.98631  0.97623
## Pos Pred Value        0.81323  0.86649  0.90118   0.8219  0.85515  0.75204
## Neg Pred Value        0.97705  0.97909  0.98781   0.9940  0.97527  0.97495
## Prevalence            0.10405  0.09631  0.10214   0.1124  0.09571  0.08857
## Detection Rate        0.08345  0.07726  0.09119   0.1071  0.07310  0.06571
## Detection Prevalence  0.10262  0.08917  0.10119   0.1304  0.08548  0.08738
## Balanced Accuracy     0.89033  0.89453  0.94082   0.9636  0.87500  0.85908
##                      Class: 4 Class: 7 Class: 0 Class: 3
## Sensitivity           0.87089  0.87116  0.93210  0.79775
## Specificity           0.98298  0.98610  0.98788  0.98016
## Pos Pred Value        0.84716  0.87530  0.89138  0.82654
## Neg Pred Value        0.98597  0.98558  0.99272  0.97613
## Prevalence            0.09774  0.10071  0.09643  0.10595
## Detection Rate        0.08512  0.08774  0.08988  0.08452
## Detection Prevalence  0.10048  0.10024  0.10083  0.10226
## Balanced Accuracy     0.92693  0.92863  0.95999  0.88896

# Train XGBoost Model
xgb_classifier <- xgboost(data = train_pca_transformed, label = train_labels, nrounds = 100, objective = "multi:softmax", num_class = length(unique(train_labels)))

predictions_xgb <- as.integer(predict(xgb_classifier, test_pca_transformed))

# Ensure predictions and test_labels have the same levels
(class_levels_xgb <- unique(c(predictions_xgb, test_labels)))

##  [1] 9 2 6 1 4 5 8 7 0 3

predictions_xgb<- factor(predictions_xgb, levels = class_levels_xgb)
test_labels_xgb <- factor(test_labels, levels = class_levels_xgb)

conf_matrix_xgb <- confusionMatrix(predictions_xgb, test_labels_xgb)

print(conf_matrix_xgb)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   9   2   6   1   4   5   8   7   0   3
##          9 814   2   0   0  16   6   3  16   0   5
##          2   2 771   3   4   6   5   6  10   1  16
##          6   0   4 825   3   4   4   5   0   6   2
##          1   1   1   2 924   3   0   0   2   0   1
##          4  17   1   6   1 785   5   1   6   1   1
##          5   6   0   6   4   0 694  12   0   3  13
##          8   7  11   5   4   3   8 754   0   2  13
##          7  11   8   1   2   4   1   3 807   0   5
##          0   6   4  10   0   0   7   5   2 797   0
##          3  10   7   0   2   0  14  15   3   0 834
## 
## Overall Statistics
##                                           
##                Accuracy : 0.953           
##                  95% CI : (0.9482, 0.9574)
##     No Information Rate : 0.1124          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9477          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 9 Class: 2 Class: 6 Class: 1 Class: 4 Class: 5
## Sensitivity            0.9314  0.95303  0.96154   0.9788  0.95615  0.93280
## Specificity            0.9936  0.99302  0.99629   0.9987  0.99485  0.99425
## Pos Pred Value         0.9443  0.93568  0.96717   0.9893  0.95267  0.94038
## Neg Pred Value         0.9920  0.99498  0.99563   0.9973  0.99525  0.99347
## Prevalence             0.1040  0.09631  0.10214   0.1124  0.09774  0.08857
## Detection Rate         0.0969  0.09179  0.09821   0.1100  0.09345  0.08262
## Detection Prevalence   0.1026  0.09810  0.10155   0.1112  0.09810  0.08786
## Balanced Accuracy      0.9625  0.97302  0.97891   0.9887  0.97550  0.96352
##                      Class: 8 Class: 7 Class: 0 Class: 3
## Sensitivity           0.93781  0.95390  0.98395  0.93708
## Specificity           0.99302  0.99537  0.99552  0.99321
## Pos Pred Value        0.93432  0.95843  0.95909  0.94237
## Neg Pred Value        0.99341  0.99484  0.99828  0.99255
## Prevalence            0.09571  0.10071  0.09643  0.10595
## Detection Rate        0.08976  0.09607  0.09488  0.09929
## Detection Prevalence  0.09607  0.10024  0.09893  0.10536
## Balanced Accuracy     0.96542  0.97463  0.98974  0.96514

References:

Kaggle Digit Recognizer Competition. Retrieved from: https://www.kaggle.com/competitions/digit-recognizer/overview
Arathee, A. (n.d.). Random Forest vs XGBoost vs Deep Neural Network. Retrieved from: https://www.kaggle.com/code/arathee2/random-forest-vs-xgboost-vs-deep-neural-network
LeCun, Y., Cortes, C., & Burges, C. (n.d.). MNIST Database. Retrieved from: http://yann.lecun.com/exdb/mnist/
Colah’s Blog. (2014). Visualizing MNIST. Retrieved from: https://colah.github.io/posts/2014-10-Visualizing-MNIST/
R-bloggers. (2018). Exploring Handwritten Digit Classification: A Tidy Analysis of the MNIST Dataset. Retrieved from: https://www.r-bloggers.com/2018/01/exploring-handwritten-digit-classification-a-tidy-analysis-of-the-mnist-dataset/
Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Retrieved from: https://repository.supsi.ch/5145/1/IDSIA-04-12.pdf
Papai, A. (2016). MNIST Handwritten Digit Classification in R using Deep Neural Networks. Retrieved from: https://apapiu.github.io/2016-01-02-minst/
SrlMayor. (n.d.). Easy Neural Network in R for 99.4%. Retrieved from: https://www.kaggle.com/code/srlmayor/easy-neural-network-in-r-for-0-994