library(tidyverse)
library(keras)
library(nnet)
library(caret)

Final Exam


Final Problem 2. 40 points.

2.1

Go to Kaggle.com and build an account if you do not already have one. It is free.

2.2

Go to https://www.kaggle.com/c/digit-recognizer/overview, accept the rules of the competition, and download the data. You will not be required to submit work to Kaggle, but you do need the data.

‘MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.”

# define url for data
urlTrain <- "https://raw.githubusercontent.com/esteban-data-enthusiast/data605/main/shared-data/digit-recognizer/train.csv"

# load the training data
rawTrain <- read.csv(urlTrain, stringsAsFactors = TRUE)


#urlTest <- "https://raw.githubusercontent.com/esteban-data-enthusiast/data605/main/shared-data/digit-recognizer/test.csv"

# load the test data
#rawTest <- read.csv(urlTest, stringsAsFactors = TRUE)

2.3

Using the training.csv file, plot representations of the first 10 images to understand the data format. Go ahead and divide all pixels by 255 to produce values between 0 and 1. (This is equivalent to min-max scaling.) (5 points)

Rescale the pixels in both datasets

# rescale the pixels
# divide all pixels by 255 to produce values between 0 and 1
rawTrain[-1] <- rawTrain[-1] / 255

# rescale the pixels
# divide all pixels by 255 to produce values between 0 and 1
#rawTest[-1] <- rawTest[-1] / 255

Plot the representations of the first 10 images from the training dataset

imgPlotter <- function(row)
{
  img <- matrix(row, nrow=28,ncol=28)
  mode(img) = "numeric"
  image(img, useRaster=TRUE, axes=FALSE)
}

# plot the first 10 images from the training set
for (i in 1:10) {
  imgPlotter(rawTrain[i,])
}


2.4

What is the frequency distribution of the numbers in the dataset? (5 points)

# exclude the "label" column and reshape the data
x_train <- 
  keras::array_reshape(as.matrix(rawTrain[-1]), c(nrow(rawTrain), 28, 28, 1))
  
dim(x_train)
#> [1] 42000    28    28     1

Generate the frequency distribution of the numbers in the data set

hist(x_train)

From the histogram we can see that most of the numbers in the data set are zero values while the rest of the numbers are between 0 and 1 because we normalized them (divided them by 255). The high frequency of zeros must be due to the pixels representing the background of each of the digits represented in each row.

Let’s represent the non-zero elements in the dataset.

hist(x_train[x_train > 0])

After excluding the zero values, we can see that value 1 has the highest frequency, which makes it the mode of the distribution of the dataset.


2.5

For each number, provide the mean pixel intensity. What does this tell you? (5 points)

hist(colMeans(x_train))

From the histogram above, we can see that the means are mostly 0’s. Which perhaps also has to do with the uniform backgrounds of each digit represented in each row of data.


2.6

Reduce the data by using principal components that account for 95% of the variance. How many components did you generate? Use PCA to generate all possible components (100% of the variance). How many components are possible? Why? (5 points)

Get the covariance of the training data

x_train_cov <- cov(rawTrain[-1])

Reduce to principal components

pca_x_train <- prcomp(x_train_cov)

After looking at the summary of the PCA object, we can see that by default 784 components (one for each of the images) have been generated, which should account for 100% of the variance.


2.7

Plot the first 10 images generated by PCA. They will appear to be noise. Why? (5 points)

# initialize index counters
k <- 1
j <- 784

for (i in 1:10){
  img <- matrix(pca_x_train$rotation[k:j], nrow = 28, ncol = 28)
  image(img, useRaster = TRUE, axes = FALSE)
  k <- k + 784
  j <- j + 784
}

We can see that the images look like blurry images with significant noise on them. This could be explained by the variance calculated across all images as opposed to variance across similar images.


2.8

Now, select only those images that have labels that are 8’s. Re-run PCA that accounts for all of the variance (100%). Plot the first 10 images. What do you see? (5 points)

x_train_8 <- rawTrain %>%
  filter(label == 8) %>%
  select(c(2:785))

Re-run PCA that accounts for all of the variance (100%)

x_cov_8 <- cov(x_train_8 / 255)

Reduce to PCA components

pca_x_train_8 <- prcomp(x_cov_8)

Plot first 10 images

k<- 1
j<-784

for (i in 1:10){
  img<- matrix(pca_x_train_8$rotation[k:j],nrow=28, ncol=28)
  image(img, useRaster=TRUE, axes=FALSE)
  k<- k+784
  j<- j+784
}

We can see that the images do look like images of the number 8. However, they still look blurry and with significant noise.


2.9

An incorrect approach to predicting the images would be to build a linear regression model with y as the digit values and X as the pixel matrix. Instead, we can build a multinomial model that classifies the digits. Build a multinomial model on the entirety of the training set. Then provide its classification accuracy (percent correctly identified) as well as a matrix of observed versus forecast values (confusion matrix). This matrix will be a 10 x 10, and correct classifications will be on the diagonal. (10 points)

Before fitting a model to the training set convert variable “label” to a factor

rawTrain$label <- as.factor(rawTrain$label)

X <- rawTrain[2:785]

X$label <- rawTrain$label

Fit a multinomial model to the training dataset

multinom_mod <- nnet::multinom(label ~ ., data = X, MaxNWts = 1000000)
#> # weights:  7860 (7065 variable)
#> initial  value 96708.573906 
#> iter  10 value 25322.714106
#> iter  20 value 20402.086316
#> iter  30 value 19312.872829
#> iter  40 value 18703.256586
#> iter  50 value 18197.815143
#> iter  60 value 17732.985798
#> iter  70 value 16739.962157
#> iter  80 value 14961.658448
#> iter  90 value 13446.085942
#> iter 100 value 12442.636014
#> final  value 12442.636014 
#> stopped after 100 iterations

Use the fitted model to predict the digit label on the Training dataset

pred <- predict(multinom_mod, rawTrain[2:785])

# label de predicted value column
pred <- as.data.frame(pred, row.names = c('predicted_value'))

Add to columns to our prediction data frame to compare the actual value vs the matched value

pred$actual <- rawTrain$label

# assign a 1 if the prediction matched the actual, else a 0
pred$equal  <- ifelse( pred$pred == pred$actual, 1, 0)

Compute the classification accuracy (percent correctly identified)

round(sum(pred$equal / nrow(pred)) * 100, 4)
#> [1] 92.4548

The classification accuracy is 92.45%, which is not bad.


Generate the confusion matrix

confusionMatrix(pred$pred, pred$actual)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1    2    3    4    5    6    7    8    9
#>          0 3994    0   19   11    4   35   17    5   19   12
#>          1    3 4588   59   32   20   39   28   37  134   18
#>          2   11   13 3753   88   17   19   17   37   21   10
#>          3    8   12   65 3879    9   91    1    9   91   53
#>          4   13    6   60   10 3852   55   35   44   38  136
#>          5   33   12   20  162    5 3386   45   10  132   33
#>          6   35    3   38   15   22   52 3973    2   19    3
#>          7    7    8   54   35    7   28    2 4076   18   87
#>          8   20   32   85   78   22   51   17    4 3519   25
#>          9    8   10   24   41  114   39    2  177   72 3811
#> 
#> Overall Statistics
#>                                          
#>                Accuracy : 0.9245         
#>                  95% CI : (0.922, 0.9271)
#>     No Information Rate : 0.1115         
#>     P-Value [Acc > NIR] : < 2.2e-16      
#>                                          
#>                   Kappa : 0.9161         
#>                                          
#>  Mcnemar's Test P-Value : < 2.2e-16      
#> 
#> Statistics by Class:
#> 
#>                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
#> Sensitivity           0.96660   0.9795  0.89849  0.89152  0.94597  0.89223
#> Specificity           0.99678   0.9901  0.99384  0.99100  0.98953  0.98817
#> Pos Pred Value        0.97036   0.9254  0.94155  0.91963  0.90657  0.88223
#> Neg Pred Value        0.99636   0.9974  0.98885  0.98751  0.99417  0.98928
#> Prevalence            0.09838   0.1115  0.09945  0.10360  0.09695  0.09036
#> Detection Rate        0.09510   0.1092  0.08936  0.09236  0.09171  0.08062
#> Detection Prevalence  0.09800   0.1180  0.09490  0.10043  0.10117  0.09138
#> Balanced Accuracy     0.98169   0.9848  0.94617  0.94126  0.96775  0.94020
#>                      Class: 6 Class: 7 Class: 8 Class: 9
#> Sensitivity            0.9604  0.92615  0.86611  0.90998
#> Specificity            0.9950  0.99346  0.99120  0.98712
#> Pos Pred Value         0.9546  0.94308  0.91331  0.88669
#> Neg Pred Value         0.9957  0.99137  0.98574  0.99000
#> Prevalence             0.0985  0.10479  0.09674  0.09971
#> Detection Rate         0.0946  0.09705  0.08379  0.09074
#> Detection Prevalence   0.0991  0.10290  0.09174  0.10233
#> Balanced Accuracy      0.9777  0.95981  0.92865  0.94855



---
title: "DATA605 - Final Exam  - Problem 2"
author: "Esteban Aramayo"
date: "2022-05-22"
output: openintro::lab_report
---

```{r global-options, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE,
                      collapse = FALSE,
                      comment = "#>" )
```


```{r}
library(tidyverse)
library(keras)
library(nnet)
library(caret)
```



# Final Exam


<br>


## Final Problem 2. 40 points.

### 2.1

Go to Kaggle.com and build an account if you do not already have one. It is free.


### 2.2 

Go to https://www.kaggle.com/c/digit-recognizer/overview, accept the rules of the competition, and download the data. You will not be required to submit work to Kaggle, but you do need the data.

‘MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.”


```{r}

# define url for data
urlTrain <- "https://raw.githubusercontent.com/esteban-data-enthusiast/data605/main/shared-data/digit-recognizer/train.csv"

# load the training data
rawTrain <- read.csv(urlTrain, stringsAsFactors = TRUE)


#urlTest <- "https://raw.githubusercontent.com/esteban-data-enthusiast/data605/main/shared-data/digit-recognizer/test.csv"

# load the test data
#rawTest <- read.csv(urlTest, stringsAsFactors = TRUE)

```


### 2.3 
Using the training.csv file, plot representations of the first 10 images to understand the data format. Go ahead and divide all pixels by 255 to produce values between 0 and 1. (This is equivalent to min-max scaling.) (5 points)


Rescale the pixels in both datasets

```{r}

# rescale the pixels
# divide all pixels by 255 to produce values between 0 and 1
rawTrain[-1] <- rawTrain[-1] / 255

# rescale the pixels
# divide all pixels by 255 to produce values between 0 and 1
#rawTest[-1] <- rawTest[-1] / 255

```


Plot the representations of the first 10 images from the training dataset

```{r}

imgPlotter <- function(row)
{
  img <- matrix(row, nrow=28,ncol=28)
  mode(img) = "numeric"
  image(img, useRaster=TRUE, axes=FALSE)
}

# plot the first 10 images from the training set
for (i in 1:10) {
  imgPlotter(rawTrain[i,])
}

```


<br>

### 2.4 
What is the frequency distribution of the numbers in the dataset? (5 points)

```{r}
# exclude the "label" column and reshape the data
x_train <- 
  keras::array_reshape(as.matrix(rawTrain[-1]), c(nrow(rawTrain), 28, 28, 1))
  
dim(x_train)
```

Generate the frequency distribution of the numbers in the data set

```{r}
hist(x_train)
```

From the histogram we can see that most of the numbers in the data set are zero values while the rest of the numbers are between 0 and 1 because we normalized them (divided them by 255). The high frequency of zeros must be due to the pixels representing the background of each of the digits represented in each row.

Let's represent the non-zero elements in the dataset.

```{r}
hist(x_train[x_train > 0])
```

After excluding the zero values, we can see that value 1 has the highest frequency, which makes it the mode of the distribution of the dataset.


<br>

### 2.5 
For each number, provide the mean pixel intensity. What does this tell you? (5 points)

```{r}
hist(colMeans(x_train))

```

From the histogram above, we can see that the means are mostly 0's. Which perhaps also has to do with the uniform backgrounds of each digit represented in each row of data.


<br>

### 2.6 
Reduce the data by using principal components that account for 95% of the variance. How many components did you generate? Use PCA to generate all possible components (100% of the variance). How many components are possible? Why? (5 points)


Get the covariance of the training data

```{r}
x_train_cov <- cov(rawTrain[-1])

```

Reduce to principal components

```{r}
pca_x_train <- prcomp(x_train_cov)

```
After looking at the summary of the PCA object, we can see that by default 784 components (one for each of the images) have been generated, which should account for 100% of the variance.


<br>

### 2.7
Plot the first 10 images generated by PCA. They will appear to be noise. Why? (5 points)

```{r}

# initialize index counters
k <- 1
j <- 784

for (i in 1:10){
  img <- matrix(pca_x_train$rotation[k:j], nrow = 28, ncol = 28)
  image(img, useRaster = TRUE, axes = FALSE)
  k <- k + 784
  j <- j + 784
}

```

We can see that the images look like blurry images with significant noise on them. This could be explained by the variance calculated across all images as opposed to variance across similar images.

<br>

### 2.8 

Now, select only those images that have labels that are 8’s. Re-run PCA that accounts for all of the variance (100%). Plot the first 10 images. What do you see? (5 points)

```{r}
x_train_8 <- rawTrain %>%
  filter(label == 8) %>%
  select(c(2:785))
```

Re-run PCA that accounts for all of the variance (100%)

```{r}
x_cov_8 <- cov(x_train_8 / 255)
```

Reduce to PCA components

```{r}
pca_x_train_8 <- prcomp(x_cov_8)
```

Plot first 10 images

```{r}
k<- 1
j<-784

for (i in 1:10){
  img<- matrix(pca_x_train_8$rotation[k:j],nrow=28, ncol=28)
  image(img, useRaster=TRUE, axes=FALSE)
  k<- k+784
  j<- j+784
}
```

We can see that the images do look like images of the number 8. However, they still look blurry and with significant noise.


<br>

### 2.9 

An incorrect approach to predicting the images would be to build a linear regression model with y as the digit values and X as the pixel matrix. Instead, we can build a multinomial model that classifies the digits. Build a multinomial model on the entirety of the training set. Then provide its classification accuracy (percent correctly identified) as well as a matrix of observed versus forecast values (confusion matrix). This matrix will be a 10 x 10, and correct classifications will be on the diagonal. (10 points)


Before fitting a model to the training set convert variable "label" to a factor

```{r}
rawTrain$label <- as.factor(rawTrain$label)

X <- rawTrain[2:785]

X$label <- rawTrain$label

```

Fit a multinomial model to the training dataset

```{r}
multinom_mod <- nnet::multinom(label ~ ., data = X, MaxNWts = 1000000)
```

Use the fitted model to predict the digit label on the Training dataset

```{r}
pred <- predict(multinom_mod, rawTrain[2:785])

# label de predicted value column
pred <- as.data.frame(pred, row.names = c('predicted_value'))

```

Add to columns to our prediction data frame to compare the actual value vs the matched value

```{r}
pred$actual <- rawTrain$label

# assign a 1 if the prediction matched the actual, else a 0
pred$equal  <- ifelse( pred$pred == pred$actual, 1, 0)
```

Compute the classification accuracy (percent correctly identified)

```{r}
round(sum(pred$equal / nrow(pred)) * 100, 4)
```

The classification accuracy is 92.45%, which is not bad.

<br>

Generate the confusion matrix

```{r}
confusionMatrix(pred$pred, pred$actual)
```



<br>

<br>


