Prompt: Use deep learning from H2o package to train your data set. Investigate prediction performance (limit to 2-3 layers) on multiple runs using H2o package (experiment by varying parameter such as numbers of layers, numbers of nodes, etc.,)

We are also provided with a few options for the data set to use for this exercise. One of these options is the MNIST digit classification data set. I have not done much work with images, and quite a bit of classification exercises, so I am going to with this digit data.

To begin, we need to load the required packages and read in the data for interpretation. First step is to take a look at the training set. I have gone ahead and extracted the training set. Below we can see the first 4 digits.

library(jsonlite)
library(caret)
library(h2o)
library(ggplot2)
library(data.table)
library(e1071)
setwd('C:\\Users\\JP\\Downloads')
The working directory was changed to C:/Users/JP/Downloads inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
data = file("C:\\Users\\JP\\Downloads\\train-images.idx3-ubyte", "rb")
readBin(data, integer(), n=4, endian="big")
[1]  2051 60000    28    28
m = matrix(readBin(data,integer(), size=4, n=28*28, endian="big"),28,28)
par(mfrow=c(2,2))
for(i in 1:4){m = matrix(readBin(data,integer(), size=1, n=28*28, endian="big"),28,28);image(m[,28:1])}

It looks like the data is loaded in properly, and we have what we need to begin. Now we start up h2o and get the data loaded in. I found a blog post that helps us get the data into the appropriate format to begin testing models for interpretation. We will use those functions to get stared. The blog post can be found here: link

h2o.init()
 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         4 hours 31 minutes 
    H2O cluster version:        3.10.4.4 
    H2O cluster version age:    22 days  
    H2O cluster name:           H2O_started_from_R_JP_pzr504 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.34 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  2 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 3.3.3 (2017-03-06) 
setwd('C:\\Users\\JP\\Downloads')
The working directory was changed to C:/Users/JP/Downloads inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
load_image_file <- function(filename) {
  ret = list()
  f = file(filename,'rb')
  readBin(f,'integer',n=1,size=4,endian='big')
  ret$n = readBin(f,'integer',n=1,size=4,endian='big')
  nrow = readBin(f,'integer',n=1,size=4,endian='big')
  ncol = readBin(f,'integer',n=1,size=4,endian='big')
  x = readBin(f,'integer',n=ret$n*nrow*ncol,size=1,signed=F)
  ret$x = matrix(x, ncol=nrow*ncol, byrow=T)
  close(f)
  ret
}
load_label_file <- function(filename) { 
  f = file(filename,'rb')
  readBin(f,'integer',n=1,size=4,endian='big')
  n = readBin(f,'integer',n=1,size=4,endian='big')
  y = readBin(f,'integer',n=n,size=1,signed=F)
  close(f)
  y
}
imagetraining<-as.data.frame(load_image_file("C:\\Users\\JP\\Downloads\\train-images.idx3-ubyte"))
imagetest<-as.data.frame(load_image_file("C:\\Users\\JP\\Downloads\\t10k-images.idx3-ubyte"))
labeltraining<-as.factor(load_label_file("C:\\Users\\JP\\Downloads\\train-labels.idx1-ubyte"))
labeltest<-as.factor(load_label_file("C:\\Users\\JP\\Downloads\\t10k-labels.idx1-ubyte"))
imagetraining[,1]<-labeltraining
imagetest[,1]<-labeltest
Training<-imagetraining
Test<-imagetest 

Now that the data is loaded in and ready to go, we can start by looking at different models using h2o. We will convert to h2o objects and begin running some models and use the caret package to see how we do.

We did not do very well here. There appears to be quite a few 7’s in our results set, which is not actually the case. Let’s try a few things to change it up. We had originally only had 2 nodes for each of the three layers and only 10 epochs. Let’s increase these numbers and see how we do. We will also expiriment with addinf 5-fold data validation for a third model and in a 4th model we will reduce the input dropout ratio, which is a feature that controls what ratio of features is dropped for a training row. I figure if we are adding validation, we could lower this number.

model2<-h2o.deeplearning(x=x,y="n",training_frame = TrainingH,validation_frame = TestH,distribution = "multinomial",activation="RectifierWithDropout",hidden = c(50,50,50),input_dropout_ratio = .2,sparse=T,epochs=100)

model3<-h2o.deeplearning(x=x,y="n",training_frame = TrainingH,validation_frame = TestH,distribution = "multinomial",activation="RectifierWithDropout",hidden = c(30,30,30),input_dropout_ratio = .2,sparse=T,epochs=50,nfolds=5)

model4<-h2o.deeplearning(x=x,y="n",training_frame = TrainingH,validation_frame = TestH,distribution = "multinomial",activation="RectifierWithDropout",hidden = c(30,30,30),input_dropout_ratio = .1,sparse=T,epochs=50,nfolds=5)

We make a few models, now it is time to put them to the test. It would not be hard to not beat our predictions earlier, but let’s see. These models take a significantly longer amount of time to run.

Model 2
head(TestH[,1])

head(results2)
caret::confusionMatrix(unlist(results2),Test$n)$overall
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull AccuracyPValue  McnemarPValue 
     0.9384000      0.9315299      0.9335086      0.9430341      0.1135000      0.0000000            NaN 

Very high accuracy in model 2 as we would expect with such a wide increase in the number of operations that are performed.

Model 3
head(TestH[,1])

head(results3)
caret::confusionMatrix(unlist(results3),Test$n)$overall
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull AccuracyPValue  McnemarPValue 
     0.8029000      0.7808491      0.7949654      0.8106578      0.1135000      0.0000000            NaN 

Not quite as accurate, this could be due simply to the fact that we reduced the layers. Let’s see if our input dropout ratio change has any impact on the accuracy.

Model 4
head(TestH[,1])

head(results4)
caret::confusionMatrix(unlist(results4),Test$n)$overall
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull AccuracyPValue  McnemarPValue 
     0.8790000      0.8654560      0.8724474      0.8853309      0.1135000      0.0000000            NaN 

Model 4 was an improvement on model 3, but the second model with the most layers has performed the best so far in terms of accuracy.

The options appear to be very unlimited. If we were going for the highest accuracy and the best model, we could make many models, and do things such as determine variable importance in order to find more and more accurate models. For the purposes of this exercise though, I would feel comfortable using our second model to make classificaitons. While it is not 100% reliable, it did very well in classifying the digits. It is very obvious that the strength is great when the first few digits completely line up.

---
title: "Week 8 Project Learning with h2o"
output: html_notebook
author: John Neville
---

Prompt:  Use deep learning from H2o package to train your data set. Investigate prediction performance (limit to 2-3 layers) on multiple runs using H2o package (experiment by varying parameter such as numbers of layers, numbers of nodes, etc.,)

We are also provided with a few options for the data set to use for this exercise.  One of these options is the MNIST digit classification data set.  I have not done much work with images, and quite a bit of classification exercises, so I am going to with this digit data.

To begin, we need to load the required packages and read in the data for interpretation.  First step is to take a look at the training set.  I have gone ahead and extracted the training set.  Below we can see the first 4 digits.

```{r}

library(jsonlite)
library(caret)
library(h2o)
library(ggplot2)
library(data.table)
library(e1071)

setwd('C:\\Users\\JP\\Downloads')
data = file("C:\\Users\\JP\\Downloads\\train-images.idx3-ubyte", "rb")

readBin(data, integer(), n=4, endian="big")

m = matrix(readBin(data,integer(), size=4, n=28*28, endian="big"),28,28)

par(mfrow=c(2,2))

for(i in 1:4){m = matrix(readBin(data,integer(), size=1, n=28*28, endian="big"),28,28);image(m[,28:1])}
```

It looks like the data is loaded in properly, and we have what we need to begin. Now we start up h2o and get the data loaded in.  I found a blog post that helps us get the data into the appropriate format to begin testing models for interpretation.  We will use those functions to get stared.  The blog post can be found here:   [link](https://charleshsliao.wordpress.com/2017/04/15/a-h2o-fnn-model-for-mnist/)

```{r}
h2o.init()

setwd('C:\\Users\\JP\\Downloads')
load_image_file <- function(filename) {
  ret = list()
  f = file(filename,'rb')
  readBin(f,'integer',n=1,size=4,endian='big')
  ret$n = readBin(f,'integer',n=1,size=4,endian='big')
  nrow = readBin(f,'integer',n=1,size=4,endian='big')
  ncol = readBin(f,'integer',n=1,size=4,endian='big')
  x = readBin(f,'integer',n=ret$n*nrow*ncol,size=1,signed=F)
  ret$x = matrix(x, ncol=nrow*ncol, byrow=T)
  close(f)
  ret
}
load_label_file <- function(filename) { 
  f = file(filename,'rb')
  readBin(f,'integer',n=1,size=4,endian='big')
  n = readBin(f,'integer',n=1,size=4,endian='big')
  y = readBin(f,'integer',n=n,size=1,signed=F)
  close(f)
  y
}
imagetraining<-as.data.frame(load_image_file("C:\\Users\\JP\\Downloads\\train-images.idx3-ubyte"))
imagetest<-as.data.frame(load_image_file("C:\\Users\\JP\\Downloads\\t10k-images.idx3-ubyte"))
labeltraining<-as.factor(load_label_file("C:\\Users\\JP\\Downloads\\train-labels.idx1-ubyte"))
labeltest<-as.factor(load_label_file("C:\\Users\\JP\\Downloads\\t10k-labels.idx1-ubyte"))
imagetraining[,1]<-labeltraining
imagetest[,1]<-labeltest
Training<-imagetraining
Test<-imagetest 

```

Now that the data is loaded in and ready to go, we can start by looking at different models using h2o.  We will convert to h2o objects and begin running some models and use the caret package to see how we do.

```{r}
TrainingH<-as.h2o(Training,destination="TrainingH")
TestH<-as.h2o(Test,destination="Test")



x<-colnames(TrainingH[,-1])

model<-h2o.deeplearning(x=x,y="n",training_frame = TrainingH,validation_frame = TestH,distribution = "multinomial",activation="RectifierWithDropout",hidden = c(2,2,2),input_dropout_ratio = .2,sparse=T,epochs=10)

summary(model)

ModelResult<-h2o.predict(model,TestH)
results<-as.data.frame(ModelResult[,1])
head(TestH[,1])
head(results)
caret::confusionMatrix(unlist(results),Test$n)$overall
```

We did not do very well here.  There appears to be quite a few 7's in our results set, which is not actually the case.  Let's try a few things to change it up.  We had originally only had 2 nodes for each of the three layers and only 10 epochs.  Let's increase these numbers and see how we do.  We will also expiriment with addinf 5-fold data validation for a third model and in a 4th model we will reduce the input dropout ratio, which is a feature that controls what ratio of features is dropped for a training row.  I figure if we are adding validation, we could lower this number.

```{r}
model2<-h2o.deeplearning(x=x,y="n",training_frame = TrainingH,validation_frame = TestH,distribution = "multinomial",activation="RectifierWithDropout",hidden = c(50,50,50),input_dropout_ratio = .2,sparse=T,epochs=100)

model3<-h2o.deeplearning(x=x,y="n",training_frame = TrainingH,validation_frame = TestH,distribution = "multinomial",activation="RectifierWithDropout",hidden = c(30,30,30),input_dropout_ratio = .2,sparse=T,epochs=50,nfolds=5)

model4<-h2o.deeplearning(x=x,y="n",training_frame = TrainingH,validation_frame = TestH,distribution = "multinomial",activation="RectifierWithDropout",hidden = c(30,30,30),input_dropout_ratio = .1,sparse=T,epochs=50,nfolds=5)
```

We make a few models, now it is time to put them to the test.  It would not be hard to not beat our predictions earlier, but let's see.  These models take a significantly longer amount of time to run.

##### Model 2
```{r}
ModelResult2<-h2o.predict(model2,TestH)
results2<-as.data.frame(ModelResult2[,1])
head(TestH[,1])
head(results2)
caret::confusionMatrix(unlist(results2),Test$n)$overall
```

Very high accuracy in model 2 as we would expect with such a wide increase in the number of operations that are performed. 

##### Model 3
```{r}
ModelResult3<-h2o.predict(model3,TestH)
results3<-as.data.frame(ModelResult3[,1])
head(TestH[,1])
head(results3)
caret::confusionMatrix(unlist(results3),Test$n)$overall

```
Not quite as accurate, this could be due simply to the fact that we reduced the layers.  Let's see if our input dropout ratio change has any impact on the accuracy.

##### Model 4

```{r}
ModelResult4<-h2o.predict(model4,TestH)
results4<-as.data.frame(ModelResult4[,1])
head(TestH[,1])
head(results4)
caret::confusionMatrix(unlist(results4),Test$n)$overall
```
Model 4 was an improvement on model 3, but the second model with the most layers has performed the best so far in terms of accuracy.

The options appear to be very unlimited.  If we were going for the highest accuracy and the best model, we could make many models, and do things such as determine variable importance in order to find more and more accurate models.  For the purposes of this exercise though, I would feel comfortable using our second model to make classificaitons.  While it is not 100% reliable, it did very well in classifying the digits.  It is very obvious that the strength is great when the first few digits completely line up.