DAT 315 Machine Learning Project 5

knitr::opts_chunk$set(cache=TRUE)
knitr::opts_chunk$set(echo=TRUE)

Problem 1

The website https://archive.ics.uci.edu/ml/datasets/Ionosphere contains data evaluating “good” and “bad” radar returns for evidence of structure in the ionosphere. There are 351 observations of 34 predictors and a binary response, g or b.

rm(list=ls())
# libraries needed for project 5
library(caretEnsemble)
library(caret)
library(ISLR)
library(elasticnet)
library(AppliedPredictiveModeling)
library(parallel)
library(doParallel)
library(randomForest)
library(gbm)
library(readxl)
library(neuralnet)
library(MASS)

# use for parallel processing
startParallel <- function() {
  cluster <- makeCluster(detectCores() - 1)
  registerDoParallel(cluster)
  return(list("cluster" = cluster, "time" = proc.time()))
}
endParallel <- function(parallelData) {
  stopCluster(parallelData$cluster)
  registerDoSEQ()
  return(proc.time() - parallelData$time)
}

(a)

Load the data into R, delete any columns with zero variance, and convert the first column to type num and the response to type factor. Set the seed to 12345 and partition the data using createDataPartition with \(p=0.7\).

sphereData <- read.csv('ionosphere.data', header = FALSE)

sphereZeroValues <- nearZeroVar(sphereData)
sphereData <- sphereData[,-sphereZeroValues]

sphereData$V1 <- as.numeric(sphereData$V1)

set.seed(12345)

sphereIndex <- createDataPartition(sphereData$V35, p=0.7, list=FALSE)
sphereTraining <- sphereData[sphereIndex,]
sphereTesting <- sphereData[-sphereIndex,]

(b)

The caretEnsemble package allows us to build and combine different models.

The code models1 <- caretList(V35∼., data=trainingData, trControl=trainControl(method=“cv”, number=10, savePredictions=TRUE, classProbs=TRUE), methodList=c(’knn’, ’lda’, ’rpart’))

results <- resamples(models1) builds kNN, LDA, and CART models for the data. Observe results either using summary(results) (for a table) or dotplot(results) (for a graph). Note that the accuracies of all three models are comparable.

models1 <- caretList(V35~., data=sphereTraining,
                     trControl=trainControl(method="cv", number=10, savePredictions=TRUE, classProbs=TRUE),
                     methodList=c('knn', 'lda', 'rpart'))

results <- resamples(models1)
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: knn, lda, rpart 
## Number of resamples: 10 
## 
## Accuracy 
##            Min.   1st Qu.    Median      Mean 3rd Qu. Max. NA's
## knn   0.7600000 0.8100000 0.8575000 0.8501667 0.87875 0.92    0
## lda   0.7600000 0.8350000 0.8750000 0.8703333 0.91000 1.00    0
## rpart 0.7083333 0.8891667 0.9183333 0.8941667 0.92000 0.96    0
## 
## Kappa 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## knn   0.4230769 0.5467033 0.6525199 0.6455658 0.7283935 0.8175182    0
## lda   0.3902439 0.6111632 0.7019704 0.6890054 0.7929140 1.0000000    0
## rpart 0.3913043 0.7624470 0.8157359 0.7747473 0.8324250 0.9110320    0

dotplot(results)

Both the summary of statistics and the visualization achieved using the dotplot function demonstrate the corresponding accuracy and kappa values of the kNN, LDA, and CART models. The plot represents the results of each of the three models. Each model has a comparable accuracy within a 95 percent confidence level. But the CART model exhibits the highest ranges of accuracy and kappa values. The summary of statistics indicates the CART model has a mean accuracy of 89.42% and a complementary mean kappa value of 0.7747473. Equivalently, the LDA model exhibits a mean accuracy of 87.03% and a complementary mean kappa value of 0.6890054. The kNN model exhibits the lowest mean accuracy of 85.02% and a complementary mean kappa value of 0.6455658.

(c)

Use modelCor(results) to examine the correlations between the predictions of the different models. Note that the LDA and kNN models are highly (>0.75) correlated. Re-run the previous code without ‘lda’.

modelCor(results)

##             knn       lda     rpart
## knn   1.0000000 0.7927249 0.5740615
## lda   0.7927249 1.0000000 0.2232201
## rpart 0.5740615 0.2232201 1.0000000

models1 <- caretList(V35~., data=sphereTraining,
                     trControl=trainControl(method="cv", number=10, savePredictions=TRUE, classProbs=TRUE),
                     methodList=c('knn', 'rpart'))

results <- resamples(models1)
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: knn, rpart 
## Number of resamples: 10 
## 
## Accuracy 
##            Min. 1st Qu. Median      Mean 3rd Qu. Max. NA's
## knn   0.8000000   0.835   0.88 0.8666667  0.9075 0.92    0
## rpart 0.8333333   0.840   0.88 0.8993333  0.9500 1.00    0
## 
## Kappa 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## knn   0.5318352 0.6111632 0.7191011 0.6886457 0.7797753 0.8263889    0
## rpart 0.6250000 0.6604694 0.7381913 0.7832126 0.8930379 1.0000000    0

dotplot(results)

The modelCor function indicate hat the LDA and kNN models are highly (>0.75) correlated. The data indicates a 0.7927249 correlation between the LDA and kNN models. The correlation between the CART and kNN models is 0.5740615, and the correlation between the CART and LDA models is 0.2232201.

Both the summary of statistics and the plot seems to indicate that by eliminating the LDA model descreases the correlations and increases the model’s predictive capacity.

(d)

The caretStack function builds a combined or “stacked” model that combines the predictions from different models. 1 Use STACK1 <- caretStack(models1, method=“glm”, metric=“Accuracy”,trControl=trainControl(method=“cv”, number=10, savePredictions=TRUE, classProbs=TRUE))to build a stacked model.

STACK1 <- caretStack(models1, method ="glm", metric ="Accuracy", trControl = trainControl(method="cv", number=10, savePredictions = TRUE, classProbs = TRUE))
print(STACK1)

## A glm ensemble of 2 base models: knn, rpart
## 
## Ensemble results:
## Generalized Linear Model 
## 
## 247 samples
##   2 predictor
##   2 classes: 'b', 'g' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 222, 223, 223, 222, 222, 223, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9273333  0.8388854

predictIonosphereData <- predict(STACK1, newdata=sphereTraining)
confusionMatrix(predictIonosphereData, sphereTraining$V35)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   b   g
##          b  26 157
##          g  63   1
##                                          
##                Accuracy : 0.1093         
##                  95% CI : (0.0733, 0.155)
##     No Information Rate : 0.6397         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : -0.5701        
##                                          
##  Mcnemar's Test P-Value : 3.609e-10      
##                                          
##             Sensitivity : 0.292135       
##             Specificity : 0.006329       
##          Pos Pred Value : 0.142077       
##          Neg Pred Value : 0.015625       
##              Prevalence : 0.360324       
##          Detection Rate : 0.105263       
##    Detection Prevalence : 0.740891       
##       Balanced Accuracy : 0.149232       
##                                          
##        'Positive' Class : b              
##

How does the accuracy of the stacked model compare to the accuracies of the separate kNN and CART models on the training data?

The accuracy of the STACK1 model compares unfavorably to the mean accuracies of the separate kNN and CART models on the training data for the ionosphere dataset. Of the three models, the STACK1 model has the lowest accuracy at 10.93%. The model’s corresponding kappa value is -0.5638. In contrast, both the kNN and CART models exhibited better predictive capacities. The mean accuracy for kNN model from Part C is 86.67% and the kappa value is 0.6886457. The mean accuracy for the CART model from Part C is 89.33% and the kappa value is 0.7832126. The mean accuracy of the separate CART model is higher than the accuracies of the kNN and combined STACK1 model built on the ionosphere training dataset.

(e)

How does the accuracy of the stacked model compare to the accuracies of the separate kNN and CART models on the testing data?

predictIonosphereData <- predict(STACK1, newdata=sphereTesting)
confusionMatrix(predictIonosphereData, sphereTesting$V35)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  b  g
##          b 14 67
##          g 23  0
##                                           
##                Accuracy : 0.1346          
##                  95% CI : (0.0756, 0.2155)
##     No Information Rate : 0.6442          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.4909         
##                                           
##  Mcnemar's Test P-Value : 5.826e-06       
##                                           
##             Sensitivity : 0.3784          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.1728          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.3558          
##          Detection Rate : 0.1346          
##    Detection Prevalence : 0.7788          
##       Balanced Accuracy : 0.1892          
##                                           
##        'Positive' Class : b               
##

predictkNN <- predict(models1$knn, newdata=sphereTesting)
confusionMatrix(predictkNN, sphereTesting$V35)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  b  g
##          b 20  2
##          g 17 65
##                                           
##                Accuracy : 0.8173          
##                  95% CI : (0.7295, 0.8863)
##     No Information Rate : 0.6442          
##     P-Value [Acc > NIR] : 8.509e-05       
##                                           
##                   Kappa : 0.5617          
##                                           
##  Mcnemar's Test P-Value : 0.001319        
##                                           
##             Sensitivity : 0.5405          
##             Specificity : 0.9701          
##          Pos Pred Value : 0.9091          
##          Neg Pred Value : 0.7927          
##              Prevalence : 0.3558          
##          Detection Rate : 0.1923          
##    Detection Prevalence : 0.2115          
##       Balanced Accuracy : 0.7553          
##                                           
##        'Positive' Class : b               
##

predictCART <- predict(models1$rpart, newdata=sphereTesting)
confusionMatrix(predictCART, sphereTesting$V35)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  b  g
##          b 30  5
##          g  7 62
##                                           
##                Accuracy : 0.8846          
##                  95% CI : (0.8071, 0.9389)
##     No Information Rate : 0.6442          
##     P-Value [Acc > NIR] : 2.487e-08       
##                                           
##                   Kappa : 0.7452          
##                                           
##  Mcnemar's Test P-Value : 0.7728          
##                                           
##             Sensitivity : 0.8108          
##             Specificity : 0.9254          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.8986          
##              Prevalence : 0.3558          
##          Detection Rate : 0.2885          
##    Detection Prevalence : 0.3365          
##       Balanced Accuracy : 0.8681          
##                                           
##        'Positive' Class : b               
##

In summarizing the effectiveness of the models on the ionosphere training data, the accuracy of the STACK1 model does not compare favorably to the accuracies of the kNN and CART models. The STACK1 model has an accuacy of only 13.46% and a kappa value of -0.4909. It is possible that the stacked model has a problem interpreting the data. However, the accuracies of the kNN and CART models are much stronger. The kNN model has an accuracy of 81.73% and a moderate kappa value of 0.5617. The CART model performed the strongest with an accuracy of 88.46% and a strong kappa value of 0.7452.

Problem 2

The ISLR package contains a dataset called Khan that consists of gene expression measurements indicating one of four types of small round blue cell tumours of childhood (SRBCT).

(a)

The data are already split into training and testing sets. Make dataframes for each and set the seed to 12345.

data(Khan)

data_train = data.frame(R = as.factor(Khan$ytrain),Khan$xtrain)
data_test = data.frame(R = as.factor(Khan$ytest),Khan$xtest)

set.seed(12345)

Par = startParallel()

(b)

Build CART (method=“rpart”), random forest (method=“rf”), gradient boosting machine (method=“gbm”), and SVM (method=“svmLinear”) models for the response and call them CART2, RF2, GBM2, and SVM2, respectively. Be sure to suppress computational details that your reader doesn’t want to see.

CART2 = train(R~.,data = data_train, method = "rpart",
              trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE),
              preProcess = c("center", "scale"))
RF2 = train(R~.,data = data_train, method = "rf",
            trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE),
            preProcess = c("center", "scale"))
GBM2 = train(R~.,data = data_train, method = "gbm",
             trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE),
             preProcess = c("center", "scale"))

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.3863             nan     0.1000    0.3461
##      2        1.1442             nan     0.1000    0.2412
##      3        0.9524             nan     0.1000    0.1676
##      4        0.8213             nan     0.1000    0.1146
##      5        0.7079             nan     0.1000    0.1491
##      6        0.5997             nan     0.1000    0.1049
##      7        0.5128             nan     0.1000    0.0965
##      8        0.4449             nan     0.1000    0.0761
##      9        0.3892             nan     0.1000    0.0682
##     10        0.3371             nan     0.1000    0.0479
##     20        0.0844             nan     0.1000    0.0126
##     40        0.0065             nan     0.1000    0.0003
##     50        0.0018             nan     0.1000    0.0003

SVM2 = train(R~.,data = data_train, method = "svmLinear",
             trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE),
             preProcess = c("center", "scale"))

(c)

Compare the accuracies of the four models on the training data.

confusionMatrix(predict(CART2, data_train), data_train$R)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4
##          1  0  0  0  0
##          2  0 22  0  0
##          3  7  1 12  0
##          4  1  0  0 20
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8571          
##                  95% CI : (0.7461, 0.9325)
##     No Information Rate : 0.3651          
##     P-Value [Acc > NIR] : 1.023e-15       
##                                           
##                   Kappa : 0.7977          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity             0.000   0.9565   1.0000   1.0000
## Specificity             1.000   1.0000   0.8431   0.9767
## Pos Pred Value            NaN   1.0000   0.6000   0.9524
## Neg Pred Value          0.873   0.9756   1.0000   1.0000
## Prevalence              0.127   0.3651   0.1905   0.3175
## Detection Rate          0.000   0.3492   0.1905   0.3175
## Detection Prevalence    0.000   0.3492   0.3175   0.3333
## Balanced Accuracy       0.500   0.9783   0.9216   0.9884

confusionMatrix(predict(RF2, data_train), data_train$R)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4
##          1  8  0  0  0
##          2  0 23  0  0
##          3  0  0 12  0
##          4  0  0  0 20
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9431, 1)
##     No Information Rate : 0.3651     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity             1.000   1.0000   1.0000   1.0000
## Specificity             1.000   1.0000   1.0000   1.0000
## Pos Pred Value          1.000   1.0000   1.0000   1.0000
## Neg Pred Value          1.000   1.0000   1.0000   1.0000
## Prevalence              0.127   0.3651   0.1905   0.3175
## Detection Rate          0.127   0.3651   0.1905   0.3175
## Detection Prevalence    0.127   0.3651   0.1905   0.3175
## Balanced Accuracy       1.000   1.0000   1.0000   1.0000

confusionMatrix(predict(GBM2, data_train), data_train$R)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4
##          1  8  0  0  0
##          2  0 23  0  0
##          3  0  0 12  0
##          4  0  0  0 20
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9431, 1)
##     No Information Rate : 0.3651     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity             1.000   1.0000   1.0000   1.0000
## Specificity             1.000   1.0000   1.0000   1.0000
## Pos Pred Value          1.000   1.0000   1.0000   1.0000
## Neg Pred Value          1.000   1.0000   1.0000   1.0000
## Prevalence              0.127   0.3651   0.1905   0.3175
## Detection Rate          0.127   0.3651   0.1905   0.3175
## Detection Prevalence    0.127   0.3651   0.1905   0.3175
## Balanced Accuracy       1.000   1.0000   1.0000   1.0000

confusionMatrix(predict(SVM2, data_train), data_train$R)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  4
##          1  8  0  0  0
##          2  0 23  0  0
##          3  0  0 12  0
##          4  0  0  0 20
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9431, 1)
##     No Information Rate : 0.3651     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity             1.000   1.0000   1.0000   1.0000
## Specificity             1.000   1.0000   1.0000   1.0000
## Pos Pred Value          1.000   1.0000   1.0000   1.0000
## Neg Pred Value          1.000   1.0000   1.0000   1.0000
## Prevalence              0.127   0.3651   0.1905   0.3175
## Detection Rate          0.127   0.3651   0.1905   0.3175
## Detection Prevalence    0.127   0.3651   0.1905   0.3175
## Balanced Accuracy       1.000   1.0000   1.0000   1.0000

There was little variation between the accuracies of the four different models built using the Khan training dataset. The CART2 model exhibited a strong accuracy of 85.71% along with a strong kappa value of 0.7977. The RF2, GBM2, and SVM2 models each exhibited an accuracy 100% along with strong kappa values of 1. Since each of these models were built using training dataset, the ability for these models to accurately predict the data was particularly strong. The CART2 model is the only model that does not exhibit 100% accuracy in prediciting the training dataset. Perfect accuracy on the training dataset does not guarantee simmilar results on the testing dataset.

(d)

Compare the accuracies of the four models on the testing data.

confusionMatrix(predict(CART2, data_test), data_test$R)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4
##          1 0 0 0 0
##          2 0 4 0 1
##          3 3 1 5 1
##          4 0 1 1 3
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6             
##                  95% CI : (0.3605, 0.8088)
##     No Information Rate : 0.3             
##     P-Value [Acc > NIR] : 0.005138        
##                                           
##                   Kappa : 0.4386          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity              0.00   0.6667   0.8333   0.6000
## Specificity              1.00   0.9286   0.6429   0.8667
## Pos Pred Value            NaN   0.8000   0.5000   0.6000
## Neg Pred Value           0.85   0.8667   0.9000   0.8667
## Prevalence               0.15   0.3000   0.3000   0.2500
## Detection Rate           0.00   0.2000   0.2500   0.1500
## Detection Prevalence     0.00   0.2500   0.5000   0.2500
## Balanced Accuracy        0.50   0.7976   0.7381   0.7333

confusionMatrix(predict(RF2, data_test), data_test$R)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4
##          1 3 0 0 0
##          2 0 6 1 0
##          3 0 0 5 0
##          4 0 0 0 5
## 
## Overall Statistics
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.7513, 0.9987)
##     No Information Rate : 0.3             
##     P-Value [Acc > NIR] : 1.662e-09       
##                                           
##                   Kappa : 0.932           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity              1.00   1.0000   0.8333     1.00
## Specificity              1.00   0.9286   1.0000     1.00
## Pos Pred Value           1.00   0.8571   1.0000     1.00
## Neg Pred Value           1.00   1.0000   0.9333     1.00
## Prevalence               0.15   0.3000   0.3000     0.25
## Detection Rate           0.15   0.3000   0.2500     0.25
## Detection Prevalence     0.15   0.3500   0.2500     0.25
## Balanced Accuracy        1.00   0.9643   0.9167     1.00

confusionMatrix(predict(GBM2, data_test), data_test$R)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4
##          1 3 1 0 0
##          2 0 5 0 0
##          3 0 0 6 0
##          4 0 0 0 5
## 
## Overall Statistics
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.7513, 0.9987)
##     No Information Rate : 0.3             
##     P-Value [Acc > NIR] : 1.662e-09       
##                                           
##                   Kappa : 0.9327          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000   0.8333      1.0     1.00
## Specificity            0.9412   1.0000      1.0     1.00
## Pos Pred Value         0.7500   1.0000      1.0     1.00
## Neg Pred Value         1.0000   0.9333      1.0     1.00
## Prevalence             0.1500   0.3000      0.3     0.25
## Detection Rate         0.1500   0.2500      0.3     0.25
## Detection Prevalence   0.2000   0.2500      0.3     0.25
## Balanced Accuracy      0.9706   0.9167      1.0     1.00

confusionMatrix(predict(SVM2, data_test), data_test$R)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4
##          1 3 0 0 0
##          2 0 6 2 0
##          3 0 0 4 0
##          4 0 0 0 5
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9            
##                  95% CI : (0.683, 0.9877)
##     No Information Rate : 0.3            
##     P-Value [Acc > NIR] : 3.773e-08      
##                                          
##                   Kappa : 0.8639         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity              1.00   1.0000   0.6667     1.00
## Specificity              1.00   0.8571   1.0000     1.00
## Pos Pred Value           1.00   0.7500   1.0000     1.00
## Neg Pred Value           1.00   1.0000   0.8750     1.00
## Prevalence               0.15   0.3000   0.3000     0.25
## Detection Rate           0.15   0.3000   0.2000     0.25
## Detection Prevalence     0.15   0.4000   0.2000     0.25
## Balanced Accuracy        1.00   0.9286   0.8333     1.00

endParallel(Par)

##    user  system elapsed 
##  18.748   0.774 102.513

In summary, the random forest (rfm) and gradient boosting machine (gbm) models exhibited 95% accuracy followed by the SVM model at 90%. The CART model performed substantially lower with only %60 accuracy. The kappa value of the CART model was low to moderate at 0.4386 which is indicative of the possibility of a correct prediction by chance alone. The kappa value for the RF2 model was high at 0.932. The kappa values of the SVM2 and the GBM2 models were also high at 0.8639 supporting the accuracy level. The kappa values essentially adjust the accuracy by accounting for the possibility of a correct prediction by chance alone.

Problem 3

The data https://archive.ics.uci.edu/ml/datasets/Energy+efficiency contains data on building characteristics and energy efficiency. What is new here is that there are two response variables, heating load (Y1) and cooling load (Y2).

data = read_excel("ENB2012_data.xlsx")

normalize <- function(x){
  return((x-min(x))/(max(x)-min(x)))
}

data_norm <- as.data.frame(lapply(data, normalize))

(a)

Load the data into R and normalize it so it is suitable for analysis by a neural network. Set the seed to 12345 and use createDataPartition (using Y1) with \(p=0.7\) to partition the data into training and testing sets.

set.seed(12345)
index <- createDataPartition(y=data_norm$Y1, p=0.7, list= FALSE)
data_train <- data_norm[index,]
data_test <- data_norm[-index,]

(b)

Use train with method = “nnet” to build a neural network model for Y1 using X1, X2, …, X8 as predictors. (Don’t use Y2 as a predictor.) Call the model NN3b. Also, note that there is no need to center and scale the predictors.

NN3b <- train(Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data = data_train, method = "nnet", trace = FALSE)

(c)

Compute R2 for NN3b on the testing data.

NN3bR2 = R2(predict(NN3b, data_test),data_test$Y1)
NN3bR2

##           [,1]
## [1,] 0.9927248

(d)

To predict both Y1 and Y2, use the neuralnet function in the neuralnet package. Using the training data, build a model called NN3d with one hidden unit in one hidden layer. Plot NN3d using plot(NN3d,rep=“best”) and use the testing data to compute R2 for Y1 and Y2.

NN3d = neuralnet(Y1 + Y2 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data = data_train, hidden = 1)
plot(NN3d, rep="best")

Y1_R2_D = R2(compute(NN3d, data_test[,1:8])$net.result,data_test$Y1)
Y2_R2_D = R2(compute(NN3d, data_test[,1:8])$net.result,data_test$Y2)

Y1_R2_D[1,1]

## [1] 0.9074511

Y2_R2_D[1,1]

## [1] 0.871583

The \(R^2\) value of 0.90774511 is high and is indicative of good measure for the neural network model. Evaluate the predictive capacity of the neural network model and compute the \(R^2\) value for the second response variable Y2. The \(R^2\) value of 0.871583 is high and is indicative of good measure for the neural network model.

(e)

Make a new model, NN3e, for Y1 and Y2 with two hidden layers, the first having 2 nodes and the second having 1 node. Plot NN3e using plot(NN3e, rep=“best”) and use the testing data to compute R2 for Y1 and Y2.

NN3e = neuralnet(Y1 + Y2 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data = data_train, hidden = c(2,1), stepmax = 225000)
plot(NN3e, rep="best")

Y1_R2_E = R2(compute(NN3e, data_test[,1:8])$net.result,data_test$Y1)
Y2_R2_E = R2(compute(NN3e, data_test[,1:8])$net.result,data_test$Y2)

Y1_R2_E[1,1]

## [1] 0.9762186

Y2_R2_E[1,1]

## [1] 0.9432155

The \(R^2\) value of 0.9762186 is high and is indicative of good measure for the neural network model. Evaluate the predictive capacity of the neural network model and compute the \(R^2\) value for the second response variable Y2. The \(R^2\) value of 0.9432155 is high and is indicative of good measure for the neural network model.

Problem 4

In this problem, we investigate the importance of normalizing the data before constructing a neural network model. Consider the Boston data set in the MASS package with lstat predicting medv.

(a)

Set the seed to 12345.

set.seed(12345)

(b)

Construct a neural network model called NN4b with one hidden layer containing one hidden variable. Plot the data and superimpose the model over the data. Comment on the quality of the fit.

NN4b <- neuralnet(medv ~ lstat, data = Boston, hidden = 1)

pred <- predict(NN4b, Boston)

plot(Boston$lstat, Boston$medv, type = "p", pch = 20, xlab="Lstat", ylab="Medv", main="Observed Boston Data")

lines(lowess(Boston$lstat, pred), lwd=2, col="red")

The plot displays the neural network data points with the model superimposed over the data. The data points are representes in black and the model is represented by the the red line. It is clear that this neural network model does not sufficiently fit the “Boston” data set. The plotted points generate an upward curve, but the model is represented by a straight line. This model does not effectively represent the data in the Boston dataset.

(c)

Construct a new dataframe containing a normalized version of the Boston data.

normalize <- function(x){
  return((x-min(x))/(max(x)-min(x)))
}
bostonNormalize <- as.data.frame(lapply(Boston, normalize))

(d)

Using the normalized data, construct a neural network model called NN4d with one hidden layer containing one hidden variable. Plot the data and superimpose the model over the data and make the curve red. Comment on the quality of the fit.

NN4d <- neuralnet(medv~lstat, data=bostonNormalize, hidden=1)

plot(bostonNormalize$lstat, bostonNormalize$medv, pch=20, xlab="Lstat", ylab="Medv", main="Normalized Boston Data")

pred4d <- predict(NN4d, bostonNormalize)

lines(lowess(bostonNormalize$lstat, pred4d), lwd=2, col="red")

The graph exhibits the plotted Boston data with neural network model superimposed over the data. The normalized data is represented in black while the model is indicated by the red line. This neural network model fits the normalized data from the Boston dataset. Both the model and the data forms an upward curve. While this model does not perfectly represent the normalized data, it is much stronger than the model generated in Problem 2(b). Some data points are outliers but this model is well-suited to represent the Boston dataset.

(e)

Use plot(NN4d, rep=“best”) to visualize the model and write down the corresponding equation. Use S to indicate the activation function.

plot(NN4d, rep = "best")

The diagram visualizes the neural network model NN4d from Problem 4(d). This model was built using one hidden layer containing one hidden variable. The corresponding equation of model NN4d is: \[medv = 2.70587 - 2.53167S(0.99799 + 6.03666(lstat))\]

(f)

Using the normalized data, construct a neural network model called NN4f with two hidden layers containing two hidden variables each. Plot the data and superimpose the model over the data and make the curve blue. Comment on the quality of the fit.

NN4f <- neuralnet(medv~lstat, data=bostonNormalize, hidden=c(2, 2))

plot(bostonNormalize$lstat, bostonNormalize$medv, pch=20, xlab="Lstat", ylab="Medv", main="Normalized Boston Data Two Hidden Variables")

pred4f <- predict(NN4f, bostonNormalize)

lines(lowess(bostonNormalize$lstat, pred4f), lwd=2, col="blue")

The graph exhibits the data points from the normalized Boston dataset. The model neural network model NN4f with two hidden variables is superimposed over the data. The black points represent the normalized Boston data and the blue line represents the neural network model. This neural network model fits the normalized data forming an upward curve similar to the data points. Like the model in Problem 4(b), this model respresents a good fit for the data.

DAT 315 Machine Learning Project 5

Daniel Buttorff

Tuyen Le

Joey Sopchick

04/15/2019

Problem 1

(a)

(b)

(c)

(d)

(e)

Problem 2

(a)

(b)

(c)

(d)

Problem 3

(a)

(b)

(c)

(d)

(e)

Problem 4

(a)

(b)

(c)

(d)

(e)

(f)