Due Date: Monday March 11, 2019 at 5:59 pm.

Introduction

We are going to use a simulated two-class data set with 200 observations for training and 100 observations for testing, which includes two features, and in which there is a visible but non-linear separation between the two classes. Use the code below for creating such a dataset.

rm(list = ls())
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
# set a seed
set.seed(1)

# ---- Create a training set ---- #
# create a matrix with 200 rows and two colums with values sampled from a normal distribution.
x <- matrix(rnorm(200*2), ncol = 2)
# Introduce some non-linearity where we move points around
x[1:100,] <- x[1:100,] + 2
x[101:150,] <- x[101:150,] -2
# assign class labels
y <- c(rep(1, 150), rep(0, 50))
# this forms a training set
d.train <- data.frame(x = x, y = as.factor(y))
names(d.train) <- c("X1", "X2", "Y")


# ---- Create a test set ---- #
# create a matrix with 100 rows and two colums with values sampled from a normal distribution.
x <- matrix(rnorm(100*2), ncol = 2)
# Introduce some non-linearity where we move points around
x[1:25,] <- x[1:25,] + 2 # moves points to the top-right of a 2D space
x[26:75,] <- x[26:75,] -2 # moves points to the bottom-left of a 2D space
# assign class labels
y <- c(rep(1, 75), rep(0, 25)) 
# this forms a testing set
d.test <- data.frame(x = x, y = as.factor(y))
names(d.test) <- c("X1", "X2", "Y")

Question 1

Create a scatter-plot of all data points in the training set color-labeled by their class type. You will notice that one class is in the center of all points of the other class. In other words, the separation between the classes is a circle around the points with Y as -1. Repeat the same for the testing set.

# Insert your code below

#Scatterplot for train data
plot(x=d.train$X1, y=d.train$X2, type="p", col=d.train$Y, pch=19, cex=1, xlab="X1", ylab="X2", main="Scatterplot for Train Data")

#Scatterplot for test data
plot(x=d.train$X1, y=d.train$X2, type="p", col=d.test$Y, pch=19, cex=1, xlab="X1", ylab="X2", main="Scatterplot for Test Data")

Question 2

Buid a neural network with a variable hidden layer network size from 2 to 50. Feel free to explore different decay rates using “expand.grid” as shown in class. Perform testing on d.test and report the final AUC with 95% CI.

# Insert your code below
library(nnet) 
library(plyr) 
library(dplyr) 
library(pROC) 
library(caret)

#Changing format of Y so it's recognizable to caret package
d.train$Y_cat <- ifelse(d.train$Y == 1, "Yes", "No") 
#d.train <- d.train %>% select(-c(Y))

d.test$Y_cat <- ifelse(d.test$Y == 1, "Yes", "No") 
#d.test <- d.test %>% select(-c(Y))

#Set training parameters to build NN 
fit_control <- trainControl(method = "cv", number = 3,
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)

#trainControl- controls computational nuances of train fn 
#method- resampling method
#cv- for repeated, training/test splits
#number- no. of folds/no. of resampling iterations 
#classProbs- class probabilities for classification models (+ pred values) in each resample
#summaryFunction- this fn computes metrics across resamples 
#twoClassSummary- metric that relies on class probabilities

#Set of parameters to train over
nnet_params <- expand.grid(size = seq(from = 2, to = 50, by = 1), decay = 5e-4)
head(nnet_params)
##   size decay
## 1    2 5e-04
## 2    3 5e-04
## 3    4 5e-04
## 4    5 5e-04
## 5    6 5e-04
## 6    7 5e-04
#expand.grid-creates a df from all combinations of factor vars 
#from=2 is first net layer; from=50 is second net layer; by=1 means there's one layer from 2-50
#decay 5e-4 is standard for nnets 

#create the model using training data
d.model.train <- train(Y_cat ~ X1+X2, data = d.train,
                 method = "nnet",
                 metric = "ROC",
                 trControl = fit_control,
                 tuneGrid = nnet_params,
               trace = FALSE)

print(d.model.train)
## Neural Network 
## 
## 200 samples
##   2 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 133, 133, 134 
## Resampling results across tuning parameters:
## 
##   size  ROC        Sens       Spec     
##    2    0.9450245  0.8786765  0.8866667
##    3    0.9428431  0.8002451  0.8866667
##    4    0.9267402  0.7806373  0.8866667
##    5    0.8890931  0.7218137  0.8600000
##    6    0.9343137  0.6813725  0.9133333
##    7    0.9354412  0.7598039  0.9133333
##    8    0.8857353  0.6801471  0.8466667
##    9    0.8596814  0.6580882  0.8800000
##   10    0.9066667  0.7377451  0.8733333
##   11    0.9117892  0.7193627  0.8866667
##   12    0.9137010  0.6997549  0.8800000
##   13    0.8692892  0.5600490  0.8666667
##   14    0.8822549  0.6580882  0.8400000
##   15    0.8480270  0.5404412  0.8933333
##   16    0.8792157  0.6997549  0.8600000
##   17    0.8839706  0.6776961  0.8733333
##   18    0.8710539  0.5796569  0.8666667
##   19    0.8693137  0.6813725  0.8533333
##   20    0.8963971  0.7389706  0.8733333
##   21    0.8648529  0.6372549  0.8533333
##   22    0.9007108  0.6752451  0.8933333
##   23    0.8845466  0.6384804  0.8666667
##   24    0.8380515  0.5808824  0.8666667
##   25    0.8901716  0.7769608  0.8466667
##   26    0.8497549  0.5980392  0.8666667
##   27    0.8738725  0.6997549  0.8800000
##   28    0.8768137  0.6605392  0.8733333
##   29    0.8459069  0.5588235  0.8933333
##   30    0.9008824  0.5992647  0.8666667
##   31    0.8649755  0.5992647  0.8733333
##   32    0.8795833  0.6397059  0.8733333
##   33    0.8670588  0.6789216  0.8533333
##   34    0.8653799  0.5968137  0.8466667
##   35    0.8837990  0.6372549  0.8666667
##   36    0.8757843  0.5968137  0.8733333
##   37    0.8613113  0.6188725  0.8866667
##   38    0.8778431  0.6397059  0.8600000
##   39    0.8925735  0.6985294  0.8400000
##   40    0.8829412  0.5980392  0.8800000
##   41    0.8782353  0.6776961  0.8800000
##   42    0.8927941  0.6789216  0.8600000
##   43    0.8796078  0.6397059  0.8466667
##   44    0.8825980  0.6789216  0.8666667
##   45    0.8742402  0.6801471  0.8600000
##   46    0.9046324  0.7181373  0.8666667
##   47    0.8656127  0.6985294  0.8466667
##   48    0.8955882  0.6397059  0.8733333
##   49    0.8556618  0.6985294  0.8666667
##   50    0.8682721  0.6593137  0.8600000
## 
## Tuning parameter 'decay' was held constant at a value of 5e-04
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were size = 2 and decay = 5e-04.
d.model.test <- predict(d.model.train, newdata = d.test, type = "prob") 
head(d.model.test)
##             No       Yes
## 1 0.0003460934 0.9996539
## 2 0.0001199452 0.9998801
## 3 0.4058806647 0.5941193
## 4 0.0260269924 0.9739730
## 5 0.0017396988 0.9982603
## 6 0.0003169585 0.9996830
d.test$predict_Y <- d.model.test$Yes

#Creating prediction object 
pred_roc <- roc(response=d.test$Y_cat, predictor = d.test$predict_Y, direction= "<") 

#Get AUC performance 
auc_perf <- auc(pred_roc)
cat("AUC: ", auc_perf, "\n")
## AUC:  0.9493333
#Get 95%CI 
ci_auc_perf <- ci.auc(pred_roc)
cat("95% CI: ", ci_auc_perf, "\n")
## 95% CI:  0.9090687 0.9493333 0.989598

Question 3

  1. Build a logistic regression prediction model using d.train. Test on d.test, and report your test AUC.
# Insert your code below
#Logistic regression on dataset
set.seed(345)
d.train.log <- glm(Y ~ X1+X2, data=d.train, family="binomial")

#Perform prediction on test dataset
d.test$pred_Y <- predict.glm(d.train.log, newdata = d.test, type="response")
#we create a new variable, pred_Y1 to avoid above values being overwritten
d.test$pred_Y1 <- ifelse(d.test$pred_Y >= 0.5,1,0)

#For AUC and 95%CI
#Create a prediction object
pred <- roc(response = d.test$Y, predictor = d.test$pred_Y, direction = "<")

#Get AUC performance
auc_log <- auc(pred)
cat("AUC: ", auc_log, "\n")
## AUC:  0.3386667
#AUC 95% CI 
ci_auc_log <- ci.auc(pred)
cat("95% CI: ", ci_auc_log, "\n")
## 95% CI:  0.2358216 0.3386667 0.4415117
  1. Which of the two models leads to better performance? Explain in no more than 2 sentences why.

Ans. The first model, neural network, is better than the logistic regression model. The AUC of the neural network is 0.94 (AUC of 1 is a perfect predictor) while the AUC of the logistic regression is 0.5 (predictor is making random guesses).