Problem 1

Remove the factor variables from the Hitters data in the ISLR package and remove all records in which the Salarydata is missing. To remove the heavy right skew, replace Salary with log(Salary) and call it logSalary. Set the seed to 12345 and use createDataPartition with p= 0.7 in the caret package to partition the data into training and testing sets.

#Loads packages
library(caret)
library(ISLR)
library(car)
library(leaps)
library(earth)
library(ridge)
library(parallel)
library(doParallel)

# Needed for #1
library(elasticnet)
library(lars)

# Needed for #2
library(AppliedPredictiveModeling)

# Needed for #3
library(MASS)
library(data.table)

# Needed for #4
library(plyr)

#needed for #5
library(kernlab)
library(klaR)

# Parallel Processing
startParallel <- function() {
  cluster <- makeCluster(detectCores() - 1)
  registerDoParallel(cluster)
  return(list("cluster" = cluster, "time" = proc.time()))
}
endParallel <- function(parallelData) {
  stopCluster(parallelData$cluster)
  registerDoSEQ()
  return(proc.time() - parallelData$time)
  }


df <- Hitters
df <- na.omit(df)
df$Salary <- log(df$Salary)
colnames(df)[colnames(df)=="Salary"] <- "logSalary"
df$League <- NULL
df$Division <- NULL
df$NewLeague <- NULL

set.seed(12345)

part <- createDataPartition(y=df$logSalary, p=0.7, list=FALSE)
training <- df[part,]
testing <- df[-part,]

(a)

Use the training data to build a linear model (LM1) for logSalary using all other variables as pre-dictors. Evaluate the predictive capacity of LM1 by computing R2 for the testing data.

LM1 <- train(logSalary~., data=training, method="lm")
testing$predict1 = predict(LM1, testing)
cor(testing$predict1, testing$logSalary)^2
## [1] 0.4518025

(b)

Use the training data to build a ridge regression model (RIDGE1) for logSalary using all other variables as predictors. Use 10-fold cross validation and center and scale the predictors. Evaluate the predictive capacity of RIDGE1 by computing \(R^2\) for the testing data.

RIDGE1 <- train(logSalary~., data=training, method="ridge",
                trControl = trainControl(method = "cv", number = 10),
                preProcess = c("center","scale"))
testing$predict2 = predict(RIDGE1, testing)
cor(testing$predict2, testing$logSalary)^2
## [1] 0.4812084

(c)

Use the training data to build a LASSO model (LASSO1) for logSalary using all other variables as predictors. Use 10-fold cross validation and center and scale the predictors. Evaluate the predictive capacity of LASSO1 by computing \(R^2\) for the testing data.

LASSO1 <- train(logSalary~., data=training, method="lasso",trControl = trainControl(
    method = "cv", 
    number = 10),preProcess = c("center","scale"))
testing$predict3 = predict(LASSO1 , testing)
cor(testing$predict3, testing$logSalary)^2
## [1] 0.4686052

(d)

Use \(predict.enet(LASSO1\$finalModel, type="coef", s=LASSO1 bestTune fraction,mode="fraction")\) to determine which variables the LASSO removes from the model. Explicitly list the variables that are removed.

predict.enet(LASSO1$finalModel, type="coef", s=LASSO1$bestTune$fraction,mode="fraction")
## $s
## [1] 0.5
## 
## $fraction
##   0 
## 0.5 
## 
## $mode
## [1] "fraction"
## 
## $coefficients
##       AtBat        Hits       HmRun        Runs         RBI       Walks 
## -0.26718587  0.40050874  0.04954137  0.00000000  0.03072973  0.12541714 
##       Years      CAtBat       CHits      CHmRun       CRuns        CRBI 
##  0.13511795  0.00000000  0.45915555 -0.09788284  0.04494156  0.00000000 
##      CWalks     PutOuts     Assists      Errors 
## -0.07672909  0.11115853  0.09988299 -0.09130279

As you can see from the model, the removed variables are Runs, CRBI, CAtBat because of the coefficients of 0. After removing, here is the final model:

# LASSO2 is the new model with the variables removed
LASSO2 <- train(logSalary~.-Runs-CRBI-CAtBat, data=training, method="lasso",
                trControl = trainControl(method = "cv", number = 10),
                preProcess = c("center","scale"))
testing$predict3 = predict(LASSO2 , testing)
cor(testing$predict3, testing$logSalary)^2
## [1] 0.4651678
predict.enet(LASSO2$finalModel, type="coef", s=LASSO2$bestTune$fraction,mode="fraction")
## $s
## [1] 0.9
## 
## $fraction
##   0 
## 0.9 
## 
## $mode
## [1] "fraction"
## 
## $coefficients
##       AtBat        Hits       HmRun         RBI       Walks       Years 
## -0.48406636  0.54029078  0.07539022  0.04288597  0.18244558  0.17512993 
##       CHits      CHmRun       CRuns      CWalks     PutOuts     Assists 
##  0.38641461 -0.13204280  0.23586551 -0.22167843  0.11956188  0.13870199 
##      Errors 
## -0.10320812

Problem 2

Predicting permeability of compounds can yield significant savings for a pharmaceutical company. In the AppliedPredictiveModeling package, the permeability data contains high dimensional data in the form of a matrix titled fingerprints with 165 observations on 1107 binary molecular predictors and a vector called permeability that contains the corresponding responses.

(a)

Combine the predictors and response into a single data frame and use nearZeroVar to remove all predictors that have small variance. Note that the data are still high dimensional.

data("permeability")
CompoundData <- cbind.data.frame(permeability, fingerprints)
cols <- nearZeroVar(CompoundData)
CompoundData <- CompoundData[, -cols]

(b)

Set the seed to 12345 and and use createDataPartition with p= 0.7 in the caret package to partition the data into training and testing sets.

set.seed(12345)
trainingData = createDataPartition(CompoundData$permeability, p = 0.7, list=FALSE )
trainingSetCompound = CompoundData[trainingData, ]
testingSetCompound = CompoundData[-trainingData, ]

(c)

Build a linear model (LM2) with the training data using train with preProcess=c(“center”,“scale”) and compute \(R^2\) for the LM2’s performance on the testing data. R is going to complain about rank deficiency because there are more variables than observations. Just ignore those warnings for this problem.

LM2 <- train(permeability~., trainingSetCompound, method="lm", maximize ="TRUE", metric="RSquared",
             preProcess = c("center", "scale"))
predicted_LM2 <- predict(LM2, testingSetCompound)
cor(testingSetCompound$permeability, predicted_LM2)^2
## [1] 0.2323023

(d)

Now incorporate principal components into your model. Build another linear model (LM2pca) with the training data using train with preProcess=c(“center”,“scale”,“pca”) and compute \(R^2\) for the LM2pca’s performance on the testing data. Did PCA increase \(R^2\)?

LM2pca = train(permeability~., data=trainingSetCompound, method="lm",
               preProcess=c("center","scale","pca"))
predicted2pca = predict(LM2pca, testingSetCompound)
cor(predicted2pca, testingSetCompound$permeability)^2
## [1] 0.3688938

The R2 value increased from \(.2323\) to \(.3689\). This increase is due to the PCA.

Problem 3

The website https://statweb.stanford.edu/~hastie/ElemStatLearn/datasets/containsmicroarray data for 14 different types of cancer.

(a)

Use fread in the data.table package to read the training and test sets from the website directly into R. Make data frames for the training and testing data that combine the predictors and the response,and convert the response to a factor variable. Note that the data are high dimensional.

xtrain<-fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.xtrain")
ytrain<-fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.ytrain")

xtest<-fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.xtest")
ytest<-fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.ytest")

trainingSet = transpose(xtrain)
testingSet = transpose(xtest)
trainingSet$response = as.factor(t(ytrain))
testingSet$response = as.factor(t(ytest))

(b)

Set the seed to 12345, and use the training data to build a linear discriminant analysis model (LDA3) that predicts the type of cancer and evaluate the accuracy of LDA3 with the testing data.

if(FALSE) {
set.seed(12345)
partitionIndex = createDataPartition(trainingSet$response, p = .1, list=FALSE )
numTrain1 = trainingSet[partitionIndex, ] 

partitionIndex = createDataPartition(testingSet$response, p = .1, list=FALSE )
numTest1 = testingSet[partitionIndex, ] 

pData <- startParallel()
LDA3 <- train(response ~ ., 
              data = numTrain1, 
              method = "lda", 
              trControl = trainControl(allowParallel = TRUE))
endParallel(pData)
predictedLDA3 <- predict(LDA3, numTest1)
LDA3CM <- confusionMatrix(predictedLDA3, numTest1$response)
LDA3CM
}

set.seed(12345)

pData <- startParallel()
LDA3 <- train(response ~ ., 
              data = trainingSet, 
              method = "lda", 
              trControl = trainControl(allowParallel = TRUE))
endParallel(pData)
##     user   system  elapsed 
##   81.019    4.825 2627.274
predictedLDA3 <- predict(LDA3, testingSet)
LDA3CM <- confusionMatrix(predictedLDA3, testingSet$response)
LDA3CM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14
##         1  1 0 0 1 0 2 0 0 0  0  2  1  1  0
##         2  0 2 1 0 0 0 0 0 0  0  0  0  0  0
##         3  1 0 2 0 1 0 1 0 0  1  0  0  0  0
##         4  0 0 0 3 0 0 0 0 0  0  0  0  0  0
##         5  0 1 0 0 4 0 0 0 2  0  0  0  0  0
##         6  1 0 0 0 0 1 0 0 0  0  0  2  0  0
##         7  0 1 0 0 0 0 1 0 0  0  0  0  0  0
##         8  1 0 1 0 0 0 0 2 0  2  0  0  0  0
##         9  0 0 0 0 0 0 0 0 4  0  0  0  0  0
##         10 0 0 0 0 0 0 0 0 0  0  0  0  0  0
##         11 0 0 0 0 0 0 0 0 0  0  1  0  0  0
##         12 0 1 0 0 1 0 0 0 0  0  0  0  0  0
##         13 0 1 0 0 0 0 0 0 0  0  0  1  2  0
##         14 0 0 0 0 0 0 0 0 0  0  0  0  0  4
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.3608, 0.6392)
##     No Information Rate : 0.1111          
##     P-Value [Acc > NIR] : 1.581e-12       
##                                           
##                   Kappa : 0.4594          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.25000  0.33333  0.50000  0.75000  0.66667  0.33333
## Specificity           0.86000  0.97917  0.92000  1.00000  0.93750  0.94118
## Pos Pred Value        0.12500  0.66667  0.33333  1.00000  0.57143  0.25000
## Neg Pred Value        0.93478  0.92157  0.95833  0.98039  0.95745  0.96000
## Prevalence            0.07407  0.11111  0.07407  0.07407  0.11111  0.05556
## Detection Rate        0.01852  0.03704  0.03704  0.05556  0.07407  0.01852
## Detection Prevalence  0.14815  0.05556  0.11111  0.05556  0.12963  0.07407
## Balanced Accuracy     0.55500  0.65625  0.71000  0.87500  0.80208  0.63725
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           0.50000  1.00000  0.66667   0.00000   0.33333
## Specificity           0.98077  0.92308  1.00000   1.00000   1.00000
## Pos Pred Value        0.50000  0.33333  1.00000       NaN   1.00000
## Neg Pred Value        0.98077  1.00000  0.96000   0.94444   0.96226
## Prevalence            0.03704  0.03704  0.11111   0.05556   0.05556
## Detection Rate        0.01852  0.03704  0.07407   0.00000   0.01852
## Detection Prevalence  0.03704  0.11111  0.07407   0.00000   0.01852
## Balanced Accuracy     0.74038  0.96154  0.83333   0.50000   0.66667
##                      Class: 12 Class: 13 Class: 14
## Sensitivity            0.00000   0.66667   1.00000
## Specificity            0.96000   0.96078   1.00000
## Pos Pred Value         0.00000   0.50000   1.00000
## Neg Pred Value         0.92308   0.98000   1.00000
## Prevalence             0.07407   0.05556   0.07407
## Detection Rate         0.00000   0.03704   0.07407
## Detection Prevalence   0.03704   0.07407   0.07407
## Balanced Accuracy      0.48000   0.81373   1.00000

(c)

Use the training data to build ak-nearest neighbors model (KNN3) that predicts the type of cancer and evaluate the accuracy of KNN3 with the testing data.

set.seed(12345)

pData <- startParallel()
KNN3 <- train(response ~ ., 
              data = trainingSet, 
              method = "knn", 
              trControl = trainControl(allowParallel = TRUE))
endParallel(pData)
predictedKNN3 <- predict(KNN3, testingSet)
KNN3CM <- confusionMatrix(predictedKNN3, testingSet$response)
KNN3CM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14
##         1  0 1 0 2 0 0 0 1 0  0  1  0  0  0
##         2  0 0 1 0 0 0 0 0 0  0  0  0  0  0
##         3  0 0 0 0 0 0 0 0 0  0  0  0  0  0
##         4  0 1 0 1 0 0 0 1 0  1  0  2  0  0
##         5  0 2 0 0 4 0 0 0 1  0  0  0  0  0
##         6  1 0 1 0 0 1 1 0 0  0  1  0  0  0
##         7  0 1 0 0 0 1 0 0 0  0  0  0  0  0
##         8  1 1 0 1 0 0 1 0 0  2  1  1  1  0
##         9  0 0 0 0 0 0 0 0 5  0  0  0  0  0
##         10 0 0 0 0 0 0 0 0 0  0  0  0  0  0
##         11 1 0 0 0 1 1 0 0 0  0  0  0  0  0
##         12 1 0 0 0 1 0 0 0 0  0  0  0  0  0
##         13 0 0 2 0 0 0 0 0 0  0  0  1  2  0
##         14 0 0 0 0 0 0 0 0 0  0  0  0  0  4
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3148          
##                  95% CI : (0.1952, 0.4555)
##     No Information Rate : 0.1111          
##     P-Value [Acc > NIR] : 4.831e-05       
##                                           
##                   Kappa : 0.2625          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.00000  0.00000  0.00000  0.25000  0.66667  0.33333
## Specificity           0.90000  0.97917  1.00000  0.90000  0.93750  0.92157
## Pos Pred Value        0.00000  0.00000      NaN  0.16667  0.57143  0.20000
## Neg Pred Value        0.91837  0.88679  0.92593  0.93750  0.95745  0.95918
## Prevalence            0.07407  0.11111  0.07407  0.07407  0.11111  0.05556
## Detection Rate        0.00000  0.00000  0.00000  0.01852  0.07407  0.01852
## Detection Prevalence  0.09259  0.01852  0.00000  0.11111  0.12963  0.09259
## Balanced Accuracy     0.45000  0.48958  0.50000  0.57500  0.80208  0.62745
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           0.00000  0.00000  0.83333   0.00000   0.00000
## Specificity           0.96154  0.82692  1.00000   1.00000   0.94118
## Pos Pred Value        0.00000  0.00000  1.00000       NaN   0.00000
## Neg Pred Value        0.96154  0.95556  0.97959   0.94444   0.94118
## Prevalence            0.03704  0.03704  0.11111   0.05556   0.05556
## Detection Rate        0.00000  0.00000  0.09259   0.00000   0.00000
## Detection Prevalence  0.03704  0.16667  0.09259   0.00000   0.05556
## Balanced Accuracy     0.48077  0.41346  0.91667   0.50000   0.47059
##                      Class: 12 Class: 13 Class: 14
## Sensitivity            0.00000   0.66667   1.00000
## Specificity            0.96000   0.94118   1.00000
## Pos Pred Value         0.00000   0.40000   1.00000
## Neg Pred Value         0.92308   0.97959   1.00000
## Prevalence             0.07407   0.05556   0.07407
## Detection Rate         0.00000   0.03704   0.07407
## Detection Prevalence   0.03704   0.09259   0.07407
## Balanced Accuracy      0.48000   0.80392   1.00000

(d)

Compare the accuracies of the different classifiers. Is either classifier very effective? Explain.

KNN3CM$overall["Accuracy"]
##  Accuracy 
## 0.3148148
LDA3CM$overall["Accuracy"]
## Accuracy 
##      0.5

Even though the LDA3 has better accuracy of .5, neither classifier is very effective in predicting cancer classification. An effective accuracy would be .8 or higher.

Problem 4

The data in T:/Faculty & Staff Alphabetical/M/McDevittt/Public/DAT315/Data/mnisttrain.csvand T:/Faculty & Staff Alphabetical/M/McDevittt/Public/DAT315/Data/mnisttest.csv are digitized hand-written numerals. (See https://en.wikipedia.org/wiki/MNIST_databasefor more information.) Each record consists of the correct numeral followed by 784 grayscale values ranging from 0 to 255. When the 784 grayscale values are arranged in a 28×28 array, they show the digitized digit. For example, the first record in the training data corresponds to the following (digitized) handwritten “5”.

Handwritten 5

Likewise, the first 10 records give the following.

first 10 records output

The training set is very large so building the models takes a long time. To save time while you’re developing your code, you might choose to only use a subset of the training data until you are confident that your code is correct. Then apply it to all of the training data.

(a)

The response is a factor variable but R does not allow variables like“0”,“1”, etc… Rename“0”as“c0”(for “character 0”),“1”as“c1”, …, and“9”as“c9”. in both the training and testing data.Also, remove all of the predictors with small variance in the training data. Be sure to remove the same variables in the testing data.

# Training Data
numTrain  <- read.csv('mnist_train.csv', header = FALSE)
numTrain$V1 <- as.factor(numTrain$V1)
revalue(numTrain$V1,
        c("0"="c0","1"="c1","2"="c2","3"="c3","4"="c4","5"="c5","6"="c6","7"="c7","8"="c8","9"="c9"))
zeroValues <- nearZeroVar(numTrain)
numTrain <- numTrain[,-zeroValues]

# Testing Data
numTest <- read.csv('mnist_test.csv', header = FALSE)
numTest$V1 <- as.factor(numTest$V1)
revalue(numTest$V1,
        c("0"="c0","1"="c1","2"="c2","3"="c3","4"="c4","5"="c5","6"="c6","7"="c7","8"="c8","9"="c9"))
numTest <- numTest[,-zeroValues]

(b)

Use the training data to build ak-nearest neighbors model (KNN4) to identify the handwritten digits. Assess the accuracy of KNN4 by applying it to the testing data.

# Running Parallel Processor
pData <- startParallel()
KNN4 <- train(V1~., data=numTrain, method="knn", trControl=trainControl(method = "cv", number = 10, allowParallel=TRUE), preProcess=c("center", "scale"))
endParallel(pData)
##     user   system  elapsed 
##    7.664    3.646 1623.323
predictedKNN4 <- predict(KNN4, numTest)
KNN4CM <- confusionMatrix(predictedKNN4, numTest$V1)
KNN4CM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0  972    0    7    0    0    4    6    0    3    5
##          1    1 1129    5    1    5    0    2   19    1    6
##          2    1    3  992    3    0    0    0    5    4    2
##          3    0    2    5  974    0   10    1    0   11    4
##          4    0    0    3    1  939    1    3    1    4    8
##          5    1    0    0   16    1  863    1    0   16    4
##          6    4    0    3    1    5    6  945    0    7    1
##          7    1    0   16   10    3    3    0  994   10   12
##          8    0    0    1    2    1    2    0    0  913    1
##          9    0    1    0    2   28    3    0    9    5  966
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9687         
##                  95% CI : (0.9651, 0.972)
##     No Information Rate : 0.1135         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9652         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9918   0.9947   0.9612   0.9644   0.9562   0.9675
## Specificity            0.9972   0.9955   0.9980   0.9963   0.9977   0.9957
## Pos Pred Value         0.9749   0.9658   0.9822   0.9672   0.9781   0.9568
## Neg Pred Value         0.9991   0.9993   0.9956   0.9960   0.9952   0.9968
## Prevalence             0.0980   0.1135   0.1032   0.1010   0.0982   0.0892
## Detection Rate         0.0972   0.1129   0.0992   0.0974   0.0939   0.0863
## Detection Prevalence   0.0997   0.1169   0.1010   0.1007   0.0960   0.0902
## Balanced Accuracy      0.9945   0.9951   0.9796   0.9803   0.9769   0.9816
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.9864   0.9669   0.9374   0.9574
## Specificity            0.9970   0.9939   0.9992   0.9947
## Pos Pred Value         0.9722   0.9476   0.9924   0.9527
## Neg Pred Value         0.9986   0.9962   0.9933   0.9952
## Prevalence             0.0958   0.1028   0.0974   0.1009
## Detection Rate         0.0945   0.0994   0.0913   0.0966
## Detection Prevalence   0.0972   0.1049   0.0920   0.1014
## Balanced Accuracy      0.9917   0.9804   0.9683   0.9760

(c)

Use the training data to build a linear discriminant analysis model (LDA4) to identify the handwritten digits. Assess the accuracy of LDA4 by applying it to the testing data.

# Running Parallel Processors
pData <- startParallel()
LDA4 <- train(V1~., data=numTrain, method="lda", trControl=trainControl(method = "cv", number = 10, allowParallel=TRUE), preProcess=c("center", "scale"))
endParallel(pData)
##    user  system elapsed 
##  20.963   3.728 121.522
predictedLDA4 <- predict(LDA4, numTest)
LDA4CM <- confusionMatrix(predictedLDA4, numTest$V1)
LDA4CM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0  914    0   13    7    1   11   11    4    5    7
##          1    1 1077   43    8   12   11    7   33   27    8
##          2    5    6  818   30    5    9    7   21   11    7
##          3    2    3   18  850    0   51    1    7   33    9
##          4    0    1   25    2  862   11   25   19   13   61
##          5   42    3   11   48    1  719   34    4   35   13
##          6   10    4   14    9   16   21  866    0   19    3
##          7    2    3   22   23    1   14    1  863   11   26
##          8    4   38   50   20    7   36    6    6  798   10
##          9    0    0   18   13   77    9    0   71   22  865
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8632          
##                  95% CI : (0.8563, 0.8699)
##     No Information Rate : 0.1135          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8479          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9327   0.9489   0.7926   0.8416   0.8778   0.8061
## Specificity            0.9935   0.9831   0.9887   0.9862   0.9826   0.9790
## Pos Pred Value         0.9394   0.8778   0.8901   0.8727   0.8459   0.7901
## Neg Pred Value         0.9927   0.9934   0.9764   0.9823   0.9866   0.9810
## Prevalence             0.0980   0.1135   0.1032   0.1010   0.0982   0.0892
## Detection Rate         0.0914   0.1077   0.0818   0.0850   0.0862   0.0719
## Detection Prevalence   0.0973   0.1227   0.0919   0.0974   0.1019   0.0910
## Balanced Accuracy      0.9631   0.9660   0.8907   0.9139   0.9302   0.8925
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.9040   0.8395   0.8193   0.8573
## Specificity            0.9894   0.9885   0.9804   0.9766
## Pos Pred Value         0.9002   0.8934   0.8185   0.8047
## Neg Pred Value         0.9898   0.9817   0.9805   0.9839
## Prevalence             0.0958   0.1028   0.0974   0.1009
## Detection Rate         0.0866   0.0863   0.0798   0.0865
## Detection Prevalence   0.0962   0.0966   0.0975   0.1075
## Balanced Accuracy      0.9467   0.9140   0.8998   0.9170

(d)

Compare and contrast the accuracies of the two classifiers on the testing data.

KNN4CM$overall["Accuracy"]
## Accuracy 
##   0.9687
LDA4CM$overall["Accuracy"]
## Accuracy 
##   0.8632

By looking at the accuracy for the KNN and the LDA4 model, we can conclude that the KNN model is ~\(96.85\%\) accurate, which is more than the LDA4 ~\(86.32\%\).

Problem 5

Access the spam data in the kernlab package.

(a)

Set the seed to 12345 and and use createDataPartitionwithp= 0.7 in the caret package to par-tition the data into training and testing sets.

data("spam")
set.seed(12345)
spamTrainingIndex = createDataPartition(spam$type,p = 0.7, list = FALSE, times = 1)
spamTrain = spam[spamTrainingIndex, ]
spamTest = spam[-spamTrainingIndex, ]

(b)

Use the training data to build a na ̈ıve Bayes classifier (NB5) for the type of message (spam or non-spam) without any preprocessing. Assess the predictive capacity of NB5 using the testing data.

NB5 <- train(type~., data=spamTrain, method="nb", trControl=trainControl(method = "cv", number = 10))
predictedNB5 <- predict(NB5, newdata=spamTest)
NB5CM <- confusionMatrix(predictedNB5, spamTest$type)
NB5CM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     476   31
##    spam        360  512
##                                           
##                Accuracy : 0.7165          
##                  95% CI : (0.6919, 0.7401)
##     No Information Rate : 0.6062          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4631          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5694          
##             Specificity : 0.9429          
##          Pos Pred Value : 0.9389          
##          Neg Pred Value : 0.5872          
##              Prevalence : 0.6062          
##          Detection Rate : 0.3452          
##    Detection Prevalence : 0.3677          
##       Balanced Accuracy : 0.7561          
##                                           
##        'Positive' Class : nonspam         
## 

(c)

Use the training data to build a na ̈ıve Bayes classifier (NB5pca) using preProcess=c(“center”,“scale”,“pca”) as an optional argument to the train command. Assess the predictive capacity of NB5pca using the testing data.

NB5pca <- train(type~., data=spamTrain, method="nb", trControl=trainControl(method = "cv", number = 10), preProcess=c("center","scale", "pca"))
predictedNB5pca <- predict(NB5pca, newdata=spamTest)
NB5pcaCM <- confusionMatrix(predictedNB5pca,spamTest$type)
NB5pcaCM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     734   62
##    spam        102  481
##                                           
##                Accuracy : 0.8811          
##                  95% CI : (0.8628, 0.8977)
##     No Information Rate : 0.6062          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7541          
##                                           
##  Mcnemar's Test P-Value : 0.002324        
##                                           
##             Sensitivity : 0.8780          
##             Specificity : 0.8858          
##          Pos Pred Value : 0.9221          
##          Neg Pred Value : 0.8250          
##              Prevalence : 0.6062          
##          Detection Rate : 0.5323          
##    Detection Prevalence : 0.5772          
##       Balanced Accuracy : 0.8819          
##                                           
##        'Positive' Class : nonspam         
## 

(d)

Compare and contrast the accuracy, specificity, and sensitivity of NB5 and NB5pca. Which classifier would be preferable in practice as part of an actual email filter?

#NBA5
NB5Accuracy <- NB5CM$overall["Accuracy"]
NB5Specificity <- NB5CM$byClass['Specificity']
NB5Sensitivity <- NB5CM$byClass['Sensitivity']

#NBA5pca
NB5pcaAccuracy <- NB5pcaCM$overall["Accuracy"]
NB5pcaSpecificity <- NB5pcaCM$byClass['Specificity']
NB5pcaSensitivity <- NB5pcaCM$byClass['Sensitivity']

# Create table for comparision
table <- matrix(c(NB5Accuracy,NB5pcaAccuracy,
                  NB5Specificity,NB5pcaSpecificity,
                  NB5Sensitivity,NB5pcaSensitivity),ncol=2,byrow=TRUE)
colnames(table) <- c("      NB5      ","        NB5pca   ")
rownames(table) <- c("Accuracy","Specificity","Sensitivity")
table <- as.table(table)
table
##                   NB5               NB5pca   
## Accuracy          0.7164612         0.8810732
## Specificity       0.9429098         0.8858195
## Sensitivity       0.5693780         0.8779904

By looking at the table above, we can see that the NB5pca had a higher accuracy then NB5. The NB5pca accuracy, specificity, and sensitivity were within 1-2% of each other and this is the model is prefered over the NB5 since the NB5 model values are all over the place.