Problem 1

Remove the factor variables from the Hitters data in the ISLR package and remove all records in which the Salary data is missing. To remove the heavy right skew, replace Salary with log(Salary) and call it logSalary. Set the seed to 12345 and use createDataPartition with p = 0.7 in the caret package to partition the data into training and testing sets.

library(ISLR)
library(caret)
library(ggplot2)

Hitters2 <- subset(Hitters, select = -c(League, Division,NewLeague))
Hitters3 <- Hitters2[complete.cases(Hitters2),]
logSalary<- log10(Hitters3$Salary)
Hitters3$Salary <- logSalary
colnames(Hitters3)[colnames(Hitters3)=="Salary"]<-"logSalary"
set.seed(12345)
databreak <- createDataPartition (Hitters3$logSalary, times=1, p=.7, list=FALSE)
traindata <- Hitters3[databreak,] 
testdata <-Hitters3[-databreak,]

(a)

Use the training data to build a linear model (LM1) for logSalary using all other variables as predictors. Evaluate the predictive capacity of LM1 by computing R2 for the testing data.

LM1<-train(logSalary~., data=traindata, method = "lm", maximize =TRUE, metric= "Rsquared")

predictedLM1<- predict(LM1, newdata= testdata)
RsquaredLM1<-cor(predictedLM1, testdata$logSalary)^2

RsquaredLM1
## [1] 0.4251033

The testing data R-squared is 0.4251033.

(b)

Use the training data to build a ridge regression model (RIDGE1) for logSalary using all other variables as predictors. Use 10-fold cross validation and center and scale the predictors. Evaluate the predictive capacity of RIDGE1 by computing R2 for the testing data.

library(caret)
library(elasticnet)
library(lars)
Ridge1<-train(logSalary~., data= traindata, method = "ridge", trControl=trainControl(method ="cv", number=10), preProcess=c("center", "scale"))

Ridge1predicted <-predict(Ridge1, newdata= testdata)

RsquaredRidge1 <- cor(Ridge1predicted, testdata$logSalary)^2

RsquaredRidge1
## [1] 0.4149291

The testing data R-squared is 0.4149291.

(c)

Use the training data to build a LASSO model (LASSO1) for logSalary using all other variables as predictors. Use 10-fold cross validation and center and scale the predictors. Evaluate the predictive capacity of LASSO1 by computing R2 for the testing data.

library(lars)
library(caret)

LASSO1<-train(logSalary~., data = traindata, method = "lasso" , trControl=trainControl(method ="cv", number=10), preProcess=c("center", "scale") )

PredictedLasso1<-predict(LASSO1, newdata=testdata)

Rsquaredlasso<-cor(PredictedLasso1, testdata$logSalary)^2

Rsquaredlasso
## [1] 0.4219527

The testing data R-squared is 0.4219527.

(d)

Use predict.enet( LASSO1finalModel, type=“coef”, s=LASSO1\(bestTune\)fraction, mode=“fraction”) to determine which variables the LASSO removes from the model. Explicitly list the variables that are removed.

predict.enet(LASSO1$finalModel, type = "coef" , s= LASSO1$bestTune$fraction, mode = "fraction")
## $s
## [1] 0.5
## 
## $fraction
##   0 
## 0.5 
## 
## $mode
## [1] "fraction"
## 
## $coefficients
##        AtBat         Hits        HmRun         Runs          RBI        Walks 
## -0.063213535  0.156360951  0.036525025  0.000000000  0.000000000  0.073336423 
##        Years       CAtBat        CHits       CHmRun        CRuns         CRBI 
##  0.110615510  0.000000000  0.053250309 -0.054634947  0.128477425 -0.000817099 
##       CWalks      PutOuts      Assists       Errors 
## -0.048763061  0.025325418  0.001873267 -0.019479563

The variables that were removed from the model were Runs, RBI, CAtBat (they all have coefficients of zero).

Problem 2

Predicting permeability of compounds can yield significant savings for a pharmaceutical company. In the AppliedPredictiveModeling package, the permeability data contains high dimensional data in the form of a matrix titled fingerprints with 165 observations on 1107 binary molecular predictors and a vector called permeability that contains the corresponding responses.

(a)

Combine the predictors and response into a single data frame and use nearZeroVar to remove all predictors that have small variance. Note that the data are still high dimensional.

library(AppliedPredictiveModeling)
library(caret)

data(permeability)

printsdata = cbind(fingerprints, permeability)
printsdata= data.frame(printsdata)

printsdata= printsdata[,-nearZeroVar(printsdata)]

(b)

Set the seed to 12345 and and use createDataPartition with p = 0.7 in the caret package to partition the data into training and testing sets.

set.seed(12345)
databreak2 <- createDataPartition (printsdata$permeability, times=1, p=.7, list=FALSE)

traindata2 <- printsdata[databreak2,] 
testdata2 <-printsdata[-databreak2,]

(c)

Build a linear model (LM2) with the training data using train with preProcess=c(“center”,“scale”) and compute R2 for the LM2’s performance on the testing data. R is going to complain about rank deficiency because there are more variables than observations. Just ignore those warnings for this problem.

LM2<-train(permeability~. , data= traindata2, method = "lm", preProcess=c("center", "scale") )

PredictedLM2<- predict(LM2, newdata= testdata2)

RsquaredLM2<-cor(PredictedLM2, testdata2$permeability)^2

RsquaredLM2
## [1] 0.01698163

The testing data R-squared is 0.01698163

(d)

Now incorporate principal components into your model. Build another linear model (LM2pca) with the training data using train with preProcess=c(“center”,“scale”,“pca”) and compute R2 for the LM2pca’s performance on the testing data. Did PCA increase R2?

LM2pca<-train(permeability~., data=traindata2, method="lm", preProcess=c("center", "scale", "pca"))

PredictedLM2pca <- predict(LM2pca, newdata= testdata2)

RsquaredLM2pca<- cor(PredictedLM2pca, testdata2$permeability)^2

RsquaredLM2pca
## [1] 0.4644431

After running the model on the testing data, it produced an R-squared of 0.4644431.PCA seems to have increased R-squared since the R-squared went up.

Problem 3

The website https://statweb.stanford.edu/~hastie/ElemStatLearn/datasets/ contains microarray data for 14 different types of cancer.

(a)

Use fread in the data.table package to read the training and test sets from the website directly into R. Make data frames for the training and testing data that combine the predictors and the response, and convert the response to a factor variable. Note that the data are high dimensional.

library(doParallel)
library(parallel)
library(data.table)
xtrain <- fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.xtrain")
ytrain <- fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.ytrain")
xtest <- fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.xtest")
ytest <- fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.ytest")

training <- transpose(xtrain)
testing <- transpose(xtest)
training$response <- as.factor(t(ytrain))
testing$response <- as.factor(t(ytest))

(b)

Set the seed to 12345, and use the training data to build a linear discriminant analysis model (LDA3) that predicts the type of cancer and evaluate the accuracy of LDA3 with the testing data.

library(caret)
cluster = makeCluster(detectCores() - 1)
registerDoParallel(cluster)
LDA3 <-  train(response~.,
               data=training,
               method="lda",
               trControl=trainControl(method="cv", number=10, allowParallel = TRUE),
               preProcess=c("center","scale"))
stopCluster(cluster)

predictLDA <- predict(LDA3, testing)
confusionMatrix(predictLDA, testing$response)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14
##         1  1 0 0 1 0 2 0 0 0  0  2  1  1  0
##         2  0 2 1 0 0 0 0 0 0  0  0  0  0  0
##         3  1 0 2 0 1 0 1 0 0  1  0  0  0  0
##         4  0 0 0 3 0 0 0 0 0  0  0  0  0  0
##         5  0 1 0 0 4 0 0 0 2  0  0  0  0  0
##         6  1 0 0 0 0 1 0 0 0  0  0  2  0  0
##         7  0 1 0 0 0 0 1 0 0  0  0  0  0  0
##         8  1 0 1 0 0 0 0 2 0  2  0  0  0  0
##         9  0 0 0 0 0 0 0 0 4  0  0  0  0  0
##         10 0 0 0 0 0 0 0 0 0  0  0  0  0  0
##         11 0 0 0 0 0 0 0 0 0  0  1  0  0  0
##         12 0 1 0 0 1 0 0 0 0  0  0  0  0  0
##         13 0 1 0 0 0 0 0 0 0  0  0  1  2  0
##         14 0 0 0 0 0 0 0 0 0  0  0  0  0  4
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.3608, 0.6392)
##     No Information Rate : 0.1111          
##     P-Value [Acc > NIR] : 1.581e-12       
##                                           
##                   Kappa : 0.4594          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.25000  0.33333  0.50000  0.75000  0.66667  0.33333
## Specificity           0.86000  0.97917  0.92000  1.00000  0.93750  0.94118
## Pos Pred Value        0.12500  0.66667  0.33333  1.00000  0.57143  0.25000
## Neg Pred Value        0.93478  0.92157  0.95833  0.98039  0.95745  0.96000
## Prevalence            0.07407  0.11111  0.07407  0.07407  0.11111  0.05556
## Detection Rate        0.01852  0.03704  0.03704  0.05556  0.07407  0.01852
## Detection Prevalence  0.14815  0.05556  0.11111  0.05556  0.12963  0.07407
## Balanced Accuracy     0.55500  0.65625  0.71000  0.87500  0.80208  0.63725
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11 Class: 12
## Sensitivity           0.50000  1.00000  0.66667   0.00000   0.33333   0.00000
## Specificity           0.98077  0.92308  1.00000   1.00000   1.00000   0.96000
## Pos Pred Value        0.50000  0.33333  1.00000       NaN   1.00000   0.00000
## Neg Pred Value        0.98077  1.00000  0.96000   0.94444   0.96226   0.92308
## Prevalence            0.03704  0.03704  0.11111   0.05556   0.05556   0.07407
## Detection Rate        0.01852  0.03704  0.07407   0.00000   0.01852   0.00000
## Detection Prevalence  0.03704  0.11111  0.07407   0.00000   0.01852   0.03704
## Balanced Accuracy     0.74038  0.96154  0.83333   0.50000   0.66667   0.48000
##                      Class: 13 Class: 14
## Sensitivity            0.66667   1.00000
## Specificity            0.96078   1.00000
## Pos Pred Value         0.50000   1.00000
## Neg Pred Value         0.98000   1.00000
## Prevalence             0.05556   0.07407
## Detection Rate         0.03704   0.07407
## Detection Prevalence   0.07407   0.07407
## Balanced Accuracy      0.81373   1.00000

(c)

Use the training data to build a k-nearest neighbors model (KNN3) that predicts the type of cancer and evaluate the accuracy of KNN3 with the testing data.

cluster = makeCluster(detectCores() - 1)
registerDoParallel(cluster)
KNN3 <-  train(response~.,
               data=training,
               method="knn",
               trControl=trainControl(method="cv", number=10, allowParallel = TRUE),
               preProcess=c("center","scale"))
stopCluster(cluster)

predictKNN <- predict(KNN3, testing)
confusionMatrix(predictKNN, testing$response)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14
##         1  0 0 0 0 0 0 0 1 0  0  1  0  0  0
##         2  0 0 0 0 0 0 1 0 0  0  0  0  0  0
##         3  0 0 1 0 0 0 0 0 0  0  0  0  0  0
##         4  0 0 2 4 1 0 0 0 0  0  0  0  0  0
##         5  0 2 0 0 3 0 0 0 0  0  0  0  0  0
##         6  2 1 0 0 0 2 0 0 0  0  1  2  0  0
##         7  1 0 0 0 0 0 1 0 0  1  0  0  0  0
##         8  1 1 0 0 1 0 0 0 1  2  1  0  0  0
##         9  0 0 0 0 0 0 0 0 4  0  0  0  0  0
##         10 0 0 0 0 0 0 0 0 0  0  0  0  0  0
##         11 0 0 0 0 0 0 0 1 0  0  0  1  1  0
##         12 0 0 0 0 0 1 0 0 0  0  0  0  0  0
##         13 0 2 1 0 1 0 0 0 1  0  0  1  2  1
##         14 0 0 0 0 0 0 0 0 0  0  0  0  0  3
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3704          
##                  95% CI : (0.2429, 0.5126)
##     No Information Rate : 0.1111          
##     P-Value [Acc > NIR] : 6.015e-07       
##                                           
##                   Kappa : 0.325           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.00000  0.00000  0.25000  1.00000  0.50000  0.66667
## Specificity           0.96000  0.97917  1.00000  0.94000  0.95833  0.88235
## Pos Pred Value        0.00000  0.00000  1.00000  0.57143  0.60000  0.25000
## Neg Pred Value        0.92308  0.88679  0.94340  1.00000  0.93878  0.97826
## Prevalence            0.07407  0.11111  0.07407  0.07407  0.11111  0.05556
## Detection Rate        0.00000  0.00000  0.01852  0.07407  0.05556  0.03704
## Detection Prevalence  0.03704  0.01852  0.01852  0.12963  0.09259  0.14815
## Balanced Accuracy     0.48000  0.48958  0.62500  0.97000  0.72917  0.77451
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11 Class: 12
## Sensitivity           0.50000  0.00000  0.66667   0.00000   0.00000   0.00000
## Specificity           0.96154  0.86538  1.00000   1.00000   0.94118   0.98000
## Pos Pred Value        0.33333  0.00000  1.00000       NaN   0.00000   0.00000
## Neg Pred Value        0.98039  0.95745  0.96000   0.94444   0.94118   0.92453
## Prevalence            0.03704  0.03704  0.11111   0.05556   0.05556   0.07407
## Detection Rate        0.01852  0.00000  0.07407   0.00000   0.00000   0.00000
## Detection Prevalence  0.05556  0.12963  0.07407   0.00000   0.05556   0.01852
## Balanced Accuracy     0.73077  0.43269  0.83333   0.50000   0.47059   0.49000
##                      Class: 13 Class: 14
## Sensitivity            0.66667   0.75000
## Specificity            0.86275   1.00000
## Pos Pred Value         0.22222   1.00000
## Neg Pred Value         0.97778   0.98039
## Prevalence             0.05556   0.07407
## Detection Rate         0.03704   0.05556
## Detection Prevalence   0.16667   0.05556
## Balanced Accuracy      0.76471   0.87500

(d)

Compare the accuracies of the different classifiers. Is either classifier very effective? Explain.

The accuracy using LDA was 0.5 (kappa=0.4594) The accuracy using KNN was 0.3704 (kappa=0.325)

Neither classifier seems to be very effective since both classifiers have kappa values below 0.6 and accuracies equal to or below 50%.

However, the LDA3 model outperforms the KNN3 model because it has both a higher accuracy and kappa value than the KNN3 model.

Problem 4

The data in T:/Faculty & Staff Alphabetical/M/McDevittt/Public/DAT315/Data/mnist train.csv and T:/Faculty & Staff Alphabetical/M/McDevittt/Public/DAT315/Data/mnist test.csv are digitized handwritten numerals. (See https://en.wikipedia.org/wiki/MNIST_database for more information.) Each record consists of the correct numeral followed by 784 grayscale values ranging from 0 to 255. When the 784 grayscale values are arranged in a 28 × 28 array, they show the digitized digit. For example, the first record in the training data corresponds to the following (digitized) handwritten “5”.

(a)

The response is a factor variable but R does not allow variables like “0”, “1”, etc… Rename “0” as “c0” (for “character 0”), “1” as “c1”, …, and “9” as “c9”. in both the training and testing data. Also, remove all of the predictors with small variance in the training data. Be sure to remove the same variables in the testing data.

library(caret)
library(lattice)
library(ggplot2)

mnist_train <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 4\\mnist_train.csv", header=FALSE)

mnist_test <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 4\\mnist_test.csv", header=FALSE)



mnist_train$V1 <- sub("0", "c0", mnist_train$V1)
mnist_train$V1 <- sub("1", "c1", mnist_train$V1)
mnist_train$V1 <- sub("2", "c2", mnist_train$V1)
mnist_train$V1 <- sub("3", "c3", mnist_train$V1)
mnist_train$V1 <- sub("4", "c4", mnist_train$V1)
mnist_train$V1 <- sub("5", "c5", mnist_train$V1)
mnist_train$V1 <- sub("6", "c6", mnist_train$V1)
mnist_train$V1 <- sub("7", "c7", mnist_train$V1)
mnist_train$V1 <- sub("8", "c8", mnist_train$V1)
mnist_train$V1 <- sub("9", "c9", mnist_train$V1)

mnist_test$V1 <-  sub("0", "c0", mnist_test$V1)
mnist_test$V1 <-  sub("1", "c1", mnist_test$V1)
mnist_test$V1 <-  sub("2", "c2", mnist_test$V1)
mnist_test$V1 <-  sub("3", "c3", mnist_test$V1)
mnist_test$V1 <-  sub("4", "c4", mnist_test$V1)
mnist_test$V1 <-  sub("5", "c5", mnist_test$V1)
mnist_test$V1 <-  sub("6", "c6", mnist_test$V1)
mnist_test$V1 <-  sub("7", "c7", mnist_test$V1)
mnist_test$V1 <-  sub("8", "c8", mnist_test$V1)
mnist_test$V1 <-  sub("9", "c9", mnist_test$V1)

predictors4<- nearZeroVar(mnist_train)


mnist_train<- mnist_train[, -predictors4]
mnist_test<- mnist_test[, -predictors4]

(b)

Use the training data to build a k-nearest neighbors model (KNN4) to identify the handwritten digits. Assess the accuracy of KNN4 by applying it to the testing data.

library(klaR)
library(caret)
library(doParallel)
library(parallel)
library(iterators)
library(lattice)
library(ggplot2)
library(foreach)


clustermade<- makeCluster(detectCores()-1)
registerDoParallel(clustermade)

KNN4<- train(V1~., data= mnist_train, method = "knn", trControl=trainControl(method="cv", number =10, allowParallel = TRUE), preProcess=c("center", "scale", "pca"))

predictedKNN4 <- predict(KNN4,mnist_test)
mnist_test2<-as.factor(mnist_test$V1)
confusionMatrix(predictedKNN4, mnist_test2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   c0   c1   c2   c3   c4   c5   c6   c7   c8   c9
##         c0  972    0    6    0    0    4    5    0    4    2
##         c1    1 1128    5    2    4    0    3   18    0    5
##         c2    1    4  996    3    0    0    0    5    5    2
##         c3    0    2    1  978    0   11    1    0    6    5
##         c4    0    0    3    0  950    0    2    1    2    5
##         c5    1    0    1   15    0  863    1    0   12    5
##         c6    4    0    2    1    5    6  945    0    7    1
##         c7    1    0   15    8    3    2    0  998    9    8
##         c8    0    0    3    1    1    2    1    0  922    2
##         c9    0    1    0    2   19    4    0    6    7  974
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9726          
##                  95% CI : (0.9692, 0.9757)
##     No Information Rate : 0.1135          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9695          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: c0 Class: c1 Class: c2 Class: c3 Class: c4
## Sensitivity             0.9918    0.9938    0.9651    0.9683    0.9674
## Specificity             0.9977    0.9957    0.9978    0.9971    0.9986
## Pos Pred Value          0.9789    0.9674    0.9803    0.9741    0.9865
## Neg Pred Value          0.9991    0.9992    0.9960    0.9964    0.9965
## Prevalence              0.0980    0.1135    0.1032    0.1010    0.0982
## Detection Rate          0.0972    0.1128    0.0996    0.0978    0.0950
## Detection Prevalence    0.0993    0.1166    0.1016    0.1004    0.0963
## Balanced Accuracy       0.9948    0.9948    0.9814    0.9827    0.9830
##                      Class: c5 Class: c6 Class: c7 Class: c8 Class: c9
## Sensitivity             0.9675    0.9864    0.9708    0.9466    0.9653
## Specificity             0.9962    0.9971    0.9949    0.9989    0.9957
## Pos Pred Value          0.9610    0.9732    0.9559    0.9893    0.9615
## Neg Pred Value          0.9968    0.9986    0.9967    0.9943    0.9961
## Prevalence              0.0892    0.0958    0.1028    0.0974    0.1009
## Detection Rate          0.0863    0.0945    0.0998    0.0922    0.0974
## Detection Prevalence    0.0898    0.0971    0.1044    0.0932    0.1013
## Balanced Accuracy       0.9818    0.9918    0.9828    0.9728    0.9805

After running on our testing data, we find it to have an accuracy of 0.9726.

(c)

Use the training data to build a linear discriminant analysis model (LDA4) to identify the handwritten digits. Assess the accuracy of LDA4 by applying it to the testing data.

library(caret)
library(doParallel)
library(parallel)
library(iterators)
library(lattice)
library(ggplot2)
library(foreach)

LDA4<- train(V1~., data= mnist_train, method = "lda",trControl=trainControl(method="cv", number =10, allowParallel = TRUE), preProcess=c("center", "scale", "pca"))


mnist_test2<-as.factor(mnist_test$V1)
predictedLDa4<- predict( LDA4, mnist_test)
confusionMatrix(predictedLDa4, mnist_test2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   c0   c1   c2   c3   c4   c5   c6   c7   c8   c9
##         c0  914    0   12    7    1   13   12    4    5    8
##         c1    1 1075   45    6   11   10    8   39   28    7
##         c2    5    7  808   27    7    8    6   20    8    6
##         c3    1    2   22  848    1   49    1    5   33   11
##         c4    0    1   25    2  859   14   26   17   15   58
##         c5   41    3   10   50    2  721   34    3   34   15
##         c6   12    4   16   10   15   20  864    1   22    5
##         c7    3    2   22   25    1   14    0  866   10   22
##         c8    3   41   55   19    6   34    7    4  795    9
##         c9    0    0   17   16   79    9    0   69   24  868
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8618          
##                  95% CI : (0.8549, 0.8685)
##     No Information Rate : 0.1135          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8464          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: c0 Class: c1 Class: c2 Class: c3 Class: c4
## Sensitivity             0.9327    0.9471    0.7829    0.8396    0.8747
## Specificity             0.9931    0.9825    0.9895    0.9861    0.9825
## Pos Pred Value          0.9365    0.8740    0.8958    0.8715    0.8446
## Neg Pred Value          0.9927    0.9932    0.9754    0.9821    0.9863
## Prevalence              0.0980    0.1135    0.1032    0.1010    0.0982
## Detection Rate          0.0914    0.1075    0.0808    0.0848    0.0859
## Detection Prevalence    0.0976    0.1230    0.0902    0.0973    0.1017
## Balanced Accuracy       0.9629    0.9648    0.8862    0.9128    0.9286
##                      Class: c5 Class: c6 Class: c7 Class: c8 Class: c9
## Sensitivity             0.8083    0.9019    0.8424    0.8162    0.8603
## Specificity             0.9789    0.9884    0.9890    0.9803    0.9762
## Pos Pred Value          0.7897    0.8916    0.8974    0.8171    0.8022
## Neg Pred Value          0.9812    0.9896    0.9821    0.9802    0.9842
## Prevalence              0.0892    0.0958    0.1028    0.0974    0.1009
## Detection Rate          0.0721    0.0864    0.0866    0.0795    0.0868
## Detection Prevalence    0.0913    0.0969    0.0965    0.0973    0.1082
## Balanced Accuracy       0.8936    0.9451    0.9157    0.8983    0.9182

After running on our testing data, we find it to have an accuracy of 0.8681.

(d)

Compare and contrast the accuracies of the two classifiers on the testing data.

The KNN model had an accuracy of 0.9726 while the LDA model had an accuracy of 0.8618. In terms of accuracy alone, the KNN model seems to be better but both models have very high accuracies so both are good models.

Problem 5

Access the spam data in the kernlab package.

(a)

Set the seed to 12345 and and use createDataPartition with p = 0.7 in the caret package to partition the data into training and testing sets.

library(kernlab)
library(caret)
library(caret)
library(doParallel)
library(parallel)
library(iterators)
library(lattice)
library(ggplot2)
library(foreach)

data(spam)
set.seed(12345)
databreak5 <- createDataPartition(spam$type, times=1, p=.7, list=FALSE)
traindata5 <- spam[databreak5,] 
testdata5 <-spam[-databreak5,]

(b)

Use the training data to build a na¨ıve Bayes classifier (NB5) for the type of message (spam or nonspam) without any preprocessing. Assess the predictive capacity of NB5 using the testing data.

options(warn=-1)
library(klaR)
library(caret)
library(MASS)
options(warn=-1)

NB5<-train( type ~., data=traindata5, method= "nb", trControl=trainControl(method ="cv", number=10) )
predictedNB5<-predict(NB5, testdata5)
confusionMatrix(predictedNB5, testdata5$type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     499   31
##    spam        337  512
##                                          
##                Accuracy : 0.7331         
##                  95% CI : (0.709, 0.7563)
##     No Information Rate : 0.6062         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4913         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.5969         
##             Specificity : 0.9429         
##          Pos Pred Value : 0.9415         
##          Neg Pred Value : 0.6031         
##              Prevalence : 0.6062         
##          Detection Rate : 0.3619         
##    Detection Prevalence : 0.3843         
##       Balanced Accuracy : 0.7699         
##                                          
##        'Positive' Class : nonspam        
## 

After running on the testing data, the model has an accuracy of 0.7331. Relatively speaking, this is good but it would still produce some errors.It also has a sensitivity of 0.7331 and a specificity of 0.9429.

(c)

Use the training data to build a na¨ıve Bayes classifier (NB5pca) using preProcess=c(“center”,“scale”,“pca”) as an optional argument to the train command. Assess the predictive capacity of NB5pca using the testing data.

library(klaR)
library(caret)
library(kernlab)
library(MASS)

NB5pca<-train(type~., data=traindata5 , method = "nb" , trControl=trainControl(method ="cv", number=10) , preProcess=c("center", "scale", "pca") )
predictedNB5pca <- predict(NB5pca, testdata5)
confusionMatrix( predictedNB5pca, testdata5$type )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     754  100
##    spam         82  443
##                                          
##                Accuracy : 0.868          
##                  95% CI : (0.849, 0.8854)
##     No Information Rate : 0.6062         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.7219         
##                                          
##  Mcnemar's Test P-Value : 0.2076         
##                                          
##             Sensitivity : 0.9019         
##             Specificity : 0.8158         
##          Pos Pred Value : 0.8829         
##          Neg Pred Value : 0.8438         
##              Prevalence : 0.6062         
##          Detection Rate : 0.5468         
##    Detection Prevalence : 0.6193         
##       Balanced Accuracy : 0.8589         
##                                          
##        'Positive' Class : nonspam        
## 

After running on the testing data, the model has an accuracy of 0.868. It also has a sensitivity of 0.9019 and a specificity of 0.8158.

(d)

Compare and contrast the accuracy, specificity, and sensitivity of NB5 and NB5pca. Which classifier would be preferable in practice as part of an actual email filter?

NB5pca has a higher accuracy of 0.868 compared to NB5’s accuracy of 0.7331. NB5pca also has a higher sensitivity of 0.9019 (meaning they would only miss 1-0.9019=9.81% fal.0s0e negatives) than NB5 with a sensitivity of 0.5969. In comparison to specificity, the proportion of true negatives, NB5 actually has a higher specificity of 0.9429 than NB5pca which has a specificity of 0.8158. Despite NB5 being better in terms of specificity, I think overall NB5pca is the better model and would produce the least amount of false negatives. Due to all of these reasons, I think NB5pca would be a better model for an email filter because it would produce less false negatives (less emails would be quarantined).