Remove the factor variables from the Hitters data in the ISLR package and remove all records in which the Salary data is missing. To remove the heavy right skew, replace Salary with log(Salary) and call it logSalary. Set the seed to 12345 and use createDataPartition with p = 0.7 in the caret package to partition the data into training and testing sets.
library(ISLR)
library(caret)
library(ggplot2)
Hitters2 <- subset(Hitters, select = -c(League, Division,NewLeague))
Hitters3 <- Hitters2[complete.cases(Hitters2),]
logSalary<- log10(Hitters3$Salary)
Hitters3$Salary <- logSalary
colnames(Hitters3)[colnames(Hitters3)=="Salary"]<-"logSalary"
set.seed(12345)
databreak <- createDataPartition (Hitters3$logSalary, times=1, p=.7, list=FALSE)
traindata <- Hitters3[databreak,]
testdata <-Hitters3[-databreak,]
Use the training data to build a linear model (LM1) for logSalary using all other variables as predictors. Evaluate the predictive capacity of LM1 by computing R2 for the testing data.
LM1<-train(logSalary~., data=traindata, method = "lm", maximize =TRUE, metric= "Rsquared")
predictedLM1<- predict(LM1, newdata= testdata)
RsquaredLM1<-cor(predictedLM1, testdata$logSalary)^2
RsquaredLM1
## [1] 0.4251033
The testing data R-squared is 0.4251033.
Use the training data to build a ridge regression model (RIDGE1) for logSalary using all other variables as predictors. Use 10-fold cross validation and center and scale the predictors. Evaluate the predictive capacity of RIDGE1 by computing R2 for the testing data.
library(caret)
library(elasticnet)
library(lars)
Ridge1<-train(logSalary~., data= traindata, method = "ridge", trControl=trainControl(method ="cv", number=10), preProcess=c("center", "scale"))
Ridge1predicted <-predict(Ridge1, newdata= testdata)
RsquaredRidge1 <- cor(Ridge1predicted, testdata$logSalary)^2
RsquaredRidge1
## [1] 0.4149291
The testing data R-squared is 0.4149291.
Use the training data to build a LASSO model (LASSO1) for logSalary using all other variables as predictors. Use 10-fold cross validation and center and scale the predictors. Evaluate the predictive capacity of LASSO1 by computing R2 for the testing data.
library(lars)
library(caret)
LASSO1<-train(logSalary~., data = traindata, method = "lasso" , trControl=trainControl(method ="cv", number=10), preProcess=c("center", "scale") )
PredictedLasso1<-predict(LASSO1, newdata=testdata)
Rsquaredlasso<-cor(PredictedLasso1, testdata$logSalary)^2
Rsquaredlasso
## [1] 0.4219527
The testing data R-squared is 0.4219527.
Use predict.enet( LASSO1finalModel, type=“coef”, s=LASSO1\(bestTune\)fraction, mode=“fraction”) to determine which variables the LASSO removes from the model. Explicitly list the variables that are removed.
predict.enet(LASSO1$finalModel, type = "coef" , s= LASSO1$bestTune$fraction, mode = "fraction")
## $s
## [1] 0.5
##
## $fraction
## 0
## 0.5
##
## $mode
## [1] "fraction"
##
## $coefficients
## AtBat Hits HmRun Runs RBI Walks
## -0.063213535 0.156360951 0.036525025 0.000000000 0.000000000 0.073336423
## Years CAtBat CHits CHmRun CRuns CRBI
## 0.110615510 0.000000000 0.053250309 -0.054634947 0.128477425 -0.000817099
## CWalks PutOuts Assists Errors
## -0.048763061 0.025325418 0.001873267 -0.019479563
The variables that were removed from the model were Runs, RBI, CAtBat (they all have coefficients of zero).
Predicting permeability of compounds can yield significant savings for a pharmaceutical company. In the AppliedPredictiveModeling package, the permeability data contains high dimensional data in the form of a matrix titled fingerprints with 165 observations on 1107 binary molecular predictors and a vector called permeability that contains the corresponding responses.
Combine the predictors and response into a single data frame and use nearZeroVar to remove all predictors that have small variance. Note that the data are still high dimensional.
library(AppliedPredictiveModeling)
library(caret)
data(permeability)
printsdata = cbind(fingerprints, permeability)
printsdata= data.frame(printsdata)
printsdata= printsdata[,-nearZeroVar(printsdata)]
Set the seed to 12345 and and use createDataPartition with p = 0.7 in the caret package to partition the data into training and testing sets.
set.seed(12345)
databreak2 <- createDataPartition (printsdata$permeability, times=1, p=.7, list=FALSE)
traindata2 <- printsdata[databreak2,]
testdata2 <-printsdata[-databreak2,]
Build a linear model (LM2) with the training data using train with preProcess=c(“center”,“scale”) and compute R2 for the LM2’s performance on the testing data. R is going to complain about rank deficiency because there are more variables than observations. Just ignore those warnings for this problem.
LM2<-train(permeability~. , data= traindata2, method = "lm", preProcess=c("center", "scale") )
PredictedLM2<- predict(LM2, newdata= testdata2)
RsquaredLM2<-cor(PredictedLM2, testdata2$permeability)^2
RsquaredLM2
## [1] 0.01698163
The testing data R-squared is 0.01698163
Now incorporate principal components into your model. Build another linear model (LM2pca) with the training data using train with preProcess=c(“center”,“scale”,“pca”) and compute R2 for the LM2pca’s performance on the testing data. Did PCA increase R2?
LM2pca<-train(permeability~., data=traindata2, method="lm", preProcess=c("center", "scale", "pca"))
PredictedLM2pca <- predict(LM2pca, newdata= testdata2)
RsquaredLM2pca<- cor(PredictedLM2pca, testdata2$permeability)^2
RsquaredLM2pca
## [1] 0.4644431
After running the model on the testing data, it produced an R-squared of 0.4644431.PCA seems to have increased R-squared since the R-squared went up.
The website https://statweb.stanford.edu/~hastie/ElemStatLearn/datasets/ contains microarray data for 14 different types of cancer.
Use fread in the data.table package to read the training and test sets from the website directly into R. Make data frames for the training and testing data that combine the predictors and the response, and convert the response to a factor variable. Note that the data are high dimensional.
library(doParallel)
library(parallel)
library(data.table)
xtrain <- fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.xtrain")
ytrain <- fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.ytrain")
xtest <- fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.xtest")
ytest <- fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.ytest")
training <- transpose(xtrain)
testing <- transpose(xtest)
training$response <- as.factor(t(ytrain))
testing$response <- as.factor(t(ytest))
Set the seed to 12345, and use the training data to build a linear discriminant analysis model (LDA3) that predicts the type of cancer and evaluate the accuracy of LDA3 with the testing data.
library(caret)
cluster = makeCluster(detectCores() - 1)
registerDoParallel(cluster)
LDA3 <- train(response~.,
data=training,
method="lda",
trControl=trainControl(method="cv", number=10, allowParallel = TRUE),
preProcess=c("center","scale"))
stopCluster(cluster)
predictLDA <- predict(LDA3, testing)
confusionMatrix(predictLDA, testing$response)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 1 1 0 0 1 0 2 0 0 0 0 2 1 1 0
## 2 0 2 1 0 0 0 0 0 0 0 0 0 0 0
## 3 1 0 2 0 1 0 1 0 0 1 0 0 0 0
## 4 0 0 0 3 0 0 0 0 0 0 0 0 0 0
## 5 0 1 0 0 4 0 0 0 2 0 0 0 0 0
## 6 1 0 0 0 0 1 0 0 0 0 0 2 0 0
## 7 0 1 0 0 0 0 1 0 0 0 0 0 0 0
## 8 1 0 1 0 0 0 0 2 0 2 0 0 0 0
## 9 0 0 0 0 0 0 0 0 4 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 11 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## 12 0 1 0 0 1 0 0 0 0 0 0 0 0 0
## 13 0 1 0 0 0 0 0 0 0 0 0 1 2 0
## 14 0 0 0 0 0 0 0 0 0 0 0 0 0 4
##
## Overall Statistics
##
## Accuracy : 0.5
## 95% CI : (0.3608, 0.6392)
## No Information Rate : 0.1111
## P-Value [Acc > NIR] : 1.581e-12
##
## Kappa : 0.4594
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.25000 0.33333 0.50000 0.75000 0.66667 0.33333
## Specificity 0.86000 0.97917 0.92000 1.00000 0.93750 0.94118
## Pos Pred Value 0.12500 0.66667 0.33333 1.00000 0.57143 0.25000
## Neg Pred Value 0.93478 0.92157 0.95833 0.98039 0.95745 0.96000
## Prevalence 0.07407 0.11111 0.07407 0.07407 0.11111 0.05556
## Detection Rate 0.01852 0.03704 0.03704 0.05556 0.07407 0.01852
## Detection Prevalence 0.14815 0.05556 0.11111 0.05556 0.12963 0.07407
## Balanced Accuracy 0.55500 0.65625 0.71000 0.87500 0.80208 0.63725
## Class: 7 Class: 8 Class: 9 Class: 10 Class: 11 Class: 12
## Sensitivity 0.50000 1.00000 0.66667 0.00000 0.33333 0.00000
## Specificity 0.98077 0.92308 1.00000 1.00000 1.00000 0.96000
## Pos Pred Value 0.50000 0.33333 1.00000 NaN 1.00000 0.00000
## Neg Pred Value 0.98077 1.00000 0.96000 0.94444 0.96226 0.92308
## Prevalence 0.03704 0.03704 0.11111 0.05556 0.05556 0.07407
## Detection Rate 0.01852 0.03704 0.07407 0.00000 0.01852 0.00000
## Detection Prevalence 0.03704 0.11111 0.07407 0.00000 0.01852 0.03704
## Balanced Accuracy 0.74038 0.96154 0.83333 0.50000 0.66667 0.48000
## Class: 13 Class: 14
## Sensitivity 0.66667 1.00000
## Specificity 0.96078 1.00000
## Pos Pred Value 0.50000 1.00000
## Neg Pred Value 0.98000 1.00000
## Prevalence 0.05556 0.07407
## Detection Rate 0.03704 0.07407
## Detection Prevalence 0.07407 0.07407
## Balanced Accuracy 0.81373 1.00000
Use the training data to build a k-nearest neighbors model (KNN3) that predicts the type of cancer and evaluate the accuracy of KNN3 with the testing data.
cluster = makeCluster(detectCores() - 1)
registerDoParallel(cluster)
KNN3 <- train(response~.,
data=training,
method="knn",
trControl=trainControl(method="cv", number=10, allowParallel = TRUE),
preProcess=c("center","scale"))
stopCluster(cluster)
predictKNN <- predict(KNN3, testing)
confusionMatrix(predictKNN, testing$response)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0
## 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## 3 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 2 4 1 0 0 0 0 0 0 0 0 0
## 5 0 2 0 0 3 0 0 0 0 0 0 0 0 0
## 6 2 1 0 0 0 2 0 0 0 0 1 2 0 0
## 7 1 0 0 0 0 0 1 0 0 1 0 0 0 0
## 8 1 1 0 0 1 0 0 0 1 2 1 0 0 0
## 9 0 0 0 0 0 0 0 0 4 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 11 0 0 0 0 0 0 0 1 0 0 0 1 1 0
## 12 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## 13 0 2 1 0 1 0 0 0 1 0 0 1 2 1
## 14 0 0 0 0 0 0 0 0 0 0 0 0 0 3
##
## Overall Statistics
##
## Accuracy : 0.3704
## 95% CI : (0.2429, 0.5126)
## No Information Rate : 0.1111
## P-Value [Acc > NIR] : 6.015e-07
##
## Kappa : 0.325
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.00000 0.00000 0.25000 1.00000 0.50000 0.66667
## Specificity 0.96000 0.97917 1.00000 0.94000 0.95833 0.88235
## Pos Pred Value 0.00000 0.00000 1.00000 0.57143 0.60000 0.25000
## Neg Pred Value 0.92308 0.88679 0.94340 1.00000 0.93878 0.97826
## Prevalence 0.07407 0.11111 0.07407 0.07407 0.11111 0.05556
## Detection Rate 0.00000 0.00000 0.01852 0.07407 0.05556 0.03704
## Detection Prevalence 0.03704 0.01852 0.01852 0.12963 0.09259 0.14815
## Balanced Accuracy 0.48000 0.48958 0.62500 0.97000 0.72917 0.77451
## Class: 7 Class: 8 Class: 9 Class: 10 Class: 11 Class: 12
## Sensitivity 0.50000 0.00000 0.66667 0.00000 0.00000 0.00000
## Specificity 0.96154 0.86538 1.00000 1.00000 0.94118 0.98000
## Pos Pred Value 0.33333 0.00000 1.00000 NaN 0.00000 0.00000
## Neg Pred Value 0.98039 0.95745 0.96000 0.94444 0.94118 0.92453
## Prevalence 0.03704 0.03704 0.11111 0.05556 0.05556 0.07407
## Detection Rate 0.01852 0.00000 0.07407 0.00000 0.00000 0.00000
## Detection Prevalence 0.05556 0.12963 0.07407 0.00000 0.05556 0.01852
## Balanced Accuracy 0.73077 0.43269 0.83333 0.50000 0.47059 0.49000
## Class: 13 Class: 14
## Sensitivity 0.66667 0.75000
## Specificity 0.86275 1.00000
## Pos Pred Value 0.22222 1.00000
## Neg Pred Value 0.97778 0.98039
## Prevalence 0.05556 0.07407
## Detection Rate 0.03704 0.05556
## Detection Prevalence 0.16667 0.05556
## Balanced Accuracy 0.76471 0.87500
Compare the accuracies of the different classifiers. Is either classifier very effective? Explain.
The accuracy using LDA was 0.5 (kappa=0.4594) The accuracy using KNN was 0.3704 (kappa=0.325)
Neither classifier seems to be very effective since both classifiers have kappa values below 0.6 and accuracies equal to or below 50%.
However, the LDA3 model outperforms the KNN3 model because it has both a higher accuracy and kappa value than the KNN3 model.
The data in T:/Faculty & Staff Alphabetical/M/McDevittt/Public/DAT315/Data/mnist train.csv and T:/Faculty & Staff Alphabetical/M/McDevittt/Public/DAT315/Data/mnist test.csv are digitized handwritten numerals. (See https://en.wikipedia.org/wiki/MNIST_database for more information.) Each record consists of the correct numeral followed by 784 grayscale values ranging from 0 to 255. When the 784 grayscale values are arranged in a 28 × 28 array, they show the digitized digit. For example, the first record in the training data corresponds to the following (digitized) handwritten “5”.
The response is a factor variable but R does not allow variables like “0”, “1”, etc… Rename “0” as “c0” (for “character 0”), “1” as “c1”, …, and “9” as “c9”. in both the training and testing data. Also, remove all of the predictors with small variance in the training data. Be sure to remove the same variables in the testing data.
library(caret)
library(lattice)
library(ggplot2)
mnist_train <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 4\\mnist_train.csv", header=FALSE)
mnist_test <- read.csv("C:\\Users\\Sarah\\OneDrive\\DAT315\\Project 4\\mnist_test.csv", header=FALSE)
mnist_train$V1 <- sub("0", "c0", mnist_train$V1)
mnist_train$V1 <- sub("1", "c1", mnist_train$V1)
mnist_train$V1 <- sub("2", "c2", mnist_train$V1)
mnist_train$V1 <- sub("3", "c3", mnist_train$V1)
mnist_train$V1 <- sub("4", "c4", mnist_train$V1)
mnist_train$V1 <- sub("5", "c5", mnist_train$V1)
mnist_train$V1 <- sub("6", "c6", mnist_train$V1)
mnist_train$V1 <- sub("7", "c7", mnist_train$V1)
mnist_train$V1 <- sub("8", "c8", mnist_train$V1)
mnist_train$V1 <- sub("9", "c9", mnist_train$V1)
mnist_test$V1 <- sub("0", "c0", mnist_test$V1)
mnist_test$V1 <- sub("1", "c1", mnist_test$V1)
mnist_test$V1 <- sub("2", "c2", mnist_test$V1)
mnist_test$V1 <- sub("3", "c3", mnist_test$V1)
mnist_test$V1 <- sub("4", "c4", mnist_test$V1)
mnist_test$V1 <- sub("5", "c5", mnist_test$V1)
mnist_test$V1 <- sub("6", "c6", mnist_test$V1)
mnist_test$V1 <- sub("7", "c7", mnist_test$V1)
mnist_test$V1 <- sub("8", "c8", mnist_test$V1)
mnist_test$V1 <- sub("9", "c9", mnist_test$V1)
predictors4<- nearZeroVar(mnist_train)
mnist_train<- mnist_train[, -predictors4]
mnist_test<- mnist_test[, -predictors4]
Use the training data to build a k-nearest neighbors model (KNN4) to identify the handwritten digits. Assess the accuracy of KNN4 by applying it to the testing data.
library(klaR)
library(caret)
library(doParallel)
library(parallel)
library(iterators)
library(lattice)
library(ggplot2)
library(foreach)
clustermade<- makeCluster(detectCores()-1)
registerDoParallel(clustermade)
KNN4<- train(V1~., data= mnist_train, method = "knn", trControl=trainControl(method="cv", number =10, allowParallel = TRUE), preProcess=c("center", "scale", "pca"))
predictedKNN4 <- predict(KNN4,mnist_test)
mnist_test2<-as.factor(mnist_test$V1)
confusionMatrix(predictedKNN4, mnist_test2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
## c0 972 0 6 0 0 4 5 0 4 2
## c1 1 1128 5 2 4 0 3 18 0 5
## c2 1 4 996 3 0 0 0 5 5 2
## c3 0 2 1 978 0 11 1 0 6 5
## c4 0 0 3 0 950 0 2 1 2 5
## c5 1 0 1 15 0 863 1 0 12 5
## c6 4 0 2 1 5 6 945 0 7 1
## c7 1 0 15 8 3 2 0 998 9 8
## c8 0 0 3 1 1 2 1 0 922 2
## c9 0 1 0 2 19 4 0 6 7 974
##
## Overall Statistics
##
## Accuracy : 0.9726
## 95% CI : (0.9692, 0.9757)
## No Information Rate : 0.1135
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9695
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: c0 Class: c1 Class: c2 Class: c3 Class: c4
## Sensitivity 0.9918 0.9938 0.9651 0.9683 0.9674
## Specificity 0.9977 0.9957 0.9978 0.9971 0.9986
## Pos Pred Value 0.9789 0.9674 0.9803 0.9741 0.9865
## Neg Pred Value 0.9991 0.9992 0.9960 0.9964 0.9965
## Prevalence 0.0980 0.1135 0.1032 0.1010 0.0982
## Detection Rate 0.0972 0.1128 0.0996 0.0978 0.0950
## Detection Prevalence 0.0993 0.1166 0.1016 0.1004 0.0963
## Balanced Accuracy 0.9948 0.9948 0.9814 0.9827 0.9830
## Class: c5 Class: c6 Class: c7 Class: c8 Class: c9
## Sensitivity 0.9675 0.9864 0.9708 0.9466 0.9653
## Specificity 0.9962 0.9971 0.9949 0.9989 0.9957
## Pos Pred Value 0.9610 0.9732 0.9559 0.9893 0.9615
## Neg Pred Value 0.9968 0.9986 0.9967 0.9943 0.9961
## Prevalence 0.0892 0.0958 0.1028 0.0974 0.1009
## Detection Rate 0.0863 0.0945 0.0998 0.0922 0.0974
## Detection Prevalence 0.0898 0.0971 0.1044 0.0932 0.1013
## Balanced Accuracy 0.9818 0.9918 0.9828 0.9728 0.9805
After running on our testing data, we find it to have an accuracy of 0.9726.
Use the training data to build a linear discriminant analysis model (LDA4) to identify the handwritten digits. Assess the accuracy of LDA4 by applying it to the testing data.
library(caret)
library(doParallel)
library(parallel)
library(iterators)
library(lattice)
library(ggplot2)
library(foreach)
LDA4<- train(V1~., data= mnist_train, method = "lda",trControl=trainControl(method="cv", number =10, allowParallel = TRUE), preProcess=c("center", "scale", "pca"))
mnist_test2<-as.factor(mnist_test$V1)
predictedLDa4<- predict( LDA4, mnist_test)
confusionMatrix(predictedLDa4, mnist_test2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
## c0 914 0 12 7 1 13 12 4 5 8
## c1 1 1075 45 6 11 10 8 39 28 7
## c2 5 7 808 27 7 8 6 20 8 6
## c3 1 2 22 848 1 49 1 5 33 11
## c4 0 1 25 2 859 14 26 17 15 58
## c5 41 3 10 50 2 721 34 3 34 15
## c6 12 4 16 10 15 20 864 1 22 5
## c7 3 2 22 25 1 14 0 866 10 22
## c8 3 41 55 19 6 34 7 4 795 9
## c9 0 0 17 16 79 9 0 69 24 868
##
## Overall Statistics
##
## Accuracy : 0.8618
## 95% CI : (0.8549, 0.8685)
## No Information Rate : 0.1135
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8464
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: c0 Class: c1 Class: c2 Class: c3 Class: c4
## Sensitivity 0.9327 0.9471 0.7829 0.8396 0.8747
## Specificity 0.9931 0.9825 0.9895 0.9861 0.9825
## Pos Pred Value 0.9365 0.8740 0.8958 0.8715 0.8446
## Neg Pred Value 0.9927 0.9932 0.9754 0.9821 0.9863
## Prevalence 0.0980 0.1135 0.1032 0.1010 0.0982
## Detection Rate 0.0914 0.1075 0.0808 0.0848 0.0859
## Detection Prevalence 0.0976 0.1230 0.0902 0.0973 0.1017
## Balanced Accuracy 0.9629 0.9648 0.8862 0.9128 0.9286
## Class: c5 Class: c6 Class: c7 Class: c8 Class: c9
## Sensitivity 0.8083 0.9019 0.8424 0.8162 0.8603
## Specificity 0.9789 0.9884 0.9890 0.9803 0.9762
## Pos Pred Value 0.7897 0.8916 0.8974 0.8171 0.8022
## Neg Pred Value 0.9812 0.9896 0.9821 0.9802 0.9842
## Prevalence 0.0892 0.0958 0.1028 0.0974 0.1009
## Detection Rate 0.0721 0.0864 0.0866 0.0795 0.0868
## Detection Prevalence 0.0913 0.0969 0.0965 0.0973 0.1082
## Balanced Accuracy 0.8936 0.9451 0.9157 0.8983 0.9182
After running on our testing data, we find it to have an accuracy of 0.8681.
Compare and contrast the accuracies of the two classifiers on the testing data.
The KNN model had an accuracy of 0.9726 while the LDA model had an accuracy of 0.8618. In terms of accuracy alone, the KNN model seems to be better but both models have very high accuracies so both are good models.
Access the spam data in the kernlab package.
Set the seed to 12345 and and use createDataPartition with p = 0.7 in the caret package to partition the data into training and testing sets.
library(kernlab)
library(caret)
library(caret)
library(doParallel)
library(parallel)
library(iterators)
library(lattice)
library(ggplot2)
library(foreach)
data(spam)
set.seed(12345)
databreak5 <- createDataPartition(spam$type, times=1, p=.7, list=FALSE)
traindata5 <- spam[databreak5,]
testdata5 <-spam[-databreak5,]
Use the training data to build a na¨ıve Bayes classifier (NB5) for the type of message (spam or nonspam) without any preprocessing. Assess the predictive capacity of NB5 using the testing data.
options(warn=-1)
library(klaR)
library(caret)
library(MASS)
options(warn=-1)
NB5<-train( type ~., data=traindata5, method= "nb", trControl=trainControl(method ="cv", number=10) )
predictedNB5<-predict(NB5, testdata5)
confusionMatrix(predictedNB5, testdata5$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 499 31
## spam 337 512
##
## Accuracy : 0.7331
## 95% CI : (0.709, 0.7563)
## No Information Rate : 0.6062
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4913
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5969
## Specificity : 0.9429
## Pos Pred Value : 0.9415
## Neg Pred Value : 0.6031
## Prevalence : 0.6062
## Detection Rate : 0.3619
## Detection Prevalence : 0.3843
## Balanced Accuracy : 0.7699
##
## 'Positive' Class : nonspam
##
After running on the testing data, the model has an accuracy of 0.7331. Relatively speaking, this is good but it would still produce some errors.It also has a sensitivity of 0.7331 and a specificity of 0.9429.
Use the training data to build a na¨ıve Bayes classifier (NB5pca) using preProcess=c(“center”,“scale”,“pca”) as an optional argument to the train command. Assess the predictive capacity of NB5pca using the testing data.
library(klaR)
library(caret)
library(kernlab)
library(MASS)
NB5pca<-train(type~., data=traindata5 , method = "nb" , trControl=trainControl(method ="cv", number=10) , preProcess=c("center", "scale", "pca") )
predictedNB5pca <- predict(NB5pca, testdata5)
confusionMatrix( predictedNB5pca, testdata5$type )
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 754 100
## spam 82 443
##
## Accuracy : 0.868
## 95% CI : (0.849, 0.8854)
## No Information Rate : 0.6062
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7219
##
## Mcnemar's Test P-Value : 0.2076
##
## Sensitivity : 0.9019
## Specificity : 0.8158
## Pos Pred Value : 0.8829
## Neg Pred Value : 0.8438
## Prevalence : 0.6062
## Detection Rate : 0.5468
## Detection Prevalence : 0.6193
## Balanced Accuracy : 0.8589
##
## 'Positive' Class : nonspam
##
After running on the testing data, the model has an accuracy of 0.868. It also has a sensitivity of 0.9019 and a specificity of 0.8158.
Compare and contrast the accuracy, specificity, and sensitivity of NB5 and NB5pca. Which classifier would be preferable in practice as part of an actual email filter?
NB5pca has a higher accuracy of 0.868 compared to NB5’s accuracy of 0.7331. NB5pca also has a higher sensitivity of 0.9019 (meaning they would only miss 1-0.9019=9.81% fal.0s0e negatives) than NB5 with a sensitivity of 0.5969. In comparison to specificity, the proportion of true negatives, NB5 actually has a higher specificity of 0.9429 than NB5pca which has a specificity of 0.8158. Despite NB5 being better in terms of specificity, I think overall NB5pca is the better model and would produce the least amount of false negatives. Due to all of these reasons, I think NB5pca would be a better model for an email filter because it would produce less false negatives (less emails would be quarantined).