Project 4

Problem 1

Part A

library('caret')
library('ISLR')
testHitters <- Hitters[c(1:13,16:19)]
testHitters <- na.omit(testHitters)
testHitters$Salary <- log(testHitters$Salary)
names(testHitters)[17] <- "LogSalary"
set.seed(12345)
trainingIndices <- createDataPartition(y=testHitters$LogSalary, p=0.7, list = FALSE)
training <- testHitters[trainingIndices,]
testing <- testHitters[-trainingIndices,]
LM1 <- train(LogSalary ~ ., data=training, method="lm")
predicted_LM1 <- predict(LM1, newdata=testing)
Rsquared_LM1 <- cor(predicted_LM1, testing$LogSalary)^2
Rsquared_LM1

## [1] 0.4518025

This part of code loads the required libraries, then creates a subset of the data without factor variables. Next, the code omits the records with salaries missing, and with the remaining records transforms the salary into the log of the salary and changes the name to LogSalary. The code then sets the seed for random number generation and creates training and testing data sets with 70% of the observations in the training data. Finally, the code creates a linear model to predict LogSalary with all of the remaining variables, and predicts values of LogSalary using the linear model created and the testing data. The R-squared value is then computed for the testing data, which is 0.4518025.

Part B

library('elasticnet')
RIDGE1 <- train(LogSalary~., data=training, method="ridge", maximize=TRUE, metric="Rsquared", trControl=trainControl(method="cv", number=10), preProcess=c("center", "scale"))
predicted_RIDGE1 <- predict(RIDGE1, newdata=testing)
Rsquared_RIDGE1 <- cor(predicted_RIDGE1, testing$LogSalary)^2
Rsquared_RIDGE1

## [1] 0.4812084

This part of code builds a ridge regression model with the training data to predict LogSalary using all other predictors, which contains arguments for cross validating and centering and scaling the predictors. Then, values of LogSalary are predicted using the ridge regression model and the testing data, and the R-squared is computed for the testing data, which is 0.4812084, higher than the r-squared for the linear model.

Part C

LASSO1 <- train(LogSalary~., data=training, method="lasso", maximize=TRUE, metric="Rsquared", trControl=trainControl(method="cv", number=10), preProcess=c("center", "scale"))
predicted_LASSO1 <- predict(LASSO1, newdata=testing)
Rsquared_LASSO1 <- cor(predicted_LASSO1, testing$LogSalary)^2
Rsquared_LASSO1

## [1] 0.4686052

This part of code builds a LASSO model with the training data to predict LogSalary using all other predictors, which contains arugments for cross validating and centering and scaling the predictors. Then, values of LogSalary are predicted using the LASSO model and the testing data, and the R-squared is computed for the testing data, which is 0.4686052. This is lower than the ridge regression model but higher than the linear model.

Part D

Removed_coef <- predict.enet(LASSO1$finalModel, type="coef", s=LASSO1$bestTune$fraction, mode='fraction')
Removed_coef$coefficients

##       AtBat        Hits       HmRun        Runs         RBI       Walks 
## -0.26718587  0.40050874  0.04954137  0.00000000  0.03072973  0.12541714 
##       Years      CAtBat       CHits      CHmRun       CRuns        CRBI 
##  0.13511795  0.00000000  0.45915555 -0.09788284  0.04494156  0.00000000 
##      CWalks     PutOuts     Assists      Errors 
## -0.07672909  0.11115853  0.09988299 -0.09130279

This command gives the variables removed from the LASSO model. The coefficients not used in the LASSO model are those whose value is zero, so the removed variables are runs, CAtbats, and CRBI.

Problem 2

Part A

library(AppliedPredictiveModeling)
library(caret)

data(permeability)
Data <- as.data.frame(fingerprints)
Data_corrected <- Data[, -nearZeroVar(Data)]
Data_corrected$permeability <- permeability

This part of code creates a data frame of the fingerprints data, then removes the predictors with a small variance from the set. The code then adds the response, permeability, to the data frame.

Part B

set.seed(12345)
trainingIndices_permeability <- createDataPartition(y=Data_corrected$permeability, p=0.7, list = FALSE)
training_permeability <- Data_corrected[trainingIndices_permeability,]
testing_permeability <- Data_corrected[-trainingIndices_permeability,]

This part of code sets the seed for random number generation, then creates training and testing data sets with 70% of the observations being placed in the training data.

Part C

LM2 <- train(permeability~., data=training_permeability, method="lm", maximize=TRUE, metric="Rsquared", preProcess=c("center", "scale"))
predicted_LM2 <- predict(LM2, newdata=testing_permeability)
Rsquared_LM2 <- cor(predicted_LM2, testing_permeability$permeability)^2
Rsquared_LM2

##      permeability
## [1,]    0.2323023

This part of code creates a linear model to predict permeability using all other predictors with the training data, while centering and scaling the predictors. Then, values of permeability are predicted using the linear model and the testing data, and the R-squared is computed for the testing data, which is 0.2323023.

Part D

LM2pca <- train(permeability~., data=training_permeability, method="lm", maximize=TRUE, metric="Rsquared", preProcess=c("center", "scale", "pca"))
predicted_LM2pca <- predict(LM2pca, newdata=testing_permeability)
Rsquared_LM2pca <- cor(predicted_LM2pca, testing_permeability$permeability)^2
Rsquared_LM2pca

##      permeability
## [1,]    0.3688938

This part of code creates another linear model to predict permeability using all other predictors with the training data, while centering and scaling the predictors and also incorporating principal components into the model. Then, values of permeability are predicted using the linear model and the testing data, and the R-squared is computed for the testing data, which is 0.3688938. ##3A

library(data.table)
library(curl)


TrainGeneex <-fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.xtrain")

TrainClassLab <-fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.ytrain")

TestGeneex <-fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.xtest")

TestClassLab <- fread("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/14cancer.ytest")



Traindata <-cbind(transpose(TrainGeneex), transpose(TrainClassLab))

Traindata <-data.frame(Traindata)

colnames(Traindata)[16064]<-"Response"


Traindata$Response <-as.factor(Traindata$Response)


Testdata <-cbind(transpose(TestGeneex),transpose(TestClassLab))
Testdata <-data.frame(Testdata)

colnames(Testdata)[16064]<-"Response"

Testdata$Response <-as.factor(Testdata$Response)

3B

library(caret)
library(ggplot2)
library(lattice)
library(kernlab)
library(e1071)




LDA <- train(Response ~., data= Traindata, method="lda", trControl=trainControl(method ="cv", number=10, allowParallel = TRUE), preProcess=c("center", "scale", "pca"))

LDAPredict <- predict(LDA, newdata= Testdata)
 confusionMatrix(LDAPredict, Testdata$Response)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14
##         1  2 2 1 0 1 2 1 1 0  2  3  1  0  0
##         2  0 1 1 0 0 0 0 0 0  0  0  0  0  0
##         3  0 0 2 0 0 0 0 0 0  0  0  0  0  0
##         4  0 0 0 4 1 0 0 0 0  0  0  0  0  0
##         5  0 0 0 0 4 0 0 0 0  0  0  0  0  0
##         6  2 0 0 0 0 1 0 0 0  0  0  2  0  0
##         7  0 1 0 0 0 0 1 0 0  0  0  0  0  0
##         8  0 0 0 0 0 0 0 1 0  1  0  0  0  0
##         9  0 0 0 0 0 0 0 0 4  0  0  0  0  0
##         10 0 0 0 0 0 0 0 0 0  0  0  0  0  0
##         11 0 0 0 0 0 0 0 0 0  0  0  0  0  0
##         12 0 0 0 0 0 0 0 0 0  0  0  0  0  0
##         13 0 2 0 0 0 0 0 0 2  0  0  1  3  0
##         14 0 0 0 0 0 0 0 0 0  0  0  0  0  4
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.3608, 0.6392)
##     No Information Rate : 0.1111          
##     P-Value [Acc > NIR] : 1.581e-12       
##                                           
##                   Kappa : 0.4602          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.50000  0.16667  0.50000  1.00000  0.66667  0.33333
## Specificity           0.72000  0.97917  1.00000  0.98000  1.00000  0.92157
## Pos Pred Value        0.12500  0.50000  1.00000  0.80000  1.00000  0.20000
## Neg Pred Value        0.94737  0.90385  0.96154  1.00000  0.96000  0.95918
## Prevalence            0.07407  0.11111  0.07407  0.07407  0.11111  0.05556
## Detection Rate        0.03704  0.01852  0.03704  0.07407  0.07407  0.01852
## Detection Prevalence  0.29630  0.03704  0.03704  0.09259  0.07407  0.09259
## Balanced Accuracy     0.61000  0.57292  0.75000  0.99000  0.83333  0.62745
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           0.50000  0.50000  0.66667   0.00000   0.00000
## Specificity           0.98077  0.98077  1.00000   1.00000   1.00000
## Pos Pred Value        0.50000  0.50000  1.00000       NaN       NaN
## Neg Pred Value        0.98077  0.98077  0.96000   0.94444   0.94444
## Prevalence            0.03704  0.03704  0.11111   0.05556   0.05556
## Detection Rate        0.01852  0.01852  0.07407   0.00000   0.00000
## Detection Prevalence  0.03704  0.03704  0.07407   0.00000   0.00000
## Balanced Accuracy     0.74038  0.74038  0.83333   0.50000   0.50000
##                      Class: 12 Class: 13 Class: 14
## Sensitivity            0.00000   1.00000   1.00000
## Specificity            1.00000   0.90196   1.00000
## Pos Pred Value             NaN   0.37500   1.00000
## Neg Pred Value         0.92593   1.00000   1.00000
## Prevalence             0.07407   0.05556   0.07407
## Detection Rate         0.00000   0.05556   0.07407
## Detection Prevalence   0.00000   0.14815   0.07407
## Balanced Accuracy      0.50000   0.95098   1.00000

3C

library(caret)
library(ggplot2)
library(lattice)

KNN3 <- train(Response~. , data= Traindata, method= "knn", maximize= TRUE, trControl=trainControl(method ="cv", number=10), preProcess=c("center", "scale", "pca"))


KNNPred<- predict(KNN3, newdata= Testdata)

confusionMatrix(KNNPred, Testdata$Response)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 1 2 3 4 5 6 7 8 9 10 11 12 13 14
##         1  1 0 0 0 0 0 0 1 0  1  0  0  0  0
##         2  0 0 0 0 0 0 0 0 0  0  0  0  0  0
##         3  0 0 0 0 0 0 0 0 0  0  0  0  0  0
##         4  0 1 2 2 1 0 0 1 0  0  0  0  0  0
##         5  0 2 0 0 3 0 0 0 0  0  0  0  0  0
##         6  0 0 0 0 0 0 1 0 0  0  1  2  0  0
##         7  0 0 0 0 0 2 1 0 1  0  0  0  0  0
##         8  1 1 0 0 0 0 0 0 1  2  1  0  0  0
##         9  0 0 0 0 0 0 0 0 4  0  0  0  0  0
##         10 1 0 0 0 0 0 0 0 0  0  0  0  0  0
##         11 0 0 1 2 0 0 0 0 0  0  1  1  1  0
##         12 1 0 0 0 0 0 0 0 0  0  0  0  0  0
##         13 0 2 1 0 2 1 0 0 0  0  0  1  2  1
##         14 0 0 0 0 0 0 0 0 0  0  0  0  0  3
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3148          
##                  95% CI : (0.1952, 0.4555)
##     No Information Rate : 0.1111          
##     P-Value [Acc > NIR] : 4.831e-05       
##                                           
##                   Kappa : 0.2663          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.25000   0.0000  0.00000  0.50000  0.50000  0.00000
## Specificity           0.96000   1.0000  1.00000  0.90000  0.95833  0.92157
## Pos Pred Value        0.33333      NaN      NaN  0.28571  0.60000  0.00000
## Neg Pred Value        0.94118   0.8889  0.92593  0.95745  0.93878  0.94000
## Prevalence            0.07407   0.1111  0.07407  0.07407  0.11111  0.05556
## Detection Rate        0.01852   0.0000  0.00000  0.03704  0.05556  0.00000
## Detection Prevalence  0.05556   0.0000  0.00000  0.12963  0.09259  0.07407
## Balanced Accuracy     0.60500   0.5000  0.50000  0.70000  0.72917  0.46078
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           0.50000  0.00000  0.66667   0.00000   0.33333
## Specificity           0.94231  0.88462  1.00000   0.98039   0.90196
## Pos Pred Value        0.25000  0.00000  1.00000   0.00000   0.16667
## Neg Pred Value        0.98000  0.95833  0.96000   0.94340   0.95833
## Prevalence            0.03704  0.03704  0.11111   0.05556   0.05556
## Detection Rate        0.01852  0.00000  0.07407   0.00000   0.01852
## Detection Prevalence  0.07407  0.11111  0.07407   0.01852   0.11111
## Balanced Accuracy     0.72115  0.44231  0.83333   0.49020   0.61765
##                      Class: 12 Class: 13 Class: 14
## Sensitivity            0.00000   0.66667   0.75000
## Specificity            0.98000   0.84314   1.00000
## Pos Pred Value         0.00000   0.20000   1.00000
## Neg Pred Value         0.92453   0.97727   0.98039
## Prevalence             0.07407   0.05556   0.07407
## Detection Rate         0.00000   0.03704   0.05556
## Detection Prevalence   0.01852   0.18519   0.05556
## Balanced Accuracy      0.49000   0.75490   0.87500

3D

The KNN3 model had less accuracy than the LDA model. The KNN model had an accuracy of 0.33 while the LDA model had an accuracy of 0.5. Both models accuracys were extremely low and probably would either of them would not be good to use.

Problem 4A

set.seed(12345)
library(caret)
library(lattice)
library(ggplot2)
library(readr)
library(utils)

mnist_test <- read_csv("C:/Users/shahi/Desktop/DAT 400/mnist_test2.csv" )
mnist_train <- read_csv("C:/Users/shahi/Desktop/DAT 400/mnist_train2.csv")



colnames(mnist_train)[1]<- "Response1"
colnames(mnist_test )[1] <- "Response1"

mnist_train$Response1 <- sub("0", "c0", mnist_train$Response1)
mnist_train$Response1 <- sub("1", "c1", mnist_train$Response1)
mnist_train$Response1 <- sub("2", "c2", mnist_train$Response1)
mnist_train$Response1 <- sub("3", "c3", mnist_train$Response1)
mnist_train$Response1 <- sub("4", "c4", mnist_train$Response1)
mnist_train$Response1 <- sub("5", "c5", mnist_train$Response1)
mnist_train$Response1 <- sub("6", "c6", mnist_train$Response1)
mnist_train$Response1 <- sub("7", "c7", mnist_train$Response1)
mnist_train$Response1 <- sub("8", "c8", mnist_train$Response1)
mnist_train$Response1 <- sub("9", "c9", mnist_train$Response1)

mnist_test$Response1 <- sub("0", "c0", mnist_test$Response1)
mnist_test$Response1 <- sub("1", "c1", mnist_test$Response1)
mnist_test$Response1 <- sub("2", "c2", mnist_test$Response1)
mnist_test$Response1 <- sub("3", "c3", mnist_test$Response1)
mnist_test$Response1 <- sub("4", "c4", mnist_test$Response1)
mnist_test$Response1 <- sub("5", "c5", mnist_test$Response1)
mnist_test$Response1 <- sub("6", "c6", mnist_test$Response1)
mnist_test$Response1 <- sub("7", "c7", mnist_test$Response1)
mnist_test$Response1 <- sub("8", "c8", mnist_test$Response1)
mnist_test$Response1 <- sub("9", "c9", mnist_test$Response1)

First1<- nearZeroVar(mnist_test)

Training4<- mnist_train[, -c(First1)]

Testing4 <- mnist_test[, -c(First1)]

Problem 4B

library(caret)
library(doParallel)
library(parallel)
library(iterators)
library(lattice)
library(ggplot2)
library(foreach)
library(stats)

cluster<- makeCluster(detectCores()-1)
registerDoParallel(cluster)

KNN4<- train(Response1~., data= Training4, method = "knn", trControl=trainControl(method="cv", number =10, allowParallel = TRUE), preProcess=c("center", "scale", "pca"))

predictedKNN4 <- predict(KNN4, newdata= Testing4)



test2<-as.factor(Testing4$Response1)



confusionMatrix(predictedKNN4, test2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   c0   c1   c2   c3   c4   c5   c6   c7   c8   c9
##         c0  972    0    7    1    0    4    5    0    3    3
##         c1    1 1129    4    1    4    0    2   18    0    5
##         c2    1    5  997    1    0    0    0    5    4    2
##         c3    0    0    2  980    0   12    1    0    8    6
##         c4    0    0    2    0  948    0    3    1    2    6
##         c5    1    0    0   13    0  863    2    0   13    6
##         c6    3    0    2    0    5    6  945    0    5    1
##         c7    2    0   16   10    2    3    0  994    8    6
##         c8    0    0    2    2    0    2    0    0  927    2
##         c9    0    1    0    2   23    2    0   10    4  972
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9727          
##                  95% CI : (0.9693, 0.9758)
##     No Information Rate : 0.1135          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9697          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: c0 Class: c1 Class: c2 Class: c3 Class: c4
## Sensitivity             0.9918    0.9947    0.9661    0.9703    0.9654
## Specificity             0.9975    0.9961    0.9980    0.9968    0.9984
## Pos Pred Value          0.9769    0.9699    0.9823    0.9713    0.9854
## Neg Pred Value          0.9991    0.9993    0.9961    0.9967    0.9962
## Prevalence              0.0980    0.1135    0.1032    0.1010    0.0982
## Detection Rate          0.0972    0.1129    0.0997    0.0980    0.0948
## Detection Prevalence    0.0995    0.1164    0.1015    0.1009    0.0962
## Balanced Accuracy       0.9946    0.9954    0.9820    0.9835    0.9819
##                      Class: c5 Class: c6 Class: c7 Class: c8 Class: c9
## Sensitivity             0.9675    0.9864    0.9669    0.9517    0.9633
## Specificity             0.9962    0.9976    0.9948    0.9991    0.9953
## Pos Pred Value          0.9610    0.9772    0.9549    0.9914    0.9586
## Neg Pred Value          0.9968    0.9986    0.9962    0.9948    0.9959
## Prevalence              0.0892    0.0958    0.1028    0.0974    0.1009
## Detection Rate          0.0863    0.0945    0.0994    0.0927    0.0972
## Detection Prevalence    0.0898    0.0967    0.1041    0.0935    0.1014
## Balanced Accuracy       0.9818    0.9920    0.9808    0.9754    0.9793

stopCluster(cluster)

Problem 4C

library(caret)
library(doParallel)
library(parallel)
library(iterators)
library(lattice)
library(ggplot2)
library(foreach)

cluster2<- makeCluster(detectCores()-1)
registerDoParallel(cluster2)

LDA4<- train(Response1~., data= Training4, method = "lda",trControl=trainControl(method="cv", number =10, allowParallel = TRUE), preProcess=c("center", "scale", "pca"))


PredLDa4<- predict( LDA4, Testing4)

test2<-as.factor(Testing4$Response1)




confusionMatrix(PredLDa4, test2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   c0   c1   c2   c3   c4   c5   c6   c7   c8   c9
##         c0  913    0   12    7    1   11   12    5    6    9
##         c1    1 1078   39    7   12   10    7   38   27    7
##         c2    4    7  815   28    4    9    8   20    8    7
##         c3    1    3   22  848    1   52    1    4   31   11
##         c4    0    1   21    2  860   13   25   17   13   66
##         c5   43    3   12   50    2  721   35    3   36   15
##         c6   12    3   16    9   16   20  864    2   20    4
##         c7    3    2   22   24    1   13    0  869   10   21
##         c8    3   38   55   20    7   33    6    4  797    9
##         c9    0    0   18   15   78   10    0   66   26  860
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8625          
##                  95% CI : (0.8556, 0.8692)
##     No Information Rate : 0.1135          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8472          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: c0 Class: c1 Class: c2 Class: c3 Class: c4
## Sensitivity             0.9316    0.9498    0.7897    0.8396    0.8758
## Specificity             0.9930    0.9833    0.9894    0.9860    0.9825
## Pos Pred Value          0.9355    0.8793    0.8956    0.8706    0.8448
## Neg Pred Value          0.9926    0.9935    0.9761    0.9821    0.9864
## Prevalence              0.0980    0.1135    0.1032    0.1010    0.0982
## Detection Rate          0.0913    0.1078    0.0815    0.0848    0.0860
## Detection Prevalence    0.0976    0.1226    0.0910    0.0974    0.1018
## Balanced Accuracy       0.9623    0.9665    0.8896    0.9128    0.9291
##                      Class: c5 Class: c6 Class: c7 Class: c8 Class: c9
## Sensitivity             0.8083    0.9019    0.8453    0.8183    0.8523
## Specificity             0.9782    0.9887    0.9893    0.9806    0.9763
## Pos Pred Value          0.7837    0.8944    0.9005    0.8200    0.8015
## Neg Pred Value          0.9812    0.9896    0.9824    0.9804    0.9833
## Prevalence              0.0892    0.0958    0.1028    0.0974    0.1009
## Detection Rate          0.0721    0.0864    0.0869    0.0797    0.0860
## Detection Prevalence    0.0920    0.0966    0.0965    0.0972    0.1073
## Balanced Accuracy       0.8932    0.9453    0.9173    0.8994    0.9143

stopCluster(cluster2)

Problem 4D

The KNN Model had an accuracy of .9727 The LDA Model had an accuracy of .862. The best model was the KNN but using either model would work because of how high both accuracy’s are.

Problem 5

Part A

library(kernlab)
library(caret)
library(caret)
library(iterators)
library(lattice)
library(ggplot2)
library(foreach)


set.seed(12345)


data(spam)

Spam_trainingIndices <- createDataPartition(y=spam$type, p=0.7, list = FALSE)
Spam_training <- spam[Spam_trainingIndices,]
Spam_testing <- spam[-Spam_trainingIndices,]

This part of code first sets the seed for random number generation. then creates training and testing data sets with 70% of the observations being placed in the training data using the response variable spam.

Part B

library(klaR)
library(MASS)
library(caret)
library(doParallel)
library(parallel)
library(iterators)
library(lattice)
library(ggplot2)
library(foreach)

cluster3<- makeCluster(detectCores()-1)
registerDoParallel(cluster3)


NB5 <- train(type~., data=Spam_training, method="nb", trControl = trainControl(method="cv", number=10))

predicted_NB5 <- predict(NB5, Spam_testing)

confusionMatrix_NB5 <- confusionMatrix(predicted_NB5, Spam_testing$type) 

confusionMatrix_NB5

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     476   31
##    spam        360  512
##                                           
##                Accuracy : 0.7165          
##                  95% CI : (0.6919, 0.7401)
##     No Information Rate : 0.6062          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4631          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5694          
##             Specificity : 0.9429          
##          Pos Pred Value : 0.9389          
##          Neg Pred Value : 0.5872          
##              Prevalence : 0.6062          
##          Detection Rate : 0.3452          
##    Detection Prevalence : 0.3677          
##       Balanced Accuracy : 0.7561          
##                                           
##        'Positive' Class : nonspam         
##

stopCluster(cluster3)

This part of code creates a naive Bayes classifier model for the type of message (spam or not spam) using the training data. Then, values of spam are predicted using the naive Bayes classifier and the testing data, and a confusion matrix is created, giving an accuracy of 0.7165 which is good.

Part C

cluster4<- makeCluster(detectCores()-1)
registerDoParallel(cluster3)

NB5pca <- train(type~., data=Spam_training, method="nb", trControl = trainControl(method="cv", number=10), preProcess=c("center", "scale", "pca"))
predicted_NB5pca <- predict(NB5pca, Spam_testing)
confusionMatrix_NB5pca <- confusionMatrix(predicted_NB5pca, Spam_testing$type) 
confusionMatrix_NB5pca

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     734   62
##    spam        102  481
##                                           
##                Accuracy : 0.8811          
##                  95% CI : (0.8628, 0.8977)
##     No Information Rate : 0.6062          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7541          
##                                           
##  Mcnemar's Test P-Value : 0.002324        
##                                           
##             Sensitivity : 0.8780          
##             Specificity : 0.8858          
##          Pos Pred Value : 0.9221          
##          Neg Pred Value : 0.8250          
##              Prevalence : 0.6062          
##          Detection Rate : 0.5323          
##    Detection Prevalence : 0.5772          
##       Balanced Accuracy : 0.8819          
##                                           
##        'Positive' Class : nonspam         
##

stopCluster(cluster4)

This part of code creates another naive Bayes classifier model for the type of message (spam or not spam) using the training data and this time incorporating principal components into the model. Then, values of spam are predicted using the naive Bayes classifier and the testing data, and a confusion matrix is created, giving an accuracy of 0.8811 which is very good and significantly better than the original model.

Part D

The original naive Bayes model has a higher specificity, but a lower accuracy and sensitivity. Sensitivity is the most important statistic because it is the number of correct positives over the total positives.It is more important ot identify the spam emails because it is worse to recieve a spam email than not receive an important email. Therefore, the naive Bayes classifier incorporating principal components is the preferred model because of its higher sensitivity.

Project 4

Shahid Abdulaziz Andrew McGowan

March 31, 2019

Problem 1

Part A

Part B

Part C

Part D

Problem 2

Part A

Part B

Part C

Part D

3B

3C

3D

Problem 4A

Problem 4B

Problem 4C

Problem 4D

Problem 5

Part A

Part B

Part C

Part D