First load necessary libraries. Use functionality in caret library to create a data partition of the dataset, in this case a large number of emails that are either ham or spam, which comes from the kernlab library.
library(caret); library(kernlab); data(spam)
## Loading required package: lattice
## Loading required package: ggplot2
inTrain<-createDataPartition(y=spam$type,
p=0.75,list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,]
dim(training)
## [1] 3451 58
Now we fit a model based on the generalized linear model method. We also set a seed, so the results can be compared. We cross-validate via the k-fold method, where we set k to 10.
set.seed(32323)
folds<-createFolds(y=spam$type,k=10,
list=TRUE,returnTrain=FALSE)
sapply(folds,length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## 460 461 460 459 461 459 460 460 461 460
suppressWarnings(modelFit<-train(type~.,data=training,method="glm"))
modelFit
## Generalized Linear Model
##
## 3451 samples
## 57 predictor
## 2 classes: 'nonspam', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.9229686 0.8370467 0.02050069 0.04779368
##
##
modelFit$finalModel
##
## Call: NULL
##
## Coefficients:
## (Intercept) make address
## -1.761452 -0.209770 -0.150427
## all num3d our
## 0.158502 2.034526 0.637986
## over remove internet
## 1.436093 2.294772 0.615352
## order mail receive
## 0.840590 0.096427 -0.297500
## will people report
## -0.176807 -0.034223 0.135098
## addresses free business
## 1.067061 1.336390 1.417437
## email you credit
## 0.121739 0.039245 0.743930
## your font num000
## 0.247766 0.272250 1.890718
## money hp hpl
## 0.368510 -1.749100 -0.953504
## george num650 lab
## -18.672686 0.449805 -2.022645
## labs telnet num857
## -0.321736 -0.094740 2.951202
## data num415 num85
## -0.752802 0.202794 -2.294297
## technology num1999 parts
## 0.816050 0.117789 -0.691748
## pm direct cs
## -1.185520 -0.453254 -49.379410
## meeting original project
## -2.279523 -1.475890 -1.247785
## re edu table
## -0.642471 -1.607532 -3.047459
## conference charSemicolon charRoundbracket
## -7.051173 -1.707439 -0.303264
## charSquarebracket charExclamation charDollar
## -1.182268 0.619422 5.307442
## charHash capitalAve capitalLong
## 2.614040 0.066823 0.005496
## capitalTotal
## 0.001314
##
## Degrees of Freedom: 3450 Total (i.e. Null); 3393 Residual
## Null Deviance: 4628
## Residual Deviance: 1268 AIC: 1384
We use this glm-trained model to predict on the remaining 25%.
predictions<-predict(modelFit,newdata=testing)
predictions
## [1] spam spam spam spam spam spam spam spam
## [9] spam spam spam spam spam spam spam spam
## [17] spam spam spam spam spam spam spam spam
## [25] spam spam spam spam spam spam spam spam
## [33] spam nonspam spam spam spam spam spam spam
## [41] spam spam spam spam spam spam spam spam
## [49] spam spam spam spam spam spam spam spam
## [57] spam spam spam spam spam spam spam spam
## [65] spam spam spam spam nonspam spam spam spam
## [73] spam spam nonspam spam spam nonspam spam spam
## [81] spam spam spam spam spam spam spam spam
## [89] spam spam spam spam spam spam nonspam spam
## [97] spam spam spam spam spam spam spam spam
## [105] spam spam spam spam spam spam spam spam
## [113] spam spam spam nonspam spam nonspam spam spam
## [121] spam spam spam nonspam spam nonspam spam spam
## [129] spam spam spam spam spam spam spam spam
## [137] spam spam spam spam spam spam nonspam spam
## [145] spam spam spam spam spam spam spam spam
## [153] spam spam spam spam spam spam spam spam
## [161] spam spam spam spam spam spam spam spam
## [169] spam spam spam spam spam spam spam spam
## [177] nonspam spam spam nonspam spam spam spam spam
## [185] spam spam nonspam spam spam spam nonspam spam
## [193] spam spam spam spam spam spam spam spam
## [201] spam spam spam spam spam spam spam spam
## [209] spam spam spam spam spam spam spam spam
## [217] spam spam spam spam spam spam spam spam
## [225] spam spam spam spam spam spam spam spam
## [233] nonspam spam spam spam spam nonspam spam spam
## [241] spam spam spam spam spam spam spam spam
## [249] spam spam nonspam spam nonspam spam spam spam
## [257] spam spam spam spam spam spam spam spam
## [265] spam spam spam spam spam spam spam spam
## [273] spam spam spam spam nonspam spam spam spam
## [281] spam spam spam spam spam spam spam spam
## [289] spam nonspam nonspam spam spam spam spam spam
## [297] spam spam spam spam spam spam spam spam
## [305] spam spam spam spam spam spam nonspam spam
## [313] spam spam nonspam spam spam spam nonspam spam
## [321] spam spam spam spam spam spam spam spam
## [329] spam spam spam spam spam spam spam spam
## [337] spam spam spam spam spam spam spam spam
## [345] spam spam spam spam spam spam spam spam
## [353] spam spam spam spam nonspam spam spam spam
## [361] spam spam spam spam spam spam spam spam
## [369] spam spam spam spam spam spam spam nonspam
## [377] nonspam spam spam spam nonspam spam spam spam
## [385] spam spam spam spam spam spam nonspam spam
## [393] nonspam spam spam spam spam spam nonspam spam
## [401] spam nonspam spam spam spam spam spam spam
## [409] spam spam spam nonspam nonspam spam spam spam
## [417] spam spam spam spam nonspam spam nonspam spam
## [425] nonspam spam spam spam spam spam spam nonspam
## [433] spam nonspam spam nonspam spam spam spam spam
## [441] spam spam spam spam spam spam spam spam
## [449] spam spam spam spam spam spam nonspam nonspam
## [457] nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam
## [465] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [473] nonspam nonspam nonspam nonspam nonspam nonspam spam nonspam
## [481] nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam
## [489] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [497] nonspam nonspam nonspam nonspam nonspam nonspam spam nonspam
## [505] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [513] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [521] nonspam nonspam nonspam nonspam nonspam nonspam spam nonspam
## [529] nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam
## [537] nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam
## [545] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [553] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [561] nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam
## [569] nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam
## [577] nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam
## [585] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [593] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [601] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [609] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [617] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [625] spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [633] nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam
## [641] nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam
## [649] nonspam nonspam nonspam nonspam nonspam nonspam spam nonspam
## [657] nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam
## [665] nonspam nonspam nonspam nonspam nonspam nonspam spam spam
## [673] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [681] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [689] nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam
## [697] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [705] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [713] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [721] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [729] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [737] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [745] spam nonspam nonspam nonspam nonspam nonspam spam nonspam
## [753] nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam
## [761] nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam
## [769] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [777] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [785] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [793] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [801] nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam
## [809] nonspam spam spam nonspam nonspam nonspam nonspam nonspam
## [817] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [825] nonspam nonspam spam spam nonspam nonspam nonspam nonspam
## [833] nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam
## [841] nonspam spam nonspam spam nonspam nonspam nonspam nonspam
## [849] nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam
## [857] nonspam nonspam nonspam nonspam spam nonspam nonspam nonspam
## [865] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [873] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [881] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [889] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [897] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [905] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [913] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [921] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [929] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [937] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [945] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [953] nonspam spam nonspam spam nonspam nonspam nonspam nonspam
## [961] nonspam nonspam nonspam nonspam nonspam spam nonspam nonspam
## [969] nonspam nonspam nonspam spam nonspam nonspam nonspam nonspam
## [977] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [985] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [993] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1001] nonspam spam nonspam nonspam nonspam nonspam spam nonspam
## [1009] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1017] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1025] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1033] nonspam nonspam nonspam nonspam nonspam nonspam spam nonspam
## [1041] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1049] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1057] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1065] nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam
## [1073] spam nonspam spam nonspam spam nonspam nonspam nonspam
## [1081] nonspam spam nonspam nonspam nonspam nonspam nonspam nonspam
## [1089] spam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1097] nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam
## [1105] nonspam nonspam nonspam nonspam nonspam nonspam nonspam spam
## [1113] nonspam spam nonspam spam spam nonspam nonspam nonspam
## [1121] nonspam nonspam spam nonspam nonspam nonspam nonspam nonspam
## [1129] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1137] nonspam nonspam nonspam nonspam nonspam nonspam nonspam nonspam
## [1145] nonspam nonspam spam nonspam nonspam nonspam
## Levels: nonspam spam
To evaluate the performance of the model on the test data, we use caret to produce a confusion matrix.
confusionMatrix(predictions,testing$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 644 40
## spam 53 413
##
## Accuracy : 0.9191
## 95% CI : (0.9018, 0.9342)
## No Information Rate : 0.6061
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8315
## Mcnemar's Test P-Value : 0.2134
##
## Sensitivity : 0.9240
## Specificity : 0.9117
## Pos Pred Value : 0.9415
## Neg Pred Value : 0.8863
## Prevalence : 0.6061
## Detection Rate : 0.5600
## Detection Prevalence : 0.5948
## Balanced Accuracy : 0.9178
##
## 'Positive' Class : nonspam
##