This project is based off of a lecture from the class “Practical Machine Learning” from the Data Science Specialization offered by Johns Hopkins through Coursera.
It uses the caret package and the kernlab spam data set to create a generalized linear model to predict whether an email is spam or not. This is my first introduction to machine learning.
require(kernlab)
## Loading required package: kernlab
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
##
## alpha
require(e1071)
## Loading required package: e1071
data(spam)
# Create training and testing data sets
inTrain<-createDataPartition(y=spam$type,p=.75,list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,]
# Fit a model
set.seed(20019)
modelFit<-train(type~.,data=training,method="glm")
modelFit
## Generalized Linear Model
##
## 3451 samples
## 57 predictor
## 2 classes: 'nonspam', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9238807 0.8390735
# Check out final model
modelFit$finalModel
##
## Call: NULL
##
## Coefficients:
## (Intercept) make address
## -1.674069 -0.227951 -0.137878
## all num3d our
## 0.135587 2.494640 0.860390
## over remove internet
## 0.686492 2.232081 0.517303
## order mail receive
## 0.685116 0.081565 -0.336805
## will people report
## -0.121083 0.013945 0.072960
## addresses free business
## 0.878389 1.261057 1.050306
## email you credit
## 0.040688 0.102546 0.896336
## your font num000
## 0.237178 0.121061 1.918649
## money hp hpl
## 0.326971 -1.917204 -0.607093
## george num650 lab
## -9.969115 0.609749 -2.011516
## labs telnet num857
## -0.628876 -0.112765 2.348573
## data num415 num85
## -1.511820 1.516409 -1.763868
## technology num1999 parts
## 0.801004 -0.543259 -0.838038
## pm direct cs
## -1.068791 -0.039872 -46.020788
## meeting original project
## -2.753518 -1.062861 -1.985800
## re edu table
## -0.816747 -1.364643 -0.969020
## conference charSemicolon charRoundbracket
## -4.275037 -1.338815 -0.876137
## charSquarebracket charExclamation charDollar
## -0.597227 0.308825 5.982574
## charHash capitalAve capitalLong
## 3.638499 0.025145 0.004597
## capitalTotal
## 0.001569
##
## Degrees of Freedom: 3450 Total (i.e. Null); 3393 Residual
## Null Deviance: 4628
## Residual Deviance: 1325 AIC: 1441
# Try to predict values in our test set:
predictions<- predict(modelFit,newdata=testing)
str(predictions)
## Factor w/ 2 levels "nonspam","spam": 2 2 2 1 2 1 2 2 2 2 ...
# Check accuracy of results!
confusionMatrix(predictions,testing$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 656 63
## spam 41 390
##
## Accuracy : 0.9096
## 95% CI : (0.8915, 0.9255)
## No Information Rate : 0.6061
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.809
## Mcnemar's Test P-Value : 0.03947
##
## Sensitivity : 0.9412
## Specificity : 0.8609
## Pos Pred Value : 0.9124
## Neg Pred Value : 0.9049
## Prevalence : 0.6061
## Detection Rate : 0.5704
## Detection Prevalence : 0.6252
## Balanced Accuracy : 0.9011
##
## 'Positive' Class : nonspam
##
We partitioned the data into training and test sets, created a predictive model using the training set, then applied that model to the test set.
Using a few simple commands in the caret package, we created a model that is ~92% accurate in identifying spam/notspam email,
My thanks to Jeffrey Leek and the Johns Hopkins biostatistics team for their excellent class series in data science.