This project is based off of a lecture from the class “Practical Machine Learning” from the Data Science Specialization offered by Johns Hopkins through Coursera.

It uses the caret package and the kernlab spam data set to create a generalized linear model to predict whether an email is spam or not. This is my first introduction to machine learning.

require(kernlab)
## Loading required package: kernlab
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
## 
##     alpha
require(e1071)
## Loading required package: e1071
data(spam)
# Create training and testing data sets
inTrain<-createDataPartition(y=spam$type,p=.75,list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,]

# Fit a model
set.seed(20019)
modelFit<-train(type~.,data=training,method="glm")
modelFit
## Generalized Linear Model 
## 
## 3451 samples
##   57 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9238807  0.8390735
# Check out final model
modelFit$finalModel
## 
## Call:  NULL
## 
## Coefficients:
##       (Intercept)               make            address  
##         -1.674069          -0.227951          -0.137878  
##               all              num3d                our  
##          0.135587           2.494640           0.860390  
##              over             remove           internet  
##          0.686492           2.232081           0.517303  
##             order               mail            receive  
##          0.685116           0.081565          -0.336805  
##              will             people             report  
##         -0.121083           0.013945           0.072960  
##         addresses               free           business  
##          0.878389           1.261057           1.050306  
##             email                you             credit  
##          0.040688           0.102546           0.896336  
##              your               font             num000  
##          0.237178           0.121061           1.918649  
##             money                 hp                hpl  
##          0.326971          -1.917204          -0.607093  
##            george             num650                lab  
##         -9.969115           0.609749          -2.011516  
##              labs             telnet             num857  
##         -0.628876          -0.112765           2.348573  
##              data             num415              num85  
##         -1.511820           1.516409          -1.763868  
##        technology            num1999              parts  
##          0.801004          -0.543259          -0.838038  
##                pm             direct                 cs  
##         -1.068791          -0.039872         -46.020788  
##           meeting           original            project  
##         -2.753518          -1.062861          -1.985800  
##                re                edu              table  
##         -0.816747          -1.364643          -0.969020  
##        conference      charSemicolon   charRoundbracket  
##         -4.275037          -1.338815          -0.876137  
## charSquarebracket    charExclamation         charDollar  
##         -0.597227           0.308825           5.982574  
##          charHash         capitalAve        capitalLong  
##          3.638499           0.025145           0.004597  
##      capitalTotal  
##          0.001569  
## 
## Degrees of Freedom: 3450 Total (i.e. Null);  3393 Residual
## Null Deviance:       4628 
## Residual Deviance: 1325  AIC: 1441
# Try to predict values in our test set:
predictions<- predict(modelFit,newdata=testing)
str(predictions)
##  Factor w/ 2 levels "nonspam","spam": 2 2 2 1 2 1 2 2 2 2 ...
# Check accuracy of results!
confusionMatrix(predictions,testing$type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     656   63
##    spam         41  390
##                                           
##                Accuracy : 0.9096          
##                  95% CI : (0.8915, 0.9255)
##     No Information Rate : 0.6061          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.809           
##  Mcnemar's Test P-Value : 0.03947         
##                                           
##             Sensitivity : 0.9412          
##             Specificity : 0.8609          
##          Pos Pred Value : 0.9124          
##          Neg Pred Value : 0.9049          
##              Prevalence : 0.6061          
##          Detection Rate : 0.5704          
##    Detection Prevalence : 0.6252          
##       Balanced Accuracy : 0.9011          
##                                           
##        'Positive' Class : nonspam         
## 

We partitioned the data into training and test sets, created a predictive model using the training set, then applied that model to the test set.

Using a few simple commands in the caret package, we created a model that is ~92% accurate in identifying spam/notspam email,

My thanks to Jeffrey Leek and the Johns Hopkins biostatistics team for their excellent class series in data science.