First, load the packages we will use and attach the data sets.

## Loading required package: kernlab
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
## 
##     alpha
## Loading required package: e1071
## Loading required package: Hmisc
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:e1071':
## 
##     impute
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Hmisc':
## 
##     combine
## Warning in data(Wage): data set 'Wage' not found

Now let’s start the same way we did before:

# Create training and testing data sets
inTrain<-createDataPartition(y=spam$type,p=.75,list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,]

# Fit a model
set.seed(222222)
modelFit<-train(type~.,data=training,method="glm")
modelFit
## Generalized Linear Model 
## 
## 3451 samples
##   57 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.916165  0.8227876
# Check out final model
modelFit$finalModel
## 
## Call:  NULL
## 
## Coefficients:
##       (Intercept)               make            address  
##         -1.598851          -0.565438          -0.142638  
##               all              num3d                our  
##          0.151660           2.231430           0.529871  
##              over             remove           internet  
##          0.885578           1.815519           0.477411  
##             order               mail            receive  
##          0.531510           0.071632          -0.207125  
##              will             people             report  
##         -0.120614          -0.049949           0.178791  
##         addresses               free           business  
##          1.478029           1.181369           0.801127  
##             email                you             credit  
##          0.165340           0.090567           0.847480  
##              your               font             num000  
##          0.240489           0.181845           2.741479  
##             money                 hp                hpl  
##          0.398701          -2.363763          -0.888291  
##            george             num650                lab  
##        -10.477271           0.733107          -2.142428  
##              labs             telnet             num857  
##         -0.301534          -0.100437           3.811690  
##              data             num415              num85  
##         -0.748791           0.413657          -2.950265  
##        technology            num1999              parts  
##          0.968962           0.096447          -0.664835  
##                pm             direct                 cs  
##         -1.011701          -0.286053         -49.197724  
##           meeting           original            project  
##         -3.563938          -1.063237          -1.387680  
##                re                edu              table  
##         -0.858448          -1.342118          -2.434637  
##        conference      charSemicolon   charRoundbracket  
##         -4.913811          -1.356230           0.037803  
## charSquarebracket    charExclamation         charDollar  
##         -1.430273           0.369592           4.610029  
##          charHash         capitalAve        capitalLong  
##          2.986782           0.017469           0.006249  
##      capitalTotal  
##          0.001251  
## 
## Degrees of Freedom: 3450 Total (i.e. Null);  3393 Residual
## Null Deviance:       4628 
## Residual Deviance: 1376  AIC: 1492
# Try to predict values in our test set:
predictions<- predict(modelFit,newdata=testing)
str(predictions)
##  Factor w/ 2 levels "nonspam","spam": 2 2 2 2 2 2 2 2 1 2 ...
# Check accuracy of results!
confusionMatrix(predictions,testing$type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     673   46
##    spam         24  407
##                                           
##                Accuracy : 0.9391          
##                  95% CI : (0.9237, 0.9522)
##     No Information Rate : 0.6061          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8714          
##  Mcnemar's Test P-Value : 0.01207         
##                                           
##             Sensitivity : 0.9656          
##             Specificity : 0.8985          
##          Pos Pred Value : 0.9360          
##          Neg Pred Value : 0.9443          
##              Prevalence : 0.6061          
##          Detection Rate : 0.5852          
##    Detection Prevalence : 0.6252          
##       Balanced Accuracy : 0.9320          
##                                           
##        'Positive' Class : nonspam         
## 

As before (since we used the same seed), we get a pretty decent model with minimal effort. Let’s see what else we can do.

DATA SLICING

K-Folds

# Create folds, return training set
folds<- createFolds(y=spam$type,k=10, list=TRUE, returnTrain = TRUE)
sapply(folds,length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 
##   4141   4141   4141   4141   4141   4141   4141   4141   4141   4140
folds[[1]][1:10]
##  [1]  1  2  3  4  5  6  7  8  9 10
# Alternatively, return the test set
folds2 <- createFolds(y=spam$type,k=10, list=TRUE, returnTrain = FALSE)
sapply(folds2,length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 
##    460    460    460    460    460    459    461    461    460    460
folds2[[1]][1:10]
##  [1]  1  3  9 17 45 48 51 52 72 79
# OR, using resampling:
folds3<-createResample(y=spam$type,times=10,list=TRUE)
sapply(folds2,length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 
##    460    460    460    460    460    459    461    461    460    460
folds3[[1]][1:10]
##  [1]  1  2  3  6  7  7  7  7  8 11
# Time Slices
tme<-1:1000
folds4<-createTimeSlices(y=tme,initialWindow=20,horizon=10)
names(folds4)
## [1] "train" "test"
folds4$train[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
folds4$test[[1]]
##  [1] 21 22 23 24 25 26 27 28 29 30

Although for now we haven’t done anything with these folds (we merely created them), the process was very simple. Now we can use these folds for cross-validation. Which I’m assuming comes later in the course?

Let’s load a new data set and look at some cool stuff. The Wage data set from the Introduction to Statistical Learning package.

# Plotting Predictors
require(ISLR)
## Loading required package: ISLR
data(Wage)
summary(Wage)
##       year           age               sex                    maritl    
##  Min.   :2003   Min.   :18.00   1. Male  :3000   1. Never Married: 648  
##  1st Qu.:2004   1st Qu.:33.75   2. Female:   0   2. Married      :2074  
##  Median :2006   Median :42.00                    3. Widowed      :  19  
##  Mean   :2006   Mean   :42.41                    4. Divorced     : 204  
##  3rd Qu.:2008   3rd Qu.:51.00                    5. Separated    :  55  
##  Max.   :2009   Max.   :80.00                                           
##                                                                         
##        race                   education                     region    
##  1. White:2480   1. < HS Grad      :268   2. Middle Atlantic   :3000  
##  2. Black: 293   2. HS Grad        :971   1. New England       :   0  
##  3. Asian: 190   3. Some College   :650   3. East North Central:   0  
##  4. Other:  37   4. College Grad   :685   4. West North Central:   0  
##                  5. Advanced Degree:426   5. South Atlantic    :   0  
##                                           6. East South Central:   0  
##                                           (Other)              :   0  
##            jobclass               health      health_ins      logwage     
##  1. Industrial :1544   1. <=Good     : 858   1. Yes:2083   Min.   :3.000  
##  2. Information:1456   2. >=Very Good:2142   2. No : 917   1st Qu.:4.447  
##                                                            Median :4.653  
##                                                            Mean   :4.654  
##                                                            3rd Qu.:4.857  
##                                                            Max.   :5.763  
##                                                                           
##       wage       
##  Min.   : 20.09  
##  1st Qu.: 85.38  
##  Median :104.92  
##  Mean   :111.70  
##  3rd Qu.:128.68  
##  Max.   :318.34  
## 
# Build training/test sets
trainvec<-createDataPartition(y=Wage$wage,p=.7,list=FALSE)
trainset<-Wage[trainvec,]
testset<-Wage[-trainvec,]
dim(trainset) 
## [1] 2102   12
dim(testset)
## [1] 898  12
# Check out "featurePlot"!! It plots every variable against each other.
featurePlot(x=trainset[ ,c("age","education","jobclass")], 
            y=trainset$wage, plot="pairs")

# Or just quickplots (this is more more style; I don't go for the shotgun approach much:
qplot(age,wage,data=trainset)

qplot(education,wage,data=trainset)

# multivariate plots:
qplot(age,wage,col=jobclass,data=trainset)

qplot(age,wage,col=education,data=trainset) + 
    geom_smooth(method="lm", formula=y~x)

qplot(education,wage,col=jobclass,data=trainset)

qplot(jobclass,wage,col=education,data=trainset)

# etc... now cut interval data into categorical
cutWage<-cut2(trainset$wage,g=3) # g cuts into 3 quantile groups
table(cutWage)
## cutWage
## [ 20.9, 91.7) [ 91.7,118.9) [118.9,318.3] 
##           707           731           664
p1<-qplot(cutWage,age,data=trainset,fill=cutWage,geom="boxplot")
p2<-qplot(cutWage,age,data=trainset,
          fill=cutWage,geom=c("boxplot","jitter"))
grid.arrange(p1,p2,ncol=2)

# can also look at tables when you've cut the I/R data into categories:
tab<-table(cutWage,trainset$jobclass)
tab
##                
## cutWage         1. Industrial 2. Information
##   [ 20.9, 91.7)           453            254
##   [ 91.7,118.9)           383            348
##   [118.9,318.3]           265            399
#also a proportions table
prop.table(tab,1)
##                
## cutWage         1. Industrial 2. Information
##   [ 20.9, 91.7)     0.6407355      0.3592645
##   [ 91.7,118.9)     0.5239398      0.4760602
##   [118.9,318.3]     0.3990964      0.6009036
# Now plot densities -- as an aside, I should do this more. It's pretty and is super quick.
qplot(wage,col=education, data=trainset,geom="density")

So we see from these many plots that knowing what your data looks like matters! Having a strong understanding of the data through visualization helps you the programmer/statistician to hone in on the variables that really matter.

Going back to the spam data set, let’s check out a few more of the functions and capabilities of the caret package. First, look at the preProcess function.

# PREPROCESSING VARIABLES

# Using preProcess function from caret --> Normalize data
preObj<-preProcess(training[,-58],method=c("center","scale"))
trainCapAves <- predict(preObj, training[,-58])$capitalAve
mean(trainCapAves)
## [1] -3.339632e-18
sd(trainCapAves)
## [1] 1
# Now apply preprocessing to test set using preObj:
testCapAves <- predict(preObj,testing[,-58])$capitalAve
mean(testCapAves)
## [1] 0.02384246
sd(testCapAves)
## [1] 1.459443
# As a shortcut, pass preprocessing args to train() function
modelFit2<-train(type~.,data=training,
                 preProcess=c("center","scale"),method="glm")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modelFit2
## Generalized Linear Model 
## 
## 3451 samples
##   57 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9163118  0.8240581
# Can do Box-Cox transforms, too:
preObj2<-preProcess(training[,-58],method="BoxCox")
trainCapAves2<-predict(preObj2,training[,-58])$capitalAve
par(mfrow=c(1,2)); hist(trainCapAves2); qqnorm(trainCapAves2)

So we normalized the variables (“centered and scaled” in caret’s terms), then made a new predictive model using those transformed variables. Then we showed that caret has other built in transformation methods–at least, it can do BoxCox transforms!

Next let’s look at imputation. I’ve heard very mixed things about imputation as a general concept. Some folks suggest that imputation is pretty bad and we shouldn’t do it; essentially making up data that doesn’t exist. On the other hand, machine learning doesn’t do so well with missing data, and it’s a shame to throw away a perfectly good observation because of a missing data point or two.

# K-nearest neighbor imputation
training$capAve<-training$capitalAve
selectNA<-rbinom(dim(training)[1],size=1,prob=.05)==1
training$capAve[selectNA]<-NA
#impute and standardize
preObj3<-preProcess(training[-58],method="knnImpute")
capAve<-predict(preObj3,training[,-58])$capAve
#standardize true values
capAveTruth<-training$capitalAve
capAveTruth<-(capAveTruth-mean(capAveTruth))/sd(capAveTruth)

#now compare full set with imputed values:
quantile(capAve-capAveTruth)
##            0%           25%           50%           75%          100% 
## -0.9883280880 -0.0019723444 -0.0004778190  0.0002364905  1.2651588715
quantile((capAve-capAveTruth)[selectNA])
##            0%           25%           50%           75%          100% 
## -0.4409257937 -0.0166235136  0.0007530806  0.0169158382  1.2651588715
quantile((capAve-capAveTruth)[!selectNA])
##            0%           25%           50%           75%          100% 
## -0.9883280880 -0.0018790577 -0.0004875111  0.0001880298  0.0007530806

So we see that actually in this case, the imputed values weren’t far off from the true values. So depending on the nature of the data and the purpose of the analysis, imputing data values is somewhere between “a necessary evil” and “not really that bad, probably.”

==================================

This information was me following along with about half the lectures in the 2nd week of Jeffrey Leek’s “Practical Machine Learning” on Coursera.