First, load the packages we will use and attach the data sets.
## Loading required package: kernlab
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
##
## alpha
## Loading required package: e1071
## Loading required package: Hmisc
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:e1071':
##
## impute
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Hmisc':
##
## combine
## Warning in data(Wage): data set 'Wage' not found
Now let’s start the same way we did before:
# Create training and testing data sets
inTrain<-createDataPartition(y=spam$type,p=.75,list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,]
# Fit a model
set.seed(222222)
modelFit<-train(type~.,data=training,method="glm")
modelFit
## Generalized Linear Model
##
## 3451 samples
## 57 predictor
## 2 classes: 'nonspam', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
## Resampling results:
##
## Accuracy Kappa
## 0.916165 0.8227876
# Check out final model
modelFit$finalModel
##
## Call: NULL
##
## Coefficients:
## (Intercept) make address
## -1.598851 -0.565438 -0.142638
## all num3d our
## 0.151660 2.231430 0.529871
## over remove internet
## 0.885578 1.815519 0.477411
## order mail receive
## 0.531510 0.071632 -0.207125
## will people report
## -0.120614 -0.049949 0.178791
## addresses free business
## 1.478029 1.181369 0.801127
## email you credit
## 0.165340 0.090567 0.847480
## your font num000
## 0.240489 0.181845 2.741479
## money hp hpl
## 0.398701 -2.363763 -0.888291
## george num650 lab
## -10.477271 0.733107 -2.142428
## labs telnet num857
## -0.301534 -0.100437 3.811690
## data num415 num85
## -0.748791 0.413657 -2.950265
## technology num1999 parts
## 0.968962 0.096447 -0.664835
## pm direct cs
## -1.011701 -0.286053 -49.197724
## meeting original project
## -3.563938 -1.063237 -1.387680
## re edu table
## -0.858448 -1.342118 -2.434637
## conference charSemicolon charRoundbracket
## -4.913811 -1.356230 0.037803
## charSquarebracket charExclamation charDollar
## -1.430273 0.369592 4.610029
## charHash capitalAve capitalLong
## 2.986782 0.017469 0.006249
## capitalTotal
## 0.001251
##
## Degrees of Freedom: 3450 Total (i.e. Null); 3393 Residual
## Null Deviance: 4628
## Residual Deviance: 1376 AIC: 1492
# Try to predict values in our test set:
predictions<- predict(modelFit,newdata=testing)
str(predictions)
## Factor w/ 2 levels "nonspam","spam": 2 2 2 2 2 2 2 2 1 2 ...
# Check accuracy of results!
confusionMatrix(predictions,testing$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 673 46
## spam 24 407
##
## Accuracy : 0.9391
## 95% CI : (0.9237, 0.9522)
## No Information Rate : 0.6061
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8714
## Mcnemar's Test P-Value : 0.01207
##
## Sensitivity : 0.9656
## Specificity : 0.8985
## Pos Pred Value : 0.9360
## Neg Pred Value : 0.9443
## Prevalence : 0.6061
## Detection Rate : 0.5852
## Detection Prevalence : 0.6252
## Balanced Accuracy : 0.9320
##
## 'Positive' Class : nonspam
##
As before (since we used the same seed), we get a pretty decent model with minimal effort. Let’s see what else we can do.
# Create folds, return training set
folds<- createFolds(y=spam$type,k=10, list=TRUE, returnTrain = TRUE)
sapply(folds,length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## 4141 4141 4141 4141 4141 4141 4141 4141 4141 4140
folds[[1]][1:10]
## [1] 1 2 3 4 5 6 7 8 9 10
# Alternatively, return the test set
folds2 <- createFolds(y=spam$type,k=10, list=TRUE, returnTrain = FALSE)
sapply(folds2,length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## 460 460 460 460 460 459 461 461 460 460
folds2[[1]][1:10]
## [1] 1 3 9 17 45 48 51 52 72 79
# OR, using resampling:
folds3<-createResample(y=spam$type,times=10,list=TRUE)
sapply(folds2,length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## 460 460 460 460 460 459 461 461 460 460
folds3[[1]][1:10]
## [1] 1 2 3 6 7 7 7 7 8 11
# Time Slices
tme<-1:1000
folds4<-createTimeSlices(y=tme,initialWindow=20,horizon=10)
names(folds4)
## [1] "train" "test"
folds4$train[[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
folds4$test[[1]]
## [1] 21 22 23 24 25 26 27 28 29 30
Although for now we haven’t done anything with these folds (we merely created them), the process was very simple. Now we can use these folds for cross-validation. Which I’m assuming comes later in the course?
Let’s load a new data set and look at some cool stuff. The Wage data set from the Introduction to Statistical Learning package.
# Plotting Predictors
require(ISLR)
## Loading required package: ISLR
data(Wage)
summary(Wage)
## year age sex maritl
## Min. :2003 Min. :18.00 1. Male :3000 1. Never Married: 648
## 1st Qu.:2004 1st Qu.:33.75 2. Female: 0 2. Married :2074
## Median :2006 Median :42.00 3. Widowed : 19
## Mean :2006 Mean :42.41 4. Divorced : 204
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## race education region
## 1. White:2480 1. < HS Grad :268 2. Middle Atlantic :3000
## 2. Black: 293 2. HS Grad :971 1. New England : 0
## 3. Asian: 190 3. Some College :650 3. East North Central: 0
## 4. Other: 37 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## jobclass health health_ins logwage
## 1. Industrial :1544 1. <=Good : 858 1. Yes:2083 Min. :3.000
## 2. Information:1456 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447
## Median :4.653
## Mean :4.654
## 3rd Qu.:4.857
## Max. :5.763
##
## wage
## Min. : 20.09
## 1st Qu.: 85.38
## Median :104.92
## Mean :111.70
## 3rd Qu.:128.68
## Max. :318.34
##
# Build training/test sets
trainvec<-createDataPartition(y=Wage$wage,p=.7,list=FALSE)
trainset<-Wage[trainvec,]
testset<-Wage[-trainvec,]
dim(trainset)
## [1] 2102 12
dim(testset)
## [1] 898 12
# Check out "featurePlot"!! It plots every variable against each other.
featurePlot(x=trainset[ ,c("age","education","jobclass")],
y=trainset$wage, plot="pairs")
# Or just quickplots (this is more more style; I don't go for the shotgun approach much:
qplot(age,wage,data=trainset)
qplot(education,wage,data=trainset)
# multivariate plots:
qplot(age,wage,col=jobclass,data=trainset)
qplot(age,wage,col=education,data=trainset) +
geom_smooth(method="lm", formula=y~x)
qplot(education,wage,col=jobclass,data=trainset)
qplot(jobclass,wage,col=education,data=trainset)
# etc... now cut interval data into categorical
cutWage<-cut2(trainset$wage,g=3) # g cuts into 3 quantile groups
table(cutWage)
## cutWage
## [ 20.9, 91.7) [ 91.7,118.9) [118.9,318.3]
## 707 731 664
p1<-qplot(cutWage,age,data=trainset,fill=cutWage,geom="boxplot")
p2<-qplot(cutWage,age,data=trainset,
fill=cutWage,geom=c("boxplot","jitter"))
grid.arrange(p1,p2,ncol=2)
# can also look at tables when you've cut the I/R data into categories:
tab<-table(cutWage,trainset$jobclass)
tab
##
## cutWage 1. Industrial 2. Information
## [ 20.9, 91.7) 453 254
## [ 91.7,118.9) 383 348
## [118.9,318.3] 265 399
#also a proportions table
prop.table(tab,1)
##
## cutWage 1. Industrial 2. Information
## [ 20.9, 91.7) 0.6407355 0.3592645
## [ 91.7,118.9) 0.5239398 0.4760602
## [118.9,318.3] 0.3990964 0.6009036
# Now plot densities -- as an aside, I should do this more. It's pretty and is super quick.
qplot(wage,col=education, data=trainset,geom="density")
So we see from these many plots that knowing what your data looks like matters! Having a strong understanding of the data through visualization helps you the programmer/statistician to hone in on the variables that really matter.
Going back to the spam data set, let’s check out a few more of the functions and capabilities of the caret package. First, look at the preProcess function.
# PREPROCESSING VARIABLES
# Using preProcess function from caret --> Normalize data
preObj<-preProcess(training[,-58],method=c("center","scale"))
trainCapAves <- predict(preObj, training[,-58])$capitalAve
mean(trainCapAves)
## [1] -3.339632e-18
sd(trainCapAves)
## [1] 1
# Now apply preprocessing to test set using preObj:
testCapAves <- predict(preObj,testing[,-58])$capitalAve
mean(testCapAves)
## [1] 0.02384246
sd(testCapAves)
## [1] 1.459443
# As a shortcut, pass preprocessing args to train() function
modelFit2<-train(type~.,data=training,
preProcess=c("center","scale"),method="glm")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
modelFit2
## Generalized Linear Model
##
## 3451 samples
## 57 predictor
## 2 classes: 'nonspam', 'spam'
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9163118 0.8240581
# Can do Box-Cox transforms, too:
preObj2<-preProcess(training[,-58],method="BoxCox")
trainCapAves2<-predict(preObj2,training[,-58])$capitalAve
par(mfrow=c(1,2)); hist(trainCapAves2); qqnorm(trainCapAves2)
So we normalized the variables (“centered and scaled” in caret’s terms), then made a new predictive model using those transformed variables. Then we showed that caret has other built in transformation methods–at least, it can do BoxCox transforms!
Next let’s look at imputation. I’ve heard very mixed things about imputation as a general concept. Some folks suggest that imputation is pretty bad and we shouldn’t do it; essentially making up data that doesn’t exist. On the other hand, machine learning doesn’t do so well with missing data, and it’s a shame to throw away a perfectly good observation because of a missing data point or two.
# K-nearest neighbor imputation
training$capAve<-training$capitalAve
selectNA<-rbinom(dim(training)[1],size=1,prob=.05)==1
training$capAve[selectNA]<-NA
#impute and standardize
preObj3<-preProcess(training[-58],method="knnImpute")
capAve<-predict(preObj3,training[,-58])$capAve
#standardize true values
capAveTruth<-training$capitalAve
capAveTruth<-(capAveTruth-mean(capAveTruth))/sd(capAveTruth)
#now compare full set with imputed values:
quantile(capAve-capAveTruth)
## 0% 25% 50% 75% 100%
## -0.9883280880 -0.0019723444 -0.0004778190 0.0002364905 1.2651588715
quantile((capAve-capAveTruth)[selectNA])
## 0% 25% 50% 75% 100%
## -0.4409257937 -0.0166235136 0.0007530806 0.0169158382 1.2651588715
quantile((capAve-capAveTruth)[!selectNA])
## 0% 25% 50% 75% 100%
## -0.9883280880 -0.0018790577 -0.0004875111 0.0001880298 0.0007530806
So we see that actually in this case, the imputed values weren’t far off from the true values. So depending on the nature of the data and the purpose of the analysis, imputing data values is somewhere between “a necessary evil” and “not really that bad, probably.”
==================================
This information was me following along with about half the lectures in the 2nd week of Jeffrey Leek’s “Practical Machine Learning” on Coursera.