This content was created from the Coursera course on introduction to machine learning.
Related work can be found on my website.
This notebook contains:
While there is some theory, most of the work here is about how to use the caret package for applying machine learning, we will progress into more technical discussion elsewhere.
The caret package has a lot of nice preprocessing functions built in.
The caret package is a great unified framework for applying all sorts of different machine learning algorithms from different developers. Being a unified framework means that the following commands are the same, no matter which method you used. If you didn't use caret, you would need to learn the nuances of each packages individually.
Spam Example:
Segment the data:
library(caret); library(kernlab); data(spam)
## Loading required package: lattice
## Loading required package: ggplot2
# Partition data: 75% of the data to train the model
inTrain<- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,]
dim(training)
## [1] 3451 58
Run a glm
We will try to predict spam messages by using the frequency of the words 'credit' and 'free'.
set.seed(32343)
modelFit<-train(type~credit+free, data=training, method='glm')
modelFit
## Generalized Linear Model
##
## 3451 samples
## 57 predictors
## 2 classes: 'nonspam', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
##
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.7 0.4 0.01 0.03
##
##
Inspect the model
modelFit$finalModel
##
## Call: NULL
##
## Coefficients:
## (Intercept) credit free
## -0.966 4.384 2.085
##
## Degrees of Freedom: 3450 Total (i.e. Null); 3448 Residual
## Null Deviance: 4630
## Residual Deviance: 3860 AIC: 3860
Prediction
predictions<- predict(modelFit, newdata=testing)
head(predictions)
## [1] nonspam spam nonspam spam nonspam spam
## Levels: nonspam spam
Confusion Matrix and Statistics
confusionMatrix(predictions, testing$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 661 246
## spam 36 207
##
## Accuracy : 0.755
## 95% CI : (0.729, 0.779)
## No Information Rate : 0.606
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.441
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.948
## Specificity : 0.457
## Pos Pred Value : 0.729
## Neg Pred Value : 0.852
## Prevalence : 0.606
## Detection Rate : 0.575
## Detection Prevalence : 0.789
## Balanced Accuracy : 0.703
##
## 'Positive' Class : nonspam
##
Let's look at this first example from above with more detail and more examples. y = what output we want to split on, which is this case are the two types of messages (SPAM and non Spam). p specifies the proportion of data that will exist in each chunk after splitting the data, in this case we split into two chunks of 75% and 25%. We then subset the data using the output from the createDataPartition
function.
library(caret); library(kernlab); data(spam)
# Partition data: 75% of the data to train the model
inTrain<- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,]
dim(training)
## [1] 3451 58
K-folds cross-validation and other split methods
Split data into k number of folds based on the outcome we want to split on (y). This function returns a list of lists.
set.seed(32323)
folds <-createFolds(y=spam$type, k=10,
list=TRUE, returnTrain=TRUE) # or return test
sapply(folds, length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## 4141 4140 4141 4142 4140 4142 4141 4141 4140 4141
Resample function
set.seed(32323)
folds<-createResample(y=spam$type, times=10, list=TRUE)
Time slices
This is for time series data slicing. This creates a window of 20 samples, and I plan to forecast 10 samples.
set.seed(32323)
tsData<-1:1000
folds<-createTimeSlices(y=tsData, initialWindow=20, horizon=10)
folds$train[[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
folds$test[[1]]
## [1] 21 22 23 24 25 26 27 28 29 30
This is the basic setup for caret's
train function…
library(caret); library(kernlab); data(spam)
inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
train<- spam[inTrain,]
test<- spam[-inTrain,]
#modelFit<-train(type~., data=training, method='glm', trainControl(allowParallel=TRUE))
However, we have many more arguments we can specify such as: pre processing methods, weighting, score metrics, and more…
args(train.default)
## function (x, y, method = "rf", preProcess = NULL, ..., weights = NULL,
## metric = ifelse(is.factor(y), "Accuracy", "RMSE"), maximize = ifelse(metric ==
## "RMSE", FALSE, TRUE), trControl = trainControl(), tuneGrid = NULL,
## tuneLength = 3)
## NULL
Continuous outcomes
Categorical outcomes
Then, there is the trainContol() function… This gives you much more control for specifying the model, it is best to review the documentation to understand everything it can do.
args(trainControl)
## function (method = "boot", number = ifelse(grepl("cv", method),
## 10, 25), repeats = ifelse(grepl("cv", method), 1, number),
## p = 0.75, initialWindow = NULL, horizon = 1, fixedWindow = TRUE,
## verboseIter = FALSE, returnData = TRUE, returnResamp = "final",
## savePredictions = FALSE, classProbs = FALSE, summaryFunction = defaultSummary,
## selectionFunction = "best", preProcOptions = list(thresh = 0.95,
## ICAcomp = 3, k = 5), index = NULL, indexOut = NULL, timingSamps = 0,
## predictionBounds = rep(FALSE, 2), seeds = NA, adaptive = list(min = 5,
## alpha = 0.05, method = "gls", complete = TRUE), allowParallel = TRUE)
## NULL
Further resources: caret tutorial
Usually it is best to do some exploratory plotting with data to get a sense of the basic structure of your data, and any quarks or anomalies.
library(ggplot2); library(ISLR); library(caret)
data(Wage)
summary(Wage)
## year age sex maritl
## Min. :2003 Min. :18.0 1. Male :3000 1. Never Married: 648
## 1st Qu.:2004 1st Qu.:33.8 2. Female: 0 2. Married :2074
## Median :2006 Median :42.0 3. Widowed : 19
## Mean :2006 Mean :42.4 4. Divorced : 204
## 3rd Qu.:2008 3rd Qu.:51.0 5. Separated : 55
## Max. :2009 Max. :80.0
##
## race education region
## 1. White:2480 1. < HS Grad :268 2. Middle Atlantic :3000
## 2. Black: 293 2. HS Grad :971 1. New England : 0
## 3. Asian: 190 3. Some College :650 3. East North Central: 0
## 4. Other: 37 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## jobclass health health_ins logwage
## 1. Industrial :1544 1. <=Good : 858 1. Yes:2083 Min. :3.00
## 2. Information:1456 2. >=Very Good:2142 2. No : 917 1st Qu.:4.45
## Median :4.65
## Mean :4.65
## 3rd Qu.:4.86
## Max. :5.76
##
## wage
## Min. : 20.1
## 1st Qu.: 85.4
## Median :104.9
## Mean :111.7
## 3rd Qu.:128.7
## Max. :318.3
##
We can usually tell a lot about the data by just investigating a summary report. For example, we now know this data contains all males from a certain region. Further, the command str()
is excellent for seeing the data types of the data.
Again, we will split the data into testing and training data, and we will only use the training data for testing the final model, all exploration and modeling should be done on the training data only.
inTrain<- createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
train<- Wage[inTrain,]
test<- Wage[-inTrain,]
dim(train); dim(test);
## [1] 2102 12
## [1] 898 12
Now let's try a few plots
featurePlot(x=train[,c('age', 'education', 'jobclass')],
y=train$wage,
plot='pairs')
In particular, the most important relationships to see are those between the response and each of the predictor variables, these can be found in the top row of graphs. You can see that, for example, there may be a positive relationship between job class and wage.
For more simple plots…
qplot(age, wage, data=train)
Note, we can clearly see that there is some type of outlier group which does not conform to the overall trend of the data. Ideally, we want to try to understand why this could be the case. We could begin to do this simply by coloring the data values. We can now see that a major explanation for this could be the job type of the individuals. This will likely be important in a model.
qplot(age, wage, data=train, color=jobclass)
Similarly…
ggplot(data=train, aes(age, wage, color=education))+geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
The cut2 from the Hmisc package will cut a continuous variable into categorical groups based on percentile/quartile data.
library(Hmisc); library(gridExtra)
cutWage<-cut2(train$wage, g=3)
table(cutWage)
## cutWage
## [ 20.1, 92.3) [ 92.3,119.0) [119.0,318.3]
## 701 728 673
Yet, it might even be better to simply create two groups. We can see from this histogram that the data are clearly separated into two normally distributed clusters.
qplot(data=train, wage, geom='histogram')
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Thus a manual specification may be more appropriate…
cutWage<-cut(train$wage,breaks=c(0, 250, max(train$wage)))
Then we use these categories in some plots
p1<-ggplot(data=train, aes(cutWage, age, fill=cutWage))+geom_boxplot()
p2<-ggplot(data=train, aes(cutWage, age, fill=cutWage))+geom_boxplot()+geom_jitter(alpha=0.2)
grid.arrange(p1,p2, ncol=2)
We can also use this cut data to table data.
t1<-table(cutWage, train$jobclass)
prop.table(t1, 1) # 1 for proportion in each row, 2 for columns
##
## cutWage 1. Industrial 2. Information
## (0,250] 0.5333 0.4667
## (250,318] 0.2167 0.7833
Density plots can be more revealing than box-plots. Density plots depict the proportion of y variable that falls into each bin of the x axis. In short, it depicts where most of your data is.
qplot(wage, color=education, data=train, geom='density')
Here we see that most of the people without a high school degree hold lower wages.
Let's load in our data to begin with some examples of why we might want to preprocess certain data.
library(caret); library(kernlab); data(spam)
inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
train<- spam[inTrain,]
test<- spam[-inTrain,]
hist(train$capitalAve, xlab='avg capital run length')
Here we see that the variable is highly skewed, this skew, especially severe skew is likely to trick up many machine learning techniques, since the variance is larger because of some extreme values. One way to remedy this is standardize variables. One way to do this is to remove the mean value, and divide by standard deviation. This process will center the data around 0 with a standard deviation of 1.
trainCapAvg<-train$capitalAve
trainCapAvgS<-(trainCapAvg - mean(trainCapAvg)) / sd(trainCapAvg)
mean(trainCapAvgS); sd(trainCapAvgS)
## [1] 8.137e-18
## [1] 1
Note: that when we standardize for the test set for our final model we must subtract the mean and divide the standard deviation from the training set. The mean and standard deviation will not be perfect, but will hopefully be close to 0 and 1.
testCapAvg<-test$capitalAve
testCapAveS<-(testCapAvg - mean(trainCapAvg)) / sd(trainCapAvg)
mean(testCapAveS); sd(testCapAveS)
## [1] -0.03145
## [1] 0.5411
We can do all this manually, however, the caret
package will do all this for you.
#58th variable is our response variable, so we remove it here.
pre<- preProcess(train[,-58], method=c('center', 'scale')) # creates preprocess
# object that you can use
trainCapAvgS<-predict(pre, train[,-58])$capitalAve
mean(trainCapAvgS); sd(trainCapAvgS)
## [1] 8.137e-18
## [1] 1
Generally it is just easier to do this within the train
function with preProcess = method
.
Obviously there are also other standardizing techniques. One example is a BoxCox transformation. Which uses some fancy techniques to try to normalize the data.
We can use imputation to estimate the missing values. There are a lot of very complex ways to do this. One simple way is knn classification; this algorithm tries to find the closest observations across all the variables and then takes the average of variable that is being imputed.
set.seed(12234)
#Articifially make some values NA
train$capAve<-train$capitalAve # Make a copy
selectNA<-rbinom(dim(train)[1], size=1, prob=0.05)==1
train$capAve[selectNA]<-NA
# Impute and standardize missing values
pre<-preProcess(train[,-58], method='knnImpute')
capAve<-predict(pre, train[,-58])$capAve
# Standardize true values
tmp<-train$capitalAve
capAveTruth<-(tmp - mean(tmp)) / sd(tmp)
# Inspect the results
quantile(capAve - capAveTruth)
## 0% 25% 50% 75% 100%
## -0.861967 -0.002469 -0.001261 -0.000685 0.986229
Feature creation involves transforming or manipulating raw data into a format that is useful for predicting.
For example, the SPAM dataset we have been working with went through this process so that we can work with it.
The original data may have been…
HI WHATS IS UP JOHN
ID LIKE TO TELL YOU ABOUT THIS AMAZING
CREDITCARD OFFERING WE HAVE TODAY SAVE $$$$$
For this email we could calculate:
This step involves a lot of clever thinking about the data you have and the results that you want for your particular application. We have to balance the compression/quantification of the data, with information loss.
Good features explain a significant amount of information behind the phenomenon in the data. The more knowledge you have of your system, the better you will be a creating features.
Creating dummy variables
library(ISLR); library(caret); data(Wage);
inTrain<-createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
train<-Wage[inTrain,]
test<-Wage[-inTrain,]
table(train$jobclass)
##
## 1. Industrial 2. Information
## 1067 1035
Notice that the value of these two variables are qualitative, but it makes more sense to utilize dummy categorical variables.
dummies<-dummyVars(wage ~ jobclass, data=train)
head(predict(dummies, newdata=train))
## jobclass.1. Industrial jobclass.2. Information
## 231655 1 0
## 86582 0 1
## 11443 0 1
## 305706 1 0
## 448410 1 0
## 377879 1 0
Features with little variance
These types of features are likely poor predictors, and should be removed from our list of features. We can use the function nearZeraVar
to identify these variables, and review/confirm the results.
nsv<-nearZeroVar(train, saveMetrics=TRUE)
nsv
## freqRatio percentUnique zeroVar nzv
## year 1.081 0.33302 FALSE FALSE
## age 1.027 2.90200 FALSE FALSE
## sex 0.000 0.04757 TRUE TRUE
## maritl 3.207 0.23787 FALSE FALSE
## race 8.559 0.19029 FALSE FALSE
## education 1.387 0.23787 FALSE FALSE
## region 0.000 0.04757 TRUE TRUE
## jobclass 1.031 0.09515 FALSE FALSE
## health 2.545 0.09515 FALSE FALSE
## health_ins 2.331 0.09515 FALSE FALSE
## logwage 1.024 17.50714 FALSE FALSE
## wage 1.036 18.26832 FALSE FALSE
We can see in this obvious example, that sex and region actually have zero variance.
Non-linear relationships
Once we have the variables, it still may be necessary to transform these data to make them more representative of the features we are trying to capture with the data. For example, age may be an important variable in our model, but age2, or age3 may also serve as functional predictors, depending on the application. Here we utilize spline basis function. The output in the first column represents the standardized age, the second represents the quadratic age, and the third column represents cubic age.
library(splines)
bsBasis<-bs(train$age, df=3) # fits 3rd order polynomial
head(bsBasis)
## 1 2 3
## [1,] 0.00000 0.00000 0.0000000
## [2,] 0.23685 0.02538 0.0009063
## [3,] 0.36253 0.38669 0.1374912
## [4,] 0.42617 0.14823 0.0171864
## [5,] 0.01794 0.20449 0.7770510
## [6,] 0.41709 0.13311 0.0141612
If we fit a linear model using these parameters, you can see we capture the non-linear effects of age on wage.
lm1<- lm(wage~bsBasis, data=train)
ggplot(data=train, aes(age, wage))+geom_point()+
geom_line(aes(age, predict(lm1, newdata=train)), color='red', size=2)
Recall, when we actually test on the test set, we must use the test data to create our variables.
# predict new set of values based on the test data
new_variable<-predict(bsBasis, age=test$age)
Identifying Correlated Predictors
Most models benefits from eliminating correlation between predictors. Given a correlation matrix, the findCorrelation
function can flag predictors for removal.
Cor<-cor(train[c('wage', 'age')])
highCor<-sum(abs(Cor[upper.tri(Cor)]) > 0.75)
highCor
## [1] 0
# Automated approach
highCorr<-findCorrelation(Cor, cutoff=0.1) # low cutoff used for demonstration
filtered_vars<-train[,-highCorr] # remove age from dataset
The next section explores methods to deal with correlation between variables.
library(caret); library(kernlab); data(spam);
inTrain<-createDataPartition(y=spam$type, p=0.75, list=FALSE)
train<-spam[inTrain,]
test<-spam[-inTrain,]
M <- abs(cor(train[,-58]))
diag(M) <- 0 # Set correlation between variables and itself to zero
which(M > 0.85, arr.ind=T) # which variables have corr > 0.8
## row col
## num415 34 32
## num857 32 34
We see here that these two variables are highly correlated, the variables correspond to the number of times that the number 415
and 857
occur in the messages. These are likely “1-800” types phone numbers distributed from spam messages.
#Plot the correlation
qplot(spam$num415, spam$num857)
Including both of these predictors is not very useful, but what if we could combine these two variables into one single variable with higher predictive power than any of the one variables alone.
There are two related solutions
Single Variable Decomposition (SVD) : Matrix Decomposition
Principal Component Analysis (PCA):
PCA
prComp<-prcomp(spam[,c(34,32)])
plot(prComp$x[,1], prComp$x[,2]) # plot pc 1 and pc2.
prComp$rotation # See how the data were combined
## PC1 PC2
## num415 0.7081 0.7061
## num857 0.7061 -0.7081
With the rotation we can see that the best principal component combination is the sum of the two variables with some coefficient, and the second best combination is taking the difference.
Testing on the full Spam dataset
prComp<-prcomp(log10(spam[,-58]+1)) # log10 for basic preprocessing nonnormal data
ggplot(data=spam, aes(prComp$x[,1], prComp$x[,2], color=type))+
geom_point()+
ylab('PC2')+
xlab('PC1')
caret
package:As you can see, we can even get quite good results using the first two principal components.
modelFit<- train(train$type~., method='glm', preProcess=c('scale','center', 'pca'), data=train)
confusionMatrix(testing$type, predict(modelFit, testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 663 34
## spam 46 407
##
## Accuracy : 0.93
## 95% CI : (0.914, 0.944)
## No Information Rate : 0.617
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.854
## Mcnemar's Test P-Value : 0.219
##
## Sensitivity : 0.935
## Specificity : 0.923
## Pos Pred Value : 0.951
## Neg Pred Value : 0.898
## Prevalence : 0.617
## Detection Rate : 0.577
## Detection Prevalence : 0.606
## Balanced Accuracy : 0.929
##
## 'Positive' Class : nonspam
##