Machine Learning 2

This content was created from the Coursera course on introduction to machine learning.

Related work can be found on my website.

This notebook contains:

While there is some theory, most of the work here is about how to use the caret package for applying machine learning, we will progress into more technical discussion elsewhere.

1.0 Introducing the Caret package

The caret package has a lot of nice preprocessing functions built in.

The caret package is a great unified framework for applying all sorts of different machine learning algorithms from different developers. Being a unified framework means that the following commands are the same, no matter which method you used. If you didn't use caret, you would need to learn the nuances of each packages individually.

Spam Example:

Segment the data:

library(caret); library(kernlab); data(spam)
## Loading required package: lattice
## Loading required package: ggplot2
# Partition data: 75% of the data to train the model
inTrain<- createDataPartition(y=spam$type, p=0.75, list=FALSE)

training<-spam[inTrain,]
testing<-spam[-inTrain,]
dim(training)
## [1] 3451   58

Run a glm

We will try to predict spam messages by using the frequency of the words 'credit' and 'free'.

set.seed(32343)
modelFit<-train(type~credit+free, data=training, method='glm')
modelFit
## Generalized Linear Model 
## 
## 3451 samples
##   57 predictors
##    2 classes: 'nonspam', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## 
## Resampling results
## 
##   Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.7       0.4    0.01         0.03    
## 
## 

Inspect the model

modelFit$finalModel
## 
## Call:  NULL
## 
## Coefficients:
## (Intercept)       credit         free  
##      -0.966        4.384        2.085  
## 
## Degrees of Freedom: 3450 Total (i.e. Null);  3448 Residual
## Null Deviance:       4630 
## Residual Deviance: 3860  AIC: 3860

Prediction

predictions<- predict(modelFit, newdata=testing)
head(predictions)
## [1] nonspam spam    nonspam spam    nonspam spam   
## Levels: nonspam spam

Confusion Matrix and Statistics

confusionMatrix(predictions, testing$type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     661  246
##    spam         36  207
##                                         
##                Accuracy : 0.755         
##                  95% CI : (0.729, 0.779)
##     No Information Rate : 0.606         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.441         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.948         
##             Specificity : 0.457         
##          Pos Pred Value : 0.729         
##          Neg Pred Value : 0.852         
##              Prevalence : 0.606         
##          Detection Rate : 0.575         
##    Detection Prevalence : 0.789         
##       Balanced Accuracy : 0.703         
##                                         
##        'Positive' Class : nonspam       
## 

2.0 Data Slicing

Let's look at this first example from above with more detail and more examples. y = what output we want to split on, which is this case are the two types of messages (SPAM and non Spam). p specifies the proportion of data that will exist in each chunk after splitting the data, in this case we split into two chunks of 75% and 25%. We then subset the data using the output from the createDataPartition function.

library(caret); library(kernlab); data(spam)
# Partition data: 75% of the data to train the model
inTrain<- createDataPartition(y=spam$type, p=0.75, list=FALSE)

training<-spam[inTrain,]
testing<-spam[-inTrain,]
dim(training)
## [1] 3451   58

K-folds cross-validation and other split methods

Split data into k number of folds based on the outcome we want to split on (y). This function returns a list of lists.

set.seed(32323)
folds <-createFolds(y=spam$type, k=10, 
                    list=TRUE, returnTrain=TRUE) # or return test
sapply(folds, length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 
##   4141   4140   4141   4142   4140   4142   4141   4141   4140   4141

Resample function

set.seed(32323)
folds<-createResample(y=spam$type, times=10, list=TRUE)

Time slices

This is for time series data slicing. This creates a window of 20 samples, and I plan to forecast 10 samples.

set.seed(32323)
tsData<-1:1000
folds<-createTimeSlices(y=tsData, initialWindow=20, horizon=10)
folds$train[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
folds$test[[1]]
##  [1] 21 22 23 24 25 26 27 28 29 30

3.0 Training options in caret

This is the basic setup for caret's train function…

library(caret); library(kernlab); data(spam)

inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE)

train<- spam[inTrain,]
test<- spam[-inTrain,]

#modelFit<-train(type~., data=training, method='glm', trainControl(allowParallel=TRUE))

However, we have many more arguments we can specify such as: pre processing methods, weighting, score metrics, and more…

args(train.default)
## function (x, y, method = "rf", preProcess = NULL, ..., weights = NULL, 
##     metric = ifelse(is.factor(y), "Accuracy", "RMSE"), maximize = ifelse(metric == 
##         "RMSE", FALSE, TRUE), trControl = trainControl(), tuneGrid = NULL, 
##     tuneLength = 3) 
## NULL

Metric options that are built into the train function

Continuous outcomes

Categorical outcomes

Then, there is the trainContol() function… This gives you much more control for specifying the model, it is best to review the documentation to understand everything it can do.

args(trainControl)
## function (method = "boot", number = ifelse(grepl("cv", method), 
##     10, 25), repeats = ifelse(grepl("cv", method), 1, number), 
##     p = 0.75, initialWindow = NULL, horizon = 1, fixedWindow = TRUE, 
##     verboseIter = FALSE, returnData = TRUE, returnResamp = "final", 
##     savePredictions = FALSE, classProbs = FALSE, summaryFunction = defaultSummary, 
##     selectionFunction = "best", preProcOptions = list(thresh = 0.95, 
##         ICAcomp = 3, k = 5), index = NULL, indexOut = NULL, timingSamps = 0, 
##     predictionBounds = rep(FALSE, 2), seeds = NA, adaptive = list(min = 5, 
##         alpha = 0.05, method = "gls", complete = TRUE), allowParallel = TRUE) 
## NULL

Further resources: caret tutorial

4.0 Basic data exploration with plotting

Usually it is best to do some exploratory plotting with data to get a sense of the basic structure of your data, and any quarks or anomalies.

library(ggplot2); library(ISLR); library(caret)

data(Wage)
summary(Wage)
##       year           age              sex                    maritl    
##  Min.   :2003   Min.   :18.0   1. Male  :3000   1. Never Married: 648  
##  1st Qu.:2004   1st Qu.:33.8   2. Female:   0   2. Married      :2074  
##  Median :2006   Median :42.0                    3. Widowed      :  19  
##  Mean   :2006   Mean   :42.4                    4. Divorced     : 204  
##  3rd Qu.:2008   3rd Qu.:51.0                    5. Separated    :  55  
##  Max.   :2009   Max.   :80.0                                           
##                                                                        
##        race                   education                     region    
##  1. White:2480   1. < HS Grad      :268   2. Middle Atlantic   :3000  
##  2. Black: 293   2. HS Grad        :971   1. New England       :   0  
##  3. Asian: 190   3. Some College   :650   3. East North Central:   0  
##  4. Other:  37   4. College Grad   :685   4. West North Central:   0  
##                  5. Advanced Degree:426   5. South Atlantic    :   0  
##                                           6. East South Central:   0  
##                                           (Other)              :   0  
##            jobclass               health      health_ins      logwage    
##  1. Industrial :1544   1. <=Good     : 858   1. Yes:2083   Min.   :3.00  
##  2. Information:1456   2. >=Very Good:2142   2. No : 917   1st Qu.:4.45  
##                                                            Median :4.65  
##                                                            Mean   :4.65  
##                                                            3rd Qu.:4.86  
##                                                            Max.   :5.76  
##                                                                          
##       wage      
##  Min.   : 20.1  
##  1st Qu.: 85.4  
##  Median :104.9  
##  Mean   :111.7  
##  3rd Qu.:128.7  
##  Max.   :318.3  
## 

We can usually tell a lot about the data by just investigating a summary report. For example, we now know this data contains all males from a certain region. Further, the command str() is excellent for seeing the data types of the data.

Split into testing and training data

Again, we will split the data into testing and training data, and we will only use the training data for testing the final model, all exploration and modeling should be done on the training data only.

inTrain<- createDataPartition(y=Wage$wage, p=0.7, list=FALSE)

train<- Wage[inTrain,]
test<- Wage[-inTrain,]
dim(train); dim(test);
## [1] 2102   12
## [1] 898  12

Now let's try a few plots

featurePlot(x=train[,c('age', 'education', 'jobclass')], 
            y=train$wage,
            plot='pairs')

plot of chunk unnamed-chunk-15

In particular, the most important relationships to see are those between the response and each of the predictor variables, these can be found in the top row of graphs. You can see that, for example, there may be a positive relationship between job class and wage.

For more simple plots…

qplot(age, wage, data=train)

plot of chunk unnamed-chunk-16

Note, we can clearly see that there is some type of outlier group which does not conform to the overall trend of the data. Ideally, we want to try to understand why this could be the case. We could begin to do this simply by coloring the data values. We can now see that a major explanation for this could be the job type of the individuals. This will likely be important in a model.

qplot(age, wage, data=train, color=jobclass)

plot of chunk unnamed-chunk-17

Similarly…

ggplot(data=train, aes(age, wage, color=education))+geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-18

Another useful method is to table the data

The cut2 from the Hmisc package will cut a continuous variable into categorical groups based on percentile/quartile data.

library(Hmisc); library(gridExtra)
cutWage<-cut2(train$wage, g=3)
table(cutWage)
## cutWage
## [ 20.1, 92.3) [ 92.3,119.0) [119.0,318.3] 
##           701           728           673

Yet, it might even be better to simply create two groups. We can see from this histogram that the data are clearly separated into two normally distributed clusters.

qplot(data=train, wage, geom='histogram')
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-20

Thus a manual specification may be more appropriate…

cutWage<-cut(train$wage,breaks=c(0, 250, max(train$wage)))

Then we use these categories in some plots

p1<-ggplot(data=train, aes(cutWage, age, fill=cutWage))+geom_boxplot()
p2<-ggplot(data=train, aes(cutWage, age, fill=cutWage))+geom_boxplot()+geom_jitter(alpha=0.2)
grid.arrange(p1,p2, ncol=2)

plot of chunk unnamed-chunk-22

We can also use this cut data to table data.

t1<-table(cutWage, train$jobclass)
prop.table(t1, 1) # 1 for proportion in each row, 2 for columns
##            
## cutWage     1. Industrial 2. Information
##   (0,250]          0.5333         0.4667
##   (250,318]        0.2167         0.7833

Density plots

Density plots can be more revealing than box-plots. Density plots depict the proportion of y variable that falls into each bin of the x axis. In short, it depicts where most of your data is.

qplot(wage, color=education, data=train, geom='density')

plot of chunk unnamed-chunk-24

Here we see that most of the people without a high school degree hold lower wages.

5.0 Preprocessing data

Let's load in our data to begin with some examples of why we might want to preprocess certain data.

library(caret); library(kernlab); data(spam)

inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE)

train<- spam[inTrain,]
test<- spam[-inTrain,]

hist(train$capitalAve, xlab='avg capital run length')

plot of chunk unnamed-chunk-25

Here we see that the variable is highly skewed, this skew, especially severe skew is likely to trick up many machine learning techniques, since the variance is larger because of some extreme values. One way to remedy this is standardize variables. One way to do this is to remove the mean value, and divide by standard deviation. This process will center the data around 0 with a standard deviation of 1.

trainCapAvg<-train$capitalAve
trainCapAvgS<-(trainCapAvg - mean(trainCapAvg)) / sd(trainCapAvg)

mean(trainCapAvgS); sd(trainCapAvgS)
## [1] 8.137e-18
## [1] 1

Note: that when we standardize for the test set for our final model we must subtract the mean and divide the standard deviation from the training set. The mean and standard deviation will not be perfect, but will hopefully be close to 0 and 1.

testCapAvg<-test$capitalAve
testCapAveS<-(testCapAvg - mean(trainCapAvg)) / sd(trainCapAvg)

mean(testCapAveS); sd(testCapAveS)
## [1] -0.03145
## [1] 0.5411

We can do all this manually, however, the caret package will do all this for you.

#58th variable is our response variable, so we remove it here.
pre<- preProcess(train[,-58], method=c('center', 'scale')) # creates preprocess
# object that you can use
trainCapAvgS<-predict(pre, train[,-58])$capitalAve

mean(trainCapAvgS); sd(trainCapAvgS)
## [1] 8.137e-18
## [1] 1

Generally it is just easier to do this within the train function with preProcess = method.

Obviously there are also other standardizing techniques. One example is a BoxCox transformation. Which uses some fancy techniques to try to normalize the data.

What about if a dataset has missing data

We can use imputation to estimate the missing values. There are a lot of very complex ways to do this. One simple way is knn classification; this algorithm tries to find the closest observations across all the variables and then takes the average of variable that is being imputed.

set.seed(12234)

#Articifially make some values NA
train$capAve<-train$capitalAve # Make a copy
selectNA<-rbinom(dim(train)[1], size=1, prob=0.05)==1
train$capAve[selectNA]<-NA

# Impute and standardize missing values
pre<-preProcess(train[,-58], method='knnImpute')
capAve<-predict(pre, train[,-58])$capAve

# Standardize true values
tmp<-train$capitalAve
capAveTruth<-(tmp - mean(tmp)) / sd(tmp)

# Inspect the results
quantile(capAve - capAveTruth)
##        0%       25%       50%       75%      100% 
## -0.861967 -0.002469 -0.001261 -0.000685  0.986229

6.0 Feature creation and engineering

Feature creation involves transforming or manipulating raw data into a format that is useful for predicting.

For example, the SPAM dataset we have been working with went through this process so that we can work with it.

The original data may have been…

HI WHATS IS UP JOHN ID LIKE TO TELL YOU ABOUT THIS AMAZING CREDITCARD OFFERING WE HAVE TODAY SAVE $$$$$ For this email we could calculate:

  1. The fraction of letters that are capital letters
  2. How many times the email contains the word “you”
  3. The number of $ signs.

This step involves a lot of clever thinking about the data you have and the results that you want for your particular application. We have to balance the compression/quantification of the data, with information loss.

Good features explain a significant amount of information behind the phenomenon in the data. The more knowledge you have of your system, the better you will be a creating features.

Examples

Creating dummy variables

library(ISLR); library(caret); data(Wage);

inTrain<-createDataPartition(y=Wage$wage, p=0.7, list=FALSE)

train<-Wage[inTrain,]
test<-Wage[-inTrain,]

table(train$jobclass)
## 
##  1. Industrial 2. Information 
##           1067           1035

Notice that the value of these two variables are qualitative, but it makes more sense to utilize dummy categorical variables.

dummies<-dummyVars(wage ~ jobclass, data=train)
head(predict(dummies, newdata=train))
##        jobclass.1. Industrial jobclass.2. Information
## 231655                      1                       0
## 86582                       0                       1
## 11443                       0                       1
## 305706                      1                       0
## 448410                      1                       0
## 377879                      1                       0

Features with little variance

These types of features are likely poor predictors, and should be removed from our list of features. We can use the function nearZeraVar to identify these variables, and review/confirm the results.

nsv<-nearZeroVar(train, saveMetrics=TRUE)
nsv
##            freqRatio percentUnique zeroVar   nzv
## year           1.081       0.33302   FALSE FALSE
## age            1.027       2.90200   FALSE FALSE
## sex            0.000       0.04757    TRUE  TRUE
## maritl         3.207       0.23787   FALSE FALSE
## race           8.559       0.19029   FALSE FALSE
## education      1.387       0.23787   FALSE FALSE
## region         0.000       0.04757    TRUE  TRUE
## jobclass       1.031       0.09515   FALSE FALSE
## health         2.545       0.09515   FALSE FALSE
## health_ins     2.331       0.09515   FALSE FALSE
## logwage        1.024      17.50714   FALSE FALSE
## wage           1.036      18.26832   FALSE FALSE

We can see in this obvious example, that sex and region actually have zero variance.

Non-linear relationships

Once we have the variables, it still may be necessary to transform these data to make them more representative of the features we are trying to capture with the data. For example, age may be an important variable in our model, but age2, or age3 may also serve as functional predictors, depending on the application. Here we utilize spline basis function. The output in the first column represents the standardized age, the second represents the quadratic age, and the third column represents cubic age.

library(splines)
bsBasis<-bs(train$age, df=3) # fits 3rd order polynomial
head(bsBasis)
##            1       2         3
## [1,] 0.00000 0.00000 0.0000000
## [2,] 0.23685 0.02538 0.0009063
## [3,] 0.36253 0.38669 0.1374912
## [4,] 0.42617 0.14823 0.0171864
## [5,] 0.01794 0.20449 0.7770510
## [6,] 0.41709 0.13311 0.0141612

If we fit a linear model using these parameters, you can see we capture the non-linear effects of age on wage.

lm1<- lm(wage~bsBasis, data=train)
ggplot(data=train, aes(age, wage))+geom_point()+
geom_line(aes(age, predict(lm1, newdata=train)), color='red', size=2)

plot of chunk unnamed-chunk-34

Recall, when we actually test on the test set, we must use the test data to create our variables.

# predict new set of values based on the test data
new_variable<-predict(bsBasis, age=test$age)

Identifying Correlated Predictors

Most models benefits from eliminating correlation between predictors. Given a correlation matrix, the findCorrelation function can flag predictors for removal.

Cor<-cor(train[c('wage', 'age')])
highCor<-sum(abs(Cor[upper.tri(Cor)]) > 0.75)
highCor
## [1] 0
# Automated approach
highCorr<-findCorrelation(Cor, cutoff=0.1) # low cutoff used for demonstration
filtered_vars<-train[,-highCorr] # remove age from dataset

The next section explores methods to deal with correlation between variables.

7.0 Preprocessing Features with Principal Component Analysis

library(caret); library(kernlab); data(spam);

inTrain<-createDataPartition(y=spam$type, p=0.75, list=FALSE)

train<-spam[inTrain,]
test<-spam[-inTrain,]

M <- abs(cor(train[,-58]))
diag(M) <- 0 # Set correlation between variables and itself to zero
which(M > 0.85, arr.ind=T) # which variables have corr > 0.8
##        row col
## num415  34  32
## num857  32  34

We see here that these two variables are highly correlated, the variables correspond to the number of times that the number 415 and 857 occur in the messages. These are likely “1-800” types phone numbers distributed from spam messages.

#Plot the correlation
qplot(spam$num415, spam$num857)

plot of chunk unnamed-chunk-38

Including both of these predictors is not very useful, but what if we could combine these two variables into one single variable with higher predictive power than any of the one variables alone.

There are two related solutions

PCA

prComp<-prcomp(spam[,c(34,32)])
plot(prComp$x[,1], prComp$x[,2]) # plot pc 1 and pc2.

plot of chunk unnamed-chunk-39

prComp$rotation # See how the data were combined
##           PC1     PC2
## num415 0.7081  0.7061
## num857 0.7061 -0.7081

With the rotation we can see that the best principal component combination is the sum of the two variables with some coefficient, and the second best combination is taking the difference.

Testing on the full Spam dataset

prComp<-prcomp(log10(spam[,-58]+1)) # log10 for basic preprocessing nonnormal data
ggplot(data=spam, aes(prComp$x[,1], prComp$x[,2], color=type))+
  geom_point()+
  ylab('PC2')+
  xlab('PC1')

plot of chunk unnamed-chunk-40

Putting it all together with the caret package:

As you can see, we can even get quite good results using the first two principal components.

modelFit<- train(train$type~., method='glm', preProcess=c('scale','center', 'pca'), data=train)
confusionMatrix(testing$type, predict(modelFit, testing))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     663   34
##    spam         46  407
##                                         
##                Accuracy : 0.93          
##                  95% CI : (0.914, 0.944)
##     No Information Rate : 0.617         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.854         
##  Mcnemar's Test P-Value : 0.219         
##                                         
##             Sensitivity : 0.935         
##             Specificity : 0.923         
##          Pos Pred Value : 0.951         
##          Neg Pred Value : 0.898         
##              Prevalence : 0.617         
##          Detection Rate : 0.577         
##    Detection Prevalence : 0.606         
##       Balanced Accuracy : 0.929         
##                                         
##        'Positive' Class : nonspam       
## 

My website