This section is about combining predictors, it’s sometimes called ensembling methods in ML. And so, the key idea here is combine classifiers by averaging or voting even if these classifiers are very different. For example, we can combine a boosting classifier with a random forest with a linear regression model. In general, combining classifiers together improves accuracy and it can also reduce interpretive abilities, but we have to be really careful the gain that we get is worth the loss on interperatibility. Boosting, bagging and random forests, they’re all basically variance on this theme that we can average different classifiers together. But those are all examples where it’s the same kind of classifiers being averaged in every case.
So some approaches for combining classifiers basically use, one approach is using similar classifiers and combining them together using something like bagging, boosting, or random forests. Another approach is to combine different classifiers using model stacking or model ensembling, and we’re going to talk a little bit about model stacking in this section.
Let’s see an example using the ‘Wage’ Data. We are going to leave out the ‘logwage’ variable because we are predicting ‘wage’ and logwage is a very good predictor. We are splitting the data in three different sets, namely ‘validation’, ‘training’, and ‘testing’. We splitted the entire data into inBuild and validation set first. And, we furhter broke the inBuild section into the training and testing set, afterward.
library(ISLR)
data(Wage)
library(ggplot2)
library(caret)
## Loading required package: lattice
Wage<-subset(Wage, select=-c(logwage))
#Create a building data set and validation set
inBuild<-createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
validation<-Wage[-inBuild,]
buildData<-Wage[inBuild,]
inTrain<-createDataPartition(y=buildData$wage, p=0.7, list=FALSE)
training<-buildData[inTrain, ]
testing<-buildData[-inTrain, ]
dim(validation)
## [1] 898 10
dim(training)
## [1] 1474 10
dim(testing)
## [1] 628 10
So, we have 898 samples in the validation set, 1474 in the training set, and 628 in the testing set.
In the training set, we are building a a glm model and we are using all the variables to predict the wage. In addition, we are also building a separate model using the random forests methods on the same dataset. All in all we are predicting two different prediction models on the same dataset.
mod1<-train(wage~., method="glm", data=training)
mod2<-train(wage~., method="rf", data=training,
trControl=trainControl(method="cv"), number=3)
mod1$finalModel
##
## Call: NULL
##
## Coefficients:
## (Intercept) year
## -1539.9992 0.8033
## age `maritl2. Married`
## 0.1182 18.0099
## `maritl3. Widowed` `maritl4. Divorced`
## -1.3480 5.4217
## `maritl5. Separated` `race2. Black`
## 6.9206 -6.9583
## `race3. Asian` `race4. Other`
## -1.3683 -9.8072
## `education2. HS Grad` `education3. Some College`
## 9.0522 19.5454
## `education4. College Grad` `education5. Advanced Degree`
## 30.8297 50.7402
## `region2. Middle Atlantic` `region3. East North Central`
## NA NA
## `region4. West North Central` `region5. South Atlantic`
## NA NA
## `region6. East South Central` `region7. West South Central`
## NA NA
## `region8. Mountain` `region9. Pacific`
## NA NA
## `jobclass2. Information` `health2. >=Very Good`
## 4.4427 7.6123
## `health_ins2. No`
## -19.0496
##
## Degrees of Freedom: 1473 Total (i.e. Null); 1457 Residual
## Null Deviance: 2590000
## Residual Deviance: 1773000 AIC: 14670
mod2$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, number = 3)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 1303.618
## % Var explained: 25.82
Let’s use use the final model to predict the testing set and plot them together to assess how they do.
pred1<-predict(mod1, testing)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred2<-predict(mod2, testing)
qplot(pred1, pred2, colour=wage, data=testing)
We can see that they’re close to each other, but they don’t actually agree with each other. And, neither of them perfectly correlates with the wage variable, which is the color of the dots in this particular picture.
So then what we can do is fit a model that combines the predictors. So, we basically build a new data set consisting of the predictions from model one and the predictions from model two. And then we can create a wage variable on the test data set. Then, we can fit a new model, which relates wage variable to the two predictions. So now, instead of just fitting a model that relates the co-variants to the outcome, we’ve fit two separate prediction models for the outcome. And we fit a regression model relating the outcome to the predictions for those two models. And then we can predict from the combined data set on new samples.
predDF<-data.frame(pred1, pred2, wage=testing$wage)
combModFit<-train(wage~., method="gam", data=predDF)
combPred<-predict(combModFit, predDF)
sqrt(sum((pred1-testing$wage)^2))
## [1] 806.4326
sqrt(sum((pred2-testing$wage)^2))
## [1] 692.8847
sqrt(sum((combPred-testing$wage)^2))
## [1] 670.3854
We can look at the how well we do on the test set. We squared the the difference of wage in testing and predictions, square them and take a sum of them, and finally taking square root. The first predictor and has an error of 806.4326 and the second predictor has an error of 692.8847, whereas the combined predictor has an error that’s lower, 670.3854.
We also want to try it out on the validation set because we use the test set to try to blend the two models together. So, we fit a model on the test set that’s not a good representation of the out of sample error. Now we basically create a prediction of the first model and the second model on the validation set. We then, create a data frame that contains those two predictions. And, we predict using the models, and the combined model, on the predictions in the validation data set. So the covariance being passed in the model are the predictions from the two different models.
pred1V<-predict(mod1, validation)
pred2V<-predict(mod2, validation)
predVDF<-data.frame(pred1=pred1V,pred2=pred2V)
combPredV<-predict(combModFit, predVDF)
As in the earlier cases, let’s see the errors of each of these models:
sqrt(sum((pred1V-validation$wage)^2))
## [1] 1299.956
sqrt(sum((pred2V-validation$wage)^2))
## [1] 1364.741
sqrt(sum((combPredV-validation$wage)^2))
## [1] 1445.652
We can see the error from model one if we use it by itself would be 1299.956. The error for the second model would be 1364.741. And the combined model has a higher error on the validation data set, i.e., 1445.652. So stacking models in this way can sometimes improve accuracy by blending different models, the strengths of different models together. Even simple blending like we’ve talked about here can be really useful and can improve accuracy.
This section is about forecasting, which is a very specific kind of prediction problem. It is typically applied to things like time series data. For example, we can use it in forecasting the price of a stock at a certain time or simply whether the price goes up or down.
So, this introduces some very specific kinds of dependent structure and some additional challenges that must be taken into account when performing prediction.
So, one thing to be aware of is that we have to be careful of spurious correlations. So, time series can often be correlate for reasons that do not make them good for predicting one from the other.
So, we are going to see a quick example of some forecasting using the quantmod package and some Google data.
library(quantmod)
from.dat<-as.Date("01/01/12", format="%m/%d/%y")
to.dat<-as.Date("12/31/19", format="%m/%d/%y")
getSymbols("GOOG", src="yahoo", from=from.dat, to=to.dat)
## [1] "GOOG"
head(GOOG)
## GOOG.Open GOOG.High GOOG.Low GOOG.Close GOOG.Volume GOOG.Adjusted
## 2012-01-03 325.2509 332.8275 324.9669 331.4626 7380500 331.4626
## 2012-01-04 331.2733 333.8736 329.0765 332.8922 5749400 332.8922
## 2012-01-05 329.8287 330.7453 326.8897 328.2745 6590300 328.2745
## 2012-01-06 328.3443 328.7677 323.6818 323.7963 5405900 323.7963
## 2012-01-09 322.0429 322.2920 309.4551 310.0678 11688800 310.0678
## 2012-01-10 313.6992 315.7166 307.3032 310.4065 8824000 310.4065
we get the open, high, low, close and volume information between 01/01/2008 and 12/31/2013 time period.
Now, we can summarize this monthly and store it as a time series and use the ‘to.monthly’ function to convert to a monthly time series. We can just take the opening information, and then create a time series object using the ‘ts’ function in R. And if we plot that, we can see the monthly opening prices for Google over a period of seven years.
Let’s do just that.
mGoog<-to.monthly(GOOG)
googOpen<-Op(mGoog)## just take the opening information
ts1<-ts(googOpen, frequency=12)
plot(ts1, xlab="years+1", ylab="GOOG")
And so, one way that we can do this is with the decompose function in R. So, if we decompose this in an additive way:
plot(decompose(ts1), xlab="Years+1")
we can see that there’s a trend variable that appears to be an upward trend of the Google stock price. There also appears to be a seasonal pattern, as well as a more of a random cyclical pattern in the data set. So, this is decomposing this series here into a series of different types of patterns in the data.
Now, lets build the training and testing set and explore further. We have to build training and test sets that have consecutive time points. Our training set starts at time point 1 and ends at time point 5. And then a test set is the next consecutive sets of points after that. So that way, we can always build a training set and apply it to a test set that have consecutive time points that show the same sort of trends that we’ve observed in my data.
ts1Train<-window(ts1, start=1, end=5)
ts1Test<-window(ts1, start=5, end=(7-0.01))
ts1Train
## Jan Feb Mar Apr May Jun Jul Aug
## 1 325.2509 291.3778 309.9682 319.1886 300.7676 284.8274 289.8237 317.4601
## 2 358.3668 377.6844 397.4104 396.0206 410.0929 434.8700 441.5699 445.8289
## 3 555.6473 587.3983 601.1218 557.1802 525.6668 559.1648 576.7366 568.8383
## 4 527.5616 530.2741 558.9953 547.0980 538.4300 536.7900 524.7300 625.3400
## 5 743.0000
## Sep Oct Nov Dec
## 1 340.9969 378.1078 338.4813 349.8088
## 2 425.5848 438.4815 513.9685 529.7693
## 3 570.2843 574.4329 553.9791 537.4245
## 4 602.3600 608.3700 711.0600 747.1100
## 5
There’s a couple different ways for doing forecasting. One is to do a simple moving average, which in another words, it basically averages up all of the values of, for a particular time point. And the prediction will be the average of the previous time points out to a particular time.
library(forecast)
plot(ts1Train)
lines(ma(ts1Train, order=3), col="red")
We can also do exponential smoothing. In other words, basically we weight near-by time points as higher values or by more heavily than time points that are farther away. So there’s a large number of different classes of smoothing models that we can choose, e.g., component, N(None), A(Additive), Ad(Additive damped), M(multiplicative), Md(multiplicative damped).
ets1<-ets(ts1Train, model="MMM")
fcast<-forecast(ets1)
plot(fcast)
lines(ts1Test, col="red")
And for exponential smoothing, we fit a model where we have a different choices for the different types of trends that we might want to fit. And then when we forecast, we can get a prediction that comes out of our forecasting model. And we can also get sort of a prediction bounds for what are the possible values that we could get from that prediction.
And you can get the accuracy using this accuracy function, so you can basically get the accuracy of your forecast using your test set, and it will give you root mean square to error and other metrics that are more appropriate for forecasting.
accuracy(fcast, ts1Test)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.8006171 27.14188 20.97464 -0.02727681 4.509309 0.2015169
## Test set 70.4486297 108.82676 81.82616 7.61002906 9.133420 0.7861570
## ACF1 Theil's U
## Training set 0.04944982 NA
## Test set 0.84156480 2.544983
The examples we’ve talked about so far in this class, you often know he labels. In other words, we’re trying to do supervised classification; we’re trying to predict an outcome that we know what it is. But sometimes we don’t know the labels for prediction, and we have to discover those labels in advance. So, one way to do that is to create clusters from the data that we’ve observed, to add names to those clusters and then build a predictor for those clusters in a new data set, we’re going to predict the cluster and apply the name that we come up with in the previous data set. This adds several levels of difficulty to the prediction problem.
First of all, creating the clusters is not a perfectly noiseless process. Second of all, coming up with the right names, in other words interpreting the clusters well is an incredibly challenging problem. Building a predictor for the clusters is basically using the algorithms that we’ve learned throughout the course of this class, as well as predicting on those clusters. Let’s see an example:
data(iris)
library(ggplot2)
inTrain<-createDataPartition(y=iris$Species, p=0.7, list=FALSE)
training<-iris[inTrain, ]
testing<-iris[-inTrain, ]
dim(training)
## [1] 105 5
dim(testing)
## [1] 45 5
We can see that there are 105 observations in training set and 45 observations in testing set.
One thing that we could do is to perform actually a k-means clustering. And the basic idea here is to basically create three different clusters.
kMeans1<-kmeans(subset(training, select=-c(Species)), centers=3)
training$clusters<-as.factor(kMeans1$cluster)
qplot(Petal.Width, Petal.Length, colour=clusters, data=training)
We can see three clusters and those clusters actually are relatively close to the clusters we would get if you used the species labels themselves, but that’s not a typical outcome, often it’s very hard to label the clusters that we get.
So in this case, we can make a table of the cluster versus species. We can fit a model that relates the cluster variable that we’ve just created, to all the predictor variables and we could do it in the training set. In this case, we doing it with a classification tree. We can then do a prediction in a training set.
modFit<-train(clusters~., data=subset(training, select=-c(Species)), method="rpart")
table(predict(modFit, training), training$Species)
##
## setosa versicolor virginica
## 1 9 3 0
## 2 26 0 0
## 3 0 32 35
We can see that I do a reasonably good job of predicting cluster 3, but clusters 1 and 3, and even 2 and 3 sometimes get mixed up in our prediction model. That’s because we have both error or variation in my prediction building and error and variation in my cluster building, so it ends up being a quite a challenging problem to deal with, unsupervised prediction in this way.
We can then apply it on the test data set. So, if we predict on the test data set, in general we wouldn’t know what the labels are, but here we are showing what the labels are. We predicting on a new data set and making a table versus the actual known species.
testClusterPred<-predict(modFit, testing)
table(testClusterPred, testing$Species)
##
## testClusterPred setosa versicolor virginica
## 1 8 0 0
## 2 7 0 0
## 3 0 15 15
And so I can see this actually does quite a reasonable job of predicting the different species into different clustered labels. In general, this is quite a hard problem, so care must be taken in the labeling the clusters, and performing the unsupervised analysis and understanding how that works.