This section is about one of the most direct and simple ways to perform machine learning using regression modeling. Knowledge of regression modeling helps a lot of with the materials in this section. We’re just using it in the service of performing prediction.
So the key idea here is that we’re just going to fit a simple regression model. We’re going to fit a line consisting a set of coefficients to a set of data. We are going to create new predictors or new covariance by multiplying them by the coefficients that we estimated with our prediction model and then we get a new prediction for a new value. This is useful when that linear model is nearly correct, in other words when the relationship between the variables can be modeled in a linear way, i.e., as a function of lines, then it is a useful way to predict. It’s very easy to implement, and it’s also quite easy to interpret compared to many machine learning algorithms in the sense that, we are fitting a set of lines to a data set and the lines are relatively easy to interpret.It can have poor performance in nonlinear settings. So it’s usually used in combination with other machine learning algorithms on complicated examples.
So, we’re going to use data on eruptions of geysers. Geysers have a waiting time in between their different eruptions and there’s an amount of time that they actually erupt for, so there’s a data set that we can load in very easily that contains some information on eruptions for a particular geyser, Old Faithful in the United States. Let’s begin:
Note that we will be building model using only the training set and we test the model using the test set.
library(caret)
data("faithful")
set.seed(333)
inTrain<-createDataPartition(y=faithful$waiting,
p=0.5, list=FALSE)
trainFaith<-faithful[inTrain,]
testFaith<-faithful[-inTrain,]
head(trainFaith)
## eruptions waiting
## 3 3.333 74
## 6 2.883 55
## 7 4.700 88
## 8 3.600 85
## 9 1.950 51
## 11 1.833 54
Here, we can see that there’s the eruption time and waiting period, time between two eruptions. Let’s plot the waiting time and eruption duration.
plot(trainFaith$waiting, trainFaith$eruptions, pch=20, col="brown", xlab="Waiting: between two Eruptions", ylab="Eruption Duration")
We have waiting time on the x axis and duration of eruption on the y axis. we can see that the dots basically fit along the straight line. It means, longer waiting times between two eruptions were pretty good predictors of eruption duration.
Now, we can fit a linear model using the following syntax:
lm1<-lm(eruptions~waiting, data=trainFaith)
summary(lm1)
##
## Call:
## lm(formula = eruptions ~ waiting, data = trainFaith)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.13375 -0.36778 0.06064 0.36578 0.96057
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.648629 0.226603 -7.275 2.55e-11 ***
## waiting 0.072211 0.003136 23.026 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4941 on 135 degrees of freedom
## Multiple R-squared: 0.7971, Adjusted R-squared: 0.7956
## F-statistic: 530.2 on 1 and 135 DF, p-value: < 2.2e-16
Eruption was the outcome variable and we want to do that to the training dataset. We got the summary of the output. The estimate shows that the intercept -1.65 with 0.23 SE, and the model was statistically significant. The waiting time was statistically significant predictor of the (beta=0.072, SE=0.003) of the eruption duration.
Now, we are going to fit the model using the waiting and eruptions variables in the trainFaith set and we are going to plot the fit line using the lm1$fitted command. Remember, lm1 comes from the previous syntax.
plot(trainFaith$waiting, trainFaith$eruptions, pch=20, col="brown", xlab="Waiting: between two Eruptions", ylab="Eruption Duration")
lines(trainFaith$waiting, lm1$fitted, lwd=3)
The line (fitted line) shows the reasonably good representation between these two values. Now let’s try predicting a new variable.
We can use the coef function to see the coefficient of the new variable, like **coef(lm1)+coef(lm1)[2]*80**. We don’t have an error set here because we don’t know what it is. The expression [1] gives us the intercept value, while expression [2] gives us the slope for the waiting time here. And, suppose we have a new waiting time, that is 80 then
coef(lm1)[1]+coef(lm1)[2]*80
## (Intercept)
## 4.128276
the eruption time is approximately 4.128.
So, another thing that we can do is we can actually predict using that LM object. We don’t actually have to, extract the coefficients and multiply them together every time if we create a new data set.
newdata<-data.frame(waiting=80)
predict(lm1,newdata)
## 1
## 4.128276
A data frame that has one new value that we want to predict, so we say we have waiting time equal to just 80,then if I type predict and I pass it the fitted model from the training set, and the new data set, new data frame we created. It’ll give us the prediction for the new value. And it matches with the above value. Thus, it’s using the same formula in this predict command, that we would use if we actually calculated out the prediction by hand.
Remember, we created this model using the training set and we want to see how it works in the test set. We are going to do just that through the syntax below:
par(mfrow=c(1,2))
plot(trainFaith$waiting, trainFaith$eruptions, pch=20, col="brown", xlab="Waiting: between two Eruptions", ylab="Eruption Duration", main="Training Data")
lines(trainFaith$waiting, predict(lm1), lwd=3)
plot(testFaith$waiting, testFaith$eruptions, pch=20, col="brown", xlab="Waiting: between two Eruptions", ylab="Eruption Duration", main="Test Data")
lines(testFaith$waiting, predict(lm1, newdata = testFaith),lwd=3)
The training data shows the reasonable model fit, and the Test data shows a little shift, but it looks reasonably good. The slight difference in these two plots make perfect sense, because we created the model using the training set, which may or not perfectly replicated by the test set.
As a matter of fact, the next thing in our list is to assess the training and test set errors. As we are using prediction on linear model, Root Mean Squared Error (RMSE) is the default.
sqrt(sum((lm1$fitted-trainFaith$eruptions)^2))
## [1] 5.740844
sqrt(sum((predict(lm1, newdata=testFaith)-testFaith$eruptions)^2))
## [1] 5.853745
The first syntax called the RMSE for the training set, while the second did call the test set. We the training set had the RMSE of 5.7408, while the test set had the error rate of 5.8537. The error rate we got in the test set is the more realistic statistics of the error rate because we did not use the test set when we built our model. It is important to understand that the test set errors are almost always larger than the training set errors.
The other thing, we can do is to calculate the prediction intervals. We are doing so using the test set.
pred1<-predict(lm1, newdata=testFaith, interval="prediction")
ord<-order(testFaith$waiting)
plot(testFaith$waiting, testFaith$eruptions, pch=20, col="brown")
matlines(testFaith$waiting[ord],pred1[ord,], type="l", col=c(9,6,6), lty=c(1,1,1),lwd=3)
We can do the same thing in the caret package. The above method is the manual method. We use the train function in the caret package to build the model. The eruptions is the output, and the waiting time is the predictor and they’re separated by the tilde. And then we say which data set we want to build the model on and for the method, we tell it Linear Modeling.
modFit<-train(eruptions~waiting, data=trainFaith, method="lm")
summary(modFit$finalModel)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.13375 -0.36778 0.06064 0.36578 0.96057
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.648629 0.226603 -7.275 2.55e-11 ***
## waiting 0.072211 0.003136 23.026 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4941 on 135 degrees of freedom
## Multiple R-squared: 0.7971, Adjusted R-squared: 0.7956
## F-statistic: 530.2 on 1 and 135 DF, p-value: < 2.2e-16
The final model is the part of the modFit objects that was created by Train that tells us the exact final model that’s being used for prediction. So, if we do a summary of that final model fit, they look very similar to the model that we fit by hand, it’s -1.64 for the intercept and 0.07 for the waiting time.
This section highlights two important concetps:
We are going to use the ISLR data set in this section. We are not going to use the whole data but subsetting the Wage variable into ‘logwage’, which will be our outcome (predicted) variable in this analyses. Here we go:
library(ISLR)
library(ggplot2)
library(caret)
data(Wage)
Wage<-subset(Wage, select=-c(logwage))
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins wage
## 1. <=Good : 858 1. Yes:2083 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.: 85.38
## Median :104.92
## Mean :111.70
## 3rd Qu.:128.68
## Max. :318.34
##
Looking at the summary, we can already see some patterns and/or some important information. For example, the participants are only from the Middle Atlantic Region, and there are two types of jobclass: Industrial, Information, etc.
Now, we are going to partition the data into training and testing sets, which will help us understand the data, a little further.
inTrain<-createDataPartition(y=Wage$wage,
p=0.7, list=FALSE)
training<-Wage[inTrain,]
testing<-Wage[-inTrain,]
dim(training)
## [1] 2102 10
dim(testing)
## [1] 898 10
Looking at the dimensions of training and testing sets, we see that there are total of 2112 (2102+10; 898+10) cases in training set and 908 cases in the test set. We can go ahead and order a feature plot for furhter analyses:
library(caret)
featurePlot(x=training[,c("age", "education",
"jobclass")],
y=training$wage,
plot="pairs")
Feature plots show a little bit about how the variables are related to each other. Sometime they are useful and sometime they are not. This particular plot is hard to read because they are squeezed together.
Anyway, in case of jobclass, one group seem to outperform the other group, and there seem to be an outliar in the age variable. We want to further analyze the outliers becuase they can pose serious threat to our model. Let’s see what it is through a means of a plot:
library(ggpubr)
bw<-qplot(age,wage,data=training, main="Black and White Representation")
cl<-qplot(age,wage,colour=jobclass,data=training, main="Color Coded by jobclass for Clarity")
ggarrange(bw, cl, ncol=2, nrow=1)
There appear to be a kind of pattern in the dataset, but also a bulk of outliers. We colorcoded the qqplot in the second instance, and came to realize that most of the datapoints in the outliers are people who work in information area. So, it appears to information variable may be able to predict more variability compared to the industrial one.
Now let’s go ahead and include education in the spectrum and see if it explains any thing.
qplot(age, wage, colour=education, data=training)
Looking at the plot we can clearly see that the people with Advanced Degree has an edge in terms of wages, followed by the people who are the college graduates. Most critically, the Advanced Degree explains most of the outliers in the plot.
All in all, some combination of age, and education levels explain most of the variability in the data.
We are fitting the model on age, jobclass, and education. Wage is used as the outcome variable. As default in linear regression model, r will create an indicator variable and compares it with other. Be mindful of indicator variable while interpreting the results.
modFit<-train(wage~age+jobclass+education,
method="lm", data=training)
finMod<-modFit$finalModel
print(modFit)
## Linear Regression
##
## 2102 samples
## 3 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 35.56759 0.2589245 24.87554
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
print(finMod)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Coefficients:
## (Intercept) age
## 61.7127 0.4793
## `jobclass2. Information` `education2. HS Grad`
## 4.2139 12.1096
## `education3. Some College` `education4. College Grad`
## 24.4166 39.4505
## `education5. Advanced Degree`
## 65.1802
The result shows that there are 3 predictors, and there were total of 2102 samples in the model (training set). It also shows that we did not conduct any pre-processing and the system had to resample total of 25 repetitions using the bootstrapping. In all of these 25 repetitions there were 2102 data points.
The Root Mean Squared Error is 35. 57, with R-squared 0.26, and mean absolute error (MAE)= 24.88. The huge RMSE may be because of the bunch of the outliers we saw in the previous plots.
One thing we have to do is to ask for diagnostic plots.
plot(finMod,1, pch=19, cex=0.5, col="lightblue")
We plotted the fitted values against the Residuals, that’s the amount of variation that’s left over after we fit your model. We want to see the line centered at Zero along the x axis, because the residual is the differnece between our model prediction and our actual real values. As we see in the plot, there are still some outliers and they have been labeled for us. Those data points are something we want to explore further.
For further clarity, we can color code the above plot by some other variable not used in the model. Let’s do so by the race.
qplot(finMod$fitted, finMod$residuals, colour=race, data=training)
Looks like some the outliers could be explained by the race variables.
Another thing that is really useful is to plot the residuals against the Index. The Index is the knowledge of the set of rows (in the actual dataset) we are looking at.
plot(finMod$residuals, pch=19)
Looking at the plot we can be pretty sure that the outliers are distributed among all the indexes. We can also see that there is a clear pattern in terms of the model. When we see a pattern and/or the outliers gathering over certain index, we have to understand that the model is missing some other important variable.
The other thing we can do is to plot the wage (the outcome) varible in the test set versus the predicted values in the test set. Ideally, these things should be very close to each other, or they should be clearly represented by a 45 degree line, and we can trace and identify the trends we might have missed in earlier exploration. But, we have to be careful that if we do this sort of exploration in the test set, we can’t go back and re-update our model in the training set because that would be using the test set to rebuild our predictors. This is more like a post-mortem on our analysis or a way to try to determine whether our analysis worked or not.
pred<-predict(modFit, testing)
qplot(wage, pred, colour=year, data=testing)
If we want all of the covariates in our model building, we can do so the following way:
modFitAll<-train(wage ~ ., data=training, method="lm")
pred1<-predict(modFitAll, testing)
qplot(wage, pred1, data=testing)
The model does a little bit better compared to the one we conducted earlier.
So, linear regression is often useful in combination with other models. It’s a quite a simple model in the sense that it always fits lines through the data. And so it can capture a lot of variability if the relationship between the predictors and the outcome is linear. If it’s not linear, then we can often miss things, and it can be better blended with other models. Exploratory data analysis can be very useful with regression models because, like, the plots we made with residuals and so forth colored by different features, we can try to identify the patterns in the data set.