library(tidyverse)
library(glmnet)The motivation of this tutorial is to have a practical introduction to Linear Regression.We begin with Simple Linear Regression and explain the accuracy metrics such as Mean Square Error and Root Mean Square Error. We also go to into Multiple Linear Regression and explain the concepts of Cross validation. We develop a routine for Cross Validation from Scratch.
The second part of the tutorial explains the Penalized Regression Models which involve penalizing or shrinking the value of the coefficients of the predictors.
In order to motivate our study of regression, let us start off with an example. Suppose you are a consultant and you are given a dataset of cars with their properties such as weight,number of cylinders, rear axle ratio, number of forward gears, number of carburetors, and now you are asked to compute the miles per gallon of the car. The miles per gallon of the car is a dependent variable whose value depends on a number of independent variables such as weight,number of cylinders, rear axle ratio and so on. With this setting, we can use Regression techniques to predict the miles per gallon consumed by the car.
The plan to study Regression is through the following steps:
Single Linear Regression
Multiple Linear Regression
Ridge Regression
Lasso Regression
While we study these various types of Regression, we would also explore Two very important concepts Metrics and Cross Validation. Lets begin the party exploring the cars dataset which our client has given us.
We have the mtcars dataset which has the data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
We observe the first Six rows of the dataset below.
head(mtcars)## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We wish to take a simple task of finding the relationship between Miles per gallon with the weight of the vehicle. To do this, the we plot a scatter plot as depicted below.From the plot it is evident that as the weight decreases, the miles per gallon increases.
mtcars %>%
ggplot(aes(x=wt,y=mpg))+
geom_point(color = "blue")+
stat_smooth(aes(x=wt,y=mpg),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(Weight)")+
ylab("Miles per gallon")Simple Linear Regression tries to find out the relationship of a variable with another variable. This is best represented by a simple equation y = (slope) * x + intercept.
In this example we use lm function to model the relationship between Miles per gallon and weight.
The lm function takes a formula of the form response ~ terms where response is the numeric vector, in this case it is Miles per gallon. Terms are the dependent variables. This can be more than one. In case of Simple Linear regression, the number of terms is one. In this example, the term is wt which represents the weight of the car.
lmfit <- lm(mpg ~ wt, mtcars)
lmfit$coefficients## (Intercept) wt
## 37.285126 -5.344472
The lmfit$coefficients provides the slope and the intercept of the equation.
We wish to find out how well the simple linear regression model has predicted the actual values. In machine learning, there are several ways to measure the accuracy of the prediction.These different measures are known as metrics.
The metric we would be discussing is the Mean Square Error. The steps to calculate this metric is provided below
Calculate the difference between the actual and predicted values. This is obtained through the value lmfit$residuals.
Square the differences
Sum the Square of the differences
Mean Square Error is mean of the Sum of Squares differences
This can be mathematically represented as
\[\frac{1}{N}\sum_{i=1}^{n}\left({\hat{y_i} - y_i} \right)^{2}\]
where \(\hat{y_i}\) is the predicted value and \({y_i}\) is the actual value
SSE = sum( (lmfit$residuals)^2 )
N = nrow(mtcars)
MSE = SSE/N
MSE## [1] 8.697561
Root Mean Square Error is the square root of the Mean Square Error.
This can be mathematically represented as
\[\sqrt{\frac{1}{N}\sum_{i=1}^{n}\left({\hat{y_i} - y_i} \right)^{2}}\]
RMSE = sqrt(MSE)
RMSE## [1] 2.949163
We show how the Miles per gallon consumed is related to the Mile Time of the vehicle. From the plot it is evident that as the Quarter Mile Time increases, the miles per gallon increases.
mtcars %>%
ggplot(aes(x=qsec,y=mpg))+
geom_point(color = "blue")+
stat_smooth(aes(x=qsec,y=mpg),method="lm", color="red")+
theme_bw()+
theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
xlab("(one Fourth Mile Time)")+
ylab("Miles per gallon")Multiple Linear Regression is used to explain the relationship between one dependent variable ( in this case, it is miles per gallon) with multiple dependent variables( in this case, it is weight and quarter mile time).
We do a multiple linear regression with the dependent variable being miles per gallon and the independent variables being weight and quarter mile time.
lmfit <- lm(mpg ~ wt +qsec, mtcars)
lmfit$coefficients## (Intercept) wt qsec
## 19.746223 -5.047982 0.929198
Mean Square Error is calculated below for the multiple linear regression model.
SSE = sum( (lmfit$residuals)^2 )
N = nrow(mtcars)
MSE = SSE/N
MSE## [1] 6.108238
RMSE = sqrt(MSE)
RMSE## [1] 2.471485
In the code below, we try to explain the relationship between Miles per gallon and the other dependent variables.the coefficients for the linear model is provided below.
lmfit <- lm(mpg ~ ., mtcars)
lmfit$coefficients## (Intercept) cyl disp hp drat wt
## 12.30337416 -0.11144048 0.01333524 -0.02148212 0.78711097 -3.71530393
## qsec vs am gear carb
## 0.82104075 0.31776281 2.52022689 0.65541302 -0.19941925
The mean square error for the multiple linear regression is provided below
SSE = sum( (lmfit$residuals)^2 )
N = nrow(mtcars)
MSE = SSE/N
MSE## [1] 4.609201
The root mean square error for the multiple linear regression is provided below
RMSE = sqrt(MSE)
RMSE## [1] 2.146905
Till now, we have predicted on data which we have seen. But in the practical world, we need to predict on data which we have not seen. Therefore it is important to model on a set of data, and test our accuracy in another completely different set of data. To do this we will use a technique known as Cross Validation.
The steps for performing Cross Validation is as follows:
Divide the data into K parts . In popular literature, these parts are known as folds. Therefore we divide the data into K folds. In our example , we take K equal to 5.
We train the model on Fold 1, and predict on Folds 2, 3, 4,5. We calculate the accuracy for this Fold.
We train the model on Fold 2, and predict on Folds 1, 3, 4,5. We calculate the accuracy for this Fold.
We train the model on Fold 3, and predict on Folds 1, 2, 4,5. We calculate the accuracy for this Fold.
We train the folds 4 and 5 and predict the same way.
The final accuracy is the mean of all the accuracies.
Cross Validation RMSE is provided by the following function
CVFromScratch = function(formula,noOfFolds)
{
data <- mtcars
k = noOfFolds #Folds
# sample from 1 to k, nrow times (the number of observations in the data)
data$id <- sample(1:k, nrow(data), replace = TRUE)
list <- 1:k
RMSECopy <- data.frame()
for (i in 1:k){
# remove rows with id i from dataframe to create training set
# select rows with id i to create test set
trainingset <- subset(data, id %in% list[-i])
testset <- subset(data, id %in% c(i))
# run a linear model
mymodel <- lm(formula, data = trainingset)
# remove response column
testset2 = testset %>% select(-mpg)
predictions <- as.data.frame(predict(mymodel,testset2 ))
SSE = sum( (testset$mpg - predictions)^2 )
N = nrow(testset)
MSE = SSE/N
RMSE = sqrt(MSE)
RMSECopy <- rbind(RMSECopy, as.data.frame(RMSE))
}
summary(RMSECopy)
}
formula = mpg ~ wt+qsec
noOfFolds = 5
CVFromScratch(formula,noOfFolds)## RMSE
## Min. :1.576
## 1st Qu.:1.729
## Median :2.509
## Mean :2.557
## 3rd Qu.:3.153
## Max. :3.820
formula = mpg ~ .
noOfFolds = 5
CVFromScratch(formula,noOfFolds)## RMSE
## Min. : 1.963
## 1st Qu.: 3.501
## Median : 3.686
## Mean : 5.076
## 3rd Qu.: 4.837
## Max. :11.395
This approach involves fitting a model involving all the predictors, but shrinks the unimportant coefficients to zero.This method of modelling improves accuracy over unseen data.The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.
In the previous models, we were trying to minimize the Mean Square Error. In the Penalized Regression Models, we would be trying to minimize Two components.
Sum of Square Errors
A Tuning Parameter \(\lambda\) multiplied by the Sum of squares of the coefficients in case of Ridge Regression
A Tuning Parameter \(\lambda\) multiplied by the Sum of absolute Value of the coefficients in case of Lasso Regression
In this way , the value of the coefficients are penalized for being too large. This method of modelling improves accuracy over unseen data.
We minimize the following for the Ridge Regression
\({\sum_{i=1}^{n}\left({\hat{y_i} - y_i} \right)^{2} + \lambda\sum_{j=1}^{p}|\beta_j|^{2}}\)
We minimize the following for the Lasso Regression
\({\sum_{i=1}^{n}\left({\hat{y_i} - y_i} \right)^{2} + \lambda\sum_{j=1}^{p}|\beta_j|}\)
Here p is the number of predictors and \(\beta_j\) is the jth coefficient.
We do a Ridge Regression using the glmnet library setting \(\alpha\) equal to zero and \(\lambda\) to a grid of numbers.In the first step we create a collection of values of \(\lambda\) ranging from Thousand to .01. Therefore the 1st values of \(\lambda\) will have large values while the values at the end will have small values.
lambdas <- 10^seq(3, -2, by = -.1)
lambdas## [1] 1.000000e+03 7.943282e+02 6.309573e+02 5.011872e+02 3.981072e+02
## [6] 3.162278e+02 2.511886e+02 1.995262e+02 1.584893e+02 1.258925e+02
## [11] 1.000000e+02 7.943282e+01 6.309573e+01 5.011872e+01 3.981072e+01
## [16] 3.162278e+01 2.511886e+01 1.995262e+01 1.584893e+01 1.258925e+01
## [21] 1.000000e+01 7.943282e+00 6.309573e+00 5.011872e+00 3.981072e+00
## [26] 3.162278e+00 2.511886e+00 1.995262e+00 1.584893e+00 1.258925e+00
## [31] 1.000000e+00 7.943282e-01 6.309573e-01 5.011872e-01 3.981072e-01
## [36] 3.162278e-01 2.511886e-01 1.995262e-01 1.584893e-01 1.258925e-01
## [41] 1.000000e-01 7.943282e-02 6.309573e-02 5.011872e-02 3.981072e-02
## [46] 3.162278e-02 2.511886e-02 1.995262e-02 1.584893e-02 1.258925e-02
## [51] 1.000000e-02
y <- mtcars$mpg
x <- mtcars %>% select(-mpg) %>% data.matrix()
fit <- glmnet(x, y, alpha = 0, lambda = lambdas)We inspect the value of the coefficients for small values of \(\lambda\).The value at index Fifty will have the second smallest value of \(\lambda\).
fit$lambda [50]## [1] 0.01258925
coef(fit)[,50]## (Intercept) cyl disp hp drat wt
## 12.69449511 -0.10189973 0.01153088 -0.02044633 0.81830687 -3.52290525
## qsec vs am gear carb
## 0.77719400 0.31834762 2.49705388 0.67141564 -0.26492009
We inspect the value of the coefficients for large values of \(\lambda\).The value at index one will have the largest value of \(\lambda\).
fit$lambda [1]## [1] 1000
coef(fit)[,1]## (Intercept) cyl disp hp drat
## 20.0145856575 -0.0164800506 -0.0002362546 -0.0003914850 0.0440169682
## wt qsec vs am gear
## -0.0307273579 0.0080706480 0.0454252840 0.0416668844 0.0224881575
## carb
## -0.0118344043
We observe the following
Large values of \(\lambda\) produces small values of coefficients
Small values of \(\lambda\) produces large values of coefficients
We do Cross validation to find the optimum value of lambda
We use cross validation to find the optimum value of \(\lambda\).We use the cv.glmnet function to do crossvalidation.lambda.min is the value of \(\lambda\) that gives minimum mean cross-validated error.Once we have the optimum value of \(\lambda\), we fit using this optimum value of \(\lambda\).
cv_fit <- cv.glmnet(x, y, alpha = 0, lambda = lambdas)
opt_lambda <- cv_fit$lambda.min
opt_lambda## [1] 3.162278
fit <- glmnet(x, y, alpha = 0, lambda = opt_lambda)
fit$lambda## [1] 3.162278
coef(fit)## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 21.019269740
## cyl -0.375606668
## disp -0.005353719
## hp -0.011454746
## drat 1.056241727
## wt -1.189217320
## qsec 0.159330517
## vs 0.793508955
## am 1.573837349
## gear 0.540210275
## carb -0.527760976
Lasso Regression is similiar to Ridge Regression. The Ridge Regression model would have all the predictors and some of the predictors would be close to zero but not equal to zero. This model does not completely eliminate the not so good predictors.The Lasso Regresssion on the other hand shrinks some of the predictors to zero and completely eliminating them.The Lasso Regression therefore helps infeature selection.
We do a Lasso Regression using the glmnet library setting \(\alpha\) equal to 1 and \(\lambda\) to a grid of numbers.
lambdas <- 10^seq(3, -2, by = -.1)
lambdas## [1] 1.000000e+03 7.943282e+02 6.309573e+02 5.011872e+02 3.981072e+02
## [6] 3.162278e+02 2.511886e+02 1.995262e+02 1.584893e+02 1.258925e+02
## [11] 1.000000e+02 7.943282e+01 6.309573e+01 5.011872e+01 3.981072e+01
## [16] 3.162278e+01 2.511886e+01 1.995262e+01 1.584893e+01 1.258925e+01
## [21] 1.000000e+01 7.943282e+00 6.309573e+00 5.011872e+00 3.981072e+00
## [26] 3.162278e+00 2.511886e+00 1.995262e+00 1.584893e+00 1.258925e+00
## [31] 1.000000e+00 7.943282e-01 6.309573e-01 5.011872e-01 3.981072e-01
## [36] 3.162278e-01 2.511886e-01 1.995262e-01 1.584893e-01 1.258925e-01
## [41] 1.000000e-01 7.943282e-02 6.309573e-02 5.011872e-02 3.981072e-02
## [46] 3.162278e-02 2.511886e-02 1.995262e-02 1.584893e-02 1.258925e-02
## [51] 1.000000e-02
y <- mtcars$mpg
x <- mtcars %>% select(-mpg) %>% data.matrix()
fit <- glmnet(x, y, alpha = 1, lambda = lambdas)We inspect the value of the coefficients for small values of \(\lambda\)
fit$lambda [50]## [1] 0.01258925
coef(fit)[,50]## (Intercept) cyl disp hp drat
## 13.521893747 -0.096983096 0.008842666 -0.018453660 0.809605725
## wt qsec vs am gear
## -3.350667774 0.740203469 0.243526586 2.450240877 0.621248872
## carb
## -0.320297953
We inspect the value of the coefficients for large values of \(\lambda\)
fit$lambda [1]## [1] 1000
coef(fit)[,1]## (Intercept) cyl disp hp drat wt
## 20.09062 0.00000 0.00000 0.00000 0.00000 0.00000
## qsec vs am gear carb
## 0.00000 0.00000 0.00000 0.00000 0.00000
We observe the following
Large values of \(\lambda\) produces small values of coefficients
Small values of \(\lambda\) produces large values of coefficients
We do Cross validation to find the optimum value of \(\lambda\)
We use cross validation to find the optimum value of \(\lambda\) .lambda.min is the value of \(\lambda\) that gives minimum mean cross-validated error.
cv_fit <- cv.glmnet(x, y, alpha = 1, lambda = lambdas)
opt_lambda <- cv_fit$lambda.min
opt_lambda## [1] 0.7943282
fit <- glmnet(x, y, alpha = 0, lambda = opt_lambda)
fit$lambda## [1] 0.7943282
coef(fit)## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 20.448240715
## cyl -0.292879060
## disp -0.003259148
## hp -0.012623075
## drat 1.002638746
## wt -1.688225692
## qsec 0.250586723
## vs 0.546670536
## am 2.002014145
## gear 0.600856653
## carb -0.660350820
We have covered a lot of ground starting from Simple Linear Regression to Multiple Linear Regression. We have learned the basics and usefulness of Metrics and Cross Validation. We went into group of Penalized Regression Methods such as Ridge Regression and Lasso Regression.
The next step is to learn Non Linear Regression methods such as Trees, Bagging,Boosting. Hope you had a lot of fun going through this tutorial and will go through the next steps.
Thank you.