library(tidyverse)
library(glmnet)

1 Motivation

The motivation of this tutorial is to have a practical introduction to Linear Regression.We begin with Simple Linear Regression and explain the accuracy metrics such as Mean Square Error and Root Mean Square Error. We also go to into Multiple Linear Regression and explain the concepts of Cross validation. We develop a routine for Cross Validation from Scratch.

The second part of the tutorial explains the Penalized Regression Models which involve penalizing or shrinking the value of the coefficients of the predictors.

2 Regression

In order to motivate our study of regression, let us start off with an example. Suppose you are a consultant and you are given a dataset of cars with their properties such as weight,number of cylinders, rear axle ratio, number of forward gears, number of carburetors, and now you are asked to compute the miles per gallon of the car. The miles per gallon of the car is a dependent variable whose value depends on a number of independent variables such as weight,number of cylinders, rear axle ratio and so on. With this setting, we can use Regression techniques to predict the miles per gallon consumed by the car.

The plan to study Regression is through the following steps:

  1. Single Linear Regression

  2. Multiple Linear Regression

  3. Ridge Regression

  4. Lasso Regression

While we study these various types of Regression, we would also explore Two very important concepts Metrics and Cross Validation. Lets begin the party exploring the cars dataset which our client has given us.

3 Explore the mtcars dataset

We have the mtcars dataset which has the data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

We observe the first Six rows of the dataset below.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

4 Weight and Miles per gallon

We wish to take a simple task of finding the relationship between Miles per gallon with the weight of the vehicle. To do this, the we plot a scatter plot as depicted below.From the plot it is evident that as the weight decreases, the miles per gallon increases.

mtcars %>%
ggplot(aes(x=wt,y=mpg))+
  geom_point(color = "blue")+
  
  stat_smooth(aes(x=wt,y=mpg),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(Weight)")+
  ylab("Miles per gallon")

5 Simple Linear Regression

Simple Linear Regression tries to find out the relationship of a variable with another variable. This is best represented by a simple equation y = (slope) * x + intercept.

In this example we use lm function to model the relationship between Miles per gallon and weight.
The lm function takes a formula of the form response ~ terms where response is the numeric vector, in this case it is Miles per gallon. Terms are the dependent variables. This can be more than one. In case of Simple Linear regression, the number of terms is one. In this example, the term is wt which represents the weight of the car.

lmfit <- lm(mpg ~ wt, mtcars) 

lmfit$coefficients
## (Intercept)          wt 
##   37.285126   -5.344472

The lmfit$coefficients provides the slope and the intercept of the equation.

5.1 Mean Square Error

We wish to find out how well the simple linear regression model has predicted the actual values. In machine learning, there are several ways to measure the accuracy of the prediction.These different measures are known as metrics.
The metric we would be discussing is the Mean Square Error. The steps to calculate this metric is provided below

  1. Calculate the difference between the actual and predicted values. This is obtained through the value lmfit$residuals.

  2. Square the differences

  3. Sum the Square of the differences

  4. Mean Square Error is mean of the Sum of Squares differences

This can be mathematically represented as

\[\frac{1}{N}\sum_{i=1}^{n}\left({\hat{y_i} - y_i} \right)^{2}\]

where \(\hat{y_i}\) is the predicted value and \({y_i}\) is the actual value

SSE = sum( (lmfit$residuals)^2 )

N = nrow(mtcars)

MSE = SSE/N

MSE
## [1] 8.697561

5.2 Root Mean Square Error

Root Mean Square Error is the square root of the Mean Square Error.

This can be mathematically represented as

\[\sqrt{\frac{1}{N}\sum_{i=1}^{n}\left({\hat{y_i} - y_i} \right)^{2}}\]

RMSE = sqrt(MSE)

RMSE
## [1] 2.949163

6 Mile Time and Miles per gallon

We show how the Miles per gallon consumed is related to the Mile Time of the vehicle. From the plot it is evident that as the Quarter Mile Time increases, the miles per gallon increases.

mtcars %>%
ggplot(aes(x=qsec,y=mpg))+
  geom_point(color = "blue")+
  
  stat_smooth(aes(x=qsec,y=mpg),method="lm", color="red")+
  theme_bw()+
  theme(axis.title = element_text(size=16),axis.text = element_text(size=14))+
  xlab("(one Fourth Mile Time)")+
  ylab("Miles per gallon")

7 Multiple Linear Regression using Weight and Miles time

Multiple Linear Regression is used to explain the relationship between one dependent variable ( in this case, it is miles per gallon) with multiple dependent variables( in this case, it is weight and quarter mile time).

We do a multiple linear regression with the dependent variable being miles per gallon and the independent variables being weight and quarter mile time.

lmfit <- lm(mpg ~ wt +qsec, mtcars) 

lmfit$coefficients
## (Intercept)          wt        qsec 
##   19.746223   -5.047982    0.929198

7.1 Mean Square Error

Mean Square Error is calculated below for the multiple linear regression model.

SSE = sum( (lmfit$residuals)^2 )

N = nrow(mtcars)

MSE = SSE/N

MSE
## [1] 6.108238

7.2 Root Mean Square Error

RMSE = sqrt(MSE)

RMSE
## [1] 2.471485

8 Multiple Linear Regression using all variables

In the code below, we try to explain the relationship between Miles per gallon and the other dependent variables.the coefficients for the linear model is provided below.

lmfit <- lm(mpg ~ ., mtcars) 

lmfit$coefficients
## (Intercept)         cyl        disp          hp        drat          wt 
## 12.30337416 -0.11144048  0.01333524 -0.02148212  0.78711097 -3.71530393 
##        qsec          vs          am        gear        carb 
##  0.82104075  0.31776281  2.52022689  0.65541302 -0.19941925

8.1 Mean Square Error

The mean square error for the multiple linear regression is provided below

SSE = sum( (lmfit$residuals)^2 )

N = nrow(mtcars)

MSE = SSE/N

MSE
## [1] 4.609201

8.2 Root Mean Square Error

The root mean square error for the multiple linear regression is provided below

RMSE = sqrt(MSE)

RMSE
## [1] 2.146905

9 Cross Validation

Till now, we have predicted on data which we have seen. But in the practical world, we need to predict on data which we have not seen. Therefore it is important to model on a set of data, and test our accuracy in another completely different set of data. To do this we will use a technique known as Cross Validation.

The steps for performing Cross Validation is as follows:

  1. Divide the data into K parts . In popular literature, these parts are known as folds. Therefore we divide the data into K folds. In our example , we take K equal to 5.

  2. We train the model on Fold 1, and predict on Folds 2, 3, 4,5. We calculate the accuracy for this Fold.

  3. We train the model on Fold 2, and predict on Folds 1, 3, 4,5. We calculate the accuracy for this Fold.

  4. We train the model on Fold 3, and predict on Folds 1, 2, 4,5. We calculate the accuracy for this Fold.

  5. We train the folds 4 and 5 and predict the same way.

  6. The final accuracy is the mean of all the accuracies.

9.1 Cross Validation of a Model using 2 variables

Cross Validation RMSE is provided by the following function

CVFromScratch = function(formula,noOfFolds)
{

data <- mtcars

k = noOfFolds #Folds

# sample from 1 to k, nrow times (the number of observations in the data)
data$id <- sample(1:k, nrow(data), replace = TRUE)
list <- 1:k


RMSECopy <- data.frame()


for (i in 1:k){
  # remove rows with id i from dataframe to create training set
  # select rows with id i to create test set
  trainingset <- subset(data, id %in% list[-i])
  testset <- subset(data, id %in% c(i))
  
  # run a linear model
  mymodel <- lm(formula, data = trainingset)
  
  # remove response column 
  testset2 = testset %>% select(-mpg)
  
  predictions <- as.data.frame(predict(mymodel,testset2 ))
  
  SSE = sum( (testset$mpg - predictions)^2 )

  N = nrow(testset)

  MSE = SSE/N

  RMSE = sqrt(MSE)

  RMSECopy <- rbind(RMSECopy, as.data.frame(RMSE))
  

}

 summary(RMSECopy)
  
}

formula = mpg ~ wt+qsec

noOfFolds = 5

CVFromScratch(formula,noOfFolds)
##       RMSE      
##  Min.   :1.576  
##  1st Qu.:1.729  
##  Median :2.509  
##  Mean   :2.557  
##  3rd Qu.:3.153  
##  Max.   :3.820

9.2 Cross Validation of a Model using all variables

formula = mpg ~ .

noOfFolds = 5

CVFromScratch(formula,noOfFolds)
##       RMSE       
##  Min.   : 1.963  
##  1st Qu.: 3.501  
##  Median : 3.686  
##  Mean   : 5.076  
##  3rd Qu.: 4.837  
##  Max.   :11.395

10 Penalized Regression Models

This approach involves fitting a model involving all the predictors, but shrinks the unimportant coefficients to zero.This method of modelling improves accuracy over unseen data.The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.

In the previous models, we were trying to minimize the Mean Square Error. In the Penalized Regression Models, we would be trying to minimize Two components.

In this way , the value of the coefficients are penalized for being too large. This method of modelling improves accuracy over unseen data.

We minimize the following for the Ridge Regression

\({\sum_{i=1}^{n}\left({\hat{y_i} - y_i} \right)^{2} + \lambda\sum_{j=1}^{p}|\beta_j|^{2}}\)

We minimize the following for the Lasso Regression

\({\sum_{i=1}^{n}\left({\hat{y_i} - y_i} \right)^{2} + \lambda\sum_{j=1}^{p}|\beta_j|}\)

Here p is the number of predictors and \(\beta_j\) is the jth coefficient.

11 Ridge Regression

We do a Ridge Regression using the glmnet library setting \(\alpha\) equal to zero and \(\lambda\) to a grid of numbers.In the first step we create a collection of values of \(\lambda\) ranging from Thousand to .01. Therefore the 1st values of \(\lambda\) will have large values while the values at the end will have small values.

lambdas <- 10^seq(3, -2, by = -.1)

lambdas
##  [1] 1.000000e+03 7.943282e+02 6.309573e+02 5.011872e+02 3.981072e+02
##  [6] 3.162278e+02 2.511886e+02 1.995262e+02 1.584893e+02 1.258925e+02
## [11] 1.000000e+02 7.943282e+01 6.309573e+01 5.011872e+01 3.981072e+01
## [16] 3.162278e+01 2.511886e+01 1.995262e+01 1.584893e+01 1.258925e+01
## [21] 1.000000e+01 7.943282e+00 6.309573e+00 5.011872e+00 3.981072e+00
## [26] 3.162278e+00 2.511886e+00 1.995262e+00 1.584893e+00 1.258925e+00
## [31] 1.000000e+00 7.943282e-01 6.309573e-01 5.011872e-01 3.981072e-01
## [36] 3.162278e-01 2.511886e-01 1.995262e-01 1.584893e-01 1.258925e-01
## [41] 1.000000e-01 7.943282e-02 6.309573e-02 5.011872e-02 3.981072e-02
## [46] 3.162278e-02 2.511886e-02 1.995262e-02 1.584893e-02 1.258925e-02
## [51] 1.000000e-02
y <- mtcars$mpg
x <- mtcars %>% select(-mpg) %>% data.matrix()

fit <- glmnet(x, y, alpha = 0, lambda = lambdas)

11.1 Small values of \(\lambda\)

We inspect the value of the coefficients for small values of \(\lambda\).The value at index Fifty will have the second smallest value of \(\lambda\).

fit$lambda [50]
## [1] 0.01258925
coef(fit)[,50]
## (Intercept)         cyl        disp          hp        drat          wt 
## 12.69449511 -0.10189973  0.01153088 -0.02044633  0.81830687 -3.52290525 
##        qsec          vs          am        gear        carb 
##  0.77719400  0.31834762  2.49705388  0.67141564 -0.26492009

11.2 Large values of \(\lambda\)

We inspect the value of the coefficients for large values of \(\lambda\).The value at index one will have the largest value of \(\lambda\).

fit$lambda [1]
## [1] 1000
coef(fit)[,1]
##   (Intercept)           cyl          disp            hp          drat 
## 20.0145856575 -0.0164800506 -0.0002362546 -0.0003914850  0.0440169682 
##            wt          qsec            vs            am          gear 
## -0.0307273579  0.0080706480  0.0454252840  0.0416668844  0.0224881575 
##          carb 
## -0.0118344043

We observe the following

  • Large values of \(\lambda\) produces small values of coefficients

  • Small values of \(\lambda\) produces large values of coefficients

  • We do Cross validation to find the optimum value of lambda

11.3 Cross Validation \(\lambda\)

We use cross validation to find the optimum value of \(\lambda\).We use the cv.glmnet function to do crossvalidation.lambda.min is the value of \(\lambda\) that gives minimum mean cross-validated error.Once we have the optimum value of \(\lambda\), we fit using this optimum value of \(\lambda\).

cv_fit <- cv.glmnet(x, y, alpha = 0, lambda = lambdas)

opt_lambda <- cv_fit$lambda.min
opt_lambda
## [1] 3.162278
fit <- glmnet(x, y, alpha = 0, lambda = opt_lambda)

fit$lambda
## [1] 3.162278
coef(fit)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 21.019269740
## cyl         -0.375606668
## disp        -0.005353719
## hp          -0.011454746
## drat         1.056241727
## wt          -1.189217320
## qsec         0.159330517
## vs           0.793508955
## am           1.573837349
## gear         0.540210275
## carb        -0.527760976

12 Lasso Regression

Lasso Regression is similiar to Ridge Regression. The Ridge Regression model would have all the predictors and some of the predictors would be close to zero but not equal to zero. This model does not completely eliminate the not so good predictors.The Lasso Regresssion on the other hand shrinks some of the predictors to zero and completely eliminating them.The Lasso Regression therefore helps infeature selection.

We do a Lasso Regression using the glmnet library setting \(\alpha\) equal to 1 and \(\lambda\) to a grid of numbers.

lambdas <- 10^seq(3, -2, by = -.1)

lambdas
##  [1] 1.000000e+03 7.943282e+02 6.309573e+02 5.011872e+02 3.981072e+02
##  [6] 3.162278e+02 2.511886e+02 1.995262e+02 1.584893e+02 1.258925e+02
## [11] 1.000000e+02 7.943282e+01 6.309573e+01 5.011872e+01 3.981072e+01
## [16] 3.162278e+01 2.511886e+01 1.995262e+01 1.584893e+01 1.258925e+01
## [21] 1.000000e+01 7.943282e+00 6.309573e+00 5.011872e+00 3.981072e+00
## [26] 3.162278e+00 2.511886e+00 1.995262e+00 1.584893e+00 1.258925e+00
## [31] 1.000000e+00 7.943282e-01 6.309573e-01 5.011872e-01 3.981072e-01
## [36] 3.162278e-01 2.511886e-01 1.995262e-01 1.584893e-01 1.258925e-01
## [41] 1.000000e-01 7.943282e-02 6.309573e-02 5.011872e-02 3.981072e-02
## [46] 3.162278e-02 2.511886e-02 1.995262e-02 1.584893e-02 1.258925e-02
## [51] 1.000000e-02
y <- mtcars$mpg
x <- mtcars %>% select(-mpg) %>% data.matrix()

fit <- glmnet(x, y, alpha = 1, lambda = lambdas)

12.1 Small values of \(\lambda\)

We inspect the value of the coefficients for small values of \(\lambda\)

fit$lambda [50]
## [1] 0.01258925
coef(fit)[,50]
##  (Intercept)          cyl         disp           hp         drat 
## 13.521893747 -0.096983096  0.008842666 -0.018453660  0.809605725 
##           wt         qsec           vs           am         gear 
## -3.350667774  0.740203469  0.243526586  2.450240877  0.621248872 
##         carb 
## -0.320297953

12.2 Large values of \(\lambda\)

We inspect the value of the coefficients for large values of \(\lambda\)

fit$lambda [1]
## [1] 1000
coef(fit)[,1]
## (Intercept)         cyl        disp          hp        drat          wt 
##    20.09062     0.00000     0.00000     0.00000     0.00000     0.00000 
##        qsec          vs          am        gear        carb 
##     0.00000     0.00000     0.00000     0.00000     0.00000

We observe the following

  • Large values of \(\lambda\) produces small values of coefficients

  • Small values of \(\lambda\) produces large values of coefficients

  • We do Cross validation to find the optimum value of \(\lambda\)

12.3 Cross Validation \(\lambda\)

We use cross validation to find the optimum value of \(\lambda\) .lambda.min is the value of \(\lambda\) that gives minimum mean cross-validated error.

cv_fit <- cv.glmnet(x, y, alpha = 1, lambda = lambdas)

opt_lambda <- cv_fit$lambda.min
opt_lambda
## [1] 0.7943282
fit <- glmnet(x, y, alpha = 0, lambda = opt_lambda)

fit$lambda
## [1] 0.7943282
coef(fit)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 20.448240715
## cyl         -0.292879060
## disp        -0.003259148
## hp          -0.012623075
## drat         1.002638746
## wt          -1.688225692
## qsec         0.250586723
## vs           0.546670536
## am           2.002014145
## gear         0.600856653
## carb        -0.660350820

13 Conclusion

We have covered a lot of ground starting from Simple Linear Regression to Multiple Linear Regression. We have learned the basics and usefulness of Metrics and Cross Validation. We went into group of Penalized Regression Methods such as Ridge Regression and Lasso Regression.
The next step is to learn Non Linear Regression methods such as Trees, Bagging,Boosting. Hope you had a lot of fun going through this tutorial and will go through the next steps.

Thank you.