Nowadays,I am learning about algorithm from ISLR with application in R. previously I was using Simple Linear Regression to predict the sales using single predictors. In this blog, we will use multiple predictors to predict the response. We will use Auto dataset from the MASS package and predict mp(Mile per Gallon) using multiple predictors in Dataset.
library(ISLR)
library(MASS)
library(corrplot) # We'll use corrplot later on in this example too.
library(dplyr)
library(ggplot2)
library(visreg) # This library will allow us to show multivariate graphs.
library(rgl)
library(knitr)
library(scatterplot3d)
data("Auto")
attach(Auto)
head(Auto)
str(Auto)
'data.frame': 392 obs. of 9 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
$ weight : num 3504 3693 3436 3433 3449 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : num 70 70 70 70 70 70 70 70 70 70 ...
$ origin : num 1 1 1 1 1 1 1 1 1 1 ...
$ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
summary(Auto)
mpg cylinders displacement horsepower weight
Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
acceleration year origin name
Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
(Other) :365
The Auto dataset has 392 observation with 9 columns. The columns are like mpg, cylinders, displacement, horsepower etc which are related to the cars. Each row is the data about the cars and name column can be used to identify. In multiple regression, we will use more than one predictors. Before applying the lm() to the dataset we need to drop name column which is a qualitative variable.
auto_df <- select(Auto,-name)
Lets look the dataset after excluding the Name column.
summary(auto_df)
mpg cylinders displacement horsepower weight
Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
acceleration year origin
Min. : 8.00 Min. :70.00 Min. :1.000
1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
Median :15.50 Median :76.00 Median :1.000
Mean :15.54 Mean :75.98 Mean :1.577
3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
Max. :24.80 Max. :82.00 Max. :3.000
The auto_df dataframe contains 8 columns lets look the plot of the dataset.
# Plot matrix of all variables.
plot(auto_df, col="navy", main="Matrix Scatterplot")
The above matrix plot helps to see the relationship between two columns and pattern in the datasets. For example, horsepower and weight seem to have positive relationships. From the plot, mpg seems to have the same pattern with displacement, weight, and horsepower.
In this blog, we want to find the best model for predicting the mpg using different predictors in data. We will look upon residuals error, p-value for the performance of the model.
Multiple Linear regression uses multiple predictors. The equation for multiple linear regression looks like: \[Y=a+b*X1 + c*X2 \] where:
\(Y\) is Response or dependent variable \(a\) is intercept \(X1\) and \(X2\) are predictors or independent variable \(b\) and \(c\) are coefficeints for the b and c respectively
Before fitting the data to a model lets create a test and train set of the Data. Train set will be used to train or fit the model and test set will be used to check the performance of the model. We will split the data to train, test set using an 80:20 ratio.
require(caTools)
sample = sample.split(auto_df,SplitRatio = 0.80) # splits the data in the ratio mentioned in SplitRatio. After splitting marks these rows as logical TRUE and the the remaining are marked as logical FALSE
train1 =subset(auto_df,sample ==TRUE) # creates a training dataset named train1 with rows which are marked as TRUE
test1=subset(auto_df, sample==FALSE)
Lets look into the train and test data.
summary(train1)
mpg cylinders displacement horsepower weight
Min. : 9.0 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
1st Qu.:17.0 1st Qu.:4.000 1st Qu.: 98.0 1st Qu.: 75.0 1st Qu.:2204
Median :23.0 Median :4.000 Median :140.5 Median : 92.0 Median :2782
Mean :23.7 Mean :5.466 Mean :194.2 Mean :104.2 Mean :2959
3rd Qu.:30.0 3rd Qu.:8.000 3rd Qu.:302.0 3rd Qu.:130.0 3rd Qu.:3600
Max. :46.6 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
acceleration year origin
Min. : 8.00 Min. :70.00 Min. :1.000
1st Qu.:13.62 1st Qu.:73.00 1st Qu.:1.000
Median :15.50 Median :76.00 Median :1.000
Mean :15.51 Mean :75.98 Mean :1.582
3rd Qu.:17.00 3rd Qu.:79.00 3rd Qu.:2.000
Max. :24.60 Max. :82.00 Max. :3.000
summary(test1)
mpg cylinders displacement horsepower weight
Min. :10.00 Min. :3.00 Min. : 71.0 Min. : 52.00 Min. :1825
1st Qu.:17.50 1st Qu.:4.00 1st Qu.:115.2 1st Qu.: 80.25 1st Qu.:2302
Median :22.00 Median :5.50 Median :159.5 Median : 95.00 Median :2852
Mean :22.69 Mean :5.49 Mean :195.2 Mean :105.15 Mean :3035
3rd Qu.:27.15 3rd Qu.:6.00 3rd Qu.:250.0 3rd Qu.:114.50 3rd Qu.:3641
Max. :44.60 Max. :8.00 Max. :440.0 Max. :215.00 Max. :4955
acceleration year origin
Min. : 8.50 Min. :70.00 Min. :1.000
1st Qu.:14.00 1st Qu.:73.00 1st Qu.:1.000
Median :15.50 Median :76.00 Median :1.000
Mean :15.64 Mean :75.99 Mean :1.561
3rd Qu.:17.50 3rd Qu.:79.00 3rd Qu.:2.000
Max. :24.80 Max. :82.00 Max. :3.000
Its time to fit the model and We will use Least Square method to fit our regression model.
fit1<- lm(mpg~., data = train1)
summary(fit1)
Call:
lm(formula = mpg ~ ., data = train1)
Residuals:
Min 1Q Median 3Q Max
-8.6093 -2.3503 -0.1876 1.9308 12.6934
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.739e+01 5.457e+00 -3.186 0.0016 **
cylinders -3.549e-01 3.948e-01 -0.899 0.3694
displacement 1.731e-02 8.991e-03 1.925 0.0552 .
horsepower -2.754e-02 1.612e-02 -1.709 0.0886 .
weight -6.272e-03 7.873e-04 -7.967 3.88e-14 ***
acceleration 4.153e-02 1.166e-01 0.356 0.7221
year 7.664e-01 6.072e-02 12.622 < 2e-16 ***
origin 1.397e+00 3.293e-01 4.244 2.97e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.441 on 286 degrees of freedom
Multiple R-squared: 0.8189, Adjusted R-squared: 0.8145
F-statistic: 184.8 on 7 and 286 DF, p-value: < 2.2e-16
We have the results of the model . From the model output we found that:
The matrix scatterplot above shows that there is the high correlation between displacement,weight, and horsepower. When there are two or more variables strongly correlated it is called collinearity. Let us validate by using correlation plot.
cor = cor(test1[1:8])
corrplot(cor, method = "number")
From the correlation, we can see that there is a strong relation between cylinders, displacement,horsepower, and weight. The mpg is also strongly correlated to cylinders, and weight. This situation is called collinearity.The problem of collinearity in the response is that it is difficult to find the individual effect on response. We should drop use only one of the collinear variables.
As the cylinders don’t show any significant relation in the first model with mpg we will exclude cylinders. Out of displacement, horsepower, and weight, the output of the first model shows that weight and mpg have a highly significant relation. We will use weight out of the other collinear variables in the next model.
fit2 <- lm(mpg~weight+year+origin,data = train1)
summary(fit2)
Call:
lm(formula = mpg ~ weight + year + origin, data = train1)
Residuals:
Min 1Q Median 3Q Max
-8.4916 -2.3379 -0.1141 1.7380 12.9231
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.948e+01 4.742e+00 -4.109 5.18e-05 ***
weight -6.202e-03 3.026e-04 -20.495 < 2e-16 ***
year 7.866e-01 5.756e-02 13.668 < 2e-16 ***
origin 1.105e+00 3.064e-01 3.607 0.000364 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.457 on 290 degrees of freedom
Multiple R-squared: 0.8148, Adjusted R-squared: 0.8128
F-statistic: 425.2 on 3 and 290 DF, p-value: < 2.2e-16
In this model, we find all predictors p-value is highly significant. After excluding the collinear variable the F- statistic improved from 184.8 to 425.2 which is a great improvement. But there is no improvement on RSE and adjusted R squared value. Let’s plot the residuals:
plot(fit2, which =1)
This plot shows some of the outliers lying far away from the middle of the graph.
We will try a different type of transformation on both predictors and the response value and see how the performance of model changes. In the next model, we will use the natural log to the mpg using log() and see the change in performance of the model.
fit3 <- lm(log(mpg)~weight+year+origin,data = train1)
summary(fit3)
Call:
lm(formula = log(mpg) ~ weight + year + origin, data = train1)
Residuals:
Min 1Q Median 3Q Max
-0.42966 -0.07392 0.00231 0.06847 0.37197
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.486e+00 1.687e-01 8.806 < 2e-16 ***
weight -2.961e-04 1.077e-05 -27.491 < 2e-16 ***
year 3.220e-02 2.048e-03 15.722 < 2e-16 ***
origin 3.170e-02 1.090e-02 2.907 0.00392 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.123 on 290 degrees of freedom
Multiple R-squared: 0.8734, Adjusted R-squared: 0.8721
F-statistic: 666.8 on 3 and 290 DF, p-value: < 2.2e-16
plot(fit3, which =1)
After some changes in the model, the performance of the model is increased. The Adjusted R-squared raised to 0.8721 and F-statistic is increased to 668.8. The p-value of the predictors is significant.
Let’s make a certain change in this model. We will use the natural log to the weight and mpg.
fit4 <- lm(log(mpg)~log(weight)+year+origin,data = auto_df)
summary(fit4)
Call:
lm(formula = log(mpg) ~ log(weight) + year + origin, data = auto_df)
Residuals:
Min 1Q Median 3Q Max
-0.40429 -0.06527 0.00074 0.06386 0.39896
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.776372 0.283133 27.465 <2e-16 ***
log(weight) -0.904177 0.027317 -33.099 <2e-16 ***
year 0.032789 0.001682 19.493 <2e-16 ***
origin 0.017207 0.009289 1.852 0.0647 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1174 on 388 degrees of freedom
Multiple R-squared: 0.8818, Adjusted R-squared: 0.8809
F-statistic: 964.7 on 3 and 388 DF, p-value: < 2.2e-16
plot(fit4,which = 1)
In this model, the F-statistics and Adjusted R-squared are improved. Most of the predictor’s p-value is significant.The origin predictor seems to have weak p-value which means that origin doesn’t have strong relationships with the mpg while the presence of other predictors. So, we will exclude the origin and make a new model.
fit5 <- lm(log(mpg)~log(weight)+year,data = auto_df)
summary(fit5)
Call:
lm(formula = log(mpg) ~ log(weight) + year, data = auto_df)
Residuals:
Min 1Q Median 3Q Max
-0.40595 -0.06360 0.00169 0.06379 0.39270
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.039451 0.245703 32.72 <2e-16 ***
log(weight) -0.934085 0.022104 -42.26 <2e-16 ***
year 0.032817 0.001687 19.45 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1177 on 389 degrees of freedom
Multiple R-squared: 0.8807, Adjusted R-squared: 0.8801
F-statistic: 1436 on 2 and 389 DF, p-value: < 2.2e-16
plot(fit5,which = 1)
The output of this model shows that the F-statistics is increased to 1436 and the Adjusted R-squared is also increased.The predictor has highly significant p values. This model is better than previous models.
The final regression equation for our model is:
\[ log(mpg) = 8.0395 + log(weight)* -.9341 + year * .0328 \]
We will use the test data to check the accuracy of our final model.
predictions<-predict(fit5, test1)
mse <- mean((test1$mpg - predictions)^2)
print(mse)
[1] 435.6201
sigma(fit5)/mean(test1$mpg)
[1] 0.005167033
test1$predicted<- predict(fit5,test1)
actuals_preds <- data.frame(test1$mpg,test1$predicted)
names(actuals_preds)<- c("mpg","predicted")
correlation_accuracy <- cor(actuals_preds)
correlation_accuracy
mpg predicted
mpg 1.0000000 0.9228438
predicted 0.9228438 1.0000000
The correlation of the models shows that the accuracy is around 92 %.Lets check the true mpg and predicted mpg for the test data.
head(actuals_preds)
Wow! what happened?We used log to the response value in our model which predicted the response with a log. If we use log to the original mpg we can relate this result.
actuals_preds$log_mpg <- log(actuals_preds$mpg)
head(actuals_preds)
In this blog, we used Multiple Linear Regression methods to find the mpg of the car. We had 8 columns out of which 4 were collinear to each other. we got the best model with only two predictors. \[ log(mpg) = 8.0395 + log(weight)* -.9341 + year * .0328 \]