In this project I am investigating the impact of a number of automobile engine factors on the vehicle’s mpg. The dataset auto-mpg.data contains infor- mation for 398 defferent automobile models. Information regarding the number of cylinders, displacement, horsepower, weight, acceleration, model year, origin, and car name as well as mpg are contained in the file.The files can be downloaded from the UCI Machine Learning Repository

  1. Using the ???rst 300 samples in the auto-mpg.data,we will run a simple linear regression to determine the relationship between mpg and a single, appropriate dependent variable of our choice. we will report all appropriate information regarding our regression and include explanatory graphs. In addition, we will include plots for the following:
  1. residuals vs. the predictor variable
  2. absolute value of the residuals vs. the predictor variable
  3. histogram of the residuals We are going to examine the above plots and discuss how we might interpret these plots to determine the appropriateness of our linear model. For the remaining 98 samples in the dataset, we are going to use our linear model to predict each automobile’s mpg and report how our predictions compare to the car’s actual reported mgp.
  1. Now we repeat the all process but at this time using more than one independent variable. We Try to construct a methodology for determining which independent factors should be included to obtain the “best” linear model for mpg.
setwd("~/myRprojects")
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(latticeExtra)# For graph
## Loading required package: lattice
## Loading required package: RColorBrewer
library(psych) # for pairs.panels()
library(caret) # for svm 
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## The following object is masked from 'package:latticeExtra':
## 
##     layer
library(magrittr)# for the pipe
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
## 
##     extract
library(leaps) # for step wise regression
library(gvlma)# for checking the assumtions
library(knitr)# for table formating
library(glmnet) # for LASSO 
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## Loading required package: foreach
## Loaded glmnet 2.0-16
auto <- read.table("auto-mpg.data")
names(auto) <- c("mpg","cylinders","displacement","horsepower","weight","acceleration","model_year","origin","car_name")
str(auto)
## 'data.frame':    398 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : Factor w/ 94 levels "?","100.0","102.0",..: 17 35 29 29 24 42 47 46 48 40 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model_year  : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ car_name    : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...
auto$horsepower <- as.numeric(auto$horsepower)
auto$horsepower[auto$horsepower=="?"] <- NA
auto$cylinders <- as.numeric(auto$cylinders)
kable(head(auto), format = "pandoc",caption = "Data Preview")
Data Preview
mpg cylinders displacement horsepower weight acceleration model_year origin car_name
18 8 307 17 3504 12.0 70 1 chevrolet chevelle malibu
15 8 350 35 3693 11.5 70 1 buick skylark 320
18 8 318 29 3436 11.0 70 1 plymouth satellite
16 8 304 29 3433 12.0 70 1 amc rebel sst
17 8 302 24 3449 10.5 70 1 ford torino
15 8 429 42 4341 10.0 70 1 ford galaxie 500
auto_num <- select(auto,mpg,displacement,horsepower,weight,acceleration)
auto_num <- na.omit(auto_num)
kable(summary(auto_num),format="pandoc", caption = "Title of the table")
Title of the table
mpg displacement horsepower weight acceleration
Min. : 9.00 Min. : 68.0 Min. : 1.00 Min. :1613 Min. : 8.00
1st Qu.:17.50 1st Qu.:104.2 1st Qu.:26.00 1st Qu.:2224 1st Qu.:13.82
Median :23.00 Median :148.5 Median :60.50 Median :2804 Median :15.50
Mean :23.51 Mean :193.4 Mean :51.39 Mean :2970 Mean :15.57
3rd Qu.:29.00 3rd Qu.:262.0 3rd Qu.:79.00 3rd Qu.:3608 3rd Qu.:17.18
Max. :46.60 Max. :455.0 Max. :94.00 Max. :5140 Max. :24.80

This table does not show any major problems. The gap between the means and the median are not significant. This table is telling us that any data point greater than 1.5 times IQR might be an outlier and may cause an issue for our linear regression model.Please note that even though there outliers in our data, we choose to ignore them for now.

pairs.panels(auto_num,method = "pearson",hist.col ="#00AFBB" ,density = TRUE,ellipses = TRUE)

This output show three different things which are the correlation between variable, the scartter plot that tell us how the variables are associated and, the histograms the inform us how skewed our data are. We noticed that displacement and weight are strongly correlated and negatively correlated with the MPG. There is a multicollinearity among the independent variables .This scatter plot denote that the relationship that could be linear is the one between weight and displacement all the other are clearly not linear.Based on the above plots of the independent variables vs mpg, our initial assessment is to consider weight and displacement as possible candidate for our linear regression model.Let’s build our best model to find out.

par(mfrow=c(2,2))
for (i in names(auto_num)) {
  boxplot(auto_num[,i], names = "names(auto_num[,i])")
}

#Using the rest 300 samples in the auto-mpg.data, run a simple linear regres-
#sion to determine the relationship between mpg and a single
auto.train <- auto_num[1:300,]
auto.test <- auto_num[301:398,]

DISPLACEMENT MODEL

#performing the regression accordingly to the above partition
model.dis <- lm(mpg~displacement, data=auto.train)
summary(model.dis)
## 
## Call:
## lm(formula = mpg ~ displacement, data = auto.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8432 -1.9571 -0.4975  1.9047 16.2304 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31.250836   0.429861   72.70   <2e-16 ***
## displacement -0.048680   0.001786  -27.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.368 on 298 degrees of freedom
## Multiple R-squared:  0.7138, Adjusted R-squared:  0.7129 
## F-statistic: 743.3 on 1 and 298 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
#plot the variable
plot(auto.train$mpg~auto.train$displacement,main="mpg vs displacement",xlab="displacement",ylab = "mpg")
abline(model.dis,col="red")

#residuals vs. the predictor variable
residual <- model.dis$residuals
plot(auto.train$mpg~residual,lwd=3, col="blue",main="mpg vs residual", xlab="residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")


#absolute value of the residuals vs. the predictor variable
abs_residual <- abs(residual)
plot(auto.train$mpg~abs_residual,lwd=3, col="red",main="mpg vs Abs_residual", xlab="abs_residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")

#histogram of the residuals
hist(residual,prob=T,breaks=20,main="HISTOGRAM OF DISPLACEMENT RESIDUALS",xlab="Residuals")
lines(density(residual),col="red",lwd=3)

plot(model.dis)

#Examine the above plots and discuss how you might interpret these plots
#to determine the appropriateness of your linear model.
# Make predictions and compute the R2, RMSE and MAE
predi_dis <- model.dis %>% predict(auto.test)  
data.frame( R2 = R2(predi_dis, auto.test$mpg),
            RMSE = RMSE(predi_dis, auto.test$mpg),
            MAE = MAE(predi_dis, auto.test$mpg))
##          R2     RMSE      MAE
## 1 0.3853939 8.385433 7.056869

In this output all the estimated values are statistically significant with a p-value equal to 2.2e-16 . It is shown that the plot of MPG vs the Displacement is not linear and there is a sort of relationship between the variable and the residual. This model is definitely not a good one. The diagnostic plot reveal that these following data point 112,245,248 are outliers. The R square state that only 38% of displacement explain MPG.

HORSEPOWER MODEL

model.hors <- lm(mpg~horsepower, data=auto.train)
summary(model.hors)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = auto.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0794  -4.5344  -0.5125   3.4621  22.0299 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.01715    0.60206   26.60   <2e-16 ***
## horsepower   0.09908    0.01057    9.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.533 on 298 degrees of freedom
## Multiple R-squared:  0.2276, Adjusted R-squared:  0.225 
## F-statistic: 87.81 on 1 and 298 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
#plot the variable
plot(auto.train$mpg~auto.train$horsepower,main="mpg vs horsepower",xlab="horsepower",ylab = "mpg")
abline(model.hors,col="red")

#residuals vs. the predictor variable
residual <- model.hors$residuals
plot(auto.train$mpg~residual,lwd=3, col="blue",main="mpg vs hors_residual", xlab="residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")

#absolute value of the residuals vs. the predictor variable
abs_residual <- abs(residual)
plot(auto.train$mpg~abs_residual,lwd=3, col="red",main="mpg vs hors.Abs_residual", xlab="abs_residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")

#histogram of the residuals
hist(residual,prob=T,breaks=20,main="HISTOGRAM OF HORSEPOWER RESIDUALS",xlab="Residuals")
lines(density(residual),col="red",lwd=3)

plot(model.hors)

#Examine the above plots and discuss how you might interpret these plots
#to determine the appropriateness of your linear model.
# Make predictions and compute the R2, RMSE and MAE
predi_hors <- model.hors %>% predict(auto.test)  
data.frame( R2 = R2(predi_hors, auto.test$mpg),
            RMSE = RMSE(predi_hors, auto.test$mpg),
            MAE = MAE(predi_hors, auto.test$mpg))
##           R2     RMSE     MAE
## 1 0.01167267 11.59034 10.1305
predictions_error <- RMSE(predi_hors, auto.test$mpg)/mean(auto.test$mpg)
compare_hors <- as.data.frame(cbind(auto.test$mpg,predi_hors),row=FALSE)
names(compare_hors) <- c("observ","predi_dis")
kable(head(compare_hors),format="pandoc", caption = "Comparaison")
Comparaison
observ predi_dis
23.9 24.53788
34.2 22.65540
34.5 22.65540
31.8 22.16000
37.3 22.55632
28.4 24.53788

The R square adjusted is 0.225 . this state cleary that the model is not good since only 22.5% of horsepower explain mpg. Despite that we noticed that significant with a p-value: < 2.2e-16.The relationship between the mpg and the horsepower is not linear.

ACCELERATION MODEL

model.acce <- lm(mpg~acceleration, data=auto.train)
summary(model.acce)
## 
## Call:
## lm(formula = mpg ~ acceleration, data = auto.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2441  -4.1160  -0.9237   3.0894  16.2186 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.5588     1.8243   2.499    0.013 *  
## acceleration   1.0641     0.1177   9.044   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.577 on 298 degrees of freedom
## Multiple R-squared:  0.2154, Adjusted R-squared:  0.2127 
## F-statistic:  81.8 on 1 and 298 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
#plot the variable
plot(auto.train$mpg~auto.train$acceleration,main="mpg vs acceleration",xlab="acceleration",ylab = "mpg")
abline(model.acce,col="red")

#residuals vs. the predictor variable
residual <- model.acce$residuals
plot(auto.train$mpg~residual,lwd=3, col="blue",main="mpg vs residual", xlab="residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")


#absolute value of the residuals vs. the predictor variable
abs_residual <- abs(residual)
plot(auto.train$mpg~abs_residual,lwd=3, col="red",main="mpg vs Abs_residual", xlab="abs_residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")

#histogram of the residuals
hist(residual,prob=T,breaks=20,main="HISTOGRAM OF ACCELERATION RESIDUALS",xlab="Residuals")
lines(density(residual),col="red",lwd=3)

plot(model.acce)

#Examine the above plots and discuss how you might interpret these plots
#to determine the appropriateness of your linear model.
# Make predictions and compute the R2, RMSE and MAE
predi_acce <- model.acce %>% predict(auto.test)  
data.frame( R2 = R2(predi_acce, auto.test$mpg),
            RMSE = RMSE(predi_acce, auto.test$mpg),
            MAE = MAE(predi_acce, auto.test$mpg))
##           R2     RMSE      MAE
## 1 0.01652835 11.52768 10.15283
predictions_error <- RMSE(predi_acce, auto.test$mpg)/mean(auto.test$mpg)
compare_acce <- as.data.frame(cbind(auto.test$mpg,predi_acce),row=FALSE)
names(compare_acce) <- c("observ","predi_dis")
kable(head(compare_acce),format="pandoc", caption = "Comparaison")
Comparaison
observ predi_dis
23.9 28.18118
34.2 18.60453
34.5 20.41345
31.8 24.98896
37.3 20.20064
28.4 21.58393

In this output all the estimated values are statistically significant with a p-value equal to 2.2e-16 . It is shown that there is no strong significant relation between this two variables.The plot residual vs the acceleration looks pretty good.(we will comment these kind of plot later)

WEIGHT MODEL

model.wei <- lm(mpg~weight, data=auto.train)
summary(model.wei)
## 
## Call:
## lm(formula = mpg ~ weight, data = auto.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1077 -1.8842 -0.0333  1.7275 15.1232 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.3879027  0.6368804   63.41   <2e-16 ***
## weight      -0.0062524  0.0001957  -31.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.992 on 298 degrees of freedom
## Multiple R-squared:  0.7741, Adjusted R-squared:  0.7733 
## F-statistic:  1021 on 1 and 298 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
#plot the variable
plot(auto.train$mpg~auto.train$acceleration,main="mpg vs weight",xlab="acceleration",ylab = "mpg")
abline(model.wei,col="red")

#residuals vs. the predictor variable
residual <- model.wei$residuals
plot(auto.train$mpg~residual,lwd=3, col="blue",main="mpg vs wei.residual", xlab="residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")


#absolute value of the residuals vs. the predictor variable
abs_residual <- abs(residual)
plot(auto.train$mpg~abs_residual,lwd=3, col="red",main="mpg vs wei.Abs_residual", xlab="abs_residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")

#histogram of the residuals
hist(residual,prob=T,breaks=20,main="HISTOGRAM OF WEIGHT RESIDUALS",xlab="Residuals")
lines(density(residual),col="red",lwd=3)

plot(model.wei)

#Examine the above plots and discuss how you might interpret these plots
#to determine the appropriateness of your linear model.
# Make predictions and compute the R2, RMSE and MAE
predi_wei <- model.wei %>% predict(auto.test)  
data.frame( R2 = R2(predi_wei, auto.test$mpg),
            RMSE = RMSE(predi_wei, auto.test$mpg),
            MAE = MAE(predi_wei, auto.test$mpg))
##          R2     RMSE      MAE
## 1 0.5337296 8.164671 6.993124
predictions_error <- RMSE(predi_wei, auto.test$mpg)/mean(auto.test$mpg)
compare_wei <- as.data.frame(cbind(auto.test$mpg,predi_wei),row=FALSE)
names(compare_wei) <- c("observ","predi_dis")
kable(head(compare_wei),format="pandoc", caption = "Comparaison")
Comparaison
observ predi_dis
23.9 19.00455
34.2 26.63253
34.5 26.94515
31.8 27.75797
37.3 27.07020
28.4 23.69388

A close look at the regression’s output show that we have a good model.We have P-value=2.2e-16 < 5%Thus coefficient of the model is statistically significant in explaining the mpg .This model has the best R squared(0.7733) compare to the others, we conclude that it is the best model to pick.This Rsquared is telling us that 77.33% of the mpg is explained by the weight. At this point to choose the best regression we compare the R squared adjusted (0.7733>0.7129>0.2127) and we picked the one that has the higher R squared adjusted.

MULTIPLE REGRESSION

Features Selection

Stepwise regression

# To find out which independent variable to use in our multiple regression we are going 
#to use the step wise regression
null=lm(mpg~1,data=auto.train)
full=lm(mpg~.,data=auto.train)
step(null,scope=list(upper=full),data=auto.train,direction="both")
## Start:  AIC=1103.93
## mpg ~ 1
## 
##                Df Sum of Sq     RSS     AIC
## + weight        1    9143.9  2668.4  659.63
## + displacement  1    8431.8  3380.5  730.59
## + horsepower    1    2688.4  9123.9 1028.46
## + acceleration  1    2544.1  9268.2 1033.17
## <none>                      11812.3 1103.93
## 
## Step:  AIC=659.63
## mpg ~ weight
## 
##                Df Sum of Sq     RSS     AIC
## + displacement  1      61.8  2606.6  654.61
## + acceleration  1      37.1  2631.3  657.43
## <none>                       2668.4  659.63
## + horsepower    1      11.0  2657.4  660.39
## - weight        1    9143.9 11812.3 1103.93
## 
## Step:  AIC=654.61
## mpg ~ weight + displacement
## 
##                Df Sum of Sq    RSS    AIC
## <none>                      2606.6 654.61
## + horsepower    1      7.49 2599.1 655.74
## + acceleration  1      6.03 2600.6 655.91
## - displacement  1     61.78 2668.4 659.63
## - weight        1    773.85 3380.5 730.59
## 
## Call:
## lm(formula = mpg ~ weight + displacement, data = auto.train)
## 
## Coefficients:
##  (Intercept)        weight  displacement  
##    38.746685     -0.004951     -0.011343

As shown in the R results above the step function produced weight and displacement as the optimum variables to produce the desired linear regression model . This step function use AIC as criterion. It selects the combination of variables with which we can have the lower AIC statistic.Thus our final model.

Building the model

#our final model is 
final_model <- lm(mpg ~ weight + displacement, data = auto.train)
print(final_model)
## 
## Call:
## lm(formula = mpg ~ weight + displacement, data = auto.train)
## 
## Coefficients:
##  (Intercept)        weight  displacement  
##    38.746685     -0.004951     -0.011343

Diagnostics

par(mfrow=c(2,2))
summary(final_model)
## 
## Call:
## lm(formula = mpg ~ weight + displacement, data = auto.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4362 -1.8464 -0.1621  1.6470 15.2024 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.7466846  0.8832868  43.866   <2e-16 ***
## weight       -0.0049513  0.0005273  -9.390   <2e-16 ***
## displacement -0.0113429  0.0042751  -2.653   0.0084 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.963 on 297 degrees of freedom
## Multiple R-squared:  0.7793, Adjusted R-squared:  0.7778 
## F-statistic: 524.5 on 2 and 297 DF,  p-value: < 2.2e-16
plot(final_model)

#Examine the above plots and discuss how you might interpret these plots
#to determine the appropriateness of your linear model.
#residuals vs. the predictor variable
residual <- final_model$residuals
plot(auto.train$mpg~residual,lwd=3, col="blue",main="mpg vs residual", xlab="residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")

#absolute value of the residuals vs. the predictor variable
abs_residual <- abs(residual)
plot(auto.train$mpg~abs_residual,lwd=3, col="red",main="mpg vs final_model_residual", xlab="abs_residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")

#histogram of the residuals
hist(residual,prob=T,breaks=20,main="HISTOGRAM OF WEIGHT RESIDUALS",xlab="Residuals")
lines(density(residual),col="red",lwd=3)

From the model summary,the model p value and predictor’s p value are less than the significance level (5%).We can state that our model is statistically significant.The R-Squared and the Ajusted R-Squared are respictively 0.7793 and 0.7778 which mean that more than 70% of the model are explained by the independend variables. Furthermore looking at the residual plot and the Absolut residual plot above, we see the residuals, even though fairly normal are clustering around 6.9 which is some indication that the could be improved. From the abs residual Plot , we see clear that the predicted residuals are definitely not clustering around zero. We decided to plot test weight against test mpg and found that the linear regression line through this data confirm our suspicion that the linear regression model is not truly representing the data. From the plot (residual vs fitted), the red line is pretty flat, with no increasing or decreasing trend.We suspect a homocedasdicity phenomen that we are going to verify.The Normal Q-Q plot indicat that the data are normaly distributed.These plot are showing some outliers that we decided to keep in the previous model.

gvlma(final_model)
## 
## Call:
## lm(formula = mpg ~ weight + displacement, data = auto.train)
## 
## Coefficients:
##  (Intercept)        weight  displacement  
##    38.746685     -0.004951     -0.011343  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = final_model) 
## 
##                     Value   p-value                   Decision
## Global Stat        183.26 0.000e+00 Assumptions NOT satisfied!
## Skewness            21.39 3.753e-06 Assumptions NOT satisfied!
## Kurtosis            95.33 0.000e+00 Assumptions NOT satisfied!
## Link Function       53.58 2.485e-13 Assumptions NOT satisfied!
## Heteroscedasticity  12.97 3.168e-04 Assumptions NOT satisfied!

Some assumptions are not satisfied. This could be due to the fact that we have some outliers in the data can impact the quality of the model.Let’s remove them from the data then see what happen.

par(mfrow=c(2,2))
remove_outlier <- lm(mpg ~ weight + displacement, data = auto.train[-c(112,245,248),])
summary(remove_outlier)
## 
## Call:
## lm(formula = mpg ~ weight + displacement, data = auto.train[-c(112, 
##     245, 248), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3148 -1.8142 -0.0992  1.5608  7.7300 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.3192534  0.8082664  47.409  < 2e-16 ***
## weight       -0.0047936  0.0004809  -9.968  < 2e-16 ***
## displacement -0.0119332  0.0038996  -3.060  0.00242 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.7 on 294 degrees of freedom
## Multiple R-squared:  0.8044, Adjusted R-squared:  0.8031 
## F-statistic: 604.7 on 2 and 294 DF,  p-value: < 2.2e-16
plot(remove_outlier)

gvlma(remove_outlier)
## 
## Call:
## lm(formula = mpg ~ weight + displacement, data = auto.train[-c(112, 
##     245, 248), ])
## 
## Coefficients:
##  (Intercept)        weight  displacement  
##    38.319253     -0.004794     -0.011933  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = remove_outlier) 
## 
##                      Value   p-value                   Decision
## Global Stat        69.0869 3.542e-14 Assumptions NOT satisfied!
## Skewness            2.3963 1.216e-01    Assumptions acceptable.
## Kurtosis            0.7437 3.885e-01    Assumptions acceptable.
## Link Function      58.7423 1.799e-14 Assumptions NOT satisfied!
## Heteroscedasticity  7.2046 7.272e-03 Assumptions NOT satisfied!

We notice that even if we remove the outlier the linear model assumptions are not satisfied.Nevertheless removing the outliers improve the model with a Ajusted R-squared varing from 77% to 80%. We suspect that a non linear model would perfom better on this car data.

Prediction

# Make predictions and compute the R2, RMSE and MAE
predi_final <- final_model %>% predict(auto.test)  
data.frame( R2 = R2(predi_final, auto.test$mpg),
            RMSE = RMSE(predi_final, auto.test$mpg),
            MAE = MAE(predi_final, auto.test$mpg))
##          R2     RMSE      MAE
## 1 0.5288592 8.092242 6.886737
predictions_error <- RMSE(predi_final, auto.test$mpg)/mean(auto.test$mpg)
compare_final <- as.data.frame(cbind(auto.test$mpg,predi_final),row=FALSE)
names(compare_final) <- c("observ","predi_dis")
kable(head(compare_final),format="pandoc", caption = "Comparaison")
Comparaison
observ predi_dis
23.9 18.86421
34.2 26.66290
34.5 26.91046
31.8 27.78099
37.3 27.16829
28.4 23.81403
cor(compare_final)
##             observ predi_dis
## observ    1.000000  0.727227
## predi_dis 0.727227  1.000000

The correlation betwenn The actual values and the predicted values is 72.72%

LASSO REGRESSION

x <- model.matrix(mpg~.,auto_num)[,-1]
y <- auto_num$mpg
lambda <- 10^seq(10,-2,length=100)
set.seed(123)
train.l <- sample(1:nrow(x),nrow(x)/2)
test.l <- (-train.l)
y.test.l <- y[test.l]

lasso.mpg <-glmnet(x[train.l,],y[train.l],alpha = 1,lambda = lambda) 
plot(lasso.mpg)

cv.out <- cv.glmnet(x[train.l,],y[train.l],alpha=1)
plot(cv.out)

bestlam <- cv.out$lambda.min
lasso.pred <- predict(lasso.mpg,s=bestlam,newx = x[test.l,])
mean((lasso.pred-y.test.l)^2)
## [1] 19.38963
out <- glmnet(x,y,alpha=1,lambda = lambda)
lasso.coef <- predict(out,type = "coefficients",s=bestlam)[1:4,]
lasso.coef
##  (Intercept) displacement   horsepower       weight 
##  40.96547040  -0.01088795   0.00425484  -0.00610106
lasso.coef[lasso.coef!=0]
##  (Intercept) displacement   horsepower       weight 
##  40.96547040  -0.01088795   0.00425484  -0.00610106