In this project I am investigating the impact of a number of automobile engine factors on the vehicle’s mpg. The dataset auto-mpg.data contains infor- mation for 398 defferent automobile models. Information regarding the number of cylinders, displacement, horsepower, weight, acceleration, model year, origin, and car name as well as mpg are contained in the file.The files can be downloaded from the UCI Machine Learning Repository
setwd("~/myRprojects")
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(latticeExtra)# For graph
## Loading required package: lattice
## Loading required package: RColorBrewer
library(psych) # for pairs.panels()
library(caret) # for svm
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## The following object is masked from 'package:latticeExtra':
##
## layer
library(magrittr)# for the pipe
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
library(leaps) # for step wise regression
library(gvlma)# for checking the assumtions
library(knitr)# for table formating
library(glmnet) # for LASSO
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
## Loading required package: foreach
## Loaded glmnet 2.0-16
auto <- read.table("auto-mpg.data")
names(auto) <- c("mpg","cylinders","displacement","horsepower","weight","acceleration","model_year","origin","car_name")
str(auto)
## 'data.frame': 398 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : Factor w/ 94 levels "?","100.0","102.0",..: 17 35 29 29 24 42 47 46 48 40 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ model_year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ car_name : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...
auto$horsepower <- as.numeric(auto$horsepower)
auto$horsepower[auto$horsepower=="?"] <- NA
auto$cylinders <- as.numeric(auto$cylinders)
kable(head(auto), format = "pandoc",caption = "Data Preview")
| mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | car_name |
|---|---|---|---|---|---|---|---|---|
| 18 | 8 | 307 | 17 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
| 15 | 8 | 350 | 35 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 18 | 8 | 318 | 29 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
| 16 | 8 | 304 | 29 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
| 17 | 8 | 302 | 24 | 3449 | 10.5 | 70 | 1 | ford torino |
| 15 | 8 | 429 | 42 | 4341 | 10.0 | 70 | 1 | ford galaxie 500 |
auto_num <- select(auto,mpg,displacement,horsepower,weight,acceleration)
auto_num <- na.omit(auto_num)
kable(summary(auto_num),format="pandoc", caption = "Title of the table")
| mpg | displacement | horsepower | weight | acceleration | |
|---|---|---|---|---|---|
| Min. : 9.00 | Min. : 68.0 | Min. : 1.00 | Min. :1613 | Min. : 8.00 | |
| 1st Qu.:17.50 | 1st Qu.:104.2 | 1st Qu.:26.00 | 1st Qu.:2224 | 1st Qu.:13.82 | |
| Median :23.00 | Median :148.5 | Median :60.50 | Median :2804 | Median :15.50 | |
| Mean :23.51 | Mean :193.4 | Mean :51.39 | Mean :2970 | Mean :15.57 | |
| 3rd Qu.:29.00 | 3rd Qu.:262.0 | 3rd Qu.:79.00 | 3rd Qu.:3608 | 3rd Qu.:17.18 | |
| Max. :46.60 | Max. :455.0 | Max. :94.00 | Max. :5140 | Max. :24.80 |
This table does not show any major problems. The gap between the means and the median are not significant. This table is telling us that any data point greater than 1.5 times IQR might be an outlier and may cause an issue for our linear regression model.Please note that even though there outliers in our data, we choose to ignore them for now.
pairs.panels(auto_num,method = "pearson",hist.col ="#00AFBB" ,density = TRUE,ellipses = TRUE)
This output show three different things which are the correlation between variable, the scartter plot that tell us how the variables are associated and, the histograms the inform us how skewed our data are. We noticed that displacement and weight are strongly correlated and negatively correlated with the MPG. There is a multicollinearity among the independent variables .This scatter plot denote that the relationship that could be linear is the one between weight and displacement all the other are clearly not linear.Based on the above plots of the independent variables vs mpg, our initial assessment is to consider weight and displacement as possible candidate for our linear regression model.Let’s build our best model to find out.
par(mfrow=c(2,2))
for (i in names(auto_num)) {
boxplot(auto_num[,i], names = "names(auto_num[,i])")
}
#Using the rest 300 samples in the auto-mpg.data, run a simple linear regres-
#sion to determine the relationship between mpg and a single
auto.train <- auto_num[1:300,]
auto.test <- auto_num[301:398,]
#performing the regression accordingly to the above partition
model.dis <- lm(mpg~displacement, data=auto.train)
summary(model.dis)
##
## Call:
## lm(formula = mpg ~ displacement, data = auto.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8432 -1.9571 -0.4975 1.9047 16.2304
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.250836 0.429861 72.70 <2e-16 ***
## displacement -0.048680 0.001786 -27.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.368 on 298 degrees of freedom
## Multiple R-squared: 0.7138, Adjusted R-squared: 0.7129
## F-statistic: 743.3 on 1 and 298 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
#plot the variable
plot(auto.train$mpg~auto.train$displacement,main="mpg vs displacement",xlab="displacement",ylab = "mpg")
abline(model.dis,col="red")
#residuals vs. the predictor variable
residual <- model.dis$residuals
plot(auto.train$mpg~residual,lwd=3, col="blue",main="mpg vs residual", xlab="residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")
#absolute value of the residuals vs. the predictor variable
abs_residual <- abs(residual)
plot(auto.train$mpg~abs_residual,lwd=3, col="red",main="mpg vs Abs_residual", xlab="abs_residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")
#histogram of the residuals
hist(residual,prob=T,breaks=20,main="HISTOGRAM OF DISPLACEMENT RESIDUALS",xlab="Residuals")
lines(density(residual),col="red",lwd=3)
plot(model.dis)
#Examine the above plots and discuss how you might interpret these plots
#to determine the appropriateness of your linear model.
# Make predictions and compute the R2, RMSE and MAE
predi_dis <- model.dis %>% predict(auto.test)
data.frame( R2 = R2(predi_dis, auto.test$mpg),
RMSE = RMSE(predi_dis, auto.test$mpg),
MAE = MAE(predi_dis, auto.test$mpg))
## R2 RMSE MAE
## 1 0.3853939 8.385433 7.056869
In this output all the estimated values are statistically significant with a p-value equal to 2.2e-16 . It is shown that the plot of MPG vs the Displacement is not linear and there is a sort of relationship between the variable and the residual. This model is definitely not a good one. The diagnostic plot reveal that these following data point 112,245,248 are outliers. The R square state that only 38% of displacement explain MPG.
model.hors <- lm(mpg~horsepower, data=auto.train)
summary(model.hors)
##
## Call:
## lm(formula = mpg ~ horsepower, data = auto.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0794 -4.5344 -0.5125 3.4621 22.0299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.01715 0.60206 26.60 <2e-16 ***
## horsepower 0.09908 0.01057 9.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.533 on 298 degrees of freedom
## Multiple R-squared: 0.2276, Adjusted R-squared: 0.225
## F-statistic: 87.81 on 1 and 298 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
#plot the variable
plot(auto.train$mpg~auto.train$horsepower,main="mpg vs horsepower",xlab="horsepower",ylab = "mpg")
abline(model.hors,col="red")
#residuals vs. the predictor variable
residual <- model.hors$residuals
plot(auto.train$mpg~residual,lwd=3, col="blue",main="mpg vs hors_residual", xlab="residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")
#absolute value of the residuals vs. the predictor variable
abs_residual <- abs(residual)
plot(auto.train$mpg~abs_residual,lwd=3, col="red",main="mpg vs hors.Abs_residual", xlab="abs_residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")
#histogram of the residuals
hist(residual,prob=T,breaks=20,main="HISTOGRAM OF HORSEPOWER RESIDUALS",xlab="Residuals")
lines(density(residual),col="red",lwd=3)
plot(model.hors)
#Examine the above plots and discuss how you might interpret these plots
#to determine the appropriateness of your linear model.
# Make predictions and compute the R2, RMSE and MAE
predi_hors <- model.hors %>% predict(auto.test)
data.frame( R2 = R2(predi_hors, auto.test$mpg),
RMSE = RMSE(predi_hors, auto.test$mpg),
MAE = MAE(predi_hors, auto.test$mpg))
## R2 RMSE MAE
## 1 0.01167267 11.59034 10.1305
predictions_error <- RMSE(predi_hors, auto.test$mpg)/mean(auto.test$mpg)
compare_hors <- as.data.frame(cbind(auto.test$mpg,predi_hors),row=FALSE)
names(compare_hors) <- c("observ","predi_dis")
kable(head(compare_hors),format="pandoc", caption = "Comparaison")
| observ | predi_dis |
|---|---|
| 23.9 | 24.53788 |
| 34.2 | 22.65540 |
| 34.5 | 22.65540 |
| 31.8 | 22.16000 |
| 37.3 | 22.55632 |
| 28.4 | 24.53788 |
The R square adjusted is 0.225 . this state cleary that the model is not good since only 22.5% of horsepower explain mpg. Despite that we noticed that significant with a p-value: < 2.2e-16.The relationship between the mpg and the horsepower is not linear.
model.acce <- lm(mpg~acceleration, data=auto.train)
summary(model.acce)
##
## Call:
## lm(formula = mpg ~ acceleration, data = auto.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2441 -4.1160 -0.9237 3.0894 16.2186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5588 1.8243 2.499 0.013 *
## acceleration 1.0641 0.1177 9.044 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.577 on 298 degrees of freedom
## Multiple R-squared: 0.2154, Adjusted R-squared: 0.2127
## F-statistic: 81.8 on 1 and 298 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
#plot the variable
plot(auto.train$mpg~auto.train$acceleration,main="mpg vs acceleration",xlab="acceleration",ylab = "mpg")
abline(model.acce,col="red")
#residuals vs. the predictor variable
residual <- model.acce$residuals
plot(auto.train$mpg~residual,lwd=3, col="blue",main="mpg vs residual", xlab="residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")
#absolute value of the residuals vs. the predictor variable
abs_residual <- abs(residual)
plot(auto.train$mpg~abs_residual,lwd=3, col="red",main="mpg vs Abs_residual", xlab="abs_residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")
#histogram of the residuals
hist(residual,prob=T,breaks=20,main="HISTOGRAM OF ACCELERATION RESIDUALS",xlab="Residuals")
lines(density(residual),col="red",lwd=3)
plot(model.acce)
#Examine the above plots and discuss how you might interpret these plots
#to determine the appropriateness of your linear model.
# Make predictions and compute the R2, RMSE and MAE
predi_acce <- model.acce %>% predict(auto.test)
data.frame( R2 = R2(predi_acce, auto.test$mpg),
RMSE = RMSE(predi_acce, auto.test$mpg),
MAE = MAE(predi_acce, auto.test$mpg))
## R2 RMSE MAE
## 1 0.01652835 11.52768 10.15283
predictions_error <- RMSE(predi_acce, auto.test$mpg)/mean(auto.test$mpg)
compare_acce <- as.data.frame(cbind(auto.test$mpg,predi_acce),row=FALSE)
names(compare_acce) <- c("observ","predi_dis")
kable(head(compare_acce),format="pandoc", caption = "Comparaison")
| observ | predi_dis |
|---|---|
| 23.9 | 28.18118 |
| 34.2 | 18.60453 |
| 34.5 | 20.41345 |
| 31.8 | 24.98896 |
| 37.3 | 20.20064 |
| 28.4 | 21.58393 |
In this output all the estimated values are statistically significant with a p-value equal to 2.2e-16 . It is shown that there is no strong significant relation between this two variables.The plot residual vs the acceleration looks pretty good.(we will comment these kind of plot later)
model.wei <- lm(mpg~weight, data=auto.train)
summary(model.wei)
##
## Call:
## lm(formula = mpg ~ weight, data = auto.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1077 -1.8842 -0.0333 1.7275 15.1232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.3879027 0.6368804 63.41 <2e-16 ***
## weight -0.0062524 0.0001957 -31.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.992 on 298 degrees of freedom
## Multiple R-squared: 0.7741, Adjusted R-squared: 0.7733
## F-statistic: 1021 on 1 and 298 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
#plot the variable
plot(auto.train$mpg~auto.train$acceleration,main="mpg vs weight",xlab="acceleration",ylab = "mpg")
abline(model.wei,col="red")
#residuals vs. the predictor variable
residual <- model.wei$residuals
plot(auto.train$mpg~residual,lwd=3, col="blue",main="mpg vs wei.residual", xlab="residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")
#absolute value of the residuals vs. the predictor variable
abs_residual <- abs(residual)
plot(auto.train$mpg~abs_residual,lwd=3, col="red",main="mpg vs wei.Abs_residual", xlab="abs_residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")
#histogram of the residuals
hist(residual,prob=T,breaks=20,main="HISTOGRAM OF WEIGHT RESIDUALS",xlab="Residuals")
lines(density(residual),col="red",lwd=3)
plot(model.wei)
#Examine the above plots and discuss how you might interpret these plots
#to determine the appropriateness of your linear model.
# Make predictions and compute the R2, RMSE and MAE
predi_wei <- model.wei %>% predict(auto.test)
data.frame( R2 = R2(predi_wei, auto.test$mpg),
RMSE = RMSE(predi_wei, auto.test$mpg),
MAE = MAE(predi_wei, auto.test$mpg))
## R2 RMSE MAE
## 1 0.5337296 8.164671 6.993124
predictions_error <- RMSE(predi_wei, auto.test$mpg)/mean(auto.test$mpg)
compare_wei <- as.data.frame(cbind(auto.test$mpg,predi_wei),row=FALSE)
names(compare_wei) <- c("observ","predi_dis")
kable(head(compare_wei),format="pandoc", caption = "Comparaison")
| observ | predi_dis |
|---|---|
| 23.9 | 19.00455 |
| 34.2 | 26.63253 |
| 34.5 | 26.94515 |
| 31.8 | 27.75797 |
| 37.3 | 27.07020 |
| 28.4 | 23.69388 |
A close look at the regression’s output show that we have a good model.We have P-value=2.2e-16 < 5%Thus coefficient of the model is statistically significant in explaining the mpg .This model has the best R squared(0.7733) compare to the others, we conclude that it is the best model to pick.This Rsquared is telling us that 77.33% of the mpg is explained by the weight. At this point to choose the best regression we compare the R squared adjusted (0.7733>0.7129>0.2127) and we picked the one that has the higher R squared adjusted.
# To find out which independent variable to use in our multiple regression we are going
#to use the step wise regression
null=lm(mpg~1,data=auto.train)
full=lm(mpg~.,data=auto.train)
step(null,scope=list(upper=full),data=auto.train,direction="both")
## Start: AIC=1103.93
## mpg ~ 1
##
## Df Sum of Sq RSS AIC
## + weight 1 9143.9 2668.4 659.63
## + displacement 1 8431.8 3380.5 730.59
## + horsepower 1 2688.4 9123.9 1028.46
## + acceleration 1 2544.1 9268.2 1033.17
## <none> 11812.3 1103.93
##
## Step: AIC=659.63
## mpg ~ weight
##
## Df Sum of Sq RSS AIC
## + displacement 1 61.8 2606.6 654.61
## + acceleration 1 37.1 2631.3 657.43
## <none> 2668.4 659.63
## + horsepower 1 11.0 2657.4 660.39
## - weight 1 9143.9 11812.3 1103.93
##
## Step: AIC=654.61
## mpg ~ weight + displacement
##
## Df Sum of Sq RSS AIC
## <none> 2606.6 654.61
## + horsepower 1 7.49 2599.1 655.74
## + acceleration 1 6.03 2600.6 655.91
## - displacement 1 61.78 2668.4 659.63
## - weight 1 773.85 3380.5 730.59
##
## Call:
## lm(formula = mpg ~ weight + displacement, data = auto.train)
##
## Coefficients:
## (Intercept) weight displacement
## 38.746685 -0.004951 -0.011343
As shown in the R results above the step function produced weight and displacement as the optimum variables to produce the desired linear regression model . This step function use AIC as criterion. It selects the combination of variables with which we can have the lower AIC statistic.Thus our final model.
#our final model is
final_model <- lm(mpg ~ weight + displacement, data = auto.train)
print(final_model)
##
## Call:
## lm(formula = mpg ~ weight + displacement, data = auto.train)
##
## Coefficients:
## (Intercept) weight displacement
## 38.746685 -0.004951 -0.011343
par(mfrow=c(2,2))
summary(final_model)
##
## Call:
## lm(formula = mpg ~ weight + displacement, data = auto.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4362 -1.8464 -0.1621 1.6470 15.2024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7466846 0.8832868 43.866 <2e-16 ***
## weight -0.0049513 0.0005273 -9.390 <2e-16 ***
## displacement -0.0113429 0.0042751 -2.653 0.0084 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.963 on 297 degrees of freedom
## Multiple R-squared: 0.7793, Adjusted R-squared: 0.7778
## F-statistic: 524.5 on 2 and 297 DF, p-value: < 2.2e-16
plot(final_model)
#Examine the above plots and discuss how you might interpret these plots
#to determine the appropriateness of your linear model.
#residuals vs. the predictor variable
residual <- final_model$residuals
plot(auto.train$mpg~residual,lwd=3, col="blue",main="mpg vs residual", xlab="residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")
#absolute value of the residuals vs. the predictor variable
abs_residual <- abs(residual)
plot(auto.train$mpg~abs_residual,lwd=3, col="red",main="mpg vs final_model_residual", xlab="abs_residual",ylab = "mpg")
grid(NA, 5, lwd = 2,col = "darkgray")
#histogram of the residuals
hist(residual,prob=T,breaks=20,main="HISTOGRAM OF WEIGHT RESIDUALS",xlab="Residuals")
lines(density(residual),col="red",lwd=3)
From the model summary,the model p value and predictor’s p value are less than the significance level (5%).We can state that our model is statistically significant.The R-Squared and the Ajusted R-Squared are respictively 0.7793 and 0.7778 which mean that more than 70% of the model are explained by the independend variables. Furthermore looking at the residual plot and the Absolut residual plot above, we see the residuals, even though fairly normal are clustering around 6.9 which is some indication that the could be improved. From the abs residual Plot , we see clear that the predicted residuals are definitely not clustering around zero. We decided to plot test weight against test mpg and found that the linear regression line through this data confirm our suspicion that the linear regression model is not truly representing the data. From the plot (residual vs fitted), the red line is pretty flat, with no increasing or decreasing trend.We suspect a homocedasdicity phenomen that we are going to verify.The Normal Q-Q plot indicat that the data are normaly distributed.These plot are showing some outliers that we decided to keep in the previous model.
gvlma(final_model)
##
## Call:
## lm(formula = mpg ~ weight + displacement, data = auto.train)
##
## Coefficients:
## (Intercept) weight displacement
## 38.746685 -0.004951 -0.011343
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = final_model)
##
## Value p-value Decision
## Global Stat 183.26 0.000e+00 Assumptions NOT satisfied!
## Skewness 21.39 3.753e-06 Assumptions NOT satisfied!
## Kurtosis 95.33 0.000e+00 Assumptions NOT satisfied!
## Link Function 53.58 2.485e-13 Assumptions NOT satisfied!
## Heteroscedasticity 12.97 3.168e-04 Assumptions NOT satisfied!
Some assumptions are not satisfied. This could be due to the fact that we have some outliers in the data can impact the quality of the model.Let’s remove them from the data then see what happen.
par(mfrow=c(2,2))
remove_outlier <- lm(mpg ~ weight + displacement, data = auto.train[-c(112,245,248),])
summary(remove_outlier)
##
## Call:
## lm(formula = mpg ~ weight + displacement, data = auto.train[-c(112,
## 245, 248), ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3148 -1.8142 -0.0992 1.5608 7.7300
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.3192534 0.8082664 47.409 < 2e-16 ***
## weight -0.0047936 0.0004809 -9.968 < 2e-16 ***
## displacement -0.0119332 0.0038996 -3.060 0.00242 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.7 on 294 degrees of freedom
## Multiple R-squared: 0.8044, Adjusted R-squared: 0.8031
## F-statistic: 604.7 on 2 and 294 DF, p-value: < 2.2e-16
plot(remove_outlier)
gvlma(remove_outlier)
##
## Call:
## lm(formula = mpg ~ weight + displacement, data = auto.train[-c(112,
## 245, 248), ])
##
## Coefficients:
## (Intercept) weight displacement
## 38.319253 -0.004794 -0.011933
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = remove_outlier)
##
## Value p-value Decision
## Global Stat 69.0869 3.542e-14 Assumptions NOT satisfied!
## Skewness 2.3963 1.216e-01 Assumptions acceptable.
## Kurtosis 0.7437 3.885e-01 Assumptions acceptable.
## Link Function 58.7423 1.799e-14 Assumptions NOT satisfied!
## Heteroscedasticity 7.2046 7.272e-03 Assumptions NOT satisfied!
We notice that even if we remove the outlier the linear model assumptions are not satisfied.Nevertheless removing the outliers improve the model with a Ajusted R-squared varing from 77% to 80%. We suspect that a non linear model would perfom better on this car data.
# Make predictions and compute the R2, RMSE and MAE
predi_final <- final_model %>% predict(auto.test)
data.frame( R2 = R2(predi_final, auto.test$mpg),
RMSE = RMSE(predi_final, auto.test$mpg),
MAE = MAE(predi_final, auto.test$mpg))
## R2 RMSE MAE
## 1 0.5288592 8.092242 6.886737
predictions_error <- RMSE(predi_final, auto.test$mpg)/mean(auto.test$mpg)
compare_final <- as.data.frame(cbind(auto.test$mpg,predi_final),row=FALSE)
names(compare_final) <- c("observ","predi_dis")
kable(head(compare_final),format="pandoc", caption = "Comparaison")
| observ | predi_dis |
|---|---|
| 23.9 | 18.86421 |
| 34.2 | 26.66290 |
| 34.5 | 26.91046 |
| 31.8 | 27.78099 |
| 37.3 | 27.16829 |
| 28.4 | 23.81403 |
cor(compare_final)
## observ predi_dis
## observ 1.000000 0.727227
## predi_dis 0.727227 1.000000
The correlation betwenn The actual values and the predicted values is 72.72%
x <- model.matrix(mpg~.,auto_num)[,-1]
y <- auto_num$mpg
lambda <- 10^seq(10,-2,length=100)
set.seed(123)
train.l <- sample(1:nrow(x),nrow(x)/2)
test.l <- (-train.l)
y.test.l <- y[test.l]
lasso.mpg <-glmnet(x[train.l,],y[train.l],alpha = 1,lambda = lambda)
plot(lasso.mpg)
cv.out <- cv.glmnet(x[train.l,],y[train.l],alpha=1)
plot(cv.out)
bestlam <- cv.out$lambda.min
lasso.pred <- predict(lasso.mpg,s=bestlam,newx = x[test.l,])
mean((lasso.pred-y.test.l)^2)
## [1] 19.38963
out <- glmnet(x,y,alpha=1,lambda = lambda)
lasso.coef <- predict(out,type = "coefficients",s=bestlam)[1:4,]
lasso.coef
## (Intercept) displacement horsepower weight
## 40.96547040 -0.01088795 0.00425484 -0.00610106
lasso.coef[lasso.coef!=0]
## (Intercept) displacement horsepower weight
## 40.96547040 -0.01088795 0.00425484 -0.00610106