Analyses on Automobile’s MPG Prediction

Technical Analyses

Data Preperation

We are investigating the vehicle’s mpg values’ relation with the automobiles’ relative features like acceleration, weight, etc. We are given a dataset of 398 different automobile models that each corresponds features with the following mpg values.

Our objective is to predict an mpg value for a spesific type of car, in terms of cylinders, displacement, horsepower, weight, acceleration, model year, origin, and car name.

To create a model, we will first be preparing the data in terms of data types, outliers and division into two dataset of training and test data.

First, we import the data from the repository and examine its structure.

auto <- read.csv(url, sep = "", header = F)
colnames(auto) <- auto_colnames

# investigating the structure of data
str(auto)

## 'data.frame':    398 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : Factor w/ 94 levels "?","100.0","102.0",..: 17 35 29 29 24 42 47 46 48 40 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model year  : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ car name    : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...

We reformatted some variables according to the definition given in the dataset repository. We examined the structure of the data and decided to construct linear models by using four continous variables in the dataset, displacement, horsepower, weight and acceleration.

# reformatting continuous variables
auto$horsepower <- as.numeric(auto$horsepower)

# creating train and test data
auto_train <- auto[1:300, ]
auto_test <- auto[301:398, ]

We examined selected variables in terms of outliers and corrected some outliers we found in acceleration feature.

Boxplot for Ouliers

Data Analyses

The relation between dependent to independent and dependent to dependent variables are a matter in terms of linear regression. Therefore, below in the graph, we represented the correlation values and graphs of independent and dependent variables with each other.

Univariate Models

We wanted to evaluate each feature individually with the mpg value in order to understand the behaviour of the relationship btw the feature and mpg values. By analyzing each variable individually on train dataset and by testing it on the test data we conclude that Mpg Vs Weight is the best fit model in univariate models. We choose this model on the following basis:

Adjusted R-squared Values
Normally distributed residual plot
Significance value of indep variables
Low std errors

We evaluate each variable with mpg, and find that horsepower and acceleration significantly lower results than the displacement and weight values with the values of 0.2249, 0.2127, respectively.

From the displacement and weight varibles, weight is the most correlated with the mpgs variable. Therefore, below, we are representing Mpg Vs Weight as a univariate solution.

# weight ========================================
# training
auto_model_we <- lm(formula =  mpg ~ weight, data = auto_train)
summary(auto_model_we)

## 
## Call:
## lm(formula = mpg ~ weight, data = auto_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1077 -1.8842 -0.0333  1.7275 15.1232 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.3879027  0.6368804   63.41   <2e-16 ***
## weight      -0.0062524  0.0001957  -31.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.992 on 298 degrees of freedom
## Multiple R-squared:  0.7741, Adjusted R-squared:  0.7733 
## F-statistic:  1021 on 1 and 298 DF,  p-value: < 2.2e-16

Below is the residuals graph in which the residuals are diplaying normal distribution with mean zero.

Model’s Residuals

We tested our models with the test data we created before and represented in below. We found that although our models assumes a good predictor, we are having high residuals. We found that the actual residuals are not behaving a normal distrubution.

# testing - displacement
mpg_predict <- reg(auto_test$weight, auto_model_we$coefficients)
mpg_residuals <- abs(mpg_predict - auto_test$mpg)

The red line in the below graph is drawn using residuals’ mean and standard deviation.

Test Data’s Residuals

Multivariate Models

Predicting mpg values from just a variable may not be so reliable. Therefore we created models using multiple variables and represented our findings in this section.

Models’ Summary

We used displacement, horsepower, weight and acceleration values in order to predict mpg values. In order to do that, we used each variable in different combinations to create linear models and compared the results of the models. We categorized all models in terms of the number of independent variables used in the model. For simplicity, we will be summarizing and sharing a summary of our findings. To refer the models, below is an index of the models.

Model 5 = displacement + horsepower
Model 6 = displacement + weight
Model 7 = displacement + acceleration
Model 8 = horsepower + weight
Model 9 = horsepower + acceleration
Model 10 = weight + acceleration
Model 11 = displacement + horsepower + weight
Model 12 = displacement + horsepower + acceleration
Model 13 = displacement + weight + acceleration
Model 14 = horsepower + weight + acceleration
Model 15 = displacement + horsepower + weight + acceleration

Models’ Analyses and Evaluation

Bivariate models are generally promising in terms of r-squared values. however, having look at the independency of the independent variable from dependent, significance codes allow us to gather more information on relationship between the models. Below is our findings. Model 9 together create a good independency and represents normally distributed residuals with mean zero but they lack of showing strong relation with mpg by having 0.3527 r squared values.

Model 5 and 7 are are statistically not a good predictor as model 6, 8 and 10 by having less r-squared values in linear model.

Although 6, 8 and 10 are all represent a good predictor for mpg, 6 and 10 is better according to 8.

Lastly, from 6 and 10, 6 represents better results according to 10. However, when we examine the relationship between displacement and weight, we find out that they are strongly correlated. Because of mulicollonilinearity, this might cause using both in the same linear model affect the outcome of predicted mpg values. Therefore, We pick model 10 as the best from bivariate models.

# multicollinearity relation
# results having close to 10, or sqrt() being bigger than 2 make it evident. 

# model 8  - Model we reject to choosing due to high multicollinearity 
car::vif(auto_model_dp_we)

## displacement       weight 
##     7.409758     7.409758

# model 10 - Model we are choosing 
car::vif(auto_model_we_ac)

##       weight acceleration 
##     1.302825     1.302825

Below we are comparing residual vs Fitted plot of two models. Model 7, which we stated not a good model as our model 10, is compared respectively.

# Comparing model 7 and 10 residual vs Fitted Plot
auto_model_two <- list(model.7 = auto_model_dp_we,
                       model.10 = auto_model_we_ac)

As it is seen in the graph, model 10 residuals are more homoscedastic than model 7.

Comparing Model 7 and 10

The summary of model 10 is shown in below.

# model 10 - Our pick of bivariate models
summary(auto_model_we_ac)

## 
## Call:
## lm(formula = mpg ~ weight + acceleration, data = auto_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6684 -1.8947 -0.0467  1.7243 14.4901 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.6717192  1.6012785  23.526   <2e-16 ***
## weight       -0.0060543  0.0002224 -27.219   <2e-16 ***
## acceleration  0.1374886  0.0744233   1.847   0.0657 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.98 on 297 degrees of freedom
## Multiple R-squared:  0.7767, Adjusted R-squared:  0.7752 
## F-statistic: 516.4 on 2 and 297 DF,  p-value: < 2.2e-16

Below is the model 10 residuals that are behaving as expected with the linear regression.

Model 10 Residuals

As a result, we tested the model with our test data in order to evaluate our model. Below is the graphical representation of the actual residuals we had in our test data.

Model 10 - Test Data Residuals

As it is seen in the graphics, the model fits well the actual residuals.

Trivariate models are generally better predictor than bivariate models. Model 12 is not as good predictor as other models, so we are ignoring this model. Of the other models in the trivariate cluster, the multicolloniarity effect of displacement and weight’s strong correlation, can effect predicted mpg values. Therefore, we will be presenting model 14, combination of horsepower, weight and acceleration as the best model we examined so far.

# multicollinearity relation
# model 13  - Model we reject to choosing due to high multicollinearity 
car::vif(auto_model_dp_we_ac)

## displacement       weight acceleration 
##     9.904083     8.103146     1.741392

# model 14 - Model we are choosing 
car::vif(auto_model_hp_we_ac)

##   horsepower       weight acceleration 
##     1.356112     1.662916     1.302874

Below is the summary of the model we are presenting.

# model 14 - Best model to predict mpg values
summary(auto_model_hp_we_ac)

## 
## Call:
## lm(formula = mpg ~ horsepower + weight + acceleration, data = auto_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8163 -1.8627  0.0196  1.6602 14.6175 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.8916880  1.7441669  21.151   <2e-16 ***
## horsepower    0.0074610  0.0066291   1.125   0.2613    
## weight       -0.0059228  0.0002512 -23.580   <2e-16 ***
## acceleration  0.1379991  0.0743912   1.855   0.0646 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.979 on 296 degrees of freedom
## Multiple R-squared:  0.7776, Adjusted R-squared:  0.7754 
## F-statistic:   345 on 3 and 296 DF,  p-value: < 2.2e-16

The summary shows us that the combination of three independent variables shows a strong correlation with mpg, by high adjusted r value of 0.7759. Also, if we checked the residuals of this model, model behaves a normal distribution with mean 0, as expected in linear regression.

Model 14 Residuals

As a result, we tested the model with our test data in order to evaluate our model. Below is the graphical representation of the actual residuals we had in our test data.

Model 14 - Test Data Residuals

Lastly, using all continous variables in the dataset presents statistically reliable results in terms of adjusted r squared value. Model presents residuals as it is expected in the linear regression with normally distributed to mean zero. However, lower significance codes on independent variables and multicollinearity effect of displacement and weight, results this model to be unsatifactory.

Models’ Conclusion

So far we evaluated all variations of the independent variables and we represented, model 10, 14 and 15 as the best models respectively to the number of variables used in the models.

(Model 14) Horsepower, Weight and Acceleration
(Model 10) Weight and Acceleration
(Model 3) Weight
(Model 15) Displacement, Horsepower, Weight and Acceleration

We strongly recommend to use Model 14, combination of horsepower, weight and acceleration in order to predict mpg values as this model represents statistically the best results over all.

¹: This report can be reached at the following link.

Analyses on Automobile’s MPG Prediction

Shivani Mehta & Metin Senturk & Pooja Umathe

Executive Summary

Factors:

Best Model:

Other Models: