Executive Summary

Engines are the most sophisticated part of the automobiles. Every configuration of an engine can produce different velocities according to many specifications like cylinder’s count, horsepower, weight of the car, and so on. In this report, we created a mpg predictor model that predicts an automobile’s potential mpg value according to the engine factors.

We evaluated models categorically, in terms of the number of engine factors that we used in the models. Therefore, we will be presenting the four best models from all the models we evaluated. Of the four models, we will be selecting one that we found will be the most successful predictor for the mpg values of the engines, however, there is an option of four models that the decision maker’s could use, if they intend to use more engine factors.

We strongly advise to use a three variable model. Therefore, using horsepower, weight and acceleration presents the most approximate results for mpg values in terms of statistical analysis. We found that 77.54% percent of the output can be explained by the model we offer. Other models we found that can be a potential use are two and one variable models. In terms of one variable models, weight is the most accounted factor to predict mpg values. Within the two variable models we tested, a combination of weight and acceleration is the best in this group. Lastly, all the continous data in the dataset can be used to predict the mpg values as a model. We evaluated the data using linear regression, by checking various aspects of the data in terms of statistical aspect, and we present a detailed report of our methodology for the models we presented.

In a nutshell, all the four models we presented above are good predictors for the mpg factor of engines. In our opinion, we advise to use the third model we offered, horsepower, weight and acceleration as it represents best results in terms of statistical aspect. Below are the modeled equations we constructed from the models.

Factors:
  • y = predicted mpg
  • d = displacement
  • h = horsepower
  • w = weight
  • a = acceleration
Best Model:
  • y = 36.73 + 0.01h - 0.01w + 0.15a
Other Models:
  • y = 37.50 - 0.01w + 0.15a
  • y = 40.39 - 0.01w
  • y = 37.03 - 0.01d + 0.01h - 0.01w + 0.07a

 

Technical Analyses

Data Preperation

We are investigating the vehicle’s mpg values’ relation with the automobiles’ relative features like acceleration, weight, etc. We are given a dataset of 398 different automobile models that each corresponds features with the following mpg values.

Our objective is to predict an mpg value for a spesific type of car, in terms of cylinders, displacement, horsepower, weight, acceleration, model year, origin, and car name.

To create a model, we will first be preparing the data in terms of data types, outliers and division into two dataset of training and test data.

First, we import the data from the repository and examine its structure.

auto <- read.csv(url, sep = "", header = F)
colnames(auto) <- auto_colnames

# investigating the structure of data
str(auto)
## 'data.frame':    398 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : Factor w/ 94 levels "?","100.0","102.0",..: 17 35 29 29 24 42 47 46 48 40 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model year  : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ car name    : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...

We reformatted some variables according to the definition given in the dataset repository. We examined the structure of the data and decided to construct linear models by using four continous variables in the dataset, displacement, horsepower, weight and acceleration.

# reformatting continuous variables
auto$horsepower <- as.numeric(auto$horsepower)

# creating train and test data
auto_train <- auto[1:300, ]
auto_test <- auto[301:398, ]

We examined selected variables in terms of outliers and corrected some outliers we found in acceleration feature.

Boxplot for Ouliers

Boxplot for Ouliers

Data Analyses

The relation between dependent to independent and dependent to dependent variables are a matter in terms of linear regression. Therefore, below in the graph, we represented the correlation values and graphs of independent and dependent variables with each other.

Univariate Models

We wanted to evaluate each feature individually with the mpg value in order to understand the behaviour of the relationship btw the feature and mpg values. By analyzing each variable individually on train dataset and by testing it on the test data we conclude that Mpg Vs Weight is the best fit model in univariate models. We choose this model on the following basis:

  1. Adjusted R-squared Values
  2. Normally distributed residual plot
  3. Significance value of indep variables
  4. Low std errors

We evaluate each variable with mpg, and find that horsepower and acceleration significantly lower results than the displacement and weight values with the values of 0.2249, 0.2127, respectively.

From the displacement and weight varibles, weight is the most correlated with the mpgs variable. Therefore, below, we are representing Mpg Vs Weight as a univariate solution.

# weight ========================================
# training
auto_model_we <- lm(formula =  mpg ~ weight, data = auto_train)
summary(auto_model_we)
## 
## Call:
## lm(formula = mpg ~ weight, data = auto_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1077 -1.8842 -0.0333  1.7275 15.1232 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.3879027  0.6368804   63.41   <2e-16 ***
## weight      -0.0062524  0.0001957  -31.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.992 on 298 degrees of freedom
## Multiple R-squared:  0.7741, Adjusted R-squared:  0.7733 
## F-statistic:  1021 on 1 and 298 DF,  p-value: < 2.2e-16

Below is the residuals graph in which the residuals are diplaying normal distribution with mean zero.

Model's Residuals

Model’s Residuals

We tested our models with the test data we created before and represented in below. We found that although our models assumes a good predictor, we are having high residuals. We found that the actual residuals are not behaving a normal distrubution.

# testing - displacement
mpg_predict <- reg(auto_test$weight, auto_model_we$coefficients)
mpg_residuals <- abs(mpg_predict - auto_test$mpg)

The red line in the below graph is drawn using residuals’ mean and standard deviation.

Test Data's Residuals

Test Data’s Residuals

Multivariate Models

Predicting mpg values from just a variable may not be so reliable. Therefore we created models using multiple variables and represented our findings in this section.

Models’ Summary

We used displacement, horsepower, weight and acceleration values in order to predict mpg values. In order to do that, we used each variable in different combinations to create linear models and compared the results of the models. We categorized all models in terms of the number of independent variables used in the model. For simplicity, we will be summarizing and sharing a summary of our findings. To refer the models, below is an index of the models.

  • Model 5 = displacement + horsepower
  • Model 6 = displacement + weight
  • Model 7 = displacement + acceleration
  • Model 8 = horsepower + weight
  • Model 9 = horsepower + acceleration
  • Model 10 = weight + acceleration
  • Model 11 = displacement + horsepower + weight
  • Model 12 = displacement + horsepower + acceleration
  • Model 13 = displacement + weight + acceleration
  • Model 14 = horsepower + weight + acceleration
  • Model 15 = displacement + horsepower + weight + acceleration

Models’ Analyses and Evaluation

Bivariate models are generally promising in terms of r-squared values. however, having look at the independency of the independent variable from dependent, significance codes allow us to gather more information on relationship between the models. Below is our findings. Model 9 together create a good independency and represents normally distributed residuals with mean zero but they lack of showing strong relation with mpg by having 0.3527 r squared values.

Model 5 and 7 are are statistically not a good predictor as model 6, 8 and 10 by having less r-squared values in linear model.

Although 6, 8 and 10 are all represent a good predictor for mpg, 6 and 10 is better according to 8.

Lastly, from 6 and 10, 6 represents better results according to 10. However, when we examine the relationship between displacement and weight, we find out that they are strongly correlated. Because of mulicollonilinearity, this might cause using both in the same linear model affect the outcome of predicted mpg values. Therefore, We pick model 10 as the best from bivariate models.

# multicollinearity relation
# results having close to 10, or sqrt() being bigger than 2 make it evident. 

# model 8  - Model we reject to choosing due to high multicollinearity 
car::vif(auto_model_dp_we) 
## displacement       weight 
##     7.409758     7.409758
# model 10 - Model we are choosing 
car::vif(auto_model_we_ac)
##       weight acceleration 
##     1.302825     1.302825

Below we are comparing residual vs Fitted plot of two models. Model 7, which we stated not a good model as our model 10, is compared respectively.

# Comparing model 7 and 10 residual vs Fitted Plot
auto_model_two <- list(model.7 = auto_model_dp_we,
                       model.10 = auto_model_we_ac)

As it is seen in the graph, model 10 residuals are more homoscedastic than model 7.

Comparing Model 7 and 10

Comparing Model 7 and 10

The summary of model 10 is shown in below.

# model 10 - Our pick of bivariate models
summary(auto_model_we_ac)
## 
## Call:
## lm(formula = mpg ~ weight + acceleration, data = auto_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6684 -1.8947 -0.0467  1.7243 14.4901 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.6717192  1.6012785  23.526   <2e-16 ***
## weight       -0.0060543  0.0002224 -27.219   <2e-16 ***
## acceleration  0.1374886  0.0744233   1.847   0.0657 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.98 on 297 degrees of freedom
## Multiple R-squared:  0.7767, Adjusted R-squared:  0.7752 
## F-statistic: 516.4 on 2 and 297 DF,  p-value: < 2.2e-16

Below is the model 10 residuals that are behaving as expected with the linear regression.

Model 10 Residuals

Model 10 Residuals

As a result, we tested the model with our test data in order to evaluate our model. Below is the graphical representation of the actual residuals we had in our test data.

Model 10 - Test Data Residuals

Model 10 - Test Data Residuals

As it is seen in the graphics, the model fits well the actual residuals.

Trivariate models are generally better predictor than bivariate models. Model 12 is not as good predictor as other models, so we are ignoring this model. Of the other models in the trivariate cluster, the multicolloniarity effect of displacement and weight’s strong correlation, can effect predicted mpg values. Therefore, we will be presenting model 14, combination of horsepower, weight and acceleration as the best model we examined so far.

# multicollinearity relation
# model 13  - Model we reject to choosing due to high multicollinearity 
car::vif(auto_model_dp_we_ac)
## displacement       weight acceleration 
##     9.904083     8.103146     1.741392
# model 14 - Model we are choosing 
car::vif(auto_model_hp_we_ac)
##   horsepower       weight acceleration 
##     1.356112     1.662916     1.302874

Below is the summary of the model we are presenting.

# model 14 - Best model to predict mpg values
summary(auto_model_hp_we_ac)
## 
## Call:
## lm(formula = mpg ~ horsepower + weight + acceleration, data = auto_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8163 -1.8627  0.0196  1.6602 14.6175 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.8916880  1.7441669  21.151   <2e-16 ***
## horsepower    0.0074610  0.0066291   1.125   0.2613    
## weight       -0.0059228  0.0002512 -23.580   <2e-16 ***
## acceleration  0.1379991  0.0743912   1.855   0.0646 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.979 on 296 degrees of freedom
## Multiple R-squared:  0.7776, Adjusted R-squared:  0.7754 
## F-statistic:   345 on 3 and 296 DF,  p-value: < 2.2e-16

The summary shows us that the combination of three independent variables shows a strong correlation with mpg, by high adjusted r value of 0.7759. Also, if we checked the residuals of this model, model behaves a normal distribution with mean 0, as expected in linear regression.

Model 14 Residuals

Model 14 Residuals

As a result, we tested the model with our test data in order to evaluate our model. Below is the graphical representation of the actual residuals we had in our test data.

Model 14 - Test Data Residuals

Model 14 - Test Data Residuals

Lastly, using all continous variables in the dataset presents statistically reliable results in terms of adjusted r squared value. Model presents residuals as it is expected in the linear regression with normally distributed to mean zero. However, lower significance codes on independent variables and multicollinearity effect of displacement and weight, results this model to be unsatifactory.

Models’ Conclusion

So far we evaluated all variations of the independent variables and we represented, model 10, 14 and 15 as the best models respectively to the number of variables used in the models.

  • (Model 14) Horsepower, Weight and Acceleration
  • (Model 10) Weight and Acceleration
  • (Model 3) Weight
  • (Model 15) Displacement, Horsepower, Weight and Acceleration

We strongly recommend to use Model 14, combination of horsepower, weight and acceleration in order to predict mpg values as this model represents statistically the best results over all.


1: This report can be reached at the following link.


  1. 1