Mileage per Gallon

Before you start reading my analysis below, let me thank you first for being willing to see the results of my writing, and it means a lot to me.

Whatever you will see in this analysis, is the result of my study in Regression Models class at Algoritma Academy. To see what I’ve learned in more detail, you can visit the Algoritma Academy learning syllabus.

Everything I have written is entirely my personal opinion based on my experience and knowledge up until now. If something is not right or missing, please feel free to contact me, I’d love to discuss it with you. Thank you.

About the dataset

This dataset is taken from Kaggle website, in a post titled Predict the miles per gallon on an highway. The dataset is posted by Akhilesh and contains information about the MPG of vehicles with 10 possible predictors.

A Glimpse of the dataset

mpg <- read.csv('dataset/mpg.csv')

# scale each column from 0 - 100 to visualize the distribution better
mpg <- data.frame(lapply(mpg, function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE)/100)))

glimpse(mpg)

## Rows: 93
## Columns: 11
## $ MPG.Highway  <dbl> 62, 50, 52, 52, 60, 62, 56, 50, 54, 50, 50, 72, 68, 56, 5~
## $ Passengers   <dbl> 62.5, 62.5, 62.5, 75.0, 50.0, 75.0, 75.0, 75.0, 62.5, 75.~
## $ Length       <dbl> 80.82192, 89.04110, 82.19178, 88.12785, 84.93151, 86.3013~
## $ Wheelbase    <dbl> 85.71429, 96.63866, 85.71429, 89.07563, 91.59664, 88.2352~
## $ Width        <dbl> 87.17949, 91.02564, 85.89744, 89.74359, 88.46154, 88.4615~
## $ U.Turn.Space <dbl> 82.22222, 84.44444, 82.22222, 82.22222, 86.66667, 91.1111~
## $ Rear.seat    <dbl> 73.61111, 83.33333, 77.77778, 86.11111, 75.00000, 77.7777~
## $ Luggage      <dbl> 50.00000, 68.18182, 63.63636, 77.27273, 59.09091, 72.7272~
## $ Weight       <dbl> 65.89525, 86.72351, 82.21681, 82.94762, 88.67235, 70.1583~
## $ Horsepower   <dbl> 46.66667, 66.66667, 57.33333, 57.33333, 69.33333, 36.6666~
## $ Fueltank     <dbl> 48.88889, 66.66667, 62.59259, 78.14815, 78.14815, 60.7407~

This dataset consist of 93 rows and 11 columns. Below is an explanation of each column.

MPG.Highway : Mileage per gallon for the vehicle. Passengers : Number of passengers. Length : Length of the vehicle. Wheelbase : Distance between the centre of the front and rear wheels. Width : Width of the vehicle. U.Turn.Space : Space for a vehicle to do a u-turn. Rear.seat : Size of the rear seat. Luggage : Number of luggage. Weight : Weight of the vehicle. Horsepower : Horsepower of the vehicle. Fueltank : Size of the fuel tank.

Exploratory data analysis

To make our analysis process more comfortable, we have to make sure that our dataset does not have any missing values, has simple yet clear column names, and has an appropriate data type.

1. Checking for missing values

mpg %>% 
  summarise_all(~sum(is.na(.))) %>% 
  pivot_longer(cols = everything(), names_to = 'column_name', values_to = 'missing_values')

## # A tibble: 11 x 2
##    column_name  missing_values
##    <chr>                 <int>
##  1 MPG.Highway               0
##  2 Passengers                0
##  3 Length                    0
##  4 Wheelbase                 0
##  5 Width                     0
##  6 U.Turn.Space              0
##  7 Rear.seat                 0
##  8 Luggage                   0
##  9 Weight                    0
## 10 Horsepower                0
## 11 Fueltank                  0

Fortunately, this dataset contains no missing values and also already came with appropriate column names and data type, so we don;t need to modify any of the column.

2. Correlation between variables

To see if these variables have positive or negative correlation between predictors and our target variable (MPH.Highway), we can make a correlation map.

ggcorr(mpg, label = TRUE, hjust = 1, layout.exp = 2)

As we can see, almost all variables have a strong negative correlation to MPG.Highway, where the Weight and FuelTank has the highest correlation to the MPG.Highway.

3. Checking outliers

Sometimes, outliers can affect our models negatively. For example, make the data is not normally distributed. To prevent this, we have to see if our dataset has some outliers or not.

Below is the distribution of our data.

boxplot(mpg)

Let’s remove rows that affect our target variable.

mpg <- mpg %>% filter(MPG.Highway < 80)
boxplot(mpg)

nrow(mpg) # number of observations after outliers removal

## [1] 89

Making model

We will make 2 regression linear models, with one and several predictors.

1. Single predictor

model0 <- lm(MPG.Highway ~ Weight, mpg)
summary(model0)

## 
## Call:
## lm(formula = MPG.Highway ~ Weight, data = mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7788  -3.5376  -0.0439   3.1131  11.7634 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 94.29733    2.90540   32.46 <0.0000000000000002 ***
## Weight      -0.49459    0.03766  -13.13 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.826 on 87 degrees of freedom
## Multiple R-squared:  0.6647, Adjusted R-squared:  0.6608 
## F-statistic: 172.4 on 1 and 87 DF,  p-value: < 0.00000000000000022

plot(mpg$MPG.Highway, mpg$Weight)

As we can see from our model summary and scatter plot that Weight and MPG.Highway have a strong negative correlation with adjusted R-squared value is 0.6534 on the model.

2. Multiple predictors

Next, we will try to select predictor variables automatically using step-wise regression with the backward elimination method.

model1 <- lm(MPG.Highway ~ ., mpg)
step(model1, direction = "backward", trace = 0)

## 
## Call:
## lm(formula = MPG.Highway ~ Length + Rear.seat + Luggage + Weight + 
##     Fueltank, data = mpg)
## 
## Coefficients:
## (Intercept)       Length    Rear.seat      Luggage       Weight     Fueltank  
##    77.17663      0.39395     -0.10549      0.09243     -0.49724     -0.21615

summary(lm(MPG.Highway ~ Passengers + Wheelbase + Width + Weight + 
    Fueltank, data = mpg))

## 
## Call:
## lm(formula = MPG.Highway ~ Passengers + Wheelbase + Width + Weight + 
##     Fueltank, data = mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3009  -2.8087   0.0806   2.8250  11.3622 
## 
## Coefficients:
##             Estimate Std. Error t value   Pr(>|t|)    
## (Intercept) 41.80010   16.31459   2.562    0.01221 *  
## Passengers  -0.08893    0.05126  -1.735    0.08645 .  
## Wheelbase    0.49819    0.20649   2.413    0.01804 *  
## Width        0.39553    0.20280   1.950    0.05451 .  
## Weight      -0.56068    0.10914  -5.137 0.00000181 ***
## Fueltank    -0.25107    0.08606  -2.918    0.00454 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.426 on 83 degrees of freedom
## Multiple R-squared:  0.731,  Adjusted R-squared:  0.7148 
## F-statistic: 45.11 on 5 and 83 DF,  p-value: < 0.00000000000000022

This step-wise regression method will produce an optimal formula based on the lowest AIC value, where the lower the AIC value, the smaller the uncaught observation value.

When compared with the initial model that only uses the Weight variable, the second regression model that uses multiple predictors has an adjusted R-squared of 0.7148. Higher than the previous model, namely 0.6608.

Prediction and error

Now, we will try to predict the MPG.Highway using the created models.

1. Model0 - Single predictor

First, we will try to predict MPG.Highway value using the first model with Weight predictor only.

predicted_mpg_model0 <- predict(object = model0, newdata = mpg)
head(predicted_mpg_model0) # hasil prediksi

##        1        2        3        4        5        6 
## 61.70645 51.40509 53.63404 53.27259 50.44122 59.59799

head(mpg$MPG.Highway) # nilai actual

## [1] 62 50 52 52 60 62

2. Model1 - Multiple predictor

Second, we will try to predict MPG.Highway value using the second model with multiple predictors.

predicted_mpg_model1 <- predict(object = model1, newdata = mpg)
head(predicted_mpg_model1) # hasil prediksi

##        1        2        3        4        5        6 
## 63.18617 54.29451 52.06590 51.30440 47.97244 60.96240

head(mpg$MPG.Highway) # nilai actual

## [1] 62 50 52 52 60 62

3. Calculate error

MAPE(predicted_mpg_model0, mpg$MPG.Highway) # model0 prediction error

## [1] 0.06869217

MAPE(predicted_mpg_model1, mpg$MPG.Highway) # model1 prediction error

## [1] 0.05405573

The MAPE calculation results in the model1 model having a smaller MAPE value, but the difference between model0 and model1 error is only 0.01%, so in my opinion, model0 is better.

Model evaluation

1. Normality

hist(model0$residuals, breaks = 20)

shapiro.test(model0$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model0$residuals
## W = 0.9951, p-value = 0.9864

The p-value of model0 is > 0.05 so that our model residuals normally distributed. This also means that our model has an error around its mean.

2. Homoscedasticity

plot(model0$fitted.values, model0$residuals)
abline(h = 0, col = "red")

bptest(model0)

## 
##  studentized Breusch-Pagan test
## 
## data:  model0
## BP = 2.1982, df = 1, p-value = 0.1382

The p-value of model0 is > 0.05 so that our model meet the homoscedasticity assumption. This also means that the residuals have no pattern (heteroscedasticity) where all the existing patterns have been captured by the model created.

Conclusion

Model model0 that has been created has an Adjusted R-square of 0.6608 and has a MAPE of 0.06869217, whilemodel1 has an Adjusted R-square of 0.7148 and has a MAPE of 0.05405573.

Even though model1 has a superior Adjusted R-square and MAPE compared to model0, the values are not far apart. So in my opinion, model0 that is using only 1 predictor is better.