Mileage per Gallon
Before you start reading my analysis below, let me thank you first for being willing to see the results of my writing, and it means a lot to me.
Whatever you will see in this analysis, is the result of my study in Regression Models class at Algoritma Academy. To see what I’ve learned in more detail, you can visit the Algoritma Academy learning syllabus.
Everything I have written is entirely my personal opinion based on my experience and knowledge up until now. If something is not right or missing, please feel free to contact me, I’d love to discuss it with you. Thank you.
About the dataset
This dataset is taken from Kaggle website, in a post titled Predict the miles per gallon on an highway. The dataset is posted by Akhilesh and contains information about the MPG of vehicles with 10 possible predictors.
A Glimpse of the dataset
mpg <- read.csv('dataset/mpg.csv')
# scale each column from 0 - 100 to visualize the distribution better
mpg <- data.frame(lapply(mpg, function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE)/100)))
glimpse(mpg)## Rows: 93
## Columns: 11
## $ MPG.Highway <dbl> 62, 50, 52, 52, 60, 62, 56, 50, 54, 50, 50, 72, 68, 56, 5~
## $ Passengers <dbl> 62.5, 62.5, 62.5, 75.0, 50.0, 75.0, 75.0, 75.0, 62.5, 75.~
## $ Length <dbl> 80.82192, 89.04110, 82.19178, 88.12785, 84.93151, 86.3013~
## $ Wheelbase <dbl> 85.71429, 96.63866, 85.71429, 89.07563, 91.59664, 88.2352~
## $ Width <dbl> 87.17949, 91.02564, 85.89744, 89.74359, 88.46154, 88.4615~
## $ U.Turn.Space <dbl> 82.22222, 84.44444, 82.22222, 82.22222, 86.66667, 91.1111~
## $ Rear.seat <dbl> 73.61111, 83.33333, 77.77778, 86.11111, 75.00000, 77.7777~
## $ Luggage <dbl> 50.00000, 68.18182, 63.63636, 77.27273, 59.09091, 72.7272~
## $ Weight <dbl> 65.89525, 86.72351, 82.21681, 82.94762, 88.67235, 70.1583~
## $ Horsepower <dbl> 46.66667, 66.66667, 57.33333, 57.33333, 69.33333, 36.6666~
## $ Fueltank <dbl> 48.88889, 66.66667, 62.59259, 78.14815, 78.14815, 60.7407~
This dataset consist of 93 rows and 11 columns. Below is an explanation of each column.
MPG.Highway : Mileage per gallon for the vehicle. Passengers : Number of passengers. Length : Length of the vehicle. Wheelbase : Distance between the centre of the front and rear wheels. Width : Width of the vehicle. U.Turn.Space : Space for a vehicle to do a u-turn. Rear.seat : Size of the rear seat. Luggage : Number of luggage. Weight : Weight of the vehicle. Horsepower : Horsepower of the vehicle. Fueltank : Size of the fuel tank.
Exploratory data analysis
To make our analysis process more comfortable, we have to make sure that our dataset does not have any missing values, has simple yet clear column names, and has an appropriate data type.
1. Checking for missing values
mpg %>%
summarise_all(~sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = 'column_name', values_to = 'missing_values')## # A tibble: 11 x 2
## column_name missing_values
## <chr> <int>
## 1 MPG.Highway 0
## 2 Passengers 0
## 3 Length 0
## 4 Wheelbase 0
## 5 Width 0
## 6 U.Turn.Space 0
## 7 Rear.seat 0
## 8 Luggage 0
## 9 Weight 0
## 10 Horsepower 0
## 11 Fueltank 0
Fortunately, this dataset contains no missing values and also already came with appropriate column names and data type, so we don;t need to modify any of the column.
2. Correlation between variables
To see if these variables have positive or negative correlation between predictors and our target variable (MPH.Highway), we can make a correlation map.
ggcorr(mpg, label = TRUE, hjust = 1, layout.exp = 2)As we can see, almost all variables have a strong negative correlation to MPG.Highway, where the Weight and FuelTank has the highest correlation to the MPG.Highway.
3. Checking outliers
Sometimes, outliers can affect our models negatively. For example, make the data is not normally distributed. To prevent this, we have to see if our dataset has some outliers or not.
Below is the distribution of our data.
boxplot(mpg) Let’s remove rows that affect our target variable.
mpg <- mpg %>% filter(MPG.Highway < 80)
boxplot(mpg)nrow(mpg) # number of observations after outliers removal## [1] 89
Making model
We will make 2 regression linear models, with one and several predictors.
1. Single predictor
model0 <- lm(MPG.Highway ~ Weight, mpg)
summary(model0)##
## Call:
## lm(formula = MPG.Highway ~ Weight, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.7788 -3.5376 -0.0439 3.1131 11.7634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.29733 2.90540 32.46 <0.0000000000000002 ***
## Weight -0.49459 0.03766 -13.13 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.826 on 87 degrees of freedom
## Multiple R-squared: 0.6647, Adjusted R-squared: 0.6608
## F-statistic: 172.4 on 1 and 87 DF, p-value: < 0.00000000000000022
plot(mpg$MPG.Highway, mpg$Weight)As we can see from our model summary and scatter plot that Weight and MPG.Highway have a strong negative correlation with adjusted R-squared value is 0.6534 on the model.
2. Multiple predictors
Next, we will try to select predictor variables automatically using step-wise regression with the backward elimination method.
model1 <- lm(MPG.Highway ~ ., mpg)
step(model1, direction = "backward", trace = 0)##
## Call:
## lm(formula = MPG.Highway ~ Length + Rear.seat + Luggage + Weight +
## Fueltank, data = mpg)
##
## Coefficients:
## (Intercept) Length Rear.seat Luggage Weight Fueltank
## 77.17663 0.39395 -0.10549 0.09243 -0.49724 -0.21615
summary(lm(MPG.Highway ~ Passengers + Wheelbase + Width + Weight +
Fueltank, data = mpg))##
## Call:
## lm(formula = MPG.Highway ~ Passengers + Wheelbase + Width + Weight +
## Fueltank, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3009 -2.8087 0.0806 2.8250 11.3622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.80010 16.31459 2.562 0.01221 *
## Passengers -0.08893 0.05126 -1.735 0.08645 .
## Wheelbase 0.49819 0.20649 2.413 0.01804 *
## Width 0.39553 0.20280 1.950 0.05451 .
## Weight -0.56068 0.10914 -5.137 0.00000181 ***
## Fueltank -0.25107 0.08606 -2.918 0.00454 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.426 on 83 degrees of freedom
## Multiple R-squared: 0.731, Adjusted R-squared: 0.7148
## F-statistic: 45.11 on 5 and 83 DF, p-value: < 0.00000000000000022
This step-wise regression method will produce an optimal formula based on the lowest AIC value, where the lower the AIC value, the smaller the uncaught observation value.
When compared with the initial model that only uses the Weight variable, the second regression model that uses multiple predictors has an adjusted R-squared of 0.7148. Higher than the previous model, namely 0.6608.
Prediction and error
Now, we will try to predict the MPG.Highway using the created models.
1. Model0 - Single predictor
First, we will try to predict MPG.Highway value using the first model with Weight predictor only.
predicted_mpg_model0 <- predict(object = model0, newdata = mpg)
head(predicted_mpg_model0) # hasil prediksi## 1 2 3 4 5 6
## 61.70645 51.40509 53.63404 53.27259 50.44122 59.59799
head(mpg$MPG.Highway) # nilai actual## [1] 62 50 52 52 60 62
2. Model1 - Multiple predictor
Second, we will try to predict MPG.Highway value using the second model with multiple predictors.
predicted_mpg_model1 <- predict(object = model1, newdata = mpg)
head(predicted_mpg_model1) # hasil prediksi## 1 2 3 4 5 6
## 63.18617 54.29451 52.06590 51.30440 47.97244 60.96240
head(mpg$MPG.Highway) # nilai actual## [1] 62 50 52 52 60 62
3. Calculate error
MAPE(predicted_mpg_model0, mpg$MPG.Highway) # model0 prediction error## [1] 0.06869217
MAPE(predicted_mpg_model1, mpg$MPG.Highway) # model1 prediction error## [1] 0.05405573
The MAPE calculation results in the model1 model having a smaller MAPE value, but the difference between model0 and model1 error is only 0.01%, so in my opinion, model0 is better.
Model evaluation
1. Normality
hist(model0$residuals, breaks = 20)shapiro.test(model0$residuals)##
## Shapiro-Wilk normality test
##
## data: model0$residuals
## W = 0.9951, p-value = 0.9864
The p-value of model0 is > 0.05 so that our model residuals normally distributed. This also means that our model has an error around its mean.
2. Homoscedasticity
plot(model0$fitted.values, model0$residuals)
abline(h = 0, col = "red")bptest(model0)##
## studentized Breusch-Pagan test
##
## data: model0
## BP = 2.1982, df = 1, p-value = 0.1382
The p-value of model0 is > 0.05 so that our model meet the homoscedasticity assumption. This also means that the residuals have no pattern (heteroscedasticity) where all the existing patterns have been captured by the model created.
Conclusion
Model model0 that has been created has an Adjusted R-square of 0.6608 and has a MAPE of 0.06869217, whilemodel1 has an Adjusted R-square of 0.7148 and has a MAPE of 0.05405573.
Even though model1 has a superior Adjusted R-square and MAPE compared to model0, the values are not far apart. So in my opinion, model0 that is using only 1 predictor is better.