Predict fuel efficiency from a US Department of Energy dataset for real cars of today

Visualizing the fuel efficiency distribution

The first step before you start modeling is to explore your data. In this course we’ll practice using tidyverse functions for exploratory data analysis. Start off this case study by examinig your data set and visualizing the distribution of fuel efficiency. The ggplot2 package, with functions like ggplot() and geom_histogram() are included in the tidyverse.

cars2018 <- read.csv("D:/VIT/DATA SCIENCE/cars2018.csv")
# Print the cars2018 object
head(cars2018)

##                Model Model.Index Displacement Cylinders Gears Transmission
## 1          Acura NSX          57          3.5         6     9       Manual
## 2      ALFA ROMEO 4C         410          1.8         4     6       Manual
## 3        Audi R8 AWD          65          5.2        10     7       Manual
## 4        Audi R8 RWD          71          5.2        10     7       Manual
## 5 Audi R8 Spyder AWD          66          5.2        10     7       Manual
## 6 Audi R8 Spyder RWD          72          5.2        10     7       Manual
##   MPG                Aspiration Lockup.Torque.Converter
## 1  21 Turbocharged/Supercharged                       Y
## 2  28 Turbocharged/Supercharged                       Y
## 3  17       Naturally Aspirated                       Y
## 4  18       Naturally Aspirated                       Y
## 5  17       Naturally Aspirated                       Y
## 6  18       Naturally Aspirated                       Y
##                 Drive Max.Ethanol             Recommended.Fuel
## 1     All Wheel Drive          10    Premium Unleaded Required
## 2 2-Wheel Drive, Rear          10    Premium Unleaded Required
## 3     All Wheel Drive          15 Premium Unleaded Recommended
## 4 2-Wheel Drive, Rear          15 Premium Unleaded Recommended
## 5     All Wheel Drive          15 Premium Unleaded Recommended
## 6 2-Wheel Drive, Rear          15 Premium Unleaded Recommended
##   Intake.Valves.Per.Cyl Exhaust.Valves.Per.Cyl  Fuel.injection
## 1                     2                      2 Direct ignition
## 2                     2                      2 Direct ignition
## 3                     2                      2 Direct ignition
## 4                     2                      2 Direct ignition
## 5                     2                      2 Direct ignition
## 6                     2                      2 Direct ignition

# Plot the histogram
ggplot(cars2018, aes(x = MPG)) +
    geom_histogram(bins = 25) +
    labs(y = "Number of cars",
         x = "Fuel efficiency (mpg)")

## Building a simple linear model

Before embarking on more complex machine learning models, it’s a good idea to build the simplest possible model to get an idea of what is going on. In this case, that means fitting a simple linear model using base R’s lm() function.

# Deselect the 2 columns to create cars_vars
cars_vars <- cars2018 %>%
    select(-Model, -Model.Index)

# Fit a linear model
fit_all <- lm(MPG ~ ., data = cars_vars)

# Print the summary of the model
summary(fit_all)

## 
## Call:
## lm(formula = MPG ~ ., data = cars_vars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5261 -1.6473 -0.1096  1.3572 26.5045 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                  44.539519   1.176283  37.865
## Displacement                                 -3.786147   0.264845 -14.296
## Cylinders                                     0.520284   0.161802   3.216
## Gears                                         0.157674   0.069984   2.253
## TransmissionCVT                               4.877637   0.404051  12.072
## TransmissionManual                           -1.074608   0.366075  -2.935
## AspirationTurbocharged/Supercharged          -2.190248   0.267559  -8.186
## Lockup.Torque.ConverterY                     -2.624494   0.381252  -6.884
## Drive2-Wheel Drive, Rear                     -2.676716   0.291044  -9.197
## Drive4-Wheel Drive                           -3.397532   0.335147 -10.137
## DriveAll Wheel Drive                         -2.941084   0.257174 -11.436
## Max.Ethanol                                  -0.007377   0.005898  -1.251
## Recommended.FuelPremium Unleaded Required    -0.403935   0.262413  -1.539
## Recommended.FuelRegular Unleaded Recommended -0.996343   0.272495  -3.656
## Intake.Valves.Per.Cyl                        -1.446107   1.620575  -0.892
## Exhaust.Valves.Per.Cyl                       -2.469747   1.547748  -1.596
## Fuel.injectionMultipoint/sequential ignition -0.658428   0.243819  -2.700
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## Displacement                                  < 2e-16 ***
## Cylinders                                    0.001339 ** 
## Gears                                        0.024450 *  
## TransmissionCVT                               < 2e-16 ***
## TransmissionManual                           0.003398 ** 
## AspirationTurbocharged/Supercharged          7.24e-16 ***
## Lockup.Torque.ConverterY                     9.65e-12 ***
## Drive2-Wheel Drive, Rear                      < 2e-16 ***
## Drive4-Wheel Drive                            < 2e-16 ***
## DriveAll Wheel Drive                          < 2e-16 ***
## Max.Ethanol                                  0.211265    
## Recommended.FuelPremium Unleaded Required    0.124010    
## Recommended.FuelRegular Unleaded Recommended 0.000268 ***
## Intake.Valves.Per.Cyl                        0.372400    
## Exhaust.Valves.Per.Cyl                       0.110835    
## Fuel.injectionMultipoint/sequential ignition 0.007028 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.916 on 1127 degrees of freedom
## Multiple R-squared:  0.7314, Adjusted R-squared:  0.7276 
## F-statistic: 191.8 on 16 and 1127 DF,  p-value: < 2.2e-16

Training and testing data

Training models based on all of your data at once is typically not the best choice. Instead, you can create subsets of your data that you use for different purposes, such as training your model and then testing your model. Creating training/testing splits reduces overfitting. When you evaluate your model on data that it was not trained on, you get a better estimate of how it will perform on new data.

# Load caret
library(caret)

## Warning: package 'caret' was built under R version 3.4.4

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

# Split the data into training and test sets
set.seed(1234)
in_train <- createDataPartition(cars_vars$Transmission, p = 0.8, list = FALSE)
training <- cars_vars[in_train, ]
testing <- cars_vars[-in_train, ]

Training models with caret

Now that your training data is ready, you can fit a set of models with caret. The train() function from caret is flexible and powerful. It allows you to try out many different kinds of models and fitting procedures. To start off, train one linear regression model and one random forest model, without any resampling. (This is what trainControl(method = “none”) does; it turns off all resampling.)

# Load caret
library(caret)

# Train a linear regression model
fit_lm <- train(log(MPG) ~ ., method = 'lm', data = training,
                trControl = trainControl(method = "none"))

# Print the model object
fit_lm

## Linear Regression 
## 
## 916 samples
##  12 predictor
## 
## No pre-processing
## Resampling: None

# Train a random forest model
fit_rf <- train(log(MPG) ~ ., method = "rf", data = training,
                trControl = trainControl(method = "none"))

# Print the model object
fit_rf

## Random Forest 
## 
## 916 samples
##  12 predictor
## 
## No pre-processing
## Resampling: None

Evaluating your models

The fit_lm and fit_rf models you just trained are in your environment. It’s time to evaluate them! For regression models, we will focus on evaluating using the root mean squared error. This quantity is measured in the same units as the original data (miles per gallon, in our case). Lower values indicate a better fit to the data. It’s not too hard to calculate root mean squared error manually, but the yardstick package, by Max Kuhn, the same developer as caret, offers convenient functions for this and other model performance metrics.

# Load yardstick
library(yardstick)

## Loading required package: broom

## 
## Attaching package: 'yardstick'

## The following objects are masked from 'package:caret':
## 
##     mnLogLoss, precision, recall

## The following object is masked from 'package:readr':
## 
##     spec

# Create the new columns
results <- training %>%
    mutate(`Linear regression` = predict(fit_lm, training),
           `Random forest` = predict(fit_rf, training))

# Evaluate the performance
metrics(results, truth = MPG, estimate = `Linear regression`)

## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.9 0.702

metrics(results, truth = MPG, estimate = `Random forest`)

## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.9 0.845

According to these metrics, the random forest model is more accurate than the linear regression model.

Using the testing data

“That is how these models perform on the training data, the data that we used to build these models in the first place.” Let’s evaluate how these simple models perform on the testing data.

# Create the new columns
results <- testing %>%
    mutate(`Linear regression` = predict(fit_lm, testing),
           `Random forest` = predict(fit_rf, testing))

# Evaluate the performance
metrics(results, truth = MPG, estimate = `Linear regression`)

## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.5 0.799

metrics(results, truth = MPG, estimate = `Random forest`)

## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.5 0.880

The metrics are not worse on the testing data for either model, indicating we have not overfitted in either case.

Bootstrap resampling

trained linear regression and random forest models without any resampling. Resampling can improve the accuracy of machine learning models, and reduce overfitting.

Let’s try bootstrap resampling, which means creating data sets the same size as the original one by randomly drawing with replacement from the original. In caret, the default behavior for bootstrapping is 25 resamplings, but you can change this using trainControl() if desired.

# Fit the models with bootstrap resampling
cars_lm_bt <- train(log(MPG) ~ ., method = "lm", data = training,
                   trControl = trainControl(method = "boot"))

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading

cars_rf_bt <- train(log(MPG) ~ ., method = "rf", data = training,
                   trControl = trainControl(method = "boot"))

# Quick look at the models
cars_lm_bt

## Linear Regression 
## 
## 916 samples
##  12 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 916, 916, 916, 916, 916, 916, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE       
##   0.1036278  0.7890514  0.07656104
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

cars_rf_bt

## Random Forest 
## 
## 916 samples
##  12 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 916, 916, 916, 916, 916, 916, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE        Rsquared   MAE       
##    2    0.10015480  0.8205322  0.07299305
##    9    0.08758544  0.8466598  0.06129895
##   16    0.09100659  0.8360034  0.06313542
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 9.

Plotting modeling results

we just trained models using bootstrap resampling, cars_lm_bt and cars_rf_bt. These models are available in your environment, trained on the entire training set instead of 10% only. Now let’s evaluate how those models performed and compare them. We will again use metrics() from the yardstick package, but also we will plot the model predictions to inspect them visually.

Notice in this code how we use gather() from tidyr (another tidyverse package) to tidy the data frame and prepare it for plotting with ggplot2.

results <- testing %>%
    mutate(`Linear regression` = predict(cars_lm_bt, testing),
           `Random forest` = predict(cars_rf_bt, testing))

metrics(results, truth = MPG, estimate = `Linear regression`)

## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.5 0.799

metrics(results, truth = MPG, estimate = `Random forest`)

## # A tibble: 1 x 2
##    rmse   rsq
##   <dbl> <dbl>
## 1  20.5 0.903

results %>%
    gather(Method, Result, `Linear regression`:`Random forest`) %>%
    ggplot(aes(log(MPG), Result, color = Method)) +
    geom_point(size = 1.5, alpha = 0.5) +
    facet_wrap(~Method) +
    geom_abline(lty = 2, color = "gray50") +
    geom_smooth(method = "lm")

Both the model metrics and the plots show that the random forest model is performing better. We can predict fuel efficiency more accurately with a random forest model.