The first step before you start modeling is to explore your data. In this course we’ll practice using tidyverse functions for exploratory data analysis. Start off this case study by examinig your data set and visualizing the distribution of fuel efficiency. The ggplot2 package, with functions like ggplot() and geom_histogram() are included in the tidyverse.
cars2018 <- read.csv("D:/VIT/DATA SCIENCE/cars2018.csv")
# Print the cars2018 object
head(cars2018)
## Model Model.Index Displacement Cylinders Gears Transmission
## 1 Acura NSX 57 3.5 6 9 Manual
## 2 ALFA ROMEO 4C 410 1.8 4 6 Manual
## 3 Audi R8 AWD 65 5.2 10 7 Manual
## 4 Audi R8 RWD 71 5.2 10 7 Manual
## 5 Audi R8 Spyder AWD 66 5.2 10 7 Manual
## 6 Audi R8 Spyder RWD 72 5.2 10 7 Manual
## MPG Aspiration Lockup.Torque.Converter
## 1 21 Turbocharged/Supercharged Y
## 2 28 Turbocharged/Supercharged Y
## 3 17 Naturally Aspirated Y
## 4 18 Naturally Aspirated Y
## 5 17 Naturally Aspirated Y
## 6 18 Naturally Aspirated Y
## Drive Max.Ethanol Recommended.Fuel
## 1 All Wheel Drive 10 Premium Unleaded Required
## 2 2-Wheel Drive, Rear 10 Premium Unleaded Required
## 3 All Wheel Drive 15 Premium Unleaded Recommended
## 4 2-Wheel Drive, Rear 15 Premium Unleaded Recommended
## 5 All Wheel Drive 15 Premium Unleaded Recommended
## 6 2-Wheel Drive, Rear 15 Premium Unleaded Recommended
## Intake.Valves.Per.Cyl Exhaust.Valves.Per.Cyl Fuel.injection
## 1 2 2 Direct ignition
## 2 2 2 Direct ignition
## 3 2 2 Direct ignition
## 4 2 2 Direct ignition
## 5 2 2 Direct ignition
## 6 2 2 Direct ignition
# Plot the histogram
ggplot(cars2018, aes(x = MPG)) +
geom_histogram(bins = 25) +
labs(y = "Number of cars",
x = "Fuel efficiency (mpg)")
## Building a simple linear model
Before embarking on more complex machine learning models, it’s a good idea to build the simplest possible model to get an idea of what is going on. In this case, that means fitting a simple linear model using base R’s lm() function.
# Deselect the 2 columns to create cars_vars
cars_vars <- cars2018 %>%
select(-Model, -Model.Index)
# Fit a linear model
fit_all <- lm(MPG ~ ., data = cars_vars)
# Print the summary of the model
summary(fit_all)
##
## Call:
## lm(formula = MPG ~ ., data = cars_vars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5261 -1.6473 -0.1096 1.3572 26.5045
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 44.539519 1.176283 37.865
## Displacement -3.786147 0.264845 -14.296
## Cylinders 0.520284 0.161802 3.216
## Gears 0.157674 0.069984 2.253
## TransmissionCVT 4.877637 0.404051 12.072
## TransmissionManual -1.074608 0.366075 -2.935
## AspirationTurbocharged/Supercharged -2.190248 0.267559 -8.186
## Lockup.Torque.ConverterY -2.624494 0.381252 -6.884
## Drive2-Wheel Drive, Rear -2.676716 0.291044 -9.197
## Drive4-Wheel Drive -3.397532 0.335147 -10.137
## DriveAll Wheel Drive -2.941084 0.257174 -11.436
## Max.Ethanol -0.007377 0.005898 -1.251
## Recommended.FuelPremium Unleaded Required -0.403935 0.262413 -1.539
## Recommended.FuelRegular Unleaded Recommended -0.996343 0.272495 -3.656
## Intake.Valves.Per.Cyl -1.446107 1.620575 -0.892
## Exhaust.Valves.Per.Cyl -2.469747 1.547748 -1.596
## Fuel.injectionMultipoint/sequential ignition -0.658428 0.243819 -2.700
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Displacement < 2e-16 ***
## Cylinders 0.001339 **
## Gears 0.024450 *
## TransmissionCVT < 2e-16 ***
## TransmissionManual 0.003398 **
## AspirationTurbocharged/Supercharged 7.24e-16 ***
## Lockup.Torque.ConverterY 9.65e-12 ***
## Drive2-Wheel Drive, Rear < 2e-16 ***
## Drive4-Wheel Drive < 2e-16 ***
## DriveAll Wheel Drive < 2e-16 ***
## Max.Ethanol 0.211265
## Recommended.FuelPremium Unleaded Required 0.124010
## Recommended.FuelRegular Unleaded Recommended 0.000268 ***
## Intake.Valves.Per.Cyl 0.372400
## Exhaust.Valves.Per.Cyl 0.110835
## Fuel.injectionMultipoint/sequential ignition 0.007028 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.916 on 1127 degrees of freedom
## Multiple R-squared: 0.7314, Adjusted R-squared: 0.7276
## F-statistic: 191.8 on 16 and 1127 DF, p-value: < 2.2e-16
Training models based on all of your data at once is typically not the best choice. Instead, you can create subsets of your data that you use for different purposes, such as training your model and then testing your model. Creating training/testing splits reduces overfitting. When you evaluate your model on data that it was not trained on, you get a better estimate of how it will perform on new data.
# Load caret
library(caret)
## Warning: package 'caret' was built under R version 3.4.4
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Split the data into training and test sets
set.seed(1234)
in_train <- createDataPartition(cars_vars$Transmission, p = 0.8, list = FALSE)
training <- cars_vars[in_train, ]
testing <- cars_vars[-in_train, ]
Now that your training data is ready, you can fit a set of models with caret. The train() function from caret is flexible and powerful. It allows you to try out many different kinds of models and fitting procedures. To start off, train one linear regression model and one random forest model, without any resampling. (This is what trainControl(method = “none”) does; it turns off all resampling.)
# Load caret
library(caret)
# Train a linear regression model
fit_lm <- train(log(MPG) ~ ., method = 'lm', data = training,
trControl = trainControl(method = "none"))
# Print the model object
fit_lm
## Linear Regression
##
## 916 samples
## 12 predictor
##
## No pre-processing
## Resampling: None
# Train a random forest model
fit_rf <- train(log(MPG) ~ ., method = "rf", data = training,
trControl = trainControl(method = "none"))
# Print the model object
fit_rf
## Random Forest
##
## 916 samples
## 12 predictor
##
## No pre-processing
## Resampling: None
The fit_lm and fit_rf models you just trained are in your environment. It’s time to evaluate them! For regression models, we will focus on evaluating using the root mean squared error. This quantity is measured in the same units as the original data (miles per gallon, in our case). Lower values indicate a better fit to the data. It’s not too hard to calculate root mean squared error manually, but the yardstick package, by Max Kuhn, the same developer as caret, offers convenient functions for this and other model performance metrics.
# Load yardstick
library(yardstick)
## Loading required package: broom
##
## Attaching package: 'yardstick'
## The following objects are masked from 'package:caret':
##
## mnLogLoss, precision, recall
## The following object is masked from 'package:readr':
##
## spec
# Create the new columns
results <- training %>%
mutate(`Linear regression` = predict(fit_lm, training),
`Random forest` = predict(fit_rf, training))
# Evaluate the performance
metrics(results, truth = MPG, estimate = `Linear regression`)
## # A tibble: 1 x 2
## rmse rsq
## <dbl> <dbl>
## 1 20.9 0.702
metrics(results, truth = MPG, estimate = `Random forest`)
## # A tibble: 1 x 2
## rmse rsq
## <dbl> <dbl>
## 1 20.9 0.845
According to these metrics, the random forest model is more accurate than the linear regression model.
“That is how these models perform on the training data, the data that we used to build these models in the first place.” Let’s evaluate how these simple models perform on the testing data.
# Create the new columns
results <- testing %>%
mutate(`Linear regression` = predict(fit_lm, testing),
`Random forest` = predict(fit_rf, testing))
# Evaluate the performance
metrics(results, truth = MPG, estimate = `Linear regression`)
## # A tibble: 1 x 2
## rmse rsq
## <dbl> <dbl>
## 1 20.5 0.799
metrics(results, truth = MPG, estimate = `Random forest`)
## # A tibble: 1 x 2
## rmse rsq
## <dbl> <dbl>
## 1 20.5 0.880
The metrics are not worse on the testing data for either model, indicating we have not overfitted in either case.
trained linear regression and random forest models without any resampling. Resampling can improve the accuracy of machine learning models, and reduce overfitting.
Let’s try bootstrap resampling, which means creating data sets the same size as the original one by randomly drawing with replacement from the original. In caret, the default behavior for bootstrapping is 25 resamplings, but you can change this using trainControl() if desired.
# Fit the models with bootstrap resampling
cars_lm_bt <- train(log(MPG) ~ ., method = "lm", data = training,
trControl = trainControl(method = "boot"))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
cars_rf_bt <- train(log(MPG) ~ ., method = "rf", data = training,
trControl = trainControl(method = "boot"))
# Quick look at the models
cars_lm_bt
## Linear Regression
##
## 916 samples
## 12 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 916, 916, 916, 916, 916, 916, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1036278 0.7890514 0.07656104
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
cars_rf_bt
## Random Forest
##
## 916 samples
## 12 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 916, 916, 916, 916, 916, 916, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.10015480 0.8205322 0.07299305
## 9 0.08758544 0.8466598 0.06129895
## 16 0.09100659 0.8360034 0.06313542
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 9.
we just trained models using bootstrap resampling, cars_lm_bt and cars_rf_bt. These models are available in your environment, trained on the entire training set instead of 10% only. Now let’s evaluate how those models performed and compare them. We will again use metrics() from the yardstick package, but also we will plot the model predictions to inspect them visually.
Notice in this code how we use gather() from tidyr (another tidyverse package) to tidy the data frame and prepare it for plotting with ggplot2.
results <- testing %>%
mutate(`Linear regression` = predict(cars_lm_bt, testing),
`Random forest` = predict(cars_rf_bt, testing))
metrics(results, truth = MPG, estimate = `Linear regression`)
## # A tibble: 1 x 2
## rmse rsq
## <dbl> <dbl>
## 1 20.5 0.799
metrics(results, truth = MPG, estimate = `Random forest`)
## # A tibble: 1 x 2
## rmse rsq
## <dbl> <dbl>
## 1 20.5 0.903
results %>%
gather(Method, Result, `Linear regression`:`Random forest`) %>%
ggplot(aes(log(MPG), Result, color = Method)) +
geom_point(size = 1.5, alpha = 0.5) +
facet_wrap(~Method) +
geom_abline(lty = 2, color = "gray50") +
geom_smooth(method = "lm")
Both the model metrics and the plots show that the random forest model is performing better. We can predict fuel efficiency more accurately with a random forest model.