Regression Showdown: Finding the Best Predictive Model

Introduction

This exercise evaluates the performance of various regression models using the ‘mtcars’ dataset. We will compare models such as Linear Regression, Polynomial Regression, Support Vector Machines, Decision Trees, and Random Forests to determine which is most suitable for accurate predictions while avoiding issues like overfitting and underfitting.

This process will involve addressing key aspects, including identifying appropriate dependent variables, handling any missing values, detecting and treating outliers if necessary, creating binary features for categorical variables, and identifying which model will provide the most accurate predictions while ensuring the best balance of accuracy and generalization performance for the dataset.

Data Preprocessing & Exploratory Data Analysis (EDA)

After loading the data, the very first step I took was to check for any missing values, blanks, or NA values. Upon confirming that there were no such discrepancies, I examined the data’s distribution and how the variables correlated with each other. For this, I used histograms, box plots, and a correlation matrix (refer image below). A few key observations from the plots included the inverse relationship between the number of cylinders and miles per gallon (mpg). As the number of cylinders increases, miles per gallon tends to decrease.

Furthermore, horsepower (hp) exhibits a strong inverse relationship with miles per gallon, demonstrating that increased horsepower correlates with decreased fuel efficiency. Horsepower also strongly correlates with the number of cylinders (cyl), indicating that larger engines produce more power. Additionally, higher horsepower (hp) correlates with lower quarter-mile times (qsec), confirming that more powerful engines result in faster acceleration.

Since I did not observe any outliers, I proceeded to the next step of choosing the independent variables. I selected horsepower (hp) as the independent variable for prediction. The rationale behind choosing horsepower as the independent variable for regression is based on the common knowledge that horsepower can be predicted from other car characteristics like weight (wt), engine displacement (disp), or the number of cylinders (cyl), etc.

Next, during my analysis of the dataset, I realized that some categorical variables needed to be converted into binary format. I selected the Number of Cylinders (cyl), Engine (vs), andTransmission (am) as categorical variables because they represent distinct, unordered groups rather than continuous numerical values. It is important to convert these categorical variables into binary because some machine learning models, such as Support Vector Machines (SVMs) and Random Forests, can behave differently or achieve better performance when exposed to binary representations of categorical data. Treating them as categorical ensures that the model does not incorrectly interpret a numerical relationship between these groups. While gear and carburetors (carb) were numerically represented, they often reflect ordinal relationships. For example, more gears generally imply better performance, and more carburetors often mean more fuel intake.

Model Selection and Data Splitting

In this stage, the ‘mtcars’ dataset will be split into training and testing sets. I plan to divide the data in a 70:30 ratio; 70% will be used to train the model, and the remaining 30% will be used for prediction. The ‘mtcars’ dataset is relatively small, with only 32 observations. A very large test set would leave too few samples for training, hindering the model’s ability to learn effectively. Conversely, a tiny test set might not be representative and could lead to unreliable performance estimates. A 70:30 split is a common starting point for moderately sized datasets.

Once the data is divided into training and testing sets, the next step is to determine whether scaling is required. Given the relatively small size of the “mtcars” dataset and the absence of drastically different scales among its variables, scaling is probably unnecessary. I will train and test the data using Linear, Polynomial, Support Vector Machines (SVMs), Decision Tree, and Random Forest models. The reason for using these five models is to determine which provides the best predictions.

Overall, the training and validation process is an iterative cycle of training, evaluating, and refining the model. It provides crucial insights into model performance, generalization, and potential data issues, ultimately guiding the development of a better machine learning model.

Model Evaluation

To evaluate the model’s performance, I have calculated R-squared, Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) for each of the five models. This tabular representation facilitates a clearer understanding and easier comparison among the models (refer image below).

In addition to numerical evaluation metrics, I have incorporated visual analysis through actual vs. predicted HP graphs for each model. These visualizations serve as an intuitive means to assess the predictive accuracy and potential biases of the models. By examining the alignment of predicted values with the ideal diagonal reference line, one can infer the extent to which each model generalizes well to the given data (refer image below).

Results Interpretation

From the performance evaluation table (refer image below), we can clearly see that the polynomial regression model has the highest R-squared (0.958) and the lowest Mean Squared Error (287.0), Root Mean Squared Error (16.9), and Mean Absolute Error (13.1) suggesting it is the best fit. However, a very high R-squared, especially in a small dataset like mtcars, can be a sign of overfitting. Therefore, we need to validate the polynomial model on a separate test set or use cross-validation to ensure its generalizability.

Given the existing performance evaluation results and the apparent risk of overfitting in the polynomial regression model, a simpler model like linear regression could be a better choice. While more data would also help, it is not possible in this case because the mtcars dataset is a standard dataset used for teaching and demonstration purposes and is fixed in size.

Recommendations

Based on the performance evaluation, polynomial regression initially appears to be the best model due to its high R-squared and low error metrics. However, the high R-squared value raises concerns about potential overfitting, especially given the small size of the mtcars dataset. Given the risk of overfitting and the comparable performance of simpler models, linear regression is likely the most practical and robust choice. It offers a good balance between predictive accuracy and model simplicity, reducing the risk of overfitting.

While the polynomial model might perform slightly better if properly validated and tuned, the added complexity may not be justified due to the small dataset and the risk of overfitting. The other models (SVR, Decision Tree, and Random Forest) show reasonable performance but do not clearly outperform the simpler linear regression. However, I will explore tuning the hyperparameters of the SVR and Random Forest models to assess if any improvement over the linear regression baseline is achievable.

Given the limited size of the mtcars dataset, the likelihood of significant gains from tuning is low, but it is worth investigating. Even with optimal hyperparameters, the complexity of these models may not be justified due to the small amount of data available. Therefore, linear regression remains the primary recommendation.

Conclusion

This study aimed to evaluate and compare the performance of various regression models using the “mtcars” dataset to determine the most suitable approach for accurate predictions while balancing model complexity and generalization.

To achieve this, we performed thorough data preprocessing and exploratory data analysis (EDA) to understand variable relationships, handle categorical features, and ensure data readiness for modeling. We then trained and tested five regression models—Linear Regression, Polynomial Regression, Support Vector Machines (SVM), Decision Trees, and Random Forests on a 70:30 train-test split.

From the evaluation metrics, Polynomial Regression initially appeared to be the best-performing model, achieving the highest R-squared (0.958) and lowest error metrics. However, its high R-squared value raised concerns about overfitting, especially considering the small dataset size. Consequently, Linear Regression emerged as the most reliable choice, as it offers a strong balance between accuracy and simplicity, reducing the risk of overfitting.

For future work, hyperparameter tuning of complex models like SVM and Random Forest could be explored to assess potential improvements. Additionally, using a larger dataset with more diverse observations would allow for better model validation and generalizability. Cross-validation techniques could also be applied to enhance confidence in model selection.

In conclusion, while Polynomial Regression demonstrated the best numerical performance, Linear Regression remains the most practical recommendation due to its interpretability, robustness, and reduced likelihood of overfitting in a small dataset.

A more detailed explanation of the methodology, including code and visualizations, can be found below.

library(corrplot)

corrplot 0.95 loaded

library(caret)

Loading required package: ggplot2

Loading required package: lattice

library(e1071) # For SVM
library(rpart) # For Decision Tree
library(randomForest) # For Random Forest

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:ggplot2':

    margin

# 1. Load the mtcars dataset:
mydata <- mtcars


# 2. Explore the data
head(mydata)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

tail(mydata)

                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

str(mydata)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

# 2.1 Checking the key statistics using summary
summary(mydata)

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000

# 2.2 Checking for any missing values in the dataset
missing_values <- sapply(mydata, function(x) sum(is.na(x)))
print(missing_values)

 mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
   0    0    0    0    0    0    0    0    0    0    0

# 2.3 Visualizing distribution of variables
# Histogram of mpg
hist(mtcars$mpg)

# Interpretation:
# The distribution appears to be bimodal with potential peaks around 15-20 and 30-35 mpg, suggesting that the dataset
# may contains at least two distinct group of cars with different fuel efficiencies.

# Boxplot of mpg for each number of cylinders
boxplot(mpg ~ cyl, data = mtcars)

# Interpretation:
# It is evident from the plot that there is inverser relationship between 'Number of Cylinders' and 'Miles Per Gallon'.
# As 'Number of Cylinders' increases, 'Miles Per Gallon' tends to decrease.

# Compute and visualize the correlation matrix
correlation_matrix <- cor(mydata[, 1:11])
corrplot(correlation_matrix, method = 'color')

# hp vs. mpg: There is a strong inverse correlation between horsepower and miles per gallon.  As horsepower increases, fuel efficiency (mpg) tends to decrease.
# This indicates that cars with more powerful engines generally consume more fuel.
# hp vs. cyl: Horsepower strongly increases with the number of cylinders, indicating a direct relationship between engine size and power.
# This suggests that cars with more cylinders tend to have significantly more powerful engines.
# hp vs. qsec: Horsepower is inversely correlated with quarter-mile time, meaning that cars with higher horsepower tend to have faster acceleration (lower quarter-mile times).
# This confirms that more powerful engines generally lead to quicker acceleration.


# 2.4 Choosing a suitable target variable for regression
# I am choosing horsepower (hp) as a target variable for regression. We can predict horsepower based on other car characteristics like weight, engine displacement (disp), or number of cylinders (cyl) etc.


# 3. Preprocess the data:
# Create binary features from categorical variables
# I have selected cyl, vs, and am as categorical variables because they represent distinct, unordered groups rather than continuous numerical values.
# Treating them as categorical ensures the model doesn't incorrectly interpret a numerical relationship between these groups
# Why I have not select gear and carb: While gear and carb are numerically represented, they often reflect ordinal relationships
# Like more gears generally imply better performance, more carburetors often mean more fuel intake.

# Converting cyl, vs and am to categorical variable
mydata$cyl <- factor(mtcars$cyl)
mydata$vs <- factor(mtcars$vs)
mydata$am <- factor(mtcars$am)

str(mydata)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
 $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

# For the cyl, vs and am variable
cyl_dummies <- model.matrix(~ cyl - 1, data = mydata)
vs_dummies <- model.matrix(~ vs - 1, data = mydata)
am_dummies <- model.matrix(~ am - 1, data = mydata)
mydata <- cbind(mtcars, cyl_dummies, vs_dummies, am_dummies)


# 4. Split the data:
# Split the data into training and testing sets
# I am considering 70/30 ratio between training and testing set.
# Along with this I am doing stratified sampling for my target variable cyl, vs and am.
mydata$combined_strat <- interaction(mydata$cyl, mydata$am, mydata$vs)

set.seed(123)
train_index <- createDataPartition(mydata$combined_strat, p = 0.7, list = FALSE)

Warning in createDataPartition(mydata$combined_strat, p = 0.7, list = FALSE):
Some classes have no records ( 4.0.0, 6.0.0, 8.0.1, 6.1.1, 8.1.1 ) and these
will be ignored

Warning in createDataPartition(mydata$combined_strat, p = 0.7, list = FALSE):
Some classes have a single record ( 4.1.0 ) and these will be selected for the
sample

train_data <- mydata[train_index, ]
test_data <- mydata[-train_index, ]


# 5. Scale the features:
# Given the relatively small size of the mtcars dataset and the absence of drastically different scales among its variables, scaling is probably not required.


# 6. Build and evaluate models:
# Linear Regression
lm_model <- lm(hp ~ ., data = train_data)
lm_pred <- predict(lm_model, newdata = test_data)
lm_mse <- mean((lm_pred - test_data$hp)^2)
lm_rmse <- sqrt(mean((lm_pred - test_data$hp)^2))
lm_mae <- mean(abs(lm_pred - test_data$hp))
lm_rsquared <- cor(lm_pred, test_data$hp)^2

# Polynomial Regression
poly_model <- lm(hp ~ poly(mpg, degree = 2) + 
                   poly(wt, degree = 2) +  
                   poly(drat, degree = 2) + 
                   poly(qsec, degree = 2) + 
                   poly(gear, degree = 2) + 
                   poly(carb, degree = 2) + 
                   cyl4 + cyl6 +      
                   am1 +               
                   vs1,                
                 data = train_data)
poly_pred <- predict(poly_model, newdata = test_data)
poly_mse <- mean((poly_pred - test_data$hp)^2)
poly_rmse <- sqrt(mean((poly_pred - test_data$hp)^2))
poly_mae <- mean(abs(poly_pred - test_data$hp))
poly_rsquared <- cor(poly_pred, test_data$hp)^2



# Support Vector Machine (SVM)
svm_model <- svm(hp ~ ., data = train_data)
svm_pred <- predict(svm_model, newdata = test_data)
svm_mse <- mean((svm_pred - test_data$hp)^2)
svm_rmse <- sqrt(mean((svm_pred - test_data$hp)^2))
svm_mae <- mean(abs(svm_pred - test_data$hp))
svm_rsquared <- cor(svm_pred, test_data$hp)^2


# Decision Tree
dt_model <- rpart(hp ~ ., data = train_data, method = "anova")
dt_pred <- predict(dt_model, newdata = test_data)
dt_mse <- mean((dt_pred - test_data$hp)^2)
dt_rmse <- sqrt(mean((dt_pred - test_data$hp)^2))
dt_mae <- mean(abs(dt_pred - test_data$hp))
dt_rsquared <- cor(dt_pred, test_data$hp)^2

# Random Forest
rf_model <- randomForest(hp ~ ., data = train_data)
rf_pred <- predict(rf_model, newdata = test_data)
rf_mse <- mean((rf_pred - test_data$hp)^2)
rf_rmse <- sqrt(mean((rf_pred - test_data$hp)^2))
rf_mae <- mean(abs(rf_pred - test_data$hp))
rf_rsquared <- cor(rf_pred, test_data$hp)^2

# Creating data frame for comparing model performance
modelEvaluation_df <- data.frame(Model = c("Linear Regression", "Polynomial Regression", "Support Vector Regression", "Decision Tree Regression", "Random Forest Regression"),
                          R_Squared = c(lm_rsquared, poly_rsquared, svm_rsquared, dt_rsquared, rf_rsquared),
                          Mean_Square_Error = c(lm_mse, poly_mse, svm_mse ,dt_mse, rf_mse),
                          Root_Mean_Square_Error = c(lm_rmse, poly_rmse, svm_rmse ,dt_rmse, rf_rmse),
                          Mean_Absolute_Error = c(lm_mae, poly_mae, svm_mae ,dt_mae, rf_mae))

print(modelEvaluation_df)

                      Model R_Squared Mean_Square_Error Root_Mean_Square_Error
1         Linear Regression 0.9264464          299.1991               17.29737
2     Polynomial Regression 0.9576881          287.0342               16.94208
3 Support Vector Regression 0.9353352          582.5020               24.13508
4  Decision Tree Regression 0.9430962          533.9669               23.10772
5  Random Forest Regression 0.9167403          620.7020               24.91389
  Mean_Absolute_Error
1            16.32160
2            13.08631
3            18.87951
4            18.27273
5            18.84389

# Function for creating visualization for comparing Actual vs. Predicted values (R-Squared)
plot_predictions <- function(predictions, model_name) {
  plot(test_data$hp, predictions, 
       main = paste(model_name, ": Predicted vs. Actual"),
       xlab = "Actual HP", ylab = "Predicted HP",
       pch = 16,
       col = "blue")
  abline(0, 1, col = "red", lwd = 2)
  r2 <- cor(predictions, test_data$hp)^2
  text(x = min(test_data$hp), y = max(predictions),
       labels = paste("R-squared =", round(r2, 3)),
       adj = c(0, 1),
       col = "darkgreen")
}

# Create scatter plots for each model (R-Squared)
plot_predictions(lm_pred, "Linear Regression")

plot_predictions(poly_pred, "Polynomial Regression")

plot_predictions(svm_pred, "Support Vector Regression")

plot_predictions(dt_pred, "Decision Tree")

plot_predictions(rf_pred, "Random Forest")

# Function for creating visualization for comparing Actual vs. Predicted values (Residuals)
plot_residuals <- function(predictions, model_name) {
  residuals <- test_data$hp - predictions
  plot(test_data$hp, residuals,
       main = paste(model_name, ": Residual Plot"),
       xlab = "Actual HP", ylab = "Residuals",
       pch = 16,
       col = "blue")
  abline(h = 0, col = "red", lwd = 2)
}


# Create scatter plots for each model (Residual Plot)
plot_residuals(lm_pred, "Linear Regression")

plot_residuals(poly_pred, "Polynomial Regression")

plot_residuals(svm_pred, "Support Vector Regression")

plot_residuals(dt_pred, "Decision Tree")

plot_residuals(rf_pred, "Random Forest")