Regression Showdown: Finding the Best Predictive Model
Author
Saurabh C Srivastava
Published
March 5, 2025
Introduction
This exercise evaluates the performance of various regression models using the ‘mtcars’ dataset. We will compare models such as Linear Regression, Polynomial Regression, Support Vector Machines, Decision Trees, and Random Forests to determine which is most suitable for accurate predictions while avoiding issues like overfitting and underfitting.
This process will involve addressing key aspects, including identifying appropriate dependent variables, handling any missing values, detecting and treating outliers if necessary, creating binary features for categorical variables, and identifying which model will provide the most accurate predictions while ensuring the best balance of accuracy and generalization performance for the dataset.
Data Preprocessing & Exploratory Data Analysis (EDA)
After loading the data, the very first step I took was to check for any missing values, blanks, or NA values. Upon confirming that there were no such discrepancies, I examined the data’s distribution and how the variables correlated with each other. For this, I used histograms, box plots, and a correlation matrix (refer image below). A few key observations from the plots included the inverse relationship between the number of cylinders and miles per gallon (mpg). As the number of cylinders increases, miles per gallon tends to decrease.
Furthermore, horsepower (hp) exhibits a strong inverse relationship with miles per gallon, demonstrating that increased horsepower correlates with decreased fuel efficiency. Horsepower also strongly correlates with the number of cylinders (cyl), indicating that larger engines produce more power. Additionally, higher horsepower (hp) correlates with lower quarter-mile times (qsec), confirming that more powerful engines result in faster acceleration.
Since I did not observe any outliers, I proceeded to the next step of choosing the independent variables. I selected horsepower (hp) as the independent variable for prediction. The rationale behind choosing horsepower as the independent variable for regression is based on the common knowledge that horsepower can be predicted from other car characteristics like weight (wt), engine displacement (disp), or the number of cylinders (cyl), etc.
Next, during my analysis of the dataset, I realized that some categorical variables needed to be converted into binary format. I selected the Number of Cylinders (cyl), Engine (vs), andTransmission (am) as categorical variables because they represent distinct, unordered groups rather than continuous numerical values. It is important to convert these categorical variables into binary because some machine learning models, such as Support Vector Machines (SVMs) and Random Forests, can behave differently or achieve better performance when exposed to binary representations of categorical data. Treating them as categorical ensures that the model does not incorrectly interpret a numerical relationship between these groups. While gear and carburetors (carb) were numerically represented, they often reflect ordinal relationships. For example, more gears generally imply better performance, and more carburetors often mean more fuel intake.
Model Selection and Data Splitting
In this stage, the ‘mtcars’ dataset will be split into training and testing sets. I plan to divide the data in a 70:30 ratio; 70% will be used to train the model, and the remaining 30% will be used for prediction. The ‘mtcars’ dataset is relatively small, with only 32 observations. A very large test set would leave too few samples for training, hindering the model’s ability to learn effectively. Conversely, a tiny test set might not be representative and could lead to unreliable performance estimates. A 70:30 split is a common starting point for moderately sized datasets.
Once the data is divided into training and testing sets, the next step is to determine whether scaling is required. Given the relatively small size of the “mtcars” dataset and the absence of drastically different scales among its variables, scaling is probably unnecessary. I will train and test the data using Linear, Polynomial, Support Vector Machines (SVMs), Decision Tree, and Random Forest models. The reason for using these five models is to determine which provides the best predictions.
Overall, the training and validation process is an iterative cycle of training, evaluating, and refining the model. It provides crucial insights into model performance, generalization, and potential data issues, ultimately guiding the development of a better machine learning model.
Model Evaluation
To evaluate the model’s performance, I have calculated R-squared, Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) for each of the five models. This tabular representation facilitates a clearer understanding and easier comparison among the models (refer image below).
In addition to numerical evaluation metrics, I have incorporated visual analysis through actual vs. predicted HP graphs for each model. These visualizations serve as an intuitive means to assess the predictive accuracy and potential biases of the models. By examining the alignment of predicted values with the ideal diagonal reference line, one can infer the extent to which each model generalizes well to the given data (refer image below).
Results Interpretation
From the performance evaluation table (refer image below), we can clearly see that the polynomial regression model has the highest R-squared (0.958) and the lowest Mean Squared Error (287.0), Root Mean Squared Error (16.9), and Mean Absolute Error (13.1) suggesting it is the best fit. However, a very high R-squared, especially in a small dataset like mtcars, can be a sign of overfitting. Therefore, we need to validate the polynomial model on a separate test set or use cross-validation to ensure its generalizability.
Given the existing performance evaluation results and the apparent risk of overfitting in the polynomial regression model, a simpler model like linear regression could be a better choice. While more data would also help, it is not possible in this case because the mtcars dataset is a standard dataset used for teaching and demonstration purposes and is fixed in size.
Recommendations
Based on the performance evaluation, polynomial regression initially appears to be the best model due to its high R-squared and low error metrics. However, the high R-squared value raises concerns about potential overfitting, especially given the small size of the mtcars dataset. Given the risk of overfitting and the comparable performance of simpler models, linear regression is likely the most practical and robust choice. It offers a good balance between predictive accuracy and model simplicity, reducing the risk of overfitting.
While the polynomial model might perform slightly better if properly validated and tuned, the added complexity may not be justified due to the small dataset and the risk of overfitting. The other models (SVR, Decision Tree, and Random Forest) show reasonable performance but do not clearly outperform the simpler linear regression. However, I will explore tuning the hyperparameters of the SVR and Random Forest models to assess if any improvement over the linear regression baseline is achievable.
Given the limited size of the mtcars dataset, the likelihood of significant gains from tuning is low, but it is worth investigating. Even with optimal hyperparameters, the complexity of these models may not be justified due to the small amount of data available. Therefore, linear regression remains the primary recommendation.
Conclusion
This study aimed to evaluate and compare the performance of various regression models using the “mtcars” dataset to determine the most suitable approach for accurate predictions while balancing model complexity and generalization.
To achieve this, we performed thorough data preprocessing and exploratory data analysis (EDA) to understand variable relationships, handle categorical features, and ensure data readiness for modeling. We then trained and tested five regression models—Linear Regression, Polynomial Regression, Support Vector Machines (SVM), Decision Trees, and Random Forests on a 70:30 train-test split.
From the evaluation metrics, Polynomial Regression initially appeared to be the best-performing model, achieving the highest R-squared (0.958) and lowest error metrics. However, its high R-squared value raised concerns about overfitting, especially considering the small dataset size. Consequently, Linear Regression emerged as the most reliable choice, as it offers a strong balance between accuracy and simplicity, reducing the risk of overfitting.
For future work, hyperparameter tuning of complex models like SVM and Random Forest could be explored to assess potential improvements. Additionally, using a larger dataset with more diverse observations would allow for better model validation and generalizability. Cross-validation techniques could also be applied to enhance confidence in model selection.
In conclusion, while Polynomial Regression demonstrated the best numerical performance, Linear Regression remains the most practical recommendation due to its interpretability, robustness, and reduced likelihood of overfitting in a small dataset.
A more detailed explanation of the methodology, including code and visualizations, can be found below.
library(corrplot)
corrplot 0.95 loaded
library(caret)
Loading required package: ggplot2
Loading required package: lattice
library(e1071) # For SVMlibrary(rpart) # For Decision Treelibrary(randomForest) # For Random Forest
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:ggplot2':
margin
# 1. Load the mtcars dataset:mydata <- mtcars# 2. Explore the datahead(mydata)
# 2.1 Checking the key statistics using summarysummary(mydata)
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
# 2.2 Checking for any missing values in the datasetmissing_values <-sapply(mydata, function(x) sum(is.na(x)))print(missing_values)
mpg cyl disp hp drat wt qsec vs am gear carb
0 0 0 0 0 0 0 0 0 0 0
# 2.3 Visualizing distribution of variables# Histogram of mpghist(mtcars$mpg)
# Interpretation:# The distribution appears to be bimodal with potential peaks around 15-20 and 30-35 mpg, suggesting that the dataset# may contains at least two distinct group of cars with different fuel efficiencies.# Boxplot of mpg for each number of cylindersboxplot(mpg ~ cyl, data = mtcars)
# Interpretation:# It is evident from the plot that there is inverser relationship between 'Number of Cylinders' and 'Miles Per Gallon'.# As 'Number of Cylinders' increases, 'Miles Per Gallon' tends to decrease.# Compute and visualize the correlation matrixcorrelation_matrix <-cor(mydata[, 1:11])corrplot(correlation_matrix, method ='color')
# hp vs. mpg: There is a strong inverse correlation between horsepower and miles per gallon. As horsepower increases, fuel efficiency (mpg) tends to decrease.# This indicates that cars with more powerful engines generally consume more fuel.# hp vs. cyl: Horsepower strongly increases with the number of cylinders, indicating a direct relationship between engine size and power.# This suggests that cars with more cylinders tend to have significantly more powerful engines.# hp vs. qsec: Horsepower is inversely correlated with quarter-mile time, meaning that cars with higher horsepower tend to have faster acceleration (lower quarter-mile times).# This confirms that more powerful engines generally lead to quicker acceleration.# 2.4 Choosing a suitable target variable for regression# I am choosing horsepower (hp) as a target variable for regression. We can predict horsepower based on other car characteristics like weight, engine displacement (disp), or number of cylinders (cyl) etc.# 3. Preprocess the data:# Create binary features from categorical variables# I have selected cyl, vs, and am as categorical variables because they represent distinct, unordered groups rather than continuous numerical values.# Treating them as categorical ensures the model doesn't incorrectly interpret a numerical relationship between these groups# Why I have not select gear and carb: While gear and carb are numerically represented, they often reflect ordinal relationships# Like more gears generally imply better performance, more carburetors often mean more fuel intake.# Converting cyl, vs and am to categorical variablemydata$cyl <-factor(mtcars$cyl)mydata$vs <-factor(mtcars$vs)mydata$am <-factor(mtcars$am)str(mydata)
# For the cyl, vs and am variablecyl_dummies <-model.matrix(~ cyl -1, data = mydata)vs_dummies <-model.matrix(~ vs -1, data = mydata)am_dummies <-model.matrix(~ am -1, data = mydata)mydata <-cbind(mtcars, cyl_dummies, vs_dummies, am_dummies)# 4. Split the data:# Split the data into training and testing sets# I am considering 70/30 ratio between training and testing set.# Along with this I am doing stratified sampling for my target variable cyl, vs and am.mydata$combined_strat <-interaction(mydata$cyl, mydata$am, mydata$vs)set.seed(123)train_index <-createDataPartition(mydata$combined_strat, p =0.7, list =FALSE)
Warning in createDataPartition(mydata$combined_strat, p = 0.7, list = FALSE):
Some classes have no records ( 4.0.0, 6.0.0, 8.0.1, 6.1.1, 8.1.1 ) and these
will be ignored
Warning in createDataPartition(mydata$combined_strat, p = 0.7, list = FALSE):
Some classes have a single record ( 4.1.0 ) and these will be selected for the
sample
train_data <- mydata[train_index, ]test_data <- mydata[-train_index, ]# 5. Scale the features:# Given the relatively small size of the mtcars dataset and the absence of drastically different scales among its variables, scaling is probably not required.# 6. Build and evaluate models:# Linear Regressionlm_model <-lm(hp ~ ., data = train_data)lm_pred <-predict(lm_model, newdata = test_data)lm_mse <-mean((lm_pred - test_data$hp)^2)lm_rmse <-sqrt(mean((lm_pred - test_data$hp)^2))lm_mae <-mean(abs(lm_pred - test_data$hp))lm_rsquared <-cor(lm_pred, test_data$hp)^2# Polynomial Regressionpoly_model <-lm(hp ~poly(mpg, degree =2) +poly(wt, degree =2) +poly(drat, degree =2) +poly(qsec, degree =2) +poly(gear, degree =2) +poly(carb, degree =2) + cyl4 + cyl6 + am1 + vs1, data = train_data)poly_pred <-predict(poly_model, newdata = test_data)poly_mse <-mean((poly_pred - test_data$hp)^2)poly_rmse <-sqrt(mean((poly_pred - test_data$hp)^2))poly_mae <-mean(abs(poly_pred - test_data$hp))poly_rsquared <-cor(poly_pred, test_data$hp)^2# Support Vector Machine (SVM)svm_model <-svm(hp ~ ., data = train_data)svm_pred <-predict(svm_model, newdata = test_data)svm_mse <-mean((svm_pred - test_data$hp)^2)svm_rmse <-sqrt(mean((svm_pred - test_data$hp)^2))svm_mae <-mean(abs(svm_pred - test_data$hp))svm_rsquared <-cor(svm_pred, test_data$hp)^2# Decision Treedt_model <-rpart(hp ~ ., data = train_data, method ="anova")dt_pred <-predict(dt_model, newdata = test_data)dt_mse <-mean((dt_pred - test_data$hp)^2)dt_rmse <-sqrt(mean((dt_pred - test_data$hp)^2))dt_mae <-mean(abs(dt_pred - test_data$hp))dt_rsquared <-cor(dt_pred, test_data$hp)^2# Random Forestrf_model <-randomForest(hp ~ ., data = train_data)rf_pred <-predict(rf_model, newdata = test_data)rf_mse <-mean((rf_pred - test_data$hp)^2)rf_rmse <-sqrt(mean((rf_pred - test_data$hp)^2))rf_mae <-mean(abs(rf_pred - test_data$hp))rf_rsquared <-cor(rf_pred, test_data$hp)^2# Creating data frame for comparing model performancemodelEvaluation_df <-data.frame(Model =c("Linear Regression", "Polynomial Regression", "Support Vector Regression", "Decision Tree Regression", "Random Forest Regression"),R_Squared =c(lm_rsquared, poly_rsquared, svm_rsquared, dt_rsquared, rf_rsquared),Mean_Square_Error =c(lm_mse, poly_mse, svm_mse ,dt_mse, rf_mse),Root_Mean_Square_Error =c(lm_rmse, poly_rmse, svm_rmse ,dt_rmse, rf_rmse),Mean_Absolute_Error =c(lm_mae, poly_mae, svm_mae ,dt_mae, rf_mae))print(modelEvaluation_df)
Model R_Squared Mean_Square_Error Root_Mean_Square_Error
1 Linear Regression 0.9264464 299.1991 17.29737
2 Polynomial Regression 0.9576881 287.0342 16.94208
3 Support Vector Regression 0.9353352 582.5020 24.13508
4 Decision Tree Regression 0.9430962 533.9669 23.10772
5 Random Forest Regression 0.9167403 620.7020 24.91389
Mean_Absolute_Error
1 16.32160
2 13.08631
3 18.87951
4 18.27273
5 18.84389
# Function for creating visualization for comparing Actual vs. Predicted values (R-Squared)plot_predictions <-function(predictions, model_name) {plot(test_data$hp, predictions, main =paste(model_name, ": Predicted vs. Actual"),xlab ="Actual HP", ylab ="Predicted HP",pch =16,col ="blue")abline(0, 1, col ="red", lwd =2) r2 <-cor(predictions, test_data$hp)^2text(x =min(test_data$hp), y =max(predictions),labels =paste("R-squared =", round(r2, 3)),adj =c(0, 1),col ="darkgreen")}# Create scatter plots for each model (R-Squared)plot_predictions(lm_pred, "Linear Regression")