Comprehensive mtcars Analysis in R

🚗 Project Overview

This project explores the built-in mtcars dataset using:

📊 Descriptive statistics

📈 Visualizations

🧠 Machine Learning models

🧪 Model evaluation

We aim to predict Miles Per Gallon (mpg) using various features like weight, horsepower, transmission type, etc.

📂 Dataset Overview

# Load data
data(mtcars)
df <- mtcars
df$car <- rownames(df)
rownames(df) <- NULL
df <- df %>% relocate(car)

# Structure and summary
str(df)

## 'data.frame':    32 obs. of  12 variables:
##  $ car : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

summary(df)

##      car                 mpg             cyl             disp      
##  Length:32          Min.   :10.40   Min.   :4.000   Min.   : 71.1  
##  Class :character   1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8  
##  Mode  :character   Median :19.20   Median :6.000   Median :196.3  
##                     Mean   :20.09   Mean   :6.188   Mean   :230.7  
##                     3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0  
##                     Max.   :33.90   Max.   :8.000   Max.   :472.0  
##        hp             drat             wt             qsec      
##  Min.   : 52.0   Min.   :2.760   Min.   :1.513   Min.   :14.50  
##  1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89  
##  Median :123.0   Median :3.695   Median :3.325   Median :17.71  
##  Mean   :146.7   Mean   :3.597   Mean   :3.217   Mean   :17.85  
##  3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90  
##  Max.   :335.0   Max.   :4.930   Max.   :5.424   Max.   :22.90  
##        vs               am              gear            carb      
##  Min.   :0.0000   Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4375   Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :5.000   Max.   :8.000

The mtcars dataset consists of 32 observations on 12 variables describing various car models. Each row represents a car, identified by the car column (a character string). The primary variable of interest, mpg (miles per gallon), ranges from 10.4 to 33.9, with a mean of approximately 20.1, indicating varied fuel efficiency across the models. Other numeric variables include cyl (number of cylinders), disp (engine displacement), hp (horsepower), drat (rear axle ratio), wt (weight in 1000 lbs), and qsec (1/4 mile time), all showing broad distributions that reflect differences in car performance and design. Categorical variables such as vs (engine type), am (transmission type), gear, and carb (number of carburetors) are encoded as numeric but represent discrete values. For example, am indicates transmission type (0 = automatic, 1 = manual), with manual cars being slightly less common. Overall, the dataset provides a compact yet diverse view of mid-20th-century automobiles, making it well-suited for regression analysis and classification tasks.

📌 Exploratory Data Analysis

🔍 Summary Statistics

# Convert some numeric columns to factors for plotting
df$cyl <- as.factor(df$cyl)
df$gear <- as.factor(df$gear)
df$carb <- as.factor(df$carb)
df$am <- as.factor(df$am)

📈 Correlation Plot

cor_matrix <- cor(df[, sapply(df, is.numeric)])
corrplot(cor_matrix, method = "color", type = "upper", tl.cex = 0.8)

📊 Correlation Heatmap :

This heatmap reveals the pairwise correlation coefficients between selected numerical variables in the mtcars dataset. Dark red indicates strong negative correlations (closer to -1), while dark blue indicates strong positive correlations (closer to +1). Here’s a breakdown of key insights:

🔻 mpg (Miles Per Gallon) shows strong negative correlations with:

wt (weight): heavier cars tend to have lower fuel efficiency.

hp (horsepower) and disp (displacement): more powerful/larger engines are associated with lower MPG.

🔺 mpg has a moderate positive correlation with drat (rear axle ratio), suggesting cars with higher drat might be slightly more fuel-efficient.

✅ Strong positive correlation exists between disp and hp, indicating larger engines tend to produce more power.

❗ vs (engine shape) is positively correlated with mpg and negatively with hp, disp, and wt, showing that V/Straight engine type is related to several performance variables.

qsec (quarter-mile time) has a slight positive correlation with mpg, meaning slower cars might be more fuel-efficient, but this is weaker.

✅ Summary: The heatmap confirms expected engineering principles: bigger, heavier, and more powerful cars consume more fuel. wt, disp, and hp are strong candidates for predicting mpg, and this insight is useful when building regression models.

📊 Visualizations

MPG Distribution

ggplot(df, aes(x = mpg)) +
  geom_histogram(bins = 10, fill = "skyblue", color = "black") +
  labs(title = "Histogram of MPG", x = "Miles per Gallon", y = "Count")

⛽ Histogram of MPG – Interpretation This histogram shows how fuel efficiency (in MPG) is distributed across the 32 cars in the dataset:

🚗 Most cars achieve between 15 to 22 MPG, making this the most common range.

📉 Fewer cars fall below 15 MPG, indicating that very low fuel efficiency is relatively rare.

📈 Similarly, cars with above 30 MPG are uncommon but do exist.

The distribution is slightly right-skewed, with more vehicles clustered in the lower-to-mid MPG range, and a few high-efficiency cars creating a longer tail on the right.

✅ Summary: Overall, the dataset is dominated by cars with moderate fuel efficiency, while very high or very low MPG values are less frequent. This skewed distribution can inform transformations or binning strategies when preparing the data for modeling or classification tasks.

Boxplot: MPG by Cylinders

ggplot(df, aes(x = cyl, y = mpg, fill = cyl)) +
  geom_boxplot() +
  labs(title = "MPG by Number of Cylinders", x = "Cylinders", y = "MPG")

📦 Boxplot of MPG by Number of Cylinders – Interpretation This boxplot illustrates how fuel efficiency (measured in miles per gallon, or MPG) varies across cars with different engine configurations: 4, 6, and 8 cylinders. Vehicles with 4 cylinders show the highest median MPG, around 27–28, and exhibit a relatively wide spread, with an interquartile range (IQR) from roughly 25 to 30 MPG, indicating some variability among these fuel-efficient models. 6-cylinder cars have a moderate median MPG of approximately 21–22, with a narrower IQR (20–23), suggesting more consistent fuel performance. In contrast, 8-cylinder vehicles display the lowest fuel efficiency, with a median around 15–16 MPG and a broader IQR of 14 to 18 MPG, including a notable outlier below 10 MPG. Overall, the trend clearly shows that as the number of cylinders increases, fuel efficiency tends to decrease, reflecting the higher power output and fuel demands of larger engines.

Scatter Plot: MPG vs Weight

ggplot(df, aes(x = wt, y = mpg)) +
  geom_point(color = "darkgreen") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "MPG vs Weight", x = "Weight (1000 lbs)", y = "MPG")

⚖️ Scatter Plot of MPG vs Weight – Interpretation The chart titled “MPG vs Weight” visualizes the relationship between a vehicle’s weight (in thousands of pounds) and its fuel efficiency (measured in miles per gallon, MPG). Each green point represents an individual car, and the red trend line indicates the overall pattern. The plot shows a clear negative relationship: as vehicle weight increases—from about 2,000 lbs to over 5,000 lbs—MPG consistently decreases, ranging from around 30 MPG for lighter cars to about 10 MPG for heavier ones. The downward-sloping linear trend line reinforces this correlation, confirming that heavier vehicles tend to be less fuel-efficient. This aligns with typical automotive engineering principles, where added weight increases fuel consumption.

Machine learning: 🛠️ Data Preparation

set.seed(123)
# Partition data
train_index <- createDataPartition(df$mpg, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

set.seed(123)

This sets a random seed so that the results of any random operation (like data splitting) are reproducible. Using the same seed ensures you get the same train/test split every time you run the code.

createDataPartition(df$mpg, p = 0.8, list = FALSE)

This function (from the caret package) splits the data into training and testing sets.

df$mpg: The target variable used to stratify the data (ensuring similar distributions in both sets).

p = 0.8: Specifies that 80% of the data should go into the training set.

list = FALSE: Returns a vector of row indices rather than a list.

train_data <- df[train_index, ]

Subsets the original dataset to get the training set (80% of the data).

test_data <- df[-train_index, ]

Takes the remaining 20% of the data as the testing set.

✅ Why this is important: Splitting the dataset into training and testing sets is a critical step in machine learning. The training set is used to build the model, while the testing set is used to evaluate how well the model performs on unseen data — helping to assess its ability to generalize.

###🤖 Modeling {.tabset}

📉 Linear Regression

lm_model <- train(mpg ~ ., data = train_data[, -1], method = "lm")
summary(lm_model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4140 -1.2602 -0.0338  0.7805  4.7203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.78965   23.29577   0.506    0.623
## cyl6        -0.44073    3.54082  -0.124    0.903
## cyl8         1.92675    7.96916   0.242    0.813
## disp         0.04114    0.03132   1.313    0.216
## hp          -0.04760    0.04181  -1.139    0.279
## drat         3.10222    2.79016   1.112    0.290
## wt          -3.94678    2.83952  -1.390    0.192
## qsec         0.23091    0.98709   0.234    0.819
## vs           3.08488    2.96477   1.041    0.320
## am1          0.37047    3.24113   0.114    0.911
## gear4        3.86903    4.05389   0.954    0.360
## gear5        7.86395    5.02662   1.564    0.146
## carb2       -2.33108    2.59751  -0.897    0.389
## carb3        1.43776    4.37203   0.329    0.748
## carb4       -2.95581    4.83986  -0.611    0.554
## carb6       -1.39430    6.91625  -0.202    0.844
## carb8       -3.64935   10.18725  -0.358    0.727
## 
## Residual standard error: 2.69 on 11 degrees of freedom
## Multiple R-squared:  0.9097, Adjusted R-squared:  0.7784 
## F-statistic: 6.927 on 16 and 11 DF,  p-value: 0.001223

📊 Linear Regression Model – Interpretation

The linear regression model was trained to predict miles per gallon (MPG) using multiple predictors, including engine specs, transmission type, and more. Here’s what the key elements of the output tell us:

🔢 Model Fit: Residual Standard Error (RSE): 2.69 → On average, the predicted MPG values deviate from the actual values by about 2.69 units.

Multiple R-squared: 0.9097 → The model explains about 91% of the variance in MPG on the training set, which suggests a very good fit.

Adjusted R-squared: 0.7784 → After adjusting for the number of predictors, the model still explains about 77.8% of the variation, indicating that some predictors may not be contributing much.

F-statistic: 6.927, p-value: 0.001223 → The overall regression model is statistically significant, meaning at least one of the predictors contributes meaningfully to explaining MPG.

📉 Residuals: The residuals range from -3.41 to +4.72, indicating some variability in model accuracy, but not extreme.

🧪 Coefficients Table: Each row shows the effect of that variable on MPG, holding all else constant.

Most individual predictors have high p-values (> 0.05), meaning they are not statistically significant by themselves in this multivariate model.

For example:

hp (horsepower): Estimate = -0.0476 (more HP slightly reduces MPG), p = 0.279 → Not significant

wt (weight): Estimate = -3.95 (heavier cars reduce MPG), p = 0.192 → Strong negative effect, but still not statistically significant

gear5: Estimate = 7.86 (higher gear may increase MPG), p = 0.146 → Suggestive, but not significant

Intercept: Estimated base MPG is ~11.8 when all variables are at zero (not meaningful in practice).

✅ Summary: 🔹 The overall model fits the training data well, with high R² and a significant F-test.

🔸 However, individual predictors are not statistically significant, likely due to:

Multicollinearity (predictors are correlated with each other),

A small sample size (only 32 rows, and just 11 degrees of freedom left),

Potential overfitting with too many predictors.

🧠 Recommendation: Consider simplifying the model by:

Performing feature selection (e.g., stepwise regression),

Using regularization (like LASSO),

Or applying PCA or domain knowledge to reduce variables.

🌳 Random Forest

rf_model <- train(mpg ~ ., data = train_data[, -1], method = "rf", importance = TRUE)
print(rf_model)

## Random Forest 
## 
## 28 samples
## 10 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 28, 28, 28, 28, 28, 28, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    2.848031  0.7741995  2.316185
##    9    2.578624  0.7985287  2.088388
##   16    2.644895  0.7821251  2.129781
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 9.

This Random Forest regression model was trained to predict MPG using 10 predictors and 28 samples, with performance evaluated via 25 bootstrap resampling repetitions.

🔧 Model Setup: Predictors used: 10

Resampling method: Bootstrapping (25 repetitions)

Preprocessing: None applied

Tuning parameter (mtry): Number of variables randomly sampled at each tree split

📈 Model Performance (by mtry value): mtry RMSE R² MAE 2 2.85 0.7742 2.32 9 2.58 0.7985 2.09 16 2.64 0.7821 2.13

✅ The best performance was achieved with mtry = 9, as it had the lowest RMSE (2.58).

📉 R-squared = 0.7985 indicates that the model explains ~80% of the variance in MPG across the training data.

📏 Mean Absolute Error (MAE) = 2.09, suggesting that predictions are off by about 2 MPG, on average.

✅ Summary: The Random Forest model performs strongly, with better accuracy than the linear model (which had an RMSE of ~2.69 and adjusted R² of ~0.78).

Unlike linear regression, Random Forests can capture nonlinear relationships and interactions between variables without explicitly modeling them.

The model automatically determined the best number of features (mtry = 9) for tree splitting.

📊 Model Evaluation

🔎 RMSE Comparison

# Predictions
lm_preds <- predict(lm_model, test_data)
rf_preds <- predict(rf_model, test_data)

# RMSE
lm_rmse <- RMSE(lm_preds, test_data$mpg)
rf_rmse <- RMSE(rf_preds, test_data$mpg)

data.frame(
  Model = c("Linear Regression", "Random Forest"),
  RMSE = c(lm_rmse, rf_rmse)
)

##               Model     RMSE
## 1 Linear Regression 5.183979
## 2     Random Forest 2.317479

This table compares the predictive accuracy of two models on the same test data using Root Mean Squared Error (RMSE):

Model RMSE Linear Regression 5.18 Random Forest 2.32

✅ Random Forest outperforms Linear Regression by a substantial margin, with an RMSE of 2.32 vs. 5.18.

📉 Lower RMSE indicates better predictive accuracy — in this case, the Random Forest model predicts MPG more accurately, with an average error of about 2.3 MPG compared to 5.2 MPG for Linear Regression.

The improvement suggests that Random Forest better captures nonlinear relationships and variable interactions, which Linear Regression may miss.

📌 Conclusion: The Random Forest model is a more suitable choice for predicting MPG in this dataset, offering significantly improved accuracy over the Linear Regression model.

📋 Actual vs Predicted

comparison <- data.frame(
  Car = test_data$car,
  Actual = test_data$mpg,
  Pred_LM = lm_preds,
  Pred_RF = rf_preds
)
print(comparison)

##               Car Actual  Pred_LM  Pred_RF
## 2   Mazda RX4 Wag   21.0 18.66003 21.01231
## 7      Duster 360   14.3 13.43311 14.80743
## 20 Toyota Corolla   33.9 29.38881 29.50138
## 29 Ford Pantera L   15.8 24.79537 17.17016

🚗 Model Prediction Comparison – Selected Cars This table compares the actual MPG values to the predicted MPG from both models for four vehicles:

Car Actual MPG Predicted (LM) Predicted (RF) Mazda RX4 Wag 21.0 18.66 21.01 Duster 360 14.3 13.43 14.81 Toyota Corolla 33.9 29.39 29.50 Ford Pantera L 15.8 24.80 ❌ 17.17 ✅

🔍 Interpretation: ✅ Random Forest predictions are consistently closer to the actual MPG values than those from Linear Regression.

For Mazda RX4 Wag, the Random Forest prediction (21.01) nearly matches the actual MPG (21.0), while Linear Regression underestimates it.

In the case of the Toyota Corolla, both models underestimate the actual MPG, but RF does slightly better.

The Ford Pantera L shows the largest error in the Linear Regression model, overestimating MPG by almost 9 units, while Random Forest’s prediction is more reasonable.

✅ Conclusion: These sample predictions confirm the earlier RMSE comparison — Random Forest provides more accurate and stable predictions across a variety of car types, including both high-efficiency and low-efficiency models.

📊 Predicted MPG Plot

ggplot(comparison, aes(x = Actual)) +
  geom_point(aes(y = Pred_LM, color = "Linear Regression")) +
  geom_point(aes(y = Pred_RF, color = "Random Forest")) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(title = "Actual vs Predicted MPG", y = "Predicted MPG") +
  scale_color_manual(values = c("Linear Regression" = "blue", "Random Forest" = "green"))

The chart titled “Actual vs Predicted MPG” compares the actual and predicted miles per gallon (MPG) values using two models: Linear Regression (blue) and Random Forest (green). The dashed diagonal line represents perfect prediction where actual and predicted values are equal. Data points for both models are scattered around this line, with Linear Regression points (blue) including values around 15-25 MPG and Random Forest points (green) ranging from 15-30 MPG. The proximity of the points to the diagonal line suggests that both models provide reasonably accurate predictions, with Random Forest showing a slightly wider range and a point near 30 MPG aligning closely with the line.

🌟 Variable Importance (RF)

varImpPlot(rf_model$finalModel)

The chart titled “rf_model$finalModel” presents the importance of various vehicle attributes in a random forest model, displayed in two panels: %IncMSE (left) and IncNodePurity (right). The %IncMSE panel shows the increase in mean squared error when a variable is permuted, with “wt” (weight) having the highest importance at around 14%, followed by “hp” (horsepower) and “disp” (displacement) at lower values, indicating their significant impact on model accuracy. The IncNodePurity panel measures the total decrease in node impurity, where “wt” again ranks highest with an importance of about 200, followed by “hp” and “disp,” reinforcing their influence. Other variables like “cyl” (cylinders), “drat” (rear axle ratio), “qsec” (quarter mile time), “vs” (engine shape), “am” (transmission), “gear” (number of gears), and “carb” (number of carburetors) show progressively lower importance in both metrics.