#1. Data Exploration
The mtcars dataset is already included in R. It includes data on 32 cars with miles per gallon, weight, horsepower and a few other variables. In this section we will load the data set, review the summary and get some visualization to understand the relationship between everything.
# Load packages and dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(performance)
## Warning: package 'performance' was built under R version 4.5.1
data(mtcars)
# View first few rows
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Summary statistics
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
# Structure of dataset
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
#1.2 Trends? Correlations? Patterns?
We will look for patterns and relationships between MPG and the other variables.
# Scatterplot: MPG vs Weight
plot(mtcars$wt, mtcars$mpg,
main = "MPG vs Car Weight",
xlab = "Weight (1000 lbs)",
ylab = "Miles Per Gallon",
pch = 19, col = "blue")
# Scatterplot: MPG vs Horsepower
plot(mtcars$hp, mtcars$mpg,
main = "MPG vs Horsepower",
xlab = "Horsepower",
ylab = "Miles Per Gallon",
pch = 19, col = "red")
From these plots, I observed that heavier cars with higher horespower tended to have lower MPG, which shows a potentially negative relationship between these variables.
#1.3 Correlation with MPG
We will be checking for variables that are the strongest in correlation with mpg.
# Correlation matrix
cor_matrix <- cor(mtcars)
cor_matrix["mpg", ]
## mpg cyl disp hp drat wt qsec
## 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.4186840
## vs am gear carb
## 0.6640389 0.5998324 0.4802848 -0.5509251
It appears that weight and horsepower are the most negativley correlated with MPG. The cars that are heavier and have more horsepower tend to have low fuel efficiency.
#2 Data Processing
Before running the regression model, I want to check for any missing or invalid data that can skew our results.
# Check for missing values
colSums(is.na(mtcars))
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0 0 0 0 0 0 0 0 0 0 0
It appears that there are no missing values or impossible values in the mtcars dataset. The data appears clean!
#3 Linear Regression using lm
We will create a linear regression model to predict mpg using weight and horsepower.
# Linear regression model
model1 <- lm(mpg ~ wt + hp, data = mtcars)
summary(model1)
##
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.941 -1.600 -0.182 1.050 5.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
## wt -3.87783 0.63273 -6.129 1.12e-06 ***
## hp -0.03177 0.00903 -3.519 0.00145 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
## F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
-Intercept( 37.32): Even though unrealistic, this would be predicted mpg when weight and horsepower are both zero.
Weight: for each 1,000lb increase in weight, the mpg decreases by about 3.9mpg.
Horsepower: For each additional horsepower, mpg decreases by about 0.03mpg.
Both variables are statistically significant.
82.6% of the variation in mpg is explained by weight and horsepower.
#3.2 Regression Assumptions and Diagnostic Plots
We need to check for linearity, normality of residuals and constant variance within our data.
# Diagnostic plots
par(mfrow = c(2, 2))
plot(model1)
par(mfrow = c(1, 1))
Interpretation: - Linearity is satisfied. - residuals are approx. normal. - Homoscedasticity: even spread.
All major assumptions of linear regression are met.
#3.3 Mean Squared Error
# Compute Mean Squared Error
mse <- mean(model1$residuals^2)
mse
## [1] 6.095242
The MSE is 6.10 meaning the models predictions do slighly differ from actual mpg values by about 2.47 mpg.
The MSE is relatively low, so the model is good.
#3.4 Interaction Term
We are now going to include an interaction between weight and horsepower to see if their combined effect is significant.
# Model with interaction term
model2 <- lm(mpg ~ wt * hp, data = mtcars)
summary(model2)
##
## Call:
## lm(formula = mpg ~ wt * hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0632 -1.6491 -0.7362 1.4211 4.5513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.80842 3.60516 13.816 5.01e-14 ***
## wt -8.21662 1.26971 -6.471 5.20e-07 ***
## hp -0.12010 0.02470 -4.863 4.04e-05 ***
## wt:hp 0.02785 0.00742 3.753 0.000811 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.153 on 28 degrees of freedom
## Multiple R-squared: 0.8848, Adjusted R-squared: 0.8724
## F-statistic: 71.66 on 3 and 28 DF, p-value: 2.981e-13
Since the p-value is less than 0.05, the interaction is significant. The R² is at .88 which means 88% of variance can be explained by the model. This means that horsepower and weight do improve the model fit that helps explain how these two effect mpg.
#3.5 Outlier Direction
Check for outliers using a boxplot.
boxplot(mtcars$hp, main = "Boxplot of Horsepower", col = "lightblue")
Based on the box plot, it does appear that there is at least one or more outliers, since there is a point outside of the whisker.
#3.7 Winsorization
Applying a 5% and 95% winsorization to the hp variable to reduce the influence of outliers:
# Winsorization
hp_winsor <- mtcars$hp
lower <- quantile(hp_winsor, 0.05)
upper <- quantile(hp_winsor, 0.95)
hp_winsor[hp_winsor < lower] <- lower
hp_winsor[hp_winsor > upper] <- upper
# New dataset with winsorized hp
mtcars$hp_winsor <- hp_winsor
# Fit new model
model3 <- lm(mpg ~ wt + hp_winsor, data = mtcars)
summary(model3)
##
## Call:
## lm(formula = mpg ~ wt + hp_winsor, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8825 -1.6545 -0.0968 0.8367 5.7259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.31722 1.56964 23.774 < 2e-16 ***
## wt -3.58279 0.66427 -5.394 8.5e-06 ***
## hp_winsor -0.03952 0.01059 -3.732 0.000824 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.546 on 29 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.8215
## F-statistic: 72.34 on 2 and 29 DF, p-value: 5.348e-12
It appears that our original model had a R² of .88 whereas after the winsorization, it is now .833 suggesting the outliers were potentially impacting the model, but to be honest it doesn’t seem like a huge shift.
#3.8 Multicollinearity
Variance inflation factor checks for multicollinearity (whether or not predictors are highly correlated).
library(performance)
check_collinearity(model1)
## # Check for Multicollinearity
##
## Low Correlation
##
## Term VIF VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
## wt 1.77 [1.29, 3.00] 1.33 0.57 [0.33, 0.77]
## hp 1.77 [1.29, 3.00] 1.33 0.57 [0.33, 0.77]
There are no multicollinearity concerns. Weight (r=-0.87) and horsepower (r=-0.78) are strongly but negatively correlated with mpg. This returns back to our initial observation that high horespower and heavier cars tend to have lower fuel efficiency.
#Conclusion
The regression analysis shows that both car weight and horsepower significantly reduce miles per gallon. Adding the interaction term between the two improved the model fit, indicating that the effect of horsepower on fuel efficiency depends on vehicle weight. Diagnostic checks confirmed that the model assumptions were met and outlier adjustment had minimal impact. Multicollinearity was low and the model explained about 88% of variance in mpg. Overall, the regression model is statistically significant and consistent with expectations.