This report explores the built-in mtcars dataset through EDA, preprocessing checks, and multiple linear regression using base R. It follows the assignment prompts exactly and includes narrative answers alongside reproducible code.
Dataset:
mtcars
(Motor Trend US magazine).
Goal: Predictmpg
using other variables and interpret/diagnose the model.
data(mtcars) # dataset is built-in
str(mtcars) # structure of variables (serves as a compact codebook)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars) # summary statistics
## mpg cyl disp hp drat
## Min. :10.4 Min. :4.00 Min. : 71.1 Min. : 52.0 Min. :2.76
## 1st Qu.:15.4 1st Qu.:4.00 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.08
## Median :19.2 Median :6.00 Median :196.3 Median :123.0 Median :3.69
## Mean :20.1 Mean :6.19 Mean :230.7 Mean :146.7 Mean :3.60
## 3rd Qu.:22.8 3rd Qu.:8.00 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.92
## Max. :33.9 Max. :8.00 Max. :472.0 Max. :335.0 Max. :4.93
## wt qsec vs am gear
## Min. :1.51 Min. :14.5 Min. :0.000 Min. :0.000 Min. :3.00
## 1st Qu.:2.58 1st Qu.:16.9 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:3.00
## Median :3.33 Median :17.7 Median :0.000 Median :0.000 Median :4.00
## Mean :3.22 Mean :17.8 Mean :0.438 Mean :0.406 Mean :3.69
## 3rd Qu.:3.61 3rd Qu.:18.9 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:4.00
## Max. :5.42 Max. :22.9 Max. :1.000 Max. :1.000 Max. :5.00
## carb
## Min. :1.00
## 1st Qu.:2.00
## Median :2.00
## Mean :2.81
## 3rd Qu.:4.00
## Max. :8.00
Narrative. mpg
(miles per gallon) is
our target. Candidate predictors include engine-related measures (e.g.,
displacement disp
, horsepower hp
, cylinders
cyl
), weight wt
, and drivetrain variables
(am
= transmission, gear
, carb
,
vs
). Intuitively, heavier, more powerful cars tend to
achieve lower fuel economy; manual transmission (am = 1
)
can sometimes improve mpg.
par(mfrow = c(1,2))
plot(mtcars$wt, mtcars$mpg, pch = 19, main = "MPG vs Weight (wt)", xlab = "Weight (1000 lbs)", ylab = "MPG")
abline(lm(mpg ~ wt, data = mtcars), col = "red", lwd = 2)
plot(mtcars$hp, mtcars$mpg, pch = 19, main = "MPG vs Horsepower (hp)", xlab = "Horsepower", ylab = "MPG")
abline(lm(mpg ~ hp, data = mtcars), col = "red", lwd = 2)
par(mfrow = c(1,1))
Interpretation. The scatterplots suggest
negative relationships: as weight or
horsepower increase, mpg tends to
decrease. The fitted lines reinforce these trends and hint that
wt
might be among the strongest predictors of fuel
economy.
cor_matrix <- cor(mtcars) # Pearson correlations for numeric vars
round(cor_matrix["mpg", ], 3) # correlations with mpg
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1.000 -0.852 -0.848 -0.776 0.681 -0.868 0.419 0.664 0.600 0.480 -0.551
Answer. The largest-magnitude correlations with
mpg
are typically negative for wt
,
cyl
, and hp
(heavier, more cylinders,
and more horsepower → lower mpg). Variables like drat
(rear
axle ratio) and qsec
(1/4 mile time) can show
positive association with mpg
.
total_na <- sum(is.na(mtcars))
na_by_col <- colSums(is.na(mtcars))
total_na
## [1] 0
na_by_col
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0 0 0 0 0 0 0 0 0 0 0
Evidence. The built-in mtcars
dataset
typically contains no missing values. The totals by
column should be zero.
We check sanity constraints (non-negative values; reasonable integer levels for gears/carb).
summary(mtcars)
## mpg cyl disp hp drat
## Min. :10.4 Min. :4.00 Min. : 71.1 Min. : 52.0 Min. :2.76
## 1st Qu.:15.4 1st Qu.:4.00 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.08
## Median :19.2 Median :6.00 Median :196.3 Median :123.0 Median :3.69
## Mean :20.1 Mean :6.19 Mean :230.7 Mean :146.7 Mean :3.60
## 3rd Qu.:22.8 3rd Qu.:8.00 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.92
## Max. :33.9 Max. :8.00 Max. :472.0 Max. :335.0 Max. :4.93
## wt qsec vs am gear
## Min. :1.51 Min. :14.5 Min. :0.000 Min. :0.000 Min. :3.00
## 1st Qu.:2.58 1st Qu.:16.9 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:3.00
## Median :3.33 Median :17.7 Median :0.000 Median :0.000 Median :4.00
## Mean :3.22 Mean :17.8 Mean :0.438 Mean :0.406 Mean :3.69
## 3rd Qu.:3.61 3rd Qu.:18.9 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:4.00
## Max. :5.42 Max. :22.9 Max. :1.000 Max. :1.000 Max. :5.00
## carb
## Min. :1.00
## 1st Qu.:2.00
## Median :2.00
## Mean :2.81
## 3rd Qu.:4.00
## Max. :8.00
# Example checks:
any(mtcars$mpg <= 0)
## [1] FALSE
any(mtcars$wt <= 0)
## [1] FALSE
any(mtcars$hp < 0)
## [1] FALSE
unique(mtcars$am) # transmission (0 = automatic, 1 = manual)
## [1] 1 0
unique(mtcars$gear) # number of forward gears
## [1] 4 3 5
unique(mtcars$carb) # number of carburetors
## [1] 4 1 2 3 6 8
Evidence. No unrealistic negatives are observed for
mpg
, wt
, or hp
. Discrete
variables (am
, gear
, carb
) have
expected ranges for this dataset, so no inconsistencies
were found.
lm
model_full <- lm(mpg ~ ., data = mtcars)
summary(model_full)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.45 -1.60 -0.12 1.22 4.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.3034 18.7179 0.66 0.518
## cyl -0.1114 1.0450 -0.11 0.916
## disp 0.0133 0.0179 0.75 0.463
## hp -0.0215 0.0218 -0.99 0.335
## drat 0.7871 1.6354 0.48 0.635
## wt -3.7153 1.8944 -1.96 0.063 .
## qsec 0.8210 0.7308 1.12 0.274
## vs 0.3178 2.1045 0.15 0.881
## am 2.5202 2.0567 1.23 0.234
## gear 0.6554 1.4933 0.44 0.665
## carb -0.1994 0.8288 -0.24 0.812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.807
## F-statistic: 13.9 on 10 and 21 DF, p-value: 3.79e-07
Interpretation. Holding other variables constant,
coefficients indicate the direction and average change
in mpg
per unit change in each predictor. For example, a
negative estimate for wt
implies that
heavier cars have lower mpg (controlling for others),
while the sign and significance of am
reflect how manual vs
automatic transmission relates to mpg
after adjusting for
engine and weight differences. Because predictors are correlated (e.g.,
hp
, cyl
, disp
relate to each
other), some individual p-values may be large due to
multicollinearity, even if the variables are
individually correlated with mpg
.
We check linearity, normality of residuals, homoscedasticity, and high-leverage points.
par(mfrow = c(2,2))
plot(model_full) # Residuals vs Fitted, QQ plot, Scale-Location, Residuals vs Leverage
par(mfrow = c(1,1))
Observations.
- Residuals vs Fitted: Look for random scatter (no
strong curvature). Mild patterns would suggest nonlinearity.
- Normal Q–Q: Points near the line indicate
approximately normal residuals; moderate deviations at tails are common
in small samples.
- Scale–Location: A roughly horizontal band suggests
constant variance; a funnel shape would indicate
heteroscedasticity.
- Residuals vs Leverage: Points with high leverage or
Cook’s distance circles may be influential; examine them if present.
pred_full <- fitted(model_full)
res_full <- resid(model_full)
mse_full <- mean(res_full^2)
mse_full
## [1] 4.609
Answer. The MSE above summarizes the in-sample average squared residual. Lower MSE indicates tighter fit to the training data; however, it does not guarantee better out-of-sample performance.
We test interactions that are plausible mechanically: e.g., the
impact of horsepower may depend on weight (wt:hp
), and
transmission (am
) may moderate relationships.
model_int <- lm(mpg ~ wt * hp + am + cyl + disp + drat + qsec + vs + gear + carb, data = mtcars)
summary(model_int)
##
## Call:
## lm(formula = mpg ~ wt * hp + am + cyl + disp + drat + qsec +
## vs + gear + carb, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.613 -1.448 0.257 1.118 4.091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.90397 16.39054 1.70 0.10417
## wt -9.61335 2.43983 -3.94 0.00081 ***
## hp -0.14099 0.04179 -3.37 0.00302 **
## am -0.72530 1.99904 -0.36 0.72054
## cyl 1.01137 0.94189 1.07 0.29571
## disp -0.00236 0.01572 -0.15 0.88201
## drat -0.80305 1.45506 -0.55 0.58713
## qsec 0.74433 0.61104 1.22 0.23735
## vs 0.13343 1.75911 0.08 0.94029
## gear 2.90761 1.43493 2.03 0.05628 .
## carb -0.51294 0.69936 -0.73 0.47180
## wt:hp 0.03622 0.01140 3.18 0.00475 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.21 on 20 degrees of freedom
## Multiple R-squared: 0.913, Adjusted R-squared: 0.865
## F-statistic: 19.1 on 11 and 20 DF, p-value: 3.05e-08
# Compare R^2 and adjusted R^2
c(R2_full = summary(model_full)$r.squared,
R2_int = summary(model_int)$r.squared,
AdjR2_full = summary(model_full)$adj.r.squared,
AdjR2_int = summary(model_int)$adj.r.squared)
## R2_full R2_int AdjR2_full AdjR2_int
## 0.8690 0.9129 0.8066 0.8650
Report. Inspect p-values in the
summary(model_int)
output: if wt:hp
is
statistically significant, it indicates the
effect of horsepower on mpg depends on the vehicle’s weight
(and vice versa). The R²/Adj‑R² comparison shows whether model fit
improves after accounting for the interaction, while adjusted R²
penalizes added complexity.
We detect outliers via Tukey’s rule, winsorize the variable with the most outliers at the 1st/99th percentiles, then refit and compare.
# Helper to count outliers using IQR rule
count_outliers <- function(x) {
qs <- quantile(x, probs = c(0.25, 0.75), na.rm = TRUE)
iqr <- qs[2] - qs[1]
lower <- qs[1] - 1.5 * iqr
upper <- qs[2] + 1.5 * iqr
sum(x < lower | x > upper, na.rm = TRUE)
}
# Count outliers per column
outlier_counts <- sapply(mtcars, count_outliers)
outlier_counts
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 0 0 1 0 3 1 0 0 0 1
# Variable with the most outliers
var_most_out <- names(which.max(outlier_counts))
var_most_out
## [1] "wt"
# Winsorization at 1% and 99% for that variable
winsorize <- function(x, probs = c(0.01, 0.99)) {
qs <- quantile(x, probs = probs, na.rm = TRUE)
pmin(pmax(x, qs[1]), qs[2])
}
mtcars_w <- mtcars
mtcars_w[[var_most_out]] <- winsorize(mtcars_w[[var_most_out]])
# Refit models on winsorized data
model_full_w <- lm(mpg ~ ., data = mtcars_w)
model_int_w <- lm(mpg ~ wt * hp + am + cyl + disp + drat + qsec + vs + gear + carb, data = mtcars_w)
# Compare fits
cmp <- data.frame(
Model = c("Full", "Full (winsorized)", "Interact", "Interact (winsorized)"),
R2 = c(summary(model_full)$r.squared,
summary(model_full_w)$r.squared,
summary(model_int)$r.squared,
summary(model_int_w)$r.squared),
AdjR2 = c(summary(model_full)$adj.r.squared,
summary(model_full_w)$adj.r.squared,
summary(model_int)$adj.r.squared,
summary(model_int_w)$adj.r.squared),
MSE = c(mean(resid(model_full)^2),
mean(resid(model_full_w)^2),
mean(resid(model_int)^2),
mean(resid(model_int_w)^2))
)
cmp
## Model R2 AdjR2 MSE
## 1 Full 0.8690 0.8066 4.609
## 2 Full (winsorized) 0.8683 0.8056 4.633
## 3 Interact 0.9129 0.8650 3.064
## 4 Interact (winsorized) 0.9126 0.8646 3.075
Answer. Report which variable had the most outliers (printed above), then summarize how winsorization affected R², Adj‑R², and MSE. Discuss any important coefficient changes. If performance does not improve, note that winsorizing can trade bias for reduced variance and may or may not help.
Even though adding terms (including interactions) generally increases R², this reflects in-sample fit and can overstate true predictive ability. Adjusted R² partially corrects for complexity, but the best check is out-of-sample validation (e.g., cross‑validation).
set.seed(123)
K <- 5
folds <- sample(rep(1:K, length.out = nrow(mtcars)))
cv_mse <- numeric(K)
for (k in 1:K) {
train <- mtcars[folds != k, ]
test <- mtcars[folds == k, ]
fit <- lm(mpg ~ ., data = train)
preds <- predict(fit, newdata = test)
cv_mse[k] <- mean((test$mpg - preds)^2)
}
cv_mse
mean(cv_mse)