1 Overview

This report explores the built-in mtcars dataset through EDA, preprocessing checks, and multiple linear regression using base R. It follows the assignment prompts exactly and includes narrative answers alongside reproducible code.

Dataset: mtcars (Motor Trend US magazine).
Goal: Predict mpg using other variables and interpret/diagnose the model.


2 1. Data Exploration

2.1 1.1 Load and Review Codebook & Summary

data(mtcars)          # dataset is built-in
str(mtcars)           # structure of variables (serves as a compact codebook)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)       # summary statistics
##       mpg            cyl            disp             hp             drat     
##  Min.   :10.4   Min.   :4.00   Min.   : 71.1   Min.   : 52.0   Min.   :2.76  
##  1st Qu.:15.4   1st Qu.:4.00   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.08  
##  Median :19.2   Median :6.00   Median :196.3   Median :123.0   Median :3.69  
##  Mean   :20.1   Mean   :6.19   Mean   :230.7   Mean   :146.7   Mean   :3.60  
##  3rd Qu.:22.8   3rd Qu.:8.00   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.92  
##  Max.   :33.9   Max.   :8.00   Max.   :472.0   Max.   :335.0   Max.   :4.93  
##        wt            qsec            vs              am             gear     
##  Min.   :1.51   Min.   :14.5   Min.   :0.000   Min.   :0.000   Min.   :3.00  
##  1st Qu.:2.58   1st Qu.:16.9   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:3.00  
##  Median :3.33   Median :17.7   Median :0.000   Median :0.000   Median :4.00  
##  Mean   :3.22   Mean   :17.8   Mean   :0.438   Mean   :0.406   Mean   :3.69  
##  3rd Qu.:3.61   3rd Qu.:18.9   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:4.00  
##  Max.   :5.42   Max.   :22.9   Max.   :1.000   Max.   :1.000   Max.   :5.00  
##       carb     
##  Min.   :1.00  
##  1st Qu.:2.00  
##  Median :2.00  
##  Mean   :2.81  
##  3rd Qu.:4.00  
##  Max.   :8.00

Narrative. mpg (miles per gallon) is our target. Candidate predictors include engine-related measures (e.g., displacement disp, horsepower hp, cylinders cyl), weight wt, and drivetrain variables (am = transmission, gear, carb, vs). Intuitively, heavier, more powerful cars tend to achieve lower fuel economy; manual transmission (am = 1) can sometimes improve mpg.

2.2 1.2 Visualizations

par(mfrow = c(1,2))
plot(mtcars$wt, mtcars$mpg, pch = 19, main = "MPG vs Weight (wt)", xlab = "Weight (1000 lbs)", ylab = "MPG")
abline(lm(mpg ~ wt, data = mtcars), col = "red", lwd = 2)

plot(mtcars$hp, mtcars$mpg, pch = 19, main = "MPG vs Horsepower (hp)", xlab = "Horsepower", ylab = "MPG")
abline(lm(mpg ~ hp, data = mtcars), col = "red", lwd = 2)

par(mfrow = c(1,1))

Interpretation. The scatterplots suggest negative relationships: as weight or horsepower increase, mpg tends to decrease. The fitted lines reinforce these trends and hint that wt might be among the strongest predictors of fuel economy.

2.3 1.3 Correlations

cor_matrix <- cor(mtcars)   # Pearson correlations for numeric vars
round(cor_matrix["mpg", ], 3)  # correlations with mpg
##    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
##  1.000 -0.852 -0.848 -0.776  0.681 -0.868  0.419  0.664  0.600  0.480 -0.551

Answer. The largest-magnitude correlations with mpg are typically negative for wt, cyl, and hp (heavier, more cylinders, and more horsepower → lower mpg). Variables like drat (rear axle ratio) and qsec (1/4 mile time) can show positive association with mpg.


3 2. Data Preprocessing

3.1 2.1 Missing Data Check

total_na <- sum(is.na(mtcars))
na_by_col <- colSums(is.na(mtcars))
total_na
## [1] 0
na_by_col
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

Evidence. The built-in mtcars dataset typically contains no missing values. The totals by column should be zero.

3.2 2.2 Inconsistent/Invalid Data Check

We check sanity constraints (non-negative values; reasonable integer levels for gears/carb).

summary(mtcars)
##       mpg            cyl            disp             hp             drat     
##  Min.   :10.4   Min.   :4.00   Min.   : 71.1   Min.   : 52.0   Min.   :2.76  
##  1st Qu.:15.4   1st Qu.:4.00   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.08  
##  Median :19.2   Median :6.00   Median :196.3   Median :123.0   Median :3.69  
##  Mean   :20.1   Mean   :6.19   Mean   :230.7   Mean   :146.7   Mean   :3.60  
##  3rd Qu.:22.8   3rd Qu.:8.00   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.92  
##  Max.   :33.9   Max.   :8.00   Max.   :472.0   Max.   :335.0   Max.   :4.93  
##        wt            qsec            vs              am             gear     
##  Min.   :1.51   Min.   :14.5   Min.   :0.000   Min.   :0.000   Min.   :3.00  
##  1st Qu.:2.58   1st Qu.:16.9   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:3.00  
##  Median :3.33   Median :17.7   Median :0.000   Median :0.000   Median :4.00  
##  Mean   :3.22   Mean   :17.8   Mean   :0.438   Mean   :0.406   Mean   :3.69  
##  3rd Qu.:3.61   3rd Qu.:18.9   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:4.00  
##  Max.   :5.42   Max.   :22.9   Max.   :1.000   Max.   :1.000   Max.   :5.00  
##       carb     
##  Min.   :1.00  
##  1st Qu.:2.00  
##  Median :2.00  
##  Mean   :2.81  
##  3rd Qu.:4.00  
##  Max.   :8.00
# Example checks:
any(mtcars$mpg <= 0)
## [1] FALSE
any(mtcars$wt <= 0)
## [1] FALSE
any(mtcars$hp < 0)
## [1] FALSE
unique(mtcars$am)   # transmission (0 = automatic, 1 = manual)
## [1] 1 0
unique(mtcars$gear) # number of forward gears
## [1] 4 3 5
unique(mtcars$carb) # number of carburetors
## [1] 4 1 2 3 6 8

Evidence. No unrealistic negatives are observed for mpg, wt, or hp. Discrete variables (am, gear, carb) have expected ranges for this dataset, so no inconsistencies were found.


4 3. Linear Regression using lm

4.1 3.1 Full Model: mpg ~ .

model_full <- lm(mpg ~ ., data = mtcars)
summary(model_full)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3.45  -1.60  -0.12   1.22   4.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  12.3034    18.7179    0.66    0.518  
## cyl          -0.1114     1.0450   -0.11    0.916  
## disp          0.0133     0.0179    0.75    0.463  
## hp           -0.0215     0.0218   -0.99    0.335  
## drat          0.7871     1.6354    0.48    0.635  
## wt           -3.7153     1.8944   -1.96    0.063 .
## qsec          0.8210     0.7308    1.12    0.274  
## vs            0.3178     2.1045    0.15    0.881  
## am            2.5202     2.0567    1.23    0.234  
## gear          0.6554     1.4933    0.44    0.665  
## carb         -0.1994     0.8288   -0.24    0.812  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.807 
## F-statistic: 13.9 on 10 and 21 DF,  p-value: 3.79e-07

Interpretation. Holding other variables constant, coefficients indicate the direction and average change in mpg per unit change in each predictor. For example, a negative estimate for wt implies that heavier cars have lower mpg (controlling for others), while the sign and significance of am reflect how manual vs automatic transmission relates to mpg after adjusting for engine and weight differences. Because predictors are correlated (e.g., hp, cyl, disp relate to each other), some individual p-values may be large due to multicollinearity, even if the variables are individually correlated with mpg.

4.2 3.2 Model Assumptions & Diagnostics

We check linearity, normality of residuals, homoscedasticity, and high-leverage points.

par(mfrow = c(2,2))
plot(model_full)  # Residuals vs Fitted, QQ plot, Scale-Location, Residuals vs Leverage

par(mfrow = c(1,1))

Observations.
- Residuals vs Fitted: Look for random scatter (no strong curvature). Mild patterns would suggest nonlinearity.
- Normal Q–Q: Points near the line indicate approximately normal residuals; moderate deviations at tails are common in small samples.
- Scale–Location: A roughly horizontal band suggests constant variance; a funnel shape would indicate heteroscedasticity.
- Residuals vs Leverage: Points with high leverage or Cook’s distance circles may be influential; examine them if present.

4.3 3.3 Evaluate with MSE

pred_full <- fitted(model_full)
res_full  <- resid(model_full)
mse_full  <- mean(res_full^2)
mse_full
## [1] 4.609

Answer. The MSE above summarizes the in-sample average squared residual. Lower MSE indicates tighter fit to the training data; however, it does not guarantee better out-of-sample performance.

4.4 3.4 Add Interaction Term(s)

We test interactions that are plausible mechanically: e.g., the impact of horsepower may depend on weight (wt:hp), and transmission (am) may moderate relationships.

model_int <- lm(mpg ~ wt * hp + am + cyl + disp + drat + qsec + vs + gear + carb, data = mtcars)
summary(model_int)
## 
## Call:
## lm(formula = mpg ~ wt * hp + am + cyl + disp + drat + qsec + 
##     vs + gear + carb, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.613 -1.448  0.257  1.118  4.091 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.90397   16.39054    1.70  0.10417    
## wt          -9.61335    2.43983   -3.94  0.00081 ***
## hp          -0.14099    0.04179   -3.37  0.00302 ** 
## am          -0.72530    1.99904   -0.36  0.72054    
## cyl          1.01137    0.94189    1.07  0.29571    
## disp        -0.00236    0.01572   -0.15  0.88201    
## drat        -0.80305    1.45506   -0.55  0.58713    
## qsec         0.74433    0.61104    1.22  0.23735    
## vs           0.13343    1.75911    0.08  0.94029    
## gear         2.90761    1.43493    2.03  0.05628 .  
## carb        -0.51294    0.69936   -0.73  0.47180    
## wt:hp        0.03622    0.01140    3.18  0.00475 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.21 on 20 degrees of freedom
## Multiple R-squared:  0.913,  Adjusted R-squared:  0.865 
## F-statistic: 19.1 on 11 and 20 DF,  p-value: 3.05e-08
# Compare R^2 and adjusted R^2
c(R2_full = summary(model_full)$r.squared,
  R2_int  = summary(model_int)$r.squared,
  AdjR2_full = summary(model_full)$adj.r.squared,
  AdjR2_int  = summary(model_int)$adj.r.squared)
##    R2_full     R2_int AdjR2_full  AdjR2_int 
##     0.8690     0.9129     0.8066     0.8650

Report. Inspect p-values in the summary(model_int) output: if wt:hp is statistically significant, it indicates the effect of horsepower on mpg depends on the vehicle’s weight (and vice versa). The R²/Adj‑R² comparison shows whether model fit improves after accounting for the interaction, while adjusted R² penalizes added complexity.

4.5 3.5 Outliers & Winsorization

We detect outliers via Tukey’s rule, winsorize the variable with the most outliers at the 1st/99th percentiles, then refit and compare.

# Helper to count outliers using IQR rule
count_outliers <- function(x) {
  qs <- quantile(x, probs = c(0.25, 0.75), na.rm = TRUE)
  iqr <- qs[2] - qs[1]
  lower <- qs[1] - 1.5 * iqr
  upper <- qs[2] + 1.5 * iqr
  sum(x < lower | x > upper, na.rm = TRUE)
}

# Count outliers per column
outlier_counts <- sapply(mtcars, count_outliers)
outlier_counts
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    1    0    0    1    0    3    1    0    0    0    1
# Variable with the most outliers
var_most_out <- names(which.max(outlier_counts))
var_most_out
## [1] "wt"
# Winsorization at 1% and 99% for that variable
winsorize <- function(x, probs = c(0.01, 0.99)) {
  qs <- quantile(x, probs = probs, na.rm = TRUE)
  pmin(pmax(x, qs[1]), qs[2])
}

mtcars_w <- mtcars
mtcars_w[[var_most_out]] <- winsorize(mtcars_w[[var_most_out]])

# Refit models on winsorized data
model_full_w <- lm(mpg ~ ., data = mtcars_w)
model_int_w  <- lm(mpg ~ wt * hp + am + cyl + disp + drat + qsec + vs + gear + carb, data = mtcars_w)

# Compare fits
cmp <- data.frame(
  Model = c("Full", "Full (winsorized)", "Interact", "Interact (winsorized)"),
  R2    = c(summary(model_full)$r.squared,
            summary(model_full_w)$r.squared,
            summary(model_int)$r.squared,
            summary(model_int_w)$r.squared),
  AdjR2 = c(summary(model_full)$adj.r.squared,
            summary(model_full_w)$adj.r.squared,
            summary(model_int)$adj.r.squared,
            summary(model_int_w)$adj.r.squared),
  MSE   = c(mean(resid(model_full)^2),
            mean(resid(model_full_w)^2),
            mean(resid(model_int)^2),
            mean(resid(model_int_w)^2))
)
cmp
##                   Model     R2  AdjR2   MSE
## 1                  Full 0.8690 0.8066 4.609
## 2     Full (winsorized) 0.8683 0.8056 4.633
## 3              Interact 0.9129 0.8650 3.064
## 4 Interact (winsorized) 0.9126 0.8646 3.075

Answer. Report which variable had the most outliers (printed above), then summarize how winsorization affected , Adj‑R², and MSE. Discuss any important coefficient changes. If performance does not improve, note that winsorizing can trade bias for reduced variance and may or may not help.

4.6 3.6 Reflection: Does higher R² imply better predictability? (no credit)

Even though adding terms (including interactions) generally increases , this reflects in-sample fit and can overstate true predictive ability. Adjusted R² partially corrects for complexity, but the best check is out-of-sample validation (e.g., cross‑validation).


5 Appendix (Optional)

5.1 A. Quick 5-fold Cross-Validation MSE (illustrative)

set.seed(123)
K <- 5
folds <- sample(rep(1:K, length.out = nrow(mtcars)))

cv_mse <- numeric(K)
for (k in 1:K) {
  train <- mtcars[folds != k, ]
  test  <- mtcars[folds == k, ]
  fit   <- lm(mpg ~ ., data = train)
  preds <- predict(fit, newdata = test)
  cv_mse[k] <- mean((test$mpg - preds)^2)
}
cv_mse
mean(cv_mse)