library(tidyverse)Comprehensive Statistical Analysis of the Auto MPG Dataset
This report was generated using AI under general human direction. At the time of generation, the contents have not been comprehensively reviewed by a human analyst.
url <- "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
download.file(url, destfile = "mpg_dataset.csv", mode = "wb")
mpg_data <- read.csv("mpg_dataset.csv", stringsAsFactors = FALSE)1. Introduction
This report presents a comprehensive statistical analysis of the Auto MPG dataset, originally from the UCI Machine Learning Repository and widely used on Kaggle. The dataset contains fuel consumption data for 398 automobiles from model years 1970–1982, with variables describing engine characteristics, vehicle weight, and country of origin.
The analysis covers:
- Descriptive statistics and distributional properties
- Normality testing
- Correlation analysis
- Hypothesis testing (parametric and non-parametric)
- Simple, multiple, interaction, and origin-specific regression models
- Regression diagnostics
2. Data Loading & Cleaning
The dataset has 398 rows and 9 columns:
glimpse(mpg_data)Rows: 398
Columns: 9
$ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
$ cylinders <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
$ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
$ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
$ weight <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
$ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
$ model_year <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
$ origin <chr> "usa", "usa", "usa", "usa", "usa", "usa", "usa", "usa", "…
$ name <chr> "chevrolet chevelle malibu", "buick skylark 320", "plymou…
Missing Values
mpg_data |> summarise(across(everything(), ~sum(is.na(.)))) |> glimpse()Rows: 1
Columns: 9
$ mpg <int> 0
$ cylinders <int> 0
$ displacement <int> 0
$ horsepower <int> 6
$ weight <int> 0
$ acceleration <int> 0
$ model_year <int> 0
$ origin <int> 0
$ name <int> 0
Six rows have missing horsepower values. Inspecting them:
mpg_data |> filter(is.na(horsepower)) mpg cylinders displacement horsepower weight acceleration model_year origin
1 25.0 4 98 NA 2046 19.0 71 usa
2 21.0 6 200 NA 2875 17.0 74 usa
3 40.9 4 85 NA 1835 17.3 80 europe
4 23.6 4 140 NA 2905 14.3 80 usa
5 34.5 4 100 NA 2320 15.8 81 europe
6 23.0 4 151 NA 3035 20.5 82 usa
name
1 ford pinto
2 ford maverick
3 renault lecar deluxe
4 ford mustang cobra
5 renault 18i
6 amc concord dl
We drop these rows and convert origin to a factor for downstream analysis.
mpg_clean <- mpg_data |>
drop_na(horsepower) |>
mutate(origin = factor(origin, levels = c("usa", "europe", "japan")))After cleaning: 392 rows remain.
3. Descriptive Statistics
Overall Summary
summary(mpg_clean |> select(-name)) mpg cylinders displacement horsepower weight
Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
acceleration model_year origin
Min. : 8.00 Min. :70.00 usa :245
1st Qu.:13.78 1st Qu.:73.00 europe: 68
Median :15.50 Median :76.00 japan : 79
Mean :15.54 Mean :75.98
3rd Qu.:17.02 3rd Qu.:79.00
Max. :24.80 Max. :82.00
Detailed Numeric Statistics
numeric_vars <- c("mpg", "displacement", "horsepower", "weight", "acceleration")
desc_stats <- mpg_clean |>
summarise(across(all_of(numeric_vars), list(
n = ~n(),
mean = ~mean(.),
median = ~median(.),
sd = ~sd(.),
min = ~min(.),
max = ~max(.),
Q1 = ~quantile(., 0.25),
Q3 = ~quantile(., 0.75),
IQR = ~IQR(.),
skew = ~(mean(.) - median(.)) / sd(.),
cv = ~sd(.) / mean(.) * 100
), .names = "{.col}__{.fn}")) |>
pivot_longer(everything(),
names_to = c("variable", "stat"),
names_sep = "__") |>
pivot_wider(names_from = stat, values_from = value)
desc_stats# A tibble: 5 × 12
variable n mean median sd min max Q1 Q3 IQR skew
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mpg 392 23.4 22.8 7.81 9 46.6 17 29 1.2 e1 0.0892
2 displacem… 392 194. 151 105. 68 455 105 276. 1.71e2 0.415
3 horsepower 392 104. 93.5 38.5 46 230 75 126 5.1 e1 0.285
4 weight 392 2978. 2804. 849. 1613 5140 2225. 3615. 1.39e3 0.205
5 accelerat… 392 15.5 15.5 2.76 8 24.8 13.8 17.0 3.25e0 0.0150
# ℹ 1 more variable: cv <dbl>
Group Statistics by Origin
mpg_clean |>
group_by(origin) |>
summarise(
n = n(),
mean_mpg = mean(mpg),
sd_mpg = sd(mpg),
median_mpg = median(mpg),
min_mpg = min(mpg),
max_mpg = max(mpg)
)# A tibble: 3 × 7
origin n mean_mpg sd_mpg median_mpg min_mpg max_mpg
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 usa 245 20.0 6.44 18.5 9 39
2 europe 68 27.6 6.58 26 16.2 44.3
3 japan 79 30.5 6.09 31.6 18 46.6
Group Statistics by Cylinders
mpg_clean |>
group_by(cylinders) |>
summarise(
n = n(),
mean_mpg = mean(mpg),
sd_mpg = sd(mpg),
mean_hp = mean(horsepower),
mean_wt = mean(weight)
)# A tibble: 5 × 6
cylinders n mean_mpg sd_mpg mean_hp mean_wt
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 3 4 20.6 2.56 99.2 2398.
2 4 199 29.3 5.67 78.3 2305.
3 5 3 27.4 8.23 82.3 3103.
4 6 83 20.0 3.83 102. 3202.
5 8 103 15.0 2.84 158. 4115.
4. Normality Tests
Shapiro-Wilk Test
The null hypothesis is that the data are normally distributed.
normality_results <- map_dfr(numeric_vars, function(v) {
x <- mpg_clean[[v]]
sw <- shapiro.test(x)
tibble(
variable = v,
W_stat = sw$statistic,
p_value = sw$p.value,
normal = ifelse(sw$p.value > 0.05, "Yes", "No")
)
})
normality_results# A tibble: 5 × 4
variable W_stat p_value normal
<chr> <dbl> <dbl> <chr>
1 mpg 0.967 1.05e- 7 No
2 displacement 0.882 8.98e-17 No
3 horsepower 0.904 5.02e-15 No
4 weight 0.941 2.60e-11 No
5 acceleration 0.992 3.05e- 2 No
Kolmogorov-Smirnov Test
ks_results <- map_dfr(numeric_vars, function(v) {
x <- mpg_clean[[v]]
ks <- ks.test(x, "pnorm", mean(x), sd(x))
tibble(
variable = v,
D_stat = ks$statistic,
p_value = ks$p.value,
normal = ifelse(ks$p.value > 0.05, "Yes", "No")
)
})
ks_results# A tibble: 5 × 4
variable D_stat p_value normal
<chr> <dbl> <dbl> <chr>
1 mpg 0.0818 1.06e- 2 No
2 displacement 0.181 1.22e-11 No
3 horsepower 0.164 1.55e- 9 No
4 weight 0.0946 1.80e- 3 No
5 acceleration 0.0513 2.53e- 1 Yes
Most variables are non-normal by both tests. Only acceleration passes the KS test (p = 0.25), though it narrowly fails Shapiro-Wilk (p = 0.03).
5. Distribution Visualizations
Faceted Histograms
mpg_clean |>
select(mpg, horsepower, weight, acceleration) |>
pivot_longer(everything(), names_to = "variable", values_to = "value") |>
mutate(variable = factor(variable,
levels = c("mpg", "horsepower", "weight", "acceleration"))) |>
ggplot(aes(x = value, fill = variable)) +
geom_histogram(bins = 25, color = "white", show.legend = FALSE) +
facet_wrap(~variable, scales = "free", ncol = 2) +
labs(title = "Distributions of Key Numeric Variables")Violin Plot: MPG by Origin
ggplot(mpg_clean, aes(x = origin, y = mpg, fill = origin)) +
geom_violin(alpha = 0.4, draw_quantiles = c(0.25, 0.5, 0.75)) +
geom_jitter(width = 0.15, alpha = 0.3, size = 1.2) +
labs(title = "MPG Distribution by Country of Origin",
subtitle = "Lines show 25th, 50th, 75th percentiles",
x = "Origin", y = "Miles Per Gallon") +
theme(legend.position = "none")Density Plot: MPG by Cylinders
ggplot(mpg_clean |> filter(cylinders %in% c(4, 6, 8)),
aes(x = mpg, fill = factor(cylinders), color = factor(cylinders))) +
geom_density(alpha = 0.3, linewidth = 0.8) +
labs(title = "MPG Density by Cylinder Count",
subtitle = "4-cylinder cars show the widest and highest MPG range",
x = "Miles Per Gallon", y = "Density",
fill = "Cylinders", color = "Cylinders")Q-Q Plot for MPG
ggplot(mpg_clean, aes(sample = mpg)) +
stat_qq(color = "steelblue", alpha = 0.6) +
stat_qq_line(color = "red") +
labs(title = "Q-Q Plot for MPG",
subtitle = "Deviations from the red line indicate non-normality",
x = "Theoretical Quantiles", y = "Sample Quantiles")6. Correlation Analysis
Pearson Correlation Matrix
cor_pearson <- cor(mpg_clean[numeric_vars], method = "pearson")
round(cor_pearson, 3) mpg displacement horsepower weight acceleration
mpg 1.000 -0.805 -0.778 -0.832 0.423
displacement -0.805 1.000 0.897 0.933 -0.544
horsepower -0.778 0.897 1.000 0.865 -0.689
weight -0.832 0.933 0.865 1.000 -0.417
acceleration 0.423 -0.544 -0.689 -0.417 1.000
Spearman Correlation Matrix
cor_spearman <- cor(mpg_clean[numeric_vars], method = "spearman")
round(cor_spearman, 3) mpg displacement horsepower weight acceleration
mpg 1.000 -0.855 -0.854 -0.876 0.442
displacement -0.855 1.000 0.876 0.946 -0.499
horsepower -0.854 0.876 1.000 0.879 -0.658
weight -0.876 0.946 0.879 1.000 -0.405
acceleration 0.442 -0.499 -0.658 -0.405 1.000
Correlation Heatmap
cor_long <- as.data.frame(as.table(cor_pearson))
names(cor_long) <- c("Var1", "Var2", "Correlation")
ggplot(cor_long, aes(x = Var1, y = Var2, fill = Correlation)) +
geom_tile() +
geom_text(aes(label = round(Correlation, 2)), size = 3.5) +
scale_fill_gradient2(low = "red", mid = "white", high = "blue", midpoint = 0) +
labs(title = "Pearson Correlation Heatmap") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Pairwise Correlation Tests
pairs_to_test <- list(
c("mpg", "weight"),
c("mpg", "horsepower"),
c("mpg", "displacement"),
c("horsepower", "weight"),
c("acceleration", "horsepower")
)
cor_tests <- map_dfr(pairs_to_test, function(pair) {
ct <- cor.test(mpg_clean[[pair[1]]], mpg_clean[[pair[2]]])
tibble(
var1 = pair[1],
var2 = pair[2],
pearson_r = ct$estimate,
t_stat = ct$statistic,
p_value = ct$p.value,
ci_lower = ct$conf.int[1],
ci_upper = ct$conf.int[2],
significant = ifelse(ct$p.value < 0.05, "***", "ns")
)
})
cor_tests# A tibble: 5 × 8
var1 var2 pearson_r t_stat p_value ci_lower ci_upper significant
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 mpg weight -0.832 -29.6 6.02e-102 -0.860 -0.799 ***
2 mpg horsepo… -0.778 -24.5 7.03e- 81 -0.815 -0.736 ***
3 mpg displac… -0.805 -26.8 1.66e- 90 -0.837 -0.767 ***
4 horsepower weight 0.865 34.0 1.36e-118 0.837 0.888 ***
5 acceleration horsepo… -0.689 -18.8 1.58e- 56 -0.738 -0.633 ***
All tested pairs are highly significant. Weight is the single strongest linear correlate of MPG (r = -0.83).
7. Hypothesis Tests
Two-Sample T-Tests
USA vs Japan
usa_mpg <- mpg_clean |> filter(origin == "usa") |> pull(mpg)
japan_mpg <- mpg_clean |> filter(origin == "japan") |> pull(mpg)
t.test(usa_mpg, japan_mpg)
Welch Two Sample t-test
data: usa_mpg and japan_mpg
t = -13.034, df = 138.64, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.997430 -8.836897
sample estimates:
mean of x mean of y
20.03347 30.45063
USA vs Europe
europe_mpg <- mpg_clean |> filter(origin == "europe") |> pull(mpg)
t.test(usa_mpg, europe_mpg)
Welch Two Sample t-test
data: usa_mpg and europe_mpg
t = -8.4311, df = 105.32, p-value = 1.93e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-9.349583 -5.789361
sample estimates:
mean of x mean of y
20.03347 27.60294
Both tests are highly significant (p < 2.2e-16 and p = 1.9e-13), confirming that US cars have substantially lower mean MPG than both Japanese and European cars.
One-Sample T-Test
Is the overall mean MPG equal to 25?
t.test(mpg_clean$mpg, mu = 25)
One Sample t-test
data: mpg_clean$mpg
t = -3.9422, df = 391, p-value = 9.564e-05
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
22.67088 24.22096
sample estimates:
mean of x
23.44592
The mean (23.45) is significantly less than 25 (p < 0.001).
One-Way ANOVA: MPG ~ Origin
aov_origin <- aov(mpg ~ origin, data = mpg_clean)
summary(aov_origin) Df Sum Sq Mean Sq F value Pr(>F)
origin 2 7904 3952 96.6 <2e-16 ***
Residuals 389 15915 41
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tukey HSD Post-Hoc
TukeyHSD(aov_origin) Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = mpg ~ origin, data = mpg_clean)
$origin
diff lwr upr p adj
europe-usa 7.569472 5.5068042 9.632139 0.0000000
japan-usa 10.417164 8.4701431 12.364184 0.0000000
japan-europe 2.847692 0.3583458 5.337038 0.0202502
All pairwise origin differences are significant. The largest gap is Japan–USA (~10.4 MPG).
Boxplot: MPG by Origin
ggplot(mpg_clean, aes(x = origin, y = mpg, fill = origin)) +
geom_boxplot() +
labs(title = "MPG by Country of Origin", x = "Origin", y = "MPG") +
theme(legend.position = "none")Chi-Square Test: Origin vs Cylinders
cont_table <- table(mpg_clean$origin, factor(mpg_clean$cylinders))
cont_table
3 4 5 6 8
usa 0 69 0 73 103
europe 0 61 3 4 0
japan 4 69 0 6 0
chi_sq <- chisq.test(cont_table)
chi_sq
Pearson's Chi-squared test
data: cont_table
X-squared = 180.72, df = 8, p-value < 2.2e-16
cramers_v <- sqrt(chi_sq$statistic / (sum(cont_table) * (min(dim(cont_table)) - 1)))Origin and cylinder count are strongly associated (χ² = 180.7, p < 2.2e-16, Cramér’s V = 0.48).
Stacked Bar: Cylinder Composition by Origin
ggplot(mpg_clean, aes(x = origin, fill = factor(cylinders))) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_brewer(palette = "Set2", name = "Cylinders") +
labs(title = "Cylinder Composition by Country of Origin",
subtitle = "USA dominated by 6- and 8-cylinder; Japan/Europe mostly 4-cylinder",
x = "Origin", y = "Proportion")Non-Parametric Tests
Mann-Whitney U: MPG (USA vs Japan)
wilcox.test(usa_mpg, japan_mpg)
Wilcoxon rank sum test with continuity correction
data: usa_mpg and japan_mpg
W = 2456.5, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
Kruskal-Wallis: MPG ~ Origin
kruskal.test(mpg ~ origin, data = mpg_clean)
Kruskal-Wallis rank sum test
data: mpg by origin
Kruskal-Wallis chi-squared = 132.09, df = 2, p-value < 2.2e-16
Both non-parametric tests confirm the parametric results (p < 2.2e-16).
F-Test for Equality of Variances: USA vs Japan
var.test(usa_mpg, japan_mpg)
F test to compare two variances
data: usa_mpg and japan_mpg
F = 1.1184, num df = 244, denom df = 78, p-value = 0.569
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.7650019 1.5798967
sample estimates:
ratio of variances
1.118361
The variances are not significantly different (p = 0.57).
Mean MPG by Cylinders
cyl_summary <- mpg_clean |>
group_by(cylinders) |>
summarise(
mean_mpg = mean(mpg),
se = sd(mpg) / sqrt(n()),
n = n()
)
ggplot(cyl_summary, aes(x = factor(cylinders), y = mean_mpg)) +
geom_col(fill = "steelblue", width = 0.6) +
geom_errorbar(aes(ymin = mean_mpg - 1.96 * se, ymax = mean_mpg + 1.96 * se),
width = 0.2) +
geom_text(aes(label = paste0("n=", n)), vjust = -0.5, size = 3.2) +
labs(title = "Mean MPG by Number of Cylinders",
subtitle = "Error bars show 95% confidence intervals",
x = "Cylinders", y = "Mean MPG") +
ylim(0, 35)8. Regression Analysis
8.1 Simple Linear Regression: MPG ~ Weight
lm_simple <- lm(mpg ~ weight, data = mpg_clean)
summary(lm_simple)
Call:
lm(formula = mpg ~ weight, data = mpg_clean)
Residuals:
Min 1Q Median 3Q Max
-11.9736 -2.7556 -0.3358 2.1379 16.5194
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.216524 0.798673 57.87 <2e-16 ***
weight -0.007647 0.000258 -29.64 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.333 on 390 degrees of freedom
Multiple R-squared: 0.6926, Adjusted R-squared: 0.6918
F-statistic: 878.8 on 1 and 390 DF, p-value: < 2.2e-16
Weight alone explains 69.3% of the variance in MPG.
8.2 Multiple Linear Regression (Full Additive Model)
lm_full <- lm(mpg ~ weight + horsepower + displacement + acceleration + origin,
data = mpg_clean)
summary(lm_full)
Call:
lm(formula = mpg ~ weight + horsepower + displacement + acceleration +
origin, data = mpg_clean)
Residuals:
Min 1Q Median 3Q Max
-12.3131 -2.9021 -0.3505 2.2157 14.8163
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.3905094 2.4526690 17.691 < 2e-16 ***
weight -0.0049546 0.0008035 -6.166 1.77e-09 ***
horsepower -0.0584607 0.0167637 -3.487 0.000544 ***
displacement 0.0026806 0.0072662 0.369 0.712395
acceleration -0.0234516 0.1232560 -0.190 0.849200
origineurope 1.0787637 0.7016362 1.537 0.124993
originjapan 2.8365644 0.6930489 4.093 5.19e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.166 on 385 degrees of freedom
Multiple R-squared: 0.7194, Adjusted R-squared: 0.7151
F-statistic: 164.5 on 6 and 385 DF, p-value: < 2.2e-16
Confidence Intervals
confint(lm_full) 2.5 % 97.5 %
(Intercept) 38.568206946 48.212811819
weight -0.006534396 -0.003374773
horsepower -0.091420494 -0.025500925
displacement -0.011605745 0.016966921
acceleration -0.265790776 0.218887542
origineurope -0.300754720 2.458282189
originjapan 1.473930033 4.199198862
Model Comparison: Simple vs Full
anova(lm_simple, lm_full)Analysis of Variance Table
Model 1: mpg ~ weight
Model 2: mpg ~ weight + horsepower + displacement + acceleration + origin
Res.Df RSS Df Sum of Sq F Pr(>F)
1 390 7321.2
2 385 6683.0 5 638.23 7.3535 1.34e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data.frame(
Model = c("Simple (weight only)", "Full additive model"),
AIC = c(AIC(lm_simple), AIC(lm_full)),
BIC = c(BIC(lm_simple), BIC(lm_full)),
Adj_R2 = c(summary(lm_simple)$adj.r.squared, summary(lm_full)$adj.r.squared)
) Model AIC BIC Adj_R2
1 Simple (weight only) 2265.939 2277.852 0.6918423
2 Full additive model 2240.184 2271.954 0.7150528
The full model is a significant improvement (p < 0.001), though the R² gain is modest.
8.3 Regression Diagnostics
Variance Inflation Factors
vif_vals <- sapply(c("weight", "horsepower", "displacement", "acceleration"), function(v) {
predictors <- setdiff(c("weight", "horsepower", "displacement", "acceleration"), v)
formula_str <- paste(v, "~", paste(predictors, collapse = " + "))
r2 <- summary(lm(as.formula(formula_str), data = mpg_clean))$r.squared
1 / (1 - r2)
})
data.frame(Variable = names(vif_vals), VIF = round(vif_vals, 2)) Variable VIF
weight weight 10.28
horsepower horsepower 8.82
displacement displacement 10.69
acceleration acceleration 2.60
Weight, displacement, and horsepower have VIF > 8, indicating substantial multicollinearity.
Heteroscedasticity (Breusch-Pagan)
resids <- residuals(lm_full)
resid_sq <- resids^2
bp_model <- lm(resid_sq ~ weight + horsepower + displacement + acceleration,
data = mpg_clean)
bp_r2 <- summary(bp_model)$r.squared
bp_stat <- nrow(mpg_clean) * bp_r2
bp_pval <- pchisq(bp_stat, df = 4, lower.tail = FALSE)
data.frame(
BP_statistic = round(bp_stat, 4),
p_value = format(bp_pval, scientific = TRUE),
Result = ifelse(bp_pval < 0.05, "Heteroscedasticity detected",
"Homoscedasticity assumed")
) BP_statistic p_value Result
1 34.4572 6.004344e-07 Heteroscedasticity detected
Diagnostic Plots
par(mfrow = c(2, 2))
plot(lm_full)
par(mfrow = c(1, 1))diag_data <- data.frame(
fitted = fitted(lm_full),
residuals = residuals(lm_full),
origin = mpg_clean$origin
)
ggplot(diag_data, aes(x = fitted, y = residuals, color = origin)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
geom_smooth(aes(group = 1), method = "loess", color = "black",
se = FALSE, linewidth = 0.7) +
labs(title = "Residuals vs Fitted Values (Full Model)",
subtitle = "Ideal: random scatter around zero; loess line should be flat",
x = "Fitted Values", y = "Residuals")The loess curve shows a U-shape, consistent with the detected heteroscedasticity. Residual spread increases at higher fitted values.
8.4 Interaction Model: Weight × Origin
lm_interact <- lm(mpg ~ weight * origin + horsepower, data = mpg_clean)
summary(lm_interact)
Call:
lm(formula = mpg ~ weight * origin + horsepower, data = mpg_clean)
Residuals:
Min 1Q Median 3Q Max
-12.6953 -2.8235 -0.4188 2.1867 14.4955
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.9244363 1.1806839 35.509 < 2e-16 ***
weight -0.0045958 0.0005737 -8.010 1.37e-14 ***
origineurope 3.3987402 2.8374558 1.198 0.2317
originjapan 10.8694645 3.4754842 3.127 0.0019 **
horsepower -0.0536903 0.0111106 -4.832 1.95e-06 ***
weight:origineurope -0.0009087 0.0010962 -0.829 0.4076
weight:originjapan -0.0035335 0.0014999 -2.356 0.0190 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.135 on 385 degrees of freedom
Multiple R-squared: 0.7236, Adjusted R-squared: 0.7193
F-statistic: 168 on 6 and 385 DF, p-value: < 2.2e-16
Origin-Specific Weight Slopes
coefs <- coef(lm_interact)
slopes <- data.frame(
Origin = c("USA", "Europe", "Japan"),
Weight_Slope = c(
coefs["weight"],
coefs["weight"] + coefs["weight:origineurope"],
coefs["weight"] + coefs["weight:originjapan"]
),
MPG_per_1000lbs = c(
round(coefs["weight"] * 1000, 2),
round((coefs["weight"] + coefs["weight:origineurope"]) * 1000, 2),
round((coefs["weight"] + coefs["weight:originjapan"]) * 1000, 2)
)
)
slopes Origin Weight_Slope MPG_per_1000lbs
1 USA -0.004595772 -4.60
2 Europe -0.005504488 -5.50
3 Japan -0.008129234 -8.13
Model Comparison: Additive vs Interaction
lm_additive <- lm(mpg ~ weight + origin + horsepower, data = mpg_clean)
anova(lm_additive, lm_interact)Analysis of Variance Table
Model 1: mpg ~ weight + origin + horsepower
Model 2: mpg ~ weight * origin + horsepower
Res.Df RSS Df Sum of Sq F Pr(>F)
1 387 6686.5
2 385 6583.4 2 103.04 3.0129 0.05031 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The interaction is borderline significant (p ≈ 0.05). The Japan × weight interaction is individually significant (p = 0.019): Japanese cars lose MPG roughly twice as fast per unit weight as US cars.
Interaction Visualization
ggplot(mpg_clean, aes(x = weight, y = mpg, color = origin)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = TRUE) +
annotate("text", x = 4200, y = 42,
label = paste0("USA: ", slopes$MPG_per_1000lbs[1], " MPG / 1000 lbs"),
color = "#F8766D", hjust = 0, size = 3.5) +
annotate("text", x = 4200, y = 39,
label = paste0("Europe: ", slopes$MPG_per_1000lbs[2], " MPG / 1000 lbs"),
color = "#00BA38", hjust = 0, size = 3.5) +
annotate("text", x = 4200, y = 36,
label = paste0("Japan: ", slopes$MPG_per_1000lbs[3], " MPG / 1000 lbs"),
color = "#619CFF", hjust = 0, size = 3.5) +
labs(title = "Weight × Origin Interaction Effect on MPG",
subtitle = "Japanese cars lose MPG nearly twice as fast per unit weight as US cars",
x = "Weight (lbs)", y = "MPG")8.5 Separate Regressions by Origin
origin_models <- mpg_clean |>
split(mpg_clean$origin) |>
map(~lm(mpg ~ weight + horsepower, data = .x))R² Comparison
origin_comparison <- map_dfr(names(origin_models), function(o) {
mod <- origin_models[[o]]
s <- summary(mod)
tibble(
Origin = o,
n = nrow(mod$model),
R2 = s$r.squared,
Adj_R2 = s$adj.r.squared,
RSE = s$sigma,
F_stat = s$fstatistic[1],
F_pvalue = pf(s$fstatistic[1], s$fstatistic[2], s$fstatistic[3], lower.tail = FALSE)
)
})
origin_comparison# A tibble: 3 × 7
Origin n R2 Adj_R2 RSE F_stat F_pvalue
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 usa 245 0.722 0.720 3.41 315. 4.62e-68
2 europe 68 0.477 0.461 4.83 29.6 7.25e-10
3 japan 79 0.455 0.440 4.56 31.7 9.85e-11
Coefficient Details by Origin
map_dfr(names(origin_models), function(o) {
mod <- origin_models[[o]]
ci <- confint(mod)
coefs <- summary(mod)$coefficients
tibble(
Origin = o,
Term = rownames(coefs),
Estimate = coefs[, 1],
Std_Err = coefs[, 2],
p_value = coefs[, 4],
CI_lower = ci[, 1],
CI_upper = ci[, 2]
)
})# A tibble: 9 × 7
Origin Term Estimate Std_Err p_value CI_lower CI_upper
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 usa (Intercept) 42.6 0.977 1.25e-116 40.7 44.6
2 usa weight -0.00590 0.000503 2.08e- 25 -0.00689 -0.00491
3 usa horsepower -0.0228 0.0100 2.41e- 2 -0.0425 -0.00300
4 europe (Intercept) 48.0 3.07 9.71e- 24 41.9 54.1
5 europe weight -0.00206 0.00152 1.79e- 1 -0.00509 0.000971
6 europe horsepower -0.191 0.0370 2.51e- 6 -0.265 -0.117
7 japan (Intercept) 47.3 3.95 3.32e- 19 39.5 55.2
8 japan weight 0.00153 0.00324 6.39e- 1 -0.00492 0.00797
9 japan horsepower -0.254 0.0582 4.02e- 5 -0.370 -0.138
Faceted Visualization with R² Annotations
r2_labels <- origin_comparison |>
mutate(label = paste0("R² = ", round(R2, 3), "\nn = ", n))
ggplot(mpg_clean, aes(x = weight, y = mpg)) +
geom_point(aes(color = horsepower), alpha = 0.6) +
geom_smooth(method = "lm", formula = y ~ x, color = "red", se = TRUE) +
geom_text(data = r2_labels,
aes(x = Inf, y = Inf, label = label),
hjust = 1.1, vjust = 1.3, size = 3.8, fontface = "bold") +
scale_color_viridis_c(name = "HP") +
facet_wrap(~origin, scales = "free_x") +
labs(title = "MPG ~ Weight + Horsepower (Separate Models by Origin)",
subtitle = "USA is far more predictable (R²=0.72) than Europe or Japan (R²≈0.46)",
x = "Weight (lbs)", y = "MPG")Key finding: USA cars are the most predictable from weight and horsepower (R² = 0.72), because they span a wide range of weights. For European and Japanese cars (R² ≈ 0.46–0.48), weight is not a significant predictor — horsepower dominates, and other unmeasured factors (engine technology, aerodynamics) likely matter more.
9. Supplementary Visualizations
MPG Trend Over Time by Origin
year_origin_summary <- mpg_clean |>
group_by(model_year, origin) |>
summarise(mean_mpg = mean(mpg), .groups = "drop")
ggplot(year_origin_summary, aes(x = model_year, y = mean_mpg, color = origin)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(title = "MPG Trend Over Time by Origin",
subtitle = "All origins improved, but Japan consistently led",
x = "Model Year", y = "Mean MPG", color = "Origin")Bubble Chart: Weight vs Horsepower vs MPG
ggplot(mpg_clean, aes(x = weight, y = horsepower, color = origin, size = mpg)) +
geom_point(alpha = 0.5) +
scale_size_continuous(range = c(1, 6), name = "MPG") +
labs(title = "Weight vs Horsepower (bubble size = MPG)",
subtitle = "Lighter, less powerful cars get better mileage",
x = "Weight (lbs)", y = "Horsepower")Mean MPG Heatmap: Origin × Model Year
heat_data <- mpg_clean |>
group_by(origin, model_year) |>
summarise(mean_mpg = mean(mpg), .groups = "drop")
ggplot(heat_data, aes(x = factor(model_year), y = origin, fill = mean_mpg)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = round(mean_mpg, 1)), size = 2.8) +
scale_fill_gradient(low = "#fee0d2", high = "#de2d26", name = "Mean MPG") +
labs(title = "Mean MPG Heatmap: Origin × Model Year",
x = "Model Year", y = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))10. Key Findings & Summary
| Finding | Detail |
|---|---|
| MPG by origin | Japan (30.5 MPG) > Europe (27.6) > USA (20), all pairwise differences significant |
| Best single predictor | Weight (r = -0.83 with MPG) |
| Normality | Most variables are non-normal; non-parametric tests confirm parametric results |
| Full regression | Adj R² = 0.715; weight and horsepower are the key predictors |
| Interaction effect | Japan × weight is significant (p = 0.019); Japanese cars penalised ~2× more per unit weight |
| Origin-specific models | USA most predictable (R² = 0.722); Europe/Japan ≈ 0.46 due to narrow weight range |
| Multicollinearity | Weight, displacement, horsepower are highly collinear (VIF > 8) |
| Heteroscedasticity | Present in all models; consider log-transform or robust SEs for inference |
| Temporal trend | All origins improved MPG from 1970–1982, likely driven by oil crises and regulation |