Comprehensive Statistical Analysis of the Auto MPG Dataset

Author

Generated by Posit Assistant

Published

February 6, 2026

Note

This report was generated using AI under general human direction. At the time of generation, the contents have not been comprehensively reviewed by a human analyst.

library(tidyverse)

url <- "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
download.file(url, destfile = "mpg_dataset.csv", mode = "wb")
mpg_data <- read.csv("mpg_dataset.csv", stringsAsFactors = FALSE)

1. Introduction

This report presents a comprehensive statistical analysis of the Auto MPG dataset, originally from the UCI Machine Learning Repository and widely used on Kaggle. The dataset contains fuel consumption data for 398 automobiles from model years 1970–1982, with variables describing engine characteristics, vehicle weight, and country of origin.

The analysis covers:

Descriptive statistics and distributional properties
Normality testing
Correlation analysis
Hypothesis testing (parametric and non-parametric)
Simple, multiple, interaction, and origin-specific regression models
Regression diagnostics

2. Data Loading & Cleaning

The dataset has 398 rows and 9 columns:

glimpse(mpg_data)

Rows: 398
Columns: 9
$ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
$ cylinders    <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
$ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
$ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
$ weight       <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
$ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
$ model_year   <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
$ origin       <chr> "usa", "usa", "usa", "usa", "usa", "usa", "usa", "usa", "…
$ name         <chr> "chevrolet chevelle malibu", "buick skylark 320", "plymou…

Missing Values

mpg_data |> summarise(across(everything(), ~sum(is.na(.)))) |> glimpse()

Rows: 1
Columns: 9
$ mpg          <int> 0
$ cylinders    <int> 0
$ displacement <int> 0
$ horsepower   <int> 6
$ weight       <int> 0
$ acceleration <int> 0
$ model_year   <int> 0
$ origin       <int> 0
$ name         <int> 0

Six rows have missing horsepower values. Inspecting them:

mpg_data |> filter(is.na(horsepower))

   mpg cylinders displacement horsepower weight acceleration model_year origin
1 25.0         4           98         NA   2046         19.0         71    usa
2 21.0         6          200         NA   2875         17.0         74    usa
3 40.9         4           85         NA   1835         17.3         80 europe
4 23.6         4          140         NA   2905         14.3         80    usa
5 34.5         4          100         NA   2320         15.8         81 europe
6 23.0         4          151         NA   3035         20.5         82    usa
                  name
1           ford pinto
2        ford maverick
3 renault lecar deluxe
4   ford mustang cobra
5          renault 18i
6       amc concord dl

We drop these rows and convert origin to a factor for downstream analysis.

mpg_clean <- mpg_data |>
  drop_na(horsepower) |>
  mutate(origin = factor(origin, levels = c("usa", "europe", "japan")))

After cleaning: 392 rows remain.

3. Descriptive Statistics

Overall Summary

summary(mpg_clean |> select(-name))

      mpg          cylinders      displacement     horsepower        weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
 Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
 Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
  acceleration     model_year       origin   
 Min.   : 8.00   Min.   :70.00   usa   :245  
 1st Qu.:13.78   1st Qu.:73.00   europe: 68  
 Median :15.50   Median :76.00   japan : 79  
 Mean   :15.54   Mean   :75.98               
 3rd Qu.:17.02   3rd Qu.:79.00               
 Max.   :24.80   Max.   :82.00

Detailed Numeric Statistics

numeric_vars <- c("mpg", "displacement", "horsepower", "weight", "acceleration")

desc_stats <- mpg_clean |>
  summarise(across(all_of(numeric_vars), list(
    n      = ~n(),
    mean   = ~mean(.),
    median = ~median(.),
    sd     = ~sd(.),
    min    = ~min(.),
    max    = ~max(.),
    Q1     = ~quantile(., 0.25),
    Q3     = ~quantile(., 0.75),
    IQR    = ~IQR(.),
    skew   = ~(mean(.) - median(.)) / sd(.),
    cv     = ~sd(.) / mean(.) * 100
  ), .names = "{.col}__{.fn}")) |>
  pivot_longer(everything(),
               names_to = c("variable", "stat"),
               names_sep = "__") |>
  pivot_wider(names_from = stat, values_from = value)

desc_stats

# A tibble: 5 × 12
  variable       n   mean median     sd   min    max     Q1     Q3    IQR   skew
  <chr>      <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 mpg          392   23.4   22.8   7.81     9   46.6   17     29   1.2 e1 0.0892
2 displacem…   392  194.   151   105.      68  455    105    276.  1.71e2 0.415 
3 horsepower   392  104.    93.5  38.5     46  230     75    126   5.1 e1 0.285 
4 weight       392 2978.  2804.  849.    1613 5140   2225.  3615.  1.39e3 0.205 
5 accelerat…   392   15.5   15.5   2.76     8   24.8   13.8   17.0 3.25e0 0.0150
# ℹ 1 more variable: cv <dbl>

Group Statistics by Origin

mpg_clean |>
  group_by(origin) |>
  summarise(
    n          = n(),
    mean_mpg   = mean(mpg),
    sd_mpg     = sd(mpg),
    median_mpg = median(mpg),
    min_mpg    = min(mpg),
    max_mpg    = max(mpg)
  )

# A tibble: 3 × 7
  origin     n mean_mpg sd_mpg median_mpg min_mpg max_mpg
  <fct>  <int>    <dbl>  <dbl>      <dbl>   <dbl>   <dbl>
1 usa      245     20.0   6.44       18.5     9      39  
2 europe    68     27.6   6.58       26      16.2    44.3
3 japan     79     30.5   6.09       31.6    18      46.6

Group Statistics by Cylinders

mpg_clean |>
  group_by(cylinders) |>
  summarise(
    n        = n(),
    mean_mpg = mean(mpg),
    sd_mpg   = sd(mpg),
    mean_hp  = mean(horsepower),
    mean_wt  = mean(weight)
  )

# A tibble: 5 × 6
  cylinders     n mean_mpg sd_mpg mean_hp mean_wt
      <int> <int>    <dbl>  <dbl>   <dbl>   <dbl>
1         3     4     20.6   2.56    99.2   2398.
2         4   199     29.3   5.67    78.3   2305.
3         5     3     27.4   8.23    82.3   3103.
4         6    83     20.0   3.83   102.    3202.
5         8   103     15.0   2.84   158.    4115.

4. Normality Tests

Shapiro-Wilk Test

The null hypothesis is that the data are normally distributed.

normality_results <- map_dfr(numeric_vars, function(v) {
 x <- mpg_clean[[v]]
 sw <- shapiro.test(x)
 tibble(
   variable = v,
   W_stat   = sw$statistic,
   p_value  = sw$p.value,
   normal   = ifelse(sw$p.value > 0.05, "Yes", "No")
 )
})
normality_results

# A tibble: 5 × 4
  variable     W_stat  p_value normal
  <chr>         <dbl>    <dbl> <chr> 
1 mpg           0.967 1.05e- 7 No    
2 displacement  0.882 8.98e-17 No    
3 horsepower    0.904 5.02e-15 No    
4 weight        0.941 2.60e-11 No    
5 acceleration  0.992 3.05e- 2 No

Kolmogorov-Smirnov Test

ks_results <- map_dfr(numeric_vars, function(v) {
  x <- mpg_clean[[v]]
  ks <- ks.test(x, "pnorm", mean(x), sd(x))
  tibble(
    variable = v,
    D_stat   = ks$statistic,
    p_value  = ks$p.value,
    normal   = ifelse(ks$p.value > 0.05, "Yes", "No")
  )
})
ks_results

# A tibble: 5 × 4
  variable     D_stat  p_value normal
  <chr>         <dbl>    <dbl> <chr> 
1 mpg          0.0818 1.06e- 2 No    
2 displacement 0.181  1.22e-11 No    
3 horsepower   0.164  1.55e- 9 No    
4 weight       0.0946 1.80e- 3 No    
5 acceleration 0.0513 2.53e- 1 Yes

Most variables are non-normal by both tests. Only acceleration passes the KS test (p = 0.25), though it narrowly fails Shapiro-Wilk (p = 0.03).

5. Distribution Visualizations

Faceted Histograms

mpg_clean |>
  select(mpg, horsepower, weight, acceleration) |>
  pivot_longer(everything(), names_to = "variable", values_to = "value") |>
  mutate(variable = factor(variable,
    levels = c("mpg", "horsepower", "weight", "acceleration"))) |>
  ggplot(aes(x = value, fill = variable)) +
  geom_histogram(bins = 25, color = "white", show.legend = FALSE) +
  facet_wrap(~variable, scales = "free", ncol = 2) +
  labs(title = "Distributions of Key Numeric Variables")

Violin Plot: MPG by Origin

ggplot(mpg_clean, aes(x = origin, y = mpg, fill = origin)) +
  geom_violin(alpha = 0.4, draw_quantiles = c(0.25, 0.5, 0.75)) +
  geom_jitter(width = 0.15, alpha = 0.3, size = 1.2) +
  labs(title = "MPG Distribution by Country of Origin",
       subtitle = "Lines show 25th, 50th, 75th percentiles",
       x = "Origin", y = "Miles Per Gallon") +
  theme(legend.position = "none")

Density Plot: MPG by Cylinders

ggplot(mpg_clean |> filter(cylinders %in% c(4, 6, 8)),
       aes(x = mpg, fill = factor(cylinders), color = factor(cylinders))) +
  geom_density(alpha = 0.3, linewidth = 0.8) +
  labs(title = "MPG Density by Cylinder Count",
       subtitle = "4-cylinder cars show the widest and highest MPG range",
       x = "Miles Per Gallon", y = "Density",
       fill = "Cylinders", color = "Cylinders")

Q-Q Plot for MPG

ggplot(mpg_clean, aes(sample = mpg)) +
  stat_qq(color = "steelblue", alpha = 0.6) +
  stat_qq_line(color = "red") +
  labs(title = "Q-Q Plot for MPG",
       subtitle = "Deviations from the red line indicate non-normality",
       x = "Theoretical Quantiles", y = "Sample Quantiles")

6. Correlation Analysis

Pearson Correlation Matrix

cor_pearson <- cor(mpg_clean[numeric_vars], method = "pearson")
round(cor_pearson, 3)

                mpg displacement horsepower weight acceleration
mpg           1.000       -0.805     -0.778 -0.832        0.423
displacement -0.805        1.000      0.897  0.933       -0.544
horsepower   -0.778        0.897      1.000  0.865       -0.689
weight       -0.832        0.933      0.865  1.000       -0.417
acceleration  0.423       -0.544     -0.689 -0.417        1.000

Spearman Correlation Matrix

cor_spearman <- cor(mpg_clean[numeric_vars], method = "spearman")
round(cor_spearman, 3)

                mpg displacement horsepower weight acceleration
mpg           1.000       -0.855     -0.854 -0.876        0.442
displacement -0.855        1.000      0.876  0.946       -0.499
horsepower   -0.854        0.876      1.000  0.879       -0.658
weight       -0.876        0.946      0.879  1.000       -0.405
acceleration  0.442       -0.499     -0.658 -0.405        1.000

Correlation Heatmap

cor_long <- as.data.frame(as.table(cor_pearson))
names(cor_long) <- c("Var1", "Var2", "Correlation")

ggplot(cor_long, aes(x = Var1, y = Var2, fill = Correlation)) +
  geom_tile() +
  geom_text(aes(label = round(Correlation, 2)), size = 3.5) +
  scale_fill_gradient2(low = "red", mid = "white", high = "blue", midpoint = 0) +
  labs(title = "Pearson Correlation Heatmap") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Pairwise Correlation Tests

pairs_to_test <- list(
  c("mpg", "weight"),
  c("mpg", "horsepower"),
  c("mpg", "displacement"),
  c("horsepower", "weight"),
  c("acceleration", "horsepower")
)

cor_tests <- map_dfr(pairs_to_test, function(pair) {
  ct <- cor.test(mpg_clean[[pair[1]]], mpg_clean[[pair[2]]])
  tibble(
    var1        = pair[1],
    var2        = pair[2],
    pearson_r   = ct$estimate,
    t_stat      = ct$statistic,
    p_value     = ct$p.value,
    ci_lower    = ct$conf.int[1],
    ci_upper    = ct$conf.int[2],
    significant = ifelse(ct$p.value < 0.05, "***", "ns")
  )
})
cor_tests

# A tibble: 5 × 8
  var1         var2     pearson_r t_stat   p_value ci_lower ci_upper significant
  <chr>        <chr>        <dbl>  <dbl>     <dbl>    <dbl>    <dbl> <chr>      
1 mpg          weight      -0.832  -29.6 6.02e-102   -0.860   -0.799 ***        
2 mpg          horsepo…    -0.778  -24.5 7.03e- 81   -0.815   -0.736 ***        
3 mpg          displac…    -0.805  -26.8 1.66e- 90   -0.837   -0.767 ***        
4 horsepower   weight       0.865   34.0 1.36e-118    0.837    0.888 ***        
5 acceleration horsepo…    -0.689  -18.8 1.58e- 56   -0.738   -0.633 ***

All tested pairs are highly significant. Weight is the single strongest linear correlate of MPG (r = -0.83).

7. Hypothesis Tests

Two-Sample T-Tests

USA vs Japan

usa_mpg   <- mpg_clean |> filter(origin == "usa") |> pull(mpg)
japan_mpg <- mpg_clean |> filter(origin == "japan") |> pull(mpg)

t.test(usa_mpg, japan_mpg)


    Welch Two Sample t-test

data:  usa_mpg and japan_mpg
t = -13.034, df = 138.64, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.997430  -8.836897
sample estimates:
mean of x mean of y 
 20.03347  30.45063

USA vs Europe

europe_mpg <- mpg_clean |> filter(origin == "europe") |> pull(mpg)
t.test(usa_mpg, europe_mpg)


    Welch Two Sample t-test

data:  usa_mpg and europe_mpg
t = -8.4311, df = 105.32, p-value = 1.93e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9.349583 -5.789361
sample estimates:
mean of x mean of y 
 20.03347  27.60294

Both tests are highly significant (p < 2.2e-16 and p = 1.9e-13), confirming that US cars have substantially lower mean MPG than both Japanese and European cars.

One-Sample T-Test

Is the overall mean MPG equal to 25?

t.test(mpg_clean$mpg, mu = 25)


    One Sample t-test

data:  mpg_clean$mpg
t = -3.9422, df = 391, p-value = 9.564e-05
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
 22.67088 24.22096
sample estimates:
mean of x 
 23.44592

The mean (23.45) is significantly less than 25 (p < 0.001).

One-Way ANOVA: MPG ~ Origin

aov_origin <- aov(mpg ~ origin, data = mpg_clean)
summary(aov_origin)

             Df Sum Sq Mean Sq F value Pr(>F)    
origin        2   7904    3952    96.6 <2e-16 ***
Residuals   389  15915      41                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tukey HSD Post-Hoc

TukeyHSD(aov_origin)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = mpg ~ origin, data = mpg_clean)

$origin
                  diff       lwr       upr     p adj
europe-usa    7.569472 5.5068042  9.632139 0.0000000
japan-usa    10.417164 8.4701431 12.364184 0.0000000
japan-europe  2.847692 0.3583458  5.337038 0.0202502

All pairwise origin differences are significant. The largest gap is Japan–USA (~10.4 MPG).

Boxplot: MPG by Origin

ggplot(mpg_clean, aes(x = origin, y = mpg, fill = origin)) +
  geom_boxplot() +
  labs(title = "MPG by Country of Origin", x = "Origin", y = "MPG") +
  theme(legend.position = "none")

Chi-Square Test: Origin vs Cylinders

cont_table <- table(mpg_clean$origin, factor(mpg_clean$cylinders))
cont_table

        
           3   4   5   6   8
  usa      0  69   0  73 103
  europe   0  61   3   4   0
  japan    4  69   0   6   0

chi_sq <- chisq.test(cont_table)
chi_sq


    Pearson's Chi-squared test

data:  cont_table
X-squared = 180.72, df = 8, p-value < 2.2e-16

cramers_v <- sqrt(chi_sq$statistic / (sum(cont_table) * (min(dim(cont_table)) - 1)))

Origin and cylinder count are strongly associated (χ² = 180.7, p < 2.2e-16, Cramér’s V = 0.48).

Stacked Bar: Cylinder Composition by Origin

ggplot(mpg_clean, aes(x = origin, fill = factor(cylinders))) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_brewer(palette = "Set2", name = "Cylinders") +
  labs(title = "Cylinder Composition by Country of Origin",
       subtitle = "USA dominated by 6- and 8-cylinder; Japan/Europe mostly 4-cylinder",
       x = "Origin", y = "Proportion")

Non-Parametric Tests

Mann-Whitney U: MPG (USA vs Japan)

wilcox.test(usa_mpg, japan_mpg)


    Wilcoxon rank sum test with continuity correction

data:  usa_mpg and japan_mpg
W = 2456.5, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

Kruskal-Wallis: MPG ~ Origin

kruskal.test(mpg ~ origin, data = mpg_clean)


    Kruskal-Wallis rank sum test

data:  mpg by origin
Kruskal-Wallis chi-squared = 132.09, df = 2, p-value < 2.2e-16

Both non-parametric tests confirm the parametric results (p < 2.2e-16).

F-Test for Equality of Variances: USA vs Japan

var.test(usa_mpg, japan_mpg)


    F test to compare two variances

data:  usa_mpg and japan_mpg
F = 1.1184, num df = 244, denom df = 78, p-value = 0.569
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.7650019 1.5798967
sample estimates:
ratio of variances 
          1.118361

The variances are not significantly different (p = 0.57).

Mean MPG by Cylinders

cyl_summary <- mpg_clean |>
  group_by(cylinders) |>
  summarise(
    mean_mpg = mean(mpg),
    se       = sd(mpg) / sqrt(n()),
    n        = n()
  )

ggplot(cyl_summary, aes(x = factor(cylinders), y = mean_mpg)) +
  geom_col(fill = "steelblue", width = 0.6) +
  geom_errorbar(aes(ymin = mean_mpg - 1.96 * se, ymax = mean_mpg + 1.96 * se),
                width = 0.2) +
  geom_text(aes(label = paste0("n=", n)), vjust = -0.5, size = 3.2) +
  labs(title = "Mean MPG by Number of Cylinders",
       subtitle = "Error bars show 95% confidence intervals",
       x = "Cylinders", y = "Mean MPG") +
  ylim(0, 35)

8. Regression Analysis

8.1 Simple Linear Regression: MPG ~ Weight

lm_simple <- lm(mpg ~ weight, data = mpg_clean)
summary(lm_simple)


Call:
lm(formula = mpg ~ weight, data = mpg_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.9736  -2.7556  -0.3358   2.1379  16.5194 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 46.216524   0.798673   57.87   <2e-16 ***
weight      -0.007647   0.000258  -29.64   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.333 on 390 degrees of freedom
Multiple R-squared:  0.6926,    Adjusted R-squared:  0.6918 
F-statistic: 878.8 on 1 and 390 DF,  p-value: < 2.2e-16

Weight alone explains 69.3% of the variance in MPG.

8.2 Multiple Linear Regression (Full Additive Model)

lm_full <- lm(mpg ~ weight + horsepower + displacement + acceleration + origin,
               data = mpg_clean)
summary(lm_full)


Call:
lm(formula = mpg ~ weight + horsepower + displacement + acceleration + 
    origin, data = mpg_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.3131  -2.9021  -0.3505   2.2157  14.8163 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  43.3905094  2.4526690  17.691  < 2e-16 ***
weight       -0.0049546  0.0008035  -6.166 1.77e-09 ***
horsepower   -0.0584607  0.0167637  -3.487 0.000544 ***
displacement  0.0026806  0.0072662   0.369 0.712395    
acceleration -0.0234516  0.1232560  -0.190 0.849200    
origineurope  1.0787637  0.7016362   1.537 0.124993    
originjapan   2.8365644  0.6930489   4.093 5.19e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.166 on 385 degrees of freedom
Multiple R-squared:  0.7194,    Adjusted R-squared:  0.7151 
F-statistic: 164.5 on 6 and 385 DF,  p-value: < 2.2e-16

Confidence Intervals

confint(lm_full)

                    2.5 %       97.5 %
(Intercept)  38.568206946 48.212811819
weight       -0.006534396 -0.003374773
horsepower   -0.091420494 -0.025500925
displacement -0.011605745  0.016966921
acceleration -0.265790776  0.218887542
origineurope -0.300754720  2.458282189
originjapan   1.473930033  4.199198862

Model Comparison: Simple vs Full

anova(lm_simple, lm_full)

Analysis of Variance Table

Model 1: mpg ~ weight
Model 2: mpg ~ weight + horsepower + displacement + acceleration + origin
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
1    390 7321.2                                 
2    385 6683.0  5    638.23 7.3535 1.34e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

data.frame(
  Model  = c("Simple (weight only)", "Full additive model"),
  AIC    = c(AIC(lm_simple), AIC(lm_full)),
  BIC    = c(BIC(lm_simple), BIC(lm_full)),
  Adj_R2 = c(summary(lm_simple)$adj.r.squared, summary(lm_full)$adj.r.squared)
)

                 Model      AIC      BIC    Adj_R2
1 Simple (weight only) 2265.939 2277.852 0.6918423
2  Full additive model 2240.184 2271.954 0.7150528

The full model is a significant improvement (p < 0.001), though the R² gain is modest.

8.3 Regression Diagnostics

Variance Inflation Factors

vif_vals <- sapply(c("weight", "horsepower", "displacement", "acceleration"), function(v) {
  predictors <- setdiff(c("weight", "horsepower", "displacement", "acceleration"), v)
  formula_str <- paste(v, "~", paste(predictors, collapse = " + "))
  r2 <- summary(lm(as.formula(formula_str), data = mpg_clean))$r.squared
  1 / (1 - r2)
})
data.frame(Variable = names(vif_vals), VIF = round(vif_vals, 2))

                 Variable   VIF
weight             weight 10.28
horsepower     horsepower  8.82
displacement displacement 10.69
acceleration acceleration  2.60

Weight, displacement, and horsepower have VIF > 8, indicating substantial multicollinearity.

Heteroscedasticity (Breusch-Pagan)

resids <- residuals(lm_full)
resid_sq <- resids^2
bp_model <- lm(resid_sq ~ weight + horsepower + displacement + acceleration,
                data = mpg_clean)
bp_r2 <- summary(bp_model)$r.squared
bp_stat <- nrow(mpg_clean) * bp_r2
bp_pval <- pchisq(bp_stat, df = 4, lower.tail = FALSE)

data.frame(
  BP_statistic = round(bp_stat, 4),
  p_value      = format(bp_pval, scientific = TRUE),
  Result       = ifelse(bp_pval < 0.05, "Heteroscedasticity detected",
                        "Homoscedasticity assumed")
)

  BP_statistic      p_value                      Result
1      34.4572 6.004344e-07 Heteroscedasticity detected

Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm_full)
par(mfrow = c(1, 1))

diag_data <- data.frame(
  fitted    = fitted(lm_full),
  residuals = residuals(lm_full),
  origin    = mpg_clean$origin
)

ggplot(diag_data, aes(x = fitted, y = residuals, color = origin)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(aes(group = 1), method = "loess", color = "black",
              se = FALSE, linewidth = 0.7) +
  labs(title = "Residuals vs Fitted Values (Full Model)",
       subtitle = "Ideal: random scatter around zero; loess line should be flat",
       x = "Fitted Values", y = "Residuals")

The loess curve shows a U-shape, consistent with the detected heteroscedasticity. Residual spread increases at higher fitted values.

8.4 Interaction Model: Weight × Origin

lm_interact <- lm(mpg ~ weight * origin + horsepower, data = mpg_clean)
summary(lm_interact)


Call:
lm(formula = mpg ~ weight * origin + horsepower, data = mpg_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.6953  -2.8235  -0.4188   2.1867  14.4955 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         41.9244363  1.1806839  35.509  < 2e-16 ***
weight              -0.0045958  0.0005737  -8.010 1.37e-14 ***
origineurope         3.3987402  2.8374558   1.198   0.2317    
originjapan         10.8694645  3.4754842   3.127   0.0019 ** 
horsepower          -0.0536903  0.0111106  -4.832 1.95e-06 ***
weight:origineurope -0.0009087  0.0010962  -0.829   0.4076    
weight:originjapan  -0.0035335  0.0014999  -2.356   0.0190 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.135 on 385 degrees of freedom
Multiple R-squared:  0.7236,    Adjusted R-squared:  0.7193 
F-statistic:   168 on 6 and 385 DF,  p-value: < 2.2e-16

Origin-Specific Weight Slopes

coefs <- coef(lm_interact)
slopes <- data.frame(
  Origin       = c("USA", "Europe", "Japan"),
  Weight_Slope = c(
    coefs["weight"],
    coefs["weight"] + coefs["weight:origineurope"],
    coefs["weight"] + coefs["weight:originjapan"]
  ),
  MPG_per_1000lbs = c(
    round(coefs["weight"] * 1000, 2),
    round((coefs["weight"] + coefs["weight:origineurope"]) * 1000, 2),
    round((coefs["weight"] + coefs["weight:originjapan"]) * 1000, 2)
  )
)
slopes

  Origin Weight_Slope MPG_per_1000lbs
1    USA -0.004595772           -4.60
2 Europe -0.005504488           -5.50
3  Japan -0.008129234           -8.13

Model Comparison: Additive vs Interaction

lm_additive <- lm(mpg ~ weight + origin + horsepower, data = mpg_clean)
anova(lm_additive, lm_interact)

Analysis of Variance Table

Model 1: mpg ~ weight + origin + horsepower
Model 2: mpg ~ weight * origin + horsepower
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1    387 6686.5                              
2    385 6583.4  2    103.04 3.0129 0.05031 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction is borderline significant (p ≈ 0.05). The Japan × weight interaction is individually significant (p = 0.019): Japanese cars lose MPG roughly twice as fast per unit weight as US cars.

Interaction Visualization

ggplot(mpg_clean, aes(x = weight, y = mpg, color = origin)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = TRUE) +
  annotate("text", x = 4200, y = 42,
           label = paste0("USA: ", slopes$MPG_per_1000lbs[1], " MPG / 1000 lbs"),
           color = "#F8766D", hjust = 0, size = 3.5) +
  annotate("text", x = 4200, y = 39,
           label = paste0("Europe: ", slopes$MPG_per_1000lbs[2], " MPG / 1000 lbs"),
           color = "#00BA38", hjust = 0, size = 3.5) +
  annotate("text", x = 4200, y = 36,
           label = paste0("Japan: ", slopes$MPG_per_1000lbs[3], " MPG / 1000 lbs"),
           color = "#619CFF", hjust = 0, size = 3.5) +
  labs(title = "Weight × Origin Interaction Effect on MPG",
       subtitle = "Japanese cars lose MPG nearly twice as fast per unit weight as US cars",
       x = "Weight (lbs)", y = "MPG")

8.5 Separate Regressions by Origin

origin_models <- mpg_clean |>
  split(mpg_clean$origin) |>
  map(~lm(mpg ~ weight + horsepower, data = .x))

R² Comparison

origin_comparison <- map_dfr(names(origin_models), function(o) {
  mod <- origin_models[[o]]
  s <- summary(mod)
  tibble(
    Origin   = o,
    n        = nrow(mod$model),
    R2       = s$r.squared,
    Adj_R2   = s$adj.r.squared,
    RSE      = s$sigma,
    F_stat   = s$fstatistic[1],
    F_pvalue = pf(s$fstatistic[1], s$fstatistic[2], s$fstatistic[3], lower.tail = FALSE)
  )
})
origin_comparison

# A tibble: 3 × 7
  Origin     n    R2 Adj_R2   RSE F_stat F_pvalue
  <chr>  <int> <dbl>  <dbl> <dbl>  <dbl>    <dbl>
1 usa      245 0.722  0.720  3.41  315.  4.62e-68
2 europe    68 0.477  0.461  4.83   29.6 7.25e-10
3 japan     79 0.455  0.440  4.56   31.7 9.85e-11

Coefficient Details by Origin

map_dfr(names(origin_models), function(o) {
  mod <- origin_models[[o]]
  ci <- confint(mod)
  coefs <- summary(mod)$coefficients
  tibble(
    Origin   = o,
    Term     = rownames(coefs),
    Estimate = coefs[, 1],
    Std_Err  = coefs[, 2],
    p_value  = coefs[, 4],
    CI_lower = ci[, 1],
    CI_upper = ci[, 2]
  )
})

# A tibble: 9 × 7
  Origin Term        Estimate  Std_Err   p_value CI_lower  CI_upper
  <chr>  <chr>          <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
1 usa    (Intercept) 42.6     0.977    1.25e-116 40.7     44.6     
2 usa    weight      -0.00590 0.000503 2.08e- 25 -0.00689 -0.00491 
3 usa    horsepower  -0.0228  0.0100   2.41e-  2 -0.0425  -0.00300 
4 europe (Intercept) 48.0     3.07     9.71e- 24 41.9     54.1     
5 europe weight      -0.00206 0.00152  1.79e-  1 -0.00509  0.000971
6 europe horsepower  -0.191   0.0370   2.51e-  6 -0.265   -0.117   
7 japan  (Intercept) 47.3     3.95     3.32e- 19 39.5     55.2     
8 japan  weight       0.00153 0.00324  6.39e-  1 -0.00492  0.00797 
9 japan  horsepower  -0.254   0.0582   4.02e-  5 -0.370   -0.138

Faceted Visualization with R² Annotations

r2_labels <- origin_comparison |>
  mutate(label = paste0("R² = ", round(R2, 3), "\nn = ", n))

ggplot(mpg_clean, aes(x = weight, y = mpg)) +
  geom_point(aes(color = horsepower), alpha = 0.6) +
  geom_smooth(method = "lm", formula = y ~ x, color = "red", se = TRUE) +
  geom_text(data = r2_labels,
            aes(x = Inf, y = Inf, label = label),
            hjust = 1.1, vjust = 1.3, size = 3.8, fontface = "bold") +
  scale_color_viridis_c(name = "HP") +
  facet_wrap(~origin, scales = "free_x") +
  labs(title = "MPG ~ Weight + Horsepower (Separate Models by Origin)",
       subtitle = "USA is far more predictable (R²=0.72) than Europe or Japan (R²≈0.46)",
       x = "Weight (lbs)", y = "MPG")

Key finding: USA cars are the most predictable from weight and horsepower (R² = 0.72), because they span a wide range of weights. For European and Japanese cars (R² ≈ 0.46–0.48), weight is not a significant predictor — horsepower dominates, and other unmeasured factors (engine technology, aerodynamics) likely matter more.

9. Supplementary Visualizations

MPG Trend Over Time by Origin

year_origin_summary <- mpg_clean |>
  group_by(model_year, origin) |>
  summarise(mean_mpg = mean(mpg), .groups = "drop")

ggplot(year_origin_summary, aes(x = model_year, y = mean_mpg, color = origin)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  labs(title = "MPG Trend Over Time by Origin",
       subtitle = "All origins improved, but Japan consistently led",
       x = "Model Year", y = "Mean MPG", color = "Origin")

Bubble Chart: Weight vs Horsepower vs MPG

ggplot(mpg_clean, aes(x = weight, y = horsepower, color = origin, size = mpg)) +
  geom_point(alpha = 0.5) +
  scale_size_continuous(range = c(1, 6), name = "MPG") +
  labs(title = "Weight vs Horsepower (bubble size = MPG)",
       subtitle = "Lighter, less powerful cars get better mileage",
       x = "Weight (lbs)", y = "Horsepower")

Mean MPG Heatmap: Origin × Model Year

heat_data <- mpg_clean |>
  group_by(origin, model_year) |>
  summarise(mean_mpg = mean(mpg), .groups = "drop")

ggplot(heat_data, aes(x = factor(model_year), y = origin, fill = mean_mpg)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = round(mean_mpg, 1)), size = 2.8) +
  scale_fill_gradient(low = "#fee0d2", high = "#de2d26", name = "Mean MPG") +
  labs(title = "Mean MPG Heatmap: Origin × Model Year",
       x = "Model Year", y = "") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

10. Key Findings & Summary

Finding	Detail
MPG by origin	Japan (30.5 MPG) > Europe (27.6) > USA (20), all pairwise differences significant
Best single predictor	Weight (r = -0.83 with MPG)
Normality	Most variables are non-normal; non-parametric tests confirm parametric results
Full regression	Adj R² = 0.715; weight and horsepower are the key predictors
Interaction effect	Japan × weight is significant (p = 0.019); Japanese cars penalised ~2× more per unit weight
Origin-specific models	USA most predictable (R² = 0.722); Europe/Japan ≈ 0.46 due to narrow weight range
Multicollinearity	Weight, displacement, horsepower are highly collinear (VIF > 8)
Heteroscedasticity	Present in all models; consider log-transform or robust SEs for inference
Temporal trend	All origins improved MPG from 1970–1982, likely driven by oil crises and regulation