Car MPG Analysis

Author

Jean Tcheby

Introduction

This project analyzes how different car characteristics, specifically horsepower and weight, influence fuel efficiency (mpg). The dataset used includes variables such as mpg, horsepower, weight, and origin. These variables allow for both quantitative and categorical analysis.

The goal of this project is to explore the relationship between horsepower, weight, and fuel efficiency using linear regression and data visualization.

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv")
Rows: 398 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): origin, name
dbl (7): mpg, cylinders, displacement, horsepower, weight, acceleration, mod...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
# A tibble: 6 × 9
    mpg cylinders displacement horsepower weight acceleration model_year origin
  <dbl>     <dbl>        <dbl>      <dbl>  <dbl>        <dbl>      <dbl> <chr> 
1    18         8          307        130   3504         12           70 usa   
2    15         8          350        165   3693         11.5         70 usa   
3    18         8          318        150   3436         11           70 usa   
4    16         8          304        150   3433         12           70 usa   
5    17         8          302        140   3449         10.5         70 usa   
6    15         8          429        198   4341         10           70 usa   
# ℹ 1 more variable: name <chr>
# Remove missing values for accurate analysis
df <- df %>% drop_na()
summary(df)
      mpg          cylinders      displacement     horsepower        weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
 Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
 Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
  acceleration     model_year       origin              name          
 Min.   : 8.00   Min.   :70.00   Length:392         Length:392        
 1st Qu.:13.78   1st Qu.:73.00   Class :character   Class :character  
 Median :15.50   Median :76.00   Mode  :character   Mode  :character  
 Mean   :15.54   Mean   :75.98                                        
 3rd Qu.:17.02   3rd Qu.:79.00                                        
 Max.   :24.80   Max.   :82.00                                        

A multiple linear regression model was used to examine the relationship between horsepower, weight, and fuel efficiency (mpg). The results show that both horsepower and weight are statistically significant predictors of mpg, as their p-values are less than 0.05.

Specifically, horsepower has a negative coefficient (-0.047), indicating that as horsepower increases, fuel efficiency decreases. Similarly, weight also has a negative coefficient (-0.0058), showing that heavier cars tend to have lower mpg.

The adjusted R-squared value is approximately 0.7049, which means that about 70% of the variation in mpg can be explained by horsepower and weight. This suggests that the model provides a strong fit to the data.

The regression equation is:

mpg = 45.64 − 0.047(horsepower) − 0.0058(weight)

```{+}

```

model <- lm(mpg ~ horsepower + weight, data = df)

summary(model)

Call:
lm(formula = mpg ~ horsepower + weight, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.0762  -2.7340  -0.3312   2.1752  16.2601 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 45.6402108  0.7931958  57.540  < 2e-16 ***
horsepower  -0.0473029  0.0110851  -4.267 2.49e-05 ***
weight      -0.0057942  0.0005023 -11.535  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.24 on 389 degrees of freedom
Multiple R-squared:  0.7064,    Adjusted R-squared:  0.7049 
F-statistic: 467.9 on 2 and 389 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model)

ggplot(df, aes(x = weight, y = mpg, color = origin)) +
  geom_point(size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Relationship Between Car Weight and Fuel Efficiency",
    subtitle = "Heavier cars tend to have lower miles per gallon",
    x = "Weight",
    y = "Miles per Gallon (MPG)",
    color = "Car Origin",
    caption = "Source: Car MPG Dataset"
  ) +
  scale_color_manual(values = c(
    "usa" = "#E41A1C",
    "europe" = "#377EB8",
    "japan" = "#4DAF4A"
  )) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 12),
    legend.position = "right"
  )
`geom_smooth()` using formula = 'y ~ x'