2024-10-17

Exploring Factors Influencing Fuel Efficiency

Background

In this analysis, I will explore the factors that influence fuel efficiency in cars. Understanding what impacts a car’s fuel efficiency is crucial for sustainability, reducing emissions, and improving vehicle performance. I will be using the Fuel Economy dataset from the EPA.

Problem Focus

What factors most significantly affect a car’s fuel efficiency, including engine displacement, transmission type, and other characteristics?

This analysis aims to shed light on key characteristics that influence MPG, offering insights for vehicle manufacturers and consumers wanting to reduce their fuel usage.

Data Wrangling, Munging, and Cleaning

1. Handling Missing Values

# Load data
data_path <- "~/vehicles.csv"
fuel_data <- read.csv(data_path)
sapply(fuel_data, function(x) sum(is.na(x)))
##       barrels08      barrelsA08       charge120       charge240          city08 
##               0               0               0               0               0 
##         city08U         cityA08        cityA08U          cityCD           cityE 
##               0               0               0               0               0 
##          cityUF             co2            co2A co2TailpipeAGpm  co2TailpipeGpm 
##               0               0               0               0               0 
##          comb08         comb08U         combA08        combA08U           combE 
##               0               0               0               0               0 
##      combinedCD      combinedUF       cylinders           displ           drive 
##               0               0             965             963               0 
##           engId        eng_dscr         feScore      fuelCost08     fuelCostA08 
##               0               0               0               0               0 
##        fuelType       fuelType1        ghgScore       ghgScoreA       highway08 
##               0               0               0               0               0 
##      highway08U      highwayA08     highwayA08U       highwayCD        highwayE 
##               0               0               0               0               0 
##       highwayUF             hlv             hpv              id             lv2 
##               0               0               0               0               0 
##             lv4            make           model         mpgData     phevBlended 
##               0               0               0               0               0 
##             pv2             pv4           range       rangeCity      rangeCityA 
##               0               0               0               0               0 
##        rangeHwy       rangeHwyA           trany           UCity          UCityA 
##               0               0               0               0               0 
##        UHighway       UHighwayA          VClass            year    youSaveSpend 
##               0               0               0               0               0 
##       baseModel         guzzler      trans_dscr        tCharger        sCharger 
##               0               0               0           37560               0 
##         atvType       fuelType2          rangeA         evMotor         mfrCode 
##               0               0               0               0               0 
##        c240Dscr      charge240b       c240bDscr       createdOn      modifiedOn 
##               0               0               0               0               0 
##       startStop        phevCity         phevHwy        phevComb 
##               0               0               0               0
fuel_data <- na.omit(fuel_data)

2. Converting Variables to Factors

# Convert relevant variables to factors
fuel_data$trany <- as.factor(fuel_data$trany)
fuel_data$fuelType <- as.factor(fuel_data$fuelType)

3. Handling Outliers

# Boxplots to show outlier
boxplot(fuel_data$displ, main = "Boxplot of Engine Displacement", ylab = "Displacement (L)")

boxplot(fuel_data$city08, main = "Boxplot of City MPG", ylab = "City MPG")

boxplot(fuel_data$highway08, main = "Boxplot of Highway MPG", ylab = "Highway MPG")

fuel_data <- fuel_data[fuel_data$displ < quantile(fuel_data$displ, 0.99), ]

4. Summary of Cleaned Data

After cleaning, we summarize the key variables to ensure the data is ready for analysis.

##      city08       highway08         displ                     trany     
##  Min.   : 7.0   Min.   :10.00   Min.   :0.900   Automatic (S8)   :2281  
##  1st Qu.:17.0   1st Qu.:23.00   1st Qu.:2.000   Manual 6-spd     :1122  
##  Median :19.0   Median :26.00   Median :2.300   Automatic (S6)   :1073  
##  Mean   :19.9   Mean   :26.89   Mean   :2.645   Manual 5-spd     : 910  
##  3rd Qu.:22.0   3rd Qu.:30.00   3rd Qu.:3.000   Automatic (AM-S7): 540  
##  Max.   :42.0   Max.   :52.00   Max.   :6.600   Automatic 9-spd  : 538  
##                                                 (Other)          :4044

Exploratory Data Analysis: Factors Influencing Fuel Efficiency

1. Summary

# Summary of data
summary(fuel_data[c("city08", "highway08", "displ", "trany")])
##      city08       highway08         displ                     trany     
##  Min.   : 7.0   Min.   :10.00   Min.   :0.900   Automatic (S8)   :2281  
##  1st Qu.:17.0   1st Qu.:23.00   1st Qu.:2.000   Manual 6-spd     :1122  
##  Median :19.0   Median :26.00   Median :2.300   Automatic (S6)   :1073  
##  Mean   :19.9   Mean   :26.89   Mean   :2.645   Manual 5-spd     : 910  
##  3rd Qu.:22.0   3rd Qu.:30.00   3rd Qu.:3.000   Automatic (AM-S7): 540  
##  Max.   :42.0   Max.   :52.00   Max.   :6.600   Automatic 9-spd  : 538  
##                                                 (Other)          :4044

2. Correlation

# Calculate correlation between displacement and MPG
cor(fuel_data$displ, fuel_data$city08)
## [1] -0.6980572
cor(fuel_data$displ, fuel_data$highway08)
## [1] -0.6315102

Visualizing Displacement vs. MPG

1. Base Plot for Displacement vs City MPG

2. Base Plot for Displacement vs Highway MPG

3. ggplot for Displacement vs City MPG

## `geom_smooth()` using formula = 'y ~ x'

4. ggplot for Displacement vs Highway MPG

## `geom_smooth()` using formula = 'y ~ x'

5. plotly Interactive Plot for Displacement vs City MPG

6. plotly Interactive Plot for Displacement vs Highway MPG

Machine Learning Techniques

1. Simple Linear Regression

## 
## Call:
## lm(formula = city08 ~ displ, data = fuel_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2156  -1.8952  -0.1908   2.0110  19.0110 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.71818    0.08383  330.65   <2e-16 ***
## displ       -2.95574    0.02958  -99.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.085 on 10506 degrees of freedom
## Multiple R-squared:  0.4873, Adjusted R-squared:  0.4872 
## F-statistic:  9985 on 1 and 10506 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = highway08 ~ displ, data = fuel_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3639  -2.7468  -0.0108   2.7252  21.7252 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.44968    0.10980  322.87   <2e-16 ***
## displ       -3.23431    0.03874  -83.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.04 on 10506 degrees of freedom
## Multiple R-squared:  0.3988, Adjusted R-squared:  0.3987 
## F-statistic:  6969 on 1 and 10506 DF,  p-value: < 2.2e-16

2. a) Multiple Linear Regression for City MPG

## Coefficients: 31.66992 -3.043678 6.549197 3.384796 1.754035 5.877768 2.120867 3.193033 2.620506 6.723796 7.354491 3.516599 1.076286 -0.2824825 -0.7707128 1.366229 2.415524 2.777836 1.303305 5.854509 -0.6332199 -2.086685 -1.467024 0.06614792 0.5605643 3.588452 0.2746922 2.197534 -1.82511 1.428947 -0.5399596 1.897791 0.8833072 -7.146861 -5.894815 -3.879419 -3.695722 -5.70508 -5.228983 -0.9783135 0.9080992 0.02600404 0.9210134 0.9045439 0.8980661 1.308506 0.96664 0.914121 0.9456748 1.107005 0.9736784 0.9234289 0.9063093 1.026945 0.9117556 0.9022778 0.9296022 0.8999062 0.9240864 0.9269919 0.9277319 0.9116447 0.9058154 0.9149872 0.9183044 0.909207 0.9066116 0.904613 0.9796478 1.376724 0.902489 0.9017373 0.9260575 0.5094327 0.1100957 0.2103771 0.4937135 0.3344059 0.114618 0.5042636 34.87496 -117.0464 7.110859 3.741991 1.953124 4.491969 2.194061 3.493009 2.771043 6.073862 7.553307 3.808196 1.187548 -0.2750708 -0.8453064 1.514199 2.598449 3.086806 1.410371 6.315599 -0.6825463 -2.288924 -1.619562 0.07229382 0.6104341 3.946794 0.3029877 2.429253 -1.863026 1.037933 -0.5983005 2.104594 0.9538363 -14.02906 -53.54266 -18.44031 -7.48556 -17.06035 -45.62094 -1.940084 3.34467e-252 0 1.228531e-12 0.0001835438 0.0508315 7.132391e-06 0.02825292 0.0004795827 0.00559756 1.292102e-09 4.597562e-14 0.0001407838 0.2350385 0.7832673 0.3979591 0.1300054 0.00937772 0.002028484 0.1584598 2.800392e-10 0.4949087 0.0221036 0.1053564 0.9423695 0.5415875 7.972261e-05 0.7619052 0.01514669 0.06248653 0.2993252 0.5496524 0.03535042 0.3401886 2.60524e-44 0 9.461615e-75 7.697333e-14 2.163763e-64 0 0.05239635
## R-squared: 0.7090051
## Adjusted R-squared: 0.707921
## P-value: 653.9779

2. b) Multiple Linear Regression for Highway MPG

## Coefficients: 35.83063 -3.486486 9.649479 8.347652 6.389263 4.892086 6.984688 7.869454 7.610072 7.65151 10.95532 6.928893 4.993758 2.71506 4.60418 6.750794 7.319253 8.253141 4.561794 10.06245 2.814996 -0.122755 1.988795 5.40792 5.513579 8.741166 4.065157 7.000608 1.784276 4.282019 4.130092 7.836204 5.623171 -6.653404 -6.520017 -8.193853 -7.688676 -5.762107 -6.534273 -7.676536 1.259274 0.03606017 1.277182 1.254344 1.245361 1.814524 1.340453 1.267625 1.311381 1.5351 1.350214 1.280532 1.256792 1.424079 1.264344 1.251202 1.289093 1.247913 1.281444 1.285473 1.286499 1.264191 1.256107 1.268826 1.273426 1.26081 1.257211 1.25444 1.358492 1.909123 1.251494 1.250452 1.284177 0.7064376 0.1526712 0.291733 0.6846396 0.4637254 0.1589425 0.6992695 28.4534 -96.68523 7.555286 6.654994 5.13045 2.696071 5.21069 6.208032 5.803099 4.984373 8.113765 5.410948 3.973416 1.906537 3.641555 5.395449 5.677833 6.613556 3.559886 7.827822 2.188106 -0.09710167 1.5833 4.262145 4.329721 6.932975 3.233472 5.580665 1.313424 2.242925 3.300129 6.266697 4.378812 -9.418247 -42.70625 -28.08682 -11.23025 -12.42569 -41.11093 -10.97793 1.345198e-171 0 4.528557e-14 2.974806e-11 2.94229e-07 0.007027484 1.9172e-07 5.5686e-10 6.698992e-09 6.315836e-07 5.457295e-16 6.407443e-08 7.132421e-05 0.05660789 0.0002723039 6.984288e-08 1.400451e-08 3.934983e-11 0.0003726604 5.44349e-15 0.02868392 0.9226475 0.1133833 2.042474e-05 1.506871e-05 4.363746e-12 0.001226746 2.455636e-08 0.1890687 0.02492248 0.0009696356 3.832978e-10 1.204872e-05 5.552438e-21 0 2.067388e-167 4.247402e-29 3.356051e-35 0 6.923827e-28
## R-squared: 0.617521
## Adjusted R-squared: 0.616096
## P-value: 433.3544

3. a) Decision Tree Regression for City MPG

## n= 10508 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 10508 194997.100 19.90008  
##    2) displ>=2.05 5807  46495.710 17.38557  
##      4) displ>=3.45 1878   8009.568 14.94622  
##        8) displ>=4.9 392   1180.120 12.59949 *
##        9) displ< 4.9 1486   4101.168 15.56528 *
##      5) displ< 3.45 3929  21969.810 18.55154 *
##    3) displ< 2.05 4701  66430.820 23.00617  
##      6) displ>=1.7 3518  32336.940 21.84821  
##       12) displ< 1.85 381   1833.864 20.15486 *
##       13) displ>=1.85 3137  29277.900 22.05387  
##         26) displ>=1.95 3053  22203.390 21.83852 *
##         27) displ< 1.95 84   1786.810 29.88095 *
##      7) displ< 1.7 1183  15348.760 26.44970 *

3. b) Decision Tree Regression for Highway MPG

## n= 10508 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 10508 285285.600 26.89475  
##    2) displ>=2.05 5807  77367.780 23.96728  
##      4) displ>=3.45 1878  14762.780 21.38072  
##        8) displ>=4.9 392   2512.712 18.85459 *
##        9) displ< 4.9 1486   9088.703 22.04711 *
##      5) displ< 3.45 3929  44035.110 25.20361 *
##    3) displ< 2.05 4701  96676.690 30.51096  
##      6) displ>=1.7 3518  62504.360 29.56111  
##       12) displ< 1.85 381   4558.730 27.60892 *
##       13) displ>=1.85 3137  56317.270 29.79821  
##         26) displ>=1.95 3053  45798.970 29.52735 *
##         27) displ< 1.95 84   2153.286 39.64286 *
##      7) displ< 1.7 1183  21559.770 33.33559 *

4. a) Random Forest Regression for City MPG

## 
## Call:
##  randomForest(formula = city08 ~ displ, data = fuel_data, ntree = 100) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 6.171726
##                     % Var explained: 66.74

4. b) Random Forest Regression for Highway MPG

## 
## Call:
##  randomForest(formula = highway08 ~ displ, data = fuel_data, ntree = 100) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 11.41853
##                     % Var explained: 57.94

Conclusion

Key Findings on Engine Displacement and Fuel Efficiency:

  1. EDA: Visuals from base R, ggplot2, and plotly showed an inverse relationship between engine displacement and MPG.
  2. Linear Regression: Higher displacement significantly lowers MPG, even when controlling for transmission and fuel type. Decision Tree: Larger displacements consistently led to lower MPG, highlighting non-linear influences.
  3. Random Forest: Displacement was consistently a primary factor affecting MPG, regardless of other variations.

In summary, engine displacement significantly affects fuel efficiency, suggesting optimization in engine size could improve MPG.