Project-2(Official)

knitr::include_graphics("lancer_car.png")

Introduction

The data set used was taken directly from fueleconomy.gov. (https://www.fueleconomy.gov/feg/ws/index.shtml#vehicle) Originally, the data set included 86 variables with 49,846 observations. The quantitative variables I focused on are City MPG (city08), Highway MPG (highway08), and Engine Size (displ). While, the categorical variables I used were Fuel types (fuelType1) and Vehicle class (VClass). To avoid distractions and outliers I created a clean version of the data set that focused on the 5 variables I needed using the select () function. The importance of this topic is that it will show the expected MPG of one’s vehicle and to make sure it is up to standard. The project can also help future car buyers decide on which vehicles are fuel efficient and money savers.

Load Necessary Libraries

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.3

Warning: package 'ggplot2' was built under R version 4.5.3

Warning: package 'readr' was built under R version 4.5.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
library(dplyr)
library(ggplot2)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

vehicles<-readr::read_csv("vehicles.csv")

Rows: 49846 Columns: 84
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (23): drive, eng_dscr, fuelType, fuelType1, make, model, mpgData, trany,...
dbl (59): barrels08, barrelsA08, charge120, charge240, city08, city08U, city...
lgl  (2): phevBlended, tCharger

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Load the Dataset

Clean the data set

vehicles_clean<-vehicles %>% 
  select(city08,highway08,displ,fuelType1,VClass)
#Isolate variables needed

Simple plots

vehicles_clean %>% 
  ggplot(aes(x=displ, y=city08))+
  labs(x="Engine Size (Liters)", y = "City MPG")+
  geom_point(alpha = 0.5)

Warning: Removed 1469 rows containing missing values or values outside the scale range
(`geom_point()`).

#Create simple scatterplot to look for trends

vehicles_clean %>% 
  ggplot(aes(x=displ, y = city08, color = fuelType1))+
  geom_point(alpha=0.4)+
  labs(title = "Engine Size vs City MPG by Fuel Type", x="Engine Size (Liters)", y="City MPG", color = "Fuel Type")

Warning: Removed 1469 rows containing missing values or values outside the scale range
(`geom_point()`).

#Created a scatterplot and added color to visualize different fuel types

Further Cleaning

vehicles_clean<-vehicles_clean %>% 
  mutate(
    avg_mpg = (city08 + highway08) / 2
  )
#Used mutate to combine City MPG and Highway MPG to make new variable

vehicles_clean<-vehicles_clean %>% 
  filter(fuelType1 %in% c("Regular Gasoline", "Premium Gasoline"))
#Used filter to make data less clustered and add focus

vehicles_sample<-vehicles_clean %>%
  sample_n(800)
#Reduce observation to 800

Visualization

vehicles_sample %>% 
  ggplot(aes(x=displ, y=avg_mpg, color=fuelType1))+
  geom_point(alpha=0.6)+
  labs(title = "Engine Size vs. AVG MPG by Fuel Type",subtitle = "Larger engines consume more gas", x="Engine Size (Liters)", y="Average MPG (City & Highway)", color="Fuel Type",caption = "Source:EPA Vehicles Dataset")+
  theme_dark()

#Created scatterplot to visualize relationship

Regression

model<-lm(city08~displ+highway08,data=vehicles_clean)
summary(model)


Call:
lm(formula = city08 ~ displ + highway08, data = vehicles_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2815 -1.2842 -0.2288  0.9005 18.4417 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.211926   0.085919   14.11   <2e-16 ***
displ       -0.431017   0.010664  -40.42   <2e-16 ***
highway08    0.756355   0.002336  323.83   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.147 on 46834 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.8438,    Adjusted R-squared:  0.8438 
F-statistic: 1.265e+05 on 2 and 46834 DF,  p-value: < 2.2e-16

#Check for correlation

A correlation analysis showed positive relationship between highway MPG and city MPG, and negative between engine size and MPG. Due to these factors these variables were selected. Other variables such as average MPG were considered but removed due to a weaker correlation.

Regression Equation: city_mpg = 1.087 - .4208 (engine_size) + 0.7601 (highway_mpg)

For every 1-liter increase in engine size, city MPG decreases by about 0.42 MPG, holding other variables constant.

P-value: 0.05 < 2e-16

Both engine size and highway MPG are statistically significant predictors of city MPG. Proving strong evidence that these variables affect fuel efficiency.

Adjusted R^2 = 0.8433

The adjusted R^2 value indicates that about 84.33% of the variation in city MPG is explained by engine size and highway MPG.

Interactive Plot

p<-ggplot(vehicles_sample,
          aes(x=displ,
              y=city08,
              color=fuelType1,
              text=paste(
                "Engine Size:", displ,
                "<br>City MPG:", city08,
                "<br>Fuel Type:", fuelType1
              )))+
  geom_point(alpha=0.7)+
  labs(
    title="Engine Size vs City MPG by Fuel Type",
    x="Engine Size(Liters)",
    y="City MPG",
    color="Fuel Type"
  )+
  theme_dark()

ggplotly(p,tooltip="text")

Conclusion

Interesting patterns I found in the visualizations are that Engine Sizes massively influence fuel consumption. Even whether the fuel types are different, engine sizes still play a big role on efficiency. I attempted to use Average_MPG but because it is made from city and highway MPG the model was too perfect. I had to update my model. Including car makes or brands to my plotly graph would have made a more user friendly visualization because it would communicate which cars are actually fuel efficient. All in all, I think the results turned out as expected and the visualizations perfectly represent my question.

Citations

https://www.fueleconomy.gov/feg/ws/index.shtml#vehicle

DATA-110 Slides