About the data

The data is collected from the R package: fueleconomy. The fueleconomy package’s data was sourced from the EPA (Environmental Protection Agency). In this package, the data is stored in vehicles dataset.

Fuel economy data contains data for all cars sold in the US from 1984 to 2015. The package fueleconomy has 33,442 rows and 12 variables

Data Source

Further data information in this study can be found in below links:

Load the data

The dataset is loaded from fueleconomy dataset of R in vehicles. kable library is used to display data tables.

# load data
library(fueleconomy)

kable(head(vehicles))

The dimensions of data shows 33384 rows and 13 columns in the dataset.

# dimensions
dim(vehicles)
## [1] 33442    12

Tidying the data

Approach: My goal is to clean the data and create new columns like mpg for further analysis.

# rename columns
vehicles <- vehicles %>% 
  rename("highway"=hwy) %>% 
  rename("city"=cty)

# show intial rows through head
kable(head(vehicles), align = "l")
id make model year class trans drive cyl displ fuel highway city
27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
summary(vehicles)
##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl        
##  Length:33442       Length:33442       Length:33442       Min.   : 2.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.772  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :16.000  
##                                                           NA's   :58      
##      displ           fuel              highway            city       
##  Min.   :0.000   Length:33442       Min.   :  9.00   Min.   :  6.00  
##  1st Qu.:2.300   Class :character   1st Qu.: 19.00   1st Qu.: 15.00  
##  Median :3.000   Mode  :character   Median : 23.00   Median : 17.00  
##  Mean   :3.353                      Mean   : 23.55   Mean   : 17.49  
##  3rd Qu.:4.300                      3rd Qu.: 27.00   3rd Qu.: 20.00  
##  Max.   :8.400                      Max.   :109.00   Max.   :138.00  
##  NA's   :57
any(is.na(vehicles))
## [1] TRUE
vehicles_subdf <- vehicles[rowSums(is.na(vehicles)) > 0,]
kable(head(vehicles_subdf))
id make model year class trans drive cyl displ fuel highway city
31893 Azure Dynamics Transit Connect Electric Van 2012 Special Purpose Vehicle 2WD Automatic (A1) Front-Wheel Drive NA NA Electricity 62 62
31894 Azure Dynamics Transit Connect Electric Wagon 2012 Special Purpose Vehicle 2WD Automatic (A1) Front-Wheel Drive NA NA Electricity 62 62
32276 BMW Active E 2011 Subcompact Cars Automatic (A1) Rear-Wheel Drive NA NA Electricity 96 107
33383 BYD e6 2012 Sport Utility Vehicle - 2WD Automatic (A1) Front-Wheel Drive NA NA Electricity 64 60
34860 BYD e6 2013 Small Sport Utility Vehicle 2WD Automatic (A1) Front-Wheel Drive NA NA Electricity 65 61
34859 BYD e6 2014 Small Sport Utility Vehicle 2WD Automatic (A1) Front-Wheel Drive NA NA Electricity 65 61

Per EPA (Environmental protection agency) guidelines, combined fuel economy is a weighted average of City and Highway MPG values that is calculated by weighting the City value by 55% and the Highway value by 45%.

# remove null values
vehicles <- na.omit(vehicles)

vehicles <- vehicles %>% mutate(mpg = 0.55 * vehicles$city + 0.45 * vehicles$highway)
kable(head(vehicles))
id make model year class trans drive cyl displ fuel highway city mpg
27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18 17.55
28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18 17.55
27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13 13.00
28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13 13.00
1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16 16.45
1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13 13.00
any(is.na(vehicles$cyl))
## [1] FALSE
vehicles_cyl <- vehicles %>% filter(!is.na(cyl))

vehicles_cyl %>%
  group_by(cyl) %>%
  summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))
## # A tibble: 9 x 5
##     cyl     n  mean median    sd
##   <int> <int> <dbl>  <dbl> <dbl>
## 1     2    45  18.4   18.6 0.529
## 2     3   182  37.1   35.9 6.04 
## 3     4 12381  24.1   23.5 4.29 
## 4     5   718  20.9   20.6 2.73 
## 5     6 11885  18.9   19.0 2.63 
## 6     8  7550  15.5   15.2 2.62 
## 7    10   138  14.5   14.6 1.80 
## 8    12   478  13.4   13.7 1.76 
## 9    16     7  11.0   11.2 0.241
vehicles_displ <- vehicles %>% filter(!is.na(displ))

vehicles_displ %>%
  group_by(displ) %>%
  summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))
## # A tibble: 64 x 5
##    displ     n  mean median    sd
##    <dbl> <int> <dbl>  <dbl> <dbl>
##  1   0       3  17.8   17.2  1.30
##  2   1     157  37.9   36.6  6.33
##  3   1.1     8  21.7   21.4  4.56
##  4   1.2    35  30.6   30.2  3.28
##  5   1.3   177  29.1   28.2  7.90
##  6   1.4    80  31.5   31.4  3.67
##  7   1.5   610  29.7   28.7  4.68
##  8   1.6  1264  26.8   26.7  3.19
##  9   1.7    50  29.9   29.7  4.35
## 10   1.8  1353  25.0   24.2  3.82
## # … with 54 more rows
vehicles %>%
  group_by(year) %>%
  summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))
## # A tibble: 32 x 5
##     year     n  mean median    sd
##    <int> <int> <dbl>  <dbl> <dbl>
##  1  1984   784  17.2   17.2  4.18
##  2  1985  1699  20.2   19.6  5.32
##  3  1986  1209  19.9   19.6  5.26
##  4  1987  1247  19.6   19.4  5.14
##  5  1988  1130  19.7   19.2  5.04
##  6  1989  1153  19.5   19.2  5.18
##  7  1990  1078  19.4   19.0  4.96
##  8  1991  1132  19.3   18.7  4.92
##  9  1992  1121  19.3   19.0  4.89
## 10  1993  1093  19.6   19.0  4.87
## # … with 22 more rows

Data Analysis

Approach: To analyze the data, I have used below described graphs to get the insights from vehicles dataset.

# to find make and model having max mpg
vehicles[which.max(vehicles$mpg),]
## # A tibble: 1 x 13
##      id make  model  year class trans drive   cyl displ fuel  highway  city
##   <int> <chr> <chr> <int> <chr> <chr> <chr> <int> <dbl> <chr>   <int> <int>
## 1 15606 Honda Insi…  2000 Two … Manu… Fron…     3     1 Regu…      61    49
## # … with 1 more variable: mpg <dbl>
ggplot(data=vehicles, aes(vehicles$mpg)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(vehicles_cyl, aes(cyl, mpg)) + geom_boxplot(aes(fill = factor(cyl)))

ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(cyl)))

qqnorm(vehicles$mpg)
qqline(vehicles$mpg)

Summary/Conclusion

After doing the above analysis of fueleconomy package from R, the data shows that the average MPG has increased from ~17MPG in 1984 to ~23MPG in 2015. There is signicant evidence that combined MPG differs for engines having different number of cylinders. Also its clear that vehicle with more cylinders give less combined MPG. The data also shows that fuel economy is different for engines having different number of displacments. More displacment performs less on overall MPG. The data provides a number of different variables that could be further analyzed. We could perform multiple regression to predict the gas mileage of a vehicle based upon these characteristics.