The data is collected from the R package: fueleconomy. The fueleconomy package’s data was sourced from the EPA (Environmental Protection Agency). In this package, the data is stored in vehicles dataset.
Fuel economy data contains data for all cars sold in the US from 1984 to 2015. The package fueleconomy has 33,442 rows and 12 variables
Further data information in this study can be found in below links:
The dataset is loaded from fueleconomy dataset of R in vehicles. kable library is used to display data tables.
# load data
library(fueleconomy)
kable(head(vehicles))
The dimensions of data shows 33384 rows and 13 columns in the dataset.
# dimensions
dim(vehicles)
## [1] 33442 12
Approach: My goal is to clean the data and create new columns like mpg for further analysis.
# rename columns
vehicles <- vehicles %>%
rename("highway"=hwy) %>%
rename("city"=cty)
# show intial rows through head
kable(head(vehicles), align = "l")
id | make | model | year | class | trans | drive | cyl | displ | fuel | highway | city |
---|---|---|---|---|---|---|---|---|---|---|---|
27550 | AM General | DJ Po Vehicle 2WD | 1984 | Special Purpose Vehicle 2WD | Automatic 3-spd | 2-Wheel Drive | 4 | 2.5 | Regular | 17 | 18 |
28426 | AM General | DJ Po Vehicle 2WD | 1984 | Special Purpose Vehicle 2WD | Automatic 3-spd | 2-Wheel Drive | 4 | 2.5 | Regular | 17 | 18 |
27549 | AM General | FJ8c Post Office | 1984 | Special Purpose Vehicle 2WD | Automatic 3-spd | 2-Wheel Drive | 6 | 4.2 | Regular | 13 | 13 |
28425 | AM General | FJ8c Post Office | 1984 | Special Purpose Vehicle 2WD | Automatic 3-spd | 2-Wheel Drive | 6 | 4.2 | Regular | 13 | 13 |
1032 | AM General | Post Office DJ5 2WD | 1985 | Special Purpose Vehicle 2WD | Automatic 3-spd | Rear-Wheel Drive | 4 | 2.5 | Regular | 17 | 16 |
1033 | AM General | Post Office DJ8 2WD | 1985 | Special Purpose Vehicle 2WD | Automatic 3-spd | Rear-Wheel Drive | 6 | 4.2 | Regular | 13 | 13 |
summary(vehicles)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.000
## Class :character Class :character Class :character 1st Qu.: 4.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 5.772
## 3rd Qu.: 6.000
## Max. :16.000
## NA's :58
## displ fuel highway city
## Min. :0.000 Length:33442 Min. : 9.00 Min. : 6.00
## 1st Qu.:2.300 Class :character 1st Qu.: 19.00 1st Qu.: 15.00
## Median :3.000 Mode :character Median : 23.00 Median : 17.00
## Mean :3.353 Mean : 23.55 Mean : 17.49
## 3rd Qu.:4.300 3rd Qu.: 27.00 3rd Qu.: 20.00
## Max. :8.400 Max. :109.00 Max. :138.00
## NA's :57
any(is.na(vehicles))
## [1] TRUE
vehicles_subdf <- vehicles[rowSums(is.na(vehicles)) > 0,]
kable(head(vehicles_subdf))
id | make | model | year | class | trans | drive | cyl | displ | fuel | highway | city |
---|---|---|---|---|---|---|---|---|---|---|---|
31893 | Azure Dynamics | Transit Connect Electric Van | 2012 | Special Purpose Vehicle 2WD | Automatic (A1) | Front-Wheel Drive | NA | NA | Electricity | 62 | 62 |
31894 | Azure Dynamics | Transit Connect Electric Wagon | 2012 | Special Purpose Vehicle 2WD | Automatic (A1) | Front-Wheel Drive | NA | NA | Electricity | 62 | 62 |
32276 | BMW | Active E | 2011 | Subcompact Cars | Automatic (A1) | Rear-Wheel Drive | NA | NA | Electricity | 96 | 107 |
33383 | BYD | e6 | 2012 | Sport Utility Vehicle - 2WD | Automatic (A1) | Front-Wheel Drive | NA | NA | Electricity | 64 | 60 |
34860 | BYD | e6 | 2013 | Small Sport Utility Vehicle 2WD | Automatic (A1) | Front-Wheel Drive | NA | NA | Electricity | 65 | 61 |
34859 | BYD | e6 | 2014 | Small Sport Utility Vehicle 2WD | Automatic (A1) | Front-Wheel Drive | NA | NA | Electricity | 65 | 61 |
Per EPA (Environmental protection agency) guidelines, combined fuel economy is a weighted average of City and Highway MPG values that is calculated by weighting the City value by 55% and the Highway value by 45%.
# remove null values
vehicles <- na.omit(vehicles)
vehicles <- vehicles %>% mutate(mpg = 0.55 * vehicles$city + 0.45 * vehicles$highway)
kable(head(vehicles))
id | make | model | year | class | trans | drive | cyl | displ | fuel | highway | city | mpg |
---|---|---|---|---|---|---|---|---|---|---|---|---|
27550 | AM General | DJ Po Vehicle 2WD | 1984 | Special Purpose Vehicle 2WD | Automatic 3-spd | 2-Wheel Drive | 4 | 2.5 | Regular | 17 | 18 | 17.55 |
28426 | AM General | DJ Po Vehicle 2WD | 1984 | Special Purpose Vehicle 2WD | Automatic 3-spd | 2-Wheel Drive | 4 | 2.5 | Regular | 17 | 18 | 17.55 |
27549 | AM General | FJ8c Post Office | 1984 | Special Purpose Vehicle 2WD | Automatic 3-spd | 2-Wheel Drive | 6 | 4.2 | Regular | 13 | 13 | 13.00 |
28425 | AM General | FJ8c Post Office | 1984 | Special Purpose Vehicle 2WD | Automatic 3-spd | 2-Wheel Drive | 6 | 4.2 | Regular | 13 | 13 | 13.00 |
1032 | AM General | Post Office DJ5 2WD | 1985 | Special Purpose Vehicle 2WD | Automatic 3-spd | Rear-Wheel Drive | 4 | 2.5 | Regular | 17 | 16 | 16.45 |
1033 | AM General | Post Office DJ8 2WD | 1985 | Special Purpose Vehicle 2WD | Automatic 3-spd | Rear-Wheel Drive | 6 | 4.2 | Regular | 13 | 13 | 13.00 |
any(is.na(vehicles$cyl))
## [1] FALSE
vehicles_cyl <- vehicles %>% filter(!is.na(cyl))
vehicles_cyl %>%
group_by(cyl) %>%
summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))
## # A tibble: 9 x 5
## cyl n mean median sd
## <int> <int> <dbl> <dbl> <dbl>
## 1 2 45 18.4 18.6 0.529
## 2 3 182 37.1 35.9 6.04
## 3 4 12381 24.1 23.5 4.29
## 4 5 718 20.9 20.6 2.73
## 5 6 11885 18.9 19.0 2.63
## 6 8 7550 15.5 15.2 2.62
## 7 10 138 14.5 14.6 1.80
## 8 12 478 13.4 13.7 1.76
## 9 16 7 11.0 11.2 0.241
vehicles_displ <- vehicles %>% filter(!is.na(displ))
vehicles_displ %>%
group_by(displ) %>%
summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))
## # A tibble: 64 x 5
## displ n mean median sd
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 0 3 17.8 17.2 1.30
## 2 1 157 37.9 36.6 6.33
## 3 1.1 8 21.7 21.4 4.56
## 4 1.2 35 30.6 30.2 3.28
## 5 1.3 177 29.1 28.2 7.90
## 6 1.4 80 31.5 31.4 3.67
## 7 1.5 610 29.7 28.7 4.68
## 8 1.6 1264 26.8 26.7 3.19
## 9 1.7 50 29.9 29.7 4.35
## 10 1.8 1353 25.0 24.2 3.82
## # … with 54 more rows
vehicles %>%
group_by(year) %>%
summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))
## # A tibble: 32 x 5
## year n mean median sd
## <int> <int> <dbl> <dbl> <dbl>
## 1 1984 784 17.2 17.2 4.18
## 2 1985 1699 20.2 19.6 5.32
## 3 1986 1209 19.9 19.6 5.26
## 4 1987 1247 19.6 19.4 5.14
## 5 1988 1130 19.7 19.2 5.04
## 6 1989 1153 19.5 19.2 5.18
## 7 1990 1078 19.4 19.0 4.96
## 8 1991 1132 19.3 18.7 4.92
## 9 1992 1121 19.3 19.0 4.89
## 10 1993 1093 19.6 19.0 4.87
## # … with 22 more rows
Approach: To analyze the data, I have used below described graphs to get the insights from vehicles dataset.
# to find make and model having max mpg
vehicles[which.max(vehicles$mpg),]
## # A tibble: 1 x 13
## id make model year class trans drive cyl displ fuel highway city
## <int> <chr> <chr> <int> <chr> <chr> <chr> <int> <dbl> <chr> <int> <int>
## 1 15606 Honda Insi… 2000 Two … Manu… Fron… 3 1 Regu… 61 49
## # … with 1 more variable: mpg <dbl>
ggplot(data=vehicles, aes(vehicles$mpg)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(vehicles_cyl, aes(cyl, mpg)) + geom_boxplot(aes(fill = factor(cyl)))
ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(cyl)))
qqnorm(vehicles$mpg)
qqline(vehicles$mpg)
After doing the above analysis of fueleconomy package from R, the data shows that the average MPG has increased from ~17MPG in 1984 to ~23MPG in 2015. There is signicant evidence that combined MPG differs for engines having different number of cylinders. Also its clear that vehicle with more cylinders give less combined MPG. The data also shows that fuel economy is different for engines having different number of displacments. More displacment performs less on overall MPG. The data provides a number of different variables that could be further analyzed. We could perform multiple regression to predict the gas mileage of a vehicle based upon these characteristics.