library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
For this project I examined the mpg dataset. This dataset is based on the fuel economy data from 1999 to 2008 for 38 popular models of cars. It is a subset of the data made available by the United States Environmental Protection Agency (EPA) on https://fueleconomy.gov/.
I created several different scatterplots and a histogram to explore relationship among different variables. Through my research and data exploration, I hypothesized that the unusual points observed in the highway mpg vs engine displacement plot were hybrids. Further analysis proved my hypothesis to be partially correct and partially incorrect. Observations that belonged in the subcompact class could be hybrids but others that were 2seaters were more likely to be sports cars. Then I created another plot, this time using drv which contains information about the type of drive train which led to an interesting finding. As sports cars are known to be rear-wheel and subcompact to be front-wheel drive, the points on the plot that belonged in the 2seater group were all rear-wheel while those belonging in the subcompact class were all front wheel drive.
I also created a histogram of city miles per gallon and type of drive to determine which type is the most efficient. Results from this plot suggest that front-wheel cars are more efficient than their rear and 4-wheel counterparts in terms of city miles per gallon. I also learned that there are potential problems with this data and that my results for the cty vs drv histogram may not be entirely accurate. This is because the number of rear-wheel cars is one fourth the number of front and 4-wheel cars in the dataset which may skew the results.
The mpg data source is the fuel economy data provided by the EPA, under Datasets and Guides for Individual Model Years available on http://fueleconomy.gov. The most recent data was posted in 2021 while the oldest dates back to 1974. Thus, the mpg dataset was most likely derived from concatenating data between years 1999 and 2008.
According to their website, “fuel economy data are the result of vehicle testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA” (EPA 2021). A potential problem with the dataset is that it only contains models which had a new release every year between 1999 and 2008 which may lead to inaccuracy in some cases. For instance the number of rear-wheel drives is one fourth the number of front-wheel and 4-wheel drives which may result in inaccurate statistics and visualization. The data was however revised recently. According to an update posted on the EPA website on Tuesday, February 02 2021, “in order to make estimates comparable across model years, the MPG estimates for all 1984-2007 model year vehicles and some 2011-2016 model year vehicles have been revised” (EPA 2021). This data update does not however guarantee that the mpg dataset was revised as well.
ggplot(data = mpg) +
geom_point(mapping=aes(x = displ, y = hwy))
Scatter plot of highway miles per gallon (hwy) as a function of engine displacement (displ). It appears to have a strong negative correlation and also two groups of potential outliers.
Potential_Outliers <- (mpg$displ >= 2 & mpg$displ <=5 | mpg$hwy <40 & mpg$displ < 2 | mpg$hwy <= 20 & mpg$displ >5)
ggplot(data = mpg) +
geom_point(mapping=aes(x = displ, y = hwy, color = Potential_Outliers))
As the scatterplot seems to have outliers, it raises the question of what other factors might influence the association between displ and hwy variables.
ggplot(data = mpg) +
geom_point(mapping=aes(x = displ, y = hwy, color = class))
Using aesthetics reveals the class value for each car. It shows that the outliers belong to a group of 2seater or subcompact cars.
ggplot(data = mpg) +
geom_point(mapping=aes(x = displ, y = hwy, color = drv))
Scatterplot of hwy vs displ using the type of drive train variable (drv), where f=front-wheel drive, r = rear-wheel drive or 4=4-wheel drive.
p2 <- mpg %>%
ggplot(aes(x=cty, fill=drv)) +
geom_histogram(bins = 20)
p2
Histogram of city miles per gallon (cty) with drv to explore relationship among other variables in the dataset. The plot shows that front-wheel drives are more efficient when compared with rear-wheel and 4-wheel drives.
I started off by creating a scatterplot of hwy vs displ which shows a strong negative correlation. In my research I found that cars with bigger engines have a lower hwy mpg and vice versa. After identifying several observations that looked unusual, I classified these points by coloring them red on a separate plot. Since these points deviated from the linear trend, I hypothesized that the points represented cars that might be hybrids. I used aesthetics to create another plot to examine the class value for each car and to understand which group the outliers belong to. This plot revealed that many of the unusual points on the scatterplot were either two-seater or subcompact cars. My hypothesis was therefore partially correct and partially incorrect as subcompact cars could be hybrids giving them a better hwy mpg but the 2seaters were most likely sports cars and not hybrids. As such, sports cars have larger engines and small bodies which is why they showed improved gas mileage on the plot.
As sports cars are known to be rear-wheel and subcompact to be front-wheel drive, I created a scatterplot of hwy vs displ this time targeting the drv variable which contains information about the type of drive. This plot supported my hypothesis that the 2-seater and subcompact cars are rear-wheel and front-wheel drives, respectively. I also created a histogram of cty with fill aesthetic set to drv to explore the relationship between city gallon per mile (cty) and the type of drive (drv). My reasoning for creating this plot was to examine other variables in the data. I was particularly interested in cty vs drv to understand which type of drive train is most efficient. As expected, the plot suggests that front-wheel drives outperform rear-wheel and 4-wheel cars in terms of city miles per gallon.
A total of 11 variables were observed in the dataset. The following is a summary of each variable, its data type and description of what each variable stands for.
variables <- c("manufacturer", "model", "displ", "year", "cyl", "trans", "drv", "cty", "hwy", "fl", "class")
datatype <- c("chr", "chr", "num", "int", "int", "chr", "chr", "int", "int", "chr", "chr")
description <- c("manufacturer name", "model name", "engine displacement in liters", "year of manufacture", "number of cylinders", "type of transmission", "type of drive train", "city miles per gallon","highway miles per gallon", "fuel type", "type of car")
table <-data.frame(variables, datatype, description)
table
## variables datatype description
## 1 manufacturer chr manufacturer name
## 2 model chr model name
## 3 displ num engine displacement in liters
## 4 year int year of manufacture
## 5 cyl int number of cylinders
## 6 trans chr type of transmission
## 7 drv chr type of drive train
## 8 cty int city miles per gallon
## 9 hwy int highway miles per gallon
## 10 fl chr fuel type
## 11 class chr type of car
For this project I worked on the mpg dataset to explore different variables and their relationship with one another. Through my research and creating different plots I recognized how some of these variables are correlated and influence each other. For instance, there appeared to be a strong negative correlation between highway miles per gallon and the engine displacement. As the engine displacement increases the highway mpg decreases. The outliers in this data were mostly found to be small cars and 2seaters. Further analysis showed that smaller cars are front-wheel and 2seaters are rear-wheel drive. Front-wheel drives also appeared to have better fuel efficiency than their larger, rear-wheel counterparts. While these findings are plausible, I would like to run statistical tests preferably with more recent data to verify if these results hold true.