Data Preparation
# load data
library(tidyverse)
library(reshape2)
library(plyr)
library(dplyr)
library(fueleconomy)
library(knitr)
vehicles <- na.omit(vehicles)
#combine hwy and cty mpg following EPA standards
vehicles <- vehicles %>%
mutate(mpg = 0.55 * vehicles$cty + 0.45 * vehicles$hwy)
as.tbl(vehicles)
Research question
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
How does the number of cylinders in a vehicle and it’s displacment effect its combined MPG? How has combined MPG changed over the years (in the dataset)
Cases
What are the cases, and how many are there?
There are 33,442 cases of vehicle data within the fueleconomy package.
Data collection
Describe the method of data collection.
The data is collected from the R package: fueleconomy. The fueleconomy package’s data was collected from the Environmental Protection Agency’s website. The data is stored in the vehicles dataset.
Type of study
What type of study is this (observational/experiment)?
The study is an observational study.
Data Source
If you collected the data, state self-collected. If not, provide a citation/link.
Response
What is the response variable, and what type is it (numerical/categorical)?
The response variable is the combined MPG.
Explanatory
What is the explanatory variable, and what type is it (numerical/categorival)?
The explanatory variable is how the fuel economy is effected by the number of clinders and/or displacment in a vehicle.
Relevant summary statistics
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(vehicles$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.80 16.70 19.70 20.11 22.60 54.40
sd(vehicles$mpg)
## [1] 4.999651
qqnorm(vehicles$mpg)
qqline(vehicles$mpg)
ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(class))) +
theme(legend.position="none")
ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(cyl)))
ggplot(vehicles, aes(factor(cyl), mpg)) +
geom_boxplot(aes(fill=factor(cyl))) +
theme(legend.position="none")
ggplot(vehicles, aes(vehicles$displ, mpg)) +
geom_point(aes(color=factor(cyl), alpha = factor(cyl))) +
theme_minimal() +
geom_smooth()
## `geom_smooth()` using method = 'gam'
mpg_year <- vehicles %>%
group_by(year) %>%
dplyr::summarise(n = n(), avgmpg = mean(mpg), median = median(mpg), sd = sd(mpg))
kable(mpg_year)
year | n | avgmpg | median | sd |
---|---|---|---|---|
1984 | 784 | 17.15944 | 17.250 | 4.182516 |
1985 | 1699 | 20.19741 | 19.600 | 5.322081 |
1986 | 1209 | 19.93201 | 19.600 | 5.255899 |
1987 | 1247 | 19.62097 | 19.350 | 5.135072 |
1988 | 1130 | 19.74969 | 19.250 | 5.041844 |
1989 | 1153 | 19.53877 | 19.150 | 5.175750 |
1990 | 1078 | 19.42032 | 19.050 | 4.955587 |
1991 | 1132 | 19.28101 | 18.700 | 4.916046 |
1992 | 1121 | 19.34095 | 19.050 | 4.894614 |
1993 | 1093 | 19.60018 | 19.050 | 4.869317 |
1994 | 982 | 19.53147 | 19.050 | 4.619627 |
1995 | 967 | 19.31541 | 18.700 | 4.639678 |
1996 | 773 | 20.11552 | 19.700 | 4.648382 |
1997 | 762 | 19.97749 | 19.600 | 4.547487 |
1998 | 809 | 19.92145 | 19.700 | 4.494452 |
1999 | 845 | 19.85521 | 20.050 | 4.531822 |
2000 | 836 | 19.77727 | 19.600 | 4.461690 |
2001 | 906 | 19.76297 | 19.600 | 4.607778 |
2002 | 972 | 19.57629 | 19.600 | 4.558214 |
2003 | 1043 | 19.44871 | 19.600 | 4.620918 |
2004 | 1122 | 19.58623 | 19.150 | 4.470160 |
2005 | 1166 | 19.75232 | 19.600 | 4.487480 |
2006 | 1104 | 19.51676 | 19.250 | 4.205074 |
2007 | 1126 | 19.52069 | 19.375 | 3.986880 |
2008 | 1186 | 19.78419 | 19.600 | 4.255072 |
2009 | 1184 | 20.34742 | 20.150 | 4.410107 |
2010 | 1109 | 21.19932 | 21.050 | 4.768022 |
2011 | 1126 | 21.46381 | 21.150 | 5.077788 |
2012 | 1144 | 22.09025 | 21.600 | 5.524931 |
2013 | 1170 | 23.02944 | 22.150 | 5.864990 |
2014 | 1202 | 23.47138 | 22.600 | 6.106523 |
2015 | 204 | 23.87426 | 23.500 | 4.898900 |
ggplot(mpg_year, aes(mpg_year$year, mpg_year$avgmpg)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
mpg_cyl <- vehicles %>%
group_by(cyl) %>%
dplyr::summarise(n = n(), avgmpg = mean(mpg), median = median(mpg), sd = sd(mpg))
kable(mpg_cyl)
cyl | n | avgmpg | median | sd |
---|---|---|---|---|
2 | 45 | 18.37000 | 18.600 | 0.5289784 |
3 | 182 | 37.11456 | 35.925 | 6.0392450 |
4 | 12381 | 24.12813 | 23.500 | 4.2924181 |
5 | 718 | 20.85578 | 20.600 | 2.7252662 |
6 | 11885 | 18.90963 | 19.050 | 2.6275592 |
8 | 7550 | 15.48340 | 15.250 | 2.6219897 |
10 | 138 | 14.46920 | 14.600 | 1.8040640 |
12 | 478 | 13.36056 | 13.700 | 1.7624723 |
16 | 7 | 10.95714 | 11.150 | 0.2405351 |
ggplot(mpg_cyl, aes(mpg_cyl$cyl, mpg_cyl$avgmpg)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'