DATA 606 Data Project Proposal

Data Preparation

# load data
library(tidyverse)
library(reshape2)
library(plyr)
library(dplyr)
library(fueleconomy)
library(knitr)

vehicles <- na.omit(vehicles)

#combine hwy and cty mpg following EPA standards

vehicles <- vehicles %>%
  mutate(mpg = 0.55 * vehicles$cty + 0.45 * vehicles$hwy)

as.tbl(vehicles)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

How does the number of cylinders in a vehicle and it’s displacment effect its combined MPG? How has combined MPG changed over the years (in the dataset)

Cases

What are the cases, and how many are there?

There are 33,442 cases of vehicle data within the fueleconomy package.

Data collection

Describe the method of data collection.

The data is collected from the R package: fueleconomy. The fueleconomy package’s data was collected from the Environmental Protection Agency’s website. The data is stored in the vehicles dataset.

Type of study

What type of study is this (observational/experiment)?

The study is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

https://blog.rstudio.com/2014/07/23/new-data-packages/

https://www.fueleconomy.gov/feg/download.shtml

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variable is the combined MPG.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorival)?

The explanatory variable is how the fuel economy is effected by the number of clinders and/or displacment in a vehicle.

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(vehicles$mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.80   16.70   19.70   20.11   22.60   54.40

sd(vehicles$mpg)

## [1] 4.999651

qqnorm(vehicles$mpg)
qqline(vehicles$mpg)

ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(class))) +
  theme(legend.position="none")

ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(cyl)))

ggplot(vehicles, aes(factor(cyl), mpg)) +
  geom_boxplot(aes(fill=factor(cyl))) +
  theme(legend.position="none")

ggplot(vehicles, aes(vehicles$displ, mpg)) +
  geom_point(aes(color=factor(cyl), alpha = factor(cyl))) +
  theme_minimal() +
  geom_smooth()

## `geom_smooth()` using method = 'gam'

mpg_year <- vehicles %>%
  group_by(year) %>%
  dplyr::summarise(n = n(), avgmpg = mean(mpg), median = median(mpg), sd = sd(mpg))

kable(mpg_year)

year	n	avgmpg	median	sd
1984	784	17.15944	17.250	4.182516
1985	1699	20.19741	19.600	5.322081
1986	1209	19.93201	19.600	5.255899
1987	1247	19.62097	19.350	5.135072
1988	1130	19.74969	19.250	5.041844
1989	1153	19.53877	19.150	5.175750
1990	1078	19.42032	19.050	4.955587
1991	1132	19.28101	18.700	4.916046
1992	1121	19.34095	19.050	4.894614
1993	1093	19.60018	19.050	4.869317
1994	982	19.53147	19.050	4.619627
1995	967	19.31541	18.700	4.639678
1996	773	20.11552	19.700	4.648382
1997	762	19.97749	19.600	4.547487
1998	809	19.92145	19.700	4.494452
1999	845	19.85521	20.050	4.531822
2000	836	19.77727	19.600	4.461690
2001	906	19.76297	19.600	4.607778
2002	972	19.57629	19.600	4.558214
2003	1043	19.44871	19.600	4.620918
2004	1122	19.58623	19.150	4.470160
2005	1166	19.75232	19.600	4.487480
2006	1104	19.51676	19.250	4.205074
2007	1126	19.52069	19.375	3.986880
2008	1186	19.78419	19.600	4.255072
2009	1184	20.34742	20.150	4.410107
2010	1109	21.19932	21.050	4.768022
2011	1126	21.46381	21.150	5.077788
2012	1144	22.09025	21.600	5.524931
2013	1170	23.02944	22.150	5.864990
2014	1202	23.47138	22.600	6.106523
2015	204	23.87426	23.500	4.898900

ggplot(mpg_year, aes(mpg_year$year, mpg_year$avgmpg)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess'

mpg_cyl <- vehicles %>%
  group_by(cyl) %>%
  dplyr::summarise(n = n(), avgmpg = mean(mpg), median = median(mpg), sd = sd(mpg))

kable(mpg_cyl)

cyl	n	avgmpg	median	sd
2	45	18.37000	18.600	0.5289784
3	182	37.11456	35.925	6.0392450
4	12381	24.12813	23.500	4.2924181
5	718	20.85578	20.600	2.7252662
6	11885	18.90963	19.050	2.6275592
8	7550	15.48340	15.250	2.6219897
10	138	14.46920	14.600	1.8040640
12	478	13.36056	13.700	1.7624723
16	7	10.95714	11.150	0.2405351

ggplot(mpg_cyl, aes(mpg_cyl$cyl, mpg_cyl$avgmpg)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess'