DATA 606 Data Project Proposal

NIcholas Schettini

April 2, 2018

Data Preparation

# load data
library(tidyverse)
library(reshape2)
library(plyr)
library(dplyr)
library(fueleconomy)
library(knitr)

vehicles <- na.omit(vehicles)

#combine hwy and cty mpg following EPA standards

vehicles <- vehicles %>%
  mutate(mpg = 0.55 * vehicles$cty + 0.45 * vehicles$hwy)

as.tbl(vehicles)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

How does the number of cylinders in a vehicle and it’s displacment effect its combined MPG? How has combined MPG changed over the years (in the dataset)

Cases

What are the cases, and how many are there?

There are 33,442 cases of vehicle data within the fueleconomy package.

Data collection

Describe the method of data collection.

The data is collected from the R package: fueleconomy. The fueleconomy package’s data was collected from the Environmental Protection Agency’s website. The data is stored in the vehicles dataset.

Type of study

What type of study is this (observational/experiment)?

The study is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

https://blog.rstudio.com/2014/07/23/new-data-packages/

https://www.fueleconomy.gov/feg/download.shtml

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variable is the combined MPG.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorival)?

The explanatory variable is how the fuel economy is effected by the number of clinders and/or displacment in a vehicle.

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(vehicles$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.80   16.70   19.70   20.11   22.60   54.40
sd(vehicles$mpg)
## [1] 4.999651
qqnorm(vehicles$mpg)
qqline(vehicles$mpg)

ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(class))) +
  theme(legend.position="none")

ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(cyl))) 

ggplot(vehicles, aes(factor(cyl), mpg)) +
  geom_boxplot(aes(fill=factor(cyl))) +
  theme(legend.position="none")

ggplot(vehicles, aes(vehicles$displ, mpg)) +
  geom_point(aes(color=factor(cyl), alpha = factor(cyl))) +
  theme_minimal() +
  geom_smooth()
## `geom_smooth()` using method = 'gam'

mpg_year <- vehicles %>%
  group_by(year) %>%
  dplyr::summarise(n = n(), avgmpg = mean(mpg), median = median(mpg), sd = sd(mpg))

kable(mpg_year)
year n avgmpg median sd
1984 784 17.15944 17.250 4.182516
1985 1699 20.19741 19.600 5.322081
1986 1209 19.93201 19.600 5.255899
1987 1247 19.62097 19.350 5.135072
1988 1130 19.74969 19.250 5.041844
1989 1153 19.53877 19.150 5.175750
1990 1078 19.42032 19.050 4.955587
1991 1132 19.28101 18.700 4.916046
1992 1121 19.34095 19.050 4.894614
1993 1093 19.60018 19.050 4.869317
1994 982 19.53147 19.050 4.619627
1995 967 19.31541 18.700 4.639678
1996 773 20.11552 19.700 4.648382
1997 762 19.97749 19.600 4.547487
1998 809 19.92145 19.700 4.494452
1999 845 19.85521 20.050 4.531822
2000 836 19.77727 19.600 4.461690
2001 906 19.76297 19.600 4.607778
2002 972 19.57629 19.600 4.558214
2003 1043 19.44871 19.600 4.620918
2004 1122 19.58623 19.150 4.470160
2005 1166 19.75232 19.600 4.487480
2006 1104 19.51676 19.250 4.205074
2007 1126 19.52069 19.375 3.986880
2008 1186 19.78419 19.600 4.255072
2009 1184 20.34742 20.150 4.410107
2010 1109 21.19932 21.050 4.768022
2011 1126 21.46381 21.150 5.077788
2012 1144 22.09025 21.600 5.524931
2013 1170 23.02944 22.150 5.864990
2014 1202 23.47138 22.600 6.106523
2015 204 23.87426 23.500 4.898900
ggplot(mpg_year, aes(mpg_year$year, mpg_year$avgmpg)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess'

mpg_cyl <- vehicles %>%
  group_by(cyl) %>%
  dplyr::summarise(n = n(), avgmpg = mean(mpg), median = median(mpg), sd = sd(mpg))

kable(mpg_cyl)
cyl n avgmpg median sd
2 45 18.37000 18.600 0.5289784
3 182 37.11456 35.925 6.0392450
4 12381 24.12813 23.500 4.2924181
5 718 20.85578 20.600 2.7252662
6 11885 18.90963 19.050 2.6275592
8 7550 15.48340 15.250 2.6219897
10 138 14.46920 14.600 1.8040640
12 478 13.36056 13.700 1.7624723
16 7 10.95714 11.150 0.2405351
ggplot(mpg_cyl, aes(mpg_cyl$cyl, mpg_cyl$avgmpg)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess'