Data Preparation

library(dplyr)
library(ggplot2)

# load data
library(fueleconomy)

vehicles

Per EPA (Environmental protection agency), combined fuel economy is a weighted average of City and Highway MPG values that is calculated by weighting the City value by 55% and the Highway value by 45%.

vehicles <- na.omit(vehicles)
vehicles
vehicles <- vehicles %>% mutate(mpg = 0.55 * vehicles$cty + 0.45 * vehicles$hwy)
vehicles

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Cases

What are the cases, and how many are there?

Fuel economy data contains data for all cars sold in the US from 1984 to 2015. The package fueleconomy has 33,442 rows and 12 variables.

Data collection

Describe the method of data collection.

The data is collected from the R package: fueleconomy. The fueleconomy package’s data was sourced from the EPA (Environmental Protection Agency). In this package, the data is stored in vehicles dataset.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

https://blog.rstudio.com/2014/07/23/new-data-packages/ https://www.fueleconomy.gov/feg/download.shtml

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is combined mpg. It is quantitative.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

The two independent variables are number of cylinders and displacement.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(vehicles$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.80   16.70   19.70   20.11   22.60   54.40
# standard deviation
sd(vehicles$mpg)
## [1] 4.999651
qqnorm(vehicles$mpg)
qqline(vehicles$mpg)

hist(vehicles$mpg, breaks = 50)

ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(class)))

ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(cyl)))

ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(displ)))

ggplot(vehicles, aes(cyl, mpg)) + geom_boxplot(aes(fill = factor(cyl)))