# load data
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(fueleconomy)
vehicles <- fueleconomy::vehicles

Research question

Is there a correlation between a vehicles fuel efficiency and the number of cylinders it has?

Cases

There are 33,442 cases with cars from the year 1984 to 2015.

Data collection

The data collected was a direct observational study from the EPA.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link. The data set was installed into RStudio https://blog.rstudio.org/2014/07/23/new-data-packages/ but originally comes from the EPA fuel economy website. Link - http://www.fueleconomy.gov/feg/download.shtml

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variable is the fuel efficiency for highway MPG (city is being ignored) and is numerical continuous.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorival)?

The explanatory variable is the number of cylinders the vehicle has and is numerical discrete. Ther are others but they are ignored for this study.

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

##statistics summary for the number of cylinders in a vehicle
summary(vehicles$cyl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.000   4.000   6.000   5.772   6.000  16.000      58
sd(vehicles$cyl, na.rm=TRUE)
## [1] 1.740931
var(vehicles$cyl, na.rm=TRUE)
## [1] 3.03084
hist(vehicles$cyl, breaks=30, main = 'Histogram of vehicle cylinders')

##this means that the mean cyl is 5.8 with a SD of 1.7 cylinders.  

##statistics summary for the mpg highway data
summary(vehicles$hwy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   19.00   23.00   23.55   27.00  109.00
sd(vehicles$hwy, na.rm=TRUE)
## [1] 6.211417
var(vehicles$hwy, na.rm=TRUE)
## [1] 38.5817
hist(vehicles$hwy, breaks=30, main = 'Histogram of vehicle highway mpg')

#this means that the mean mpg is 23.55 with a SD of 6.2 mpg.

## is the data normally distributed?
qqnorm(vehicles$hwy, main= 'Normal QQ Plot for Highway MPG')
qqline(vehicles$hwy)

##The data shows it is right skewed.  This can be overcome by sampling the sample in further analysis.

The next question is to determine if the the vehicle highway mpg is normally distributed.

qqnorm(vehicles$hwy, main= 'Normal QQ Plot for Highway MPG')
qqline(vehicles$hwy)

##The data shows it is right skewed.  This can be overcome by sampling the sample in further analysis.

In analyzing the dataset, other variables MAY be introduced to determine if the response variable is influenced.