CUNY MSDS DATA 606 - Final Project - Statistical Analysis

Nicholas Schettini

May 13, 2018

Libraries

# load data
library(tidyverse)
library(fueleconomy)
library(knitr)
library(ggthemes)
library(kableExtra)
library(ggiraph)

1.) Introduction

In today’s money-conscious and eco-friendly society, consumers are usually concerned with how many MPG their new vehicle will bring them. Vehicles that are more eco-friendly help our environment, and also help consumers save money on ever-rising gas prices. This study will focus on answering the following questions:

  • Has the average combined highway and city MPG changed over time. If so, how?

  • How does the number of cylinders a vehicle has impact it’s combined MPG?

  • How does a vehicles displacment impact it’s combined MPG?

2.) Data

The data is collected from the R package: fueleconomy. The fueleconomy package’s data was collected from the Environmental Protection Agency’s website. The data is stored in the vehicles dataset.

The cases represent the various vehicles between years 1984 - 2015. There are 33,442 cases of vehicle data within the fueleconomy package.

The study is an observational study. The response variable is the combined (highway and city) MPG. The explanatory variable is how the fuel economy is effected by the number of cylinders and displacment in a vehicle.

Information relating to the data in this study can be found at the following links: https://blog.rstudio.com/2014/07/23/new-data-packages/

https://www.fueleconomy.gov/feg/download.shtml

3 & 4.) Exploratory Data Analysis & Infrence

An overview of the data found within the vehicles package is below:

summary(vehicles)
##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl        
##  Length:33442       Length:33442       Length:33442       Min.   : 2.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.772  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :16.000  
##                                                           NA's   :58      
##      displ           fuel                hwy              cty        
##  Min.   :0.000   Length:33442       Min.   :  9.00   Min.   :  6.00  
##  1st Qu.:2.300   Class :character   1st Qu.: 19.00   1st Qu.: 15.00  
##  Median :3.000   Mode  :character   Median : 23.00   Median : 17.00  
##  Mean   :3.353                      Mean   : 23.55   Mean   : 17.49  
##  3rd Qu.:4.300                      3rd Qu.: 27.00   3rd Qu.: 20.00  
##  Max.   :8.400                      Max.   :109.00   Max.   :138.00  
##  NA's   :57

The vehicle dataset found within the fueleconomy package does not contain data on combined MPG. Instead, the dataset gives us the highway (hwy) and city (cty) MPG. In order to answer the research question, we must first combine hwy and cty into one combined variable: mpg.

In order to calculate this variable, the hwy and cty variables must be combined using the EPA standards: \[cty * 0.55 + hwy * 0.45\]

This value was then added to the vehicles dataset as mpg

#remove data that has null values
vehicles <- na.omit(vehicles)

#combine hwy and cty mpg following EPA standards
vehicles <- vehicles %>%
  mutate(mpg = 0.55 * vehicles$cty + 0.45 * vehicles$hwy)

Now that our mpg variable has been calculated, we can do some further investigation.

ggplot1 <- ggplot(vehicles, aes(factor(vehicles$year), vehicles$mpg)) +
  geom_boxplot_interactive(aes(tooltip = vehicles$year, data_id = year, fill = factor(vehicles$year)))     +
  theme(legend.position="none")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_hline(aes(yintercept = median(mpg)), colour = 'red')

ggiraph(code = print(ggplot1))
## Warning: package 'gdtools' was built under R version 3.4.4

The boxplot confirms our summary statistics above - the data seems to hover around the median of 19.70 MPG, until around the year 2009 when it starts to show an increase in the average combined MPG.

qqnorm(vehicles$mpg)
qqline(vehicles$mpg)

ggplot(vehicles, aes(sample = mpg)) + 
  stat_qq(aes(color = factor(year))) +
  ylab("Sample Quantiles - MPG")

The two Q-Q plots also show that after 2009, the average combined MPG shows an increase.

mpg_year <- vehicles %>%
  group_by(year) %>%
  dplyr::summarise(n = n(), avgmpg = mean(mpg), median = median(mpg), sd = sd(mpg))
  
  
kable(mpg_year, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  row_spec(32, bold = T, color = "white", background = "green")  %>%
  row_spec(1, bold = T, color = "white", background = "red")
year n avgmpg median sd
1984 784 17.15944 17.250 4.182516
1985 1699 20.19741 19.600 5.322081
1986 1209 19.93201 19.600 5.255899
1987 1247 19.62097 19.350 5.135072
1988 1130 19.74969 19.250 5.041844
1989 1153 19.53877 19.150 5.175750
1990 1078 19.42032 19.050 4.955587
1991 1132 19.28101 18.700 4.916046
1992 1121 19.34095 19.050 4.894614
1993 1093 19.60018 19.050 4.869317
1994 982 19.53147 19.050 4.619627
1995 967 19.31541 18.700 4.639678
1996 773 20.11552 19.700 4.648382
1997 762 19.97749 19.600 4.547487
1998 809 19.92145 19.700 4.494452
1999 845 19.85521 20.050 4.531822
2000 836 19.77727 19.600 4.461690
2001 906 19.76297 19.600 4.607778
2002 972 19.57629 19.600 4.558214
2003 1043 19.44871 19.600 4.620918
2004 1122 19.58623 19.150 4.470160
2005 1166 19.75232 19.600 4.487480
2006 1104 19.51676 19.250 4.205074
2007 1126 19.52069 19.375 3.986880
2008 1186 19.78419 19.600 4.255072
2009 1184 20.34742 20.150 4.410107
2010 1109 21.19932 21.050 4.768022
2011 1126 21.46381 21.150 5.077788
2012 1144 22.09025 21.600 5.524931
2013 1170 23.02944 22.150 5.864990
2014 1202 23.47138 22.600 6.106523
2015 204 23.87426 23.500 4.898900
ggplotint2 <- ggplot(mpg_year, aes(mpg_year$year, mpg_year$avgmpg)) +
  geom_point_interactive(tooltip = mpg_year$avgmpg, data_id = mpg_year$avgmpg) +
  geom_smooth() +
  ggtitle("MPG vs. Year") +
  xlab("Years: 1985-2015") +
  ylab("Combined MPG")


ggiraph(code = print(ggplotint2))
## `geom_smooth()` using method = 'loess'

The data above seem indicate that there is a right-skew within the combined MPG data. The data seems to show a mostly normal distribution. The data shows that combined MPG begins to increase after the year 2009.

\[H_0: \mu_{1984} = \mu_{2015}\]

\[H_a: \mu_{1984} \neq \mu_{2015}\]

diff_years <- mpg_year %>%
  filter(year == "1984" | year == "2015")

kable(diff_years)
year n avgmpg median sd
1984 784 17.15944 17.25 4.182516
2015 204 23.87426 23.50 4.898900

\[T = \frac{(\bar{x}_{1}-\bar{x}_{2}) - \mu_0}{\sqrt{\frac{s_{1984}^2}{n_{1984}} + \frac{s_{2015}^2}{n_{2015}}}}\]

s1984 <- 4.182516
s2015 <- 4.898900
n1984 <- 784
n2015 <- 204
xbar1984 <- 17.15944
xbar2015 <- 23.87426
u0 <- 0

xdiff <-  xbar2015 - xbar1984
SE <- sqrt((s1984^2/n1984) + (s2015^2/n2015))
t <-  (xdiff-u0)/SE
2 * (1 - pt(17.94891, df = 203))
## [1] 0

Check to see if variance is equal to determine type of T test

test1984 <- filter(vehicles, year == "1984")
test1984 <- test1984$mpg

test2015 <- filter(vehicles, year == "2015")
test2015 <- test2015$mpg

var.test(test1984, test2015)
## 
##  F test to compare two variances
## 
## data:  test1984 and test2015
## F = 0.72892, num df = 783, denom df = 203, p-value = 0.003227
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5815171 0.9007487
## sample estimates:
## ratio of variances 
##           0.728917

Since the p-score is low, the Null Hypothesis is rejected. The variance is not equal.

Using the t.test function to calculate p-value on the 1984 and 2015 data.

t.test(test1984, test2015,  var.equal = F)
## 
##  Welch Two Sample t-test
## 
## data:  test1984 and test2015
## t = -17.949, df = 284.65, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.451194 -5.978458
## sample estimates:
## mean of x mean of y 
##  17.15944  23.87426

Since the p-value is low - null must go; meaning, the \(H_0\) is rejected and \(H_a\) is accepted. There is signicant evidence that fuel economy has changed from 1984 and 2015.

Running ANOVA on all of the years in the dataset:

model_year <- lm(mpg~factor(year), data = vehicles)
anova(model_year)
## Analysis of Variance Table
## 
## Response: mpg
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## factor(year)    31  47645 1536.95  65.149 < 2.2e-16 ***
## Residuals    33352 786813   23.59                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test also shows a p-value < 0.05. Meaning, he \(H_0\) is rejected and \(H_a\) is accepted. There is signicant evidence that fuel economy has changed from 1984 through 2015.

How does the number of cylinders impact the average combined MPG?

ggplotint3 <- ggplot(vehicles, aes(factor(cyl), mpg)) +
  geom_boxplot_interactive(aes(tooltip = mpg, data_id = mpg, fill=factor(cyl))) +
  theme(legend.position="none")

ggiraph(code = print(ggplotint3))
ggplot(vehicles, aes(sample = mpg)) + 
  stat_qq(aes(color = factor(cyl))) 

ggplot(vehicles, aes(mpg)) + 
  geom_histogram( bins = 50, aes(fill = factor(cyl))) 

mpg_cyl <- vehicles %>%
  group_by(cyl) %>%
  dplyr::summarise(n = n(), avgmpg = mean(mpg), median = median(mpg), sd = sd(mpg))
                
kable(mpg_cyl, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  row_spec(2, bold = T, color = "white", background = "green")  %>%
  row_spec(9, bold = T, color = "white", background = "red")
cyl n avgmpg median sd
2 45 18.37000 18.600 0.5289784
3 182 37.11456 35.925 6.0392450
4 12381 24.12813 23.500 4.2924181
5 718 20.85578 20.600 2.7252662
6 11885 18.90963 19.050 2.6275592
8 7550 15.48340 15.250 2.6219897
10 138 14.46920 14.600 1.8040640
12 478 13.36056 13.700 1.7624723
16 7 10.95714 11.150 0.2405351
ggplotint <- ggplot(mpg_cyl, aes(mpg_cyl$cyl, mpg_cyl$avgmpg)) +
  geom_point() +
  geom_smooth()

Looking at the graphs: on average, as the number of cylinders increases, the average combined MPG decreases.

Stastiscally we can test this by running the ANOVA test. ANOVA is used in this case since we are testing more than two samples.

\[H_0: \mu_2 = \mu_3 = \mu_4 = \mu_5 = \mu_6 = \mu_8 = \mu_{10} = \mu_{12} = \mu_{16}\] \[H_a: \mu_2 \neq \mu_3 \neq \mu_4 \neq \mu_5 \neq \mu_6 \neq \mu_8 \neq \mu_{10} \neq \mu_{12} \neq \mu_{16}\]

model <- lm(mpg~factor(cyl), data = vehicles)
anova(model)
## Analysis of Variance Table
## 
## Response: mpg
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## factor(cyl)     8 458546   57318  5088.9 < 2.2e-16 ***
## Residuals   33375 375913      11                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is low (less than 0.05) - null must go; meaning, the \(H_0\) is rejected and the \(H_a\) is accepted. There is signicant evidence that fuel economy is different for engines with different # of cylinders.

How does the number of displacment impact the average combined MPG?

 ggplot(vehicles, aes(vehicles$displ, mpg)) +
  geom_point(aes( color=factor(cyl), alpha = factor(cyl))) +
  theme_minimal() +
  geom_smooth()
## `geom_smooth()` using method = 'gam'

model_dis <- lm(mpg~as.factor(displ), data = vehicles)
anova(model_dis)
## Analysis of Variance Table
## 
## Response: mpg
##                     Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(displ)    63 573567  9104.2  1162.8 < 2.2e-16 ***
## Residuals        33320 260892     7.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is low - null must go; meaning, the H0 is rejected. There is signicant evidence that fuel economy is different for engines with different #’s of displacment.

Correlogram

library(corrgram)
## Warning: package 'corrgram' was built under R version 3.4.4
corrgram(vehicles, order=TRUE, 
         lower.panel=panel.shade,
        upper.panel=panel.pie, 
        text.panel=panel.txt,
        main="MPG Data")

From the above graph, we can see correlations between different variables in our dataset. For example: hwy, cty, and mpg are positively correlated. While hwy, mpg and cyl are negativelty correlated.

Conclusions:

After analyzing the fueleconomy package from the EPA, the data shows that the average MPG has increased since 1984 to 2015. From around ~17MPG to 23MPG.

According to the data, there is signicant evidence that fuel economy is different for engines with different # of cylinders. The data shows that overall, the more cylinders in a vehicle, the less overall MPG.

There is also signicant evidence that fuel economy is different for engines with different #’s of displacment. According to the data, overall, the more displacment the less the overall MPG.