Project 3: Assumptions and Issues


Fuel Economy

Jane Braun

RPI

April 20th - Final Draft


1. Select

Hadley Wickham Datasets

Hadley Wickham has compiled multiple large, thorough datasets that are very useful in performing analyses. They can be accessed at the following URL: https://github.com/hadley?tab=repositories

The dataset that will be used in the followng analysis is Hadley Wickham’s “Fuel Economy” dataset. This can be accessed below: https://github.com/hadley/fueleconomy

Reading in dataset

Hadley Wickham’s dataset packages are accessed via the “devtools” package. This package has a function, install_github, which can install datasets from the GitHub website.

#install.packages("devtools")
library(devtools)
## Warning: package 'devtools' was built under R version 3.1.3
#install_github("hadley/fueleconomy")
library(fueleconomy)
x <- vehicles # Renamed for convenience

Variables of Interest

For this analysis, there must be two continuous independent variables and one continuous dependent variable. No categorial variables are to be used.

str(x)
## Classes 'tbl_df', 'tbl' and 'data.frame':    33442 obs. of  12 variables:
##  $ id   : int  27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
##  $ make : chr  "AM General" "AM General" "AM General" "AM General" ...
##  $ model: chr  "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
##  $ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
##  $ class: chr  "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
##  $ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
##  $ drive: chr  "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
##  $ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...
##  $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : int  17 17 13 13 17 13 21 26 28 26 ...
##  $ cty  : int  18 18 13 13 16 13 14 20 22 18 ...

When looking at the structure of the dataframe, seen above, it appears as though there are 33,442 observations (rows), with 12 variables each (columns). As stated before, we must use continous variables. Therefore, I automatically will remove the columns with categorical variables (make, model, class, etc.).

x <- x[c(-2,-3,-5,-6,-7, -10)]
x$year <- as.numeric(x$year)
x$cyl <- as.numeric(x$cyl)
x$hwy <- as.numeric(x$hwy)
x$cty <- as.numeric(x$cty)
xo <- x # Saves original dataset for power analysis later on
str(x)
## Classes 'tbl_df', 'tbl' and 'data.frame':    33442 obs. of  6 variables:
##  $ id   : int  27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
##  $ year : num  1984 1984 1984 1984 1985 ...
##  $ cyl  : num  4 4 6 6 4 6 6 4 4 6 ...
##  $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
##  $ hwy  : num  17 17 13 13 17 13 21 26 28 26 ...
##  $ cty  : num  18 18 13 13 16 13 14 20 22 18 ...

This dataset is extremely large, so will therefore be subsetted down. The rationale behind this sample size can be found in the “2. Sample Sizes” section of this analysis.

N <- 485
M <- nrow(x)
set.seed(99)
indicies <- sample(M,N,replace=FALSE)
x <- x[indicies,]

Independent Variables

Out of the 5 remaining variables, year, cylinders, displacement, highway mpg, and city mpg.

I will select “year” and “displacement” as the independent variables.

Dependent Variable

For this analysis, I am interested in how various attributes of the car affect the car’s mileage. In this case, the dependent variable of interest will be highway mpg.

2. Describe

Selected Dataset

Hadley Wickham’s fuel economy dataset is taken from data on the U.S. Department of Energy’s website. It was gathered from testing completed by the Environmental Protection Agency.

The variables of interest - year, displacement, and highway mileage - will be in the following units: year - year displacement - liters highway mileage - miles per gallon

Hypotheses

The null hypothesis we are testing is that the variation in highway mileage cannot be explained by anything other than randomization.

The alternate hypothesis to the null is that something other than randomization can help to explain the varitation in highway mileage - in this case, this is the variation in both year and displacement.

3. Model

Model Creation

For this model, I will use a step-wise method in order to see which independent variables appear to be most correlated with the dependent variable.

The following correlation matrix shows the correlation between each of the variables and themselves.

x <- x[c(-3,-6)]
cor(x[c(-1)], use = "complete")
##             year       displ        hwy
## year  1.00000000  0.05045227  0.2427938
## displ 0.05045227  1.00000000 -0.7027743
## hwy   0.24279385 -0.70277431  1.0000000

There is only a slight correlation between displacement and year, so there is not much to worry about suppression.

Linear Model with 1 Independent Variable

Based on the correlation matrix, displacement and highway mileage are most highly correlated, with an absolute value of 0.703.

attach(x)
model1 <- lm(hwy~displ)

Linear Model with 2 Independent Variables

The second highest correlation in the matrix is year and highway mileage, with a correlation of 0.24. This is the final step-wise model.

attach(x)
## The following objects are masked from x (pos = 3):
## 
##     displ, hwy, id, year
model2 <- lm(hwy~displ + year)

Model Summary

Model 1

summary(model1)
## 
## Call:
## lm(formula = hwy ~ displ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6042  -2.4888  -0.4249   2.1772  17.6906 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.2063     0.4899   67.78   <2e-16 ***
## displ        -2.9485     0.1358  -21.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.91 on 483 degrees of freedom
## Multiple R-squared:  0.4939, Adjusted R-squared:  0.4928 
## F-statistic: 471.3 on 1 and 483 DF,  p-value: < 2.2e-16

Model 2

summary(model2)
## 
## Call:
## lm(formula = hwy ~ displ + year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2640 -2.2858 -0.3919  1.6319 19.0293 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -297.08735   35.35020  -8.404 4.88e-16 ***
## displ         -3.00752    0.12525 -24.012  < 2e-16 ***
## year           0.16527    0.01769   9.344  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.602 on 482 degrees of freedom
## Multiple R-squared:  0.5715, Adjusted R-squared:  0.5697 
## F-statistic: 321.4 on 2 and 482 DF,  p-value: < 2.2e-16

4. Plots

Residuals versus Fitted

par(mfrow=c(1,1))
model.res <- resid(model2)
plot(fitted(model2), model.res, pch=21, cex=1, bg='ivory4',main="Plot of Fitted Values vs. Residuals  (Not Sandardized)", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')

par(mfrow=c(1,1))
model.st.res <- rstandard(model2)
plot(fitted(model2), model.st.res, pch=21, cex=1, bg='ivory4',main="Standardized Plot of Fitted Values vs. Residuals", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')

Diagnostic Plots

attach(x)
## The following objects are masked from x (pos = 3):
## 
##     displ, hwy, id, year
## 
## The following objects are masked from x (pos = 4):
## 
##     displ, hwy, id, year
hist(hwy)

boxplot(hwy)

plot(x[c(-1)])

par(mfrow = c(1,1))
qqnorm(residuals(model2))
qqline(residuals(model2))   

5. Interpretation

Statistical Analysis

As seen in the summary of the seond model, the estimate of the y-intercept is -297, the slope estimate for displacement is -3.008, and the slope estimate for year is 0.165. This means that for values of 0 for both year and displacement, the value of the highway mileage is -297 mpg (which of course is infeasible). With a slope of -3.008 for displacement, this says that for each additional unit of displacement (in liters), the highway mileage decreases by -3.008 units, where one unit is a one mile per galon. With a slop of 0.17 for Year, this says that for each additional unit of year (in years), the highway mileage increases by 0.165 mile per galon.

The R^2 value for the model is 0.5697. Generally, a larger R^2 value indicates a stronger relationship between the independent and dependent variables, where an R^2 of 1 would indicate a perfect correlation, and a 0 would indicate zero correlation. With this R^2 value of 0.5834, we can say that variation in the independent variables, engine displacement and year, can explain 56.97% of the variation in the dependent variable, highway mileage. This is a fairly high R^2 value, and therefore a large percent of the variation in the dependent variable can be explained by the independent variables.

Another indicator of the strength of the model are the p-values for the estimate, which in this case is < 2 x 10^-16 for displacement and < 2 x 10^-16 for year. These are very small values, and they indicate the probability that the F-statistic (in this case 321.4) is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-values are less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in highway mileage cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in engine displacement and year may help to explain the variation in highway mileage.

White’s Test

#install.packages("het.test")
#install.packages("vars")
library(het.test)
## Warning: package 'het.test' was built under R version 3.1.3
## Loading required package: vars
## Warning: package 'vars' was built under R version 3.1.3
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.1.2
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.2
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.1.3
## Loading required package: urca
## Warning: package 'urca' was built under R version 3.1.3
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.1.3
library(vars)

attach(x)
## The following objects are masked from x (pos = 11):
## 
##     displ, hwy, id, year
## 
## The following objects are masked from x (pos = 12):
## 
##     displ, hwy, id, year
## 
## The following objects are masked from x (pos = 13):
## 
##     displ, hwy, id, year
white_model1 <- VAR(na.omit(data.frame(y=hwy, x=displ)), p=1)

whites.htest(white_model1)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  24.8809 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0154
white_model2 <- VAR(na.omit(data.frame(y=hwy, x=year)), p=1)

whites.htest(white_model2)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  14.0951 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.2947

The Eight Assumptions

  1. The distribution of residuals is normal (at each value of the outcome)
  2. The variance of the residuals for every set of values for the the predictor is equal. (Heteroscedasticity)
  3. The error term is additive no interactions
  4. At every value of the outcome, the expected (mean) value of the residuals is zero no non-linear relationship
  5. The expected correlation between residuals, for any two cases, is 0. the independence assumption (lack of autocorrelation)
  6. All predictor variables are uncorrelated with the error term
  7. No predictors are a perfect linear function of other predictors (no perfect multicollinearity)
  8. The mean of the error term in zero.
1. The distribution of residuals is normal (at each value of the outcome)
par(mfrow = c(1,1))
hist(residuals(model2), main = "Histogram of Residuals of Model", xlab = "Residuals")

boxplot(residuals(model2), main = "Boxplot of Residuals of Model", xlab = "Residuals")

The histogram and boxplot above show the distribution of the residuals. The residuals appear to be relatively skewed right. There is a series of residuals that trail to the right. This indicates a bias in the average. In general, the kurtosis seems relatively normal, so there is probably not a lot of bias in the variance.

par(mfrow = c(1,1))
qqnorm(residuals(model2))
qqline(residuals(model2)) 

The Q-Q plot, is the points are normally distributed, will have points that appear relatively linear - centering around the line. In the middle of the dynamic range, the points fall right on the line, and seem linear. However, near the extremes of the range, the points trend away from the line.

2. The variance of the residuals for every set of values for the the predictor is equal. (Heteroscedasticity)
plot(model2$residuals, type = "p", col = "black", main = "Residuals plot")

Initally looking at this plot of the residuals ordered by their observation, the variances of the residuals do not appear to vary much throughout the dynamic range. However, upon closer inspection, the variance at the lower end appears to be larger than the variance at the higher end of the range. This could demonstrate some possible heteroskedasticity.

In order to look at this further, we can you White’s Test as well.

whites.htest(white_model1)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  24.8809 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0154
whites.htest(white_model2)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  14.0951 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.2947

The null hypothesis of White’s Test is that the data is homoskedastic. The result of the first White’s tests (with displacement as the IV), the p-values is less than 0.05. If we are assuming a significance level of 0.05, then we would reject this null hypothesis, and instead support the alternative - that the data is heteroskedastic. However, looking at the second White’s test, with year as the IV, the p-value is greater than 0.05, so we must fail to reject the null hypothesis - that the data demonstrates homoskedasticity.

3. The error term is additive

The two independent variables in this analysis are year and engine displacement. Based on my (limited) knowledge of the two variables and cars, I do not believe they have a mulitplicative effect, and are instead additive.

When looking at the plot of the standardized residuals versus fitted, however, there seems to be a different effect going on:

par(mfrow=c(1,1))
model.st.res <- rstandard(model2)
plot(fitted(model2), model.st.res, pch=21, cex=1, bg='ivory4',main="Standardized Plot of Fitted Values vs. Residuals", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')

Instead of being randomly scattered around the 0 line, there seems to be a dip in the values around the center. This suggests that there could be some sort of nonlinear effect with the variables.

In the future, this could be analyzed with the help of some transformations as well.

4. At every value of the outcome, the expected (mean) value of the residuals is zero
par(mfrow=c(1,1))
model.st.res <- rstandard(model2)
plot(fitted(model2), model.st.res, pch=21, cex=1, bg='ivory4',main="Standardized Plot of Fitted Values vs. Residuals", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')

Looking at the plot above of the fitted values vs. the residual, it appears as though the points are relatively scattered and random across the dynamic range. This implies that the relationship between the variables is linear. Although, it is possible that the expected value of the residuals is slightly greater than zero, since it seems as though there are slightly more points above the line than below it. As said in the previous assumption, there also appears to be a dip in the values of the residuals around the center of the dynamic range. This could suggest a non-linear effect.

5. The expected correlation between residuals, for any two cases, is 0.

This assumption is also know as the independence assumption - suggesting that there is a lack of autocorrelation in the data. In order to analyze this, we can look at the plot of the residuals in order that they were observed:

plot(model2$residuals, type = "p", col = "black", main = "Residuals plot")

There does not appear to be any pattern (i.e. groups of data, seasonality, etc.). For this reason, I believe that we can assume independence in the data.

6. All predictor variables are uncorrelated with the error term

This is an interesting assumption - and one that is difficult to prove - because the residuals and the IVs should always be uncorrelated. Therefore, so should the IVs and the error term.

7. No predictors are a perfect linear function of other predictors (no perfect multicollinearity)

For this assumption to be verified, the two independent variables must not be linear functions of one another.

attach(x)
## The following objects are masked from x (pos = 3):
## 
##     displ, hwy, id, year
## 
## The following objects are masked from x (pos = 12):
## 
##     displ, hwy, id, year
## 
## The following objects are masked from x (pos = 13):
## 
##     displ, hwy, id, year
## 
## The following objects are masked from x (pos = 14):
## 
##     displ, hwy, id, year
plot(year ~ displ)

cor(x[c(-1,-4)], use = "complete")
##             year      displ
## year  1.00000000 0.05045227
## displ 0.05045227 1.00000000

Looking at both the plot of year versus displacement and the correlation matrix of year versus displacement, it does not appear as though there is a collinearity between

8. The mean of the error term in zero.

For this assumption to be confirmed, the mean of the residuals must also be zero.

par(mfrow=c(1,1))
model.st.res <- rstandard(model2)
plot(fitted(model2), model.st.res, pch=21, cex=1, bg='ivory4',main="Standardized Plot of Fitted Values vs. Residuals", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')

As stated before, it appears as though there is some sort of nonlinear effect going on with the data, which causes the parabola-like shape of the residuals. Because of this deformation, the mean of the residuals is most likely greater than 0.

The Four Issues

  1. Causality
  2. Sample Sizes
  3. Collinearity
  4. Measurement Error
1. Causality

The two independent variables in this analysis - year and engine displacement - are proximal and probabilistic causes of highway mileage. They are proximal because the are not the very initial point of causation for a change in highway mileage. The two independent variables are also not determinate, because, like most things, there is potential for exceptions. For example - even though many cars have increased their mileage in the past decade, it is still possible that there are pickup trucks/SUVs that have worse mpg than a sedan about 10 years ago.

2. Sample Sizes

The dataset that was used is extremely large. Because of this, it can be unclear whether the independent variables are significant as a result of their actual effect, or if it is because there are so many data points.

In order to solve this problem, we must complete a power analysis using G* Power to find the sample size needed for certain levels of beta, alpha, etc.

First, we will calculate f^2, the effect size, from the original regression’s R^2 value.

r2 <- summary(lm(hwy~displ + year, data = xo))$r.squared # R^2 value from original dataset. 
f2 <- r2 / (1-r2)
f2
## [1] 1.381696

This value o f^2 is very large, and results in a very small sample size. So, instead, I will assume a small effect, which has an f^2 value of 0.02

Using this f-squared value, an alpha level of 0.05, and a power level of 0.8, with 2 predictors, we can then calculate the sample size using the G*Power Program.

Using G*Power, a sample size of 485 was calculated. This is used for subsetting the data earlier in this analysis.

3. Collinearity

The issue of collinearity arises when the idependent variables are correlated. This would happen if the R^2 value of one of the independent variables, using the other independent variable, is high or close to 1.

model_collinearity <- lm(displ ~ year, data=x)
summary(model_collinearity)
## 
## Call:
## lm(formula = displ ~ year, data = x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2577 -1.0574 -0.3004  0.9210  3.4568 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.884552  12.832606  -0.848    0.397
## year          0.007125   0.006417   1.110    0.267
## 
## Residual standard error: 1.308 on 483 degrees of freedom
## Multiple R-squared:  0.002545,   Adjusted R-squared:  0.0004803 
## F-statistic: 1.233 on 1 and 483 DF,  p-value: 0.2675

As you can see in the summary above, the model that we developed of the IV displacement being explained by year is not very significant. The R^2 value is 0.002545, meaning that only 0.2545% of the variation in displacement can be explained by variation in year. This leads me to believe that there is no collinearity between the independent variables.

4. Measurement Error

In this study, it is unlikely that there is much room for measurement error. Year and engine displacement are both standard specifications that manufacturers would know about a car. Unless there are issues with the quality control in the plant, it is unlikely that a car that should have a 3 liter engine displacement actually has 6 liters.

The one area with possibility for measurement area could be how they measure highway mileage. The method would need to be standardized - all the cars going the same “highway” speed for the same period of time. It’s possible that various manufacturers have different definitions of what a highway speed is.