Data selection

This data is downloaded from here and the fuel economy information over a period of 31 years spanning 1984-2015. There are a total of 33,442 entries.

I’m focusing on the the following variables:
- year: Car model year
- displ: Engine displacement (L)
- hwy: Highway miles per gallon (mpg)

2. Null and alternate hypotheses

\(H_0\) = The fuel economy is not related to the car model year or the engine displacement.
\(H_1\) = The fuel economy is a function of BOTH the car model year and the engine displacement.

## Downloading github repo hadley/fueleconomy@master
## Installing fueleconomy
## '/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL  \
##   '/private/var/folders/hm/8t_hdfyd4pnbr0lg32923mkw0000gn/T/RtmpD8HfKV/devtools62d14cbe295/hadley-fueleconomy-188fa25'  \
##   --library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library'  \
##   --install-tests

Before I proceed with analyzing this data set, I want to remove any entries (rows) that contain ‘na’ values:

v = vehicles[,c('year','displ','hwy')]
v = v[complete.cases(v),]

I also want to subtract 1984 from the year values, so that our intercept values will make more sense during the interpretation of the model statistics. The year values will now mean “years since 1984”.

v$year = v$year - 1984

Model construction

1. Overview of the data in a scattergram:

plot(v,pch=19, cex=1, col="#00000010")

The only obvious outliers here are where we seem to have data points where engine displacement = 0. This cannot be right, so I’ll remove these entries

2. Downsampling

After initially creating this model using all 33K+ cases and finding a very large effect size of 0.761 using \(R^2\)=0.5608, \(alpha\)=0.05 and \(beta\)=0.95 (see Interpretting the model: Item #4), it seems likely that we should downsample the data in order to approach a more reasonable sample size. G*Power suggested a sample size of 10, but this seems too small. Instead, I’m downsampling to 100 cases and proceeding with the analysis.

sample_size=100
vector=nrow(v)
set.seed(100)
idx=sample(vector, sample_size, replace=FALSE)
v=v[idx,]

After removing the outliers and downsampling, here is the updated scattergram:

v = v[v$displ > 0,]
plot(v,pch=19, cex=1)

3. Correlation matrix

cor(v)

##              year       displ        hwy
## year   1.00000000 -0.09698428  0.3164187
## displ -0.09698428  1.00000000 -0.7530461
## hwy    0.31641873 -0.75304611  1.0000000

I’ve chosen car model year and engine displacement as independent variables because these are vehicle specifications that cannot be changed. The highway fuel economy, on the other hand, is likely a function of a variety of factors. These likely include (but are not limited to) the model year and engine displacement.

4. Step-wise model building

I’ll start with the independent variable ‘engine displacent’ because is has the largest correlation with highway fuel economy (-0.75).

mpg.lm_H1 <- lm(v$hwy ~ v$displ)
summary(mpg.lm_H1)

## 
## Call:
## lm(formula = v$hwy ~ v$displ)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5108 -2.7389 -0.8823  2.4116 15.9280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.0250     1.0035   32.91   <2e-16 ***
## v$displ      -3.0408     0.2684  -11.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.816 on 98 degrees of freedom
## Multiple R-squared:  0.5671, Adjusted R-squared:  0.5627 
## F-statistic: 128.4 on 1 and 98 DF,  p-value: < 2.2e-16

According to the model summary, this model which only considers the engine displacement accounts for 57% of the variation observed in the highway fuel economy (Adjusted R-squared = 0.5671).
Next, I’ll add the model year (correlation with highway fuel economy = 0.32) to the model and see what effect it has on the explained correlation.

mpg.lm_H2 <- lm(v$hwy ~ v$displ + v$year)
summary(mpg.lm_H2)

## 
## Call:
## lm(formula = v$hwy ~ v$displ + v$year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3239 -1.8671 -0.6823  1.5766 14.7768 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.51339    1.13253  26.943  < 2e-16 ***
## v$displ     -2.94457    0.25163 -11.702  < 2e-16 ***
## v$year       0.16081    0.04078   3.943 0.000152 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.561 on 97 degrees of freedom
## Multiple R-squared:  0.6269, Adjusted R-squared:  0.6192 
## F-statistic: 81.48 on 2 and 97 DF,  p-value: < 2.2e-16

Now, the model explains slightly more of the variation than was explained using the engine displacement alone (Adjusted R-squared = 0.6269).

Here is the 3D scatterplot showing how well our model (red plane with 95% confidence intervals shown in red) is representing our data.

#install.packages("scatterplot3d")
library(scatterplot3d)
H2_3d <- scatterplot3d(v$displ, v$year, v$hwy, pch=19, main = "Highway fuel economy vs. engine displacement and model year", bg = 'ivory4', xlab = "Displacement (L)", ylab = "Model year", zlab = "Highway fuel economy (mpg)", axis=TRUE)
H2_3d$plane3d(mpg.lm_H2, col = 'blue')
H2_3d$plane3d(confint(mpg.lm_H2)[,1],col="red")
H2_3d$plane3d(confint(mpg.lm_H2)[,2],col="red")

Residual evaluation

1. Residual normality diagnostic plots

We’ll make a series of plots that will tell us whether our residuals are normally distributed:

Scatterplot of the standardized residuals as a function of the fitted values. This will give us a sense of how well our model is working.
Histogram of the standardized residual values. This will help us determine the normality of the model residuals.
Boxplot of the standardized residuals, which will also show us if the values are skewed to one side of the distribution.
QQ plot of the residuals distribution vs. a theoretical normal distribution. This will reveal how much we’re differing from the normal.

mpg.res = resid(mpg.lm_H2)
mpg.stdres = rstandard(mpg.lm_H2)
par(mfrow=c(2,2))
plot(fitted(mpg.lm_H2), mpg.stdres, xlab="Fitted values", ylab="Standardized residual", main="(a) Std. residuals vs. fitted values")
abline(0,0,col='red')
hist(mpg.stdres,breaks=80, xlab="Standardized residual", main="(b) Std. residuals distribution")
boxplot(mpg.stdres, xlab="MPH ~ Displacement + Year", ylab="Standardized residual", main="(c) Std. residuals boxplot")
qqnorm(mpg.stdres, xlab="Normal quantiles", ylab="Standardized residual quantiles", main="(d) QQ plot: std. residuals vs. normal")
qqline(mpg.stdres,col='red')

par(mfrow=c(1,1))

1. The standardized residuals vs. fitted values plot looks pretty good, as most of the values appear to be part of a uniform distribution centered around zero. However, there are some residual values at either extreme that appear to be significantly departed from zero.
1. The histogram suggests that there is some non-normality in the residuals. Specifically, there is skewness in the distribution of residuals toward the right tail of the distribution. There is no evidence of kurtosis
1. The boxplot confirms that the right tail of the distribution is larger than the left tail. But I’m hesitant to call any of these values in the left tail outliers.
1. The QQ plot shows that the residual values are indeed deviated somewhat from the theoretical normal distribution.

2. Do the residuals have equal variance at all values of each of the predictors (independent variables)?

This is where we look for either homo- or heteroscedasticy. We can check this visually by referring to scatterplots of the residuals plotted against each of the independent variable values.

par(mfrow=c(1,2))
plot(v$displ, mpg.stdres, xlab="Index", ylab="Standardized residual", main="Std. residuals vs. engine displacement", pch=19)
abline(0,0,col='red')
plot(v$year, mpg.stdres, xlab="Index", ylab="Standardized residual", main="Std. residuals vs. model year", pch=19)
abline(0,0,col='red')

par(mfrow=c(1,1))

For the first plot, if you consider different vertical strips of data points from left to right, there is a slight difference in the variance of the points in each strip as you move from left to right. This means that there might be heteroscedasticity in the model. The second plot looks better, with the uniform variance associated with homoscedasticity.

3. Is the mean of the residual zero for each fitted value?

plot(fitted(mpg.lm_H2), mpg.stdres, xlab="Fitted values", ylab="Standardized residual", main="(a) Std. residuals vs. fitted values")
abline(0,0,col='red',lwd=5)
lines(smooth.spline(fitted(mpg.lm_H2), mpg.stdres,df=15),col='blue',lwd=5)

This shows us that the means of the residuals are close to zero for the central fitted values, but deviate from zero at both extremes. This likely means that our linear model is attempting to model a non-linear phenomenon.

Interpretting the model

summary(mpg.lm_H2)

## 
## Call:
## lm(formula = v$hwy ~ v$displ + v$year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3239 -1.8671 -0.6823  1.5766 14.7768 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.51339    1.13253  26.943  < 2e-16 ***
## v$displ     -2.94457    0.25163 -11.702  < 2e-16 ***
## v$year       0.16081    0.04078   3.943 0.000152 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.561 on 97 degrees of freedom
## Multiple R-squared:  0.6269, Adjusted R-squared:  0.6192 
## F-statistic: 81.48 on 2 and 97 DF,  p-value: < 2.2e-16

1. According to our linear model parameters:

\(b_0\): If there was a theoretical car with zero engine displacement manufactured in 1980 it would have highway fuel economy of 30.5 mpg.
\(b_1\): For a given model year, the addition of each additional liter of engine displacement contributes to a loss of 2.9 mpg.
\(b_2\): On the other hand, for a fixed engine displacement, each subsequent model year adds 0.16 mpg to each vehicle.
\(r\): Overall, the engine displacement and the model year account for 63% of the observed variation in highway fuel economy.

2. Will use White’s test twice: once for each independent variable.

In the first case, the y value is the dependent variable ‘highway fuel economy’ and the x value is the independent variable ‘engine displacement’:

dataset <- data.frame(y=v$hwy, x=v$displ)
model1 <- VAR(dataset, p = 1)
whites.htest(model1)

## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  11.6203 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.4766

The p-value of 0.4766 indicates that there not significant heteroscedasticity (not really in agreement with our visual assessment of the residuals plot above). For the second case, we’ll replace the x value with the independent variable ‘model year’:

dataset <- data.frame(y=v$hwy, x=v$year)
model1 <- VAR(dataset, p = 1)
whites.htest(model1)

## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  11.5676 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.4810

Again, the p-value indicates non-significant heteroscedastitity (this time in agreement with our plot).

3. Are we satisfying all eight assumptions?

Here’s a summary of the findings.

Residuals normally distributed?
- The standardized residuals are not quite normally distributed, as shown in the 2x2 residual diagnostic plots above.
Homoscedasticity?
- There is no significant heteroscedasticity based on White’s test for either the engine displacement or model year predictors. So the assumption of equal variance among the residuals holds true.
Additive error
- It’s not clear that the effects of both IV’s are additive (and not multiplicative). The engine displacement and the model year don’t seem to be very correlated (-0.10), so any multiplicative effect is likely small.
Expected residual value is zero?
- Nonlinearity in the system might explain the non-uniform, parabolic distribution seen in the plot of standardized residuals vs. fitted values.
No colinearity between residuals
- We’re not entirely sure that all the cases are independent, as the data set likely includes similar cars from the same manufacturer that have similar specifications outside of engine displacement and model year. These similar specs could invalidate the assumption that all the cases are unrelated and independent.
No correlations between predictors and error
- The main point of concern for this is the plot of standardized residuals vs. engine displacement (see the left plot in Residual evaluation: Item #2). There appears to be some structure in that plot that could suggest the existence of some other mpg-affecting variables that we are not fully accounting for with the engine displacement variable.
No predictors are linear functions of other predictors
- The scattergram indicates that neither of the predictors is a linear predictors of the other predictor.
Zero mean of the error
- The mean of the standardized residuals is very small.

mean(mpg.stdres)

## [1] 0.002334032

4. Which of the issues, if any, present problems for your analysis?

Is your sample size too small or too large?
- Originally, by including all 33K observations and and calculating a large effect size (0.76), it became clear that we were using many more samples than necessary to reject the null hypothesis. Accordingly, I’ve used this calculated effect size to inform my downsampling (see Model Construction: Item #2).
Is there a causal relationship?
- Neither the engine displacement nor the model year are the ultimate or deterministic causes of the increase in highway fuel economy. Both are proximal and probabalistic causes because they both contributed to the effect but are definitely not the only contributing factors. Other causal variables are likely vehicle aerodynamics, tire construction, fuel composition, etc. There are certainly cases where both engine displacement and model year would have otherwise resulted in a higher fuel economy, were it not for the effects of something like an non-aerodynamic vehicle body (hence probabalistic causality).
Is there collinearity?
- There is not strong correlation between the two independent variable (engine displacement and model year), so I don’t think colinearity is a problem for this model.
Measurement error?
- We have no reason to believe that the engine displacement measurements are inaccurate, as they are part of the vehicle specifications. And unless there are data-entry errors, there is likely no error in the model year values. The only variable that could contain measurement error is the highway fuel economy. We have to assume that these values were arrived at through repeated testing with a large enough N.

Applied Regression Analysis - Project #3

John Beaulaurier

April 13, 2015