Data selection

This data is downloaded from here and the fuel economy information over a period of 31 years spanning 1984-2015. There are a total of 33,442 entries.

I’m focusing on the the following variables:
- year: Car model year
- displ: Engine displacement (L)
- hwy: Highway miles per gallon (mpg)

The null and alternate hypotheses that I am testing are the following:

\(H_0\) = The fuel economy is not related to the car model year or the engine displacement.
\(H_1\) = The fuel economy is a function of BOTH the car model year and the engine displacement.

## Downloading github repo hadley/fueleconomy@master
## Installing fueleconomy
## '/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL  \
##   '/private/var/folders/hm/8t_hdfyd4pnbr0lg32923mkw0000gn/T/Rtmpap3axv/devtools6056471b567/hadley-fueleconomy-188fa25'  \
##   --library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library'  \
##   --install-tests

Before I proceed with analyzing this data set, I want to remove any entries (rows) that contain ‘na’ values:

v = vehicles[,c('year','displ','hwy')]
v = v[complete.cases(v),]

I also want to subtract 1984 from the year values, so that our intercept values will make more sense during the interpretation of the model statistics. The year values will now mean “years since 1984”.

v$year = v$year - 1984

Model construction

First, I want to get an overview of the data in a scattergram:

plot(v,pch=19, cex=1, col="#00000010")

The only obvious outliers here are where we seem to have data points where engine displacement = 0. This cannot be right, so I’ll remove these entries and replot the scattergram.

v = v[v$displ > 0,]
plot(v,pch=19, cex=1, col="#00000010")

I’ll look at the correlation matrix for these three variables:

cor(v)

##             year       displ        hwy
## year  1.00000000  0.06278627  0.2075190
## displ 0.06278627  1.00000000 -0.7186687
## hwy   0.20751902 -0.71866866  1.0000000

I’ve chosen car model year and engine displacement as independent variables because these are vehicle specifications that cannot be changed. The highway fuel economy, on the other hand, is likely a function of a variety of factors. These likely include (but are not limited to) the model year and engine displacement.

To construct the model, I’m using a hierarchical approach.

I’ll start with the independent variable ‘engine displacent’ because is has the largest correlation with highway fuel economy (-0.72).

mpg.lm_H1 <- lm(v$hwy ~ v$displ)
summary(mpg.lm_H1)

## 
## Call:
## lm(formula = v$hwy ~ v$displ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5284  -2.5284  -0.3967   2.3138  30.4187 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.60769    0.05799   579.6   <2e-16 ***
## v$displ     -3.02642    0.01603  -188.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.981 on 33380 degrees of freedom
## Multiple R-squared:  0.5165, Adjusted R-squared:  0.5165 
## F-statistic: 3.566e+04 on 1 and 33380 DF,  p-value: < 2.2e-16

According to the model summary, this model which only considers the engine displacement accounts for 52% of the variation observed in the highway fuel economy (Adjusted R-squared = 0.5165).
Next, I’ll add the model year (correlation with highway fuel economy = 0.21) to the model and see what effect it has on the explained correlation.

mpg.lm_H2 <- lm(v$hwy ~ v$displ + v$year)
summary(mpg.lm_H2)

## 
## Call:
## lm(formula = v$hwy ~ v$displ + v$year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6082  -2.3083  -0.3834   1.8613  30.1202 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.49420    0.06159  511.38   <2e-16 ***
## v$displ     -3.09348    0.01496 -206.82   <2e-16 ***
## v$year       0.15494    0.00217   71.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.708 on 33379 degrees of freedom
## Multiple R-squared:  0.5806, Adjusted R-squared:  0.5805 
## F-statistic: 2.31e+04 on 2 and 33379 DF,  p-value: < 2.2e-16

Now, the model explains slightly more of the variation than was explained using the engine displacement alone (Adjusted R-squared = 0.5805).

Here is the 3D scatterplot showing how well our model (red plane with 95% confidence intervals shown in red) is representing our data.

#install.packages("scatterplot3d")
library(scatterplot3d)
H2_3d <- scatterplot3d(v$displ, v$year, v$hwy, pch=19, color="#00000010", main = "Highway fuel economy vs. engine displacement and model year", bg = 'ivory4', xlab = "Displacement (L)", ylab = "Model year", zlab = "Highway fuel economy (mpg)", axis=TRUE)
H2_3d$plane3d(mpg.lm_H2, col = 'blue')
H2_3d$plane3d(confint(mpg.lm_H2)[,1],col="red")
H2_3d$plane3d(confint(mpg.lm_H2)[,2],col="red")

Residual evaluation

We’ll make a series of plots that will tell us whether our residuals are normally distributed:

Scatterplot of the standardized residuals as a function of the fitted values. This will give us a sense of how well our model is working.
Histogram of the standardized residual values. This will help us determine the normality of the model residuals.
Boxplot of the standardized residuals, which will also show us if the values are skewed to one side of the distribution.
QQ plot of the residuals distribution vs. a theoretical normal distribution. This will reveal how much we’re differing from the normal.

mpg.res = resid(mpg.lm_H2)
mpg.stdres = rstandard(mpg.lm_H2)
par(mfrow=c(2,2))
plot(fitted(mpg.lm_H2), mpg.stdres, xlab="Fitted values", ylab="Standardized residual", main="(a) Std. residuals vs. fitted values")
abline(0,0,col='red')
hist(mpg.stdres,breaks=80, xlab="Standardized residual", main="(b) Std. residuals distribution")
boxplot(mpg.stdres, xlab="MPH ~ Displacement + Year", ylab="Standardized residual", main="(c) Std. residuals boxplot")
qqnorm(mpg.stdres, xlab="Normal quantiles", ylab="Standardized residual quantiles", main="(d) QQ plot: std. residuals vs. normal")
qqline(mpg.stdres,col='red')

par(mfrow=c(1,1))

1. The first plot shows that the linear model is performing much better at the mid-range fitted values than it is on other ends. In particular, the predicted highway fuel economy values undershoot the actual highway fuel economy values at the upper and lower extremes.
1. The histogram suggests that there is some non-normality in the residuals. Specifically, there is skewness in the distribution of residuals toward the right tail of the distribution. There is no evidence of kurtosis
1. The boxplot confirms that the right tail of the distribution is larger than the left tail. But there are no obvious outliers.
1. The QQ plot shows that the residual values are indeed deviated substantially from the theoretical normal distribution.

Next we’ll check whether the residuals have equal variance at all values of each of the predictors (independent variables). This is where we look for either homo- or heteroscedasticy. We can check this visually by referring to scatterplots of the residuals plotted against each of the independent variable values.

par(mfrow=c(1,2))
plot(v$displ, mpg.stdres, xlab="Index", ylab="Standardized residual", main="Std. residuals vs. engine displacement", pch=19, col="#00000010")
abline(0,0,col='red')
plot(v$year, mpg.stdres, xlab="Index", ylab="Standardized residual", main="Std. residuals vs. model year", pch=19, col="#00000010")
abline(0,0,col='red')

par(mfrow=c(1,1))

For both these plots, if you consider different vertical strips of data points from left to right, there is clearly a difference in the variance of the points in each strip as you move from left to right. This means that there is heteroscedasticity in the model.

Next, we’ll see whether the mean of the residual for each fitted value is zero:

plot(fitted(mpg.lm_H2), mpg.stdres, xlab="Fitted values", ylab="Standardized residual", main="(a) Std. residuals vs. fitted values")
abline(0,0,col='red',lwd=5)
lines(smooth.spline(fitted(mpg.lm_H2), mpg.stdres,df=15),col='blue',lwd=5)

This shows us that the means of the residuals are non-zero for a significant chunk of the fitted values. This likely means that our linear model is attempting to model a non-linear phenomenon.

Interpret the model

summary(mpg.lm_H2)

## 
## Call:
## lm(formula = v$hwy ~ v$displ + v$year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6082  -2.3083  -0.3834   1.8613  30.1202 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.49420    0.06159  511.38   <2e-16 ***
## v$displ     -3.09348    0.01496 -206.82   <2e-16 ***
## v$year       0.15494    0.00217   71.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.708 on 33379 degrees of freedom
## Multiple R-squared:  0.5806, Adjusted R-squared:  0.5805 
## F-statistic: 2.31e+04 on 2 and 33379 DF,  p-value: < 2.2e-16

According to our linear model parameters:

\(b_0\): If there was a theoretical car with zero engine displacement manufactured in 1980 it would have highway fuel economy of 31.5 mpg.
\(b_1\): For a given model year, the addition of each additional liter of engine displacement contributes to a loss of 3.1 mpg.
\(b_2\): On the other hand, for a fixed engine displacement, each subsequent model year adds 0.15 mpg to each vehicle.
\(r\): Overall, the engine displacement and the model year account for 58% of the observed variation in highway fuel economy.

Will use White’s test twice: once for each independent variable. In the first case, the y value is the dependent variable ‘highway fuel economy’ and the x value is the independent variable ‘engine displacement’:

dataset <- data.frame(y=v$hwy, x=v$displ)
model1 <- VAR(dataset, p = 1)
whites.htest(model1)

## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  6201.3419 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0000

The p-value of 0.0000 indicates that there is significant heteroscedasticity (in agreement with our visual assessment of the residuals plot above). For the second case, we’ll replace the x value with the independent variable ‘model year’:

dataset <- data.frame(y=v$hwy, x=v$year)
model1 <- VAR(dataset, p = 1)
whites.htest(model1)

## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  4233.3897 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0000

Again, the p-value indicates significant heteroscedastitity (also in agreement with the above plot). In both cases, the variance in highway fuel economy cannot be fully explained by the variance in either the engine displacement or model year. This means that our model is insufficient in some way, such as not completely specifying an independent variable.

Which of the issues, if any, present problems for your analysis?

Is your sample size too small or too large?
- No, we have over 33K observations.
Is there a causal relationship?
- UNKNOWN
Is there collinearity?
- UNKNOWN

Applied Regression Analysis - Project #3

John Beaulaurier

April 13, 2015

Data selection

Model construction

Residual evaluation

Interpret the model