Project 3: Assumptions and Issues

Sam Rose

4/20/2015

Dataset

The dataset that I have selected is the fuel economy data set. This has information on different vehicles (make, model, etc.) and what kind of mileage they get in the city or on the highway. It consists of approximately 33 thousand observations spanning from 1984 to 2015. For the purposes of this analysis I am going to use years since 1983 to make the interpretability of the linear model summaries easier.

# load libraries
library(scatterplot3d)
library(RColorBrewer)
# load in data
load("vehicles.rdata")
ok <- complete.cases(vehicles)
vehicles <- vehicles[ok,]
attach(vehicles)
year <- year - 1983
str(vehicles)

## Classes 'tbl_df', 'tbl' and 'data.frame':    33384 obs. of  12 variables:
##  $ id   : int  27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
##  $ make : chr  "AM General" "AM General" "AM General" "AM General" ...
##  $ model: chr  "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
##  $ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
##  $ class: chr  "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
##  $ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
##  $ drive: chr  "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
##  $ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...
##  $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : int  17 17 13 13 17 13 21 26 28 26 ...
##  $ cty  : int  18 18 13 13 16 13 14 20 22 18 ...

plot(vehicles[,c('year', 'cyl', 'displ', 'hwy', 'cty')])

plot of chunk unnamed-chunk-1

IV and DV

The two independent variables I am going to select are cyl (number of cylinders in the engine) and year, and the dependent variable will be highway gas mileage. I chose not to use displ in this model even though it seems to have a significant effect on mileage because it is highly corellated with cyl.

Null/Alternate Hypothesis

My \( H_0 \) is that the year a car was made and the number of cylinders it has have no effect on the miles per gallon on the highway.

My \( H_1 \) is that the increasing the year will have a positive effect on the miles per gallon, while increasing the amount of cylinders will have a negative effect.

Model building

I am going to build a step-wise model to assess the contributions to the model adding each component one by one. The corellation matrix was used to pick terms in the order that they should be added since in constructing a step-wise model terms should be added in order of variance explained. According to the corellations, cyl should be added first, and then year.

The cyl term has an R-squared value of .426, and is highly significant in it's F statistic, meaning that it explains variance within hwy above a random chance level. The intercept for cyl is 35.8, meaning that a car with 0 cylinders would have 39 mpg. Obviously, this is inaccurate (except for electric cars!) but it makes sense with the slope of -2.14 in that the mpg will generally go down as more cylinders are added because of increased fuel consumption.

Adding the year term only adds ~ .05 to the R-squared, but it does seem to explain enough variance to be considered a significant component in the model. This model explains about 50% of the variance in highway mpg through # of cylinders and the year the card was made. The year P-value is likely inflated due to the large sample size. The \( b_0 \) for this model drops a little bit to 33.65. For every addition of 1 cylinder the mpg on the highway goes down 2.24 and for every increase in year the mpg goes up by .17, both of which make sense in reality.

cor(vehicles[,c('year', 'cyl', 'displ', 'hwy', 'cty')])

##             year        cyl       displ        hwy         cty
## year  1.00000000  0.1082726  0.06282948  0.2075353  0.09750968
## cyl   0.10827256  1.0000000  0.89888323 -0.6526728 -0.71184964
## displ 0.06282948  0.8988832  1.00000000 -0.7183769 -0.74742233
## hwy   0.20753530 -0.6526728 -0.71837691  1.0000000  0.92501599
## cty   0.09750968 -0.7118496 -0.74742233  0.9250160  1.00000000

fit <- lm(hwy ~ cyl)
summary(fit)

## 
## Call:
## lm(formula = hwy ~ cyl)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2633  -2.9707  -0.2633   2.7367  31.5903 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.84868    0.08221   436.1   <2e-16 ***
## cyl         -2.14634    0.01364  -157.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.338 on 33382 degrees of freedom
## Multiple R-squared:  0.426,  Adjusted R-squared:  0.426 
## F-statistic: 2.477e+04 on 1 and 33382 DF,  p-value: < 2.2e-16

fit <- lm(hwy ~ cyl + year)
summary(fit)

## 
## Call:
## lm(formula = hwy ~ cyl + year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5335  -2.7711  -0.3395   2.3711  31.1563 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.659979   0.082129  409.84   <2e-16 ***
## cyl         -2.246568   0.012747 -176.24   <2e-16 ***
## year         0.171964   0.002368   72.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.031 on 33381 degrees of freedom
## Multiple R-squared:  0.5043, Adjusted R-squared:  0.5043 
## F-statistic: 1.698e+04 on 2 and 33381 DF,  p-value: < 2.2e-16

Plots

# plotting the model in 3d
mdl <- scatterplot3d(cyl, year, hwy, main = "# of Cylinders and Year vs. Highway MPG", pch = 21, bg = 'ivory4', xlab = "# of Cylinders", ylab = "Year since 1983", zlab = "Highway MPG", axis = TRUE)
mdl$plane3d(fit, col = 'dodgerblue4')

plot of chunk unnamed-chunk-3

Testing the Assumptions

Assumptions 1. The distribution of residuals is normal 2. The variance of the residuals for every set of values for the predictor is normal 3. The error term is additive 4. At every value of the outcome the expected value of the residuals is zero 5. The expected correlation between residuals for any 2 cases is 0 6. All predictors are uncorrelated with the error term 7. No predictors are a perfect linear function of other predictors 8. The mean of the error term is 0

1. Distribution of residuals

# plotting residuals
par(mfrow = c(2,2))

plot(fitted(fit), rstandard(fit), main = "Standardized Residuals of Linear Model", ylab = "Residual")
abline(h = 0, col = 'red', lty = 2)

# histogram of residuals
hist(rstandard(fit), xlab = "Residuals", main = "Histogram of Model Standardized Residuals")
# boxplt of residuals
boxplot(resid(fit), main = "Residual Boxplot")

# qq plot
qqnorm(resid(fit), main = "Q-Q Plot")
qqline(resid(fit))

plot of chunk unnamed-chunk-4

From the different diagnostic plots of the residual structure we can see that they are mostly normally distributed but there is a long tail of high residual values that skews the distribution. This is evident from the long tail in the histogram plot, the large number of outliers above the whiskers in the boxplot, and the deviation from the theoretical distribution at the upper end on the Q-Q plot. The residuals vs fitted values plot shows that the model is performing best at values between 15-25, but can do very poorly outside of this. There seems to be a fair amount of kurtosis going on as well. From this, I can conclude that the distribution is not exactly to normal.

2. The variance of the residuals for every set of values for the predictor is normal

First to test this I will look at the standardized residuals vs. fitted values for each term in the model.

# plot standardized residuals vs each term in the model
par(mfrow = c(1,2))
plot(cyl, rstandard(fit), main = "Cylinders vs Standardized Residuals", xlab = "Cylinders", ylab = "Standardized Residuals")
abline(h = 0, col = 'red', lty = 2)

plot(year, rstandard(fit), main = "Year vs Standardized Residuals", xlab = "Year", ylab = "Standardized Residuals")
abline(h = 0, col = 'red', lty = 2)

# run the heteroskedastity test (White's Test)
library(het.test)

## Loading required package: vars
## Loading required package: MASS
## Loading required package: strucchange
## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.1.3

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 3.1.3

## Loading required package: urca
## Loading required package: lmtest

plot of chunk unnamed-chunk-5

library(vars)

# hwy and cyl
mod1 = VAR(vehicles[,c('hwy', 'cyl')], p = 1)
whites.htest(mod1)

## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  4689.0894 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0000

# hwy and year
mod2 = VAR(vehicles[,c('hwy', 'year')], p = 1)
whites.htest(mod2)

## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  4233.8485 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0000

From the standard residual vs terms plots we can clearly see that cyl exhibits Heteroscedasticity, in that it's variance changes across it's range of values. Year seems to behave a little bit better but still has a significant White's test and does seem to increase in variance as you move out from the middle. This means that there are likely terms within my model that I have not accounted for that are causing widely varying residuals. For both of these predictors we can say that by White's test they significantly reject the null hypothesis that the variables are homoscedastic and accept the alternative that the variables are heteroscedastic.

3. The error term is additive

I cannot tell from the plots from the previous assumption whether the error term is additive or not. Based on theory, I would not expect a significant interaction between years and cylinders, but there very well could have been technological improvements to engines with certain numbers of cylinders over the years that would significantly influence the error term. However, based on the heteroscedasticity of these variables this is a plausible possibility that the error term is not additive.

4. At every value of the outcome the expected value of the residuals is zero

From the standardized residuals vs terms in the model plots above for assumption 2 we can easily get an idea that this assumption is going to be false based on the amount of R-squared in the model for cylinders and the shape of it's plot vs the standardized residuals.

# plot fitted values vs standardized residuals
par(mfrow = c(1,1))
plot(fitted(fit), rstandard(fit), xlab="Fitted values", ylab="Standardized Residuals", main="Fitted Values vs. Standardized Residuals")
abline(h = 0,col='red',lty = 2)

plot of chunk unnamed-chunk-6

Clearly from this plot we can see that outside of the fitted values of 15-25 the mean of the residuals is not zero, making this model a very poor predictor at the fringes.

5. The expected correlation between residuals for any 2 cases is 0

For this we can look at the residuals plot based on sample id.

plot(resid(fit), main = "Residuals Plot", ylab = "Residuals")

plot of chunk unnamed-chunk-7

Looking at this there doesn't seem to be any noticeable latent structure in the residuals, but there are a few peaks that could be suggestive. Based on this plot though, I believe that the correlation between residuals is not an issue here although as previously mentioned technological advances over time could contribute.

6. All predictors are uncorrelated with the error term

To test this I ran a Pearson's correlation test between the predictors and the residuals. From the residual vs predictor plots above I could tell that these were probably not going to be significant correlations and they aren't.

# test correlations
cor.test(year, resid(fit))

## 
##  Pearson's product-moment correlation
## 
## data:  year and resid(fit)
## t = 0, df = 33382, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01072709  0.01072709
## sample estimates:
##          cor 
## 4.356683e-16

cor.test(cyl, resid(fit))

## 
##  Pearson's product-moment correlation
## 
## data:  cyl and resid(fit)
## t = 0, df = 33382, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01072709  0.01072709
## sample estimates:
##          cor 
## 1.828148e-14

7. No predictors are a perfect linear function of other predictors

I showed earlier that they have a low correlation, and in fact that this was the reason that I did not choos displacement as a predictor because it was very highly correlated with cylinders.

cor(vehicles[,c('year', 'cyl', 'hwy')])

##           year        cyl        hwy
## year 1.0000000  0.1082726  0.2075353
## cyl  0.1082726  1.0000000 -0.6526728
## hwy  0.2075353 -0.6526728  1.0000000

plot(vehicles[,c('year', 'cyl', 'hwy')], cex = .1)

plot of chunk unnamed-chunk-9

8. The mean of the error term is 0

This assumption is satisfied as the mean of the residuals is very close to 0.

mean(resid(fit))

## [1] -9.831174e-16

The Four Issues

1. Causality

I believe that the number of cylinders is a proximal and determinate cause. This is because the engineering of a specific engine will determine how much fuel it is going to use while driving. Obviously, there are other factors such as car weight and aerodynamics, but physically having more cylinders means that you will most likely consume more fuel because you can convert more fuel into energy faster.

Year is most likely more of a probabalistic correlation than anything else. Based on the idea that high gas mileage has become more of a desirable feature due to the large fluctuations in gas prices, the trends in engineering over the years will most likely reflect that market desire. This can correlate with the mpg but it is not the physical cause of it.

2. Sample Size

Given the large sample size of ~33k the sample here is definitely not too small. The large sample size may be boosting the p-value of the year term in the model, but it doesn't seem to be overestimating it's effect. Downsampling to 10k does not change the inherent structure in the data which is the main issue with this model, the predictions based on this don't change very much at all.

# downsampling
v.ds <- vehicles[sample(1:nrow(vehicles),10000),]
plot(v.ds[,c('year', 'cyl', 'hwy')], cex = .1)

plot of chunk unnamed-chunk-11

fit <- lm(v.ds$hwy ~ v.ds$cyl + v.ds$year)
summary(fit)

## 
## Call:
## lm(formula = v.ds$hwy ~ v.ds$cyl + v.ds$year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5217  -2.8111  -0.3291   2.4256  28.7608 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.143e+02  8.697e+00  -36.13   <2e-16 ***
## v.ds$cyl    -2.261e+00  2.354e-02  -96.05   <2e-16 ***
## v.ds$year    1.755e-01  4.357e-03   40.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.044 on 9997 degrees of freedom
## Multiple R-squared:  0.502,  Adjusted R-squared:  0.5019 
## F-statistic:  5039 on 2 and 9997 DF,  p-value: < 2.2e-16

3. Colinearity

The correlation between the predictors is not that high ~.1 so I would not suspect colinearity. I ran a vif (variance inflation factor) test to double check. The square root of this is around 1 for both variables meaning there is no inflation factors acting here.

library(car)

## Warning: package 'car' was built under R version 3.1.3

fit <- lm(hwy ~ cyl + year)
sqrt(vif(fit))

##      cyl     year 
## 1.005914 1.005914

4. Measurement Error

Measurement error should not be a problem in this dataset because the number of cylinders in a car's engine and the year it was made are discrete and known quantities. The only potential source of error here would be that of entry of the data into the dataset.