Applied Linear Regression: Project 3

NOTE: From the project outline it was found that there existed high correlation between the independent variables chosen (displacement and number of cylinders). In most of the available literature, solution to multicollinearity is:

  1. Remove highly correlated predictors from the model.
  2. Use Partial Least Squares Regression (PLS) or Principal Components Analysis.

Going by solution # 1 removing one of the variables will make the analysis limited to linear regression. Therefore, in the final project, the variable: number of cylinders(cyl) was removed and another IV: year (for the car model) was added instead for model building.

Dataset

The given data set is the fuel economy data from the EPA. It ranges from the year 1985 to 2015 for various car models and each row has a detailed specification of the vehicle. There are 74 variables and 34632 records in the original data set.

Below is the summary of data for a small subset of the entire dataset:

#vehicles<- (read.csv(file.choose(), header=T))
install.packages("fueleconomy", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/uzma/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## package 'fueleconomy' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\uzma\AppData\Local\Temp\RtmpgDy6C4\downloaded_packages
library("fueleconomy", lib.loc="C:\\Users\\uzma\\Documents\\R\\win-library\\3.1")
## Warning: package 'fueleconomy' was built under R version 3.1.3
x<-vehicles
head(x)
##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13

Here we see how the data is organized and also create a data frame:

data.frame.new<-x
NROW(data.frame.new)
## [1] 33442
tail(data.frame.new)
##          id  make                             model year       class
## 33437 31064 smart   fortwo electric drive cabriolet 2011 Two Seaters
## 33438 33305 smart fortwo electric drive convertible 2013 Two Seaters
## 33439 34393 smart fortwo electric drive convertible 2014 Two Seaters
## 33440 31065 smart       fortwo electric drive coupe 2011 Two Seaters
## 33441 33306 smart       fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart       fortwo electric drive coupe 2014 Two Seaters
##                trans            drive cyl displ        fuel hwy cty
## 33437 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33438 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33439 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33440 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  79  94
## 33441 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33442 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
str(data.frame.new)
## Classes 'tbl_df', 'tbl' and 'data.frame':    33442 obs. of  12 variables:
##  $ id   : int  27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
##  $ make : chr  "AM General" "AM General" "AM General" "AM General" ...
##  $ model: chr  "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
##  $ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
##  $ class: chr  "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
##  $ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
##  $ drive: chr  "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
##  $ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...
##  $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : int  17 17 13 13 17 13 21 26 28 26 ...
##  $ cty  : int  18 18 13 13 16 13 14 20 22 18 ...
summary(data.frame.new)
##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl        
##  Length:33442       Length:33442       Length:33442       Min.   : 2.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.772  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :16.000  
##                                                           NA's   :58      
##      displ           fuel                hwy              cty        
##  Min.   :0.000   Length:33442       Min.   :  9.00   Min.   :  6.00  
##  1st Qu.:2.300   Class :character   1st Qu.: 19.00   1st Qu.: 15.00  
##  Median :3.000   Mode  :character   Median : 23.00   Median : 17.00  
##  Mean   :3.353                      Mean   : 23.55   Mean   : 17.49  
##  3rd Qu.:4.300                      3rd Qu.: 27.00   3rd Qu.: 20.00  
##  Max.   :8.400                      Max.   :109.00   Max.   :138.00  
##  NA's   :57

Hypothesis under test and the variables involved

The null hypothesis for this ‘explanatory model’ can be stated as: “The variation in the year (make year) (IV 1) and the displacement in engine(litres) (IV 2) cannot explain the variation in the vehicle’s city mileage (mpg) (DV)”
(Here IV - independent variable and DV - dependent variable)

Alternate hypothesis: “The variability in either or both the independent variables(make year and engine displacement) can explain the variability in the dependent variable”

Model building

The rationale behind this design: A factor like displacement in engine is the volume of air, the engine can consume in a single revolution. So the more air an engine moves, the more fuel it can consume with every turn. Essentially four-cylinder engines are thought of as more fuel-efficient than bigger motors like V-6’s or V-8’s. Therefore there is a chance that this independent variable impacts the city mileage of a vehicle. Further, with ever improving technology most car manufacturers are trying to come up with fuel efficient models.The real test of a vehicle’s mileage is in a chaotic city traffic, therefore it is quite important for the car manufacturer to see the impact of yearly change in design/features on the city mileage.

Model: Heirarchical Model

We aim to build a ‘heirarchical model’ for this problem which corresponds to a model in which we enter variables into the model in a predetermined manner based on a theoretical description of the model.

The dependent variable, city mileage can depend a number of factors ranging from those that correspond to the vehicle itself and also those that correspond to the external environment. However with the given dataset we can only focus our attention to the vehicle dependent factors. As described above, our choice of indpendent variables is based on some prior knowledge of the problem.The given dataset has multiple factors, however we focus our attention to the 2 variables mentioned above (engine displacement and make year) due to the following reasons:

  1. Engine displacement(displ): Engine displacement is the volume of an engine’s cylinders, therefore it is a general indicator of engines size and power. We can think of this variable as having a larger impact on the dependent variable than any other factors given its direct relationship with power. In order to test this relationship we develop a regression model here with this variable as one of the independent variable.

  2. Year(year): This variable is considered important because with technology changing at an extremely fast pace,most car manufacturers are trying to improve on mileage with each passing year.

myvars <- c("cty", "displ","year")
newdata1 <- data.frame.new[myvars]

attach(newdata1)

teststudy.lm1 <- lm(cty~displ+year) 
summary(teststudy.lm1)
## 
## Call:
## lm(formula = cty ~ displ + year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5540  -1.7008  -0.4265   1.0280  29.9922 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.159e+02  3.473e+00  -33.37   <2e-16 ***
## displ       -2.549e+00  1.198e-02 -212.73   <2e-16 ***
## year         7.093e-02  1.738e-03   40.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.971 on 33382 degrees of freedom
##   (57 observations deleted due to missingness)
## Multiple R-squared:  0.5795, Adjusted R-squared:  0.5795 
## F-statistic: 2.301e+04 on 2 and 33382 DF,  p-value: < 2.2e-16

Power analysis

#Determine effect size for power analysis
r2 <- summary(teststudy.lm1)$r.squared
f2 <- r2 / (1-r2)
f2
## [1] 1.378315

For the above found effect size value, we get a sample size value of N=11 from GStar power analysis. However, we can use the effect size convention for an f value to be 0.02,0.15,0.35 for a small,medium,large size respectively.

Selecting our desired effect size to be low we get a required sample size of 485 for a Power of 80%. Therefore, we extract a sample of this size to re-build our heirarchical model:

detach(newdata1)
y <- 1:nrow(newdata1)
set.seed(99)
indices <- sample(y,485,replace=F)
newdata <- newdata1[indices,]
attach(newdata)

# Final model

teststudy.lm2 <- lm(cty~displ+year) 
summary(teststudy.lm2)
## 
## Call:
## lm(formula = cty ~ displ + year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8220 -1.4843 -0.4534  0.8587 23.3415 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -125.15361   28.28698  -4.424 1.20e-05 ***
## displ         -2.44188    0.10022 -24.364  < 2e-16 ***
## year           0.07532    0.01415   5.322 1.58e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.882 on 482 degrees of freedom
## Multiple R-squared:  0.5588, Adjusted R-squared:  0.5569 
## F-statistic: 305.2 on 2 and 482 DF,  p-value: < 2.2e-16

In the model with the reduced sample size we see the R-squared value decreases to 55% from 58%.However, both variables are found to be significant and we can infer that 55% of the variability in the DV can be explained by the 2 IVs together.

Based on both our models we can say that we reject the null hypothesis that the variation in the city mileage cannot be explained by the independent variables or in other words we can conclude that there is something other than randomization that causes the variability in the DV (Therefore the variability observed in the dependent variable can be attributed to the variability in one or all of the 2 the independent variables (with the given F-Statistic value).

Model Interpretation:

b0: If there was a car with zero engine displacement manufactured in the very first year of this survey, it would have city mileage of -1.159e+02 mpg.

b1: In a particular model year, the addition of each additional liter of engine displacement contributes to a loss of 2.549e+00.

b2: For a fixed engine displacement, each model year adds 7.093e-02 mpg to each vehicle.

R-squared: Both the IVs contribute to 54.8% of the observed variation in the DV-city mileage.

Plots (Scatter Plots and Regression plane)

# 3D Scatterplot with Coloring and Vertical Drop Lines
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 3.1.3
library(MASS)
## Warning: package 'MASS' was built under R version 3.1.3
library(rgl)
## Warning: package 'rgl' was built under R version 3.1.3
attach(newdata) 
## The following objects are masked from newdata (pos = 6):
## 
##     cty, displ, year
s3d <-scatterplot3d(displ,year,cty,pch=16,angle = 30,color="blue",highlight.3d=TRUE,type="h", main="3D Scatterplot: cty (MPG) vs year vs displ (litres)") 
## Warning in scatterplot3d(displ, year, cty, pch = 16, angle = 30, color =
## "blue", : color is ignored when highlight.3d = TRUE
s3d$plane3d(teststudy.lm2)

Checking Model Assumptions

Normality

# distribution of residuals

testmodel.resid <-teststudy.lm2$residuals
testmodel.st.resid <-rstandard(teststudy.lm2)

hist(testmodel.resid,breaks=80, xlab="Residual", main="Residuals distribution")

hist(testmodel.st.resid,breaks=80, xlab="Standardized Residual", main="Standardized Residuals distribution")

Histogram shows some kurtosis and many outliers to the right.This means there is some bias in the model. Further, we can say that there is some skew to the left. Next, we plot the box-plots:

boxplot(testmodel.resid, xlab="City Mileage ~ Displacement + Year", ylab="Residuals", main="Residuals boxplot")

boxplot(testmodel.st.resid, xlab="City Mileage ~ Displacement + Year", ylab="Standardized Residuals", main="St. Residuals boxplot")

Box-plots confirm presence of outliers. Finally we plot the P-P plots to see if there is some large deviation from normality:

qqnorm(testmodel.resid, xlab="Normal quantiles", ylab="Residual quantiles", main="QQ plot:residuals vs. normal")
qqline(testmodel.resid,col='blue')

qqnorm(testmodel.st.resid, xlab="Normal quantiles", ylab="St.Residual quantiles", main="QQ plot: st.residuals vs. normal")
qqline(testmodel.st.resid,col='blue')

As was evident from the histograms and box-plots, we see a similar trend here. There is deviation from normality especially at the two extreme ends. However, a fair number of data points seem to fall within the range of normality, therefore we can conclude that this assumption is not violated.

Heteroscedasticity

In order to check for Heteroscedasticity, we first perform a visual analysis through scatter plots of the residuals, plotted against each IV. The residuals plot can also be used to test the homogeneity of variance (homoscedasticity ) assumption(2 and 3).

## Plots for standard residuals

plot(displ,testmodel.st.resid,xlab="Engine displacement in litres", ylab="Standardized residuals")
abline(0,0, lwd=2, col='blue')

plot(year,testmodel.st.resid,xlab="Make Year", ylab="Standardized residuals")
abline(0,0, lwd=2, col='blue')

There is clearly a difference in the variance of the points through the entire range as we move from left to right. This means that there is some heteroscedasticity in the model which is less for the IV-year and more in the IV-engine displacement. We look at the vertical scatter at a given point along the x-axis. The homogeneity of variance assumption is not supported since the vertical scatter does not seem uniform.To confirm this we perform the White’s test:

library(het.test) 
## Warning: package 'het.test' was built under R version 3.1.3
## Loading required package: vars
## Warning: package 'vars' was built under R version 3.1.3
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.3
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.1.3
## Loading required package: urca
## Warning: package 'urca' was built under R version 3.1.3
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.1.3
library(vars)

attach(newdata)
## The following objects are masked from newdata (pos = 10):
## 
##     cty, displ, year
## 
## The following objects are masked from newdata (pos = 14):
## 
##     cty, displ, year
dataset <-data.frame(y=cty,x=displ)
datasetnew<-na.omit(dataset)
model1 <-VAR(datasetnew, p=1)

whites.htest(model1)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  25.2911 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.0135
dataset1 <-data.frame(y=cty,x=year)
datasetnew1<-na.omit(dataset1)
model2 <-VAR(datasetnew1, p=1)

whites.htest(model2)
## 
## White's Test for Heteroskedasticity:
## ==================================== 
## 
##  No Cross Terms
## 
##  H0: Homoskedasticity
##  H1: Heteroskedasticity
## 
##  Test Statistic:
##  11.3912 
## 
##  Degrees of Freedom:
##  12 
## 
##  P-value:
##  0.4957

From the results we see that, for the IV engine displacement p<0.05 therefore we reject the null hypothesis at 5% significance (alpha) and conclude that there is heteroskedasticity. However, for the second IV-year, p>0.05 therefore we fail to reject H0 and so there is no heteroskedasticity.

Expected value of residuals is zero

plot(fitted(teststudy.lm2), testmodel.st.resid, xlab="Fitted values", ylab="Standardized residual", main="St. residuals vs. fitted values")
abline(0,0,col='blue',lwd=2)
lines(smooth.spline(fitted(teststudy.lm2), testmodel.st.resid,df=15),col='red',lwd=2)

In general, values that are close to the horizontal line are predicted well. The points above the line are underpredicted and the ones below the line are overpredicted. The linearity assumption seems to be violated because the number of points above the line and below the line are not equal in number. This is again points to the fact that the mean of the residuals is not zero as can be seen in the above plot of standard residuals versus fitted values. Non-linearity can also be seen.

Expected correlation between variables and residuals

par(mfrow=c(1,1))
plot(testmodel.st.resid,cex=0.5,ylab="Standard Residuals",main="Residuals")
abline(0,0,col="red",lwd = 2)
lines(smooth.spline(testmodel.st.resid),col='blue',lwd=2)

The independence assumtion (error term uncorelated with the IVs) seem to hold for the entire dynamic range.

Additvity of IVs

test.lm.new<-lm(displ~year)
summary(test.lm.new)
## 
## Call:
## lm(formula = displ ~ year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2577 -1.0574 -0.3004  0.9210  3.4568 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.884552  12.832606  -0.848    0.397
## year          0.007125   0.006417   1.110    0.267
## 
## Residual standard error: 1.308 on 483 degrees of freedom
## Multiple R-squared:  0.002545,   Adjusted R-squared:  0.0004803 
## F-statistic: 1.233 on 1 and 483 DF,  p-value: 0.2675

AS can be seen from the p-values, there is no significant relationship between the variables. So we can assume that the error terms are also additive for the two IVs.

Relationship between predictors

plot(newdata)

The plots do not show any significant relationships between the predictor variables. We can assume no correlation here.

Mean of error term

mean(testmodel.st.resid)
## [1] 0.00069174

This is approaching zero.

Summary for Assumptions

Here we summarize our results in terms of the eight assumtions that underlie regression modelling:

  1. Distribution of residuals is fairly normal.This can be inferred from the histograms, box-plots and Q-Q plots when all taken together.

  2. Variance of residuals for each predictor (set of values) is not equal. This assumption is violated by the predictor variable engine displacement although not to a very high degree as can be seen from the plots and White’s test as well.

  3. Error term is additive and there are no significant interactions found in the model. This is more dependent on the underlying theory i.e. the rationale of the model itself. There does not seem to be a direct multiplicative interaction between the IVs. From the fitted values plot there does seem to be some deviation but not too significant. Further the regression test performed to analyze any interactions gives a non-significant p-value indicating very little or no interactions.

  4. Expected Mean of the residuals is not exactly zero and there seems to be some non-linear relationship if we examine portions of the entire dynamic range. However, a holistic overview suggests that the mean would be approaching zero.

  5. Expected correlation between residuals for any 2 cases is zero as can be seen from the residual plot vs index. There are some extreme values, however no conspicuos patterns can be seen so we can say that there is lack of auto-corelation.

  6. Both Predictor variables can be assumed to be uncorrelated with the error term as can be seen from the plots of standard residuals with each IV. We can further check this assumtion by performing a regression test to check if the IVs are in some way explaining the variability in the residuals.

test.error.lm1<-lm(testmodel.st.resid~displ+year)

summary(test.error.lm1)
## 
## Call:
## lm(formula = testmodel.st.resid ~ displ + year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6782 -0.5168 -0.1582  0.2985  8.1377 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.422e-02  9.857e+00   0.003    0.997
## displ        1.300e-04  3.492e-02   0.004    0.997
## year        -1.699e-05  4.932e-03  -0.003    0.997
## 
## Residual standard error: 1.004 on 482 degrees of freedom
## Multiple R-squared:  5.081e-08,  Adjusted R-squared:  -0.004149 
## F-statistic: 1.225e-05 on 2 and 482 DF,  p-value: 1

Given the high p values and negligible R-squared, we can say that this assumption holds.

  1. No predictors are a perfect linear function of other predictors i.e. there is no perfect multicollinearity in the model. This is evident in the plot of Engine displacement vs. Year.

  2. Mean of the error term is not exactly zero but can be considered zero since it is extremely small.

Issues

  • Collinearity
cor(newdata,use="complete")
##              cty       displ       year
## cty    1.0000000 -0.72996114 0.12398398
## displ -0.7299611  1.00000000 0.05045227
## year   0.1239840  0.05045227 1.00000000

There is some correlation between the predictors. However, this does not seem to be too high to interfere with the overall predictions. Also, N is not very small so this does not seem to be a problem.

  • Causality: The two independent variables in this analysis - year and engine displacement - are proximal and probabilistic causes of city mileage.

  • Measurement error: Might not exist beacuse this is survey data from EPA and is supposed to be accurate as it is to be carried out for every vehicle.