Assignment 4: Logistic Regression

Sam Rose

5/14/2015

Dataset

The dataset I am using is a subset of a fuel economy dataset from the EPA's National Vehicle and Fuel Emissions Laboratory in Michigan in collaboration with automakers. The dataset can be found here: http://catalog.data.gov/dataset/fuel-economy-data. The dataset includes information from 1984 to present day.

# read in data
v <- read.csv("vehicles.csv", h = T, stringsAsFactors = F)
str(v)

## 'data.frame':    35907 obs. of  83 variables:
##  $ barrels08      : num  15.7 30 12.2 30 17.3 ...
##  $ barrelsA08     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ charge120      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ charge240      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ city08         : int  19 9 23 10 17 21 22 23 23 23 ...
##  $ city08U        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cityA08        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cityA08U       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cityCD         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cityE          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cityUF         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ co2            : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ co2A           : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ co2TailpipeAGpm: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ co2TailpipeGpm : num  423 808 329 808 468 ...
##  $ comb08         : int  21 11 27 11 19 22 25 24 26 25 ...
##  $ comb08U        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ combA08        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ combA08U       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ combE          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ combinedCD     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ combinedUF     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cylinders      : int  4 12 4 8 4 4 4 4 4 4 ...
##  $ displ          : num  2 4.9 2.2 5.2 2.2 1.8 1.8 1.6 1.6 1.8 ...
##  $ drive          : chr  "Rear-Wheel Drive" "Rear-Wheel Drive" "Front-Wheel Drive" "Rear-Wheel Drive" ...
##  $ engId          : int  9011 22020 2100 2850 66031 66020 66020 57005 57005 57006 ...
##  $ eng_dscr       : chr  "(FFS)" "(GUZZLER)" "(FFS)" "" ...
##  $ feScore        : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ fuelCost08     : int  1900 3650 1500 3650 2400 1850 1600 1700 1550 1600 ...
##  $ fuelCostA08    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fuelType       : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ fuelType1      : chr  "Regular Gasoline" "Regular Gasoline" "Regular Gasoline" "Regular Gasoline" ...
##  $ ghgScore       : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ ghgScoreA      : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ highway08      : int  25 14 33 12 23 24 29 26 31 30 ...
##  $ highway08U     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ highwayA08     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ highwayA08U    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ highwayCD      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ highwayE       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ highwayUF      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hlv            : int  0 0 19 0 0 0 0 0 0 0 ...
##  $ hpv            : int  0 0 77 0 0 0 0 0 0 0 ...
##  $ id             : int  1 10 100 1000 10000 10001 10002 10003 10004 10005 ...
##  $ lv2            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lv4            : int  0 0 0 0 14 15 15 13 13 13 ...
##  $ make           : chr  "Alfa Romeo" "Ferrari" "Dodge" "Dodge" ...
##  $ model          : chr  "Spider Veloce 2000" "Testarossa" "Charger" "B150/B250 Wagon 2WD" ...
##  $ mpgData        : chr  "Y" "N" "Y" "N" ...
##  $ phevBlended    : chr  "false" "false" "false" "false" ...
##  $ pv2            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pv4            : int  0 0 0 0 90 88 88 89 89 89 ...
##  $ range          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ rangeCity      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ rangeCityA     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ rangeHwy       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ rangeHwyA      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ trany          : chr  "Manual 5-spd" "Manual 5-spd" "Manual 5-spd" "Automatic 3-spd" ...
##  $ UCity          : num  23.3 11 29 12.2 21 ...
##  $ UCityA         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ UHighway       : num  35 19 47 16.7 32 ...
##  $ UHighwayA      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VClass         : chr  "Two Seaters" "Two Seaters" "Subcompact Cars" "Vans" ...
##  $ year           : int  1985 1985 1985 1985 1993 1993 1993 1993 1993 1993 ...
##  $ youSaveSpend   : int  -1000 -9750 1000 -9750 -3500 -750 500 0 750 500 ...
##  $ guzzler        : chr  "" "T" "" "" ...
##  $ trans_dscr     : chr  "" "" "SIL" "" ...
##  $ tCharger       : logi  NA NA NA NA TRUE NA ...
##  $ sCharger       : chr  "" "" "" "" ...
##  $ atvType        : chr  "" "" "" "" ...
##  $ fuelType2      : chr  "" "" "" "" ...
##  $ rangeA         : chr  "" "" "" "" ...
##  $ evMotor        : chr  "" "" "" "" ...
##  $ mfrCode        : chr  "" "" "" "" ...
##  $ c240Dscr       : chr  "" "" "" "" ...
##  $ charge240b     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ c240bDscr      : chr  "" "" "" "" ...
##  $ createdOn      : chr  "Tue Jan 01 00:00:00 EST 2013" "Tue Jan 01 00:00:00 EST 2013" "Tue Jan 01 00:00:00 EST 2013" "Tue Jan 01 00:00:00 EST 2013" ...
##  $ modifiedOn     : chr  "Tue Jan 01 00:00:00 EST 2013" "Tue Jan 01 00:00:00 EST 2013" "Tue Jan 01 00:00:00 EST 2013" "Tue Jan 01 00:00:00 EST 2013" ...
##  $ startStop      : chr  "" "" "" "" ...
##  $ phevCity       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ phevHwy        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ phevComb       : int  0 0 0 0 0 0 0 0 0 0 ...

v <- v[,c('comb08', 'co2', 'fuelType')]
v <- v[which(v$fuelType == 'Diesel' | v$fuelType == 'Regular'),]
v <- v[which(v$co2 != -1),]

v$fuelType[v$fuelType == "Regular"] <- 0
v$fuelType[v$fuelType == "Diesel"] <- 1
v$fuelType <- as.factor(v$fuelType)
attach(v)

## The following object is masked from package:datasets:
## 
##     co2

# number of observations for each fuel type
table(v$fuelType)

## 
##    0    1 
## 1593  101

Independent Variables

Descriptions for data variables can be found here: http://www.fueleconomy.gov/feg/ws/index.shtml#fuelType1

From the myriad of continuous variables I am going to choose co2 emissions (co2) and combined MPG ('comb08') to be my independent variables.

Combined MPG is the combined highway and city mpg for a given vehicle. Co2 is defined as the tailpipe co2 emissions in grams/mile.

Dependent Variable

For my dependent variable I am going to choose fuel type ('fuelType'). I am going to subset the dataset to only look at diesel vs. regular fuel types.

Hypotheses

The \( H_0 \) for my model is that co2 emissions and combined MPG explain none of the variation in fuel type.

The \( H_1 \) for my model is that co2 emissions and combined MPG can explain variation within fuel type.

This model is meant to be predictive, assessing the ability to forecast whether a vehicle uses regular or diesel fuel based on the independent variables.

Model

I am building a hierarchical binomial logistic regression model based on theory in this case. It is known that diesel engines get better gas mileage than unleaded by about a 33% margin so I am going to use the combined mileage as my first term in the model. The second term will be co2, even though Diesel fuel emits less co2 than regular I expect this to be less informative based on theory. This model will be predictive, and is assessing based on known properties of different fuel types can the type of fuel be predicted by these variables.

# build model 1 
m1 <- glm(fuelType ~ comb08, data=v, family = binomial(logit))

# build model 2
m2 <- glm(fuelType ~ comb08 + co2, data=v, family = binomial(logit))


# summary of models

summary(m1)

## 
## Call:
## glm(formula = fuelType ~ comb08, family = binomial(logit), data = v)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1750  -0.3718  -0.2945  -0.2328   2.6037  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.98691    0.44943 -13.321  < 2e-16 ***
## comb08       0.11962    0.01504   7.951 1.85e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 765.44  on 1693  degrees of freedom
## Residual deviance: 703.50  on 1692  degrees of freedom
## AIC: 707.5
## 
## Number of Fisher Scoring iterations: 6

summary(m2)

## 
## Call:
## glm(formula = fuelType ~ comb08 + co2, family = binomial(logit), 
##     data = v)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1319  -0.2899  -0.2340  -0.2155   2.3459  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -20.626757   1.748248 -11.799   <2e-16 ***
## comb08        0.380746   0.034643  10.991   <2e-16 ***
## co2           0.020968   0.002229   9.405   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 765.44  on 1693  degrees of freedom
## Residual deviance: 629.29  on 1691  degrees of freedom
## AIC: 635.29
## 
## Number of Fisher Scoring iterations: 6

Plots

Fitted Values vs Standardized Residuals

plot(fitted(m2), rstandard(m2), main = "Fitted Values vs Standardized Residuals", ylab = "Residual")
abline(h = 0, col = 'red', lty = 2)

plot of chunk unnamed-chunk-3

Diagnostic Plots for 8 Assumptions

# histogram of residuals
hist(rstandard(m2), xlab = "Residuals", main = "Histogram of Model Standardized Residuals")

plot of chunk unnamed-chunk-4

# boxplt of residuals
boxplot(rstandard(m2), main = "Residual Boxplot")

plot of chunk unnamed-chunk-4

# qq plot
qqnorm(rstandard(m2), main = "Q-Q Plot")
qqline(rstandard(m2))

plot of chunk unnamed-chunk-4

# standardized residuals plot

plot(rstandard(m2), main = "Standardized Residuals Plot", ylab = "Residuals")

plot of chunk unnamed-chunk-4

# standardized residuals vs each variable

plot(comb08, rstandard(m2), main = "Combined mpg vs standardized residuals")

plot of chunk unnamed-chunk-4

plot(co2, rstandard(m2), main = "CO2 emissions vs standardized residuals")

plot of chunk unnamed-chunk-4

Interpretation

Model Interpretation

I am using the second model (m2) for interpretation. Here the estimate for the y intercept is 20.6, which is the log of the odds that a car uses diesel fuel if the co2 emissions and combined mpg are both 0.

The slope for the combined MPG coefficient is .38, meaning that for each increase in 1% mpg, the LOD goes up by .38 that the car is diesel. The slope for the co2 emissions coefficient is .02, which means the LOD goes up by .02 for every increase in 1% by co2.

The P-values for both of the terms in the model are very significant, meaning that they both explain some a significant proportion of variance with respect to the fuel type. This means that we can reject our null hypothesis.

LINE analysis

Linear

plot(fitted(m2), rstandard(m2), main = "Fitted Values vs Standardized Residuals", ylab = "Residual")
abline(h = 0, col = 'red', lty = 2)

plot of chunk unnamed-chunk-5

This assumption does not appear to be met because there are downward trends and one set of values clearly separated from the others. The data do not look random across this set of values and the expected value of the residuals does not appear to be zero.

Independent

# standardized residuals plot

plot(rstandard(m2), main = "Standardized Residuals Plot", ylab = "Residuals")

plot of chunk unnamed-chunk-6

There does not appear to be any dependence on sample order within the dataset meaning that we can assume independence within this dataset.

Normally Distributed

hist(rstandard(m2), xlab = "Residuals", main = "Histogram of Model Standardized Residuals")

plot of chunk unnamed-chunk-7

We definitely do not have a normal distribution of residuals as is evidenced by the spread of outliers from the midpoint in the histogram of standardized residuals.

Equal Variance

From the above residuals plots it seems like there could be some heteroskedasticity because of the apparent change in variance as the fitted values get higher, but I will use the Breusch-Pagan to check.

library(lmtest)

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.1.3

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(m2)

## 
##  studentized Breusch-Pagan test
## 
## data:  m2
## BP = 389.9304, df = 2, p-value < 2.2e-16

This gives a very significant result and suggests that I have very significant heteroskedasticity in my dataset. Therefore I can reject the null hypothesis that my dataset is homoskedastic.

Four Issues

1. Causality

I believe that both combined MPG and CO2 emissions are probablistic corellations with Diesel status. This is based on my reasoning above when building the model that both increased MPG and decreased CO2 are general properties of diesel as a fuel and will show through in measurements using this fuel overall, but car type and other factors will contribute to variability.

2. Sample Size

The size for the regular vehicles is large enough, but the diesel samples are much smaller in comparison (101), more samples of the diesel variety should be added ideally to improve the model.

3. Colinearity

This is definitely an issue for this model.

cor.test(v$co2, v$comb08)

## 
##  Pearson's product-moment correlation
## 
## data:  v$co2 and v$comb08
## t = -103.7092, df = 1692, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9357531 -0.9227800
## sample estimates:
##        cor 
## -0.9295537

These values are significantly corellated but I thought I would include both because they were most likely the best predictors of diesel status for this dataset.

4. Measurement Error

I do not believe that measurement error should be an issue in this dataset because these metrics are standardized and rigorously tested by the EPA.

Interactions

Here I am going to test interaction effects between the two terms in the model.

mi <- glm(fuelType ~ comb08 * co2, data=v, family = binomial(logit))
summary(mi)

## 
## Call:
## glm(formula = fuelType ~ comb08 * co2, family = binomial(logit), 
##     data = v)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.2777  -0.0459  -0.0246  -0.0092   0.6957  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -74.485373  16.337118  -4.559 5.13e-06 ***
## comb08       -0.251639   0.478035  -0.526    0.599    
## co2          -0.037261   0.037985  -0.981    0.327    
## comb08:co2    0.009746   0.001732   5.629 1.81e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 765.440  on 1693  degrees of freedom
## Residual deviance:  45.803  on 1690  degrees of freedom
## AIC: 53.803
## 
## Number of Fisher Scoring iterations: 10

The above test clearly shows that there is a significant interaction between combined mpg and co2 emissions, demonstrating a potential presence of moderators. This means that changing values for one of these effects the values of the other and that there may be other factors at play influencing these two variables.

Final Interpretation

Overall this model has several assumptions that are violated but does seem to exploit some properties of diesel fuel that can be used to classify it in cars, as increases in both co2 and combined mpg leads to an increase LOD of diesel. There exists a segment of the data that skews the rest of the distribution, and this is most likely due to a separate class of cars (i.e. trucks vs 4 doors) that would need to be further accounted for in a more comprehensive analysis. The interaction effect of the two terms in the model is also very significant in this analysis, owing to the colinearity of the terms, and thus inflating the p-values.