MUSA 500, Homework #3

Author

Minwook Kang, Ann Zhang, and Nissim Lebovits

Published

November 30, 2022

Show the code
library(tidyverse)
library(sf)
library(janitor)
#library(ggthemr) # mincomment : its not working on my pc
library(ggpubr)
library(ggrepel)
library(purrr)
#install.packages("kableExtra")
library(kableExtra)
library(caret)
library(corrplot)
#install.packages("prediction")
library(prediction)

#ggthemr("pale") #set global ggplot theme
options(scipen = 999) # turn off scientific notation

library(aod)
library(rms)
library(gmodels)

library(ROCR)

#install.packages("crosstable")
library(crosstable)
#install.packages("flextable")
library(flextable)

#knitr::opts_knit$set(root.dir = "C:/Users/Nissim/Desktop/Fall 2022/Spat Stats/Homeworks/musa-500-hmwk-3")

#import data
crash_data = read.csv("https://github.com/kmu973/tempDataRepo/raw/main/Logistic%20Regression%20Data.csv") |>
              clean_names()

Introduction

The Problem

The goal of the current assignment is to identify predictors of accidents related to drunk driving. The data used in this assignment come from a data set containing all 53,260 car crashes in the City of Philadelphia for the years 2008 – 2012. The data set was compiled by the Pennsylvania Department of Transportation, and is made available to the public at OpenDataPhilly.org. In the past, Azavea, one of Philadelphia’s most prominent GIS software development firms, has used these data for a number of interesting analyses, which have been published on the company’s website.

Because the crash data are geocoded, it is possible to spatially join the data to the 2000 Census block group level data set that was used for the two previous homework assignments. After the spatial join, each crash point contains the median household income and the percent of individuals with at least a bachelor’s degree in the block group where the crash took place.

Even though the original data set has a total of 53,260 car crashes, for the sake of this assignment, we remove the 9,896 crash locations which took place in non-residential block groups, where median household income and vacancy rates are 0, from the data set. The final data set contains the 43,364 crashes that took place in Philadelphia’s residential block groups. Here, we will be regressing the binary dependent variable, DRINKING_D, on the following binary and continuous predictors: FATAL_OR_M, OVERTURNED, CELL_PHONE, SPEEDING, AGGRESSIVE, DRIVER1617, DRIVER65PLUS, PCTBACHMOR, and MEDHHINC.

Methods

Issues with OLS Regression

Let’s begin by recalling the five assumptions of OLS regression:

  1. A linear relationship between the dependent variable and predictors
  2. Normality of residuals
  3. Homoscedasticity
  4. Independence of observations (no spatial, temporal or other forms of dependence in the data)
  5. No multicollinearity

Additionally, OLS regression relies on continuous variables. In this assignment, we are dealing with binary variables as well as continuous variables, so regular OLS regression is inappropriate.

Logistic Regression

Assumptions of Logistic Regression

Instead of OLS regression, we will use an approach known as logistic regression. The assumptions of logistic regression are similar to the assumptions in OLS regression, but not identical. In logistic regression:

  1. The dependent variable must be binary
  2. Observations must be independent
  3. There must not be multicollinearity

We will also need a larger sample size than in OLS because in logistic regression we rely on maximum likelihood estimation. Logistic regression also differs from OLS in that:

  1. There’s no assumption of a linear relationship between the DV and each IV
  2. There’s no assumption of hetereoscedasticity
  3. Residuals do not need to be normally distributed

Odds and Odds Ratios

In order to understand logistic regression, we must first review the concept of odds.

Odds are similar to probability. However, while probability may be calculated as \(\frac{\text{\# desirable outcomes}}{\text{\# possible outcomes}}\), odds are calculated as \(\frac{\text{\# desirable outcomes}}{\text{\# undesirable outcomes}}\).

Technically, it’s the probability of being murdered by a serial killer

The log odds that \(Y = 1\) (or the logit) are simply \(ln(Odds(Y = 1))\). In other words, if \(Y\) is our dependent variable and \(X\) is our independent variable, a 1-unit increase in \(X\) means that the odds that \(Y=1\) go up by \(e^\text{ln(Odds(Y = 1))}\).

The odds ratio is the ratio of the odds of one outcome for our binary dependent variable versus the other possible outcome, or \(\frac{\text{Odds(Y=1)}}{\text{Odds(Y=0)}}\). It can also be calculated by exponentiating the coefficient of the independent variable in question. The odds ratio indicates how much more likely we are to see \(Y=1\) than \(Y=0\), e.g., an odds ratio of 3 means that it is 3 times more likely that \(Y=1\) than that \(Y=0\).1 Rather than looking at the estimated \(\beta\) coefficients, most statisticians prefer to look at odds ratios

Equation for the Logistic Regression and Multivariate Logistic Regression

For logistic regression with a single predictor, the equation is \[ln(\frac{p}{1-p}) = \beta_0 + \beta_1X_1 + \epsilon\]

where \(p = P(Y=1)\),

\(\frac{p}{1-p}\) are the odds,

and \(ln(\frac{p}{1-p})\) are the log odds, or logit.

Solving algebraically, we come to the inverse logit or logistic function for logistic regression with one variable: \[p = \frac{e^{\beta_0+\beta_1X_1}}{1+e^{\beta_0+\beta_1X_1}} = \frac{1}{1+e^{-\beta_0-\beta_1X_1}}\]

For multivariate logistic regression with \(n\) predictors, that equation becomes \[p = \frac{e^{\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_nX_n}}{1+e^{\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_nX_n}}\]

Interpretation and Hypotheses for Each Predictor

As explained above, we can say that \(e^{\beta_1}\) is the extent to which the odds of \(Y=1\) change as the predictor (i.e., Population) increases by 1 unit. So, when \(e^{\beta_1} = 1.001\), 1.001 is the extent to which the odds that \(Y=1\) go up as the independent variable \(X\) increases by 1 unit. If we want to examine the extent to which the odds that \(Y=1\) change as X increases by 100 units, we can simply raise \(e\) to \(100*\beta_1\), so \(e^{100*\beta_1}\). If \(\beta_1 < 0\), we would say that the odds that \(Y=1\) decrease by a factor of \(e^{\beta_1}\) as X increases by 1 unit. If \(\beta_1 = 0\), then the predictor has no effect on the dependent variable.

For each predictor \(X_i\):

  • \(H_0:\beta_i = 0 \hspace{1mm} (OR_i = 1)\)
  • \(H_a:\beta_i \neq 0 \hspace{1mm} (OR_i \neq 1)\)

Within the context of logistic regression, the z-score is equal to \(\frac{\hat{\beta_i}}{\sigma\hat{\beta_i}}\), which is normally distributed. This is also sometimes known as the Wald statistic, and yields a p-value for each term in the regression using the standard normal (z) tables and in the R output.

Assessing Quality of Model Fit

With logistic regression, it is possible to calculate an \(R^2\) value and, generally speaking, the higher the \(R^2\), the better. However, this value is rarely used, as it cannot be interpreted as the percent of variance that is explained by the model.

Instead, we can use the Akaike Information Criterion to compare models. The AIC estimates prediction error, and therefore the quality of our model, and includes a penality that “is an increasing function of the number of estimated parameters”.2 When comparing multiple models, “the one with the minimum AIC value assures a good balance of goodness of fit and complexity”.3

Finally, in assessing the quality of our model, we can consider specificity, sensitivity, and the misclassification rate.

Sensitivity (also called the true positive rate) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition), and is complementary to the false negative rate. Specificity (also called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition), and is complementary to the false positive rate. The misclassification rate is the proportion of the sum of false positives and false negatives out of the total number of observations in the sample.

False positives are dangerous, folks

A simple way of assessing model quality is to determine a cut-off value for the fitted values \(\hat{y_i}\), which is equal to P(Y = 1) at each observation \(i\). When the model performs well, observations where \(Y = 1\) will have higher values of \(\hat{y_i}\), and observations where \(Y = 0\) will have lower values of \(\hat{y_i}\). We can set a cut-off value, ranging between 0 and 1, for what constitutes a “high” versus a “low” value based on the distribution of our data. In practice, most statisticians use multiple values, e.g., 0.1, 0.2, 0.3 … 0.7.

Ideally, we want to choose a cut-off value that will optimize sensitivity and specificity. Typically, we want to see a high true positive rate and a low false positive rate, although, depending on the specific case, it may be more important to minimize the false positive rate than to maximize the true positive rate, or vice versa. To this end, we can plot the true positive rate (sensitivity) against the false positive rate (specificity), as in Figure 14. This is known an ROC curve, and it offers us a few different options for identifying out probability cut-off:

  1. The Youden Index is a cut-off for which the sum of sensitivity plus specificity is maximized
  2. A cut-off for which the ROC curve has the minimum distance from the upper left corner of the graph, i.e., the point at which specificy = 1 and sensitivity = 1 (this can be implemented in R)

In this assignment, we will be selecting our cut-off by taking the second approach and minimizing the distance of the curve from the upper left corner of the graph.

Figure 1: ROC Curve Example

If we calculate the area under our ROC curve (the AUC), we can measure the prediction accuracy of the model. It is effectively the probability that the model correctly ranks two randomly selected observations where one has \(Y = 1\) and the other has \(Y = 0\), i.e., the probability that the \(\hat{y}\) for the observation where one has \(Y = 1\) is higher than the \(\hat{y}\) where \(Y = 0\). Higher AUCs mean that we can find a cut-off value for which both sensitivity and specificity are relatively high. The AUC varies from 0.5 to 1, and roughly correspond to the following accuracy levels:

  • .9 to 1.0 = excellent
  • .8 to .9 = good
  • .7 to .8 = fair (some statisticians say that any value > .7 is good)
  • .6 to .7 = poor
  • .5 to .6 = fail

Exploratory Analysis Prior to Regression

Prior to running our regression model, we will run cross tabulations between the dependent variable DRINKING_D and each of the binary predictors (FATAL_OR_M, OVERTURNED, CELL_PHONE, SPEEDING, AGGRESSIVE, DRIVER1617, DRIVER65PLUS). A cross tabulation will show us the proportional distribution of the variables. More importantly for our purposes, we can use a Chi-squared test to examine the association between the DV and each individual predictor variable. Here, the p-value is calculated based on the \(\chi^2\) value and the degrees of freedom. If p < 0.05, then the relationship between the dependent variable and the given predictor is significant. If we have a high \(\chi^2\) and a p-value < 0.05, we can therefore reject \(H_0\), that the proportion of fatalities for crashes in which our predictor = 1 is the same as the proportion of fatalities for crashes in which our predictor = 0, in favor of \(H_a\), that the proportion of fatalities for crashes in which our predictor = 1 is not the same as the proportion of fatalities for crashes in which our predictor = 0. This suggests a relationship between the variables.

Additionally, we can compare the means of the continuous predictors in our model (PCTBACHMOR and MEDHHINC) by looking at independent sample t-tests. Here, \(H_0\) is that the average values for our continuous predictor variable are the same for crashes that involve drunk drivers and crashes that don’t. \(H_a\) is that the average values for our continuous predictor variable are different for crashes that involve drunk drivers and crashes that don’t. As before, a high value of the t-statistic, and a p-value lower than 0.05 suggest that there’s evidence to reject the null hypothesis in favor of the alternative.

Results

Findings from Exploratory Analysis

Distribution of Dependent Variable

To begin our data exploration, we’ll consider the distribution of our dependent variable. As we can see, from our prop table, crashes in which the driver were drunk account for only 5.7% of all 43,364 crashes between 2008 and 2012.

Show the code
drinking_d_table = table(crash_data$drinking_d)

DV_table <- left_join(as.data.frame(drinking_d_table), as.data.frame(prop.table(drinking_d_table)), by = "Var1") %>%
  mutate(Var1 = case_when(Var1 == 0 ~ "Non-drunk",
                          Var1 == 1 ~ "Drunk"),
         Freq.y = round(Freq.y, 3))

kable(DV_table, caption = "<center><span style='font-size:14px; color:black; font-family: Arial'>Table 1. Tabulation of the Dependent Variable</span></center>",
      col.names = c("Driver Status", "Count", "Proportion"), align = "c") %>% 
    kable_minimal(full_width = T, html_font = "Arial", font_size = 14) %>%
    row_spec(0:2, extra_css = "line-height: 30px")
Table 1. Tabulation of the Dependent Variable
Driver Status Count Proportion
Non-drunk 40879 0.943
Drunk 2485 0.057

Cross-Tabulations with Binary Predictors

Next, we’ll look at the cross-tabulations of our dependent variable with each of the binary predictors to evaluate whether the Chi-squared test indicates a statistically significant relationship.

Show the code
crosstable(crash_data, fatal_or_m, by = drinking_d, funs = "mean", test = TRUE) %>% 
  as_flextable()
crosstable(crash_data, overturned, by = drinking_d, funs = "mean", test = TRUE) %>% 
  as_flextable()
crosstable(crash_data, cell_phone, by = drinking_d, funs = "mean", test = TRUE) %>% 
  as_flextable()
crosstable(crash_data, speeding, by = drinking_d, funs = "mean", test = TRUE) %>% 
  as_flextable()
crosstable(crash_data, aggressive, by = drinking_d, funs = "mean", test = TRUE) %>% 
  as_flextable()
crosstable(crash_data, driver1617, by = drinking_d, funs = "mean", test = TRUE) %>% 
  as_flextable()
crosstable(crash_data, driver65plus, by = drinking_d, funs = "mean", test = TRUE) %>% 
  as_flextable()
#CrossTable(crash_data$fatal_or_m, crash_data$drinking_d, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)

#CrossTable(crash_data$overturned, crash_data$drinking_d, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)

#CrossTable(crash_data$cell_phone, crash_data$drinking_d, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)

#CrossTable(crash_data$speeding, crash_data$drinking_d, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)

#CrossTable(crash_data$aggressive, crash_data$drinking_d, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)

#CrossTable(crash_data$driver1617, crash_data$drinking_d, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)

#CrossTable(crash_data$driver65plus, crash_data$drinking_d, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)

label

variable

drinking_d

test

0

1

fatal_or_m

0

39698 (94.53%)

2297 (5.47%)

p value: <0.0001
(Pearson's Chi-squared test)

1

1181 (86.27%)

188 (13.73%)

label

variable

drinking_d

test

0

1

overturned

0

40267 (94.43%)

2375 (5.57%)

p value: <0.0001
(Pearson's Chi-squared test)

1

612 (84.76%)

110 (15.24%)

label

variable

drinking_d

test

0

1

cell_phone

0

40453 (94.27%)

2457 (5.73%)

p value: 0.6873
(Pearson's Chi-squared test)

1

426 (93.83%)

28 (6.17%)

label

variable

drinking_d

test

0

1

speeding

0

39618 (94.68%)

2225 (5.32%)

p value: <0.0001
(Pearson's Chi-squared test)

1

1261 (82.91%)

260 (17.09%)

label

variable

drinking_d

test

0

1

aggressive

0

22357 (93.44%)

1569 (6.56%)

p value: <0.0001
(Pearson's Chi-squared test)

1

18522 (95.29%)

916 (4.71%)

label

variable

drinking_d

test

0

1

driver1617

0

40205 (94.21%)

2473 (5.79%)

p value: <0.0001
(Pearson's Chi-squared test)

1

674 (98.25%)

12 (1.75%)

label

variable

drinking_d

test

0

1

driver65plus

0

36642 (93.93%)

2366 (6.07%)

p value: <0.0001
(Pearson's Chi-squared test)

1

4237 (97.27%)

119 (2.73%)

Cross-Tabulations

As we can see based on these cross-tabulations, the Chi-squared test shows a high \(\chi^2\) and a p-value < 0.05 for the dependent variable and each of the binary predictors except for CELL_PHONE. For all variables except for CELL_PHONE, then, we can reject \(H_0\), which, as we stated before, states that there is that the probability of a crash involving a drunk driver is no different whether our predictor = 1 or = 0. In other words, we can reject the hypothesis that there is no relationship between the dependent variable and all of the predictors other than CELL_PHONE.

Comparing Means of Continuous Variables

Show the code
means = c(tapply(crash_data$pctbachmor, crash_data$drinking_d, mean), tapply(crash_data$medhhinc, crash_data$drinking_d, mean))
sds = c(tapply(crash_data$pctbachmor, crash_data$drinking_d, sd), tapply(crash_data$medhhinc, crash_data$drinking_d, sd))

contvars_df = as.data.frame(rbind(means, sds)) 

colnames(contvars_df) = c("PCTBACHMOR 0", "PCTBACHMOR 1", "MEDHHINC 0", "MEDHHINC 1")

contvars_df
      PCTBACHMOR 0 PCTBACHMOR 1 MEDHHINC 0 MEDHHINC 1
means     16.56986     16.61173   31483.05   31998.75
sds       18.21426     18.72091   16930.10   17810.50
Show the code
kable(contvars_df, caption = "<center><span style='font-size:14px; color:black; font-family: Arial'>Comparing Means of Continuous Variables</span></center>", align = "c") %>% 
    kable_minimal(full_width = T, html_font = "Arial", font_size = 14) %>%
    row_spec(0:2, extra_css = "line-height: 30px")
Comparing Means of Continuous Variables
PCTBACHMOR 0 PCTBACHMOR 1 MEDHHINC 0 MEDHHINC 1
means 16.56986 16.61173 31483.05 31998.75
sds 18.21426 18.72091 16930.10 17810.50

T-Tests for Continuous Predictors

T-test shows us high p-value which implies that we can’t reject the Null Hypothesis, the average values for our continuous predictor variable are the same for crashes that involve drunk drivers and crashes that don’t.

Show the code
t.test(crash_data$pctbachmor~crash_data$drinking_d)
t.test(crash_data$medhhinc~crash_data$drinking_d)

    Welch Two Sample t-test

data:  crash_data$pctbachmor by crash_data$drinking_d
t = -0.10842, df = 2777.5, p-value = 0.9137
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.7991398  0.7153982
sample estimates:
mean in group 0 mean in group 1 
       16.56986        16.61173 

    Welch Two Sample t-test

data:  crash_data$medhhinc by crash_data$drinking_d
t = -1.4053, df = 2763.9, p-value = 0.16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -1235.2508   203.8544
sample estimates:
mean in group 0 mean in group 1 
       31483.05        31998.75 

Assumptions of Logistic Regression

Recalling from Method Section that there are three main assumptions for logistic regressions: 1) binary dependent variable, 2) independent observations, and 3) no severe multicollinearity. The dependent variable, DRINKING_D (whether drivers were drunk or not drunk) was binary, and the observations of each recorded car accident are independent of each other.

Multicollinearity usually refers to the situation when two or more predictors (explanatory variables) are strongly correlated with each other (r > 0.9 or r < -0.9), which signals a redundancy in including both predictors (in other words, adding the second one does not help improve the predictive power of the model). In the case of multicollinearity, we usually leave one predictor and take out the rest that are strongly correlated with that one predictor.

To test the third assumption, whether there is multicollinearity between all predictors (both binary and continuous), we used the Pearson correlations and results are shown below in the form of matrix. Note that since Pearson is usually run based on the assumption of continuous (not categorical) variables, there are potential limitations of using Pearson correlations to measure the association between binary predictors, since Pearson cannot determine non-linear relationships.

Show the code
#3-a multicollinearlity test

correlation <- crash_data[c(4:10, 12:13)]
cor(correlation, method = "pearson")
               fatal_or_m    overturned    cell_phone      speeding  aggressive
fatal_or_m    1.000000000  0.0331959240  0.0021603225  0.0817126678 -0.01104729
overturned    0.033195924  1.0000000000 -0.0009897786  0.0594402861  0.01643894
cell_phone    0.002160322 -0.0009897786  1.0000000000 -0.0036011640 -0.02574299
speeding      0.081712668  0.0594402861 -0.0036011640  1.0000000000  0.21152537
aggressive   -0.011047295  0.0164389397 -0.0257429929  0.2115253684  1.00000000
driver1617   -0.002808379  0.0037239674  0.0014851333  0.0160115997  0.02842895
driver65plus -0.012512349 -0.0195009743 -0.0027172590 -0.0328541108  0.01502693
pctbachmor   -0.014652265  0.0093321352 -0.0012458540 -0.0007390853  0.02712211
medhhinc     -0.018212431  0.0279213029  0.0020998852  0.0117866805  0.04344045
               driver1617 driver65plus    pctbachmor     medhhinc
fatal_or_m   -0.002808379 -0.012512349 -0.0146522648 -0.018212431
overturned    0.003723967 -0.019500974  0.0093321352  0.027921303
cell_phone    0.001485133 -0.002717259 -0.0012458540  0.002099885
speeding      0.016011600 -0.032854111 -0.0007390853  0.011786681
aggressive    0.028428953  0.015026930  0.0271221096  0.043440451
driver1617    1.000000000 -0.020848417 -0.0026359662  0.022877425
driver65plus -0.020848417  1.000000000  0.0261903901  0.050337711
pctbachmor   -0.002635966  0.026190390  1.0000000000  0.477869537
medhhinc      0.022877425  0.050337711  0.4778695368  1.000000000
Show the code
mc_test <- cor(correlation,use="pairwise.complete.obs")
corrplot(mc_test, method = "square",type = "lower",tl.cex = 0.5)

Although MEDHHINC (the median household income) and PCTBACHMOR (percentage of bachelor degree in the census tract where accident happened) having relatively higher correlation (r = 0.477 < 0.9, which does not qualify for multicollinearity), the correlation matrix indicates no severe multicollinearity across all predictors (all r satisfies -0.9 < r < 0.9, showing non-multicollinearity).

Logistic Regression Results

The results of logistic regression with all predictors is shown below. All predictors, except for CELL_PHONE (whether drivers were using cell-phones) and PCTBACHMOR (percentage of bachelor degree in the census tract where accident happened), are statistically significantly associated with drinking_d (whether drivers were drunk).

Show the code
logit <- glm(drinking_d ~ fatal_or_m + overturned + cell_phone + speeding + aggressive + driver1617 + driver65plus
             + pctbachmor + medhhinc, data = crash_data, family = "binomial")
summary(logit)

Call:
glm(formula = drinking_d ~ fatal_or_m + overturned + cell_phone + 
    speeding + aggressive + driver1617 + driver65plus + pctbachmor + 
    medhhinc, family = "binomial", data = crash_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1945  -0.3693  -0.3471  -0.2731   3.0099  

Coefficients:
                 Estimate   Std. Error z value             Pr(>|z|)    
(Intercept)  -2.732506616  0.045875659 -59.563 < 0.0000000000000002 ***
fatal_or_m    0.814013802  0.083806924   9.713 < 0.0000000000000002 ***
overturned    0.928921376  0.109166324   8.509 < 0.0000000000000002 ***
cell_phone    0.029550085  0.197777821   0.149               0.8812    
speeding      1.538975665  0.080545894  19.107 < 0.0000000000000002 ***
aggressive   -0.596915946  0.047779238 -12.493 < 0.0000000000000002 ***
driver1617   -1.280295964  0.293147168  -4.367 0.000012572447127937 ***
driver65plus -0.774664640  0.095858315  -8.081 0.000000000000000641 ***
pctbachmor   -0.000370634  0.001296387  -0.286               0.7750    
medhhinc      0.000002804  0.000001341   2.091               0.0365 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 19036  on 43363  degrees of freedom
Residual deviance: 18340  on 43354  degrees of freedom
AIC: 18360

Number of Fisher Scoring iterations: 6

The following table indicates the odds ratio for each predictor. Odds is usually calculated as \(\frac{\text{number of desirable outcomes}}{\text{number of undesirable outcomes}}\). For all statistically significant predictors, if the OR (Odds Ratio) value is higher than 0 (i.e.,fatal_or_m, overturned, cell_phone, speeding, and medhhinc ), 1 unit increase in each predictor corresponds to a increase in the odds of the driver being drunk when accident happens; if the OR (Odds Ratio) value is lower than 0 (i.e.,aggressive, driver1617, driver65plus, and pctbachmor), 1 unit increase in each predictor corresponds to a increase in the odds of the driver being drunk when accident happens.

Show the code
exp(cbind(OR = coef(logit), confint(logit)))
                     OR      2.5 %     97.5 %
(Intercept)  0.06505601 0.05947628 0.07119524
fatal_or_m   2.25694878 1.90991409 2.65313350
overturned   2.53177687 2.03462326 3.12242730
cell_phone   1.02999102 0.68354737 1.48846840
speeding     4.65981462 3.97413085 5.45020642
aggressive   0.55050681 0.50101688 0.60423487
driver1617   0.27795502 0.14774429 0.47109277
driver65plus 0.46085831 0.37998364 0.55347851
pctbachmor   0.99962944 0.99707035 1.00215087
medhhinc     1.00000280 1.00000013 1.00000539
Show the code
logitoutput <- summary(logit)
logitcoeffs <- logitoutput$coefficients
logitcoeffs
                    Estimate     Std. Error     z value
(Intercept)  -2.732506616128 0.045875659265 -59.5633209
fatal_or_m    0.814013801855 0.083806923855   9.7129660
overturned    0.928921376176 0.109166323622   8.5092302
cell_phone    0.029550084767 0.197777821226   0.1494105
speeding      1.538975665492 0.080545894089  19.1068171
aggressive   -0.596915945677 0.047779237535 -12.4932079
driver1617   -1.280295964022 0.293147167807  -4.3674171
driver65plus -0.774664640320 0.095858315238  -8.0813505
pctbachmor   -0.000370633604 0.001296386538  -0.2858974
medhhinc      0.000002804492 0.000001340972   2.0913870
                                                                                              Pr(>|z|)
(Intercept)  0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
fatal_or_m   0.000000000000000000000265496694373546263134334977085160289789200760424137115478515625000
overturned   0.000000000000000017509193098994720030545102495977971557294949889183044433593750000000000
cell_phone   0.881229720508864278194494090712396427989006042480468750000000000000000000000000000000000
speeding     0.000000000000000000000000000000000000000000000000000000000000000000000000000000002215783
aggressive   0.000000000000000000000000000000000008130790694984008205523084988897153380094096064567566
driver1617   0.000012572447127937271245477074410601403542386833578348159790039062500000000000000000000
driver65plus 0.000000000000000640534374817719861497020139573521646525477990508079528808593750000000000
pctbachmor   0.774956667791637965336803972604684531688690185546875000000000000000000000000000000000000
medhhinc     0.036493383531930247143382217700491310097277164459228515625000000000000000000000000000000
Show the code
or_ci <- exp(cbind(OR = coef(logit), confint(logit)))

finallogitoutput <- cbind(logitcoeffs, or_ci)
finallogitoutput
                    Estimate     Std. Error     z value
(Intercept)  -2.732506616128 0.045875659265 -59.5633209
fatal_or_m    0.814013801855 0.083806923855   9.7129660
overturned    0.928921376176 0.109166323622   8.5092302
cell_phone    0.029550084767 0.197777821226   0.1494105
speeding      1.538975665492 0.080545894089  19.1068171
aggressive   -0.596915945677 0.047779237535 -12.4932079
driver1617   -1.280295964022 0.293147167807  -4.3674171
driver65plus -0.774664640320 0.095858315238  -8.0813505
pctbachmor   -0.000370633604 0.001296386538  -0.2858974
medhhinc      0.000002804492 0.000001340972   2.0913870
                                                                                              Pr(>|z|)
(Intercept)  0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
fatal_or_m   0.000000000000000000000265496694373546263134334977085160289789200760424137115478515625000
overturned   0.000000000000000017509193098994720030545102495977971557294949889183044433593750000000000
cell_phone   0.881229720508864278194494090712396427989006042480468750000000000000000000000000000000000
speeding     0.000000000000000000000000000000000000000000000000000000000000000000000000000000002215783
aggressive   0.000000000000000000000000000000000008130790694984008205523084988897153380094096064567566
driver1617   0.000012572447127937271245477074410601403542386833578348159790039062500000000000000000000
driver65plus 0.000000000000000640534374817719861497020139573521646525477990508079528808593750000000000
pctbachmor   0.774956667791637965336803972604684531688690185546875000000000000000000000000000000000000
medhhinc     0.036493383531930247143382217700491310097277164459228515625000000000000000000000000000000
                     OR      2.5 %     97.5 %
(Intercept)  0.06505601 0.05947628 0.07119524
fatal_or_m   2.25694878 1.90991409 2.65313350
overturned   2.53177687 2.03462326 3.12242730
cell_phone   1.02999102 0.68354737 1.48846840
speeding     4.65981462 3.97413085 5.45020642
aggressive   0.55050681 0.50101688 0.60423487
driver1617   0.27795502 0.14774429 0.47109277
driver65plus 0.46085831 0.37998364 0.55347851
pctbachmor   0.99962944 0.99707035 1.00215087
medhhinc     1.00000280 1.00000013 1.00000539

The following set of matrices indicate specificity, sensitivity at different probability cut-offs (thresholds), respectively cutting off at probability of 0.02, 0.03, 0.05, 0.07, 0.08, 0.09, 0.1, 0.15, 0.2, and 0.5. The values are then encoded into a table presenting not only specificity and sensitivity but also mis-classifications rate. Based on the table, the cut-off value that has the lowest mis-classification rate is 0.02 and the one corresponding to the highest mis-classification rate is 0.5.

Cut-Off Values Table

Since the ROC curve can be used to plot true positive rate against false positive rate and determine the quality of the model, a ROC curve is plotted for our model. Given the ROC curve, the calculated optimal cut-off point is at probability of 0.06. This suggested cut-off point diverge from the 0.02 cut-off point observed from the table earlier. This difference may result from the fact that the previous table only aims to minimize mis-classification rate, while the optimal cut-off point here aims to maximizing both sensitivity and specificity.

Show the code
#3-a 

a <- cbind(crash_data$drinking_d, fit)
colnames(a) <- c("labels", "predictions")
head(a)
  labels predictions
1      0  0.06794568
2      0  0.08305194
3      0  0.06794568
4      0  0.03858293
5      0  0.06794568
6      1  0.07048039
Show the code
roc <- as.data.frame(a)
pred <- ROCR::prediction(roc$predictions, roc$labels)
roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
plot(roc.perf)
abline(a = 0, b = 1)

Show the code
#3-a 

opt.cut = function(perf, pred){
  cut.ind = mapply(FUN=function(x, y, p){
    d = (x - 0)^2 + (y-1)^2
    ind = which(d == min(d))
    c(sensitivity = y[[ind]], specificity = 1-x[[ind]], 
      cutoff = p[[ind]])
  }, perf@x.values, perf@y.values, pred@cutoffs)
}

print(opt.cut(roc.perf, pred))
                  [,1]
sensitivity 0.66076459
specificity 0.54524328
cutoff      0.06365151
Show the code
#3-a 

auc.perf = performance(pred, measure ="auc")
auc.perf@y.values
[[1]]
[1] 0.6398695

Given the ROC curve as presented above, we then calculated the area under curve to further measure the accuracy of the model. The calculated AUC is 0.64, suggesting that the model is relatively unsatisfactory.

Logistic Regression with Binary Predictors

The following table is the results from a logistic regression only with binary predictors (explanatory variables). The results suggest that cell_phone remains statistically insignificant in the new model, while other binary predictors remain significant.

Show the code
#3-a 

logit2 <- glm(drinking_d ~ fatal_or_m + overturned + cell_phone + speeding + aggressive + driver1617 + driver65plus
            , data = crash_data, family = "binomial")

summary(logit2)

Call:
glm(formula = drinking_d ~ fatal_or_m + overturned + cell_phone + 
    speeding + aggressive + driver1617 + driver65plus, family = "binomial", 
    data = crash_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1961  -0.3692  -0.3153  -0.2764   3.0093  

Coefficients:
             Estimate Std. Error z value             Pr(>|z|)    
(Intercept)  -2.65190    0.02753 -96.324 < 0.0000000000000002 ***
fatal_or_m    0.80932    0.08376   9.662 < 0.0000000000000002 ***
overturned    0.93978    0.10903   8.619 < 0.0000000000000002 ***
cell_phone    0.03107    0.19777   0.157                0.875    
speeding      1.54032    0.08053  19.128 < 0.0000000000000002 ***
aggressive   -0.59365    0.04775 -12.433 < 0.0000000000000002 ***
driver1617   -1.27158    0.29311  -4.338  0.00001436374143293 ***
driver65plus -0.76646    0.09576  -8.004  0.00000000000000121 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 19036  on 43363  degrees of freedom
Residual deviance: 18344  on 43356  degrees of freedom
AIC: 18360

Number of Fisher Scoring iterations: 6
Show the code
exp(cbind(OR = coef(logit2), confint(logit2)))
                     OR      2.5 %    97.5 %
(Intercept)  0.07051713 0.06678642 0.0743978
fatal_or_m   2.24636998 1.90112455 2.6404533
overturned   2.55942903 2.05736015 3.1556897
cell_phone   1.03156149 0.68459779 1.4907150
speeding     4.66608472 3.97961862 5.4573472
aggressive   0.55230941 0.50268818 0.6061758
driver1617   0.28038936 0.14904734 0.4751771
driver65plus 0.46465631 0.38318289 0.5579332
Show the code
logitoutput2 <- summary(logit2)
logitcoeffs2 <- logitoutput2$coefficients
logitcoeffs2
                Estimate Std. Error     z value
(Intercept)  -2.65189961 0.02753107 -96.3238683
fatal_or_m    0.80931557 0.08376150   9.6621431
overturned    0.93978420 0.10903433   8.6191585
cell_phone    0.03107367 0.19777088   0.1571195
speeding      1.54032033 0.08052787  19.1277908
aggressive   -0.59364687 0.04774781 -12.4329656
driver1617   -1.27157607 0.29310969  -4.3382260
driver65plus -0.76645727 0.09576440  -8.0035718
                                                                                             Pr(>|z|)
(Intercept)  0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000
fatal_or_m   0.00000000000000000000043663270416796089966326999132206765352748334407806396484375000000
overturned   0.00000000000000000674479490423146321037561889966127637308090925216674804687500000000000
cell_phone   0.87515064024584299229303496758802793920040130615234375000000000000000000000000000000000
speeding     0.00000000000000000000000000000000000000000000000000000000000000000000000000000000148224
aggressive   0.00000000000000000000000000000000001730915657124955061131957312348106370336608961224556
driver1617   0.00001436374143293184440627746623064808773051481693983078002929687500000000000000000000
driver65plus 0.00000000000000120861170055062276218298122909544645153800956904888153076171875000000000
Show the code
or_ci2 <- exp(cbind(OR = coef(logit2), confint(logit2)))

finallogitoutput2 <- cbind(logitcoeffs2, or_ci2)
finallogitoutput2
                Estimate Std. Error     z value
(Intercept)  -2.65189961 0.02753107 -96.3238683
fatal_or_m    0.80931557 0.08376150   9.6621431
overturned    0.93978420 0.10903433   8.6191585
cell_phone    0.03107367 0.19777088   0.1571195
speeding      1.54032033 0.08052787  19.1277908
aggressive   -0.59364687 0.04774781 -12.4329656
driver1617   -1.27157607 0.29310969  -4.3382260
driver65plus -0.76645727 0.09576440  -8.0035718
                                                                                             Pr(>|z|)
(Intercept)  0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000
fatal_or_m   0.00000000000000000000043663270416796089966326999132206765352748334407806396484375000000
overturned   0.00000000000000000674479490423146321037561889966127637308090925216674804687500000000000
cell_phone   0.87515064024584299229303496758802793920040130615234375000000000000000000000000000000000
speeding     0.00000000000000000000000000000000000000000000000000000000000000000000000000000000148224
aggressive   0.00000000000000000000000000000000001730915657124955061131957312348106370336608961224556
driver1617   0.00001436374143293184440627746623064808773051481693983078002929687500000000000000000000
driver65plus 0.00000000000000120861170055062276218298122909544645153800956904888153076171875000000000
                     OR      2.5 %    97.5 %
(Intercept)  0.07051713 0.06678642 0.0743978
fatal_or_m   2.24636998 1.90112455 2.6404533
overturned   2.55942903 2.05736015 3.1556897
cell_phone   1.03156149 0.68459779 1.4907150
speeding     4.66608472 3.97961862 5.4573472
aggressive   0.55230941 0.50268818 0.6061758
driver1617   0.28038936 0.14904734 0.4751771
driver65plus 0.46465631 0.38318289 0.5579332

In addition, by comparing the Akaike Information Criterion (AIC) values, we can see that the models have almost identical AIC values, suggesting that they have similar quality.

Show the code
AIC(logit, logit2)
       df      AIC
logit  10 18359.63
logit2  8 18360.47

Similar to OLS, we can do cross-validation to determine a quality of the model. See following links for more information: K-Fold cross-validation in R: https://www.r-bloggers.com/evaluating-logistic-regression-models/ More on cross-validation in R: https://www.r-bloggers.com/predicting-creditability-using-logistic-regression-in-r-cross-validating-the-classifier-part-2-2/

Discussion

This projects use logistic regression as an alternative to OLS regression to look at relationships between whether drivers were drunk or not when car crashes happened and a group of predictors. From the models, we can see that all predictors, except for cell_phone and pctbachmor, (i.e., fatal_or_m, overturned, speeding, aggressive, driver 1617, driver65plus, medhhinc) are significantly associated with drivers drunk status. The logistic regression model satisfies all assumptions needed, as shown in the previous section.

Most results, including the significance of relationships and the direction of the relationships, align with expectations. For instance, we wouldn’t expect whether drivers drank to be associated with the percentage of bachelor degree holders in the tracts of where the incidents happen, and it turns out to be not significant. Similarly, the use of cell phone is not expected to have a direct association with drunk drivers. On the other hand, the predictors related to age, diver1617 and driver65plus are expected to be valid predictors, since teenage drivers are not legal to drink, and drivers older than 65 are less likely to drink due to health concerns. One result that is unexpected is the negative association between aggressive driving behaviors and whether the driver was drunk, as we would usually assume drunk drivers may perform more aggressively when driving.

Rare Events Modeling in Logistic Regression

To judge whether logistic regression is appropriate here, we need to examine the number of cases, especially the number of the rarer outcome of the two – in this cases, drivers that were drunk at the time of crashes are a lot less than those who were not drunk. If there is a small sample bias here, we can use Paul Allison proposed methods for modeling rare events, similar to penalized likelihood, for reducing small-sample biases in Maximum likelihood estimation.

The question is then whether the number of drunk drivers should be considered small sample. The answer is uncertain. According to Table 1, we have 2,485 observations for drivers who were drunk at the time of incidents, versus 40,879 observations for drivers who were not drunk. Although 2,485 seems like a reasonably sized group, it only take 5.7% of all observations. Therefore, there might be a need to adopt Paul Allison’s methods for modeling rare events instead of just using logistic regression.

Limitations

In this Analysis we’ve only leveraged given independent variables, to improve our model, we should’ve to spend more time in exploring other independent variables. It might help us to get larger value of AUC than that of 0.64 in current model.

Footnotes

  1. Sperandei S. Understanding logistic regression analysis. Biochem Med (Zagreb). 2014 Feb 15;24(1):12-8. doi: 10.11613/BM.2014.003. PMID: 24627710; PMCID: PMC3936971.↩︎

  2. https://en.wikipedia.org/wiki/Akaike_information_criterion↩︎

  3. https://www.sciencedirect.com/topics/social-sciences/akaike-information-criterion↩︎

  4. Source: https://en.wikipedia.org/wiki/Receiver_operating_characteristic↩︎