library(tidyverse)library(sf)library(janitor)#library(ggthemr) # mincomment : its not working on my pclibrary(ggpubr)library(ggrepel)library(purrr)#install.packages("kableExtra")library(kableExtra)library(caret)library(corrplot)#install.packages("prediction")library(prediction)#ggthemr("pale") #set global ggplot themeoptions(scipen =999) # turn off scientific notationlibrary(aod)library(rms)library(gmodels)library(ROCR)#install.packages("crosstable")library(crosstable)#install.packages("flextable")library(flextable)#knitr::opts_knit$set(root.dir = "C:/Users/Nissim/Desktop/Fall 2022/Spat Stats/Homeworks/musa-500-hmwk-3")#import datacrash_data =read.csv("https://github.com/kmu973/tempDataRepo/raw/main/Logistic%20Regression%20Data.csv") |>clean_names()
Introduction
The Problem
The goal of the current assignment is to identify predictors of accidents related to drunk driving. The data used in this assignment come from a data set containing all 53,260 car crashes in the City of Philadelphia for the years 2008 – 2012. The data set was compiled by the Pennsylvania Department of Transportation, and is made available to the public at OpenDataPhilly.org. In the past, Azavea, one of Philadelphia’s most prominent GIS software development firms, has used these data for a number of interesting analyses, which have been published on the company’s website.
Because the crash data are geocoded, it is possible to spatially join the data to the 2000 Census block group level data set that was used for the two previous homework assignments. After the spatial join, each crash point contains the median household income and the percent of individuals with at least a bachelor’s degree in the block group where the crash took place.
Even though the original data set has a total of 53,260 car crashes, for the sake of this assignment, we remove the 9,896 crash locations which took place in non-residential block groups, where median household income and vacancy rates are 0, from the data set. The final data set contains the 43,364 crashes that took place in Philadelphia’s residential block groups. Here, we will be regressing the binary dependent variable, DRINKING_D, on the following binary and continuous predictors: FATAL_OR_M, OVERTURNED, CELL_PHONE, SPEEDING, AGGRESSIVE, DRIVER1617, DRIVER65PLUS, PCTBACHMOR, and MEDHHINC.
Methods
Issues with OLS Regression
Let’s begin by recalling the five assumptions of OLS regression:
A linear relationship between the dependent variable and predictors
Normality of residuals
Homoscedasticity
Independence of observations (no spatial, temporal or other forms of dependence in the data)
No multicollinearity
Additionally, OLS regression relies on continuous variables. In this assignment, we are dealing with binary variables as well as continuous variables, so regular OLS regression is inappropriate.
Logistic Regression
Assumptions of Logistic Regression
Instead of OLS regression, we will use an approach known as logistic regression. The assumptions of logistic regression are similar to the assumptions in OLS regression, but not identical. In logistic regression:
The dependent variable must be binary
Observations must be independent
There must not be multicollinearity
We will also need a larger sample size than in OLS because in logistic regression we rely on maximum likelihood estimation. Logistic regression also differs from OLS in that:
There’s no assumption of a linear relationship between the DV and each IV
There’s no assumption of hetereoscedasticity
Residuals do not need to be normally distributed
Odds and Odds Ratios
In order to understand logistic regression, we must first review the concept of odds.
Odds are similar to probability. However, while probability may be calculated as \(\frac{\text{\# desirable outcomes}}{\text{\# possible outcomes}}\), odds are calculated as \(\frac{\text{\# desirable outcomes}}{\text{\# undesirable outcomes}}\).
Technically, it’s the probability of being murdered by a serial killer
The log odds that \(Y = 1\) (or the logit) are simply \(ln(Odds(Y = 1))\). In other words, if \(Y\) is our dependent variable and \(X\) is our independent variable, a 1-unit increase in \(X\) means that the odds that \(Y=1\) go up by \(e^\text{ln(Odds(Y = 1))}\).
The odds ratio is the ratio of the odds of one outcome for our binary dependent variable versus the other possible outcome, or \(\frac{\text{Odds(Y=1)}}{\text{Odds(Y=0)}}\). It can also be calculated by exponentiating the coefficient of the independent variable in question. The odds ratio indicates how much more likely we are to see \(Y=1\) than \(Y=0\), e.g., an odds ratio of 3 means that it is 3 times more likely that \(Y=1\) than that \(Y=0\).1 Rather than looking at the estimated \(\beta\) coefficients, most statisticians prefer to look at odds ratios
Equation for the Logistic Regression and Multivariate Logistic Regression
For logistic regression with a single predictor, the equation is \[ln(\frac{p}{1-p}) = \beta_0 + \beta_1X_1 + \epsilon\]
where \(p = P(Y=1)\),
\(\frac{p}{1-p}\) are the odds,
and \(ln(\frac{p}{1-p})\) are the log odds, or logit.
Solving algebraically, we come to the inverse logit or logistic function for logistic regression with one variable: \[p = \frac{e^{\beta_0+\beta_1X_1}}{1+e^{\beta_0+\beta_1X_1}} = \frac{1}{1+e^{-\beta_0-\beta_1X_1}}\]
For multivariate logistic regression with \(n\) predictors, that equation becomes \[p = \frac{e^{\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_nX_n}}{1+e^{\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_nX_n}}\]
Interpretation and Hypotheses for Each Predictor
As explained above, we can say that \(e^{\beta_1}\) is the extent to which the odds of \(Y=1\) change as the predictor (i.e., Population) increases by 1 unit. So, when \(e^{\beta_1} = 1.001\), 1.001 is the extent to which the odds that \(Y=1\) go up as the independent variable \(X\) increases by 1 unit. If we want to examine the extent to which the odds that \(Y=1\) change as X increases by 100 units, we can simply raise \(e\) to \(100*\beta_1\), so \(e^{100*\beta_1}\). If \(\beta_1 < 0\), we would say that the odds that \(Y=1\)decrease by a factor of \(e^{\beta_1}\) as X increases by 1 unit. If \(\beta_1 = 0\), then the predictor has no effect on the dependent variable.
For each predictor \(X_i\):
\(H_0:\beta_i = 0 \hspace{1mm} (OR_i = 1)\)
\(H_a:\beta_i \neq 0 \hspace{1mm} (OR_i \neq 1)\)
Within the context of logistic regression, the z-score is equal to \(\frac{\hat{\beta_i}}{\sigma\hat{\beta_i}}\), which is normally distributed. This is also sometimes known as the Wald statistic, and yields a p-value for each term in the regression using the standard normal (z) tables and in the R output.
Assessing Quality of Model Fit
With logistic regression, it is possible to calculate an \(R^2\) value and, generally speaking, the higher the \(R^2\), the better. However, this value is rarely used, as it cannot be interpreted as the percent of variance that is explained by the model.
Instead, we can use the Akaike Information Criterion to compare models. The AIC estimates prediction error, and therefore the quality of our model, and includes a penality that “is an increasing function of the number of estimated parameters”.2 When comparing multiple models, “the one with the minimum AIC value assures a good balance of goodness of fit and complexity”.3
Finally, in assessing the quality of our model, we can consider specificity, sensitivity, and the misclassification rate.
Sensitivity (also called the true positive rate) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition), and is complementary to the false negative rate. Specificity (also called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition), and is complementary to the false positive rate. The misclassification rate is the proportion of the sum of false positives and false negatives out of the total number of observations in the sample.
False positives are dangerous, folks
A simple way of assessing model quality is to determine a cut-off value for the fitted values \(\hat{y_i}\), which is equal to P(Y = 1) at each observation \(i\). When the model performs well, observations where \(Y = 1\) will have higher values of \(\hat{y_i}\), and observations where \(Y = 0\) will have lower values of \(\hat{y_i}\). We can set a cut-off value, ranging between 0 and 1, for what constitutes a “high” versus a “low” value based on the distribution of our data. In practice, most statisticians use multiple values, e.g., 0.1, 0.2, 0.3 … 0.7.
Ideally, we want to choose a cut-off value that will optimize sensitivity and specificity. Typically, we want to see a high true positive rate and a low false positive rate, although, depending on the specific case, it may be more important to minimize the false positive rate than to maximize the true positive rate, or vice versa. To this end, we can plot the true positive rate (sensitivity) against the false positive rate (specificity), as in Figure 14. This is known an ROC curve, and it offers us a few different options for identifying out probability cut-off:
The Youden Index is a cut-off for which the sum of sensitivity plus specificity is maximized
A cut-off for which the ROC curve has the minimum distance from the upper left corner of the graph, i.e., the point at which specificy = 1 and sensitivity = 1 (this can be implemented in R)
In this assignment, we will be selecting our cut-off by taking the second approach and minimizing the distance of the curve from the upper left corner of the graph.
Figure 1: ROC Curve Example
If we calculate the area under our ROC curve (the AUC), we can measure the prediction accuracy of the model. It is effectively the probability that the model correctly ranks two randomly selected observations where one has \(Y = 1\) and the other has \(Y = 0\), i.e., the probability that the \(\hat{y}\) for the observation where one has \(Y = 1\) is higher than the \(\hat{y}\) where \(Y = 0\). Higher AUCs mean that we can find a cut-off value for which both sensitivity and specificity are relatively high. The AUC varies from 0.5 to 1, and roughly correspond to the following accuracy levels:
.9 to 1.0 = excellent
.8 to .9 = good
.7 to .8 = fair (some statisticians say that any value > .7 is good)
.6 to .7 = poor
.5 to .6 = fail
Exploratory Analysis Prior to Regression
Prior to running our regression model, we will run cross tabulations between the dependent variable DRINKING_D and each of the binary predictors (FATAL_OR_M, OVERTURNED, CELL_PHONE, SPEEDING, AGGRESSIVE, DRIVER1617, DRIVER65PLUS). A cross tabulation will show us the proportional distribution of the variables. More importantly for our purposes, we can use a Chi-squared test to examine the association between the DV and each individual predictor variable. Here, the p-value is calculated based on the \(\chi^2\) value and the degrees of freedom. If p < 0.05, then the relationship between the dependent variable and the given predictor is significant. If we have a high \(\chi^2\) and a p-value < 0.05, we can therefore reject \(H_0\), that the proportion of fatalities for crashes in which our predictor = 1 is the same as the proportion of fatalities for crashes in which our predictor = 0, in favor of \(H_a\), that the proportion of fatalities for crashes in which our predictor = 1 is not the same as the proportion of fatalities for crashes in which our predictor = 0. This suggests a relationship between the variables.
Additionally, we can compare the means of the continuous predictors in our model (PCTBACHMOR and MEDHHINC) by looking at independent sample t-tests. Here, \(H_0\) is that the average values for our continuous predictor variable are the same for crashes that involve drunk drivers and crashes that don’t. \(H_a\) is that the average values for our continuous predictor variable are different for crashes that involve drunk drivers and crashes that don’t. As before, a high value of the t-statistic, and a p-value lower than 0.05 suggest that there’s evidence to reject the null hypothesis in favor of the alternative.
Results
Findings from Exploratory Analysis
Distribution of Dependent Variable
To begin our data exploration, we’ll consider the distribution of our dependent variable. As we can see, from our prop table, crashes in which the driver were drunk account for only 5.7% of all 43,364 crashes between 2008 and 2012.
Next, we’ll look at the cross-tabulations of our dependent variable with each of the binary predictors to evaluate whether the Chi-squared test indicates a statistically significant relationship.
As we can see based on these cross-tabulations, the Chi-squared test shows a high \(\chi^2\) and a p-value < 0.05 for the dependent variable and each of the binary predictors except for CELL_PHONE. For all variables except for CELL_PHONE, then, we can reject \(H_0\), which, as we stated before, states that there is that the probability of a crash involving a drunk driver is no different whether our predictor = 1 or = 0. In other words, we can reject the hypothesis that there is no relationship between the dependent variable and all of the predictors other than CELL_PHONE.
T-test shows us high p-value which implies that we can’t reject the Null Hypothesis, the average values for our continuous predictor variable are the same for crashes that involve drunk drivers and crashes that don’t.
Welch Two Sample t-test
data: crash_data$pctbachmor by crash_data$drinking_d
t = -0.10842, df = 2777.5, p-value = 0.9137
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-0.7991398 0.7153982
sample estimates:
mean in group 0 mean in group 1
16.56986 16.61173
Welch Two Sample t-test
data: crash_data$medhhinc by crash_data$drinking_d
t = -1.4053, df = 2763.9, p-value = 0.16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-1235.2508 203.8544
sample estimates:
mean in group 0 mean in group 1
31483.05 31998.75
Assumptions of Logistic Regression
Recalling from Method Section that there are three main assumptions for logistic regressions: 1) binary dependent variable, 2) independent observations, and 3) no severe multicollinearity. The dependent variable, DRINKING_D (whether drivers were drunk or not drunk) was binary, and the observations of each recorded car accident are independent of each other.
Multicollinearity usually refers to the situation when two or more predictors (explanatory variables) are strongly correlated with each other (r > 0.9 or r < -0.9), which signals a redundancy in including both predictors (in other words, adding the second one does not help improve the predictive power of the model). In the case of multicollinearity, we usually leave one predictor and take out the rest that are strongly correlated with that one predictor.
To test the third assumption, whether there is multicollinearity between all predictors (both binary and continuous), we used the Pearson correlations and results are shown below in the form of matrix. Note that since Pearson is usually run based on the assumption of continuous (not categorical) variables, there are potential limitations of using Pearson correlations to measure the association between binary predictors, since Pearson cannot determine non-linear relationships.
Although MEDHHINC (the median household income) and PCTBACHMOR (percentage of bachelor degree in the census tract where accident happened) having relatively higher correlation (r = 0.477 < 0.9, which does not qualify for multicollinearity), the correlation matrix indicates no severe multicollinearity across all predictors (all r satisfies -0.9 < r < 0.9, showing non-multicollinearity).
Logistic Regression Results
The results of logistic regression with all predictors is shown below. All predictors, except for CELL_PHONE (whether drivers were using cell-phones) and PCTBACHMOR (percentage of bachelor degree in the census tract where accident happened), are statistically significantly associated with drinking_d (whether drivers were drunk).
Call:
glm(formula = drinking_d ~ fatal_or_m + overturned + cell_phone +
speeding + aggressive + driver1617 + driver65plus + pctbachmor +
medhhinc, family = "binomial", data = crash_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1945 -0.3693 -0.3471 -0.2731 3.0099
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.732506616 0.045875659 -59.563 < 0.0000000000000002 ***
fatal_or_m 0.814013802 0.083806924 9.713 < 0.0000000000000002 ***
overturned 0.928921376 0.109166324 8.509 < 0.0000000000000002 ***
cell_phone 0.029550085 0.197777821 0.149 0.8812
speeding 1.538975665 0.080545894 19.107 < 0.0000000000000002 ***
aggressive -0.596915946 0.047779238 -12.493 < 0.0000000000000002 ***
driver1617 -1.280295964 0.293147168 -4.367 0.000012572447127937 ***
driver65plus -0.774664640 0.095858315 -8.081 0.000000000000000641 ***
pctbachmor -0.000370634 0.001296387 -0.286 0.7750
medhhinc 0.000002804 0.000001341 2.091 0.0365 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 19036 on 43363 degrees of freedom
Residual deviance: 18340 on 43354 degrees of freedom
AIC: 18360
Number of Fisher Scoring iterations: 6
The following table indicates the odds ratio for each predictor. Odds is usually calculated as \(\frac{\text{number of desirable outcomes}}{\text{number of undesirable outcomes}}\). For all statistically significant predictors, if the OR (Odds Ratio) value is higher than 0 (i.e.,fatal_or_m, overturned, cell_phone, speeding, and medhhinc ), 1 unit increase in each predictor corresponds to a increase in the odds of the driver being drunk when accident happens; if the OR (Odds Ratio) value is lower than 0 (i.e.,aggressive, driver1617, driver65plus, and pctbachmor), 1 unit increase in each predictor corresponds to a increase in the odds of the driver being drunk when accident happens.
The following set of matrices indicate specificity, sensitivity at different probability cut-offs (thresholds), respectively cutting off at probability of 0.02, 0.03, 0.05, 0.07, 0.08, 0.09, 0.1, 0.15, 0.2, and 0.5. The values are then encoded into a table presenting not only specificity and sensitivity but also mis-classifications rate. Based on the table, the cut-off value that has the lowest mis-classification rate is 0.02 and the one corresponding to the highest mis-classification rate is 0.5.
Cut-Off Values Table
Since the ROC curve can be used to plot true positive rate against false positive rate and determine the quality of the model, a ROC curve is plotted for our model. Given the ROC curve, the calculated optimal cut-off point is at probability of 0.06. This suggested cut-off point diverge from the 0.02 cut-off point observed from the table earlier. This difference may result from the fact that the previous table only aims to minimize mis-classification rate, while the optimal cut-off point here aims to maximizing both sensitivity and specificity.
Show the code
#3-a a <-cbind(crash_data$drinking_d, fit)colnames(a) <-c("labels", "predictions")head(a)
Given the ROC curve as presented above, we then calculated the area under curve to further measure the accuracy of the model. The calculated AUC is 0.64, suggesting that the model is relatively unsatisfactory.
Logistic Regression with Binary Predictors
The following table is the results from a logistic regression only with binary predictors (explanatory variables). The results suggest that cell_phone remains statistically insignificant in the new model, while other binary predictors remain significant.
In addition, by comparing the Akaike Information Criterion (AIC) values, we can see that the models have almost identical AIC values, suggesting that they have similar quality.
Show the code
AIC(logit, logit2)
df AIC
logit 10 18359.63
logit2 8 18360.47
Similar to OLS, we can do cross-validation to determine a quality of the model. See following links for more information: K-Fold cross-validation in R: https://www.r-bloggers.com/evaluating-logistic-regression-models/ More on cross-validation in R: https://www.r-bloggers.com/predicting-creditability-using-logistic-regression-in-r-cross-validating-the-classifier-part-2-2/
Discussion
This projects use logistic regression as an alternative to OLS regression to look at relationships between whether drivers were drunk or not when car crashes happened and a group of predictors. From the models, we can see that all predictors, except for cell_phone and pctbachmor, (i.e., fatal_or_m, overturned, speeding, aggressive, driver 1617, driver65plus, medhhinc) are significantly associated with drivers drunk status. The logistic regression model satisfies all assumptions needed, as shown in the previous section.
Most results, including the significance of relationships and the direction of the relationships, align with expectations. For instance, we wouldn’t expect whether drivers drank to be associated with the percentage of bachelor degree holders in the tracts of where the incidents happen, and it turns out to be not significant. Similarly, the use of cell phone is not expected to have a direct association with drunk drivers. On the other hand, the predictors related to age, diver1617 and driver65plus are expected to be valid predictors, since teenage drivers are not legal to drink, and drivers older than 65 are less likely to drink due to health concerns. One result that is unexpected is the negative association between aggressive driving behaviors and whether the driver was drunk, as we would usually assume drunk drivers may perform more aggressively when driving.
Rare Events Modeling in Logistic Regression
To judge whether logistic regression is appropriate here, we need to examine the number of cases, especially the number of the rarer outcome of the two – in this cases, drivers that were drunk at the time of crashes are a lot less than those who were not drunk. If there is a small sample bias here, we can use Paul Allison proposed methods for modeling rare events, similar to penalized likelihood, for reducing small-sample biases in Maximum likelihood estimation.
The question is then whether the number of drunk drivers should be considered small sample. The answer is uncertain. According to Table 1, we have 2,485 observations for drivers who were drunk at the time of incidents, versus 40,879 observations for drivers who were not drunk. Although 2,485 seems like a reasonably sized group, it only take 5.7% of all observations. Therefore, there might be a need to adopt Paul Allison’s methods for modeling rare events instead of just using logistic regression.
Limitations
In this Analysis we’ve only leveraged given independent variables, to improve our model, we should’ve to spend more time in exploring other independent variables. It might help us to get larger value of AUC than that of 0.64 in current model.
Footnotes
Sperandei S. Understanding logistic regression analysis. Biochem Med (Zagreb). 2014 Feb 15;24(1):12-8. doi: 10.11613/BM.2014.003. PMID: 24627710; PMCID: PMC3936971.↩︎