Assignment-3.knit

The Application of Logistic Regression to Examine the Predictors of Car Crashes Caused by Alcohol

Brooke Porter, Dennis Espejo, Oindriza Reza Nodi MUSA 5000: Eugene Brusilovskiy November 20, 2025

Introduction

Drunk driving remains a persistent threat to public health and safety (Perrine et al., 1989), persisting despite decades of efforts to curb this high-risk behavior through law enforcement, legislation, and activism (Eisenberg, 2003). In this report, we consider the issue of drunk driving in the city of Philadelphia. Philadelphia is a major, metropolitan city and currently holds the title of the most walkable city in America (Visit Philadelphia, 2025). Alcohol-impaired driving threatens the wellbeing of other drivers and pedestrians. As such, analyzing the characteristics of incidents of drunk driving accidents provides insights into how to create safer conditions for both drivers and pedestrians in the city.

In this report, we identify predictors of accidents involving alcohol-impaired driving. For our analysis, we use R to run a logistic regression. Specifically, we regress the binary dependent variable, drinking driver indicator (DRINKING_D), on the following binary and continuous predictors: crash resulted in fatality or major injury (FATAL_OR_M), crash involved an overturned vehicle (OVERTURNED), driver was using cell phone (CELL_PHONE), crash involved speeding car (SPEEDING), crash involved aggressive driving (AGGRESSIVE,) crash involved at least one driver who was 16 or 17 years old (DRIVER1617), crash involved at least one driver who was at least 65 years old (DRIVER65PLUS), percentage of bachelor degrees in Census Block Group where the crash took place (PCTBACHMOR), and Median household income in the Census Block Group where the crash took place (MEDHHINC). Our data comes from a Pennsylvania Department of Transportation compiled dataset with information on car accidents in the City of Philadelphia that occurred between 2008-2012. This data is geocoded, meaning it provides spatial insights into the 2000 census block data that we also utilize for this analysis. Spatial joining these datasets provides information on the MEDHHINC and PCTBACHMOR of the block group for each crash point. This additional information is useful because census blocks are the smallest geographic unit of analysis provided by the U.S. Census Bureau (Rossitier, 2011), and provide the most granular level of geographic analysis necessary to conduct an in-depth and delineated analysis of trends across specific sectors in Philadelphia.

Methods

OLS Regression and Binary Variables

Typically, in OLS regression, we work with two continuous variables. We estimate coefficients for each predictor, and when those coefficients are multiplied by their corresponding x values, the model produces a numerical prediction for the dependent variable. If we try to apply this same logic to a binary outcome, we could attempt to predict the probability that a variable Y equals 1. In this case, the probability P(Y=1) becomes the dependent variable, and the model would look like:

\[ P(Y = 1) = \beta_0 + \beta_1 x_1 + \varepsilon \]

However, if we have a case where 𝛽 = .5 and 𝛽1 = .4 with a sample size of 15, then we would receive the following result:

\[ P(Y = 1) = 0.5 + 0.4(15) \] \[ P(Y = 1) = 6 \]

A value of 6 makes little practical sense because probabilities must range between 0 and 1. A value of 1 represents a 100% probability, and it is nonsensical for a probability percentage to exceed 100%. A workaround to these constraints of working with binary variables in OLS is logistic regression.

Rundown of Logistic Regression

Logistic regression provides a much-needed solution by modeling a binary outcome in a way that guarantees predicted probabilities stay between 0 and 1. However, before we dive into logistic regression, it is essential to understand odds and odds ratios, a key component that is both conceptually and practically important in understanding logistic regression. The following formula represents the odds:

\[ \text{Odds} = \frac{p}{1 - p} \]

Where p equals the probability of desirable outcomes, and 1-p represents the probability of undesirable outcomes. For example, let’s consider a bag of fruit containing 5 apples and 15 other types of fruit. We want to know the odds of picking an apple (our desirable outcome p) over our undesirable outcome (1-p). We would model the equation as follows, where p = 5/20 = .25:

\[ \text{Odds} = \frac{0.25}{1 - 0.25} = \frac{0.25}{0.75} = 0.33 \]

The probability of picking an apple out of all possible outcomes is .25, but the odds of picking an apple over the undesirable outcomes are .33. When interpreting odds, it’s important to note the following:

Odds = 1 → equally likely

Odds > 1 → event is more likely

Odds < 1 → event is less likely

Because .33 is less than 1, the odds of picking an apple are lower than the odds of not picking an apple. In fact, apples are 67% less likely to be chosen because- 1−.33=.67, meaning the event occurs at only about one-third the rate of the non-event.

Let’s say we have a new bag with 40 apples and 25 other fruits, and we want to know the odds of picking an apple in this bag, which, when plugged into the odds formula, gives us the odds of picking an apple of about 1.6, which means apples are 60% more likely to be chosen over other fruit because 1.6> 1 and 1.6-1=60% (when our odds is greater than 1 we subtract 1 from the odds). An odds ratio then compares these two odds using the following formula:

\[ \text{OR} = \frac{\text{Odds in reference group}}{\text{Odds in comparison group}} \]

Let’s say our first bag is the reference group, and our second bag is the comparison group.

\[ \text{OR} = \frac{0.33}{1.60} \approx 0.21 \]

Dividing .33 by 1.6 gives an odds ratio of about .21, which means the odds of picking an apple in the first bag are only about 21% of the odds in the second bag. In other words, apples are about 79% less likely to be selected in Bag 1 compared to Bag 2.

Essentially, with logistic regression, we use odds in the logit model:

\[ \ln\left(\frac{p}{1 - p}\right) = \beta_0 + \beta_1 x_1 \]

Which, when we use algebra to isolate p, gives us the logistic function:

\[ p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1)}} \]

The logit model and logistic function helps ensure our variables are constrained between 0 and 1. To further illustrate how the logistic model, specifically, helps transform our binary variables, the following graph represents the difference between the outputs of an OLS regression and a regression using the logistic function

Figure 1:Logistic curve (blue) mapping linear predictors to probabilities vs. OLS line (red) that can fall outside 0–1. (Brusilovskiy, 2025)

Where the OLS regression allows its line to fall below or exceed the 0 to 1 range, making it impossible to make a sensible probability estimate, the regression line using the logistic function creates a line constrained between 0 and 1, as it does with the logit function, although when using the logit translator the x and y axis are flipped.

For this project, the logit model is:

\[ \log\left(\frac{P(\text{Drinking_D}=1)}{1 - P(\text{Drinking_D}=1)}\right) = \beta_0 + \beta_1 \cdot \text{FATAL\_OR\_M} + \beta_2 \cdot \text{OVERTURNED} + \beta_3 \cdot \text{CELL\_PHONE} \]

\[ + \beta_4 \cdot \text{SPEEDING} + \beta_5 \cdot \text{AGGRESSIVE} + \beta_6 \cdot \text{DRIVER1617} + \beta_7 \cdot \text{DRIVER65PLUS} \]

\[ + \beta_8 \cdot \text{PCTBACHMOR} + \beta_9 \cdot \text{MEDHHINC} \]

While the logit function takes the odds, in this case log(P(Drinking_D =1)) / (1 −P(Drinking_D =1)), and transforms it into a means to interpret linearly. The inverse of the logit function, the logistic function, transforms a probability p into a result. As previously stated, the logistic function is derived by using algebra to isolate p in the logit model and is represented by the following formula:

\[ P(\text{Drinking_D}=1) = \frac{1}{1 + e\Big\{ -( \beta_0 + \beta_1\,\text{FATAL_OR_M} + \beta_2\,\text{OVERTURNED} + \beta_3\,\text{CELL_PHONE} \\ \qquad + \beta_4\,\text{SPEEDING} + \beta_5\,\text{AGGRESSIVE} + \beta_6\,\text{DRIVER1617} \\ \qquad + \beta_7\,\text{DRIVER65PLUS} + \beta_8\,\text{PCTBACHMOR} + \beta_9\,\text{MEDHHINC} ) \Big\}} \]

(P(Drinking_D =1)) represents the probability of the dependent variable, drunk driving, and the e^ represents the regression equation where β₀ is the y-intercept, and β1-B9 are the coefficients of each predictor. In the previous equation as ” β₀ + β₁·FATAL_OR_M (for example)” get larger p approaches 1 signaling a higher likelihood of an event occurring, while ” β₀ + β₁·FATAL_OR_M (for example)” gets small p approaches 0, which is the main benefit from using the logistic function as it constrains our outcome probability between the values of 0 and 1, preventing the value from exceeding the range between 0 and 1.

Hypothesis Testing

Within logistic regression, for each predictor, the null and alternative hypotheses are:

\[ H_0: \beta_i = 0 \; (\text{OR}_i = 1) \]

\[ H_1: \beta_i \ne 0 \; (\text{OR}_i \ne 1) \]

A coefficient of 0 means the predictor has no correlation with the outcome, and under this scenario, no values of 𝛽𝑖 will make the likelihood function in the Maximum Likelihood Estimation, “a statistical method for estimating the coefficients of a model” (Brusilovskiy, 2025), any larger than the likelihood achieved at 𝛽𝑖=0. To test the significance of Bi in logistic regression the following formula produces a z-statistic with a standard normal distribution:

\[ \frac{\hat{\beta}_i - E(\hat{\beta}_i)}{\sigma_{\hat{\beta}_i}} = \frac{\hat{\beta}_i - 0}{\sigma_{\hat{\beta}_i}} = \frac{\hat{\beta}_i}{\sigma_{\hat{\beta}_i}} \]

Where β̂ᵢ / σ_{β̂ᵢ}, sometimes referred to as the Wald statistic, specifically provides the z statistic. The p-value is then derived from this z-statistic. However, for the purposes of this paper, we will use the p-value provided by R, which also utilizes the z-statistic to determine the p-value, rather than calculating it ourselves. Additionally, when the odds ratio, which statisticians prefer to interpret, equals 1, it translates to a 1:1 odds ratio of the event occurring. This means the predictor has no influence on the event leaning one way or the other. The odds ratio is often derived by exponentiating the coefficient produced for each predictor. For example, if we have a coefficient of 0.25 and exponentiate it by 𝑒 we get approximately 1.28. Since 1.28 is greater than 1, we subtract 1 from 1.28 to get 0.28, then multiply by 100 to obtain 28%. This means the odds of the event increase by 28 percent for a one-unit increase in the predictor.

Assessing Quality of Model Fit

Although R’s logistic regression output provides a type of R-squared, it does not have the same interpretation as the R-squared used in OLS regression. In OLS, R-squared represents the percent of variance explained by the model. Logistic regression, however, models the log-odds of a binary outcome, so traditional variance is not the basis of model fit. Because of this, logistic regression uses alternative measures of model quality.

One of the most commonly used metrics for evaluating logistic regression models is the Akaike Information Criterion (AIC). AIC allows us to compare competing models, where the model with the lowest AIC is considered the best-fitting model. AIC is calculated using the formula:

\[ \text{AIC} = 2K - 2\ln(L) \]

Where K is the number of estimated parameters in the model, and L is the maximized value of the likelihood function (that is, how likely it is that the model produced the observed pattern of Y values) (Bevans, 2023). The logic of residuals in linear regression, 𝑒𝑖=𝑦𝑖−𝑦^𝑖, does not directly translate to logistic regression because 𝑦^𝑖 is a predicted probability rather than a predicted continuous value. In logistic regression, y^i represents the predicted probability that 𝑌=1

\[ P(Y = 1) = \hat{y}_i = \frac{\exp(\beta_0 + \beta_1 x_{1i} + \cdots)} {1 + \exp(\beta_0 + \beta_1 x_{1i} + \cdots)} \]

To assess how well the model predicts the binary outcome, we examine whether high predicted probabilities correspond to cases where Y = 1, and low predicted probabilities correspond to cases where Y = 0. To further illustrate the concepts of sensitivity, specificity, and misclassification, let’s consider a table where we are trying to determine if a grocery store is located within a census block, using a cutoff of 0.5.

Table 1: Sample Table for Specificity, Sensitivity, and Misclassification Analysis

From the table above, let’s assign the following variables to each value:

A: Census blocks with no grocery stores that fall below the cutoff (25) (Specificity)

B: Census block with a grocery store predicted as not having a grocery store (7) (False Negative)

C: Census blocks with no grocery store predicted as having a grocery store (5)

D: Census blocks with a grocery store predicted as having a grocery store 23 (Sensitivity)

The following equations can be used to figure out the Sensitivity, Specificity, and Misclassification rates:

\[ \text{Sensitivity} = \frac{d}{b + d} \]

\[ \text{Specificity} = \frac{a}{a + c} \]

\[ \text{Misclassification\ Rate} = \frac{b + c}{a + b + c + d} \]

Plugging in our values, we get the following:

\[ \text{Sensitivity} = \frac{23}{30} = .76 \]

\[ \text{Specificity} = \frac{25}{30} = .83 \]

\[ \text{Misclassification Rate} = \frac{12}{60} = .20 \]

This means the model had a 0.76 rate of capturing sensitivity, a 0.83 rate of capturing specificity (true negatives), and a total misclassification rate of 0.21. To get the false positive and false negative rates, we subtract the specificity and sensitivity rates, respectively, from 1. In our previous example, the false positive rate 1-.83= .17, and the false negative rate is 1-.76=.23.

However, determining what counts as a “high” or “low” probability is not a fixed standard. The commonly used cutoff, which we used in our previous example, of 0.5 is arbitrary and may not always be sensible. For example, if only a small proportion of cases in the dataset have Y=1, the model may never generate predicted probabilities above 0.5 simply because the base rate is so low. In this case, using 0.5 as a cutoff would yield poor results. A better approach is to examine a histogram of 𝑦^ to see where the predicted probabilities cluster and choose a cutoff point that aligns with the distribution of the data. For example, if we have a cutoff point of 0.5, but most 𝑦^ values fall below 0.5, then the cutoff point will have a decreased sensitivity rate while retaining an increased specificity. It is then up to us to classify a cutoff point that aligns with our research question. Specifically, if we want a cutoff point that yields higher sensitivity and lower specificity, because we care more about distinguishing true positives rather than capturing true negatives, for example, in the case of a contagious disease, it may be best to be safer and distinguish as many potential positive cases as possible thus increasing our sensitivity. With that said, there is no concrete answer if higher or lower values of each quantity are better or worse, and it mainly depends on the research question.

To determine the optimal cutoff point for our project, several methods can assist us in making this decision. The ROC curve is ” a way to plot true positive rates against false positive rates (1-specificity)” (Brusilovskiy, 2025).

Figure 2: Sample ROC Curve Illustration (Brusilovskiy, 2025)

There are quite a few different methods for identifying probability cut-offs based on ROC curves, including:

Youden Index: A cutoff for which (Sensitivity + Specificity) is maximized and in which the the ROC curve is closes to the upper left corner of the graph, where Sensitivity =1 and Specificity = 1, thus maximizing both measures.

Area Under Curve (AUC): A method for identifying the proportion of the graph below the curve. A higher AUC reflects stronger overall classification performance, as the model achieves higher true positive rates while maintaining low false positive rates. Below is a “rough guide for classifying the accuracy”as well as a sample AUC, where the yellow portion highlights the AUC(Brusiloviskiy, 2025):

• .90-1 = excellent

• .80-.90 = good

• .70-.80 = fair

• .60-.70 = poor

• .50-.60 = fail

Figure 3: Sample AUC illustration (Brusilovskiy, 2025)

Assumptions of Logistic Regression

Within logistic regression, several assumptions from OLS regression still hold.

We assume independence of observations, meaning that each case in the dataset should be unrelated to the others. This assumption follows the same logic discussed in the first homework assignment.

2.We assume no severe multicollinearity among the predictors. Logistic regression can tolerate slightly more multicollinearity than OLS, but there should not be strong correlations between independent variables.

In contrast to OLS regression, logistic regression does not require several of the classical linear model assumptions:

There is no assumption of homoscedasticity because logistic regression estimates probabilities through maximum likelihood, not through minimizing squared residuals.
The residuals also do not need to be normally distributed, since logistic regression works with the log-odds of the outcome and uses likelihood-based inference rather than relying on the normal distribution of errors.
Logistic regression does not assume a linear relationship between the independent variables and the dependent variable.

Logistic regression also has several assumptions that are unique to the model itself.

The dependent variable must be binary, meaning it only takes on two possible values.
Logistic regression also performs best with larger samples, since maximum likelihood estimation benefits from more information to produce stable and unbiased parameter estimates.

Exploratory Analysis

Before running logistic regression, statisticians often use a Chi-Square test for exploratory analysis, specifically to see if there are associations between the dependent variable and binary predictors. For the initial chi-square analysis, the null and alternative hypotheses are:

\(H_0\): The proportion of the binary predictor that involves the dependent variable is the same as the proportion of the binary predictor that does not involve the dependent variable.

\(H_a\): The proportion of the binary predictor that involves the dependent variable is different than the proportion of the binary predictor that does not involve the dependent variable.

With chi-square, a high value of the \(\chi^2\) statistic and a significant p-value are necessary to reject the null hypothesis in favor of the alternative hypothesis. The \(\chi^2\) statistic is calculated using the following formula:

\[ \chi^2 = \sum \frac{(\text{Observed value} - \text{Expected value})^2} {\text{Expected value}} \]

For continuous predictors and a binary dependent variable, we get both the mean averages and standard deviations of each continuous predictor under the two categories of the DV, and then employ a t-test, which is commonly used to “compare the mean value of a continuous variable for two independent groups” (Brusilovskiy, 2025).

For this project, our continuous predictors are PCTBACHMOR and MEDHHINC, and the null and alternative hypotheses for each are as follows:

PCTBACHMOR

\(H_0\): Average values of the variable PCTBACHMOR are the same for crashes that involve drunk drivers and crashes that don’t.
\(H_a\): Average values of the variable PCTBACHMOR are different for crashes that involve drunk drivers and crashes that don’t.

MEDHHINC

\(H_0\): Average values of the variable MEDHHINC are the same for crashes that involve drunk drivers and crashes that don’t.
\(H_a\): Average values of the variable MEDHHINC are different for crashes that involve drunk drivers and crashes that don’t.

Results

Table 2: Tabulation of the dependent variable (DRINKING_D)

Tabulating the dependent variable reveals that 94% (40879) of crashes in the dataset involved drunk driving compared to 6% (2485) that did not.

Table 3: Cross tabulation of the dependent variable with each of the binary predictors

Next, we looked at the cross tabulation of DRINKING_D with each of the binary predictors and performed a chi-squared test for each:

The null and alternative hypotheses for the chi-squared t-test for the variable FATAL_OR M was as follows: 𝐻0: the proportion of fatalities for crashes that involve drunk drivers is the same as the proportion of fatalities for crashes that don’t involve drunk drivers, vs. 𝐻𝑎: the proportion of fatalities for crashes that involve drunk. drivers is different than the proportion of fatalities for crashes that don’t involve drunk drivers. The chi-square test p-value for FATAL_OR M is less than .05 and the chi-squared statistics is high, meaning there is a statistically significant difference; therefore, we reject the null hypothesis in favor of the alternative.

The null and alternative hypotheses for the chi-squared t-test for the variable OVERTURNED was as follows: 𝐻0: the proportion of crashed with overturned vehicles for crashes that involve drunk drivers is the same as the proportion of overturned vehicles for crashes that don’t involve drunk drivers, vs. 𝐻𝑎: the proportion of crashed with overturned vehicles for crashes that involve drunk. drivers is different than the proportion of overturned vehicles for crashes that don’t involve drunk drivers. The chi-square test p-value OVERTURNED is less than .05 and the chi-squared statistics is high, meaning there is a statistically significant difference; therefore, we reject the null hypothesis in favor of the alternative.

The null and alternative hypotheses for the chi-squared t-test for the variable SPEEDING was as follows: 𝐻0: the proportion of crashes with speeding that involve drunk drivers is the same as the proportion of crashes with speeding for crashes that don’t involve drunk drivers, vs. 𝐻𝑎: the proportion of crashes with speeding for crashes that involve drunk drivers is different than the proportion of crashes with speeding for crashes that don’t involve drunk drivers. The chi-square test p-value for SPEEDING is less than .05 and the chi-squared statistics is high, meaning there is a statistically significant difference; therefore, we reject the null hypothesis in favor of the alternative.

The null and alternative hypotheses for the chi-squared t-test for the variable AGGRESSIVE was as follows: 𝐻0: the proportion of crashes with aggressive driving that involve drunk drivers is the same as the proportion of crashes with aggressive driving that don’t involve drunk drivers, vs. 𝐻𝑎: the proportion of crashes with aggressive driving for crashes that involve drunk drivers is different than the proportion crashes with aggressive driving for crashes that don’t involve drunk drivers. The chi-square test p-value AGGRESSIVE is less than .05 and the chi-squared statistics is high, meaning there is a statistically significant difference; therefore, we reject the null hypothesis in favor of the alternative.

The null and alternative hypotheses for the chi-squared t-test for the variable DRIVER16-17 was as follows: 𝐻0: the proportion of crashes with 16-17 year old drivers that involve drunk drivers is the same as the proportion of crashes with 16-17 year old drivers that don’t involve drunk drivers, vs. 𝐻𝑎: the proportion of crashes with 16-17 year old drivers for crashes that involve drunk drivers is different than the proportion crashes with 16-17 year old drivers for crashes that don’t involve drunk drivers. The chi-square test p-value DRIVER16-17 is less than .05 and the chi-squared statistics is high, meaning there is a statistically significant difference; therefore, we reject the null hypothesis in favor of the alternative.

The null and alternative hypotheses for the chi-squared t-test for the variable DRIVER65PLUS was as follows: 𝐻0: the proportion of crashes with 65+ year old drivers that involve drunk drivers is the same as the proportion of crashes with 65+ year old drivers that don’t involve drunk drivers, vs. 𝐻𝑎: the proportion of crashes with 65+ year old drivers for crashes that involve drunk drivers is different than the proportion crashes with 65+ year old drivers for crashes that don’t involve drunk drivers.

The chi-square test p-value DRIVER65PLUS is less than .05 and the chi-squared statistics is high, meaning there is a statistically significant difference; therefore, we reject the null hypothesis in favor of the alternative.

Based on the results of the chi-square test and the p-values for each predictor, We reject the null hypothesis in favor of the alternative for the predictors FATAL_OR M, OVERTURNED, SPEEDING, AGGRESSIVE, DRIVER16-17, and DRIVER65PLUS. Each of these predictors has a high chi-squared statistic as well as a p<.05 which indicates statistical significance. This statistical significance points to an association between drunk driving and these predictors.

The only variable for which we do not reject the null hypothesis is CELL_PHONE. We do not reject the null hypothesis that the proportion of accidents with cellphones for crashes that involve drunk drivers is the same as the proportion of accidents with cellphone for crashes that don’t involve drunk drivers for CELL_PHONE because of the small chi-squared test value and p-value for that predictor.

𝐻0: the proportion of fatalities for crashes that involve drunk drivers is the same as the proportion of fatalities for crashes that don’t involve drunk drivers, vs. 𝐻𝑎: the proportion of fatalities for crashes that involve drunk. drivers is different than the proportion of fatalities for crashes that don’t involve drunk drivers.

Table 4: Continuous Variables’ Independent Samples T-tests

Next, we looked at the continuous predictors, PCTBACHMOR and MEDHHINC and whether their means varied for different levels of DRINKING_D. To compare the mean values of each continuous variable, we used an independent samples t-test. The null and alternative hypotheses for the independent samples t-test for the variable PCTBACHMOR was as follows:

𝐻0: average values of the variable PCTBACHMOR are the same for crashes that involve drunk drivers and crashes that don’t. vs. 𝐻𝑎: average values of the variable PCTBACHMOR are different for crashes that involve drunk drivers and crashes that don’t.

The p-value for the PCTBACHMOR t-test is greater than .05 and the t-statistic is small, meaning there is not a statistically significant difference; therefore, we cannot reject the null hypothesis in favor of the alternative.

The null and alternative hypotheses for the independent samples t-test for the variable MEDHHINC was as follows: 𝐻0: average values of the variable MEDHHINC are the same for crashes that involve drunk drivers and crashes that don’t. vs. 𝐻𝑎: average values of the variable MEDHHINC are different for crashes that involve drunk drivers and crashes that don’t. The p-value for the MEDHHINC t-test is greater than .05 and the t-statistic is small, meaning there is not a statistically significant difference; therefore, we cannot reject the null hypothesis in favor of the alternative. Therefore, the average values of PCTBACHMOR and MEDHHINC for crashes that involve drunk drivers and crashes that don’t are not statistically significantly different.

Our exploratory analysis revealed which assumptions of logistic regression that the model satisfies and violates. The dependent variable, DRINKING_D is indeed binary, with only two possible values 0=no drunk driving and 1=drinking driving, satisfying the first assumption of logistic regression. The assumption of independent observations is likely violated because the model uses geocoded census block data. Using this spatial context to see if there is clustering among clash sites does violate assumptions of independence. The model does meet the assumption of a sufficiently large sample size because there are more than 50 observations per predictor. Finally, we used a Pearson Correlation Matrix check for multicollinearity amongst the variables.

Table 5: Pearson Correlation Matrix

Analysis of the matrix reveals that there is no severe multicollinearity between any of the variables. We used the threshold of 0.8 as an indicator of whether any predictor pairs are excessively correlated. None of the pairs have values above 0.8; therefore, none of the predictors should be excessively correlated and the logistic assumption of no severe multicollinearity is met. It is important to note that Pearson Correlations provide an imperfect assessment of associations between binary predictors because Pearson correlations measure linear relationships and, therefore, might underestimate the strength of the association between non-linearly related binary variables.

Table 6: Logistic Regression Results

The reduction in deviance from 19,036 (null) to 18,340 (residual) indicates that the predictors improve model fit, while the AIC of 18,360 provides a baseline value that can be used for comparison with alternative models but cannot be interpreted on its own.

In terms of assessment of the independent variables, holding all else constant, crashes involving a fatality or major injury have 2.26 times higher odds of involving alcohol. This indicates a strong positive association between injury severity and drinking behavior. Overturned vehicles are associated with 2.53 times higher odds of alcohol involvement. This suggests rollover-type crashes are more likely when drinking is involved. There is no evidence that cell phone involvement is associated with drinking-related crashes. The odds ratio of 1.03 is effectively null because the P value is above 0.05. Speeding increases the odds of a drinking-related crash by 4.66 times. This is one of the strongest predictors in the model, consistent with behavioral risk profiles among impaired drivers. Aggressive driving behavior decreases the odds of alcohol involvement by about 45%. This may reflect that aggressive driving is a distinct behavioral mechanism different from drunk driving. Drivers aged 16–17 have 72% lower odds of alcohol involvement compared to adults and drivers age 65+ have 54% lower odds of alcohol involvement. Higher percent bachelor’s degree or more in the county does not predict alcohol involvement, because the P value is above 0.05. The coefficient for PCTBACHMOR is extremely small (β = –0.00037) and statistically insignificant (p = 0.775 > 0.05), indicating no meaningful association between educational attainment and the likelihood of a drinking-related crash. Although β = 2.8e-06 is statistically significant (p = 0.036 < 0.05), its magnitude is effectively zero, and the corresponding odds ratio of 1.0000028 is not substantively important.

Table 7: Misclassification Rate Table

The table presents the model’s classification performance for probability cut-offs ranging from 0.02 to 0.50, showing how sensitivity, specificity, and misclassification rate change as the threshold increases. It is noticeable that sensitivity declines sharply as the probability cut off increases. At the lowest cut off value 0.02, sensitivity is 0.9835, which is the highest. Sensitivity collapses as the threshold increases, reaching 0.0016 at the 0.50 cut-off, indicating almost no true positives are captured at high thresholds.

Alternatively, specificity increases as the threshold increases. At low cut-offs (0.02–0.03), specificity is extremely low (0.0581–0.0639), meaning the model incorrectly labels many non-drinking cases as drinking cases. By the highest cut-off of 0.50, specificity reaches 0.9999, indicating almost perfect identification of true negatives.

The misclassification rate decreases as the cut-off increases, reaching a minimum at 0.50. The highest misclassification rates occur at the lowest cut-offs, for example, at the lowest cutoff of 0.02, a misclassification rate of 0.8889 is observed. At these thresholds, the model over-predicts “drinking,” leading to many false positives. On the other hand, the lowest misclassification rate occurs at cut-off 0.50, with a misclassification rate of 0.0573, however this value should not necessarily be interpreted as the optimal threshold. At 0.50, specificity is extremely high with a value of 0.9999, meaning the model almost perfectly identifies non-drinking crashes. However, sensitivity falls to 0.0016, indicating that virtually all drinking-related crashes are misclassified as non-drinking that is, extremely high risk of false negatives.

Figure 4 : ROC Curve

Using the ROC-based distance minimization method, the optimal probability cut-off was identified as 0.0637, which yields a sensitivity of 0.661 and a specificity of 0.545.

The optimal cut-off selected based on the minimum misclassification rate occurred at a much higher threshold of 0.50, with the lowest misclassification rate of 0.0573. This cut-off produced very high specificity but extremely low sensitivity, meaning the model failed to identify almost all drinking-related crashes.

In contrast, the optimal cut-off derived from the ROC distance-minimization method was much lower at 0.0637. This threshold represents the best compromise between sensitivity and specificity, yielding a sensitivity of 0.661 and specificity of 0.545. While still weak, it balances the two error types more evenly than the misclassification-based cut-off.

The ROC curve for this model lies only slightly above the 45-degree reference line, and the area under the ROC curve (AUC) for this model is 0.6399, which indicates poor discriminatory performance. An AUC of 0.5 corresponds to random guessing, while values above 0.7 generally indicate acceptable discrimination. With an AUC of only 0.64, this model is only marginally better than chance at distinguishing drinking-related crashes from non–drinking-related crashes.

Table 8 : Logistic Regression Results Only With Binary Predictors

When PCTBACHMOR and MEDHHINC are removed from the model, none of the remaining predictors change significance status. All variables that were significant in the first model remain significant, and variables that were non-significant remain non-significant. In both the full model and the binary-predictors-only model, the predictors FATAL_OR_M, OVERTURNED, SPEEDING, AGGRESSIVE, DRIVER1617, and DRIVER65PLUS all remain statistically significant. Their coefficients and z-values change slightly, but their significance does not. CELL_PHONE is insignificant in both models, with a P value > 0.05.

Table 9 : AIC Criteria of Both Models

The AIC for the full model (including PCTBACHMOR and MEDHHINC) is 18,359.63, while the AIC for the reduced model containing only the binary predictors is 18,360.47. Because lower AIC values indicate a better balance between model fit and model complexity, the model with the lower AIC is preferred. Therefore, the full model is slightly better according to the AIC criterion. However, the difference between the two AIC values is extremely small and less than 1, suggesting that adding the two continuous socioeconomic predictors does not meaningfully change model performance.

Discussion

In this analysis, we applied logistic regression to Philadelphia crash data to identify the factors most strongly associated with alcohol-involved driving incidents. Our results show that behavioral crash characteristics, such as fatal or major injuries, overturned vehicles, speeding, and aggressive driving are the strongest and most consistent predictors of drunk driving, while demographic and socioeconomic variables like median household income and educational attainment add little explanatory value.

Specifically, FATAL_OR_M, OVERTURNED, SPEEDING, AGGRESSIVE, DRIVER1617, and DRIVER65PLUS are all statistically significant predictors. Speeding and overturned vehicles substantially increase the odds of a crash involving alcohol, while aggressive driving and the presence of very young (16–17) or older (65+) drivers reduce the odds. In contrast, CELL_PHONE shows no significant relationship with alcohol involvement, and the two socioeconomic block-group variables, PCTBACHMOR and MEDHHINC, are either statistically insignificant or substantively negligible, indicating that they do not meaningfully explain drunk-driving crash risk.

These results are largely unsurprising and align with expectations for drunk-driving behavior. Risk-intensive crash characteristics, such as speeding, rollover crashes, and severe injuries are logically associated with alcohol impairment, and the significant positive coefficients confirm this pattern. Likewise, the negative relationships for the youngest and oldest drivers are intuitive, as underage drivers are legally barred from drinking and older adults tend to exhibit more cautious driving behavior. The non-significance of cell-phone use is also expected, since distracted driving and impaired driving arise from different behavioral mechanisms.

What was surprising, however, was that neither educational attainment nor median household income had any meaningful association with alcohol-related crashes. Although we initially expected socioeconomic context to influence impaired driving behavior, these predictors contributed almost nothing to the model. More broadly, despite the fact that several predictors behave in theoretically expected ways, the overall model performance remains quite poor, suggesting that the available variables do not capture the full complexity of alcohol-involved crashes and that important behavioral or contextual factors are missing from the dataset.

Standard logistic regression is still appropriate here. Even though drunk-driving crashes make up only 5.73% of all observations, that is 2,485 out of 40,879, Paul Allison notes that rare-events logistic regression is generally necessary only when the absolute number of events is extremely small, e.g., fewer than 100 events. In our dataset, we have nearly 2,500 drinking-related crashes, which is more than sufficient for maximum likelihood estimation to behave well. Thus, while the event of alcohol involvement is relatively uncommon, it does not fall into the category of ‘rare events’ where specialized techniques such as Firth penalized likelihood or King & Zeng corrections would be required. Conventional logistic regression remains appropriate.

Several limitations should be acknowledged in interpreting the results of this analysis. First, the overall model performance is weak with an AUC of approx. 0.64, suggesting that the predictors included do not fully capture the complex behavioral, situational, and environmental factors that influence alcohol-impaired driving. While crash-level characteristics such as speeding or severe injury are significant, the dataset lacks key information such as time of day, driver blood alcohol content, policing patterns, and roadway conditions, variables that are known to strongly influence drunk driving risk. Second, the socioeconomic indicators drawn from census block groups (PCTBACHMOR and MEDHHINC) may not meaningfully represent the characteristics of the individual drivers involved in crashes; this ecological mismatch likely contributes to their lack of predictive value.

Third, the binary nature of many predictors oversimplifies behaviors like “aggressive driving” or “speeding,” which exist on a spectrum and may require more granular data for accurate modeling. Finally, because the analysis relies exclusively on recorded crash data, it may be affected by underreporting, misclassification of alcohol involvement, or differential enforcement patterns, factors that can bias estimates and limit generalizability.

References

Bevans, R. (2023, June 22). Akaike Information Criterion | When & How to Use It (Example). Scribbr. Retrieved November 24, 2025, from https://www.scribbr.com/statistics/akaike-information-criterion/

Eisenberg, D. (2003). Evaluating the effectiveness of policies related to drunk driving. Journal of policy Analysis and management, 22(2), 249-274.

Perrine, M. W., Peck, R. C., & Fell, J. C. (1989). Epidemiologic perspectives on drunk driving. In Surgeon General’s workshop on drunk driving: Background papers (pp. 35-76). US Department of Health and Human Services Washington, DC.

Visit Philadelphia. (2025, June 25). Philly named America’s most walkable city by USA Today — For the third straight year! Visit Philadelphia. https://www.visitphilly.com/features/philly-voted-most-walkable-city-in-america-by-usa-today/