2025-03-07

Author: Faline Rezvani

Over 100M U.S. citizens walk for recreation, as well as the population walking for transportation, 24% of which have an income below the federal poverty level (Adjaye-Gbewonyo & Briones, 2024).

In 2022 alone, there were 7,714 non-occupant, or pedestrian, fatalities resulting from motor vehicle crashes.

Beginning in 1975, the National Highway Traffic Safety Administration (NHTSA) Fatality Analysis Reporting System (FARS) centralizes fatality incident data related to motor vehicle crashes.

FARS data are obtained from documents including police crash reports, death certificates, vehicle registrations, medical examiner reports, driver’s license files, highway department data, EMS reports, and vital records. Data are entered directly from these sources, then FARS analysts extract and compile those data into FARS datasets.

Using 2022 FARS data, we will inspect motor vehicle crash characteristics related to increased pedestrian risk in the U.S.

Loading FARS Auxiliary Accident and Vehicle .csv files, checking for missing values, joining datasets, and dropping duplicate records.

YEAR                        0
STATE                       0
ST_CASE                     0
FATALS                      0
A_CRAINJ                    0
A_REGION                    0
A_RU                        0
A_INTER                     0
A_RELRD                     0
A_INTSEC                    0
A_ROADFC                    0
A_JUNC                      0
A_MANCOL                    0
A_TOD                       0
A_DOW                       0
A_CT                        0
A_WEATHER                   0
A_LT                        0
A_MC                        0
A_SPCRA                     0
A_PED                       0
A_PED_F                     0
A_PEDAL                     0
A_PEDAL_F                   0
A_ROLL                      0
A_POLPUR                    0
A_POSBAC                    0
A_D15_19                    0
A_D16_19                    0
A_D15_20                    0
A_D16_20                    0
A_D65PLS                    0
A_D21_24                    0
A_D16_24                    0
A_RD                        0
A_HR                        0
A_DIST                      0
A_DROWSY                    0
A_WRONGWAY                  0
BIA                         0
SPJ_INDIAN                  0
INDIAN_RES                  0
CENSUS_2020_TRACT_FIPS    257
TRACT                       0
dtype: int64
YEAR             0
STATE            0
ST_CASE          0
VEH_NO           0
A_WRONGWAYDRV    0
A_DRDIS          0
A_DRDRO          0
A_VRD            0
A_BODY           0
A_IMP1           0
A_VROLL          0
A_LIC_S          0
A_LIC_C          0
A_CDL_S          0
A_MC_L_S         0
A_SPVEH          0
A_SBUS           0
A_MOD_YR         0
A_FIRE_EXP       0
dtype: int64

A_PED_F ‘1’ - involving pedestrian fatality; ‘2’ - not involving pedestrian fatality

A_PED_F
2    31807
1     7414
Name: count, dtype: int64

Feature correlation heatmap. Highly correlated features (>.80) may result in quality issues during modeling.

Shuffling dataframe. The model could learn potential data entry patterns.

The outcome of a situation can be described statistically by the odds of success. A linear model cannot be used to set a threshold between classes, however. For that we use logistic regression.

Rewriting the logistic function in terms of the odds ratio and taking the natural log of both sides, the logit function converts values to the scale of a probability in the range of [0,1] (Bati, n.d.). Probabilities >0.50 are rounded up to 1, while probabilities <0.50 are rounded down to 0.

With the logit function, logistic regression is used in a statistical model to calculate predictor coefficient estimates.

With a supervised machine learning classification model, the logistic function is used to make classifications, or predictions on future data.

Theoretically, we can represent the discovery that speeding was a factor in a crash as the outcome, or response, making ‘speeding’, or ‘not speeding’ our binary logistic variable. The weights of predictors, or coefficient estimates, on a response will help us evaluate relationships within FARS.

A_SPCRA ‘1’ - not involving speeding, ‘0’ - involving speeding

A_SPCRA
1    28299
0    10922
Name: count, dtype: int64

Balancing the dataset, ensuring algorithm will not be skewed by abundance of ‘1’, incidents not involving speeding.

Counter({1: 28299, 0: 10922})
Counter({0: 28299, 1: 28299})

Test/Train Split

(42448, 42)
(14150, 42)

Using the statsmodels Logit function to calculate coefficients of predictors

Optimization terminated successfully.
         Current function value: 0.592043
         Iterations 6
Logit Regression Results
Dep. Variable: A_SPCRA No. Observations: 42448
Model: Logit Df Residuals: 42406
Method: MLE Df Model: 41
Date: Mon, 10 Mar 2025 Pseudo R-squ.: 0.1459
Time: 13:16:15 Log-Likelihood: -25131.
converged: True LL-Null: -29423.
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
STATE -0.0099 0.001 -14.258 0.000 -0.011 -0.009
FATALS -0.1973 0.031 -6.324 0.000 -0.258 -0.136
A_CRAINJ 6.4197 0.907 7.080 0.000 4.643 8.197
A_REGION -0.0545 0.005 -11.384 0.000 -0.064 -0.045
A_RU -0.3658 0.024 -15.154 0.000 -0.413 -0.319
A_INTER 0.2028 0.045 4.469 0.000 0.114 0.292
A_RELRD -0.2158 0.015 -14.364 0.000 -0.245 -0.186
A_INTSEC 0.0493 0.030 1.664 0.096 -0.009 0.107
A_ROADFC -0.0959 0.011 -8.761 0.000 -0.117 -0.074
A_MANCOL 0.1411 0.014 10.289 0.000 0.114 0.168
A_TOD -0.0628 0.023 -2.704 0.007 -0.108 -0.017
A_DOW -0.0542 0.023 -2.391 0.017 -0.099 -0.010
A_CT -0.5279 0.030 -17.877 0.000 -0.586 -0.470
A_WEATHER 0.0028 0.001 4.789 0.000 0.002 0.004
A_LT -0.1644 0.039 -4.259 0.000 -0.240 -0.089
A_MC -0.2008 0.052 -3.860 0.000 -0.303 -0.099
A_PED_F -1.5623 0.050 -30.962 0.000 -1.661 -1.463
A_PEDAL_F -1.2182 0.090 -13.591 0.000 -1.394 -1.043
A_ROLL 0.3711 0.028 13.304 0.000 0.316 0.426
A_POLPUR 1.5951 0.119 13.453 0.000 1.363 1.827
A_POSBAC 0.1359 0.014 9.875 0.000 0.109 0.163
A_D16_20 0.3937 0.033 11.874 0.000 0.329 0.459
A_D65PLS -0.6123 0.030 -20.311 0.000 -0.671 -0.553
A_D21_24 0.3245 0.032 10.265 0.000 0.263 0.387
A_RD 0.0043 0.035 0.120 0.904 -0.065 0.074
A_HR 0.6063 0.058 10.410 0.000 0.492 0.720
A_DIST -0.0922 0.041 -2.262 0.024 -0.172 -0.012
A_DROWSY -0.7990 0.093 -8.589 0.000 -0.981 -0.617
A_WRONGWAY -0.6823 0.063 -10.831 0.000 -0.806 -0.559
SPJ_INDIAN 0.1538 0.240 0.641 0.521 -0.316 0.624
INDIAN_RES -0.2293 0.178 -1.285 0.199 -0.579 0.121
TRACT 0.0698 0.137 0.510 0.610 -0.198 0.338
VEH_NO 0.5484 0.354 1.549 0.121 -0.145 1.242
A_BODY 0.0758 0.008 9.997 0.000 0.061 0.091
A_IMP1 0.0045 0.007 0.603 0.547 -0.010 0.019
A_LIC_S -0.6428 0.055 -11.641 0.000 -0.751 -0.535
A_LIC_C 0.4093 0.063 6.488 0.000 0.286 0.533
A_CDL_S 0.2083 0.040 5.232 0.000 0.130 0.286
A_MC_L_S -0.2219 0.053 -4.190 0.000 -0.326 -0.118
A_SBUS -0.0855 0.206 -0.414 0.679 -0.490 0.319
A_MOD_YR 5.429e-05 1.21e-05 4.477 0.000 3.05e-05 7.81e-05
A_FIRE_EXP -0.2217 0.050 -4.401 0.000 -0.320 -0.123

Pseudo R-squared measures the variance of the error term (variance of errors), a residual variable representing the margin of error resulting from differences in a statistical model’s theoretical values and actual observed values.

The pseudo R-squared value, 0.14 of our logit model tells us that there are many other independent variables to take into consideration for understanding something so complex as a driver speeding while operating a motor vehicle.

A p-value of < the significance level of 0.05 for predictor, ‘A_PED_F’ supports evidence to reject the null hypothesis: here is no change in incidents involving speeding with a change in incidents of pedestrian fatality.

Building Supervised Machine Learning Logistic Regression Model
LogisticRegression(C=0.2, class_weight='balanced', max_iter=3000, solver='saga')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Making predictions on the test dataset

The accuracy score of our regression model is: 0.6398586572438163

Known vs. Predicted

array([[4528, 2507],
       [2589, 4526]])
The predictions of our logistic regression model include 4,766 true positives with 2,533 false positives, and 4,421 true negatives with 2,430 false negatives.
              precision    recall  f1-score   support

           0       0.64      0.64      0.64      7035
           1       0.64      0.64      0.64      7115

    accuracy                           0.64     14150
   macro avg       0.64      0.64      0.64     14150
weighted avg       0.64      0.64      0.64     14150
(0.0, 1.0)
(0.0, 1.05)

The ROC curve visualizes the true positive and false positive rate of change. The ideal curve of our classification model would pass through coordinate (0,1). This would be the point where true positive rate would be 1 and false positive rate would be 0. The area under the curve (AUC), in the range [0,1], measures the probability that a model will more likely predict one class over another.

Please see the presentations for recommendations and the report for findings here

  1. Adjaye-Gbewonyo, D., & Briones, E. M. (2024, July). Walking for Leisure and Transportation Among Adults: United States, 2022.https://www.cdc.gov/nchs/products/databriefs/db504.htm

  2. Bati, F. (n.d.) CMSC 437 Lecture 2c: Logistic Regression. University of Maryland Global Campus.

  3. National Highway Traffic Safety Administration (NHTSA). (2021, February). Fatality Analysis Reporting System (FARS) auxiliary datasets analytical user’s manual, 1982-2019 (Report No. DOT HS 813 071). Washington, DC: Author.https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813071