2025-03-07
Author: Faline Rezvani
Over 100M U.S. citizens walk for recreation, as well as the population walking for transportation, 24% of which have an income below the federal poverty level (Adjaye-Gbewonyo & Briones, 2024).
In 2022 alone, there were 7,714 non-occupant, or pedestrian, fatalities resulting from motor vehicle crashes.
Beginning in 1975, the National Highway Traffic Safety Administration (NHTSA) Fatality Analysis Reporting System (FARS) centralizes fatality incident data related to motor vehicle crashes.
FARS data are obtained from documents including police crash reports, death certificates, vehicle registrations, medical examiner reports, driver’s license files, highway department data, EMS reports, and vital records. Data are entered directly from these sources, then FARS analysts extract and compile those data into FARS datasets.
Using 2022 FARS data, we will inspect motor vehicle crash characteristics related to increased pedestrian risk in the U.S.
Loading FARS Auxiliary Accident and Vehicle .csv files, checking for missing values, joining datasets, and dropping duplicate records.
YEAR 0
STATE 0
ST_CASE 0
FATALS 0
A_CRAINJ 0
A_REGION 0
A_RU 0
A_INTER 0
A_RELRD 0
A_INTSEC 0
A_ROADFC 0
A_JUNC 0
A_MANCOL 0
A_TOD 0
A_DOW 0
A_CT 0
A_WEATHER 0
A_LT 0
A_MC 0
A_SPCRA 0
A_PED 0
A_PED_F 0
A_PEDAL 0
A_PEDAL_F 0
A_ROLL 0
A_POLPUR 0
A_POSBAC 0
A_D15_19 0
A_D16_19 0
A_D15_20 0
A_D16_20 0
A_D65PLS 0
A_D21_24 0
A_D16_24 0
A_RD 0
A_HR 0
A_DIST 0
A_DROWSY 0
A_WRONGWAY 0
BIA 0
SPJ_INDIAN 0
INDIAN_RES 0
CENSUS_2020_TRACT_FIPS 257
TRACT 0
dtype: int64
YEAR 0
STATE 0
ST_CASE 0
VEH_NO 0
A_WRONGWAYDRV 0
A_DRDIS 0
A_DRDRO 0
A_VRD 0
A_BODY 0
A_IMP1 0
A_VROLL 0
A_LIC_S 0
A_LIC_C 0
A_CDL_S 0
A_MC_L_S 0
A_SPVEH 0
A_SBUS 0
A_MOD_YR 0
A_FIRE_EXP 0
dtype: int64
A_PED_F ‘1’ - involving pedestrian fatality; ‘2’ - not involving pedestrian fatality
A_PED_F
2 31807
1 7414
Name: count, dtype: int64
Feature correlation heatmap. Highly correlated features (>.80) may result in quality issues during modeling.
Shuffling dataframe. The model could learn potential data entry patterns.
The outcome of a situation can be described statistically by the odds of success. A linear model cannot be used to set a threshold between classes, however. For that we use logistic regression.
Rewriting the logistic function in terms of the odds ratio and taking the natural log of both sides, the logit function converts values to the scale of a probability in the range of [0,1] (Bati, n.d.). Probabilities >0.50 are rounded up to 1, while probabilities <0.50 are rounded down to 0.
With the logit function, logistic regression is used in a statistical model to calculate predictor coefficient estimates.
With a supervised machine learning classification model, the logistic function is used to make classifications, or predictions on future data.
Theoretically, we can represent the discovery that speeding was a factor in a crash as the outcome, or response, making ‘speeding’, or ‘not speeding’ our binary logistic variable. The weights of predictors, or coefficient estimates, on a response will help us evaluate relationships within FARS.
A_SPCRA ‘1’ - not involving speeding, ‘0’ - involving speeding
A_SPCRA
1 28299
0 10922
Name: count, dtype: int64
Balancing the dataset, ensuring algorithm will not be skewed by abundance of ‘1’, incidents not involving speeding.
Counter({1: 28299, 0: 10922})
Counter({0: 28299, 1: 28299})
Test/Train Split
(42448, 42)
(14150, 42)
Using the statsmodels Logit function to calculate coefficients of predictors
Optimization terminated successfully.
Current function value: 0.592043
Iterations 6
| Dep. Variable: | A_SPCRA | No. Observations: | 42448 |
| Model: | Logit | Df Residuals: | 42406 |
| Method: | MLE | Df Model: | 41 |
| Date: | Mon, 10 Mar 2025 | Pseudo R-squ.: | 0.1459 |
| Time: | 13:16:15 | Log-Likelihood: | -25131. |
| converged: | True | LL-Null: | -29423. |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
| STATE | -0.0099 | 0.001 | -14.258 | 0.000 | -0.011 | -0.009 |
| FATALS | -0.1973 | 0.031 | -6.324 | 0.000 | -0.258 | -0.136 |
| A_CRAINJ | 6.4197 | 0.907 | 7.080 | 0.000 | 4.643 | 8.197 |
| A_REGION | -0.0545 | 0.005 | -11.384 | 0.000 | -0.064 | -0.045 |
| A_RU | -0.3658 | 0.024 | -15.154 | 0.000 | -0.413 | -0.319 |
| A_INTER | 0.2028 | 0.045 | 4.469 | 0.000 | 0.114 | 0.292 |
| A_RELRD | -0.2158 | 0.015 | -14.364 | 0.000 | -0.245 | -0.186 |
| A_INTSEC | 0.0493 | 0.030 | 1.664 | 0.096 | -0.009 | 0.107 |
| A_ROADFC | -0.0959 | 0.011 | -8.761 | 0.000 | -0.117 | -0.074 |
| A_MANCOL | 0.1411 | 0.014 | 10.289 | 0.000 | 0.114 | 0.168 |
| A_TOD | -0.0628 | 0.023 | -2.704 | 0.007 | -0.108 | -0.017 |
| A_DOW | -0.0542 | 0.023 | -2.391 | 0.017 | -0.099 | -0.010 |
| A_CT | -0.5279 | 0.030 | -17.877 | 0.000 | -0.586 | -0.470 |
| A_WEATHER | 0.0028 | 0.001 | 4.789 | 0.000 | 0.002 | 0.004 |
| A_LT | -0.1644 | 0.039 | -4.259 | 0.000 | -0.240 | -0.089 |
| A_MC | -0.2008 | 0.052 | -3.860 | 0.000 | -0.303 | -0.099 |
| A_PED_F | -1.5623 | 0.050 | -30.962 | 0.000 | -1.661 | -1.463 |
| A_PEDAL_F | -1.2182 | 0.090 | -13.591 | 0.000 | -1.394 | -1.043 |
| A_ROLL | 0.3711 | 0.028 | 13.304 | 0.000 | 0.316 | 0.426 |
| A_POLPUR | 1.5951 | 0.119 | 13.453 | 0.000 | 1.363 | 1.827 |
| A_POSBAC | 0.1359 | 0.014 | 9.875 | 0.000 | 0.109 | 0.163 |
| A_D16_20 | 0.3937 | 0.033 | 11.874 | 0.000 | 0.329 | 0.459 |
| A_D65PLS | -0.6123 | 0.030 | -20.311 | 0.000 | -0.671 | -0.553 |
| A_D21_24 | 0.3245 | 0.032 | 10.265 | 0.000 | 0.263 | 0.387 |
| A_RD | 0.0043 | 0.035 | 0.120 | 0.904 | -0.065 | 0.074 |
| A_HR | 0.6063 | 0.058 | 10.410 | 0.000 | 0.492 | 0.720 |
| A_DIST | -0.0922 | 0.041 | -2.262 | 0.024 | -0.172 | -0.012 |
| A_DROWSY | -0.7990 | 0.093 | -8.589 | 0.000 | -0.981 | -0.617 |
| A_WRONGWAY | -0.6823 | 0.063 | -10.831 | 0.000 | -0.806 | -0.559 |
| SPJ_INDIAN | 0.1538 | 0.240 | 0.641 | 0.521 | -0.316 | 0.624 |
| INDIAN_RES | -0.2293 | 0.178 | -1.285 | 0.199 | -0.579 | 0.121 |
| TRACT | 0.0698 | 0.137 | 0.510 | 0.610 | -0.198 | 0.338 |
| VEH_NO | 0.5484 | 0.354 | 1.549 | 0.121 | -0.145 | 1.242 |
| A_BODY | 0.0758 | 0.008 | 9.997 | 0.000 | 0.061 | 0.091 |
| A_IMP1 | 0.0045 | 0.007 | 0.603 | 0.547 | -0.010 | 0.019 |
| A_LIC_S | -0.6428 | 0.055 | -11.641 | 0.000 | -0.751 | -0.535 |
| A_LIC_C | 0.4093 | 0.063 | 6.488 | 0.000 | 0.286 | 0.533 |
| A_CDL_S | 0.2083 | 0.040 | 5.232 | 0.000 | 0.130 | 0.286 |
| A_MC_L_S | -0.2219 | 0.053 | -4.190 | 0.000 | -0.326 | -0.118 |
| A_SBUS | -0.0855 | 0.206 | -0.414 | 0.679 | -0.490 | 0.319 |
| A_MOD_YR | 5.429e-05 | 1.21e-05 | 4.477 | 0.000 | 3.05e-05 | 7.81e-05 |
| A_FIRE_EXP | -0.2217 | 0.050 | -4.401 | 0.000 | -0.320 | -0.123 |
Pseudo R-squared measures the variance of the error term (variance of errors), a residual variable representing the margin of error resulting from differences in a statistical model’s theoretical values and actual observed values.
The pseudo R-squared value, 0.14 of our logit model tells us that there are many other independent variables to take into consideration for understanding something so complex as a driver speeding while operating a motor vehicle.
A p-value of < the significance level of 0.05 for predictor, ‘A_PED_F’ supports evidence to reject the null hypothesis: here is no change in incidents involving speeding with a change in incidents of pedestrian fatality.
Building Supervised Machine Learning Logistic Regression ModelLogisticRegression(C=0.2, class_weight='balanced', max_iter=3000, solver='saga')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(C=0.2, class_weight='balanced', max_iter=3000, solver='saga')
Making predictions on the test dataset
The accuracy score of our regression model is: 0.6398586572438163
Known vs. Predicted
array([[4528, 2507],
[2589, 4526]])
The predictions of our logistic regression model include 4,766 true positives with 2,533 false positives, and 4,421 true negatives with 2,430 false negatives.
precision recall f1-score support
0 0.64 0.64 0.64 7035
1 0.64 0.64 0.64 7115
accuracy 0.64 14150
macro avg 0.64 0.64 0.64 14150
weighted avg 0.64 0.64 0.64 14150
(0.0, 1.0)
(0.0, 1.05)
The ROC curve visualizes the true positive and false positive rate of change. The ideal curve of our classification model would pass through coordinate (0,1). This would be the point where true positive rate would be 1 and false positive rate would be 0. The area under the curve (AUC), in the range [0,1], measures the probability that a model will more likely predict one class over another.
Please see the presentations for recommendations and the report for findings here
Adjaye-Gbewonyo, D., & Briones, E. M. (2024, July). Walking for Leisure and Transportation Among Adults: United States, 2022.https://www.cdc.gov/nchs/products/databriefs/db504.htm
Bati, F. (n.d.) CMSC 437 Lecture 2c: Logistic Regression. University of Maryland Global Campus.
National Highway Traffic Safety Administration (NHTSA). (2021, February). Fatality Analysis Reporting System (FARS) auxiliary datasets analytical user’s manual, 1982-2019 (Report No. DOT HS 813 071). Washington, DC: Author.https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813071