This small project is a part of a weekly #TidyTuesday challenge and uses The Stanford Open Policing Project data. More information about the project in this video.
I decided to analyze information about traffic stops by police in Albany, NY, because I lived in New York State before and was just curious about situation there. Albany data contains 10328 police stops of Albany drivers between December 2007 and December 2017.
Firstly, I decided to see the number of stops by race. It immediately highlighted the first problem - there are too many NAs in this dataset.
| race | number |
|---|---|
| asian/pacific islander | 1983 |
| black | 22658 |
| hispanic | 2595 |
| other/unknown | 7815 |
| white | 28527 |
| NA | 39703 |
In the form of a barchart stops by race will look like this:
Then I asked myself about the age? Maybe drivers of a specific race are stopped differently based on their age? But boxplot looks similar. Moreover, it highlighted the second potential problem with data - outliers. Looks like there were drivers, who were 100+ yo, when they were stopped by police. Is it possible or it is a problem with typos?
In total I have 6 drivers with the age above 100, and 100 drivers with the age above 90.
My third idea was to check how drivers are stopped by hour? Any specific hours with more stops? Looks like that closer to midnight police stops drivers much more frequently and not very active early in the morning. Interesting that the majority of stops without the race identification were made during these late hours. Maybe policemen do not have time to fill in forms and skip some boxes in their forms?
Forth idea - any patterns with gender? The good thing - almost all drivers were idintified by their gender.
| sex | number |
|---|---|
| male | 71848 |
| female | 31406 |
| NA | 27 |
Looks like stops of male drivers prevail significantly. Probably the majority of drivers are males. The chart looks like this:
Ok, maybe any specific patterns of race AND gender? Maybe some group was stopped disproportinally here? The table looks like this:
| race | sex | number |
|---|---|---|
| asian/pacific islander | male | 1465 |
| asian/pacific islander | female | 518 |
| black | male | 15685 |
| black | female | 6973 |
| hispanic | male | 1868 |
| hispanic | female | 727 |
| other/unknown | male | 5131 |
| other/unknown | female | 2664 |
| other/unknown | NA | 20 |
| white | male | 18088 |
| white | female | 10436 |
| white | NA | 3 |
| NA | male | 29611 |
| NA | female | 10088 |
| NA | NA | 4 |
We can see that NAs create problems here. Probably not too much sense to visualize this.
My next question was about types of violations, which caused stops. Any patterns here?
| violation | number |
|---|---|
| SPEED VIOL POSTED LIMIT | 13938 |
| TRAFFIC DEVICE VIOL - PAS | 11227 |
| OPERATE MV BY UNLICENSD D | 8181 |
| OPER MV W/O INSPECT CERTI | 5901 |
| DRIVE WHILE USING MOBILE | 4666 |
| FAILURE TO OBEY TRAFFIC D | 3875 |
| FAIL TO STOP-STOP SIGN | 3653 |
| AGGRAVATED UNLICENSED OPE | 2295 |
| ILL SIGNAL:PARKED | 2206 |
| REGIS PLATE DISPLAY VIOLA | 2123 |
Top-10 violations in Albany? Please see above. Violations of speed limits is the biggest problem. I am not a US citizen and I do not drive, so I am not sure how bad is this table in general.
Ok, any patterns of violations by gender? Maybe groups have their “favorite” violations? I selected 20 the most frequent cases (not violations) with gender identification. And also created a stacked barchart for this table.
| violation | sex | number |
|---|---|---|
| SPEED VIOL POSTED LIMIT | male | 8302 |
| TRAFFIC DEVICE VIOL - PAS | male | 7374 |
| OPERATE MV BY UNLICENSD D | male | 5782 |
| SPEED VIOL POSTED LIMIT | female | 5636 |
| OPER MV W/O INSPECT CERTI | male | 4284 |
| TRAFFIC DEVICE VIOL - PAS | female | 3849 |
| DRIVE WHILE USING MOBILE | male | 2497 |
| OPERATE MV BY UNLICENSD D | female | 2393 |
| FAIL TO STOP-STOP SIGN | male | 2341 |
| FAILURE TO OBEY TRAFFIC D | male | 2331 |
| DRIVE WHILE USING MOBILE | female | 2168 |
| REGIS PLATE DISPLAY VIOLA | male | 1895 |
| AGGRAVATED UNLICENSED OPE | male | 1822 |
| ILL SIGNAL:PARKED | male | 1664 |
| FAILURE TO STAY IN SINGLE | male | 1637 |
| OPER MV W/O INSPECT CERTI | female | 1615 |
| FAILURE TO OBEY TRAFFIC D | female | 1544 |
| SPEED VIOL-IMPRUDENT SPEE | male | 1411 |
| FAIL TO STOP-STOP SIGN | female | 1311 |
| OPER UNREGISTERD MV ON HI | male | 1296 |
Do you see any specific gender-based patterns? I do not really.
Ok, my final attempt was to see any patterns in violations by race.
| violation | race | number |
|---|---|---|
| SPEED VIOL POSTED LIMIT | white | 6564 |
| TRAFFIC DEVICE VIOL - PAS | white | 3700 |
| SPEED VIOL POSTED LIMIT | NA | 3281 |
| OPERATE MV BY UNLICENSD D | NA | 3240 |
| TRAFFIC DEVICE VIOL - PAS | NA | 3130 |
| OPERATE MV BY UNLICENSD D | black | 2896 |
| SPEED VIOL POSTED LIMIT | black | 2582 |
| OPER MV W/O INSPECT CERTI | NA | 2303 |
| TRAFFIC DEVICE VIOL - PAS | black | 2160 |
| DRIVE WHILE USING MOBILE | white | 2019 |
| FAILURE TO OBEY TRAFFIC D | NA | 1910 |
| TRAFFIC DEVICE VIOL - PAS | other/unknown | 1733 |
| AGGRAVATED UNLICENSED OPE | NA | 1448 |
| OPER MV W/O INSPECT CERTI | white | 1443 |
| OPER MV W/O INSPECT CERTI | black | 1435 |
| FAIL TO STOP-STOP SIGN | NA | 1424 |
| FAILURE TO STAY IN SINGLE | NA | 1416 |
| REGIS PLATE DISPLAY VIOLA | NA | 1353 |
| FAILURE TO OBEY TRAFFIC D | white | 1195 |
| OPERATE MV BY UNLICENSD D | white | 1137 |
But with such data (full of NAs) it really does not make sense, as you can see from a table above, NAs prevail. As a result, I dropped all NAs from the race column and visualize top 20 cases.
| violation | race | number |
|---|---|---|
| SPEED VIOL POSTED LIMIT | white | 6564 |
| TRAFFIC DEVICE VIOL - PAS | white | 3700 |
| OPERATE MV BY UNLICENSD D | black | 2896 |
| SPEED VIOL POSTED LIMIT | black | 2582 |
| TRAFFIC DEVICE VIOL - PAS | black | 2160 |
| DRIVE WHILE USING MOBILE | white | 2019 |
| TRAFFIC DEVICE VIOL - PAS | other/unknown | 1733 |
| OPER MV W/O INSPECT CERTI | white | 1443 |
| OPER MV W/O INSPECT CERTI | black | 1435 |
| FAILURE TO OBEY TRAFFIC D | white | 1195 |
| OPERATE MV BY UNLICENSD D | white | 1137 |
| FAIL TO STOP-STOP SIGN | white | 999 |
| DRIVE WHILE USING MOBILE | other/unknown | 759 |
| FAIL TO STOP-STOP SIGN | black | 725 |
| SPEED VIOL POSTED LIMIT | asian/pacific islander | 636 |
| OPER UNREGISTERD MV ON HI | black | 629 |
| DRIVE WHILE USING MOBILE | black | 614 |
| ILL SIGNAL:PARKED | white | 583 |
| DRIVER/PASS W/O FRNT SEAT | white | 531 |
| AGGRAVATED UNLICENSED OPE | black | 521 |
Finally, I decided to create the heatmap of violations, because just a map would have too much dots. But again, the problem with data is NAs.
Trying to build a heatmap, suing leaflet.extras, I got a warning message
In validateCoords(lng, lat, funcName) :
Data contains 9851 rows with either missing or invalid lat/lon values and will be ignored
Decided to check total NAs:
| x | |
|---|---|
| raw_row_number | 0 |
| date | 0 |
| time | 13 |
| location | 9779 |
| lat | 9851 |
| lng | 9851 |
| age | 18 |
| race | 39703 |
| sex | 27 |
| type | 0 |
| violation | 534 |
| vehicle_color | 1239 |
| vehicle_make | 796 |
| vehicle_registration_state | 971 |
| vehicle_year | 1281 |
To avoid this I decided to remove NAs only from lng and lat colums (9851 rows) and visualize 93420 traffic police stops.
Conclusion
Personally I did not notice any patterns, but I must admit that this data has problems with NAs and strange outliers in age.
Author’s Twitter: OleksiyAnokhin