This small project is a part of a weekly #TidyTuesday challenge and uses The Stanford Open Policing Project data. More information about the project in this video.

I decided to analyze information about traffic stops by police in Albany, NY, because I lived in New York State before and was just curious about situation there. Albany data contains 10328 police stops of Albany drivers between December 2007 and December 2017.

Firstly, I decided to see the number of stops by race. It immediately highlighted the first problem - there are too many NAs in this dataset.

race number
asian/pacific islander 1983
black 22658
hispanic 2595
other/unknown 7815
white 28527
NA 39703

In the form of a barchart stops by race will look like this:

Then I asked myself about the age? Maybe drivers of a specific race are stopped differently based on their age? But boxplot looks similar. Moreover, it highlighted the second potential problem with data - outliers. Looks like there were drivers, who were 100+ yo, when they were stopped by police. Is it possible or it is a problem with typos?

In total I have 6 drivers with the age above 100, and 100 drivers with the age above 90.

My third idea was to check how drivers are stopped by hour? Any specific hours with more stops? Looks like that closer to midnight police stops drivers much more frequently and not very active early in the morning. Interesting that the majority of stops without the race identification were made during these late hours. Maybe policemen do not have time to fill in forms and skip some boxes in their forms?

Forth idea - any patterns with gender? The good thing - almost all drivers were idintified by their gender.

sex number
male 71848
female 31406
NA 27

Looks like stops of male drivers prevail significantly. Probably the majority of drivers are males. The chart looks like this:

Ok, maybe any specific patterns of race AND gender? Maybe some group was stopped disproportinally here? The table looks like this:

race sex number
asian/pacific islander male 1465
asian/pacific islander female 518
black male 15685
black female 6973
hispanic male 1868
hispanic female 727
other/unknown male 5131
other/unknown female 2664
other/unknown NA 20
white male 18088
white female 10436
white NA 3
NA male 29611
NA female 10088
NA NA 4

We can see that NAs create problems here. Probably not too much sense to visualize this.

My next question was about types of violations, which caused stops. Any patterns here?

violation number
SPEED VIOL POSTED LIMIT 13938
TRAFFIC DEVICE VIOL - PAS 11227
OPERATE MV BY UNLICENSD D 8181
OPER MV W/O INSPECT CERTI 5901
DRIVE WHILE USING MOBILE 4666
FAILURE TO OBEY TRAFFIC D 3875
FAIL TO STOP-STOP SIGN 3653
AGGRAVATED UNLICENSED OPE 2295
ILL SIGNAL:PARKED 2206
REGIS PLATE DISPLAY VIOLA 2123

Top-10 violations in Albany? Please see above. Violations of speed limits is the biggest problem. I am not a US citizen and I do not drive, so I am not sure how bad is this table in general.

Ok, any patterns of violations by gender? Maybe groups have their “favorite” violations? I selected 20 the most frequent cases (not violations) with gender identification. And also created a stacked barchart for this table.

violation sex number
SPEED VIOL POSTED LIMIT male 8302
TRAFFIC DEVICE VIOL - PAS male 7374
OPERATE MV BY UNLICENSD D male 5782
SPEED VIOL POSTED LIMIT female 5636
OPER MV W/O INSPECT CERTI male 4284
TRAFFIC DEVICE VIOL - PAS female 3849
DRIVE WHILE USING MOBILE male 2497
OPERATE MV BY UNLICENSD D female 2393
FAIL TO STOP-STOP SIGN male 2341
FAILURE TO OBEY TRAFFIC D male 2331
DRIVE WHILE USING MOBILE female 2168
REGIS PLATE DISPLAY VIOLA male 1895
AGGRAVATED UNLICENSED OPE male 1822
ILL SIGNAL:PARKED male 1664
FAILURE TO STAY IN SINGLE male 1637
OPER MV W/O INSPECT CERTI female 1615
FAILURE TO OBEY TRAFFIC D female 1544
SPEED VIOL-IMPRUDENT SPEE male 1411
FAIL TO STOP-STOP SIGN female 1311
OPER UNREGISTERD MV ON HI male 1296

Do you see any specific gender-based patterns? I do not really.

Ok, my final attempt was to see any patterns in violations by race.

violation race number
SPEED VIOL POSTED LIMIT white 6564
TRAFFIC DEVICE VIOL - PAS white 3700
SPEED VIOL POSTED LIMIT NA 3281
OPERATE MV BY UNLICENSD D NA 3240
TRAFFIC DEVICE VIOL - PAS NA 3130
OPERATE MV BY UNLICENSD D black 2896
SPEED VIOL POSTED LIMIT black 2582
OPER MV W/O INSPECT CERTI NA 2303
TRAFFIC DEVICE VIOL - PAS black 2160
DRIVE WHILE USING MOBILE white 2019
FAILURE TO OBEY TRAFFIC D NA 1910
TRAFFIC DEVICE VIOL - PAS other/unknown 1733
AGGRAVATED UNLICENSED OPE NA 1448
OPER MV W/O INSPECT CERTI white 1443
OPER MV W/O INSPECT CERTI black 1435
FAIL TO STOP-STOP SIGN NA 1424
FAILURE TO STAY IN SINGLE NA 1416
REGIS PLATE DISPLAY VIOLA NA 1353
FAILURE TO OBEY TRAFFIC D white 1195
OPERATE MV BY UNLICENSD D white 1137

But with such data (full of NAs) it really does not make sense, as you can see from a table above, NAs prevail. As a result, I dropped all NAs from the race column and visualize top 20 cases.

violation race number
SPEED VIOL POSTED LIMIT white 6564
TRAFFIC DEVICE VIOL - PAS white 3700
OPERATE MV BY UNLICENSD D black 2896
SPEED VIOL POSTED LIMIT black 2582
TRAFFIC DEVICE VIOL - PAS black 2160
DRIVE WHILE USING MOBILE white 2019
TRAFFIC DEVICE VIOL - PAS other/unknown 1733
OPER MV W/O INSPECT CERTI white 1443
OPER MV W/O INSPECT CERTI black 1435
FAILURE TO OBEY TRAFFIC D white 1195
OPERATE MV BY UNLICENSD D white 1137
FAIL TO STOP-STOP SIGN white 999
DRIVE WHILE USING MOBILE other/unknown 759
FAIL TO STOP-STOP SIGN black 725
SPEED VIOL POSTED LIMIT asian/pacific islander 636
OPER UNREGISTERD MV ON HI black 629
DRIVE WHILE USING MOBILE black 614
ILL SIGNAL:PARKED white 583
DRIVER/PASS W/O FRNT SEAT white 531
AGGRAVATED UNLICENSED OPE black 521

Finally, I decided to create the heatmap of violations, because just a map would have too much dots. But again, the problem with data is NAs.

Trying to build a heatmap, suing leaflet.extras, I got a warning message

In validateCoords(lng, lat, funcName) :

Data contains 9851 rows with either missing or invalid lat/lon values and will be ignored

Decided to check total NAs:

x
raw_row_number 0
date 0
time 13
location 9779
lat 9851
lng 9851
age 18
race 39703
sex 27
type 0
violation 534
vehicle_color 1239
vehicle_make 796
vehicle_registration_state 971
vehicle_year 1281

To avoid this I decided to remove NAs only from lng and lat colums (9851 rows) and visualize 93420 traffic police stops.

Conclusion

Personally I did not notice any patterns, but I must admit that this data has problems with NAs and strange outliers in age.

Author’s Twitter: OleksiyAnokhin