In the news nearly each week, sometimes each day, there are reports of interactions with the public and law enforcement agencies. Over the years there has been a growing call for more data on these interactions between the public and police - especially given the current. My project initially sought to tell a story about these interactions. As is sometimes the case, the story I stumbled across is different than the one I had planned to tell.
The Racial and Identity Profiling Act (RIPA) is a California state law passed in 2018 that, among other things, mandates the data reporting of the largest law enforcement agencies in the state. Starting with the top eight and since expanding to 15. The law touches on which agencies must collect information and when to submit it to the state, but is open to interpretation on collection method, storage scheme, and quality checks.
These large gaps in methodology big enough to drive a truck through, and as I hope to show opens the door to strange results from agency to agency completely negating the ability to interpret and compare data among them - a primary reason for even collecting these data in the first place.
What interested me most about these data was the demographics and characteristics of people as well as events that occurred during interactions with police. I work with these kinds of data all the time in my day job and it is important that we, as data scientists, take both the responsibility and opportunity to consider and discuss the meaning of demographic measures in any data set with which we work and how it may inadvertently perpetuate the legacy and lingering effects of structural racism, resource gatekeeping, and government policies and policing, that hold communities of color back, deprive resources and services from them, and punish their neighborhoods.
The data files for each year are simply enormous, taking awhile to download and read into RStudio. As a result, much of the pre-processing took place outside of R to improve run time. Despite the size, I appreciated the formatting of the data set. It had both binary variables for certain categories for race, gender, age, and purpose of stop and also a collapsed column with all options in one. These options were numeric and needed to be translated to their meaning before being imported to RStudio.
For this project I wanted to focus on the closest two law enforcement agencies to where I live and so reduced set to the Los Angeles Police Department (LAPD) and the Los Angeles County Sheriff’s Department (LASD). The LADP has responsibility over a geographic range 100 square miles more than New York City with diverse geographies from rural to urban, and coastal to rugged and mountainous. The LASD covers the entire area of the County of Los Angeles, the most populous county in the United States and rougly 4.750 square miles.
Additionally, the data for type of stop was heavily weighed toward traffic stops. It made sense to filter for these types of events as they are (evidently) the most common way for the public to interact with the police in everyday life.
Methodology aside, the basis for the data collection is primarily based on “officer perseption” (i.e., whatever the officer thought or observed or can remember or wrote down at the time). For both innocent and malicious reasons, this is a bad approach. There are no rules or policies that I could find in my casual search from either the LASD or LAPD that governs how data are collected or what consequences there are to misreporting or ommitting details about an interaction. Because these data are not connected to a particular officer and observed on a system level, it is difficult to determine if individually or as a group agencies are failing to report. At least, I thought it would be difficult.
Some additional background research uncovered both negligent and intentional misreporting of crime data that occurred in police departments in the US. So it is difficult to root out a purposful (maybe even organized) attempt to obscure or influence data reporting. As it turns out, by visualizing certain aspects of these interactions, patters emerged (or didn’t) that suggests gaps in data collection or quality, or technical gaps in storage and reporting. Let’s dive in.
As mentioned, much of the heavy data pre-processing took place outside RStudio on account of the vastness of the orginal file (600+MB).
One thing that is a perennial consideration that was not cleaned up in pre-processing is the date of the interaction. I have become a huge fan of the lubridate package to navigate how to change any date format in CSV or Excel to one RStudio can read. I have also found it essential to periodically check on the results of my efforts to ensure what I think is being changed actually did change.
Narrowing the focus of the project to two law enforcement agencies and one type of stop (traffic) made it harder for insights to hide among a sea of results. Setting collection and reporting methods among agencies aside, I hope to be able to compare these agencies with eachother.
There is a stark difference in number of traffic stops performed by LASD and LAPD. Knowing a little about both of these agencies from my time working at the County and with both of these agencies, it is not surprising to me. While both agencies have broad responsibilities, LASD has fewer patrolling ones as they contract with 44 of 88 cities in the county, the largest of these being Lancaster, Compton, and Santa Clarita. LAPD by contract has more than 400 square miles to patrol in one of the most densly packed areas in the country with lots of cars and plenty of moving violations to choose from.
I was interested to know how long iterations with police last. Using a box plot to observe the distribution, outliers quickly took over the entire figure. Both agencies operate jail facilities, so I could conclude that these long durations could be transports to jail, or lengthy detentions at the scene. Again, without more understanding (or uniformity) of the agencies it’s hard to interpret these outliers as “legitimate” operational considerations or data quality issues.
To better see what I’m looking at,I trimmed the window down to 45 minutes so I could better see the interquartile range (IQR). The differences of these stops by race and between agencies stood out to me as potential areas of further study. Note the relative similarities of the IQRs for LASD between race groups. Compare this with LAPD, particularly for those who are Black or African American.
Our last assignment with D3 opened me up the the “stream” graph. I really wanted to use it in my final project and found the perfect application with the date of the stop. I wanted to know if there was seasonality with these data. Is the holiday season with more people on the road (presumably) going to impact the number of traffic stops? Do certain demographics influence when someone might be more likely to be pulled over?
Turns out, maybe…
Wraggling data at this stage was a simple matter of calling on the tidyverse to group and summarise. Initially, I had the agencies together, but discovered using this stream graph that LAPD stops are pinched during the summer. Through the months of April to August, LAPD’s taffic stops reported to RIPA tank to almost 10% of the other months. LASD’s by contrast remain relatively steady. I really liked the proportional vs. actual views as they key into two different pieces. The demographics of LAPD hold steady in the non-summer months, while shift around quite a bit in the summer. LASD has a consistent reporting throughout. More than other visuals, I think this stream graph shows these shifts really nicely and in a way that makes should 8 or 9 different categories more legible.
I think that using these types of visuals, combined with other predictive and analytical tools, the State could detect and call out areas of low data quality and recommend auditing of process, standardization of data collection, and rooting out malfeasence on the part of officers individually and collectively, as well as departments that can’t or won’t take corrective action.
I’ve learned more through this project and course that data visuals can always tell us a story, whether that’s about results or about quality.