2020-06-15

library(tidyverse)
library(data.table)
library(reactable)
library(DT)
library(htmltab)

Below I give the code I use to pull police shooting data directly from the web. For each data set I show the first 50 results.

Washington Post (2015 - Present)

Washington Post’s National Police Shooting Database

Below I pull the Washington Post data set from their github page and create a few new variables, recoding Race, Sex, Armed, and Threat levels.

The Washington Post’s National Police Shooting Database began in 2015 and includes “only those shootings in which a police officer, in the line of duty, shot and killed a civilian” and therefore does not track “deaths of people in police custody, fatal shootings by off-duty officers or non-shooting deaths.”

wapo <- read_csv("https://cdn.rawgit.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv") %>%   
  dplyr::mutate(  
      
    Year=as.numeric(format(date,"%Y")),  
      
    age_f= cut(age, breaks=c(0, 14, 19, 39, 200),   
               
                    right= FALSE, labels=c("AgeUnder15", "Age15_19", "Age20_39", "Age40Above")),   
      
    Race = dplyr::recode(race,   
                         "W" = "White",  
                         "H" = "Hispanic",  
                         "B" = "Black",  
                         "A" = "Asian",  
                         "N" = "Native American",  
                         "O" = "Other"),  
      
    Sex = dplyr::recode(gender, "M" = "Male", "F" = "Female"),  
      
    Armed = dplyr::case_when(armed == "unarmed" ~ "Unarmed",   
                             is.na(armed) | armed == "undetermined" ~ "Unknown/Missing",  
                             TRUE ~ "Armed"),  
      
    Threat = dplyr::case_when(threat_level == "attack" ~ "Attacking Officer",   
                              threat_level %in% c("other", "undetermined") | is.na(threat_level) ~ "Undetermined Threat",   
                                  TRUE ~ "Undetermined Threat"))  
  ##dplyr::filter(Year %in% c(2015:2019)) ##filtering to completed years

  
##reactable::reactable(wapo)

 wapo %>% head(., n = 50) %>%     
   reactable::reactable(., resizable = TRUE, filterable = TRUE, searchable = TRUE,   
                        highlight = TRUE, striped = TRUE, compact = TRUE, wrap = FALSE)   #DT::datatable(.) 

Fatal Encounters (2000 - Present)

Fatal Encounters

The raw data can be obtained in two places:

I use the googlesheets4 package to download the google spreadsheet. Before I download it, I run googlesheets4::gs4_deauth() so that I don’t have to use credentials. Warning: downloading this can take several minutes!

Also, note that this data set differs from the Washington Post data in that it includes:

  • deaths that occurred in the presence of a police officer (e.g. suicides),
  • deaths of people in police custody (e.g. inmates in prisons),
  • fatal shootings by off-duty officers, and
  • non-shooting deaths (e.g. crashed vehicles in police chases, tasers, etc.)
library(googlesheets4)

googlesheets4::gs4_deauth()
fe <- googlesheets4::read_sheet(ss = "https://docs.google.com/spreadsheets/d/1dKmaV_JiWcG8XBoRgP8b4e9Eopkpgt7FL7nyspvzAsE/edit#gid=0", 
    sheet = 1)

## Attempted to use htmltab to scap the html table, but didn't work fe <-
## htmltab::htmltab(doc = 'https://numeracy.co/public/wqzC522zdz', which =
## '//*[@id='reactRoot']/div/div[1]/div[3]/div/div/div/div/div[2]/div[2]/div/div/div/div[1]/div/div[1]')

fe %>% head(., 50) %>% 
reactable::reactable(., resizable = TRUE, filterable = TRUE, searchable = TRUE, highlight = TRUE, 
    striped = TRUE, compact = TRUE, wrap = FALSE)

To find cases that occurred in prisons (in order to remove them for example), you can search for “prison” or “corrections” in the Agency responsible for death column or “prisoner” in the Brief description... column, as shown below. You will need to examine these in more detail, since some of these include escaped prisoners, parolee violators, etc.

fe_f <- fe %>% dplyr::filter(grepl("corrections|prison", `Agency responsible for death`, 
    ignore.case = TRUE) | grepl("prisoner", `A brief description of the circumstances surrounding the death`, 
    ignore.case = TRUE))

fe_f %>% head(., n = 10) %>% 
reactable::reactable(., resizable = TRUE, filterable = TRUE, searchable = TRUE, highlight = TRUE, 
    striped = TRUE, compact = TRUE, wrap = FALSE)

Mapping Police Violence (2013 - 2019)

Mapping Police Violence

This file is an .xlsx file with several tabs. I use the openxlsx::read.xlsx() function with the argument detectDates = TRUE. Finally, I use the janitor::clean_names() function to fix the column headers.

mpv <- openxlsx::read.xlsx("https://mappingpoliceviolence.org/s/MPVDatasetDownload.xlsx", 
    sheet = 1, detectDates = TRUE) %>% janitor::clean_names(.)


mpv %>% head(., n = 50) %>% 
reactable::reactable(., resizable = TRUE, filterable = TRUE, searchable = TRUE, highlight = TRUE, 
    striped = TRUE, compact = TRUE, wrap = FALSE)

Killed by Police (2015 - Present)

Killed by Police

Unfortunately, there is no link to the raw data. Instead, I have to use the htmltab package in order to scrape the data manually from the html tables. I used the developer tools in Chrome to find the Xpath //*[@id='post-3820']/div/div/tabl. Also, there is column named with only an astericks which indicates manner of death, so I am renaming this. The manner of death column also does not have a codebook, although I assume that ‘G’ means ‘gunshot’, I am not certain.

library(htmltab)

kbp_20 <- htmltab(doc = "https://killedbypolice.net/kbp2020/", which = "//*[@id='post-3820']/div/div/table") %>% 
    dplyr::rename(manner_of_death = `*`)

kbp_19 <- htmltab(doc = "https://killedbypolice.net/kbp2019/", which = "//*[@id='post-13']/div/div/table") %>% 
    dplyr::rename(manner_of_death = `*`)

kbp_18 <- htmltab(doc = "https://killedbypolice.net/kbp2018/", which = "//*[@id='post-31']/div/div/table") %>% 
    dplyr::rename(manner_of_death = `*`)

kbp_17 <- htmltab(doc = "https://killedbypolice.net/kbp2017/", which = "//*[@id='post-33']/div/div/table") %>% 
    dplyr::rename(manner_of_death = `*`)

kbp_16 <- htmltab(doc = "https://killedbypolice.net/kbp2016/", which = "//*[@id='post-35']/div/div/table") %>% 
    dplyr::rename(manner_of_death = `*`)

kbp_15 <- htmltab(doc = "https://killedbypolice.net/kbp2015/", which = "//*[@id='post-38']/div/div/table") %>% 
    dplyr::rename(manner_of_death = `*`)


kbp <- dplyr::bind_rows(kbp_20, kbp_19, kbp_18, kbp_17, kbp_16, kbp_15)

kbp %>% head(., n = 50) %>% 
reactable::reactable(., resizable = TRUE, filterable = TRUE, searchable = TRUE, highlight = TRUE, 
    striped = TRUE, compact = TRUE, wrap = FALSE)

The Guardian (2015-2016)

The Guardian’s The Counted Database

Only includes data for 2015 and 2016.

G_15 <- readr::read_csv("https://raw.githubusercontent.com/flother/thecounted/master/data/the-counted-2015.csv")

G_16 <- readr::read_csv("https://raw.githubusercontent.com/flother/thecounted/master/data/the-counted-2016.csv")


guard <- dplyr::bind_rows(G_15, G_16)


guard %>% head(., n = 50) %>% 
reactable::reactable(., resizable = TRUE, filterable = TRUE, searchable = TRUE, highlight = TRUE, 
    striped = TRUE, compact = TRUE, wrap = FALSE)

Lott & Moody (2013-2015)

This is an interesting data set compiled by researchers Lott & Moody (2016) in their paper entitled “Do White Police Officers Unfairly Target Black Suspects?”. The data itself is publicly available, but hard to find. You can download the zip file which contains a STATA .dta file at this url: https://crimeresearch.org/data/

Below I first download the data to a temp file using the tempfile() function to generate the temporary file name; the download.file() function; and the unz() to unzip the package, along with the read_stata() function from the haven package. Among the many useful variables in this data set are the columns:

  • ivcr - whether or not the offender/victim (the author’s use “offender” for the civilian killed) was Involved in Violent CRime at the time of the incident,
  • ipcr - whether or not the offender/victim was Involved in Property CRime at the time of the incident, and
  • drugrelated - whether or not the offender/victim was involved in drug-related offense at the time of the incident.
url <- "https://crimeresearch.org/wp-content/uploads/2013/06/police.shoot_.zip" 
 
temp <- tempfile() 
download.file(url, temp) 


lott <-  haven::read_stata(unz(temp, filename="police.shoot.dta")) %>%  
  dplyr::select(-contains("dum")) %>%   ##deleting dummy variables
  dplyr::mutate(
    OfficerRace = dplyr::case_when(
      pw == 1 ~ "White",
      ph == 1 ~ "Hispanic",
      pb == 1 ~ "Black",
      po == 1 ~ "Other",
      pu == 1 ~ "Unknown",
      TRUE ~ "Unknown"),
    Officer2Race = dplyr::case_when(
      officer2race==1 ~ "White",
      officer2race==2 ~ "Black",
      officer2race==3 ~ "Hispanic",
      officer2race==4 ~ "Asian",  ##includes Pacific Islander, 4 cases
      officer2race==5 ~ "Native American", ##No cases
      officer2race==6 ~ "Other", ##No cases
      ##here missing could mean unknown or there was no second officer
      numpolice <= 1 ~ "None",
      TRUE ~ "Unknown"),
    OfficerSex = dplyr::case_when(
      officer1gender==1 ~ "Male",
      officer1gender==0 ~ "Female",
      TRUE ~ "Unknown"),
    Officer2Sex = dplyr::case_when(
      officer2gender == 1 ~ "Male",
      officer2gender == 0 ~ "Female", 
      ##here missing could mean unknown or there was no second officer
      numpolice <= 1 ~ "None",
      TRUE ~ "Unknown"),
    age_f= cut(sa, breaks=c(0, 14, 19, 39, 200), 
               right= FALSE, labels=c("AgeUnder15", "Age15_19", "Age20_39", "Age40Above")),
    
    Race = dplyr::case_when(
      offender1race == 1 ~ "White",
      offender1race == 2 ~ "Black",
      offender1race == 3 ~ "Hispanic",
      offender1race == 4 ~ "Asian",
      offender1race == 5 ~ "Native American",
      TRUE ~ "Unknown"), ##Missing
    Age = sa,
    CrimeInvolved = dplyr::case_when(
      ivcr == 1 | ipcr == 1 ~ TRUE, ##involved in violent crime, or property crime
      is.na("ivcr") | is.na(ipcr) ~ NA,  ##actually no missing values here
      TRUE ~ FALSE), ##otherwise FALSE
    
    ## cases classified as missing for "armed" column but TRUE for firearm column
    Crime_or_Armed = dplyr::case_when(
      armed == 1 ~ TRUE,
      firearm == 1 ~ TRUE,
      vehicle == 1 ~ TRUE,
      knife == 1 ~ TRUE,
      otherweapon == 1 ~ TRUE,
      TRUE ~ FALSE) ##otherwise FALSE
  )
  

##reactable couldn't handle the column labels imported from the stata file  

##find all numeric variables
rounded_cols <- names(lott)[sapply(lott, FUN = is.double)]

lott %>% head(., n = 50) %>% 
    
  DT::datatable(., extensions = c('FixedColumns'),
                options = list(
                  dom = 't',
                  scrollX = TRUE,
                  fixedColumns = TRUE)) %>% 
  
  DT::formatRound(., columns = rounded_cols, digits = 2)

PKIC (2015)

PKIC - Police Killings in Context dataset

Data used in the analysis found in this book: Selby et al. (2016) From the github website, is this description:

“The StreetCred® Police Killings in Context (PKIC) database is an open-standard data set that includes incidents (other than traffic accidents) in which an unarmed civilian was killed during an encounter with American police who (whether on-, or off-duty) were acting in an official capacity, from the time of first contact through booking.”

Contains 159 cases of unarmed civilians killed by police in 2015. The data are limited in scope but extremely detailed.

pkic <- readr::read_csv("https://raw.githubusercontent.com/StreetCredSoftware/PKIC/master/StreetCred_%20PKIC_2015.csv")

pkic %>% head(., n = 50) %>% 
reactable::reactable(., resizable = TRUE, filterable = TRUE, searchable = TRUE, highlight = TRUE, 
    striped = TRUE, compact = TRUE, wrap = FALSE)

Vice News (2010-2016)

A description of this dataset can be found here. Their github page is here: https://github.com/vicenews/shot-by-cops/ . Here is a brief description:

“VICE News spent nine months collecting data on both fatal and nonfatal police shootings from the 50 largest local police departments in the United States. For every person shot and killed by cops in these departments from 2010 through 2016, we found, police shot at two more people who survived….”

This data set includes shootings as well as fatalities and is sourced directly from police departments, thus making it unique among the data sets covered in this post. The data is contained in a google docs spreadsheet, so I use the googlesheets4 package to download it.

The coding in the Vice file is atrocious. There are at least three date formats! The columns are all mixed data types as well. Perhaps the sloppiest data set I’ve ever seen. Below I manually enter the state and county FIPS codes.

url <- "https://docs.google.com/spreadsheets/d/1CaOQ7FUYsGFCHEqGzA2hlfj69sx3GE9GoJ40OcqI9KY/edit#gid=1271324584"

googlesheets4::gs4_deauth()
vice <- googlesheets4::read_sheet(url, col_types = "cdcccccccccccccc") %>% 
    dplyr::mutate_if(is.character, list(~trimws(iconv(.,"UTF-8", "ASCII")))) %>% 
    dplyr::mutate(
      year = ifelse(grepl("-", x=Date), 
                    substr(Date, start=3, stop=4),
                    substr(Date, start=nchar(Date)-1, stop=nchar(Date))),
      year = as.numeric(paste0("20", year)),
      # Date = trimws(Date),
      # Date2 = ifelse(nchar(Date) == 4, paste0("01/", "01/", Date), Date),
      # Date2 = ifelse(nchar(Date2) == 7, paste0(substr(Date2, 6,7), "/01/", year), Date2),
      # Date3 = lubridate::mdy(Date2)),
      State = dplyr::recode(City, 
        "Albuquerque" = "NM",
        "Atlanta" = "GA",
        "Austin"="TX",
        "BaltimoreCity"="MD",
        "BaltimoreCounty"="MD",
        "Boston"="MA",
        "CharlotteMecklenburg"="NC",
        "Chicago"="IL",
        "Cincinnati"="OH",
        "City of Miami"="FL",
        "Cleveland"="OH",
        "Columbus"="OH",
        "Dallas"="TX",
        "DekalbCounty"="GA",
        "Denver"="CO",
        "Detroit"="MI",  
        "El Paso"="TX",
        "FairfaxCounty"="VA",
        "Fort Worth"="TX",
        "Honolulu"="HI",
        "Houston"="TX",
        "Indianapolis"="IN",
        "Jacksonville"="FL",
        "Kansas City"="MO",
        "LasVegas"="NV",
        "LosAngeles"="CA",
        "Louisville"="KY",
        "Memphis"="TN",    
        "MiamiDade"="FL",
        "Milwaukee"="WI",
        "Nashville"="TN",
        "New York"="NY",
        "Newark"="NJ",
        "NewOrleans"="LA",
        "Philadelphia"="PA",
        "Phoenix"="AZ",
        "Portland"="OR",
        "PrinceGeorgesCounty"="MD",
        "San Antonio"="TX",
        "San Francisco" = "CA",
        "SanDiego" = "CA",
        "SanJose" = "CA",
        "Seattle" = "WA",
        "St. Louis" = "MO",
        "Tampa" = "FL",
        "Tucson" = "AZ",
        "Washington DC" = "DC"),
    FIPS = dplyr::recode(City, 
         "Albuquerque" = "35001",
         "Atlanta" = "13121",
         "Austin"="48453",
         "BaltimoreCity"="24510",
         "BaltimoreCounty"="24005",
         "Boston"="25025",
         "CharlotteMecklenburg"="37119",
         "Chicago"="17031",
         "Cincinnati"="39061",
         "City of Miami"="12086",  ##same as miami-dade county
         "Cleveland"="39035",
         "Columbus"="39049",
         "Dallas"="48113",
         "DekalbCounty"="13089",
         "Denver"="08031",
         "Detroit"="26163",  
         "El Paso"="48141",
         "FairfaxCounty"="51059",
         "Fort Worth"="48439",
         "Honolulu"="15003",
         "Houston"="48201",
         "Indianapolis"="18097",   #see also 18003, 
         "Jacksonville"="12031",
         "Kansas City"="29095",
         "LasVegas"="32003",
         "LosAngeles"="06037",
         "Louisville"="21111",
         "Memphis"="47157",    
         "MiamiDade"="12086", ##same as miami
         "Milwaukee"="55079",
         "Nashville"="47037",
         "New York"="36061",
         "Newark"="34013",
         "NewOrleans"="22071",
         "Philadelphia"="42101",
         "Phoenix"="04013",
         "Portland"="41051",
         "PrinceGeorgesCounty"="24033",
         "San Antonio"="48029",
         "San Francisco" = "06075",
         "SanDiego" = "06073",
         "SanJose" = "06085",
         "Seattle" = "53033",
         "St. Louis" = "29189",
         "Tampa" = "12057",
         "Tucson" = "04019",
         "Washington DC" = "11001"))

vice %>% head(., n = 50) %>% 
    
  reactable::reactable(., resizable = TRUE, filterable = TRUE, searchable = TRUE,   
                       highlight = TRUE, striped = TRUE, compact = TRUE, wrap = FALSE)

FBI (2014-2018)

FBI data can be found in its “Expanded Homicide Data Table 14”, subtitled, “Justifiable Homicide by Weapon, Law Enforcement.” There are several of these. To obtain the table for years 2014-2018, see: https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/tables/expanded-homicide-data-table-14.xls

The data set contains counts per year of cases involving “The killing of a felon by a law enforcement officer in the line of duty.” This dataset averages around 400 cases a year, less than half of the cases reported by the Washington Post and others. It also does not contain any information about the decedent (i.e. civilian victim/offender).

To obtain this data, I use the htmltab package.

library(htmltab)

url <- "https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/tables/expanded-homicide-data-table-14.xls"

fbi <- htmltab::htmltab(doc = url, which = "//*[@id='table-data-container']/table")


fbi %>% reactable::reactable(., resizable = TRUE, filterable = TRUE, searchable = TRUE, 
    highlight = TRUE, striped = TRUE, compact = TRUE, wrap = FALSE)

USPSD - (2011-2015)

USPSD - US Police Shootings Data

This is possibly the worst data set you could use aside from the FBI & CDC data. But it was used by an article that received a lot of media attention, Ross 2015, an article in PLoS ONE. You can read about this data set here . It also appears deprecated. It’s a crowd-sourced data set, but nobody has been contributing actively to it for several years, since 2015.

There are numerous duplicated cases, for example:

  • Travis McNeil & Travis McNeal (Miami)
  • Christopher Travis (mentioned twice) (Berkeley)
  • Antonio Tafolla & Antonio Tafolla-Diaz
  • Lorenzo Joseph Ciaramella & Lorenzo Ciaramella
  • William “Corey” Jackson & William Corey Jackson
  • Jesse Matthews (mentioned twice)
  • Jordan Hatcher & Jordan Ross Hatcher
  • John Geer & John B. Geer

It’s also important to point out that this dataset includes reported shootings, not just shootings that resulted in deaths.

The url to the google spreadsheet is here. Below I use googlesheets4::read_sheet() to download the data from sheet #2 which is sorted by the date of the shooting incident, which is oddly labeled “Date Searched”.

library(googlesheets4)

googlesheets4::gs4_deauth()
  
uspsd <- googlesheets4::read_sheet(ss = "https://docs.google.com/spreadsheets/d/1cEGQ3eAFKpFBVq1k2mZIy5mBPxC6nBTJHzuSWtZQSVw/edit#gid=1424771648", sheet = 2) %>% 
    
  dplyr::mutate(  
    Date = as.Date(`Date Searched`), 
      
    Year = as.numeric(format(Date,"%Y")), 
      
    Year = replace(Year, which(Year == 1012), 2012),  
      
    Year = replace(Year, which(Year == 1011), 2011) 
    )
  
uspsd %>% head(., n = 50) %>%   
      
  reactable::reactable(., resizable = TRUE, filterable = TRUE, searchable = TRUE,     
                       highlight = TRUE, striped = TRUE, compact = TRUE, wrap = FALSE) 

CDC + LEOKA (2010-2018)

The final way to obtain estimates of deaths by police is from the CDC’s WISQARS Fatal Injury Data from the National Vital Statistics System (NVSS) operated by the National Center for Health Statistics. The website is here: https://webappa.cdc.gov/sasweb/ncipc/dataRestriction_inj.html

However, to my knowledge there is no API for this data and you cannot simply scrape the data from the web because it’s restricted. Once you agree to the terms of use, you can manually get the data by clicking on “Legal intervention” under the “Homicide and Legal intervention” tab inside the “Intent or manner of the injury” box. This will give you estimates for “Legal Intervention Injury Deaths” for the year you select. If you select multiple years, it will, unfortunately, sum across those years rather than give you estimates for each year separately. Below I manually enter in this data.

Finally, to get an estimate of civilians killed by police, you have to subtract police officers killed in the line of duty, which are included in the CDC estimates. You can obtain this data directly from the FBI’s website

### https://ucr.fbi.gov/leoka/2019/tables/table-42.xls/ LEOKA - UCR FBI

url <- "https://ucr.fbi.gov/leoka/2019/tables/table-42.xls/"
leoka_url <- htmltab(doc = url, which = "//*[@id='table-data-container']/table")
leoka <- leoka_url[1, 4:13]
leoka <- leoka %>% tidyr::pivot_longer(`2010`:`2019`, names_to = "Year", values_to = "OfficersKilled") %>% 
    dplyr::mutate(Year = as.numeric(Year), OfficersKilled = as.numeric(OfficersKilled))


cdc <- tibble::tibble(Year = 2010:2018, cdc = c(412, 492, 550, 516, 515, 530, 549, 
    616, 618)) %>% dplyr::left_join(leoka, by = "Year") %>% dplyr::mutate(CDC = cdc - 
    OfficersKilled) %>% dplyr::select(Year, CDC, OfficersKilled)


cdc %>% reactable::reactable(., resizable = TRUE, filterable = TRUE, searchable = TRUE, 
    highlight = TRUE, striped = TRUE, compact = TRUE, wrap = FALSE)

Combine & Save

Below I combine all of these data sets into a list and save as a qs object using the qs package. Alternatively, you could save the list as an .rds file. Of course, you could join all of these data sets together into one data frame, or just save them as separate files.

all_data <- list(wapo = wapo, fe = fe, mpv = mpv, kbp = kbp, guard = guard, lott = lott, 
    pkic = pkic, vice = vice, fbi = fbi, uspsd = uspsd, cdc = cdc)


library(qs)

qs::qsave(all_data, "all-police-shooting-data.qs")