Introduction

UK Police rely on ‘reasonable suspicion’ of illegal behavior to warrant a stop-and-search. Compared with other policing practices, the level of oversight for stop-and-search is comparatively low (Borooah, 2021), which raises the question: what extent of bias pervades officers’ selection of individuals for stop-and-search?

Bias is defined as the disproportionate use of stop-and-search on a specific group relative to their proportion of the population. Bias may be indicative of discrimination and profiling by police, and it suggests a measure of unfair treatment. Yet, some suggest that strategic profiling is necessary for effective policing, so bias should be interpreted within its wider context (Petherick and Brooks, 2021).

The aim of this analysis is to assess stop-and-search bias by comparing the proportion of searches on a specific group relative to their proportion of the population. Where appropriate, this analysis will be conducted at the national, regional and local levels, controlling for geographic variances that may distort the measurement of bias. The primary demographics examined are age-group, sex, racial groups, investigating which groups experience more or less searches compared to population proportions.

Data

Stop-and-Search Data

A near-complete dataset of searches is extracted from the police.uk API for calendar year 2022 at the level of police force areas (PFAs), including information on ethnicity, age and gender. Missing data is either: (i) inputted using forward filling, which is used to retain accurate estimates of full-year search rates; or (ii), when PFAs have no 2022 data, data is inputted from an alternative source at gov.uk. A final dataset of 496,593 searches is produced for analysis.

Demographic Data

Population data is manually sourced from the 2021 Census from the Office of National Statistics (ONS) at 2 geographic levels: national (England and Wales) and lower super output areas (LSOAs). These are aggregated into total populations at middle super output area (MSOA) and PFA levels to estimate corresponding rates of stop-and-search by population. Racial data is known for 478,357 (94.71%) of available searches, compiled into a summary table of search rates by racial group.

Geographic Data

Geographic shapefile data is downloaded from geoportal.statistics.gov.uk, used for spatial and population transformations in visualizing and joining location data on the level of MSOA, LSOA, PFA.

Analysis

Figure 1: Proportion of stop and search by age and sex of person compared with the population, England and Wales, 2022

Given figure 1, a high majority of searches target males, across all age_groups, and typically at a rate that is higher than their proportion of the population. Males aged 18-24 represent 28.4% of searches in 2022, yet only 4.2% of the population, a disparity of 6.8 times. Conversely, women are represented near their population proportions in youth, but as they age, become significantly less represented in search statistics.

Figure 2: Map of stop and search per 1000 population, Police Force Area, England and Wales, 2022

Figure 3: Stop and search per 1000 population, Police Force Area, England and Wales, 2022

Figures 2 and 3 indicate that location substantially mediates one’s experience of stop and search on the national level. Certain urban zones, including London and Liverpool (Merseyside), have higher search rates on average. This location effect necessitates controlling for PFA in analysis of racial disparity in stop and search; it reveals location as significant in expected rate of stop and search.

Figure 4: Stop and search per 1000 population by Racial Group, Police Force Area, England and Wales, 2022

Police Force Area (PFA)
Rate per 1000 Population in 2022
Black White Mixed Asian Other All Races* Total Stops**
Avon and Somerset 18.05 3.10 4.06 2.92 5.49 3.46 3.69
Bedfordshire 17.88 4.37 2.47 7.38 5.21 5.57 5.86
Cambridgeshire 12.71 2.70 2.56 4.27 4.31 3.05 3.22
Cheshire 35.64 8.33 4.87 7.47 13.88 8.46 8.77
City of London 2,030.04 184.49 70.21 241.88 422.04 251.28 270.14
Cleveland 18.73 9.26 7.11 9.60 8.03 9.34 11.52
Cumbria 34.86 6.18 5.26 16.33 13.99 6.37 6.77
Derbyshire 10.05 1.65 2.64 4.34 3.60 1.93 2.22
Devon and Cornwall 23.76 3.64 1.18 4.00 2.67 3.68 3.92
Dorset 30.33 2.12 2.39 2.35 3.22 2.34 2.48
Durham 10.73 5.18 3.29 4.17 6.70 5.17 5.37
Dyfed-Powys 59.78 11.75 2.05 6.92 15.42 11.73 12.60
Essex 22.07 8.05 5.22 8.92 17.51 8.59 9.56
Gloucestershire 10.78 2.25 1.76 2.16 28.28 2.51 2.70
Hampshire 25.55 4.30 2.71 4.30 16.38 4.71 4.71
Hertfordshire 19.77 4.98 3.41 5.67 6.39 5.56 5.91
Humberside 16.96 5.54 3.05 6.79 2.90 5.61 6.06
Kent 18.20 6.01 6.78 5.79 21.89 6.59 6.71
Lancashire 27.28 6.06 3.11 12.90 4.64 6.85 7.48
Leicestershire 18.31 4.13 1.66 4.75 2.31 4.62 4.64
Lincolnshire 20.39 3.67 3.91 5.95 6.87 3.83 4.10
Merseyside 76.35 37.89 13.35 24.80 46.21 37.68 40.43
Metropolitan 53.36 15.62 7.21 15.60 16.50 20.29 20.64
Norfolk 30.54 4.27 1.39 4.10 14.92 4.53 4.56
North Wales 11.27 5.00 0.97 4.81 42.91 5.13 5.13
Northamptonshire 13.79 4.14 1.97 4.81 4.17 4.50 4.54
Northumbria 6.25 3.28 1.35 3.83 2.93 3.31 3.40
Nottinghamshire 12.45 2.80 4.95 4.32 3.70 3.33 3.36
South Wales 17.48 5.07 3.82 4.46 6.94 5.22 5.32
South Yorkshire 25.02 7.05 0.36 15.33 5.87 7.80 9.19
Staffordshire 11.09 3.36 1.84 6.49 4.61 3.58 3.80
Suffolk 40.33 4.60 1.74 3.94 26.46 5.20 5.26
Surrey 23.05 4.09 1.95 4.28 4.09 4.37 4.45
Sussex 24.53 2.84 1.32 3.21 6.92 3.14 3.17
Thames Valley 23.56 4.72 1.46 7.74 3.71 5.66 5.77
Warwickshire 14.36 1.70 1.77 3.29 4.83 2.01 2.01
West Mercia 20.47 2.61 4.71 10.17 5.11 3.02 3.02
West Midlands 16.98 6.49 5.57 11.31 0.00 8.18 10.18
West Yorkshire 17.10 5.63 4.28 13.43 6.65 7.20 7.47
Wiltshire 12.58 2.00 0.99 2.44 7.12 2.22 2.24
Greater Manchester 5.97 2.47 2.75 2.90 1.75 2.68 3.28
Gwent 13.44 2.63 4.07 4.02 3.41 2.78 3.14
North Yorkshire 7.89 1.88 3.53 7.01 2.91 2.04 2.29
*'All Races' is the rate for all stops where racial data is recorded | **'Total stops' is the rate for all stops recorded

Figure 4 demonstrates a significant racial disparity in stop-and-search rates after controlling for PFA. Generally, Mixed, Other and Asian are searched at a higher rate than White, and Black is searched at the highest rate, in every PFA in England and Wales.

Figure 5: Racial Disparity in Stop and Search Rate, Police Force Area, England and Wales, 2022

Figure 6: Stop and search per 1000 population for White and Black Racial Group, Police Force Area, England and Wales, 2022

Figures 5 and 6 indicate a White-Black disparity in search rates as heavily disfavoring the Black racial group across England and Wales. Blacks are often searched at rates 5-6 times as often as Whites. Interestingly, urban areas tend to have lower racial disparity yet higher overall search rates. This suggests that rural areas, despite fewer total searches, nonetheless practice disproportional targeting of Blacks.

Figure 7: Total Stop and Searches by MSOA, London, June 2022

Figure 8: BAME Population by MSOA, London, June 2022

Figure 9: Regression of BAME Population on Stop and Search Rates, London, June 2022

Characteristic Beta 95% CI p-value
Percent_BAME 0.24 0.15, 0.33 <0.001
Abbreviation: CI = Confidence Interval

Figures 7-9 demonstrate that a location effect of stop and search does not meaningfully interact with ‘Black Asian Minority Ethnic’ (BAME) population on the local level in London. Hotspots in search rates do not match high BAME population. While a statistically significant positive correlation exists (0.24), its magnitude is small. Location is significant for a person’s expected rate of stop-search, but it is unlikely because of local area’s ethnic minority population.

Appendix: All code in this assignment

knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
#Load required packages
library(ukpolice)
library(tidyverse)
library(sf)
library(sp)
library(httr)
library(jsonlite)
library(RSQLite)
library(httr)
library(netstat)
library(texreg)
library(gt)
library(viridis)
library(gtsummary)
#Ethnicity, Sex and Age Population Data at LSOA level - Census 2021 - Office of National Statistics
ethnicity_data <- read.csv("data\\LSOA_ethnicity.csv")
pop_by_sex_age <- read.csv("data\\pop_by_sex_age.csv")

#GeoJSON Shape files for LSOA, MSOA and Police Force Areas - Census 2021 - geoportal.statistics.gov.uk
LSOA_shapes <- st_read("data\\LSOA_shapefiles.geojson", quiet = TRUE) %>% 
  rename(LSOA_code = LSOA21CD, LSOA_name = LSOA21NM)
pf_shapes <- st_read("data\\force_areas_shapefiles.geojson", quiet = TRUE) %>% 
  rename(force_name = pfa16nm, force_code = pfa16cd) %>%
  mutate(force_name = ifelse(force_name=="Metropolitan Police", "Metropolitan", force_name))
MSOA_shapes <- st_read("data//MSOA_shapefiles.geojson",quiet = TRUE) %>% 
  rename(MSOA_code = MSOA21CD, MSOA_name = MSOA21NM) %>% 
  #Join MSOA to Police Force Area for later spatial transformation
  st_join(pf_shapes, join = st_intersects) %>% 
  filter(force_name == "Metropolitan") %>% 
  st_transform(crs = 4326) 
#Join LSOA to Police Force Area (PFA) to associate each shapefiles with PFA for population aggregation
LSOA_by_force <- LSOA_shapes %>% 
  #Use LSOA centroid to join to PFA
  st_centroid() %>% 
  st_join(pf_shapes, join = st_intersects) %>% 
  as_tibble() %>% 
  select("LSOA_code", "LSOA_name", force_code, force_name)

#Sum all LSOA population statistics to get PFA-level statistics for ethnicity populations
ethnicity_by_LSOA <- ethnicity_data %>% 
  rename(LSOA_code = Lower.layer.Super.Output.Areas.Code, LSOA_name = Lower.layer.Super.Output.Areas, ethnic_group = Ethnic.group..20.categories.) %>%
  select(LSOA_code, Observation, ethnic_group) %>%
  pivot_wider(names_from = ethnic_group, values_from = Observation) %>% 
  mutate(total_pop = rowSums(across(where(is.numeric))))

#Dataframe with the necessary ethnicity statistics by PFA
ethnicity_by_force <- LSOA_by_force %>% 
  full_join(ethnicity_by_LSOA, by = "LSOA_code") %>% 
  group_by(force_code, force_name) %>%
  mutate(force_name = str_replace_all(force_name, " Police", "")) %>%
  summarize(across(where(is.numeric), sum, na.rm = TRUE))

#Preprocessing the population by age and sex data
#PFA-level aggregation not necessary here, as sex/age data only used on the national level
pop_by_sex_age <- pop_by_sex_age %>% 
  select(Sex..2.categories., Age..91.categories..Code, Observation) %>% 
  rename(gender = Sex..2.categories., age = Age..91.categories..Code, population = Observation) %>% 
  mutate(population_percent = (population/sum(population))*100)
#Preparing for API query
#Vector with correct police force inputs to iterate over
police_forces <- ukc_forces()%>% 
  rename(force_name = name, ukc_input = id)%>% 
  mutate(force_name = str_replace_all(force_name, " Police", "")) %>% 
  mutate(force_name = str_replace_all(force_name, " Constabulary", "")) %>% 
  mutate(force_name = str_replace_all(force_name, "&", "and")) %>%
  mutate(force_name = str_replace_all(force_name, " Service", "")) %>%
  #Join to get the force code, force name, creating a reference table for later joins
  inner_join(ethnicity_by_force, by = "force_name") %>% 
  select(force_name, ukc_input, force_code)

#Vector of months to loop through
dates <- c("2022-01","2022-02","2022-03","2022-04","2022-05","2022-06","2022-07","2022-08","2022-09","2022-10","2022-11","2022-12")

#NOTE: police.uk API is packaged into the ukpolice package for accessibility
#CODE BELOW TO RUN ONLY IF API QUERY IS REQUIRED
"
#Data Cleaning Function - To be passed to raw API Output
clean_ss_data <- function(ss_data, force, month) {
  force_data <- ss_data %>% 
    mutate(ukc_input = force) %>% 
    # Join to reference table to get force code
    inner_join(police_forces, by = 'ukc_input') %>%
    # Select Necessary Columns
    select(age_range, gender, self_defined_ethnicity, officer_defined_ethnicity, legislation, object_of_search, outcome_object_name, legislation, force_code, force_name, ukc_input) %>% 
    # Self-defined ethnicity is often coded as not stated, replace with NA as this is not useful for analysis
    mutate(self_defined_ethnicity = ifelse(grepl('Not stated', self_defined_ethnicity), NA, self_defined_ethnicity)) %>% 
    # Define racial_group based on either officer or self defined ethnicity (self-defined preferred) to maximize available racial information
    mutate(racial_group = ifelse(is.na(self_defined_ethnicity), officer_defined_ethnicity, self_defined_ethnicity)) %>% 
    # Classify racial_group based on official ethnicity groupings from Census 2021 
    # Note the ordering of the ifelse statements is important, otherwise grepl may classify incorrectly
    mutate(racial_group = ifelse(grepl('Mixed', racial_group, fixed = TRUE), 'Mixed',
                           ifelse(grepl('Asian', racial_group, fixed = TRUE), 'Asian',
                           ifelse(grepl('Black', racial_group, fixed = TRUE), 'Black',
                           ifelse(grepl('White', racial_group, fixed = TRUE), 'White',
                           ifelse(grepl('Other', racial_group, fixed = TRUE), 'Other', NA)))))) %>% 
    mutate(month = month)
  return(force_data)
}

#Beginning API query

#Creating empty tibble of correct dimensions to store the full dataset
stop_search_2022 <- ukc_stop_search_force('bedfordshire','2022-01') %>%
  clean_ss_data('bedfordshire', '2022-01') %>%
  slice(0)

#Creating table to store combinations of forces and time-periods that have no data available
missing_data = tibble(ukc_input = character(), month = character())

#Aim is to get full dataset of every stop and search in 2022, this is a large dataset, however it allows for granular analysis and a highly informative option, allowing us to query the API just once and store the data

#Iterate through all months in 2022 and PFAs
for (month in dates){
  for (force in police_forces$ukc_input){
    force_data <- ukc_stop_search_force(force, month)
    #Error handling for missing data
    if (length(force_data)==0 | is.null(force_data)){
      missing_data <- missing_data %>% 
        add_row(ukc_input = force, month = month)
      next
    } else {
    #API Query and data cleaning
    stop_search_2022 <- force_data %>% 
      clean_ss_data(force, month) %>%
      #Append to full dataset
      rbind(stop_search_2022)
    }}
  #Sleep to be cautious of API over-querying
  Sys.sleep(3) 
}
"
#After inspecting the missing_data table, there is a large proportion of missing forces with partial data for 2022
#Consulting API issue log at https://data.police.uk/changelog/ reveals these to be data reporting outages
#To avoid estimating inaccurate annual search rates, we will forward fill the data for the missing months
'
#Create table of partial missing data
partial_missing_forces <- missing_data %>% 
  #Filer out manchester, gwent and north yorkshire as these are fully missing 2022 data, will be dealt with separately
  filter(ukc_input %in% c("greater-manchester","gwent", "north-yorkshire")==FALSE) %>% 
  mutate(month_index = as.numeric(str_replace_all(month, "2022-", ""))) 

#Create input table for forward filling 
forward_prop_input <- missing_data %>% 
  filter(ukc_input %in% c("greater-manchester","gwent", "north-yorkshire")==FALSE) %>% 
  group_by(ukc_input) %>%
  summarize(missing_months = n()) %>% 
  inner_join(police_forces, by = "ukc_input")

#Iterate over partial missing forces
for (force in forward_prop_input$ukc_input){
  missing_months <- forward_prop_input$missing_months[forward_prop_input$ukc_input==force]
  last_record_month <- min(partial_missing_forces$month_index[partial_missing_forces$ukc_input==force]) -1
  month_input <- paste0("2022-", last_record_month)
  #Fill in between last recorded month and next valid observation
  for (i in 1:forward_prop_input$missing_months[forward_prop_input$ukc_input==force]){
    force_data <- ukc_stop_search_force(force, month_input)
    month_filled_input <- paste0("2022-", last_record_month+i)
    stop_search_2022 <- force_data %>% 
      clean_ss_data(force, month_filled_input) %>%
      rbind(stop_search_2022)
  }
}

#Running a short-test to ensure that forward filling worked as intended
stop_search_2022 %>% 
  filter(force_name == "South Wales") %>% 
  group_by(month) %>% 
  summarize(total_stops = n()) %>% 
  print()
  
#There should now be all 12 months of data for South Wales PFA
'
#Full API scrap and missing data fill is now complete, write data to csv to avoid future scraping - see stop_search_2022.csv in data folder (github access)
#write.csv(stop_search_2022, "data//stop_search_2022.csv", row.names=FALSE)
#Load in the pre-scraped data - Note this is missing Manchester, North Yorkshire and Gwent PFAs
stop_search_2022 <- read_csv("data//stop_search_2022.csv")
#Now we have the full dataset, we can begin to analyse the data
#First off, analysing age and sex of those stopped and searched 
#We do this at the national level, as there is less rationale for regional variation in this case

#Group national sex and age population statistics to match the encoding from police.uk data
pop_sex_age_formatted <- pop_by_sex_age %>% 
  mutate(age_range = ifelse(age < 10, "under 10", 
                    ifelse(age <=17, "10-17", 
                     ifelse(age <=24, "18-24",
                    ifelse(age <= 34, "25-34", "over 34"))))) %>% 
  group_by(age_range, gender) %>%
  summarize(population = sum(population), population_percent=sum(population_percent))
  
#Find proportion of searches by combinations of age group and sex
stops_by_age_and_sex <- stop_search_2022 %>% 
  filter(is.na(age_range)==FALSE) %>% 
  group_by(age_range, gender) %>%
  summarize(stops_recorded = n()) %>% 
  filter(gender %in% c("Male", "Female")) %>% 
  #total searches of this data is 426768 (i.e. searches that recorded age information)
  mutate(percentage_stops = (stops_recorded/426768)*100) %>% 
  #Join to population data to measure disproportionality of search rates
  inner_join(pop_sex_age_formatted, by = join_by(age_range, gender))

#Convert the searches data to tidy-long format for plotting
age_gender_percent_stops <- stops_by_age_and_sex %>% 
  select(age_range, gender, percentage_stops) %>% 
  mutate(gender = paste0("stops-", gender)) %>% 
  rename(percentage = percentage_stops)

#Finally, get the population data in tidy-long format, then append to the searches data
age_sex_stops_population <- stops_by_age_and_sex %>% 
  select(age_range, gender, population_percent) %>% 
  mutate(population_percent = -1*population_percent) %>% 
  mutate(gender = paste0("population-", gender)) %>% 
  rename(percentage = population_percent) %>% 
  rbind(age_gender_percent_stops) %>% 
  filter(age_range != "under 10")

#Plot the data in a mirrored vertical bar plot, filling by gender
#This serves to compare rates of searches against proportion of the population for gender and age group
stops_by_age_and_sex_histogram <- age_sex_stops_population %>% 
  ggplot(aes(x = age_range, y = percentage, fill = gender)) +
  geom_bar(stat = "identity", position = "stack", width = 0.60) +
  scale_y_continuous(breaks = seq(-60, 60, by = 10), labels = abs(seq(-60,60,by=10)), limits = c(-60,60)) +
  geom_hline(yintercept = 0, linetype = "solid", color = "black", size = 1) +
  coord_flip() +
  scale_fill_manual(values = c("population-Female" = "skyblue", "stops-Female" = "skyblue", "population-Male" = "skyblue4", "stops-Male" = "skyblue4"),
                      breaks = c("population-Female", "population-Male"),
                      labels = c("Female", "Male")) + 
  theme(aspect.ratio = 4/9, axis.text.y = element_text(angle = 0, hjust = 1)) +
  ylab("Percent") +
  xlab("Age Group") +
  annotate("text", x = 1, y = 50, label = "Proportion of Stop\nand Search in each \nage group", vjust = -1, size = 3.5) +
  annotate("text", x = 1, y = -35, label = "Proportion of \nPopulation in each \nage group", vjust = -1, size = 3.5) +
  annotate("text", x = 0.1, y = 35, label = "Based on 426,768 observed searches", vjust = -1, size = 3.5) +
  labs(fill = "Sex")
  
stops_by_age_and_sex_histogram
#Further age/gender visualizations could be generated, however the robustness of the national trend is clear

#Moving onto the racial analysis, a first step is to fill in the missing data for Manchester, North Yorkshire and Gwent so we can work with a complete sample

#Data extracted from an alternative gov.uk source
missing_forces_ss <- read.csv("data/missing_forces_stop_search.csv")

#Clean and filter this data to match the police.uk data
missing_forces_ss <- missing_forces_ss %>% 
  rename(number_of_stop_and_searches_ethnicity_reported = total_number_of_stop_and_search_carried_out_in_this_year_in_this_area_excluding_cases_where_the_ethnicity_was_unreported, stop_search_rate_ethnicity = proportion_of_total_stop_and_searches_of_this_ethnicity_in_the_financial_year_excludes_unreported, force_name = geography) %>%
  mutate(time = as.character(time)) %>% 
  filter(force_name %in% c("Gwent", "Greater Manchester", "North Yorkshire")==TRUE) %>% 
  #Note that this replacement is for 2021/22, yet it serves as a reasonable approximation for 2022
  filter(geography_type == "Police Force Area" & time == "2021/22") %>% 
  #Note variables are encoded differently in this data, it will be converted to police.uk standard
  select(ethnicity, force_name, number_of_stop_and_searches, number_of_stop_and_searches_ethnicity_reported, population_by_ethnicity, rate_per_1_000_population_by_ethnicity, stop_search_rate_ethnicity)
#First a few additional processing steps before finding racial search rate figures

#Aggregating racial group populations from the ethnicity populations by PFA dataset
race_by_force <- ethnicity_by_force %>% 
  pivot_longer(cols = -c(force_code,force_name), names_to = "racial_group", values_to = "population") %>% 
  #Using same grepl approach as before
  mutate(racial_group = ifelse(grepl("Mixed", racial_group, fixed = TRUE), 'Mixed',
                       ifelse(grepl("Asian", racial_group, fixed = TRUE), 'Asian',
                       ifelse(grepl("Black", racial_group, fixed = TRUE), 'Black',
                       ifelse(grepl("White", racial_group, fixed = TRUE), 'White',
                       ifelse(grepl("Other", racial_group, fixed = TRUE), 'Other', NA)))))) %>%
  group_by(force_name, racial_group) %>% 
  summarize(population = sum(population[racial_group==racial_group])) %>% 
  filter(is.na(racial_group)==FALSE) %>%
  pivot_wider(names_from = racial_group, values_from = "population", names_prefix = "population_") %>% 
  mutate(population_Total = population_White + population_Black + population_Mixed + population_Asian + population_Other) %>%
  inner_join(police_forces, by = "force_name")

#Getting the number of searches by race for the 3 missing forces
race_stops_data_missing <- missing_forces_ss %>% 
  filter(ethnicity %in% c("White", "Black", "Mixed", "Asian", "Other")==TRUE) %>% 
  select(c(number_of_stop_and_searches, ethnicity, force_name)) %>% 
  pivot_wider(names_from = ethnicity, values_from = number_of_stop_and_searches, names_prefix = "stops_")

#Finding the total_stops for each force from the complete search dataset
stops_total_by_force <- stop_search_2022 %>% 
  group_by(force_name) %>% 
  summarize(number_of_stop_and_searches = n()) %>% 
  inner_join(police_forces, by = "force_name") %>% 
  bind_rows(missing_forces_ss) %>% 
  #get first 43 rows, which are the police forces
  slice(1:43) %>%
  select(force_name, number_of_stop_and_searches) %>%
  rename(stops_Total = number_of_stop_and_searches)

#Lastly, aggregate all racial tables to find total search rates by PFA and racial group - FINAL RACIAL BIAS DATA
race_stops_data <- stop_search_2022 %>% 
  #Filter for stops where racial group is known
  filter(is.na(racial_group)==FALSE) %>% 
  group_by(force_name) %>% 
  summarize(stops_Black = sum(racial_group=="Black"), stops_White = sum(racial_group=="White"), stops_Mixed = sum(racial_group=="Mixed"), stops_Asian = sum(racial_group=="Asian"), stops_Other = sum(racial_group=="Other")) %>% 
  #Add the 3 missing forces
  rbind(race_stops_data_missing) %>%
  mutate(stops_All = stops_Black + stops_White + stops_Mixed + stops_Asian + stops_Other) %>%
  inner_join(police_forces, by = "force_name") %>%
  #Add population data
  inner_join(race_by_force, by = c("force_name","force_code","ukc_input")) %>%
  #Find rates of stop and search per 1000 people by Race
  inner_join(stops_total_by_force, by = "force_name") %>% 
  mutate(stop_rate_black = (stops_Black/population_Black)*1000, stop_rate_white = (stops_White/population_White)*1000, stop_rate_mixed = (stops_Mixed/population_Mixed)*1000, stop_rate_asian = (stops_Asian/population_Asian)*1000, stop_rate_other = (stops_Other/population_Other)*1000, stop_rate_all_races = (stops_All/population_Total)*1000, stops_rate_Total = (stops_Total/population_Total)*1000) 
#Now we can visualize the total search rate by force on a map by joining with PFA shapefiles

#Before this, let's load data on UK city locations to determine if an association with search rates is visible on a map
uk_cities <- read.csv("data//gb_cities.csv") %>% 
  st_as_sf(coords = c("lng", "lat"), crs = 4326) %>% 
  st_transform(27700) %>% 
  filter(population > 500000)

#Generating the map of search rates by PFA
total_stop_rate_map <- pf_shapes %>% 
  inner_join(race_stops_data, by = c("force_name")) %>%
  select(force_name, stops_rate_Total) %>%
  #Filter out city of london as outlier - It has a very high search rate due to high visitation and low resident population
  filter(force_name != "City of London") %>%
  ggplot() + geom_sf(aes(fill = stops_rate_Total)) +
  scale_fill_viridis_c( 
                       direction = -1, 
                       na.value = "grey90", 
                       name = str_wrap("Searches per 1000 population", width = 15), 
                       trans = "identity") +
  theme_void() +
  theme(legend.position = "right") +
  labs(caption = "Source: police.uk | Based on 507,381 Total Searches") +
  geom_sf(data = uk_cities, color = "red2", size = 2.6) +
  annotate("text", x = 180050.1, y = 480358.4, label = str_wrap("Dots Represent Cities with Population > 500,000",width=20), size = 3)
total_stop_rate_map
#The map is a good indicator of geographic-level trends but it is difficult to measure magnitude of differences
#A bar plot serves this function better

#Displaying via histogram
ss_rate_by_force <- race_stops_data %>% 
  filter(force_name != "City of London") %>% 
  select(force_name, stops_rate_Total) %>% 
  mutate(stop_rate_total = stops_rate_Total) %>% 
  arrange(desc(stop_rate_total))

#refactor forces in descending order of stop rate
ss_rate_by_force$force_name <- factor(ss_rate_by_force$force_name, levels = ss_rate_by_force$force_name[order(-ss_rate_by_force$stop_rate_total)])

#Generate Bar plot
total_stop_rate_histogram <- ggplot(ss_rate_by_force, aes(x = force_name, y = stop_rate_total)) +
  geom_bar(stat = "identity", color = "black", fill = "lightblue", width = 0.8) + # Increase bar spacing by reducing width
  scale_y_continuous(expand = c(0,0), limits = c(0, 50), breaks = seq(0, 50, by = 10)) + # Set y-axis limits and breaks
  theme_minimal(base_size = 10) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6), # Rotate and adjust the font size
        axis.title = element_text(size = 12), # Adjust the axis title font size if needed
        panel.grid.major = element_line(color = "gray", size = 0.5), # Add major grid lines
        panel.grid.minor = element_blank(), # Disable minor grid lines
        panel.background = element_rect(fill = "white"),axis.line = element_line(color = "black")) +  # Add axis lines
  labs(
       x = "Police Force Area (PFA)",
       y = "Searches per 1000 Residents")
total_stop_rate_histogram
#It's clear that there is a large variation in search rates by force, now let's examine how racial disparity varies across forces

#Summary table - race stop rates by force
race_stops_data %>% 
  select(force_name, stop_rate_black, stop_rate_white, stop_rate_mixed, stop_rate_asian, stop_rate_other, stop_rate_all_races, stops_rate_Total) %>%
  rename("Police Force Area (PFA)" = force_name, Black = stop_rate_black, White = stop_rate_white, Mixed = stop_rate_mixed, Asian = stop_rate_asian, Other = stop_rate_other, "All Races*" = stop_rate_all_races, "Total Stops**" = stops_rate_Total) %>%
  gt() %>% 
  fmt_number(columns = c(everything()), decimals = 2) %>%
  tab_spanner(label = md("**Rate per 1000 Population in 2022**"), columns = c(2:8)) %>% 
  opt_row_striping() %>% 
  tab_footnote(
    footnote = "
    *'All Races' is the rate for all stops where racial data is recorded | **'Total stops' is the rate for all stops recorded"
  ) %>% 
  tab_style(
    style = list(cell_text(weight = "bold")),
    locations = list(cells_column_labels(columns = everything()), cells_body(columns = c(1))))
#The table is informative, but it's difficult to capture exact trends across forces
#Appears the Black vs. White gap is significant, it is worth visualizing racial disparity (the ratio of black to white stop rates)

#Generating map of racial disparity by PFA
black_stop_rate_map <- pf_shapes %>% 
  inner_join(race_stops_data, by = c("force_name")) %>%
  select(force_name, stop_rate_black, stop_rate_white) %>% 
  mutate(black_white_ratio = stop_rate_black/stop_rate_white) %>%
  #limit range of ratio to 8 (in order to visualization the greatest data variation)
  mutate(black_white_ratio = ifelse(black_white_ratio > 8, 8, black_white_ratio)) %>%
  filter(force_name != "City of London") %>%
  ggplot() + 
  geom_sf(aes(fill = black_white_ratio)) +
  scale_fill_viridis_c(
    direction = -1, 
    na.value = "red", 
    name = str_wrap("Search Disparity Rate for Black compared to White", width=15),
    trans = "identity",
    limits = c(0, 8),
    breaks = c(0, 2, 4, 6, 8),
    labels = c("0", "2", "4", "6", "8+")
  ) +
  theme_void() +
  theme(legend.position = "right") +
  labs( caption = "Source: police.uk") +
  #Once again, add the cities to the map, to see if urbanization may be a factor in racial biasing
  geom_sf(data = uk_cities, color = "red2", size = 2.6) +
  annotate("text", x = 180050.1, y = 480358.4, label = str_wrap("Dots Represent Cities with Population > 500,000",width=20), size = 3)
black_stop_rate_map
#Generating a bar plot for further visualization of racial disparity by force

#Order forces in descending order by stop rate
race_stops_data$force_name <- factor(race_stops_data$force_name, levels = race_stops_data$force_name[order(-race_stops_data$stop_rate_black)])
race_stops_data <- race_stops_data %>% 
  filter(force_name != "City of London")

#Generate Bar plot
black_stop_rate_histogram <- race_stops_data %>% 
  filter(force_name != "City of London") %>% 
  select(force_name, stop_rate_black, stop_rate_white) %>%
  pivot_longer(cols = c(stop_rate_black,stop_rate_white), names_to = "race", values_to = "stop_rate") %>% 
  ggplot(aes(x = force_name, y = stop_rate, fill = race)) +
  geom_bar(stat = "identity", position = "identity", alpha = 0.5, width = 0.8) + 
  scale_y_continuous(expand = c(0,0), limits = c(0, 90), breaks = seq(0, 90, by = 10)) +
  theme_minimal(base_size = 10) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6), # Rotate and adjust the font size
        axis.title = element_text(size = 12), # Adjust the axis title font size if needed
        panel.grid.major = element_line(color = "gray", size = 0.5), # Add major grid lines
        panel.grid.minor = element_blank(), # Disable minor grid lines
        panel.background = element_rect(fill = "white"),
        axis.line = element_line(color = "black"), legend.position = "bottom") +  # Add axis lines
  labs(
       x = "Police Force Area (PFA)",
       y = "Searches per 1000 Population",
       fill = "Racial Group") +  
  scale_fill_manual(values = c("stop_rate_black" = "skyblue", "stop_rate_white" = "gray10"),
                     labels = c("Black Ethnicity", "White Ethnicity"))
black_stop_rate_histogram
#Now we ask the question: does race impact the location effect on stop and search? Are certain areas with high BAME populations more likely to experience stop and search?

#For this, we can a slice of the metropolitan search data, specifically June 2022
#We move up from LSOAs to MSOAs, as the former are too small to be useful for observing trends at the London level
#Aim is to compare BAME population to stop and search rates to see if a possible relationship exists

#Ensure spherical data set to false, otherwise the following code will often bug
sf_use_s2(FALSE)

#Finding BAME population per MSOA using shapefiles
#BAME is all non-white racial groups
MSOA_BAME_pop <- MSOA_shapes %>% 
  st_transform(crs = 27700) %>%
  st_join(LSOA_shapes, join = st_intersects) %>% 
  as_tibble() %>% 
  select(MSOA_code, LSOA_code) %>% 
  inner_join(ethnicity_by_LSOA, by = c("LSOA_code")) %>%
  select(-c("Does not apply", total_pop)) %>%
  pivot_longer(cols = -c(MSOA_code,LSOA_code), names_to = "ethnicity", values_to = "population") %>% 
    mutate(racial_group = ifelse(grepl("Mixed", ethnicity, fixed = TRUE), 'Mixed',
                           ifelse(grepl("Asian", ethnicity, fixed = TRUE), 'Asian',
                           ifelse(grepl("Black", ethnicity, fixed = TRUE), 'Black',
                           ifelse(grepl("White", ethnicity, fixed = TRUE), 'White',
                           ifelse(grepl("Other", ethnicity, fixed = TRUE), 'Other', NA)))))) %>% 
  mutate(BAME = ifelse(racial_group == "White", "White", "BAME")) %>% 
  group_by(MSOA_code, BAME) %>%
  summarise(population = sum(population)) %>%
  pivot_wider(names_from = BAME, values_from = population) %>%
  mutate(Total = White + BAME) %>%
  mutate(BAME_prop = (BAME/Total)*100, White_prop=(White/Total)*100)


#Finding the total number of searches per MSOA in Metropolitan police force area
metro_data <- ukc_stop_search_force("metropolitan","2022-06") %>% as_tibble()

metro_search_data <- metro_data %>% 
  filter(!is.na(latitude)) %>%
  #Convert lat-long coordinates to shapefile, then use shapefile intersect to group into MSOA
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>% 
  st_join(MSOA_shapes) %>%
  filter(!is.na(MSOA_code)) %>%
  #Summarize number of searches per MSOA
  group_by(MSOA_code) %>%
  summarise(total_searches = n())

#Mapping the hotspots of stop and search in London, June 2022  
metro_search_map <- MSOA_shapes %>%
  st_join(metro_search_data, by = c("MSOA_code")) %>%
  #Use total searches rather than search rate, as MSOAs all have uniform population, so actual variance would not change much
  mutate(total_searches = ifelse(is.na(total_searches), 0, total_searches)) %>% 
  mutate(total_searches = ifelse(total_searches > 150, 150, total_searches)) %>% 
  ggplot() +
  geom_sf(aes(fill = total_searches), alpha = 0.6) +
  scale_fill_viridis_c(name = "Total Searches",  
                       labels = c("0","50","100", "150+")) +
  theme_void()

metro_search_map
#Mapping the BAME population of London
metro_BAME_map <- MSOA_shapes %>%
  st_join(metro_search_data, by = c("MSOA_code")) %>%
  rename(MSOA_code = MSOA_code.x) %>% 
  inner_join(MSOA_BAME_pop, by = c("MSOA_code")) %>%
  ggplot() +
  geom_sf(aes(fill = BAME_prop), alpha = 0.6) +
  scale_fill_viridis_c(name = str_wrap("BAME Percent of Population",width=15)) +
  theme_void()

#Now we can visually compare the two maps to see if hotspots overlap with areas of high BAME population
metro_BAME_map
#Regression analysis
MSOA_BAME_pop <- MSOA_BAME_pop %>% 
  inner_join(metro_search_data, by = c("MSOA_code")) %>% 
  as_tibble() %>% 
  rename(Percent_BAME = BAME_prop) %>% 
  mutate(percent_searches = (total_searches/sum(total_searches))*100)
reg1 <- lm(total_searches ~ Percent_BAME, data = MSOA_BAME_pop)

tbl_regression(reg1)

References

Borooah, Vani. “Racial Bias in Policing: Police Stop and Searches in England and Wales.” Mpra.ub.uni-Muenchen.de, 2021, mpra.ub.uni-muenchen.de/113064/.

Petherick, Wayne, and Nathan Brooks. “Reframing Criminal Profiling: A Guide for Integrated Practice.” Psychiatry, Psychology and Law, vol. 28, no. 5, 10 Dec. 2020, pp. 1–17, www.ncbi.nlm.nih.gov/pmc/articles/PMC9103349/, https://doi.org/10.1080/13218719.2020.1837030.

Data Sources

GB Cities Coords Data

Sex by Single Year of Age National Data

Ethnic Population by LSOA

Stop and Search Data

Shapefiles - LSOA, PFA, MSOA

Census 2021 Data