Crime Data Analysis In West Yorkshire (April–September 2020)

Author

by Happiness Ndanu

Introduction

This report investigates crime trends in West Yorkshire from April to September 2020. The objectives are to:

Understand how crime types varied over six months
Identify unusual data features (e.g., missing values, duplicates)
Provide visual and spatial insight into crime patterns
Suggest areas for further analysis and improvement

Key R packages used include: tidyverse, leaflet, janitor, and ggplot2.

Load the Necessary Libraries

library(tidyverse)
library(summarytools)
library(patchwork)
library(readr)
library(leaflet)
library(janitor)
library(ggplot2)
library(paletteer)

Data Loading & Cleaning

Data from the six monthly files were merged. April and July lacked reported_by and falls_within columns. These were added and filled with NA values.

Missing values were removed from the following columns: crime_id, month, longitude, latitude, lsoa_code, crime_type), while the remaining columns were replaced with “Unknown”.

Necessary columns needed for the analysis were selected and duplicates based on the crime_idcolumn were dropped to maintain data integrity.

#data loading
april <- read_csv("2020-04-west-yorkshire-street.csv") %>% clean_names()
may <- read_csv("2020-05-west-yorkshire-street.csv")%>% clean_names()
june <- read_csv("2020-06-west-yorkshire-street.csv")%>% clean_names()
july <- read_csv("2020-07-west-yorkshire-street.csv")%>% clean_names()
aug <- read_csv("2020-08-west-yorkshire-street.csv")%>% clean_names()
sept <- read_csv("2020-09-west-yorkshire-street.csv")%>% clean_names()

# Create a list of all datasets
monthly_datasets <- list( April = april,
  May = may,
  June = june,
  July = july,
  August = aug,
  September = sept
)

# Compare column names across all datasets
lapply(monthly_datasets, colnames) %>% 
  purrr::map(~ sort(.x)) %>% 
  enframe(name = "Month", value = "Column_Names")

# A tibble: 6 × 2
  Month     Column_Names
  <chr>     <list>      
1 April     <chr [12]>  
2 May       <chr [13]>  
3 June      <chr [13]>  
4 July      <chr [12]>  
5 August    <chr [13]>  
6 September <chr [13]>

sapply(monthly_datasets, function(df) paste(sort(colnames(df)), collapse = ", ")) %>%
  unique()

[1] "context, crime_id, crime_type, falls_within, last_outcome_category, latitude, location, longitude, lsoa_code, lsoa_name, month, x1"             
[2] "context, crime_id, crime_type, falls_within, last_outcome_category, latitude, location, longitude, lsoa_code, lsoa_name, month, reported_by, x1"
[3] "context, crime_id, crime_type, last_outcome_category, latitude, location, longitude, lsoa_code, lsoa_name, month, reported_by, x1"

#an alternative for this in tidyverse would be the function "colnames('dataset_name')".

# Let's list all the necessary columns needed for this analysis
all_column_names<- c("x1", "crime_id", "month", "reported_by", "falls_within", "longitude",
               "latitude", "location", "lsoa_code", "lsoa_name", "crime_type",
               "last_outcome_category", "context")
#Next, create a function that adds the missing columns to the two datasets and fills them with missing values (NAs)
add_missing_cols <- function(df, all_column_names) {
  missing <- setdiff(all_column_names, colnames(df))
  for (col in missing) {
    df[[col]] <- NA
  }
  df <- df[, all_column_names]
  return(df)
}
#The function adds any missing columns with NA values in the respective months (April and July)
#It then reorders the columns to match a standard structure.

april <- add_missing_cols(april, all_column_names)
may <- add_missing_cols(may, all_column_names)
june <- add_missing_cols(june, all_column_names)
july <- add_missing_cols(july, all_column_names)
aug <- add_missing_cols(aug, all_column_names)
sept <- add_missing_cols(sept, all_column_names)

combined_months <- bind_rows(april, may, june, july, aug, sept)
#The six datasets were combined into one complete dataset to facilitate comprehensive analysis

#select the necessary columns for the analysis
combined_months <- combined_months %>% 
  select(-context)

#Identify any missing value in our data.
colSums(is.na(combined_months))

                   x1              crime_id                 month 
                    0                 29689                  2000 
          reported_by          falls_within             longitude 
                23785                 31278                  5487 
             latitude              location             lsoa_code 
                 5487                  2000                  5488 
            lsoa_name            crime_type last_outcome_category 
                 5488                  2000                 31334

combined_months <- combined_months %>%
  drop_na(crime_id,month,longitude,latitude,lsoa_code,crime_type
  ) %>%
    mutate(reported_by=replace_na(reported_by, "Unknown")) %>% 
  mutate(falls_within=replace_na(falls_within, "Unknown")) %>% 
  mutate(location=replace_na(location, "Unknown")) %>% 
     mutate(lsoa_name=replace_na(lsoa_name, "Unknown")) %>% 
       mutate(last_outcome_category=replace_na(last_outcome_category, "Unknown"))

#verifying whether missing values have been removed
sum(is.na(combined_months))

[1] 0

#looking out for any duplicates in our data using the crime_id column since it is the primary identifier
combined_months %>%
  filter(!is.na(crime_id)) %>%
  summarise(duplicated_ids = sum(duplicated(crime_id)))

# A tibble: 1 × 1
  duplicated_ids
           <int>
1              2

#Two records with duplicate crime_id values were detected. 
#We will apply the distinct() function to retain only unique entries and eliminate duplicates in the dataset.

combined_months <- combined_months%>%
  distinct(crime_id, .keep_all = TRUE)

final_cleaned_data <- combined_months

Having standardized column structures, removed duplicates, addressed missing values, and retained only the necessary variables, the dataset is now clean, consistent, and ready for further analysis.

Exploratory Data Analysis (EDA)

This step aims to uncover patterns, trends, and anomalies within the data, and to develop an initial understanding of the variables and their relationships. In this report, we will look into crime incidences in West Yorkshire and explore different trends and relationships in the data.

Did Crime Rise Or Fall Over The 6 Months?

final_cleaned_data %>%
  count(month) %>%
  ggplot(aes(x = month, y = n)) +
  geom_line(group = 1, color = "#fdae61", linewidth = 1.2) +
  geom_point(color = "darkred", size = 2) +
  labs(title = "Total Monthly Reported Crimes in West Yorkshire", x = "Month",y = "Number of Crimes") + theme_minimal(base_size = 9)

Findings: There was a steady increase in reported crimes from April to August, followed by a noticeable decline in September.

Distribution of Reported Crime Types (April–September 2020)

final_cleaned_data %>%
  count(crime_type, sort = TRUE) %>%
  ggplot(aes(x = reorder(crime_type, n), y = n)) +
  geom_col(fill = "darkorange") +
    geom_text(aes(label = n), hjust = -0.1, size = 1) +
  coord_flip() +
  labs(title = "Crime Type Distribution",x = "Crime Type", y = "Count") +
  theme_minimal(base_size = 9)

Findings: Violence and sexual offences had the highest number of reported crimes, leading by a significant margin. This was followed by public order crimes. In contrast, exclusive crimes recorded the fewest cases.

Top 15 Locations by Number of Crimes

final_cleaned_data %>%
  count(location, sort = TRUE) %>%
  slice_max(n, n = 15) %>%
  ggplot(aes(x = reorder(location, n), y = n)) +
  geom_col(fill = "tomato") +
  geom_text(aes(label = n), hjust = -0.1, size = 1) +
  coord_flip() +
  labs(title = "Top 15 Locations by Number of Crimes",x = "Location",y = "Number of Crimes" ) +
  theme_minimal(base_size = 8)

Findings: The most common crime locations included public areas such as supermarkets, parking areas, petrol stations, and recreation spaces. Supermarkets were the leading location, recording over 2,600 incidents, followed by parking areas and petrol stations. This highlights the concentration of criminal activity in and around commercial and high-traffic public zones.

Spatial Overview of Reported Crimes (Apr–Sep 2020)

Since PDF does not support interactive maps, the Leaflet map was rendered in HTML and a screenshot of the output is what is attached after the code. This map visualizes 5,000 randomly sampled crime incidents across West Yorkshire. Each point is color-coded by reporting agency that is, blue for West Yorkshire Police and red for Unknown. Interactive popups provide additional context, including the month, crime type, reporting agency, and location.This spatial visualization facilitates the identification of crime clusters and spatial patterns, while also enabling a visual comparison between records reported by known and unknown sources.

#Sample 5,000 records for performance (this can be adjusted as needed)
sampled_data <- final_cleaned_data %>%
  filter(
    !is.na(latitude), !is.na(longitude),
    latitude >= 53.5, latitude <= 54.0,
    longitude >= -1.9, longitude <= -1.2
    #the coordinates are an approximate of west yorkshire boundaries
  ) %>%
  sample_n(5000)

leaflet(sampled_data) %>%
  addTiles() %>%
  addCircleMarkers(
    lng = ~longitude,
    lat = ~latitude,
    color = ~ifelse(reported_by == "West Yorkshire Police", "blue", "red"),
    radius = 3,
    stroke = FALSE,
    fillOpacity = 0.6,
    popup = ~paste0(
      "<strong>Month:</strong> ", month, "<br>",
      "<strong>Crime Type:</strong> ", crime_type, "<br>",
      "<strong>Reported By:</strong> ", reported_by, "<br>",
      "<strong>Location:</strong> ", location )  ) %>%
  addLegend("bottomright",colors = c("blue", "red"),labels = c("West Yorkshire Police", "Unknown"),title = "Reported By",opacity = 0.8)

Figure: Sampled crime incidents (5,000 points) across West Yorkshire, colored by reporting authority.

Findings: The leaflet map screenshots reveal clear crime hotspots in urban centers such as Leeds. Most incidents were reported by official policing authorities (blue markers), while a smaller cluster of unverified reports (red markers), mainly from April, suggests possible data collection or reporting data inconsistencies.

Map showing total crime count by LSOA

#total crimes by LSOA
lsoa_summary <- final_cleaned_data%>%
  group_by(lsoa_code, lsoa_name) %>%
  summarise(
    crime_count = n(),
    lat = mean(latitude, na.rm = TRUE),
    lon = mean(longitude, na.rm = TRUE)
  ) %>%
  ungroup()
leaflet(data = lsoa_summary) %>%
  addTiles() %>%
  addCircleMarkers( ~lon, ~lat, radius = ~sqrt(crime_count) / 2,  color = "red", fillOpacity = 0.5, label = ~paste0(lsoa_name, ": ", crime_count, " crimes"), popup = ~paste0("<strong>", lsoa_name, "</strong><br/>", "Crimes: ", crime_count) ) %>%
  addLegend(position = "bottomright", colors = "red",labels = "Crime Density (scaled by marker size)",title = "Crime by LSOA")

Total reported crimes per LSOA (circle size indicates volume).

Findings: This map presents the total number of crimes aggregated by LSOA (Lower Super Output Area). Each red circle represents an LSOA, with the size of the marker scaled proportionally to the number of reported crimes in that area (using the square root of crime count for balanced visual scaling).

High-crime LSOAs such as Leeds 111B, Calderdale 008E, and several in Bradford are visually prominent due to larger marker sizes. This allows for quick visual identification of crime hotspots across the region.

The use of average latitude and longitude per LSOA ensures the markers are geographically centered for each area. Interactive popups and labels display the LSOA name and total crime count for added context.

Conclusion

This analysis provides a comprehensive overview of crime trends in West Yorkshire over the six-month period from April to September 2020. Some of the key insights include:

A steady rise in crime was observed from April to August, followed by a slight dip in September.
Violence and sexual offences consistently ranked as the most reported crime type across the region.
Certain locations like supermarkets, parking areas, and petrol stations emerged as crime hotspots.
Spatial visualization highlighted urban clusters of criminal activity, primarily reported by West Yorkshire Police.
Additionally, several data quality issues were identified and addressed, such as: Inconsistencies in column structure across monthly files, missing values in critical fields (especially in April), and duplicate entries in crime_id, which were resolved through data cleaning steps.

Overall, the cleaned dataset enabled rich exploratory insights that can inform both local policy and resource allocation for crime prevention.

Future Improvements

If given more time or data, I would consider:

Comparing West Yorkshire to national crime trends
Hourly/weekly patterns
Predictive modelling using external variables (e.g., weather, holidays)
Comparative analysis with other regions