Crime Data Analysis In West Yorkshire (Apr–Sep 2020)

Author

Happiness Ndanu

Introduction

This report looks into the crime trends at West Yorkshire in the year 2020 (April-September). The objectives of this analysis include:

  • Understanding how crime type counts changed over the 6 months period

  • Looking into any unusual features in the data such as missing data or duplicates

  • Suggesting any further improvements or analysis in the project

  • Assessing whether the accuracy of the report findings

Some of the key tools used include R packages such as tidyverse, leaflet, janitor and ggplot2 for data wrangling, spatial mapping, and visual analytics. Let’s dive in.

We have several libraries that are needed for the analysis, let’s load them.

Loading Libraries

Next we will load datasets for the analysis which are for the months of April - September. I will use the readr package for this.

Code
april <- read_csv("2020-04-west-yorkshire-street.csv") %>% clean_names()
may <- read_csv("2020-05-west-yorkshire-street.csv")%>% clean_names()
june <- read_csv("2020-06-west-yorkshire-street.csv")%>% clean_names()
july <- read_csv("2020-07-west-yorkshire-street.csv")%>% clean_names()
aug <- read_csv("2020-08-west-yorkshire-street.csv")%>% clean_names()
sept <- read_csv("2020-09-west-yorkshire-street.csv")%>% clean_names()

Data Cleaning

For compatibility and ease in analysis, we will access whether the columns match across the different months.

Code
colnames(april)
 [1] "x1"                    "crime_id"              "month"                
 [4] "falls_within"          "longitude"             "latitude"             
 [7] "location"              "lsoa_code"             "lsoa_name"            
[10] "crime_type"            "last_outcome_category" "context"              
Code
 [1] "x1"                    "crime_id"              "month"                
 [4] "reported_by"           "falls_within"          "longitude"            
 [7] "latitude"              "location"              "lsoa_code"            
[10] "lsoa_name"             "crime_type"            "last_outcome_category"
[13] "context"              
Code
colnames(june)
 [1] "x1"                    "crime_id"              "month"                
 [4] "reported_by"           "falls_within"          "longitude"            
 [7] "latitude"              "location"              "lsoa_code"            
[10] "lsoa_name"             "crime_type"            "last_outcome_category"
[13] "context"              
Code
colnames(july)
 [1] "x1"                    "crime_id"              "month"                
 [4] "reported_by"           "longitude"             "latitude"             
 [7] "location"              "lsoa_code"             "lsoa_name"            
[10] "crime_type"            "last_outcome_category" "context"              
Code
 [1] "x1"                    "crime_id"              "month"                
 [4] "reported_by"           "falls_within"          "longitude"            
 [7] "latitude"              "location"              "lsoa_code"            
[10] "lsoa_name"             "crime_type"            "last_outcome_category"
[13] "context"              
Code
colnames(sept)
 [1] "x1"                    "crime_id"              "month"                
 [4] "reported_by"           "falls_within"          "longitude"            
 [7] "latitude"              "location"              "lsoa_code"            
[10] "lsoa_name"             "crime_type"            "last_outcome_category"
[13] "context"              

The datasets for April and July are missing two key columns: reported_by and falls_within, respectively which already present in the other remaining datasets (May, June,August and September). This presents the first data quality issue, as we need to merge all six months into a single dataset for comprehensive analysis.

To address this, the following steps are taken:

  1. Listing all necessary columns needed for the analysis
  2. Creating a function that adds the missing columns in April and July ,fills the columns with blanks/ missing values (NAs) and reorders the columns to match the standard structure
Code
# Lets list all the necessary columns needed for this analysis
all_column_names<- c("x1", "crime_id", "month", "reported_by", "falls_within", "longitude",
               "latitude", "location", "lsoa_code", "lsoa_name", "crime_type",
               "last_outcome_category", "context")
#Next, create a function that adds the missing columns to the two datasets and fills them with missing values (NAs)
add_missing_cols <- function(df, all_column_names) {
  missing <- setdiff(all_column_names, colnames(df))
  for (col in missing) {
    df[[col]] <- NA
  }
  df <- df[, all_column_names]
  return(df)
}
#The function adds any missing columns with NA values in the respective months (April and July), then reorders the columns to match a standard structure.


april <- add_missing_cols(april, all_column_names)
may <- add_missing_cols(may, all_column_names)
june <- add_missing_cols(june, all_column_names)
july <- add_missing_cols(july, all_column_names)
aug <- add_missing_cols(aug, all_column_names)
sept <- add_missing_cols(sept, all_column_names)
#The reported by and falls withing columns have been successfully added to the April and July datasets while the remaining datasets have had their columns reordered for standardization purposes.

To make the analysis easier to execute, we will combine the six months data into one complete dataset and ensure the columns match to avoid inconsistency in the data

To facilitate the analysis, we will merge the six months of data into a complete dataset and ensure that the columns are consistent to avoid any data inconsistencies.

Code
combined_months <- bind_rows(april, may, june, july, aug, sept)

We will now use the glimpse() function to inspect the data types of all columns and confirm they are correctly formatted for analysis.

Code
glimpse(combined_months)
Rows: 158,898
Columns: 13
$ x1                    <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ crime_id              <chr> NA, "5ea1997471c9de64fcfcf1145cadfff71ba37f21668…
$ month                 <chr> "2020-04", "2020-04", "2020-04", "2020-04", "202…
$ reported_by           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ falls_within          <chr> "West Yorkshire Police", "West Yorkshire Police"…
$ longitude             <dbl> -1.550626, -1.670108, -1.862742, -1.879031, -1.8…
$ latitude              <dbl> 53.59740, 53.55363, 53.94007, 53.94381, 53.92494…
$ location              <chr> "On or near Swithen Hill", "On or near Huddersfi…
$ lsoa_code             <chr> "E01007359", "E01007426", "E01010646", "E0101064…
$ lsoa_name             <chr> "Barnsley 005C", "Barnsley 027D", "Bradford 001A…
$ crime_type            <chr> "Anti-social behaviour", "Burglary", "Anti-socia…
$ last_outcome_category <chr> NA, "Investigation complete; no suspect identifi…
$ context               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

The columns have the right data types. We will retain the month column as a character type to maintain the “YYYY-MM” format. This will make it easier to group and summarize the data by month without adding the day-level detail.

Next step, we will only select the necessary columns for the analysis. In this case, the context column will be dropped since it has no meaningful data (only blanks).

Code
combined_months <- combined_months %>% 
  select(-context)

After looking into data type and picking the right columns for our analysis, we will identify any missing value in our data.

Code
colSums(is.na(combined_months))
                   x1              crime_id                 month 
                    0                 29689                  2000 
          reported_by          falls_within             longitude 
                23785                 31278                  5487 
             latitude              location             lsoa_code 
                 5487                  2000                  5488 
            lsoa_name            crime_type last_outcome_category 
                 5488                  2000                 31334 

All columns except x1 contained missing values. To address this, we will drop rows with missing values in critical columns, which include: crime_id, month, falls_within, longitude, latitude, LSOA_code, and crime_type. This will be done using the drop_na() function from the tidyr package.

For non-critical columns, missing values will be replaced with "Unknown" using the mutate() and replace_na() functions. This approach helps preserve valuable records while maintaining data consistency and interpretability.

Code
combined_months <- combined_months %>%
  drop_na(
    crime_id,
    month,
    longitude,
    latitude,
    lsoa_code,
    crime_type
  ) %>%
    mutate(reported_by=replace_na(reported_by, "Unknown")) %>% 
  mutate(falls_within=replace_na(reported_by, "Unknown")) %>% 
  mutate(location=replace_na(location, "Unknown")) %>% 
     mutate(lsoa_name=replace_na(lsoa_name, "Unknown")) %>% 
       mutate(last_outcome_category=replace_na(last_outcome_category, "Unknown"))

Let us now verify whether all missing values have been successfully removed or replaced for data integrity purposes

Code
sum(is.na(combined_months))
[1] 0

Another form of data quality issue is the presence of duplicate records. To identify these, we will inspect the crime_id column, which serves as the dataset’s primary identifier, to check for any repeated entries.

Code
combined_months %>%
  filter(!is.na(crime_id)) %>%
  summarise(duplicated_ids = sum(duplicated(crime_id)))

One record with duplicate crime_id values was detected. We will apply the distinct() function to retain only unique entries and eliminate duplicates in the dataset.

Code
combined_months <- combined_months%>%
  distinct(crime_id, .keep_all = TRUE)
Code
final_cleaned_data <- combined_months

Having standardized column structures, removed duplicates, addressed missing values, formatted key fields, and retained only the necessary variables, the dataset is now clean, consistent, and ready for further analysis.

Exploratory Data Analysis (EDA)

This step aims to uncover patterns, trends, and anomalies within the data, and to develop an initial understanding of the variables and their relationships.

Did Crime Rise Or Fall Over The 6 Months?

Code
final_cleaned_data %>%
  count(month) %>%
  ggplot(aes(x = month, y = n)) +
  geom_line(group = 1, color = "#fdae61", linewidth = 1.2) +
  geom_point(color = "darkred", size = 2) +
  labs(
    title = "Total Monthly Reported Crimes in West Yorkshire (Apr–Sept 2020)",
    x = "Month",
    y = "Number of Crimes"
  ) +
  theme_minimal()

Findings: There was a steady increase in reported crimes from April to August, followed by a noticeable decline in September.

Distribution of Reported Crime Types (April–September 2020)

Code
final_cleaned_data %>%
  count(crime_type, sort = TRUE) %>%
  ggplot(aes(x = reorder(crime_type, n), y = n)) +
  geom_col(fill = "darkorange") +
  coord_flip() +
  labs(
    title = "Crime Type Distribution",
    x = "Crime Type",
    y = "Count"
  ) +
  theme_minimal()

Code
final_cleaned_data %>%
  count(month, crime_type) %>%
  ggplot(aes(x = month, y = fct_rev(fct_infreq(crime_type)), fill = n)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c() +
  labs(
    title = "Heatmap: Crime Type Trends Over Time",
    x = "Month",
    y = "Crime Type",
    fill = "Count"
  ) +
  theme_minimal()

Findings: Violence and sexual offences had the highest number of reported crimes, leading by a significant margin. This was followed by public order crimes. In contrast, exclusive crimes recorded the fewest cases.

Top 15 Locations by Number of Crimes

Code
final_cleaned_data %>%
  count(location, sort = TRUE) %>%
  slice_max(n, n = 15) %>%
  ggplot(aes(x = reorder(location, n), y = n)) +
  geom_col(fill = "tomato") +
  geom_text(aes(label = n), hjust = -0.1, size = 3) +
  coord_flip() +
  labs(
    title = "Top 15 Locations by Number of Crimes",
    x = "Location",
    y = "Number of Crimes"
  ) +
  theme_minimal()

The most common crime locations included public areas such as supermarkets, parking areas, petrol stations, and recreation spaces. Supermarkets were the leading location, recording over 2,600 incidents, followed by parking areas and petrol stations. This highlights the concentration of criminal activity in and around commercial and high-traffic public zones.

Count of Crimes by Reporting Authority

Code
reported_by_counts <- final_cleaned_data %>%
  count(reported_by, sort = TRUE)
Code
ggplot(reported_by_counts, aes(x = reorder(reported_by, n), y = n)) +
  geom_col(fill = "#2c7fb8") +
  coord_flip() +
  labs(
    title = "Count of Crimes by Reporting Authority (April-September)",
    x = "Reporting Authority",
    y = "Number of Crimes"
  ) +
  theme_minimal()

The majority of crimes (85.5%) were reported by West Yorkshire Police, while the remaining 14.5% were labeled as Unknown, likely due to missing or unrecorded values in the original dataset. These were handled during the data cleaning process to maintain dataset consistency.

Reported By vs Falls Within: Consistency Check

Code
final_cleaned_data%>%
  count(reported_by, falls_within, sort = TRUE) %>% print()
# A tibble: 2 × 3
  reported_by           falls_within               n
  <chr>                 <chr>                  <int>
1 West Yorkshire Police West Yorkshire Police 106354
2 Unknown               Unknown                18023

The data shows perfect alignment between the reported_by and falls_within columns. All records reported by West Yorkshire Police were also recorded as falling within its jurisdiction. Similarly, records with missing values in both fields were labeled as Unknown, ensuring consistency in location attribution.

Data Quality Gaps:

  • Likely, the reported_by field was missing or incorrectly recorded during extraction or collection

  • This may have occurred during a specific month or batch .Let us explore this

Code
final_cleaned_data %>%
  filter(reported_by == "Unknown") %>%
  count(month, sort = TRUE) %>% print()
# A tibble: 1 × 2
  month       n
  <chr>   <int>
1 2020-04 18023

All the 18,023 crime records labeled as Unknown in both the reported_by and falls_within fields were recorded in April 2020. This indicates a potential data quality issue specific to the April dataset, likely caused by incomplete data entry or missing source information. For this case, we will flag the anomaly for further investigation and confirm with the data entry or records management team.

Top 10 Most Affected LSOAs by Crime

Code
final_cleaned_data %>%
  count(lsoa_name, sort = TRUE) %>%
  slice_max(n, n=10) %>%
  ggplot(aes(x = reorder(lsoa_name, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 10 Most Affected LSOAs by Crime",
    x = "LSOA Name", y = "Number of Crimes"
  ) +
  theme_minimal()

Crime Type Distribution in Top 10 LSOAs

Code
top_lsoas <- final_cleaned_data %>%
  count(lsoa_name, sort = TRUE) %>%
  top_n(10, n) %>%
  pull(lsoa_name)

top_crime_types <- final_cleaned_data %>%
  count(crime_type, sort = TRUE) %>%
  slice_max(n, n = 10) %>%
  pull(crime_type)


final_cleaned_data %>%
  filter(lsoa_name %in% top_lsoas, crime_type %in% top_crime_types) %>%
  count(lsoa_name, crime_type) %>%
  ggplot(aes(x = crime_type, y = lsoa_name, fill = n)) +
  geom_tile() +
 geom_text(aes(label = n), color = "white", size = 3)+
  scale_fill_viridis_c() +
  scale_x_discrete(limits = top_crime_types) +
  theme_minimal() +
  labs(
    title = "Crime Type Distribution in Top 10 LSOAs",
    x = "Crime Type", y = "LSOA Name", fill = "Count"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The heatmap illustrates how crime types are distributed across the top 10 LSOAs with the highest overall crime counts. Leeds 111B, Calderdale 008E, and Bradford 039G recorded the highest number of crimes. The visual clearly shows that Violence and sexual offences, along with Public order and Criminal damage, are among the most prevalent across most of these areas. The intensity of the color represents the volume of crimes, helping to identify hotspots for specific offences.

Spatial Overview of Reported Crimes (Apr–Sep 2020)

Since the dataset contains a large number of records, a random sample of 5,000 crime incidents with valid geographic coordinates was selected for visualization using an interactive leaflet map.

Each point is color-coded by reporting agency that is, blue for West Yorkshire Police and red for Unknown. Interactive popups provide additional context, including the month, crime type, reporting agency, and location.

This spatial visualization facilitates the identification of crime clusters and spatial patterns, while also enabling a visual comparison between records reported by known and unknown sources.

Code
#Sample 5,000 records for performance (this can be adjusted as needed)
sampled_data <- final_cleaned_data %>%
  filter(
    !is.na(latitude), !is.na(longitude),
    latitude >= 53.5, latitude <= 54.0,
    longitude >= -1.9, longitude <= -1.2
    #the coordinates are an approximate of west yorkshire boundaries
  ) %>%
  sample_n(5000)
Code
leaflet(sampled_data) %>%
  addTiles() %>%
  addCircleMarkers(
    lng = ~longitude,
    lat = ~latitude,
    color = ~ifelse(reported_by == "West Yorkshire Police", "blue", "red"),
    radius = 3,
    stroke = FALSE,
    fillOpacity = 0.6,
    popup = ~paste0(
      "<strong>Month:</strong> ", month, "<br>",
      "<strong>Crime Type:</strong> ", crime_type, "<br>",
      "<strong>Reported By:</strong> ", reported_by, "<br>",
      "<strong>Location:</strong> ", location
    )
  ) %>%
  addLegend(
    "bottomright",
    colors = c("blue", "red"),
    labels = c("West Yorkshire Police", "Unknown"),
    title = "Reported By",
    opacity = 0.8
  )

The interactive leaflet map reveals that crime incidents are concentrated around urban centers, particularly in areas like Leeds, Bradford, and Wakefield, suggesting these are hotspots of criminal activity.

The dominance of blue markers confirms that the vast majority of crimes were reported by official policing channels. The red markers, though fewer, tend to cluster in specific regions (especially in April), highlighting a potential data collection or reporting anomaly that had been flagged earlier during cleaning.

Map showing total crime count by LSOA

Code
#total crimes by LSOA
lsoa_summary <- final_cleaned_data%>%
  group_by(lsoa_code, lsoa_name) %>%
  summarise(
    crime_count = n(),
    lat = mean(latitude, na.rm = TRUE),
    lon = mean(longitude, na.rm = TRUE)
  ) %>%
  ungroup()
`summarise()` has grouped output by 'lsoa_code'. You can override using the
`.groups` argument.
Code
# STEP 2: Create a leaflet map
leaflet(data = lsoa_summary) %>%
  addTiles() %>%
  addCircleMarkers(
    ~lon, ~lat,
    radius = ~sqrt(crime_count) / 2,  # Scale radius for visibility
    color = "red",
    fillOpacity = 0.5,
    label = ~paste0(lsoa_name, ": ", crime_count, " crimes"),
    popup = ~paste0("<strong>", lsoa_name, "</strong><br/>",
                    "Crimes: ", crime_count)
  ) %>%
  addLegend(
    position = "bottomright",
    colors = "red",
    labels = "Crime Density (scaled by marker size)",
    title = "Crime by LSOA"
  )

This map presents the total number of crimes aggregated by LSOA (Lower Super Output Area). Each red circle represents an LSOA, with the size of the marker scaled proportionally to the number of reported crimes in that area (using the square root of crime count for balanced visual scaling).

High-crime LSOAs such as Leeds 111B, Calderdale 008E, and several in Bradford are visually prominent due to larger marker sizes. This allows for quick visual identification of crime hotspots across the region.

The use of average latitude and longitude per LSOA ensures the markers are geographically centered for each area. Interactive popups and labels display the LSOA name and total crime count for added context.

Conclusion

  • Trends: Crime generally decreased from April to June, with fluctuations afterward. Certain crime types like anti-social behavior remained consistent, while others varied.
  • Data Issues: April’s missing reported_by values and missing columns indicate inconsistencies. You’ve addressed them via cleaning and imputation.
  • Next Steps: With more time or data, further analysis could include:
    • Hourly/weekly patterns
    • Outcome-based clustering
    • Predictive modelling using external variables (e.g., weather, holidays)
    • Comparative studies with other regions

Future Improvements

If given more time or data, I would consider: - Mapping crimes per capita using LSOA population data - Linking outcomes to socioeconomic indicators - Comparing West Yorkshire to national crime trends