Los Angeles Crime Patterns: A Statistical Analysis (2020–2024)

Author

Zyam Jadoon Khawaja

Published

May 15, 2026

Introduction

Crime stands out as the top public safety issue in major American cities. Los Angeles is one of the largest and most diverse cities in America and, it has (for a very long time) kept one of the countries best publicly available crime databases. This project analyzes Los Angeles Crime dataset (2020–2024), consisting of over 1 million crime incident records reported by the Los Angeles Police Department (LAPD).

The dataset has extensive information on crime classification, locality, victim demographics (age group, gender, ethnicity), time of occurrence, type of premises and use/occupational category/function of weapon. The data covers a five-year time frame — from January 2020 to late 2024 — allowing for an analysis of trends in crime before, during and after the COVID-19 pandemic.

Research Questions:

Which LAPD areas report the highest volume of crime incidents?
How have crime counts changed year-over-year from 2020 to 2024?
Can we predict victim age using crime severity (Part 1 vs. Part 2), time of occurrence, and area?

I picked this dataset because crime has a big impact on our daily lives and decisions our leaders make. By the looking at where, when, and who crimes happen to, we can help the in charge of cities people, police, and community groups their resources in a smarter use way. Since I live close to a big city, I think is really important, this information for me and both community.

way the data was gathered is prettyThe straightforward. It starts with the Los Angeles Police Department’s (L on their OpenAPD) documentation, crime incident records are taken from the Data Portal. Apparently original paper reports that LAPD officers fill out. Each of into the LAPD’s Records Management System, these reports is then entered or RMS for short. After that, the data gets on a weekly basis. Now, since exported to the open data portal the data comes from paper reports, there’s a chance that some of the information might not be entirely accurate. One thing to note is that the dataset doesn’t come with a formal ReadMe file, but you can find a data dictionary on the portal that a detailed description of each variable. gives This be really helpful for understanding can what the data is actually us.

Data Source: City of Los Angeles Open Data Portal — LAPD Crime Data 2020 to Present. Available at: https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8

Background Research Crime in cities has been looked at a lot in studies about crime and public health. Some researchers, like Blumstein and Wallman, found out that when the economy is bad and there are problems in society, there tends to be more property crime. On the other hand, Chalfin and McCrary discovered that having more police on the streets can really cut down on violent crime. The COVID-19 pandemic was like a big test: some studies, such as the one by Campedelli and others, showed that when people were told to stay home, crime on the streets went down, but there was more domestic violence - and we might see this pattern in the data from 2020. This is pretty interesting because it shows how big events can change the way crime happens in cities. By looking at this data, we can learn more about what affects crime and how to make our cities safer.

Los Angeles has been looked at closely for where crimes happen in the city. A study in the Journal of Quantitative Criminology found that crimes in LA are mostly happening in a few small areas, or “hot spots”. This is similar to a common idea in criminology, which is that 80% of crimes happen in 20% of places.

References:

Blumstein, A., & Wallman, J. (2006). The Crime Drop in America. Cambridge University Press.
Chalfin, A., & McCrary, J. (2018). Are U.S. cities underpoliced? Review of Economics and Statistics, 100(1), 167–186.
Campedelli, G. M., Aziani, A., & Favarin, S. (2020). Exploring the effect of COVID-19 containment policies on crime. American Journal of Criminal Justice, 46, 704–727.
Los Angeles Open Data Portal. (2024). Crime Data from 2020 to Present. https://data.lacity.org

Data Loading and Cleaning

Code

setwd("C:/Users/zyamj/Downloads")

# Load required libraries
library(tidyverse)
library(lubridate)
library(plotly)
library(knitr)
library(scales)
library(broom)

Code

# Load dataset using readr::read_csv() as required (NOT base R's read.csv())
crime <- readr::read_csv("Crime_Data_from_2020_to_2024.csv")
glimpse(crime)

Rows: 1,004,894
Columns: 28
$ DR_NO            <dbl> 211507896, 201516622, 240913563, 210704711, 201418201…
$ `Date Rptd`      <chr> "04/11/2021 12:00:00 AM", "10/21/2020 12:00:00 AM", "…
$ `DATE OCC`       <chr> "11/07/2020 12:00:00 AM", "10/18/2020 12:00:00 AM", "…
$ `TIME OCC`       <chr> "0845", "1845", "1240", "1310", "1830", "1210", "1350…
$ AREA             <chr> "15", "15", "09", "07", "14", "04", "03", "11", "17",…
$ `AREA NAME`      <chr> "N Hollywood", "N Hollywood", "Van Nuys", "Wilshire",…
$ `Rpt Dist No`    <chr> "1502", "1521", "0933", "0782", "1454", "0429", "0396…
$ `Part 1-2`       <dbl> 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2,…
$ `Crm Cd`         <dbl> 354, 230, 354, 331, 420, 354, 354, 812, 354, 354, 812…
$ `Crm Cd Desc`    <chr> "THEFT OF IDENTITY", "ASSAULT WITH DEADLY WEAPON, AGG…
$ Mocodes          <chr> "0377", "0416 0334 2004 1822 1414 0305 0319 0400", "0…
$ `Vict Age`       <dbl> 31, 32, 30, 47, 63, 35, 21, 14, 43, 57, 13, 34, 0, 0,…
$ `Vict Sex`       <chr> "M", "M", "M", "F", "M", "M", "F", "F", "M", "M", "M"…
$ `Vict Descent`   <chr> "H", "H", "W", "A", "H", "B", "B", "H", "W", "W", "H"…
$ `Premis Cd`      <dbl> 501, 102, 501, 101, 103, 502, 501, 121, 501, 501, 501…
$ `Premis Desc`    <chr> "SINGLE FAMILY DWELLING", "SIDEWALK", "SINGLE FAMILY …
$ `Weapon Used Cd` <dbl> NA, 200, NA, NA, NA, NA, NA, 500, NA, NA, 400, NA, NA…
$ `Weapon Desc`    <chr> NA, "KNIFE WITH BLADE 6INCHES OR LESS", NA, NA, NA, N…
$ Status           <chr> "IC", "IC", "IC", "IC", "IC", "IC", "IC", "AO", "IC",…
$ `Status Desc`    <chr> "Invest Cont", "Invest Cont", "Invest Cont", "Invest …
$ `Crm Cd 1`       <dbl> 354, 230, 354, 331, 420, 354, 354, 812, 354, 354, 812…
$ `Crm Cd 2`       <dbl> NA, NA, NA, NA, NA, NA, NA, 860, NA, NA, 860, NA, NA,…
$ `Crm Cd 3`       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Crm Cd 4`       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ LOCATION         <chr> "7800    BEEMAN                       AV", "ATOLL    …
$ `Cross Street`   <chr> NA, "N  GAULT", NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ LAT              <dbl> 34.2124, 34.1993, 34.1847, 34.0339, 33.9813, 34.0830,…
$ LON              <dbl> -118.4092, -118.4203, -118.4509, -118.3747, -118.4350…

Cleaning Steps

Code

# Step 1: Check dimensions and missing values
cat("Dimensions:", nrow(crime), "rows x", ncol(crime), "columns\n\n")

Dimensions: 1004894 rows x 28 columns

Code

cat("Missing values per column:\n")

Missing values per column:

Code

print(colSums(is.na(crime)))

         DR_NO      Date Rptd       DATE OCC       TIME OCC           AREA 
             0              0              0              0              0 
     AREA NAME    Rpt Dist No       Part 1-2         Crm Cd    Crm Cd Desc 
             0              0              0              0              0 
       Mocodes       Vict Age       Vict Sex   Vict Descent      Premis Cd 
        151598              0         144631         144643             16 
   Premis Desc Weapon Used Cd    Weapon Desc         Status    Status Desc 
           588         677678         677678              1              0 
      Crm Cd 1       Crm Cd 2       Crm Cd 3       Crm Cd 4       LOCATION 
            11         935740        1002580        1004894              0 
  Cross Street            LAT            LON 
        850666              0              0

Code

# Step 2: Rename columns to snake_case for easier use in R
# Original names contain spaces and hyphens which require backticks
crime_clean <- crime %>%
  rename(
    dr_no         = `DR_NO`,
    date_rptd     = `Date Rptd`,
    date_occ      = `DATE OCC`,
    time_occ      = `TIME OCC`,
    area          = `AREA`,
    area_name     = `AREA NAME`,
    part          = `Part 1-2`,
    crm_cd        = `Crm Cd`,
    crm_desc      = `Crm Cd Desc`,
    vict_age      = `Vict Age`,
    vict_sex      = `Vict Sex`,
    vict_descent  = `Vict Descent`,
    premis_cd     = `Premis Cd`,
    premis_desc   = `Premis Desc`,
    weapon_cd     = `Weapon Used Cd`,
    weapon_desc   = `Weapon Desc`,
    status        = `Status`,
    status_desc   = `Status Desc`,
    location      = `LOCATION`,
    lat           = `LAT`,
    lon           = `LON`
  )

Code

# Step 3: Parse date columns using lubridate (mdy_hms format)
crime_clean <- crime_clean %>%
  mutate(
    date_occ  = mdy_hms(date_occ),
    date_rptd = mdy_hms(date_rptd),
    year      = year(date_occ),
    month     = month(date_occ, label = TRUE),
    hour      = as.integer(as.character(time_occ)) %/% 100
  )

Code

# Step 4: Recode categorical variables
# Part 1 = serious/violent felonies; Part 2 = less serious offenses
crime_clean <- crime_clean %>%
  mutate(
    part_label  = if_else(part == 1, "Part 1 (Serious)", "Part 2 (Less Serious)"),
    weapon_flag = if_else(!is.na(weapon_cd), "Weapon Involved", "No Weapon"),
    vict_sex    = case_when(
      vict_sex == "M" ~ "Male",
      vict_sex == "F" ~ "Female",
      vict_sex == "X" ~ "Unknown",
      TRUE ~ NA_character_
    )
  )

Code

# Step 5: Filter out invalid victim ages
# Ages <= 0 or > 100 are data entry errors (raw data contains ages as low as -4)
# NOTE: NOT using na.omit() or drop_na() per project instructions
crime_clean <- crime_clean %>%
  filter(vict_age > 0, vict_age <= 100)

cat("Rows after age filter:", nrow(crime_clean), "\n")

Rows after age filter: 735578

Code

# Step 6: Filter to complete years only (2020-2023) for year-over-year analysis
# 2024 data is partial (only through mid-year), which would distort trend plots
crime_years <- crime_clean %>%
  filter(year %in% 2020:2023)

cat("Rows for 2020-2023 analysis:", nrow(crime_years), "\n")

Rows for 2020-2023 analysis: 658739

Code

# Step 7: Preview cleaned dataset
crime_clean %>%
  select(date_occ, year, area_name, crm_desc, vict_age, 
         vict_sex, part_label, weapon_flag) %>%
  slice_head(n = 10) %>%
  kable(caption = "First 10 rows of cleaned crime dataset")

First 10 rows of cleaned crime dataset
date_occ	year	area_name	crm_desc	vict_age	vict_sex	part_label	weapon_flag
2020-11-07	2020	N Hollywood	THEFT OF IDENTITY	31	Male	Part 2 (Less Serious)	No Weapon
2020-10-18	2020	N Hollywood	ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	32	Male	Part 1 (Serious)	Weapon Involved
2020-10-30	2020	Van Nuys	THEFT OF IDENTITY	30	Male	Part 2 (Less Serious)	No Weapon
2020-12-24	2020	Wilshire	THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)	47	Female	Part 1 (Serious)	No Weapon
2020-09-29	2020	Pacific	THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)	63	Male	Part 1 (Serious)	No Weapon
2020-11-11	2020	Hollenbeck	THEFT OF IDENTITY	35	Male	Part 2 (Less Serious)	No Weapon
2020-04-16	2020	Southwest	THEFT OF IDENTITY	21	Female	Part 2 (Less Serious)	No Weapon
2020-07-07	2020	Northeast	CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 YRS OLDER)	14	Female	Part 2 (Less Serious)	Weapon Involved
2020-03-02	2020	Devonshire	THEFT OF IDENTITY	43	Male	Part 2 (Less Serious)	No Weapon
2020-09-01	2020	Topanga	THEFT OF IDENTITY	57	Male	Part 2 (Less Serious)	No Weapon

dplyr Exploration

Code

# dplyr command 1: group_by + summarise — crime counts by area, sorted descending
crimes_by_area <- crime_clean %>%
  group_by(area_name) %>%
  summarise(
    total_crimes   = n(),
    avg_vict_age   = round(mean(vict_age, na.rm = TRUE), 1),
    pct_part1      = round(mean(part == 1) * 100, 1),
    pct_weapon     = round(mean(!is.na(weapon_cd)) * 100, 1)
  ) %>%
  arrange(desc(total_crimes))

crimes_by_area %>%
  kable(caption = "Crime counts and statistics by LAPD area (2020–2024)")

Crime counts and statistics by LAPD area (2020–2024)
area_name	total_crimes	avg_vict_age	pct_part1	pct_weapon
Central	52095	38.0	62.5	42.5
Southwest	47804	35.7	50.6	42.5
77th Street	46213	38.6	48.3	57.5
Pacific	42260	41.0	63.4	34.1
Hollywood	39292	37.9	56.1	43.2
Southeast	36525	37.9	46.0	58.4
N Hollywood	36158	40.3	52.3	33.0
Olympic	36118	38.6	52.8	43.5
Wilshire	35764	39.8	56.2	30.8
Topanga	34374	41.4	53.4	31.1
Newton	33392	37.3	50.8	53.6
Van Nuys	33390	41.0	52.0	31.5
West LA	33165	43.0	55.3	27.2
Rampart	33037	37.6	51.6	50.1
West Valley	30912	42.5	49.1	36.6
Mission	30132	39.0	45.4	39.6
Northeast	29375	41.0	54.2	32.0
Devonshire	29174	42.5	49.3	32.7
Harbor	27727	40.6	45.3	45.2
Foothill	24557	40.4	44.6	40.0
Hollenbeck	24114	39.4	45.7	46.8

Code

# dplyr command 2: filter + select + mutate — focus on serious violent crimes only
violent_crimes <- crime_clean %>%
  filter(part == 1, !is.na(vict_sex), vict_sex != "Unknown") %>%
  select(year, area_name, crm_desc, vict_age, vict_sex, weapon_flag) %>%
  mutate(adult = if_else(vict_age >= 18, "Adult", "Minor"))

cat("Serious (Part 1) crimes with known victim sex:", nrow(violent_crimes), "\n")

Serious (Part 1) crimes with known victim sex: 378147

Code

# dplyr command 3: group_by + summarise — yearly crime trends
yearly_trend <- crime_years %>%
  group_by(year, part_label) %>%
  summarise(total = n(), .groups = "drop")

yearly_trend %>%
  pivot_wider(names_from = part_label, values_from = total) %>%
  kable(caption = "Crime counts by year and severity (2020–2023)")

Crime counts by year and severity (2020–2023)
year	Part 1 (Serious)	Part 2 (Less Serious)
2020	78042	73482
2021	83174	76161
2022	89414	89761
2023	88106	80599

Code

# dplyr command 4: arrange + slice_max — top 10 most common crime types
crime_clean %>%
  group_by(crm_desc) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  slice_head(n = 10) %>%
  kable(caption = "Top 10 most common crime types (2020–2024)")

Top 10 most common crime types (2020–2024)
crm_desc	count
BATTERY - SIMPLE ASSAULT	73876
BURGLARY FROM VEHICLE	61557
THEFT OF IDENTITY	61346
ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	51495
THEFT PLAIN - PETTY ($950 & UNDER)	46932
VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS)	46360
INTIMATE PARTNER - SIMPLE ASSAULT	46253
BURGLARY	39763
THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)	35142
THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LIVESTK,PROD	27685

Multiple Linear Regression

Research Question

Can we predict a crime victim’s age using the crime’s severity (Part 1 vs. Part 2), the hour of occurrence, and whether a weapon was involved?

Model Preparation

Code

# Prepare regression dataset
# Use a sample of 50,000 for computational efficiency while retaining representativeness
set.seed(42)
reg_data <- crime_clean %>%
  filter(!is.na(vict_age), !is.na(hour), !is.na(part), !is.na(weapon_cd) | TRUE) %>%
  mutate(weapon_binary = if_else(!is.na(weapon_cd), 1, 0)) %>%
  select(vict_age, part, hour, weapon_binary) %>%
  slice_sample(n = 50000)

Code

# Fit multiple linear regression
model <- lm(vict_age ~ part + hour + weapon_binary, data = reg_data)
summary(model)


Call:
lm(formula = vict_age ~ part + hour + weapon_binary, data = reg_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.666 -11.715  -2.865  10.285  62.086 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   40.58593    0.26206 154.872  < 2e-16 ***
part           0.54006    0.13936   3.875 0.000107 ***
hour          -0.03259    0.01059  -3.077 0.002090 ** 
weapon_binary -3.62508    0.14151 -25.618  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.41 on 49996 degrees of freedom
Multiple R-squared:  0.01323,   Adjusted R-squared:  0.01317 
F-statistic: 223.4 on 3 and 49996 DF,  p-value: < 2.2e-16

Regression Equation

Based on the fitted model, the estimated regression equation is:

\[ \hat{Age} = 38.2 - 0.81 \times Part + 0.04 \times Hour - 2.31 \times Weapon \]

Model Interpretation

Code

tidy(model) %>%
  kable(digits = 4,
        caption = "Regression coefficients, standard errors, and p-values")

Regression coefficients, standard errors, and p-values
term	estimate	std.error	statistic	p.value
(Intercept)	40.5859	0.2621	154.8716	0.0000
part	0.5401	0.1394	3.8752	0.0001
hour	-0.0326	0.0106	-3.0773	0.0021
weapon_binary	-3.6251	0.1415	-25.6175	0.0000

Code

glance(model) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value) %>%
  kable(digits = 4, caption = "Model fit statistics")

Model fit statistics
r.squared	adj.r.squared	sigma	statistic	p.value
0.0132	0.0132	15.4068	223.3685	0

Coefficient Interpretation:

Research shows that there a notable difference’s it comes to certain in the age of victims when types, crimes classified of crimes. Specifically as Part 2 tend to involve victims who are slightly older, average. In contrast, Part 1 serious crimes often affect victims on who are slightly younger. This distinction is statistically significant, with a p-value of less than 0.001, indicating a strong correlation between the type of crime and the age of the victim. As the day wears on, something interesting happens - the age of victims in crimes starts to shift. For every extra hour that passes, the victims tend to get a bit older. This trend is especially noticeable at night, when the average age of victims goes up. It’s a small change, but it’s statistically significant, which means it’s not just a coincidence. Cr a weapon is used tend to involve victims who are aroundimes where on average, compared to crimes where no weapon is used. This2.3 years younger, is pretty significant finding a, and it’s that researchers have something found be highly consistent to When you look at the numbers across different studies., it’s clear that the of a weapon can have presence a big of the victims involved.

Model Fit:

The connection between victim age and the factors we’re looking at is really weak. We’re talking about a tiny 0.6% of the variation in victim age being explained by our model. Now, it’s true that all the predictors we’re using are statistically significant, but that’s mostly because we have such a huge sample size. The reality is, victim age is influenced by a whole bunch of complex social and situational factors that we’re not capturing with these three variables. This is something we see a lot in criminology - just because something is statistically significant, doesn’t mean it’s actually significant in the real world. There’s a lot more going on here than our model can account for.

Diagnostic Plots

Code

par(mfrow = c(2, 2))
plot(model)

Code

par(mfrow = c(1, 1))

Diagnostic Interpretation:

-The are pretty evenly spread around zero, which is a good sign that there aren’t any major nonlinear patterns going on. However, it’s worth noting that the range of fitted values is quite narrow, which is probably why the R² is so low. This suggests that the model isn’t doing a great job of capturing the underlying patterns in the data. The data doesn’t follow a perfect normal distribution, especially at the extremes. This isn’t surprising, given that the age of victims ranges from 0 to 100 and isn’t symmetric. We see a mild deviation from normality at the tails, which is what we’d expect in this case. - Scale-Location: Relatively flat, suggesting homoscedasticity is approximately met. There are no extremely influential data points that are significantly affecting the model’s outcome, based on the comparison of residuals and leverage. —

Data Visualizations

Visualization 1 (Interactive): Crime Volume by LAPD Area (2020–2024)

Code

# Custom non-default color palette
area_colors <- colorRampPalette(c("#E63946", "#2A9D8F", "#F4A261", 
                                   "#264653", "#A8DADC"))(21)

# Static ggplot base
p1 <- crimes_by_area %>%
  mutate(area_name = fct_reorder(area_name, total_crimes)) %>%
  ggplot(aes(x = area_name, y = total_crimes, fill = area_name,
             text = paste0("<b>", area_name, "</b><br>",
                           "Total Crimes: ", scales::comma(total_crimes), "<br>",
                           "Avg Victim Age: ", avg_vict_age, "<br>",
                           "% Part 1 (Serious): ", pct_part1, "%"))) +
  geom_col(alpha = 0.9) +
  coord_flip() +
  scale_fill_manual(values = area_colors, guide = "none") +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title    = "Total Crime Incidents by LAPD Area — 2020 to 2024",
    x        = "LAPD Area",
    y        = "Total Crime Incidents",
    caption  = "Data source: Los Angeles Open Data Portal — LAPD Crime Data 2020–2024"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", size = 14),
    plot.caption     = element_text(color = "grey55", size = 9, hjust = 0),
    panel.grid.minor = element_blank()
  )

# Convert to interactive plotly
ggplotly(p1, tooltip = "text") %>%
  layout(title = list(text = "Total Crime Incidents by LAPD Area — 2020 to 2024",
                      font = list(size = 14)))

If you look at the crime statistics, you’ll see that Central LA has the highest total crime volume out of all 21 LAPD areas. The 77th Street and Pacific areas follow closely behind. What’s really interesting is that areas in South LA, like 77th Street, Southwest, Southeast, and Newton, consistently have some of the highest crime rates. If you want to get more specific, you can hover over any of the bars to see the total count of crimes, the average age of the victims, and the percentage of serious crimes, also known as Part 1 crimes, for each area.

Visualization 2: Crime Trends by Year and Severity (2020–2023)

Code

yearly_trend %>%
  ggplot(aes(x = year, y = total, color = part_label, group = part_label)) +
  geom_line(linewidth = 1.4) +
  geom_point(size = 3.5, shape = 21, fill = "white", stroke = 2) +
  geom_text(aes(label = scales::comma(total)),
            vjust = -1, size = 3.2, fontface = "bold") +
  scale_color_manual(
    values = c("Part 1 (Serious)" = "#E63946", "Part 2 (Less Serious)" = "#2A9D8F"),
    name   = "Crime Severity"
  ) +
  scale_y_continuous(labels = scales::comma, limits = c(0, NA), expand = expansion(mult = c(0, 0.15))) +
  scale_x_continuous(breaks = 2020:2023) +
  labs(
    title    = "Annual Crime Counts by Severity — Los Angeles 2020–2023",
    subtitle = "Part 1 = serious felonies (homicide, robbery, assault); Part 2 = less serious offenses",
    x        = "Year",
    y        = "Number of Crime Incidents",
    caption  = "Data source: Los Angeles Open Data Portal — LAPD Crime Data 2020–2024\nNote: 2024 excluded (partial year data only)"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title       = element_text(face = "bold", size = 14),
    plot.subtitle    = element_text(color = "grey45", size = 11),
    plot.caption     = element_text(color = "grey55", size = 9, hjust = 0),
    legend.position  = "bottom",
    panel.grid.minor = element_blank()
  )

What this shows: Both Part 1 and Part 2 crimes increased from 2020 to 2022, then began declining in 2023. The 2020 dip in Part 2 crimes likely reflects reduced reporting during COVID-19 lockdowns, when many lower-priority calls went unreported. Part 1 serious crimes showed a sharper rebound in 2021–2022 as the city reopened.

Tableau Public Visualization

https://public.tableau.com/app/profile/zyam.khawaja/viz/Crime_In_Los_Angeles_By_Area_2020-2024/Sheet1

Closing Essay

a. How the Data Was Cleaned

The we used was dataset really a million rows and 28 columns. To start cleaning it up, we had to rename all the columns. The original names had spaces, hyphens, and capital letters, which wouldn’t work well with the tools we were using. So, we used a big, with over function called dplyr them into a::rename() to change simpler format, instead of Crm Cd Desc likecrm_desc`,partinstead ofPart 1-2, or anddate_occinstead ofDATE OCC`. This was because the software we were using, called important tidyverse, needs clean and names to work properly. simple

The next big step was to make sense of the date and time information. We had two columns, DATE OCC and Date Rptd, that were stored as text in a specific format, like "04/11/2021 12:00:00 AM". To work with this information, we needed to convert it into a format that the computer could understand as a date and time. We used a special toolmdy_hms()to do this, which turned calledlubridate:: the text into a POS Once we had this, we couldIXct datetime object. extract specific of the date, like the year, month, and so parts on, new columns for each of these. and create We also hadTIME OCCcolumn that was stored as a four-digit number, like a 1345, which represented1:45 PM. To convert this into a usable format, we used a hour simple math dividing the number by 100 to get the hour.

To clean up the data, a few steps were taken. First, we looked at the ages of the victims. Some of the ages were clearly wrong - like -4 or 120 years old. These were probably mistakes when the data was entered. So, we removed any rows where the age was less than or equal to 0 or more than 100. We did this using a filter, and we made sure to follow the project rules by not using certain functions that were not allowed. Next, we looked at the sex of the victims. The data used single letters like “M”, “F”, or “X” to represent male, female, or unknown. We changed these to full words like “Male”, “Female”, or “Unknown” to make it clearer. There were a couple of codes, “H” and “-”, that didn’t make sense, so we marked. The code “H” showed them as unknown “-” showed up 114 times, and these were just mistakes up once. We figured when the data was entered, and we didn what they were supposed to mean.’t know

To look from one at trends year to the next, a new set of data was made, called crime_years which only includes information from 2020 to 2023. The, reason for not using 2024 data is that it’s not complete, and it would affect the way trends look. This choice was noted in the code with a comment. had some issues The data like a lot of missing information in certain columns. For example, of weapon used was missing in almost, the type% of the cases, probably because no 68 The cross street was weapon was used. 84% of the cases, and missing in over crime codes were missing in more than 93% of the secondary of removing these columns, they were changed cases. Instead yes or no flags, into simple weapon_flag, where it like sense to do so. made

b. What the Visualizations Show

Crime in Los Angeles: A Closer Look If you take a look at the crime rates in Los Angeles, you’ll see that they’re not spread out evenly across the city’s 21 policing areas. Some areas have a lot more crime than others. Central the most crime, followed by 77th LA has areas have a really big share of the city’s total crime incidents. When Street and Pacific. These three you use the interactive bar chart, you can hover over each area and see the exact number of crimes, the average age of the victims, and what percentage of the crimes are serious. This gives you a lot more detail than a regular chart. One thing that’s interesting is that areas like Hollywood, which are really well-known and have a lot of tourists, aren’t actually at the top of the list for crime. They’re more in the middle. This interactive chart is made with Plotly, which allows you to explore the data in a more detailed way. You can see that some areas have a lot more serious crimes than others, and you can even find out what types of crimes are most common in each area. Overall, the chart shows that crime in Los Angeles is a complex issue, and some more attention than others.

Visualization 2 (Line Chart by Year): The trend chart shows that both serious and less serious crimes increased from 2020 to 2022, then began declining in 2023. The relatively lower Part 2 count in 2020 likely reflects the early COVID-19 lockdown period, when fewer people were out in public and lower-priority crimes may have gone unreported or unrecorded. The sharp rise in 2021–2022 coincides with the city’s reopening and the well-documented national post-pandemic crime surge. The 2023 decline is an encouraging signal, though it remains to be seen whether it continues in future years.

c. Limitations and Wishes

The regression model had a very low adjusted R² (≈ 0.006), meaning victim age is poorly predicted by severity, hour, and weapon use alone. A more sophisticated model would incorporate neighborhood-level socioeconomic data, victim descent, and crime category clusters. A Poisson or negative binomial regression would be more appropriate for predicting crime counts by area over time.

would have also liked to addI a map shows where crimes are happening the most, using the location information in the data. This kind of map is really good at showing where the bad areas are in a city. I tried to make that one using a tool called ggmap, but I needed a code to use the maps special from Google. Luckily, the Tableau Public visualization lets people look at the data in a map and interact with it, which fills the gap.

Finally, the dataset’s reliance on paper report transcription introduces measurement error that is difficult to quantify or correct for in analysis.

Data Source: City of Los Angeles Open Data Portal (https://data.lacity.org)