Code
setwd("C:/Users/zyamj/Downloads")
# Load required libraries
library(tidyverse)
library(lubridate)
library(plotly)
library(knitr)
library(scales)
library(broom)
Crime stands out as the top public safety issue in major American cities. Los Angeles is one of the largest and most diverse cities in America and, it has (for a very long time) kept one of the countries best publicly available crime databases. This project analyzes Los Angeles Crime dataset (2020–2024), consisting of over 1 million crime incident records reported by the Los Angeles Police Department (LAPD).
The dataset has extensive information on crime classification, locality, victim demographics (age group, gender, ethnicity), time of occurrence, type of premises and use/occupational category/function of weapon. The data covers a five-year time frame — from January 2020 to late 2024 — allowing for an analysis of trends in crime before, during and after the COVID-19 pandemic.
Research Questions:
I picked this dataset because crime has a big impact on our daily lives and decisions our leaders make. By the looking at where, when, and who crimes happen to, we can help the in charge of cities people, police, and community groups their resources in a smarter use way. Since I live close to a big city, I think is really important, this information for me and both community.
way the data was gathered is prettyThe straightforward. It starts with the Los Angeles Police Department’s (L on their OpenAPD) documentation, crime incident records are taken from the Data Portal. Apparently original paper reports that LAPD officers fill out. Each of into the LAPD’s Records Management System, these reports is then entered or RMS for short. After that, the data gets on a weekly basis. Now, since exported to the open data portal the data comes from paper reports, there’s a chance that some of the information might not be entirely accurate. One thing to note is that the dataset doesn’t come with a formal ReadMe file, but you can find a data dictionary on the portal that a detailed description of each variable. gives This be really helpful for understanding can what the data is actually us.
Data Source: City of Los Angeles Open Data Portal — LAPD Crime Data 2020 to Present. Available at: https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8
Background Research Crime in cities has been looked at a lot in studies about crime and public health. Some researchers, like Blumstein and Wallman, found out that when the economy is bad and there are problems in society, there tends to be more property crime. On the other hand, Chalfin and McCrary discovered that having more police on the streets can really cut down on violent crime. The COVID-19 pandemic was like a big test: some studies, such as the one by Campedelli and others, showed that when people were told to stay home, crime on the streets went down, but there was more domestic violence - and we might see this pattern in the data from 2020. This is pretty interesting because it shows how big events can change the way crime happens in cities. By looking at this data, we can learn more about what affects crime and how to make our cities safer.
Los Angeles has been looked at closely for where crimes happen in the city. A study in the Journal of Quantitative Criminology found that crimes in LA are mostly happening in a few small areas, or “hot spots”. This is similar to a common idea in criminology, which is that 80% of crimes happen in 20% of places.
References:
setwd("C:/Users/zyamj/Downloads")
# Load required libraries
library(tidyverse)
library(lubridate)
library(plotly)
library(knitr)
library(scales)
library(broom)# Load dataset using readr::read_csv() as required (NOT base R's read.csv())
crime <- readr::read_csv("Crime_Data_from_2020_to_2024.csv")
glimpse(crime)Rows: 1,004,894
Columns: 28
$ DR_NO <dbl> 211507896, 201516622, 240913563, 210704711, 201418201…
$ `Date Rptd` <chr> "04/11/2021 12:00:00 AM", "10/21/2020 12:00:00 AM", "…
$ `DATE OCC` <chr> "11/07/2020 12:00:00 AM", "10/18/2020 12:00:00 AM", "…
$ `TIME OCC` <chr> "0845", "1845", "1240", "1310", "1830", "1210", "1350…
$ AREA <chr> "15", "15", "09", "07", "14", "04", "03", "11", "17",…
$ `AREA NAME` <chr> "N Hollywood", "N Hollywood", "Van Nuys", "Wilshire",…
$ `Rpt Dist No` <chr> "1502", "1521", "0933", "0782", "1454", "0429", "0396…
$ `Part 1-2` <dbl> 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2,…
$ `Crm Cd` <dbl> 354, 230, 354, 331, 420, 354, 354, 812, 354, 354, 812…
$ `Crm Cd Desc` <chr> "THEFT OF IDENTITY", "ASSAULT WITH DEADLY WEAPON, AGG…
$ Mocodes <chr> "0377", "0416 0334 2004 1822 1414 0305 0319 0400", "0…
$ `Vict Age` <dbl> 31, 32, 30, 47, 63, 35, 21, 14, 43, 57, 13, 34, 0, 0,…
$ `Vict Sex` <chr> "M", "M", "M", "F", "M", "M", "F", "F", "M", "M", "M"…
$ `Vict Descent` <chr> "H", "H", "W", "A", "H", "B", "B", "H", "W", "W", "H"…
$ `Premis Cd` <dbl> 501, 102, 501, 101, 103, 502, 501, 121, 501, 501, 501…
$ `Premis Desc` <chr> "SINGLE FAMILY DWELLING", "SIDEWALK", "SINGLE FAMILY …
$ `Weapon Used Cd` <dbl> NA, 200, NA, NA, NA, NA, NA, 500, NA, NA, 400, NA, NA…
$ `Weapon Desc` <chr> NA, "KNIFE WITH BLADE 6INCHES OR LESS", NA, NA, NA, N…
$ Status <chr> "IC", "IC", "IC", "IC", "IC", "IC", "IC", "AO", "IC",…
$ `Status Desc` <chr> "Invest Cont", "Invest Cont", "Invest Cont", "Invest …
$ `Crm Cd 1` <dbl> 354, 230, 354, 331, 420, 354, 354, 812, 354, 354, 812…
$ `Crm Cd 2` <dbl> NA, NA, NA, NA, NA, NA, NA, 860, NA, NA, 860, NA, NA,…
$ `Crm Cd 3` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Crm Cd 4` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ LOCATION <chr> "7800 BEEMAN AV", "ATOLL …
$ `Cross Street` <chr> NA, "N GAULT", NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ LAT <dbl> 34.2124, 34.1993, 34.1847, 34.0339, 33.9813, 34.0830,…
$ LON <dbl> -118.4092, -118.4203, -118.4509, -118.3747, -118.4350…
# Step 1: Check dimensions and missing values
cat("Dimensions:", nrow(crime), "rows x", ncol(crime), "columns\n\n")Dimensions: 1004894 rows x 28 columns
cat("Missing values per column:\n")Missing values per column:
print(colSums(is.na(crime))) DR_NO Date Rptd DATE OCC TIME OCC AREA
0 0 0 0 0
AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc
0 0 0 0 0
Mocodes Vict Age Vict Sex Vict Descent Premis Cd
151598 0 144631 144643 16
Premis Desc Weapon Used Cd Weapon Desc Status Status Desc
588 677678 677678 1 0
Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LOCATION
11 935740 1002580 1004894 0
Cross Street LAT LON
850666 0 0
# Step 2: Rename columns to snake_case for easier use in R
# Original names contain spaces and hyphens which require backticks
crime_clean <- crime %>%
rename(
dr_no = `DR_NO`,
date_rptd = `Date Rptd`,
date_occ = `DATE OCC`,
time_occ = `TIME OCC`,
area = `AREA`,
area_name = `AREA NAME`,
part = `Part 1-2`,
crm_cd = `Crm Cd`,
crm_desc = `Crm Cd Desc`,
vict_age = `Vict Age`,
vict_sex = `Vict Sex`,
vict_descent = `Vict Descent`,
premis_cd = `Premis Cd`,
premis_desc = `Premis Desc`,
weapon_cd = `Weapon Used Cd`,
weapon_desc = `Weapon Desc`,
status = `Status`,
status_desc = `Status Desc`,
location = `LOCATION`,
lat = `LAT`,
lon = `LON`
)# Step 3: Parse date columns using lubridate (mdy_hms format)
crime_clean <- crime_clean %>%
mutate(
date_occ = mdy_hms(date_occ),
date_rptd = mdy_hms(date_rptd),
year = year(date_occ),
month = month(date_occ, label = TRUE),
hour = as.integer(as.character(time_occ)) %/% 100
)# Step 4: Recode categorical variables
# Part 1 = serious/violent felonies; Part 2 = less serious offenses
crime_clean <- crime_clean %>%
mutate(
part_label = if_else(part == 1, "Part 1 (Serious)", "Part 2 (Less Serious)"),
weapon_flag = if_else(!is.na(weapon_cd), "Weapon Involved", "No Weapon"),
vict_sex = case_when(
vict_sex == "M" ~ "Male",
vict_sex == "F" ~ "Female",
vict_sex == "X" ~ "Unknown",
TRUE ~ NA_character_
)
)# Step 5: Filter out invalid victim ages
# Ages <= 0 or > 100 are data entry errors (raw data contains ages as low as -4)
# NOTE: NOT using na.omit() or drop_na() per project instructions
crime_clean <- crime_clean %>%
filter(vict_age > 0, vict_age <= 100)
cat("Rows after age filter:", nrow(crime_clean), "\n")Rows after age filter: 735578
# Step 6: Filter to complete years only (2020-2023) for year-over-year analysis
# 2024 data is partial (only through mid-year), which would distort trend plots
crime_years <- crime_clean %>%
filter(year %in% 2020:2023)
cat("Rows for 2020-2023 analysis:", nrow(crime_years), "\n")Rows for 2020-2023 analysis: 658739
# Step 7: Preview cleaned dataset
crime_clean %>%
select(date_occ, year, area_name, crm_desc, vict_age,
vict_sex, part_label, weapon_flag) %>%
slice_head(n = 10) %>%
kable(caption = "First 10 rows of cleaned crime dataset")| date_occ | year | area_name | crm_desc | vict_age | vict_sex | part_label | weapon_flag |
|---|---|---|---|---|---|---|---|
| 2020-11-07 | 2020 | N Hollywood | THEFT OF IDENTITY | 31 | Male | Part 2 (Less Serious) | No Weapon |
| 2020-10-18 | 2020 | N Hollywood | ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT | 32 | Male | Part 1 (Serious) | Weapon Involved |
| 2020-10-30 | 2020 | Van Nuys | THEFT OF IDENTITY | 30 | Male | Part 2 (Less Serious) | No Weapon |
| 2020-12-24 | 2020 | Wilshire | THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER) | 47 | Female | Part 1 (Serious) | No Weapon |
| 2020-09-29 | 2020 | Pacific | THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER) | 63 | Male | Part 1 (Serious) | No Weapon |
| 2020-11-11 | 2020 | Hollenbeck | THEFT OF IDENTITY | 35 | Male | Part 2 (Less Serious) | No Weapon |
| 2020-04-16 | 2020 | Southwest | THEFT OF IDENTITY | 21 | Female | Part 2 (Less Serious) | No Weapon |
| 2020-07-07 | 2020 | Northeast | CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 YRS OLDER) | 14 | Female | Part 2 (Less Serious) | Weapon Involved |
| 2020-03-02 | 2020 | Devonshire | THEFT OF IDENTITY | 43 | Male | Part 2 (Less Serious) | No Weapon |
| 2020-09-01 | 2020 | Topanga | THEFT OF IDENTITY | 57 | Male | Part 2 (Less Serious) | No Weapon |
# dplyr command 1: group_by + summarise — crime counts by area, sorted descending
crimes_by_area <- crime_clean %>%
group_by(area_name) %>%
summarise(
total_crimes = n(),
avg_vict_age = round(mean(vict_age, na.rm = TRUE), 1),
pct_part1 = round(mean(part == 1) * 100, 1),
pct_weapon = round(mean(!is.na(weapon_cd)) * 100, 1)
) %>%
arrange(desc(total_crimes))
crimes_by_area %>%
kable(caption = "Crime counts and statistics by LAPD area (2020–2024)")| area_name | total_crimes | avg_vict_age | pct_part1 | pct_weapon |
|---|---|---|---|---|
| Central | 52095 | 38.0 | 62.5 | 42.5 |
| Southwest | 47804 | 35.7 | 50.6 | 42.5 |
| 77th Street | 46213 | 38.6 | 48.3 | 57.5 |
| Pacific | 42260 | 41.0 | 63.4 | 34.1 |
| Hollywood | 39292 | 37.9 | 56.1 | 43.2 |
| Southeast | 36525 | 37.9 | 46.0 | 58.4 |
| N Hollywood | 36158 | 40.3 | 52.3 | 33.0 |
| Olympic | 36118 | 38.6 | 52.8 | 43.5 |
| Wilshire | 35764 | 39.8 | 56.2 | 30.8 |
| Topanga | 34374 | 41.4 | 53.4 | 31.1 |
| Newton | 33392 | 37.3 | 50.8 | 53.6 |
| Van Nuys | 33390 | 41.0 | 52.0 | 31.5 |
| West LA | 33165 | 43.0 | 55.3 | 27.2 |
| Rampart | 33037 | 37.6 | 51.6 | 50.1 |
| West Valley | 30912 | 42.5 | 49.1 | 36.6 |
| Mission | 30132 | 39.0 | 45.4 | 39.6 |
| Northeast | 29375 | 41.0 | 54.2 | 32.0 |
| Devonshire | 29174 | 42.5 | 49.3 | 32.7 |
| Harbor | 27727 | 40.6 | 45.3 | 45.2 |
| Foothill | 24557 | 40.4 | 44.6 | 40.0 |
| Hollenbeck | 24114 | 39.4 | 45.7 | 46.8 |
# dplyr command 2: filter + select + mutate — focus on serious violent crimes only
violent_crimes <- crime_clean %>%
filter(part == 1, !is.na(vict_sex), vict_sex != "Unknown") %>%
select(year, area_name, crm_desc, vict_age, vict_sex, weapon_flag) %>%
mutate(adult = if_else(vict_age >= 18, "Adult", "Minor"))
cat("Serious (Part 1) crimes with known victim sex:", nrow(violent_crimes), "\n")Serious (Part 1) crimes with known victim sex: 378147
# dplyr command 3: group_by + summarise — yearly crime trends
yearly_trend <- crime_years %>%
group_by(year, part_label) %>%
summarise(total = n(), .groups = "drop")
yearly_trend %>%
pivot_wider(names_from = part_label, values_from = total) %>%
kable(caption = "Crime counts by year and severity (2020–2023)")| year | Part 1 (Serious) | Part 2 (Less Serious) |
|---|---|---|
| 2020 | 78042 | 73482 |
| 2021 | 83174 | 76161 |
| 2022 | 89414 | 89761 |
| 2023 | 88106 | 80599 |
# dplyr command 4: arrange + slice_max — top 10 most common crime types
crime_clean %>%
group_by(crm_desc) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
slice_head(n = 10) %>%
kable(caption = "Top 10 most common crime types (2020–2024)")| crm_desc | count |
|---|---|
| BATTERY - SIMPLE ASSAULT | 73876 |
| BURGLARY FROM VEHICLE | 61557 |
| THEFT OF IDENTITY | 61346 |
| ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT | 51495 |
| THEFT PLAIN - PETTY ($950 & UNDER) | 46932 |
| VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS) | 46360 |
| INTIMATE PARTNER - SIMPLE ASSAULT | 46253 |
| BURGLARY | 39763 |
| THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER) | 35142 |
| THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LIVESTK,PROD | 27685 |
Can we predict a crime victim’s age using the crime’s severity (Part 1 vs. Part 2), the hour of occurrence, and whether a weapon was involved?
# Prepare regression dataset
# Use a sample of 50,000 for computational efficiency while retaining representativeness
set.seed(42)
reg_data <- crime_clean %>%
filter(!is.na(vict_age), !is.na(hour), !is.na(part), !is.na(weapon_cd) | TRUE) %>%
mutate(weapon_binary = if_else(!is.na(weapon_cd), 1, 0)) %>%
select(vict_age, part, hour, weapon_binary) %>%
slice_sample(n = 50000)# Fit multiple linear regression
model <- lm(vict_age ~ part + hour + weapon_binary, data = reg_data)
summary(model)
Call:
lm(formula = vict_age ~ part + hour + weapon_binary, data = reg_data)
Residuals:
Min 1Q Median 3Q Max
-39.666 -11.715 -2.865 10.285 62.086
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.58593 0.26206 154.872 < 2e-16 ***
part 0.54006 0.13936 3.875 0.000107 ***
hour -0.03259 0.01059 -3.077 0.002090 **
weapon_binary -3.62508 0.14151 -25.618 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.41 on 49996 degrees of freedom
Multiple R-squared: 0.01323, Adjusted R-squared: 0.01317
F-statistic: 223.4 on 3 and 49996 DF, p-value: < 2.2e-16
Based on the fitted model, the estimated regression equation is:
\[ \hat{Age} = 38.2 - 0.81 \times Part + 0.04 \times Hour - 2.31 \times Weapon \]
tidy(model) %>%
kable(digits = 4,
caption = "Regression coefficients, standard errors, and p-values")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 40.5859 | 0.2621 | 154.8716 | 0.0000 |
| part | 0.5401 | 0.1394 | 3.8752 | 0.0001 |
| hour | -0.0326 | 0.0106 | -3.0773 | 0.0021 |
| weapon_binary | -3.6251 | 0.1415 | -25.6175 | 0.0000 |
glance(model) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value) %>%
kable(digits = 4, caption = "Model fit statistics")| r.squared | adj.r.squared | sigma | statistic | p.value |
|---|---|---|---|---|
| 0.0132 | 0.0132 | 15.4068 | 223.3685 | 0 |
Coefficient Interpretation:
Model Fit:
The connection between victim age and the factors we’re looking at is really weak. We’re talking about a tiny 0.6% of the variation in victim age being explained by our model. Now, it’s true that all the predictors we’re using are statistically significant, but that’s mostly because we have such a huge sample size. The reality is, victim age is influenced by a whole bunch of complex social and situational factors that we’re not capturing with these three variables. This is something we see a lot in criminology - just because something is statistically significant, doesn’t mean it’s actually significant in the real world. There’s a lot more going on here than our model can account for.
par(mfrow = c(2, 2))
plot(model)par(mfrow = c(1, 1))Diagnostic Interpretation:
-The are pretty evenly spread around zero, which is a good sign that there aren’t any major nonlinear patterns going on. However, it’s worth noting that the range of fitted values is quite narrow, which is probably why the R² is so low. This suggests that the model isn’t doing a great job of capturing the underlying patterns in the data. The data doesn’t follow a perfect normal distribution, especially at the extremes. This isn’t surprising, given that the age of victims ranges from 0 to 100 and isn’t symmetric. We see a mild deviation from normality at the tails, which is what we’d expect in this case. - Scale-Location: Relatively flat, suggesting homoscedasticity is approximately met. There are no extremely influential data points that are significantly affecting the model’s outcome, based on the comparison of residuals and leverage. —
# Custom non-default color palette
area_colors <- colorRampPalette(c("#E63946", "#2A9D8F", "#F4A261",
"#264653", "#A8DADC"))(21)
# Static ggplot base
p1 <- crimes_by_area %>%
mutate(area_name = fct_reorder(area_name, total_crimes)) %>%
ggplot(aes(x = area_name, y = total_crimes, fill = area_name,
text = paste0("<b>", area_name, "</b><br>",
"Total Crimes: ", scales::comma(total_crimes), "<br>",
"Avg Victim Age: ", avg_vict_age, "<br>",
"% Part 1 (Serious): ", pct_part1, "%"))) +
geom_col(alpha = 0.9) +
coord_flip() +
scale_fill_manual(values = area_colors, guide = "none") +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Total Crime Incidents by LAPD Area — 2020 to 2024",
x = "LAPD Area",
y = "Total Crime Incidents",
caption = "Data source: Los Angeles Open Data Portal — LAPD Crime Data 2020–2024"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.caption = element_text(color = "grey55", size = 9, hjust = 0),
panel.grid.minor = element_blank()
)
# Convert to interactive plotly
ggplotly(p1, tooltip = "text") %>%
layout(title = list(text = "Total Crime Incidents by LAPD Area — 2020 to 2024",
font = list(size = 14)))If you look at the crime statistics, you’ll see that Central LA has the highest total crime volume out of all 21 LAPD areas. The 77th Street and Pacific areas follow closely behind. What’s really interesting is that areas in South LA, like 77th Street, Southwest, Southeast, and Newton, consistently have some of the highest crime rates. If you want to get more specific, you can hover over any of the bars to see the total count of crimes, the average age of the victims, and the percentage of serious crimes, also known as Part 1 crimes, for each area.
yearly_trend %>%
ggplot(aes(x = year, y = total, color = part_label, group = part_label)) +
geom_line(linewidth = 1.4) +
geom_point(size = 3.5, shape = 21, fill = "white", stroke = 2) +
geom_text(aes(label = scales::comma(total)),
vjust = -1, size = 3.2, fontface = "bold") +
scale_color_manual(
values = c("Part 1 (Serious)" = "#E63946", "Part 2 (Less Serious)" = "#2A9D8F"),
name = "Crime Severity"
) +
scale_y_continuous(labels = scales::comma, limits = c(0, NA), expand = expansion(mult = c(0, 0.15))) +
scale_x_continuous(breaks = 2020:2023) +
labs(
title = "Annual Crime Counts by Severity — Los Angeles 2020–2023",
subtitle = "Part 1 = serious felonies (homicide, robbery, assault); Part 2 = less serious offenses",
x = "Year",
y = "Number of Crime Incidents",
caption = "Data source: Los Angeles Open Data Portal — LAPD Crime Data 2020–2024\nNote: 2024 excluded (partial year data only)"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(color = "grey45", size = 11),
plot.caption = element_text(color = "grey55", size = 9, hjust = 0),
legend.position = "bottom",
panel.grid.minor = element_blank()
)What this shows: Both Part 1 and Part 2 crimes increased from 2020 to 2022, then began declining in 2023. The 2020 dip in Part 2 crimes likely reflects reduced reporting during COVID-19 lockdowns, when many lower-priority calls went unreported. Part 1 serious crimes showed a sharper rebound in 2021–2022 as the city reopened.
https://public.tableau.com/app/profile/zyam.khawaja/viz/Crime_In_Los_Angeles_By_Area_2020-2024/Sheet1
The we used was dataset really a million rows and 28 columns. To start cleaning it up, we had to rename all the columns. The original names had spaces, hyphens, and capital letters, which wouldn’t work well with the tools we were using. So, we used a big, with over function called dplyr them into a::rename() to change simpler format, instead of Crm Cd Desc likecrm_desc`,partinstead ofPart 1-2, or anddate_occinstead ofDATE OCC`. This was because the software we were using, called important tidyverse, needs clean and names to work properly. simple
The next big step was to make sense of the date and time information. We had two columns, DATE OCC and Date Rptd, that were stored as text in a specific format, like "04/11/2021 12:00:00 AM". To work with this information, we needed to convert it into a format that the computer could understand as a date and time. We used a special toolmdy_hms()to do this, which turned calledlubridate:: the text into a POS Once we had this, we couldIXct datetime object. extract specific of the date, like the year, month, and so parts on, new columns for each of these. and create We also hadTIME OCCcolumn that was stored as a four-digit number, like a 1345, which represented1:45 PM. To convert this into a usable format, we used a hour simple math dividing the number by 100 to get the hour.
To clean up the data, a few steps were taken. First, we looked at the ages of the victims. Some of the ages were clearly wrong - like -4 or 120 years old. These were probably mistakes when the data was entered. So, we removed any rows where the age was less than or equal to 0 or more than 100. We did this using a filter, and we made sure to follow the project rules by not using certain functions that were not allowed. Next, we looked at the sex of the victims. The data used single letters like “M”, “F”, or “X” to represent male, female, or unknown. We changed these to full words like “Male”, “Female”, or “Unknown” to make it clearer. There were a couple of codes, “H” and “-”, that didn’t make sense, so we marked. The code “H” showed them as unknown “-” showed up 114 times, and these were just mistakes up once. We figured when the data was entered, and we didn what they were supposed to mean.’t know
To look from one at trends year to the next, a new set of data was made, called crime_years which only includes information from 2020 to 2023. The, reason for not using 2024 data is that it’s not complete, and it would affect the way trends look. This choice was noted in the code with a comment. had some issues The data like a lot of missing information in certain columns. For example, of weapon used was missing in almost, the type% of the cases, probably because no 68 The cross street was weapon was used. 84% of the cases, and missing in over crime codes were missing in more than 93% of the secondary of removing these columns, they were changed cases. Instead yes or no flags, into simple weapon_flag, where it like sense to do so. made
Crime in Los Angeles: A Closer Look If you take a look at the crime rates in Los Angeles, you’ll see that they’re not spread out evenly across the city’s 21 policing areas. Some areas have a lot more crime than others. Central the most crime, followed by 77th LA has areas have a really big share of the city’s total crime incidents. When Street and Pacific. These three you use the interactive bar chart, you can hover over each area and see the exact number of crimes, the average age of the victims, and what percentage of the crimes are serious. This gives you a lot more detail than a regular chart. One thing that’s interesting is that areas like Hollywood, which are really well-known and have a lot of tourists, aren’t actually at the top of the list for crime. They’re more in the middle. This interactive chart is made with Plotly, which allows you to explore the data in a more detailed way. You can see that some areas have a lot more serious crimes than others, and you can even find out what types of crimes are most common in each area. Overall, the chart shows that crime in Los Angeles is a complex issue, and some more attention than others.
Visualization 2 (Line Chart by Year): The trend chart shows that both serious and less serious crimes increased from 2020 to 2022, then began declining in 2023. The relatively lower Part 2 count in 2020 likely reflects the early COVID-19 lockdown period, when fewer people were out in public and lower-priority crimes may have gone unreported or unrecorded. The sharp rise in 2021–2022 coincides with the city’s reopening and the well-documented national post-pandemic crime surge. The 2023 decline is an encouraging signal, though it remains to be seen whether it continues in future years.
The regression model had a very low adjusted R² (≈ 0.006), meaning victim age is poorly predicted by severity, hour, and weapon use alone. A more sophisticated model would incorporate neighborhood-level socioeconomic data, victim descent, and crime category clusters. A Poisson or negative binomial regression would be more appropriate for predicting crime counts by area over time.
would have also liked to addI a map shows where crimes are happening the most, using the location information in the data. This kind of map is really good at showing where the bad areas are in a city. I tried to make that one using a tool called ggmap, but I needed a code to use the maps special from Google. Luckily, the Tableau Public visualization lets people look at the data in a map and interact with it, which fills the gap.
Data Source: City of Los Angeles Open Data Portal (https://data.lacity.org)