Introduction

The primary objective of analyzing this dataset is to identify key trends, crime hotspots, and potential correlations between crime types, locations, and other factors. The findings can assist policymakers, law enforcement agencies, and community stakeholders in making data-driven decisions for crime prevention and public safety improvements.Lon Angeles is a well populated area in the United States that is interesting to analyze.

Dataset

This data set consists of 28 variables, all important and relevant to crimes in Los Angeles from the year 2020 to 2024. This data set was obtained from Kaggle. While the dataset is generally well-structured, it contains a significant number of missing values, particularly in the following fields:Weapon Used Code & Weapon Description (676,308 missing values); Crime Codes 2, 3, and 4 (934,330, 1,001,133, and 1,003,384 missing values, respectively).Despite these missing values, the dataset remains a valuable resource for conducting crime pattern analysis, geographic crime mapping, and understanding trends in criminal activity across Los Angeles.

Findings

Crime Count by Crime Type

This graph presents the frequency of various crime types in Los Angeles from 2020 to 2024. The most frequently reported crimes include stolen vehicles, which account for a substantial portion of reported incidents, followed closely by battery (simple assault) and burglary from vehicles. Additionally, felony vandalism and grand theft auto consistently rank among the top-reported crimes. The “Other” category, representing less frequent crimes, which aggregates 573,967 cases, indicating that while a few crime types dominate, a broad spectrum of other offenses still occur regularly.

#I will be using the data set of the crimes from 2020 to present. 


setwd("C:/Users/nicos/OneDrive/Desktop/R Visualization")
library(data.table)
library(dplyr)
library(lubridate)
library(ggplot2)
library(scales)
library(RColorBrewer)
library(ggthemes)
library(plyr)
library(leaflet)

filename <- "Crime_Data_from_2020_to_Present.csv"
df <- fread(filename, na.strings=c(NA, ""))

# Manually change the name of each column so there is no space between words
colnames(df) <- c("DR_NO", "Date_Rptd", "DATE_OCC", "TIME_OCC", "AREA", "AREA_NAME",
                  "Rpt_Dist_No", "Part_1_2", "Crm_Cd", "Crm_Cd_Desc", "Mocodes", "Vict_Age",
                  "Vict_Sex", "Vict_Descent", "Premis_Cd", "Premis_Desc", 
                  "Weapon_Used_Cd", "Weapon_Desc", "Status", "Status_Desc", "Crm_Cd_1",
                  "Crm_Cd_2", "Crm_Cd_3", "Crm_Cd_4", "LOCATION", "Cross_Street",
                  "LAT", "LON")

df_crimes <- dplyr::count(df, Crm_Cd_Desc)
df_crimes <- df_crimes[order(df_crimes$n, decreasing = TRUE),]

# ------------------ Data set up ------------------------

# graph: Type of crimes, data with each year stacked 


# Top 10 Crimes in my dataset 
top_crimes <- df_crimes$Crm_Cd_Desc[1:10]

# New column of year number
df$year <- year(mdy_hms(df$DATE_OCC))

new_df <- df %>%
  dplyr::filter(Crm_Cd_Desc %in% top_crimes) %>%
  dplyr::select(DATE_OCC, Crm_Cd_Desc) %>%
  dplyr::mutate(year = year(mdy_hms(DATE_OCC))) %>%
  dplyr::group_by(Crm_Cd_Desc, year) %>%
  dplyr::summarise(n = length(Crm_Cd_Desc), .groups = 'keep') %>%
  data.frame()

# long tail effect - rare crimes

other_df <- df %>%
  dplyr::filter(!Crm_Cd_Desc %in% top_crimes)  %>% 
  dplyr::select(DATE_OCC)  %>% # Selecting the date of those
  dplyr::mutate(year = year(mdy_hms(DATE_OCC)), Crm_Cd_Desc = "Other") %>%
  dplyr::group_by(Crm_Cd_Desc, year)  %>%
  dplyr::summarise(n = length(Crm_Cd_Desc), .groups = 'keep')  %>%
  data.frame()

new_df <- rbind(new_df, other_df)


# Total number of citations 
agg_tot <- new_df  %>%
  dplyr::select(Crm_Cd_Desc, n)  %>%
  dplyr::group_by(Crm_Cd_Desc)  %>%
  dplyr::summarise(tot = sum(n), .groups = 'keep')  %>% # sum up all of the n 
  data.frame()

# the year is a number, however it is a factor
new_df$year <- as.factor(new_df$year) # it is discreet


# whatever the max total is, it will be rounded up to the next 250000th term
max_y <- round_any(max(agg_tot$tot), 20000, ceiling)

# -------- GRAPH 1----------


# fill - what you will fill those bars with, in this case year
ggplot(new_df, aes(x = reorder(Crm_Cd_Desc, n, sum), y = n, fill= year)) +
  geom_bar(stat="identity", position = position_stack(reverse=TRUE)) + 
  coord_flip() + # flipping the axis, pivoting 
  labs(title = "Crime Count by Crime Type", x = "", y = "Crime Count", fill = "year") +
  theme_light()  + 
  theme(plot.title = element_text(hjust=0.5)) + 
  scale_fill_brewer(palette = "Paired", guide = guide_legend(reverse = TRUE)) + 
  geom_text(data = agg_tot, aes(x = Crm_Cd_Desc, y = tot, label = scales::comma(tot), fill = NULL), hjust = -0.1, size=1.7) +
  scale_y_continuous(labels = comma, 
                     breaks = seq(0, max_y, by = 25000),
                     limits = c(0, max_y))

Crimes by day and year from (2020-2024)

The line graph showcases crime occurrences across different days of the week, highlighting Fridays and Saturdays as peak days for criminal activity, with counts reaching nearly 25,000 cases on some weekends. The trend remains consistent across years, with Sundays also exhibiting elevated crime rates, likely due to increased nightlife and weekend social activities. Conversely, Tuesdays and Wednesdays show the lowest crime rates, dropping below 15,000 reported incidents on average, which may correlate with decreased social activity and routine work schedules..

# -------- GRAPH 2 ----------

# Days per year, count of crimes
days_df <- df %>%
  dplyr::select(DATE_OCC) %>%
  dplyr::mutate(year = year(mdy_hms(DATE_OCC)),
         dayoftheweek = weekdays(mdy_hms(DATE_OCC), abbreviate = TRUE))  %>%
  dplyr::group_by(year, dayoftheweek)  %>%
  dplyr::summarise(n = length(DATE_OCC), .groups='keep') %>%
  data.frame()


# changing the year to factor
days_df$year <- as.factor(days_df$year)

# mon, tue, wed...
day_order <- factor(days_df$dayoftheweek, level=c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))

ggplot(days_df, aes(x = day_order, y = n, group=year)) + # a line per year with group
  geom_line(aes(color=year), size=3) +
  labs(title = "Crimes by Day and by Year from 2020 to 2024", x = "Days of the Week", y= "Crime Count") +
  theme_light() +
  theme(plot.title = element_text(hjust=0.5)) +
  geom_point(shape=21, size=5, color="black", fill="white") +
  scale_y_continuous(labels=comma) +
  scale_color_brewer(palette = "Set2", name="Year", guide = guide_legend(reverse=TRUE))

Crimes by Month Per Year

The heatmap reveals seasonal crime patterns, with notable spikes in summer months (June to August), where crime rates increase by 16.1% in certain years. December also experiences a high frequency of crimes (13.2%), potentially linked to holiday-related thefts and shopping season incidents. The months of February and April tend to show the lowest crime rates, possibly due to fewer large-scale events and colder weather reducing outdoor activity.

# ---------- HEATMAPS --------------

months_df <- df %>%
  dplyr::select(DATE_OCC) %>%
  dplyr::mutate(year = year(mdy_hms(DATE_OCC)),
                months = months(mdy_hms(DATE_OCC), abbreviate = TRUE))  %>%
  dplyr::group_by(year, months)  %>%
  dplyr::summarise(n = length(DATE_OCC), .groups='keep') %>%
  data.frame()

# change year as a factor, we do not want the year to be continuous 
months_df$year <- factor(months_df$year)

# Making labels in order 
mymonths <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct',
              'Nov', 'Dec')
# I can tell it the order i want it to be in
months_order <- factor(months_df$months, level=mymonths)

# Correcting the breaks of the heatmap, the legend at the right
breaks <- c(seq(0, max(months_df$n), by=3000))

heatmap <- ggplot(months_df, aes(x = year, y= months_order, fill=n)) +
  geom_tile(color="black") +
  geom_text(aes(label=comma(n))) + 
  coord_equal(ratio = 0.3) +
  labs(title = "Heatmap: Crimes by Month per Year",
       x = "Year",
       y = "Month",
       fill = "Crime Count") + 
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5)) +
  # Flipping the days, Mon to Sun instead of Sun to Mon
  scale_y_discrete(limits = rev(levels(months_df$months))) +
  # Colors of the heat map, adding breaks 
  scale_fill_continuous(low="white", high="deeppink3", breaks = breaks) +
  # border of the breaks - legend at the right
  guides(fill = guide_legend(reverse=TRUE, override.aes=list(colour="black")))

heatmap

Crime committed per area

This visualization breaks down crime distribution across different precincts. The 77th Street and Newton divisions consistently report the highest crime rates, with some areas seeing crime percentages exceeding 12% of the city’s total reported incidents. Other high-crime regions include Southeast and Southwest divisions, where theft and violent crimes are more concentrated. Lower crime rates appear in Westside precincts, which align with demographic and socioeconomic differences across neighborhoods.

# ----------- Pie Charts showing Citations by State ----------------------


df_crime_area_count <- df %>%
  dplyr::filter(Crm_Cd_Desc %in% top_crimes) %>%
  dplyr::select(AREA_NAME, Crm_Cd_Desc) %>%
  dplyr::group_by(AREA_NAME, Crm_Cd_Desc) %>%
  dplyr::summarise(n = length(Crm_Cd_Desc), .groups = 'keep') %>%
  dplyr::group_by(AREA_NAME) %>%
  dplyr::mutate(percent_of_total = round(100*n/sum(n),1)) %>% # getting the percentage of each per year
  dplyr::ungroup()  %>%
  data.frame()


ggplot(data = df_crime_area_count, aes(x="", y=n, fill=Crm_Cd_Desc)) +
  # we are tricking it into doing a bar chart, we will do a piechart later
  geom_bar(stat="identity", position="fill") +
  coord_polar(theta="y", start=0) +
  labs(fill = "Crime", x = NULL, y=NULL, 
       title = "Crime Committed per Area") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        legend.position = "bottom") +
  facet_wrap(~AREA_NAME, ncol=7, nrow=3) +
  scale_fill_brewer(palette = "Paired") +
  geom_text(aes(x=1.6, label = ifelse(percent_of_total>5, paste0(percent_of_total, "%"),"")),
            size = 2.5,
            position=position_fill(vjust = 0.5))

Map of Six Different Points (2 per top Region) where a Vehicle was Stolen

This geospatial visualization pinpoints specific locations with high vehicle theft rates. The 77th Street, Newton, and Southeast areas collectively report more than 25,000 stolen vehicle cases annually, with Newton alone contributing over 8,000 cases. The clusters of theft incidents suggest patterns where vehicle security measures may need enhancement, particularly in high-density residential areas and commercial parking lots.The analysis of Los Angeles crime data from 2020 to the present highlights several significant patterns and insights. Crime frequency varies by type, with vehicle theft and burglary remaining among the most prevalent offenses, affecting thousands of residents each year. Geographic analysis identifies hotspots such as the 77th Street and Newton divisions, where crime rates surpass other areas significantly.

# ------- GRAPH 5, Map of Vehicles Stolen in Top 3 Areas -----------

# Our top crime is Stolen Vehicles

df_vehicles_area_count <- df %>%
  dplyr::filter(Crm_Cd_Desc %in% "VEHICLE - STOLEN") %>%
  dplyr::select(AREA_NAME) %>%
  dplyr::group_by(AREA_NAME) %>%
  dplyr::summarise(n = length(AREA_NAME), .groups = 'keep') %>%
  data.frame()

df_vehicles_area_count <- df_vehicles_area_count[order(-df_vehicles_area_count$n), ]


df_selected_areas <- df%>%
  dplyr::filter(Crm_Cd_Desc %in% "VEHICLE - STOLEN",
                AREA_NAME %in% c("77th Street", "Newton", "Southeast")) %>%
  dplyr::select(AREA_NAME, Crm_Cd_Desc, LAT, LON) %>%
  group_by(AREA_NAME) %>%
  slice(1:2) %>%
  ungroup() %>%
  data.frame()

df_selected_areas$n <- c(8766, 8766, 8281, 8281, 7242, 7242)

# Creating the Map

m <- leaflet() %>%
  addProviderTiles(providers$OpenStreetMap) %>%
  setView(lng = -118.2587, lat = 33.9747, zoom = 12 ) %>%
  addCircles(
    lng = subset(df_selected_areas, AREA_NAME == '77th Street')$LON,
    lat = subset(df_selected_areas, AREA_NAME == '77th Street')$LAT,
    opacity = 10,
    color = "red",
    popup = paste(row.names(subset(df_selected_areas, AREA_NAME == '77th Street')), subset(df_selected_areas, AREA_NAME == '77th Street')$n),
    radius = 200
  ) %>%
  addCircles(
    lng = subset(df_selected_areas, AREA_NAME == 'Newton')$LON,
    lat = subset(df_selected_areas, AREA_NAME == 'Newton')$LAT,
    opacity = 10,
    color = "blue",
    popup = paste(row.names(subset(df_selected_areas, AREA_NAME == 'Newton')), subset(df_selected_areas, AREA_NAME == 'Newton')$n),
    radius = 125
  ) %>%
  addCircles(
    lng = subset(df_selected_areas, AREA_NAME == 'Southeast')$LON,
    lat = subset(df_selected_areas, AREA_NAME == 'Southeast')$LAT,
    opacity = 10,
    color = "green",
    popup = paste(row.names(subset(df_selected_areas, AREA_NAME == 'Southeast')), subset(df_selected_areas, AREA_NAME == 'Southeast')$n),
    radius = 80
  )
m

Conclusion

These findings underscore the importance of targeted crime prevention strategies, including increased law enforcement presence in high-risk areas and public safety initiatives focusing on peak crime days and seasons. Further research into the socio-economic factors driving crime in specific regions could enhance intervention efforts. While this dataset provides valuable insights, additional contextual information, such as policing strategies and community engagement programs, could further refine our understanding of crime trends in Los Angeles. The data serves as a vital tool for policymakers, law enforcement, and community leaders in crafting informed policies and ensuring public safety.