This dataset comprises crime incidents reported in the City of Cambridge, as featured in the Cambridge Police Department’s Annual Crime Reports, spanning from 1980 to 2008. The data provides detailed information about various crime types and their occurrences across different neighborhoods in Cambridge.
To identify crime trends in Cambridge and highest crime type by each neighborhood.
i Which crime occured most?
ii Highest crime report by neighborhood.
iii Crime trend over the years.
iv Neighborhood crime share using pie chart.
v Which month has the highest crime for each year?
vi Monthly crime trend for each year.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.5.2
crime_report<-read.csv("C:/Users/HomePC/Desktop/Raheem.R/Crime_Reports G.csv", na.strings = c("NA", ""))
head(crime_report) # it brings out the first 6 row from the top
## File.Number Date.of.Report Crime.Date.Time
## 1 2009-01323 02/21/2009 09:53:00 AM 02/21/2009 09:20 - 09:30
## 2 2009-01324 02/21/2009 09:59:00 AM 02/20/2009 22:30 - 02/21/2009 10:00
## 3 2009-01327 02/21/2009 12:32:00 PM 02/19/2009 21:00 - 02/21/2009 12:00
## 4 2009-01331 02/21/2009 03:05:00 PM 02/21/2009 15:00 - 15:10
## 5 2009-01346 02/22/2009 05:02:00 AM 02/22/2009 05:02
## 6 2009-01357 02/22/2009 09:39:00 PM 02/22/2009 21:39 - 21:45
## Crime Reporting.Area Neighborhood
## 1 Threats 105 East Cambridge
## 2 Auto Theft 1109 North Cambridge
## 3 Hit and Run 1109 North Cambridge
## 4 Larceny (Misc) 1303 Strawberry Hill
## 5 OUI 105 East Cambridge
## 6 Aggravated Assault 1109 North Cambridge
## Location
## 1 100 OTIS ST, Cambridge, MA
## 2 400 RINDGE AVE, Cambridge, MA
## 3 400 RINDGE AVE, Cambridge, MA
## 4 0 NORUMBEGA ST, Cambridge, MA
## 5 FIFTH ST & GORE ST, Cambridge, MA
## 6 400 RINDGE AVE, Cambridge, MA
tail(crime_report) # it shows the bottom 6 of the data set
## File.Number Date.of.Report Crime.Date.Time
## 95918 2024-03751 05/07/2024 12:48:00 PM 05/03/2024 12:47 - 05/07/2024 12:47
## 95919 2024-03755 05/07/2024 01:13:00 PM 05/04/2024 12:00 - 18:00
## 95920 2024-03756 05/07/2024 02:41:00 PM 05/07/2024 14:40 - 14:41
## 95921 2024-03777 05/07/2024 08:13:00 PM 05/07/2024 15:00 - 19:15
## 95922 2024-03806 05/08/2024 04:09:00 PM 05/07/2024 04:00 - 04:05
## 95923 2024-03824 05/09/2024 10:23:00 AM 05/05/2024 11:30 - 13:00
## Crime Reporting.Area Neighborhood
## 95918 Forgery 411 Area 4
## 95919 Larceny from MV 411 Area 4
## 95920 Accident 611 Mid-Cambridge
## 95921 Larceny of Bicycle 411 Area 4
## 95922 Larceny from MV 1005 West Cambridge
## 95923 Hit and Run 1204 Highlands
## Location
## 95918 100 BISHOP ALLEN DR, Cambridge, MA
## 95919 100 BISHOP ALLEN DR, Cambridge, MA
## 95920 MASSACHUSETTS AVE & PEABODY ST, Cambridge, MA
## 95921 0 COLUMBIA ST, Cambridge, MA
## 95922 0 FOSTER PL, Cambridge, MA
## 95923 200 Alewife Brook Pky, Cambridge, MA
str(crime_report) # to show the internal structure and data types
## 'data.frame': 95923 obs. of 7 variables:
## $ File.Number : chr "2009-01323" "2009-01324" "2009-01327" "2009-01331" ...
## $ Date.of.Report : chr "02/21/2009 09:53:00 AM" "02/21/2009 09:59:00 AM" "02/21/2009 12:32:00 PM" "02/21/2009 03:05:00 PM" ...
## $ Crime.Date.Time: chr "02/21/2009 09:20 - 09:30" "02/20/2009 22:30 - 02/21/2009 10:00" "02/19/2009 21:00 - 02/21/2009 12:00" "02/21/2009 15:00 - 15:10" ...
## $ Crime : chr "Threats" "Auto Theft" "Hit and Run" "Larceny (Misc)" ...
## $ Reporting.Area : int 105 1109 1109 1303 105 1109 501 501 1108 105 ...
## $ Neighborhood : chr "East Cambridge" "North Cambridge" "North Cambridge" "Strawberry Hill" ...
## $ Location : chr "100 OTIS ST, Cambridge, MA" "400 RINDGE AVE, Cambridge, MA" "400 RINDGE AVE, Cambridge, MA" "0 NORUMBEGA ST, Cambridge, MA" ...
summary(crime_report) #is to give a quick statistical summary`
## File.Number Date.of.Report Crime.Date.Time Crime
## Length:95923 Length:95923 Length:95923 Length:95923
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Reporting.Area Neighborhood Location
## Min. : 101.0 Length:95923 Length:95923
## 1st Qu.: 406.0 Class :character Class :character
## Median : 604.0 Mode :character Mode :character
## Mean : 632.5
## 3rd Qu.: 912.0
## Max. :1304.0
## NA's :8
dim(crime_report)
## [1] 95923 7
I removed date of report because there is a column (crime.start.date) that contains the date that the crime occured.
crime_report<-select(crime_report,-Date.of.Report)
In crime column there were entries stated Admin Error which means the crime was not stated so i will be removing it.
# HANDLING MISSING VALUES
colSums(is.na(crime_report))
## File.Number Crime.Date.Time Crime Reporting.Area Neighborhood
## 0 11 0 8 8
## Location
## 295
# removing using na.omit
crime_report <- na.omit(crime_report)
#checking for the name of the crime
table(crime_report$Crime)
##
## Accident Admin Error Aggravated Assault
## 2871 2857 2399
## Annoying & Accosting Arson Auto Theft
## 148 148 1941
## Commercial Break Commercial Robbery Counterfeiting
## 1014 368 257
## Disorderly Domestic Dispute Drinking in Public
## 494 10 362
## Drugs Embezzlement Extortion/Blackmail
## 1084 157 164
## Flim Flam Forgery Gambling
## 2583 6109 4
## Harassment Hit and Run Homicide
## 1521 9121 21
## Housebreak Indecent Exposure Kidnapping
## 3804 390 34
## Larceny (Misc) Larceny from Building Larceny from MV
## 702 4464 7369
## Larceny from Person Larceny from Residence Larceny of Bicycle
## 3228 3953 6292
## Larceny of Plate Larceny of Services Liquor Possession/Sale
## 434 303 74
## Mal. Dest. Property Missing Person Noise Complaint
## 6130 1953 165
## OUI Peeping & Spying Phone Calls
## 590 81 527
## Prostitution Rec. Stol. Property Sex Offender Violation
## 73 326 94
## Shoplifting Simple Assault Stalking
## 5610 4201 41
## Street Robbery Suspicious Package Taxi Violation
## 1223 1125 404
## Threats Trespassing Violation of H.O.
## 2766 746 298
## Violation of R.O. Warrant Arrest Weapon Violations
## 7 4405 167
# Remove rows where Crime is "Admin Error"
crime_report <- crime_report %>%
filter(Crime != "Admin Error")
# Check that they are gone
table(crime_report$Crime)
##
## Accident Aggravated Assault Annoying & Accosting
## 2871 2399 148
## Arson Auto Theft Commercial Break
## 148 1941 1014
## Commercial Robbery Counterfeiting Disorderly
## 368 257 494
## Domestic Dispute Drinking in Public Drugs
## 10 362 1084
## Embezzlement Extortion/Blackmail Flim Flam
## 157 164 2583
## Forgery Gambling Harassment
## 6109 4 1521
## Hit and Run Homicide Housebreak
## 9121 21 3804
## Indecent Exposure Kidnapping Larceny (Misc)
## 390 34 702
## Larceny from Building Larceny from MV Larceny from Person
## 4464 7369 3228
## Larceny from Residence Larceny of Bicycle Larceny of Plate
## 3953 6292 434
## Larceny of Services Liquor Possession/Sale Mal. Dest. Property
## 303 74 6130
## Missing Person Noise Complaint OUI
## 1953 165 590
## Peeping & Spying Phone Calls Prostitution
## 81 527 73
## Rec. Stol. Property Sex Offender Violation Shoplifting
## 326 94 5610
## Simple Assault Stalking Street Robbery
## 4201 41 1223
## Suspicious Package Taxi Violation Threats
## 1125 404 2766
## Trespassing Violation of H.O. Violation of R.O.
## 746 298 7
## Warrant Arrest Weapon Violations
## 4405 167
# checking back the missing value
sum(is.na(crime_report)) # this has been cleaned
## [1] 0
head(crime_report)
## File.Number Crime.Date.Time Crime
## 1 2009-01323 02/21/2009 09:20 - 09:30 Threats
## 2 2009-01324 02/20/2009 22:30 - 02/21/2009 10:00 Auto Theft
## 3 2009-01327 02/19/2009 21:00 - 02/21/2009 12:00 Hit and Run
## 4 2009-01331 02/21/2009 15:00 - 15:10 Larceny (Misc)
## 5 2009-01346 02/22/2009 05:02 OUI
## 6 2009-01357 02/22/2009 21:39 - 21:45 Aggravated Assault
## Reporting.Area Neighborhood Location
## 1 105 East Cambridge 100 OTIS ST, Cambridge, MA
## 2 1109 North Cambridge 400 RINDGE AVE, Cambridge, MA
## 3 1109 North Cambridge 400 RINDGE AVE, Cambridge, MA
## 4 1303 Strawberry Hill 0 NORUMBEGA ST, Cambridge, MA
## 5 105 East Cambridge FIFTH ST & GORE ST, Cambridge, MA
## 6 1109 North Cambridge 400 RINDGE AVE, Cambridge, MA
I used lubridate package to convert my crime date columns to R recognised date time. i aslo extracted the year,month and the weekday of the crime and i also checked for missing column and i treated the missing column in Reporting area column.
crime_report <- crime_report %>%
mutate(
# Clean Crime.Date.Time: keep only first date
Crime.Date.Time= str_trim(str_extract(Crime.Date.Time, "^[^-]+")),
Crime.Date.Time= parse_date_time(Crime.Date.Time,
orders = c("mdy HM", "mdy HMS", "mdy H", "mdy")),
# Extract useful parts AFTER cleaning
Crime_Year = year(Crime.Date.Time),
Crime_Hour = hour(Crime.Date.Time),
Crime_Month = month(Crime.Date.Time, label = TRUE),
Crime_Weekday = wday(Crime.Date.Time, label = TRUE)
)
# Arranging year and month for crime date
crime_report <- crime_report %>%
arrange(Crime_Year, Crime_Month)
crime_report$Crime<-as.factor(crime_report$Crime)
crime_report$Neighborhood<-as.factor(crime_report$Neighborhood)
crime_report$Location<-as.factor(crime_report$Location)
crime_report$Crime.Date.Time<-as.factor(crime_report$Crime.Date.Time)
str(crime_report)
## 'data.frame': 92755 obs. of 10 variables:
## $ File.Number : chr "2010-04009" "2012-03794" "2018-03218" "2012-04627" ...
## $ Crime.Date.Time: Factor w/ 86994 levels "1980-01-01","1980-01-01 12:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Crime : Factor w/ 53 levels "Accident","Aggravated Assault",..: 16 25 16 16 16 15 44 16 16 16 ...
## $ Reporting.Area : int 1002 606 1101 702 304 1002 607 1012 101 1106 ...
## $ Neighborhood : Factor w/ 13 levels "Agassiz","Area 4",..: 13 7 9 11 6 13 7 13 4 9 ...
## $ Location : Factor w/ 5079 levels "0 ABERDEEN AVE, Cambridge, MA",..: 577 1937 77 1809 1134 1384 1030 1229 1696 1452 ...
## $ Crime_Year : num 1980 1980 1990 1993 1993 ...
## $ Crime_Hour : int 0 12 12 9 19 14 0 0 10 19 ...
## $ Crime_Month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 1 1 6 1 10 1 12 1 1 3 ...
## $ Crime_Weekday : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 3 3 6 6 6 1 4 7 1 4 ...
## - attr(*, "na.action")= 'omit' Named int [1:311] 172 480 768 1337 1513 2105 2155 3028 3337 3401 ...
## ..- attr(*, "names")= chr [1:311] "172" "480" "768" "1337" ...
I converted all the categorical data from character to factor because R recognizes categorical data as factor.
#Grouped crimes and i checked for which type occurred most
crime_summary<- crime_report %>%
group_by(Crime) %>%
summarise(Total_Cases=n()) %>% #counting how many times the crime appears
arrange(desc(Total_Cases)) #sorting from highest to the lowest
#Extracting top 10 crimes
top_crimes<-crime_summary %>%
slice_head(n = 10) #slecting top 10
#viewing top_crimes
top_crimes
## # A tibble: 10 × 2
## Crime Total_Cases
## <fct> <int>
## 1 Hit and Run 9121
## 2 Larceny from MV 7369
## 3 Larceny of Bicycle 6292
## 4 Mal. Dest. Property 6130
## 5 Forgery 6109
## 6 Shoplifting 5610
## 7 Larceny from Building 4464
## 8 Warrant Arrest 4405
## 9 Simple Assault 4201
## 10 Larceny from Residence 3953
# Plot top 10
ggplot(top_crimes, aes(x = reorder(Crime, Total_Cases), y = Total_Cases, fill = Crime)) +
geom_col(show.legend = FALSE) + # Draw bars without legend
geom_text(aes(label = Total_Cases), hjust = -0.1, size = 3.5) + # Add data labels
coord_flip() + # Flip axes for horizontal bars
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + # Add padding to avoid label cutoff
theme_minimal() + # Use minimal theme
labs(title = "Top 10 Reported Crimes in Cambridge",
x = "Crime Type",
y = "Total Cases")
I group my data according to which type of crime occured most,which I sorted from the highest to the lowest so i decided to work with the top 10 most occurred crime and plot the top 10 on a bar chart.
Hit and run was the highest reported crime in cambridge followed by larcencies.
# Highest crime by Neighborhood
# Top 5 Neighborhoods and Top 5 Crimes
top_crime_plot <- crime_report %>%
# Summarize counts per neighborhood and crime
group_by(Neighborhood, Crime) %>%
summarise(Total_Cases = n(), .groups = "drop") %>%
# Keep only top 5 neighborhoods and top 5 crimes overall
filter(Neighborhood %in% (crime_report %>% count(Neighborhood) %>% top_n(5, n) %>% pull(Neighborhood)),
Crime %in% (crime_report %>% count(Crime) %>% top_n(5, n) %>% pull(Crime))) %>%
arrange(desc(Total_Cases))
top_crime_plot
## # A tibble: 25 × 3
## Neighborhood Crime Total_Cases
## <fct> <fct> <int>
## 1 North Cambridge Hit and Run 1124
## 2 Cambridgeport Larceny from MV 1082
## 3 Cambridgeport Larceny of Bicycle 994
## 4 Cambridgeport Hit and Run 962
## 5 East Cambridge Hit and Run 955
## 6 Mid-Cambridge Hit and Run 940
## 7 Mid-Cambridge Larceny from MV 901
## 8 East Cambridge Forgery 866
## 9 Area 4 Hit and Run 840
## 10 North Cambridge Larceny from MV 796
## # ℹ 15 more rows
# Visualize with facet wrap by Crime and sort within each facet
ggplot(top_crime_plot, aes(x = reorder_within(Neighborhood, Total_Cases, Crime), y = Total_Cases, fill = Crime)) +
geom_col(show.legend = FALSE) + # Draw bars without legend
coord_flip() + # Flip axes for horizontal bars
facet_wrap(~ Crime, scales = "free_y", ncol = 2) + # Facet by crime type
scale_x_reordered() + # Fix axis labels after reorder_within
theme_minimal() + # Use minimal theme
labs(
title = "Top 5 Neighborhoods by Top 5 Crimes",
x = "Neighborhood",
y = "Total Cases"
)
OBSERVATIONS
1 In East Cambridge forgery incidents is the highest because the neighborhood contains many banks , businesses ,retail stores .These areas involve frequent financial transactions which increases both the opportunity for forgery crimes.
2 In North Cambridge hit and run is the highest due to its heavy traffic flow ,large parking areas and frequents interactions between commuters and residential streets,which increases collision opportunities and driver fleeing the scence in the process.
3 In Cambridgeport larceny and malicious destruction of property is the highest due to its busy commercial activity and densed street parking environment ,people tend to slintely break into cars without people’s consent.
I did a crime trend over years,The sharp change in the trend between 1980 and 2008 is due to incomplete or missing crime records in the earlier years. Proper data collection appears to have started around 2009, resulting in a sudden increase in recorded cases. Therefore, the flat line before 2009 reflects limited or inconsistent reporting rather than an actual absence of crime.
# Summarize yearly crime counts
crime_summary <- crime_report %>%
group_by(Crime_Year) %>%
summarise(Total_Cases = n()) %>%
arrange(Crime_Year)
crime_summary
## # A tibble: 30 × 2
## Crime_Year Total_Cases
## <dbl> <int>
## 1 1980 2
## 2 1990 1
## 3 1993 2
## 4 1995 1
## 5 1999 1
## 6 2000 5
## 7 2001 14
## 8 2002 4
## 9 2003 2
## 10 2004 16
## # ℹ 20 more rows
# Plot yearly crime trend
crime_report1 <- crime_report %>% filter(Crime_Year > 2008)
head(crime_report1)
## File.Number Crime.Date.Time Crime Reporting.Area
## 1 2009-00323 2009-01-10 23:00:00 Larceny from Building 707
## 2 2009-00330 2009-01-13 07:45:00 Housebreak 509
## 3 2009-00340 2009-01-13 08:24:00 Hit and Run 509
## 4 2009-00341 2009-01-10 12:00:00 Larceny from MV 801
## 5 2009-00345 2009-01-13 09:00:00 Hit and Run 1007
## 6 2009-00352 2009-01-06 15:00:00 Larceny from Building 708
## Neighborhood Location Crime_Year Crime_Hour
## 1 Riverside 100 Mount Auburn St, Cambridge, MA 2009 23
## 2 Cambridgeport 200 MAGAZINE ST, Cambridge, MA 2009 7
## 3 Cambridgeport 0 GRANITE ST, Cambridge, MA 2009 8
## 4 Agassiz 100 KIRKLAND ST, Cambridge, MA 2009 12
## 5 West Cambridge 0 HILLIARD ST, Cambridge, MA 2009 9
## 6 Riverside 0 JFK ST, Cambridge, MA 2009 15
## Crime_Month Crime_Weekday
## 1 Jan Sat
## 2 Jan Tue
## 3 Jan Tue
## 4 Jan Sat
## 5 Jan Tue
## 6 Jan Tue
crime_summary1 <- crime_report1 %>%
group_by(Crime_Year) %>%
summarise(Total_Cases = n()) %>%
arrange(Crime_Year)
crime_summary1
## # A tibble: 16 × 2
## Crime_Year Total_Cases
## <dbl> <int>
## 1 2009 6515
## 2 2010 6474
## 3 2011 6433
## 4 2012 6144
## 5 2013 6285
## 6 2014 6179
## 7 2015 6041
## 8 2016 5668
## 9 2017 5436
## 10 2018 5409
## 11 2019 5395
## 12 2020 5731
## 13 2021 5484
## 14 2022 5956
## 15 2023 6820
## 16 2024 2566
ggplot(crime_summary1, aes(x = Crime_Year, y = Total_Cases)) +
geom_line(color = "red3", linewidth = 1, group = 1) +
geom_point(color = "black", size = 1) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_continuous(limits = c(2009, 2024), breaks = 2009:2024) +
labs(
title = "Yearly Crime Trend in Cambridge",
x = "Year",
y = "Number of Crimes"
)
#The sharp change in the trend between 1980 and 2008 is due to incomplete or missing crime records in the earlier years. Proper data collection appears to have started around 2009, #
#resulting in a sudden increase in recorded cases. Therefore, the flat line before 2009 reflects limited or inconsistent reporting rather than
#an actual absence of crime.
In this plot the huge drop in 2024 was not a real drop in crime it was due to incomplete data problem in the kaggle dataset.
crime_report %>%
count(Crime_Year, Crime_Month) %>%
ggplot(aes(x = Crime_Month, y = Crime_Year, fill = n)) +
geom_tile(color = "white") +
# Use multiple distinct colors for better contrast
scale_fill_gradientn(
colours = c("blue", "cyan", "yellow", "orange", "red", "darkred"),
values = scales::rescale(c(0, 100, 200, 300, 400, 600)), # Adjust based on your data range
guide = "colorbar"
) +
theme_minimal() +
labs(
title = "Heatmap: Crime Frequency by Year and Month",
x = "Month",
y = "Year",
fill = "Total Crimes"
)
This heatmap shows the crime frequency by year and month,i used blue color which signifies low crime frequency and dark red which shows high crime frequency.The red hotspot zone between 2009 and 2023 shows the years and months where Cambridge had high, consistent crime activity, especially in spring(April-June ) and fall(October-November). The blue areas earlier mostly show years with low or incomplete data.
From 1980 to 2009 crime count was not recorded in the data.
# Filter to years with complete monthly data
complete_years <- crime_report1 %>%
count(Crime_Year, Crime_Month) %>%
count(Crime_Year) %>%
filter(n == 12) %>%
pull(Crime_Year)
# Use only complete years
crime_report1 %>%
filter(Crime_Year %in% complete_years) %>%
count(Crime_Year, Crime_Month) %>%
ggplot(aes(x = Crime_Month, y = n, group = Crime_Year)) +
geom_line(color = "red", linewidth = 1) +
facet_wrap(~ Crime_Year, ncol = 4) +
theme_minimal() +
labs(
title = "Monthly Crime Trend by Year (Complete Years Only)",
x = "Month",
y = "Crime Count"
)
As a result of my exploratiory analysis,
i I saw that Hit and run is the most reported crime.
ii Cambridge port has the highest crime distribution by neighborhood in the pie chart.
iii From 1980 t0 2008 there was no official crime reporting by the cambridge police department until 2009 where official recording began where we could see the rise and fall of crime trends but in the year 2023 there was a peak in crime which served as the highest crime trends over the years and 2024 there was a drastic fall in crime rate due to partial record.