Analyze Boston is the City of Boston’s open data hub to find facts, figures, and maps related to our lives within the city. We are working to make this the default technology platform to support the publication of the City’s public information, in the form of data, and to make this information easy to find, access, and use by a broad audience. This platform is managed by the Citywide Analytics Team.
Crime incident reports are provided by Boston Police Department (BPD) to document the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred.
Records begin in June 14, 2015 and continue to September 3, 2018.
The Dataset published on Analyze Boston and Kaggle
we are data analysts at Analyze Boston whose job is to analyze and dissect information from data more deeply. We want to assist police officers so they can increase security in certain areas of Boston. We got Crime in Boston 2015-2018 data and want to utilize it to determine the distribution of criminal cases in Boston and what types of crimes often occur in Boston.
Make sure our data placed in the same folder our R project data.
# Read Dataset
crime <- read.csv("data_input/crime.csv")
head(crime, 10)
## INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP
## 1 I182070945 619 Larceny
## 2 I182070943 1402 Vandalism
## 3 I182070941 3410 Towed
## 4 I182070940 3114 Investigate Property
## 5 I182070938 3114 Investigate Property
## 6 I182070936 3820 Motor Vehicle Accident Response
## 7 I182070933 724 Auto Theft
## 8 I182070932 3301 Verbal Disputes
## 9 I182070931 301 Robbery
## 10 I182070929 3301 Verbal Disputes
## OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING
## 1 LARCENY ALL OTHERS D14 808
## 2 VANDALISM C11 347
## 3 TOWED MOTOR VEHICLE D4 151
## 4 INVESTIGATE PROPERTY D4 272
## 5 INVESTIGATE PROPERTY B3 421
## 6 M/V ACCIDENT INVOLVING PEDESTRIAN - INJURY C11 398
## 7 AUTO THEFT B2 330
## 8 VERBAL DISPUTE B2 584
## 9 ROBBERY - STREET C6 177
## 10 VERBAL DISPUTE C11 364
## OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET
## 1 2018-09-02 13:00:00 2018 9 Sunday 13 Part One LINCOLN ST
## 2 2018-08-21 00:00:00 2018 8 Tuesday 0 Part Two HECLA ST
## 3 2018-09-03 19:27:00 2018 9 Monday 19 Part Three CAZENOVE ST
## 4 2018-09-03 21:16:00 2018 9 Monday 21 Part Three NEWCOMB ST
## 5 2018-09-03 21:05:00 2018 9 Monday 21 Part Three DELHI ST
## 6 2018-09-03 21:09:00 2018 9 Monday 21 Part Three TALBOT AVE
## 7 2018-09-03 21:25:00 2018 9 Monday 21 Part One NORMANDY ST
## 8 2018-09-03 20:39:37 2018 9 Monday 20 Part Three LAWN ST
## 9 2018-09-03 20:48:00 2018 9 Monday 20 Part One MASSACHUSETTS AVE
## 10 2018-09-03 20:38:00 2018 9 Monday 20 Part Three LESLIE ST
## Lat Long Location
## 1 42.35779 -71.13937 (42.35779134, -71.13937053)
## 2 42.30682 -71.06030 (42.30682138, -71.06030035)
## 3 42.34659 -71.07243 (42.34658879, -71.07242943)
## 4 42.33418 -71.07866 (42.33418175, -71.07866441)
## 5 42.27537 -71.09036 (42.27536542, -71.09036101)
## 6 42.29020 -71.07159 (42.29019621, -71.07159012)
## 7 42.30607 -71.08273 (42.30607218, -71.08273260)
## 8 42.32702 -71.10555 (42.32701648, -71.10555088)
## 9 42.33152 -71.07085 (42.33152148, -71.07085307)
## 10 42.29515 -71.05861 (42.29514664, -71.05860832)
Checking Dataset
# Inspect
str(crime)
## 'data.frame': 319073 obs. of 17 variables:
## $ INCIDENT_NUMBER : chr "I182070945" "I182070943" "I182070941" "I182070940" ...
## $ OFFENSE_CODE : int 619 1402 3410 3114 3114 3820 724 3301 301 3301 ...
## $ OFFENSE_CODE_GROUP : chr "Larceny" "Vandalism" "Towed" "Investigate Property" ...
## $ OFFENSE_DESCRIPTION: chr "LARCENY ALL OTHERS" "VANDALISM" "TOWED MOTOR VEHICLE" "INVESTIGATE PROPERTY" ...
## $ DISTRICT : chr "D14" "C11" "D4" "D4" ...
## $ REPORTING_AREA : int 808 347 151 272 421 398 330 584 177 364 ...
## $ SHOOTING : chr "" "" "" "" ...
## $ OCCURRED_ON_DATE : chr "2018-09-02 13:00:00" "2018-08-21 00:00:00" "2018-09-03 19:27:00" "2018-09-03 21:16:00" ...
## $ YEAR : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
## $ MONTH : int 9 8 9 9 9 9 9 9 9 9 ...
## $ DAY_OF_WEEK : chr "Sunday" "Tuesday" "Monday" "Monday" ...
## $ HOUR : int 13 0 19 21 21 21 21 20 20 20 ...
## $ UCR_PART : chr "Part One" "Part Two" "Part Three" "Part Three" ...
## $ STREET : chr "LINCOLN ST" "HECLA ST" "CAZENOVE ST" "NEWCOMB ST" ...
## $ Lat : num 42.4 42.3 42.3 42.3 42.3 ...
## $ Long : num -71.1 -71.1 -71.1 -71.1 -71.1 ...
## $ Location : chr "(42.35779134, -71.13937053)" "(42.30682138, -71.06030035)" "(42.34658879, -71.07242943)" "(42.33418175, -71.07866441)" ...
There is some datatype that not appropriate
Delete Column that is not use,
The datatype that we should change,
Import Packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
crime_clean <- crime %>%
select(-c("SHOOTING", "REPORTING_AREA", "Lat", "Long")) %>%
mutate(OFFENSE_CODE_GROUP = as.factor(OFFENSE_CODE_GROUP),
OFFENSE_DESCRIPTION = as.factor(OFFENSE_DESCRIPTION),
DISTRICT = as.factor(DISTRICT),
OCCURRED_ON_DATE = ymd_hms(OCCURRED_ON_DATE),
UCR_PART = as.factor(UCR_PART),
STREET = as.factor(STREET))
crime_clean$MONTH <- sapply(as.character(crime_clean$MONTH), switch,
"1" = "January",
"2" = "February",
"3" = "March",
"4" = "April",
"5" = "May",
"6" = "June",
"7" = "July",
"8" = "August",
"9" = "September",
"10" = "October",
"11" = "November",
"12" = "December")
crime_clean <- crime_clean[!(crime_clean$STREET == ""),]
crime_clean <- crime_clean[!(crime_clean$DISTRICT == ""),]
crime_clean$MONTH <- as.factor(crime_clean$MONTH)
crime_clean$DAY_OF_WEEK <- as.factor(crime_clean$DAY_OF_WEEK)
head(crime_clean)
## INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP
## 1 I182070945 619 Larceny
## 2 I182070943 1402 Vandalism
## 3 I182070941 3410 Towed
## 4 I182070940 3114 Investigate Property
## 5 I182070938 3114 Investigate Property
## 6 I182070936 3820 Motor Vehicle Accident Response
## OFFENSE_DESCRIPTION DISTRICT OCCURRED_ON_DATE YEAR
## 1 LARCENY ALL OTHERS D14 2018-09-02 13:00:00 2018
## 2 VANDALISM C11 2018-08-21 00:00:00 2018
## 3 TOWED MOTOR VEHICLE D4 2018-09-03 19:27:00 2018
## 4 INVESTIGATE PROPERTY D4 2018-09-03 21:16:00 2018
## 5 INVESTIGATE PROPERTY B3 2018-09-03 21:05:00 2018
## 6 M/V ACCIDENT INVOLVING PEDESTRIAN - INJURY C11 2018-09-03 21:09:00 2018
## MONTH DAY_OF_WEEK HOUR UCR_PART STREET Location
## 1 September Sunday 13 Part One LINCOLN ST (42.35779134, -71.13937053)
## 2 August Tuesday 0 Part Two HECLA ST (42.30682138, -71.06030035)
## 3 September Monday 19 Part Three CAZENOVE ST (42.34658879, -71.07242943)
## 4 September Monday 21 Part Three NEWCOMB ST (42.33418175, -71.07866441)
## 5 September Monday 21 Part Three DELHI ST (42.27536542, -71.09036101)
## 6 September Monday 21 Part Three TALBOT AVE (42.29019621, -71.07159012)
Each of column already changed into desired data type
Checking Missing Value
anyNA(crime_clean)
## [1] FALSE
colSums(is.na(crime_clean))
## INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION
## 0 0 0 0
## DISTRICT OCCURRED_ON_DATE YEAR MONTH
## 0 0 0 0
## DAY_OF_WEEK HOUR UCR_PART STREET
## 0 0 0 0
## Location
## 0
Awesome! we haven’t Missing Values
Now, The Crime in Boston data is ready to be processed and analyzed
We can use summary() function to know the data
summary(crime_clean)
## INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP
## Length:307484 Min. : 111 Motor Vehicle Accident Response: 33684
## Class :character 1st Qu.:1001 Larceny : 25578
## Mode :character Median :2907 Medical Assistance : 23001
## Mean :2306 Investigate Person : 18377
## 3rd Qu.:3201 Other : 17515
## Max. :3831 Drug Violation : 15420
## (Other) :173909
## OFFENSE_DESCRIPTION DISTRICT
## INVESTIGATE PERSON : 18381 B2 :48132
## SICK/INJURED/MEDICAL - PERSON : 18344 C11 :41458
## M/V - LEAVING SCENE - PROPERTY DAMAGE: 15291 D4 :40228
## VANDALISM : 14864 B3 :34687
## ASSAULT SIMPLE - BATTERY : 14352 A1 :34179
## VERBAL DISPUTE : 13023 C6 :22514
## (Other) :213229 (Other):86286
## OCCURRED_ON_DATE YEAR MONTH
## Min. :2015-06-15 00:00:00 Min. :2015 August : 33557
## 1st Qu.:2016-04-11 07:30:00 1st Qu.:2016 July : 33441
## Median :2017-02-05 01:43:30 Median :2017 June : 29638
## Mean :2017-01-28 03:11:54 Mean :2017 September: 25445
## 3rd Qu.:2017-11-10 11:19:45 3rd Qu.:2017 May : 25364
## Max. :2018-09-03 21:25:00 Max. :2018 October : 24648
## (Other) :135391
## DAY_OF_WEEK HOUR UCR_PART
## Friday :46712 Min. : 0.00 : 90
## Monday :43966 1st Qu.: 9.00 Other : 1188
## Saturday :43179 Median :14.00 Part One : 60132
## Sunday :38979 Mean :13.12 Part Three:152143
## Thursday :44940 3rd Qu.:18.00 Part Two : 93931
## Tuesday :44626 Max. :23.00
## Wednesday:45082
## STREET Location
## WASHINGTON ST : 14192 Length:307484
## BLUE HILL AVE : 7794 Class :character
## BOYLSTON ST : 7219 Mode :character
## DORCHESTER AVE : 5143
## TREMONT ST : 4796
## MASSACHUSETTS AVE: 4707
## (Other) :263633
INSIGHT
We need to subset the data for the Crime Group
crime_category <- as.data.frame(sort(table(crime_clean$OFFENSE_CODE_GROUP), decreasing = T))
names(crime_category)[1] <- paste("Category")
names(crime_category)[2] <- paste("Frequency")
head(crime_category, 10)
## Category Frequency
## 1 Motor Vehicle Accident Response 33684
## 2 Larceny 25578
## 3 Medical Assistance 23001
## 4 Investigate Person 18377
## 5 Other 17515
## 6 Drug Violation 15420
## 7 Simple Assault 15363
## 8 Vandalism 15118
## 9 Verbal Disputes 13023
## 10 Towed 10966
Plotting The Data
ggplot(head(crime_category, 10), aes(x = reorder(Category, Frequency), y = Frequency))+
geom_col(fill = "purple") +
coord_flip()+
labs(x = "",
y = "Frequency",
title = "The most Occur Crime Category") +
theme_minimal()
We need to subset the data for the street of Occur Crime
crime_street <- as.data.frame(sort(table(crime_clean$STREET), decreasing = T))
names(crime_street)[1] <- paste("Street")
names(crime_street)[2] <- paste("Frequency")
head(crime_street, 10)
## Street Frequency
## 1 WASHINGTON ST 14192
## 2 BLUE HILL AVE 7794
## 3 BOYLSTON ST 7219
## 4 DORCHESTER AVE 5143
## 5 TREMONT ST 4796
## 6 MASSACHUSETTS AVE 4707
## 7 HARRISON AVE 4608
## 8 CENTRE ST 4379
## 9 COMMONWEALTH AVE 4134
## 10 HYDE PARK AVE 3470
Plotting The Data
ggplotly(ggplot(head(crime_street, 10), aes(x = reorder(Street, Frequency), y = Frequency))+
geom_col(fill = "Orange") +
coord_flip()+
labs(x = "",
y = "Frequency",
title = "The most Street of Occur Crime") +
theme_minimal())
We need to subset the data for the hour of Occur Crime
crime_hour <- as.data.frame(table(crime_clean$HOUR))
names(crime_hour)[1] <- paste("Hour")
names(crime_hour)[2] <- paste("Frequency")
crime_hour
## Hour Frequency
## 1 0 14560
## 2 1 8770
## 3 2 7261
## 4 3 4392
## 5 4 3286
## 6 5 3177
## 7 6 4861
## 8 7 8542
## 9 8 12593
## 10 9 14311
## 11 10 15864
## 12 11 15935
## 13 12 18116
## 14 13 16324
## 15 14 16581
## 16 15 15926
## 17 16 19156
## 18 17 19855
## 19 18 19451
## 20 19 16897
## 21 20 15330
## 22 21 13624
## 23 22 12446
## 24 23 10226
Plotting The Data
ggplotly(ggplot(crime_hour, aes(x = reorder(Hour, Frequency), y = Frequency))+
geom_col(fill = "red") +
coord_flip()+
labs(x = "Hour",
y = "Frequency",
title = "The Most Hour of Occur Crime") +
theme_minimal())
We need to subset the data Occur Crime
crime_day <- as.data.frame(table(crime_clean$HOUR,
crime_clean$DAY_OF_WEEK))
names(crime_day)[1] <- paste("Hour")
names(crime_day)[2] <- paste("Day")
names(crime_day)[3] <- paste("Frequency")
head(crime_day, 10)
## Hour Day Frequency
## 1 0 Friday 2086
## 2 1 Friday 1208
## 3 2 Friday 908
## 4 3 Friday 512
## 5 4 Friday 433
## 6 5 Friday 467
## 7 6 Friday 739
## 8 7 Friday 1346
## 9 8 Friday 1981
## 10 9 Friday 2234
Plotting The Data
ggplotly(ggplot(data = crime_day, mapping = aes(x = Frequency, y = reorder(Hour, Frequency))) +
geom_col(mapping = aes(fill = Day)) + # default position
labs(x = "Frequency",
y = "Hour",
fill = "",
title = "Crime Hour with Highest Occur",
subtitle = "Colored per Day of Occur Crime") +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(legend.position = "top"))
From the analysis and plots that have been shown previously, it can be concluded that,