Take A Look the Data Type of “Crime in Boston” Data"
## 'data.frame': 319073 obs. of 17 variables:
## $ INCIDENT_NUMBER : Factor w/ 282517 levels "142052550","I010370257-00",..: 282517 282516 282515 282514 282513 282512 282511 282510 282509 282508 ...
## $ OFFENSE_CODE : int 619 1402 3410 3114 3114 3820 724 3301 301 3301 ...
## $ OFFENSE_CODE_GROUP : Factor w/ 67 levels "Aggravated Assault",..: 35 64 63 33 33 44 5 65 59 65 ...
## $ OFFENSE_DESCRIPTION: Factor w/ 244 levels "A&B HANDS, FEET, ETC. - MED. ATTENTION REQ.",..: 130 231 223 124 124 165 22 232 207 232 ...
## $ DISTRICT : Factor w/ 13 levels "","A1","A15",..: 9 7 10 10 6 7 5 5 8 7 ...
## $ REPORTING_AREA : int 808 347 151 272 421 398 330 584 177 364 ...
## $ SHOOTING : Factor w/ 2 levels "","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OCCURRED_ON_DATE : Factor w/ 233229 levels "2015-06-15 00:00:00",..: 232988 230467 233216 233228 233226 233227 233229 233224 233225 233222 ...
## $ YEAR : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
## $ MONTH : int 9 8 9 9 9 9 9 9 9 9 ...
## $ DAY_OF_WEEK : Factor w/ 7 levels "Friday","Monday",..: 4 6 2 2 2 2 2 2 2 2 ...
## $ HOUR : int 13 0 19 21 21 21 21 20 20 20 ...
## $ UCR_PART : Factor w/ 5 levels "","Other","Part One",..: 3 5 4 4 4 4 3 4 3 4 ...
## $ STREET : Factor w/ 4658 levels ""," ALBANY ST ",..: 2537 2075 786 3067 1242 4075 3100 2461 2742 2505 ...
## $ Lat : num 42.4 42.3 42.3 42.3 42.3 ...
## $ Long : num -71.1 -71.1 -71.1 -71.1 -71.1 ...
## $ Location : Factor w/ 18194 levels "(-1.00000000, -1.00000000)",..: 15616 6832 13472 11292 2034 4673 6752 10012 10748 5513 ...
Change the Data Type of “OCCURRED_ON_DATE”
boston$INCIDENT_NUMBER <- as.character(boston$INCIDENT_NUMBER) #change the type into chr because unique
boston$OCCURRED_ON_DATE <- ymd_hms(boston$OCCURRED_ON_DATE,tz = "America/New_York") # change into date format## Warning: 3 failed to parse.
Change the Data Type of “OFFENSE_CODE” and “REPORTING_AREA” into Factor (Repeated)
Take A Look the Summary of “Crime in Boston” Data"
## INCIDENT_NUMBER OFFENSE_CODE
## Length:319073 3006 : 18783
## Class :character 3115 : 18754
## Mode :character 3831 : 16323
## 1402 : 15154
## 802 : 14799
## 3301 : 13099
## (Other):222161
## OFFENSE_CODE_GROUP
## Motor Vehicle Accident Response: 37132
## Larceny : 25935
## Medical Assistance : 23540
## Investigate Person : 18750
## Other : 18075
## Drug Violation : 16548
## (Other) :179093
## OFFENSE_DESCRIPTION DISTRICT
## SICK/INJURED/MEDICAL - PERSON : 18783 B2 :49945
## INVESTIGATE PERSON : 18754 C11 :42530
## M/V - LEAVING SCENE - PROPERTY DAMAGE: 16323 D4 :41915
## VANDALISM : 15154 A1 :35717
## ASSAULT SIMPLE - BATTERY : 14791 B3 :35442
## VERBAL DISPUTE : 13099 C6 :23460
## (Other) :222169 (Other):90064
## REPORTING_AREA SHOOTING OCCURRED_ON_DATE YEAR
## 111 : 2372 :318054 Min. :2015-06-15 00:00:00 Min. :2015
## 186 : 2016 Y: 1019 1st Qu.:2016-04-12 00:43:15 1st Qu.:2016
## 329 : 1878 Median :2017-01-28 03:11:00 Median :2017
## 117 : 1832 Mean :2017-01-25 11:08:33 Mean :2017
## 143 : 1775 3rd Qu.:2017-11-05 18:14:00 3rd Qu.:2017
## (Other):288950 Max. :2018-09-03 21:25:00 Max. :2018
## NA's : 20250 NA's :3
## MONTH DAY_OF_WEEK HOUR UCR_PART
## Min. : 1.00 Friday :48495 Min. : 0.00 : 90
## 1st Qu.: 4.00 Monday :45679 1st Qu.: 9.00 Other : 1232
## Median : 7.00 Saturday :44818 Median :14.00 Part One : 61629
## Mean : 6.61 Sunday :40313 Mean :13.12 Part Three:158553
## 3rd Qu.: 9.00 Thursday :46656 3rd Qu.:18.00 Part Two : 97569
## Max. :12.00 Tuesday :46383 Max. :23.00
## Wednesday:46729
## STREET Lat Long
## WASHINGTON ST : 14194 Min. :-1.00 Min. :-71.18
## : 10871 1st Qu.:42.30 1st Qu.:-71.10
## BLUE HILL AVE : 7794 Median :42.33 Median :-71.08
## BOYLSTON ST : 7221 Mean :42.21 Mean :-70.91
## DORCHESTER AVE: 5149 3rd Qu.:42.35 3rd Qu.:-71.06
## TREMONT ST : 4796 Max. :42.40 Max. : -1.00
## (Other) :269048 NA's :19999 NA's :19999
## Location
## (0.00000000, 0.00000000) : 19999
## (42.34862382, -71.08277637): 1243
## (42.36183857, -71.05976489): 1208
## (42.28482577, -71.09137369): 1121
## (42.32866284, -71.08563401): 1042
## (42.25621592, -71.12401947): 898
## (Other) :293562
Clean Up the “NA’s” and “Blank” from Data
Change Month form numeric format into Abb. Month and Reorder Them
DAY_OF_WEEK stored in long format and not orederd
Replace Missing Values in “SHOOTING” with “N”
Clean Up the Long and Lat Anomalies
Remove unused levels
##Simple Exploratory Data Analysis
Take a glimpse look at Crime Boston Data to find pattern or interesting findings
By Year
ggplot(boston, aes(x = YEAR)) +
geom_bar(fill = "royalblue2", col = "mediumblue") +
theme_igray() +
labs(x = NULL, y = NULL,
title = "Number of Crimes in Boston",
subtitle = "During Year of 2015 - 2019") +
geom_text(aes(label=comma(..count..)),stat="count", position=position_dodge(0.9),vjust=-0.5) +
scale_y_continuous(labels=comma)We can see from the chart that Crimes in Boston Occured mostly in 2016 - 2017
By Month
ggplot(boston, aes(x = MONTH)) +
geom_bar(fill = "royalblue2", col = "mediumblue") +
theme_igray() +
labs(x = NULL, y = NULL,
title = "Number of Crimes in Boston by Month",
subtitle = "During Year of 2015 - 2019") +
geom_text(aes(label=comma(..count..)),stat="count", position=position_dodge(0.9),vjust=-0.5) +
scale_y_continuous(labels=comma)Crimes in Bostod peaked at July
By Season
Since there are 4 seasons in US, it will be interesting to see whether the seasons may affect the Crimes Rate
# Create Function to divide 12 months into 4 seasons
season <- function(m){
if(m == "Mar" | m == "Apr" | m == "May"){
m <- "SPRING"
}else if(m == "Jun" | m == "Jul" | m == "Aug"){
m <- "SUMMER"
}else if(m == "Sep" | m == "Oct" | m == "Nov"){
m <- "AUTUMN"
}else{
m <- "WINTER"
}
}
# Apply it to Month and create new Column
boston$SEASON_OCCURED <- as.factor(sapply(boston$MONTH, season))
# Reorder it
boston$SEASON_OCCURED <- ordered(boston$SEASON_OCCURED,
levels = c("WINTER",
"SPRING",
"SUMMER",
"AUTUMN"))
ggplot(boston, aes(x = MONTH, fill = SEASON_OCCURED)) +
geom_bar() +
theme_igray() +
labs(x = NULL, y = NULL,
title = "Number of Crimes in Boston by Season",
subtitle = "During Year of 2015 - 2019",
fill = "Season") +
geom_text(aes(label=comma(..count..)),stat="count", position=position_dodge(0.9), vjust=-0.5, size = 3) +
scale_y_continuous(labels=comma) +
scale_fill_manual(values = cbPalette[1:4]) +
theme(legend.position = "bottom",
legend.title = element_text(size = 8),
legend.text = element_text(size = 8))Summer is the season where Crimes in Boston Mostly Occured
By Day of Week
ggplot(boston, aes(x = DAY_OF_WEEK)) +
geom_bar(fill = "royalblue2", col = "mediumblue") +
theme_igray() +
labs(x = NULL, y = NULL,
title = "Number of Crimes in Boston by Day of Week",
subtitle = "During Year of 2015 - 2019") +
geom_text(aes(label=comma(..count..)),stat="count", position=position_dodge(0.9),vjust=-0.5) +
scale_y_continuous(labels=comma)Crimes in Boston occur mostly in weekday, and peaked during Friday
By Time
event <- function(x){
if(x < 6){
x <- "12AM to 6AM"
}else if(x >= 6 & x < 12){
x <- "6AM to 12PM"
}else if(x >= 12 & x < 18){
x <- "12PM to 6PM"
}else{
x <- "6PM to 12AM"
}
}
boston$TIME_OCCURED <- as.factor(sapply(boston$HOUR, event))
boston$TIME_OCCURED <- ordered(boston$TIME_OCCURED,
levels = c("12AM to 6AM",
"6AM to 12PM",
"12PM to 6PM",
"6PM to 12AM"))
ggplot(boston, aes(x = TIME_OCCURED)) +
geom_bar(fill = "royalblue2", col = "mediumblue") +
theme_igray() +
labs(x = NULL, y = NULL,
title = "Number of Crimes in Boston by Time Occured",
subtitle = "During Year of 2015 - 2019") +
geom_text(aes(label=comma(..count..)),stat="count", position=position_dodge(0.9),vjust=-0.5) +
scale_y_continuous(labels=comma)Crimes happen mostly at 12PM - 6PM
Correlation between Day of Week and the Time when Crimes Occur
ggplot(boston) +
geom_mosaic(aes(x = product(DAY_OF_WEEK), fill=TIME_OCCURED)) +
labs(x = NULL, y = NULL,
title = "Number of Crimes in Boston by Day of Week vs Time Occured",
subtitle = "During Year of 2015 - 2019",
fill = "Time Occured") +
theme(legend.position = "bottom",
legend.title = element_text(size = 8),
legend.text = element_text(size = 8))On the weekday, Crimes in Boston happen during the same period. But interesting thing happen at weekend, there is some shifting of time occured, from 6AM-12PM into 12AM to 6AM.
Take a look the most occured Offense Type, let’s say they occur more than 10k Cases
boston10k <- as.data.frame(table(OFFENSE_TYPE = boston$OFFENSE_CODE_GROUP))
boston10k <- boston10k[boston10k$Freq >= 10000, ]
boston10k <- boston10k[order(boston10k$Freq, decreasing = T),]
boston10k## OFFENSE_TYPE Freq
## 44 Motor Vehicle Accident Response 30350
## 35 Larceny 25032
## 41 Medical Assistance 22326
## 31 Investigate Person 17958
## 47 Other 17009
## 62 Simple Assault 14829
## 64 Vandalism 14826
## 16 Drug Violation 14371
## 65 Verbal Disputes 12942
## 63 Towed 10712
## 33 Investigate Property 10566
## 36 Larceny From Motor Vehicle 10226
We would like to take out Other since it may consists of few Offense Types
Plot It Into Chart
ggplot(boston10k, aes(x = reorder(OFFENSE_TYPE, Freq), y = Freq)) +
geom_col(aes(fill=OFFENSE_TYPE)) +
coord_flip() +
theme_igray() +
labs(x = NULL, y = NULL,
title = "Top 10 Offense Type Numbers in Boston Crimes",
subtitle = "During Year of 2015 - 2019") +
scale_fill_brewer(palette = "Paired") +
theme(legend.position = "none") +
scale_y_continuous(labels=comma)Location Analysis
By Districts
newlev <- names(table(boston$DISTRICT))[order(table(boston$DISTRICT), decreasing = T)]
boston$DISTRICT <- factor(boston$DISTRICT, levels=newlev)
ggplot(boston, aes(x = DISTRICT)) +
geom_bar(fill = "royalblue2", col = "mediumblue") +
theme_igray() +
labs(x = NULL, y = NULL,
title = "Number of Crimes in Boston on Every Districts",
subtitle = "During Year of 2015 - 2019") +
geom_text(aes(label=comma(..count..)),stat="count", position=position_dodge(0.9),vjust=-0.5) +
scale_y_continuous(labels=comma)B2 is the District with most Cases, and A15 is the safest
Districs vs Offense Type
We would like to see, the distribution of Top Numbers Offense Type (more than 10k cases) in every districts
bostonsort <- boston[which(boston$OFFENSE_CODE_GROUP %in% boston10k$OFFENSE_TYPE), ] # subset boston dataset which only contains Top Offense Type
bostonsort <- droplevels(bostonsort) # remove unused levels
sortlev <- names(table(bostonsort$DISTRICT))[order(table(bostonsort$DISTRICT), decreasing = T)] #sort form largest - smallest
bostonsort$DISTRICT <- factor(bostonsort$DISTRICT, levels=sortlev) # set as factor
ggplot(bostonsort) +
geom_mosaic(aes(x = product(DISTRICT), fill=OFFENSE_CODE_GROUP)) +
labs(x = NULL, y = NULL,
title = "Crimes in Boston by Offense Type in Every Districts",
subtitle = "During Year of 2015 - 2019",
fill = "Time Occured") +
theme_igray() +
theme(legend.position = "none",
axis.text.x = element_text(size = 7)) +
scale_fill_brewer(palette = "Paired")- As we can see, the order of the District is the same with the unsorted one.
- Motor Vehicle Accident Response is the most occured case on every distrcit
- Larceny should be main concern at D4 and A1
Number of Crimes on Every Date
#Change Format Date from POSIXct into Date
bostonsort$OCCURRED_ON_DATE <- as.Date(as.POSIXct(bostonsort$OCCURRED_ON_DATE,tz = "America/New_York"))
#count numbers of crimes by date of occurance
bostoncount <- bostonsort %>%
group_by(OCCURRED_ON_DATE) %>%
summarise(count = n()) %>%
ungroup()
ggplot(bostoncount, aes(x = OCCURRED_ON_DATE, y = count)) +
geom_point(col = "royalblue2") +
labs(x = NULL, y = NULL,
title = "Count of Crimes in Boston by Date of Occurance",
subtitle = "During Year of 2015 - 2019") +
theme_igray()Same like as shown by Month, the numbers of crimes mostly occured on mid year, and decline at the end of the year