I started to analysed this project keeping one thing in my mind. I lived in Boston City for past 10 years but how safe is it compared to any other big cities around the United States? It’s no secret here that Boston is greater than most places on the planet (especially if you ask any Bostonian who is literally comparing it to New York). That’s not just their stance/attitude/ego talking; but in our analysis I have some serious data just to prove it. According to most of the recent polls and studies published as of today 12/21/2018, from tasty burger/water, shack shake to pedestrian safety(Vision Zero Boston) Boston seems to get better year after year. Vision Zero Boston is Cities commitment to focus the City’s resources on proven strategies to eliminate fatal and serious traffic crashes in the City by 2030, that rate Boston as tops in the nation.When I get a chance to look at Beantown’s incidence of rape (39 per 100,000, putting it third on the list of cities in the survey) and assault (483 per 100,000, fifth on the list of cities in the survey) [external data Source] it might seem like a surprisingly dangerous big city.Surprising then is that only 27% of those surveyed as of end of 2018 said they considered Boston to fairly or very unsafe. According to recent surveys the city’s low murder rate per capita (9 per 100,000) could likely be a contributing factor to this perception.The city also has the second-lowest incidence of vehicle theft at a rate of 258 per 100,000 and fares well on the other crimes rates.
Is it true to say that Boston was more violent than New York and Seattle, but less violent than Chicago and Las Vegas, according to numbers from the FBI, based on crimes committed back in 2015. As of Today 12/21/18 Nationally, Boston ranked 14 out of 50 according to Us News.After digging up and analysing the data what we found out that in recent numbers from the Boston Police Department, or BPD, show that violent crime, as well as property crime, has continued to drop, and has been steadily dropping for past years.
My Goal was to build R shiny app to see the results without digging too much in data. Took a lot of help from Online Community to do EDA as well as to built the R Shiny App. Hope you guys enjoy reading these and if you like please Upvote.
I have successfully built R shiny app which covers crime map based on murders, aggravated assaults, and robberies, as well as property crimes such as burglaries, auto thefts, larcenies, and arson for data of date between August 2015 till today. Crimes such as sexual assaults and rapes have not been included because of a lack of data.
Public safety is vital to public health and happiness, and a state’s safety can be a crucial factor in deciding where to relocate your family.The map shows that Mattapan,Roxbury, and Dorchester have had more aggravated assaults than other neighborhoods. Roxbury, downtown, and Dorchester had more robberies than other sections of the city. Back Bay, the South End, and downtown suffered more larcenies. In 2017 the rate of violent crime in Massachusetts fell for the sixth year in a row and dropped below national levels for the second time since 2016, according to FBI statistics.
In 2017, there were 358 incidents of violent crime per 100,000 people in Massachusetts, compared to the national rate of 394 per 100,000 people, according to data published in late September.The FBI defines violent crime as homicide, rape, robbery, and aggravated assault. The bureau said it calculates crime rates using figures voluntarily provided by law enforcement agencies, and tallies estimates when it doesn’t receive a complete year’s worth of data.
The overall crime rate in Boston is equal to the national average. For every 100,000 people, there are 7.56 daily crimes that occur in Boston. Boston is safer than 14% of the cities in the United States. In Boston you have a 1 in 37 chance of becoming a victim of any crime. The number of total year over year crimes in Boston has decreased by 9%.
I want to highlight what the City of Boston says about the data they provide:
“Although the City has made reasonable efforts to provide accurate data, the City makes no representations or guarantees about the accuracy, completeness, or currency of the information provided. The City of Boston provides this data as is and with all faults, and makes no warranty of any kind. Each user is responsible for determining the suitability of the data for their intended use or purpose.”
The dataset ‘Crime Incident Report’ is an open data intiative program led by the Boston government to document the initial details surrounding an incident to which Boston Police Department (BPD) officers respond. The dataset contains records from the Boston government’s new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occured. The Boston government took an initiative to improve the city of Boston by releasing its data source to the public. Over the past few decades, the way we look at the field of climate, genetics, sports, have been altered dramatically due to big data technology advancements; similarly, the way crime data was traditionally held by law enforcement agencies has also changed, crime prediction is a niche trend in this era.
The dataset begins from August 2015 to December 21st 2018 (1st day Of Winter), As of December 21st, 2018, there are 349080 incidents and 17 variables; ranging from types of offense, reported area and reporting area, date occured, street, and the longitude and latitude of the incident. The size of the dataset is 78.5mb.
Why we choose this dataset?
Performing an analysis of crime on the Boston dataset will help not only me but people living here to identify and analyze patterns in Boston’s crime incidents. I believe that the results of this analysis can be useful to law enforcement agencies in how they deploy their resources, and to assist in identifying and apprehending suspects. Last but not least, it would help many international people residing in Boston, to better understand the city of Boston.
I plan to explore the incidents backwards through time, and mine for patterns by time and location and to visually represent these results. These results are not only potentially beneficial to the law enforcement, but they can also be beneficial for the residents of Boston to see and understand where and how often crime is happening in their neighborhood. I hope that by understanding the frequency of incidents in a neighborhood, residents can be proactive about how they report incidents.
These are the packages involved in reading this report.
library("dplyr")
library("tidyverse")
library("chron")
library("ggplot2")
library("janitor")
library("Hmisc")
library("funModeling")
library("tidyverse")
library("openair")
library("rticles")
library("lubridate")
library("tidyr")
library("skimr")
library("rmarkdown")
library("visdat")
library("maps")
library("leaflet")
library("plotly")
library("waffle")
library("DataExplorer")
library("lattice")
library("wordcloud")
library("gridExtra")Theme Settings
Reads the csv file ‘crime_incident_reports.csv’ from the Government of Boston’s website and save it as variable raw_crime. The code is written so the reader of the .Rmd file does not need the dataset downloaded to run the code chunks. The datasets is from August 2015 till today December 21st 2018.
This report is based on crimes reported to, and arrests made by, the Boston Police Department within the City of Boston.
raw_crime = read.csv('https://data.boston.gov/dataset/6220d948-eae2-4e4b-8723-2dc8e67722a3/resource/12cb3883-56f5-47de-afa5-3b1cf61b257b/download/crime_incident_reports.csv', sep = ",", na.strings =c('','NA','na','N/A','n/a','NaN','nan'), strip.white = TRUE, stringsAsFactors = FALSE)Overview of Massachusetts and Northeastern University
The red outlined state is the state of Massachusetts, located in the Northeastern part of the United States. Boston is situated on the Eastern side of Massachusetts, touching the Atlantic ocean and sits at 42.3601° N, 71.0589° W.
The second map map of Boston through Leaflet Package.
The Third Map is also Map of Boston distributed among Crime District by Latitude and Longitude.
ggplot(map_data("state", region = "Massachusetts"), aes(long, lat, group = group)) +
geom_polygon(fill = "gray", colour = "red") +
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
coord_quickmap()leaflet() %>%
setView(lng=-71.0892, lat=42.3398, zoom = 10) %>%
addTiles() %>%
addMarkers(lng=-71.0892, lat=42.3398, popup="Boston") # Crime Mapping
qplot(Long, Lat, data= raw_crime, color=DISTRICT, geom='point', xlim = c(-71.2,-70.95), ylim= c(42.22,42.4))+
theme_bw(base_size=15)+
geom_point(size = 2)+
labs(caption= "Data Source : Boston |@ Pankaj Shah")## Warning: Removed 22697 rows containing missing values (geom_point).
## Warning: Removed 22697 rows containing missing values (geom_point).
I used the latitude and longitude coordinates to plot each incident in the area. This roughly depicts the map of Boston.
Lets name our datasets so that we don’t have to download time and again when we modify our code.
# Lets name crime datasets, if we mess up then we dont have to import another time.
crime <- raw_crimeInfo about the crime datasets:
Gather some basic Information about the datasets before exploration. The datasets is 78.5 Mb with 349080 Rows/observation and 17 different Columns. One thing you will observe is most of the Missing data are coming from Reporting area, latitude and longitude.
Lets further explore only missing datas in table format. So when we sort them in descending order we can see Reporting area has 22236 missing values and lat and long has 21842 missing values as of today 12/21/18.
One of the false analysis is saying 99.59% of datasets is missing in Shooting Column. Upon diagnosis we found out if shooting didn’t take place the row was left empty and if it took place it was coded as “Y”. I also note that there are 12 districts in Boston datasets but if we see district it shows 13 unique values. There is something more than 12 district. Below it shows there is no missing data from District.
Project Boundaries
# Handling the Missing datasets
sort(sapply(crime, function(x) sum(is.na(x))), decreasing = TRUE)## SHOOTING REPORTING_AREA Lat
## 347884 22244 21856
## Long STREET DISTRICT
## 21856 11272 1910
## UCR_PART INCIDENT_NUMBER OFFENSE_CODE
## 97 0 0
## OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION OCCURRED_ON_DATE
## 0 0 0
## YEAR MONTH DAY_OF_WEEK
## 0 0 0
## HOUR Location
## 0 0
# Sometimes we need to be carefull when package ask us to remove the missing data.
plot_missing(crime)# Clean the column names.
crime <- clean_names(crime)If shooting hasn’t occured then the row is left empty. Thats why it is showing 347664 Missing Varaibles in Shooting Column. There are about 1913 Missing district datasets. One of the idea to fix that would be use Latitude and Longitude data if avialable. But we have 21842 Lat Long missing data as well. Also we can give a try by using Reporting Area if its avialable. As the streets are not unique it will be Little harder to find the missing district based on Street Name.
# Take a peek at our datasets
glimpse(crime)## Observations: 349,301
## Variables: 17
## $ incident_number <chr> "I182102934", "I182102933", "I182102923", ...
## $ offense_code <int> 1402, 3115, 614, 3301, 3006, 3006, 3831, 3...
## $ offense_code_group <chr> "Vandalism", "Investigate Person", "Larcen...
## $ offense_description <chr> "VANDALISM", "INVESTIGATE PERSON", "LARCEN...
## $ district <chr> "C11", "C6", "B2", "B2", "C11", NA, "C11",...
## $ reporting_area <int> 244, 958, 329, 329, 359, NA, 355, 774, 300...
## $ shooting <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ occurred_on_date <chr> "2018-12-21 21:30:00", "2018-12-21 20:21:0...
## $ year <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, ...
## $ month <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12...
## $ day_of_week <chr> "Friday", "Friday", "Friday", "Friday", "F...
## $ hour <int> 21, 20, 16, 20, 20, 20, 20, 17, 20, 20, 17...
## $ ucr_part <chr> "Part Two", "Part Three", "Part One", "Par...
## $ street <chr> "DORCHESTER AVE", "BELLFLOWER ST", "NAZING...
## $ lat <dbl> 42.31812, 42.32445, 42.30664, 42.30833, 42...
## $ long <dbl> -71.05673, -71.05811, -71.08585, -71.07599...
## $ location <chr> "(42.31811990, -71.05673136)", "(42.324451...
## Fixing Missing Data Lat and Long
# latitude <- gsub('.*\\((.*),.*', '\\1', crime$location)
# longitude <- gsub('.*, (.*)\\).*', '\\1', crime$location)
# latitude <- as.factor(latitude)
# longitude <- as.factor(longitude)
# All the missing data has lat and long coordinates 0.000 & 0.000** Factor conversion for better Summary Results**
crime$incident_number <- as.factor(crime$incident_number)
crime$offense_code_group <- as.factor(crime$offense_code_group)
crime$offense_description <- as.factor(crime$offense_description)
crime$district <- as.factor(crime$district)
crime$shooting <- as.factor(crime$shooting)
crime$occurred_on_date <- as.Date(crime$occurred_on_date)
crime$day_of_week <- as.factor(crime$day_of_week)
crime$ucr_part <- as.factor(crime$ucr_part)
crime$street <- as.factor(crime$street)
crime$location <- as.factor(crime$location)Below is the number of Incidents Reported Each Year.
crime %>%
filter(occurred_on_date < ymd("2018-12-21")) %>%
group_by(occurred_on_date) %>%
summarise(n = n()) %>%
ggplot(aes(x = occurred_on_date, y = n)) +
geom_point() +
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
labs(title = " Scatterplot Number of Incidents Reported in each year",
y = 'Number of Incidents Reported',
x = 'Date')+
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_pankajOne Interesting observation to see is crime reporting frequency has been evenly distributed. If you follow the dot it seems to drop towards the end and have pick up around mid year.
** plotting the crime datasets is helpful to see skewness of datasets.**
# Nothing new or so specific to observe here. We have seen all these data descriptvely.
plot_num(crime)## Warning: attributes are not identical across measure variables; they will
## be dropped
Below is the Summary of Our Crime Datasets
summary(crime)## incident_number offense_code
## I152071596 : 20 Min. : 111
## I172053750 : 18 1st Qu.:1001
## I162067346 : 14 Median :2907
## I130041200-00: 13 Mean :2318
## I162030584 : 13 3rd Qu.:3201
## I152097957 : 12 Max. :3831
## (Other) :349211
## offense_code_group
## Motor Vehicle Accident Response: 40750
## Larceny : 28570
## Medical Assistance : 25967
## Investigate Person : 20311
## Other : 19694
## Drug Violation : 18074
## (Other) :195935
## offense_description district
## SICK/INJURED/MEDICAL - PERSON : 20784 B2 : 54804
## INVESTIGATE PERSON : 20316 C11 : 46487
## M/V - LEAVING SCENE - PROPERTY DAMAGE: 17827 D4 : 46012
## VANDALISM : 16462 A1 : 39397
## ASSAULT SIMPLE - BATTERY : 16213 B3 : 38692
## VERBAL DISPUTE : 14358 (Other):121999
## (Other) :243341 NA's : 1910
## reporting_area shooting occurred_on_date year
## Min. : 0.0 Y : 1417 Min. :2015-06-15 Min. :2015
## 1st Qu.:177.0 NA's:347884 1st Qu.:2016-05-09 1st Qu.:2016
## Median :343.0 Median :2017-03-25 Median :2017
## Mean :383.1 Mean :2017-03-19 Mean :2017
## 3rd Qu.:544.0 3rd Qu.:2018-02-01 3rd Qu.:2018
## Max. :962.0 Max. :2018-12-21 Max. :2018
## NA's :22244
## month day_of_week hour ucr_part
## Min. : 1.000 Friday :53039 Min. : 0.00 Other : 1369
## 1st Qu.: 4.000 Monday :49916 1st Qu.: 9.00 Part One : 67385
## Median : 7.000 Saturday :48978 Median :14.00 Part Three:173867
## Mean : 6.918 Sunday :44121 Mean :13.11 Part Two :106583
## 3rd Qu.:10.000 Thursday :51039 3rd Qu.:18.00 NA's : 97
## Max. :12.000 Tuesday :50846 Max. :23.00
## Wednesday:51362
## street lat long
## WASHINGTON ST : 15634 Min. :-1.00 Min. :-71.18
## BLUE HILL AVE : 8521 1st Qu.:42.30 1st Qu.:-71.10
## BOYLSTON ST : 7944 Median :42.33 Median :-71.08
## DORCHESTER AVE: 5647 Mean :42.21 Mean :-70.90
## TREMONT ST : 5261 3rd Qu.:42.35 3rd Qu.:-71.06
## (Other) :295022 Max. :42.40 Max. : -1.00
## NA's : 11272 NA's :21856 NA's :21856
## location
## (0.00000000, 0.00000000) : 21856
## (42.34862382, -71.08277637): 1384
## (42.36183857, -71.05976489): 1321
## (42.28482577, -71.09137369): 1187
## (42.32866284, -71.08563401): 1133
## (42.25621592, -71.12401947): 984
## (Other) :321436
describe(crime)## crime
##
## 17 Variables 349301 Observations
## ---------------------------------------------------------------------------
## incident_number
## n missing distinct
## 349301 0 308436
##
## lowest : 142052550 I010370257-00 I030217815-08 I050310906-00 I060168073-00
## highest: I182102916 I182102919 I182102923 I182102933 I182102934
## ---------------------------------------------------------------------------
## offense_code
## n missing distinct Info Mean Gmd .05 .10
## 349301 0 222 0.999 2318 1318 522 614
## .25 .50 .75 .90 .95
## 1001 2907 3201 3802 3831
##
## lowest : 111 112 121 123 301, highest: 3811 3820 3821 3830 3831
## ---------------------------------------------------------------------------
## offense_code_group
## n missing distinct
## 349301 0 67
##
## lowest : Aggravated Assault Aircraft Arson Assembly or Gathering Violations Auto Theft
## highest: Towed Vandalism Verbal Disputes Violations Warrant Arrests
## ---------------------------------------------------------------------------
## offense_description
## n missing distinct
## 349301 0 244
##
## lowest : A&B HANDS, FEET, ETC. - MED. ATTENTION REQ. A&B ON POLICE OFFICER ABDUCTION - INTICING AFFRAY AIRCRAFT INCIDENTS
## highest: WEAPON - FIREARM - CARRYING / POSSESSING, ETC WEAPON - FIREARM - OTHER VIOLATION WEAPON - FIREARM - SALE / TRAFFICKING WEAPON - OTHER - CARRYING / POSSESSING, ETC WEAPON - OTHER - OTHER VIOLATION
## ---------------------------------------------------------------------------
## district
## n missing distinct
## 347391 1910 12
##
## Value A1 A15 A7 B2 B3 C11 C6 D14 D4 E13
## Frequency 39397 7085 14259 54804 38692 46487 26053 21992 46012 19366
## Proportion 0.113 0.020 0.041 0.158 0.111 0.134 0.075 0.063 0.132 0.056
##
## Value E18 E5
## Frequency 18996 14248
## Proportion 0.055 0.041
## ---------------------------------------------------------------------------
## reporting_area
## n missing distinct Info Mean Gmd .05 .10
## 327057 22244 879 1 383.1 273.5 56 97
## .25 .50 .75 .90 .95
## 177 343 544 777 824
##
## lowest : 0 1 2 3 4, highest: 958 959 960 961 962
## ---------------------------------------------------------------------------
## shooting
## n missing distinct value
## 1417 347884 1 Y
##
## Value Y
## Frequency 1417
## Proportion 1
## ---------------------------------------------------------------------------
## occurred_on_date
## n missing distinct
## 349301 0 1286
##
## lowest : 2015-06-15 2015-06-16 2015-06-17 2015-06-18 2015-06-19
## highest: 2018-12-17 2018-12-18 2018-12-19 2018-12-20 2018-12-21
## ---------------------------------------------------------------------------
## year
## n missing distinct Info Mean Gmd
## 349301 0 4 0.929 2017 1.149
##
## Value 2015 2016 2017 2018
## Frequency 53577 99314 101145 95265
## Proportion 0.153 0.284 0.290 0.273
## ---------------------------------------------------------------------------
## month
## n missing distinct Info Mean Gmd .05 .10
## 349301 0 12 0.993 6.918 3.797 1 2
## .25 .50 .75 .90 .95
## 4 7 10 11 12
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 23684 21706 24190 24162 26329 30742 34801 35343 34362 34101
## Proportion 0.068 0.062 0.069 0.069 0.075 0.088 0.100 0.101 0.098 0.098
##
## Value 11 12
## Frequency 31197 28684
## Proportion 0.089 0.082
## ---------------------------------------------------------------------------
## day_of_week
## n missing distinct
## 349301 0 7
##
## Value Friday Monday Saturday Sunday Thursday Tuesday
## Frequency 53039 49916 48978 44121 51039 50846
## Proportion 0.152 0.143 0.140 0.126 0.146 0.146
##
## Value Wednesday
## Frequency 51362
## Proportion 0.147
## ---------------------------------------------------------------------------
## hour
## n missing distinct Info Mean Gmd .05 .10
## 349301 0 24 0.997 13.11 7.124 1 2
## .25 .50 .75 .90 .95
## 9 14 18 21 22
##
## lowest : 0 1 2 3 4, highest: 19 20 21 22 23
## ---------------------------------------------------------------------------
## ucr_part
## n missing distinct
## 349204 97 4
##
## Value Other Part One Part Three Part Two
## Frequency 1369 67385 173867 106583
## Proportion 0.004 0.193 0.498 0.305
## ---------------------------------------------------------------------------
## street
## n missing distinct
## 338029 11272 4730
##
## lowest : ALBANY ST BLUE HILL AVE COLUMBUS AVE COMMONWEALTH AVE MASSACHUSETTS AVE
## highest: YUILL CIR ZAMORA CT ZAMORA ST ZEIGLER ST ZELLER ST
## ---------------------------------------------------------------------------
## lat
## n missing distinct Info Mean Gmd .05 .10
## 327445 21856 18390 1 42.21 0.2582 42.27 42.28
## .25 .50 .75 .90 .95
## 42.30 42.33 42.35 42.36 42.37
##
## Value -1.0 42.0 42.5
## Frequency 841 3249 323355
## Proportion 0.003 0.010 0.988
## ---------------------------------------------------------------------------
## long
## n missing distinct Info Mean Gmd .05 .10
## 327445 21856 18390 1 -70.9 0.3914 -71.14 -71.13
## .25 .50 .75 .90 .95
## -71.10 -71.08 -71.06 -71.05 -71.04
##
## Value -71 -1
## Frequency 326604 841
## Proportion 0.997 0.003
## ---------------------------------------------------------------------------
## location
## n missing distinct
## 349301 0 18406
##
## lowest : (-1.00000000, -1.00000000) (0.00000000, 0.00000000) (42.23241330, -71.12971531) (42.23265556, -71.13069992) (42.23287025, -71.13004959)
## highest: (42.39351800, -71.01140500) (42.39352862, -71.01158244) (42.39407288, -71.01096386) (42.39421300, -71.06741300) (42.39504158, -71.01017732)
## ---------------------------------------------------------------------------
# df_status(crime) # to see percenatage and quantile of each columns.One thing to observe from above describe table is we cannot say crime has risen from 2015 because there were only observation from August compare to other years. For 2015 we start from August 15 missing almost 3/4 of data.For 2018 we are missing half a month of data. But its fair to say that the report of crime has stayed even in all these 4 years around 30% in each year.
Looking at the month Column we can say that crime among all months remain constant except month of December where crimes are fairly low. My guess is it’s cold and sometime it’s snowy in Boston around December and also we have Christmas. People might be more busy towards holiday Parties, more family related things happening and also people are less crossing each others path. I cannot say its very low towards winter and high around summer as we need to break it down by year to make a call of the pattern which we might do it later.
If we look at the day of the week column. It seems like Friday seems more arrest and Sunday seems less report of crime.
UCR part three is major almost 50% observation followed by part two 30% and then 20 % part I.
Something to Observe over here are the q_na which stands for quantity of NA. q_zeros : Quantity of Zeros in each column p_na which stands for percentage of NA and p_zeros which stands for percentage of Zeros. Apart from that we can see the unique values in each column. There are almost 222 different offense code for 67 different groups. We can witness 12 unique district for 4 year data. Interesting is 5 ucr_part as of data sets we should have only 3 ucr_part. What are other two? There are 4733 unique name of street.
Here we have another visualization of the number of incidents by district, this time we plot them in the order from least counts to most counts. As discussed before, Roxbury, Dorchester, and South End exhibits the most counts of crime between the years.
Lets see those missing district rows.
# Lets see number of observation in each district
table(crime$district, useNA = "ifany") # useNA shows the Na##
## A1 A15 A7 B2 B3 C11 C6 D14 D4 E13 E18 E5
## 39397 7085 14259 54804 38692 46487 26053 21992 46012 19366 18996 14248
## <NA>
## 1910
# sum(table(crime$district)) # sum + Missing = total obs.
missing_dist <- crime %>% filter(is.na(district))
head(missing_dist$district, 5) # See NA in district ## [1] <NA> <NA> <NA> <NA> <NA>
## Levels: A1 A15 A7 B2 B3 C11 C6 D14 D4 E13 E18 E5
head(missing_dist, 2)## incident_number offense_code offense_code_group
## 1 I182102915 3006 Medical Assistance
## 2 I182102903 1830 Drug Violation
## offense_description district reporting_area shooting
## 1 SICK/INJURED/MEDICAL - PERSON <NA> NA <NA>
## 2 DRUGS - SICK ASSIST - HEROIN <NA> NA <NA>
## occurred_on_date year month day_of_week hour ucr_part street lat
## 1 2018-12-21 2018 12 Friday 20 Part Three <NA> 42.33737
## 2 2018-12-21 2018 12 Friday 20 Part Two <NA> 42.29863
## long location
## 1 -71.07804 (42.33737410, -71.07803582)
## 2 -71.06064 (42.29863416, -71.06063601)
Where district is Missing we are also Missing Reporting area and only information we have in some is location which can be helpful to find.
Exploratory Data Analysis Boundaries
# Uncomment below to see details about missing district datasets.
# table(missing_dist$reporting_area) # Too see how many repoting area is available to fix.
# sum(table(missing_dist$reporting_area)) # getting the sum of reporting area.
# table(missing_dist$shooting) # to see how many shooting happen in these datasets. Y:2
# table(missing_dist$year) # to see which year has how many missing data.
# table(missing_dist$month) # To see if the datasets is missing from particular month.
# table(missing_dist$long) # To see how many unique longitude we have.
# table(missing_dist$lat) # To see how many unique latitude we have.
# table(missing_dist$day_of_week) # We have missing those district in each day of week.
# table(missing_dist$hour) # Check the hour
# table(missing_dist$offense_code_group) # chechking if crime was any help to figure out why these missing district.
# sum(table(missing_dist$lat))
# sum(table(missing_dist$long))Converting District Code to District Names
Just knowing the district codes is hard to know where the crime was reported in Boston. People do know the neighborhood name rather than District name so our first step will be to recode the district code to District name.
# Lets Rename the distrcit code to district name
district_name = c(
A1 = 'Downtown',
A15= 'Charlestown',
A7= 'East Boston',
B2= 'Roxbury',
B3= 'Mattapan',
C6= 'South Boston',
C11= 'Dorchester',
D4= 'South End',
D14= 'Brighton',
E5= 'West Roxbury',
E13= 'Jamaica Plain',
E18= 'Hyde Park')
district_name## A1 A15 A7 B2
## "Downtown" "Charlestown" "East Boston" "Roxbury"
## B3 C6 C11 D4
## "Mattapan" "South Boston" "Dorchester" "South End"
## D14 E5 E13 E18
## "Brighton" "West Roxbury" "Jamaica Plain" "Hyde Park"
We could clean up these missing datas in many ways. Some of the ideas are by using reporting area to figure out which district these missing district are coming from in total there are only 40 columns with reporting. Latitude and Longitude position will be helpful to fix these missing districts. We have almost 1593 observations. It will be tedious work to chek each rows to fix and geolocation doesnt range by district as the district are not rectangle or square. There could be some efficient way to do these but will deal with these later.
For now my plan is to ignore these 1909 which is 0.0054% which shouldnt have any affect on our datasets. While shooting it might fluctuate so I check that variable and found out there were 2 case where district is missing. I can fix by reporting area too. Most of the missing data are from 2018, almost 708, it could be still uploading as of today. I even break it down by month to see if certain month have whole missing data for district. But it was spread evenly between months as well.
Lets Move on and concentrate on bigger pie of data rather than focusing on small missing datasets.
Lets ignore that missing district.
# We will change all district code to district name by passing the following code.
crime$district <- as.factor(district_name[(crime$district)] )
#unique(crime$district)
sort(table(crime$district), decreasing = TRUE)##
## Roxbury South Boston Brighton Downtown Mattapan
## 54804 46487 46012 39397 38692
## Dorchester South End West Roxbury Jamaica Plain East Boston
## 26053 21992 19366 18996 14259
## Hyde Park Charlestown
## 14248 7085
# sum(table(crime$district)) # 346594 + 1915 = 348509 observationFrom the summary Column we can say that Motor Vehicle Accident Response was recorded most and Larceny the second most.There were 1416 Shooting cases as of today. Most of the crimes happens on Friday followed by Monday and Saturday. UCR (Uniform Crime Reporting ) part Three is mostly reported followed by Part II and then I. Most of the crime reported street is Washignton street,followed by Blue Hill Ave and then Boylston st. Also there are 4 different Washington street in Boston, MA. As of now I don’t have actual length of street so its hard to normalise the data based on crime and length.
# Looking duplicate data by incident_number, year, day_of_week : TIME POINT
get_dupes(crime, incident_number, year) %>% nrow## [1] 71342
duplicated <- get_dupes(crime, incident_number, year, day_of_week)
head(duplicated)## # A tibble: 6 x 18
## incident_number year day_of_week dupe_count offense_code
## <fct> <int> <fct> <int> <int>
## 1 I030217815-08 2015 Thursday 2 3125
## 2 I030217815-08 2015 Thursday 2 111
## 3 I060168073-00 2018 Saturday 3 3125
## 4 I060168073-00 2018 Saturday 3 1864
## 5 I060168073-00 2018 Saturday 3 1864
## 6 I070720870-00 2018 Thursday 11 3125
## # ... with 13 more variables: offense_code_group <fct>,
## # offense_description <fct>, district <fct>, reporting_area <int>,
## # shooting <fct>, occurred_on_date <date>, month <int>, hour <int>,
## # ucr_part <fct>, street <fct>, lat <dbl>, long <dbl>, location <fct>
As of today we have 71300 observation with same Incident Number with different crimes happened on same day. Only difference is the crimes that were committed have different offense code.
# Looking duplicate data by incident_number, year, street, district : LOCATION
duplicate_count <- get_dupes(crime, incident_number, year, street, district, day_of_week)
head(duplicate_count)## # A tibble: 6 x 18
## incident_number year street district day_of_week dupe_count offense_code
## <fct> <int> <fct> <fct> <fct> <int> <int>
## 1 I030217815-08 2015 RIVER… Jamaica… Thursday 2 3125
## 2 I030217815-08 2015 RIVER… Jamaica… Thursday 2 111
## 3 I060168073-00 2018 CENTR… West Ro… Saturday 3 3125
## 4 I060168073-00 2018 CENTR… West Ro… Saturday 3 1864
## 5 I060168073-00 2018 CENTR… West Ro… Saturday 3 1864
## 6 I070720870-00 2018 BROOK… Roxbury Thursday 11 3125
## # ... with 11 more variables: offense_code_group <fct>,
## # offense_description <fct>, reporting_area <int>, shooting <fct>,
## # occurred_on_date <date>, month <int>, hour <int>, ucr_part <fct>,
## # lat <dbl>, long <dbl>, location <fct>
Lets see which day most of those multiple offense crime occurs.
tabyl(duplicate_count$day_of_week)## duplicate_count$day_of_week n percent
## Friday 10940 0.1533459
## Monday 10007 0.1402680
## Saturday 10161 0.1424266
## Sunday 8671 0.1215413
## Thursday 10668 0.1495332
## Tuesday 10355 0.1451459
## Wednesday 10540 0.1477391
sort(table(duplicate_count$day_of_week), decreasing = TRUE)##
## Friday Thursday Wednesday Tuesday Saturday Monday Sunday
## 10940 10668 10540 10355 10161 10007 8671
Again Friday comes on top of the list followed by Thrusday and Wednesday.Remember these are all in duplicated datasets which is different from orginal Crime datasets. Most of the multiple offense crime happened on Friday followed by Thrusday and then Wednesday.
head(sort(table(duplicate_count$offense_code_group),decreasing = TRUE), 10)##
## Drug Violation Other Warrant Arrests
## 9491 6844 5581
## Simple Assault Larceny Aggravated Assault
## 4905 4178 3502
## Violations Missing Person Located Missing Person Reported
## 3466 3171 3059
## Vandalism
## 3022
In our observation we see top ten contenders are drug Violation, followed by Warrant Arrests, Simple Assault. Remember these are all in duplicated datasets which is different from orginal Crime datasets.
#sort(table(duplicate_count$dupe_count),decreasing = TRUE)
tabyl(duplicate_count$dupe_count)## duplicate_count$dupe_count n percent
## 2 47246 0.6622466429
## 3 13842 0.1940231561
## 4 5900 0.0827002327
## 5 2440 0.0342014522
## 6 1020 0.0142973284
## 7 371 0.0052003028
## 8 176 0.0024669900
## 9 108 0.0015138348
## 10 80 0.0011213591
## 11 33 0.0004625606
## 12 48 0.0006728155
## 13 26 0.0003644417
## 14 14 0.0001962378
## 18 18 0.0002523058
## 20 20 0.0002803398
Most of the duplicated crimes happened at the same location are mostly charged with two violation code followed all the way upto 20 different offense. In past 4 years there is 1 such case. Similarly 4 cases where someone is charged with 12 different offense code. I am really interested in finding out which was that one case where someone is charged with 20 different crimes offense.
duplicate_count[which(duplicate_count$dupe_count == 20), ] ## # A tibble: 20 x 18
## incident_number year street district day_of_week dupe_count
## <fct> <int> <fct> <fct> <fct> <int>
## 1 I152071596 2015 CREST… Roxbury Saturday 20
## 2 I152071596 2015 CREST… Roxbury Saturday 20
## 3 I152071596 2015 CREST… Roxbury Saturday 20
## 4 I152071596 2015 CREST… Roxbury Saturday 20
## 5 I152071596 2015 CREST… Roxbury Saturday 20
## 6 I152071596 2015 CREST… Roxbury Saturday 20
## 7 I152071596 2015 CREST… Roxbury Saturday 20
## 8 I152071596 2015 CREST… Roxbury Saturday 20
## 9 I152071596 2015 CREST… Roxbury Saturday 20
## 10 I152071596 2015 CREST… Roxbury Saturday 20
## 11 I152071596 2015 CREST… Roxbury Saturday 20
## 12 I152071596 2015 CREST… Roxbury Saturday 20
## 13 I152071596 2015 CREST… Roxbury Saturday 20
## 14 I152071596 2015 CREST… Roxbury Saturday 20
## 15 I152071596 2015 CREST… Roxbury Saturday 20
## 16 I152071596 2015 CREST… Roxbury Saturday 20
## 17 I152071596 2015 CREST… Roxbury Saturday 20
## 18 I152071596 2015 CREST… Roxbury Saturday 20
## 19 I152071596 2015 CREST… Roxbury Saturday 20
## 20 I152071596 2015 CREST… Roxbury Saturday 20
## # ... with 12 more variables: offense_code <int>,
## # offense_code_group <fct>, offense_description <fct>,
## # reporting_area <int>, shooting <fct>, occurred_on_date <date>,
## # month <int>, hour <int>, ucr_part <fct>, lat <dbl>, long <dbl>,
## # location <fct>
After digging dive in I think there should be 4 people involved. As we see there are 4 repeataed different offense code. I don’t think one person could be chargedd with 4 different offense. So this happen in 2015. Lets pull out the whole history about it.
For these we can take row name from original crime datasets.
crime[which(crime$incident_number == "I152071596"), ] ## incident_number offense_code offense_code_group
## 329107 I152071596 111 Homicide
## 329108 I152071596 111 Homicide
## 329109 I152071596 111 Homicide
## 329110 I152071596 111 Homicide
## 329111 I152071596 413 Aggravated Assault
## 329112 I152071596 413 Aggravated Assault
## 329113 I152071596 413 Aggravated Assault
## 329114 I152071596 413 Aggravated Assault
## 329115 I152071596 3001 Medical Assistance
## 329116 I152071596 3001 Medical Assistance
## 329117 I152071596 3001 Medical Assistance
## 329118 I152071596 3001 Medical Assistance
## 329119 I152071596 3503 Missing Person Located
## 329120 I152071596 3503 Missing Person Located
## 329121 I152071596 3503 Missing Person Located
## 329122 I152071596 3503 Missing Person Located
## 329123 I152071596 2662 Ballistics
## 329124 I152071596 2662 Ballistics
## 329125 I152071596 2662 Ballistics
## 329126 I152071596 2662 Ballistics
## offense_description district reporting_area
## 329107 MURDER, NON-NEGLIGIENT MANSLAUGHTER Roxbury 326
## 329108 MURDER, NON-NEGLIGIENT MANSLAUGHTER Roxbury 326
## 329109 MURDER, NON-NEGLIGIENT MANSLAUGHTER Roxbury 326
## 329110 MURDER, NON-NEGLIGIENT MANSLAUGHTER Roxbury 326
## 329111 ASSAULT - AGGRAVATED - BATTERY Roxbury 326
## 329112 ASSAULT - AGGRAVATED - BATTERY Roxbury 326
## 329113 ASSAULT - AGGRAVATED - BATTERY Roxbury 326
## 329114 ASSAULT - AGGRAVATED - BATTERY Roxbury 326
## 329115 DEATH INVESTIGATION Roxbury 326
## 329116 DEATH INVESTIGATION Roxbury 326
## 329117 DEATH INVESTIGATION Roxbury 326
## 329118 DEATH INVESTIGATION Roxbury 326
## 329119 MISSING PERSON - NOT REPORTED - LOCATED Roxbury 326
## 329120 MISSING PERSON - NOT REPORTED - LOCATED Roxbury 326
## 329121 MISSING PERSON - NOT REPORTED - LOCATED Roxbury 326
## 329122 MISSING PERSON - NOT REPORTED - LOCATED Roxbury 326
## 329123 BALLISTICS EVIDENCE/FOUND Roxbury 326
## 329124 BALLISTICS EVIDENCE/FOUND Roxbury 326
## 329125 BALLISTICS EVIDENCE/FOUND Roxbury 326
## 329126 BALLISTICS EVIDENCE/FOUND Roxbury 326
## shooting occurred_on_date year month day_of_week hour ucr_part
## 329107 Y 2015-08-29 2015 8 Saturday 2 Part One
## 329108 Y 2015-08-29 2015 8 Saturday 2 Part One
## 329109 Y 2015-08-29 2015 8 Saturday 2 Part One
## 329110 Y 2015-08-29 2015 8 Saturday 2 Part One
## 329111 Y 2015-08-29 2015 8 Saturday 2 Part One
## 329112 Y 2015-08-29 2015 8 Saturday 2 Part One
## 329113 Y 2015-08-29 2015 8 Saturday 2 Part One
## 329114 Y 2015-08-29 2015 8 Saturday 2 Part One
## 329115 Y 2015-08-29 2015 8 Saturday 2 Part Three
## 329116 Y 2015-08-29 2015 8 Saturday 2 Part Three
## 329117 Y 2015-08-29 2015 8 Saturday 2 Part Three
## 329118 Y 2015-08-29 2015 8 Saturday 2 Part Three
## 329119 Y 2015-08-29 2015 8 Saturday 2 Part Three
## 329120 Y 2015-08-29 2015 8 Saturday 2 Part Three
## 329121 Y 2015-08-29 2015 8 Saturday 2 Part Three
## 329122 Y 2015-08-29 2015 8 Saturday 2 Part Three
## 329123 Y 2015-08-29 2015 8 Saturday 2 Part Two
## 329124 Y 2015-08-29 2015 8 Saturday 2 Part Two
## 329125 Y 2015-08-29 2015 8 Saturday 2 Part Two
## 329126 Y 2015-08-29 2015 8 Saturday 2 Part Two
## street lat long location
## 329107 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329108 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329109 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329110 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329111 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329112 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329113 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329114 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329115 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329116 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329117 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329118 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329119 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329120 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329121 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329122 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329123 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329124 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329125 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329126 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
As we get guessed it happen in same location which can be confirmed by location address. Also by looking the offense code description seems like there was shooting involved from shooting column and also there were some death happened as we can see Murder and Manslaughter on offense code.
Offense_code “2662” is ballastic. Lets see what we can get from there.Lets Analysed by District first.
ballastic <- crime[which(crime$offense_code == "2662"), ]
ballastic %>% group_by(district) %>%
count(sort = TRUE)## # A tibble: 13 x 2
## # Groups: district [13]
## district n
## <fct> <int>
## 1 Roxbury 291
## 2 South Boston 239
## 3 Mattapan 214
## 4 West Roxbury 83
## 5 Dorchester 70
## 6 Brighton 53
## 7 Jamaica Plain 28
## 8 East Boston 24
## 9 Downtown 23
## 10 South End 22
## 11 Hyde Park 14
## 12 Charlestown 13
## 13 <NA> 5
Lets Analysed by Year
ballastic <- crime[which(crime$offense_code == "2662"), ]
ballastic %>% group_by(year) %>%
count()## # A tibble: 4 x 2
## # Groups: year [4]
## year n
## <int> <int>
## 1 2015 170
## 2 2016 303
## 3 2017 341
## 4 2018 265
Compare to 2016 and 2017 we have less crime involving ballastics.
But has shooting changed over course of year in each district.
group_shoot_year <- crime[which(crime$shooting == "Y"), ]
group_shoot_year %>%
group_by(district, year) %>%
count()## # A tibble: 49 x 3
## # Groups: district, year [49]
## district year n
## <fct> <int> <int>
## 1 Brighton 2015 21
## 2 Brighton 2016 10
## 3 Brighton 2017 36
## 4 Brighton 2018 25
## 5 Charlestown 2015 1
## 6 Charlestown 2016 1
## 7 Charlestown 2017 1
## 8 Charlestown 2018 2
## 9 Dorchester 2015 4
## 10 Dorchester 2016 9
## # ... with 39 more rows
Overall the shooting has decreased compare to last year in each district. It has gone down.Lets see below the percentage of the duplicated count based on incident number percentage.
head( sort(table(duplicate_count$street),decreasing = TRUE), 10)##
## WASHINGTON ST BOYLSTON ST BLUE HILL AVE HARRISON AVE
## 3497 2062 1887 1684
## MASSACHUSETTS AVE DORCHESTER AVE TREMONT ST CENTRE ST
## 1444 1343 1217 978
## COLUMBIA RD DUDLEY ST
## 795 711
When the repeated crimes are occured in same location. It seems like Washington street still has multiple offense but there are more multiple offense on Boylston street than blue hill avenue which is slightly different from our datasets.Others remain the same.
Lets see why there are 5 different observation of ucr_part?
tabyl(crime$ucr_part)## crime$ucr_part n percent valid_percent
## Other 1369 0.0039192559 0.003920345
## Part One 67385 0.1929138479 0.192967435
## Part Three 173867 0.4977569489 0.497895213
## Part Two 106583 0.3051322498 0.305217008
## <NA> 97 0.0002776975 NA
So we see 98 observation of empty rows which are not coded as NA or Zeros. and 1366 are coded as other. Thats why in describe it didnt show up as Missing or NA.
ggplot(subset(crime,!is.na(district)))+
aes(x=district)+
geom_bar(stat = "count",fill='red') +
geom_text(stat="count",aes(label=..count..),vjust=-1)+
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
labs(title="Frequency of Incidents by Disctrict",
x="Districts",
y="Number of Incidents")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_continuous(limit=c(0,60000))+
theme_pankaj# Dot plot
crime_dist <- crime %>%
filter() %>%
group_by(district) %>%
summarise(n = n()) %>%
ungroup() %>%
arrange(desc(n))
# Frequency of crime by district wise:
ggplot(crime_dist, aes(x = reorder(district, n), y = n)) +
geom_point(size = 12, stat = "identity", color = "red") +
geom_text(aes(label = n, fontface = "bold"), color = "black", size = 2) +
theme_minimal(base_size = 20) +
xlab("District") + ylab("Count") +
ggtitle("Frequency of crime district wise") +
scale_y_continuous(limits=c(0,max(crime_dist$n+2000)))+
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
theme_pankaj+
theme(axis.text.x = element_text(angle = 45, hjust = 1))# Save the dotplot as a png file
ggsave("dotplot.png", scale = 2, dpi = 400)## Saving 14 x 9 in image
# Plotly
plot_district = plot_ly(crime, x = ~ district, color = ~ district) %>%
add_histogram() %>%
layout(
title = "Total district count by the crime",
xaxis = list(title = "district count",
yaxis = list(title = "Count"))
)
plot_district## Warning: Ignoring 1910 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
xyplot(lat~long|district, data=crime, xlim=c(-71.2,-70.95), ylim=c(42.22,42.4))# Frequency of crime by district (in descending order)
count(crime, district, sort = TRUE)## # A tibble: 13 x 2
## district n
## <fct> <int>
## 1 Roxbury 54804
## 2 South Boston 46487
## 3 Brighton 46012
## 4 Downtown 39397
## 5 Mattapan 38692
## 6 Dorchester 26053
## 7 South End 21992
## 8 West Roxbury 19366
## 9 Jamaica Plain 18996
## 10 East Boston 14259
## 11 Hyde Park 14248
## 12 Charlestown 7085
## 13 <NA> 1910
Lets dig into our crime datasets as we have some sort of idea how the duplicate incidents number has behaved. Lets have uniform date by fixing time Zone.
# Detected data with Daylight Saving Time, converting to UTC/GMT
crime$date <- as.POSIXlt(crime$occurred_on_date, "GMT") # Fixing time zone
crime$date <- as.Date(crime$date) # Date
# Lets also create a seperate Year, month , day column for EDA.
crime <- crime %>% mutate(crime_date = as.Date(occurred_on_date, "%Y-%m-%d"))
# create a sepearte ymd crime_date then we will divide that crime_date into year,month, day.
crime <-crime %>% mutate(month=as.numeric(month(occurred_on_date)))
crime <- crime %>% mutate(year=as.numeric(year(occurred_on_date)))
crime <- crime %>% mutate(day=as.numeric(day(occurred_on_date)))
# Lets look our datasets by date column.
head(crime$occurred_on_date)## [1] "2018-12-21" "2018-12-21" "2018-12-21" "2018-12-21" "2018-12-21"
## [6] "2018-12-21"
crime %>%
count(date) %>%
ggplot(aes(date, n))+
geom_line()+ # line graph for date.
expand_limits(y = 0)+ # less misleading
labs(title ="Continous Crime Date Pattern", color = "red")+
labs(caption = "Source: Boston Data | @ Pankaj Shah")+
theme(plot.title = element_text(color="#D70026", size=14, face="bold.italic", hjust = 0.5, vjust=0.5))+
theme_pankajAs we can see the datasets that we extracted have continous recording without any breaks.Crime seems like pickup during the summer month and fall back in Winter. The pattern remains the same. Also we can see at the end of the year .i.e month of december have deep in crime rate. Living in Boston for a while one thing I have notice that December is coldest month, not one of the coldest but definetly the temperature remains high around 30-35F during the day and gets colder at night dropping down to 20-25F at night. Also Most of the schools have finals around this time of the year and most of the people live the town right after the final before the Christmas break. So it does explain the fall of crime around month of december. Also most of the people are busy either in Shopping Mall or with their family members for Christmas.
Frequency of Incidents by Disctrict
ggplot(subset(crime,!is.na(district)))+
aes(x=year, color=district)+
geom_line(stat="count")+
scale_x_continuous(breaks = seq(2012,2018,1))+
scale_y_continuous(breaks = seq(5000,50000,5000))+
labs(title="Frequency of Incidents by Disctrict", x="Districts", y="Number of Incidents")+
labs(caption = "Source: Boston Data | @ Pankaj Shah")+
theme_pankajCumulative Frequency of Incidents by District from 2015 to 2018. There seems to be a not any significant increase in number of crimes from 2016 upto 2018. The rise in line graph could be misleading interms of missing half a year of data in 2015.But from these graph we can see there are less crime in different district. We can analyse or rank the district based on the crime occured based on these graph.
crime%>%
count(offense_code, sort = TRUE) %>%
ggplot(aes (n))+
geom_histogram()+
labs(title="Majority of crimes offense code lies below 5000", x="offense_code", y="n")+
labs(caption = "Source: Boston Data | @ Pankaj Shah")+
theme_pankajThe majority of the offense code lies below 5000.
#table(crime$offense_code_group)
offense_code_crime <- sort(xtabs(formula = ~ offense_code_group, data = crime), decreasing = TRUE)
head(offense_code_crime)## offense_code_group
## Motor Vehicle Accident Response Larceny
## 40750 28570
## Medical Assistance Investigate Person
## 25967 20311
## Other Drug Violation
## 19694 18074
offense_code_crime <- as.tibble(offense_code_crime)
top_offense_code_crime <- offense_code_crime %>% top_n(10)## Selecting by n
top_offense_code_crime## # A tibble: 10 x 2
## offense_code_group n
## <chr> <int>
## 1 Motor Vehicle Accident Response 40750
## 2 Larceny 28570
## 3 Medical Assistance 25967
## 4 Investigate Person 20311
## 5 Other 19694
## 6 Drug Violation 18074
## 7 Simple Assault 17383
## 8 Vandalism 16744
## 9 Verbal Disputes 14358
## 10 Towed 12375
crime_offense_code <- sort(table(crime$offense_code), decreasing = TRUE)
top_10_offense_code_crime <- as.tibble(head(crime_offense_code, 10))
top_10_offense_code_crime## # A tibble: 10 x 2
## Var1 n
## <chr> <int>
## 1 3006 20784
## 2 3115 20316
## 3 3831 17827
## 4 1402 16462
## 5 802 16230
## 6 3301 14358
## 7 3410 12375
## 8 3114 12230
## 9 617 9969
## 10 2647 9963
crime %>%
filter(offense_code %in% top_10_offense_code_crime$Var1) %>%
group_by(district) %>%
count## # A tibble: 13 x 2
## # Groups: district [13]
## district n
## <fct> <int>
## 1 Brighton 17662
## 2 Charlestown 3182
## 3 Dorchester 11188
## 4 Downtown 14800
## 5 East Boston 6285
## 6 Hyde Park 6465
## 7 Jamaica Plain 8840
## 8 Mattapan 17914
## 9 Roxbury 23533
## 10 South Boston 21598
## 11 South End 10378
## 12 West Roxbury 8054
## 13 <NA> 615
Most of the arrest was made for offense code 3006. Lets see what is that.
crime[which(crime$offense_code == "3006"), ] %>% head(1)## incident_number offense_code offense_code_group
## 5 I182102916 3006 Medical Assistance
## offense_description district reporting_area shooting
## 5 SICK/INJURED/MEDICAL - PERSON South Boston 359 <NA>
## occurred_on_date year month day_of_week hour ucr_part street
## 5 2018-12-21 2018 12 Friday 20 Part Three MELVILLE AVE
## lat long location date crime_date day
## 5 42.29429 -71.06821 (42.29428541, -71.06820981) 2018-12-21 2018-12-21 21
It seems like most of the calls were made for Medical Assistance Sick/Injured Medical
medical_assiscatance <- crime[which(crime$offense_code == "3006"), ]
sort(table(medical_assiscatance$district), decreasing = TRUE)##
## Roxbury South Boston Brighton Downtown Mattapan
## 2985 2924 2072 2063 2055
## Dorchester Jamaica Plain South End Hyde Park West Roxbury
## 1610 1500 1405 1308 1271
## East Boston Charlestown
## 1143 398
top_10_street <- sort(table(medical_assiscatance$street), decreasing = TRUE)
head(top_10_street, 10)##
## WASHINGTON ST DORCHESTER AVE BLUE HILL AVE MASSACHUSETTS AVE
## 794 348 337 313
## CENTRE ST TREMONT ST COMMONWEALTH AVE BOYLSTON ST
## 309 290 273 267
## RIVER ST COLUMBIA RD
## 223 214
crime[which(crime$offense_code == "3115"), ] %>% head(1)## incident_number offense_code offense_code_group offense_description
## 2 I182102933 3115 Investigate Person INVESTIGATE PERSON
## district reporting_area shooting occurred_on_date year month
## 2 Dorchester 958 <NA> 2018-12-21 2018 12
## day_of_week hour ucr_part street lat long
## 2 Friday 20 Part Three BELLFLOWER ST 42.32445 -71.05811
## location date crime_date day
## 2 (42.32445145, -71.05810680) 2018-12-21 2018-12-21 21
investigate_person <- crime[which(crime$offense_code == "3115"), ]
sort(table(investigate_person$district), decreasing = TRUE)##
## Roxbury South Boston Mattapan Brighton Downtown
## 2957 2898 2678 2263 1744
## Dorchester South End Jamaica Plain Hyde Park West Roxbury
## 1583 1333 1296 1065 983
## East Boston Charlestown
## 925 514
head(sort(table(investigate_person$street), decreasing = TRUE),10)##
## WASHINGTON ST BLUE HILL AVE CENTRE ST DORCHESTER AVE
## 872 524 327 286
## BOYLSTON ST HARRISON AVE COMMONWEALTH AVE TREMONT ST
## 285 255 249 235
## W BROADWAY MASSACHUSETTS AVE
## 231 218
Did shooting took place while investigating person?
sort(table(investigate_person$shooting), decreasing = TRUE)## Y
## 12
There are 12 incidents out of 20236 Where shooting was involved while investigating the person.
In which district did the shooting took place while investigating person?
investigate_shooting <- investigate_person %>%
filter(shooting == "Y")
sort(table(investigate_shooting$district), decreasing = TRUE)##
## Roxbury West Roxbury East Boston Mattapan Brighton
## 8 2 1 1 0
## Charlestown Dorchester Downtown Hyde Park Jamaica Plain
## 0 0 0 0 0
## South Boston South End
## 0 0
In which reporting area did the shooting took place while investigating person?
sort(table(investigate_shooting$reporting_area), decreasing = TRUE)##
## 326 330 909 26 441 906 912
## 4 2 2 1 1 1 1
330, 326, 909 : Mattapan | 26 : Roxbury | 906, 912 :Jamaica Plain | 441 : South Boston |
Various other components to look at while the shooting took place when investigating person
sort(table(investigate_shooting$day_of_week), decreasing = TRUE)##
## Saturday Wednesday Friday Sunday Thursday Monday Tuesday
## 4 4 2 1 1 0 0
sort(table(investigate_shooting$month), decreasing = TRUE)##
## 7 9 1 2 5 6 11
## 4 3 1 1 1 1 1
sort(table(investigate_shooting$year), decreasing = TRUE)##
## 2018 2016 2017
## 8 2 2
head(sort(table(investigate_shooting$street), decreasing = TRUE),8)##
## FAYSTON ST BLUE HILL AVE BORDER ST CABOT ST CENTRE ST
## 4 2 1 1 1
## HEATH ST WALES ST WHITTIER ST
## 1 1 1
sort(table(investigate_shooting$hour), decreasing = TRUE)##
## 18 20 3 4 13 19 21 23
## 4 2 1 1 1 1 1 1
It is remarkable to see that shooting has been increased while investigating in 2018 we have 8 incidents compare to 2016, 2017. 4 shooting happen around 6pm while investigating.
crime$offense_category <- crime$offense_code_groupunique(crime$offense_code_group[grep("Motor",crime$offense_code_group)])## [1] Larceny From Motor Vehicle Motor Vehicle Accident Response
## 67 Levels: Aggravated Assault Aircraft ... Warrant Arrests
unique(crime$offense_code_group[grep("Larceny",crime$offense_code_group)])## [1] Larceny From Motor Vehicle Larceny
## 67 Levels: Aggravated Assault Aircraft ... Warrant Arrests
Offense grouping
crime<- crime %>%
mutate(offense_category = ifelse(offense_code_group %in% c("Fire Related Reports", "Firearm Discovery", "Arson"),"Arson",
ifelse(offense_code_group %in% c("Simple Assault", "Verbal Disputes", "Disorderly Conduct", "Aggravated Assault", "Prisoner Related Incidents"),"Assault",
ifelse(offense_code_group %in% c("Other Burglary", "Residential Burglary", "Commercial Burglary", "Burglary - No Property Taken"),"Burglary",
ifelse(offense_code_group %in% c("Medical Assistance", "Biological Threat"),"Medical",
ifelse(offense_code_group %in% c("Larceny", "Larceny From Motor Vehicle", "Auto Theft", "Auto Theft Recovery", "HOME INVASION", "Embezzlement"),"Larceny",
ifelse(offense_code_group %in% c("Missing Person Reported", "Missing Person Located"),"Missing",
ifelse(offense_code_group %in% c("Motor Vehicle Accident Response", "Towed", "License Plate Related Incidents"),"MVT",
ifelse(offense_code_group %in% c("Investigate Person", "INVESTIGATE PERSON","Investigate Property","Drug Violation","Violations","Counterfeiting","Operating Under the Influence","Firearm Violations","License Violation","Restraining Order Violations","Liquor Violation","Assembly or Gathering Violations","Confidence Games","Harbor Related Incidents","Evading Fare","Gambling"),"NON-VIO",
ifelse(offense_code_group %in% c("Property Lost","Property Found","Recovered Stolen Property","Property Related Damage","Landlord/Tenant Disputes"),"Property",
ifelse(offense_code_group %in% c("Robbery"),"Robbery",
ifelse(offense_code_group %in% c("Harassment", "Criminal Harassment", "Prostitution"),"SEX",
ifelse(offense_code_group %in% c("HUMAN TRAFFICKING", "HUMAN TRAFFICKING - INVOLUNTARY SERVITUDE"), "TRAFFICKING",
ifelse(offense_code_group %in% c("Vandalism"),"Vandalism",
ifelse(offense_code_group %in% c("Warrant Arrests", "Search Warrants"),"Warrant",
ifelse(offense_code_group %in% c("Police Service Incidents","Ballistics","Bomb Hoax","Offenses Against Child / Family","Homicide","Explosives","Manslaughter"),"VIO",
ifelse(offense_code_group %in% c("Fraud"),"Fraud",
ifelse(offense_code_group %in% c("Phone Call Complaints", "Service", "Aircraft"),"Service+_callls",
"Other"))))))))))))))))))table(crime$offense_category)##
## Arson Assault Burglary Fraud
## 2921 43484 7921 6460
## Larceny Medical Missing MVT
## 47149 25969 9617 53737
## NON-VIO Other Property Robbery
## 71257 19694 18817 4989
## Service+_callls SEX TRAFFICKING Vandalism
## 387 4830 10 16744
## VIO Warrant
## 5176 10139
# Frequency and percent of crime category (in descending order)
count(crime, offense_category, sort = TRUE) %>%
mutate(percent = round(n/sum(n)*100, 1))## # A tibble: 18 x 3
## offense_category n percent
## <chr> <int> <dbl>
## 1 NON-VIO 71257 20.4
## 2 MVT 53737 15.4
## 3 Larceny 47149 13.5
## 4 Assault 43484 12.4
## 5 Medical 25969 7.4
## 6 Other 19694 5.6
## 7 Property 18817 5.4
## 8 Vandalism 16744 4.8
## 9 Warrant 10139 2.9
## 10 Missing 9617 2.8
## 11 Burglary 7921 2.3
## 12 Fraud 6460 1.8
## 13 VIO 5176 1.5
## 14 Robbery 4989 1.4
## 15 SEX 4830 1.4
## 16 Arson 2921 0.8
## 17 Service+_callls 387 0.1
## 18 TRAFFICKING 10 0
# Mean frequency of crime category per district
crime %>%
group_by(offense_category,district) %>%
summarise(total = n()) %>%
group_by(offense_category) %>%
summarise(average = round(mean(total, na.rm=TRUE), 0))## # A tibble: 18 x 2
## offense_category average
## <chr> <dbl>
## 1 Arson 225
## 2 Assault 3345
## 3 Burglary 609
## 4 Fraud 497
## 5 Larceny 3627
## 6 Medical 1998
## 7 Missing 740
## 8 MVT 4134
## 9 NON-VIO 5481
## 10 Other 1515
## 11 Property 1447
## 12 Robbery 384
## 13 Service+_callls 30
## 14 SEX 372
## 15 TRAFFICKING 2
## 16 Vandalism 1288
## 17 VIO 398
## 18 Warrant 780
# Count by district
# Frequency of crime category by district (in descending order)
crime %>%
group_by(offense_category, district) %>%
summarise(n = n()) %>%
ungroup() %>%
arrange(desc(n))## # A tibble: 226 x 3
## offense_category district n
## <chr> <fct> <int>
## 1 NON-VIO Roxbury 11502
## 2 Larceny Brighton 11108
## 3 NON-VIO South Boston 9587
## 4 Assault Roxbury 8350
## 5 NON-VIO Mattapan 8197
## 6 NON-VIO Brighton 8158
## 7 NON-VIO Downtown 8144
## 8 MVT Roxbury 8115
## 9 MVT South Boston 7255
## 10 Larceny Downtown 7224
## # ... with 216 more rows
crime %>%
count(offense_category, month = floor_date(date, "month")) %>%
filter(month > min(month)) %>%
filter(month < max(month)) %>%
ggplot(aes(month, n , color = offense_category))+
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
geom_line()+
labs(title = "Offense Category by Month")+
theme_pankajWe used crime classification from the Book of Criminal Justice and plotted the number of incidents in each type of crime. It helps to visualize the types of crime by classifying them into broader cattegories; e.g. instead of having many types of larceny, it helps to only visualize all types of larceny as one category.
plot_crime_offense_category = plot_ly(crime, x = ~ offense_category, color = ~ offense_category) %>%
add_histogram() %>%
layout(
title = "Total Offense category count by the crime",
xaxis = list(title = "Offense category",
yaxis = list(title = "Count"))
)
plot_crime_offense_category## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Larceny
# Frequency and percent of Vehicle crime by district (in descending order)
crime %>%
filter(offense_category == "Larceny") %>%
group_by(district) %>%
summarise(n = n()) %>%
ungroup() %>%
arrange(desc(n)) %>%
mutate(percent = round(n/sum(n)*100, 1))## # A tibble: 13 x 3
## district n percent
## <fct> <int> <dbl>
## 1 Brighton 11108 23.6
## 2 Downtown 7224 15.3
## 3 Roxbury 5907 12.5
## 4 South Boston 4636 9.8
## 5 Dorchester 3787 8
## 6 Mattapan 2993 6.3
## 7 South End 2877 6.1
## 8 West Roxbury 2680 5.7
## 9 Jamaica Plain 1941 4.1
## 10 East Boston 1429 3
## 11 Hyde Park 1388 2.9
## 12 Charlestown 1000 2.1
## 13 <NA> 179 0.4
df <- crime %>%
filter(offense_category == "Larceny") %>%
group_by(district) %>%
summarise(n = n())
calendar_heatmap <- ggplot(df, aes(x= reorder(district, -n), y=n, fill=n)) +
geom_tile(aes(fill=n)) +
geom_text(aes(label=n), size=3.5, color="black") +
scale_fill_gradient("Frequency", low = "darkgreen", high = "red") +
theme_pankaj+
labs(x = "District")+
ggtitle("Larceny levels in Boston") +
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
calendar_heatmap#ggsave("calendar_heatmap.png", scale = 1, dpi = 300)MA_Arrest_Summary <-crime %>%
filter(offense_category == 'Larceny') %>%
group_by(month, district, year) %>%
summarise(Count = n())
MA_Arrest_Summary %>% head(5)## # A tibble: 5 x 4
## # Groups: month, district [2]
## month district year Count
## <dbl> <fct> <dbl> <int>
## 1 1 Brighton 2016 277
## 2 1 Brighton 2017 249
## 3 1 Brighton 2018 205
## 4 1 Charlestown 2016 26
## 5 1 Charlestown 2017 27
ggplot(MA_Arrest_Summary, aes(x= district, y= Count, fill = year)) +
geom_point()+
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_pankaj # 80:20
tbl <- crime %>%
filter(offense_category == "Larceny") %>%
group_by(district) %>%
summarise(n = n()) %>%
ungroup() %>%
arrange(desc(n)) %>%
mutate(percent.crimes = round(n/sum(n)*100, 1),
cum.percent.crimes = round(cumsum(percent.crimes), 1),
percent.district = 1/n()*100,
cum.percent.district = round(cumsum(percent.district), 1)) %>%
select(DISTRICT = district,
`No. crimes` = n,
`% crimes` = percent.crimes,
`Cum. % crimes` = cum.percent.crimes,
`Cum. % district` = cum.percent.district)
# Create a simple table and save as a pdf
pdf("80-20_rule.pdf", height=11, width=8.5)
grid.table(tbl, rows = NULL)
dev.off()## quartz_off_screen
## 2
Larceny Proportion in December 2015 and December 2017 We take a closer look at larceny as a proportion of total robbery offences, by district, in December 2015. Since December is coming up, and it is a holiday month, we wanted to see the spread of larceny across districts. It is interesting to see that more larceny occurs in the South End and and Roxbury than any other district, but this is in line with our observation that a lot more crime occurs in these districts than any other.
However, what is more interesting is seeing the change in distrct level larceny between 2015 and 2017. Larceny counts increased in the South End from 2015 to 2017, whereas the larceny counts decreased significantly in Roxbury.
# Recoding 1 & 0 for shooting as there are less shooting in crime involved.
crime$shooting <- ifelse(as.character(crime$shooting) == "Y", 1, 0)
table(crime$shooting == 1)##
## TRUE
## 1417
MA_Arrest_Summary <-crime %>%
filter(offense_category == 'Larceny') %>%
group_by(month, district, year) %>%
summarise(Count = n()
)
MA_Arrest_Summary %>% head(5)## # A tibble: 5 x 4
## # Groups: month, district [2]
## month district year Count
## <dbl> <fct> <dbl> <int>
## 1 1 Brighton 2016 277
## 2 1 Brighton 2017 249
## 3 1 Brighton 2018 205
## 4 1 Charlestown 2016 26
## 5 1 Charlestown 2017 27
# Summarize the arrests
MA_Arrest_Summary <-crime %>%
filter(offense_category == 'Larceny' | shooting == "Y") %>%
group_by(month, district, year) %>%
summarise(Count = n())
MA_Arrest_Summary %>% head(5)## # A tibble: 5 x 4
## # Groups: month, district [2]
## month district year Count
## <dbl> <fct> <dbl> <int>
## 1 1 Brighton 2016 277
## 2 1 Brighton 2017 249
## 3 1 Brighton 2018 205
## 4 1 Charlestown 2016 26
## 5 1 Charlestown 2017 27
ggplot(MA_Arrest_Summary, aes(x= district, y= Count, fill = year)) +
geom_point()+
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_pankaj # Frequency of crime by district (in descending order)
par(mfrow=c(2,2))
df_3 <- crime %>%
filter(date == "2015-12-01" & offense_category == "Larceny") %>%
count(district, sort = TRUE) %>%
mutate(percent = round(n/sum(n)*100, 1)) %>%
select(district, percent) %>%
spread(district, percent)
df_3 <- df_3[ , order(-df_3 [which(rownames(df) == '1'), ]) ]
# Use waffle
waffle(df_3, rows = 4, size = 2,
colors=(RColorBrewer::brewer.pal(n=10,"Set3")),
title="District level Larceny as a proportion of \n total Robbery offences on year December 2015",
legend_pos = "right")+
labs(caption= "Data Source : Boston |@ Pankaj Shah")# ggsave("waffle.png", scale=1.5, dpi=300)
# Frequency of crime by district (in descending order)
df_3 <- crime %>%
filter(date == "2018-12-01" & offense_category == "Larceny") %>%
count(district, sort = TRUE) %>%
mutate(percent = round(n/sum(n)*100, 1)) %>%
select(district, percent) %>%
spread(district, percent)
df_3 <- df_3[ , order(-df_3 [which(rownames(df) == '1'), ]) ]
# Use waffle
waffle(df_3, rows = 4, size = 2,
colors=(RColorBrewer::brewer.pal(n=12,"Set3")),
title="District level Larceny as a proportion of \n total Robbery offences on year December 2018",
legend_pos = "right")+labs(caption= "Data Source : Boston |@ Pankaj Shah")# ggsave("waffle_1.png", scale=1.5, dpi=300)shoot <- crime[which(crime$shooting == 1), ]
sort(table(shoot$district), decreasing = TRUE)##
## Roxbury Mattapan South Boston Brighton West Roxbury
## 494 299 249 92 91
## Dorchester Jamaica Plain East Boston South End Hyde Park
## 46 44 27 27 22
## Downtown Charlestown
## 19 5
sort(table(shoot$year), decreasing = TRUE)##
## 2017 2018 2016 2015
## 457 364 345 251
m1 <- data.frame(sort(table(shoot$month), decreasing = TRUE))
m1 <- m1 %>% rename(month = Var1)
m1## month Freq
## 1 7 186
## 2 6 174
## 3 8 160
## 4 12 156
## 5 10 137
## 6 9 126
## 7 11 113
## 8 4 86
## 9 5 86
## 10 1 85
## 11 2 63
## 12 3 45
d1 <- data.frame(table(shoot$day_of_week))
d1 <- d1 %>% rename(wday = Var1)
d1## wday Freq
## 1 Friday 187
## 2 Monday 162
## 3 Saturday 336
## 4 Sunday 201
## 5 Thursday 179
## 6 Tuesday 143
## 7 Wednesday 209
p1 <- data.frame(sort(table(shoot$day), decreasing = TRUE))
p1 <- p1 %>% rename(day = Var1)
# Overall Shooting throughout the year
ggplot(data = shoot, aes(x = month)) +
geom_bar(label = TRUE)+
ggtitle("Overall Shooting throughout the year")+
labs(caption = "Source: Boston Datasets | @ Pankaj Shah", x = "Month") +
theme_pankaj## Warning: Ignoring unknown parameters: label
# Overall Shooting throughout the month
ggplot(data = shoot, aes(x = day)) +
geom_bar()+
ggtitle("Overall Shooting throughout the month ")+
labs(caption = "Source: Boston Datasets | @ Pankaj Shah", x = "Day")+
theme_pankaj# Overall Shooting throughout the week
ggplot(data = d1, aes(x = wday, Freq)) +
geom_bar(stat = "identity")+
ggtitle("Overall Shooting throughout the week ") +
labs(caption = "Source: Boston Datasets | @ Pankaj Shah", x = "Wday") +
theme_pankajp <- ggplot(d1, aes(x = wday, y = Freq, fill = Freq)) +
geom_bar(width = 1, stat="identity") + coord_polar("y", start=pi / 3) +
ggtitle("Pie Chart distribution of shooting throughout the week") +
labs(caption= "Data Source : Boston |@ Pankaj Shah")
print(p) plot(prop.table(table(crime$reporting_area)))We can see which regions have most of the crimes.
Lets divide day in 4 shifts to know the crime time. We generate six points of the day to bin the day into four equal segments.
time_diff <- c("0", "6", "12", "18", "24") # Breaking day into 6 interval period
crime$time_diff <- cut(crime$hour,
breaks = time_diff,
labels = c("00-06", "06-12", "12-18", "18-24"),
include.lowest = TRUE)
table(crime$time_diff)##
## 00-06 06-12 12-18 18-24
## 53012 96671 122138 77480
#createing Shift plot
crime <- crime %>% mutate(shift = ifelse(time_diff == "00-06", "Late Night",
ifelse(time_diff == "06-12", "Morning",
ifelse(time_diff == "12-18", "Day",
"Evening"))))
table(crime$shift)##
## Day Evening Late Night Morning
## 122138 77480 53012 96671
#
plot_shift = plot_ly(crime, x = ~ time_diff, color = ~ time_diff) %>%
add_histogram() %>%
layout(
title = "Total district count by the crime",
xaxis = list(title = "Shift count",
yaxis = list(title = "Count"))
)
plot_shifttemp <- aggregate(crime$offense_category,
by = list(crime$offense_category, crime$time_diff), FUN= length)
names(temp) <- c("offense_category", "time_diff", "count")
head(temp)## offense_category time_diff count
## 1 Arson 00-06 521
## 2 Assault 00-06 8168
## 3 Burglary 00-06 1608
## 4 Fraud 00-06 871
## 5 Larceny 00-06 5574
## 6 Medical 00-06 4223
Crime by Time Range of Day: It seems like most of the crime is committed between 6am and 6pm. Most incidents happen between 12pm and 6pm, while the least happens between 12am and 6am.
Lets look at the holiday data from 2015 till today
holiday <- crime %>% filter(crime_date == "2015-01-01"| # New Year 2015
crime_date == "2015-01-19"| # Martin Luther King, Jr.
crime_date == "2015-02-16"| # Washington’s Birthday
crime_date == "2015-05-25"| # Memorial Day
crime_date == "2015-07-04"| # Independence Day
crime_date == "2015-09-07"| # Labor Day
crime_date == "2015-10-12"| # Columbus Day
crime_date == "2015-11-11"| # Veterans Day
crime_date == "2015-11-26"| # Thanksgiving Day
crime_date == "2015-12-25"| # Christmas Day
crime_date == "2016-01-01"| # New Year 2016
crime_date == "2016-01-18"| # Martin Luther King, Jr.
crime_date == "2016-02-15"| # Washington’s Birthday
crime_date == "2016-05-30"| # Memorial Day
crime_date == "2016-07-04"| # Independence Day
crime_date == "2016-09-05"| # Labor Day
crime_date == "2016-10-10"| # Columbus Day
crime_date == "2016-11-11"| # Veterans Day
crime_date == "2016-11-24"| # Thanksgiving Day
crime_date == "2016-12-25"| # Christmas Day
crime_date == "2017-01-01"| # New Year 2017
crime_date == "2017-01-16"| # Martin Luther King, Jr.
crime_date == "2017-02-20"| # Washington’s Birthday
crime_date == "2017-05-29"| # Memorial Day
crime_date == "2017-07-04"| # Independence Day
crime_date == "2017-09-04"| # Labor Day
crime_date == "2017-10-09"| # Columbus Day
crime_date == "2017-11-10"| # Veterans Day
crime_date == "2017-11-23"| # Thanksgiving Day
crime_date == "2017-12-25"| # Christmas Day
crime_date == "2018-01-01"| # New Year 2018
crime_date == "2018-01-15"| # Martin Luther King, Jr.
crime_date == "2018-02-19"| # Washington’s Birthday
crime_date == "2018-05-28"| # Memorial Day
crime_date == "2018-07-04"| # Independence Day
crime_date == "2018-09-03"| # Labor Day
crime_date == "2018-10-08"| # Columbus Day
crime_date == "2018-11-12"| # Veterans Day
crime_date == "2018-11-22") # Thanksgiving Day
# No data for christmas 2018.count_holiday <- holiday %>% group_by(crime_date,offense_category) %>% summarise(count = n())
count_holiday_total <- holiday %>% group_by(crime_date) %>% summarise(count = n())
count_holiday_year <- holiday %>% group_by(year) %>% summarise(count = n())
count_crime <- crime %>% group_by(year,offense_category) %>% summarise(count = n())
count_shift <- crime %>% group_by(year,shift) %>% summarise(count = n())
count_shift$count <- as.numeric(count_shift$count)
count_crime$count <- as.numeric(count_crime$count)
count_district <- crime %>% group_by(year,district) %>% summarise(count = n())
choice <- unique(count_crime$offense_category)
total_crime <- crime %>% group_by(year) %>% summarise(total = n())
total_crime$count <- as.numeric(total_crime$total)There seems to be higher counts of crime on Fridays than the rest of the week, with Sunday having the lowest counts of crime. More crime also occur during the weekdays rather than on the weekend. Notice here that the y-axis scale starts from 43,000 and ends with 52,500.
crime$day <- crime$DAY_OF_WEEK
crime$day <- factor(crime$day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
qplot(crime$day, xlab= "Day of week", main= "Crimes by day of week") + scale_y_continuous("Number of crimes")+
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
theme_pankajplot_crime_offense_day = plot_ly(crime, x = ~day , color = ~ day) %>%
add_histogram() %>%
layout(
title = "Total district count by the crime during the day",
xaxis = list(title = "Days",
yaxis = list(title = "Count"))
)
plot_crime_offense_dayMost crimes happen on the later half of the year, and the most crime-ridden quarter is the third quarter of the year.
crime$Month <- month.abb[crime$month]
table(crime$Month)##
## Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
## 24162 35343 28684 21706 23684 34801 30742 24190 26329 31197 34101 34362
crime$Month <- factor(crime$Month, levels= c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
ggplot(crime, aes(x= Month))+
geom_bar()+
theme_pankaj+
ggtitle("Crimes by Month")+
labs(y ="Number of crimes")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
labs(caption = "Boston Data Source | @ Pankaj Shah")Gradient Tile Plot of Crime Types by Time of Day This is a more detailed view of the types of crimes (by intensity) throughout the day segmented by 6 hour sections.
ggplot(temp, aes(x= offense_category, y= factor(time_diff))) +
geom_tile(aes(fill= count)) + scale_x_discrete("offense_category", expand = c(0,0)) + scale_y_discrete("Time of day", expand = c(0,-2)) +
scale_fill_gradient("Number of crimes", low = "white", high = "orange") + theme_bw() + ggtitle("Crimes by time of day") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line (colour = NA))+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
theme_pankajAside from there being more crime between 6am and 6pm, non-violent crimes have the highest intensity in terms of occurence, and happen between 12pm and 6pm. More motor-vehicle crimes occur between 6am and 12pm than any othe time range, and most larceny occurs between 12pm and 6pm.
Gradient Tile Plot of Crime Types by Day of Week
temp1 <- aggregate(crime$offense_category, by = list(crime$offense_category, crime$day), FUN= length)
names(temp1) <- c("offense_category", "day", "count")
ggplot(temp1, aes(x= offense_category, y= day, fill= count)) + geom_tile(aes(fill= count)) +
scale_x_discrete("offense_category", expand = c(0,0)) + scale_y_discrete("Day of week", expand = c(0,-2)) + scale_fill_gradient("Number of crimes", low = "white", high = "lightpink") +
ggtitle("Crimes by day of week") +
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
theme_pankaj +
theme(axis.text.x = element_text(angle = 45, hjust = 1))In our previous plot ‘Crime by Day of Week’, we saw that the occurence of crime is even distributed throughout the weekdays, and that crime happens less on weekends. Here, we are able to see the each type of crime committed, and its frequency in terms of intensity, throughout the week.
Most crimes seem to be in line with what the previous graph told us, however, it seems that vandalism occurs more often on weekends than weekdays, as suggested by the deeper color on Saturday and Sunday than the rest of the week.
Gradient Tile Plot of Crime Types by Month
temp2 <- aggregate(crime$offense_category, by = list(crime$offense_category, crime$Month), FUN= length)
names(temp2) <- c("offense_category", "Month", "count")
ggplot(temp2, aes(x= offense_category, y= Month, fill= count)) +
geom_tile(aes(fill= count)) +
scale_x_discrete("OFFENSE_CAT", expand = c(0,0)) +
scale_y_discrete("month", expand = c(0,-2)) +
scale_fill_gradient("Number of crimes", low = "white", high = "lightgreen") + theme_bw() +
ggtitle("Crimes by Month") +
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
theme_pankaj +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Previously, we see that crimes most often occur in the third quarter of the year, here we are able to see what types of crime are committed throughout the year by month.
The majority of the types of crime are pretty evenly distributed, in terms of color intensity, throughout the year. For crimes such non-violent crime, motor vehicle theft, larceny, and assault, they are in line with our observation that crime occurs more in Q3 than other quarters.
Here we explore the possibility that seasonality affects the rate at which crime is committed per hour. However, it doesn’t seem like there are any significant differences between seasons on the rate of crime occurence.
#create seasons
crime<- crime %>% mutate(Season = ifelse(month %in% c(6,7,8), "Summer",
ifelse(month %in% c(9,10,11), "Fall",
ifelse(month %in% c(12,1,2), "Winter",
"Spring"))))
table(crime$Season)##
## Fall Spring Summer Winter
## 99660 74681 100886 74074
# Plot
plot_crime_offense_season = plot_ly(crime, x = ~ Season, color = ~ Season) %>%
add_histogram() %>%
layout(
title = "Total count of the crime by Season",
xaxis = list(title = "Season",
yaxis = list(title = "Count"))
)
plot_crime_offense_seasonstreet_crime <- sort(table(crime$street), decreasing = TRUE)
head(street_crime, 10)##
## WASHINGTON ST BLUE HILL AVE BOYLSTON ST DORCHESTER AVE
## 15634 8521 7944 5647
## TREMONT ST MASSACHUSETTS AVE HARRISON AVE CENTRE ST
## 5261 5145 5097 4817
## COMMONWEALTH AVE HYDE PARK AVE
## 4570 3801
library("RColorBrewer")
pal = brewer.pal(9,"Blues")
street_name <- as.tibble(table(crime$street))
colnames(street_name) <- c("Street_Name", "Count")
wordcloud(street_name$Street_Name, street_name$Count, min.freq = 200, random.order = F, random.color = F, colors =c("black", "cornflowerblue", "darkred"), scale = c(2,.3))This is a word cloud of the most dangerous streets in Boston. Washington St. appears the be the most dangerous street in Boston, however there are Wahsingston streets in multiple districts, so that is a limitation of this word cloud.
Most Dangerous Streets in Each District Here you can see the top three most crime-ridden streets by district.
streetd = data.frame(with(crime,table(street, district)))
topstreet = data.frame()
for (district_name in unique(streetd$district)) {
substreetd = subset(streetd, district==district_name)
tmp = substreetd[with(substreetd, order(Freq, decreasing=T))[1:3], ]
topstreet = rbind(topstreet, tmp)
}
topstreet$street = as.factor(topstreet$street)
topstreet$Rank = rep(3:1, 6)
library(ggplot2)
ggplot(topstreet, aes(x= Rank, y = Freq, label=street, fill = Freq))+
geom_text(aes(label=street), size=3.0) +
ylim(0,10000)+
geom_bar(stat='identity')+
geom_text(size=9, hjust=0, vjust=0)+
facet_wrap(~district, nrow=12)+
theme_bw(base_size=30)+
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
coord_flip()