1 Introduction

I started to analysed this project keeping one thing in my mind. I lived in Boston City for past 10 years but how safe is it compared to any other big cities around the United States? It’s no secret here that Boston is greater than most places on the planet (especially if you ask any Bostonian who is literally comparing it to New York). That’s not just their stance/attitude/ego talking; but in our analysis I have some serious data just to prove it. According to most of the recent polls and studies published as of today 12/21/2018, from tasty burger/water, shack shake to pedestrian safety(Vision Zero Boston) Boston seems to get better year after year. Vision Zero Boston is Cities commitment to focus the City’s resources on proven strategies to eliminate fatal and serious traffic crashes in the City by 2030, that rate Boston as tops in the nation.When I get a chance to look at Beantown’s incidence of rape (39 per 100,000, putting it third on the list of cities in the survey) and assault (483 per 100,000, fifth on the list of cities in the survey) [external data Source] it might seem like a surprisingly dangerous big city.Surprising then is that only 27% of those surveyed as of end of 2018 said they considered Boston to fairly or very unsafe. According to recent surveys the city’s low murder rate per capita (9 per 100,000) could likely be a contributing factor to this perception.The city also has the second-lowest incidence of vehicle theft at a rate of 258 per 100,000 and fares well on the other crimes rates.

Is it true to say that Boston was more violent than New York and Seattle, but less violent than Chicago and Las Vegas, according to numbers from the FBI, based on crimes committed back in 2015. As of Today 12/21/18 Nationally, Boston ranked 14 out of 50 according to Us News.After digging up and analysing the data what we found out that in recent numbers from the Boston Police Department, or BPD, show that violent crime, as well as property crime, has continued to drop, and has been steadily dropping for past years.

My Goal was to build R shiny app to see the results without digging too much in data. Took a lot of help from Online Community to do EDA as well as to built the R Shiny App. Hope you guys enjoy reading these and if you like please Upvote.

I have successfully built R shiny app which covers crime map based on murders, aggravated assaults, and robberies, as well as property crimes such as burglaries, auto thefts, larcenies, and arson for data of date between August 2015 till today. Crimes such as sexual assaults and rapes have not been included because of a lack of data.

1.1 Executive Summary

Public safety is vital to public health and happiness, and a state’s safety can be a crucial factor in deciding where to relocate your family.The map shows that Mattapan,Roxbury, and Dorchester have had more aggravated assaults than other neighborhoods. Roxbury, downtown, and Dorchester had more robberies than other sections of the city. Back Bay, the South End, and downtown suffered more larcenies. In 2017 the rate of violent crime in Massachusetts fell for the sixth year in a row and dropped below national levels for the second time since 2016, according to FBI statistics.

In 2017, there were 358 incidents of violent crime per 100,000 people in Massachusetts, compared to the national rate of 394 per 100,000 people, according to data published in late September.The FBI defines violent crime as homicide, rape, robbery, and aggravated assault. The bureau said it calculates crime rates using figures voluntarily provided by law enforcement agencies, and tallies estimates when it doesn’t receive a complete year’s worth of data.

The overall crime rate in Boston is equal to the national average. For every 100,000 people, there are 7.56 daily crimes that occur in Boston. Boston is safer than 14% of the cities in the United States. In Boston you have a 1 in 37 chance of becoming a victim of any crime. The number of total year over year crimes in Boston has decreased by 9%.

I want to highlight what the City of Boston says about the data they provide:

“Although the City has made reasonable efforts to provide accurate data, the City makes no representations or guarantees about the accuracy, completeness, or currency of the information provided. The City of Boston provides this data as is and with all faults, and makes no warranty of any kind. Each user is responsible for determining the suitability of the data for their intended use or purpose.”

1.2 Description

The dataset ‘Crime Incident Report’ is an open data intiative program led by the Boston government to document the initial details surrounding an incident to which Boston Police Department (BPD) officers respond. The dataset contains records from the Boston government’s new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occured. The Boston government took an initiative to improve the city of Boston by releasing its data source to the public. Over the past few decades, the way we look at the field of climate, genetics, sports, have been altered dramatically due to big data technology advancements; similarly, the way crime data was traditionally held by law enforcement agencies has also changed, crime prediction is a niche trend in this era.

The dataset begins from August 2015 to December 21st 2018 (1st day Of Winter), As of December 21st, 2018, there are 349080 incidents and 17 variables; ranging from types of offense, reported area and reporting area, date occured, street, and the longitude and latitude of the incident. The size of the dataset is 78.5mb.

1.3 Why crime data?

Why we choose this dataset?

Performing an analysis of crime on the Boston dataset will help not only me but people living here to identify and analyze patterns in Boston’s crime incidents. I believe that the results of this analysis can be useful to law enforcement agencies in how they deploy their resources, and to assist in identifying and apprehending suspects. Last but not least, it would help many international people residing in Boston, to better understand the city of Boston.

1.4 Analysis objective

I plan to explore the incidents backwards through time, and mine for patterns by time and location and to visually represent these results. These results are not only potentially beneficial to the law enforcement, but they can also be beneficial for the residents of Boston to see and understand where and how often crime is happening in their neighborhood. I hope that by understanding the frequency of incidents in a neighborhood, residents can be proactive about how they report incidents.

2 Data Preparation

2.1 Library

These are the packages involved in reading this report.

library("dplyr")
library("tidyverse")
library("chron")
library("ggplot2")
library("janitor")
library("Hmisc")
library("funModeling")
library("tidyverse")
library("openair")
library("rticles")
library("lubridate")
library("tidyr")
library("skimr")
library("rmarkdown")
library("visdat")
library("maps")
library("leaflet")
library("plotly")
library("waffle") 
library("DataExplorer")
library("lattice")
library("wordcloud")
library("gridExtra")

Theme Settings

2.2 Importing Datasets

Reads the csv file ‘crime_incident_reports.csv’ from the Government of Boston’s website and save it as variable raw_crime. The code is written so the reader of the .Rmd file does not need the dataset downloaded to run the code chunks. The datasets is from August 2015 till today December 21st 2018.

This report is based on crimes reported to, and arrests made by, the Boston Police Department within the City of Boston.

raw_crime = read.csv('https://data.boston.gov/dataset/6220d948-eae2-4e4b-8723-2dc8e67722a3/resource/12cb3883-56f5-47de-afa5-3b1cf61b257b/download/crime_incident_reports.csv', sep = ",", na.strings =c('','NA','na','N/A','n/a','NaN','nan'), strip.white = TRUE, stringsAsFactors = FALSE)

2.3 MA | Boston

Overview of Massachusetts and Northeastern University

The red outlined state is the state of Massachusetts, located in the Northeastern part of the United States. Boston is situated on the Eastern side of Massachusetts, touching the Atlantic ocean and sits at 42.3601° N, 71.0589° W.

The second map map of Boston through Leaflet Package.

The Third Map is also Map of Boston distributed among Crime District by Latitude and Longitude.

ggplot(map_data("state", region = "Massachusetts"), aes(long, lat, group = group)) +
  geom_polygon(fill = "gray", colour = "red") +
  labs(caption= "Data Source : Boston |@ Pankaj Shah")+
  coord_quickmap()

leaflet() %>%
  setView(lng=-71.0892, lat=42.3398, zoom = 10) %>% 
  addTiles() %>%
  addMarkers(lng=-71.0892, lat=42.3398, popup="Boston") 
# Crime Mapping
qplot(Long, Lat, data= raw_crime, color=DISTRICT, geom='point', xlim = c(-71.2,-70.95), ylim= c(42.22,42.4))+
theme_bw(base_size=15)+
  geom_point(size = 2)+
  labs(caption= "Data Source : Boston |@ Pankaj Shah")
## Warning: Removed 22697 rows containing missing values (geom_point).

## Warning: Removed 22697 rows containing missing values (geom_point).

I used the latitude and longitude coordinates to plot each incident in the area. This roughly depicts the map of Boston.

Lets name our datasets so that we don’t have to download time and again when we modify our code.

# Lets name crime datasets, if we mess up then we dont have to import another time.
crime <- raw_crime

3 Statisticsl Analysis Exploration

Info about the crime datasets:

Gather some basic Information about the datasets before exploration. The datasets is 78.5 Mb with 349080 Rows/observation and 17 different Columns. One thing you will observe is most of the Missing data are coming from Reporting area, latitude and longitude.

Lets further explore only missing datas in table format. So when we sort them in descending order we can see Reporting area has 22236 missing values and lat and long has 21842 missing values as of today 12/21/18.

One of the false analysis is saying 99.59% of datasets is missing in Shooting Column. Upon diagnosis we found out if shooting didn’t take place the row was left empty and if it took place it was coded as “Y”. I also note that there are 12 districts in Boston datasets but if we see district it shows 13 unique values. There is something more than 12 district. Below it shows there is no missing data from District.

3.1 Missing Data Analysis

Project Boundaries

# Handling the Missing datasets
sort(sapply(crime, function(x) sum(is.na(x))), decreasing = TRUE)
##            SHOOTING      REPORTING_AREA                 Lat 
##              347884               22244               21856 
##                Long              STREET            DISTRICT 
##               21856               11272                1910 
##            UCR_PART     INCIDENT_NUMBER        OFFENSE_CODE 
##                  97                   0                   0 
##  OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION    OCCURRED_ON_DATE 
##                   0                   0                   0 
##                YEAR               MONTH         DAY_OF_WEEK 
##                   0                   0                   0 
##                HOUR            Location 
##                   0                   0
# Sometimes we need to be carefull when package ask us to remove the missing data.
plot_missing(crime)

# Clean the column names.
crime <- clean_names(crime)

If shooting hasn’t occured then the row is left empty. Thats why it is showing 347664 Missing Varaibles in Shooting Column. There are about 1913 Missing district datasets. One of the idea to fix that would be use Latitude and Longitude data if avialable. But we have 21842 Lat Long missing data as well. Also we can give a try by using Reporting Area if its avialable. As the streets are not unique it will be Little harder to find the missing district based on Street Name.

3.2 Glimpse

# Take a peek at our datasets
glimpse(crime)
## Observations: 349,301
## Variables: 17
## $ incident_number     <chr> "I182102934", "I182102933", "I182102923", ...
## $ offense_code        <int> 1402, 3115, 614, 3301, 3006, 3006, 3831, 3...
## $ offense_code_group  <chr> "Vandalism", "Investigate Person", "Larcen...
## $ offense_description <chr> "VANDALISM", "INVESTIGATE PERSON", "LARCEN...
## $ district            <chr> "C11", "C6", "B2", "B2", "C11", NA, "C11",...
## $ reporting_area      <int> 244, 958, 329, 329, 359, NA, 355, 774, 300...
## $ shooting            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ occurred_on_date    <chr> "2018-12-21 21:30:00", "2018-12-21 20:21:0...
## $ year                <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, ...
## $ month               <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12...
## $ day_of_week         <chr> "Friday", "Friday", "Friday", "Friday", "F...
## $ hour                <int> 21, 20, 16, 20, 20, 20, 20, 17, 20, 20, 17...
## $ ucr_part            <chr> "Part Two", "Part Three", "Part One", "Par...
## $ street              <chr> "DORCHESTER AVE", "BELLFLOWER ST", "NAZING...
## $ lat                 <dbl> 42.31812, 42.32445, 42.30664, 42.30833, 42...
## $ long                <dbl> -71.05673, -71.05811, -71.08585, -71.07599...
## $ location            <chr> "(42.31811990, -71.05673136)", "(42.324451...
## Fixing Missing Data Lat and Long

# latitude <- gsub('.*\\((.*),.*', '\\1', crime$location)
# longitude <- gsub('.*, (.*)\\).*', '\\1', crime$location)
# latitude <- as.factor(latitude)
# longitude <- as.factor(longitude)

# All the missing data has lat and long coordinates 0.000 & 0.000

** Factor conversion for better Summary Results**

crime$incident_number <- as.factor(crime$incident_number)
crime$offense_code_group <- as.factor(crime$offense_code_group)
crime$offense_description <- as.factor(crime$offense_description)
crime$district <- as.factor(crime$district)
crime$shooting <- as.factor(crime$shooting)
crime$occurred_on_date <- as.Date(crime$occurred_on_date)
crime$day_of_week <- as.factor(crime$day_of_week)
crime$ucr_part <- as.factor(crime$ucr_part)
crime$street <- as.factor(crime$street)
crime$location <- as.factor(crime$location)

3.3 Incidents/Year

Below is the number of Incidents Reported Each Year.

crime %>%
    filter(occurred_on_date < ymd("2018-12-21")) %>% 
    group_by(occurred_on_date) %>%
    summarise(n = n()) %>%
    ggplot(aes(x = occurred_on_date, y = n)) +
        geom_point() +
  labs(caption= "Data Source : Boston |@ Pankaj Shah")+
    labs(title = " Scatterplot Number of Incidents Reported in each year",
      y = 'Number of Incidents Reported', 
         x = 'Date')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_pankaj

One Interesting observation to see is crime reporting frequency has been evenly distributed. If you follow the dot it seems to drop towards the end and have pick up around mid year.

3.4 Plot

** plotting the crime datasets is helpful to see skewness of datasets.**

# Nothing new or so specific to observe here. We have seen all these  data descriptvely. 
plot_num(crime)
## Warning: attributes are not identical across measure variables; they will
## be dropped

3.5 Summary

Below is the Summary of Our Crime Datasets

summary(crime)
##       incident_number    offense_code 
##  I152071596   :    20   Min.   : 111  
##  I172053750   :    18   1st Qu.:1001  
##  I162067346   :    14   Median :2907  
##  I130041200-00:    13   Mean   :2318  
##  I162030584   :    13   3rd Qu.:3201  
##  I152097957   :    12   Max.   :3831  
##  (Other)      :349211                 
##                        offense_code_group
##  Motor Vehicle Accident Response: 40750  
##  Larceny                        : 28570  
##  Medical Assistance             : 25967  
##  Investigate Person             : 20311  
##  Other                          : 19694  
##  Drug Violation                 : 18074  
##  (Other)                        :195935  
##                             offense_description    district     
##  SICK/INJURED/MEDICAL - PERSON        : 20784   B2     : 54804  
##  INVESTIGATE PERSON                   : 20316   C11    : 46487  
##  M/V - LEAVING SCENE - PROPERTY DAMAGE: 17827   D4     : 46012  
##  VANDALISM                            : 16462   A1     : 39397  
##  ASSAULT SIMPLE - BATTERY             : 16213   B3     : 38692  
##  VERBAL DISPUTE                       : 14358   (Other):121999  
##  (Other)                              :243341   NA's   :  1910  
##  reporting_area  shooting      occurred_on_date          year     
##  Min.   :  0.0   Y   :  1417   Min.   :2015-06-15   Min.   :2015  
##  1st Qu.:177.0   NA's:347884   1st Qu.:2016-05-09   1st Qu.:2016  
##  Median :343.0                 Median :2017-03-25   Median :2017  
##  Mean   :383.1                 Mean   :2017-03-19   Mean   :2017  
##  3rd Qu.:544.0                 3rd Qu.:2018-02-01   3rd Qu.:2018  
##  Max.   :962.0                 Max.   :2018-12-21   Max.   :2018  
##  NA's   :22244                                                    
##      month           day_of_week         hour             ucr_part     
##  Min.   : 1.000   Friday   :53039   Min.   : 0.00   Other     :  1369  
##  1st Qu.: 4.000   Monday   :49916   1st Qu.: 9.00   Part One  : 67385  
##  Median : 7.000   Saturday :48978   Median :14.00   Part Three:173867  
##  Mean   : 6.918   Sunday   :44121   Mean   :13.11   Part Two  :106583  
##  3rd Qu.:10.000   Thursday :51039   3rd Qu.:18.00   NA's      :    97  
##  Max.   :12.000   Tuesday  :50846   Max.   :23.00                      
##                   Wednesday:51362                                      
##             street            lat             long       
##  WASHINGTON ST : 15634   Min.   :-1.00   Min.   :-71.18  
##  BLUE HILL AVE :  8521   1st Qu.:42.30   1st Qu.:-71.10  
##  BOYLSTON ST   :  7944   Median :42.33   Median :-71.08  
##  DORCHESTER AVE:  5647   Mean   :42.21   Mean   :-70.90  
##  TREMONT ST    :  5261   3rd Qu.:42.35   3rd Qu.:-71.06  
##  (Other)       :295022   Max.   :42.40   Max.   : -1.00  
##  NA's          : 11272   NA's   :21856   NA's   :21856   
##                         location     
##  (0.00000000, 0.00000000)   : 21856  
##  (42.34862382, -71.08277637):  1384  
##  (42.36183857, -71.05976489):  1321  
##  (42.28482577, -71.09137369):  1187  
##  (42.32866284, -71.08563401):  1133  
##  (42.25621592, -71.12401947):   984  
##  (Other)                    :321436

3.6 Describe

describe(crime)
## crime 
## 
##  17  Variables      349301  Observations
## ---------------------------------------------------------------------------
## incident_number 
##        n  missing distinct 
##   349301        0   308436 
## 
## lowest : 142052550     I010370257-00 I030217815-08 I050310906-00 I060168073-00
## highest: I182102916    I182102919    I182102923    I182102933    I182102934   
## ---------------------------------------------------------------------------
## offense_code 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   349301        0      222    0.999     2318     1318      522      614 
##      .25      .50      .75      .90      .95 
##     1001     2907     3201     3802     3831 
## 
## lowest :  111  112  121  123  301, highest: 3811 3820 3821 3830 3831
## ---------------------------------------------------------------------------
## offense_code_group 
##        n  missing distinct 
##   349301        0       67 
## 
## lowest : Aggravated Assault               Aircraft                         Arson                            Assembly or Gathering Violations Auto Theft                      
## highest: Towed                            Vandalism                        Verbal Disputes                  Violations                       Warrant Arrests                 
## ---------------------------------------------------------------------------
## offense_description 
##        n  missing distinct 
##   349301        0      244 
## 
## lowest : A&B HANDS, FEET, ETC.  - MED. ATTENTION REQ.  A&B ON POLICE OFFICER                         ABDUCTION - INTICING                          AFFRAY                                        AIRCRAFT INCIDENTS                           
## highest: WEAPON - FIREARM - CARRYING / POSSESSING, ETC WEAPON - FIREARM - OTHER VIOLATION            WEAPON - FIREARM - SALE / TRAFFICKING         WEAPON - OTHER - CARRYING / POSSESSING, ETC   WEAPON - OTHER - OTHER VIOLATION             
## ---------------------------------------------------------------------------
## district 
##        n  missing distinct 
##   347391     1910       12 
##                                                                       
## Value         A1   A15    A7    B2    B3   C11    C6   D14    D4   E13
## Frequency  39397  7085 14259 54804 38692 46487 26053 21992 46012 19366
## Proportion 0.113 0.020 0.041 0.158 0.111 0.134 0.075 0.063 0.132 0.056
##                       
## Value        E18    E5
## Frequency  18996 14248
## Proportion 0.055 0.041
## ---------------------------------------------------------------------------
## reporting_area 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   327057    22244      879        1    383.1    273.5       56       97 
##      .25      .50      .75      .90      .95 
##      177      343      544      777      824 
## 
## lowest :   0   1   2   3   4, highest: 958 959 960 961 962
## ---------------------------------------------------------------------------
## shooting 
##        n  missing distinct    value 
##     1417   347884        1        Y 
##                
## Value         Y
## Frequency  1417
## Proportion    1
## ---------------------------------------------------------------------------
## occurred_on_date 
##        n  missing distinct 
##   349301        0     1286 
## 
## lowest : 2015-06-15 2015-06-16 2015-06-17 2015-06-18 2015-06-19
## highest: 2018-12-17 2018-12-18 2018-12-19 2018-12-20 2018-12-21
## ---------------------------------------------------------------------------
## year 
##        n  missing distinct     Info     Mean      Gmd 
##   349301        0        4    0.929     2017    1.149 
##                                       
## Value        2015   2016   2017   2018
## Frequency   53577  99314 101145  95265
## Proportion  0.153  0.284  0.290  0.273
## ---------------------------------------------------------------------------
## month 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   349301        0       12    0.993    6.918    3.797        1        2 
##      .25      .50      .75      .90      .95 
##        4        7       10       11       12 
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency  23684 21706 24190 24162 26329 30742 34801 35343 34362 34101
## Proportion 0.068 0.062 0.069 0.069 0.075 0.088 0.100 0.101 0.098 0.098
##                       
## Value         11    12
## Frequency  31197 28684
## Proportion 0.089 0.082
## ---------------------------------------------------------------------------
## day_of_week 
##        n  missing distinct 
##   349301        0        7 
##                                                                       
## Value         Friday    Monday  Saturday    Sunday  Thursday   Tuesday
## Frequency      53039     49916     48978     44121     51039     50846
## Proportion     0.152     0.143     0.140     0.126     0.146     0.146
##                     
## Value      Wednesday
## Frequency      51362
## Proportion     0.147
## ---------------------------------------------------------------------------
## hour 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   349301        0       24    0.997    13.11    7.124        1        2 
##      .25      .50      .75      .90      .95 
##        9       14       18       21       22 
## 
## lowest :  0  1  2  3  4, highest: 19 20 21 22 23
## ---------------------------------------------------------------------------
## ucr_part 
##        n  missing distinct 
##   349204       97        4 
##                                                       
## Value           Other   Part One Part Three   Part Two
## Frequency        1369      67385     173867     106583
## Proportion      0.004      0.193      0.498      0.305
## ---------------------------------------------------------------------------
## street 
##        n  missing distinct 
##   338029    11272     4730 
## 
## lowest :  ALBANY ST           BLUE HILL AVE       COLUMBUS AVE        COMMONWEALTH AVE    MASSACHUSETTS AVE 
## highest: YUILL CIR           ZAMORA CT           ZAMORA ST           ZEIGLER ST          ZELLER ST          
## ---------------------------------------------------------------------------
## lat 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   327445    21856    18390        1    42.21   0.2582    42.27    42.28 
##      .25      .50      .75      .90      .95 
##    42.30    42.33    42.35    42.36    42.37 
##                                
## Value        -1.0   42.0   42.5
## Frequency     841   3249 323355
## Proportion  0.003  0.010  0.988
## ---------------------------------------------------------------------------
## long 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   327445    21856    18390        1    -70.9   0.3914   -71.14   -71.13 
##      .25      .50      .75      .90      .95 
##   -71.10   -71.08   -71.06   -71.05   -71.04 
##                         
## Value         -71     -1
## Frequency  326604    841
## Proportion  0.997  0.003
## ---------------------------------------------------------------------------
## location 
##        n  missing distinct 
##   349301        0    18406 
## 
## lowest : (-1.00000000, -1.00000000)  (0.00000000, 0.00000000)    (42.23241330, -71.12971531) (42.23265556, -71.13069992) (42.23287025, -71.13004959)
## highest: (42.39351800, -71.01140500) (42.39352862, -71.01158244) (42.39407288, -71.01096386) (42.39421300, -71.06741300) (42.39504158, -71.01017732)
## ---------------------------------------------------------------------------
# df_status(crime) # to see percenatage and quantile of each columns.

One thing to observe from above describe table is we cannot say crime has risen from 2015 because there were only observation from August compare to other years. For 2015 we start from August 15 missing almost 3/4 of data.For 2018 we are missing half a month of data. But its fair to say that the report of crime has stayed even in all these 4 years around 30% in each year.

Looking at the month Column we can say that crime among all months remain constant except month of December where crimes are fairly low. My guess is it’s cold and sometime it’s snowy in Boston around December and also we have Christmas. People might be more busy towards holiday Parties, more family related things happening and also people are less crossing each others path. I cannot say its very low towards winter and high around summer as we need to break it down by year to make a call of the pattern which we might do it later.

If we look at the day of the week column. It seems like Friday seems more arrest and Sunday seems less report of crime.

UCR part three is major almost 50% observation followed by part two 30% and then 20 % part I.

Something to Observe over here are the q_na which stands for quantity of NA. q_zeros : Quantity of Zeros in each column p_na which stands for percentage of NA and p_zeros which stands for percentage of Zeros. Apart from that we can see the unique values in each column. There are almost 222 different offense code for 67 different groups. We can witness 12 unique district for 4 year data. Interesting is 5 ucr_part as of data sets we should have only 3 ucr_part. What are other two? There are 4733 unique name of street.

Here we have another visualization of the number of incidents by district, this time we plot them in the order from least counts to most counts. As discussed before, Roxbury, Dorchester, and South End exhibits the most counts of crime between the years.

Lets see those missing district rows.

# Lets see number of observation in each district
table(crime$district, useNA = "ifany") # useNA shows the Na
## 
##    A1   A15    A7    B2    B3   C11    C6   D14    D4   E13   E18    E5 
## 39397  7085 14259 54804 38692 46487 26053 21992 46012 19366 18996 14248 
##  <NA> 
##  1910
# sum(table(crime$district)) # sum + Missing = total obs.

missing_dist <- crime %>% filter(is.na(district))
head(missing_dist$district, 5) # See NA in district 
## [1] <NA> <NA> <NA> <NA> <NA>
## Levels: A1 A15 A7 B2 B3 C11 C6 D14 D4 E13 E18 E5
head(missing_dist, 2)
##   incident_number offense_code offense_code_group
## 1      I182102915         3006 Medical Assistance
## 2      I182102903         1830     Drug Violation
##             offense_description district reporting_area shooting
## 1 SICK/INJURED/MEDICAL - PERSON     <NA>             NA     <NA>
## 2  DRUGS - SICK ASSIST - HEROIN     <NA>             NA     <NA>
##   occurred_on_date year month day_of_week hour   ucr_part street      lat
## 1       2018-12-21 2018    12      Friday   20 Part Three   <NA> 42.33737
## 2       2018-12-21 2018    12      Friday   20   Part Two   <NA> 42.29863
##        long                    location
## 1 -71.07804 (42.33737410, -71.07803582)
## 2 -71.06064 (42.29863416, -71.06063601)

Where district is Missing we are also Missing Reporting area and only information we have in some is location which can be helpful to find.

Exploratory Data Analysis Boundaries

# Uncomment below to see details about missing district datasets.

# table(missing_dist$reporting_area) # Too see how many repoting area is available to fix.
# sum(table(missing_dist$reporting_area)) # getting the sum of reporting area.
# table(missing_dist$shooting) # to see how many shooting happen in these datasets. Y:2
# table(missing_dist$year) # to see which year has how many missing data.
# table(missing_dist$month) # To see if the datasets is missing from particular month.
# table(missing_dist$long) # To see how many unique longitude we have.
# table(missing_dist$lat) # To see how many unique latitude we have.
# table(missing_dist$day_of_week) # We have missing those district in each day of week.
# table(missing_dist$hour) # Check the hour
# table(missing_dist$offense_code_group) # chechking if crime was any help to figure out why these missing district.
# sum(table(missing_dist$lat))
# sum(table(missing_dist$long))

4 District Names

Converting District Code to District Names

Just knowing the district codes is hard to know where the crime was reported in Boston. People do know the neighborhood name rather than District name so our first step will be to recode the district code to District name.

# Lets Rename the distrcit code to district name
district_name = c(
A1 = 'Downtown',
A15= 'Charlestown',
A7= 'East Boston',
B2= 'Roxbury',
B3= 'Mattapan',
C6= 'South Boston',
C11= 'Dorchester',
D4= 'South End',
D14= 'Brighton',
E5= 'West Roxbury',
E13= 'Jamaica Plain',
E18= 'Hyde Park')

district_name
##              A1             A15              A7              B2 
##      "Downtown"   "Charlestown"   "East Boston"       "Roxbury" 
##              B3              C6             C11              D4 
##      "Mattapan"  "South Boston"    "Dorchester"     "South End" 
##             D14              E5             E13             E18 
##      "Brighton"  "West Roxbury" "Jamaica Plain"     "Hyde Park"

We could clean up these missing datas in many ways. Some of the ideas are by using reporting area to figure out which district these missing district are coming from in total there are only 40 columns with reporting. Latitude and Longitude position will be helpful to fix these missing districts. We have almost 1593 observations. It will be tedious work to chek each rows to fix and geolocation doesnt range by district as the district are not rectangle or square. There could be some efficient way to do these but will deal with these later.

For now my plan is to ignore these 1909 which is 0.0054% which shouldnt have any affect on our datasets. While shooting it might fluctuate so I check that variable and found out there were 2 case where district is missing. I can fix by reporting area too. Most of the missing data are from 2018, almost 708, it could be still uploading as of today. I even break it down by month to see if certain month have whole missing data for district. But it was spread evenly between months as well.

Lets Move on and concentrate on bigger pie of data rather than focusing on small missing datasets.

Lets ignore that missing district.

# We will change all district code to district name by passing the following code.
crime$district <- as.factor(district_name[(crime$district)] )
#unique(crime$district)
sort(table(crime$district), decreasing = TRUE)
## 
##       Roxbury  South Boston      Brighton      Downtown      Mattapan 
##         54804         46487         46012         39397         38692 
##    Dorchester     South End  West Roxbury Jamaica Plain   East Boston 
##         26053         21992         19366         18996         14259 
##     Hyde Park   Charlestown 
##         14248          7085
# sum(table(crime$district))  # 346594 + 1915 = 348509 observation

From the summary Column we can say that Motor Vehicle Accident Response was recorded most and Larceny the second most.There were 1416 Shooting cases as of today. Most of the crimes happens on Friday followed by Monday and Saturday. UCR (Uniform Crime Reporting ) part Three is mostly reported followed by Part II and then I. Most of the crime reported street is Washignton street,followed by Blue Hill Ave and then Boylston st. Also there are 4 different Washington street in Boston, MA. As of now I don’t have actual length of street so its hard to normalise the data based on crime and length.

5 Duplicate Datasets

5.1 Time point

# Looking duplicate data by incident_number, year, day_of_week : TIME POINT
get_dupes(crime, incident_number, year) %>% nrow
## [1] 71342
duplicated <- get_dupes(crime, incident_number, year, day_of_week)
head(duplicated)
## # A tibble: 6 x 18
##   incident_number  year day_of_week dupe_count offense_code
##   <fct>           <int> <fct>            <int>        <int>
## 1 I030217815-08    2015 Thursday             2         3125
## 2 I030217815-08    2015 Thursday             2          111
## 3 I060168073-00    2018 Saturday             3         3125
## 4 I060168073-00    2018 Saturday             3         1864
## 5 I060168073-00    2018 Saturday             3         1864
## 6 I070720870-00    2018 Thursday            11         3125
## # ... with 13 more variables: offense_code_group <fct>,
## #   offense_description <fct>, district <fct>, reporting_area <int>,
## #   shooting <fct>, occurred_on_date <date>, month <int>, hour <int>,
## #   ucr_part <fct>, street <fct>, lat <dbl>, long <dbl>, location <fct>

As of today we have 71300 observation with same Incident Number with different crimes happened on same day. Only difference is the crimes that were committed have different offense code.

5.2 Location Count

# Looking duplicate data by incident_number, year, street, district : LOCATION 
duplicate_count <- get_dupes(crime, incident_number, year, street, district, day_of_week)
head(duplicate_count)
## # A tibble: 6 x 18
##   incident_number  year street district day_of_week dupe_count offense_code
##   <fct>           <int> <fct>  <fct>    <fct>            <int>        <int>
## 1 I030217815-08    2015 RIVER… Jamaica… Thursday             2         3125
## 2 I030217815-08    2015 RIVER… Jamaica… Thursday             2          111
## 3 I060168073-00    2018 CENTR… West Ro… Saturday             3         3125
## 4 I060168073-00    2018 CENTR… West Ro… Saturday             3         1864
## 5 I060168073-00    2018 CENTR… West Ro… Saturday             3         1864
## 6 I070720870-00    2018 BROOK… Roxbury  Thursday            11         3125
## # ... with 11 more variables: offense_code_group <fct>,
## #   offense_description <fct>, reporting_area <int>, shooting <fct>,
## #   occurred_on_date <date>, month <int>, hour <int>, ucr_part <fct>,
## #   lat <dbl>, long <dbl>, location <fct>

5.3 Day of week

Lets see which day most of those multiple offense crime occurs.

tabyl(duplicate_count$day_of_week)
##  duplicate_count$day_of_week     n   percent
##                       Friday 10940 0.1533459
##                       Monday 10007 0.1402680
##                     Saturday 10161 0.1424266
##                       Sunday  8671 0.1215413
##                     Thursday 10668 0.1495332
##                      Tuesday 10355 0.1451459
##                    Wednesday 10540 0.1477391
sort(table(duplicate_count$day_of_week), decreasing = TRUE)
## 
##    Friday  Thursday Wednesday   Tuesday  Saturday    Monday    Sunday 
##     10940     10668     10540     10355     10161     10007      8671

Again Friday comes on top of the list followed by Thrusday and Wednesday.Remember these are all in duplicated datasets which is different from orginal Crime datasets. Most of the multiple offense crime happened on Friday followed by Thrusday and then Wednesday.

5.4 Offense code

head(sort(table(duplicate_count$offense_code_group),decreasing = TRUE), 10)
## 
##          Drug Violation                   Other         Warrant Arrests 
##                    9491                    6844                    5581 
##          Simple Assault                 Larceny      Aggravated Assault 
##                    4905                    4178                    3502 
##              Violations  Missing Person Located Missing Person Reported 
##                    3466                    3171                    3059 
##               Vandalism 
##                    3022

In our observation we see top ten contenders are drug Violation, followed by Warrant Arrests, Simple Assault. Remember these are all in duplicated datasets which is different from orginal Crime datasets.

5.5 Dupe count

#sort(table(duplicate_count$dupe_count),decreasing = TRUE)
tabyl(duplicate_count$dupe_count)
##  duplicate_count$dupe_count     n      percent
##                           2 47246 0.6622466429
##                           3 13842 0.1940231561
##                           4  5900 0.0827002327
##                           5  2440 0.0342014522
##                           6  1020 0.0142973284
##                           7   371 0.0052003028
##                           8   176 0.0024669900
##                           9   108 0.0015138348
##                          10    80 0.0011213591
##                          11    33 0.0004625606
##                          12    48 0.0006728155
##                          13    26 0.0003644417
##                          14    14 0.0001962378
##                          18    18 0.0002523058
##                          20    20 0.0002803398

Most of the duplicated crimes happened at the same location are mostly charged with two violation code followed all the way upto 20 different offense. In past 4 years there is 1 such case. Similarly 4 cases where someone is charged with 12 different offense code. I am really interested in finding out which was that one case where someone is charged with 20 different crimes offense.

5.6 Dupe count Twenty

duplicate_count[which(duplicate_count$dupe_count == 20), ] 
## # A tibble: 20 x 18
##    incident_number  year street district day_of_week dupe_count
##    <fct>           <int> <fct>  <fct>    <fct>            <int>
##  1 I152071596       2015 CREST… Roxbury  Saturday            20
##  2 I152071596       2015 CREST… Roxbury  Saturday            20
##  3 I152071596       2015 CREST… Roxbury  Saturday            20
##  4 I152071596       2015 CREST… Roxbury  Saturday            20
##  5 I152071596       2015 CREST… Roxbury  Saturday            20
##  6 I152071596       2015 CREST… Roxbury  Saturday            20
##  7 I152071596       2015 CREST… Roxbury  Saturday            20
##  8 I152071596       2015 CREST… Roxbury  Saturday            20
##  9 I152071596       2015 CREST… Roxbury  Saturday            20
## 10 I152071596       2015 CREST… Roxbury  Saturday            20
## 11 I152071596       2015 CREST… Roxbury  Saturday            20
## 12 I152071596       2015 CREST… Roxbury  Saturday            20
## 13 I152071596       2015 CREST… Roxbury  Saturday            20
## 14 I152071596       2015 CREST… Roxbury  Saturday            20
## 15 I152071596       2015 CREST… Roxbury  Saturday            20
## 16 I152071596       2015 CREST… Roxbury  Saturday            20
## 17 I152071596       2015 CREST… Roxbury  Saturday            20
## 18 I152071596       2015 CREST… Roxbury  Saturday            20
## 19 I152071596       2015 CREST… Roxbury  Saturday            20
## 20 I152071596       2015 CREST… Roxbury  Saturday            20
## # ... with 12 more variables: offense_code <int>,
## #   offense_code_group <fct>, offense_description <fct>,
## #   reporting_area <int>, shooting <fct>, occurred_on_date <date>,
## #   month <int>, hour <int>, ucr_part <fct>, lat <dbl>, long <dbl>,
## #   location <fct>

After digging dive in I think there should be 4 people involved. As we see there are 4 repeataed different offense code. I don’t think one person could be chargedd with 4 different offense. So this happen in 2015. Lets pull out the whole history about it.

5.7 Incident ID

For these we can take row name from original crime datasets.

crime[which(crime$incident_number == "I152071596"), ] 
##        incident_number offense_code     offense_code_group
## 329107      I152071596          111               Homicide
## 329108      I152071596          111               Homicide
## 329109      I152071596          111               Homicide
## 329110      I152071596          111               Homicide
## 329111      I152071596          413     Aggravated Assault
## 329112      I152071596          413     Aggravated Assault
## 329113      I152071596          413     Aggravated Assault
## 329114      I152071596          413     Aggravated Assault
## 329115      I152071596         3001     Medical Assistance
## 329116      I152071596         3001     Medical Assistance
## 329117      I152071596         3001     Medical Assistance
## 329118      I152071596         3001     Medical Assistance
## 329119      I152071596         3503 Missing Person Located
## 329120      I152071596         3503 Missing Person Located
## 329121      I152071596         3503 Missing Person Located
## 329122      I152071596         3503 Missing Person Located
## 329123      I152071596         2662             Ballistics
## 329124      I152071596         2662             Ballistics
## 329125      I152071596         2662             Ballistics
## 329126      I152071596         2662             Ballistics
##                            offense_description district reporting_area
## 329107     MURDER, NON-NEGLIGIENT MANSLAUGHTER  Roxbury            326
## 329108     MURDER, NON-NEGLIGIENT MANSLAUGHTER  Roxbury            326
## 329109     MURDER, NON-NEGLIGIENT MANSLAUGHTER  Roxbury            326
## 329110     MURDER, NON-NEGLIGIENT MANSLAUGHTER  Roxbury            326
## 329111          ASSAULT - AGGRAVATED - BATTERY  Roxbury            326
## 329112          ASSAULT - AGGRAVATED - BATTERY  Roxbury            326
## 329113          ASSAULT - AGGRAVATED - BATTERY  Roxbury            326
## 329114          ASSAULT - AGGRAVATED - BATTERY  Roxbury            326
## 329115                     DEATH INVESTIGATION  Roxbury            326
## 329116                     DEATH INVESTIGATION  Roxbury            326
## 329117                     DEATH INVESTIGATION  Roxbury            326
## 329118                     DEATH INVESTIGATION  Roxbury            326
## 329119 MISSING PERSON - NOT REPORTED - LOCATED  Roxbury            326
## 329120 MISSING PERSON - NOT REPORTED - LOCATED  Roxbury            326
## 329121 MISSING PERSON - NOT REPORTED - LOCATED  Roxbury            326
## 329122 MISSING PERSON - NOT REPORTED - LOCATED  Roxbury            326
## 329123               BALLISTICS EVIDENCE/FOUND  Roxbury            326
## 329124               BALLISTICS EVIDENCE/FOUND  Roxbury            326
## 329125               BALLISTICS EVIDENCE/FOUND  Roxbury            326
## 329126               BALLISTICS EVIDENCE/FOUND  Roxbury            326
##        shooting occurred_on_date year month day_of_week hour   ucr_part
## 329107        Y       2015-08-29 2015     8    Saturday    2   Part One
## 329108        Y       2015-08-29 2015     8    Saturday    2   Part One
## 329109        Y       2015-08-29 2015     8    Saturday    2   Part One
## 329110        Y       2015-08-29 2015     8    Saturday    2   Part One
## 329111        Y       2015-08-29 2015     8    Saturday    2   Part One
## 329112        Y       2015-08-29 2015     8    Saturday    2   Part One
## 329113        Y       2015-08-29 2015     8    Saturday    2   Part One
## 329114        Y       2015-08-29 2015     8    Saturday    2   Part One
## 329115        Y       2015-08-29 2015     8    Saturday    2 Part Three
## 329116        Y       2015-08-29 2015     8    Saturday    2 Part Three
## 329117        Y       2015-08-29 2015     8    Saturday    2 Part Three
## 329118        Y       2015-08-29 2015     8    Saturday    2 Part Three
## 329119        Y       2015-08-29 2015     8    Saturday    2 Part Three
## 329120        Y       2015-08-29 2015     8    Saturday    2 Part Three
## 329121        Y       2015-08-29 2015     8    Saturday    2 Part Three
## 329122        Y       2015-08-29 2015     8    Saturday    2 Part Three
## 329123        Y       2015-08-29 2015     8    Saturday    2   Part Two
## 329124        Y       2015-08-29 2015     8    Saturday    2   Part Two
## 329125        Y       2015-08-29 2015     8    Saturday    2   Part Two
## 329126        Y       2015-08-29 2015     8    Saturday    2   Part Two
##            street      lat      long                    location
## 329107 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329108 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329109 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329110 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329111 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329112 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329113 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329114 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329115 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329116 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329117 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329118 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329119 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329120 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329121 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329122 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329123 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329124 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329125 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)
## 329126 CRESTON ST 42.31195 -71.07872 (42.31195475, -71.07871912)

As we get guessed it happen in same location which can be confirmed by location address. Also by looking the offense code description seems like there was shooting involved from shooting column and also there were some death happened as we can see Murder and Manslaughter on offense code.

5.8 Ballastic District

Offense_code “2662” is ballastic. Lets see what we can get from there.Lets Analysed by District first.

ballastic <- crime[which(crime$offense_code == "2662"), ]
ballastic %>% group_by(district) %>% 
  count(sort = TRUE)
## # A tibble: 13 x 2
## # Groups:   district [13]
##    district          n
##    <fct>         <int>
##  1 Roxbury         291
##  2 South Boston    239
##  3 Mattapan        214
##  4 West Roxbury     83
##  5 Dorchester       70
##  6 Brighton         53
##  7 Jamaica Plain    28
##  8 East Boston      24
##  9 Downtown         23
## 10 South End        22
## 11 Hyde Park        14
## 12 Charlestown      13
## 13 <NA>              5

5.9 Ballastic Year.

Lets Analysed by Year

ballastic <- crime[which(crime$offense_code == "2662"), ]
ballastic %>% group_by(year) %>% 
  count()
## # A tibble: 4 x 2
## # Groups:   year [4]
##    year     n
##   <int> <int>
## 1  2015   170
## 2  2016   303
## 3  2017   341
## 4  2018   265

Compare to 2016 and 2017 we have less crime involving ballastics.

5.10 Ballastic District & Year

But has shooting changed over course of year in each district.

group_shoot_year <- crime[which(crime$shooting == "Y"), ]
group_shoot_year %>% 
  group_by(district, year) %>% 
  count()
## # A tibble: 49 x 3
## # Groups:   district, year [49]
##    district     year     n
##    <fct>       <int> <int>
##  1 Brighton     2015    21
##  2 Brighton     2016    10
##  3 Brighton     2017    36
##  4 Brighton     2018    25
##  5 Charlestown  2015     1
##  6 Charlestown  2016     1
##  7 Charlestown  2017     1
##  8 Charlestown  2018     2
##  9 Dorchester   2015     4
## 10 Dorchester   2016     9
## # ... with 39 more rows

Overall the shooting has decreased compare to last year in each district. It has gone down.Lets see below the percentage of the duplicated count based on incident number percentage.

5.11 Ballastic by street.

head( sort(table(duplicate_count$street),decreasing = TRUE), 10)
## 
##     WASHINGTON ST       BOYLSTON ST     BLUE HILL AVE      HARRISON AVE 
##              3497              2062              1887              1684 
## MASSACHUSETTS AVE    DORCHESTER AVE        TREMONT ST         CENTRE ST 
##              1444              1343              1217               978 
##       COLUMBIA RD         DUDLEY ST 
##               795               711

When the repeated crimes are occured in same location. It seems like Washington street still has multiple offense but there are more multiple offense on Boylston street than blue hill avenue which is slightly different from our datasets.Others remain the same.

6 Variables

6.1 UCR Part

Lets see why there are 5 different observation of ucr_part?

tabyl(crime$ucr_part)
##  crime$ucr_part      n      percent valid_percent
##           Other   1369 0.0039192559   0.003920345
##        Part One  67385 0.1929138479   0.192967435
##      Part Three 173867 0.4977569489   0.497895213
##        Part Two 106583 0.3051322498   0.305217008
##            <NA>     97 0.0002776975            NA

So we see 98 observation of empty rows which are not coded as NA or Zeros. and 1366 are coded as other. Thats why in describe it didnt show up as Missing or NA.

6.2 District

ggplot(subset(crime,!is.na(district)))+
  aes(x=district)+
  geom_bar(stat = "count",fill='red') + 
  geom_text(stat="count",aes(label=..count..),vjust=-1)+
  labs(caption= "Data Source : Boston |@ Pankaj Shah")+
  labs(title="Frequency of Incidents by Disctrict",
       x="Districts",
       y="Number of Incidents")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
scale_y_continuous(limit=c(0,60000))+
  theme_pankaj

# Dot plot 
crime_dist <- crime %>% 
  filter() %>% 
  group_by(district) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  arrange(desc(n))

# Frequency of crime by district wise:
ggplot(crime_dist, aes(x = reorder(district, n), y = n)) + 
  geom_point(size = 12, stat = "identity", color = "red") + 
  geom_text(aes(label = n, fontface = "bold"), color = "black", size = 2) + 
  theme_minimal(base_size = 20) + 
  xlab("District") + ylab("Count") + 
  ggtitle("Frequency of crime district wise") +
  scale_y_continuous(limits=c(0,max(crime_dist$n+2000)))+
  labs(caption= "Data Source : Boston |@ Pankaj Shah")+
  theme_pankaj+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Save the dotplot as a png file
ggsave("dotplot.png", scale = 2, dpi = 400)
## Saving 14 x 9 in image
# Plotly
plot_district = plot_ly(crime, x = ~ district, color = ~ district) %>% 
  add_histogram() %>%
  layout(
    title = "Total district count by the crime",
    xaxis = list(title = "district count",
    yaxis = list(title = "Count"))
  )
plot_district
## Warning: Ignoring 1910 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
xyplot(lat~long|district, data=crime, xlim=c(-71.2,-70.95), ylim=c(42.22,42.4))

# Frequency of crime by district (in descending order)
count(crime, district, sort = TRUE)
## # A tibble: 13 x 2
##    district          n
##    <fct>         <int>
##  1 Roxbury       54804
##  2 South Boston  46487
##  3 Brighton      46012
##  4 Downtown      39397
##  5 Mattapan      38692
##  6 Dorchester    26053
##  7 South End     21992
##  8 West Roxbury  19366
##  9 Jamaica Plain 18996
## 10 East Boston   14259
## 11 Hyde Park     14248
## 12 Charlestown    7085
## 13 <NA>           1910

6.3 Date and Time

Lets dig into our crime datasets as we have some sort of idea how the duplicate incidents number has behaved. Lets have uniform date by fixing time Zone.

# Detected data with Daylight Saving Time, converting to UTC/GMT
crime$date <- as.POSIXlt(crime$occurred_on_date, "GMT") # Fixing time zone
crime$date <- as.Date(crime$date) # Date 

# Lets also create a seperate Year, month , day column for EDA.

crime <- crime %>% mutate(crime_date = as.Date(occurred_on_date, "%Y-%m-%d")) 
# create a sepearte ymd crime_date then we will divide that crime_date into year,month, day.
crime <-crime %>% mutate(month=as.numeric(month(occurred_on_date)))
crime <- crime %>% mutate(year=as.numeric(year(occurred_on_date)))
crime <- crime %>% mutate(day=as.numeric(day(occurred_on_date)))

# Lets look our datasets by date column.
head(crime$occurred_on_date)
## [1] "2018-12-21" "2018-12-21" "2018-12-21" "2018-12-21" "2018-12-21"
## [6] "2018-12-21"

6.4 Time Series pattern

crime %>% 
  count(date) %>% 
  ggplot(aes(date, n))+
  geom_line()+ # line graph for date.
  expand_limits(y = 0)+ # less misleading
  labs(title ="Continous Crime Date Pattern", color = "red")+
  labs(caption = "Source: Boston Data | @ Pankaj Shah")+
  theme(plot.title = element_text(color="#D70026", size=14, face="bold.italic", hjust = 0.5, vjust=0.5))+
  theme_pankaj

As we can see the datasets that we extracted have continous recording without any breaks.Crime seems like pickup during the summer month and fall back in Winter. The pattern remains the same. Also we can see at the end of the year .i.e month of december have deep in crime rate. Living in Boston for a while one thing I have notice that December is coldest month, not one of the coldest but definetly the temperature remains high around 30-35F during the day and gets colder at night dropping down to 20-25F at night. Also Most of the schools have finals around this time of the year and most of the people live the town right after the final before the Christmas break. So it does explain the fall of crime around month of december. Also most of the people are busy either in Shopping Mall or with their family members for Christmas.

6.5 Incident by District

Frequency of Incidents by Disctrict

ggplot(subset(crime,!is.na(district)))+
  aes(x=year, color=district)+
  geom_line(stat="count")+
  scale_x_continuous(breaks = seq(2012,2018,1))+
  scale_y_continuous(breaks = seq(5000,50000,5000))+
  labs(title="Frequency of Incidents by Disctrict", x="Districts", y="Number of Incidents")+
  labs(caption = "Source: Boston Data | @ Pankaj Shah")+
  theme_pankaj

Cumulative Frequency of Incidents by District from 2015 to 2018. There seems to be a not any significant increase in number of crimes from 2016 upto 2018. The rise in line graph could be misleading interms of missing half a year of data in 2015.But from these graph we can see there are less crime in different district. We can analyse or rank the district based on the crime occured based on these graph.

7 Offense code distribution

crime%>% 
  count(offense_code, sort = TRUE) %>% 
  ggplot(aes (n))+
  geom_histogram()+
  labs(title="Majority of crimes offense code lies below 5000", x="offense_code", y="n")+
  labs(caption = "Source: Boston Data | @ Pankaj Shah")+
  theme_pankaj

The majority of the offense code lies below 5000.

7.1 Table

#table(crime$offense_code_group)
offense_code_crime <- sort(xtabs(formula = ~ offense_code_group, data = crime), decreasing = TRUE)
head(offense_code_crime)
## offense_code_group
## Motor Vehicle Accident Response                         Larceny 
##                           40750                           28570 
##              Medical Assistance              Investigate Person 
##                           25967                           20311 
##                           Other                  Drug Violation 
##                           19694                           18074
offense_code_crime <- as.tibble(offense_code_crime)
top_offense_code_crime <- offense_code_crime %>% top_n(10)
## Selecting by n
top_offense_code_crime
## # A tibble: 10 x 2
##    offense_code_group                  n
##    <chr>                           <int>
##  1 Motor Vehicle Accident Response 40750
##  2 Larceny                         28570
##  3 Medical Assistance              25967
##  4 Investigate Person              20311
##  5 Other                           19694
##  6 Drug Violation                  18074
##  7 Simple Assault                  17383
##  8 Vandalism                       16744
##  9 Verbal Disputes                 14358
## 10 Towed                           12375

7.2 Top ten code

crime_offense_code <- sort(table(crime$offense_code), decreasing = TRUE)
top_10_offense_code_crime <- as.tibble(head(crime_offense_code, 10))
top_10_offense_code_crime
## # A tibble: 10 x 2
##    Var1      n
##    <chr> <int>
##  1 3006  20784
##  2 3115  20316
##  3 3831  17827
##  4 1402  16462
##  5 802   16230
##  6 3301  14358
##  7 3410  12375
##  8 3114  12230
##  9 617    9969
## 10 2647   9963

7.3 Top ten by district

crime %>% 
  filter(offense_code %in% top_10_offense_code_crime$Var1) %>%
  group_by(district) %>% 
  count
## # A tibble: 13 x 2
## # Groups:   district [13]
##    district          n
##    <fct>         <int>
##  1 Brighton      17662
##  2 Charlestown    3182
##  3 Dorchester    11188
##  4 Downtown      14800
##  5 East Boston    6285
##  6 Hyde Park      6465
##  7 Jamaica Plain  8840
##  8 Mattapan      17914
##  9 Roxbury       23533
## 10 South Boston  21598
## 11 South End     10378
## 12 West Roxbury   8054
## 13 <NA>            615

Most of the arrest was made for offense code 3006. Lets see what is that.

7.4 Medical assistance

crime[which(crime$offense_code == "3006"), ] %>% head(1)
##   incident_number offense_code offense_code_group
## 5      I182102916         3006 Medical Assistance
##             offense_description     district reporting_area shooting
## 5 SICK/INJURED/MEDICAL - PERSON South Boston            359     <NA>
##   occurred_on_date year month day_of_week hour   ucr_part       street
## 5       2018-12-21 2018    12      Friday   20 Part Three MELVILLE AVE
##        lat      long                    location       date crime_date day
## 5 42.29429 -71.06821 (42.29428541, -71.06820981) 2018-12-21 2018-12-21  21

It seems like most of the calls were made for Medical Assistance Sick/Injured Medical

7.5 Medical assistance by district

medical_assiscatance <- crime[which(crime$offense_code == "3006"), ]
sort(table(medical_assiscatance$district), decreasing = TRUE)
## 
##       Roxbury  South Boston      Brighton      Downtown      Mattapan 
##          2985          2924          2072          2063          2055 
##    Dorchester Jamaica Plain     South End     Hyde Park  West Roxbury 
##          1610          1500          1405          1308          1271 
##   East Boston   Charlestown 
##          1143           398

7.6 Medical assistance by street

top_10_street <- sort(table(medical_assiscatance$street), decreasing = TRUE)
head(top_10_street, 10)
## 
##     WASHINGTON ST    DORCHESTER AVE     BLUE HILL AVE MASSACHUSETTS AVE 
##               794               348               337               313 
##         CENTRE ST        TREMONT ST  COMMONWEALTH AVE       BOYLSTON ST 
##               309               290               273               267 
##          RIVER ST       COLUMBIA RD 
##               223               214

7.7 Investigate Person by street

crime[which(crime$offense_code == "3115"), ] %>% head(1)
##   incident_number offense_code offense_code_group offense_description
## 2      I182102933         3115 Investigate Person  INVESTIGATE PERSON
##     district reporting_area shooting occurred_on_date year month
## 2 Dorchester            958     <NA>       2018-12-21 2018    12
##   day_of_week hour   ucr_part        street      lat      long
## 2      Friday   20 Part Three BELLFLOWER ST 42.32445 -71.05811
##                      location       date crime_date day
## 2 (42.32445145, -71.05810680) 2018-12-21 2018-12-21  21

7.8 Investigate Person by district

investigate_person <- crime[which(crime$offense_code == "3115"), ]
sort(table(investigate_person$district), decreasing = TRUE)
## 
##       Roxbury  South Boston      Mattapan      Brighton      Downtown 
##          2957          2898          2678          2263          1744 
##    Dorchester     South End Jamaica Plain     Hyde Park  West Roxbury 
##          1583          1333          1296          1065           983 
##   East Boston   Charlestown 
##           925           514
head(sort(table(investigate_person$street), decreasing = TRUE),10)
## 
##     WASHINGTON ST     BLUE HILL AVE         CENTRE ST    DORCHESTER AVE 
##               872               524               327               286 
##       BOYLSTON ST      HARRISON AVE  COMMONWEALTH AVE        TREMONT ST 
##               285               255               249               235 
##        W BROADWAY MASSACHUSETTS AVE 
##               231               218

7.9 Shooting while Investigate

Did shooting took place while investigating person?

sort(table(investigate_person$shooting), decreasing = TRUE)
##  Y 
## 12

There are 12 incidents out of 20236 Where shooting was involved while investigating the person.

7.10 Shooting district

In which district did the shooting took place while investigating person?

investigate_shooting <- investigate_person %>% 
   filter(shooting == "Y")

sort(table(investigate_shooting$district), decreasing = TRUE)
## 
##       Roxbury  West Roxbury   East Boston      Mattapan      Brighton 
##             8             2             1             1             0 
##   Charlestown    Dorchester      Downtown     Hyde Park Jamaica Plain 
##             0             0             0             0             0 
##  South Boston     South End 
##             0             0

7.11 Shooting Reporting Area

In which reporting area did the shooting took place while investigating person?

sort(table(investigate_shooting$reporting_area), decreasing = TRUE)
## 
## 326 330 909  26 441 906 912 
##   4   2   2   1   1   1   1

330, 326, 909 : Mattapan | 26 : Roxbury | 906, 912 :Jamaica Plain | 441 : South Boston |

7.12 Other components

Various other components to look at while the shooting took place when investigating person

sort(table(investigate_shooting$day_of_week), decreasing = TRUE)
## 
##  Saturday Wednesday    Friday    Sunday  Thursday    Monday   Tuesday 
##         4         4         2         1         1         0         0
sort(table(investigate_shooting$month), decreasing = TRUE)
## 
##  7  9  1  2  5  6 11 
##  4  3  1  1  1  1  1
sort(table(investigate_shooting$year), decreasing = TRUE)
## 
## 2018 2016 2017 
##    8    2    2
head(sort(table(investigate_shooting$street), decreasing = TRUE),8)
## 
##    FAYSTON ST BLUE HILL AVE     BORDER ST      CABOT ST     CENTRE ST 
##             4             2             1             1             1 
##      HEATH ST      WALES ST   WHITTIER ST 
##             1             1             1
sort(table(investigate_shooting$hour), decreasing = TRUE)
## 
## 18 20  3  4 13 19 21 23 
##  4  2  1  1  1  1  1  1

It is remarkable to see that shooting has been increased while investigating in 2018 we have 8 incidents compare to 2016, 2017. 4 shooting happen around 6pm while investigating.

8 Offense category

crime$offense_category <- crime$offense_code_group

8.1 Motor

unique(crime$offense_code_group[grep("Motor",crime$offense_code_group)])
## [1] Larceny From Motor Vehicle      Motor Vehicle Accident Response
## 67 Levels: Aggravated Assault Aircraft ... Warrant Arrests

8.2 Larceny

unique(crime$offense_code_group[grep("Larceny",crime$offense_code_group)])
## [1] Larceny From Motor Vehicle Larceny                   
## 67 Levels: Aggravated Assault Aircraft ... Warrant Arrests

Offense grouping

crime<- crime %>%
  mutate(offense_category = ifelse(offense_code_group %in% c("Fire Related Reports", "Firearm Discovery", "Arson"),"Arson", 
                  ifelse(offense_code_group %in% c("Simple Assault", "Verbal Disputes", "Disorderly Conduct", "Aggravated Assault", "Prisoner Related Incidents"),"Assault",
                  ifelse(offense_code_group %in% c("Other Burglary", "Residential Burglary", "Commercial Burglary", "Burglary - No Property Taken"),"Burglary",
                  ifelse(offense_code_group %in% c("Medical Assistance", "Biological Threat"),"Medical",
                  ifelse(offense_code_group %in% c("Larceny", "Larceny From Motor Vehicle", "Auto Theft", "Auto Theft Recovery", "HOME INVASION", "Embezzlement"),"Larceny",
                  ifelse(offense_code_group %in% c("Missing Person Reported", "Missing Person Located"),"Missing", 
                  ifelse(offense_code_group %in% c("Motor Vehicle Accident Response", "Towed", "License Plate Related Incidents"),"MVT", 
                  ifelse(offense_code_group %in% c("Investigate Person", "INVESTIGATE PERSON","Investigate Property","Drug Violation","Violations","Counterfeiting","Operating Under the Influence","Firearm Violations","License Violation","Restraining Order Violations","Liquor Violation","Assembly or Gathering Violations","Confidence Games","Harbor Related Incidents","Evading Fare","Gambling"),"NON-VIO",
                  ifelse(offense_code_group %in% c("Property Lost","Property Found","Recovered Stolen Property","Property Related Damage","Landlord/Tenant Disputes"),"Property",
                  ifelse(offense_code_group %in% c("Robbery"),"Robbery", 
                  ifelse(offense_code_group %in% c("Harassment", "Criminal Harassment", "Prostitution"),"SEX", 
                  ifelse(offense_code_group %in% c("HUMAN TRAFFICKING", "HUMAN TRAFFICKING - INVOLUNTARY SERVITUDE"), "TRAFFICKING",
                  ifelse(offense_code_group %in% c("Vandalism"),"Vandalism", 
                  ifelse(offense_code_group %in% c("Warrant Arrests", "Search Warrants"),"Warrant", 
                  ifelse(offense_code_group %in% c("Police Service Incidents","Ballistics","Bomb Hoax","Offenses Against Child / Family","Homicide","Explosives","Manslaughter"),"VIO",
                  ifelse(offense_code_group %in% c("Fraud"),"Fraud", 
                  ifelse(offense_code_group %in% c("Phone Call Complaints", "Service", "Aircraft"),"Service+_callls",
                                                                    "Other"))))))))))))))))))

8.3 Percentage

table(crime$offense_category)
## 
##           Arson         Assault        Burglary           Fraud 
##            2921           43484            7921            6460 
##         Larceny         Medical         Missing             MVT 
##           47149           25969            9617           53737 
##         NON-VIO           Other        Property         Robbery 
##           71257           19694           18817            4989 
## Service+_callls             SEX     TRAFFICKING       Vandalism 
##             387            4830              10           16744 
##             VIO         Warrant 
##            5176           10139
# Frequency and percent of crime category (in descending order)
count(crime, offense_category, sort = TRUE) %>%
  mutate(percent = round(n/sum(n)*100, 1))
## # A tibble: 18 x 3
##    offense_category     n percent
##    <chr>            <int>   <dbl>
##  1 NON-VIO          71257    20.4
##  2 MVT              53737    15.4
##  3 Larceny          47149    13.5
##  4 Assault          43484    12.4
##  5 Medical          25969     7.4
##  6 Other            19694     5.6
##  7 Property         18817     5.4
##  8 Vandalism        16744     4.8
##  9 Warrant          10139     2.9
## 10 Missing           9617     2.8
## 11 Burglary          7921     2.3
## 12 Fraud             6460     1.8
## 13 VIO               5176     1.5
## 14 Robbery           4989     1.4
## 15 SEX               4830     1.4
## 16 Arson             2921     0.8
## 17 Service+_callls    387     0.1
## 18 TRAFFICKING         10     0

8.4 Count

# Mean frequency of crime category per district
crime %>%
  group_by(offense_category,district) %>%
  summarise(total = n()) %>% 
  group_by(offense_category) %>% 
  summarise(average = round(mean(total, na.rm=TRUE), 0))
## # A tibble: 18 x 2
##    offense_category average
##    <chr>              <dbl>
##  1 Arson                225
##  2 Assault             3345
##  3 Burglary             609
##  4 Fraud                497
##  5 Larceny             3627
##  6 Medical             1998
##  7 Missing              740
##  8 MVT                 4134
##  9 NON-VIO             5481
## 10 Other               1515
## 11 Property            1447
## 12 Robbery              384
## 13 Service+_callls       30
## 14 SEX                  372
## 15 TRAFFICKING            2
## 16 Vandalism           1288
## 17 VIO                  398
## 18 Warrant              780
# Count by district
# Frequency of crime category by district (in descending order)
crime %>%
  group_by(offense_category, district) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  arrange(desc(n))
## # A tibble: 226 x 3
##    offense_category district         n
##    <chr>            <fct>        <int>
##  1 NON-VIO          Roxbury      11502
##  2 Larceny          Brighton     11108
##  3 NON-VIO          South Boston  9587
##  4 Assault          Roxbury       8350
##  5 NON-VIO          Mattapan      8197
##  6 NON-VIO          Brighton      8158
##  7 NON-VIO          Downtown      8144
##  8 MVT              Roxbury       8115
##  9 MVT              South Boston  7255
## 10 Larceny          Downtown      7224
## # ... with 216 more rows

8.5 Count by Month

crime %>% 
  count(offense_category, month = floor_date(date, "month")) %>% 
  filter(month > min(month)) %>% 
  filter(month < max(month)) %>% 
  ggplot(aes(month, n , color = offense_category))+
  labs(caption= "Data Source : Boston |@ Pankaj Shah")+
  geom_line()+
  labs(title = "Offense Category by Month")+
  theme_pankaj

We used crime classification from the Book of Criminal Justice and plotted the number of incidents in each type of crime. It helps to visualize the types of crime by classifying them into broader cattegories; e.g. instead of having many types of larceny, it helps to only visualize all types of larceny as one category.

8.6 Plot Crime

plot_crime_offense_category = plot_ly(crime, x = ~ offense_category, color = ~ offense_category) %>% 
  add_histogram() %>%
  layout(
    title = "Total Offense category count by the crime",
    xaxis = list(title = "Offense category",
    yaxis = list(title = "Count"))
  )
plot_crime_offense_category
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Larceny

# Frequency and percent of Vehicle crime by district (in descending order)
crime %>% 
  filter(offense_category == "Larceny") %>% 
  group_by(district) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  arrange(desc(n)) %>%
  mutate(percent = round(n/sum(n)*100, 1))
## # A tibble: 13 x 3
##    district          n percent
##    <fct>         <int>   <dbl>
##  1 Brighton      11108    23.6
##  2 Downtown       7224    15.3
##  3 Roxbury        5907    12.5
##  4 South Boston   4636     9.8
##  5 Dorchester     3787     8  
##  6 Mattapan       2993     6.3
##  7 South End      2877     6.1
##  8 West Roxbury   2680     5.7
##  9 Jamaica Plain  1941     4.1
## 10 East Boston    1429     3  
## 11 Hyde Park      1388     2.9
## 12 Charlestown    1000     2.1
## 13 <NA>            179     0.4
df <- crime %>%
  filter(offense_category == "Larceny") %>%
  group_by(district) %>%
  summarise(n = n())

calendar_heatmap <- ggplot(df, aes(x= reorder(district, -n), y=n, fill=n)) +
  geom_tile(aes(fill=n)) +
  geom_text(aes(label=n), size=3.5, color="black") +
  scale_fill_gradient("Frequency", low = "darkgreen", high = "red") +
  theme_pankaj+ 
  labs(x = "District")+
  ggtitle("Larceny levels in Boston") +
  labs(caption= "Data Source : Boston |@ Pankaj Shah")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

calendar_heatmap

#ggsave("calendar_heatmap.png", scale = 1, dpi = 300)

8.7 Larceny Arrest

MA_Arrest_Summary <-crime %>%
    filter(offense_category == 'Larceny') %>%
    group_by(month, district, year) %>%
    summarise(Count = n())

MA_Arrest_Summary %>% head(5)
## # A tibble: 5 x 4
## # Groups:   month, district [2]
##   month district     year Count
##   <dbl> <fct>       <dbl> <int>
## 1     1 Brighton     2016   277
## 2     1 Brighton     2017   249
## 3     1 Brighton     2018   205
## 4     1 Charlestown  2016    26
## 5     1 Charlestown  2017    27
ggplot(MA_Arrest_Summary, aes(x= district, y= Count, fill = year)) +
  geom_point()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_pankaj 

# 80:20

tbl <- crime %>% 
  filter(offense_category == "Larceny") %>% 
  group_by(district) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  arrange(desc(n)) %>%
  mutate(percent.crimes = round(n/sum(n)*100, 1),
         cum.percent.crimes = round(cumsum(percent.crimes), 1),
         percent.district = 1/n()*100,
         cum.percent.district = round(cumsum(percent.district), 1)) %>% 
          select(DISTRICT = district,
         `No. crimes` = n,
         `% crimes` = percent.crimes,
         `Cum. % crimes` = cum.percent.crimes,
         `Cum. % district` = cum.percent.district)

# Create a simple table and save as a pdf
pdf("80-20_rule.pdf", height=11, width=8.5)
grid.table(tbl, rows = NULL)
dev.off()
## quartz_off_screen 
##                 2

Larceny Proportion in December 2015 and December 2017 We take a closer look at larceny as a proportion of total robbery offences, by district, in December 2015. Since December is coming up, and it is a holiday month, we wanted to see the spread of larceny across districts. It is interesting to see that more larceny occurs in the South End and and Roxbury than any other district, but this is in line with our observation that a lot more crime occurs in these districts than any other.

However, what is more interesting is seeing the change in distrct level larceny between 2015 and 2017. Larceny counts increased in the South End from 2015 to 2017, whereas the larceny counts decreased significantly in Roxbury.

8.8 Larceny District

# Recoding 1 & 0 for shooting as there are less shooting in crime involved.
crime$shooting <- ifelse(as.character(crime$shooting) == "Y", 1, 0)
table(crime$shooting == 1)
## 
## TRUE 
## 1417
MA_Arrest_Summary <-crime %>%
    filter(offense_category == 'Larceny') %>%
    group_by(month, district, year) %>%
    summarise(Count = n()
              )
MA_Arrest_Summary %>% head(5)
## # A tibble: 5 x 4
## # Groups:   month, district [2]
##   month district     year Count
##   <dbl> <fct>       <dbl> <int>
## 1     1 Brighton     2016   277
## 2     1 Brighton     2017   249
## 3     1 Brighton     2018   205
## 4     1 Charlestown  2016    26
## 5     1 Charlestown  2017    27

8.9 Larceny Shooting

# Summarize the arrests
MA_Arrest_Summary <-crime %>%
    filter(offense_category == 'Larceny' |  shooting == "Y") %>%
    group_by(month, district, year) %>%
    summarise(Count = n())
              
MA_Arrest_Summary %>% head(5)
## # A tibble: 5 x 4
## # Groups:   month, district [2]
##   month district     year Count
##   <dbl> <fct>       <dbl> <int>
## 1     1 Brighton     2016   277
## 2     1 Brighton     2017   249
## 3     1 Brighton     2018   205
## 4     1 Charlestown  2016    26
## 5     1 Charlestown  2017    27
ggplot(MA_Arrest_Summary, aes(x= district, y= Count, fill = year)) +
  geom_point()+
  labs(caption= "Data Source : Boston |@ Pankaj Shah")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_pankaj 

8.10 Comparison Year

# Frequency of crime by district (in descending order)
par(mfrow=c(2,2))
df_3 <- crime %>% 
  filter(date == "2015-12-01" & offense_category == "Larceny") %>% 
  count(district, sort = TRUE) %>%
  mutate(percent = round(n/sum(n)*100, 1)) %>% 
  select(district, percent) %>% 
  spread(district, percent)
df_3 <- df_3[ , order(-df_3 [which(rownames(df) == '1'), ]) ]

# Use waffle
waffle(df_3, rows = 4, size = 2, 
       colors=(RColorBrewer::brewer.pal(n=10,"Set3")),
       title="District level Larceny as a proportion of \n total Robbery  offences on year December 2015",
       legend_pos = "right")+
  labs(caption= "Data Source : Boston |@ Pankaj Shah")

# ggsave("waffle.png", scale=1.5, dpi=300)
# Frequency of crime by district (in descending order)
df_3 <- crime %>% 
  filter(date == "2018-12-01" & offense_category == "Larceny") %>% 
  count(district, sort = TRUE) %>%
  mutate(percent = round(n/sum(n)*100, 1)) %>% 
  select(district, percent) %>% 
  spread(district, percent)
df_3 <- df_3[ , order(-df_3 [which(rownames(df) == '1'), ]) ]

# Use waffle
waffle(df_3, rows = 4, size = 2, 
       colors=(RColorBrewer::brewer.pal(n=12,"Set3")),
       title="District level Larceny as a proportion of \n total Robbery offences on year December 2018",
       legend_pos = "right")+labs(caption= "Data Source : Boston |@ Pankaj Shah")

# ggsave("waffle_1.png", scale=1.5, dpi=300)

9 Shooting

shoot <- crime[which(crime$shooting == 1), ]
sort(table(shoot$district), decreasing = TRUE)
## 
##       Roxbury      Mattapan  South Boston      Brighton  West Roxbury 
##           494           299           249            92            91 
##    Dorchester Jamaica Plain   East Boston     South End     Hyde Park 
##            46            44            27            27            22 
##      Downtown   Charlestown 
##            19             5
sort(table(shoot$year), decreasing = TRUE)
## 
## 2017 2018 2016 2015 
##  457  364  345  251
m1 <- data.frame(sort(table(shoot$month), decreasing = TRUE))
m1 <- m1 %>% rename(month = Var1)
m1
##    month Freq
## 1      7  186
## 2      6  174
## 3      8  160
## 4     12  156
## 5     10  137
## 6      9  126
## 7     11  113
## 8      4   86
## 9      5   86
## 10     1   85
## 11     2   63
## 12     3   45
d1 <- data.frame(table(shoot$day_of_week))
d1 <- d1 %>% rename(wday = Var1)
d1
##        wday Freq
## 1    Friday  187
## 2    Monday  162
## 3  Saturday  336
## 4    Sunday  201
## 5  Thursday  179
## 6   Tuesday  143
## 7 Wednesday  209
p1 <- data.frame(sort(table(shoot$day), decreasing = TRUE))
p1 <- p1 %>% rename(day = Var1)

# Overall Shooting throughout the year
ggplot(data = shoot, aes(x = month)) +
  geom_bar(label = TRUE)+
  ggtitle("Overall Shooting throughout the year")+
  labs(caption = "Source: Boston Datasets | @ Pankaj Shah", x = "Month") +
  theme_pankaj
## Warning: Ignoring unknown parameters: label

# Overall Shooting throughout the month
ggplot(data = shoot, aes(x = day)) +
  geom_bar()+
  ggtitle("Overall Shooting throughout the month ")+
  labs(caption = "Source: Boston Datasets | @ Pankaj Shah", x = "Day")+
  theme_pankaj

# Overall Shooting throughout the week 

ggplot(data = d1, aes(x = wday, Freq)) +
  geom_bar(stat = "identity")+
  ggtitle("Overall Shooting throughout the week ") +
  labs(caption = "Source: Boston Datasets | @ Pankaj Shah", x = "Wday") +
  theme_pankaj

p <- ggplot(d1, aes(x = wday, y = Freq, fill = Freq)) + 
       geom_bar(width = 1, stat="identity") + coord_polar("y", start=pi / 3) + 
       ggtitle("Pie Chart distribution of shooting throughout the week") +
  labs(caption= "Data Source : Boston |@ Pankaj Shah")
print(p) 

10 Reporting Area

plot(prop.table(table(crime$reporting_area)))

We can see which regions have most of the crimes.

11 Crime Shift breakdown

Lets divide day in 4 shifts to know the crime time. We generate six points of the day to bin the day into four equal segments.

11.1 Shift Count

time_diff <- c("0", "6", "12", "18", "24") # Breaking day into 6 interval period
crime$time_diff <- cut(crime$hour, 
                      breaks = time_diff,
                      labels = c("00-06", "06-12", "12-18", "18-24"), 
                      include.lowest = TRUE)
table(crime$time_diff)
## 
##  00-06  06-12  12-18  18-24 
##  53012  96671 122138  77480
#createing Shift plot
crime <- crime %>% mutate(shift = ifelse(time_diff == "00-06", "Late Night",
                                                     ifelse(time_diff == "06-12", "Morning",
                                                             ifelse(time_diff == "12-18", "Day",
                                                                    "Evening"))))
table(crime$shift)
## 
##        Day    Evening Late Night    Morning 
##     122138      77480      53012      96671
#
plot_shift = plot_ly(crime, x = ~ time_diff, color = ~ time_diff) %>% 
  add_histogram() %>%
  layout(
    title = "Total district count by the crime",
    xaxis = list(title = "Shift count",
    yaxis = list(title = "Count"))
  )
plot_shift
temp <- aggregate(crime$offense_category, 
                  by = list(crime$offense_category, crime$time_diff), FUN= length)
names(temp) <- c("offense_category", "time_diff", "count")
head(temp)
##   offense_category time_diff count
## 1            Arson     00-06   521
## 2          Assault     00-06  8168
## 3         Burglary     00-06  1608
## 4            Fraud     00-06   871
## 5          Larceny     00-06  5574
## 6          Medical     00-06  4223

Crime by Time Range of Day: It seems like most of the crime is committed between 6am and 6pm. Most incidents happen between 12pm and 6pm, while the least happens between 12am and 6am.

12 Holiday

Lets look at the holiday data from 2015 till today

holiday <- crime %>% filter(crime_date == "2015-01-01"| # New Year 2015
                              crime_date == "2015-01-19"| # Martin Luther King, Jr.
                              crime_date == "2015-02-16"| # Washington’s Birthday
                              crime_date == "2015-05-25"| # Memorial Day
                              crime_date == "2015-07-04"| # Independence Day
                              crime_date == "2015-09-07"| # Labor Day
                              crime_date == "2015-10-12"| # Columbus Day
                              crime_date == "2015-11-11"| # Veterans Day
                              crime_date == "2015-11-26"| # Thanksgiving Day
                              crime_date == "2015-12-25"| # Christmas Day
                              crime_date == "2016-01-01"| # New Year 2016
                              crime_date == "2016-01-18"| # Martin Luther King, Jr.
                              crime_date == "2016-02-15"| # Washington’s Birthday
                              crime_date == "2016-05-30"| # Memorial Day
                              crime_date == "2016-07-04"| # Independence Day
                              crime_date == "2016-09-05"| # Labor Day
                              crime_date == "2016-10-10"| # Columbus Day
                              crime_date == "2016-11-11"| # Veterans Day
                              crime_date == "2016-11-24"| # Thanksgiving Day
                              crime_date == "2016-12-25"| # Christmas Day
                              crime_date == "2017-01-01"| # New Year 2017
                              crime_date == "2017-01-16"| # Martin Luther King, Jr.
                              crime_date == "2017-02-20"| # Washington’s Birthday
                              crime_date == "2017-05-29"| # Memorial Day
                              crime_date == "2017-07-04"| # Independence Day
                              crime_date == "2017-09-04"| # Labor Day
                              crime_date == "2017-10-09"| # Columbus Day
                              crime_date == "2017-11-10"| # Veterans Day
                              crime_date == "2017-11-23"| # Thanksgiving Day
                              crime_date == "2017-12-25"| # Christmas Day
                              crime_date == "2018-01-01"| # New Year 2018
                              crime_date == "2018-01-15"| # Martin Luther King, Jr.
                              crime_date == "2018-02-19"| # Washington’s Birthday
                              crime_date == "2018-05-28"| # Memorial Day
                              crime_date == "2018-07-04"| # Independence Day
                              crime_date == "2018-09-03"| # Labor Day
                              crime_date == "2018-10-08"| # Columbus Day
                              crime_date == "2018-11-12"| # Veterans Day
                              crime_date == "2018-11-22") # Thanksgiving Day 
                              # No data for christmas 2018.

12.1 Holiday EDA

count_holiday <- holiday %>% group_by(crime_date,offense_category) %>% summarise(count = n())
count_holiday_total <- holiday %>% group_by(crime_date) %>% summarise(count = n())
count_holiday_year <- holiday %>% group_by(year) %>% summarise(count = n())
count_crime <- crime %>% group_by(year,offense_category) %>% summarise(count = n())
count_shift <- crime %>%  group_by(year,shift) %>% summarise(count = n())
count_shift$count <- as.numeric(count_shift$count)
count_crime$count <- as.numeric(count_crime$count)
count_district <- crime %>% group_by(year,district) %>% summarise(count = n())
choice <- unique(count_crime$offense_category)
total_crime <- crime %>% group_by(year) %>% summarise(total = n())
total_crime$count <- as.numeric(total_crime$total)

13 Crime Breakdown by timepoint

13.1 Day of the Week

There seems to be higher counts of crime on Fridays than the rest of the week, with Sunday having the lowest counts of crime. More crime also occur during the weekdays rather than on the weekend. Notice here that the y-axis scale starts from 43,000 and ends with 52,500.

crime$day <- crime$DAY_OF_WEEK
crime$day <- factor(crime$day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

qplot(crime$day, xlab= "Day of week", main= "Crimes by day of week") + scale_y_continuous("Number of crimes")+
  labs(caption= "Data Source : Boston |@ Pankaj Shah")+
  theme_pankaj

13.2 District count

plot_crime_offense_day = plot_ly(crime, x = ~day , color = ~ day) %>% 
  add_histogram() %>%
  layout(
    title = "Total district count by the crime during the day",
    xaxis = list(title = "Days",
    yaxis = list(title = "Count"))
  )
plot_crime_offense_day

13.3 Crime by Month

Most crimes happen on the later half of the year, and the most crime-ridden quarter is the third quarter of the year.

crime$Month <- month.abb[crime$month]
table(crime$Month)
## 
##   Apr   Aug   Dec   Feb   Jan   Jul   Jun   Mar   May   Nov   Oct   Sep 
## 24162 35343 28684 21706 23684 34801 30742 24190 26329 31197 34101 34362
crime$Month <- factor(crime$Month, levels= c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
ggplot(crime, aes(x= Month))+
  geom_bar()+
  theme_pankaj+
  ggtitle("Crimes by Month")+
  labs(y ="Number of crimes")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(caption = "Boston Data Source | @ Pankaj Shah")

13.4 Time of Day

Gradient Tile Plot of Crime Types by Time of Day This is a more detailed view of the types of crimes (by intensity) throughout the day segmented by 6 hour sections.

ggplot(temp, aes(x= offense_category, y= factor(time_diff))) +
geom_tile(aes(fill= count)) + scale_x_discrete("offense_category", expand = c(0,0)) + scale_y_discrete("Time of day", expand = c(0,-2)) +
scale_fill_gradient("Number of crimes", low = "white", high = "orange") + theme_bw() + ggtitle("Crimes by time of day") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line (colour = NA))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(caption= "Data Source : Boston |@ Pankaj Shah")+
  theme_pankaj

Aside from there being more crime between 6am and 6pm, non-violent crimes have the highest intensity in terms of occurence, and happen between 12pm and 6pm. More motor-vehicle crimes occur between 6am and 12pm than any othe time range, and most larceny occurs between 12pm and 6pm.

13.5 Day of Week

Gradient Tile Plot of Crime Types by Day of Week

temp1 <- aggregate(crime$offense_category, by = list(crime$offense_category, crime$day), FUN= length)
names(temp1) <- c("offense_category", "day", "count")

ggplot(temp1, aes(x= offense_category, y= day, fill= count)) + geom_tile(aes(fill= count)) +
scale_x_discrete("offense_category", expand = c(0,0)) + scale_y_discrete("Day of week", expand = c(0,-2)) + scale_fill_gradient("Number of crimes", low = "white", high = "lightpink") +
ggtitle("Crimes by day of week") +
  labs(caption= "Data Source : Boston |@ Pankaj Shah")+
  theme_pankaj +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

In our previous plot ‘Crime by Day of Week’, we saw that the occurence of crime is even distributed throughout the weekdays, and that crime happens less on weekends. Here, we are able to see the each type of crime committed, and its frequency in terms of intensity, throughout the week.

Most crimes seem to be in line with what the previous graph told us, however, it seems that vandalism occurs more often on weekends than weekdays, as suggested by the deeper color on Saturday and Sunday than the rest of the week.

13.6 Types by Month

Gradient Tile Plot of Crime Types by Month

temp2 <- aggregate(crime$offense_category, by = list(crime$offense_category, crime$Month), FUN= length)
names(temp2) <- c("offense_category", "Month", "count")

ggplot(temp2, aes(x= offense_category, y= Month, fill= count)) +
geom_tile(aes(fill= count)) +
scale_x_discrete("OFFENSE_CAT", expand = c(0,0)) +
scale_y_discrete("month", expand = c(0,-2)) +
scale_fill_gradient("Number of crimes", low = "white", high = "lightgreen") + theme_bw() + 
ggtitle("Crimes by Month") + 
labs(caption= "Data Source : Boston |@ Pankaj Shah")+
theme_pankaj +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Previously, we see that crimes most often occur in the third quarter of the year, here we are able to see what types of crime are committed throughout the year by month.

The majority of the types of crime are pretty evenly distributed, in terms of color intensity, throughout the year. For crimes such non-violent crime, motor vehicle theft, larceny, and assault, they are in line with our observation that crime occurs more in Q3 than other quarters.

14 Crime rate by Seasons

Here we explore the possibility that seasonality affects the rate at which crime is committed per hour. However, it doesn’t seem like there are any significant differences between seasons on the rate of crime occurence.

#create seasons
crime<- crime %>% mutate(Season = ifelse(month %in% c(6,7,8), "Summer",
                                                     ifelse(month %in% c(9,10,11), "Fall",
                                                             ifelse(month %in% c(12,1,2), "Winter",
                                                                    "Spring"))))
table(crime$Season)
## 
##   Fall Spring Summer Winter 
##  99660  74681 100886  74074
# Plot

plot_crime_offense_season = plot_ly(crime, x = ~ Season, color = ~ Season) %>% 
  add_histogram() %>%
  layout(
    title = "Total count of the crime by Season",
    xaxis = list(title = "Season",
    yaxis = list(title = "Count"))
  )
plot_crime_offense_season

15 Street Crimes

street_crime <- sort(table(crime$street), decreasing = TRUE)
head(street_crime, 10)
## 
##     WASHINGTON ST     BLUE HILL AVE       BOYLSTON ST    DORCHESTER AVE 
##             15634              8521              7944              5647 
##        TREMONT ST MASSACHUSETTS AVE      HARRISON AVE         CENTRE ST 
##              5261              5145              5097              4817 
##  COMMONWEALTH AVE     HYDE PARK AVE 
##              4570              3801

15.1 Most Dangerous Streets

library("RColorBrewer") 
pal = brewer.pal(9,"Blues")
street_name <- as.tibble(table(crime$street))
colnames(street_name) <- c("Street_Name", "Count")
wordcloud(street_name$Street_Name, street_name$Count, min.freq = 200, random.order = F, random.color = F, colors =c("black", "cornflowerblue", "darkred"), scale = c(2,.3))

This is a word cloud of the most dangerous streets in Boston. Washington St. appears the be the most dangerous street in Boston, however there are Wahsingston streets in multiple districts, so that is a limitation of this word cloud.

15.2 Streets | District

Most Dangerous Streets in Each District Here you can see the top three most crime-ridden streets by district.

streetd = data.frame(with(crime,table(street, district)))
topstreet = data.frame()
for (district_name in unique(streetd$district)) {
  substreetd = subset(streetd, district==district_name)
  tmp = substreetd[with(substreetd, order(Freq, decreasing=T))[1:3], ]
  topstreet = rbind(topstreet, tmp)
}

topstreet$street = as.factor(topstreet$street)
topstreet$Rank = rep(3:1, 6)
library(ggplot2)
ggplot(topstreet, aes(x= Rank, y = Freq, label=street, fill = Freq))+
  geom_text(aes(label=street), size=3.0) +
  ylim(0,10000)+
geom_bar(stat='identity')+
geom_text(size=9, hjust=0, vjust=0)+
facet_wrap(~district, nrow=12)+
  theme_bw(base_size=30)+
  labs(caption= "Data Source : Boston |@ Pankaj Shah")+
  coord_flip()

16 Save and Read RDS

# RENAME for R shiny App
crime <- crime %>% rename(CRIME = offense_category, LONGITUDE = long, LATITUDE = lat)
holiday <- holiday %>% rename( CRIME = offense_category,LONGITUDE = long, LATITUDE = lat)

saveRDS(crime, file = "crime.rds")
saveRDS(holiday, file = "holiday.rds")

crime <- readRDS( "crime.rds")
holiday <- readRDS("holiday.rds")

17 Conclusion

After performing in-depth analysis on the Boston Crime Data Set, we can clearly see the trends and relations between the types of crimes, location and the occurance of the crime. Some of the notable takeaways from the analysis are mentioned below:

  1. The most affected districts in Boston are Dorchester, South-End, Roxbury.
  2. It can be noted that the highest number of the crimes were reported during summer months of July and Auguest.
  3. Motor-Vehicle accident response were the highest number of report registered with the Boston Police.

This analysis can help Boston Police act accordingly and try to reduce the crimes frequently occuring in the city of Boston.

Frequency of Incidents by District: Dorchester, Roxbury, and South End districts have the highest cumulative incidents across the dataset. These three districts are also neighbors with each other, spanning from the South to the Southeast of Boston.

Link for R Shiny app for further exploration is here.

https://cloudstudio.shinyapps.io/github/