Introduction:

The data set I have chosen to analyze takes data from Crime Incident Reports from the City of Boston between 2015 and 2020. The data set has 531,942 rows of data and is organized into seventeen variables. This covers various criminal incident reports across the six year span and includes various pieces of each line of data such as an internal Boston Police Dept. (BPD) Number, a numerical code of the offense, which group each offense falls into, and even when and where the shooting took place.

Descriptive Statistics:

summary(project_df)
##  INCIDENT_NUMBER     OFFENSE_CODE  OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION
##  Length:531942      Min.   : 111   Length:531942      Length:531942      
##  Class :character   1st Qu.:1102   Class :character   Class :character   
##  Mode  :character   Median :3005   Mode  :character   Mode  :character   
##                     Mean   :2331                                         
##                     3rd Qu.:3201                                         
##                     Max.   :3831                                         
##                                                                          
##    DISTRICT         REPORTING_AREA    SHOOTING         OCCURRED_ON_DATE  
##  Length:531942      Min.   :  0.0   Length:531942      Length:531942     
##  Class :character   1st Qu.:179.0   Class :character   Class :character  
##  Mode  :character   Median :347.0   Mode  :character   Mode  :character  
##                     Mean   :385.4                                        
##                     3rd Qu.:542.0                                        
##                     Max.   :962.0                                        
##                     NA's   :40183                                        
##       YEAR          MONTH        DAY_OF_WEEK             HOUR      
##  Min.   :2015   Min.   : 1.000   Length:531942      Min.   : 0.00  
##  1st Qu.:2016   1st Qu.: 4.000   Class :character   1st Qu.: 9.00  
##  Median :2018   Median : 7.000   Mode  :character   Median :14.00  
##  Mean   :2018   Mean   : 6.744                      Mean   :13.07  
##  3rd Qu.:2019   3rd Qu.:10.000                      3rd Qu.:18.00  
##  Max.   :2020   Max.   :12.000                      Max.   :23.00  
##                                                                    
##    UCR_PART            STREET               Lat             Long       
##  Length:531942      Length:531942      Min.   :-1.00   Min.   :-71.20  
##  Class :character   Class :character   1st Qu.:42.30   1st Qu.:-71.10  
##  Mode  :character   Mode  :character   Median :42.33   Median :-71.08  
##                                        Mean   :42.24   Mean   :-70.95  
##                                        3rd Qu.:42.35   3rd Qu.:-71.06  
##                                        Max.   :42.40   Max.   :  0.00  
##                                        NA's   :30249   NA's   :30249   
##    Location        
##  Length:531942     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Plot 1: Horizontal Stacked Bar Chart of Report by Group Type

This graph is a horizontal bar chart showing the crimes that were reported the most often for each year of the data set, from 2015 to 2020. Each of the bars refers to a specific incident, such as investigating a suspicious person or a hit-and-run accident. An interesting quality of this graph is that it shows that there was a wide variety of calls reported in, with investigating suspicious persons and sick/injured persons being the most common incidents reported. The top ten incidents are split evenly between being people and property related. Another interesting statistic is the lower number of vehicle break-ins, which is in tenth. With the population density and concentration of motor vehicles in the city, it would be expected to have a higher number of calls for that, but some of the calls that could have been classified as a vehicle break-in could also have been called in as a suspicious person.

df_descriptions <- dplyr::count(project_df, OFFENSE_DESCRIPTION)
df_descriptions <- df_descriptions[order(df_descriptions$n, decreasing = TRUE),]

top_descriptions <- df_descriptions$OFFENSE_DESCRIPTION[1:10]

df_newplot <- project_df %>% 
  filter(OFFENSE_DESCRIPTION %in% top_descriptions) %>%
  select(OCCURRED_ON_DATE, OFFENSE_DESCRIPTION, YEAR) %>%
  group_by(OFFENSE_DESCRIPTION, YEAR) %>%
  dplyr::summarise(n = length(OFFENSE_DESCRIPTION), .groups = 'keep') %>%
  data.frame()

agg_tot <- df_newplot %>%
  select(OFFENSE_DESCRIPTION, n) %>%
  group_by(OFFENSE_DESCRIPTION) %>%
  dplyr::summarise(tot = sum(n), .groups = 'keep') %>%
  data.frame()


ggplot(df_newplot, aes(x = reorder(OFFENSE_DESCRIPTION, n, sum), y = n, fill = YEAR)) +
  geom_bar(stat="identity", position = position_stack(reverse = TRUE)) +
  coord_flip() +
  labs(title="Crime Report Count by Type of Crime", x="Offense Type", y= "Report Count", fill="Year", caption="Source: https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system") +
  theme_light() +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_color_brewer(palette = "Accent") +
  geom_text(data=agg_tot, aes(x=OFFENSE_DESCRIPTION, y=tot, label=scales::comma(tot), fill=NULL), hjust =-0.2, size=4) +
  scale_y_continuous(labels = comma,
                     breaks = seq(0, 40000, by = 5000),
                     limits = c(0, 40000))

Plot 2: Crime Report Count by Hour of the Day

This graph is a line chart that shows how many reports were called in at each hour of the day. Each point on the line is one hour in military time, so 0 is 12 AM, 13 is 1 PM, etc. up through 11 PM at night. There are a couple notable points in the graph, the first of which are the lowest and highest points of the graph, which occur at 5 AM and 5 PM, respectively. The low point makes the most sense of the two points, since there is very little activity taking place at that time of day other than night shift workers and early morning exercise. The high point makes slightly less sense, being at an odd time of day, but it is still understandable. At five in the afternoon, most people are making their way home from work and they might be tired to the point that their guard is lowered since they are focused on getting home or going to get food. This could mean increased irritability which can lead to more incidents such as verbal disputes, as well as more non-violent incidents such as traffic accidents.

hours_df_project <- project_df %>%
  select(HOUR) %>%
  group_by(HOUR) %>%
  dplyr::summarise(n=length(HOUR), .groups='keep') %>%
  data.frame()

hi_lo <- hours_df_project %>%
  filter(n == min(n) | n == max(n)) %>%
  data.frame()

x_axis_labels <-  min(hours_df_project$HOUR):max(hours_df_project$HOUR)



ggplot(hours_df_project, aes(x=HOUR, y=n)) +
  geom_line(color='black', size=1) +
  geom_point(shape=21, size=4, color='red', fill='white') +
  labs(x="Hour", y="Incident Count", 
       title="Crime Incident Report Count by Hour", 
       caption="Source: https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system") +
  scale_y_continuous(labels=comma) +
  theme_light() +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_x_continuous(labels=x_axis_labels, breaks=x_axis_labels, minor_breaks = NULL) +
  geom_point(data = hi_lo, aes(x=HOUR, y=n), shape=21, size=4, fill='red', color='red') +
  geom_label_repel(aes(label = ifelse(n== max(n) | n == min(n), scales::comma(n), "")), 
                   box.padding = 1, point.padding = 1, size=4, 
                   color='Grey50', segment.color = 'darkblue')

Plot 3: Reports by Day of the Week By Year

This graph is a split line chart that shows how many reports came in on each day of the week, with one line for each year of the data set. The chart has been structured so that the days go in the correct order, with the weekdays coming first followed by the weekend. One interesting observation is the difference in where the lines for 2015 and 2020 are compared to the rest of the lines. While there is overlap for most of the lines, the lines for those two years in particular sit a good couple thousand per day below the other four years. Unfortunately, with data like this, there is no one way to determine what caused the spike in reports over the other four years in the graph, 2016-2019, but one possible change could be reform in police procedures or a shift in political climate. On the other end, the lower number of reports from 2020 can likely be attributed to the coronavirus pandemic, which saw people becoming generally less active during the day as a result.

weekday_df <- project_df %>%
  select(DAY_OF_WEEK, YEAR) %>%
  group_by(YEAR, DAY_OF_WEEK) %>%
  dplyr::summarise(n = length(DAY_OF_WEEK), .groups='keep') %>%
  data.frame()
weekday_df$YEAR <- as.factor(weekday_df$YEAR)

weekday_order <- factor(weekday_df$DAY_OF_WEEK,
                        level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday',
                                'Friday', 'Saturday', 'Sunday'))

ggplot(weekday_df, aes(x=weekday_order, y=n, group=YEAR)) +
  geom_line(aes(color=YEAR), size=3) +
  labs(title = "Crime Incident Reports by Day of the Week by Year", x="Day of the Week", y="Report Count", caption="Source: https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system") +
  theme_light() +
  theme(plot.title=element_text(hjust=0.5)) +
  scale_color_brewer(palette = "Accent") +
  geom_point(shape=21, size=5, color="black", fill="white") +
  scale_y_continuous(labels=comma)

Plot 4: Reports by Year by UCR Part

This graph takes a look at what percent of crimes fell under each part of the UCR, or Universal Crime Reporting Code. According to the FBI, part one is standard crimes and arrests known by the police, whereas part two crimes are more victim-less crimes such as gambling or public intoxication,and part three has no clear classification. The one part of this data set that is making these graphs appear unusual is that none of the crimes for 2020 have a classification that would allow them be sorted yet, meaning the classifications are likely applied retroactively. It is also interesting noting the gradual decrease in part one crimes, which tend to be more violent. It is intriguing how so many of the crimes fit under part three, which often takes up the largest portion of each pie chart.

ucr_parts <- count(project_df, UCR_PART)
ucr_parts <- ucr_parts[order(-n),]

df_ucr <- project_df %>%
  select(UCR_PART, YEAR) %>%
  dplyr::mutate(myParts = ifelse(UCR_PART=="Part Three", "Part Three",
                                 ifelse(UCR_PART=="Part Two", "Part Two",
                                        ifelse(UCR_PART=="Part One", "Part One","Other")))) %>%
  group_by(YEAR, UCR_PART) %>%
  dplyr::summarise(n=length(UCR_PART), .groups='keep') %>%
  group_by(YEAR) %>%
  dplyr::mutate(percent_of_total = round(100*n/sum(n),1)) %>%
  ungroup() %>%
  data.frame()

df_ucr$UCR_PART <- factor(df_ucr$UCR_PART, levels=c("Part One", "Part Two", "Part Three"))

ggplot(data = df_ucr, aes(x="", y=n, fill=UCR_PART)) +
  geom_bar(stat = "identity", position = "fill") +
  coord_polar(theta="y", start=0) +
  labs(fill= "Parts", x= NULL, y= NULL, title = "Crime Reports by Year by UCR Part",
       caption="Slices under 5% are not labeled") +
  theme_light() +
  theme(plot.title = element_text(hjust=0.5),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank()) +
  facet_wrap(~YEAR, ncol = 3, nrow = 2) +
  scale_fill_brewer(palette = "Reds") +
  geom_text(aes(x=1.7, label = ifelse(percent_of_total > 5,paste0(percent_of_total,"%"),"")),
            size=3,
            position = position_fill(vjust=0.5))

Plot 5: Heatmap of Report Count by Day of the Week

The last of these graphs is a heatmap, once again showing the concentration of the number of crimes reported on each day of the week for each year of the data set. One interesting observation is the larger numbers being centered on Friday, which has the most crimes of any day of the week reported each year. This type of graph also helps to give a better sense of the concentration of the data, and where it appears to be more dense than other spots, such as in 2015.

weekday_order <- factor(weekday_df$DAY_OF_WEEK,
                        level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday',
                                'Friday', 'Saturday', 'Sunday'))

mylevels<- c('Monday', 'Tuesday', 'Wednesday', 'Thursday',
        'Friday', 'Saturday', 'Sunday')
weekday_df$DAY_OF_WEEK <- factor(weekday_df$DAY_OF_WEEK, levels = mylevels)
breaks <- c(seq(0, max(weekday_df$n), by = 2500))


ggplot(data = weekday_df, aes(x=YEAR, y=DAY_OF_WEEK, fill=n)) +
  geom_tile(color="black") +
  geom_text(aes(label=comma(n)), size=2) +
  coord_equal(ratio=1) +
  labs(title = "Heatmap: Citations by Day of the Week",
       x= "Year",
       y= "Day of the Week",
       fill= "Citation Count",
       caption="Source: https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_y_discrete(limits = rev(levels(weekday_df$DAY_OF_WEEK))) +
  scale_fill_continuous(low="white", high="red", breaks = breaks) +
  guides(fill = guide_legend(reverse=TRUE, override.aes=list(colour="black")))