Introduction

Below is real world data containing crime incident reports provided by the Boston Police Department (BDP). This data is reported from August 2015 - December 2020. It contains records from the Boston Police Department new crime incident reports system, including a reduced set of fields focused on capturing the type of offense that occurred, where it occurred, and when it occurred.

Dataset: Boston Crime

The Boston Crime dataset includes 531,942 rows of data and 18 columns, ranging from incident numbers, to offense descriptions, to district numbers, and even data about latitude and longitude coordinates. Dating all the way back from August of 2015, this dataset not only contains what year these crimes occurred, but it also contains information on what day, hour, and month it occurred. Being that there were 531942 rows and 18 columns in the Boston Crime dataset, I decided to focus in on only certain aspects to develop a storyline to my data.

Being that there were 284 unique crime offenses in this dataset, I first wanted to narrow down my criteria and focus in on the most popular offenses and how often each offense occurred. From 531,942 rows of data, I knew I had to narrow down my results.

Once I found out the most popular crime offenses in the Boston area, I decided to dig deeper into these offenses and get a sense of when they occurred, more specifically what year. Using frequencies, I created a histogram visualization to show how many total crime offenses, out of 531942, occurred each year.

The next step in telling this story came from digging even deeper into the 6 years of data from the Boston Police Department. In creating a stacked bar chart, I was able to not only see what year each crime offense occurred, but also how many times in per year. This gave me a much better sense of the most popular crime offenses, as well as the years it happened the most.

My next visualization delved deeper into months of the year. Separating my data out by year and by month, I could visualize which were the most popular years and months for crimes.

The next part of my story involved taking a step away from just the most popular crime offense, and rather looking at the most common time of the day each one occurred. Out of the 284 unique crime offenses, I was able to analyze the most popular times of the day crimes occurred, the least popular time of day they occurred, and how many crimes occurred each hour of the day on average.

Keeping up with the 284 crime offenses, I continued analyzing the most popular times they occurred by looking at which days of the week they happened. Using the days of the week, year, and count, I was able to figure out which days were the most popular in committing crimes, as well as how many were committed and in what year.

Last but not least, I ended my visualizations with a heatmap. This was the last step of my story, and I knew this would be a fun, interactive way for people to see the most popular crime offenses, how many occurred, and on what day.

Overall, this Boston Crime dataset was especially useful in creating visualizations to tell a story about my data. Working with such a large (84 MB) dataset, my goal was to narrow down the crime offenses and use variables such as year, month, day of week, and hour to find out which crime offenses were the most popular and when. Starting out broad like having the top 10 crime offenses and their count was a good place to start, and then I was able to narrow it down from there. Analyzing what year, then month, then day of week, and finally what hour each offense happened was the best way to broadcast my data from the broadest to narrowest margin.

Data Summary

A summary of my dataset is included below, stating each column name followed by how many times it appears in the dataset (length) as well as information about the minimum, median, maximum, and 1st and 3rd quartile ranges.

summary(df)
##  INCIDENT_NUMBER     OFFENSE_CODE  OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION
##  Length:531942      Min.   : 111   Length:531942      Length:531942      
##  Class :character   1st Qu.:1102   Class :character   Class :character   
##  Mode  :character   Median :3005   Mode  :character   Mode  :character   
##                     Mean   :2331                                         
##                     3rd Qu.:3201                                         
##                     Max.   :3831                                         
##                                                                          
##    DISTRICT         REPORTING_AREA    SHOOTING         OCCURRED_ON_DATE  
##  Length:531942      Min.   :  0.0   Length:531942      Length:531942     
##  Class :character   1st Qu.:179.0   Class :character   Class :character  
##  Mode  :character   Median :347.0   Mode  :character   Mode  :character  
##                     Mean   :385.4                                        
##                     3rd Qu.:542.0                                        
##                     Max.   :962.0                                        
##                     NA's   :40183                                        
##       YEAR          MONTH        DAY_OF_WEEK             HOUR      
##  Min.   :2015   Min.   : 1.000   Length:531942      Min.   : 0.00  
##  1st Qu.:2016   1st Qu.: 4.000   Class :character   1st Qu.: 9.00  
##  Median :2018   Median : 7.000   Mode  :character   Median :14.00  
##  Mean   :2018   Mean   : 6.744                      Mean   :13.07  
##  3rd Qu.:2019   3rd Qu.:10.000                      3rd Qu.:18.00  
##  Max.   :2020   Max.   :12.000                      Max.   :23.00  
##                                                                    
##    UCR_PART            STREET               Lat             Long       
##  Length:531942      Length:531942      Min.   :-1.00   Min.   :-71.20  
##  Class :character   Class :character   1st Qu.:42.30   1st Qu.:-71.10  
##  Mode  :character   Mode  :character   Median :42.33   Median :-71.08  
##                                        Mean   :42.24   Mean   :-70.95  
##                                        3rd Qu.:42.35   3rd Qu.:-71.06  
##                                        Max.   :42.40   Max.   :  0.00  
##                                        NA's   :30249   NA's   :30249   
##    Location        
##  Length:531942     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Top 10 Number of Offenses

Below is a bar chart consisting of the top 10 number of crime offenses in the Boston area and their count. Since this chart took data from the dataset as a whole, these crime offenses are totaled from 2015-2020. Ranging from 14,000 to 29,000 number of offenses, the most popular one was from sick or injured personnel. At just over 14,000 offenses, the 10th highest offense is larceny theft from building. I wanted my first visualization to be the broadest one, and continue to get more detailed with each visualization afterwards. This visualization gives an idea to people on the most popular crime offenses and how many occurred over the years. Once seeing the most popular crime offenses, it is important to now find out when these offenses occurred the most.

offensecount <- data.frame(count(df, OFFENSE_DESCRIPTION))
offensecount <- offensecount[order(offensecount$n, decreasing = TRUE), ]
v1 <- ggplot(offensecount[2:11,], aes(x = n, y = reorder(OFFENSE_DESCRIPTION, n))) + 
  geom_bar(colour="black", fill="darkolivegreen3", stat="identity") +
  labs(title = "Number of Offenses (Top 10)", x = "Number of Offenses", y = "Offense Description") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = scales::comma(n)), vjust = 0, hjust = -0.04, size = 3.5)
v1

Histogram of Offenses by Year

The next visualization is a histogram of the number of total offenses per year. A histogram is especially useful in taking large sets of data and grouping them into bins. Usually with frequencies, a histogram will show the number of times an event occurred. In this case, the visualization totaled up all crime offenses, broke them up into offenses per year, and showed the frequencies per year of total offenses. One thing that I noticed right away was the drop in offenses in 2015. With almost a 50,000 difference in total crime offenses from 2015 to 2016, it was interesting to find out that the reason why 2015 had so many less offenses was because this dataset was taken from the middle of 2015 through 2020. Due to 2015 having only half a year of data, this make the data from all years disproportional.

Another thing I noticed was a big drop from 2019 to 2020. With almost 20,000 less total crime offenses in 2020 than 2019, it seems like 2020 has had the lowest number of crime offenses out of all 5 full years of data in this dataset. One of the reasons this occurred is probably due to the start of the pandemic in March, 2020. Once Covid-19 hit the United States, businesses shut down and the entire country went into lockdown. With a deadly disease spreading, less and less crime happened in the Boston City area. Less people went outside, most businesses closed, and only essential personnel was allowed out of the house and into work during certain time periods. Most of the summer 2020 in Boston issued a citywide curfew. People could not leave their houses between 12am - 5am. Being that so many crimes happen later in the night, I definitely think this pandemic had a huge impact on why crime offenses dropped so much last year.

p1 <- ggplot(df, aes(x = YEAR)) + 
  geom_histogram(bins = 6, color = "black", fill = "lightskyblue1") +
  labs(title = "Histogram of Offenses by Year", x = "Year", y = "Count of Offenses") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = comma) +
  stat_bin(binwidth = 1, geom = 'text', color = 'black', aes(label = scales::comma(..count..)), vjust = -0.5)
x_axis_labels <- min(df$YEAR):max(df$YEAR)
p1 <- p1 + scale_x_continuous(labels = x_axis_labels, breaks = x_axis_labels)
p1

Offense Count by Offense Type

The next visualization delves deeper into the top 10 number of crime offenses. It takes the first visualization bar chart one step further by breaking out the data into years. Just like we saw in the previous visualization, 2015 has the least number of crime offenses being that it is taken from only half of the years data. Breaking out this data by years allows people to see which crime offenses occurred in each year and get a sense of how many.

One thing that stood out to me in this visualization was every crime offense occurred at least once in all 6 years except assault simple battery. This could be for a number of reasons, the pandemic again being the most common. In 2015, vandalism and property damage were the most popular offenses. The most popular offenses in 2016, 2017, 2018, and 2019 were all investigative personnel or sick/injured/medical personnel. Being that these two offenses were the most popular offenses in total, it makes sense that they were the most popular offenses in the majority of the 6 years of data. In 2020, the most popular offenses that occurred were verbal disputes and investigative personnel. Overall, we can see that sick/injured/medical personnel is the most popular crime in Boston from 2015-2020, followed by investigative personnel.

df_reasons <- count(df, OFFENSE_DESCRIPTION)
df_reasons <- df_reasons[order(df_reasons$n, decreasing = TRUE), ]
top_reasons <- df_reasons$OFFENSE_DESCRIPTION[1:10]

new_df2 <- df %>%
  filter(OFFENSE_DESCRIPTION %in% top_reasons) %>%
  select(YEAR, OFFENSE_DESCRIPTION) %>%
  group_by(OFFENSE_DESCRIPTION, YEAR) %>%
  dplyr::summarise(n = length(OFFENSE_DESCRIPTION), .groups = 'keep') %>%
  data.frame()

agg_tot2 <- new_df2 %>%
  select(OFFENSE_DESCRIPTION, n) %>%
  group_by(OFFENSE_DESCRIPTION) %>%
  dplyr::summarise(tot = sum(n), .groups = 'keep') %>%
  data.frame()

new_df2$YEAR <- as.factor(new_df2$YEAR)
max_y <- round_any(max(agg_tot2$tot), 35000, ceiling)

v2 <- ggplot(new_df2, aes(x = reorder(OFFENSE_DESCRIPTION, n, sum), y = n, fill = YEAR)) +
  geom_bar(stat = "identity", position = position_stack(reverse = TRUE)) +
  coord_flip() +
  labs(title = "Offense Count by Offense Type (Top 10)", x = "Offense Description", y = "Offense Count", fill = "Year") +
  theme_clean() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette = "Set2") +
  geom_text(data = agg_tot2, aes(x = OFFENSE_DESCRIPTION, y = tot, label = scales::comma(tot), fill = NULL), hjust = -0.1, size = 4) +
  scale_y_continuous(labels = comma, 
                     breaks = seq(0, max_y, by = 5000),
                     limits = c(0, max_y))
v2

Total Offenses by Month (Numerical) and Year

Being that the previous visualization shows the number of times an offense occurred in a year, this one goes deeper into which months certain crime offenses occurred. This multiple bar chart is split up into 6 sections, with each section accounting for a different year. Each section is then divided up even further into 12 sections, one for each month.

Again, we can see here that only half of the year 2015 had data, but for the most part years 2016-2020 did not fluctuate too much. The beginning and end of the years seem to have less crime, whereas the middle months start to spike up again. These middle months are mainly summer months, meaning that weather is warmer, people are outside, and more crime is occurring as opposed to the winter months.

months_df <- df %>%
  select(MONTH, YEAR) %>%
  group_by (YEAR, MONTH) %>%
  dplyr::summarise(n = length(MONTH), n = length(YEAR), .groups = 'keep') %>%
  data.frame()
months_df$YEAR <- factor(months_df$YEAR)
mymonths <- c('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12')
month_order100 <- factor(months_df$MONTH, level = mymonths)
v5 <- ggplot(months_df, aes(x = month_order100, y = n, fill = YEAR)) +
  geom_bar(stat = "identity", position = "dodge") + 
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = comma, breaks = seq(0, 8000, by = 2000)) +
  labs(title = "Multiple Bar Charts - Total Offenses by Month (Numerical) by Year",
       x = "Months of the Year (Numerical)",
       y = "Offense Count",
       fill = "Year") +
  scale_fill_brewer(palette = "Set3") +
  facet_wrap(~YEAR, ncol = 3, nrow = 2)
v5

Offenses by Hour

This next visualization shows the total number of offenses by hour. After seeing which years and months were the most common to commit a crime, this line chart is able to find both the count and the hour that crimes occurred in the Boston area.

Looking at the chart, you can see that the least amount of crimes occcurred in the middle of the night, between 2-7 am. At 5 am, crimes were at an all-time low with only 5,656 total crime offenses between years 2015-2020. The biggest drop went from midnight to 1 am, meaning that crimes are lowest later at night. Crimes seem to be highest around 5 pm, with 34,132 total crime offenses in that hour. Looking at this line chart, one can tell that crimes are highest in the late afternoon and evening between 1 pm - 7 pm and lowest in the middle of the night between 1 am - 6 am.

hours_df <- df %>%
  select(HOUR) %>%
  group_by(HOUR) %>%
  dplyr::summarise(n = length(HOUR), .groups = 'keep') %>%
  data.frame()
x_axis_labels = min(hours_df$HOUR):max(hours_df$HOUR)
hi_lo <- hours_df %>%
  filter(n == min(n) | n == (max(n))) %>%
  data.frame() 

v3 <- ggplot(hours_df, aes(x = HOUR, y = n)) +
  geom_line(color = 'black', size = 1) +
  geom_point(shape = 21, size = 4, color = 'dodgerblue4', fill = 'white') +
  labs(x = "Hour", y = "Offense Count", title = "Offenses by Hour") +
  scale_y_continuous(labels = comma) +
  theme_light () +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(labels = x_axis_labels, breaks = x_axis_labels, minor_breaks = NULL) +
  geom_point(data = hi_lo, aes(x = HOUR, y = n), shape = 21, size = 4, fill = 'dodgerblue4', color = 'dodgerblue4') +
  geom_label_repel(aes(label = ifelse(n == max(n) | n == min(n), scales::comma(n), "")), 
                   box.padding = 1.5, 
                   point.padding = 1.5, 
                   size = 4, 
                   color = 'Gray28',
                   segment.color = 'darkblue')
v3

Offenses by Day and Year

This next visualization is a multiple line chart, with offense count on the y-axis and days of the week on the x-axis. Again, this data is split up by year, with each line representing a year between 2015-2020. Further, this chart also shows the total amount of crimes that were committed per dayof each year.

Right away in looking at the chart, 2015 and 2020 had the lowest amount of crimes which occurred, and the in-between years of 2016, 2017, 2018, and 2019 seemed to be all around the same. The weekends also seemed to be when most crimes were committed, specifically Fridays. Once the weekend ended, crimes seemed to jump back down again towards the start of the week. This data wasn’t surprising to me, since more people are out on weekends and more is going on. Monday-Friday is usually a typical work week for people, so there is less crime occurring and less people out and about.

days_df <- df %>%
  select(DAY_OF_WEEK, YEAR) %>%
  group_by(YEAR, DAY_OF_WEEK) %>%
  dplyr::summarise(n = length(DAY_OF_WEEK), n = length(YEAR), .groups = 'keep') %>%
  data.frame()
days_df$YEAR <- as.factor(days_df$YEAR)
day_order <- factor(days_df$DAY_OF_WEEK, level = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
v4 <- ggplot(days_df, aes(x = day_order, y = n, group = YEAR)) +
  geom_line(aes(color = YEAR), size = 3) +
  labs(title = "Offenses by Day and Year", x = "Days of the Week", y = "Offense Count") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_point(shape = 21, size = 5, color = "black", fill = "white") +
  scale_y_continuous(labels = comma) +
  scale_color_brewer(palette = "Dark2", name = "Year", guide = guide_legend(reverse = TRUE))
v4

Heatmap of Offenses by Day of the Week

This final visualization is similar to the previous multiple line chart, but with a different way of looking at data. A heatmap is a visual technique that shows the magnitude of a phenomenon as color in two dimensions. The variation in color is by intensity, with the darker of the color meaning more intensity of the data. For example, the total offenses that occurred the most per day of each year are the darkest shade of red, and the lightest shade of red are the least number of offenses which occurred.

The previous multiple line chart was not able to tell us specifically how many crimes occurred each day of the week, but this heatmap is able to give two ways of seeing which days had the highest crime. From looking at the data, we can see that Fridays in 2017 had the most amount of crime, with 15,521 offenes in total. The least amount of crime happened on Sunday in 2015, with 6,600 offenses. Looking at this heatmap, you can tell that Fridays were the most popular days that crime offenses occurred in Boston, totaling 80,990 offenses on Fridays from 2015-2020. Conversely, Sundays were the least popular days that crime occurred, only totaling 67,483 offenses throughout the 6 years.

mylevels <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')
days_df$DAY_OF_WEEK <- factor(days_df$DAY_OF_WEEK, levels = mylevels)
breaks <- c(seq(0, max(days_df$n), by = 2000))
v6 <- ggplot(days_df, aes(x = YEAR, y = DAY_OF_WEEK, fill = n))  +
  geom_tile(color = "black") +
  geom_text(aes(label = comma(n)), size = 3) +
  coord_equal(ratio = 1) +
  labs(title = "Heatmap: Offenses by Day of the Week",
       x = "Year",
       y = "Days of the Week",
       fill = "Offense Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_discrete(limits = rev(levels(days_df$DAY_OF_WEEK))) +
  scale_fill_continuous(low = "White", high = "tomato3", labels = comma, breaks = breaks) +
  guides(fill = guide_legend(reverse = TRUE, override.aes = list(colour = "black")))
v6

Conclusion

Overall, I think these data visualizations told a story about the most popular crimes committed, along with each year, month, day of the week, and hour each one was committed the most. Starting out broad, the first visualization demonstrated the top 10 total offenses and how many each one was committed in the span of 6 years. The second visualization is able to show which year each of these top 10 offenses occurred, showing even further that 2015 had the least amount of offenses (53,597) and 2017 had the most (101,338). After finding out which year was most popular in crime, the 3rd visualization describes the count that each crime was committed. The next 4 visualizations delve deeper into figuring out which crimes happened the most often for each month, day, and hour of the 6 years.

Since this dataset was taken from the Boston Police Department website, it was overall very accurate in numbers. I did not have to take out any “NA’s” or “bad” data, because all of the data I needed from each column was filled. Using 5 columns of data in this dataset, accompanied by 7 different visualizations, I was able to capture the most popular crime offenses in the Boston City area and when each of those offenses occurred the most.