The data set I have chosen to analyze takes data from Crime Incident Reports from the City of Boston between 2015 and 2020. The data set has 531,942 rows of data and is organized into seventeen variables. This covers various criminal incident reports across the six year span and includes various pieces of each line of data such as an internal Boston Police Dept. (BPD) Number, a numerical code of the offense, which group each offense falls into, and even when and where the shooting took place.
summary(project_df)
## INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION
## Length:531942 Min. : 111 Length:531942 Length:531942
## Class :character 1st Qu.:1102 Class :character Class :character
## Mode :character Median :3005 Mode :character Mode :character
## Mean :2331
## 3rd Qu.:3201
## Max. :3831
##
## DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE
## Length:531942 Min. : 0.0 Length:531942 Length:531942
## Class :character 1st Qu.:179.0 Class :character Class :character
## Mode :character Median :347.0 Mode :character Mode :character
## Mean :385.4
## 3rd Qu.:542.0
## Max. :962.0
## NA's :40183
## YEAR MONTH DAY_OF_WEEK HOUR
## Min. :2015 Min. : 1.000 Length:531942 Min. : 0.00
## 1st Qu.:2016 1st Qu.: 4.000 Class :character 1st Qu.: 9.00
## Median :2018 Median : 7.000 Mode :character Median :14.00
## Mean :2018 Mean : 6.744 Mean :13.07
## 3rd Qu.:2019 3rd Qu.:10.000 3rd Qu.:18.00
## Max. :2020 Max. :12.000 Max. :23.00
##
## UCR_PART STREET Lat Long
## Length:531942 Length:531942 Min. :-1.00 Min. :-71.20
## Class :character Class :character 1st Qu.:42.30 1st Qu.:-71.10
## Mode :character Mode :character Median :42.33 Median :-71.08
## Mean :42.24 Mean :-70.95
## 3rd Qu.:42.35 3rd Qu.:-71.06
## Max. :42.40 Max. : 0.00
## NA's :30249 NA's :30249
## Location
## Length:531942
## Class :character
## Mode :character
##
##
##
##
This graph is a horizontal bar chart showing the crimes that were reported the most often for each year of the data set, from 2015 to 2020. Each of the bars refers to a specific incident, such as investigating a suspicious person or a hit-and-run accident. An interesting quality of this graph is that it shows that there was a wide variety of calls reported in, with investigating suspicious persons and sick/injured persons being the most common incidents reported. The top ten incidents are split evenly between being people and property related. Another interesting statistic is the lower number of vehicle break-ins, which is in tenth. With the population density and concentration of motor vehicles in the city, it would be expected to have a higher number of calls for that, but some of the calls that could have been classified as a vehicle break-in could also have been called in as a suspicious person.
df_descriptions <- dplyr::count(project_df, OFFENSE_DESCRIPTION)
df_descriptions <- df_descriptions[order(df_descriptions$n, decreasing = TRUE),]
top_descriptions <- df_descriptions$OFFENSE_DESCRIPTION[1:10]
df_newplot <- project_df %>%
filter(OFFENSE_DESCRIPTION %in% top_descriptions) %>%
select(OCCURRED_ON_DATE, OFFENSE_DESCRIPTION, YEAR) %>%
group_by(OFFENSE_DESCRIPTION, YEAR) %>%
dplyr::summarise(n = length(OFFENSE_DESCRIPTION), .groups = 'keep') %>%
data.frame()
agg_tot <- df_newplot %>%
select(OFFENSE_DESCRIPTION, n) %>%
group_by(OFFENSE_DESCRIPTION) %>%
dplyr::summarise(tot = sum(n), .groups = 'keep') %>%
data.frame()
ggplot(df_newplot, aes(x = reorder(OFFENSE_DESCRIPTION, n, sum), y = n, fill = YEAR)) +
geom_bar(stat="identity", position = position_stack(reverse = TRUE)) +
coord_flip() +
labs(title="Crime Report Count by Type of Crime", x="Offense Type", y= "Report Count", fill="Year", caption="Source: https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system") +
theme_light() +
theme(plot.title = element_text(hjust=0.5)) +
scale_color_brewer(palette = "Accent") +
geom_text(data=agg_tot, aes(x=OFFENSE_DESCRIPTION, y=tot, label=scales::comma(tot), fill=NULL), hjust =-0.2, size=4) +
scale_y_continuous(labels = comma,
breaks = seq(0, 40000, by = 5000),
limits = c(0, 40000))
This graph is a line chart that shows how many reports were called in at each hour of the day. Each point on the line is one hour in military time, so 0 is 12 AM, 13 is 1 PM, etc. up through 11 PM at night. There are a couple notable points in the graph, the first of which are the lowest and highest points of the graph, which occur at 5 AM and 5 PM, respectively. The low point makes the most sense of the two points, since there is very little activity taking place at that time of day other than night shift workers and early morning exercise. The high point makes slightly less sense, being at an odd time of day, but it is still understandable. At five in the afternoon, most people are making their way home from work and they might be tired to the point that their guard is lowered since they are focused on getting home or going to get food. This could mean increased irritability which can lead to more incidents such as verbal disputes, as well as more non-violent incidents such as traffic accidents.
hours_df_project <- project_df %>%
select(HOUR) %>%
group_by(HOUR) %>%
dplyr::summarise(n=length(HOUR), .groups='keep') %>%
data.frame()
hi_lo <- hours_df_project %>%
filter(n == min(n) | n == max(n)) %>%
data.frame()
x_axis_labels <- min(hours_df_project$HOUR):max(hours_df_project$HOUR)
ggplot(hours_df_project, aes(x=HOUR, y=n)) +
geom_line(color='black', size=1) +
geom_point(shape=21, size=4, color='red', fill='white') +
labs(x="Hour", y="Incident Count",
title="Crime Incident Report Count by Hour",
caption="Source: https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system") +
scale_y_continuous(labels=comma) +
theme_light() +
theme(plot.title = element_text(hjust=0.5)) +
scale_x_continuous(labels=x_axis_labels, breaks=x_axis_labels, minor_breaks = NULL) +
geom_point(data = hi_lo, aes(x=HOUR, y=n), shape=21, size=4, fill='red', color='red') +
geom_label_repel(aes(label = ifelse(n== max(n) | n == min(n), scales::comma(n), "")),
box.padding = 1, point.padding = 1, size=4,
color='Grey50', segment.color = 'darkblue')
This graph is a split line chart that shows how many reports came in on each day of the week, with one line for each year of the data set. The chart has been structured so that the days go in the correct order, with the weekdays coming first followed by the weekend. One interesting observation is the difference in where the lines for 2015 and 2020 are compared to the rest of the lines. While there is overlap for most of the lines, the lines for those two years in particular sit a good couple thousand per day below the other four years. Unfortunately, with data like this, there is no one way to determine what caused the spike in reports over the other four years in the graph, 2016-2019, but one possible change could be reform in police procedures or a shift in political climate. On the other end, the lower number of reports from 2020 can likely be attributed to the coronavirus pandemic, which saw people becoming generally less active during the day as a result.
weekday_df <- project_df %>%
select(DAY_OF_WEEK, YEAR) %>%
group_by(YEAR, DAY_OF_WEEK) %>%
dplyr::summarise(n = length(DAY_OF_WEEK), .groups='keep') %>%
data.frame()
weekday_df$YEAR <- as.factor(weekday_df$YEAR)
weekday_order <- factor(weekday_df$DAY_OF_WEEK,
level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday'))
ggplot(weekday_df, aes(x=weekday_order, y=n, group=YEAR)) +
geom_line(aes(color=YEAR), size=3) +
labs(title = "Crime Incident Reports by Day of the Week by Year", x="Day of the Week", y="Report Count", caption="Source: https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system") +
theme_light() +
theme(plot.title=element_text(hjust=0.5)) +
scale_color_brewer(palette = "Accent") +
geom_point(shape=21, size=5, color="black", fill="white") +
scale_y_continuous(labels=comma)
This graph takes a look at what percent of crimes fell under each part of the UCR, or Universal Crime Reporting Code. According to the FBI, part one is standard crimes and arrests known by the police, whereas part two crimes are more victim-less crimes such as gambling or public intoxication,and part three has no clear classification. The one part of this data set that is making these graphs appear unusual is that none of the crimes for 2020 have a classification that would allow them be sorted yet, meaning the classifications are likely applied retroactively. It is also interesting noting the gradual decrease in part one crimes, which tend to be more violent. It is intriguing how so many of the crimes fit under part three, which often takes up the largest portion of each pie chart.
ucr_parts <- count(project_df, UCR_PART)
ucr_parts <- ucr_parts[order(-n),]
df_ucr <- project_df %>%
select(UCR_PART, YEAR) %>%
dplyr::mutate(myParts = ifelse(UCR_PART=="Part Three", "Part Three",
ifelse(UCR_PART=="Part Two", "Part Two",
ifelse(UCR_PART=="Part One", "Part One","Other")))) %>%
group_by(YEAR, UCR_PART) %>%
dplyr::summarise(n=length(UCR_PART), .groups='keep') %>%
group_by(YEAR) %>%
dplyr::mutate(percent_of_total = round(100*n/sum(n),1)) %>%
ungroup() %>%
data.frame()
df_ucr$UCR_PART <- factor(df_ucr$UCR_PART, levels=c("Part One", "Part Two", "Part Three"))
ggplot(data = df_ucr, aes(x="", y=n, fill=UCR_PART)) +
geom_bar(stat = "identity", position = "fill") +
coord_polar(theta="y", start=0) +
labs(fill= "Parts", x= NULL, y= NULL, title = "Crime Reports by Year by UCR Part",
caption="Slices under 5% are not labeled") +
theme_light() +
theme(plot.title = element_text(hjust=0.5),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()) +
facet_wrap(~YEAR, ncol = 3, nrow = 2) +
scale_fill_brewer(palette = "Reds") +
geom_text(aes(x=1.7, label = ifelse(percent_of_total > 5,paste0(percent_of_total,"%"),"")),
size=3,
position = position_fill(vjust=0.5))
The last of these graphs is a heatmap, once again showing the concentration of the number of crimes reported on each day of the week for each year of the data set. One interesting observation is the larger numbers being centered on Friday, which has the most crimes of any day of the week reported each year. This type of graph also helps to give a better sense of the concentration of the data, and where it appears to be more dense than other spots, such as in 2015.
weekday_order <- factor(weekday_df$DAY_OF_WEEK,
level=c('Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday'))
mylevels<- c('Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday')
weekday_df$DAY_OF_WEEK <- factor(weekday_df$DAY_OF_WEEK, levels = mylevels)
breaks <- c(seq(0, max(weekday_df$n), by = 2500))
ggplot(data = weekday_df, aes(x=YEAR, y=DAY_OF_WEEK, fill=n)) +
geom_tile(color="black") +
geom_text(aes(label=comma(n)), size=2) +
coord_equal(ratio=1) +
labs(title = "Heatmap: Citations by Day of the Week",
x= "Year",
y= "Day of the Week",
fill= "Citation Count",
caption="Source: https://data.boston.gov/dataset/crime-incident-reports-august-2015-to-date-source-new-system") +
theme_minimal() +
theme(plot.title = element_text(hjust=0.5)) +
scale_y_discrete(limits = rev(levels(weekday_df$DAY_OF_WEEK))) +
scale_fill_continuous(low="white", high="red", breaks = breaks) +
guides(fill = guide_legend(reverse=TRUE, override.aes=list(colour="black")))