Being from the suburbs outside of Boston and and growing up on the East Coast I was very interested in analyzing the data available to glean more information in order to back my biased beliefs that Boston is a safe city to live in and explore.
The Boston Crime Data set has eleven variables and over 560,000 entries making it a very large and diverse data set making lots of information available for study. The data sets eleven variables range from different types of criminal offenses and dates they are committed to the street and district they were committed in. There are also some identification variables available.
The first thing I did after reading the data set into R Studio was to look deeply at the data using basic code such as summary, head, and colnames. Once I knew what I was working with I could then clean the data by omitting NA values and prepare a story around the data set. I looked at the streets to see where certain crimes took place and at districts to see where certain streets were located. I then started to build my story line based around police District 6, South Boston, or “Southie” where my mother grew up-as a long haired brunette in the hunting grounds of the Boston Strangler, just streets over from the Whalburgs (New Kids On the Block), and Whitey Bulger, and learned just how unsafe those streets actually were.
Getting Started with data cleaning and data frame organization: Click below to see the file, libraries, and data frames used in my visualizations.
setwd("C:/Users/bbay7/Documents/DS736/Boston Crime Data")
df <- read.csv('BostonCrimeAllYears.csv')
##libraries to manipulate data frames
library(dplyr)
library(plyr)
##ggplot2 to work with charts and data analysis visualization tools
library(ggplot2)
library(plotly)
## to include commas
library(scales)
##RColerBrewer to make aesthetically pleasing visualizations
library(RColorBrewer)
## Lubridate to work with days and years
library(lubridate)
## to make visualizations nicer that the default
library(ggthemes)
library(data.table)
#Maps
library(leaflet)
district <- data.frame(count(df, "DISTRICT"))
streetname <- data.frame(count(df, "STREET"))
Boston Criminal Offenses by District (Top 10)
The first graph I created was a very basic histogram comparing the Top 10 police districts in Boston with the highest amount of criminal offenses. Here I found that Roxbury (B2) is in fact the least safe district with the highest crime rate in the city. Washington Street is a major street in Roxbury and appears in the data quite frequently. It is very close to Jamaica Plain, home to Joey McIntyre (NKOTB) growing up and is actually one of the safest places to live in BOston, this made me question-If district B2 is the least safe district then what is its actual rating nationally if it is so close to JP which is toted as one of the safest places to live in Bean-Town? According to the Niche website Roxbury has a C- for an average of national safety as concerned to crime data, giving credit to my belief that Boston is one of the safest cities in the country to live. The safest district with the lowest amount of offenses according to my data is A1, Downtown Boston, the Charlestown area. There you will find the Boston Garden sports complex, Sudbury Street shopping district and many other safe tourist attractions including the original home of Paul Revere and other areas that are heavily patrolled and well kept.
###Begin Data Visualizations##
###Boston Criminal Offenses by District (Top 10)##
district$n <- as.numeric(district$freq)
ggplot(district[2:11,], aes(x = DISTRICT, y = n)) +
geom_bar(colour="black", fill="gray76", stat="Identity") +
scale_y_continuous(labels=comma)+
labs(title = "Boston Criminal Offenses by District (Top 10)", x = "District_Id", y = "Number_of_Offenses") +
theme(plot.title = element_text(hjust =0.5))
Pie Chart showing District Crime Percentages
To look at individual districts a little bit deeper I created a pie chart that breaks down each district into a percentage of overall crime in the area displaying the number of offenses recorded per district as well its percentage. The total crimes is displayed in the center.
## Pie Chart
p <- plot_ly(district, labels = ~DISTRICT, values = ~n) %>%
add_pie(hole = 0.5) %>%
layout(title="Offenses By District") %>%
layout(annotations=list(text=paste0("Total Offense Count: \n",
scales::comma(sum(district$n))),
"showarrow"=F))
p
htmlwidgets::saveWidget(p, "BostonPie.html")
Histogram displaying the number of offenses in Boston by year
This is a histogram displaying the number of offenses in Boston by year. We see that crime was very low in Boston in 2015 then rose sharply between 2016 and 2018, falling again in 2019, the onslaught of the pandemic but not to previous levels.
##Boston Criminal Offenses by year##
p1 <- ggplot(df, aes(x=YEAR)) +
geom_histogram(bins = 7, color="darkgreen", fill="lightgreen") +
labs(title = "Histogram of Offenses by Year", x = "Year", y = "Count of Offenses") +
scale_y_continuous(labels=comma)+
stat_bin(binwidth = 1, geom='text', color='black', aes(label=scales::comma(..count..)), vjust=-0.5)
x_axis_labels <- min(df$YEAR):max(df$YEAR)
p1 <- p1 + scale_x_continuous(labels = x_axis_labels, breaks =x_axis_labels)
p1
Line Graph Incidents grouped by Day of Week and Year
We see the rise and fall of crime in Boston around the pandemic in much the same way in the line graph that I created which displays the number of incidents that occurred by day of week separated by year. 2015 is at the bottom of the chart hovering around the 8,000 mark per day followed by 2019-2021 enveloping the 10ks and the highest being 2016-2018 rising above 14k. As for the days of the week Friday was always the highest occurrences while Sunday was the lowest. Boston still lives by the Blue Book laws which prevents many business from opening on the day of rest.
###Line Plots Offenses by Day of Week###
Dates_of_shooting <- data.frame(count(df, "SHOOTING"))
Dates_of_shooting = df[,c("MONTH", "DAY_OF_WEEK", "YEAR", "INCIDENT_NUMBER", "DISTRICT")]
Dates_of_shooting$YEAR <- as.factor(Dates_of_shooting$YEAR)
df1 <- Dates_of_shooting %>%
group_by(YEAR, DAY_OF_WEEK)%>%
dplyr::summarise(n = length(DAY_OF_WEEK), .groups = 'keep') %>%
data.frame()
day_order <- factor(df1$DAY_OF_WEEK, level=c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))
day_order <- factor(df1$DAY_OF_WEEK, levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
ggplot(df1, aes(x = day_order, y = n, group = YEAR)) +
geom_line(aes(color = YEAR), size=1) +
geom_point(shape=21, size=4, color='red', fill='white') +
labs(title = "Incidents Occurred: Grouped by Day of the Week and Year", x = "Days of the Week", y = "Number of Incidents Occured") +
theme(plot.title = element_text(hjust = 0.5))+
scale_y_continuous(labels = comma)
Heatmap Occurances by Day of Week and District
I then created a heatmap to lay out the occurrences by day of the week but separated by district. This showed the above information but in a visually clearer way. I could see that A15 in fact had the lowest crime rate at only about 1,800 crimes occurring on any given weekday with a decrease on Saturday and Sunday. In fact all districts had less crime on Sunday. B2 still showed to be the highest in crime averaging 13k occurrences on any given weekday. C6, my district of interest was on the low end of about 5,000.
#### Heat Map
district_heat = df %>%
select(DISTRICT, OCCURRED_ON_DATE, INCIDENT_NUMBER, YEAR) %>%
mutate(YEAR = year(ymd_hms(OCCURRED_ON_DATE))) %>%
group_by(DISTRICT) %>%
dplyr::summarise(n = length(INCIDENT_NUMBER), .groups = 'keep') %>%
data.frame()
heat1 <- Dates_of_shooting %>%
filter(DISTRICT %in% c("A1", "A7", "A15", "B2", "B3", "B11","C6", "C11", "D4", "D14", "E13")) %>%
group_by(DAY_OF_WEEK, DISTRICT)%>%
dplyr::summarise(n = length(DAY_OF_WEEK), .groups = 'keep') %>%
data.frame()
mylevels <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')
heat1$DAY_OF_WEEK <- factor(heat1$DAY_OF_WEEK, levels = mylevels)
ggplot(heat1, aes(x = DAY_OF_WEEK, y = DISTRICT, fill=n))+
geom_tile(color="blue") +
geom_text(aes(label=comma(n))) +
coord_equal(ratio = .5) +
labs(title="Heatmap: Criminal Offenses by Days of the Week per District",
x = "Day of the Week",
y = "District",
fill = "Count of Offenses") +
theme_minimal()+
theme(plot.title = element_text(hjust=0.5))
Maps showing safest areas in and around Boston
I then went to work on creating maps that would make a Boston visitor familiar with the safest neighborhoods should they want to look into relocating to the area, here with flags, I marked the top safest cities with their name and number of occurrences found in the data. My second map shows colored dots representative of attractions, colleges, and transportation available with the number of local incidents reported when you click.
Allston <- c(42.3529, -71.1321, "#1 Safest Neighborhood in Boston", 1553)
Hyde_Park <- c(42.2557, -71.1256, "#2 SNB Hyde Park", 1759)
Newton <- c(42.3122, -71.2206, "#3 SNB Newton", 560)
West_Roxbury <- c(42.2782, -71.1600, "#4 SNB West_Roxbury", 870)
South_Boston <- c(42.3381437, -71.0475773, "Most Searched Neighborhood, South_Boston", 1917)
Airport <- c(42.3656, -71.0096, "Transportation", 693)
Fenway_Park <- c(42.3467, -71.0972, "Attraction", 1794)
Boston_Common <- c(42.3551, 71.0657, "Attraction", 655)
Boston_College <- c(42.3355, -71.1685, "College", 987)
Harvard <- c(42.3770, -71.1167, "College", 213)
gps_df <- data.frame(rbind(Allston, Hyde_Park, Newton, West_Roxbury, South_Boston, Airport, Fenway_Park, Boston_College, Harvard))
colnames(gps_df) <- c("Lat1", "Long1", "Location_Name", "n")
gps_df$Lat1 <- as.numeric(gps_df$Lat1)
gps_df$Long1 <- as.numeric(gps_df$Long1)
gps_df$n <- as.numeric(gps_df$n)
icon.glyphicon <- makeAwesomeIcon(icon = 'flag', markerColor = 'red', iconColor = 'blue')
m <-leaflet() %>%
addTiles() %>%
addAwesomeMarkers(lng = gps_df$Long1, lat = gps_df$Lat1,
icon = icon.glyphicon,
popup = paste(row.names(gps_df), gps_df$n),
label = row.names(gps_df))
htmlwidgets::saveWidget(m, "BostonMap1.html")
m
m2 <- leaflet() %>%
addProviderTiles(providers$Esri.NatGeoWorldMap) %>%
setView(lng = -71.0475773, lat = 42.3381437, zoom = 12) %>%
addCircles(
lng = subset(gps_df, Location_Name == 'College')$Long1,
lat = subset(gps_df, Location_Name == 'College')$Lat1,
opacity = 10,
color = "red",
popup = paste(row.names(subset(gps_df, Location_Name == 'College')), subset(gps_df, Location_Name == 'College')$n),
radius = sqrt(subset(gps_df, Location_Name == 'College')$n)
) %>%
addCircles(
lng = subset(gps_df, Location_Name == 'Attraction')$Long1,
lat = subset(gps_df, Location_Name == 'Attraction')$Lat1,
opacity = 10,
color = "blue",
popup = paste(row.names(subset(gps_df, Location_Name == 'Attraction')), subset(gps_df, Location_Name == 'Attraction')$n),
radius = 80
) %>%
addCircles(
lng = subset(gps_df, Location_Name == 'Transportation')$Long1,
lat = subset(gps_df, Location_Name == 'Transportation')$Lat1,
opacity = 10,
color = "green",
popup = paste(row.names(subset(gps_df, Location_Name == 'Transportation')), subset(gps_df, Location_Name == 'Transportation')$n),
radius = 80
)
htmlwidgets::saveWidget(m2, "BostonMap2.html")
m2
With the low crime rate in comparison to my current city, if it weren’t for the high cost of living I know Boston would be my number one choice to make my home, especially district A15 which is known as Downtown or Charlestown. I have also learned that Dorchester, the Back Bay, and Roxbury are areas to stay away from. Now that this is on RPubs, maybe someone else will see this and can be informed.