The graphs used to create data visualizations derive from a dataset entitled, “NYPD Arrest Data (Historic)”. This dataset overviews fourteen years’ worth of arrests that were made throughout the five boroughs (Bronx, Brooklyn, Manhattan, Staten Island and Queens) of New York City from 2006 - 2019. These arrests were made and recorded by the New York Police Department (NYPD). The dataset is composed of over 4.5 million rows and 19 columns. The columns used to create data frames and graphs range from, perpetrator arrest date, police description (of crime), perpetrator gender, perpetrator race, etc. to provide insight of this large dataset with data visualizations.
setwd("~/R_datafiles")
library(data.table)
filename <- "NYC_Arrests.csv"
df <- fread(filename, na.strings = c(NA, ""))
library(ggplot2)
library(lubridate)
library(scales)
library(ggthemes)
library(RColorBrewer)
library(dplyr)
library(plyr)
#Bar Chart Setup
Perp_Race <- data.frame(dplyr::count(df, PERP_RACE))
Perp_Race <- Perp_Race[order(Perp_Race$n, decreasing = TRUE), ]
#Histogram Setup
df$year <-year(mdy(df$ARREST_DATE))
#Stacked Bar Chart Setup
df_reasons <- dplyr::count(df, PD_DESC)
df_reasons <- df_reasons [order(df_reasons$n, decreasing = TRUE),]
top_reasons_for_arrest <- df_reasons$PD_DESC[1:10]
new_df <- df %>%
filter(PD_DESC %in% top_reasons_for_arrest) %>%
select(ARREST_DATE, PD_DESC) %>%
mutate(year = year(mdy(ARREST_DATE))) %>%
group_by(PD_DESC, year) %>%
dplyr::summarise(n = length(PD_DESC), .groups = 'keep') %>%
data.frame()
other_df <- df %>%
filter(!PD_DESC %in% top_reasons_for_arrest) %>%
select(ARREST_DATE) %>%
mutate(year = year(mdy(ARREST_DATE)), PD_DESC = "Other") %>%
group_by(PD_DESC, year) %>%
dplyr::summarise(n = length(PD_DESC), .groups = "keep") %>%
data.frame()
new_df <- rbind(new_df, other_df)
agg_tot <- new_df %>%
select(PD_DESC, n) %>%
group_by(PD_DESC) %>%
dplyr::summarise(tot =sum(n), .groups = "keep") %>%
data.frame
#Multibar Chart Setup
months_df<- df %>%
select(ARREST_DATE) %>%
mutate(months = months(mdy(ARREST_DATE), abbreviate = TRUE),
year = year(mdy(ARREST_DATE))) %>%
group_by(year, months) %>%
dplyr::summarise(n = length(ARREST_DATE), .groups = 'keep') %>%
data.frame()
months_df$year <- factor(months_df$year)
mymonths <- c('Jan', 'Feb', 'Mar', 'Apr', 'May','Jun', 'Jul','Aug', 'Sep', 'Oct', 'Nov', 'Dec')
month_order <- factor(months_df$months, level = mymonths)
x = min(as.numeric(levels(months_df$year)))
y = max(as.numeric(levels(months_df$year)))
months_df$year <- factor(months_df$year, levels = seq(y, x, by = -1))
#Pie Chart Setup
length(unique(df$PERP_SEX))
## [1] 2
perp_sex <- dplyr::count(df, PERP_SEX)
perp_sex[perp_sex$PERP_SEX %in% c("Male", "Female")]
## Empty data.table (0 rows and 2 cols): PERP_SEX,n
perp_sex_df <- df %>%
select(PERP_SEX, ARREST_DATE) %>%
mutate(year = year (mdy(ARREST_DATE)),
myPERP_SEX = ifelse(PERP_SEX== "M", "M", ifelse(PERP_SEX == "F", "F", "Other"))) %>%
group_by(year, myPERP_SEX) %>%
dplyr::summarise(n=length(myPERP_SEX), .groups = 'keep') %>%
group_by(year) %>%
dplyr::mutate(percent_of_total = round(100*n/sum(n), 1)) %>%
ungroup() %>%
data.frame()
#Line Chart Setup
days_df <- df%>%
select(ARREST_DATE) %>%
mutate(year = year(mdy(ARREST_DATE)),
dayoftheweek = weekdays(mdy(ARREST_DATE), abbreviate = TRUE)) %>%
group_by(year, dayoftheweek) %>%
dplyr::summarise(n = length(ARREST_DATE), .groups = 'keep') %>%
data.frame()
days_df$year <- as.factor(days_df$year)
day_order <- factor(days_df$dayoftheweek, level=c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))
#Heatmap Setup
mylevels <- c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun')
days_df$dayoftheweek <- factor(days_df$dayoftheweek, levels = mylevels)
breaks_heatmap <- c(seq(0, max(days_df$n), by = 10000))
The data visualizations that were generated for this project shed light on a variety of fields correlated with NYPD arrests from 2006 - 2019. From displaying graphs breaking down total number of arrests by years, months and weekdays, top reasons for arrest, breakdown of arrests by race, and comparing the percentages of total arrests for male verses female perpetrators (“perps”). Individuals viewing this data visualization set should develop a more well-rounded perspective of arrests made by the NYPD.
The following histogram highlights the total number of arrests by year. From 2006 - 2009, arrest rates in New York City continued to increase slightly each year. 2010 has the highest number of arrests with a total of 422,322 arrests. After this peak, arrest rates continue to decrease over time, reaching an all time low in 2019 with only 214,617 arrests. Thus, New York City saw over a 50% decrease in overall arrests from 2010 - 2019. Perhaps this means that the New York Police Department (NYPD) grew and had more individuals patrolling the streets.
histogram <- ggplot(df, aes(x=year)) +
geom_histogram(bins = 14, color = "darkblue", fill = "lightblue") +
labs(title = "Histogram of Arrests by Year", x = "Year", y = "Count of Arrests", size = 10) +
scale_y_continuous(labels=comma) +
theme(plot.title = element_text(hjust = 0.5)) +
stat_bin(binwidth=1, geom= 'text', color= "black", aes(label=scales::comma(..count..)), vjust= -0.7, size = 4)
x_axis_labels <- min(df$year) : max(df$year)
histogram <- histogram + scale_x_continuous(labels = x_axis_labels, breaks = x_axis_labels)
histogram
This multiple bar chart expands upon the previous graph by breaking down the total number of arrests by month and year. With this, viewers are able to observe months in which arrests were higher and lower in a particular year. From 2006 - 2019, there appears to be a lower number of arrests in the month of February. Similar to the month of February, December is another month in which the total number of arrests is lower than other months in a pertained year (throughout all fourteen years). Throughout the months of March and May, more arrests are made compared to others. Therefore, arrests are less common in winter months as opposed to those in the spring time based on the visualization.
multibarchart <- ggplot(months_df, aes( x = month_order, y = n, fill = year))+
geom_bar(stat = "identity", position = "dodge") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = comma) +
labs(title = "Multiple Bar Charts - Total Arrests by Month & Year",
x = "Months of the Year",
y = "Citation Count",
fill = "Year") +
facet_wrap(~year , ncol = 2, nrow = 7)
multibarchart
The third graph that was generated is a stacked bar chart that breaks down the total number of arrests by reason for arrest and year. The number one reason for arrest in New York City is “other” and following is possession of marijuana. Possession of marijuana makes up 423,299 arrests, which is about 10% of overall arrests throughout the fourteen year period. The third most common reason for arrest is, “Assault 3” or Assault in the Third Degree. Each year is more-less correlated with one another in terms of reason for arrest.
new_df$year <- as.factor(new_df$year)
library(plyr)
max_y <-round_any(max(agg_tot$tot), 850000, ceiling)
stackedbarchart <- ggplot(new_df, aes(x = reorder(PD_DESC, n, sum), y = n, fill = year)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Arrests by Reason for Arrest", x = "", y = "Arrest Count", fill = "Year") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(data = agg_tot, aes(x = PD_DESC, y = tot, label = tot, fill = NULL), hjust=-0.3, size = 3) +
scale_y_continuous(labels = comma, limits=c(0, max_y))
stackedbarchart
Previous graphs determined that the year of 2010 had the highest number of arrests in New York while 2019 has the lowest. This line chart reaffirms this fact and presents a breakdown of total arrests by day of the week and year. Wednesdays and Thursdays have the highest number of arrests throughout the week whereas, Sundays and Mondays appear to have the lowest. Each year is consistent with Wednesdays and Thursdays having higher rates of arrests and Sundays and Mondays having lower.
linechart <- ggplot(days_df, aes(x = day_order, y = n, group = year)) +
geom_line(aes(color = year), size = 3) +
labs(title = "Arrests by Day of the Week & Year", x = "Days of the Week", y = "Arrests") +
theme_light()+
theme(plot.title = element_text(hjust = 0.5)) +
geom_point(shape=21, size = 5, color = "black", fill = "white") +
scale_y_continuous(labels = comma)
linechart
This heatmap observes a breakdown of total number of arrests by day of the week and year. In areas where the color yellow is more pigmented, the higher the number of total arrests during the week are. For instance, in years 2009 and 2010, Wednesdays and Thursdays have high numbers of arrests, which makes the yellow color of the heatmap more pigmented than in other areas. In 2018 and 2019 on Sundays and Mondays, less arrests were made; thus, the yellow squares appear to be a very light colored yellow or white.
heatmap <- ggplot(days_df, aes(x = year, y = dayoftheweek, fill = n)) +
geom_tile(color="black") +
geom_text(aes(label=comma(n)), size=3) +
coord_equal(ratio = 1) +
labs(title = "Heatmap for Arrests by Day of the Week & Year",
x = "Year",
y = "Days of the Week",
fill = "Number of Arrests") +
theme_minimal()+
theme(plot.title=element_text(hjust=0.5)) +
scale_y_discrete(limits = rev(levels(days_df$dayoftheweek))) +
scale_fill_continuous(low = "white", high = "yellow", breaks = breaks_heatmap) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(colour = "black")))
heatmap
To shed light on perpetrator (“perps”) demographics, a bar chart was made to display a breakdown of individual arrests by race. Black individuals make up two thirds of all arrests from 2006 - 2019 with almost 2.5 million arrests. White Hispanics follow with over 1 million arrests. Combined, these two racial groups compose about two thirds of overall arrests in New York City throughout a fourteen-year period.
# Bar Chart
barchart <- ggplot(Perp_Race[1:8,], aes(x= reorder(PERP_RACE, -n), y=n)) +
geom_bar(colour= "black", fill= "yellow", stat = "identity") +
labs(title = "Number of Perps by Race", x ="Perp Race", y = "Total Number of Arrests by Race", size = 8) +
scale_y_continuous(label = comma) +
theme(plot.title = element_text(hjust = 0.5))
barchart
Finally, this pie chart presents arrests by gender and year. Throughout a fourteen-year period, males account for over 80% of all arrests while females make up <20%. Therefore, males get arrested more often than females in New York City.
perp_sex_df$myPERP_SEX = factor (perp_sex_df$myPERP_SEX, levels = c("M", "F"))
piechart <- ggplot(data = perp_sex_df, aes(x = "", y = n, fill = myPERP_SEX)) +
geom_bar(stat = "identity", position = "fill") +
coord_polar(theta = "y", start = 0 ) +
labs(fill = "Perp Gender", x = NULL, y = NULL, title = "Arrests by Gender and Year") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5),
axis.text = element_blank(),
axis.ticks = element_blank (),
panel.grid = element_blank()) +
facet_wrap(~year, ncol= 4, nrow = 4) +
scale_fill_brewer(palette = "Set3") +
geom_text(aes(x = 1.7, label = paste0(percent_of_total, "%")),
size = 4,
position = position_fill(vjust = 0.5))
piechart
In conclusion, this report analyzes NYPD arrests made in New York City from 2006 - 2019. From 2006 – 2010, there was a steady increase of arrest activity in New York, reaching its peak in 2010. Following 2010, arrests continuously decreased and by 2019, total arrests were down by over 50%. The winter months of December and February suggest lower rates of arrest rates while the spring months of March and May present higher rates throughout each year. Aside from “other”, possession of marijuana is the top reason for arrest and weekdays (Wednesday and Thursday) incur higher crime activity contrasting lower activity in the early days of the week (Sunday and Monday). Finally, black individuals account for about half of all arrests made and males tend to get arrests more often than females.