library(tidyverse)
library(knitr)
setwd("C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets")
hatecrimes <- read_csv("NYPD_Hate_Crimes_19-26.csv")NY Hate Crimes 2019-2026
About this dataset
This dataset looks at all types of hate crimes in New York counties by the type of hate crime from 2019 to 2026 – https://data.cityofnewyork.us/Public-Safety/NYPD-Hate-Crimes/bqiq-cu78/about_data
My caveat:
Flawed hate crime data collection - we should know how the data was collected
(Nathan Yau of Flowing Data, Dec 5, 2017)
Data can provide you with important information, but when the collection process is flawed, there’s not much you can do. Ken Schwencke, reporting for ProPublica, researched the tiered system that the FBI relies on to gather hate crime data for the United States:
“Under a federal law passed in 1990, the FBI is required to track and tabulate crimes in which there was ‘manifest evidence of prejudice’ against a host of protected groups, regardless of differences in how state laws define who’s protected. The FBI, in turn, relies on local law enforcement agencies to collect and submit this data, but can’t compel them to do so.”
This is a link to the ProPublica Article: https://www.propublica.org/article/why-america-fails-at-gathering-hate-crime-statistics
Here is a data visualization of where hate crimes do NOT get reported around the country (Ken Schwencke, 2017): https://projects.propublica.org/graphics/hatecrime-map
So now we know that there is possible bias in the dataset, what can we do with it?
Clean up the data:
Make all headers lowercase and remove spaces
After cleaning up the variable names, look at the structure of the data. Since there are 44 variables in this dataset, you can use “summary” to decide which hate crimes to focus on. In the output of “summary”, look at the min/max values. Some have a max-vale of 1.
names(hatecrimes) <- tolower(names(hatecrimes))
names(hatecrimes) <- gsub(" ","",names(hatecrimes))
head(hatecrimes)# A tibble: 6 × 14
fullcomplaintid complaintyearnumber monthnumber recordcreatedate
<dbl> <dbl> <dbl> <chr>
1 2.02e14 2019 1 1/23/2019
2 2.02e14 2019 2 2/25/2019
3 2.02e14 2019 2 2/27/2019
4 2.02e14 2019 4 4/16/2019
5 2.02e14 2019 6 6/20/2019
6 2.02e14 2019 7 7/31/2019
# ℹ 10 more variables: complaintprecinctcode <dbl>, patrolboroughname <chr>,
# county <chr>, lawcodecategorydescription <chr>, offensedescription <chr>,
# pdcodedescription <chr>, biasmotivedescription <chr>,
# offensecategory <chr>, arrestdate <lgl>, arrestid <chr>
Explore the bias motive (biasmotivedescription)
bias_count <- hatecrimes |>
select(biasmotivedescription) |>
group_by(biasmotivedescription) |>
count() |>
arrange(desc(n))
head(bias_count)# A tibble: 6 × 2
# Groups: biasmotivedescription [6]
biasmotivedescription n
<chr> <int>
1 ANTI-JEWISH 1906
2 ANTI-MALE HOMOSEXUAL (GAY) 489
3 ANTI-ASIAN 401
4 ANTI-BLACK 315
5 ANTI-OTHER ETHNICITY 168
6 ANTI-MUSLIM 156
*We can see that the highest counts come from hate crimes against Jewish people, gay males, Asian people and Black people.
Visualize these counts as a bar graph
ggplot(hatecrimes, aes(x = biasmotivedescription))+
geom_bar()Use inclusion/exclusion criteria to filter
As we saw in the table, there are 29 different levels, and some are only one or two. Filter for the top 10. This time, use the bias_count subset with geom_col()
bias_count |>
head(10) |>
ggplot(aes(x=biasmotivedescription, y = n)) +
geom_col()Arrange the bars according to height and rotate
Use “reorder” and “coord_flip”
bias_count |>
head(10) |>
ggplot(aes(x=reorder(biasmotivedescription, n), y = n)) +
geom_col() +
coord_flip()Add title, caption for the data source, and x-axis label
bias_count |>
head(10) |>
ggplot(aes(x=reorder(biasmotivedescription, n), y = n)) +
geom_col() +
coord_flip()+
labs(x = "",
y = "Counts of hatecrime types based on motive",
title = "Bar Graph of Hate Crimes from 2019-2026",
subtitle = "Counts based on the hatecrime motive",
caption = "Source: NY State Division of Criminal Justice Services")Finally add color and change the theme
bias_count |>
head(10) |>
ggplot(aes(x=reorder(biasmotivedescription, n), y = n)) +
geom_col(fill = "salmon") +
coord_flip()+
labs(x = "",
y = "Counts of hatecrime types based on motive",
title = "Bar Graph of Hate Crimes from 2019-2026",
subtitle = "Counts based on the hatecrime motive",
caption = "Source: NY State Division of Criminal Justice Services") +
theme_minimal()Add annotations for counts and remove the x-axis values
bias_count |>
head(10) |>
ggplot(aes(x=reorder(biasmotivedescription, n), y = n)) +
geom_col(fill = "salmon") +
coord_flip()+
labs(x = "",
y = "Counts of hatecrime types based on motive",
title = "Bar Graph of Hate Crimes from 2019-2026",
subtitle = "Counts based on the hatecrime motive",
caption = "Source: NY State Division of Criminal Justice Services") +
theme_minimal()+
geom_text(aes(label = n), hjust = -.05, size = 3) +
theme(axis.text.x = element_blank())Look deeper into crimes against Jewish, Asian, Black people, and gay males
Spelling makes a difference, so be careful!
First check the year totals
hate_year <- hatecrimes |>
filter(biasmotivedescription %in% c("ANTI-JEWISH", "ANTI-MALE HOMOSEXUAL (GAY)", "ANTI-ASIAN", "ANTI-BLACK"))|>
group_by(complaintyearnumber) |>
count(biasmotivedescription)|>
arrange(desc(n))
hate_year# A tibble: 28 × 3
# Groups: complaintyearnumber [7]
complaintyearnumber biasmotivedescription n
<dbl> <chr> <int>
1 2024 ANTI-JEWISH 371
2 2023 ANTI-JEWISH 343
3 2025 ANTI-JEWISH 320
4 2022 ANTI-JEWISH 279
5 2019 ANTI-JEWISH 252
6 2021 ANTI-JEWISH 215
7 2021 ANTI-ASIAN 150
8 2020 ANTI-JEWISH 126
9 2023 ANTI-MALE HOMOSEXUAL (GAY) 116
10 2022 ANTI-ASIAN 91
# ℹ 18 more rows
Then check the county totals
hate_county <- hatecrimes |>
filter(biasmotivedescription %in% c("ANTI-JEWISH", "ANTI-MALE HOMOSEXUAL (GAY)", "ANTI-ASIAN", "ANTI-BLACK"))|>
group_by(county) |>
count(biasmotivedescription)|>
arrange(desc(n))
hate_county# A tibble: 20 × 3
# Groups: county [5]
county biasmotivedescription n
<chr> <chr> <int>
1 KINGS ANTI-JEWISH 798
2 NEW YORK ANTI-JEWISH 651
3 QUEENS ANTI-JEWISH 289
4 NEW YORK ANTI-MALE HOMOSEXUAL (GAY) 237
5 NEW YORK ANTI-ASIAN 228
6 KINGS ANTI-MALE HOMOSEXUAL (GAY) 120
7 KINGS ANTI-BLACK 99
8 BRONX ANTI-JEWISH 92
9 QUEENS ANTI-MALE HOMOSEXUAL (GAY) 91
10 KINGS ANTI-ASIAN 80
11 NEW YORK ANTI-BLACK 79
12 QUEENS ANTI-ASIAN 78
13 RICHMOND ANTI-JEWISH 76
14 QUEENS ANTI-BLACK 75
15 BRONX ANTI-MALE HOMOSEXUAL (GAY) 35
16 RICHMOND ANTI-BLACK 35
17 BRONX ANTI-BLACK 27
18 BRONX ANTI-ASIAN 10
19 RICHMOND ANTI-MALE HOMOSEXUAL (GAY) 6
20 RICHMOND ANTI-ASIAN 5
Check information combining totals from counties and years
hate2 <- hatecrimes |>
filter(biasmotivedescription %in% c("ANTI-JEWISH", "ANTI-MALE HOMOSEXUAL (GAY)", "ANTI-ASIAN", "ANTI-BLACK"))|>
group_by(complaintyearnumber, county) |>
count(biasmotivedescription)|>
arrange(desc(n))
hate2# A tibble: 127 × 4
# Groups: complaintyearnumber, county [35]
complaintyearnumber county biasmotivedescription n
<dbl> <chr> <chr> <int>
1 2024 KINGS ANTI-JEWISH 152
2 2024 NEW YORK ANTI-JEWISH 136
3 2025 KINGS ANTI-JEWISH 136
4 2019 KINGS ANTI-JEWISH 128
5 2023 KINGS ANTI-JEWISH 126
6 2022 KINGS ANTI-JEWISH 125
7 2023 NEW YORK ANTI-JEWISH 124
8 2025 NEW YORK ANTI-JEWISH 110
9 2022 NEW YORK ANTI-JEWISH 104
10 2021 NEW YORK ANTI-ASIAN 84
# ℹ 117 more rows
Plot these three types of hate crimes together
Use the following commands to finalize your barplot: - position = “dodge” makes side-by-side bars, rather than stacked bars - stat = “identity” allows you to plot each set of bars for each year between 2010 and 2016 - ggtitle gives the plot a title - labs gives a title to the legend
ggplot(data = hate2) +
geom_bar(aes(x=complaintyearnumber, y=n, fill = biasmotivedescription),
position = "dodge", stat = "identity") +
labs(fill = "Hate Crime Type",
y = "Number of Hate Crime Incidents",
title = "Hate Crime Type in NY Counties Between 2010-2016",
caption = "Source: NY State Division of Criminal Justice Services")We can see that hate crimes against Jewish people continually rose from 2020 to 2024, while hate crimes agains gay males and Asian people decreased during that same time frame.
What about the counties?
I have not dealt with the counties, but I think that is the next place to explore. I can make bar graphs by county instead of by year.
ggplot(data = hate2) +
geom_bar(aes(x=county, y=n, fill = biasmotivedescription),
position = "dodge", stat = "identity") +
labs(fill = "Hate Crime Type",
y = "Number of Hate Crime Incidents",
title = "Hate Crime Type in NY Counties Between 2010-2016",
caption = "Source: NY State Division of Criminal Justice Services")The highest counts
We can see that the highest counts of hate crimes against Jewish, Asian, and Black people took place in Kings County (Brooklyn) and New York County
Put it all together with years and counties using “facet”
ggplot(data = hate2) +
geom_bar(aes(x=complaintyearnumber, y=n, fill = biasmotivedescription),
position = "dodge", stat = "identity") +
facet_wrap(~county) +
labs(fill = "Hate Crime Type",
y = "Number of Hate Crime Incidents",
title = "Hate Crime Type in NY Counties Between 2010-2016",
caption = "Source: NY State Division of Criminal Justice Services")How would calculations be affected by looking at hate crimes in counties per year by population densities?
Bring in census data for populations of New York counties. These are estimates from the 2010 census.
setwd("C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets")
nypop <- read_csv("nyc_census_pop_2020.csv")Clean the county name to match the other dataset
Rename the variable “Geography” as “county” so that it matches in the other dataset.
nypop$`Area Name` <- gsub(" County", "", nypop$`Area Name`)
nypop2 <- nypop |>
rename(county = `Area Name`)|>
select(county, `2020 Census Population`)
head(nypop2)# A tibble: 6 × 2
county `2020 Census Population`
<chr> <dbl>
1 Albany 314848
2 Allegany 46456
3 Bronx 1472654
4 Broome 198683
5 Cattaraugus 77042
6 Cayuga 76248
Join the hate2 data with nypop
datajoin <- left_join(hate2, nypop2, by=c("county"))
datajoin# A tibble: 127 × 5
# Groups: complaintyearnumber, county [35]
complaintyearnumber county biasmotivedescription n 2020 Census Populati…¹
<dbl> <chr> <chr> <int> <dbl>
1 2024 KINGS ANTI-JEWISH 152 NA
2 2024 NEW Y… ANTI-JEWISH 136 NA
3 2025 KINGS ANTI-JEWISH 136 NA
4 2019 KINGS ANTI-JEWISH 128 NA
5 2023 KINGS ANTI-JEWISH 126 NA
6 2022 KINGS ANTI-JEWISH 125 NA
7 2023 NEW Y… ANTI-JEWISH 124 NA
8 2025 NEW Y… ANTI-JEWISH 110 NA
9 2022 NEW Y… ANTI-JEWISH 104 NA
10 2021 NEW Y… ANTI-ASIAN 84 NA
# ℹ 117 more rows
# ℹ abbreviated name: ¹`2020 Census Population`
It didn’t work - the new column has NA values
The counties are upper case in hate2 and mixed in nypop
hate_new <- hate2 |>
mutate(county = as_factor(str_to_lower(as.character(county))))
nypop_new <- nypop2 |>
mutate(county = as_factor(str_to_lower(as.character(county))))Try joining again
datajoin <- left_join(hate_new, nypop_new, by=c("county"))
datajoin# A tibble: 127 × 5
# Groups: complaintyearnumber, county [35]
complaintyearnumber county biasmotivedescription n 2020 Census Populati…¹
<dbl> <fct> <chr> <int> <dbl>
1 2024 kings ANTI-JEWISH 152 2736074
2 2024 new y… ANTI-JEWISH 136 1694251
3 2025 kings ANTI-JEWISH 136 2736074
4 2019 kings ANTI-JEWISH 128 2736074
5 2023 kings ANTI-JEWISH 126 2736074
6 2022 kings ANTI-JEWISH 125 2736074
7 2023 new y… ANTI-JEWISH 124 1694251
8 2025 new y… ANTI-JEWISH 110 1694251
9 2022 new y… ANTI-JEWISH 104 1694251
10 2021 new y… ANTI-ASIAN 84 1694251
# ℹ 117 more rows
# ℹ abbreviated name: ¹`2020 Census Population`
Calculate the rate of incidents per 100,000. Then arrange in descending order
datajoinrate <- datajoin |>
mutate(rate = n/`2020 Census Population`* 100000) |>
arrange(desc(rate))
datajoinrate# A tibble: 127 × 6
# Groups: complaintyearnumber, county [35]
complaintyearnumber county biasmotivedescription n 2020 Census Populati…¹
<dbl> <fct> <chr> <int> <dbl>
1 2024 new y… ANTI-JEWISH 136 1694251
2 2023 new y… ANTI-JEWISH 124 1694251
3 2025 new y… ANTI-JEWISH 110 1694251
4 2022 new y… ANTI-JEWISH 104 1694251
5 2024 kings ANTI-JEWISH 152 2736074
6 2025 kings ANTI-JEWISH 136 2736074
7 2021 new y… ANTI-ASIAN 84 1694251
8 2021 new y… ANTI-JEWISH 84 1694251
9 2019 kings ANTI-JEWISH 128 2736074
10 2023 kings ANTI-JEWISH 126 2736074
# ℹ 117 more rows
# ℹ abbreviated name: ¹`2020 Census Population`
# ℹ 1 more variable: rate <dbl>
Notice that the highest rates of hate crimes happened in:
New York and Kings counties
Your turn!
Once you complete this tutorial, include an essay of about 150-200 words which that answers the following questions:
Write about the positive and negative aspects of this hatecrimes dataset.
List 2 different paths you could hypothetically like to study about this dataset at some future point.