DATA 607: Assignment 1

Introduction

For Assignment 1, the dataset I chose from Github was Hate Crimes (https://raw.githubusercontent.com/fivethirtyeight/data/master/hate-crimes/hate_crimes.csv), used in the article “Higher Rates Of Hate Crimes Are Tied To Income Inequality” by Maimuna Majumder:https://fivethirtyeight.com/features/higher-rates-of-hate-crimes-are-tied-to-income-inequality/. To provide some context, the author wanted to look at how hate crimes varied by state as well as how income by state predicted higher rates of hate crimes before and after the 2016 presidential election. Majumder used data collected by the FBI and the Southern Poverty Law Center, which both include self-reported, voluntary, and, publicly available hate crime data.

Load in Libraries

knitr::opts_chunk$set(echo = TRUE)

library (readr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import the data (csv. file) from GitHub to R

The hate_crime dataset has 51 observations (rows) and 12 variables (columns).

hate_crime<- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/hate-crimes/hate_crimes.csv")

View Data

Take a look at the data to make sure the variables are appropriately labeled.

glimpse(hate_crime)

## Rows: 51
## Columns: 12
## $ state                                    <chr> "Alabama", "Alaska", "Arizona…
## $ median_household_income                  <int> 42278, 67629, 49254, 44922, 6…
## $ share_unemployed_seasonal                <dbl> 0.060, 0.064, 0.063, 0.052, 0…
## $ share_population_in_metro_areas          <dbl> 0.64, 0.63, 0.90, 0.69, 0.97,…
## $ share_population_with_high_school_degree <dbl> 0.821, 0.914, 0.842, 0.824, 0…
## $ share_non_citizen                        <dbl> 0.02, 0.04, 0.10, 0.04, 0.13,…
## $ share_white_poverty                      <dbl> 0.12, 0.06, 0.09, 0.12, 0.09,…
## $ gini_index                               <dbl> 0.472, 0.422, 0.455, 0.458, 0…
## $ share_non_white                          <dbl> 0.35, 0.42, 0.49, 0.26, 0.61,…
## $ share_voters_voted_trump                 <dbl> 0.63, 0.53, 0.50, 0.60, 0.33,…
## $ hate_crimes_per_100k_splc                <dbl> 0.12583893, 0.14374012, 0.225…
## $ avg_hatecrimes_per_100k_fbi              <dbl> 1.8064105, 1.6567001, 3.41392…

hate_crime <- hate_crime %>%
  rename(income_inequality = gini_index)

In the dataset, the variables include: state, median_household_income, share_unemployed_seasonal, share_population_in_metro_areas, share_population_with_high_school_degree, share_non_citizen, share_white_poverty, gini_index, share_non_white, share_voters_voted_trump, hate_crimes_per_100k_splc, and avg_hatecrimes_per_100k_fbi. These variables accurately describe the data it contains, therefore, we will leave the names, except for gini_index. Gini_index is income inequality measured by the gini index, we renamed gini_index to income_inequality.

The main predictor is income inequality labeled as income_inequality and the outcome variable is hate crimes which is captured in two different variables: hate_crimes_per_100k_splc: Hate crimes per 100,000 population from 2016 after the election, collected by the Southern Poverty Law Center avg_hatecrimes_per_100k_fbi: Hate crimes pe 100,000 population in 2015 before the election collected by the FBI.

Plot of Hate Crimes in 2015 vs. 2016 by State

ggplot(data=hate_crime) +
   aes(x=state, y=avg_hatecrimes_per_100k_fbi) +
   geom_bar(stat="identity", fill="blue") + 
  ggtitle("Hate Crimes per 100k in 2015 ") +
  xlab("States in the US") + ylab("Average Hate Crimes per 100k") +
  theme(axis.text.x = element_text(angle = 90))

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

ggplot(data=hate_crime) +
   aes(x=state, y=hate_crimes_per_100k_splc) +
   geom_bar(stat="identity", fill="green") + 
  ggtitle("Hate Crimes per 100k after 2016 Presidential Election") +
   xlab("States in the US") + ylab("Average Hate Crimes per 100k") +
  theme(axis.text.x = element_text(angle = 90))

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).

The graphs above are bar graphs of the average hate crimes per 100,000 population before and after the 2016 presidential election by state. We can see that most of the hate crimes in each state increased from 2015 to 2016.

Conclusion

Majumder found that some states had far more hate crimes than others. Overall, the data showed that hate crimes increased after the 2016 election. To update the analysis and findings in the article, I would use a different source of data rather than the one from FBI and the Southern Poverty Law Center. While the FBI only collects prosecutable data, the Southern Poverty Law Center includes non-prosecutable hate crimes, which make the comparisons different. I would look at prosecutable and non-procsecutable data separately to see hows those trends changed pre- and post-election.

Furthermore, since Majumder found that income inequality was associated with higher rates of hate crimes before and after the election. I would redo the regression technique used with current data to see how the trends of hate crimes continued into the 2020 election and to also check to see if income inequality was still a predictor.