Load necessary packages:

library(ggplot2)
library(dplyr)
library(broom)
library(knitr)
library(fivethirtyeight)

# Create a categorical variable of Trump support level based on the numerical
# variable of share of 2016 U.S. presidential voters who voted for Donald Trump.
# Each of the low, medium, and high levels have roughly one third of states.
hate_crimes <- hate_crimes %>% 
  mutate(
    trump_support_level = cut_number(share_vote_trump, 3, labels=c("low", "medium", "high"))
    )

Preparation

  • Make sure to load all the necessary packages and create the new categorical variable trump_support_level.
  • Read the following
    1. The first four paragraphs on the Wikipedia entry for the Gini coefficient/index: a statistical measure of income inequality.
    2. The Jan 23, 2017 FiveThirtyEight article “Higher Rates Of Hate Crimes Are Tied To Income Inequality”. You will be partially reconstructing their analysis.
  • The dataset used for this article is included in the FiveThirtyEight package: hate_crimes.
    • Read the help file corresponding to this data by running ?hate_crimes
    • Run View(hate_crimes) and explore the dataset

Question 0: Preliminary questions

  1. What value of the Gini index indicates perfect income equality? 0

  2. What value of the Gini index indicates perfect income inequality? 1

  3. Why are the two hate crime variables based on counts per 100,000 individuals and not based on just the count? Due to the uneven distribution of hate crimes across the United States

  4. Name some differences on how and when the FBI reports-based and the Southern Poverty Law Center-based rates of hate crime were collected, and how this could impact any analyses.

Write your answers here: The FBI doesn’t track hate crimes systematically ( it collects voluntarily submitted data from law enforcement agencies) and so its hard to gauge how comprehensive its data is. Moreover, its publicly accessible records were collected between 2010 and 2015. SPLC data on the other hand was non existent before the 2016 election but comes from a combination of both hate crimes and non prosecutable hate incidents. Inasmuch as both data sets reveal similar trends, an SPLC data-based analysis might suffer from the effects of awareness bias while an FBI based anaysis might leave some quesions unaswered due to it’s limited nature.

Question 1: Hate crimes and Trump support

Let’s model the relationship, both visually and via regression, between:

  • \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election
  • \(x\): Level of Trump support in the state: low, medium, or high

a) Visual model

Create a visual model of this data (do not forget to include appropriate axes labels and title):

# Write code to plot this model below:
ggplot(data = hate_crimes, aes(x = trump_support_level, y = hate_crimes_per_100k_splc)) + 
  geom_boxplot()+labs(x="Trump support level", y="Hate crimes per 100k people after the election", title="Hate crime - Trump support relationship.")

b) Regression model

Output the regression table and interpret the results

# Write code to generate a regression table below:
#Since regression involves means:
hate_crimes %>% 
  group_by(trump_support_level) %>%
  summarize(mean = mean(hate_crimes_per_100k_splc, na.rm = TRUE))%>%
  kable()
trump_support_level mean
low 0.4601833
medium 0.2222983
high 0.1910425
lm(hate_crimes_per_100k_splc ~ trump_support_level, data=hate_crimes) %>% 
  tidy()%>%
  kable()
term estimate std.error statistic p.value
(Intercept) 0.4601833 0.0529413 8.692337 0.0000000
trump_support_levelmedium -0.2378850 0.0785246 -3.029433 0.0040912
trump_support_levelhigh -0.2691408 0.0800397 -3.362593 0.0016070
The intercept 0.4601833 corr esponds to th e mean numbe r of hate cr imes in areas with low Trump support level. trump_support_levelmedium is the additional mean number of hate crimes above and beyond the baseline of comparison. Same goes for trump_support_levelhigh

c) Conclusion

  1. Give a one sentence as to whether or not your results above consistent with
  2. Which state was the outlier in the “low” group?

Write you answers here: -Some Trump stronghold states reported significantly more number of hate crimes than others as reflected in the chloropeth map in fivethrirtyeight so no. There seems to be no direct relationship between Trump support level and amount of hate crimes committed -District of Columbia

Question 2: Two simple linear regressions

Create two separate visualizations (do not forget to include appropriate axes labels and title) and run two separate simple linear regressions (using only one predictor) for \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election with

  1. \(x\): the gini index
  2. \(x\): high school education level

and interpret any slope values.

a) Gini Index

# Write code to plot this model below:
ggplot(hate_crimes, aes(x=gini_index, y=hate_crimes_per_100k_splc)) +
  geom_point() +
  labs(x="Gini index", y="Hate crimes per 100k people after the election", title="Regression model 1") +
  geom_smooth(method="lm", se=FALSE)

# Write code to generate a regression table below:
lm(hate_crimes_per_100k_splc ~ gini_index, data=hate_crimes) %>% 
  tidy()%>%
  kable()
term estimate std.error statistic p.value
(Intercept) -1.527463 0.7833043 -1.950025 0.0574197
gini_index 4.020510 1.7177215 2.340606 0.0237445

b) High school education level

# Write code to plot this model below:
ggplot(hate_crimes, aes(x=share_pop_hs, y=hate_crimes_per_100k_splc)) +
  geom_point() +
  labs(x="High school education level", y="Hate crimes per 100k people after the election", title="Regression model 2") +
  geom_smooth(method="lm", se=FALSE)

# Write code to generate a regression table below:
lm(hate_crimes_per_100k_splc ~ share_pop_hs, data=hate_crimes) %>% 
  tidy()%>%
  kable()
term estimate std.error statistic p.value
(Intercept) -1.705274 0.9228076 -1.847919 0.0711930
share_pop_hs 2.320228 1.0647852 2.179057 0.0346031
Slope values:
first model:~4. 02
second model:~2 .32
They both help us model the fit of # of hate crimes in both models.

Question 3: Multiple regression

Run a multiple regression for

  1. \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election
  2. Using the following predictor variables simultaenously
    1. \(x_1\): the gini index
    2. \(x_2\): high school education level

an interpret both slope coefficients

# Write code to generate a regression table below. No need for a visualization
# here:
lm(hate_crimes_per_100k_splc ~ share_pop_hs+gini_index, data=hate_crimes) %>% 
  tidy()%>%
  kable()
term estimate std.error statistic p.value
(Intercept) -8.211991 1.418930 -5.787453 7.0e-07
share_pop_hs 5.255865 1.002924 5.240545 4.3e-06
gini_index 8.702370 1.629755 5.339678 3.1e-06

Write your interpretation below: They have different slopes (5.3 vs 8.7). The associated effect of the gini index and high school education on hate crimes is different.

Question 4: Impact of DC on analyses

Create two new data frames:

  1. hate_crimes_no_new_york: the hate_crimes dataset without New York
  2. hate_crimes_no_DC: the hate_crimes data without the District of Columbia

Repeat the multiple regression from Question 3 and indicate the removal of which state from the dataset has a bigger impact on the analysis. Why do you think this is?

# Write code to generate regression tables below:
hate_crimes_no_new_york <- hate_crimes %>% 
  filter(state != "New York")
View(hate_crimes_no_new_york)

hate_crimes_no_DC <- hate_crimes %>% 
  filter(state != "District of Columbia")
View(hate_crimes_no_DC)

lm(hate_crimes_per_100k_splc ~ share_pop_hs+gini_index, data=hate_crimes_no_DC) %>% 
  tidy()%>%
  kable()
term estimate std.error statistic p.value
(Intercept) -3.989258 1.5083121 -2.644849 0.0113652
share_pop_hs 3.284001 0.9432964 3.481409 0.0011577
gini_index 3.135572 1.8352715 1.708506 0.0947527
lm(hate_crimes_per_100k_splc ~ share_pop_hs+gini_index, data=hate_crimes_no_new_york) %>% 
  tidy()%>%
  kable()
term estimate std.error statistic p.value
(Intercept) -8.655118 1.448044 -5.977109 4.0e-07
share_pop_hs 5.399269 1.001034 5.393694 2.8e-06
gini_index 9.414876 1.706481 5.517131 1.8e-06

Write your response here: Removal of DC has the highest effect. The unusually high number of hate crimes reported skew the original analysis and thus not including it results into a bigger impact than removing New York.