Load necessary packages:

library(ggplot2)
library(dplyr)
library(broom)
library(knitr)
library(fivethirtyeight)

# Create a categorical variable of Trump support level based on the numerical
# variable of share of 2016 U.S. presidential voters who voted for Donald Trump.
# Each of the low, medium, and high levels have roughly one third of states.
hate_crimes <- hate_crimes %>% 
  mutate(
    trump_support_level = cut_number(share_vote_trump, 3, labels=c("low", "medium", "high"))
    )

Preparation

  • Make sure to load all the necessary packages and create the new categorical variable trump_support_level.
  • Read the following
    1. The first four paragraphs on the Wikipedia entry for the Gini coefficient/index: a statistical measure of income inequality.
    2. The Jan 23, 2017 FiveThirtyEight article “Higher Rates Of Hate Crimes Are Tied To Income Inequality”. You will be partially reconstructing their analysis.
  • The dataset used for this article is included in the FiveThirtyEight package: hate_crimes.
    • Read the help file corresponding to this data by running ?hate_crimes
    • Run View(hate_crimes) and explore the dataset

Question 0: Preliminary questions

  1. What value of the Gini index indicates perfect income equality?
  2. What value of the Gini index indicates perfect income inequality?
  3. Why are the two hate crime variables based on counts per 100,000 individuals and not based on just the count?
  4. Name some differences on how and when the FBI reports-based and the Southern Poverty Law Center-based rates of hate crime were collected, and how this could impact any analyses.

Write your answers here:

1. 0 indicates perfect income equality

2. 1 indicates perfect income inequality

3. In my opinion if the hate crime variables based on just count, the whole data might not be able to accurately reflect the relationship between hate crime and other variables since the larger the population a state has, the more hate crimes it tend to have.

4. The FBI reports-based hate crime were collected before 2016 and from law enforcement agencies, while Southern Poverty Law Center-based rates of hate crime were collected after 2016 election and from a combination of curated media accounts and self-reported form entries. Since these data came from two different sources, with different methods of collection, there would be unavoidable inaccuracies when making analysis.

Question 1: Hate crimes and Trump support

Let’s model the relationship, both visually and via regression, between:

  • \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election
  • \(x\): Level of Trump support in the state: low, medium, or high

a) Visual model

Create a visual model of this data (do not forget to include appropriate axes labels and title):

# Write code to plot this model below:
ggplot(hate_crimes, aes(x=trump_support_level, y=hate_crimes_per_100k_splc)) +
  geom_boxplot() +
  labs(x="Trump Support Level", y="Hate crimes per 100K individuals in the 10 days after the 2016 US
election)", title="Hate crimes vs. Trump support")

b) Regression model

Output the regression table and interpret the results

# Write code to generate a regression table below:
lm( hate_crimes_per_100k_splc~ trump_support_level , hate_crimes) %>% 
  tidy()
##                        term   estimate  std.error statistic      p.value
## 1               (Intercept)  0.4601833 0.05294126  8.692337 4.182015e-11
## 2 trump_support_levelmedium -0.2378850 0.07852458 -3.029433 4.091171e-03
## 3   trump_support_levelhigh -0.2691408 0.08003966 -3.362593 1.606977e-03

From the regression table, I found a negative relationship between the level of trump support and the hate crimes rate. In other words, the higher the trump support level is, the lower the hate crimes rate.

c) Conclusion

  1. Give a one sentence as to whether or not your results above consistent with
  2. Which state was the outlier in the “low” group?

Write you answers here:

Yes, my result is consistent with the two graphs given.

Disctrict of Columbia is the outlier in the “low” group.

Question 2: Two simple linear regressions

Create two separate visualizations (do not forget to include appropriate axes labels and title) and run two separate simple linear regressions (using only one predictor) for \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election with

  1. \(x\): the gini index
  2. \(x\): high school education level

and interpret any slope values.

a) Gini Index

# Write code to plot this model below:
ggplot(hate_crimes, aes(x=gini_index, y=hate_crimes_per_100k_splc)) + 
  geom_point() + 
  labs(x="The Gini Index", y="Hate crimes per 100K individuals in the 10 days after the 2016 US
election ") +
  geom_smooth(method="lm", se=FALSE)

# Write code to generate a regression table below:
lm( hate_crimes_per_100k_splc~ gini_index , hate_crimes) %>% 
  tidy()
##          term  estimate std.error statistic    p.value
## 1 (Intercept) -1.527463 0.7833043 -1.950025 0.05741966
## 2  gini_index  4.020510 1.7177215  2.340606 0.02374447

b) High school education level

# Write code to plot this model below:
ggplot(hate_crimes, aes(x=share_pop_hs, y=hate_crimes_per_100k_splc)) + 
  geom_point() + 
  labs(x="Share of adults with a high-school degree", y="Hate crimes per 100K individuals in the 10 days after the 2016 US
election ", title="Hate crimes vs. High school education level") +
  geom_smooth(method="lm", se=FALSE)

# Write code to generate a regression table below:
lm( hate_crimes_per_100k_splc~ share_pop_hs , hate_crimes) %>% 
  tidy()
##           term  estimate std.error statistic    p.value
## 1  (Intercept) -1.705274 0.9228076 -1.847919 0.07119297
## 2 share_pop_hs  2.320228 1.0647852  2.179057 0.03460305

Question 3: Multiple regression

Run a multiple regression for

  1. \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election
  2. Using the following predictor variables simultaenously
    1. \(x_1\): the gini index
    2. \(x_2\): high school education level

an interpret both slope coefficients

# Write code to generate a regression table below. No need for a visualization
# here:
lm( hate_crimes_per_100k_splc~ share_pop_hs+gini_index , hate_crimes) %>% 
  tidy()
##           term  estimate std.error statistic      p.value
## 1  (Intercept) -8.211991  1.418930 -5.787453 6.921925e-07
## 2 share_pop_hs  5.255865  1.002924  5.240545 4.340152e-06
## 3   gini_index  8.702370  1.629755  5.339678 3.117468e-06

Write your interpretation below:

From the table just created, both high-school education level and gini index have strong positive relationships with the hate crimes. However, obiviously, since the slope for gini index data is 8.7, higher than that of high school degree, we can conclude that the income inequality has a larger impact on the hate crimes than the high-school degree.

Question 4: Impact of DC on analyses

Create two new data frames:

  1. hate_crimes_no_new_york: the hate_crimes dataset without New York
  2. hate_crimes_no_DC: the hate_crimes data without the District of Columbia

Repeat the multiple regression from Question 3 and indicate the removal of which state from the dataset has a bigger impact on the analysis. Why do you think this is?

# Write code to generate regression tables below:
hate_crimes_no_new_york <- hate_crimes %>%
  filter(state != "New York")
lm( hate_crimes_per_100k_splc~ share_pop_hs+gini_index , hate_crimes_no_new_york) %>% 
  tidy()
##           term  estimate std.error statistic      p.value
## 1  (Intercept) -8.655118  1.448044 -5.977109 3.947701e-07
## 2 share_pop_hs  5.399269  1.001034  5.393695 2.763474e-06
## 3   gini_index  9.414876  1.706481  5.517131 1.833887e-06
hate_crimes_no_DC <- hate_crimes %>%
  filter(state != "District of Columbia")
lm( hate_crimes_per_100k_splc~ share_pop_hs+gini_index , hate_crimes_no_DC) %>% 
  tidy()
##           term  estimate std.error statistic     p.value
## 1  (Intercept) -3.989258 1.5083121 -2.644849 0.011365187
## 2 share_pop_hs  3.284001 0.9432964  3.481409 0.001157652
## 3   gini_index  3.135572 1.8352715  1.708506 0.094752663

Write your response here:

Comparing the two tables we just created with the original table from Q3, the removal of New York from the original dataset resulted in slope 5.4 and 9.4 for high school degree and gini index respectively, which is not a big departure from the original slope 5.3 and 8.7. However, the removal of DC from the original dataset resulted in slope of 3.3 and 3.1 for high school degree and gini index respectively. Thus, we can conclude that the removal of DC has a bigger impact on the analysis. This result can be attributed to the fact that DC is an outlier in this dataset.