Load necessary packages:

library(ggplot2)
library(dplyr)
library(broom)
library(knitr)
library(fivethirtyeight)

# Create a categorical variable of Trump support level based on the numerical
# variable of share of 2016 U.S. presidential voters who voted for Donald Trump.
# Each of the low, medium, and high levels have roughly one third of states.
hate_crimes <- hate_crimes %>% 
  mutate(
    trump_support_level = cut_number(share_vote_trump, 3, labels=c("low", "medium", "high"))
    )

Preparation

  • Make sure to load all the necessary packages and create the new categorical variable trump_support_level.
  • Read the following
    1. The first four paragraphs on the Wikipedia entry for the Gini coefficient/index: a statistical measure of income inequality.
    2. The Jan 23, 2017 FiveThirtyEight article “Higher Rates Of Hate Crimes Are Tied To Income Inequality”. You will be partially reconstructing their analysis.
  • The dataset used for this article is included in the FiveThirtyEight package: hate_crimes.
    • Read the help file corresponding to this data by running ?hate_crimes
    • Run View(hate_crimes) and explore the dataset

Question 0: Preliminary questions

  1. What value of the Gini index indicates perfect income equality?
  2. What value of the Gini index indicates perfect income inequality?
  3. Why are the two hate crime variables based on counts per 100,000 individuals and not based on just the count?
  4. Name some differences on how and when the FBI reports-based and the Southern Poverty Law Center-based rates of hate crime were collected, and how this could impact any analyses.

Write your answers here:

  1. A Gini coefficient of zero indicates perfect equality.
  2. A Gini coefficient of one indicates perfect inequality.
  3. There are 51 states(/district) and counts that come form Southern Poverty Law Center and counts from FBI.
  4. FBI reports were 2010-2015, whereas Southern Poverty Law Center is from Nov. 9-18, 2016. The FBI report is from a larger range of time, whereas the SPLC report is just from when the election was. Analyses could be impacted here, especially if there is a dramatic difference in hate crimes during the election period, compared to the amount of hate crime that there may typically be. Also according to the article, at the time of the election, people were influenced by an awareness biases.

Question 1: Hate crimes and Trump support

Let’s model the relationship, both visually and via regression, between:

  • \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election
  • \(x\): Level of Trump support in the state: low, medium, or high

a) Visual model

Create a visual model of this data (do not forget to include appropriate axes labels and title):

# Write code to plot this model below:
ggplot(hate_crimes, aes(x=trump_support_level, y=hate_crimes_per_100k_splc)) +
  geom_boxplot() +
  labs(x = "Support of Trump", y="Hate Crimes", title= "Comparison between Support of Trump and Hate Crimes")

b) Regression model

Output the regression table and interpret the results

# Write code to generate a regression table below:
lm(hate_crimes_per_100k_splc ~ trump_support_level, data=hate_crimes) %>% 
  tidy()
##                        term   estimate  std.error statistic      p.value
## 1               (Intercept)  0.4601833 0.05294126  8.692337 4.182015e-11
## 2 trump_support_levelmedium -0.2378850 0.07852458 -3.029433 4.091171e-03
## 3   trump_support_levelhigh -0.2691408 0.08003966 -3.362593 1.606977e-03

c) Conclusion

  1. Give a one sentence as to whether or not your results above consistent with
  2. Which state was the outlier in the “low” group?

Write you answers here: 1. My results are not consistent with the FiveThirtyEight article (Which is shown since there is high crime rates in states with low support of Trump such as Massachusettes, and there are low crime rates in states with high support of Trump such as Louisiana). 1. District of Columbia was the outlier in the “low” group.

Question 2: Two simple linear regressions

Create two separate visualizations (do not forget to include appropriate axes labels and title) and run two separate simple linear regressions (using only one predictor) for \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election with

  1. \(x\): the gini index
  2. \(x\): high school education level

and interpret any slope values.

a) Gini Index

# Write code to plot this model below:
ggplot(hate_crimes, aes(x=gini_index, y=hate_crimes_per_100k_splc)) + 
  geom_point() + 
  labs(x="Gini Inex", y="Hate Crimes", title= "Comparison of Gini Index and Hate Crimes") +
  geom_smooth(method="lm", se=FALSE)

# Write code to generate a regression table below:
lm(hate_crimes_per_100k_splc ~ gini_index, data=hate_crimes) %>% 
  tidy()
##          term  estimate std.error statistic    p.value
## 1 (Intercept) -1.527463 0.7833043 -1.950025 0.05741966
## 2  gini_index  4.020510 1.7177215  2.340606 0.02374447

b) High school education level

# Write code to plot this model below:
ggplot(hate_crimes, aes(x=share_pop_hs, y=hate_crimes_per_100k_splc)) + 
  geom_point() + 
  labs(x="High School Education Level", y="Hate Crimes", title= "Comparison of High School Education Level and Hate Crimes") +
  geom_smooth(method="lm", se=FALSE)

# Write code to generate a regression table below:
lm(hate_crimes_per_100k_splc ~ share_pop_hs, data=hate_crimes) %>% 
  tidy()
##           term  estimate std.error statistic    p.value
## 1  (Intercept) -1.705274 0.9228076 -1.847919 0.07119297
## 2 share_pop_hs  2.320228 1.0647852  2.179057 0.03460305

Question 3: Multiple regression

Run a multiple regression for

  1. \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election
  2. Using the following predictor variables simultaenously
    1. \(x_1\): the gini index
    2. \(x_2\): high school education level

an interpret both slope coefficients

# Write code to generate a regression table below. No need for a visualization
# here:
lm(hate_crimes_per_100k_splc ~ gini_index + share_pop_hs, data=hate_crimes) %>% 
  tidy()
##           term  estimate std.error statistic      p.value
## 1  (Intercept) -8.211991  1.418930 -5.787453 6.921925e-07
## 2   gini_index  8.702370  1.629755  5.339678 3.117468e-06
## 3 share_pop_hs  5.255865  1.002924  5.240545 4.340152e-06

Write your interpretation below: # For each increase in a unit of the gini_index the hate_crimes_per_100k_splc increases by 8.702370. For each increase in a unit of the share_pop_hs the hate_crimes_per_100k_splc increases by 5.255865.

Question 4: Impact of DC on analyses

Create two new data frames:

  1. hate_crimes_no_new_york: the hate_crimes dataset without New York
  2. hate_crimes_no_DC: the hate_crimes data without the District of Columbia

Repeat the multiple regression from Question 3 and indicate the removal of which state from the dataset has a bigger impact on the analysis. Why do you think this is?

# Write code to generate regression tables below:

hate_crimes_no_new_york <- hate_crimes %>%
  filter(state != "New York")

lm(hate_crimes_per_100k_splc ~ gini_index + share_pop_hs, data=hate_crimes_no_new_york) %>% 
  tidy()
##           term  estimate std.error statistic      p.value
## 1  (Intercept) -8.655118  1.448044 -5.977109 3.947701e-07
## 2   gini_index  9.414876  1.706481  5.517131 1.833887e-06
## 3 share_pop_hs  5.399269  1.001034  5.393695 2.763474e-06
hate_crimes_no_DC <- hate_crimes %>%
  filter(state != "District of Columbia")

lm(hate_crimes_per_100k_splc ~ gini_index + share_pop_hs, data=hate_crimes_no_DC) %>% 
  tidy()
##           term  estimate std.error statistic     p.value
## 1  (Intercept) -3.989258 1.5083121 -2.644849 0.011365187
## 2   gini_index  3.135572 1.8352715  1.708506 0.094752663
## 3 share_pop_hs  3.284001 0.9432964  3.481409 0.001157652

Write your response here: # The removal of New York has a bigger impact on the analysis. I think this is the case because the data hate_crimes_no_new_york has all significant p.values, whereas the data hate_crimes_no_DC has a p.value that is not significant. (The significant p.value is less than 0.05).