Collaboration

Please indicate who you collaborated with on this problem set:

I am forever grateful to Alexis Kilayko, my shinyuu, for helping me with this assignment.

Background

Think back to the hate crimes data we used in Problem Set 02. The FiveThirtyEight article article about those data are in the Jan 23, 2017 “Higher Rates Of Hate Crimes Are Tied To Income Inequality”

We will use these data in this Problem Set to run regression models with a single categorical predictor (explanatory) variable.

Setup

First load the necessary packages

library(ggplot2)
library(dplyr)
library(moderndive)
library(fivethirtyeight)

Next let’s explore the hate_crimes dataset in the fivethirtyeight package using the glimpse() function from the dplyr package:

glimpse(hate_crimes)
## Observations: 51
## Variables: 12
## $ state                       <chr> "Alabama", "Alaska", "Arizona", "A...
## $ median_house_inc            <int> 42278, 67629, 49254, 44922, 60487,...
## $ share_unemp_seas            <dbl> 0.060, 0.064, 0.063, 0.052, 0.059,...
## $ share_pop_metro             <dbl> 0.64, 0.63, 0.90, 0.69, 0.97, 0.80...
## $ share_pop_hs                <dbl> 0.821, 0.914, 0.842, 0.824, 0.806,...
## $ share_non_citizen           <dbl> 0.02, 0.04, 0.10, 0.04, 0.13, 0.06...
## $ share_white_poverty         <dbl> 0.12, 0.06, 0.09, 0.12, 0.09, 0.07...
## $ gini_index                  <dbl> 0.472, 0.422, 0.455, 0.458, 0.471,...
## $ share_non_white             <dbl> 0.35, 0.42, 0.49, 0.26, 0.61, 0.31...
## $ share_vote_trump            <dbl> 0.63, 0.53, 0.50, 0.60, 0.33, 0.44...
## $ hate_crimes_per_100k_splc   <dbl> 0.12583893, 0.14374012, 0.22531995...
## $ avg_hatecrimes_per_100k_fbi <dbl> 1.8064105, 1.6567001, 3.4139280, 0...

You should also use the View() function to take a look at the data in the viewer Recall we can’t have View() in an R Markdown document! And finally, type ?hate_crimes into the console to see a description of the variables in this data set.

Data manipulation

We will next add a new column to this data set that expresses the Share of 2016 U.S. presidential voters who voted for Trump as a categorical variable. Run this code below.

hate_crimes <- hate_crimes %>% 
  mutate(trump_support = cut_number(share_vote_trump, 3, labels = c("low", "medium", "high")))

The cut_numbers function used in the code above sorts the share_trump_vote variable from lowest to highest, cuts it into three groups of roughly 17 states each (51/3). It categorizes all the lowest values as “low”, the middle 17 values as “medium”, and the top 17 values as “high”. We have created a categorical variable!

Question 1: Trump support level

Let’s model the relationship between:

  • \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election as measured by the SPLC
  • \(x\): Level of Trump support in the state: low, medium, or high, as contained in the variable trump_support we created above.

a) Visual model

  1. Create a visual model of this data (a graph) that will allow you to conduct an “eyeball test” of the relationship between hate crimes per 100K and level of Trump support. Include appropriate axes labels and a title.
  2. Comment on the relationship between these two variables.
ggplot(hate_crimes, aes(x = trump_support, y = hate_crimes_per_100k_splc))+
  geom_boxplot()+
  labs( x = "levels of Trump support", y = "hate crimes per 100k")

b) Regression model

Now run a model that examines the relationship between hate crime rates and the level of Trump support. Generate a regression table.

trumpsupport1 <- lm(hate_crimes_per_100k_splc ~ trump_support, data = hate_crimes)
get_regression_table(trumpsupport1)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 0.460 0.053 8.692 0.000 0.353 0.567
trump_supportmedium -0.238 0.079 -3.029 0.004 -0.396 -0.080
trump_supporthigh -0.269 0.080 -3.363 0.002 -0.430 -0.108
  1. What does the intercept mean in this regression table?
  2. What does the model estimate as the number of hate crimes per 100000 people in states with “low” Trump support?
  3. Does the model estimate that hate crimes are more frequent in states that show “low” or “medium” support for Trump?
  4. What does the model predict as the number of hate crimes per 100000 people in states with “high” Trump support?
  5. What are the three possible fitted values \(\widehat{y}\) for this model? (Hint: use the get_regression_points) function to explore this if you are not sure!

Write your answers here (if possible in an enumerated list just like above):

  1. The intercept in this regression table is the average hate crimes per 100k people in “low” Trump support areas.
  2. It estimates that in states with “low” Trump support, there are on average, 0.460 hate crimes per 100k people.
  3. It estimates that hate crimes are more frequent in states that show “low” support for Trump.
  4. In states with “high” Trump support, the model predicts that there is a lower frequency of hate crimes, i.e., there are on average, 0.460 - 0.269 = 0.191 hate crimes per 100k people.
  5. For “low” Trump support, the fitted \(\widehat{y}\) value is 0.460 hate crimes per 100k people. For “medium” Trump support, the fitted \(\widehat{y}\) value is 0.222 hate crimes per 100k people. And for “high” Trump support, the fitted \(\widehat{y}\) value is 0.191 hate crimes per 100k people.

c) Questions

For these questions, showing your work is optional; solve it any way you choose.

  1. Which 5 states had the highest rate of hate crimes? Describe levels of Trump support in these 5 states.
  2. Which 5 states had the lowest rate of hate crimes? Describe levels of Trump support in these 5 states.
  3. Do these results surprise you? There is no right answer to this question

Write your answers here (if possible in an enumerated list just like above):

highest5 <- hate_crimes %>% 
  select(state, hate_crimes_per_100k_splc, trump_support) %>% 
  arrange(desc(hate_crimes_per_100k_splc))
head(highest5)
state hate_crimes_per_100k_splc trump_support
District of Columbia 1.5223017 low
Oregon 0.8328496 low
Washington 0.6774876 low
Massachusetts 0.6308106 low
Minnesota 0.6274799 low
Maine 0.6155740 low
lowest5 <- hate_crimes %>% 
  select(state, hate_crimes_per_100k_splc, trump_support) %>% 
  arrange(hate_crimes_per_100k_splc)
head(lowest5)
state hate_crimes_per_100k_splc trump_support
Mississippi 0.0674468 high
Arkansas 0.0690608 high
New Jersey 0.0783059 low
Rhode Island 0.0954016 low
Kansas 0.1051525 high
Louisiana 0.1097333 high
  1. 5 states with highest rate of hate crimes: 1. District of Columbia, “low” Trump support. 2. Oregon, “low” Trump support 3. Washington, “low” Trump support. 4. Massachusetts, “low” Trump support. 5. Minnesota, “low” Trump support

  2. 5 states with lowest rate of hate crimes: 1. Mississippi, “high” Trump support. 2. Arkansas, “high” Trump support. 3. New Jersey, “low” Trump support. 4. Rhode Island, “low” Trump support. 5.Kansas, “high” Trump support.

  3. The results do surprise me because the media likes to portray Trump and his supporters as the root of all hate, it’s interesting to see that the data shows quite the opposite. At the same time, I find it hard to believe that a single variable, like Trump support, could correlate to something like hate crime. I also think it is important to note that D.C.’s values skew the graph’s trends quite a lot. There might not have been as big a trend if not for the high crime levels for D.C.’s “low” Trump support.

Question 2

For this exercise, we will model the relationship between

  • \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election as measured by the SPLC
  • \(x\): Level of unemployment in the state: low, or high. We will create this categorical variable.

Using the tools and code examples from above, complete the following tasks:

a) data manipulation

  1. Make a new categorical variable called unemployment that has two levels, “low” and “high” based on the variable share_unemp_seas.
hate_crimes1 <- hate_crimes %>% 
  mutate(unemployment = cut_number(share_unemp_seas, 2, labels = c("low", "high")))

b) visual data exploration

Create a visual model of this data (a graph) that will allow you to conduct an “eyeball test” of the relationship between hate crimes per 100K and unemployment level. Include appropriate axes labels and a title.

ggplot(hate_crimes1, aes(x = unemployment, y = hate_crimes_per_100k_splc))+
  geom_boxplot()+
  labs( x = "unemployment level", y = "hate crimes per 100k")

c) running the model

Now run a model that examines the relationship between hate crime rates and the unemployment level. Generate a regression table.

unemployment1 <- lm(hate_crimes_per_100k_splc ~ unemployment, data = hate_crimes1)
get_regression_table(unemployment1)
term estimate std_error statistic p_value lower_ci upper_ci
intercept 0.320 0.053 6.020 0.000 0.213 0.427
unemploymenthigh -0.031 0.074 -0.421 0.676 -0.181 0.119

d) interpreting the model

Answer the following questions:

  1. What does the intercept mean in this regression table?
  2. What does the model estimate as the number of hate crimes per 100000 people in states with “high” unemployment?
  3. What are the two possible fitted values \(\widehat{y}\) for this model? Why are there only two this time? (there were three in the last question)

Answer the questions here:

  1. The intercept is the average hate crimes per 100k people for the states with “low” unemployment levels (the baseline).
  2. The model estimates that in states with “high” unemployment levels, there is an average of 0.320 - 0.031 = 0.289 hate crimes per 100k people.
  3. The two possible \(\widehat{y}\) values are 0.320 for “low” unemployment level states, and 0.289 for “high” unemployment states. There are only two possible fitted values because there are only two categorical explanatory variables in this example vs. 3 variables in the previous question.

e) interpreting residuals

Use the get_regression_points function to generate a table showing the predictions, and the residuals. How are the residuals calculated here?

head(get_regression_points(unemployment1))
ID hate_crimes_per_100k_splc unemployment hate_crimes_per_100k_splc_hat residual
1 0.126 high 0.289 -0.163
2 0.144 high 0.289 -0.145
3 0.225 high 0.289 -0.063
4 0.069 high 0.289 -0.220
5 0.256 high 0.289 -0.033
6 0.391 low 0.320 0.070

Answer the question here: Residuals are calculated by taking the difference between the observed hate crimes value and the predicted hate crime value, i.e., \(y - \widehat{y}\)