Please indicate who you collaborated with on this problem set:
I am forever grateful to Alexis Kilayko, my shinyuu, for helping me with this assignment.
Think back to the hate crimes data we used in Problem Set 02. The FiveThirtyEight article article about those data are in the Jan 23, 2017 “Higher Rates Of Hate Crimes Are Tied To Income Inequality”
We will use these data in this Problem Set to run regression models with a single categorical predictor (explanatory) variable.
First load the necessary packages
library(ggplot2)
library(dplyr)
library(moderndive)
library(fivethirtyeight)Next let’s explore the hate_crimes dataset in the fivethirtyeight package using the glimpse() function from the dplyr package:
glimpse(hate_crimes)## Observations: 51
## Variables: 12
## $ state <chr> "Alabama", "Alaska", "Arizona", "A...
## $ median_house_inc <int> 42278, 67629, 49254, 44922, 60487,...
## $ share_unemp_seas <dbl> 0.060, 0.064, 0.063, 0.052, 0.059,...
## $ share_pop_metro <dbl> 0.64, 0.63, 0.90, 0.69, 0.97, 0.80...
## $ share_pop_hs <dbl> 0.821, 0.914, 0.842, 0.824, 0.806,...
## $ share_non_citizen <dbl> 0.02, 0.04, 0.10, 0.04, 0.13, 0.06...
## $ share_white_poverty <dbl> 0.12, 0.06, 0.09, 0.12, 0.09, 0.07...
## $ gini_index <dbl> 0.472, 0.422, 0.455, 0.458, 0.471,...
## $ share_non_white <dbl> 0.35, 0.42, 0.49, 0.26, 0.61, 0.31...
## $ share_vote_trump <dbl> 0.63, 0.53, 0.50, 0.60, 0.33, 0.44...
## $ hate_crimes_per_100k_splc <dbl> 0.12583893, 0.14374012, 0.22531995...
## $ avg_hatecrimes_per_100k_fbi <dbl> 1.8064105, 1.6567001, 3.4139280, 0...
You should also use the View() function to take a look at the data in the viewer Recall we can’t have View() in an R Markdown document! And finally, type ?hate_crimes into the console to see a description of the variables in this data set.
We will next add a new column to this data set that expresses the Share of 2016 U.S. presidential voters who voted for Trump as a categorical variable. Run this code below.
hate_crimes <- hate_crimes %>%
mutate(trump_support = cut_number(share_vote_trump, 3, labels = c("low", "medium", "high")))The cut_numbers function used in the code above sorts the share_trump_vote variable from lowest to highest, cuts it into three groups of roughly 17 states each (51/3). It categorizes all the lowest values as “low”, the middle 17 values as “medium”, and the top 17 values as “high”. We have created a categorical variable!
Let’s model the relationship between:
low, medium, or high, as contained in the variable trump_support we created above.ggplot(hate_crimes, aes(x = trump_support, y = hate_crimes_per_100k_splc))+
geom_boxplot()+
labs( x = "levels of Trump support", y = "hate crimes per 100k")Now run a model that examines the relationship between hate crime rates and the level of Trump support. Generate a regression table.
trumpsupport1 <- lm(hate_crimes_per_100k_splc ~ trump_support, data = hate_crimes)
get_regression_table(trumpsupport1)| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 0.460 | 0.053 | 8.692 | 0.000 | 0.353 | 0.567 |
| trump_supportmedium | -0.238 | 0.079 | -3.029 | 0.004 | -0.396 | -0.080 |
| trump_supporthigh | -0.269 | 0.080 | -3.363 | 0.002 | -0.430 | -0.108 |
get_regression_points) function to explore this if you are not sure!Write your answers here (if possible in an enumerated list just like above):
For these questions, showing your work is optional; solve it any way you choose.
Write your answers here (if possible in an enumerated list just like above):
highest5 <- hate_crimes %>%
select(state, hate_crimes_per_100k_splc, trump_support) %>%
arrange(desc(hate_crimes_per_100k_splc))
head(highest5)| state | hate_crimes_per_100k_splc | trump_support |
|---|---|---|
| District of Columbia | 1.5223017 | low |
| Oregon | 0.8328496 | low |
| Washington | 0.6774876 | low |
| Massachusetts | 0.6308106 | low |
| Minnesota | 0.6274799 | low |
| Maine | 0.6155740 | low |
lowest5 <- hate_crimes %>%
select(state, hate_crimes_per_100k_splc, trump_support) %>%
arrange(hate_crimes_per_100k_splc)
head(lowest5)| state | hate_crimes_per_100k_splc | trump_support |
|---|---|---|
| Mississippi | 0.0674468 | high |
| Arkansas | 0.0690608 | high |
| New Jersey | 0.0783059 | low |
| Rhode Island | 0.0954016 | low |
| Kansas | 0.1051525 | high |
| Louisiana | 0.1097333 | high |
5 states with highest rate of hate crimes: 1. District of Columbia, “low” Trump support. 2. Oregon, “low” Trump support 3. Washington, “low” Trump support. 4. Massachusetts, “low” Trump support. 5. Minnesota, “low” Trump support
5 states with lowest rate of hate crimes: 1. Mississippi, “high” Trump support. 2. Arkansas, “high” Trump support. 3. New Jersey, “low” Trump support. 4. Rhode Island, “low” Trump support. 5.Kansas, “high” Trump support.
The results do surprise me because the media likes to portray Trump and his supporters as the root of all hate, it’s interesting to see that the data shows quite the opposite. At the same time, I find it hard to believe that a single variable, like Trump support, could correlate to something like hate crime. I also think it is important to note that D.C.’s values skew the graph’s trends quite a lot. There might not have been as big a trend if not for the high crime levels for D.C.’s “low” Trump support.
For this exercise, we will model the relationship between
low, or high. We will create this categorical variable.Using the tools and code examples from above, complete the following tasks:
unemployment that has two levels, “low” and “high” based on the variable share_unemp_seas.hate_crimes1 <- hate_crimes %>%
mutate(unemployment = cut_number(share_unemp_seas, 2, labels = c("low", "high")))Create a visual model of this data (a graph) that will allow you to conduct an “eyeball test” of the relationship between hate crimes per 100K and unemployment level. Include appropriate axes labels and a title.
ggplot(hate_crimes1, aes(x = unemployment, y = hate_crimes_per_100k_splc))+
geom_boxplot()+
labs( x = "unemployment level", y = "hate crimes per 100k")Now run a model that examines the relationship between hate crime rates and the unemployment level. Generate a regression table.
unemployment1 <- lm(hate_crimes_per_100k_splc ~ unemployment, data = hate_crimes1)
get_regression_table(unemployment1)| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 0.320 | 0.053 | 6.020 | 0.000 | 0.213 | 0.427 |
| unemploymenthigh | -0.031 | 0.074 | -0.421 | 0.676 | -0.181 | 0.119 |
Answer the following questions:
Answer the questions here:
Use the get_regression_points function to generate a table showing the predictions, and the residuals. How are the residuals calculated here?
head(get_regression_points(unemployment1))| ID | hate_crimes_per_100k_splc | unemployment | hate_crimes_per_100k_splc_hat | residual |
|---|---|---|---|---|
| 1 | 0.126 | high | 0.289 | -0.163 |
| 2 | 0.144 | high | 0.289 | -0.145 |
| 3 | 0.225 | high | 0.289 | -0.063 |
| 4 | 0.069 | high | 0.289 | -0.220 |
| 5 | 0.256 | high | 0.289 | -0.033 |
| 6 | 0.391 | low | 0.320 | 0.070 |
Answer the question here: Residuals are calculated by taking the difference between the observed hate crimes value and the predicted hate crime value, i.e., \(y - \widehat{y}\)