Please indicate who you collaborated with on this problem set:
We will again use the hate crimes data we used in Problem Set 02 & 07. The FiveThirtyEight article article about those data are in the Jan 23, 2017 “Higher Rates Of Hate Crimes Are Tied To Income Inequality”. This week, we will use these data to run regression models with a single categorical predictor (explanatory) variable and a single numeric predictor (explanatory) variable.
First load the necessary packages
library(ggplot2)
library(dplyr)
library(moderndive)
library(fivethirtyeight)Next let’s explore the hate_crimes dataset in the fivethirtyeight package using the glimpse() function from the dplyr package:
glimpse(hate_crimes)## Observations: 51
## Variables: 12
## $ state <chr> "Alabama", "Alaska", "Arizona", "A...
## $ median_house_inc <int> 42278, 67629, 49254, 44922, 60487,...
## $ share_unemp_seas <dbl> 0.060, 0.064, 0.063, 0.052, 0.059,...
## $ share_pop_metro <dbl> 0.64, 0.63, 0.90, 0.69, 0.97, 0.80...
## $ share_pop_hs <dbl> 0.821, 0.914, 0.842, 0.824, 0.806,...
## $ share_non_citizen <dbl> 0.02, 0.04, 0.10, 0.04, 0.13, 0.06...
## $ share_white_poverty <dbl> 0.12, 0.06, 0.09, 0.12, 0.09, 0.07...
## $ gini_index <dbl> 0.472, 0.422, 0.455, 0.458, 0.471,...
## $ share_non_white <dbl> 0.35, 0.42, 0.49, 0.26, 0.61, 0.31...
## $ share_vote_trump <dbl> 0.63, 0.53, 0.50, 0.60, 0.33, 0.44...
## $ hate_crimes_per_100k_splc <dbl> 0.12583893, 0.14374012, 0.22531995...
## $ avg_hatecrimes_per_100k_fbi <dbl> 1.8064105, 1.6567001, 3.4139280, 0...
You should also use the View() function to take a look at the data in the viewer, Recall we can’t have View() in an R Markdown document! And finally, type ?hate_crimes into the console to see a description of the variables in this data set.
We will next add a new column to this data set that expresses the Share of the population that lives in metropolitan areas, 2015 categorical variable. Run this code below.
hate_crimes <- hate_crimes %>%
mutate(urbanization = cut_number(share_pop_metro, 2, labels = c("low", "high")))Recall, the cut_numbers function sorts the share_pop_metro variable from lowest to highest, cuts it into two groups. It categorizes all the lowest values as "low", and the largest values as "high". Finally, let’s express the share of adults 25 and older with a high-school degree in 2009 as a percentage.
hate_crimes <- hate_crimes %>%
mutate(prct_hs = share_pop_hs *100)Let’s model the relationship between:
prct_hs variable we mutated above.low, or high, as contained in the variable urbanization we mutated above.First, visualize the parallel slopes model for this data that will allow you to conduct an “eyeball test” of the relationship between all three variables (see Lec22 for instructions on how to update the moderndive package to include the gg_parallel_slopes function that does this):
gg_parallel_slopes(y = "median_house_inc", num_x = "prct_hs", cat_x = "urbanization",
data = hate_crimes, alpha = 0.50) +
labs(x = "education", y = "income",
title = "income in 2016: Parallel slopes model")Comment on the relationship between the variables.
Write your answer below:
In general, those with households in lower levels of urbanization have lower levels of income.
Now run a model that examines the relationship between median household income, high-school education, and urbanization. Show the regression table.
inc_model1 <- lm(median_house_inc ~ prct_hs+urbanization, data = hate_crimes)
get_regression_table(inc_model1)| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -119929.869 | 22994.810 | -5.216 | 0 | -166164.033 | -73695.706 |
| prct_hs | 1977.508 | 262.257 | 7.540 | 0 | 1450.206 | 2504.811 |
| urbanizationhigh | 6980.048 | 1772.649 | 3.938 | 0 | 3415.897 | 10544.198 |
Let’s now quantify the relationship between the outcome and explanatory variables.
get_regression_points function to look at the predicted values for median income. We have a categorical value this week, just like last week. Why don’t we only have two unique predicted values again this week?Write your answers below:
Show your work for each question
Write your answers below:
# You don't necessarily need to use this code block, but rather it is optional:
hate_crimes %>%
filter(state == "Alaska")| state | median_house_inc | share_unemp_seas | share_pop_metro | share_pop_hs | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_vote_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | urbanization | prct_hs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Alaska | 67629 | 0.064 | 0.63 | 0.914 | 0.04 | 0.06 | 0.422 | 0.42 | 0.53 | 0.1437401 | 1.6567 | low | 91.4 |
Imagine that federal budget cuts caused a reduction across all states of ~ 2% in high school graduation rates. We can make a new data set that reflects these budget cuts like so:
budget_cuts <- hate_crimes %>%
mutate(prct_hs = prct_hs - 2) %>%
select(urbanization, prct_hs)Generate a table that shows the predicted median household income for each state, based on our previous model. (For the sake of the exercise, assume that it is OK to make predictions just slightly outside of the range of percent HS education values used in our original model)
get_regression_points(inc_model1, newdata = budget_cuts)| ID | prct_hs | urbanization | median_house_inc_hat |
|---|---|---|---|
| 1 | 80.1 | low | 38468.56 |
| 2 | 89.4 | low | 56859.39 |
| 3 | 82.2 | high | 49601.37 |
| 4 | 80.4 | low | 39061.81 |
| 5 | 78.6 | high | 42482.34 |
| 6 | 87.3 | high | 59686.67 |
| 7 | 86.6 | high | 58302.41 |
| 8 | 85.4 | high | 55929.40 |
| 9 | 85.1 | high | 55336.15 |
| 10 | 83.3 | high | 51776.63 |
| 11 | 81.9 | high | 49008.12 |
| 12 | 88.4 | low | 54881.88 |
| 13 | 86.4 | low | 50926.86 |
| 14 | 84.4 | high | 53951.89 |
| 15 | 84.6 | low | 47367.35 |
| 16 | 89.4 | low | 56859.39 |
| 17 | 87.7 | low | 53497.62 |
| 18 | 79.7 | low | 37677.55 |
| 19 | 80.2 | high | 45646.36 |
| 20 | 88.2 | low | 54486.38 |
| 21 | 87.0 | high | 59093.41 |
| 22 | 87.0 | high | 59093.41 |
| 23 | 85.9 | high | 56918.15 |
| 24 | 89.5 | low | 57057.14 |
| 25 | 78.4 | low | 35106.79 |
| 26 | 84.8 | low | 47762.85 |
| 27 | 88.8 | low | 55672.88 |
| 28 | 87.8 | low | 53695.37 |
| 29 | 81.9 | high | 49008.12 |
| 30 | 89.3 | low | 56661.64 |
| 31 | 85.4 | high | 55929.40 |
| 32 | 80.8 | low | 39852.81 |
| 33 | 82.7 | high | 50590.13 |
| 34 | 82.3 | low | 42819.08 |
| 35 | 88.1 | low | 54288.62 |
| 36 | 85.6 | low | 49344.85 |
| 37 | 83.6 | low | 45389.84 |
| 38 | 87.1 | high | 59291.16 |
| 39 | 85.9 | high | 56918.15 |
| 40 | 82.7 | high | 50590.13 |
| 41 | 81.6 | low | 41434.82 |
| 42 | 87.9 | low | 53893.12 |
| 43 | 81.1 | high | 47426.11 |
| 44 | 77.9 | high | 41098.09 |
| 45 | 88.4 | high | 61861.93 |
| 46 | 89.0 | low | 56068.38 |
| 47 | 84.6 | high | 54347.39 |
| 48 | 87.7 | high | 60477.67 |
| 49 | 80.8 | low | 39852.81 |
| 50 | 87.8 | low | 53695.37 |
| 51 | 89.8 | low | 57650.39 |