Collaboration

Please indicate who you collaborated with on this problem set:

Background

We will again use the hate crimes data we used in Problem Set 02 & 07. The FiveThirtyEight article article about those data are in the Jan 23, 2017 “Higher Rates Of Hate Crimes Are Tied To Income Inequality”. This week, we will use these data to run regression models with a single categorical predictor (explanatory) variable and a single numeric predictor (explanatory) variable.

Setup

First load the necessary packages

library(ggplot2)
library(dplyr)
library(moderndive)
library(fivethirtyeight)

Next let’s explore the hate_crimes dataset in the fivethirtyeight package using the glimpse() function from the dplyr package:

glimpse(hate_crimes)
## Observations: 51
## Variables: 12
## $ state                       <chr> "Alabama", "Alaska", "Arizona", "A...
## $ median_house_inc            <int> 42278, 67629, 49254, 44922, 60487,...
## $ share_unemp_seas            <dbl> 0.060, 0.064, 0.063, 0.052, 0.059,...
## $ share_pop_metro             <dbl> 0.64, 0.63, 0.90, 0.69, 0.97, 0.80...
## $ share_pop_hs                <dbl> 0.821, 0.914, 0.842, 0.824, 0.806,...
## $ share_non_citizen           <dbl> 0.02, 0.04, 0.10, 0.04, 0.13, 0.06...
## $ share_white_poverty         <dbl> 0.12, 0.06, 0.09, 0.12, 0.09, 0.07...
## $ gini_index                  <dbl> 0.472, 0.422, 0.455, 0.458, 0.471,...
## $ share_non_white             <dbl> 0.35, 0.42, 0.49, 0.26, 0.61, 0.31...
## $ share_vote_trump            <dbl> 0.63, 0.53, 0.50, 0.60, 0.33, 0.44...
## $ hate_crimes_per_100k_splc   <dbl> 0.12583893, 0.14374012, 0.22531995...
## $ avg_hatecrimes_per_100k_fbi <dbl> 1.8064105, 1.6567001, 3.4139280, 0...

You should also use the View() function to take a look at the data in the viewer, Recall we can’t have View() in an R Markdown document! And finally, type ?hate_crimes into the console to see a description of the variables in this data set.

Data manipulation

We will next add a new column to this data set that expresses the Share of the population that lives in metropolitan areas, 2015 categorical variable. Run this code below.

hate_crimes <- hate_crimes %>% 
  mutate(urbanization = cut_number(share_pop_metro, 2, labels = c("low", "high")))

Recall, the cut_numbers function sorts the share_pop_metro variable from lowest to highest, cuts it into two groups. It categorizes all the lowest values as "low", and the largest values as "high". Finally, let’s express the share of adults 25 and older with a high-school degree in 2009 as a percentage.

hate_crimes <- hate_crimes %>% 
  mutate(prct_hs = share_pop_hs *100)

Question 1:

Let’s model the relationship between:

  • \(y\): Median household income in 2016
  • \(x_1\): numerical variable percent of adults 25 and older with a high-school degree in 2009, contained in the prct_hs variable we mutated above.
  • \(x_2\): categorical variable level of urbanization in a state: low, or high, as contained in the variable urbanization we mutated above.

a) Visual model

First, visualize the parallel slopes model for this data that will allow you to conduct an “eyeball test” of the relationship between all three variables (see Lec22 for instructions on how to update the moderndive package to include the gg_parallel_slopes function that does this):

gg_parallel_slopes(y = "median_house_inc", num_x = "prct_hs", cat_x = "urbanization",
                   data = hate_crimes, alpha = 0.50) +
  labs(x = "education", y = "income",
       title = "income in 2016: Parallel slopes model")

b) Relationship between variables

Comment on the relationship between the variables.

Write your answer below:

In general, those with households in lower levels of urbanization have lower levels of income.

c) Regression model

Now run a model that examines the relationship between median household income, high-school education, and urbanization. Show the regression table.

inc_model1 <- lm(median_house_inc ~ prct_hs+urbanization, data = hate_crimes)
get_regression_table(inc_model1)
term estimate std_error statistic p_value lower_ci upper_ci
intercept -119929.869 22994.810 -5.216 0 -166164.033 -73695.706
prct_hs 1977.508 262.257 7.540 0 1450.206 2504.811
urbanizationhigh 6980.048 1772.649 3.938 0 3415.897 10544.198

d) Quantifying the relationship

Let’s now quantify the relationship between the outcome and explanatory variables.

  1. Is the intercept the same for the states with a “low” and “high” level of urbanization? (look at the data visualization above to help with this!) Is the slope the same?
  2. What is the slope for the regression line of the states with a “low” level of urbanization? What is the intercept?
  3. What is the slope for the regression line of the states with a “high” level of urbanization? What is the intercept?
  4. For every increase in 1% of high-school educated adults, what is the associated increase in the median household income of a state?
  5. Based on your regression table output, is median household income greater in states that have lower or higher levels of urbanization?
  6. Use the get_regression_points function to look at the predicted values for median income. We have a categorical value this week, just like last week. Why don’t we only have two unique predicted values again this week?

Write your answers below:

  1. The intercepts for the same states with “low” and “high” levels of urbanization are different. But with the parallel slopes model, the value of their slopes are the same (because they are parallel).
  2. The slope is 1977.508 and the intercept is -119929.869.
  3. The slope is also 1977.508 and the intercept is -119929.869 + 6980.048 = -112949.821
  4. For every 1% increase in high-school educated adults, the associated increase in median income is 1977.508
  5. Household income is greater in states of higher levels of urbanization.
  6. Last week, we were modelling the average number of hate crimes for each corresponding categorical value, and were assuming that for each categorical level, the average hate crimes remained the same. On the other hand, this week, we assume that median income changes as percentage of highschool educated adults changes within each categorical level so there aren’t set values for each categorical value.

e) Prediction

  1. What would you predict as the median household income for a state with a high level of urbanization, in which 85% of the share of adults have a high school degree?
  2. What would you predict as the median household income for a state with a low level of urbanization, in which 85% of the share of adults have a high school degree?
  3. What would you predict as the median household income for a state with a low level of urbanization, in which 30% of the share of adults have a high school degree?
  4. What is the prediction for Alaska? What is the residual? Did our model over or underestimate the median income for this state?

Show your work for each question

Write your answers below:

  1. I would predict that the median household income is -112949.821 + 1977.508*85 = 55138.359
  2. I would predict that the median household income is -119929.869 + 1977.508*85 = 48158.311
  3. I would predict that the median household income is -119929.869 + 1977.508*30 = -60604.629
  4. Alaska has low levels of urbanization and its share of highschool educated adults is 91.4%, so the median income prediction would be -119929.869 + 1977.508*91.4 = 60814.3622. The observed median income is 67629, so the residual would be 67629 - 60814.3622 = 6814.6378. So the model underestimated the median income for Alaska by 6814.6378.
# You don't necessarily need to use this code block, but rather it is optional:
hate_crimes %>% 
  filter(state == "Alaska")
state median_house_inc share_unemp_seas share_pop_metro share_pop_hs share_non_citizen share_white_poverty gini_index share_non_white share_vote_trump hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi urbanization prct_hs
Alaska 67629 0.064 0.63 0.914 0.04 0.06 0.422 0.42 0.53 0.1437401 1.6567 low 91.4

Extra credit:

Imagine that federal budget cuts caused a reduction across all states of ~ 2% in high school graduation rates. We can make a new data set that reflects these budget cuts like so:

budget_cuts <- hate_crimes %>%
  mutate(prct_hs = prct_hs - 2) %>% 
  select(urbanization, prct_hs)

Generate a table that shows the predicted median household income for each state, based on our previous model. (For the sake of the exercise, assume that it is OK to make predictions just slightly outside of the range of percent HS education values used in our original model)

get_regression_points(inc_model1, newdata = budget_cuts)
ID prct_hs urbanization median_house_inc_hat
1 80.1 low 38468.56
2 89.4 low 56859.39
3 82.2 high 49601.37
4 80.4 low 39061.81
5 78.6 high 42482.34
6 87.3 high 59686.67
7 86.6 high 58302.41
8 85.4 high 55929.40
9 85.1 high 55336.15
10 83.3 high 51776.63
11 81.9 high 49008.12
12 88.4 low 54881.88
13 86.4 low 50926.86
14 84.4 high 53951.89
15 84.6 low 47367.35
16 89.4 low 56859.39
17 87.7 low 53497.62
18 79.7 low 37677.55
19 80.2 high 45646.36
20 88.2 low 54486.38
21 87.0 high 59093.41
22 87.0 high 59093.41
23 85.9 high 56918.15
24 89.5 low 57057.14
25 78.4 low 35106.79
26 84.8 low 47762.85
27 88.8 low 55672.88
28 87.8 low 53695.37
29 81.9 high 49008.12
30 89.3 low 56661.64
31 85.4 high 55929.40
32 80.8 low 39852.81
33 82.7 high 50590.13
34 82.3 low 42819.08
35 88.1 low 54288.62
36 85.6 low 49344.85
37 83.6 low 45389.84
38 87.1 high 59291.16
39 85.9 high 56918.15
40 82.7 high 50590.13
41 81.6 low 41434.82
42 87.9 low 53893.12
43 81.1 high 47426.11
44 77.9 high 41098.09
45 88.4 high 61861.93
46 89.0 low 56068.38
47 84.6 high 54347.39
48 87.7 high 60477.67
49 80.8 low 39852.81
50 87.8 low 53695.37
51 89.8 low 57650.39