Collaboration

Please indicate who you collaborated with on this problem set:

Background

We will again use the hate crimes data we used in Problem Set 02 & 07. The FiveThirtyEight article article about those data are in the Jan 23, 2017 “Higher Rates Of Hate Crimes Are Tied To Income Inequality”. This week, we will use these data to run regression models with a single categorical predictor (explanatory) variable and a single numeric predictor (explanatory) variable.

Setup

First load the necessary packages

library(ggplot2)
library(dplyr)
library(moderndive)
library(fivethirtyeight)

Next let’s explore the hate_crimes dataset in the fivethirtyeight package using the glimpse() function from the dplyr package:

glimpse(hate_crimes)

## Observations: 51
## Variables: 12
## $ state                       <chr> "Alabama", "Alaska", "Arizona", "A...
## $ median_house_inc            <int> 42278, 67629, 49254, 44922, 60487,...
## $ share_unemp_seas            <dbl> 0.060, 0.064, 0.063, 0.052, 0.059,...
## $ share_pop_metro             <dbl> 0.64, 0.63, 0.90, 0.69, 0.97, 0.80...
## $ share_pop_hs                <dbl> 0.821, 0.914, 0.842, 0.824, 0.806,...
## $ share_non_citizen           <dbl> 0.02, 0.04, 0.10, 0.04, 0.13, 0.06...
## $ share_white_poverty         <dbl> 0.12, 0.06, 0.09, 0.12, 0.09, 0.07...
## $ gini_index                  <dbl> 0.472, 0.422, 0.455, 0.458, 0.471,...
## $ share_non_white             <dbl> 0.35, 0.42, 0.49, 0.26, 0.61, 0.31...
## $ share_vote_trump            <dbl> 0.63, 0.53, 0.50, 0.60, 0.33, 0.44...
## $ hate_crimes_per_100k_splc   <dbl> 0.12583893, 0.14374012, 0.22531995...
## $ avg_hatecrimes_per_100k_fbi <dbl> 1.8064105, 1.6567001, 3.4139280, 0...

You should also use the View() function to take a look at the data in the viewer, Recall we can’t have View() in an R Markdown document! And finally, type ?hate_crimes into the console to see a description of the variables in this data set.

Data manipulation

We will next add a new column to this data set that expresses the Share of the population that lives in metropolitan areas, 2015 categorical variable. Run this code below.

hate_crimes <- hate_crimes %>% 
  mutate(urbanization = cut_number(share_pop_metro, 2, labels = c("low", "high")))

Recall, the cut_numbers function sorts the share_pop_metro variable from lowest to highest, cuts it into two groups. It categorizes all the lowest values as "low", and the largest values as "high". Finally, let’s express the share of adults 25 and older with a high-school degree in 2009 as a percentage.

hate_crimes <- hate_crimes %>% 
  mutate(prct_hs = share_pop_hs *100)

Question 1:

Let’s model the relationship between:

\(y\): Median household income in 2016
\(x_1\): numerical variable percent of adults 25 and older with a high-school degree in 2009, contained in the prct_hs variable we mutated above.
\(x_2\): categorical variable level of urbanization in a state: low, or high, as contained in the variable urbanization we mutated above.

a) Visual model

First, visualize the parallel slopes model for this data that will allow you to conduct an “eyeball test” of the relationship between all three variables (see Lec22 for instructions on how to update the moderndive package to include the gg_parallel_slopes function that does this):

gg_parallel_slopes(y = "median_house_inc", num_x = "prct_hs", cat_x = "urbanization",
                   data = hate_crimes, alpha = 0.50) +
  labs(x = "education", y = "income",
       title = "income in 2016: Parallel slopes model")

b) Relationship between variables

Comment on the relationship between the variables.

Write your answer below:

In general, those with households in lower levels of urbanization have lower levels of income.

c) Regression model

Now run a model that examines the relationship between median household income, high-school education, and urbanization. Show the regression table.

inc_model1 <- lm(median_house_inc ~ prct_hs+urbanization, data = hate_crimes)
get_regression_table(inc_model1)

term	estimate	std_error	statistic	lower_ci	upper_ci
intercept	-119929.869	22994.810	-5.216	-166164.033	-73695.706
prct_hs	1977.508	262.257	7.540	1450.206	2504.811
urbanizationhigh	6980.048	1772.649	3.938	3415.897	10544.198

d) Quantifying the relationship

Let’s now quantify the relationship between the outcome and explanatory variables.

Is the intercept the same for the states with a “low” and “high” level of urbanization? (look at the data visualization above to help with this!) Is the slope the same?
What is the slope for the regression line of the states with a “low” level of urbanization? What is the intercept?
What is the slope for the regression line of the states with a “high” level of urbanization? What is the intercept?
For every increase in 1% of high-school educated adults, what is the associated increase in the median household income of a state?
Based on your regression table output, is median household income greater in states that have lower or higher levels of urbanization?
Use the get_regression_points function to look at the predicted values for median income. We have a categorical value this week, just like last week. Why don’t we only have two unique predicted values again this week?

Write your answers below:

The intercepts for the same states with “low” and “high” levels of urbanization are different. But with the parallel slopes model, the value of their slopes are the same (because they are parallel).
The slope is 1977.508 and the intercept is -119929.869.
The slope is also 1977.508 and the intercept is -119929.869 + 6980.048 = -112949.821
For every 1% increase in high-school educated adults, the associated increase in median income is 1977.508
Household income is greater in states of higher levels of urbanization.
Last week, we were modelling the average number of hate crimes for each corresponding categorical value, and were assuming that for each categorical level, the average hate crimes remained the same. On the other hand, this week, we assume that median income changes as percentage of highschool educated adults changes within each categorical level so there aren’t set values for each categorical value.

e) Prediction

What would you predict as the median household income for a state with a high level of urbanization, in which 85% of the share of adults have a high school degree?
What would you predict as the median household income for a state with a low level of urbanization, in which 85% of the share of adults have a high school degree?
What would you predict as the median household income for a state with a low level of urbanization, in which 30% of the share of adults have a high school degree?
What is the prediction for Alaska? What is the residual? Did our model over or underestimate the median income for this state?

Show your work for each question

Write your answers below:

I would predict that the median household income is -112949.821 + 1977.508*85 = 55138.359
I would predict that the median household income is -119929.869 + 1977.508*85 = 48158.311
I would predict that the median household income is -119929.869 + 1977.508*30 = -60604.629
Alaska has low levels of urbanization and its share of highschool educated adults is 91.4%, so the median income prediction would be -119929.869 + 1977.508*91.4 = 60814.3622. The observed median income is 67629, so the residual would be 67629 - 60814.3622 = 6814.6378. So the model underestimated the median income for Alaska by 6814.6378.

# You don't necessarily need to use this code block, but rather it is optional:
hate_crimes %>% 
  filter(state == "Alaska")

state	median_house_inc	share_unemp_seas	share_pop_metro	share_pop_hs	share_non_citizen	share_white_poverty	gini_index	share_non_white	share_vote_trump	hate_crimes_per_100k_splc	avg_hatecrimes_per_100k_fbi	urbanization	prct_hs
Alaska	67629	0.064	0.63	0.914	0.04	0.06	0.422	0.42	0.53	0.1437401	1.6567	low	91.4

Extra credit:

Imagine that federal budget cuts caused a reduction across all states of ~ 2% in high school graduation rates. We can make a new data set that reflects these budget cuts like so:

budget_cuts <- hate_crimes %>%
  mutate(prct_hs = prct_hs - 2) %>% 
  select(urbanization, prct_hs)

Generate a table that shows the predicted median household income for each state, based on our previous model. (For the sake of the exercise, assume that it is OK to make predictions just slightly outside of the range of percent HS education values used in our original model)

get_regression_points(inc_model1, newdata = budget_cuts)

ID	prct_hs	urbanization	median_house_inc_hat
1	80.1	low	38468.56
2	89.4	low	56859.39
3	82.2	high	49601.37
4	80.4	low	39061.81
5	78.6	high	42482.34
6	87.3	high	59686.67
7	86.6	high	58302.41
8	85.4	high	55929.40
9	85.1	high	55336.15
10	83.3	high	51776.63
11	81.9	high	49008.12
12	88.4	low	54881.88
13	86.4	low	50926.86
14	84.4	high	53951.89
15	84.6	low	47367.35
16	89.4	low	56859.39
17	87.7	low	53497.62
18	79.7	low	37677.55
19	80.2	high	45646.36
20	88.2	low	54486.38
21	87.0	high	59093.41
22	87.0	high	59093.41
23	85.9	high	56918.15
24	89.5	low	57057.14
25	78.4	low	35106.79
26	84.8	low	47762.85
27	88.8	low	55672.88
28	87.8	low	53695.37
29	81.9	high	49008.12
30	89.3	low	56661.64
31	85.4	high	55929.40
32	80.8	low	39852.81
33	82.7	high	50590.13
34	82.3	low	42819.08
35	88.1	low	54288.62
36	85.6	low	49344.85
37	83.6	low	45389.84
38	87.1	high	59291.16
39	85.9	high	56918.15
40	82.7	high	50590.13
41	81.6	low	41434.82
42	87.9	low	53893.12
43	81.1	high	47426.11
44	77.9	high	41098.09
45	88.4	high	61861.93
46	89.0	low	56068.38
47	84.6	high	54347.39
48	87.7	high	60477.67
49	80.8	low	39852.81
50	87.8	low	53695.37
51	89.8	low	57650.39

Problem Set 08

CAITLIN ONG

2018-11-07