Load necessary packages:
library(ggplot2)
library(dplyr)
library(broom)
library(knitr)
library(fivethirtyeight)
# Create a categorical variable of Trump support level based on the numerical
# variable of share of 2016 U.S. presidential voters who voted for Donald Trump.
# Each of the low, medium, and high levels have roughly one third of states.
hate_crimes <- hate_crimes %>%
mutate(
trump_support_level = cut_number(share_vote_trump, 3, labels=c("low", "medium", "high"))
)trump_support_level.hate_crimes.
?hate_crimesView(hate_crimes) and explore the datasetWhat value of the Gini index indicates perfect income equality? 0
What value of the Gini index indicates perfect income inequality? 1
Why are the two hate crime variables based on counts per 100,000 individuals and not based on just the count? Due to the uneven distribution of hate crimes across the United States
Name some differences on how and when the FBI reports-based and the Southern Poverty Law Center-based rates of hate crime were collected, and how this could impact any analyses.
Write your answers here: The FBI doesn’t track hate crimes systematically ( it collects voluntarily submitted data from law enforcement agencies) and so its hard to gauge how comprehensive its data is. Moreover, its publicly accessible records were collected between 2010 and 2015. SPLC data on the other hand was non existent before the 2016 election but comes from a combination of both hate crimes and non prosecutable hate incidents. Inasmuch as both data sets reveal similar trends, an SPLC data-based analysis might suffer from the effects of awareness bias while an FBI based anaysis might leave some quesions unaswered due to it’s limited nature.
Let’s model the relationship, both visually and via regression, between:
Create a visual model of this data (do not forget to include appropriate axes labels and title):
# Write code to plot this model below:
ggplot(data = hate_crimes, aes(x = trump_support_level, y = hate_crimes_per_100k_splc)) +
geom_boxplot()+labs(x="Trump support level", y="Hate crimes per 100k people after the election", title="Hate crime - Trump support relationship.")Output the regression table and interpret the results
# Write code to generate a regression table below:
#Since regression involves means:
hate_crimes %>%
group_by(trump_support_level) %>%
summarize(mean = mean(hate_crimes_per_100k_splc, na.rm = TRUE))%>%
kable()| trump_support_level | mean |
|---|---|
| low | 0.4601833 |
| medium | 0.2222983 |
| high | 0.1910425 |
lm(hate_crimes_per_100k_splc ~ trump_support_level, data=hate_crimes) %>%
tidy()%>%
kable()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.4601833 | 0.0529413 | 8.692337 | 0.0000000 |
| trump_support_levelmedium | -0.2378850 | 0.0785246 | -3.029433 | 0.0040912 |
| trump_support_levelhigh | -0.2691408 | 0.0800397 | -3.362593 | 0.0016070 |
| The intercept 0.4601833 corr | esponds to th | e mean numbe | r of hate cr | imes in areas with low Trump support level. trump_support_levelmedium is the additional mean number of hate crimes above and beyond the baseline of comparison. Same goes for trump_support_levelhigh |
Write you answers here: -Some Trump stronghold states reported significantly more number of hate crimes than others as reflected in the chloropeth map in fivethrirtyeight so no. There seems to be no direct relationship between Trump support level and amount of hate crimes committed -District of Columbia
Create two separate visualizations (do not forget to include appropriate axes labels and title) and run two separate simple linear regressions (using only one predictor) for \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election with
and interpret any slope values.
# Write code to plot this model below:
ggplot(hate_crimes, aes(x=gini_index, y=hate_crimes_per_100k_splc)) +
geom_point() +
labs(x="Gini index", y="Hate crimes per 100k people after the election", title="Regression model 1") +
geom_smooth(method="lm", se=FALSE)# Write code to generate a regression table below:
lm(hate_crimes_per_100k_splc ~ gini_index, data=hate_crimes) %>%
tidy()%>%
kable()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -1.527463 | 0.7833043 | -1.950025 | 0.0574197 |
| gini_index | 4.020510 | 1.7177215 | 2.340606 | 0.0237445 |
# Write code to plot this model below:
ggplot(hate_crimes, aes(x=share_pop_hs, y=hate_crimes_per_100k_splc)) +
geom_point() +
labs(x="High school education level", y="Hate crimes per 100k people after the election", title="Regression model 2") +
geom_smooth(method="lm", se=FALSE)# Write code to generate a regression table below:
lm(hate_crimes_per_100k_splc ~ share_pop_hs, data=hate_crimes) %>%
tidy()%>%
kable()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -1.705274 | 0.9228076 | -1.847919 | 0.0711930 |
| share_pop_hs | 2.320228 | 1.0647852 | 2.179057 | 0.0346031 |
| Slope values: | ||||
| first model:~4. | 02 | |||
| second model:~2 | .32 | |||
| They both help | us model the | fit of # of | hate crimes | in both models. |
Run a multiple regression for
an interpret both slope coefficients
# Write code to generate a regression table below. No need for a visualization
# here:
lm(hate_crimes_per_100k_splc ~ share_pop_hs+gini_index, data=hate_crimes) %>%
tidy()%>%
kable()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -8.211991 | 1.418930 | -5.787453 | 7.0e-07 |
| share_pop_hs | 5.255865 | 1.002924 | 5.240545 | 4.3e-06 |
| gini_index | 8.702370 | 1.629755 | 5.339678 | 3.1e-06 |
Write your interpretation below: They have different slopes (5.3 vs 8.7). The associated effect of the gini index and high school education on hate crimes is different.
Create two new data frames:
hate_crimes_no_new_york: the hate_crimes dataset without New Yorkhate_crimes_no_DC: the hate_crimes data without the District of ColumbiaRepeat the multiple regression from Question 3 and indicate the removal of which state from the dataset has a bigger impact on the analysis. Why do you think this is?
# Write code to generate regression tables below:
hate_crimes_no_new_york <- hate_crimes %>%
filter(state != "New York")
View(hate_crimes_no_new_york)
hate_crimes_no_DC <- hate_crimes %>%
filter(state != "District of Columbia")
View(hate_crimes_no_DC)
lm(hate_crimes_per_100k_splc ~ share_pop_hs+gini_index, data=hate_crimes_no_DC) %>%
tidy()%>%
kable()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -3.989258 | 1.5083121 | -2.644849 | 0.0113652 |
| share_pop_hs | 3.284001 | 0.9432964 | 3.481409 | 0.0011577 |
| gini_index | 3.135572 | 1.8352715 | 1.708506 | 0.0947527 |
lm(hate_crimes_per_100k_splc ~ share_pop_hs+gini_index, data=hate_crimes_no_new_york) %>%
tidy()%>%
kable()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -8.655118 | 1.448044 | -5.977109 | 4.0e-07 |
| share_pop_hs | 5.399269 | 1.001034 | 5.393694 | 2.8e-06 |
| gini_index | 9.414876 | 1.706481 | 5.517131 | 1.8e-06 |
Write your response here: Removal of DC has the highest effect. The unusually high number of hate crimes reported skew the original analysis and thus not including it results into a bigger impact than removing New York.