Load necessary packages:
library(ggplot2)
library(dplyr)
library(broom)
library(knitr)
library(fivethirtyeight)
# Create a categorical variable of Trump support level based on the numerical
# variable of share of 2016 U.S. presidential voters who voted for Donald Trump.
# Each of the low, medium, and high levels have roughly one third of states.
hate_crimes <- hate_crimes %>%
mutate(
trump_support_level = cut_number(share_vote_trump, 3, labels=c("low", "medium", "high"))
)trump_support_level.hate_crimes.
?hate_crimesView(hate_crimes) and explore the datasetWrite your answers here:
Let’s model the relationship, both visually and via regression, between:
Create a visual model of this data (do not forget to include appropriate axes labels and title):
# Write code to plot this model below:
ggplot(hate_crimes, aes(x=trump_support_level, y=hate_crimes_per_100k_splc)) +
geom_boxplot() +
labs(x = "Support of Trump", y="Hate Crimes", title= "Comparison between Support of Trump and Hate Crimes")Output the regression table and interpret the results
# Write code to generate a regression table below:
lm(hate_crimes_per_100k_splc ~ trump_support_level, data=hate_crimes) %>%
tidy()## term estimate std.error statistic p.value
## 1 (Intercept) 0.4601833 0.05294126 8.692337 4.182015e-11
## 2 trump_support_levelmedium -0.2378850 0.07852458 -3.029433 4.091171e-03
## 3 trump_support_levelhigh -0.2691408 0.08003966 -3.362593 1.606977e-03
Write you answers here: 1. My results are not consistent with the FiveThirtyEight article (Which is shown since there is high crime rates in states with low support of Trump such as Massachusettes, and there are low crime rates in states with high support of Trump such as Louisiana). 1. District of Columbia was the outlier in the “low” group.
Create two separate visualizations (do not forget to include appropriate axes labels and title) and run two separate simple linear regressions (using only one predictor) for \(y\): Hate crimes per 100K individuals in the 10 days after the 2016 US election with
and interpret any slope values.
# Write code to plot this model below:
ggplot(hate_crimes, aes(x=gini_index, y=hate_crimes_per_100k_splc)) +
geom_point() +
labs(x="Gini Inex", y="Hate Crimes", title= "Comparison of Gini Index and Hate Crimes") +
geom_smooth(method="lm", se=FALSE)# Write code to generate a regression table below:
lm(hate_crimes_per_100k_splc ~ gini_index, data=hate_crimes) %>%
tidy()## term estimate std.error statistic p.value
## 1 (Intercept) -1.527463 0.7833043 -1.950025 0.05741966
## 2 gini_index 4.020510 1.7177215 2.340606 0.02374447
# Write code to plot this model below:
ggplot(hate_crimes, aes(x=share_pop_hs, y=hate_crimes_per_100k_splc)) +
geom_point() +
labs(x="High School Education Level", y="Hate Crimes", title= "Comparison of High School Education Level and Hate Crimes") +
geom_smooth(method="lm", se=FALSE)# Write code to generate a regression table below:
lm(hate_crimes_per_100k_splc ~ share_pop_hs, data=hate_crimes) %>%
tidy()## term estimate std.error statistic p.value
## 1 (Intercept) -1.705274 0.9228076 -1.847919 0.07119297
## 2 share_pop_hs 2.320228 1.0647852 2.179057 0.03460305
Run a multiple regression for
an interpret both slope coefficients
# Write code to generate a regression table below. No need for a visualization
# here:
lm(hate_crimes_per_100k_splc ~ gini_index + share_pop_hs, data=hate_crimes) %>%
tidy()## term estimate std.error statistic p.value
## 1 (Intercept) -8.211991 1.418930 -5.787453 6.921925e-07
## 2 gini_index 8.702370 1.629755 5.339678 3.117468e-06
## 3 share_pop_hs 5.255865 1.002924 5.240545 4.340152e-06
Write your interpretation below: # For each increase in a unit of the gini_index the hate_crimes_per_100k_splc increases by 8.702370. For each increase in a unit of the share_pop_hs the hate_crimes_per_100k_splc increases by 5.255865.
Create two new data frames:
hate_crimes_no_new_york: the hate_crimes dataset without New Yorkhate_crimes_no_DC: the hate_crimes data without the District of ColumbiaRepeat the multiple regression from Question 3 and indicate the removal of which state from the dataset has a bigger impact on the analysis. Why do you think this is?
# Write code to generate regression tables below:
hate_crimes_no_new_york <- hate_crimes %>%
filter(state != "New York")
lm(hate_crimes_per_100k_splc ~ gini_index + share_pop_hs, data=hate_crimes_no_new_york) %>%
tidy()## term estimate std.error statistic p.value
## 1 (Intercept) -8.655118 1.448044 -5.977109 3.947701e-07
## 2 gini_index 9.414876 1.706481 5.517131 1.833887e-06
## 3 share_pop_hs 5.399269 1.001034 5.393695 2.763474e-06
hate_crimes_no_DC <- hate_crimes %>%
filter(state != "District of Columbia")
lm(hate_crimes_per_100k_splc ~ gini_index + share_pop_hs, data=hate_crimes_no_DC) %>%
tidy()## term estimate std.error statistic p.value
## 1 (Intercept) -3.989258 1.5083121 -2.644849 0.011365187
## 2 gini_index 3.135572 1.8352715 1.708506 0.094752663
## 3 share_pop_hs 3.284001 0.9432964 3.481409 0.001157652
Write your response here: # The removal of New York has a bigger impact on the analysis. I think this is the case because the data hate_crimes_no_new_york has all significant p.values, whereas the data hate_crimes_no_DC has a p.value that is not significant. (The significant p.value is less than 0.05).