Research Question
In the project proposal I intend to use the dataset from fivethirtyeight called “hate_crimes”. The dataset is described below. The research question i would like to answer is are their any significant relationships between hatecrimes in the US to other parameters in the dataset such as unemployment, median household income, race, etc.
A “hate crime” is defined as a crime that is based on a particular bias or prejudice. Several cases may be: African-Americans being policed differently than others potentially in fatal ways or, crimes against muslims or other islamic religious groups because of their faith, or crimes against those with specific sexual orientations.
This is an important question because in recent years there has been an uptick in recorded hate crimes. It is unclear if crime rates are increasing over time or if there is just more light being shed upon these situations due to the abundance of technology in today’s society vs previous decades. Nonetheless, it is important to understand the relationship between hate crimes and other potential factors to be to take measures both politically and socially to mitigate the issue.
library(fivethirtyeight)
library(DT)
library(GGally)
library(Hmisc)
library(tidyverse)
library(knitr)
library(RColorBrewer)
library(broom)
colnames((hate_crimes))
## [1] "state" "state_abbrev"
## [3] "median_house_inc" "share_unemp_seas"
## [5] "share_pop_metro" "share_pop_hs"
## [7] "share_non_citizen" "share_white_poverty"
## [9] "gini_index" "share_non_white"
## [11] "share_vote_trump" "hate_crimes_per_100k_splc"
## [13] "avg_hatecrimes_per_100k_fbi"
hatecrimes<-hate_crimes
there are about 51 cases in this dataset is hatecrimes and has 2 metrics in this dataset:
hate_crimes_per_100k_splc - This represents the hate crimes per every 100,000 people
avg_hatecrimes_per_100k_fbi - This represents Average annual hate crimes per every 100,000 people
It’s important to note that these aggregated observations and the data is not granular to where provides each individual hate crime as an observation and information about it.
The variables that are of interest will be most of the other variables in the dataset, these will be used as our predictor variables while hate crimes will be our response variable:
median_house_inc - Median Household income for the year of 2016
share_unemp_season - Share of the population that is unemployed
share_pop_metro - share of population that lives in a metropolitan area for the year of 2015
share_non_citizen - Share of the population that are not U.S. Citizens as of 2015
share_white_poverty - Share of white residents who live in poverty for 2015
gini_index - a measure of the distribution of income across income percentiles in a population
share_non_white - Share of the population that is not white for 2015
share_vote_trump - Share of 2016 U.S. presidential voters who voted for Donald Trump
This is an observational study because we are collecting historical data evaluating our hypothesis based on that. There will be no experimental design with placebo control groups and experimental groups. The scope of our inference will be generalized to the US population since this data provides a sample that is representative of every individual state. Because this is not a randomized control trial, we will not use these data to infer causality.
The following data table below allows the user to look through the raw data set from fivethirtyeights.
datatable(hate_crimes)
To do some exploratory analysis, We will employ tools that help us understand the distribution of our data, this includes providing summary stats accross all of the columns and We have a wide version of the dataset but we will create a long version as well for ease of looking at different parameters in our dataset.
Cycle through the tabs below to view the distribution and normality of our parameters in the dataset.
The histograms below show all of our potential predictor variables as well as our response variable (avg hate crimes). Most of the predictor variables follow a normal or close to normal distribution. there is some skewness in a few datasets due to some of the outliers such as in gini_index, avghatecrimes_per100k_fbi, and hate_crimes_per_100k_splc there are outliers that cause the data to seem right skewed. similarly outliers in share_vote_trump cause a left skewness in that distribution.
We can infer from this that our linear models will:
1) Have near normal residuals
2) have constant variability
We will check for linearity and variablity around the residuals plot during downstream analysis
hatecrimes_long<-hatecrimes %>%
pivot_longer(cols = 3:length(hatecrimes), names_to = "Parameter")
hatecrimes_long %>% ggplot(mapping = aes(x = value, fill = Parameter))+
geom_histogram(alpha = 0.4)+
facet_wrap(Parameter~.,scales = 'free', ncol = 3)+
#geom_density(fill = NA, linetype = 2, na.rm=T)+
theme(panel.background = element_blank(),
panel.border = element_rect(colour = "black", fill=NA),
panel.grid = element_blank(),
legend.position = "top",
legend.title = element_blank(),
strip.background =element_blank())
The density plots below similarly show all of the parameters but with a smoothed curve instead of a histogram to better see the skewness, peaks, and distributions of our parameters.Most of the predictor variables follow a normal or close to normal distribution. As stated in the histogram tab. See the QQ_plot tab to see how far these parameters deviated from the gaussian distribution.
We can infer from this that our linear models will:
1) Have near normal residuals
2) have constant variability
We will check for linearity and variablity around the residuals plot during downstream analysis
hatecrimes_long %>% ggplot(mapping = aes(x = value, fill = Parameter))+
#geom_histogram(alpha = 0.4)+
facet_wrap(Parameter~.,scales = 'free', ncol = 3)+
geom_density(fill = NA, linetype = 2, na.rm=T)+
theme(panel.background = element_blank(),
panel.border = element_rect(colour = "black", fill=NA),
panel.grid = element_blank(),
legend.position = "top",
legend.title = element_blank(),
strip.background =element_blank())
The QQ plots below similarly show all of the parameters but include the normal distribution line overlaid with the data to show where and how far the parameters deviate from the mean of our parameters.Most of the predictor variables follow a normal or close to normal distribution. As stated in the histogram tab. See the QQ_plot tab to see how far these parameters deviated from the gaussian distribution. For the analyses downstream, we will assume that these parameters satisfy the conditions for normality on our regressions.
We can infer from this that our linear models will:
1) Have near normal residuals
2) have constant variability
We will check for linearity and variablity around the residuals plot during downstream analysis
hatecrimes_long %>% ggplot(mapping = aes(sample = value))+
#geom_histogram(alpha = 0.4)+
stat_qq()+stat_qq_line()+
facet_wrap(Parameter~.,scales = 'free', ncol = 3)+
#geom_density(fill = NA, linetype = 2, na.rm=T)+
theme(panel.background = element_blank(),
panel.border = element_rect(colour = "black", fill=NA),
panel.grid = element_blank(),
legend.position = "top",
legend.title = element_blank(),
strip.background =element_blank())
summary statistics for the dataset are shown below:
describe(hatecrimes)
## hatecrimes
##
## 13 Variables 51 Observations
## --------------------------------------------------------------------------------
## state
## n missing distinct
## 51 0 51
##
## lowest : Alabama Alaska Arizona Arkansas California
## highest: Virginia Washington West Virginia Wisconsin Wyoming
## --------------------------------------------------------------------------------
## state_abbrev
## n missing distinct
## 51 0 51
##
## lowest : AK AL AR AZ CA, highest: VT WA WI WV WY
## --------------------------------------------------------------------------------
## median_house_inc
## n missing distinct Info Mean Gmd .05 .10
## 51 0 51 1 55224 10575 42342 43716
## .25 .50 .75 .90 .95
## 48657 54916 60719 67629 70692
##
## lowest : 35521 39552 42278 42406 42786, highest: 68277 70161 71223 73397 76165
## --------------------------------------------------------------------------------
## share_unemp_seas
## n missing distinct Info Mean Gmd .05 .10
## 51 0 32 0.999 0.04957 0.01235 0.0340 0.0360
## .25 .50 .75 .90 .95
## 0.0420 0.0510 0.0575 0.0630 0.0670
##
## lowest : 0.028 0.029 0.034 0.035 0.036, highest: 0.063 0.064 0.067 0.068 0.073
## --------------------------------------------------------------------------------
## share_pop_metro
## n missing distinct Info Mean Gmd .05 .10
## 51 0 31 0.998 0.7502 0.2059 0.400 0.510
## .25 .50 .75 .90 .95
## 0.630 0.790 0.895 0.970 0.985
##
## lowest : 0.31 0.34 0.35 0.45 0.50, highest: 0.92 0.94 0.96 0.97 1.00
## --------------------------------------------------------------------------------
## share_pop_hs
## n missing distinct Info Mean Gmd .05 .10
## 51 0 40 1 0.8691 0.03925 0.8115 0.8220
## .25 .50 .75 .90 .95
## 0.8405 0.8740 0.8980 0.9100 0.9140
##
## lowest : 0.799 0.804 0.806 0.817 0.821, highest: 0.910 0.913 0.914 0.915 0.918
## --------------------------------------------------------------------------------
## share_non_citizen
## n missing distinct Info Mean Gmd .05 .10
## 48 3 12 0.985 0.05458 0.03516 0.0135 0.0200
## .25 .50 .75 .90 .95
## 0.0300 0.0450 0.0800 0.1000 0.1100
##
## lowest : 0.01 0.02 0.03 0.04 0.05, highest: 0.08 0.09 0.10 0.11 0.13
##
## Value 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11
## Frequency 3 4 9 8 4 4 2 5 2 3 3
## Proportion 0.062 0.083 0.188 0.167 0.083 0.083 0.042 0.104 0.042 0.062 0.062
##
## Value 0.13
## Frequency 1
## Proportion 0.021
## --------------------------------------------------------------------------------
## share_white_poverty
## n missing distinct Info Mean Gmd .05 .10
## 51 0 12 0.98 0.09176 0.02729 0.060 0.060
## .25 .50 .75 .90 .95
## 0.075 0.090 0.100 0.120 0.135
##
## lowest : 0.04 0.05 0.06 0.07 0.08, highest: 0.11 0.12 0.13 0.14 0.17
##
## Value 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## Frequency 1 1 4 7 7 11 8 3 5 1 2
## Proportion 0.020 0.020 0.078 0.137 0.137 0.216 0.157 0.059 0.098 0.020 0.039
##
## Value 0.17
## Frequency 1
## Proportion 0.020
## --------------------------------------------------------------------------------
## gini_index
## n missing distinct Info Mean Gmd .05 .10
## 51 0 39 0.999 0.4538 0.02278 0.4240 0.4300
## .25 .50 .75 .90 .95
## 0.4400 0.4540 0.4665 0.4740 0.4805
##
## lowest : 0.419 0.422 0.423 0.425 0.427, highest: 0.474 0.475 0.486 0.499 0.532
## --------------------------------------------------------------------------------
## share_non_white
## n missing distinct Info Mean Gmd .05 .10
## 51 0 34 0.999 0.3157 0.186 0.090 0.150
## .25 .50 .75 .90 .95
## 0.195 0.280 0.420 0.500 0.615
##
## lowest : 0.06 0.07 0.09 0.10 0.15, highest: 0.56 0.61 0.62 0.63 0.81
## --------------------------------------------------------------------------------
## share_vote_trump
## n missing distinct Info Mean Gmd .05 .10
## 51 0 33 0.999 0.49 0.1303 0.330 0.350
## .25 .50 .75 .90 .95
## 0.415 0.490 0.575 0.630 0.645
##
## lowest : 0.04 0.30 0.33 0.34 0.35, highest: 0.63 0.64 0.65 0.69 0.70
## --------------------------------------------------------------------------------
## hate_crimes_per_100k_splc
## n missing distinct Info Mean Gmd .05 .10
## 47 4 47 1 0.3041 0.2355 0.08343 0.10790
## .25 .50 .75 .90 .95
## 0.14271 0.22620 0.35693 0.62034 0.66348
##
## lowest : 0.06744680 0.06906077 0.07830591 0.09540164 0.10515247
## highest: 0.62747993 0.63081059 0.67748765 0.83284961 1.52230172
## --------------------------------------------------------------------------------
## avg_hatecrimes_per_100k_fbi
## n missing distinct Info Mean Gmd .05 .10
## 50 1 50 1 2.368 1.671 0.4896 0.6905
## .25 .50 .75 .90 .95
## 1.2931 1.9871 3.1843 3.8568 4.5935
##
## lowest : 0.2669408 0.4120118 0.4309276 0.5613956 0.6227460
## highest: 4.2078896 4.4132026 4.7410699 4.8018993 10.9534797
## --------------------------------------------------------------------------------
Removing the one from out response vaoutlier:
hatecrimes<-hatecrimes %>%
filter(state!= "District of Columbia")
responsevariable<- unique(hatecrimes_long$Parameter)[10]
predictorvariables<-unique(hatecrimes_long$Parameter)[1:9]
plottheme<-theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid = element_blank())
hatecrimeslm<-function(data,x,y){
linearM<-lm(formula(paste(y,"~",x)), data)
residual<-residuals(linearM)
intercept<-round(linearM$coefficients[[1]],2)
slope<-round(linearM$coefficients[[2]],4)
adjr2<-round(summary(linearM)$r.squared,2)
#as.character(as.expression(eq)))
p_value<-round(summary(linearM)$coefficients[,4][[2]],3)
p1<-ggplot(data = data,mapping = aes_string(x, y))+
geom_point(pch = 21, color = "black", fill ="skyblue",alpha = 0.7,size =3 )+
geom_smooth(method = "lm")+
plottheme+
labs(subtitle = paste0("Y = ", intercept,"+",slope,"x",
"\nR^2 = ", adjr2,
"\nP-Value = ", p_value),
title = "Linear Model Plot")
resplot<- augment(linearM)
p2<-ggplot(resplot,aes(x = .fitted, y = .resid))+
geom_point(pch = 21, color = "black", fill ="skyblue",alpha = 0.7,size =3 )+
geom_segment(aes(x = .fitted,
xend =.fitted,
y = .resid,
yend =0),
linetype = 2,
color = "red")+
geom_hline(yintercept = 0)+
plottheme+
labs(title = "Residuals Plot")
print(p1)
print(p2)
print(summary(linearM))
}
hatecrimeslm(hatecrimes,predictorvariables[1],responsevariable)
##
## Call:
## lm(formula = formula(paste(y, "~", x)), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25922 -0.10473 -0.02883 0.07081 0.53087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.668e-02 1.553e-01 -0.172 0.8643
## median_house_inc 5.582e-06 2.810e-06 1.987 0.0532 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1722 on 44 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.08232, Adjusted R-squared: 0.06147
## F-statistic: 3.947 on 1 and 44 DF, p-value: 0.0532
Null Hypothesis: There is no relationship between median household income and Hatecrimes/100k people
Alternate Hypothesis: There is a relationship between median household income and hatecrimes/100k people
Based on the above, we do notice that there is a slight trend but a weak correlation coeffient of 0.08. the P_value is also greater than 0.05 indicating that there is insufficient evidence to reject the null hypothesis and so we cannot conclude dependence.
hatecrimeslm(hatecrimes,predictorvariables[7],responsevariable)
##
## Call:
## lm(formula = formula(paste(y, "~", x)), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20381 -0.14320 -0.04881 0.08871 0.54960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8009 0.6917 1.158 0.253
## gini_index -1.1528 1.5227 -0.757 0.453
##
## Residual standard error: 0.1786 on 44 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.01286, Adjusted R-squared: -0.009577
## F-statistic: 0.5731 on 1 and 44 DF, p-value: 0.4531
Null Hypothesis: There is no relationship between the gini index and Hatecrimes/100k people
Alternate Hypothesis: There is a relationship between the gini index and hatecrimes/100k people
Based on the above, There is no visible trend and expectedly a weak correlation coeffient of 0.01. the P_value is also much greater than 0.05 indicating that there is insufficient evidence to reject the null hypothesis and so we cannot conclude dependence.
In conclusion, our strongest relationship when looking at this from bi-variate analysis standpoint, (only one response and one predictor variable), obtain is 0.18 indicating that as the share of the population with only a HS degree increases, the number of hate crimes per 100k of people tend to increase. Similarly, and interestingly.. we obtain our second strongest coefficient of 0.17 infering that share of proportions increase for those who voted for Trump within a state, the lower the hate crimes. This to me implies states that have more partisan divisiveness have higher rates of crimes. so all red, or all blue states will likely have lower hate crime rates.
Both regression coefficients had P<0.05 and are statistically significant, the conditions for inference were also met with linearity, normal (after outlier removal), and scattered residuals.
Openintro Statistics, Fourth Edition, David Diez