Part 1 - Introduction
There is growing sentiment within society that instances of hate are increasing. In the 10 days following the 2016 election, nearly 900 hate incidents were reported to the Southern Poverty Law Center (Majumder 2017). Thats an average of 90 hate crimes per day. For contrast, 36,000 hate crimes were reported to the FBI from 2010-2015, or 16 hate crimes per day on average.
The data we will be using for the analysis is a compilation of data from the Kaiser Family Foundation, the US Census Bureau, the US Election Project, Southern Poverty Law Center, and the FBI. The data is posted on FiveThirtyEight’s GitHub in a CSV format. There are some caveats associated with the data: 1) the federal government doesn’t track hate crimes systematically and thus some of the data is reliant upon media accounts and people’s self-reports, 2) the FBI collects hate crime data from law enforcement agencies but the data is submitted voluntarily, 3) the FBI uses the Uniform Crime Reporting Program (UCR) to collect crime data, but the UCR only collects data on prosecutable hate crimes, 4) heightened news coverage of hate incidents after the election may have caused people to report hate incidents more frequently due to awareness bias (Majumder 2017). All of these things contribute to bias baked in to the data being used for analysis.
The question we will be looking at in this analysis is:
Are there more annual hate crimes per 100,00 population in areas where the greater share of population voted for Trump in 2016?
Part 2 - Data
hateCrimes <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data606/master/project/data/hate_crimes.csv")
regionMap <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data606/master/project/data/region_mapping.csv") %>%
rename_all(funs(str_to_lower(.)))
Some manipulations will need to be done to the data. hateCrimes will be joined to regionMap to add the columns: Region, Division, and State.Code. The following code will join the two datasets and create an additional categorical variable to denote Trump support
data <- hateCrimes %>%
# join two datasets
left_join(regionMap, by = "state") %>%
# add trump_support variables
mutate(support = case_when(
share_voters_voted_trump > .5 ~ 'Trump',
share_voters_voted_trump < .5 ~ 'Clinton',
TRUE ~ 'Split'
))
The decision here was made to create a column titled support. The condition was based on the percentage share of voters. A share of voters greater than 50% received a value of ‘Trump’, less than 50% received ‘Clinton’, and 50% received ‘Split’.
We can further trim some of the “fat” in the dataset to only keep the columns we are analyzing. We will keep the median household income as a comparison for ‘Trump Support’. The original FiveThirtyEight article quoted income being the greatest predictor of hate crimes. We can also rename some columns for easier reference later
data %<>%
select(-(3:9)) %>%
# rename column
rename(median_income = median_household_income,
trump_share = share_voters_voted_trump,
hate_crime_proportion = hate_crimes_per_100k_splc,
avg_hatecrimes = avg_hatecrimes_per_100k_fbi)
Part 3 - Exploratory Data Analysis
General Information
Lets get some general information on the dataset
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 51 obs. of 9 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ median_income : int 42278 67629 49254 44922 60487 60940 70161 57522 68277 46140 ...
## $ trump_share : num 0.63 0.53 0.5 0.6 0.33 0.44 0.41 0.42 0.04 0.49 ...
## $ hate_crime_proportion: num 0.1258 0.1437 0.2253 0.0691 0.2558 ...
## $ avg_hatecrimes : num 1.806 1.657 3.414 0.869 2.398 ...
## $ state code : chr "AL" "AK" "AZ" "AR" ...
## $ region : chr "South" "West" "West" "South" ...
## $ division : chr "East South Central" "Pacific" "Mountain" "West South Central" ...
## $ support : chr "Trump" "Trump" "Split" "Trump" ...
We are able to see that there are 51 observations, and only 9 variables in the trimmed data set. We may want to convert some of the columns to factors in later analysis.
summary(data)
## state median_income trump_share hate_crime_proportion
## Length:51 Min. :35521 Min. :0.040 Min. :0.06745
## Class :character 1st Qu.:48657 1st Qu.:0.415 1st Qu.:0.14271
## Mode :character Median :54916 Median :0.490 Median :0.22620
## Mean :55224 Mean :0.490 Mean :0.30409
## 3rd Qu.:60719 3rd Qu.:0.575 3rd Qu.:0.35694
## Max. :76165 Max. :0.700 Max. :1.52230
## NA's :4
## avg_hatecrimes state code region
## Min. : 0.2669 Length:51 Length:51
## 1st Qu.: 1.2931 Class :character Class :character
## Median : 1.9871 Mode :character Mode :character
## Mean : 2.3676
## 3rd Qu.: 3.1843
## Max. :10.9535
## NA's :1
## division support
## Length:51 Length:51
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
Histograms
plotSetup <- ggplot(data)
plotSetup +
geom_bar(mapping = aes(x = support,
color = support,
fill = support)) +
labs(title = "Voter Support", x = "", y = "Frequency")
Nationwide, Trump appears to be the less popular option. This coincides with the results of the election. Clinton won the popular vote but lost the electoral college.
We can get more granular by looking at the distribution by region
plotSetup +
geom_bar(mapping = aes(x = support,
color = support,
fill = support)) +
facet_grid(. ~ region) +
labs(x = "Voter Support", y = "Frequency", title = "Regional Voter Support")
By getting more granular, we are able to determine that Trump is more favored in the Midwest and the South. In the Northeast, Trump did not get any majority support in any of the states, and while the West did have some states where he had the majority, Clinton was overall the most favored
Lets take a look at the distribution of the avg_hatecrimes
plotSetup +
geom_histogram(mapping = aes(x = avg_hatecrimes), binwidth = 1, na.rm = TRUE, alpha = .8, fill = "#377EB8") +
labs(x = "Average Hate Crimes", y = "Frequency", title = "Frequency Distribution")
The distribution shows a unimodal, skewed right distribution. If we remove the outlier…
plotSetup +
geom_histogram(mapping = aes(x = avg_hatecrimes), binwidth = 1, na.rm = TRUE, alpha = .8, fill = "#377EB8") +
labs(x = "Average Hate Crimes", y = "Frequency", title = "Frequency Distribution") +
xlim(0, 6)
…the distribution is still slightly skewed right and still unimodal.
The main question being asked in this analysis is whether regions with voter support favoring Trump experienced instances of higher hate crime rates. To do some exploratory analysis, we can use a series of box plots and scatterplots to ascertain that relationship.
Box Plots
# color setup
colors <- c("dodgerblue", "firebrick1", "grey")
names(colors) <- c("Clinton", "Trump", "Split")
plotSetup +
geom_boxplot(mapping = aes(x = support, y = avg_hatecrimes, fill = support)) +
labs(x = "", y = "Average Hate Crimes", title = "Voter Support") +
scale_fill_manual(values = colors)
This boxplot clearly identifies and further reinforces that Trump has less support nationwide. The median, Q1, and Q3 values are all lower for Trump. Now lets create a similar boxplot, except break it down by region…
plotSetup +
geom_boxplot(mapping = aes(x = support, y = avg_hatecrimes, fill = support)) +
facet_grid(. ~ region) +
labs(x = "", y = "Average Hate Crimes", title = "Regional Voter Support") +
scale_fill_manual(values = colors)
This boxplot is interesting. Clinton actually had a higher median of support in the South than Trump. This was hidden in the histogram. Lets take a look at the scatterplots
Scatterplot
model <- lm(avg_hatecrimes ~ trump_share, data = data)
bf <- coef(model)
plotSetup +
geom_point(mapping = aes(x = trump_share, y = avg_hatecrimes, color = region)) +
geom_smooth(aes(x = trump_share, y = avg_hatecrimes), method = "lm", na.rm = T) +
#geom_abline(slope = bf[[2]], intercept = bf[[1]]) +
labs(x = "Trump Voter Share", y = "Average Hate Crimes", title = "Trump Share vs Average Hate Crimes")
summary(model)
##
## Call:
## lm(formula = avg_hatecrimes ~ trump_share, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1081 -1.1586 -0.0971 0.8863 5.2238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0260 0.9281 6.493 4.41e-08 ***
## trump_share -7.4087 1.8300 -4.049 0.000187 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.495 on 48 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2546, Adjusted R-squared: 0.239
## F-statistic: 16.39 on 1 and 48 DF, p-value: 0.0001869
The slope is negative, which does indicate that there is a negative correlation between trump_share and avg_hatecrimes. The R-Squared indicates that only 25.46% of the variation.
We can also use the other variable hate_crime_proportion and plot it versus trump_share
model2 <- lm(hate_crime_proportion ~ trump_share, data = data)
bf2 <- coef(model2)
plotSetup +
geom_point(mapping = aes(x = trump_share, y = hate_crime_proportion, color = region)) +
geom_smooth(aes(x = trump_share, y = hate_crime_proportion), method = "lm", na.rm = T) +
#geom_abline(slope = bf2[[2]], intercept = bf2[[1]]) +
labs(x = "Trump Voter Share", y = "Hate Crime Proportion", title = "Trump Share vs Hate Crime Proportion") +
lims(x = c(.2, .8))
summary(model2)
##
## Call:
## lm(formula = hate_crime_proportion ~ trump_share, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33201 -0.11410 -0.02299 0.07858 0.56395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0173 0.1252 8.127 2.25e-10 ***
## trump_share -1.4748 0.2522 -5.847 5.26e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1926 on 45 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.4317, Adjusted R-squared: 0.4191
## F-statistic: 34.19 on 1 and 45 DF, p-value: 5.263e-07
The slope is also negative, which does indicate that there is a negative correlation between trump_share and hate_crime_proportion. The R-Squared indicates that only 25.46% of the variation.
Both scatterplots show evidence of a positive correlation and both R-Squared values demonstrate that trump_share is not a great explanatory variable for hate crimes.
Part 4 - Inference
We will be performing two signifance tests to determine if there is enough evidence to statistically determine if there were more instances of hate crimes in areas that favored Trump
Part 5 - Conclusion
Through significance testing, we were able to confirm that there was not enough evidence to reject the null hypothesis that Trump supported areas have higher instances of hate crime. The exploratory data analysis also showed a decreasing trend line as Trump support increased
References
Majumder, Maimuna. “Higher Rates Of Hate Crimes Are Tied To Income Inequality.” FiveThirtyEight, FiveThirtyEight, 21 Apr. 2017, fivethirtyeight.com/features/higher-rates-of-hate-crimes-are-tied-to-income-inequality/.