Data 606 - Project

Part 1 - Introduction

There is growing sentiment within society that instances of hate are increasing. In the 10 days following the 2016 election, nearly 900 hate incidents were reported to the Southern Poverty Law Center (Majumder 2017). Thats an average of 90 hate crimes per day. For contrast, 36,000 hate crimes were reported to the FBI from 2010-2015, or 16 hate crimes per day on average.

The data we will be using for the analysis is a compilation of data from the Kaiser Family Foundation, the US Census Bureau, the US Election Project, Southern Poverty Law Center, and the FBI. The data is posted on FiveThirtyEight’s GitHub in a CSV format. There are some caveats associated with the data: 1) the federal government doesn’t track hate crimes systematically and thus some of the data is reliant upon media accounts and people’s self-reports, 2) the FBI collects hate crime data from law enforcement agencies but the data is submitted voluntarily, 3) the FBI uses the Uniform Crime Reporting Program (UCR) to collect crime data, but the UCR only collects data on prosecutable hate crimes, 4) heightened news coverage of hate incidents after the election may have caused people to report hate incidents more frequently due to awareness bias (Majumder 2017). All of these things contribute to bias baked in to the data being used for analysis.

The question we will be looking at in this analysis is:

Are there more annual hate crimes per 100,00 population in areas where the greater share of population voted for Trump in 2016?

Part 2 - Data

hateCrimes <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data606/master/project/data/hate_crimes.csv")
regionMap <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data606/master/project/data/region_mapping.csv") %>%
  rename_all(funs(str_to_lower(.)))

Some manipulations will need to be done to the data. hateCrimes will be joined to regionMap to add the columns: Region, Division, and State.Code. The following code will join the two datasets and create an additional categorical variable to denote Trump support

data <- hateCrimes %>%
  # join two datasets
  left_join(regionMap, by = "state") %>%
  # add trump_support variables
  mutate(support = case_when(
    share_voters_voted_trump > .5 ~ 'Trump',
    share_voters_voted_trump < .5 ~ 'Clinton',
    TRUE ~ 'Split'
  ))

The decision here was made to create a column titled support. The condition was based on the percentage share of voters. A share of voters greater than 50% received a value of ‘Trump’, less than 50% received ‘Clinton’, and 50% received ‘Split’.

We can further trim some of the “fat” in the dataset to only keep the columns we are analyzing. We will keep the median household income as a comparison for ‘Trump Support’. The original FiveThirtyEight article quoted income being the greatest predictor of hate crimes. We can also rename some columns for easier reference later

data %<>%
  select(-(3:9)) %>%
  # rename column
  rename(median_income = median_household_income,
         trump_share = share_voters_voted_trump,
         hate_crime_proportion = hate_crimes_per_100k_splc,
         avg_hatecrimes = avg_hatecrimes_per_100k_fbi)

Part 3 - Exploratory Data Analysis

General Information

Lets get some general information on the dataset

str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    51 obs. of  9 variables:
##  $ state                : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ median_income        : int  42278 67629 49254 44922 60487 60940 70161 57522 68277 46140 ...
##  $ trump_share          : num  0.63 0.53 0.5 0.6 0.33 0.44 0.41 0.42 0.04 0.49 ...
##  $ hate_crime_proportion: num  0.1258 0.1437 0.2253 0.0691 0.2558 ...
##  $ avg_hatecrimes       : num  1.806 1.657 3.414 0.869 2.398 ...
##  $ state code           : chr  "AL" "AK" "AZ" "AR" ...
##  $ region               : chr  "South" "West" "West" "South" ...
##  $ division             : chr  "East South Central" "Pacific" "Mountain" "West South Central" ...
##  $ support              : chr  "Trump" "Trump" "Split" "Trump" ...

We are able to see that there are 51 observations, and only 9 variables in the trimmed data set. We may want to convert some of the columns to factors in later analysis.

summary(data)
##     state           median_income    trump_share    hate_crime_proportion
##  Length:51          Min.   :35521   Min.   :0.040   Min.   :0.06745      
##  Class :character   1st Qu.:48657   1st Qu.:0.415   1st Qu.:0.14271      
##  Mode  :character   Median :54916   Median :0.490   Median :0.22620      
##                     Mean   :55224   Mean   :0.490   Mean   :0.30409      
##                     3rd Qu.:60719   3rd Qu.:0.575   3rd Qu.:0.35694      
##                     Max.   :76165   Max.   :0.700   Max.   :1.52230      
##                                                     NA's   :4            
##  avg_hatecrimes     state code           region         
##  Min.   : 0.2669   Length:51          Length:51         
##  1st Qu.: 1.2931   Class :character   Class :character  
##  Median : 1.9871   Mode  :character   Mode  :character  
##  Mean   : 2.3676                                        
##  3rd Qu.: 3.1843                                        
##  Max.   :10.9535                                        
##  NA's   :1                                              
##    division           support         
##  Length:51          Length:51         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

Histograms

plotSetup <- ggplot(data)

plotSetup +
  geom_bar(mapping = aes(x = support, 
                         color = support, 
                         fill = support)) +
  labs(title = "Voter Support", x = "", y = "Frequency")

Nationwide, Trump appears to be the less popular option. This coincides with the results of the election. Clinton won the popular vote but lost the electoral college.

We can get more granular by looking at the distribution by region

plotSetup +
  geom_bar(mapping = aes(x = support, 
                         color = support, 
                         fill = support)) +
  facet_grid(. ~ region) +
  labs(x = "Voter Support", y = "Frequency", title = "Regional Voter Support")

By getting more granular, we are able to determine that Trump is more favored in the Midwest and the South. In the Northeast, Trump did not get any majority support in any of the states, and while the West did have some states where he had the majority, Clinton was overall the most favored

Lets take a look at the distribution of the avg_hatecrimes

plotSetup +
  geom_histogram(mapping = aes(x = avg_hatecrimes), binwidth = 1, na.rm = TRUE, alpha = .8, fill = "#377EB8") +
  labs(x = "Average Hate Crimes", y = "Frequency", title = "Frequency Distribution")

The distribution shows a unimodal, skewed right distribution. If we remove the outlier…

plotSetup +
  geom_histogram(mapping = aes(x = avg_hatecrimes), binwidth = 1, na.rm = TRUE, alpha = .8, fill = "#377EB8") +
  labs(x = "Average Hate Crimes", y = "Frequency", title = "Frequency Distribution") + 
  xlim(0, 6)

…the distribution is still slightly skewed right and still unimodal.

The main question being asked in this analysis is whether regions with voter support favoring Trump experienced instances of higher hate crime rates. To do some exploratory analysis, we can use a series of box plots and scatterplots to ascertain that relationship.

Box Plots

# color setup
colors <- c("dodgerblue", "firebrick1", "grey")
names(colors) <- c("Clinton", "Trump", "Split")

plotSetup +
  geom_boxplot(mapping = aes(x = support, y = avg_hatecrimes, fill = support)) +
  labs(x = "", y = "Average Hate Crimes", title = "Voter Support") +
  scale_fill_manual(values = colors)

This boxplot clearly identifies and further reinforces that Trump has less support nationwide. The median, Q1, and Q3 values are all lower for Trump. Now lets create a similar boxplot, except break it down by region…

plotSetup +
  geom_boxplot(mapping = aes(x = support, y = avg_hatecrimes, fill = support)) +
  facet_grid(. ~ region) +
  labs(x = "", y = "Average Hate Crimes", title = "Regional Voter Support") + 
  scale_fill_manual(values = colors)

This boxplot is interesting. Clinton actually had a higher median of support in the South than Trump. This was hidden in the histogram. Lets take a look at the scatterplots

Scatterplot

model <- lm(avg_hatecrimes ~ trump_share, data = data)
bf <- coef(model)

plotSetup +
  geom_point(mapping = aes(x = trump_share, y = avg_hatecrimes, color = region)) +
  geom_smooth(aes(x = trump_share, y = avg_hatecrimes), method = "lm", na.rm = T) +
  #geom_abline(slope = bf[[2]], intercept = bf[[1]]) +
  labs(x = "Trump Voter Share", y = "Average Hate Crimes", title = "Trump Share vs Average Hate Crimes")

summary(model)
## 
## Call:
## lm(formula = avg_hatecrimes ~ trump_share, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1081 -1.1586 -0.0971  0.8863  5.2238 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.0260     0.9281   6.493 4.41e-08 ***
## trump_share  -7.4087     1.8300  -4.049 0.000187 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.495 on 48 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2546, Adjusted R-squared:  0.239 
## F-statistic: 16.39 on 1 and 48 DF,  p-value: 0.0001869

The slope is negative, which does indicate that there is a negative correlation between trump_share and avg_hatecrimes. The R-Squared indicates that only 25.46% of the variation.

We can also use the other variable hate_crime_proportion and plot it versus trump_share

model2 <- lm(hate_crime_proportion ~ trump_share, data = data)
bf2 <- coef(model2)

plotSetup +
  geom_point(mapping = aes(x = trump_share, y = hate_crime_proportion, color = region)) +
  geom_smooth(aes(x = trump_share, y = hate_crime_proportion), method = "lm", na.rm = T) +
  #geom_abline(slope = bf2[[2]], intercept = bf2[[1]]) +
  labs(x = "Trump Voter Share", y = "Hate Crime Proportion", title = "Trump Share vs Hate Crime Proportion") +
  lims(x = c(.2, .8))

summary(model2)
## 
## Call:
## lm(formula = hate_crime_proportion ~ trump_share, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33201 -0.11410 -0.02299  0.07858  0.56395 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0173     0.1252   8.127 2.25e-10 ***
## trump_share  -1.4748     0.2522  -5.847 5.26e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1926 on 45 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.4317, Adjusted R-squared:  0.4191 
## F-statistic: 34.19 on 1 and 45 DF,  p-value: 5.263e-07

The slope is also negative, which does indicate that there is a negative correlation between trump_share and hate_crime_proportion. The R-Squared indicates that only 25.46% of the variation.

Both scatterplots show evidence of a positive correlation and both R-Squared values demonstrate that trump_share is not a great explanatory variable for hate crimes.

Part 4 - Inference

We will be performing two signifance tests to determine if there is enough evidence to statistically determine if there were more instances of hate crimes in areas that favored Trump

trump_share vs avg_hatecrimes

Let \(\bar{x}_{cc}\) = Mean of crime in areas of Clinton support
Let \(\bar{x}_{ct}\) = Mean of crime in areas of Trump support
\(H_o\): \(\bar{x}_{cc}\) = \(\bar{x}_{ct}\)
\(H_a\): \(\bar{x}_{ct}\) > \(\bar{x}_{cc}\)

# mean of avg hate crimes
crime_summary <- data %>%
  select(support, avg_hatecrimes) %>%
  group_by(support) %>%
  summarise(mean = mean(avg_hatecrimes, na.rm = T),
            sd = sd(avg_hatecrimes, na.rm = T),
            size = n(),
            st_error = sd/sqrt(nrow(data) - 1)) %>%
  filter(support != "Split")

kable(crime_summary)

support	mean	sd	size	st_error
Clinton	2.735555	2.066828	26	0.2922937
Trump	1.940743	1.190869	24	0.1684144

# significance test
t <- (crime_summary$mean[crime_summary$support == 'Trump'] - crime_summary$mean[crime_summary$support == 'Clinton']) %>%
  divide_by(crime_summary$st_error[crime_summary$support == 'Trump'])

# upper tail
p <- pt(t, df = nrow(data) - 2, lower.tail = F) %>%
  print
## [1] 0.9999899

Based on the p-value, there is not enough evidence to reject the null hypothesis thus we cannot conclude that there is higher hate crime in Trump favored areas

trump_share vs hate_crime_propotion

Let \(\bar{x}_{cc}\) = Mean of crime in areas of Clinton support
Let \(\bar{x}_{ct}\) = Mean of crime in areas of Trump support
\(H_o\): \(\bar{x}_{cc}\) = \(\bar{x}_{ct}\)
\(H_a\): \(\bar{x}_{ct}\) > \(\bar{x}_{cc}\)

# mean of hate_crime_proportion
crime_summary <- data %>%
  select(support, hate_crime_proportion) %>%
  group_by(support) %>%
  summarise(mean = mean(hate_crime_proportion, na.rm = T),
            sd = sd(hate_crime_proportion, na.rm = T),
            size = n(),
            st_error = sd/sqrt(nrow(data) - 1)) %>%
  filter(support != "Split")

kable(crime_summary)

support	mean	sd	size	st_error
Clinton	0.3926267	0.3062642	26	0.0433123
Trump	0.2024468	0.1156196	24	0.0163511

# significance test
t <- (crime_summary$mean[crime_summary$support == 'Trump'] - crime_summary$mean[crime_summary$support == 'Clinton']) %>%
  divide_by(crime_summary$st_error[crime_summary$support == 'Trump'])

# upper tail
p <- pt(t, df = nrow(data) - 2, lower.tail = F) %>%
  print
## [1] 1