DATA 621 Blog1

Part 1 - Introduction

What is your research question? Why do you care? Why should others care?

My research question is “Did we see a higher percentage of hate crimes in states with a higher percentage of Trump voters or in states with lower median incomes?” By analyzing the data, we can determine if people voted and behaved the way they did because of political or economic reasons which would be interesting to understand for anyone interpreting the results of this dataset (political scientists, politicians, voters etc.).

##Data Preparation

I removed states with NA’s excluding 4 of the 51 records and included a classification column that identifies and assigns a value of 1 for states where share of votes for Trump equals or exceeds 50% and a value of 0 for states where the share is less than 50%. I then plotted a boxplot and the results were interesting, the crime rate distribution was higher in states that did not vote for Trump and lower in states that saw a higher percentage of the vote share for Trump. Since, the Trump vote variable is numerical and not categorical, perhaps a correlation is a better metric to assess the relationship between crime and vote rather than simply relying onn the visual box plot. With this intent, I calculated correlation rates, the calculated correlation btween crime and income indicates that there is a weak relationship however the correlation between crime and vote is -0.65 indicating a negative relationship. Hence, states with a higher percentage of Trump voters saw low crime which is visually depicted in the box plot.

I used the median US income in 2016 from the American Community Survey of $57,617 for the box plot classification of crime against income. Again, this variable didn’t seem to intuitively explain the share of hate crimes. Washington DC seemed to be an outlier and removing this data point didn’t increase correlation or intuition of the model.

# load data
library(fivethirtyeight)
data(hate_crimes)
row <- nrow(hate_crimes); row

## [1] 51

col <- ncol(hate_crimes); col

## [1] 13

naomit1 <- na.omit(hate_crimes); naomit1

naomit <- na.omit(hate_crimes); naomit

vote <- naomit1$share_vote_trump
crime <- naomit1$hate_crimes_per_100k_splc
income <- naomit1$median_house_inc
state <- naomit1$state
inequality <- naomit1$gini_index
cor1 <- cor(crime, income); cor1

## [1] 0.3437892

cor2 <- cor(crime, vote); cor2

## [1] -0.654785

naomit$class1 <- 0
naomit$class2 <- 0
naomit$class1[vote >= 0.5] <- 1; naomit

naomit$class2[income >= 57617] <- 1; naomit

ggplot(naomit, aes(x = as.factor(class1), y = crime)) + geom_boxplot(fill = 'red', alpha=.3) + xlab("Trump Voter classification") + ylab('crime')

ggplot(data = naomit) + geom_point(mapping = aes(x = vote, y = crime)) + geom_smooth(mapping = aes(x = vote, y = crime))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

plot(crime ~ vote)

ggplot(naomit, aes(x = as.factor(class2), y = crime)) + geom_boxplot(fill = 'blue', alpha=.3) + xlab("Median Household Income") + ylab('crime')

ggplot(data = naomit) + geom_point(mapping = aes(x = income, y = crime)) + geom_smooth(mapping = aes(x = income, y = crime))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

plot(crime ~ income)

Part 2 - Data

In our naomit dataset, there are 47 rows excluding states that have NAs and this dataset has 4 columns. The column names are self-explanatory - state (State name), income (Median Household Income in 2016), crime (Hate crimes per 100,000 population from the Southern Poverty Law Center 2016 dataset) and vote (Share of 2016 U.S. presidential voters who voted for Donald Trump).

##Data collection

I used the dataset from fivethirtyeight. However, the website admits to the challenges in data collection. The federal government doesn’t track hate crimes systematically (agencies report to FBI voluntarily). Voluntary reporting means that the hate crimes may be underreported. Also, the FBI UCR program only collects data on prosecutable hate crimes, which also leads to underreporting as there may be a significant share of crimes that are non-prosecutable but still categorized as hate crimes. The data was aggregated using both the federal government websites as well as from the Southern Poverty Law Center which is the dataset we use here. The Southern Poverty Law Center uses media accounts and people’s self-reports to assess the situation. To explain the hate crime percentage, the data on key socioeconomic factors was collected for each state from the Census Bureau and from the Kaiser Foundation website.

##Cases

The number of cases (observations) in this dataset are 47, representing hate crime percentages in states where the variable is not NA.

##Variables

The variables I’ll be studying here are share_vote_trump, income and their impact on hate crimes. hate_crimes_per_100k_splc and share_vote_trump are both numeric variables while income is an integer.

sapply(naomit1, class)

##                       state                state_abbrev 
##                 "character"                 "character" 
##            median_house_inc            share_unemp_seas 
##                   "integer"                   "numeric" 
##             share_pop_metro                share_pop_hs 
##                   "numeric"                   "numeric" 
##           share_non_citizen         share_white_poverty 
##                   "numeric"                   "numeric" 
##                  gini_index             share_non_white 
##                   "numeric"                   "numeric" 
##            share_vote_trump   hate_crimes_per_100k_splc 
##                   "numeric"                   "numeric" 
## avg_hatecrimes_per_100k_fbi 
##                   "numeric"

##Type of study

This is an observational study as no treatments were imposed on people and data was aggregated from surveys.

Scope of inference (Generalizibality) - The population of interest is the entire US population, the findings from this analysis cannot be applied to the entire US population as this is a small probably biased sample with very specific characteristics (because of voluntry reporting) and the experiment is observational in nature. Awareness bias is one of the biases present in the dataset specifically in the Southern Law Center dataset that prevents us from generalizing the results as self-reporting rates may have increased after people became aware of the incidence of hate crimes. The other bias that could exist is that some states with law enforcement agencies and residents who are more likely to report could be overrepresented and others without that presence could be underrepresented in the dataset.

Scope of inference (Causality) - This data cannot be used to establish causal links between the variables of interest as this is an observational study and although such studies can provide evidence of a naturally occurring association between variables, they cannot themselves show a causal connection. Causality can only be established by conducting an experiment. Hence, we can only determine if there is a relationship or association between trump voters and incomes on the share of hate crimes.

Part 3 - Exploratory data analysis

The crime variable distribution is unimodal and right-skewed as indicated by the histogram. The vote variable is left-skewed and unimodal. The skew should not be a problem since the sample size is more than 30. Income on the other hand is normally distributed and unimodal.

summary(crime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06906 0.14374 0.22620 0.30243 0.35062 1.52230

hist(crime)

sd(crime)

## [1] 0.2515628

qqnorm(crime)
qqline(crime)

summary(vote)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.4100  0.4900  0.4822  0.5700  0.6900

hist(vote)

IQR(vote)

## [1] 0.16

sd(vote)

## [1] 0.1140751

qqnorm(vote)
qqline(vote)

summary(income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   39552   48060   54916   55299   60708   76165

hist(income)

IQR(income)

## [1] 12648

sd(income)

## [1] 8979.492

qqnorm(income)
qqline(income)

Part 4 - Inference

HO: Crime is not dependent on Trump vote share and median household incomes of states. Ha: Crime is dependent on Trump vote share and median household incomes of states.

Conditions: The cases are independent as the simple random sample constitutes less than 10% of U.S. population. The distribution of the variables income and votes is unimodal and any skew is moderate and given sample size > 30, the underlying distributions can be considered to be normal. So, all three conditions are satisfied.

We use multiple regression methods which generally depend on the following four assumptions: 1). The residuals of the model are nearly normal, 2). The variability of the residuals is nearly constant, 3). The residuals are independent and 4). Each variable is linearly related to the outcome.

We can trust the p-values and parameter estimates if the conditions for regression are reasonable. We verify the conditions for the model using diagnostic plots. The Q-Q plot below shows deviations from the straight line. Hence, the residuals are not normally distributed. Secondly, we plot the absolute values of residuals against the fitted values to determine if the variance of the residuals is approximately constant which is not satisfied here. Finally we check if the residuals are independent and from the boxplot we see that there is some difference in the variability of the residuals in the two groups. Hence, the conditions for this model are not reasonable as at least two of the four conditions for multiple regression are not satisfied.

In the search for the best model, we start with a full model (m_full) and eliminate the variables with the highest p-values which happen to be share_white_poverty and share_non_white. But, dropping those two variables barely changed the coefficients of the other variables, implying that the dropped variables had a small collinearity with the other explanatory variables. However, the adjusted-R improves from 0.6213 to 0.6419. Finally, we use the backward selection and p-value method to identify the best model. We eliminate all variables from the model that have a p-value > 0.05 which leaves us with the two variables share_vote_trump and avg_hatecrimes_per_100k_fbi. The model becomes crime = 0.45604 - 0.73414 * share_vote_trump + 0.08442 * avg_hatecrimes_per_100k_fbi. This model has an adjusted-R-squared of 0.6415. Furthermore, eliminating each of the explanatory variables in the model above did not improve R-squared further.

regr <- lm(formula = crime ~ income + vote, data = naomit1, y = TRUE); regr

## 
## Call:
## lm(formula = crime ~ income + vote, data = naomit1, y = TRUE)
## 
## Coefficients:
## (Intercept)       income         vote  
##   1.266e+00   -3.364e-06   -1.612e+00

summary(regr)

## 
## Call:
## lm(formula = crime ~ income + vote, data = naomit1, y = TRUE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32836 -0.10969 -0.03346  0.06452  0.55066 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.266e+00  3.563e-01   3.553 0.000956 ***
## income      -3.364e-06  4.196e-06  -0.802 0.427118    
## vote        -1.612e+00  3.303e-01  -4.881 1.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1931 on 42 degrees of freedom
## Multiple R-squared:  0.4374, Adjusted R-squared:  0.4106 
## F-statistic: 16.32 on 2 and 42 DF,  p-value: 5.687e-06

qqnorm(regr$residuals)
qqline(regr$residuals)

plot(abs(regr$residuals) ~ regr$fitted.values,
     xlab = "Fitted Values", ylab = "Absolute Value of Residuals")

boxplot(regr$residuals ~ naomit$class1)

boxplot(regr$residuals ~ naomit$class2)

m_full <- lm(hate_crimes_per_100k_splc ~ median_house_inc + share_unemp_seas + share_pop_metro + share_pop_hs
             + share_non_citizen + share_white_poverty + gini_index + share_non_white
             + share_vote_trump + avg_hatecrimes_per_100k_fbi, data = naomit1)
summary(m_full)

## 
## Call:
## lm(formula = hate_crimes_per_100k_splc ~ median_house_inc + share_unemp_seas + 
##     share_pop_metro + share_pop_hs + share_non_citizen + share_white_poverty + 
##     gini_index + share_non_white + share_vote_trump + avg_hatecrimes_per_100k_fbi, 
##     data = naomit1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35739 -0.09222 -0.01030  0.07309  0.34927 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -1.139e+00  2.381e+00  -0.478 0.635375    
## median_house_inc            -5.066e-06  6.049e-06  -0.837 0.408229    
## share_unemp_seas             1.720e+00  3.348e+00   0.514 0.610769    
## share_pop_metro             -8.536e-02  2.448e-01  -0.349 0.729436    
## share_pop_hs                 1.890e+00  1.703e+00   1.110 0.274731    
## share_non_citizen           -6.705e-01  1.607e+00  -0.417 0.679218    
## share_white_poverty          2.413e-01  1.961e+00   0.123 0.902799    
## gini_index                   9.655e-01  2.176e+00   0.444 0.660094    
## share_non_white             -3.509e-02  3.472e-01  -0.101 0.920080    
## share_vote_trump            -1.088e+00  4.195e-01  -2.593 0.013934 *  
## avg_hatecrimes_per_100k_fbi  7.335e-02  2.035e-02   3.604 0.000991 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1548 on 34 degrees of freedom
## Multiple R-squared:  0.7074, Adjusted R-squared:  0.6213 
## F-statistic: 8.219 on 10 and 34 DF,  p-value: 1.391e-06

m_full1 <- lm(hate_crimes_per_100k_splc ~ median_house_inc + share_unemp_seas + share_pop_metro + share_pop_hs
             + share_non_citizen + gini_index
             + share_vote_trump + avg_hatecrimes_per_100k_fbi, data = naomit1)
summary(m_full1)

## 
## Call:
## lm(formula = hate_crimes_per_100k_splc ~ median_house_inc + share_unemp_seas + 
##     share_pop_metro + share_pop_hs + share_non_citizen + gini_index + 
##     share_vote_trump + avg_hatecrimes_per_100k_fbi, data = naomit1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35522 -0.08943 -0.01455  0.06648  0.35860 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -1.057e+00  2.114e+00  -0.500 0.620078    
## median_house_inc            -5.553e-06  4.965e-06  -1.119 0.270746    
## share_unemp_seas             1.529e+00  2.995e+00   0.511 0.612726    
## share_pop_metro             -8.724e-02  2.345e-01  -0.372 0.712049    
## share_pop_hs                 1.882e+00  1.540e+00   1.222 0.229771    
## share_non_citizen           -7.866e-01  1.340e+00  -0.587 0.560802    
## gini_index                   9.077e-01  2.100e+00   0.432 0.668149    
## share_vote_trump            -1.079e+00  4.022e-01  -2.683 0.010949 *  
## avg_hatecrimes_per_100k_fbi  7.442e-02  1.901e-02   3.915 0.000386 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1505 on 36 degrees of freedom
## Multiple R-squared:  0.707,  Adjusted R-squared:  0.6419 
## F-statistic: 10.86 on 8 and 36 DF,  p-value: 1.272e-07

m_backward <- lm(formula = hate_crimes_per_100k_splc ~ share_vote_trump + avg_hatecrimes_per_100k_fbi, data = naomit1)
summary(m_backward)

## 
## Call:
## lm(formula = hate_crimes_per_100k_splc ~ share_vote_trump + avg_hatecrimes_per_100k_fbi, 
##     data = naomit1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44198 -0.08285 -0.01422  0.06106  0.39120 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.45604    0.14211   3.209  0.00255 ** 
## share_vote_trump            -0.73414    0.23989  -3.060  0.00384 ** 
## avg_hatecrimes_per_100k_fbi  0.08442    0.01592   5.302 3.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1506 on 42 degrees of freedom
## Multiple R-squared:  0.6578, Adjusted R-squared:  0.6415 
## F-statistic: 40.37 on 2 and 42 DF,  p-value: 1.66e-10

Part 5 - Conclusion

The model above makes a clear distinction of means for the two variables share_vote_trump and avg_hatecrimes_per_100k_fbi. The p-value is small for both below the 0.05 level. Using these results, we can reject the null hypothesis in favor of the alternative hypothesis and state that crime is in fact dependent on Trump vote share and median household incomes of states.

We look at diagnostic plots finally to consider whether the final model derived (crime = 0.45604 - 0.73414 * share_vote_trump + 0.08442 * avg_hatecrimes_per_100k_fbi) satisfies the four assumptions - 1). The residuals of the model are nearly normal, 2). The variability of the residuals is nearly constant 3). The residuals are independent and 4). Each variable is linearly related to the outcome. The normal probability plot of the residuals below shows some minor irregularities with one outlier in the lower left-hand side indicating long tails in the distribution of residuals. A plot of the absolute value of the residuals against their corresponding fitted values shows deviations from constant variance. Since, two of the four conditions are not satisfied, the diagnostics do not support the model assumptions as we see underlying structure in the residuals. I tried further reducing the model by removing either the share_vote_trump variable or the avg_hatecrimes_per_100k_fbi explanatory variables. Both of these did not improve adjusted R-squared. So, I’m going ahead and reporting the final model derived (crime = 0.45604 - 0.73414 * share_vote_trump + 0.08442 * avg_hatecrimes_per_100k_fbi) by noting its above-mentioned model shortcomings.

summary(m_backward)

## 
## Call:
## lm(formula = hate_crimes_per_100k_splc ~ share_vote_trump + avg_hatecrimes_per_100k_fbi, 
##     data = naomit1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44198 -0.08285 -0.01422  0.06106  0.39120 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.45604    0.14211   3.209  0.00255 ** 
## share_vote_trump            -0.73414    0.23989  -3.060  0.00384 ** 
## avg_hatecrimes_per_100k_fbi  0.08442    0.01592   5.302 3.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1506 on 42 degrees of freedom
## Multiple R-squared:  0.6578, Adjusted R-squared:  0.6415 
## F-statistic: 40.37 on 2 and 42 DF,  p-value: 1.66e-10

qqnorm(m_backward$residuals)
qqline(m_backward$residuals)

plot(abs(m_backward$residuals) ~ m_backward$fitted.values,
     xlab = "Fitted Values", ylab = "Absolute Value of Residuals")

m_backward1 <- lm(hate_crimes_per_100k_splc ~ avg_hatecrimes_per_100k_fbi, data = naomit1)
summary(m_backward1)

## 
## Call:
## lm(formula = hate_crimes_per_100k_splc ~ avg_hatecrimes_per_100k_fbi, 
##     data = naomit1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45176 -0.12099  0.00908  0.07426  0.41645 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.03747    0.04216   0.889    0.379    
## avg_hatecrimes_per_100k_fbi  0.11162    0.01444   7.729 1.15e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1646 on 43 degrees of freedom
## Multiple R-squared:  0.5815, Adjusted R-squared:  0.5717 
## F-statistic: 59.74 on 1 and 43 DF,  p-value: 1.15e-09

m_backward2 <- lm(hate_crimes_per_100k_splc ~ share_vote_trump, data = naomit1)
summary(m_backward2)

## 
## Call:
## lm(formula = hate_crimes_per_100k_splc ~ share_vote_trump, data = naomit1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32575 -0.11385 -0.02260  0.07346  0.58132 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.9987     0.1259   7.934 5.88e-10 ***
## share_vote_trump  -1.4440     0.2542  -5.681 1.06e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1923 on 43 degrees of freedom
## Multiple R-squared:  0.4287, Adjusted R-squared:  0.4155 
## F-statistic: 32.27 on 1 and 43 DF,  p-value: 1.063e-06

References

##OpenIntro Statistics ##Data citations are as below: ##Median Annual Household Income - https://www.kff.org/other/state-indicator/median-annual-income/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D ##Unemployment Rate (Seasonally Adjusted) - https://www.kff.org/other/state-indicator/unemployment-rate/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D ##Educational Attainment in the United States: 2009 - https://www.census.gov/prod/2012pubs/p20-566.pdf ##Population Distribution by Citizenship Status - https://www.kff.org/other/state-indicator/distribution-by-citizenship-status/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D ##Poverty Rate by Race/Ethnicity - https://www.kff.org/other/state-indicator/poverty-rate-by-raceethnicity/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D ##GINI INDEX OF INCOME INEQUALITY - https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_10_1YR_B19083&prodType=table ##Population Distribution by Race/Ethnicity - https://www.kff.org/other/state-indicator/distribution-by-raceethnicity/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D ##2016 November General Election Turnout Rates - http://www.electproject.org/2016g ##Ten Days After: Harassment and Intimidation in the Aftermath of the Election - https://www.splcenter.org/20161129/ten-days-after-harassment-and-intimidation-aftermath-election ##FBI - https://ucr.fbi.gov/hate-crime