Data Preparation
data <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data606/master/project/data/hate_crimes.csv")
## Parsed with column specification:
## cols(
## state = col_character(),
## median_household_income = col_integer(),
## share_unemployed_seasonal = col_double(),
## share_population_in_metro_areas = col_double(),
## share_population_with_high_school_degree = col_double(),
## share_non_citizen = col_double(),
## share_white_poverty = col_double(),
## gini_index = col_double(),
## share_non_white = col_double(),
## share_voters_voted_trump = col_double(),
## hate_crimes_per_100k_splc = col_double(),
## avg_hatecrimes_per_100k_fbi = col_double()
## )
region_mapping <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data606/master/project/data/region_mapping.csv") %>%
# convert all headers to lowercase
rename_all(funs(str_to_lower(.)))
## Parsed with column specification:
## cols(
## State = col_character(),
## `State Code` = col_character(),
## Region = col_character(),
## Division = col_character()
## )
Adding Qualitative Variables
data %<>%
left_join(region_mapping, by = "state")
Research question
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Are there more annual hate crimes per 100,000 population in areas where the greater share of the population voted for Trump in 2016?
Cases
What are the cases, and how many are there?
There are 51 cases, all 50 US states and the District of Columbia. Each case has relevant statistics on hate crimes and vote results from the 2016 election.
Data collection
Describe the method of data collection.
The data collection was simple. The data source was posted on FiveThirtyEight’s GitHub in a CSV format. That data was gathered from numerous sources including the Kaiser Family Foundeation, Census Bureau, United States Election Project, Souther Poverty Law Center, and the FBI.
Type of study
What type of study is this (observational/experiment)?
This is an observational study as there it is just analyzing data on events that have occured.
Data Source
If you collected the data, state self-collected. If not, provide a citation/link.
Response
What is the response variable, and what type is it (numerical/categorical)?
The response variable is average annual hate crimes per 100,000 population and it is a numerical variable.
Explanatory
What is the explanatory variable, and what type is it (numerical/categorival)?
The explanatory variable is the share of the population in the state that voted for Trump in 2016. This is also a numerical variable.
Relevant summary statistics
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(data)
## state median_household_income share_unemployed_seasonal
## Length:51 Min. :35521 Min. :0.02800
## Class :character 1st Qu.:48657 1st Qu.:0.04200
## Mode :character Median :54916 Median :0.05100
## Mean :55224 Mean :0.04957
## 3rd Qu.:60719 3rd Qu.:0.05750
## Max. :76165 Max. :0.07300
##
## share_population_in_metro_areas share_population_with_high_school_degree
## Min. :0.3100 Min. :0.7990
## 1st Qu.:0.6300 1st Qu.:0.8405
## Median :0.7900 Median :0.8740
## Mean :0.7502 Mean :0.8691
## 3rd Qu.:0.8950 3rd Qu.:0.8980
## Max. :1.0000 Max. :0.9180
##
## share_non_citizen share_white_poverty gini_index share_non_white
## Min. :0.01000 Min. :0.04000 Min. :0.4190 Min. :0.0600
## 1st Qu.:0.03000 1st Qu.:0.07500 1st Qu.:0.4400 1st Qu.:0.1950
## Median :0.04500 Median :0.09000 Median :0.4540 Median :0.2800
## Mean :0.05458 Mean :0.09176 Mean :0.4538 Mean :0.3157
## 3rd Qu.:0.08000 3rd Qu.:0.10000 3rd Qu.:0.4665 3rd Qu.:0.4200
## Max. :0.13000 Max. :0.17000 Max. :0.5320 Max. :0.8100
## NA's :3
## share_voters_voted_trump hate_crimes_per_100k_splc
## Min. :0.040 Min. :0.06745
## 1st Qu.:0.415 1st Qu.:0.14271
## Median :0.490 Median :0.22620
## Mean :0.490 Mean :0.30409
## 3rd Qu.:0.575 3rd Qu.:0.35694
## Max. :0.700 Max. :1.52230
## NA's :4
## avg_hatecrimes_per_100k_fbi state code region
## Min. : 0.2669 Length:51 Length:51
## 1st Qu.: 1.2931 Class :character Class :character
## Median : 1.9871 Mode :character Mode :character
## Mean : 2.3676
## 3rd Qu.: 3.1843
## Max. :10.9535
## NA's :1
## division
## Length:51
## Class :character
## Mode :character
##
##
##
##
one <- data %>%
select(state, median_household_income, share_voters_voted_trump) %>%
arrange(desc(median_household_income)) %>%
head(5)
# top 5 median incomes
kable(one, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
state | median_household_income | share_voters_voted_trump |
---|---|---|
Maryland | 76165 | 0.35 |
New Hampshire | 73397 | 0.47 |
Hawaii | 71223 | 0.30 |
Connecticut | 70161 | 0.41 |
District of Columbia | 68277 | 0.04 |
Visualizations
ggplot(data, aes(x = share_voters_voted_trump, y = avg_hatecrimes_per_100k_fbi, col = region)) +
geom_point(aes(size = avg_hatecrimes_per_100k_fbi), alpha = .6, shape = 16) +
geom_abline()
Linear Model
model <- lm(avg_hatecrimes_per_100k_fbi ~ share_voters_voted_trump, data = data)
summary(model)
##
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ share_voters_voted_trump,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1081 -1.1586 -0.0971 0.8863 5.2238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0260 0.9281 6.493 4.41e-08 ***
## share_voters_voted_trump -7.4087 1.8300 -4.049 0.000187 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.495 on 48 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2546, Adjusted R-squared: 0.239
## F-statistic: 16.39 on 1 and 48 DF, p-value: 0.0001869
Multiplot
p1 <- data %>%
filter(region == "Northeast") %>%
ggplot(aes(x = share_voters_voted_trump, y = avg_hatecrimes_per_100k_fbi)) +
geom_point(aes(size = avg_hatecrimes_per_100k_fbi), alpha = .6, shape = 16) +
geom_abline()
p2 <- data %>%
filter(region == "South") %>%
ggplot(aes(x = share_voters_voted_trump, y = avg_hatecrimes_per_100k_fbi)) +
geom_point(aes(size = avg_hatecrimes_per_100k_fbi), alpha = .6, shape = 16) +
geom_abline()
p3 <- data %>%
filter(region == "Midwest") %>%
ggplot(aes(x = share_voters_voted_trump, y = avg_hatecrimes_per_100k_fbi)) +
geom_point(aes(size = avg_hatecrimes_per_100k_fbi), alpha = .6, shape = 16) +
geom_abline()
p4 <- data %>%
filter(region == "West") %>%
ggplot(aes(x = share_voters_voted_trump, y = avg_hatecrimes_per_100k_fbi)) +
geom_point(aes(size = avg_hatecrimes_per_100k_fbi), alpha = .6, shape = 16) +
geom_abline()
p1
p2
p3
p4
There does appear to be fairly weak positive relationship across all of the regions.