DATA 606 Data Project Proposal

Data Preparation

# load data

hate_url<- "https://raw.githubusercontent.com/fivethirtyeight/data/master/hate-crimes/hate_crimes.csv"
hate_url <-read.csv(hate_url)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Does unemployment drive hate crimes as well? Is there a relationship between unemployment rate and hate crimes?

Cases

What are the cases, and how many are there?

Each case represents a state in the united states. There 51 observations in the given data set, only 47 was taken because NA value was excluded.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(extraoperators)
library(psych)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

#step1 pull the useful column
hate_url <- hate_url %>%
select("state","median_household_income","share_unemployed_seasonal","hate_crimes_per_100k_splc","avg_hatecrimes_per_100k_fbi")

#step2 exclude the NA
hate_url <- hate_url[complete.cases(hate_url),]

#step3 added a new column to combine both case from FBI and Southern Poverty Law Center number
hate_url$hate_crimes_combine <- hate_url$hate_crimes_per_100k_splc+hate_url$avg_hatecrimes_per_100k_fbi 

#step4 get the Median which is 0.05200
summary(hate_url)

##     state           median_household_income share_unemployed_seasonal
##  Length:47          Min.   :35521           Min.   :0.02900          
##  Class :character   1st Qu.:47630           1st Qu.:0.04350          
##  Mode  :character   Median :54310           Median :0.05200          
##                     Mean   :54802           Mean   :0.05087          
##                     3rd Qu.:60598           3rd Qu.:0.05800          
##                     Max.   :76165           Max.   :0.07300          
##  hate_crimes_per_100k_splc avg_hatecrimes_per_100k_fbi hate_crimes_combine
##  Min.   :0.06745           Min.   : 0.412              Min.   : 0.5324    
##  1st Qu.:0.14271           1st Qu.: 1.304              1st Qu.: 1.4788    
##  Median :0.22620           Median : 1.937              Median : 2.2272    
##  Mean   :0.30409           Mean   : 2.342              Mean   : 2.6460    
##  3rd Qu.:0.35693           3rd Qu.: 3.119              3rd Qu.: 3.4408    
##  Max.   :1.52230           Max.   :10.953              Max.   :12.4758

#step5 define high and low unemployed rate by Median
hate_url$high_unemployed <-hate_url$share_unemployed_seasonal %g% 0.05200

Data collection

Describe the method of data collection.

The data are from FBI and Southern Poverty Law Center.

The FBI Uniform Crime Reporting Program collects hate crime data from law enforcement agencies. the UCR Program collects data on only prosecutable hate crimes, which make up a fraction of hate incidents (which includes non-prosecutable offenses, such as circulation of white nationalist recruitment materials on college campuses).

The Southern Poverty Law Center uses media accounts and people’s self-reports to assess the situation.

Type of study

What type of study is this (observational/experiment)?

It is observational

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

https://github.com/fivethirtyeight/data/tree/master/hate-crimes

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

the response variable is hate_crimes_combine, it is numeric.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

high_unemployed as qualitative, and share_unemployed_seasonal as quantitative.

If the share_unemployed_seasonal is higher than median, then the reply is true, else is false.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

table(hate_url$high_unemployed, useNA='ifany')

## 
## FALSE  TRUE 
##    27    20

describe(hate_url$hate_crimes_combine)

##    vars  n mean  sd median trimmed  mad  min   max range skew kurtosis   se
## X1    1 47 2.65 1.9   2.23    2.43 1.46 0.53 12.48 11.94 2.93    12.61 0.28

describe(hate_url$share_unemployed_seasonal)

##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 47 0.05 0.01   0.05    0.05 0.01 0.03 0.07  0.04 0.03    -0.71  0

describeBy(hate_url$hate_crimes_combine, 
           group = hate_url$high_unemployed, mat=TRUE)

##     item group1 vars  n     mean       sd   median  trimmed      mad       min
## X11    1  FALSE    1 27 2.638884 1.262283 2.257538 2.585261 1.545220 0.8855916
## X12    2   TRUE    1 20 2.655671 2.560312 2.120228 2.182719 1.046404 0.5324321
##          max     range     skew   kurtosis        se
## X11  5.43271  4.547118 0.341475 -0.9504367 0.2429265
## X12 12.47578 11.943349 2.753245  7.9135465 0.5725033

ggplot(hate_url, aes(x=hate_crimes_combine)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(hate_url, aes(x=share_unemployed_seasonal)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.