## Warning: NAs introduced by coercion

Part 1 - Introduction:

With the prevalence of social media and the distribution of information, the problems in our World can now be brought to the masses. No longer are we limited by time, but now can have news concurrently delivered directly to our pockets. This availability of information has shed some light on a lot of social problems. One such problem is the prevalence of “Death By Cop.” More and more, we hear that Police Officers are using violent force, resulting in the death of possible perpetrators. Though this is most likely caused by the officers attempting to perform their duties, the lack of data we have available, and the lack of oversight from the city and state, makes these incidents cause for concern. One such attempt to document these deaths is the Guardian’s site called “The Counted.” This crowd sourced data site uses the population as its primary source for information, allowing others to submit their information into the website. By opening it up to the population, and then separately verifying each case, this allows for a broader selection of data. With this data, the Guardian is attempting to bring to light the prevalence of death by cops, allowing for the population to become police oversight. Knowing that they are being “watched,” Police Officers and government agencies are now forced to take action, imposing stricter rules and regulations, making officers more “socially aware,” and making safety a top priority. Though we are a long way from a perfect society, this action takes us one step closer to a better society.

In this Project, we will take the counted data source, and analyse some possible root-causes of fatality by police officers. Specifically, we will look at Census data in conjunction with the Counted Data, to analyse if there is any correlation between fatalities and Income/Poverty regions. Furthermore, we will be investigating any difference in the data in terms of ethnicity, and whether or not one’s socioeconomic factor plays apart. For our part here, we want to know, generally speaking, are the locations of fatalities dependent on such factors as race? And do income/poverty rate play a factor as well?

Part 2 - Data:

The Data we will be using for our investigation is one from the Guardian’s Crowd source data. This data includes key demographic information of the deceased (name, age, race, gender, etc.) along with basic data from the incident (date of incident, whether they were armed, location, etc.) These “incidents” were then combined with the 2015 Community Survey Census data. This joining of data was performed using Google and Bing GPS data and overlaying it with Census API data. All of this was performed by FiveThirtyEight a polling aggregation website with many data sources. As this was real-world data, this is an entirely observational study, as it takes from specific, tragic events.

Our response variable (for the most part) is ethnicity (in this case the number of deaths per ethnicity), and we want to compare that to the median household income and the poverty rate of the surround area. Our goal is to determine, what (if any) role does income and poverty rate play on fatalities by cop. For posterity, we are assuming several things here; first, the incidents took place in a location that is relevant to the victim (ie. work or home). Second, the census data is an accurate predictor of the the victim himself, or is in someway relevant to the incident itself. This may not be the case, as the victim may have traveled from his or her home location. Third, we are assuming that the crowd sourced data is in of itself accurate. This, of course, is a wide-array of assumptions, and we know most of these to be incorrect. However, for our purposes here, we will assume these are true for the time being, and discuss the possible implications later.

For our conditions of inference, we know this to be observational data, we can safely assume that all the samples are independent (and we greatly hope that they are, Police Officer killing with any sort of dependence on one instant or another would be cause for extreme concern). Furthermore, we can assume that each sample we extracted (ie. each ethnicity) is also independent from one another. Our sample size (which we will discuss further in a moment) is sufficiently large that we can safely say that the it meets are criteria, and we can perform an inference on the data at large.

Our population of interest here is the total population of the US. However, because this data is will be biased towards “criminals” and those engaging in “criminal activity,” this data for the most part will not be relevant for the general population. Rather, this data will give us a more general idea of what kind of person will be likely to confront a fatal police encounter, or more specifically in what locale. Furthermore, our scope of inference here will be to determine if race is a factor, and what link that has with socioeconomic status. However, causality here will be very difficult to validate. Though this data may very well link race and income to fatalities by police officer, this by no means would be a cause. There is just too many unanswered question to make that leap.

Part 3 - Exploratory data analysis:

For our exploratory analysis, we would like to see if there is any difference in the median household income between the ethnicity involved in fatalities by police. Our initial summary data shows the following:

## Asian/Pacific Islander                  Black        Hispanic/Latino 
##                     10                    135                     67 
##        Native American                Unknown                  White 
##                      4                     15                    236

From the data that we have, we can see some problems. First, the data set we are drawing from is limited set. In order to ensure an appropriate data size, in terms of ethnicity, we will focus specifically on White, black, and Hispanic. All of these ethnicity have sufficient size that they can make inference possible. For our purposes here, and though there were many factors in the census data, we are going to only focus on two areas, Median Household Income and Poverty Rate. A brief summary of this data is as follows:

##     h_income           pov       
##  Min.   : 10290   Min.   : 1.10  
##  1st Qu.: 32625   1st Qu.:10.90  
##  Median : 42759   Median :18.20  
##  Mean   : 46627   Mean   :21.11  
##  3rd Qu.: 56190   3rd Qu.:28.70  
##  Max.   :142500   Max.   :79.20  
##  NA's   :2        NA's   :2

Our next step would be to create some basic plots, the following is a break down of first the total median household income followed by the three ethnicity choosen. Each of these graphs represents the median income of the location where the fatality occurred.

Next we plotted similar histograms, but this time based on poverty rate of the surrounding area.

In order to perform inference, we must have a normal distribution. An eyeball of the plots do show an apparent normal appearance. However, plotting a Q-Q plot can determine if the data approaches normal. For brevity, and to save space, the Total Q-Q Plots for both Income and Poverty Rate are shown (the others have a similar skew, albeit with less data points)

Normal Q-Q Plot for Income

## Warning: Removed 2 rows containing missing values (stat_qq).

Normal Q-Q Plot for Poverty Rate

## Warning: Removed 2 rows containing missing values (stat_qq).

As you can see both plots do show some signs of skew. In this case, all the data appears to be skewed to the right. This might be cause for concern, but at this time the distribution of this sample will be treated as normal. This will allow for us to run some inference based on the sample distributions.

Part 4 - Inference:

For our analysis, we specifically want to run some comparative data to see if there is a statistical difference in both the income and the poverty rate. The question we wished to investigate here was if there was any difference between poverty rate and income based on race. To run a “real-world” scenario, and compare the ethnicities against each other, we used the inference package from a previous assignment, and created subset with the 3 possible combinations (White Vs Black, Black Vs Hispanic, and White Vs. Hispanic).

With each of these combinations, we want to compare the mean income and poverty rate of every incident involving a fatality. Our null hypothesis in all these cases is that the average income of each respective ethnicity will be equal. Our alternative hypothesis is that they are not equal, and we performed a two-sided test. Our thought process here, is that by comparing the mean income and poverty rate, we could see if any specific race was more likely to be involved in a fatal shoot out, in a lower income neighborhood. For our test here, our confidence level is 0.05 with a two sided test. the The inferences are as follows:

*Please note: the package had to be loaded from the previous Lab 5 assignment, please be sure to run that before using this r-Markdown

Black Vs Hispanic Median Income

inference(y = black_hisp$h_income, x = black_hisp$raceethnicity, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Black = 133, mean_Black = 42608.8, sd_Black = 22167.82
## n_Hispanic/Latino = 67, mean_Hispanic/Latino = 42767.96, sd_Hispanic/Latino = 15324.03
## Observed difference between means (Black-Hispanic/Latino) = -159.1507
## 
## H0: mu_Black - mu_Hispanic/Latino = 0 
## HA: mu_Black - mu_Hispanic/Latino != 0 
## Standard error = 2683.224 
## Test statistic: Z =  -0.059 
## p-value =  0.9528

#### White Vs Hispanic Median Income

inference(y = white_hisp$h_income, x = white_hisp$raceethnicity, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Hispanic/Latino = 67, mean_Hispanic/Latino = 42767.96, sd_Hispanic/Latino = 15324.03
## n_White = 236, mean_White = 49826.94, sd_White = 20322.17
## Observed difference between means (Hispanic/Latino-White) = -7058.985
## 
## H0: mu_Hispanic/Latino - mu_White = 0 
## HA: mu_Hispanic/Latino - mu_White != 0 
## Standard error = 2292.34 
## Test statistic: Z =  -3.079 
## p-value =  0.002

#### White Vs Black Median Income

inference(y = black_white$h_income, x = black_white$raceethnicity, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Black = 133, mean_Black = 42608.8, sd_Black = 22167.82
## n_White = 236, mean_White = 49826.94, sd_White = 20322.17
## Observed difference between means (Black-White) = -7218.136
## 
## H0: mu_Black - mu_White = 0 
## HA: mu_Black - mu_White != 0 
## Standard error = 2333.407 
## Test statistic: Z =  -3.093 
## p-value =  0.002

Black Vs Hispanic Poverty Rate

inference(y = black_hisp$pov, x = black_hisp$raceethnicity, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Black = 133, mean_Black = 25.7564, sd_Black = 15.6229
## n_Hispanic/Latino = 67, mean_Hispanic/Latino = 24.2478, sd_Hispanic/Latino = 12.286
## Observed difference between means (Black-Hispanic/Latino) = 1.5086
## 
## H0: mu_Black - mu_Hispanic/Latino = 0 
## HA: mu_Black - mu_Hispanic/Latino != 0 
## Standard error = 2.022 
## Test statistic: Z =  0.746 
## p-value =  0.4556

#### White Vs Hispanic Poverty Rate

inference(y = white_hisp$pov, x = white_hisp$raceethnicity, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Hispanic/Latino = 67, mean_Hispanic/Latino = 24.2478, sd_Hispanic/Latino = 12.286
## n_White = 236, mean_White = 17.5742, sd_White = 10.8589
## Observed difference between means (Hispanic/Latino-White) = 6.6736
## 
## H0: mu_Hispanic/Latino - mu_White = 0 
## HA: mu_Hispanic/Latino - mu_White != 0 
## Standard error = 1.659 
## Test statistic: Z =  4.022 
## p-value =  0

#### White Vs Black Poverty Rate

inference(y = black_white$pov, x = black_white$raceethnicity, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_Black = 133, mean_Black = 25.7564, sd_Black = 15.6229
## n_White = 236, mean_White = 17.5742, sd_White = 10.8589
## Observed difference between means (Black-White) = 8.1822
## 
## H0: mu_Black - mu_White = 0 
## HA: mu_Black - mu_White != 0 
## Standard error = 1.528 
## Test statistic: Z =  5.355 
## p-value =  0

For our purposes here, the inferences conducted (as seen above) where done using a two sided test. As you can see, between black and Hispanics, there was no apparent difference in income or poverty level (we failed to reject the null hypothesis, our p-value was too great). This indicates the for our purposes here, Hispanic and Black fatalities occur more often then not occur in similar socioeconomic neighborhoods average at roughly $42,000 with a poverty rate of 25%.

On the other hand, comparing the neighborhoods where white fatalities occurred, we see a vast difference. Their is a significant difference between the location of white fatalities and that of both black and Hispanic individuals. On average, the income level was $49,000 (or a difference of about $7000) and the poverty was significantly lower approximately 17% a difference of 8% between both black and Hispanic locations.

Part 5 - Conclusion:

This data is very revealing if not useful. Black and Hispanic fatalities by Police Officer appear to occur more often in lower income neighborhoods, with higher poverty rates. There is a significant difference between those two ethnicities and white fatalities. It is unlikely that this difference is due to any random chance, and for the most part is evident of some connection between income disparity and race. However, it is by no means an indication of a causality (with the data available, there isn’t enough evidence to overlook several of the assumptions we made earlier). What could be some of the reasons for this difference (and it is a substantial difference)? Some possible reasons for this could be that Police officers are more likely to patrol lower income neighborhoods. These areas are more likely to have criminal activity. Furthermore, more often than not, there is a disproportionate number of minorities in lower-income neighbors which could result in the discrepancy seen here. However, there is not enough evidence here to support one theory over the other. What we do know, is that more often than not, black and hispanic police fatalities occur in neighborhoods of lower socioeconomic status than those of white people.

There are plenty of other avenues that we could investigate with this. Creating a comparision between other criminal activity and socioeconomic status it could be possible to determine if there is an apparent link between race and arrest/killings. From what we see here, there is cause for concern, especially in terms of the type of peroson that is subject to increased numbers of police fatalities. As a society we should investigate these matters, as we can never better ourselves until we have a better grasp on the causes of these tragic events.

References:

Flowers, A and Scheinkman, A, Police Killings (2015) , Github Repository https://github.com/fivethirtyeight/data/tree/master/police-killings

Casselman, B. “Where Police Have Killed Americans In 2015” FiveThirtyEight. June 3, 2014 http://fivethirtyeight.com/features/where-police-have-killed-americans-in-2015/