Final_Project

Introduction
Obtaining Data
Data Scrubbing and Transformation
- The Counted Data
- Police Force Data
Explore and Visualize
Modelling
Interpret and Conclusion
Sources

Introduction

As we know, these are tulmutous times for our society. With recent events, Police Brutatlity has been on the fore-front of everyones minds. Such has created the need for better oversight, and more realistic expectations for our police force. As such, many people and journalists have taken it upon themselves to document police violence. One site in particular is “The Counted” This site uses crowdsourced data to identify all the fatalities by police officers in 2015. This site creates an instant source for oversight and documentation, something that is a necessity in our world today. By creating this site, we are no longer relying on the government to provide the data we need to understand trends in our society, but documenting the data ourselves.

The purpose of this project isn’t just about the findings, but as a stepping stone to better understand how we the importance of up to date data. This project was done in conjunction with another project that analyzed just a single dataset. That project focused on poverty data, and can be seen here. It is my hope with the project to better understand how to extract data from the web and elsewhere, and combine it into a unique dataset.

The focus of this project will take data from 3 separate sources, The Counted Data set, population data by state, and the most recently available data for Police Force size to determine if there is some link between Police Size and the number of Police Fatalities. We hope to answer the question, do larger Police Forces result in an increased number of deaths by police?

Obtaining Data

## Loading required package: bitops
## Loading required package: lattice
## Loading required package: plyr
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:utils':
## 
##     View
## 
## 
## Attaching package: 'tidyr'
## 
## The following object is masked from 'package:RCurl':
## 
##     complete
## 
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Loading required package: xml2

Counted Data Extraction

The first data set we will be pulling from is the Counted Data. Currently, the data source is available via download to CSV. The data file was uploaded to a github page and be extracted as follows:

url_data <- getURL("https://raw.githubusercontent.com/mfarris9505/FinalProject/master/the-counted.csv")

count_data <- read.csv(textConnection(url_data))
#Removing some data that is unnecessary for our purpose here to streamline the appearance 
count_data <- count_data[-c(1,3:10,13)]
head(count_data)

##                name state    classification              armed
## 1   Matthew Ajibade    GA  Death in custody                 No
## 2      Lewis Lembke    OR           Gunshot            Firearm
## 3 Michael Kocher Jr    HI Struck by vehicle                 No
## 4     John Quintero    KS           Gunshot                 No
## 5       Tim Elliott    WA           Gunshot            Firearm
## 6   Matthew Hoffman    CA           Gunshot Non-lethal firearm

Population Data Extraction

To create a rate, we needed the general population per state. To do this, data was pulled from the American Community Survey API from Census.gov. A brief overview of the census data can be seen by viewing this site(aside- this is the most cumbersome API I have seen to date, it was not very well organized or documented, I think its in Beta at the moment, and it seems they removed all their old documentation). Discovery Tool Census Data

This specific data source uses 5 years of data to best estimate population data. This data is from 2014, which is the most up to date population that I could find after navigating the site. Using this API, a list was pulled using JSON:

url<- "http://api.census.gov/data/2014/acs5?get=NAME,B01001_001E&for=state:*"

pop2014_data <- fromJSON(url)
pop2014_data <- data.frame(pop2014_data)
pop2014_data <- pop2014_data[-1,]
#Remove excess column
pop2014_data$X3 <- NULL
names(pop2014_data) <- c("State","Population")

head(pop2014_data)

##        State Population
## 2    Alabama    4817678
## 3     Alaska     728300
## 4    Arizona    6561516
## 5   Arkansas    2947036
## 6 California   38066920
## 7   Colorado    5197580

Police Force Data Extraction

For the other data source we will be using, we will be performing a simple webscrape, extracting a web table. This data is actually recent FBI data, which is available in CSV form. However, the data in the table on the webpage is in the version which will facilitate cleaning. (The CSV surprisingly is more difficult to transform). The table can be seen here. This data will cleaned and then combined to create a single rate per state for the two categories to create a comparison.

police_url <- "http://www.governing.com/gov-data/safety-justice/police-officers-per-capita-rates-employment-for-city-departments.html"

police_data <- police_url %>% 
  read_html() %>% 
  html_nodes(xpath ='//*[@id="inputdata"]') %>% 
  html_table()

police_data <- data.frame(police_data)
# Removing some Excess data (Working with rates not totals)
police_data <- police_data[-c(2:4,6)]
head(police_data)

##               City Officers.per.10K.residents
## 1   Washington, DC                       61.2
## 2    Baltimore, MD                       47.4
## 3      Chicago, IL                       44.1
## 4   Wilmington, DE                       42.6
## 5 Philadelphia, PA                       42.4
## 6     New York, NY                       41.7

This data set was choosen as it best normalizes the data. This data includes the rate of each city per 10k population. As this has more than one city per state (for the most part), an average police size can be found by taking the average of these individual rates from each city.This should give an approximation of the total state police officer per citizen. This is making alot of assumptions, and is creating an inherent bias on large populations, but this can’t be prevented, as the data is limited to only large cities.

State Abbreviations

One other piece of data that I needed to obtain was States Abbreviations. If you look, one of data sources (Population), listed only the states names. This would be problematice in the future, so we uploaded a states Abbr data frame, with another simple webscrape.

abbr_url <- "http://www.50states.com/abbreviations.htm#.Vm3q9UorKUk"

abbr_data <- abbr_url %>% 
  read_html() %>% 
  html_nodes(xpath ='//*[@id="content"]/div[1]/table') %>% 
  html_table()

abbr_data <- data.frame(abbr_data)
names(abbr_data) <- c("State", "Abbr")

Data Scrubbing and Transformation

The Counted Data

For our purposes here we need to transform this data in a single data frame with the rate of police fatalities per 100k population. To do this we need to use some summary functions.

count_data$count <- 1

counted_data <- count_data %>%
  group_by(state) %>%
  summarise(n()) 
names(counted_data) <- c("State", "Fatalities")

Now that we have a summary of the data, we must combine the population data to create a rate. This can be done simply using the join function on the plyr package. But furst we have to convert the states to abbreviations. This can be done several ways, but the easiest would be to use the same join function that will match the state columns. We are using a left join here as we only need the data from the left table.

pop2014_data <- join(pop2014_data,abbr_data, by = "State")
#Removing the Full State Name
pop2014_data$State <- NULL
names(pop2014_data) <- c("Population", "State")
head(pop2014_data)

##   Population State
## 1    4817678    AL
## 2     728300    AK
## 3    6561516    AZ
## 4    2947036    AR
## 5   38066920    CA
## 6    5197580    CO

Now the population can be combined with to the Counted data. We will be using another left join, as the COunted Data does not have a complete list of states. This is because some data is not represented here (ie. Puerto Rico and Vermont which had no deaths this year). The rate per Million for the data was also calculated in this step.

counted_combined <- join(counted_data, pop2014_data, by = "State")
counted_combined$Fatalities <- as.numeric(as.character(counted_combined$Fatalities))
counted_combined$Population <- as.numeric(as.character(counted_combined$Population))

counted_combined <- mutate(counted_combined, Fatal_Rate_Per_Mil = (Fatalities/Population * 1000000))

head(counted_combined)

##   State Fatalities Population Fatal_Rate_Per_Mil
## 1    AK          5     728300           6.865303
## 2    AL         19    4817678           3.943809
## 3    AR          5    2947036           1.696620
## 4    AZ         41    6561516           6.248556
## 5    CA        194   38066920           5.096288
## 6    CO         27    5197580           5.194725

Police Force Data

We need to complete a similar step for the Police Data. To start we need to collect the state data. This must be done using the tidyr separate function:

police_data <- police_data %>% separate(City, c("City", "State"), ",")

## Warning: Too many values at 4 locations: 238, 351, 366, 581

police_data$City <- NULL
police_data$State <-str_trim(police_data$State)
head(police_data)

##   State Officers.per.10K.residents
## 1    DC                       61.2
## 2    MD                       47.4
## 3    IL                       44.1
## 4    DE                       42.6
## 5    PA                       42.4
## 6    NY                       41.7

We can now transfrom this into similar data. As we are using rates here. It is important to take the average rather that the sum when calculating the states data:

police_final <- police_data %>%
  group_by(State) %>%
  summarise(mean(Officers.per.10K.residents)) 
names(police_final) <- c("State", "Officer_Per_10k" )

Now we can effectively combine both into a single dataset that we can use for modeling.

counted_combined$State <- as.character(counted_combined$State)
counted_combined$State <- factor(counted_combined$State)
police_final$State <- as.character(police_final$State)
police_final$State <- factor(police_final$State)

counted_combined <- join(counted_combined, police_final, by = "State")

head(counted_combined)

##   State Fatalities Population Fatal_Rate_Per_Mil Officer_Per_10k
## 1    AK          5     728300           6.865303        12.40000
## 2    AL         19    4817678           3.943809        25.82500
## 3    AR          5    2947036           1.696620        20.58750
## 4    AZ         41    6561516           6.248556        15.33529
## 5    CA        194   38066920           5.096288        11.17197
## 6    CO         27    5197580           5.194725        15.72353

Explore and Visualize

Now that the data is condensed and collated, it is time to start exploring what we have created. The best way to do this would be to provide some summary statistics:

summary(counted_combined)

##      State      Fatalities       Population       Fatal_Rate_Per_Mil
##  AK     : 1   Min.   :  1.00   Min.   :  575251   Min.   :0.9494    
##  AL     : 1   1st Qu.:  5.00   1st Qu.: 1854315   1st Qu.:2.3022    
##  AR     : 1   Median : 16.50   Median : 4492160   Median :3.5054    
##  AZ     : 1   Mean   : 21.36   Mean   : 6269615   Mean   :3.7157    
##  CA     : 1   3rd Qu.: 21.75   3rd Qu.: 6838665   3rd Qu.:4.4258    
##  CO     : 1   Max.   :194.00   Max.   :38066920   Max.   :9.6888    
##  (Other):44                                                         
##  Officer_Per_10k
##  Min.   :11.17  
##  1st Qu.:15.16  
##  Median :18.95  
##  Mean   :20.15  
##  3rd Qu.:22.11  
##  Max.   :61.20  
##

After reviewing the summary statistics, it is important to visualize the data to see if we have any discrepancies, to see if the data approaches a normal trend, etc. Boxplot and histograms are an appropriate way to better “see” the data:

From what we can see here, is we have some potential Outliers. An examination of outliers and how to remove them will be discussed in a later course. However, for our purposes here, the one Outlier that is causing much concern is the DC data, particularly in terms of the higher number of police officers per citizen. As DC is not a State, and represents a rather heavily policed area (particularly because of the importance of certain individuals who reside there) I believe it is warranted to remove this particular point. As we want to make a comparsion to Police size, this one would create an inherent bias. Again, not too familiar with the process of eliminating outliers, but these reasons seem to justify it’s removal. With this in mind we can start our modeling process.

Modelling

For our process here, we determined that linear regression would likely be the best model, and the one I am currently most familiar with. Using linear regression. It is best to start with a plot of the two variables to determine if police size in any way influences the fatalities by officers. We will perform a simple linear regression model with a 95% confidence. Our null hypothesis is that slope or \(\beta\) = 0, and our alternate is that \(\beta\) does not equal 0.

counted_combined <- counted_combined[-8,]
ggplot(counted_combined, aes(Officer_Per_10k, Fatal_Rate_Per_Mil)) +
  geom_point()+
  geom_smooth(method="lm") +
  theme_stata() + 
  scale_colour_stata()

As we can see from the graph, the linear model shows that there is an apparent decrease in fatalities with larger sized police forces. However, it is good to note here, that the confidence region shows that their is not much signifcance in this data, as the true slope with a 95% probabilty could be also be positive. This can be further seen by reviewing the the linear model function.

counted_lm <- lm(Fatal_Rate_Per_Mil ~ Officer_Per_10k, data = counted_combined)
summary(counted_lm)

## 
## Call:
## lm(formula = Fatal_Rate_Per_Mil ~ Officer_Per_10k, data = counted_combined)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6267 -1.5505  0.0077  0.8580  5.8598 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.63974    1.02694   4.518  4.2e-05 ***
## Officer_Per_10k -0.05226    0.05102  -1.024    0.311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.026 on 47 degrees of freedom
## Multiple R-squared:  0.02184,    Adjusted R-squared:  0.001024 
## F-statistic: 1.049 on 1 and 47 DF,  p-value: 0.3109

Again, we can see from our summary that the model is not statisitcally significant. The p-value in the model is much too high. We fail to reject the null hypothesis.

Even though we failed to reject the null, it is important to deterling if the linear model is valid. To do this we must check to see if meets certain criteria. First, the model must be linear, which we can see from the graph. Also, plotting the residuals vs police force will give us an idea of linearity. This can be seen in Plot 1 below. Furthermore, a plot of the residual histogram should approach normal, that can be seen in Plot 2

q1 <- ggplot(data = NULL, aes(x=counted_combined$Officer_Per_10k, y=counted_lm$residuals)) +
  geom_point() +
  geom_abline(intercept=0, slope=0) +
  ggtitle("Residual Plot 1")

q2 <- ggplot(data = counted_lm, aes(Officer_Per_10k)) +
  geom_histogram() +
  ggtitle("Histogram Plot 2")

multiplot(q1,q2, cols = 2)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

From what I can see, the linearity doesn’t past muster, there is not a consistent distribution, we can see some patterns developing. This could be an indication that there is some other model (that I am unfamiliar with) which may be better suited. The residuals, however, do appear to be normally distributed, with a slight skew. From what we can see from both the p-value and the tests, is that the linear model is not a good fit for the data that we have. It would be best to use a more appropriate modelling method.

Interpret and Conclusion

From the data presented above, we can see that we failed to reject our null, and our data here shows no apparent linear relationship between Police Size and Police Fatalities. However, this isn’t a clear indidication that there is no pattern at all. The fact is there was some interesting residual data could indicate that there is another more acurate representation. Furthermore, the data itself was obtained on an “best-estimate.” The state data was calculated from the cities with the most populations. This would like unfairly bias the data towards big cities. As there was no cross-check to determine if police fatalities were mostly from densely populated cities, we can safely say that the data may not be an accurate representation of the population in focus. Further study would be best done using city data. However, as the number of police fatalities is dispersed over a large population, city data would only result in a small number. Furthermore, there was not a one to one relation between our two sources, and some city data would not be represented. This was one of the reasons why I choose to combine the data across the state level, as it allowed for a greater one to one realtionship. As you can see, this choice may have limited any patterns amoung the data source.

The take away from this project is this, our data is getting better. With the prevalence of modern distribution of information, including but not limited to crowdsourcing, we now have better way to identify problems. One such problem is the number of deaths by police officers. The value of having an update source will allow for a more real-time analysis that can better prevent future deaths. However, this real-time data is not readily available every where. Some data is still exceeding difficult to update. Furthermore, certain goverment data is still significantly behind. Hopefully, one day data will come closer and closer to real-time.

Sources

Kenan Davis, Rich Harris, Nadja Popovich and Kenton Powell, “The Counted”, 2015
“State Abbreviations” 50States, 2015
American Community Survey 5 Year Data (2010 - 2014) Census.gov, 2015
Governing, 2015