1. Final Project Instructions

Describe your motivation for performing this analysis (2). Have a recognizable data science workflow (4.a - 4.e, 5.a - 5.c). Include data from at least two different types of data sources (4.b.i, 4.e.iv.3, 5.a.i). Include at least one data transformation operation (4.b.ii, 5.a.ii). Include at least one statistical analysis and at least one graphic that describes or validates the data (4.d, 5.b). Includes at least one graphic that supports the conclusion (4.e, 5.b). Includes at least one statistical analysis that supports the conclusion (4.e, 5.c). Includes at least one feature not covered in class (5.a - 5.b). Deliver the presentation in the allotted time of 3 to 5 minutes. Show at least one challenge encountered in code or data, and what was done when the challenge was encountered (4.e.iv). Ensure the audience comes away with a clear understanding of the motivation for undertaking the project. Ensure the audience comes away with a clear understanding of at least one insight gained, the conclusion you reached, or the hypothesis you confirmed. Deliver the submit self-contained code and data. Deliver fully reproducible results. Ensure all of the delivered code runs without errors. Deliver code and conclusions using a reproducible research tool. Deliver draft project proposal, project, and presentation on time.

2. Purpose of Project

In April of 2015 there was a murder in New York City. There was a surge of police activity. Then, when crime scene investigation was done, the body remained covered by a white sheet and guarded by a police officer for a few hours until it was picked up by an ambulance. The strange thing is, a few news outlets reported that “EMS rushed him to St. Barnabas Hospital, where he was pronounced dead, cops said.” The whole situation was disturbing, but these new reports also made me wonder how crime statistics are reported. Was this death reported in the statistics for the precinct of the hospital (48th) or the precinct of the crime (46th)? I never did the digging necessary to find out.

Then, in September of 2016, I read something about a similar situation. There was a story about the time of death reported by the NYC Police Department. The report said “Police officials initially said that Graham died at Montefiore Medical Center. His death certificate says he was shot at 3:01 p.m. and died at 3:53 p.m. According to his family, the video indicates he was already dead, and contradicts the city account.” Although this story was also disturbing in-and-of itself, it also made me wonder if the death was reported in the statistics for the precinct of the hospital (52nd) or the precinct of the crime (47th)? I also wondered if the total number of mortalities reported in NYC police statistics match the total number of mortalities reported in NYC Hospitals statistics.

3. Question(s)

Do NYC Police Precincts and NYC Hospitals report different mortality figures? This question will be examined in general terms (whether NYC Police Precincts and NYC Hospitals report the same total mortality figures in their respective statistics) and much more specific terms (whether mortality statistics are skewed toward police precincts near hospitals).

library(XML)
library(httr)
library(tidyr)
library(dplyr)
library(RCurl)
library(ggmap)
library(ggplot2)
library(stringr)
library(jsonlite)

4. General Question

Do NYC Police Precincts and NYC Hospitals report the same total mortality figures in their statistics?

a. Obtain Data

i. NYC Police Department (PD)

Seven Major Felony Offenses. Statistical breakdown by citywide, borough, and precinct 2000 - 2015.

ii. NYC Department of Health and Mental Hygiene (DOHMH)

New York City Leading Causes of Death. The leading causes of death by sex and ethnicity in New York City 2007 - 2011.

b. Scrub Data

i. Load Data Sets

Mortality_1 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/master/",
                               "DATA%20607/DATA_607_Final_Project_1.csv"), stringsAsFactors = F)

Mortality_2 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/master/",
                               "DATA%20607/DATA_607_Final_Project_2.csv"), stringsAsFactors = F)

ii. Clean, Preprocess, and Reduce Data Sets

Precincts_1 <- Mortality_1 %>%
  filter(grepl('MURDER', CRIME)) %>%
  gather(xYear, Count, 3:18) %>%
  na.omit(Count) %>%  
  mutate(Year = as.integer(sub("X", "", xYear))) %>%
  group_by(Year) %>%
  summarise(Total = sum(Count))

Hospital_1 <- Mortality_2 %>% 
  filter(grepl('HOMICIDE', Cause.of.Death)) %>% 
  group_by(Year) %>% 
  summarise(Total = sum(Count))

c. Explore Data

i. Determine the Task

Test the significance of the difference between pairs.

d. Model Data

i. Choose the Technique

shapiro.test(unlist(Hospital_1))

## 
##  Shapiro-Wilk normality test
## 
## data:  unlist(Hospital_1)
## W = 0.8467, p-value = 0.05306

shapiro.test(unlist(Precincts_1))

## 
##  Shapiro-Wilk normality test
## 
## data:  unlist(Precincts_1)
## W = 0.69943, p-value = 8.524e-07

bartlett.test(Hospital_1,Precincts_1)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  Hospital_1
## Bartlett's K-squared = 22.052, df = 1, p-value = 2.654e-06

Given fairly normal and homoscedastic matched pairs, and assuming independent cases (years), the paired t-test would be appropriate. The paired t-test compares two populations of paired data that are correlated due to said pairing. It uses the mean and standard deviation of the differences to calculate \(t=\frac { \bar { d } }{ s(d)/{ \sqrt { n } } }\). The null hypothesis of the test is that the true mean difference is zero.

ii. Use Algorithm to Perform Analysis

(Compare_1 <- full_join(Precincts_1, Hospital_1, by = "Year", suffix = c(".NYPD", ".DOHMH")))

## # A tibble: 16 × 3
##     Year Total.NYPD Total.DOHMH
##    <int>      <int>       <int>
## 1   2000        673          NA
## 2   2001        649          NA
## 3   2002        587          NA
## 4   2003        597          NA
## 5   2004        570          NA
## 6   2005        539          NA
## 7   2006        596          NA
## 8   2007        496        2012
## 9   2008        523        2096
## 10  2009        471        1952
## 11  2010        536        2124
## 12  2011        515        2084
## 13  2012        419          NA
## 14  2013        335          NA
## 15  2014        333          NA
## 16  2015        352          NA

visualize <- rbind(cbind(Hospital_1, Source = "DOHMH"), cbind(Precincts_1, Source = "NYPD"))
ggplot() + geom_area(aes(x = Year, y = Total, fill = Source), data = visualize, 
           stat="identity", position = position_dodge(width = 0), alpha= I(0.5))

Compare_2 <- right_join(Precincts_1, Hospital_1, by = "Year", suffix = c(".NYPD", ".DOHMH"))
data.frame(Compare_2, Difference = matrix((unlist(Compare_2[3] - Compare_2[2])), ncol=1),
           Factor = matrix((unlist(Compare_2[3] / Compare_2[2])), ncol=1))

##   Year Total.NYPD Total.DOHMH Difference   Factor
## 1 2007        496        2012       1516 4.056452
## 2 2008        523        2096       1573 4.007648
## 3 2009        471        1952       1481 4.144374
## 4 2010        536        2124       1588 3.962687
## 5 2011        515        2084       1569 4.046602

t.test(Compare_2[3] - Compare_2[2])

## 
##  One Sample t-test
## 
## data:  Compare_2[3] - Compare_2[2]
## t = 76.578, df = 4, p-value = 1.743e-07
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1489.369 1601.431
## sample estimates:
## mean of x 
##    1545.4

e. Interpret Data

i. Comparison

The data reported by the NYC PD and for the NYC DOHMH is conflicting and flawed. The NYC PD data shows 16 years of data with an average of 512 while the NYC DOHMH data shows 5 years of data with an average of 2054. When looking at the years where both have data available, the average difference is 1545 which is very strange. The NYC DOHMH data is inflated by a constant factor of about 4.04.

ii. Statistical Test

There is something wrong here that does not justify even discussing the t-test results.

iii. Findings

Occam’s razor demands that the fewest assumptions be made in the interpretation of these findings. Therefore, it is likely that something is wrong in the data. Further investigation is necessary.

iv. Investigate Further

1. NYC PD Freedom of Information Law Request

OIGNYPDFoil@oignypd.nyc.gov

FOIL Request #: 2016.0056

This is in response to your October 27, 2016 (received by the Office of the Inspector General for the NYPD [“OIG-NYPD”] on October 27, 2016) request, pursuant to the Freedom of Information Law, for “formal definition of the ‘MURDER & NON NEGL. MANSLAUGHTER’ CompStat category.”

The New York Police Department uses the New York State Penal Law to define all crimes/offenses/infractions referred to in their CompStat 2.0 reporting data.

Please note that the OIG-NYPD is not part of the NYPD. For records kept by the NYPD, you may make an electronic FOIL request to the NYPD’s Records Access Officer at http://www.nyc.gov/html/nypd/html/legal_matters/dclm_doc_production_foil.shtml.

This letter represents our complete response to your request. Should you wish to appeal this determination, you must send written notice within thirty days to Asim Rehman, General Counsel at the Office of the Inspector General for the New York City Police Department, City of New York Department of Investigation, 80 Maiden Lane, 14th Floor, New York, New York 10038.

New York State Penal Law

S 125.00 Homicide defined: Homicide means conduct which causes the death of a person or an unborn child with which a female has been pregnant for more than twenty-four weeks under circumstances constituting murder, manslaughter in the first degree, manslaughter in the second degree, criminally negligent homicide, abortion in the first degree or self-abortion in the first degree.

Section	Offense	Class
125.10	Criminally negligent homicide.	E FELONY
125.11	Aggravated criminally negligent homicide.	C FELONY
125.12	Vehicular manslaughter in the second degree.	D FELONY
125.13	Vehicular manslaughter in the first degree.	C FELONY
125.14	Aggravated vehicular homicide.	B FELONY
125.15	Manslaughter in the second degree.	C FELONY
125.20	Manslaughter in the first degree.	B FELONY
125.21	Aggravated manslaughter in the second degree.	C FELONY
125.22	Aggravated manslaughter in the first degree.	B FELONY
125.25	Murder in the second degree.	A-I FELONY
125.26	Aggravated murder.	A-I FELONY
125.27	Murder in the first degree.	A-I FELONY
125.40	Abortion in the second degree.	E FELONY
125.45	Abortion in the first degree.	D FELONY
125.50	Self-abortion in the second degree.	B MISD
125.55	Self-abortion in the first degree.	A MISD
125.60	Issuing abortional articles.	B MISD

2. NYC DOH Freedom of Information Law Request

recordsaccess@health.nyc.gov

FOIL Request #216FR02936

The NYC Department of Health and Mental Hygiene has received your Freedom of Information Law request and assigned it the control number noted above. The data you seek for 2000 - 2014 is available online at the Department’s EpiQuery website: https://sasebiweb200.health.dohmh.nycnet/epiquery/VS/index.html . If you select “Mortality - by select causes” and then click on the “SUBMIT” button, the data will appear. Please note that 2015 data has not yet been finalized.

The definition of “homicide” is also available online at http://www1.nyc.gov/assets/doh/downloads/pdf/ip/ip-homicides-in-new-york-city.pdf.

This concludes the Department’s response to your Freedom of Information Law request.

International Classification of Disease (ICD)-10

Category	Condition
X85	Assault by drugs, medicaments and biological substances
X86	Assault by corrosive substance
X87	Assault by pesticides
X88	Assault by gases and vapors
X89	Assault by other specified chemicals and noxious substances
X90	Assault by unspecified chemical or noxious substance
X91	Assault by hanging, strangulation and suffocation
X92	Assault by drowning and submersion
X93	Assault by handgun discharge
X94	Assault by rifle, shotgun and larger firearm discharge
X95	Assault by other and unspecified firearm discharge
X96	Assault by explosive material
X97	Assault by smoke, fire and flames
X98	Assault by steam, hot vapors and hot objects
X99	Assault by sharp object
Y00	Assault by blunt object
Y01	Assault by pushing from high place
Y02	Assault by pushing or placing victim before moving object
Y03	Assault by crashing of motor vehicle
Y04	Assault by bodily force
Y05	Sexual assault by bodily force
Y06	Neglect and abandonment
Y07	Other maltreatment syndromes
Y08	Assault by other specified means
Y09	Assault by unspecified means
Y87.1	Sequela of assault ⁱ

ⁱ The sequela include conditions reported as such, or occurring as “late effects” one year or more after the originating event.

3. Scrub, Explore, Model

Mortality_3 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/master/",
                               "DATA%20607/DATA_607_Final_Project_3.csv"), stringsAsFactors = F)

Hospital_2 <- Mortality_3[1:15, 2:3]
Hospital_2[1] <- as.integer(unlist(Hospital_2[1]))
colnames(Hospital_2) <- colnames(Hospital_1)
shapiro.test(unlist(Hospital_2))

## 
##  Shapiro-Wilk normality test
## 
## data:  unlist(Hospital_2)
## W = 0.70284, p-value = 1.716e-06

bartlett.test(Hospital_2, Precincts_1)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  Hospital_2
## Bartlett's K-squared = 67.331, df = 1, p-value = 2.296e-16

visualize <- rbind(cbind(Hospital_2, Source = "DOHMH"), cbind(Precincts_1, Source = "NYPD"))
ggplot() + geom_area(aes(x = Year, y = Total, fill = Source), data = visualize, 
           stat="identity", position = position_dodge(width = 0), alpha= I(0.5))

Compare_3 <- right_join(Precincts_1, Hospital_2, by = "Year", suffix = c(".NYPD", ".DOHMH"))
data.frame(Compare_3, Difference = matrix((unlist(Compare_3[3] - Compare_3[2])), ncol=1),
           Factor = matrix((unlist(Compare_3[3] / Compare_3[2])), ncol=1))

##    Year Total.NYPD Total.DOHMH Difference   Factor
## 1  2000        673         711         38 1.056464
## 2  2001        649         669         20 1.030817
## 3  2002        587         612         25 1.042589
## 4  2003        597         654         57 1.095477
## 5  2004        570         597         27 1.047368
## 6  2005        539         578         39 1.072356
## 7  2006        596         624         28 1.046980
## 8  2007        496         513         17 1.034274
## 9  2008        523         556         33 1.063098
## 10 2009        471         496         25 1.053079
## 11 2010        536         547         11 1.020522
## 12 2011        515         528         13 1.025243
## 13 2012        419         440         21 1.050119
## 14 2013        335         341          6 1.017910
## 15 2014        333         353         20 1.060060

t.test(Compare_3[3] - Compare_3[2])

## 
##  One Sample t-test
## 
## data:  Compare_3[3] - Compare_3[2]
## t = 7.6627, df = 14, p-value = 2.253e-06
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  18.24250 32.42417
## sample estimates:
## mean of x 
##  25.33333

4. Comparison

These data are much better. The new NYC DOHMH data shows 15 years of data with an average of 548. When looking at the years where both have data available, the average difference is 25. The new NYC DOHMH data is still inflated, but now only by a factor of 1.05.

5. Statistical Test

For the t-test, the null hypothesis is set to a zero difference, \(d = 0\), between the NYC PD murder statistics and NYC DOHMH mortality statistics (this would indicate variation is due to reporting discrepancies), and the alternate hypothesis to an actual difference, \(d \neq 0\), existing between the two agencies. Failure to reject the null hypothesis will be assessed on a \(p\)-value of \(0.05\) such that \(H_{ 0 }: p > 0.05; H_{ A }: p \le 0.05\). The test results indicate the existence of a \(p\)-value \(=0.000002253027 < 0.05\) indicating that the probability of observing a sample \(t\) statistic with 14 degrees of freedom as extreme as the test statistic 7.662652 is extremely low. Therefore, the null hypothesis is rejected.

v. Findings

The data supports the conclusion that an actual difference exists between NYC PD mortality statistics and NYC DOHMH mortality statistics. This begs the question about where this statistically significant and important difference in mortalities is coming from. Examining the data definitions, it seems reasonable to check if a sequela lag issue (mortality reported as such, or occurring as “late effects” one year or more after the originating event) can be the reason.

vi. Refine Further

(sequela <- data.frame("Year" = Compare_3[1] - 1, Compare_3[3] - Compare_3[2]))

##    Year Total.DOHMH
## 1  1999          38
## 2  2000          20
## 3  2001          25
## 4  2002          57
## 5  2003          27
## 6  2004          39
## 7  2005          28
## 8  2006          17
## 9  2007          33
## 10 2008          25
## 11 2009          11
## 12 2010          13
## 13 2011          21
## 14 2012           6
## 15 2013          20

sequela_lag <- Precincts_1[1:14, ]
sequela_lag[ , 2] <-  sequela_lag[ , 2] + sequela[2:15 ,2]
visualize <- rbind(cbind(Hospital_2[1:14,], Source = "DOHMH"), cbind(sequela_lag, Source = "NYPD"))
ggplot() + geom_area(aes(x = Year, y = Total, fill = Source), data = visualize, 
           stat="identity", position = position_dodge(width = 0), alpha= I(0.5))

Compare_4 <- left_join(sequela_lag, Hospital_2, by = "Year", suffix = c(".NYPD", ".DOHMH"))
data.frame(Compare_4, Difference = matrix((unlist(Compare_4[3] - Compare_4[2])), ncol=1),
           Factor = matrix((unlist(Compare_4[3] / Compare_4[2])), ncol=1))

##    Year Total.NYPD Total.DOHMH Difference    Factor
## 1  2000        693         711         18 1.0259740
## 2  2001        674         669         -5 0.9925816
## 3  2002        644         612        -32 0.9503106
## 4  2003        624         654         30 1.0480769
## 5  2004        609         597        -12 0.9802956
## 6  2005        567         578         11 1.0194004
## 7  2006        613         624         11 1.0179445
## 8  2007        529         513        -16 0.9697543
## 9  2008        548         556          8 1.0145985
## 10 2009        482         496         14 1.0290456
## 11 2010        549         547         -2 0.9963570
## 12 2011        536         528         -8 0.9850746
## 13 2012        425         440         15 1.0352941
## 14 2013        355         341        -14 0.9605634

t.test(Compare_4[3] - Compare_4[2])

## 
##  One Sample t-test
## 
## data:  Compare_4[3] - Compare_4[2]
## t = 0.28746, df = 13, p-value = 0.7783
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -8.376886 10.948314
## sample estimates:
## mean of x 
##  1.285714

1. Comparison

Lagging the difference between agencies one year (although it can be greater), the modified NYC PD data shows 14 years of data with an average of 561. When looking at the years where both have data available, the average difference is 1. The new modified data is still showing some variation, but now the mean factor is 1. The differences basically cancel out.

2. Statistical Test

For the t-test, the null hypothesis is set to a zero difference, \(d = 0\), between the NYC PD mortality statistics including sequela and NYC DOHMH mortality statistics (this would indicate variation is due to reporting discrepancies), and the alternate hypothesis to an actual difference, \(d \neq 0\), existing between the two agencies. Failure to reject the null hypothesis will be assessed on a \(p\)-value of \(0.05\) such that \(H_{ 0 }: p > 0.05; H_{ A }: p \le 0.05\). The test results indicate the existence of a \(p\)-value \(=0.7782909 > 0.05\) indicating that the probability of observing a sample \(t\) statistic with 13 degrees of freedom as extreme as the test statistic 0.2874606 is extremely high. Therefore, we fail to reject the null hypothesis.

vii. Findings

The data supports the conclusion that no difference exists between NYC PD mortality statistics and NYC DOHMH mortality statistics, or in other words, the variation in difference is due to reporting differences. Now the question about where mortalities are reported–at the precinct of the hospital or the precinct of the crime— will be addressed.

5. Specific Question

Are mortality statistics skewed toward police precincts near hospitals?

a. Obtain, Scrub, Explore Data

i. Google Geocoding API

The geocode() function from the ggmap library is extremely efficient but only returns longitude and latitude coordinates. Using Google’s geocoding API returns much more information. The API allows 2,500 free requests per day and returns results in XML or JSON format. The API has a limit of 50 requests per second.

NYC_Precincts <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/master/",
                               "DATA%20607/DATA_607_Final_Project_4.csv"), stringsAsFactors = F)

NYC_Hospitals <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/master/",
                               "DATA%20607/DATA_607_Final_Project_5.csv"), stringsAsFactors = F)

Geocode_1 <- matrix(NA, nrow(NYC_Precincts), 13)
for (i in 1:nrow(NYC_Precincts)) {
  Geocode_1[i, 1] = NYC_Precincts[i, 1]
  url <- "https://maps.googleapis.com/maps/api/geocode/json?address="
  url <- str_replace_all(paste0(url, NYC_Precincts[i, 3],", ", NYC_Precincts[i, 4], ", NY"), "\\s", "+")
  json <- GET(url)
  json <- rawToChar(json$content)
  json <- fromJSON(json)
  geocode <- data.frame(json)
  Geocode_1[i, c(2:3)] <- if(!is.null(geocode[[1,3]]$lat[1])) {
    c(geocode[[1,3]]$lat[1], geocode[[1,3]]$lng[1])
  } else if (length(unlist(geocode[[1,3]][2])) == 1){
    c(unlist(geocode[[1,3]][1]), unlist(geocode[[1,3]][2]))
  } else { unlist(geocode[[1,3]][1]) }
  Geocode_1[i, 4] <- geocode[[2]][1]
  Geocode_1[i, c(5:13)]  <- data.frame(geocode[[1]][1])[1:9, 1]
  while (nchar(Geocode_1[i, 13]) != 5 | is.na(Geocode_1[i, 13])) {
    if (nchar(Geocode_1[i, 12]) != 5 | is.na(Geocode_1[i, 12])) {
      Geocode_1[i, c(10:13)] <- Geocode_1[i, c(9:12)]
      Geocode_1[i, 8] <- Geocode_1[i, 7]
    } else {
      Geocode_1[i, c(10:13)] <- Geocode_1[i, c(9:12)]
      Geocode_1[i, 9] <- Geocode_1[i, 8]
    }
  }
  Sys.sleep(1 / 3) # Make no more than 3 requests per second
}
head(Geocode_1[, 2:4])

##      [,1]         [,2]          [,3]                                      
## [1,] "40.7204042" "-74.0068671" "16 Ericsson Pl, New York, NY 10013, USA" 
## [2,] "40.716319"  "-73.9972801" "19 Elizabeth St, New York, NY 10013, USA"
## [3,] "40.7343033" "-74.0052104" "233 W 10th St, New York, NY 10014, USA"  
## [4,] "40.7163623" "-73.9839341" "19 1/2 Pitt St, New York, NY 10002, USA" 
## [5,] "40.7266893" "-73.9876452" "321 E 5th St, New York, NY 10003, USA"   
## [6,] "40.7428389" "-73.9984916" "230 W 20th St, New York, NY 10011, USA"

Geocode_2 <- matrix(NA, nrow(NYC_Hospitals), 13)
for(i in 1:nrow(NYC_Hospitals)) {
  Geocode_2[i, 1] = NYC_Hospitals[i, 1]
  url <- "https://maps.googleapis.com/maps/api/geocode/json?address="
  url <- str_replace_all(paste0(url, NYC_Hospitals[i, 2], ", ",
          NYC_Hospitals[i, 3],", NY ", NYC_Hospitals[i, 5]), "\\s", "+")
  json <- GET(url)
  json <- rawToChar(json$content)
  json <- fromJSON(json)
  geocode <- data.frame(json)
  Geocode_2[i, c(2:3)] <- if(!is.null(geocode[[1,3]]$lat[1])) {
    c(geocode[[1,3]]$lat[1], geocode[[1,3]]$lng[1])
  } else if (length(unlist(geocode[[1,3]][2])) == 1){
    c(unlist(geocode[[1,3]][1]), unlist(geocode[[1,3]][2]))
  } else { unlist(geocode[[1,3]][1]) }
  Geocode_2[i, 4] <- geocode[[2]][1]
  Geocode_2[i, c(5:13)]  <- data.frame(geocode[[1]][1])[1:9, 1]
  while (nchar(Geocode_2[i, 13]) != 5 | is.na(Geocode_2[i, 13])) {
    if (nchar(Geocode_2[i, 12]) != 5 | is.na(Geocode_2[i, 12])) {
      Geocode_2[i, c(10:13)] <- Geocode_2[i, c(9:12)]
      Geocode_2[i, 8] <- Geocode_2[i, 7]
    } else {
      Geocode_2[i, c(10:13)] <- Geocode_2[i, c(9:12)]
      Geocode_2[i, 9] <- Geocode_2[i, 8]
    }
  }
  Sys.sleep(1 / 2) # Make no more than 3 requests per second
}
head(Geocode_2[, 2:4])

##      [,1]         [,2]         
## [1,] "40.7400449" "-73.9743966"
## [2,] "40.7348835" "-73.9897585"
## [3,] "40.603877"  "-73.9964509"
## [4,] "40.852633"  "-73.828528" 
## [5,] "40.8316796" "-73.9024215"
## [6,] "40.6560033" "-73.911295" 
##      [,3]                                        
## [1,] "462 1st Avenue, New York, NY 10016, USA"   
## [2,] "10 Union Square E, New York, NY 10003, USA"
## [3,] "2063 86th St, Brooklyn, NY 11214, USA"     
## [4,] "3251 Westchester Ave, Bronx, NY 10461, USA"
## [5,] "1276 Fulton Ave, Bronx, NY 10456, USA"     
## [6,] "1 Brookdale Plaza, Brooklyn, NY 11212, USA"

ii. Calculate Distances

Given latitudes \(\lambda_{ 1 }, \lambda_{ 2 }\) and longitudes \(\varphi _{ 1 }, \varphi _{ 2 }\) in radians, as well as the Earth’s mean radius in miles of \(R=3959\), the distance between two points in miles can be calculated trigonometrically using the below formula.

\[d=\arccos { \left( \sin { \left( { \lambda }_{ 1 } \right) } \sin { \left( { \lambda }_{ 2 } \right) } +\cos { \left( { \lambda }_{ 1 } \right) } \cos { \left( { \lambda }_{ 2 } \right) } \cos { \left( \varphi _{ 1 }-\varphi _{ 2 } \right) } \right) R }\]

Distance <- matrix(NA, nrow(Geocode_1), nrow(Geocode_2))
R = 3959 # Earth mean radius in miles
for (i in 1:nrow(Geocode_1)){
  lat1 = as.double(Geocode_1[i, 2]) * pi / 180
  lng1 = as.double(Geocode_1[i, 3]) * pi / 180
  for (j in 1:nrow(Geocode_2)) {
    lat2 = as.double(Geocode_2[j, 2]) * pi / 180
    lng2 = as.double(Geocode_2[j, 3]) * pi / 180
    Distance[i, j] = acos(sin(lat1) * sin(lat2) + cos(lat1) * cos(lat2) * cos(lng2 - lng1)) * R
  }
}
NYC_Precincts <- Distance %>% as.data.frame() %>% 
  mutate(Min_Dist = apply(.[, apply(., 1, is.numeric)], 1, min)) %>% 
  select(ncol(Distance) + 1) %>% 
  cbind(Geocode_1[ , 1:3], ., Geocode_1[ , c(4,9,13)])
colnames(NYC_Precincts) <- c("Precinct", "Lat", "Lng", "Min_Dist", "Address", "Borough", "ZIP")
NYC_Hospitals <- Distance %>% t() %>% as.data.frame() %>% 
  mutate(Min_Dist = apply(.[, apply(., 2, is.numeric)], 1, min)) %>% 
  select(nrow(Distance) + 1) %>% 
  cbind(Geocode_2[ , 1:3], ., Geocode_2[ , c(4,9,13)])
colnames(NYC_Hospitals) <- c("Precinct", "Lat", "Lng", "Min_Dist", "Address", "Borough", "ZIP")
Precincts_2 <- Mortality_1 %>%
  filter(grepl('MURDER', CRIME) & PCT != "DOC") %>%
  gather(Year, Count, 3:18) %>%
  mutate(Count = ifelse(is.na(Count), 0, Count)) %>% 
  mutate(Year = as.integer(sub("X", "", Year))) %>%
  rename(Precinct = PCT) %>% 
  group_by(Precinct) %>%
  summarise(Total = sum(Count))

iii. Determine the Task

Test the correlation between mortality and distance.

b. Model Data

i. Choose the Technique

Correlation coefficients measure the strength of association between two variables. The most common correlation coefficient, called the Pearson product-moment correlation coefficient, measures the strength of the linear association between variables.

ii. Use Algorithm to Perform Analysis

ggmap(get_map(location=geocode("NYC"), source="google", maptype="roadmap")) + 
  geom_point(aes(y = as.numeric(levels(Lat))[Lat], x = as.numeric(levels(Lng))[Lng]), 
  data = NYC_Precincts, alpha = 0.5, color="#009999", size = 2) +
  geom_point(aes(y = as.numeric(levels(Lat))[Lat], x = as.numeric(levels(Lng))[Lng]), 
  data = NYC_Hospitals, alpha = 0.5, color="#FF6699", size = 2)

ggmap(get_map(location=geocode("Brooklyn, NY"), source="google", maptype="roadmap", zoom=11)) + 
  geom_point(aes(y = as.numeric(levels(Lat))[Lat], x = as.numeric(levels(Lng))[Lng]), 
  data = NYC_Precincts, alpha = 0.5, color="#009999", size = 3) +
  geom_point(aes(y = as.numeric(levels(Lat))[Lat], x = as.numeric(levels(Lng))[Lng]), 
  data = NYC_Hospitals, alpha = 0.5, color="#FF6699", size = 3)

Compare_5 <- left_join(NYC_Precincts, Precincts_2, by = "Precinct")
cor(Compare_5[, c(4,8)])[[2]]

## [1] -0.03227174

ggplot(Compare_5, aes(Min_Dist, Total)) + 
  geom_point(aes(colour = factor(Borough))) + 
  labs(x = "Miles to Nearest Hospital", y = "Total Mortality in Precinct") +
  labs(colour = "Borough", title = "Mortality Relative to Proximitry") +
  theme(plot.title = element_text(hjust = 0.5))

(Mortality_4 <- Compare_5 %>% 
  group_by(Borough) %>% summarise(Mean = mean(Total), Total = sum(Total)) %>% 
  cbind(., Population = c(1385107, 2504710, 1585874, 2230541, 468730)) %>% 
  mutate(Frequency = paste0(round(100 * Mean / Population, 5),"%")))

##         Borough      Mean Total Population Frequency
## 1         Bronx 177.58333  2131    1385107  0.01282%
## 2      Brooklyn 140.39130  3229    2504710  0.00561%
## 3      New York  54.13636  1191    1585874  0.00341%
## 4        Queens  88.12500  1410    2230541  0.00395%
## 5 Staten Island  57.50000   230     468730  0.01227%

c. Interpret Data

i. Comparison

The police precincts and hospitals plotted on a map show relatively even spacing between police precincts and hospitals. Zooming in further also shows how close police precincts and hospitals are to each other. The scatter plot makes this relationship even more obvious with the vast majority of distances clustered in the area representing less than two miles. It is also worth noting the difference between mortality totals and mortality frequency relative to population in each borough, as perception could be biased by looking solely at totals.

ii. Statistical Test

The correlation between mortality and distance to nearest hospital is negative with \(r=-0.0322717\). This is an extremely weak correlation which does not support the conclusion that there exists a relationship between mortality figures reported from precincts closer to hospitals.

iii. Findings

The data supports the conclusion that no relationship exists between mortality figures reported from precincts closer to hospitals and those reported from precincts further from hospitals.

6. Conclusion

In general, NYC police statistics appear to match NYC hospital statistics when it comes to mortalities. Differences in data reported by each agency appear to be attributable to how sequela is handled. What this analysis also shows is both the disparity between the quality of data and the importance of working through initial findings critically.

Specifically, there is no evidence to support the hypothesis that mortality figures are skewed toward precincts closer to hospitals. Therefore, it is highly unlikely that mortality is reported at the precinct of the hospital rather than the precinct of the crime. It is worth noting both that the number of crimes, police precincts, and hospitals tend to higher in high density areas and that hospitals do not necessarily lie in the precinct to which they are closest. Yet given the questions raised and the findings, a micro level examination of those factors is unwarranted in this analysis.

7. Comment

The events that inspired this analysis were tragedies. Trying to explain what happened is not easy, if at all possible. At least now some possibilities can be ruled out. Yet regardless of what transpired and why, I would like to add this note extending condolences and respect to the families of the individuals involved.