Describe your motivation for performing this analysis (2). Have a recognizable data science workflow (4.a - 4.e, 5.a - 5.c). Include data from at least two different types of data sources (4.b.i, 4.e.iv.3, 5.a.i). Include at least one data transformation operation (4.b.ii, 5.a.ii). Include at least one statistical analysis and at least one graphic that describes or validates the data (4.d, 5.b). Includes at least one graphic that supports the conclusion (4.e, 5.b). Includes at least one statistical analysis that supports the conclusion (4.e, 5.c). Includes at least one feature not covered in class (5.a - 5.b). Deliver the presentation in the allotted time of 3 to 5 minutes. Show at least one challenge encountered in code or data, and what was done when the challenge was encountered (4.e.iv). Ensure the audience comes away with a clear understanding of the motivation for undertaking the project. Ensure the audience comes away with a clear understanding of at least one insight gained, the conclusion you reached, or the hypothesis you confirmed. Deliver the submit self-contained code and data. Deliver fully reproducible results. Ensure all of the delivered code runs without errors. Deliver code and conclusions using a reproducible research tool. Deliver draft project proposal, project, and presentation on time.
In April of 2015 there was a murder in New York City. There was a surge of police activity. Then, when crime scene investigation was done, the body remained covered by a white sheet and guarded by a police officer for a few hours until it was picked up by an ambulance. The strange thing is, a few news outlets reported that “EMS rushed him to St. Barnabas Hospital, where he was pronounced dead, cops said.” The whole situation was disturbing, but these new reports also made me wonder how crime statistics are reported. Was this death reported in the statistics for the precinct of the hospital (48th) or the precinct of the crime (46th)? I never did the digging necessary to find out.
Then, in September of 2016, I read something about a similar situation. There was a story about the time of death reported by the NYC Police Department. The report said “Police officials initially said that Graham died at Montefiore Medical Center. His death certificate says he was shot at 3:01 p.m. and died at 3:53 p.m. According to his family, the video indicates he was already dead, and contradicts the city account.” Although this story was also disturbing in-and-of itself, it also made me wonder if the death was reported in the statistics for the precinct of the hospital (52nd) or the precinct of the crime (47th)? I also wondered if the total number of mortalities reported in NYC police statistics match the total number of mortalities reported in NYC Hospitals statistics.
Do NYC Police Precincts and NYC Hospitals report different mortality figures? This question will be examined in general terms (whether NYC Police Precincts and NYC Hospitals report the same total mortality figures in their respective statistics) and much more specific terms (whether mortality statistics are skewed toward police precincts near hospitals).
library(XML)
library(httr)
library(tidyr)
library(dplyr)
library(RCurl)
library(ggmap)
library(ggplot2)
library(stringr)
library(jsonlite)
Do NYC Police Precincts and NYC Hospitals report the same total mortality figures in their statistics?
Mortality_1 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/master/",
"DATA%20607/DATA_607_Final_Project_1.csv"), stringsAsFactors = F)
Mortality_2 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/master/",
"DATA%20607/DATA_607_Final_Project_2.csv"), stringsAsFactors = F)
Precincts_1 <- Mortality_1 %>%
filter(grepl('MURDER', CRIME)) %>%
gather(xYear, Count, 3:18) %>%
na.omit(Count) %>%
mutate(Year = as.integer(sub("X", "", xYear))) %>%
group_by(Year) %>%
summarise(Total = sum(Count))
Hospital_1 <- Mortality_2 %>%
filter(grepl('HOMICIDE', Cause.of.Death)) %>%
group_by(Year) %>%
summarise(Total = sum(Count))
Test the significance of the difference between pairs.
shapiro.test(unlist(Hospital_1))
##
## Shapiro-Wilk normality test
##
## data: unlist(Hospital_1)
## W = 0.8467, p-value = 0.05306
shapiro.test(unlist(Precincts_1))
##
## Shapiro-Wilk normality test
##
## data: unlist(Precincts_1)
## W = 0.69943, p-value = 8.524e-07
bartlett.test(Hospital_1,Precincts_1)
##
## Bartlett test of homogeneity of variances
##
## data: Hospital_1
## Bartlett's K-squared = 22.052, df = 1, p-value = 2.654e-06
Given fairly normal and homoscedastic matched pairs, and assuming independent cases (years), the paired t-test would be appropriate. The paired t-test compares two populations of paired data that are correlated due to said pairing. It uses the mean and standard deviation of the differences to calculate \(t=\frac { \bar { d } }{ s(d)/{ \sqrt { n } } }\). The null hypothesis of the test is that the true mean difference is zero.
(Compare_1 <- full_join(Precincts_1, Hospital_1, by = "Year", suffix = c(".NYPD", ".DOHMH")))
## # A tibble: 16 × 3
## Year Total.NYPD Total.DOHMH
## <int> <int> <int>
## 1 2000 673 NA
## 2 2001 649 NA
## 3 2002 587 NA
## 4 2003 597 NA
## 5 2004 570 NA
## 6 2005 539 NA
## 7 2006 596 NA
## 8 2007 496 2012
## 9 2008 523 2096
## 10 2009 471 1952
## 11 2010 536 2124
## 12 2011 515 2084
## 13 2012 419 NA
## 14 2013 335 NA
## 15 2014 333 NA
## 16 2015 352 NA
visualize <- rbind(cbind(Hospital_1, Source = "DOHMH"), cbind(Precincts_1, Source = "NYPD"))
ggplot() + geom_area(aes(x = Year, y = Total, fill = Source), data = visualize,
stat="identity", position = position_dodge(width = 0), alpha= I(0.5))
Compare_2 <- right_join(Precincts_1, Hospital_1, by = "Year", suffix = c(".NYPD", ".DOHMH"))
data.frame(Compare_2, Difference = matrix((unlist(Compare_2[3] - Compare_2[2])), ncol=1),
Factor = matrix((unlist(Compare_2[3] / Compare_2[2])), ncol=1))
## Year Total.NYPD Total.DOHMH Difference Factor
## 1 2007 496 2012 1516 4.056452
## 2 2008 523 2096 1573 4.007648
## 3 2009 471 1952 1481 4.144374
## 4 2010 536 2124 1588 3.962687
## 5 2011 515 2084 1569 4.046602
t.test(Compare_2[3] - Compare_2[2])
##
## One Sample t-test
##
## data: Compare_2[3] - Compare_2[2]
## t = 76.578, df = 4, p-value = 1.743e-07
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1489.369 1601.431
## sample estimates:
## mean of x
## 1545.4
The data reported by the NYC PD and for the NYC DOHMH is conflicting and flawed. The NYC PD data shows 16 years of data with an average of 512 while the NYC DOHMH data shows 5 years of data with an average of 2054. When looking at the years where both have data available, the average difference is 1545 which is very strange. The NYC DOHMH data is inflated by a constant factor of about 4.04.
There is something wrong here that does not justify even discussing the t-test results.
Occam’s razor demands that the fewest assumptions be made in the interpretation of these findings. Therefore, it is likely that something is wrong in the data. Further investigation is necessary.
FOIL Request #: 2016.0056
This is in response to your October 27, 2016 (received by the Office of the Inspector General for the NYPD [“OIG-NYPD”] on October 27, 2016) request, pursuant to the Freedom of Information Law, for “formal definition of the ‘MURDER & NON NEGL. MANSLAUGHTER’ CompStat category.”
The New York Police Department uses the New York State Penal Law to define all crimes/offenses/infractions referred to in their CompStat 2.0 reporting data.
Please note that the OIG-NYPD is not part of the NYPD. For records kept by the NYPD, you may make an electronic FOIL request to the NYPD’s Records Access Officer at http://www.nyc.gov/html/nypd/html/legal_matters/dclm_doc_production_foil.shtml.
This letter represents our complete response to your request. Should you wish to appeal this determination, you must send written notice within thirty days to Asim Rehman, General Counsel at the Office of the Inspector General for the New York City Police Department, City of New York Department of Investigation, 80 Maiden Lane, 14th Floor, New York, New York 10038.
New York State Penal Law
S 125.00 Homicide defined: Homicide means conduct which causes the death of a person or an unborn child with which a female has been pregnant for more than twenty-four weeks under circumstances constituting murder, manslaughter in the first degree, manslaughter in the second degree, criminally negligent homicide, abortion in the first degree or self-abortion in the first degree.
Section | Offense | Class |
---|---|---|
125.10 | Criminally negligent homicide. | E FELONY |
125.11 | Aggravated criminally negligent homicide. | C FELONY |
125.12 | Vehicular manslaughter in the second degree. | D FELONY |
125.13 | Vehicular manslaughter in the first degree. | C FELONY |
125.14 | Aggravated vehicular homicide. | B FELONY |
125.15 | Manslaughter in the second degree. | C FELONY |
125.20 | Manslaughter in the first degree. | B FELONY |
125.21 | Aggravated manslaughter in the second degree. | C FELONY |
125.22 | Aggravated manslaughter in the first degree. | B FELONY |
125.25 | Murder in the second degree. | A-I FELONY |
125.26 | Aggravated murder. | A-I FELONY |
125.27 | Murder in the first degree. | A-I FELONY |
125.40 | Abortion in the second degree. | E FELONY |
125.45 | Abortion in the first degree. | D FELONY |
125.50 | Self-abortion in the second degree. | B MISD |
125.55 | Self-abortion in the first degree. | A MISD |
125.60 | Issuing abortional articles. | B MISD |
FOIL Request #216FR02936
The NYC Department of Health and Mental Hygiene has received your Freedom of Information Law request and assigned it the control number noted above. The data you seek for 2000 - 2014 is available online at the Department’s EpiQuery website: https://sasebiweb200.health.dohmh.nycnet/epiquery/VS/index.html . If you select “Mortality - by select causes” and then click on the “SUBMIT” button, the data will appear. Please note that 2015 data has not yet been finalized.
The definition of “homicide” is also available online at http://www1.nyc.gov/assets/doh/downloads/pdf/ip/ip-homicides-in-new-york-city.pdf.
This concludes the Department’s response to your Freedom of Information Law request.
International Classification of Disease (ICD)-10
Category | Condition |
---|---|
X85 | Assault by drugs, medicaments and biological substances |
X86 | Assault by corrosive substance |
X87 | Assault by pesticides |
X88 | Assault by gases and vapors |
X89 | Assault by other specified chemicals and noxious substances |
X90 | Assault by unspecified chemical or noxious substance |
X91 | Assault by hanging, strangulation and suffocation |
X92 | Assault by drowning and submersion |
X93 | Assault by handgun discharge |
X94 | Assault by rifle, shotgun and larger firearm discharge |
X95 | Assault by other and unspecified firearm discharge |
X96 | Assault by explosive material |
X97 | Assault by smoke, fire and flames |
X98 | Assault by steam, hot vapors and hot objects |
X99 | Assault by sharp object |
Y00 | Assault by blunt object |
Y01 | Assault by pushing from high place |
Y02 | Assault by pushing or placing victim before moving object |
Y03 | Assault by crashing of motor vehicle |
Y04 | Assault by bodily force |
Y05 | Sexual assault by bodily force |
Y06 | Neglect and abandonment |
Y07 | Other maltreatment syndromes |
Y08 | Assault by other specified means |
Y09 | Assault by unspecified means |
Y87.1 | Sequela of assault i |
i The sequela include conditions reported as such, or occurring as “late effects” one year or more after the originating event.
Mortality_3 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/master/",
"DATA%20607/DATA_607_Final_Project_3.csv"), stringsAsFactors = F)
Hospital_2 <- Mortality_3[1:15, 2:3]
Hospital_2[1] <- as.integer(unlist(Hospital_2[1]))
colnames(Hospital_2) <- colnames(Hospital_1)
shapiro.test(unlist(Hospital_2))
##
## Shapiro-Wilk normality test
##
## data: unlist(Hospital_2)
## W = 0.70284, p-value = 1.716e-06
bartlett.test(Hospital_2, Precincts_1)
##
## Bartlett test of homogeneity of variances
##
## data: Hospital_2
## Bartlett's K-squared = 67.331, df = 1, p-value = 2.296e-16
visualize <- rbind(cbind(Hospital_2, Source = "DOHMH"), cbind(Precincts_1, Source = "NYPD"))
ggplot() + geom_area(aes(x = Year, y = Total, fill = Source), data = visualize,
stat="identity", position = position_dodge(width = 0), alpha= I(0.5))
Compare_3 <- right_join(Precincts_1, Hospital_2, by = "Year", suffix = c(".NYPD", ".DOHMH"))
data.frame(Compare_3, Difference = matrix((unlist(Compare_3[3] - Compare_3[2])), ncol=1),
Factor = matrix((unlist(Compare_3[3] / Compare_3[2])), ncol=1))
## Year Total.NYPD Total.DOHMH Difference Factor
## 1 2000 673 711 38 1.056464
## 2 2001 649 669 20 1.030817
## 3 2002 587 612 25 1.042589
## 4 2003 597 654 57 1.095477
## 5 2004 570 597 27 1.047368
## 6 2005 539 578 39 1.072356
## 7 2006 596 624 28 1.046980
## 8 2007 496 513 17 1.034274
## 9 2008 523 556 33 1.063098
## 10 2009 471 496 25 1.053079
## 11 2010 536 547 11 1.020522
## 12 2011 515 528 13 1.025243
## 13 2012 419 440 21 1.050119
## 14 2013 335 341 6 1.017910
## 15 2014 333 353 20 1.060060
t.test(Compare_3[3] - Compare_3[2])
##
## One Sample t-test
##
## data: Compare_3[3] - Compare_3[2]
## t = 7.6627, df = 14, p-value = 2.253e-06
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 18.24250 32.42417
## sample estimates:
## mean of x
## 25.33333
These data are much better. The new NYC DOHMH data shows 15 years of data with an average of 548. When looking at the years where both have data available, the average difference is 25. The new NYC DOHMH data is still inflated, but now only by a factor of 1.05.
For the t-test, the null hypothesis is set to a zero difference, \(d = 0\), between the NYC PD murder statistics and NYC DOHMH mortality statistics (this would indicate variation is due to reporting discrepancies), and the alternate hypothesis to an actual difference, \(d \neq 0\), existing between the two agencies. Failure to reject the null hypothesis will be assessed on a \(p\)-value of \(0.05\) such that \(H_{ 0 }: p > 0.05; H_{ A }: p \le 0.05\). The test results indicate the existence of a \(p\)-value \(=0.000002253027 < 0.05\) indicating that the probability of observing a sample \(t\) statistic with 14 degrees of freedom as extreme as the test statistic 7.662652 is extremely low. Therefore, the null hypothesis is rejected.
The data supports the conclusion that an actual difference exists between NYC PD mortality statistics and NYC DOHMH mortality statistics. This begs the question about where this statistically significant and important difference in mortalities is coming from. Examining the data definitions, it seems reasonable to check if a sequela lag issue (mortality reported as such, or occurring as “late effects” one year or more after the originating event) can be the reason.
(sequela <- data.frame("Year" = Compare_3[1] - 1, Compare_3[3] - Compare_3[2]))
## Year Total.DOHMH
## 1 1999 38
## 2 2000 20
## 3 2001 25
## 4 2002 57
## 5 2003 27
## 6 2004 39
## 7 2005 28
## 8 2006 17
## 9 2007 33
## 10 2008 25
## 11 2009 11
## 12 2010 13
## 13 2011 21
## 14 2012 6
## 15 2013 20
sequela_lag <- Precincts_1[1:14, ]
sequela_lag[ , 2] <- sequela_lag[ , 2] + sequela[2:15 ,2]
visualize <- rbind(cbind(Hospital_2[1:14,], Source = "DOHMH"), cbind(sequela_lag, Source = "NYPD"))
ggplot() + geom_area(aes(x = Year, y = Total, fill = Source), data = visualize,
stat="identity", position = position_dodge(width = 0), alpha= I(0.5))
Compare_4 <- left_join(sequela_lag, Hospital_2, by = "Year", suffix = c(".NYPD", ".DOHMH"))
data.frame(Compare_4, Difference = matrix((unlist(Compare_4[3] - Compare_4[2])), ncol=1),
Factor = matrix((unlist(Compare_4[3] / Compare_4[2])), ncol=1))
## Year Total.NYPD Total.DOHMH Difference Factor
## 1 2000 693 711 18 1.0259740
## 2 2001 674 669 -5 0.9925816
## 3 2002 644 612 -32 0.9503106
## 4 2003 624 654 30 1.0480769
## 5 2004 609 597 -12 0.9802956
## 6 2005 567 578 11 1.0194004
## 7 2006 613 624 11 1.0179445
## 8 2007 529 513 -16 0.9697543
## 9 2008 548 556 8 1.0145985
## 10 2009 482 496 14 1.0290456
## 11 2010 549 547 -2 0.9963570
## 12 2011 536 528 -8 0.9850746
## 13 2012 425 440 15 1.0352941
## 14 2013 355 341 -14 0.9605634
t.test(Compare_4[3] - Compare_4[2])
##
## One Sample t-test
##
## data: Compare_4[3] - Compare_4[2]
## t = 0.28746, df = 13, p-value = 0.7783
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -8.376886 10.948314
## sample estimates:
## mean of x
## 1.285714
Lagging the difference between agencies one year (although it can be greater), the modified NYC PD data shows 14 years of data with an average of 561. When looking at the years where both have data available, the average difference is 1. The new modified data is still showing some variation, but now the mean factor is 1. The differences basically cancel out.
For the t-test, the null hypothesis is set to a zero difference, \(d = 0\), between the NYC PD mortality statistics including sequela and NYC DOHMH mortality statistics (this would indicate variation is due to reporting discrepancies), and the alternate hypothesis to an actual difference, \(d \neq 0\), existing between the two agencies. Failure to reject the null hypothesis will be assessed on a \(p\)-value of \(0.05\) such that \(H_{ 0 }: p > 0.05; H_{ A }: p \le 0.05\). The test results indicate the existence of a \(p\)-value \(=0.7782909 > 0.05\) indicating that the probability of observing a sample \(t\) statistic with 13 degrees of freedom as extreme as the test statistic 0.2874606 is extremely high. Therefore, we fail to reject the null hypothesis.
The data supports the conclusion that no difference exists between NYC PD mortality statistics and NYC DOHMH mortality statistics, or in other words, the variation in difference is due to reporting differences. Now the question about where mortalities are reported–at the precinct of the hospital or the precinct of the crime— will be addressed.
Are mortality statistics skewed toward police precincts near hospitals?
The geocode()
function from the ggmap
library is extremely efficient but only returns longitude and latitude coordinates. Using Google’s geocoding API returns much more information. The API allows 2,500 free requests per day and returns results in XML
or JSON
format. The API has a limit of 50 requests per second.
NYC_Precincts <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/master/",
"DATA%20607/DATA_607_Final_Project_4.csv"), stringsAsFactors = F)
NYC_Hospitals <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/master/",
"DATA%20607/DATA_607_Final_Project_5.csv"), stringsAsFactors = F)
Geocode_1 <- matrix(NA, nrow(NYC_Precincts), 13)
for (i in 1:nrow(NYC_Precincts)) {
Geocode_1[i, 1] = NYC_Precincts[i, 1]
url <- "https://maps.googleapis.com/maps/api/geocode/json?address="
url <- str_replace_all(paste0(url, NYC_Precincts[i, 3],", ", NYC_Precincts[i, 4], ", NY"), "\\s", "+")
json <- GET(url)
json <- rawToChar(json$content)
json <- fromJSON(json)
geocode <- data.frame(json)
Geocode_1[i, c(2:3)] <- if(!is.null(geocode[[1,3]]$lat[1])) {
c(geocode[[1,3]]$lat[1], geocode[[1,3]]$lng[1])
} else if (length(unlist(geocode[[1,3]][2])) == 1){
c(unlist(geocode[[1,3]][1]), unlist(geocode[[1,3]][2]))
} else { unlist(geocode[[1,3]][1]) }
Geocode_1[i, 4] <- geocode[[2]][1]
Geocode_1[i, c(5:13)] <- data.frame(geocode[[1]][1])[1:9, 1]
while (nchar(Geocode_1[i, 13]) != 5 | is.na(Geocode_1[i, 13])) {
if (nchar(Geocode_1[i, 12]) != 5 | is.na(Geocode_1[i, 12])) {
Geocode_1[i, c(10:13)] <- Geocode_1[i, c(9:12)]
Geocode_1[i, 8] <- Geocode_1[i, 7]
} else {
Geocode_1[i, c(10:13)] <- Geocode_1[i, c(9:12)]
Geocode_1[i, 9] <- Geocode_1[i, 8]
}
}
Sys.sleep(1 / 3) # Make no more than 3 requests per second
}
head(Geocode_1[, 2:4])
## [,1] [,2] [,3]
## [1,] "40.7204042" "-74.0068671" "16 Ericsson Pl, New York, NY 10013, USA"
## [2,] "40.716319" "-73.9972801" "19 Elizabeth St, New York, NY 10013, USA"
## [3,] "40.7343033" "-74.0052104" "233 W 10th St, New York, NY 10014, USA"
## [4,] "40.7163623" "-73.9839341" "19 1/2 Pitt St, New York, NY 10002, USA"
## [5,] "40.7266893" "-73.9876452" "321 E 5th St, New York, NY 10003, USA"
## [6,] "40.7428389" "-73.9984916" "230 W 20th St, New York, NY 10011, USA"
Geocode_2 <- matrix(NA, nrow(NYC_Hospitals), 13)
for(i in 1:nrow(NYC_Hospitals)) {
Geocode_2[i, 1] = NYC_Hospitals[i, 1]
url <- "https://maps.googleapis.com/maps/api/geocode/json?address="
url <- str_replace_all(paste0(url, NYC_Hospitals[i, 2], ", ",
NYC_Hospitals[i, 3],", NY ", NYC_Hospitals[i, 5]), "\\s", "+")
json <- GET(url)
json <- rawToChar(json$content)
json <- fromJSON(json)
geocode <- data.frame(json)
Geocode_2[i, c(2:3)] <- if(!is.null(geocode[[1,3]]$lat[1])) {
c(geocode[[1,3]]$lat[1], geocode[[1,3]]$lng[1])
} else if (length(unlist(geocode[[1,3]][2])) == 1){
c(unlist(geocode[[1,3]][1]), unlist(geocode[[1,3]][2]))
} else { unlist(geocode[[1,3]][1]) }
Geocode_2[i, 4] <- geocode[[2]][1]
Geocode_2[i, c(5:13)] <- data.frame(geocode[[1]][1])[1:9, 1]
while (nchar(Geocode_2[i, 13]) != 5 | is.na(Geocode_2[i, 13])) {
if (nchar(Geocode_2[i, 12]) != 5 | is.na(Geocode_2[i, 12])) {
Geocode_2[i, c(10:13)] <- Geocode_2[i, c(9:12)]
Geocode_2[i, 8] <- Geocode_2[i, 7]
} else {
Geocode_2[i, c(10:13)] <- Geocode_2[i, c(9:12)]
Geocode_2[i, 9] <- Geocode_2[i, 8]
}
}
Sys.sleep(1 / 2) # Make no more than 3 requests per second
}
head(Geocode_2[, 2:4])
## [,1] [,2]
## [1,] "40.7400449" "-73.9743966"
## [2,] "40.7348835" "-73.9897585"
## [3,] "40.603877" "-73.9964509"
## [4,] "40.852633" "-73.828528"
## [5,] "40.8316796" "-73.9024215"
## [6,] "40.6560033" "-73.911295"
## [,3]
## [1,] "462 1st Avenue, New York, NY 10016, USA"
## [2,] "10 Union Square E, New York, NY 10003, USA"
## [3,] "2063 86th St, Brooklyn, NY 11214, USA"
## [4,] "3251 Westchester Ave, Bronx, NY 10461, USA"
## [5,] "1276 Fulton Ave, Bronx, NY 10456, USA"
## [6,] "1 Brookdale Plaza, Brooklyn, NY 11212, USA"
Given latitudes \(\lambda_{ 1 }, \lambda_{ 2 }\) and longitudes \(\varphi _{ 1 }, \varphi _{ 2 }\) in radians, as well as the Earth’s mean radius in miles of \(R=3959\), the distance between two points in miles can be calculated trigonometrically using the below formula.
\[d=\arccos { \left( \sin { \left( { \lambda }_{ 1 } \right) } \sin { \left( { \lambda }_{ 2 } \right) } +\cos { \left( { \lambda }_{ 1 } \right) } \cos { \left( { \lambda }_{ 2 } \right) } \cos { \left( \varphi _{ 1 }-\varphi _{ 2 } \right) } \right) R }\]
Distance <- matrix(NA, nrow(Geocode_1), nrow(Geocode_2))
R = 3959 # Earth mean radius in miles
for (i in 1:nrow(Geocode_1)){
lat1 = as.double(Geocode_1[i, 2]) * pi / 180
lng1 = as.double(Geocode_1[i, 3]) * pi / 180
for (j in 1:nrow(Geocode_2)) {
lat2 = as.double(Geocode_2[j, 2]) * pi / 180
lng2 = as.double(Geocode_2[j, 3]) * pi / 180
Distance[i, j] = acos(sin(lat1) * sin(lat2) + cos(lat1) * cos(lat2) * cos(lng2 - lng1)) * R
}
}
NYC_Precincts <- Distance %>% as.data.frame() %>%
mutate(Min_Dist = apply(.[, apply(., 1, is.numeric)], 1, min)) %>%
select(ncol(Distance) + 1) %>%
cbind(Geocode_1[ , 1:3], ., Geocode_1[ , c(4,9,13)])
colnames(NYC_Precincts) <- c("Precinct", "Lat", "Lng", "Min_Dist", "Address", "Borough", "ZIP")
NYC_Hospitals <- Distance %>% t() %>% as.data.frame() %>%
mutate(Min_Dist = apply(.[, apply(., 2, is.numeric)], 1, min)) %>%
select(nrow(Distance) + 1) %>%
cbind(Geocode_2[ , 1:3], ., Geocode_2[ , c(4,9,13)])
colnames(NYC_Hospitals) <- c("Precinct", "Lat", "Lng", "Min_Dist", "Address", "Borough", "ZIP")
Precincts_2 <- Mortality_1 %>%
filter(grepl('MURDER', CRIME) & PCT != "DOC") %>%
gather(Year, Count, 3:18) %>%
mutate(Count = ifelse(is.na(Count), 0, Count)) %>%
mutate(Year = as.integer(sub("X", "", Year))) %>%
rename(Precinct = PCT) %>%
group_by(Precinct) %>%
summarise(Total = sum(Count))
Test the correlation between mortality and distance.
Correlation coefficients measure the strength of association between two variables. The most common correlation coefficient, called the Pearson product-moment correlation coefficient, measures the strength of the linear association between variables.
ggmap(get_map(location=geocode("NYC"), source="google", maptype="roadmap")) +
geom_point(aes(y = as.numeric(levels(Lat))[Lat], x = as.numeric(levels(Lng))[Lng]),
data = NYC_Precincts, alpha = 0.5, color="#009999", size = 2) +
geom_point(aes(y = as.numeric(levels(Lat))[Lat], x = as.numeric(levels(Lng))[Lng]),
data = NYC_Hospitals, alpha = 0.5, color="#FF6699", size = 2)
ggmap(get_map(location=geocode("Brooklyn, NY"), source="google", maptype="roadmap", zoom=11)) +
geom_point(aes(y = as.numeric(levels(Lat))[Lat], x = as.numeric(levels(Lng))[Lng]),
data = NYC_Precincts, alpha = 0.5, color="#009999", size = 3) +
geom_point(aes(y = as.numeric(levels(Lat))[Lat], x = as.numeric(levels(Lng))[Lng]),
data = NYC_Hospitals, alpha = 0.5, color="#FF6699", size = 3)
Compare_5 <- left_join(NYC_Precincts, Precincts_2, by = "Precinct")
cor(Compare_5[, c(4,8)])[[2]]
## [1] -0.03227174
ggplot(Compare_5, aes(Min_Dist, Total)) +
geom_point(aes(colour = factor(Borough))) +
labs(x = "Miles to Nearest Hospital", y = "Total Mortality in Precinct") +
labs(colour = "Borough", title = "Mortality Relative to Proximitry") +
theme(plot.title = element_text(hjust = 0.5))
(Mortality_4 <- Compare_5 %>%
group_by(Borough) %>% summarise(Mean = mean(Total), Total = sum(Total)) %>%
cbind(., Population = c(1385107, 2504710, 1585874, 2230541, 468730)) %>%
mutate(Frequency = paste0(round(100 * Mean / Population, 5),"%")))
## Borough Mean Total Population Frequency
## 1 Bronx 177.58333 2131 1385107 0.01282%
## 2 Brooklyn 140.39130 3229 2504710 0.00561%
## 3 New York 54.13636 1191 1585874 0.00341%
## 4 Queens 88.12500 1410 2230541 0.00395%
## 5 Staten Island 57.50000 230 468730 0.01227%
The police precincts and hospitals plotted on a map show relatively even spacing between police precincts and hospitals. Zooming in further also shows how close police precincts and hospitals are to each other. The scatter plot makes this relationship even more obvious with the vast majority of distances clustered in the area representing less than two miles. It is also worth noting the difference between mortality totals and mortality frequency relative to population in each borough, as perception could be biased by looking solely at totals.
The correlation between mortality and distance to nearest hospital is negative with \(r=-0.0322717\). This is an extremely weak correlation which does not support the conclusion that there exists a relationship between mortality figures reported from precincts closer to hospitals.
The data supports the conclusion that no relationship exists between mortality figures reported from precincts closer to hospitals and those reported from precincts further from hospitals.
In general, NYC police statistics appear to match NYC hospital statistics when it comes to mortalities. Differences in data reported by each agency appear to be attributable to how sequela is handled. What this analysis also shows is both the disparity between the quality of data and the importance of working through initial findings critically.
Specifically, there is no evidence to support the hypothesis that mortality figures are skewed toward precincts closer to hospitals. Therefore, it is highly unlikely that mortality is reported at the precinct of the hospital rather than the precinct of the crime. It is worth noting both that the number of crimes, police precincts, and hospitals tend to higher in high density areas and that hospitals do not necessarily lie in the precinct to which they are closest. Yet given the questions raised and the findings, a micro level examination of those factors is unwarranted in this analysis.
The events that inspired this analysis were tragedies. Trying to explain what happened is not easy, if at all possible. At least now some possibilities can be ruled out. Yet regardless of what transpired and why, I would like to add this note extending condolences and respect to the families of the individuals involved.
http://nypost.com/2015/04/06/daily-blotter-485/
http://www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm
http://www.statstutor.ac.uk/resources/uploaded/paired-t-test.pdf
http://www.statisticssolutions.com/manova-analysis-paired-sample-t-test/
http://stattrek.com/hypothesis-test/paired-means.aspx?Tutorial=AP
http://stattrek.com/statistics/correlation.aspx
http://www.nyc.gov/html/oignypd/pages/about/foil-request.shtml
http://ypdcrime.com/penal.law/article125.htm#p125.27
http://www.nycourts.gov/judges/cji/2-PenalLaw/125/art125hp.shtml
https://www1.nyc.gov/site/doh/about/ogc-foil.page
https://a816-healthpsi.nyc.gov/epiquery/VS/index.html
http://apps.who.int/classifications/apps/icd/icd10online2005/fr-icd.htm?gx85.htm+
http://apps.who.int/classifications/apps/icd/icd10online2005/fr-icd.htm?gy85.htm+
https://en.wikipedia.org/wiki/ICD-10_Chapter_XX:_External_causes_of_morbidity_and_mortality
https://a816-healthpsi.nyc.gov/epiquery/CHS/uhf-zip-information.pdf
https://www1.nyc.gov/assets/doh/downloads/pdf/ah/zipcodetable.pdf
http://www.health.ny.gov/regulations/hcra/provider/provhosp.htm
https://en.wikipedia.org/wiki/Acute_care
http://www.nyc.gov/html/nypd/html/home/precincts.shtml
https://developers.google.com/maps/documentation/geocoding