An interesting application of big data in the public sector is for optimizing the use of constrained government resources.
A recent example of this from NYC was a Department of Health initiative to target restaurant inspections by data mining Yelp reviews to find which ones mentioned roaches. More generally, government agencies and corporations are realizing that bottom-up, crowd-sourced data collection is a good way to gather some types of important data that were previously difficult or too expensive to collect accurately.
A more recent scourge on cities has been RATS. Surprisingly, NYC is NOT the most rat-infested city in the US. Perhaps you can guess which one is. For this exercise, however, we will use data from NYC, simply because it collects particularly good city data through their open data initiatives. In NYC, rodents have commanded attention from government offices and from the press and the community (in part fueled by viral videos like pizza rat; more recently chubby sewer rat). In New York, the resurgence of rats in the city has been a result of many forces, including population growth, construction, and trash pickup problems, and even global warming, as many would argue, that a major redistribution of water rats that occurred as a result of Hurricane Sandy, which hit New York City on October 29th, 2012. (In Philadelphia, some have argued that gentrification-related rebuilding is having a similar effect in terms of displacing rats around the city).
Like other vermin, rats are difficult to detect and they vastly outnumber rodent inspectors and the resources that the city has available for monitoring. In fact, some estimates have put the number of rats in New York city at about one per person (i.e. slightly over 8 million) although these estimates are controversial. Nevertheless, there is little debate that rats remain a significant problem for residents of the five boroughs. As in the Yelp-roach-restaurant inspection example provided above, the goal of this assignment is uncover some temporal patterns in rat problems in the city, and to use prediction based techniques to investigate if the city’s restaurant inspection activities (which are separate from their rodent inspection activities) can be useful for targeting rodent inspections. Other major cities are also using crowd-sourcing efforts to fight this menace.
In addition to the NYC rodent inspection data, we will use data on NYC 311 calls, which are a fascinating source of information on New York City as well as the NYC restaurant health inspection data. All of these stakeholders–rodent inspectors, restaurant inspectors, and citizens–can be valuable sources of information about which parts of the city have rat problems. Note that these data sources are not “independent”. Many of the inspections in the rodent inspection database are initially undertaken due to 311 complaints.
These data are all available through NYC Open Data
I will first start with loading in the relevant packages.
library(TeachingDemos)
library(dplyr)
library(readr)
library(ggplot2)
library(gridExtra)
library(ggalt)
library(scales)
#Load the library
library("ggmap")
#Set your API Key
register_google(key = 'AIzaSyDl0v77_lnHXrgqa_O4XGZuPDmcZLQfQWs')
Now I will load in the first dataset and sample 80% of the data from the dataset
rod.inspection = read_csv("C:/Users/fzy20/Documents/Penn/OIDD 245/Rats/Rodent_Inspection.csv")
myname = "David"
set.seed(char2seed(myname))
rod.inspection = sample_frac(rod.inspection, .8)
rod.inspection = rod.inspection[, sample(1:ncol(rod.inspection))]
I aim to answer these questions in this part of my analysis:
how rat sightings (inspections yielding Active Rat Signs) have been changing from month to month over the most recent five years that are available in the data set
In general, have rat sightings been increasing, decreasing, or remained steady over this five year window?
Are there seasonal trends in rat sightings? Are there more rat sightings in some months than others?
Generate a similar plot illustrating “efficiency” of rat inspections where efficiency for any given month is computed as the number of inspections yielding “Active Rat Signs” in that month divided by the total number of inspections that take place in that month.
Generate a list of the top ten zip codes with the largest number of inspections yielding active rat signs.
Before proceeding to making grapphical analysis in order to answer the questions, I will first start by cleaning the rodent inspection dataset
# Extracting Date Information
rod.inspection$INSPECTION_DAT_TIME <- sapply(strsplit(rod.inspection$INSPECTION_DATE," "),"[",2)
rod.inspection$INSPECTION_DAT_DATE <- sapply(strsplit(rod.inspection$INSPECTION_DATE," "),"[",1)
rod.inspection$INSPECTION_DAT_DATE <- as.Date(rod.inspection$INSPECTION_DAT_DATE,format='%m/%d/%Y')
#Splitting dae information into month and years
rod.inspection$INSPECTION_DAT_DATE_MONTH <- format(rod.inspection$INSPECTION_DAT_DATE,"%m")
rod.inspection$INSPECTION_DAT_DATE_YEAR <- format(rod.inspection$INSPECTION_DAT_DATE,"%Y")
Now, I will transform the dataset to a way meant to makae our analysis later on easier
#Grouping the data by borough and aggregating rats sighting count
BOROUGH_TIMELINE_DF <- rod.inspection %>% group_by(BOROUGH,INSPECTION_DAT_DATE_MONTH,INSPECTION_DAT_DATE_YEAR) %>% filter(INSPECTION_DAT_DATE_YEAR >= 2011) %>% summarize(sightings = sum(RESULT == "Active Rat Signs"))
#engineer the year-month variable
BOROUGH_TIMELINE_DF$YEAR_MONTH <-paste(as.character(BOROUGH_TIMELINE_DF$INSPECTION_DAT_DATE_YEAR), "." ,as.character(BOROUGH_TIMELINE_DF$INSPECTION_DAT_DATE_MONTH),sep = "")
BOROUGH_TIMELINE_DF$YEAR_MONTH <- as.double(BOROUGH_TIMELINE_DF$YEAR_MONTH)
BOROUGH_TIMELINE_DF <- BOROUGH_TIMELINE_DF[order(BOROUGH_TIMELINE_DF$YEAR_MONTH),]
BOROUGH_TIMELINE_DF$YEAR_MONTH <- as.factor(BOROUGH_TIMELINE_DF$YEAR_MONTH)
BOROUGH_TIMELINE_DF[is.na(BOROUGH_TIMELINE_DF)] <- 0
To answer the questions whether or not there has been an increase in rat sighting and if there are any seasonal trends with rat sightings, I have created three charts, one didvided by borough the other one is just an aggregated view of rat signs in all 5 boroughs, and the last one is aggregated in months.
ggplot(BOROUGH_TIMELINE_DF, aes(x = YEAR_MONTH, y = sightings, group = BOROUGH, color = BOROUGH)) + geom_line() + geom_point() + theme(axis.text.x = element_text(vjust = 0.2, size = 6,angle=90)) + geom_smooth(method = "lm",se = FALSE)
#data aggregation not separated by boroughs
BOROUGH_SINGLE_TIMELINE_DF <- BOROUGH_TIMELINE_DF %>% group_by(YEAR_MONTH) %>% summarize(sightings = sum(sightings))
#plotting
ggplot(BOROUGH_SINGLE_TIMELINE_DF, aes(x = YEAR_MONTH, y = sightings, group = 1)) + geom_line() + geom_point() + theme(axis.text.x = element_text(vjust = 0.2, size = 6,angle=90)) + geom_smooth(method = "lm")
In general the rat sightings seems quite steady over the 5 year window; however in Bronx there seems to be an increase in rat sighting, likely due to the increase in population.
#data aggregation not separated by boroughs, by months
BOROUGH_MONTHS_DF <- BOROUGH_TIMELINE_DF %>% group_by(INSPECTION_DAT_DATE_MONTH) %>% summarize(sightings = sum(sightings))
#plot
ggplot(BOROUGH_MONTHS_DF, aes(x = INSPECTION_DAT_DATE_MONTH, y = sightings, group = 1)) + geom_point() + theme(axis.text.x = element_text(vjust = 0.2, size = 10,angle=90)) + geom_line() + geom_rect(aes(xmin = "03", xmax = "05",ymin = -Inf, ymax = Inf), alpha = 0.04)
As for seasonality, it is quite evident that the months march april and may have significantly higher rat sighting incidents than the others.
Now to look at the efficiency of rat inspections, I have made two graphs with one divided by the borough and the other aggregated by all boroughs
#data grouping and engineering the efficiency variable
BOROUGH_EFFICIENCY_DF <- rod.inspection %>% group_by(BOROUGH,INSPECTION_DAT_DATE_MONTH,INSPECTION_DAT_DATE_YEAR) %>% filter(INSPECTION_DAT_DATE_YEAR >= 2011) %>% summarize(efficiency = sum(RESULT == "Active Rat Signs")/n())
#reconstructing the year.month variable
BOROUGH_EFFICIENCY_DF$YEAR_MONTH <-paste(as.character(BOROUGH_EFFICIENCY_DF$INSPECTION_DAT_DATE_YEAR), "." ,as.character(BOROUGH_EFFICIENCY_DF$INSPECTION_DAT_DATE_MONTH),sep = "")
BOROUGH_EFFICIENCY_DF$YEAR_MONTH <- as.double(BOROUGH_EFFICIENCY_DF$YEAR_MONTH)
BOROUGH_EFFICIENCY_DF <- BOROUGH_EFFICIENCY_DF[order(BOROUGH_EFFICIENCY_DF$YEAR_MONTH),]
BOROUGH_EFFICIENCY_DF$YEAR_MONTH <- as.factor(BOROUGH_EFFICIENCY_DF$YEAR_MONTH)
BOROUGH_EFFICIENCY_DF[is.na(BOROUGH_EFFICIENCY_DF)] <- 0
#plotting
ggplot(BOROUGH_EFFICIENCY_DF, aes(x = YEAR_MONTH, y = efficiency, group = BOROUGH, color = BOROUGH)) + geom_line() + geom_point() + theme(axis.text.x = element_text(vjust = 0.2, size = 6,angle=90))
#data grouping aggregated
BOROUGH_SINGLE_EFFICIENCY_DF <- BOROUGH_EFFICIENCY_DF %>% group_by(YEAR_MONTH) %>% summarize(efficiency = sum(efficiency))
#plotting
ggplot(BOROUGH_SINGLE_EFFICIENCY_DF, aes(x = YEAR_MONTH, y = efficiency, group = 1)) + geom_line() + geom_point() + theme(axis.text.x = element_text(vjust = 0.2, size = 6,angle=90)) + geom_smooth(method = "lm")
The efficiency is evidently increasing across all regions; However is that really because we are better at finding rats or is it because there has just been more rats(?)
Moving on to tacking the Zipcode!
#data grouping by zipcode
ZIP_CODE_DF <- rod.inspection %>% group_by(ZIP_CODE) %>% filter(INSPECTION_DAT_DATE_YEAR >= 2011) %>% summarize(sightings = sum(RESULT == "Active Rat Signs"))
TOP_ZIP_CODE_DF <- ZIP_CODE_DF[order(ZIP_CODE_DF$sightings,decreasing = TRUE),][1:10,]
#constructing valid zip code variable to merge
library(zipcode)
data(zipcode)
zipcode$zip <- as.integer(zipcode$zip)
TOP_ZIP_CODE_DF <- merge(zipcode, TOP_ZIP_CODE_DF, by.x = "zip", by.y = "ZIP_CODE")
#Top ten Zip codes
TOP_ZIP_CODE_DF
## zip city state latitude longitude sightings
## 1 10009 New York NY 40.72709 -73.97864 1567
## 2 10452 Bronx NY 40.83875 -73.92234 2085
## 3 10453 Bronx NY 40.85302 -73.91214 2439
## 4 10456 Bronx NY 40.82968 -73.90856 3328
## 5 10457 Bronx NY 40.84674 -73.89861 4649
## 6 10458 Bronx NY 40.86417 -73.88881 4388
## 7 10467 Bronx NY 40.87226 -73.86937 1755
## 8 10468 Bronx NY 40.86711 -73.89916 2747
## 9 11221 Brooklyn NY 40.69123 -73.92637 2629
## 10 11237 Brooklyn NY 40.70336 -73.91993 1981
I have also mapped out these zip codes!
NYC= get_map("New York", zoom=10)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=New%20York&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=New+York&key=xxx
ggmap(NYC) + geom_point(aes(x = longitude, y = latitude, size = sightings/10, color = "red", alpha = 0.5), data = TOP_ZIP_CODE_DF) + theme(legend.position="bottom")
Bronx seems like a particularly bad place for rats; Brooklyn come in close but is slightly better in terms of the pest situation
Moving on to the next part, in here I aim to do the following:
Create a chart of phone calls about rat/vermin sightings in the weeks before and after Hurricane Sandy (Oct 29, 2012) and identify possible causal relationship between calamities and rats sighting.
Explore the idea of using other “proxy” indicators as a predictor of Rodent problems. Specifically I will examine Across all zip codes, and identify what two types of other complaints are most highly correlated with Rodent sightings?
I will be examining the 311 data of rodent report to understand the effect of the sandy hurricane
#data reading
sandy_311_df_og <- read_csv("C:/Users/fzy20/Documents/Penn/OIDD 245/Rats/sandyrelated.csv")
sandy_311_df <- sandy_311_df_og
#date extraction
sandy_311_df$`Created Time` <- sapply(strsplit(sandy_311_df$`Created Date`," "),"[",2)
sandy_311_df$`Created date` <- sapply(strsplit(sandy_311_df$`Created Date`," "),"[",1)
sandy_311_df$`Created_date` <- as.Date(sandy_311_df$`Created date`, format = '%m/%d/%y')
#data fltering
sandy_311_df <- sandy_311_df[sandy_311_df$`Complaint Type` == "Rodent",]
sandy_311_agg_df <- sandy_311_df %>% group_by(Created_date) %>% summarize(count = n())
val1 <- sum(sandy_311_agg_df[1:2, 'count'])/2
val2 <- sum(sandy_311_agg_df[2:15, 'count'])/13
I would first want to examine whether or not rat sightings have increased after the hurricane
#plotting result
ggplot(sandy_311_agg_df, aes(x = Created_date, y = count)) + geom_point() + geom_segment(aes(x = as.Date("10/28/2012", format = '%m/%d/%Y'), y = val1, xend = as.Date("10/29/2012", format = '%m/%d/%Y'), yend = val1), color="red", linetype="dashed", size = 1.5) + geom_segment(aes(x = as.Date("10/29/2012", format = '%m/%d/%Y'), y = val2, xend = as.Date("11/11/2012", format = '%m/%d/%Y'), yend = val2), color="red", linetype="dashed", size = 1.5) + geom_line() + geom_vline(xintercept = as.Date("10/29/2012", format = '%m/%d/%Y'), linetype="dotted", color = "blue", size=1.5)
It is quite obvious that rats sighting increased after the hurricane, however the increase was more gradual than immediate.
To analyze other complaint type as a proxy for rodent sighting, I will be extracting the correlation of rodent sighting with other popular sighting.
#data aggregation on complaint type and zipcode
sandy_311_df <- sandy_311_df_og
sandy_311_df$Complaint_Type <- sandy_311_df$`Complaint Type`
sandy_311_complaint_type_df <- sandy_311_df %>% group_by(Complaint_Type) %>% summarize(count = n())
sandy_311_complaint_type_top_df <- sandy_311_complaint_type_df[order(sandy_311_complaint_type_df$count, decreasing = TRUE),][1:15,]
big_with_sight_df <- sandy_311_df %>% group_by(`Incident Zip`, `Complaint Type`) %>% summarize(sightings = n())
top_with_sight_df <- big_with_sight_df %>% filter(`Complaint Type` %in% append(sandy_311_complaint_type_top_df$Complaint_Type, "Rodent"))
#zip code incident dataframe construction
library(reshape2)
transposed_incident_sighting_df <-dcast(top_with_sight_df, `Incident Zip` ~ `Complaint Type`, value.var="sightings")
transposed_incident_sighting_df[is.na(transposed_incident_sighting_df)] = 0
#correlation plot
library(corrplot)
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(cor(transposed_incident_sighting_df[,2:16]), type = 'upper', method = "color", order="hclust", tl.col="black",tl.srt=45, tl.cex = 1, addCoef.col = "black",number.cex = 1)
From the above correlation matrix, we could see that NONCONST, PLUMBING, HEATING, GENERAL CONSTRUCTIOG, PAINT - PLASTER and ELECTRIC reports are all highly correlated with rodent reports, signaling that there are likely going to be rats in cosntruction areas. Some other “interesting”" insights include how all construction reports are highly correlated, and plumbing is actually quite correlated with traffic(?).
rodent_cor_df <- as.data.frame(cor(transposed_incident_sighting_df[,2:16])[,"Rodent"])
rodent_cor_df$Variables <- rownames(rodent_cor_df)
colnames(rodent_cor_df) <- c("correlation")
rodent_cor_df[order(rodent_cor_df$correlation,decreasing = TRUE),][2:15,]
## correlation NA
## NONCONST 0.68376218 NONCONST
## PLUMBING 0.62223256 PLUMBING
## HEATING 0.60929215 HEATING
## GENERAL CONSTRUCTION 0.60178168 GENERAL CONSTRUCTION
## PAINT - PLASTER 0.56999270 PAINT - PLASTER
## ELECTRIC 0.40313951 ELECTRIC
## Street Condition 0.22066711 Street Condition
## Blocked Driveway 0.20969805 Blocked Driveway
## General Construction/Plumbing 0.19438780 General Construction/Plumbing
## Sewer 0.10522515 Sewer
## Consumer Complaint 0.10448663 Consumer Complaint
## Traffic Signal Condition -0.03759359 Traffic Signal Condition
## Street Light Condition -0.04472930 Street Light Condition
## Damaged Tree -0.09858071 Damaged Tree
This table is just to show an easier view of rodent associated correlation.
Now we will look at another dataset and our goal is to construct a model to predict rat sighting from this dataset. specficially I would want to investigate whether the restaurant violation database, which is created and maintained by a different agency, can be useful for predicting which property inspections are most likely to yield Active Rat Signs. This will result in a simple predictive model of the following form:
ActiveRatSigns = log(RestaurantRatViolations + 1) + month + year + ϵ
#datareading
rest_inspec_DF <- read_csv("C:/Users/fzy20/Documents/Penn/OIDD 245/Rats/dohmh.csv")
To do this, I will use both the Restaurant Violation and Rodent Inspection databases and will construct a new variable RestaurantRatViolations, which is the number of rodent-related restaurant violations in that zip code, year, and month (The relevant inspection codes for a restaurant rat violation are: 04L, 04K, and 08A). The dependent variable, ActiveRatSightings, will be a binary measure, that I will create from the rodent inspection database that should be computed as a 1 or 0 depending on whether an inspection yielded Active Rat Signs. I will also be including Month and year should into the regression as dummy variables to hold the relationship of restaurant violation and rat sighting volation as untampered as possible.
#date extraction
rest_inspec_DF$`INSPECTION DATE` <- as.Date(rest_inspec_DF$`INSPECTION DATE`, format = "%m/%d/%Y")
rest_inspec_DF$INSPECTION_DAT_DATE_MONTH <- format(rest_inspec_DF$`INSPECTION DATE`,"%m")
rest_inspec_DF$INSPECTION_DAT_DATE_YEAR <- format(rest_inspec_DF$`INSPECTION DATE`,"%Y")
#data filtering
count_per_combo_DF <- rest_inspec_DF %>% filter(`VIOLATION CODE` %in% c("04L","04k","08A")) %>% group_by(INSPECTION_DAT_DATE_MONTH, INSPECTION_DAT_DATE_YEAR, ZIPCODE) %>% summarize(rest_log = log(n()+1))
merged_df <- merge(rod.inspection, count_per_combo_DF, by.x = c("INSPECTION_DAT_DATE_YEAR","INSPECTION_DAT_DATE_MONTH", "ZIP_CODE"), by.y=c("INSPECTION_DAT_DATE_YEAR","INSPECTION_DAT_DATE_MONTH","ZIPCODE"),all.x = TRUE)
merged_df[is.na(merged_df$rest_log),"rest_log"] = 0
#target variable construction
merged_df$target_val <- as.integer(merged_df$RESULT == "Active Rat Signs")
#converting categorical variables
merged_df = merged_df[merged_df$INSPECTION_DAT_DATE_YEAR >= 2011,]
merged_df$INSPECTION_DAT_DATE_YEAR <- as.factor(merged_df$INSPECTION_DAT_DATE_YEAR)
merged_df$INSPECTION_DAT_DATE_MONTH <- as.factor(merged_df$INSPECTION_DAT_DATE_MONTH)
#model construction
summary(glm("target_val ~ rest_log + INSPECTION_DAT_DATE_YEAR + INSPECTION_DAT_DATE_MONTH", data = merged_df, family=binomial))
##
## Call:
## glm(formula = "target_val ~ rest_log + INSPECTION_DAT_DATE_YEAR + INSPECTION_DAT_DATE_MONTH",
## family = binomial, data = merged_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6128 -0.5198 -0.4956 -0.4571 2.2485
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.929375 0.017979 -107.312 < 2e-16 ***
## rest_log 0.073344 0.005397 13.590 < 2e-16 ***
## INSPECTION_DAT_DATE_YEAR2012 -0.238356 0.013715 -17.380 < 2e-16 ***
## INSPECTION_DAT_DATE_YEAR2013 -0.062181 0.017654 -3.522 0.000428 ***
## INSPECTION_DAT_DATE_YEAR2014 -0.091465 0.018138 -5.043 4.59e-07 ***
## INSPECTION_DAT_DATE_YEAR2015 -0.029082 0.018601 -1.563 0.117935
## INSPECTION_DAT_DATE_YEAR2016 0.049270 0.025912 1.901 0.057242 .
## INSPECTION_DAT_DATE_MONTH02 -0.176610 0.020595 -8.575 < 2e-16 ***
## INSPECTION_DAT_DATE_MONTH03 -0.094459 0.019308 -4.892 9.97e-07 ***
## INSPECTION_DAT_DATE_MONTH04 -0.235894 0.020169 -11.696 < 2e-16 ***
## INSPECTION_DAT_DATE_MONTH05 -0.187639 0.020524 -9.142 < 2e-16 ***
## INSPECTION_DAT_DATE_MONTH06 -0.098212 0.020868 -4.706 2.52e-06 ***
## INSPECTION_DAT_DATE_MONTH07 -0.007209 0.020718 -0.348 0.727855
## INSPECTION_DAT_DATE_MONTH08 -0.063688 0.020998 -3.033 0.002421 **
## INSPECTION_DAT_DATE_MONTH09 -0.150461 0.021311 -7.060 1.66e-12 ***
## INSPECTION_DAT_DATE_MONTH10 -0.041501 0.021013 -1.975 0.048266 *
## INSPECTION_DAT_DATE_MONTH11 -0.138429 0.022319 -6.202 5.56e-10 ***
## INSPECTION_DAT_DATE_MONTH12 -0.276888 0.022232 -12.454 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 444177 on 614034 degrees of freedom
## Residual deviance: 442619 on 614017 degrees of freedom
## (4 observations deleted due to missingness)
## AIC: 442655
##
## Number of Fisher Scoring iterations: 4
The restaurant variable is certainly statistically significant and has a positive effect on rat sighting. The interpretation from this regression is that for every 1 increase in the log of restaurant valuation within the district, there is a 7% increase in the rat sighting rate.
merged_df <- merged_df[!(is.na(merged_df$target_val)),]
ggplot(merged_df, aes(x = as.factor(target_val), y = rest_log)) +
geom_boxplot(colour = "grey50")
Here is a possible question that could be investiagted further:
Does the presence of rodent correlate with the level of poverty experienced in the area? Can the poverty in the area be a significant factor in predicting rodent sighting?
http://a816-dohbesp.nyc.gov/IndicatorPublic/VisualizationData.aspx?id=103,4466a0,109,Summarize; would have to map out the zipcodes
Correlation calculation on level of poverty and rodent sighting? logistic regression analysis like in Part 3 with the added variable of level of poverty in each of the zipcodes.