Data Context:

An interesting application of big data in the public sector is for optimizing the use of constrained government resources.

A recent example of this from NYC was a Department of Health initiative to target restaurant inspections by data mining Yelp reviews to find which ones mentioned roaches. More generally, government agencies and corporations are realizing that bottom-up, crowd-sourced data collection is a good way to gather some types of important data that were previously difficult or too expensive to collect accurately.

A more recent scourge on cities has been RATS. Surprisingly, NYC is NOT the most rat-infested city in the US. Perhaps you can guess which one is. For this exercise, however, we will use data from NYC, simply because it collects particularly good city data through their open data initiatives. In NYC, rodents have commanded attention from government offices and from the press and the community (in part fueled by viral videos like pizza rat; more recently chubby sewer rat). In New York, the resurgence of rats in the city has been a result of many forces, including population growth, construction, and trash pickup problems, and even global warming, as many would argue, that a major redistribution of water rats that occurred as a result of Hurricane Sandy, which hit New York City on October 29th, 2012. (In Philadelphia, some have argued that gentrification-related rebuilding is having a similar effect in terms of displacing rats around the city).

Like other vermin, rats are difficult to detect and they vastly outnumber rodent inspectors and the resources that the city has available for monitoring. In fact, some estimates have put the number of rats in New York city at about one per person (i.e. slightly over 8 million) although these estimates are controversial. Nevertheless, there is little debate that rats remain a significant problem for residents of the five boroughs. As in the Yelp-roach-restaurant inspection example provided above, the goal of this assignment is uncover some temporal patterns in rat problems in the city, and to use prediction based techniques to investigate if the city’s restaurant inspection activities (which are separate from their rodent inspection activities) can be useful for targeting rodent inspections. Other major cities are also using crowd-sourcing efforts to fight this menace.

In addition to the NYC rodent inspection data, we will use data on NYC 311 calls, which are a fascinating source of information on New York City as well as the NYC restaurant health inspection data. All of these stakeholders–rodent inspectors, restaurant inspectors, and citizens–can be valuable sources of information about which parts of the city have rat problems. Note that these data sources are not “independent”. Many of the inspections in the rodent inspection database are initially undertaken due to 311 complaints.

Data sources

These data are all available through NYC Open Data

Code Proper

Part 0. File setup:


I will first start with loading in the relevant packages.

library(TeachingDemos)
library(dplyr)
library(readr)
library(ggplot2)
library(gridExtra)
library(ggalt)
library(scales)
#Load the library
library("ggmap")
#Set your API Key
register_google(key = 'AIzaSyDl0v77_lnHXrgqa_O4XGZuPDmcZLQfQWs')

Now I will load in the first dataset and sample 80% of the data from the dataset

rod.inspection = read_csv("C:/Users/fzy20/Documents/Penn/OIDD 245/Rats/Rodent_Inspection.csv")
myname = "David"
set.seed(char2seed(myname))
rod.inspection = sample_frac(rod.inspection, .8)
rod.inspection = rod.inspection[, sample(1:ncol(rod.inspection))] 

Part I. Descriptive Statistics and Figures


I aim to answer these questions in this part of my analysis:

  1. how rat sightings (inspections yielding Active Rat Signs) have been changing from month to month over the most recent five years that are available in the data set

  2. In general, have rat sightings been increasing, decreasing, or remained steady over this five year window?

  3. Are there seasonal trends in rat sightings? Are there more rat sightings in some months than others?

  4. Generate a similar plot illustrating “efficiency” of rat inspections where efficiency for any given month is computed as the number of inspections yielding “Active Rat Signs” in that month divided by the total number of inspections that take place in that month.

  5. Generate a list of the top ten zip codes with the largest number of inspections yielding active rat signs.

Before proceeding to making grapphical analysis in order to answer the questions, I will first start by cleaning the rodent inspection dataset

# Extracting Date Information
rod.inspection$INSPECTION_DAT_TIME <- sapply(strsplit(rod.inspection$INSPECTION_DATE," "),"[",2)
rod.inspection$INSPECTION_DAT_DATE <- sapply(strsplit(rod.inspection$INSPECTION_DATE," "),"[",1)
rod.inspection$INSPECTION_DAT_DATE <- as.Date(rod.inspection$INSPECTION_DAT_DATE,format='%m/%d/%Y')
#Splitting dae information into month and years
rod.inspection$INSPECTION_DAT_DATE_MONTH <- format(rod.inspection$INSPECTION_DAT_DATE,"%m")
rod.inspection$INSPECTION_DAT_DATE_YEAR <- format(rod.inspection$INSPECTION_DAT_DATE,"%Y")

Now, I will transform the dataset to a way meant to makae our analysis later on easier

#Grouping the data by borough and aggregating rats sighting count
BOROUGH_TIMELINE_DF <- rod.inspection %>% group_by(BOROUGH,INSPECTION_DAT_DATE_MONTH,INSPECTION_DAT_DATE_YEAR) %>% filter(INSPECTION_DAT_DATE_YEAR >= 2011) %>% summarize(sightings = sum(RESULT == "Active Rat Signs"))
#engineer the year-month variable
BOROUGH_TIMELINE_DF$YEAR_MONTH <-paste(as.character(BOROUGH_TIMELINE_DF$INSPECTION_DAT_DATE_YEAR), "." ,as.character(BOROUGH_TIMELINE_DF$INSPECTION_DAT_DATE_MONTH),sep = "")
BOROUGH_TIMELINE_DF$YEAR_MONTH <- as.double(BOROUGH_TIMELINE_DF$YEAR_MONTH) 
BOROUGH_TIMELINE_DF <- BOROUGH_TIMELINE_DF[order(BOROUGH_TIMELINE_DF$YEAR_MONTH),]
BOROUGH_TIMELINE_DF$YEAR_MONTH <- as.factor(BOROUGH_TIMELINE_DF$YEAR_MONTH)
BOROUGH_TIMELINE_DF[is.na(BOROUGH_TIMELINE_DF)] <- 0

To answer the questions whether or not there has been an increase in rat sighting and if there are any seasonal trends with rat sightings, I have created three charts, one didvided by borough the other one is just an aggregated view of rat signs in all 5 boroughs, and the last one is aggregated in months.

Rat sighting per borough over 5 years

ggplot(BOROUGH_TIMELINE_DF, aes(x = YEAR_MONTH, y = sightings, group = BOROUGH, color = BOROUGH)) + geom_line() + geom_point() + theme(axis.text.x = element_text(vjust = 0.2, size = 6,angle=90)) + geom_smooth(method = "lm",se = FALSE)

overall rat sighting over the past 5 years

#data aggregation not separated by boroughs
BOROUGH_SINGLE_TIMELINE_DF <- BOROUGH_TIMELINE_DF %>% group_by(YEAR_MONTH) %>% summarize(sightings = sum(sightings))
#plotting
ggplot(BOROUGH_SINGLE_TIMELINE_DF, aes(x = YEAR_MONTH, y = sightings, group = 1)) + geom_line() + geom_point() + theme(axis.text.x = element_text(vjust = 0.2, size = 6,angle=90)) + geom_smooth(method = "lm")

In general the rat sightings seems quite steady over the 5 year window; however in Bronx there seems to be an increase in rat sighting, likely due to the increase in population.

rat sightings by month

#data aggregation not separated by boroughs, by months
BOROUGH_MONTHS_DF <- BOROUGH_TIMELINE_DF %>% group_by(INSPECTION_DAT_DATE_MONTH) %>% summarize(sightings = sum(sightings))
#plot
ggplot(BOROUGH_MONTHS_DF, aes(x = INSPECTION_DAT_DATE_MONTH, y = sightings, group = 1)) + geom_point() + theme(axis.text.x = element_text(vjust = 0.2, size = 10,angle=90)) + geom_line() + geom_rect(aes(xmin = "03", xmax = "05",ymin = -Inf, ymax = Inf), alpha = 0.04)

As for seasonality, it is quite evident that the months march april and may have significantly higher rat sighting incidents than the others.

Now to look at the efficiency of rat inspections, I have made two graphs with one divided by the borough and the other aggregated by all boroughs

inspection efficiency by borough

#data grouping and engineering the efficiency variable
BOROUGH_EFFICIENCY_DF <- rod.inspection %>% group_by(BOROUGH,INSPECTION_DAT_DATE_MONTH,INSPECTION_DAT_DATE_YEAR) %>% filter(INSPECTION_DAT_DATE_YEAR >= 2011) %>% summarize(efficiency = sum(RESULT == "Active Rat Signs")/n())
#reconstructing the year.month variable
BOROUGH_EFFICIENCY_DF$YEAR_MONTH <-paste(as.character(BOROUGH_EFFICIENCY_DF$INSPECTION_DAT_DATE_YEAR), "." ,as.character(BOROUGH_EFFICIENCY_DF$INSPECTION_DAT_DATE_MONTH),sep = "")
BOROUGH_EFFICIENCY_DF$YEAR_MONTH <- as.double(BOROUGH_EFFICIENCY_DF$YEAR_MONTH) 
BOROUGH_EFFICIENCY_DF <- BOROUGH_EFFICIENCY_DF[order(BOROUGH_EFFICIENCY_DF$YEAR_MONTH),]
BOROUGH_EFFICIENCY_DF$YEAR_MONTH <- as.factor(BOROUGH_EFFICIENCY_DF$YEAR_MONTH)
BOROUGH_EFFICIENCY_DF[is.na(BOROUGH_EFFICIENCY_DF)] <- 0
#plotting
ggplot(BOROUGH_EFFICIENCY_DF, aes(x = YEAR_MONTH, y = efficiency, group = BOROUGH, color = BOROUGH)) + geom_line() + geom_point() + theme(axis.text.x = element_text(vjust = 0.2, size = 6,angle=90))

inspection efficiency aggregate

#data grouping aggregated
BOROUGH_SINGLE_EFFICIENCY_DF <- BOROUGH_EFFICIENCY_DF %>% group_by(YEAR_MONTH) %>% summarize(efficiency = sum(efficiency))
#plotting
ggplot(BOROUGH_SINGLE_EFFICIENCY_DF, aes(x = YEAR_MONTH, y = efficiency, group = 1)) + geom_line() + geom_point() + theme(axis.text.x = element_text(vjust = 0.2, size = 6,angle=90)) + geom_smooth(method = "lm")

The efficiency is evidently increasing across all regions; However is that really because we are better at finding rats or is it because there has just been more rats(?)

Moving on to tacking the Zipcode!

Top Zip Code List

#data grouping by zipcode
ZIP_CODE_DF <- rod.inspection %>% group_by(ZIP_CODE) %>% filter(INSPECTION_DAT_DATE_YEAR >= 2011) %>% summarize(sightings = sum(RESULT == "Active Rat Signs"))
TOP_ZIP_CODE_DF <- ZIP_CODE_DF[order(ZIP_CODE_DF$sightings,decreasing = TRUE),][1:10,]
#constructing valid zip code variable to merge
library(zipcode)
data(zipcode)
zipcode$zip <- as.integer(zipcode$zip)
TOP_ZIP_CODE_DF <- merge(zipcode, TOP_ZIP_CODE_DF, by.x = "zip", by.y = "ZIP_CODE")
#Top ten Zip codes
TOP_ZIP_CODE_DF
##      zip     city state latitude longitude sightings
## 1  10009 New York    NY 40.72709 -73.97864      1567
## 2  10452    Bronx    NY 40.83875 -73.92234      2085
## 3  10453    Bronx    NY 40.85302 -73.91214      2439
## 4  10456    Bronx    NY 40.82968 -73.90856      3328
## 5  10457    Bronx    NY 40.84674 -73.89861      4649
## 6  10458    Bronx    NY 40.86417 -73.88881      4388
## 7  10467    Bronx    NY 40.87226 -73.86937      1755
## 8  10468    Bronx    NY 40.86711 -73.89916      2747
## 9  11221 Brooklyn    NY 40.69123 -73.92637      2629
## 10 11237 Brooklyn    NY 40.70336 -73.91993      1981

I have also mapped out these zip codes!

Top Zipcode Map

NYC= get_map("New York", zoom=10)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=New%20York&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=New+York&key=xxx
ggmap(NYC)  + geom_point(aes(x = longitude, y = latitude, size = sightings/10, color = "red", alpha = 0.5), data = TOP_ZIP_CODE_DF) + theme(legend.position="bottom")

Bronx seems like a particularly bad place for rats; Brooklyn come in close but is slightly better in terms of the pest situation

PART 2. The 311 data


Moving on to the next part, in here I aim to do the following:

  1. Create a chart of phone calls about rat/vermin sightings in the weeks before and after Hurricane Sandy (Oct 29, 2012) and identify possible causal relationship between calamities and rats sighting.

  2. Explore the idea of using other “proxy” indicators as a predictor of Rodent problems. Specifically I will examine Across all zip codes, and identify what two types of other complaints are most highly correlated with Rodent sightings?

I will be examining the 311 data of rodent report to understand the effect of the sandy hurricane

#data reading
sandy_311_df_og <- read_csv("C:/Users/fzy20/Documents/Penn/OIDD 245/Rats/sandyrelated.csv")
sandy_311_df <- sandy_311_df_og
#date extraction
sandy_311_df$`Created Time` <- sapply(strsplit(sandy_311_df$`Created Date`," "),"[",2)
sandy_311_df$`Created date` <- sapply(strsplit(sandy_311_df$`Created Date`," "),"[",1)
sandy_311_df$`Created_date` <- as.Date(sandy_311_df$`Created date`, format = '%m/%d/%y')
#data fltering
sandy_311_df <- sandy_311_df[sandy_311_df$`Complaint Type` == "Rodent",]
sandy_311_agg_df <- sandy_311_df %>% group_by(Created_date) %>% summarize(count = n())
val1 <- sum(sandy_311_agg_df[1:2, 'count'])/2
val2 <- sum(sandy_311_agg_df[2:15, 'count'])/13

I would first want to examine whether or not rat sightings have increased after the hurricane

line graph before and after hurricane

#plotting result
ggplot(sandy_311_agg_df, aes(x = Created_date, y = count)) + geom_point() + geom_segment(aes(x = as.Date("10/28/2012", format = '%m/%d/%Y'), y = val1, xend = as.Date("10/29/2012", format = '%m/%d/%Y'), yend = val1), color="red", linetype="dashed", size = 1.5) + geom_segment(aes(x = as.Date("10/29/2012", format = '%m/%d/%Y'), y = val2, xend = as.Date("11/11/2012", format = '%m/%d/%Y'), yend = val2), color="red", linetype="dashed", size = 1.5) + geom_line() + geom_vline(xintercept = as.Date("10/29/2012", format = '%m/%d/%Y'), linetype="dotted", color = "blue", size=1.5)

It is quite obvious that rats sighting increased after the hurricane, however the increase was more gradual than immediate.
To analyze other complaint type as a proxy for rodent sighting, I will be extracting the correlation of rodent sighting with other popular sighting.

Correlation Matrix

#data aggregation on complaint type and zipcode
sandy_311_df <- sandy_311_df_og
sandy_311_df$Complaint_Type <- sandy_311_df$`Complaint Type`
sandy_311_complaint_type_df <- sandy_311_df %>% group_by(Complaint_Type) %>% summarize(count = n())
sandy_311_complaint_type_top_df <- sandy_311_complaint_type_df[order(sandy_311_complaint_type_df$count, decreasing = TRUE),][1:15,]
big_with_sight_df <- sandy_311_df %>% group_by(`Incident Zip`, `Complaint Type`) %>% summarize(sightings = n()) 
top_with_sight_df <- big_with_sight_df %>% filter(`Complaint Type` %in% append(sandy_311_complaint_type_top_df$Complaint_Type, "Rodent"))
#zip code incident dataframe construction
library(reshape2)
transposed_incident_sighting_df <-dcast(top_with_sight_df, `Incident Zip` ~ `Complaint Type`, value.var="sightings")
transposed_incident_sighting_df[is.na(transposed_incident_sighting_df)] = 0
#correlation plot
library(corrplot)
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(cor(transposed_incident_sighting_df[,2:16]), type = 'upper', method = "color", order="hclust", tl.col="black",tl.srt=45, tl.cex = 1,  addCoef.col = "black",number.cex = 1)

From the above correlation matrix, we could see that NONCONST, PLUMBING, HEATING, GENERAL CONSTRUCTIOG, PAINT - PLASTER and ELECTRIC reports are all highly correlated with rodent reports, signaling that there are likely going to be rats in cosntruction areas. Some other “interesting”" insights include how all construction reports are highly correlated, and plumbing is actually quite correlated with traffic(?).

Correlation Coefficient List

rodent_cor_df <- as.data.frame(cor(transposed_incident_sighting_df[,2:16])[,"Rodent"])
rodent_cor_df$Variables <- rownames(rodent_cor_df)
colnames(rodent_cor_df) <- c("correlation")
rodent_cor_df[order(rodent_cor_df$correlation,decreasing = TRUE),][2:15,]
##                               correlation                            NA
## NONCONST                       0.68376218                      NONCONST
## PLUMBING                       0.62223256                      PLUMBING
## HEATING                        0.60929215                       HEATING
## GENERAL CONSTRUCTION           0.60178168          GENERAL CONSTRUCTION
## PAINT - PLASTER                0.56999270               PAINT - PLASTER
## ELECTRIC                       0.40313951                      ELECTRIC
## Street Condition               0.22066711              Street Condition
## Blocked Driveway               0.20969805              Blocked Driveway
## General Construction/Plumbing  0.19438780 General Construction/Plumbing
## Sewer                          0.10522515                         Sewer
## Consumer Complaint             0.10448663            Consumer Complaint
## Traffic Signal Condition      -0.03759359      Traffic Signal Condition
## Street Light Condition        -0.04472930        Street Light Condition
## Damaged Tree                  -0.09858071                  Damaged Tree

This table is just to show an easier view of rodent associated correlation.

PART 3. Rodent and Restaurant Inspection.

Now we will look at another dataset and our goal is to construct a model to predict rat sighting from this dataset. specficially I would want to investigate whether the restaurant violation database, which is created and maintained by a different agency, can be useful for predicting which property inspections are most likely to yield Active Rat Signs. This will result in a simple predictive model of the following form:

ActiveRatSigns = log(RestaurantRatViolations + 1) + month + year + ϵ

#datareading
rest_inspec_DF <- read_csv("C:/Users/fzy20/Documents/Penn/OIDD 245/Rats/dohmh.csv")

To do this, I will use both the Restaurant Violation and Rodent Inspection databases and will construct a new variable RestaurantRatViolations, which is the number of rodent-related restaurant violations in that zip code, year, and month (The relevant inspection codes for a restaurant rat violation are: 04L, 04K, and 08A). The dependent variable, ActiveRatSightings, will be a binary measure, that I will create from the rodent inspection database that should be computed as a 1 or 0 depending on whether an inspection yielded Active Rat Signs. I will also be including Month and year should into the regression as dummy variables to hold the relationship of restaurant violation and rat sighting volation as untampered as possible.

#date extraction
rest_inspec_DF$`INSPECTION DATE` <- as.Date(rest_inspec_DF$`INSPECTION DATE`, format = "%m/%d/%Y")
rest_inspec_DF$INSPECTION_DAT_DATE_MONTH <- format(rest_inspec_DF$`INSPECTION DATE`,"%m")
rest_inspec_DF$INSPECTION_DAT_DATE_YEAR <- format(rest_inspec_DF$`INSPECTION DATE`,"%Y")
#data filtering
count_per_combo_DF <- rest_inspec_DF %>% filter(`VIOLATION CODE` %in% c("04L","04k","08A")) %>% group_by(INSPECTION_DAT_DATE_MONTH, INSPECTION_DAT_DATE_YEAR, ZIPCODE) %>% summarize(rest_log = log(n()+1))
merged_df <- merge(rod.inspection, count_per_combo_DF, by.x = c("INSPECTION_DAT_DATE_YEAR","INSPECTION_DAT_DATE_MONTH", "ZIP_CODE"), by.y=c("INSPECTION_DAT_DATE_YEAR","INSPECTION_DAT_DATE_MONTH","ZIPCODE"),all.x = TRUE)
merged_df[is.na(merged_df$rest_log),"rest_log"] = 0
#target variable construction
merged_df$target_val <- as.integer(merged_df$RESULT == "Active Rat Signs")
#converting categorical variables
merged_df = merged_df[merged_df$INSPECTION_DAT_DATE_YEAR >= 2011,]
merged_df$INSPECTION_DAT_DATE_YEAR <- as.factor(merged_df$INSPECTION_DAT_DATE_YEAR)
merged_df$INSPECTION_DAT_DATE_MONTH <- as.factor(merged_df$INSPECTION_DAT_DATE_MONTH)

Model Construction and summary:

#model construction
summary(glm("target_val ~ rest_log + INSPECTION_DAT_DATE_YEAR + INSPECTION_DAT_DATE_MONTH", data = merged_df, family=binomial))
## 
## Call:
## glm(formula = "target_val ~ rest_log + INSPECTION_DAT_DATE_YEAR + INSPECTION_DAT_DATE_MONTH", 
##     family = binomial, data = merged_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6128  -0.5198  -0.4956  -0.4571   2.2485  
## 
## Coefficients:
##                               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)                  -1.929375   0.017979 -107.312  < 2e-16 ***
## rest_log                      0.073344   0.005397   13.590  < 2e-16 ***
## INSPECTION_DAT_DATE_YEAR2012 -0.238356   0.013715  -17.380  < 2e-16 ***
## INSPECTION_DAT_DATE_YEAR2013 -0.062181   0.017654   -3.522 0.000428 ***
## INSPECTION_DAT_DATE_YEAR2014 -0.091465   0.018138   -5.043 4.59e-07 ***
## INSPECTION_DAT_DATE_YEAR2015 -0.029082   0.018601   -1.563 0.117935    
## INSPECTION_DAT_DATE_YEAR2016  0.049270   0.025912    1.901 0.057242 .  
## INSPECTION_DAT_DATE_MONTH02  -0.176610   0.020595   -8.575  < 2e-16 ***
## INSPECTION_DAT_DATE_MONTH03  -0.094459   0.019308   -4.892 9.97e-07 ***
## INSPECTION_DAT_DATE_MONTH04  -0.235894   0.020169  -11.696  < 2e-16 ***
## INSPECTION_DAT_DATE_MONTH05  -0.187639   0.020524   -9.142  < 2e-16 ***
## INSPECTION_DAT_DATE_MONTH06  -0.098212   0.020868   -4.706 2.52e-06 ***
## INSPECTION_DAT_DATE_MONTH07  -0.007209   0.020718   -0.348 0.727855    
## INSPECTION_DAT_DATE_MONTH08  -0.063688   0.020998   -3.033 0.002421 ** 
## INSPECTION_DAT_DATE_MONTH09  -0.150461   0.021311   -7.060 1.66e-12 ***
## INSPECTION_DAT_DATE_MONTH10  -0.041501   0.021013   -1.975 0.048266 *  
## INSPECTION_DAT_DATE_MONTH11  -0.138429   0.022319   -6.202 5.56e-10 ***
## INSPECTION_DAT_DATE_MONTH12  -0.276888   0.022232  -12.454  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 444177  on 614034  degrees of freedom
## Residual deviance: 442619  on 614017  degrees of freedom
##   (4 observations deleted due to missingness)
## AIC: 442655
## 
## Number of Fisher Scoring iterations: 4

The restaurant variable is certainly statistically significant and has a positive effect on rat sighting. The interpretation from this regression is that for every 1 increase in the log of restaurant valuation within the district, there is a 7% increase in the rat sighting rate.

graphical supplement

merged_df <- merged_df[!(is.na(merged_df$target_val)),]
ggplot(merged_df, aes(x = as.factor(target_val), y = rest_log)) +
  geom_boxplot(colour = "grey50")

Part 4: Recommendation

Here is a possible question that could be investiagted further:

overview/Question:

Does the presence of rodent correlate with the level of poverty experienced in the area? Can the poverty in the area be a significant factor in predicting rodent sighting?

Analysis:

Correlation calculation on level of poverty and rodent sighting? logistic regression analysis like in Part 3 with the added variable of level of poverty in each of the zipcodes.