Part 0: File Setup

Load in the relevant packages.

library(TeachingDemos)
library(dplyr)
library(readr)
library(stringr)
library(ggplot2)
library(lubridate)
library(tidyr)

Load in the relevant datasets. For the given code, it will generate a random 80% sample of the total dataset for us to use.

rodent = read.csv("~/Desktop/OIDD245/HW 1/Data/Rodent_Inspection.csv")

myname = "Pitchaya Liewchanpatana"
set.seed(char2seed(myname))
rodent = sample_frac(rodent, .8)
rodent = rodent[, sample(1:ncol(rodent))]

Part 1: Descriptive Statistics and Figures.

This part addresses the trend of rat sightings in 5 of New York’s boroughs (namely Bronx, Brooklyn, Manhatten, Queens and Staten Island).

a. how rat sightings (inspections yielding Active Rat Signs) have been changing from month to month over the most recent five years that are available in the data set

Clean the dataset and transform the data into the appropriate time format to be used for graphing.

# cleaning up column names
colnames(rodent) <- tolower(colnames(rodent))
rodent.untouched <- rodent
dim(rodent) # 751939 rows, 20 columns

## [1] 751939     20

# transforming 'inspection date' variable from character to Date class. 
head(rodent) # %m/%d/%Y 00:00:00 AM/PM.

rodent$reformat.date<- as.Date(rodent$inspection_date, "%m/%d/%Y")
rodent$year.month<- strftime(rodent$reformat.date, "%Y-%m") # %Y/%m
rodent$year<- strftime(rodent$reformat.date, "%Y") # %Y 
rodent$month<- strftime(rodent$reformat.date, "%m") # %m

Group the data by borough and aggregate the rats sighting count.

# use group_by and summarize functions
rodent.borough <- rodent %>% 
  group_by(borough, year.month) %>% 
  filter(year >= 2012) %>% 
  summarize(sightings = sum(result == "Active Rat Signs"))

rodent.borough

Plot the month-year time (x-variable) against the rat sightings count (y-variable) by each of the 5 boroughs.

# make elemental plot and add in additional elements
ggplot(rodent.borough, aes(x = year.month, y = sightings, group = borough, color = borough)) + geom_line() + geom_point() +geom_smooth(method = "lm", se = FALSE) + ggtitle("Rat Sightings in 2012-2016") + xlab("year and month") +ylab("no. of rat sightings") + theme(axis.text.x = element_text(vjust = 0.2, angle = 90, hjust = 1, size = 5 )) + theme(plot.title = element_text(hjust = 0.5))

b. In general, have rat sightings been increasing, decreasing, or remained steady over this five year window?

In general, rat sightings have been steady over the past 5 years for all 5 boroughs as evident from the best-fit-line plotted with their slope almost zero. This means there were no significant increase/decrease in the rat sightings on average over the years. Note that there is a very slight but not significant increase in Manhatten and Queens.

c. Are there seasonal trends in rat sightings? Are there more rat sightings in some months than others?

Group the data by months (as opposed to borough in a.) and aggregate the rats sighting count.

rodent.month <- rodent %>% 
  group_by(month) %>% 
  summarize(sightings = sum(result == "Active Rat Signs")) %>% 
  na.omit()

rodent.month

Plot the month time (x-variable) against the rat sightings count (y-variable) by each of the 12 months.

ggplot(rodent.month, aes(x = month, y = sightings, group = 1)) + geom_point() + geom_line() + geom_text(aes(label = sightings), vjust = 2, hjust = 1.1, col = "red", size = 3) + ylim(5600, 10000) + ggtitle("Rat Sightings Per Month") + xlab("month") +ylab("no. of rat sightings") + theme(plot.title = element_text(hjust = 0.5))

Interpretation: There is a seasonal trend in which month 02 (Feb) increases dramatically from 6689 sightings to 9547 sightings in month 03 (Mar). It then drops steadily till month 09 (Sep) which then increases but as not dramatic to 8222 sightings. This might be explained by the seasonal temperature of different months. As Feb-Mar experience some of the coldest seasons of the year, the rats might move into residential homes to hide from the cold and move out as the weather grows warmer during the summer. Hence, the spike in rat sightings at the start of the year and end of the year. (source: https://www.victorpest.com/articles/rodent-season)

d. Generate a similar plot illustrating the efficiency of rat inspections.

Note: efficiency for any given month is computed as the number of inspections yielding “Active Rat Signs” in that month divided by the total number of inspections that take place in that month.

Group the data by borough and aggregate the efficiency counts.

# use group_by and summarize functions
rodent.borough.efficiency <- rodent %>% 
  group_by(borough, year.month) %>% 
  filter(year >= 2012) %>% 
  summarize(total.sightings = n(), rat.sightings = sum(result == "Active Rat Signs")) %>% 
  mutate(efficiency = rat.sightings/total.sightings)

rodent.borough.efficiency

Plot the year-month time (x-variable) against the efficiency (y-variable) by each of 5 boroughs.

ggplot(rodent.borough.efficiency, aes(x = year.month, y = efficiency, group = borough, colour = borough)) + geom_point() + geom_line() + ggtitle("Rat Sightings Efficiency in 2012-2016") + xlab("year and month") +ylab("ratio of rat sightings effectiveness") + theme(axis.text.x = element_text(vjust = 0.2, angle = 90, hjust = 1, size = 5 )) + theme(plot.title = element_text(hjust = 0.5))

e. Generate a list of the top ten zip codes with the largest number of inspections yielding active rat signs.

Aggregate the number of inspections by zipcode and find the top 10.

rodent.zipcode <- rodent %>% 
  group_by(zip_code) %>% 
  filter(year >= 2012) %>% 
  summarize(sightings = sum(result == "Active Rat Signs")) %>%
  arrange(desc(sightings)) %>% 
  slice(1:10)

rodent.zipcode

Part 2: 311 Data

For those who do not know, 311 refers to the hotline residents call for non-emergencies. It is very much like 911 version for non-emergencies. In this part, we will explore how to use the 311 data to spot trend during the Hurricane Sandy (in Oct 29 2012) and find correlations between different types of complaints to Rodent complaints.

a.1 Load the dataset and clean the column names. Transform the character variable ‘created.date’ into the time format. Combine the date column into the original file.

sandy = read.csv("~/Desktop/OIDD245/HW 1/Data/sandyrelated.csv")

# cleaning col names. 
colnames(sandy) <- tolower(colnames(sandy))
sandy.untouched <- sandy
dim(sandy) # total complaints = 82977

## [1] 82977    52

table(sandy$complaint.type) # 'Rodent' complaints = 390

## 
##                            Adopt-A-Basket 
##                                         2 
##                             Agency Issues 
##                                        15 
##                               Air Quality 
##                                       109 
##                          Animal in a Park 
##                                        22 
##                                 APPLIANCE 
##                                       159 
##                                  Asbestos 
##                                        29 
##                  Benefit Card Replacement 
##                                       322 
##                          BEST/Site Safety 
##                                        24 
##                 Bike/Roller/Skate Chronic 
##                                         3 
##                          Blocked Driveway 
##                                      1144 
##                                   Boilers 
##                                       118 
##                          Bridge Condition 
##                                         7 
##                         Broken Muni Meter 
##                                       314 
##                      Broken Parking Meter 
##                                        18 
##                        Building Condition 
##                                         1 
##                              Building/Use 
##                                       413 
##                Bus Stop Shelter Placement 
##                                         7 
##            City Vehicle Placard Complaint 
##                                         4 
##                    Collection Truck Noise 
##                                         6 
##                                   Comment 
##                                         1 
##                                 Complaint 
##                                        10 
##                                Compliment 
##                                        11 
##                              Construction 
##                                       119 
##                              CONSTRUCTION 
##                                        46 
##                        Consumer Complaint 
##                                      1246 
##                       Cranes and Derricks 
##                                        47 
##                            Curb Condition 
##                                         5 
##                              Damaged Tree 
##                                     25812 
## DCA / DOH New License Application Request 
##                                        30 
##                    DCA Literature Request 
##                                         1 
##                                 Dead Tree 
##                                        40 
##                          Derelict Bicycle 
##                                         4 
##                          Derelict Vehicle 
##                                       151 
##                         Derelict Vehicles 
##                                       197 
##                          Dirty Conditions 
##                                       353 
##                          Disorderly Youth 
##                                         7 
##                    DOF Literature Request 
##                                       409 
##               DOF Parking - Tax Exemption 
##                                        10 
##                    DOT Literature Request 
##                                         8 
##                    DPR Literature Request 
##                                         6 
##                                  Drinking 
##                                        15 
##                            Drinking Water 
##                                         3 
##                      EAP Inspection - F59 
##                                        14 
##                                  ELECTRIC 
##                                      1427 
##                                Electrical 
##                                        36 
##                                  Elevator 
##                                       310 
##                           Ferry Complaint 
##                                         5 
##                             Ferry Inquiry 
##                                        33 
##                     Fire Alarm - Addition 
##                                         2 
##                 Fire Alarm - Modification 
##                                         1 
##                   Fire Alarm - New System 
##                                         1 
##                 Fire Alarm - Reinspection 
##                                         5 
##                Fire Safety Director - F58 
##                                        79 
##                        Food Establishment 
##                                       121 
##                            Food Poisoning 
##                                        50 
##                For Hire Vehicle Complaint 
##                                       164 
##                            Found Property 
##                                        12 
##                      GENERAL CONSTRUCTION 
##                                      3250 
##             General Construction/Plumbing 
##                                      1199 
##                                  Graffiti 
##                                       262 
##                      Harboring Bees/Wasps 
##                                         2 
##                       Hazardous Materials 
##                                       180 
##                        Hazmat Storage/Use 
##                                         1 
##                                   HEATING 
##                                     20237 
##                         Highway Condition 
##                                        24 
##                    Highway Sign - Damaged 
##                                         3 
##                   Highway Sign - Dangling 
##                                         1 
##                       Homeless Encampment 
##                                        35 
##                Illegal Animal Kept as Pet 
##                                         7 
##                       Illegal Animal Sold 
##                                         1 
##                           Illegal Parking 
##                                       475 
##                       Illegal Tree Damage 
##                                         6 
##                        Indoor Air Quality 
##                                        96 
##                             Indoor Sewage 
##                                        41 
##                          Industrial Waste 
##                                        86 
##       Investigations and Discipline (IAD) 
##                                        19 
##                                Invitation 
##                                         5 
##                                      Lead 
##                                        48 
##                   Litter Basket / Request 
##                                        23 
##                   Maintenance or Facility 
##                                        84 
##                            Misc. Comments 
##                                         7 
##                  Miscellaneous Categories 
##                                         1 
##         Missed Collection (All Materials) 
##                                        65 
##                                      Mold 
##                                        12 
##                                     Noise 
##                                       511 
##                        Noise - Commercial 
##                                       418 
##                        Noise - Helicopter 
##                                        11 
##                  Noise - House of Worship 
##                                         9 
##                              Noise - Park 
##                                        12 
##                   Noise - Street/Sidewalk 
##                                       176 
##                           Noise - Vehicle 
##                                       120 
##                              Noise Survey 
##                                       116 
##                      Non-Residential Heat 
##                                       128 
##                                  NONCONST 
##                                      1575 
##                    OEM Literature Request 
##                                        12 
##                         Other Enforcement 
##                                        65 
##                Overflowing Litter Baskets 
##                                        11 
##                   Overgrown Tree/Branches 
##                                        53 
##                           PAINT - PLASTER 
##                                      1884 
##                              Parking Card 
##                                         1 
##                                     Plant 
##                                         4 
##                                  Plumbing 
##                                        60 
##                                  PLUMBING 
##                                      2614 
##                     Posting Advertisement 
##                                         2 
##                 Public Payphone Complaint 
##                                        24 
##                             Public Toilet 
##                                         1 
##                      Radioactive Material 
##                                         1 
##                     Recycling Enforcement 
##                                        12 
##                   Request for Information 
##                                        12 
##                                    Rodent 
##                                       390 
##             Root/Sewer/Sidewalk Condition 
##                                         3 
##                      Sanitation Condition 
##                                       307 
##                           Scaffold Safety 
##                                        44 
##                        School Maintenance 
##                                        47 
##                   Senior Center Complaint 
##                                         2 
##                                     Sewer 
##                                      2121 
##                        Sidewalk Condition 
##                                       125 
##                                   Smoking 
##                                        26 
##                                      Snow 
##                                        25 
##                       Special Enforcement 
##                                       244 
##   Special Projects Inspection Team (SPIT) 
##                                       125 
##                            Standing Water 
##                                         8 
##                          Street Condition 
##                                      1069 
##                    Street Light Condition 
##                                      3245 
##                     Street Sign - Damaged 
##                                       196 
##                    Street Sign - Dangling 
##                                       114 
##                     Street Sign - Missing 
##                                       103 
##                Sweeping/Missed-Inadequate 
##                                         3 
##                                 Tattooing 
##                                         2 
##                            Taxi Complaint 
##                                       617 
##                           Taxi Compliment 
##                                        25 
##                                   Traffic 
##                                       191 
##                  Traffic Signal Condition 
##                                      5036 
##                             Unleashed Dog 
##                                         9 
##                Unsanitary Animal Facility 
##                                         1 
##            Unsanitary Animal Pvt Property 
##                                        32 
##               Unsanitary Pigeon Condition 
##                                         2 
##                       Urinating in Public 
##                                         9 
##                                Vacant Lot 
##                                        21 
##                                   Vending 
##                                        32 
##                   Violation of Park Rules 
##                                        19 
##                        Water Conservation 
##                                        53 
##                             Water Quality 
##                                        58 
##                              Water System 
##                                      1162 
##                   X-Ray Machine/Equipment 
##                                         1

# transforming 'created.date' variable from character to Date class. 
head(sandy) # %m/%d/%Y 00:00

sandy1 <- as.data.frame(str_split_fixed(sandy$created.date, " ", n = 8)) # %m/%d/%Y 
colnames(sandy1)[1] <- "reformat.date"
sandy1$reformat.date<- as.Date(sandy1$reformat.date, "%m/%d/%y") # note to self: y vs Y are different. the small y used when the original data is 12 as the year vs big Y when the original data is 2012. 
class(sandy1$reformat.date)

## [1] "Date"

head(sandy1)

# only keep the date column
sandy1$V2 <- NULL
sandy1$V3 <- NULL
sandy1$V4 <- NULL
sandy1$V5 <- NULL
sandy1$V6 <- NULL
sandy1$V7 <- NULL
sandy1$V8 <- NULL
head(sandy1)

sandy.hurricane <- sandy %>% 
  cbind(sandy1) %>% 
  group_by(reformat.date, complaint.type) %>% 
  filter(complaint.type == "Rodent") %>% 
  summarize(counts = n())

sandy.hurricane

a.2 Create a chart of phone calls about rat/vermin sightings in the weeks before and after Hurricane Sandy (Oct 29, 2012). To identify these, use the ‘Complaint Type’ tag “Rodent”. There is not a lot of data from the before period, but based on what you have, do these data suggest that Hurricane Sandy led to an increase in the daily numbers of rodent sightings on the island?

ggplot(sandy.hurricane, aes(x = reformat.date, y = counts)) + geom_point() + geom_line() + ggtitle("Rodent Complaints Before And After Hurricane Sandy") + xlab("Year and Month") + ylab("no. of rodent complaints") + theme(axis.text.x = element_text(vjust = 0.2, angle = 90, hjust = 1, size = 7)) + theme(plot.title = element_text(hjust = 0.5)) + geom_text(aes(label = counts), vjust = 1, hjust = 1.7, col = "red", size = 3) + geom_smooth(method = "lm", se = FALSE, col = "red")

Interpretation: There is not a lot of data from the before period, but based on what you have, the graph that I generated do suggest that Hurricane Sandy led to an increase in the daily numbers of rodent sightings on the island. Hurrican Sandy occured on Oct 29th where before that the number of Rodent complaints were 8. After Oct 29th, it rose to 12 and consistently rising to 37 and then drop and then spike to 39 complaints. This is also seen from the linear regression line where it is consistenly increasing.

b. Using the 311 data, report a table of correlations between Rodent complaints and the top fifteen types of non-Rodent related complaint types, where the correlations of interest are between the counts of complaints of each type across the different zip codes in the data set (and where top fifteen is according to total counts of complaint types in the 311 data set). Across all zip codes, what two types of other complaints are most highly correlated with Rodent sightings?

# Step1: Aggregate sandy dataset by complaint type. 
sandy.complaint <- sandy %>% 
  group_by(complaint.type) %>% 
  summarize(count = n()) %>%
  arrange(desc(count)) %>% 
  slice(1:15) %>% 
  add_row(complaint.type = "Rodent", count = 390)

sandy.complaint

# Step 2: Create a dataframe of counts & complaint.type. Clean the rows with empty zipcodes. 
sandy.zipcode <- sandy %>% 
  group_by(incident.zip, complaint.type) %>% 
  summarize(counts = n()) 
sandy.zipcode$incident.zip[sandy.zipcode$incident.zip ==""] <- NA
sandy.zipcode <- na.omit(sandy.zipcode)

sandy.zipcode

# Merge Step 1 and Step 2 together
sandy.merged <- merge(sandy.complaint, sandy.zipcode, by = "complaint.type", all.x= T)
sandy.merged$count <- NULL

sandy.merged

# Step 3: Create 'spread' out version of the previous dataframe 
sandy.spread <- sandy.merged %>% 
  spread(key = complaint.type, value = counts) 
colnames(sandy.spread)

##  [1] "incident.zip"                  "Blocked Driveway"             
##  [3] "Consumer Complaint"            "Damaged Tree"                 
##  [5] "ELECTRIC"                      "GENERAL CONSTRUCTION"         
##  [7] "General Construction/Plumbing" "HEATING"                      
##  [9] "NONCONST"                      "PAINT - PLASTER"              
## [11] "PLUMBING"                      "Rodent"                       
## [13] "Sewer"                         "Street Condition"             
## [15] "Street Light Condition"        "Traffic Signal Condition"     
## [17] "Water System"

# Step 4: Remove zipcode column and replace all empty values with 0
sandy.spread$incident.zip <- NULL
sandy.spread[is.na(sandy.spread)] <- 0
colnames(sandy.spread) <- tolower(colnames(sandy.spread))

sandy.spread

Instead of manually finding every correlation of a complaint to a Rodent complaint, we can construct a correlation graph as taught in class.

# Plot using corrplot() package.
library(corrplot)

## corrplot 0.84 loaded

# Used https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html for instructions on what to do into corrplot. 
corrplot(cor(sandy.spread), type = 'upper', method = "circle", order="FPC", tl.col="black",tl.srt=30, tl.cex = 0.5,  addCoef.col = "black", number.cex = 0.5)

From the graph above, we can draw many insights about the Rodent’s complaint relationship with other complaint types. The complaint types with the highest correlation to rodent are, in descending order of correlation, are: 1. nonconst (0.68), 2. plumbing (0.62), 3. heating (0.61), 4. general construction (0.6), 5. paint-plaster (0.57)

Hence, to answer the question about which two other complaint types are most highly correlated with Rodent sightings, it is nonconst and plumbing.

Part 3: Rodent and Restaurant Inspection

For part 3, our main objective is to investigate whether the restaurant violation database, which is created and maintained by a different agency, can be useful for predicting which property inspections are most likely to yield Active Rat Signs. I will estimate using a simple predictive model.

Load in the dataset. Perform simple cleaning of column names. Transform the time variable into Date class.

# Load dataset. 
# Let's call the restaurant dataset as rest for simplicatin.
rest= read.csv("~/Desktop/OIDD245/HW 1/Data/DOHMH_New_York_City_Restaurant_Inspection_Results.csv")

# Clean column names and examine relevant columns. 
rest

colnames(rest) <- tolower(colnames(rest))
colnames(rest) #relevant columns are 'zipcode', 'violation.code', 'inspection.date'

##  [1] "camis"                 "dba"                  
##  [3] "boro"                  "building"             
##  [5] "street"                "zipcode"              
##  [7] "phone"                 "cuisine.description"  
##  [9] "inspection.date"       "action"               
## [11] "violation.code"        "violation.description"
## [13] "critical.flag"         "score"                
## [15] "grade"                 "grade.date"           
## [17] "record.date"           "inspection.type"

table(rest$violation.code) # relevant are '04L', '04K', '08A'. It has 3071+ 33320+ 45006 = 81397 datapoints respectively,

## 
##         02A   02B   02C   02D   02E   02F   02G   02H   02I   02J   03A 
##  8915   418 22364   556    41     8    12 37781  4289    86     3   438 
##   03B   03C   03D   03E   03F   03G   04A   04B   04C   04D   04E   04F 
##   256   231    33    16     1    18  8429    45  4081   500   312    83 
##   04G   04H   04I   04J   04K   04L   04M   04N   04O   05A   05B   05C 
##     6 12258    19  5011  3071 33320 11881 20651   365    82    39    58 
##   05D   05E   05F   05H   05I   06A   06B   06C   06D   06E   06F   06G 
##  4369   146   750   796     1  8807  3011 25483 28795 10986  8902   122 
##   06H   06I   07A   08A   08B   08C   09A   09B   09C   10A   10B   10C 
##    11    17    80 45006  1906  5217  1997  4121  5547  5091 25456   489 
##   10D   10E   10F   10G   10H   10I   10J   10K   15E   15H   15I   15J 
##  4149  3254 63609    75  7465  4133  2846     1     6     1   603   294 
##   15K   15L   15S   15T   16A   16B   16C   16E   16F   16J   18B   18C 
##   216  5245   227    83   647  4681   212    24    12     1    13    19 
##   18D   18F   18G   18I   20A   20B   20D   20E   20F   22A   22B   22C 
##   120   154     1     1   511    11  1630   358  3164  1398   253  3971 
##   22E   22F   22G   99B 
##    68   148    47     5

# Transform time variable into Date class. 
rest$inspection.date <- as.Date(rest$inspection.date, format = "%m/%d/%Y")
rest$month.year<- strftime(rest$inspection.date, "%m-%Y")

rest

Let’s start following the steps given in the prompt.

step 1: From the restaurant inspection database, generate a data set which, for each month-year-zip code combination, contains the total number of rodent-related inspection violations.

rest.rodent <- rest %>% 
  filter(violation.code == "04L" | 
         violation.code == "04K" |
         violation.code == "08A") %>% 
  unite('month.year.zipcode', c(month.year, zipcode), sep = "-") %>% 
  group_by(month.year.zipcode) %>% 
  summarize(count = n())

sum(rest.rodent$count) #81397 counts, correct

## [1] 81397

rest.rodent

step 2: join the measure with the rodent inspection data so that there is a row for each rodent inspection that also contains a measure of the number of restaurant violations for the zip code, month, and year in which the inspection occurred.

# Repeat the same process for the Rodent inspection data. We will use the month-year-zipcode combination as the variable for the outer merge later on. 

head(rodent) # %m/%d/%Y 00:00:00 AM/PM.

rodent$reformat.date<- as.Date(rodent$inspection_date, "%m/%d/%Y")
rodent$month.year<- strftime(rodent$reformat.date, "%m-%Y") 

colnames(rodent)

##  [1] "longitude"                   "house_number"               
##  [3] "x_coord"                     "bbl"                        
##  [5] "job_id"                      "latitude"                   
##  [7] "location"                    "job_ticket_or_work_order_id"
##  [9] "zip_code"                    "approved_date"              
## [11] "inspection_date"             "boro_code"                  
## [13] "result"                      "street_name"                
## [15] "job_progress"                "y_coord"                    
## [17] "lot"                         "borough"                    
## [19] "inspection_type"             "block"                      
## [21] "reformat.date"               "year.month"                 
## [23] "year"                        "month"                      
## [25] "month.year"

rodent.rodent <- rodent %>% 
  unite('month.year.zipcode', c(month.year, zip_code), sep = "-") %>% 
  group_by(month.year.zipcode, result) %>% 
  summarize(sightings = n())
rodent.rodent

# Create a new dataframe where we merge the rodent inspection and restaurant inspection dataframe together. 

merged.rodent <- merge(rodent.rodent, rest.rodent, by = c("month.year.zipcode"), all = T)
colnames(merged.rodent) = c("month.year.zipcode", 
                            "result",
                            "rodent.inspection", 
                            "rest.inspection")
merged.rodent

# Convert any missing restaurant violation numbers to zero.

merged.rodent$rest.inspection[is.na(merged.rodent$rest.inspection)] <- 0
merged.rodent$rodent.inspection[is.na(merged.rodent$rodent.inspection)] <- 0

# Add one to this restaurant violation measure and log it.

merged.rodent$rest.inspection <- merged.rodent$rest.inspection+1
merged.rodent$log.rest.inspection <- log(merged.rodent$rest.inspection)

step 3: create a binary variable that takes the value 1 or 0 depending on whether an inspection yields Active Rat Signs.

merged.rodent$dummy[merged.rodent$result == "Active Rat Signs"] <- 1
merged.rodent$dummy[!merged.rodent$result == "Active Rat Signs"] <- 0

step 4.1: Run a logistic regression on the new data set to test whether the estimated coefficient on the restaurant sightings variable has a statistically significant relationship with whether or not an inspection yields Active Rat Signs and–if it is significant–whether the relationship is positive or negative.

Note: Include month and year in your regression as factor (dummy) variables. You should include a single dummy variable for each month (Jan, Feb, etc) and for each year, not for each month-year combination. This should leave you with about 15 or 16 dummy variables in your regression output.

merged.rodent$month<- as.numeric(substr(merged.rodent$month.year.zipcode, 1, 2))

## Warning: NAs introduced by coercion

merged.rodent$year<- as.numeric(substr(merged.rodent$month.year.zipcode, 4, 7))
merged.rodent <- filter(merged.rodent, merged.rodent$year >= 2012)

table(merged.rodent$month) # only month 1-12 are present (12 variables)

## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
## 2495 2344 2549 2158 2062 2099 2123 2148 2149 2141 2019 2033

table(merged.rodent$year) # only from year 2012-2016 are present (5 variables)

## 
## 2012 2013 2014 2015 2016 
## 6032 6075 6279 6384 1550

modell = glm("dummy ~ log.rest.inspection + as.factor(year) + as.factor(month)", data = merged.rodent, family=binomial) 
# total 12 month dummy variables + 4 year variables against 2012 dummy variable = 16 dummy variables
summary(modell)

## 
## Call:
## glm(formula = "dummy ~ log.rest.inspection + as.factor(year) + as.factor(month)", 
##     family = binomial, data = merged.rodent)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8020  -0.6814  -0.6402  -0.5836   2.0208  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.549479   0.061985 -24.998  < 2e-16 ***
## log.rest.inspection  0.174157   0.017374  10.024  < 2e-16 ***
## as.factor(year)2013 -0.214050   0.052027  -4.114 3.89e-05 ***
## as.factor(year)2014 -0.196619   0.052666  -3.733 0.000189 ***
## as.factor(year)2015 -0.218805   0.053169  -4.115 3.87e-05 ***
## as.factor(year)2016 -0.262694   0.083672  -3.140 0.001692 ** 
## as.factor(month)2   -0.037005   0.074118  -0.499 0.617593    
## as.factor(month)3   -0.011625   0.072076  -0.161 0.871869    
## as.factor(month)4   -0.028662   0.076677  -0.374 0.708552    
## as.factor(month)5   -0.029529   0.077705  -0.380 0.703936    
## as.factor(month)6   -0.064259   0.077726  -0.827 0.408389    
## as.factor(month)7    0.008316   0.076750   0.108 0.913720    
## as.factor(month)8   -0.046491   0.076964  -0.604 0.545805    
## as.factor(month)9   -0.057438   0.077010  -0.746 0.455756    
## as.factor(month)10  -0.134397   0.077918  -1.725 0.084554 .  
## as.factor(month)11  -0.095516   0.078946  -1.210 0.226323    
## as.factor(month)12  -0.095719   0.078758  -1.215 0.224230    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25374  on 25746  degrees of freedom
## Residual deviance: 25267  on 25730  degrees of freedom
##   (573 observations deleted due to missingness)
## AIC: 25301
## 
## Number of Fisher Scoring iterations: 4

step 4.2: Does this evidence indicate that there is a statistically significant relationship between the restaurant inspection based measures and whether a rodent inspection yields Active Rat Signs? Show the regression statistics and justify your answer.

# install.packages("margins")
library(margins)

margins_modell = margins(modell)
plot(margins_modell)

This evidence do indicate that there is a statistically significant relationship between the restaurant inspection (log.rest.inspection) based measures and whether a rodent inspection yields Active Rat Signs (dummy). We can see this from the Z-value which is 10.024 and is more than the critical value Z = 1.645 at 0.05 siginificance level. The p value is also less than 0.05 which is correct.

We can also see a relationship between where there is a Active Rat inspection result with a particular month and year. Every month and year except June would have a siginifcant negative effect on the Active Rat inspection result.

PART 4. Beyond inspections

Provide one example of a story you might be able to tell by connecting the NYC rats data with one or more data sources that we did not use here.

a. An overview of the question you might like to answer in your data analysis – no more than one or two sentences.

NYC has locations where residents can drop off their food to be composted. Since food scraps easily attract rats and rodents as food waste are one of their food sources, I want to see if the food scrap drop-off location contributes to the location and number of rodents being spotted.

b. Pointers to a specific data set you would use and the variables you would use to join it with any of the rats data sets described above. This can be a URL link that points to a data set or you can describe how you might collect the data using web scraping or an API.

Food Scrap Drop-Off Locations in NYC (Map): https://data.cityofnewyork.us/Environment/Food-Scrap-Drop-Off-Locations-in-NYC-Map-/qfn2-4jea Food Scrap Drop-Off Locations in NYC: https://data.cityofnewyork.us/Environment/Food-Scrap-Drop-Off-Locations-in-NYC/if26-z6xq

These two links are from the same datasource. The first one is visualized onto an interactive map while the second one is the raw data source itself. The variables I would use to join with the ‘rodent’ and ‘rest’ dataset are 1) borough eg. Bronx, Manhattan and 2) zipcode eg. 10010, 10011.

c. A short description of the type of supporting analysis you might conduct (one or two sentences). This hypothetical analysis can be a chart, a graph, or a regression.

First, I will use outer merge to join the food dropoff location dataset together with either the rodent or restaurant dataset that contain rodent sightings. The merging variable would be either the ‘borough’ or ‘zipcode’. Next, I will use group_by and summarize to count the number of food dropoff locations in each borough/zipcode.

From here onwards, I can plot a chart for the number of food-drop locations (x-axis) against the number of rodent sightings (y-axis) and add a linear regression line to see if there’s a relationship. Similarly to Part 1, I can graph 5 lines for each of the 5 boroughs. To check the results, I can perform a logistic regression model to check if there’s a significant relationship between the number of food-drop locations and the number of rodent sightings in each borough/zipcode.

OIDD245 HW 1 : Rats!

Pitchaya Liewchanpatana

10/28/2019