Load in the relevant packages.
library(TeachingDemos)
library(dplyr)
library(readr)
library(stringr)
library(ggplot2)
library(lubridate)
library(tidyr)
Load in the relevant datasets. For the given code, it will generate a random 80% sample of the total dataset for us to use.
rodent = read.csv("~/Desktop/OIDD245/HW 1/Data/Rodent_Inspection.csv")
myname = "Pitchaya Liewchanpatana"
set.seed(char2seed(myname))
rodent = sample_frac(rodent, .8)
rodent = rodent[, sample(1:ncol(rodent))]
This part addresses the trend of rat sightings in 5 of New York’s boroughs (namely Bronx, Brooklyn, Manhatten, Queens and Staten Island).
Clean the dataset and transform the data into the appropriate time format to be used for graphing.
# cleaning up column names
colnames(rodent) <- tolower(colnames(rodent))
rodent.untouched <- rodent
dim(rodent) # 751939 rows, 20 columns
## [1] 751939 20
# transforming 'inspection date' variable from character to Date class.
head(rodent) # %m/%d/%Y 00:00:00 AM/PM.
rodent$reformat.date<- as.Date(rodent$inspection_date, "%m/%d/%Y")
rodent$year.month<- strftime(rodent$reformat.date, "%Y-%m") # %Y/%m
rodent$year<- strftime(rodent$reformat.date, "%Y") # %Y
rodent$month<- strftime(rodent$reformat.date, "%m") # %m
Group the data by borough and aggregate the rats sighting count.
# use group_by and summarize functions
rodent.borough <- rodent %>%
group_by(borough, year.month) %>%
filter(year >= 2012) %>%
summarize(sightings = sum(result == "Active Rat Signs"))
rodent.borough
Plot the month-year time (x-variable) against the rat sightings count (y-variable) by each of the 5 boroughs.
# make elemental plot and add in additional elements
ggplot(rodent.borough, aes(x = year.month, y = sightings, group = borough, color = borough)) + geom_line() + geom_point() +geom_smooth(method = "lm", se = FALSE) + ggtitle("Rat Sightings in 2012-2016") + xlab("year and month") +ylab("no. of rat sightings") + theme(axis.text.x = element_text(vjust = 0.2, angle = 90, hjust = 1, size = 5 )) + theme(plot.title = element_text(hjust = 0.5))
In general, rat sightings have been steady over the past 5 years for all 5 boroughs as evident from the best-fit-line plotted with their slope almost zero. This means there were no significant increase/decrease in the rat sightings on average over the years. Note that there is a very slight but not significant increase in Manhatten and Queens.
Group the data by months (as opposed to borough in a.) and aggregate the rats sighting count.
rodent.month <- rodent %>%
group_by(month) %>%
summarize(sightings = sum(result == "Active Rat Signs")) %>%
na.omit()
rodent.month
Plot the month time (x-variable) against the rat sightings count (y-variable) by each of the 12 months.
ggplot(rodent.month, aes(x = month, y = sightings, group = 1)) + geom_point() + geom_line() + geom_text(aes(label = sightings), vjust = 2, hjust = 1.1, col = "red", size = 3) + ylim(5600, 10000) + ggtitle("Rat Sightings Per Month") + xlab("month") +ylab("no. of rat sightings") + theme(plot.title = element_text(hjust = 0.5))
Interpretation: There is a seasonal trend in which month 02 (Feb) increases dramatically from 6689 sightings to 9547 sightings in month 03 (Mar). It then drops steadily till month 09 (Sep) which then increases but as not dramatic to 8222 sightings. This might be explained by the seasonal temperature of different months. As Feb-Mar experience some of the coldest seasons of the year, the rats might move into residential homes to hide from the cold and move out as the weather grows warmer during the summer. Hence, the spike in rat sightings at the start of the year and end of the year. (source: https://www.victorpest.com/articles/rodent-season)
Note: efficiency for any given month is computed as the number of inspections yielding “Active Rat Signs” in that month divided by the total number of inspections that take place in that month.
Group the data by borough and aggregate the efficiency counts.
# use group_by and summarize functions
rodent.borough.efficiency <- rodent %>%
group_by(borough, year.month) %>%
filter(year >= 2012) %>%
summarize(total.sightings = n(), rat.sightings = sum(result == "Active Rat Signs")) %>%
mutate(efficiency = rat.sightings/total.sightings)
rodent.borough.efficiency
Plot the year-month time (x-variable) against the efficiency (y-variable) by each of 5 boroughs.
ggplot(rodent.borough.efficiency, aes(x = year.month, y = efficiency, group = borough, colour = borough)) + geom_point() + geom_line() + ggtitle("Rat Sightings Efficiency in 2012-2016") + xlab("year and month") +ylab("ratio of rat sightings effectiveness") + theme(axis.text.x = element_text(vjust = 0.2, angle = 90, hjust = 1, size = 5 )) + theme(plot.title = element_text(hjust = 0.5))
Aggregate the number of inspections by zipcode and find the top 10.
rodent.zipcode <- rodent %>%
group_by(zip_code) %>%
filter(year >= 2012) %>%
summarize(sightings = sum(result == "Active Rat Signs")) %>%
arrange(desc(sightings)) %>%
slice(1:10)
rodent.zipcode
For those who do not know, 311 refers to the hotline residents call for non-emergencies. It is very much like 911 version for non-emergencies. In this part, we will explore how to use the 311 data to spot trend during the Hurricane Sandy (in Oct 29 2012) and find correlations between different types of complaints to Rodent complaints.
sandy = read.csv("~/Desktop/OIDD245/HW 1/Data/sandyrelated.csv")
# cleaning col names.
colnames(sandy) <- tolower(colnames(sandy))
sandy.untouched <- sandy
dim(sandy) # total complaints = 82977
## [1] 82977 52
table(sandy$complaint.type) # 'Rodent' complaints = 390
##
## Adopt-A-Basket
## 2
## Agency Issues
## 15
## Air Quality
## 109
## Animal in a Park
## 22
## APPLIANCE
## 159
## Asbestos
## 29
## Benefit Card Replacement
## 322
## BEST/Site Safety
## 24
## Bike/Roller/Skate Chronic
## 3
## Blocked Driveway
## 1144
## Boilers
## 118
## Bridge Condition
## 7
## Broken Muni Meter
## 314
## Broken Parking Meter
## 18
## Building Condition
## 1
## Building/Use
## 413
## Bus Stop Shelter Placement
## 7
## City Vehicle Placard Complaint
## 4
## Collection Truck Noise
## 6
## Comment
## 1
## Complaint
## 10
## Compliment
## 11
## Construction
## 119
## CONSTRUCTION
## 46
## Consumer Complaint
## 1246
## Cranes and Derricks
## 47
## Curb Condition
## 5
## Damaged Tree
## 25812
## DCA / DOH New License Application Request
## 30
## DCA Literature Request
## 1
## Dead Tree
## 40
## Derelict Bicycle
## 4
## Derelict Vehicle
## 151
## Derelict Vehicles
## 197
## Dirty Conditions
## 353
## Disorderly Youth
## 7
## DOF Literature Request
## 409
## DOF Parking - Tax Exemption
## 10
## DOT Literature Request
## 8
## DPR Literature Request
## 6
## Drinking
## 15
## Drinking Water
## 3
## EAP Inspection - F59
## 14
## ELECTRIC
## 1427
## Electrical
## 36
## Elevator
## 310
## Ferry Complaint
## 5
## Ferry Inquiry
## 33
## Fire Alarm - Addition
## 2
## Fire Alarm - Modification
## 1
## Fire Alarm - New System
## 1
## Fire Alarm - Reinspection
## 5
## Fire Safety Director - F58
## 79
## Food Establishment
## 121
## Food Poisoning
## 50
## For Hire Vehicle Complaint
## 164
## Found Property
## 12
## GENERAL CONSTRUCTION
## 3250
## General Construction/Plumbing
## 1199
## Graffiti
## 262
## Harboring Bees/Wasps
## 2
## Hazardous Materials
## 180
## Hazmat Storage/Use
## 1
## HEATING
## 20237
## Highway Condition
## 24
## Highway Sign - Damaged
## 3
## Highway Sign - Dangling
## 1
## Homeless Encampment
## 35
## Illegal Animal Kept as Pet
## 7
## Illegal Animal Sold
## 1
## Illegal Parking
## 475
## Illegal Tree Damage
## 6
## Indoor Air Quality
## 96
## Indoor Sewage
## 41
## Industrial Waste
## 86
## Investigations and Discipline (IAD)
## 19
## Invitation
## 5
## Lead
## 48
## Litter Basket / Request
## 23
## Maintenance or Facility
## 84
## Misc. Comments
## 7
## Miscellaneous Categories
## 1
## Missed Collection (All Materials)
## 65
## Mold
## 12
## Noise
## 511
## Noise - Commercial
## 418
## Noise - Helicopter
## 11
## Noise - House of Worship
## 9
## Noise - Park
## 12
## Noise - Street/Sidewalk
## 176
## Noise - Vehicle
## 120
## Noise Survey
## 116
## Non-Residential Heat
## 128
## NONCONST
## 1575
## OEM Literature Request
## 12
## Other Enforcement
## 65
## Overflowing Litter Baskets
## 11
## Overgrown Tree/Branches
## 53
## PAINT - PLASTER
## 1884
## Parking Card
## 1
## Plant
## 4
## Plumbing
## 60
## PLUMBING
## 2614
## Posting Advertisement
## 2
## Public Payphone Complaint
## 24
## Public Toilet
## 1
## Radioactive Material
## 1
## Recycling Enforcement
## 12
## Request for Information
## 12
## Rodent
## 390
## Root/Sewer/Sidewalk Condition
## 3
## Sanitation Condition
## 307
## Scaffold Safety
## 44
## School Maintenance
## 47
## Senior Center Complaint
## 2
## Sewer
## 2121
## Sidewalk Condition
## 125
## Smoking
## 26
## Snow
## 25
## Special Enforcement
## 244
## Special Projects Inspection Team (SPIT)
## 125
## Standing Water
## 8
## Street Condition
## 1069
## Street Light Condition
## 3245
## Street Sign - Damaged
## 196
## Street Sign - Dangling
## 114
## Street Sign - Missing
## 103
## Sweeping/Missed-Inadequate
## 3
## Tattooing
## 2
## Taxi Complaint
## 617
## Taxi Compliment
## 25
## Traffic
## 191
## Traffic Signal Condition
## 5036
## Unleashed Dog
## 9
## Unsanitary Animal Facility
## 1
## Unsanitary Animal Pvt Property
## 32
## Unsanitary Pigeon Condition
## 2
## Urinating in Public
## 9
## Vacant Lot
## 21
## Vending
## 32
## Violation of Park Rules
## 19
## Water Conservation
## 53
## Water Quality
## 58
## Water System
## 1162
## X-Ray Machine/Equipment
## 1
# transforming 'created.date' variable from character to Date class.
head(sandy) # %m/%d/%Y 00:00
sandy1 <- as.data.frame(str_split_fixed(sandy$created.date, " ", n = 8)) # %m/%d/%Y
colnames(sandy1)[1] <- "reformat.date"
sandy1$reformat.date<- as.Date(sandy1$reformat.date, "%m/%d/%y") # note to self: y vs Y are different. the small y used when the original data is 12 as the year vs big Y when the original data is 2012.
class(sandy1$reformat.date)
## [1] "Date"
head(sandy1)
# only keep the date column
sandy1$V2 <- NULL
sandy1$V3 <- NULL
sandy1$V4 <- NULL
sandy1$V5 <- NULL
sandy1$V6 <- NULL
sandy1$V7 <- NULL
sandy1$V8 <- NULL
head(sandy1)
sandy.hurricane <- sandy %>%
cbind(sandy1) %>%
group_by(reformat.date, complaint.type) %>%
filter(complaint.type == "Rodent") %>%
summarize(counts = n())
sandy.hurricane
ggplot(sandy.hurricane, aes(x = reformat.date, y = counts)) + geom_point() + geom_line() + ggtitle("Rodent Complaints Before And After Hurricane Sandy") + xlab("Year and Month") + ylab("no. of rodent complaints") + theme(axis.text.x = element_text(vjust = 0.2, angle = 90, hjust = 1, size = 7)) + theme(plot.title = element_text(hjust = 0.5)) + geom_text(aes(label = counts), vjust = 1, hjust = 1.7, col = "red", size = 3) + geom_smooth(method = "lm", se = FALSE, col = "red")
Interpretation: There is not a lot of data from the before period, but based on what you have, the graph that I generated do suggest that Hurricane Sandy led to an increase in the daily numbers of rodent sightings on the island. Hurrican Sandy occured on Oct 29th where before that the number of Rodent complaints were 8. After Oct 29th, it rose to 12 and consistently rising to 37 and then drop and then spike to 39 complaints. This is also seen from the linear regression line where it is consistenly increasing.
For part 3, our main objective is to investigate whether the restaurant violation database, which is created and maintained by a different agency, can be useful for predicting which property inspections are most likely to yield Active Rat Signs. I will estimate using a simple predictive model.
Load in the dataset. Perform simple cleaning of column names. Transform the time variable into Date class.
# Load dataset.
# Let's call the restaurant dataset as rest for simplicatin.
rest= read.csv("~/Desktop/OIDD245/HW 1/Data/DOHMH_New_York_City_Restaurant_Inspection_Results.csv")
# Clean column names and examine relevant columns.
rest
colnames(rest) <- tolower(colnames(rest))
colnames(rest) #relevant columns are 'zipcode', 'violation.code', 'inspection.date'
## [1] "camis" "dba"
## [3] "boro" "building"
## [5] "street" "zipcode"
## [7] "phone" "cuisine.description"
## [9] "inspection.date" "action"
## [11] "violation.code" "violation.description"
## [13] "critical.flag" "score"
## [15] "grade" "grade.date"
## [17] "record.date" "inspection.type"
table(rest$violation.code) # relevant are '04L', '04K', '08A'. It has 3071+ 33320+ 45006 = 81397 datapoints respectively,
##
## 02A 02B 02C 02D 02E 02F 02G 02H 02I 02J 03A
## 8915 418 22364 556 41 8 12 37781 4289 86 3 438
## 03B 03C 03D 03E 03F 03G 04A 04B 04C 04D 04E 04F
## 256 231 33 16 1 18 8429 45 4081 500 312 83
## 04G 04H 04I 04J 04K 04L 04M 04N 04O 05A 05B 05C
## 6 12258 19 5011 3071 33320 11881 20651 365 82 39 58
## 05D 05E 05F 05H 05I 06A 06B 06C 06D 06E 06F 06G
## 4369 146 750 796 1 8807 3011 25483 28795 10986 8902 122
## 06H 06I 07A 08A 08B 08C 09A 09B 09C 10A 10B 10C
## 11 17 80 45006 1906 5217 1997 4121 5547 5091 25456 489
## 10D 10E 10F 10G 10H 10I 10J 10K 15E 15H 15I 15J
## 4149 3254 63609 75 7465 4133 2846 1 6 1 603 294
## 15K 15L 15S 15T 16A 16B 16C 16E 16F 16J 18B 18C
## 216 5245 227 83 647 4681 212 24 12 1 13 19
## 18D 18F 18G 18I 20A 20B 20D 20E 20F 22A 22B 22C
## 120 154 1 1 511 11 1630 358 3164 1398 253 3971
## 22E 22F 22G 99B
## 68 148 47 5
# Transform time variable into Date class.
rest$inspection.date <- as.Date(rest$inspection.date, format = "%m/%d/%Y")
rest$month.year<- strftime(rest$inspection.date, "%m-%Y")
rest
Let’s start following the steps given in the prompt.
# Repeat the same process for the Rodent inspection data. We will use the month-year-zipcode combination as the variable for the outer merge later on.
head(rodent) # %m/%d/%Y 00:00:00 AM/PM.
rodent$reformat.date<- as.Date(rodent$inspection_date, "%m/%d/%Y")
rodent$month.year<- strftime(rodent$reformat.date, "%m-%Y")
colnames(rodent)
## [1] "longitude" "house_number"
## [3] "x_coord" "bbl"
## [5] "job_id" "latitude"
## [7] "location" "job_ticket_or_work_order_id"
## [9] "zip_code" "approved_date"
## [11] "inspection_date" "boro_code"
## [13] "result" "street_name"
## [15] "job_progress" "y_coord"
## [17] "lot" "borough"
## [19] "inspection_type" "block"
## [21] "reformat.date" "year.month"
## [23] "year" "month"
## [25] "month.year"
rodent.rodent <- rodent %>%
unite('month.year.zipcode', c(month.year, zip_code), sep = "-") %>%
group_by(month.year.zipcode, result) %>%
summarize(sightings = n())
rodent.rodent
# Create a new dataframe where we merge the rodent inspection and restaurant inspection dataframe together.
merged.rodent <- merge(rodent.rodent, rest.rodent, by = c("month.year.zipcode"), all = T)
colnames(merged.rodent) = c("month.year.zipcode",
"result",
"rodent.inspection",
"rest.inspection")
merged.rodent
# Convert any missing restaurant violation numbers to zero.
merged.rodent$rest.inspection[is.na(merged.rodent$rest.inspection)] <- 0
merged.rodent$rodent.inspection[is.na(merged.rodent$rodent.inspection)] <- 0
# Add one to this restaurant violation measure and log it.
merged.rodent$rest.inspection <- merged.rodent$rest.inspection+1
merged.rodent$log.rest.inspection <- log(merged.rodent$rest.inspection)
merged.rodent$dummy[merged.rodent$result == "Active Rat Signs"] <- 1
merged.rodent$dummy[!merged.rodent$result == "Active Rat Signs"] <- 0
Note: Include month and year in your regression as factor (dummy) variables. You should include a single dummy variable for each month (Jan, Feb, etc) and for each year, not for each month-year combination. This should leave you with about 15 or 16 dummy variables in your regression output.
merged.rodent$month<- as.numeric(substr(merged.rodent$month.year.zipcode, 1, 2))
## Warning: NAs introduced by coercion
merged.rodent$year<- as.numeric(substr(merged.rodent$month.year.zipcode, 4, 7))
merged.rodent <- filter(merged.rodent, merged.rodent$year >= 2012)
table(merged.rodent$month) # only month 1-12 are present (12 variables)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 2495 2344 2549 2158 2062 2099 2123 2148 2149 2141 2019 2033
table(merged.rodent$year) # only from year 2012-2016 are present (5 variables)
##
## 2012 2013 2014 2015 2016
## 6032 6075 6279 6384 1550
modell = glm("dummy ~ log.rest.inspection + as.factor(year) + as.factor(month)", data = merged.rodent, family=binomial)
# total 12 month dummy variables + 4 year variables against 2012 dummy variable = 16 dummy variables
summary(modell)
##
## Call:
## glm(formula = "dummy ~ log.rest.inspection + as.factor(year) + as.factor(month)",
## family = binomial, data = merged.rodent)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8020 -0.6814 -0.6402 -0.5836 2.0208
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.549479 0.061985 -24.998 < 2e-16 ***
## log.rest.inspection 0.174157 0.017374 10.024 < 2e-16 ***
## as.factor(year)2013 -0.214050 0.052027 -4.114 3.89e-05 ***
## as.factor(year)2014 -0.196619 0.052666 -3.733 0.000189 ***
## as.factor(year)2015 -0.218805 0.053169 -4.115 3.87e-05 ***
## as.factor(year)2016 -0.262694 0.083672 -3.140 0.001692 **
## as.factor(month)2 -0.037005 0.074118 -0.499 0.617593
## as.factor(month)3 -0.011625 0.072076 -0.161 0.871869
## as.factor(month)4 -0.028662 0.076677 -0.374 0.708552
## as.factor(month)5 -0.029529 0.077705 -0.380 0.703936
## as.factor(month)6 -0.064259 0.077726 -0.827 0.408389
## as.factor(month)7 0.008316 0.076750 0.108 0.913720
## as.factor(month)8 -0.046491 0.076964 -0.604 0.545805
## as.factor(month)9 -0.057438 0.077010 -0.746 0.455756
## as.factor(month)10 -0.134397 0.077918 -1.725 0.084554 .
## as.factor(month)11 -0.095516 0.078946 -1.210 0.226323
## as.factor(month)12 -0.095719 0.078758 -1.215 0.224230
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25374 on 25746 degrees of freedom
## Residual deviance: 25267 on 25730 degrees of freedom
## (573 observations deleted due to missingness)
## AIC: 25301
##
## Number of Fisher Scoring iterations: 4
# install.packages("margins")
library(margins)
margins_modell = margins(modell)
plot(margins_modell)
This evidence do indicate that there is a statistically significant relationship between the restaurant inspection (log.rest.inspection) based measures and whether a rodent inspection yields Active Rat Signs (dummy). We can see this from the Z-value which is 10.024 and is more than the critical value Z = 1.645 at 0.05 siginificance level. The p value is also less than 0.05 which is correct.
We can also see a relationship between where there is a Active Rat inspection result with a particular month and year. Every month and year except June would have a siginifcant negative effect on the Active Rat inspection result.
NYC has locations where residents can drop off their food to be composted. Since food scraps easily attract rats and rodents as food waste are one of their food sources, I want to see if the food scrap drop-off location contributes to the location and number of rodents being spotted.
Food Scrap Drop-Off Locations in NYC (Map): https://data.cityofnewyork.us/Environment/Food-Scrap-Drop-Off-Locations-in-NYC-Map-/qfn2-4jea Food Scrap Drop-Off Locations in NYC: https://data.cityofnewyork.us/Environment/Food-Scrap-Drop-Off-Locations-in-NYC/if26-z6xq
These two links are from the same datasource. The first one is visualized onto an interactive map while the second one is the raw data source itself. The variables I would use to join with the ‘rodent’ and ‘rest’ dataset are 1) borough eg. Bronx, Manhattan and 2) zipcode eg. 10010, 10011.
First, I will use outer merge to join the food dropoff location dataset together with either the rodent or restaurant dataset that contain rodent sightings. The merging variable would be either the ‘borough’ or ‘zipcode’. Next, I will use group_by and summarize to count the number of food dropoff locations in each borough/zipcode.
From here onwards, I can plot a chart for the number of food-drop locations (x-axis) against the number of rodent sightings (y-axis) and add a linear regression line to see if there’s a relationship. Similarly to Part 1, I can graph 5 lines for each of the 5 boroughs. To check the results, I can perform a logistic regression model to check if there’s a significant relationship between the number of food-drop locations and the number of rodent sightings in each borough/zipcode.