Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Is there a relationship with geographical location and failing inspections in New York City.

Cases

What are the cases, and how many are there? Each case represents an inspection report of a tank. There are 31,536 cases in this data set.

Data collection

Describe the method of data collection. The data is collected by the New York City Department of Health and Mental Hygiene. The inspections are self reported by building owners are collected by the city agency.

Type of study

What type of study is this (observational/experiment)? This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

https://data.cityofnewyork.us/Health/Rooftop-Drinking-Water-Tank-Inspection-Results/gjm4-k24g

Dependent Variable

What is the response variable? Is it quantitative or qualitative? The response variable is the quantity of failed inpsection per building with a tank. The variable is quatitative.

Independent Variable

You should have two independent variables, one quantitative and one qualitative. The independent variables are zip code and distance to major transportation hub.

Libraries

library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggmap)
library(geosphere)

Importing Data and Cleaning

df <- read_csv("https://raw.githubusercontent.com/engine2031/Data-607/main/Rooftop_Drinking_Water_Tank_Inspection_Results2.csv")

#Reducing Columns to those I thought would be most useful.
df1 <- df %>% select(-BIN, -HOUSE_NUM, -STREET_NAME, -INSPECTION_BY_FIRM, -INSPECTION_DATE)



#Replacing Yes and Nos in column to be 0 and 1.  This will allow me to sum the quantity of failed text per inspection. 
df1$SI_RESULT_SEDIMENT[df1$SI_RESULT_SEDIMENT == "N"] <- 0
df1$SI_RESULT_SEDIMENT[df1$SI_RESULT_SEDIMENT == "A"] <- 1

df1$SI_RESULT_BIOLOGICAL_GROWTH[df1$SI_RESULT_BIOLOGICAL_GROWTH == "N"] <- 0
df1$SI_RESULT_BIOLOGICAL_GROWTH[df1$SI_RESULT_BIOLOGICAL_GROWTH == "A"] <- 1

df1$SI_RESULT_DEBRIS_INSECTS[df1$SI_RESULT_DEBRIS_INSECTS == "N"] <- 0
df1$SI_RESULT_DEBRIS_INSECTS[df1$SI_RESULT_DEBRIS_INSECTS == "A"] <- 1

df1$SI_RESULT_RODENT_BIRD[df1$SI_RESULT_RODENT_BIRD == "N"] <- 0
df1$SI_RESULT_RODENT_BIRD[df1$SI_RESULT_RODENT_BIRD == "A"] <- 1

df1$COLIFORM[df1$COLIFORM == "A"] <- 0
df1$COLIFORM[df1$COLIFORM == "P"] <- 1

df1$ECOLI[df1$ECOLI == "A"] <- 0
df1$ECOLI[df1$ECOLI == "P"] <- 1

#Converting columns from character to numeric type 
df1$SI_RESULT_SEDIMENT <- as.numeric(df1$SI_RESULT_SEDIMENT)
df1$SI_RESULT_BIOLOGICAL_GROWTH <- as.numeric(df1$SI_RESULT_BIOLOGICAL_GROWTH)
df1$SI_RESULT_DEBRIS_INSECTS <- as.numeric(df1$SI_RESULT_DEBRIS_INSECTS)
df1$SI_RESULT_RODENT_BIRD <- as.numeric(df1$SI_RESULT_RODENT_BIRD)
df1$COLIFORM <- as.numeric(df1$COLIFORM)
df1$ECOLI <- as.numeric(df1$ECOLI)

head(df1)
## # A tibble: 6 x 18
##   BOROUGH     ZIP BLOCK   LOT REPORTING_YEAR TANK_NUM SI_RESULT_SEDIMENT
##   <chr>     <dbl> <dbl> <dbl>          <dbl>    <dbl>              <dbl>
## 1 QUEENS    11374  2086    50           2018        1                  0
## 2 MANHATTAN 10128  1523    34           2017        1                  0
## 3 MANHATTAN 10036  1260     7           2019        1                  0
## 4 MANHATTAN 10017  1278     1           2016        3                  0
## 5 MANHATTAN 10022  1292    37           2019        1                  0
## 6 MANHATTAN 10024  1203     1           2017        1                  0
## # ... with 11 more variables: SI_RESULT_BIOLOGICAL_GROWTH <dbl>,
## #   SI_RESULT_DEBRIS_INSECTS <dbl>, SI_RESULT_RODENT_BIRD <dbl>,
## #   SAMPLE_COLLECTED <chr>, COLIFORM <dbl>, ECOLI <dbl>, MEET_STANDARDS <chr>,
## #   LATITUDE <dbl>, LONGITUDE <dbl>, BBL <dbl>, NTA <chr>

Data Manipulation

#Summation of violation types and creation of new column 
df2 <- df1 %>% mutate(Qty_Fail_Insp = SI_RESULT_SEDIMENT + 
                 SI_RESULT_BIOLOGICAL_GROWTH +
                 SI_RESULT_DEBRIS_INSECTS+
                 SI_RESULT_RODENT_BIRD+
                 COLIFORM+
                 ECOLI)

#Plot to understand the overall distribution of failed inspections.  Based on data bar plot it is seen that the majority of tanks did not have failed or corrective action required.   
ggplot(data = df2, aes(x = Qty_Fail_Insp))+
  geom_bar()

#Filter for Inspections Greater than 0. Allows for better visualizations of quantity of tanks with failed inspections.  In addition after looking at a visualization of the quantity of failed inspections, it can be seen that the majority of tanks are in Manhattan.  For this reason the dataframe is further filter to be limited to Manhattan.  
df2b <- df2 %>% filter(Qty_Fail_Insp > 0)%>% filter(BOROUGH == "MANHATTAN")


ggplot(data = df2, aes(x = BOROUGH))+
  geom_bar()

ggplot(data = df2b, aes(x = Qty_Fail_Insp))+
  geom_bar()

Geospatial Data Visualization

Two geospatial visualizations are provided. The first indicated with blue dots on a Manhattan map identifies all of the locations where house tanks were inspected. The second indicates building tank locations where non-structural related failed inspections occurred.

nyc_map <- get_stamenmap(bbox = c(left = -74.05, bottom = 40.6977 , right = -73.89
  , top = 40.8812), zoom = 13, maptype = "terrain")
ggmap(nyc_map) +
  geom_point(data = df2, 
             aes(x = LONGITUDE, y = LATITUDE), 
             size = .7, colour = "blue") 

ggmap(nyc_map) +
  geom_point(data = df2b, 
             aes(x = LONGITUDE, y = LATITUDE), 
             size = .7, colour = "red") 

Water Tanks in Relation to Major NYC Transportation Hubs

Often when prospective tenants are looking for major office space to lease one of the major contributing factors in making the decision is the relation to major transportation hubs. Closer proximity to these major transportation hubs provides employees with an easier when transferring from one of these locations. Looking at the data visualizations below, it can be seen that the south street seaport station has the least amount of house tank locations with failed inspections within close proximity.

Abbreviations: Grand Central Station = gct Penn Station = pst South Street Station = sst Distance of Tanks from Transportation Hub = dist_xxx

df3 <- df2 %>% mutate(gct_lat = 40.7527, gct_lon = -73.9772)%>%
  mutate(pst_lat = 40.7610, pst_lon = -73.9903) %>%
  mutate(sst_lat = 40.7016, sst_lon = -74.0124)
  
  
df3 <- df3 %>% filter(BOROUGH == "MANHATTAN")

df4 <- df3 %>% mutate(dist_gct = distHaversine(cbind(LONGITUDE, LATITUDE), cbind(gct_lon, gct_lat)))%>%
  mutate(dist_pst = distHaversine(cbind(LONGITUDE, LATITUDE), cbind(pst_lon, pst_lat)))%>%
  mutate(dist_sst = distHaversine(cbind(LONGITUDE, LATITUDE), cbind(sst_lon, sst_lat)))

#Jitter Plot for density of failed inspection in relation to distance from Grand Central Terminal                              
ggplot(data = df4, aes(x = dist_gct, y = Qty_Fail_Insp))+
  geom_jitter()+ scale_x_log10()+ ylim(1,4)

#Jitter Plot for density of failed inspection in relation to distance from Penn Station 
ggplot(data = df4, aes(x = dist_pst, y = Qty_Fail_Insp))+
  geom_jitter()+ scale_x_log10()+ ylim(1,4)

#Jitter Plot for density of failed inspection in relation to distance from South Street Station 
ggplot(data = df4, aes(x = dist_sst, y = Qty_Fail_Insp))+
  geom_jitter(alpha = 0.4)+ scale_x_log10()+ ylim(1,4)

Modeling

A liner regression model is applied to predict quantity of failed inspections based on zip code and distance to a major transportation hubs. After performing a linear model it is seen that this type of model is not appropriate. The predictions column in tank_test_results did not match the actual failed inspections in any case.

df5 <- df4 %>% drop_na(Qty_Fail_Insp)%>% select(-BOROUGH, -TANK_NUM, -SI_RESULT_SEDIMENT, -SI_RESULT_BIOLOGICAL_GROWTH, SI_RESULT_DEBRIS_INSECTS, SI_RESULT_RODENT_BIRD, -SAMPLE_COLLECTED, -COLIFORM, -ECOLI, -MEET_STANDARDS, -NTA, -SI_RESULT_DEBRIS_INSECTS, -SI_RESULT_RODENT_BIRD, -gct_lat, -gct_lon, -pst_lat, -pst_lon, -sst_lat, -sst_lon, -BLOCK,-LOT, -LATITUDE, -LONGITUDE )

df6 <- df5 %>% pivot_longer(dist_gct:dist_pst:dist_sst, 
                            names_to = "hub", values_to = "distance", 
                            values_drop_na = TRUE) %>%
  mutate(distance = log10(distance))%>%
  mutate_if(is.character, factor) 
  
df6$ZIP <- as.factor(df6$ZIP)

set.seed(123)
tank_split <- initial_split(df6, prop = .75, strata = Qty_Fail_Insp)

tank_training <- tank_split %>%
  training()

tank_test <- tank_split %>%
  testing()

lm_model <- linear_reg() %>% 
  set_engine('lm') %>% 
  set_mode('regression')


lm_fit <- lm_model %>% 
  fit(Qty_Fail_Insp ~ ZIP+ hub + distance, data = df6)

##tidy(lm_fit)

Insp_Pred <- lm_fit %>% predict(new_data = tank_test)

tank_test_results <- tank_test %>% 
  select(ZIP, hub, distance, Qty_Fail_Insp)%>% 
  bind_cols(Insp_Pred)

tank_test_results
## # A tibble: 20,451 x 5
##    ZIP   hub      distance Qty_Fail_Insp     .pred
##    <fct> <fct>       <dbl>         <dbl>     <dbl>
##  1 10036 dist_pst     2.97             0 0.00715  
##  2 10017 dist_gct     2.44             0 0.00986  
##  3 10017 dist_pst     3.07             0 0.0108   
##  4 10022 dist_gct     3.04             0 0.0294   
##  5 10009 dist_pst     3.56             0 0.00921  
##  6 10009 dist_sst     3.58             0 0.00858  
##  7 10001 dist_sst     3.74             0 0.0141   
##  8 10038 dist_sst     3.22             0 0.00954  
##  9 10027 dist_pst     3.81             0 0.0000138
## 10 10003 dist_pst     3.47             0 0.00423  
## # ... with 20,441 more rows

References

  1. Robin Love Lace, Geocomputation with R
  2. Lisa Lendway - Statistic & Data Science, https://www.youtube.com/watch?v=2k8O-Y_uiRU
  3. stackoverflow, https://stackoverflow.com/questions/49532911/calculate-distance-longitude-latitude-of-multiple-in-dataframe-r