You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Is there a relationship with geographical location and failing inspections in New York City.
What are the cases, and how many are there? Each case represents an inspection report of a tank. There are 31,536 cases in this data set.
Describe the method of data collection. The data is collected by the New York City Department of Health and Mental Hygiene. The inspections are self reported by building owners are collected by the city agency.
What type of study is this (observational/experiment)? This is an observational study.
If you collected the data, state self-collected. If not, provide a citation/link.
https://data.cityofnewyork.us/Health/Rooftop-Drinking-Water-Tank-Inspection-Results/gjm4-k24g
What is the response variable? Is it quantitative or qualitative? The response variable is the quantity of failed inpsection per building with a tank. The variable is quatitative.
You should have two independent variables, one quantitative and one qualitative. The independent variables are zip code and distance to major transportation hub.
library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggmap)
library(geosphere)
df <- read_csv("https://raw.githubusercontent.com/engine2031/Data-607/main/Rooftop_Drinking_Water_Tank_Inspection_Results2.csv")
#Reducing Columns to those I thought would be most useful.
df1 <- df %>% select(-BIN, -HOUSE_NUM, -STREET_NAME, -INSPECTION_BY_FIRM, -INSPECTION_DATE)
#Replacing Yes and Nos in column to be 0 and 1. This will allow me to sum the quantity of failed text per inspection.
df1$SI_RESULT_SEDIMENT[df1$SI_RESULT_SEDIMENT == "N"] <- 0
df1$SI_RESULT_SEDIMENT[df1$SI_RESULT_SEDIMENT == "A"] <- 1
df1$SI_RESULT_BIOLOGICAL_GROWTH[df1$SI_RESULT_BIOLOGICAL_GROWTH == "N"] <- 0
df1$SI_RESULT_BIOLOGICAL_GROWTH[df1$SI_RESULT_BIOLOGICAL_GROWTH == "A"] <- 1
df1$SI_RESULT_DEBRIS_INSECTS[df1$SI_RESULT_DEBRIS_INSECTS == "N"] <- 0
df1$SI_RESULT_DEBRIS_INSECTS[df1$SI_RESULT_DEBRIS_INSECTS == "A"] <- 1
df1$SI_RESULT_RODENT_BIRD[df1$SI_RESULT_RODENT_BIRD == "N"] <- 0
df1$SI_RESULT_RODENT_BIRD[df1$SI_RESULT_RODENT_BIRD == "A"] <- 1
df1$COLIFORM[df1$COLIFORM == "A"] <- 0
df1$COLIFORM[df1$COLIFORM == "P"] <- 1
df1$ECOLI[df1$ECOLI == "A"] <- 0
df1$ECOLI[df1$ECOLI == "P"] <- 1
#Converting columns from character to numeric type
df1$SI_RESULT_SEDIMENT <- as.numeric(df1$SI_RESULT_SEDIMENT)
df1$SI_RESULT_BIOLOGICAL_GROWTH <- as.numeric(df1$SI_RESULT_BIOLOGICAL_GROWTH)
df1$SI_RESULT_DEBRIS_INSECTS <- as.numeric(df1$SI_RESULT_DEBRIS_INSECTS)
df1$SI_RESULT_RODENT_BIRD <- as.numeric(df1$SI_RESULT_RODENT_BIRD)
df1$COLIFORM <- as.numeric(df1$COLIFORM)
df1$ECOLI <- as.numeric(df1$ECOLI)
head(df1)
## # A tibble: 6 x 18
## BOROUGH ZIP BLOCK LOT REPORTING_YEAR TANK_NUM SI_RESULT_SEDIMENT
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 QUEENS 11374 2086 50 2018 1 0
## 2 MANHATTAN 10128 1523 34 2017 1 0
## 3 MANHATTAN 10036 1260 7 2019 1 0
## 4 MANHATTAN 10017 1278 1 2016 3 0
## 5 MANHATTAN 10022 1292 37 2019 1 0
## 6 MANHATTAN 10024 1203 1 2017 1 0
## # ... with 11 more variables: SI_RESULT_BIOLOGICAL_GROWTH <dbl>,
## # SI_RESULT_DEBRIS_INSECTS <dbl>, SI_RESULT_RODENT_BIRD <dbl>,
## # SAMPLE_COLLECTED <chr>, COLIFORM <dbl>, ECOLI <dbl>, MEET_STANDARDS <chr>,
## # LATITUDE <dbl>, LONGITUDE <dbl>, BBL <dbl>, NTA <chr>
#Summation of violation types and creation of new column
df2 <- df1 %>% mutate(Qty_Fail_Insp = SI_RESULT_SEDIMENT +
SI_RESULT_BIOLOGICAL_GROWTH +
SI_RESULT_DEBRIS_INSECTS+
SI_RESULT_RODENT_BIRD+
COLIFORM+
ECOLI)
#Plot to understand the overall distribution of failed inspections. Based on data bar plot it is seen that the majority of tanks did not have failed or corrective action required.
ggplot(data = df2, aes(x = Qty_Fail_Insp))+
geom_bar()
#Filter for Inspections Greater than 0. Allows for better visualizations of quantity of tanks with failed inspections. In addition after looking at a visualization of the quantity of failed inspections, it can be seen that the majority of tanks are in Manhattan. For this reason the dataframe is further filter to be limited to Manhattan.
df2b <- df2 %>% filter(Qty_Fail_Insp > 0)%>% filter(BOROUGH == "MANHATTAN")
ggplot(data = df2, aes(x = BOROUGH))+
geom_bar()
ggplot(data = df2b, aes(x = Qty_Fail_Insp))+
geom_bar()
Two geospatial visualizations are provided. The first indicated with blue dots on a Manhattan map identifies all of the locations where house tanks were inspected. The second indicates building tank locations where non-structural related failed inspections occurred.
nyc_map <- get_stamenmap(bbox = c(left = -74.05, bottom = 40.6977 , right = -73.89
, top = 40.8812), zoom = 13, maptype = "terrain")
ggmap(nyc_map) +
geom_point(data = df2,
aes(x = LONGITUDE, y = LATITUDE),
size = .7, colour = "blue")
ggmap(nyc_map) +
geom_point(data = df2b,
aes(x = LONGITUDE, y = LATITUDE),
size = .7, colour = "red")
Often when prospective tenants are looking for major office space to lease one of the major contributing factors in making the decision is the relation to major transportation hubs. Closer proximity to these major transportation hubs provides employees with an easier when transferring from one of these locations. Looking at the data visualizations below, it can be seen that the south street seaport station has the least amount of house tank locations with failed inspections within close proximity.
Abbreviations: Grand Central Station = gct Penn Station = pst South Street Station = sst Distance of Tanks from Transportation Hub = dist_xxx
df3 <- df2 %>% mutate(gct_lat = 40.7527, gct_lon = -73.9772)%>%
mutate(pst_lat = 40.7610, pst_lon = -73.9903) %>%
mutate(sst_lat = 40.7016, sst_lon = -74.0124)
df3 <- df3 %>% filter(BOROUGH == "MANHATTAN")
df4 <- df3 %>% mutate(dist_gct = distHaversine(cbind(LONGITUDE, LATITUDE), cbind(gct_lon, gct_lat)))%>%
mutate(dist_pst = distHaversine(cbind(LONGITUDE, LATITUDE), cbind(pst_lon, pst_lat)))%>%
mutate(dist_sst = distHaversine(cbind(LONGITUDE, LATITUDE), cbind(sst_lon, sst_lat)))
#Jitter Plot for density of failed inspection in relation to distance from Grand Central Terminal
ggplot(data = df4, aes(x = dist_gct, y = Qty_Fail_Insp))+
geom_jitter()+ scale_x_log10()+ ylim(1,4)
#Jitter Plot for density of failed inspection in relation to distance from Penn Station
ggplot(data = df4, aes(x = dist_pst, y = Qty_Fail_Insp))+
geom_jitter()+ scale_x_log10()+ ylim(1,4)
#Jitter Plot for density of failed inspection in relation to distance from South Street Station
ggplot(data = df4, aes(x = dist_sst, y = Qty_Fail_Insp))+
geom_jitter(alpha = 0.4)+ scale_x_log10()+ ylim(1,4)
A liner regression model is applied to predict quantity of failed inspections based on zip code and distance to a major transportation hubs. After performing a linear model it is seen that this type of model is not appropriate. The predictions column in tank_test_results did not match the actual failed inspections in any case.
df5 <- df4 %>% drop_na(Qty_Fail_Insp)%>% select(-BOROUGH, -TANK_NUM, -SI_RESULT_SEDIMENT, -SI_RESULT_BIOLOGICAL_GROWTH, SI_RESULT_DEBRIS_INSECTS, SI_RESULT_RODENT_BIRD, -SAMPLE_COLLECTED, -COLIFORM, -ECOLI, -MEET_STANDARDS, -NTA, -SI_RESULT_DEBRIS_INSECTS, -SI_RESULT_RODENT_BIRD, -gct_lat, -gct_lon, -pst_lat, -pst_lon, -sst_lat, -sst_lon, -BLOCK,-LOT, -LATITUDE, -LONGITUDE )
df6 <- df5 %>% pivot_longer(dist_gct:dist_pst:dist_sst,
names_to = "hub", values_to = "distance",
values_drop_na = TRUE) %>%
mutate(distance = log10(distance))%>%
mutate_if(is.character, factor)
df6$ZIP <- as.factor(df6$ZIP)
set.seed(123)
tank_split <- initial_split(df6, prop = .75, strata = Qty_Fail_Insp)
tank_training <- tank_split %>%
training()
tank_test <- tank_split %>%
testing()
lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
lm_fit <- lm_model %>%
fit(Qty_Fail_Insp ~ ZIP+ hub + distance, data = df6)
##tidy(lm_fit)
Insp_Pred <- lm_fit %>% predict(new_data = tank_test)
tank_test_results <- tank_test %>%
select(ZIP, hub, distance, Qty_Fail_Insp)%>%
bind_cols(Insp_Pred)
tank_test_results
## # A tibble: 20,451 x 5
## ZIP hub distance Qty_Fail_Insp .pred
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 10036 dist_pst 2.97 0 0.00715
## 2 10017 dist_gct 2.44 0 0.00986
## 3 10017 dist_pst 3.07 0 0.0108
## 4 10022 dist_gct 3.04 0 0.0294
## 5 10009 dist_pst 3.56 0 0.00921
## 6 10009 dist_sst 3.58 0 0.00858
## 7 10001 dist_sst 3.74 0 0.0141
## 8 10038 dist_sst 3.22 0 0.00954
## 9 10027 dist_pst 3.81 0 0.0000138
## 10 10003 dist_pst 3.47 0 0.00423
## # ... with 20,441 more rows