As a NYC resident and Data Science graduate student, I’m analyzing 311 service request data from 2010 to present to understand complaint patterns across our five boroughs. The 311 system is how we report non-emergency issues—from potholes and noise complaints to heat outages and street light repairs. By visualizing this data geographically, we can see where problems occur and how they cluster in specific neighborhoods. My primary hypothesis is that NYC 311 complaint patterns-including volume, type, and response times—are significantly influenced by weather conditions, neighborhood demographics, and borough characteristics.
How NYC Currently Uses 311 Data
Reactive Response:
Performance Monitoring:
This project analyzes NYC 311 service request data to identify geographic complaint patterns and enable proactive municipal management. By integrating weather conditions and demographic characteristics, the analysis examines how environmental and socioeconomic factors influence service request volumes and types across neighborhoods. The project follows the OSEMN data science workflow—Obtain, Scrub, Explore, Model, and iNterpret—to ensure a systematic and reproducible approach from data acquisition through actionable insights.
Obtain: Data will be acquired from three primary sources: - NYC 311 Service Requests via the NYC Open Data Socrata API using the RSocrata package in R - Historical Weather Data from Open-Meteo.com, a free open-source weather API requiring no API key, to capture temperature, precipitation, and other meteorological conditions corresponding to complaint dates - Demographic Data from the U.S. Census Bureau API (census.gov) to obtain population density, income levels, housing characteristics, and other socioeconomic indicators at the census tract or ZIP code level - Geographic Boundary Files (boroughs, ZIP codes, community districts) using the tigris package for spatial mapping and joining datasets
Scrub: The data cleaning process will use dplyr, tidyr, and lubridate to remove records with missing coordinates, standardize complaint categories, convert dates to proper datetime format, and create time-based variables like year, month, and season for trend analysis. Weather data will be aligned with complaint records by date, and demographic data will be joined spatially by geographic identifier (ZIP code or census tract) to create an enriched analytical dataset.
Explore: Exploratory data analysis will examine complaint distributions across boroughs, identify seasonal and weather-related trends, and investigate correlations between complaint types and neighborhood demographics. Preliminary visualizations will reveal geographic clustering patterns and potential relationships between environmental conditions and service request volumes.
Model: Statistical and spatial analysis techniques will be applied to quantify complaint hotspots, measure temporal trends, and identify significant patterns across neighborhoods and time periods. Regression models will assess the influence of weather variables (temperature extremes, precipitation events) and demographic factors (population density, median income) on complaint frequency and type.
iNterpret: My deliverable will be an interactive Shiny dashboard that allows users to filter by complaint type, date range, borough, weather conditions, and demographic characteristics. The dashboard will feature linked visualizations—an interactive map with complaint markers, dynamic time series charts, weather overlays, demographic profiles, and summary statistics—where selections on one view automatically update the others. Users can explore patterns, download custom reports, and identify areas needing proactive intervention based on environmental and socioeconomic context.
# Load required packages
library(httr)
library(jsonlite)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(tibble)
# Load API token from environment variable (stored in .Renviron)
app_token <- Sys.getenv("NYC_OPENDATA_TOKEN")
# Check if token loaded successfully
if (app_token == "") {
stop("API token not found. Please set NYC_OPENDATA_TOKEN in your .Renviron file.")
}
# Use the SODA API endpoint
api_url <- "https://data.cityofnewyork.us/resource/erm2-nwe9.json"
query_params <- list(
`$limit` = 80000,
`$order` = "created_date DESC",
`$$app_token` = app_token
)
response <- GET(api_url, query = query_params)
if (status_code(response) == 200) {
df_311 <- content(response, as = "text", encoding = "UTF-8") %>%
fromJSON(flatten = TRUE) %>%
as.data.frame()
print(head(df_311))
} else {
print(paste("Error:", status_code(response)))
}
## unique_key created_date agency agency_name
## 1 67170118 2025-12-14T02:46:11.000 DSNY Department of Sanitation
## 2 67170119 2025-12-14T02:45:22.000 DSNY Department of Sanitation
## 3 67164190 2025-12-14T02:21:06.000 NYPD New York City Police Department
## 4 67166872 2025-12-14T02:20:47.000 NYPD New York City Police Department
## 5 67168351 2025-12-14T02:20:32.000 NYPD New York City Police Department
## 6 67168322 2025-12-14T02:20:15.000 NYPD New York City Police Department
## complaint_type descriptor location_type
## 1 Graffiti Graffiti Residential
## 2 Graffiti Graffiti Residential
## 3 Noise - Residential Loud Music/Party Residential Building/House
## 4 Illegal Parking Posted Parking Sign Violation Street/Sidewalk
## 5 Noise - Residential Loud Music/Party Residential Building/House
## 6 Noise - Residential Loud Music/Party Residential Building/House
## incident_zip incident_address street_name address_type city
## 1 10009 269 EAST 10 STREET EAST 10 STREET ADDRESS NEW YORK
## 2 10009 269 EAST 10 STREET EAST 10 STREET ADDRESS NEW YORK
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 10463 146 WEST 228 STREET WEST 228 STREET ADDRESS BRONX
## 5 10031 500 WEST 140 STREET WEST 140 STREET ADDRESS NEW YORK
## 6 10466 655 EAST 230 STREET EAST 230 STREET ADDRESS BRONX
## facility_type status due_date
## 1 N/A Open 2026-01-13T02:46:11.000
## 2 N/A Open 2026-01-13T02:45:22.000
## 3 <NA> In Progress <NA>
## 4 <NA> In Progress <NA>
## 5 <NA> In Progress <NA>
## 6 <NA> In Progress <NA>
## resolution_description
## 1 The graffiti on this property has been scheduled to be removed by the City.
## 2 The graffiti on this property has been scheduled to be removed by the City.
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## resolution_action_updated_date community_board bbl borough
## 1 2025-12-14T02:46:11.000 03 MANHATTAN 1004380043 MANHATTAN
## 2 2025-12-14T02:45:22.000 03 MANHATTAN 1004380043 MANHATTAN
## 3 <NA> 0 Unspecified <NA> Unspecified
## 4 <NA> 08 BRONX 1022150574 BRONX
## 5 <NA> 09 MANHATTAN 1020710036 MANHATTAN
## 6 <NA> 12 BRONX 2048330028 BRONX
## x_coordinate_state_plane y_coordinate_state_plane open_data_channel_type
## 1 988881 204617 UNKNOWN
## 2 988881 204617 UNKNOWN
## 3 1022954 264308 MOBILE
## 4 1009449 258838 PHONE
## 5 998051 238813 PHONE
## 6 1022911 264242 MOBILE
## park_facility_name park_borough latitude longitude
## 1 Unspecified MANHATTAN 40.72830187478833 -73.98329142747356
## 2 Unspecified MANHATTAN 40.72830187478833 -73.98329142747356
## 3 Unspecified Unspecified 40.89205337686121 -73.86001254663061
## 4 Unspecified BRONX 40.87708884788051 -73.90887895223878
## 5 Unspecified MANHATTAN 40.82215133794938 -73.95013598942462
## 6 Unspecified BRONX 40.89187241649303 -73.86016845296459
## cross_street_1 cross_street_2 intersection_street_1
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 MARBLE HILL LANE MARBLE HILL AVENUE MARBLE HILL LANE
## 5 AMSTERDAM AVENUE HAMILTON PLACE AMSTERDAM AVENUE
## 6 CARPENTER AVENUE LOWERRE PLACE CARPENTER AVENUE
## intersection_street_2 landmark closed_date vehicle_type
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 MARBLE HILL AVENUE WEST 228 STREET <NA> <NA>
## 5 HAMILTON PLACE WEST 140 STREET <NA> <NA>
## 6 LOWERRE PLACE EAST 230 STREET <NA> <NA>
## taxi_pick_up_location bridge_highway_name bridge_highway_segment road_ramp
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA>
## bridge_highway_direction taxi_company_borough location.latitude
## 1 <NA> <NA> 40.72830187478833
## 2 <NA> <NA> 40.72830187478833
## 3 <NA> <NA> 40.89205337686121
## 4 <NA> <NA> 40.87708884788051
## 5 <NA> <NA> 40.82215133794938
## 6 <NA> <NA> 40.89187241649303
## location.longitude location.human_address
## 1 -73.98329142747356 {"address": "", "city": "", "state": "", "zip": ""}
## 2 -73.98329142747356 {"address": "", "city": "", "state": "", "zip": ""}
## 3 -73.86001254663061 {"address": "", "city": "", "state": "", "zip": ""}
## 4 -73.90887895223878 {"address": "", "city": "", "state": "", "zip": ""}
## 5 -73.95013598942462 {"address": "", "city": "", "state": "", "zip": ""}
## 6 -73.86016845296459 {"address": "", "city": "", "state": "", "zip": ""}
nrow(df_311)
## [1] 80000
sum(is.na(df_311$hour))
## [1] 0
df_311 <- df_311 %>%
mutate(
created_date = as.POSIXct(created_date, format = "%Y-%m-%dT%H:%M:%S"),
closed_date = as.POSIXct(closed_date, format = "%Y-%m-%dT%H:%M:%S"),
date_only = as.Date(created_date),
hour = hour(created_date),
day_of_week = wday(created_date, label = TRUE),
month = month(created_date, label = TRUE)
)
print(str(df_311))
## 'data.frame': 80000 obs. of 47 variables:
## $ unique_key : chr "67170118" "67170119" "67164190" "67166872" ...
## $ created_date : POSIXct, format: "2025-12-14 02:46:11" "2025-12-14 02:45:22" ...
## $ agency : chr "DSNY" "DSNY" "NYPD" "NYPD" ...
## $ agency_name : chr "Department of Sanitation" "Department of Sanitation" "New York City Police Department" "New York City Police Department" ...
## $ complaint_type : chr "Graffiti" "Graffiti" "Noise - Residential" "Illegal Parking" ...
## $ descriptor : chr "Graffiti" "Graffiti" "Loud Music/Party" "Posted Parking Sign Violation" ...
## $ location_type : chr "Residential" "Residential" "Residential Building/House" "Street/Sidewalk" ...
## $ incident_zip : chr "10009" "10009" NA "10463" ...
## $ incident_address : chr "269 EAST 10 STREET" "269 EAST 10 STREET" NA "146 WEST 228 STREET" ...
## $ street_name : chr "EAST 10 STREET" "EAST 10 STREET" NA "WEST 228 STREET" ...
## $ address_type : chr "ADDRESS" "ADDRESS" NA "ADDRESS" ...
## $ city : chr "NEW YORK" "NEW YORK" NA "BRONX" ...
## $ facility_type : chr "N/A" "N/A" NA NA ...
## $ status : chr "Open" "Open" "In Progress" "In Progress" ...
## $ due_date : chr "2026-01-13T02:46:11.000" "2026-01-13T02:45:22.000" NA NA ...
## $ resolution_description : chr "The graffiti on this property has been scheduled to be removed by the City." "The graffiti on this property has been scheduled to be removed by the City." NA NA ...
## $ resolution_action_updated_date: chr "2025-12-14T02:46:11.000" "2025-12-14T02:45:22.000" NA NA ...
## $ community_board : chr "03 MANHATTAN" "03 MANHATTAN" "0 Unspecified" "08 BRONX" ...
## $ bbl : chr "1004380043" "1004380043" NA "1022150574" ...
## $ borough : chr "MANHATTAN" "MANHATTAN" "Unspecified" "BRONX" ...
## $ x_coordinate_state_plane : chr "988881" "988881" "1022954" "1009449" ...
## $ y_coordinate_state_plane : chr "204617" "204617" "264308" "258838" ...
## $ open_data_channel_type : chr "UNKNOWN" "UNKNOWN" "MOBILE" "PHONE" ...
## $ park_facility_name : chr "Unspecified" "Unspecified" "Unspecified" "Unspecified" ...
## $ park_borough : chr "MANHATTAN" "MANHATTAN" "Unspecified" "BRONX" ...
## $ latitude : chr "40.72830187478833" "40.72830187478833" "40.89205337686121" "40.87708884788051" ...
## $ longitude : chr "-73.98329142747356" "-73.98329142747356" "-73.86001254663061" "-73.90887895223878" ...
## $ cross_street_1 : chr NA NA NA "MARBLE HILL LANE" ...
## $ cross_street_2 : chr NA NA NA "MARBLE HILL AVENUE" ...
## $ intersection_street_1 : chr NA NA NA "MARBLE HILL LANE" ...
## $ intersection_street_2 : chr NA NA NA "MARBLE HILL AVENUE" ...
## $ landmark : chr NA NA NA "WEST 228 STREET" ...
## $ closed_date : POSIXct, format: NA NA ...
## $ vehicle_type : chr NA NA NA NA ...
## $ taxi_pick_up_location : chr NA NA NA NA ...
## $ bridge_highway_name : chr NA NA NA NA ...
## $ bridge_highway_segment : chr NA NA NA NA ...
## $ road_ramp : chr NA NA NA NA ...
## $ bridge_highway_direction : chr NA NA NA NA ...
## $ taxi_company_borough : chr NA NA NA NA ...
## $ location.latitude : chr "40.72830187478833" "40.72830187478833" "40.89205337686121" "40.87708884788051" ...
## $ location.longitude : chr "-73.98329142747356" "-73.98329142747356" "-73.86001254663061" "-73.90887895223878" ...
## $ location.human_address : chr "{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}" "{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}" "{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}" "{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}" ...
## $ date_only : Date, format: "2025-12-14" "2025-12-14" ...
## $ hour : int 2 2 2 2 2 2 2 2 2 2 ...
## $ day_of_week : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 12 12 12 12 12 12 12 12 12 12 ...
## NULL
sum(is.na(df_311$hour))
## [1] 0
df_311 %>% count(day_of_week)
## day_of_week n
## 1 Sun 7256
## 2 Mon 12546
## 3 Tue 14050
## 4 Wed 12155
## 5 Thu 11635
## 6 Fri 11954
## 7 Sat 10404
# WEATHER DATA INTEGRATION
library(httr)
library(jsonlite)
library(dplyr)
library(lubridate)
# Get date range from your 311 data
start_date <- min(df_311$date_only, na.rm = TRUE)
end_date <- max(df_311$date_only, na.rm = TRUE)
# Fetch NYC historical weather
weather_url <- "https://archive-api.open-meteo.com/v1/archive"
weather_params <- list(
latitude = 40.7128,
longitude = -74.0060,
start_date = as.character(start_date),
end_date = as.character(end_date),
daily = "temperature_2m_max,temperature_2m_min,temperature_2m_mean,precipitation_sum,windspeed_10m_max",
timezone = "America/New_York"
)
weather_response <- GET(weather_url, query = weather_params)
if (status_code(weather_response) == 200) {
weather_json <- fromJSON(content(weather_response, "text", encoding = "UTF-8"))
weather_df <- data.frame(
date = as.Date(weather_json$daily$time),
temp_max_f = weather_json$daily$temperature_2m_max * 9/5 + 32, # Convert to Fahrenheit
temp_min_f = weather_json$daily$temperature_2m_min * 9/5 + 32,
temp_mean_f = weather_json$daily$temperature_2m_mean * 9/5 + 32,
precipitation_mm = weather_json$daily$precipitation_sum,
wind_max_mph = weather_json$daily$windspeed_10m_max * 0.621371 # Convert to mph
)
cat("Weather data retrieved:", nrow(weather_df), "days\n")
print(head(weather_df))
} else {
cat("Error fetching weather:", status_code(weather_response), "\n")
}
## Weather data retrieved: 8 days
## date temp_max_f temp_min_f temp_mean_f precipitation_mm wind_max_mph
## 1 2025-12-07 40.82 28.76 34.52 0.0 7.332178
## 2 2025-12-08 34.88 22.64 28.04 0.0 11.557501
## 3 2025-12-09 31.46 18.32 25.16 0.0 9.817662
## 4 2025-12-10 43.34 30.56 36.68 3.1 11.371089
## 5 2025-12-11 38.84 27.50 32.90 0.0 14.975041
## 6 2025-12-12 34.52 23.00 28.40 0.0 10.004073
# Join weather to 311 data
df_311 <- df_311 %>%
left_join(weather_df, by = c("date_only" = "date"))
# Verify join
cat("\nWeather columns added:", sum(!is.na(df_311$temp_mean_f)), "rows with weather data\n")
##
## Weather columns added: 80000 rows with weather data
library(tidycensus)
library(dplyr)
# Load the key from environment variable
census_api_key(Sys.getenv("CENSUS_API_KEY"))
## To install your API key for use in future sessions, run this function with `install = TRUE`.
query_params <- list(
`$limit` = 75000, # More records = more days
`$order` = "created_date DESC"
)
# Fetch NYC demographic data by borough
nyc_demographics <- get_acs(
geography = "county",
state = "NY",
county = c("005", "047", "061", "081", "085"),
variables = c(
total_pop = "B01003_001",
median_income = "B19013_001",
median_age = "B01002_001",
total_housing = "B25001_001",
renter_occupied = "B25003_003",
poverty_count = "B17001_002"
),
year = 2022,
output = "wide"
)
## Getting data from the 2018-2022 5-year ACS
## Warning: • You have not set a Census API key. Users without a key are limited to 500
## queries per day and may experience performance limitations.
## ℹ For best results, get a Census API key at
## http://api.census.gov/data/key_signup.html and then supply the key to the
## `census_api_key()` function to use it throughout your tidycensus session.
## This warning is displayed once per session.
# Clean up borough names to match your 311 data
nyc_demographics <- nyc_demographics %>%
mutate(
borough = case_when(
grepl("Bronx", NAME) ~ "BRONX",
grepl("Kings", NAME) ~ "BROOKLYN",
grepl("New York County", NAME) ~ "MANHATTAN",
grepl("Queens", NAME) ~ "QUEENS",
grepl("Richmond", NAME) ~ "STATEN ISLAND"
),
pct_renter = renter_occupiedE / total_housingE * 100,
pct_poverty = poverty_countE / total_popE * 100
) %>%
select(borough,
population = total_popE,
median_income = median_incomeE,
median_age = median_ageE,
pct_renter,
pct_poverty)
print("NYC Demographics by Borough:")
## [1] "NYC Demographics by Borough:"
print(nyc_demographics)
## # A tibble: 5 × 6
## borough population median_income median_age pct_renter pct_poverty
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BRONX 1443229 47036 35 76.8 26.3
## 2 BROOKLYN 2679620 74692 35.9 64.4 18.8
## 3 MANHATTAN 1645867 99880 38.5 64.0 15.4
## 4 QUEENS 2360826 82431 39.9 49.9 11.6
## 5 STATEN ISLAND 492925 96185 40.3 29.1 10.3
# Join demographics to 311 data
df_311 <- df_311 %>%
left_join(nyc_demographics, by = "borough")
cat("Demographics joined:", sum(!is.na(df_311$population)), "rows with demographic data\n")
## Demographics joined: 79941 rows with demographic data
head(df_311, 5)
## unique_key created_date agency agency_name
## 1 67170118 2025-12-14 02:46:11 DSNY Department of Sanitation
## 2 67170119 2025-12-14 02:45:22 DSNY Department of Sanitation
## 3 67164190 2025-12-14 02:21:06 NYPD New York City Police Department
## 4 67166872 2025-12-14 02:20:47 NYPD New York City Police Department
## 5 67168351 2025-12-14 02:20:32 NYPD New York City Police Department
## complaint_type descriptor location_type
## 1 Graffiti Graffiti Residential
## 2 Graffiti Graffiti Residential
## 3 Noise - Residential Loud Music/Party Residential Building/House
## 4 Illegal Parking Posted Parking Sign Violation Street/Sidewalk
## 5 Noise - Residential Loud Music/Party Residential Building/House
## incident_zip incident_address street_name address_type city
## 1 10009 269 EAST 10 STREET EAST 10 STREET ADDRESS NEW YORK
## 2 10009 269 EAST 10 STREET EAST 10 STREET ADDRESS NEW YORK
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 10463 146 WEST 228 STREET WEST 228 STREET ADDRESS BRONX
## 5 10031 500 WEST 140 STREET WEST 140 STREET ADDRESS NEW YORK
## facility_type status due_date
## 1 N/A Open 2026-01-13T02:46:11.000
## 2 N/A Open 2026-01-13T02:45:22.000
## 3 <NA> In Progress <NA>
## 4 <NA> In Progress <NA>
## 5 <NA> In Progress <NA>
## resolution_description
## 1 The graffiti on this property has been scheduled to be removed by the City.
## 2 The graffiti on this property has been scheduled to be removed by the City.
## 3 <NA>
## 4 <NA>
## 5 <NA>
## resolution_action_updated_date community_board bbl borough
## 1 2025-12-14T02:46:11.000 03 MANHATTAN 1004380043 MANHATTAN
## 2 2025-12-14T02:45:22.000 03 MANHATTAN 1004380043 MANHATTAN
## 3 <NA> 0 Unspecified <NA> Unspecified
## 4 <NA> 08 BRONX 1022150574 BRONX
## 5 <NA> 09 MANHATTAN 1020710036 MANHATTAN
## x_coordinate_state_plane y_coordinate_state_plane open_data_channel_type
## 1 988881 204617 UNKNOWN
## 2 988881 204617 UNKNOWN
## 3 1022954 264308 MOBILE
## 4 1009449 258838 PHONE
## 5 998051 238813 PHONE
## park_facility_name park_borough latitude longitude
## 1 Unspecified MANHATTAN 40.72830187478833 -73.98329142747356
## 2 Unspecified MANHATTAN 40.72830187478833 -73.98329142747356
## 3 Unspecified Unspecified 40.89205337686121 -73.86001254663061
## 4 Unspecified BRONX 40.87708884788051 -73.90887895223878
## 5 Unspecified MANHATTAN 40.82215133794938 -73.95013598942462
## cross_street_1 cross_street_2 intersection_street_1
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 MARBLE HILL LANE MARBLE HILL AVENUE MARBLE HILL LANE
## 5 AMSTERDAM AVENUE HAMILTON PLACE AMSTERDAM AVENUE
## intersection_street_2 landmark closed_date vehicle_type
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 MARBLE HILL AVENUE WEST 228 STREET <NA> <NA>
## 5 HAMILTON PLACE WEST 140 STREET <NA> <NA>
## taxi_pick_up_location bridge_highway_name bridge_highway_segment road_ramp
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA>
## bridge_highway_direction taxi_company_borough location.latitude
## 1 <NA> <NA> 40.72830187478833
## 2 <NA> <NA> 40.72830187478833
## 3 <NA> <NA> 40.89205337686121
## 4 <NA> <NA> 40.87708884788051
## 5 <NA> <NA> 40.82215133794938
## location.longitude location.human_address
## 1 -73.98329142747356 {"address": "", "city": "", "state": "", "zip": ""}
## 2 -73.98329142747356 {"address": "", "city": "", "state": "", "zip": ""}
## 3 -73.86001254663061 {"address": "", "city": "", "state": "", "zip": ""}
## 4 -73.90887895223878 {"address": "", "city": "", "state": "", "zip": ""}
## 5 -73.95013598942462 {"address": "", "city": "", "state": "", "zip": ""}
## date_only hour day_of_week month temp_max_f temp_min_f temp_mean_f
## 1 2025-12-14 2 Sun Dec 32.54 16.88 26.42
## 2 2025-12-14 2 Sun Dec 32.54 16.88 26.42
## 3 2025-12-14 2 Sun Dec 32.54 16.88 26.42
## 4 2025-12-14 2 Sun Dec 32.54 16.88 26.42
## 5 2025-12-14 2 Sun Dec 32.54 16.88 26.42
## precipitation_mm wind_max_mph population median_income median_age pct_renter
## 1 11.4 11.80605 1645867 99880 38.5 64.03536
## 2 11.4 11.80605 1645867 99880 38.5 64.03536
## 3 11.4 11.80605 NA NA NA NA
## 4 11.4 11.80605 1443229 47036 35.0 76.75935
## 5 11.4 11.80605 1645867 99880 38.5 64.03536
## pct_poverty
## 1 15.35221
## 2 15.35221
## 3 NA
## 4 26.32465
## 5 15.35221
library(dplyr)
library(tidyr)
library(lubridate)
# 1. Dataset Overview
cat("Rows:", nrow(df_311), "\n")
## Rows: 80000
cat("Columns:", ncol(df_311), "\n")
## Columns: 57
glimpse(df_311)
## Rows: 80,000
## Columns: 57
## $ unique_key <chr> "67170118", "67170119", "67164190", "67…
## $ created_date <dttm> 2025-12-14 02:46:11, 2025-12-14 02:45:…
## $ agency <chr> "DSNY", "DSNY", "NYPD", "NYPD", "NYPD",…
## $ agency_name <chr> "Department of Sanitation", "Department…
## $ complaint_type <chr> "Graffiti", "Graffiti", "Noise - Reside…
## $ descriptor <chr> "Graffiti", "Graffiti", "Loud Music/Par…
## $ location_type <chr> "Residential", "Residential", "Resident…
## $ incident_zip <chr> "10009", "10009", NA, "10463", "10031",…
## $ incident_address <chr> "269 EAST 10 STREET", "269 EAST 10 …
## $ street_name <chr> "EAST 10 STREET", "EAST 10 STREET",…
## $ address_type <chr> "ADDRESS", "ADDRESS", NA, "ADDRESS", "A…
## $ city <chr> "NEW YORK", "NEW YORK", NA, "BRONX", "N…
## $ facility_type <chr> "N/A", "N/A", NA, NA, NA, NA, NA, NA, N…
## $ status <chr> "Open", "Open", "In Progress", "In Prog…
## $ due_date <chr> "2026-01-13T02:46:11.000", "2026-01-13T…
## $ resolution_description <chr> "The graffiti on this property has been…
## $ resolution_action_updated_date <chr> "2025-12-14T02:46:11.000", "2025-12-14T…
## $ community_board <chr> "03 MANHATTAN", "03 MANHATTAN", "0 Unsp…
## $ bbl <chr> "1004380043", "1004380043", NA, "102215…
## $ borough <chr> "MANHATTAN", "MANHATTAN", "Unspecified"…
## $ x_coordinate_state_plane <chr> "988881", "988881", "1022954", "1009449…
## $ y_coordinate_state_plane <chr> "204617", "204617", "264308", "258838",…
## $ open_data_channel_type <chr> "UNKNOWN", "UNKNOWN", "MOBILE", "PHONE"…
## $ park_facility_name <chr> "Unspecified", "Unspecified", "Unspecif…
## $ park_borough <chr> "MANHATTAN", "MANHATTAN", "Unspecified"…
## $ latitude <chr> "40.72830187478833", "40.72830187478833…
## $ longitude <chr> "-73.98329142747356", "-73.983291427473…
## $ cross_street_1 <chr> NA, NA, NA, "MARBLE HILL LANE", "AMSTER…
## $ cross_street_2 <chr> NA, NA, NA, "MARBLE HILL AVENUE", "HAMI…
## $ intersection_street_1 <chr> NA, NA, NA, "MARBLE HILL LANE", "AMSTER…
## $ intersection_street_2 <chr> NA, NA, NA, "MARBLE HILL AVENUE", "HAMI…
## $ landmark <chr> NA, NA, NA, "WEST 228 STREET", "WEST …
## $ closed_date <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ vehicle_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ taxi_pick_up_location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bridge_highway_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bridge_highway_segment <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ road_ramp <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bridge_highway_direction <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ taxi_company_borough <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ location.latitude <chr> "40.72830187478833", "40.72830187478833…
## $ location.longitude <chr> "-73.98329142747356", "-73.983291427473…
## $ location.human_address <chr> "{\"address\": \"\", \"city\": \"\", \"…
## $ date_only <date> 2025-12-14, 2025-12-14, 2025-12-14, 20…
## $ hour <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ day_of_week <ord> Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun,…
## $ month <ord> Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec,…
## $ temp_max_f <dbl> 32.54, 32.54, 32.54, 32.54, 32.54, 32.5…
## $ temp_min_f <dbl> 16.88, 16.88, 16.88, 16.88, 16.88, 16.8…
## $ temp_mean_f <dbl> 26.42, 26.42, 26.42, 26.42, 26.42, 26.4…
## $ precipitation_mm <dbl> 11.4, 11.4, 11.4, 11.4, 11.4, 11.4, 11.…
## $ wind_max_mph <dbl> 11.80605, 11.80605, 11.80605, 11.80605,…
## $ population <dbl> 1645867, 1645867, NA, 1443229, 1645867,…
## $ median_income <dbl> 99880, 99880, NA, 47036, 99880, 47036, …
## $ median_age <dbl> 38.5, 38.5, NA, 35.0, 38.5, 35.0, 35.0,…
## $ pct_renter <dbl> 64.03536, 64.03536, NA, 76.75935, 64.03…
## $ pct_poverty <dbl> 15.35221, 15.35221, NA, 26.32465, 15.35…
# 2. MISSING VALUES ANALYSIS
missing_summary <- df_311 %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "column", values_to = "missing_count") %>%
mutate(
missing_pct = round(missing_count / nrow(df_311) * 100, 2),
recommendation = case_when(
missing_pct == 0 ~ "Complete",
missing_pct < 5 ~ "Impute",
missing_pct < 30 ~ "Impute with caution",
TRUE ~ "Consider dropping"
)
) %>%
arrange(desc(missing_pct))
print(missing_summary %>% filter(missing_pct > 0))
## # A tibble: 36 × 4
## column missing_count missing_pct recommendation
## <chr> <int> <dbl> <chr>
## 1 taxi_company_borough 79946 99.9 Consider dropping
## 2 road_ramp 79858 99.8 Consider dropping
## 3 bridge_highway_direction 79833 99.8 Consider dropping
## 4 due_date 79703 99.6 Consider dropping
## 5 bridge_highway_name 79668 99.6 Consider dropping
## 6 bridge_highway_segment 79668 99.6 Consider dropping
## 7 taxi_pick_up_location 79262 99.1 Consider dropping
## 8 vehicle_type 77252 96.6 Consider dropping
## 9 facility_type 76100 95.1 Consider dropping
## 10 landmark 34942 43.7 Consider dropping
## # ℹ 26 more rows
# Identify columns with >30% missing
threshold <- 0.30
cols_to_drop <- names(df_311)[colSums(is.na(df_311)) / nrow(df_311) > threshold]
cat("Dropping columns:", cols_to_drop, "\n")
## Dropping columns: facility_type due_date cross_street_1 cross_street_2 intersection_street_1 intersection_street_2 landmark vehicle_type taxi_pick_up_location bridge_highway_name bridge_highway_segment road_ramp bridge_highway_direction taxi_company_borough
# Drop them
df_311 <- df_311 %>%
select(-all_of(cols_to_drop))
# 3. DUPLICATE ROWS
cat("\n=== DUPLICATE ROWS ===\n")
##
## === DUPLICATE ROWS ===
dup_count <- sum(duplicated(df_311))
cat("Duplicate rows:", dup_count, "\n")
## Duplicate rows: 0
cat("Duplicate %:", round(dup_count / nrow(df_311) * 100, 2), "%\n")
## Duplicate %: 0 %
# Remove duplicates (uncomment to execute)
df_311 <- df_311 %>% distinct()
# 4. DATA TYPES CHECK
sapply(df_311, class)
## $unique_key
## [1] "character"
##
## $created_date
## [1] "POSIXct" "POSIXt"
##
## $agency
## [1] "character"
##
## $agency_name
## [1] "character"
##
## $complaint_type
## [1] "character"
##
## $descriptor
## [1] "character"
##
## $location_type
## [1] "character"
##
## $incident_zip
## [1] "character"
##
## $incident_address
## [1] "character"
##
## $street_name
## [1] "character"
##
## $address_type
## [1] "character"
##
## $city
## [1] "character"
##
## $status
## [1] "character"
##
## $resolution_description
## [1] "character"
##
## $resolution_action_updated_date
## [1] "character"
##
## $community_board
## [1] "character"
##
## $bbl
## [1] "character"
##
## $borough
## [1] "character"
##
## $x_coordinate_state_plane
## [1] "character"
##
## $y_coordinate_state_plane
## [1] "character"
##
## $open_data_channel_type
## [1] "character"
##
## $park_facility_name
## [1] "character"
##
## $park_borough
## [1] "character"
##
## $latitude
## [1] "character"
##
## $longitude
## [1] "character"
##
## $closed_date
## [1] "POSIXct" "POSIXt"
##
## $location.latitude
## [1] "character"
##
## $location.longitude
## [1] "character"
##
## $location.human_address
## [1] "character"
##
## $date_only
## [1] "Date"
##
## $hour
## [1] "integer"
##
## $day_of_week
## [1] "ordered" "factor"
##
## $month
## [1] "ordered" "factor"
##
## $temp_max_f
## [1] "numeric"
##
## $temp_min_f
## [1] "numeric"
##
## $temp_mean_f
## [1] "numeric"
##
## $precipitation_mm
## [1] "numeric"
##
## $wind_max_mph
## [1] "numeric"
##
## $population
## [1] "numeric"
##
## $median_income
## [1] "numeric"
##
## $median_age
## [1] "numeric"
##
## $pct_renter
## [1] "numeric"
##
## $pct_poverty
## [1] "numeric"
# 5. STANDARDIZE TEXT COLUMNS
df_311 <- df_311 %>%
mutate(
# Trim whitespace and standardize case
borough = toupper(trimws(borough)),
complaint_type = toupper(trimws(complaint_type)),
# Standardize "Unspecified" values
borough = if_else(borough %in% c("", "UNSPECIFIED", "NA"), NA_character_, borough)
)
cat("Unique boroughs:", unique(df_311$borough), "\n")
## Unique boroughs: MANHATTAN NA BRONX BROOKLYN QUEENS STATEN ISLAND
# 6. INVALID/ILLOGICAL VALUES
# Check: Closed date before Created date
if ("closed_date" %in% names(df_311) & "created_date" %in% names(df_311)) {
invalid_dates <- df_311 %>%
filter(!is.na(closed_date) & closed_date < created_date)
cat("Invalid dates (closed before created):", nrow(invalid_dates), "\n")
}
## Invalid dates (closed before created): 12
# Check: Future dates
future_dates <- df_311 %>%
filter(created_date > Sys.time())
cat("Future dates:", nrow(future_dates), "\n")
## Future dates: 0
# Check: Negative response times
if ("response_time" %in% names(df_311)) {
negative_response <- df_311 %>%
filter(response_time < 0)
cat("Negative response times:", nrow(negative_response), "\n")
}
# 7. OUTLIER DETECTION (Numeric Columns)
detect_outliers <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR
sum(x < lower | x > upper, na.rm = TRUE)
}
numeric_cols <- df_311 %>% select(where(is.numeric)) %>% names()
for (col in numeric_cols) {
outlier_count <- detect_outliers(df_311[[col]])
pct <- round(outlier_count / nrow(df_311) * 100, 2)
cat(col, ": ", outlier_count, " outliers (", pct, "%)\n", sep = "")
}
## hour: 0 outliers (0%)
## temp_max_f: 0 outliers (0%)
## temp_min_f: 0 outliers (0%)
## temp_mean_f: 0 outliers (0%)
## precipitation_mm: 15827 outliers (19.78%)
## wind_max_mph: 21694 outliers (27.12%)
## population: 0 outliers (0%)
## median_income: 0 outliers (0%)
## median_age: 0 outliers (0%)
## pct_renter: 2230 outliers (2.79%)
## pct_poverty: 0 outliers (0%)
# 8. CARDINALITY CHECK
cardinality <- df_311 %>%
summarise(across(everything(), n_distinct)) %>%
pivot_longer(everything(), names_to = "column", values_to = "unique_values") %>%
arrange(desc(unique_values))
print(cardinality)
## # A tibble: 43 × 2
## column unique_values
## <chr> <int>
## 1 unique_key 80000
## 2 created_date 66200
## 3 closed_date 39842
## 4 resolution_action_updated_date 35290
## 5 latitude 34618
## 6 longitude 34618
## 7 location.latitude 34618
## 8 location.longitude 34618
## 9 incident_address 33295
## 10 y_coordinate_state_plane 29467
## # ℹ 33 more rows
library(ggplot2)
#DATA. VISUALIZATIONS
# Top Complaint Types
df_311 %>%
count(complaint_type, sort = TRUE) %>%
head(10) %>%
ggplot(aes(x = reorder(complaint_type, n), y = n, fill = complaint_type)) +
geom_col() +
coord_flip() +
labs(
title = "Top 10 Complaint Types in NYC 311 Data",
x = "Complaint Type",
y = "Number of Complaints"
) +
theme_minimal() +
theme(legend.position = "none")
# Complaints by Borough
df_311 %>%
filter(!is.na(borough) & borough != "Unspecified") %>%
count(borough) %>%
ggplot(aes(x = reorder(borough, n), y = n, fill = borough)) +
geom_col() +
labs(
title = "311 Complaints by Borough",
x = "Borough",
y = "Number of Complaints"
) +
theme_minimal() +
theme(legend.position = "none")
# Complaints by Hour of Day
df_311 %>%
count(hour) %>%
ggplot(aes(x = hour, y = n)) +
geom_line(color = "steelblue", size = 1) +
geom_point(color = "steelblue", size = 2) +
labs(
title = "311 Complaints by Hour of Day",
x = "Hour (24-hour format)",
y = "Number of Complaints"
) +
scale_x_continuous(breaks = seq(0, 23, 2)) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Complaints by Day of Week
df_311 %>%
count(day_of_week) %>%
ggplot(aes(x = day_of_week, y = n, fill = day_of_week)) +
geom_col() +
labs(
title = "311 Complaints by Day of Week",
x = "Day",
y = "Number of Complaints"
) +
theme_minimal() +
theme(legend.position = "none")
# Heatmap - Hour vs Day of Week
df_311 %>%
count(day_of_week, hour) %>%
ggplot(aes(x = hour, y = day_of_week, fill = n)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
labs(
title = "Complaint Frequency: Hour vs Day of Week",
x = "Hour of Day",
y = "Day of Week",
fill = "Count"
) +
theme_minimal()
# Top Complaints by Borough
df_311 %>%
filter(!is.na(borough) & borough != "Unspecified") %>%
count(borough, complaint_type) %>%
group_by(borough) %>%
slice_max(n, n = 5) %>%
ggplot(aes(x = reorder(complaint_type, n), y = n, fill = borough)) +
geom_col() +
coord_flip() +
facet_wrap(~borough, scales = "free_y") +
labs(
title = "Top 5 Complaint Types by Borough",
x = "Complaint Type",
y = "Count"
) +
theme_minimal() +
theme(legend.position = "none")
# STATISTICAL ANALYSIS
# Chi-Square Test: Are complaint types independent of borough?
contingency_table <- table(df_311$borough, df_311$complaint_type)
chi_test <- chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
print("Chi-Square Test: Complaint Type vs Borough")
## [1] "Chi-Square Test: Complaint Type vs Borough"
print(chi_test)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = NaN, df = 596, p-value = NA
# ANOVA: Does response time differ by borough?
df_311 <- df_311 %>%
mutate(response_time = as.numeric(difftime(closed_date, created_date, units = "hours")))
# Remove NA and extreme outliers
df_response <- df_311 %>%
filter(!is.na(response_time) & response_time > 0 & response_time < 720)
anova_result <- aov(response_time ~ borough, data = df_response)
print("ANOVA: Response Time by Borough")
## [1] "ANOVA: Response Time by Borough"
print(summary(anova_result))
## Df Sum Sq Mean Sq F value Pr(>F)
## borough 4 688637 172159 351.3 <2e-16 ***
## Residuals 58961 28897157 490
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 18 observations deleted due to missingness
# Post-hoc test if significant
tukey_result <- TukeyHSD(anova_result)
print(tukey_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = response_time ~ borough, data = df_response)
##
## $borough
## diff lwr upr p adj
## BROOKLYN-BRONX 4.8512561 4.2075483 5.4949638 0.0000000
## MANHATTAN-BRONX 9.2228806 8.4822037 9.9635575 0.0000000
## QUEENS-BRONX 0.4570617 -0.2445357 1.1586591 0.3870210
## STATEN ISLAND-BRONX 3.8666515 2.2706978 5.4626053 0.0000000
## MANHATTAN-BROOKLYN 4.3716245 3.5722814 5.1709677 0.0000000
## QUEENS-BROOKLYN -4.3941944 -5.1574677 -3.6309211 0.0000000
## STATEN ISLAND-BROOKLYN -0.9846045 -2.6086167 0.6394076 0.4628933
## QUEENS-MANHATTAN -8.7658189 -9.6124763 -7.9191615 0.0000000
## STATEN ISLAND-MANHATTAN -5.3562291 -7.0210588 -3.6913993 0.0000000
## STATEN ISLAND-QUEENS 3.4095899 1.7617747 5.0574050 0.0000002
# Response Time by Borough
df_response %>%
filter(!is.na(borough) & borough != "Unspecified") %>%
ggplot(aes(x = borough, y = response_time, fill = borough)) +
geom_boxplot() +
labs(
title = "Response Time (Hours) by Borough",
x = "Borough",
y = "Response Time (Hours)"
) +
theme_minimal() +
theme(legend.position = "none")
# SUMMARY STATISTICS TABLE
summary_stats <- df_response %>%
filter(!is.na(borough) & borough != "Unspecified") %>%
group_by(borough) %>%
summarise(
count = n(),
mean_response = mean(response_time, na.rm = TRUE),
median_response = median(response_time, na.rm = TRUE),
sd_response = sd(response_time, na.rm = TRUE)
)
print("Summary Statistics by Borough:")
## [1] "Summary Statistics by Borough:"
print(summary_stats)
## # A tibble: 5 × 5
## borough count mean_response median_response sd_response
## <chr> <int> <dbl> <dbl> <dbl>
## 1 BRONX 22510 12.3 3.51 19.0
## 2 BROOKLYN 14451 17.2 4.03 23.7
## 3 MANHATTAN 9433 21.6 4.48 28.0
## 4 QUEENS 11043 12.8 2.62 20.4
## 5 STATEN ISLAND 1529 16.2 4.88 21.5
# Filter heating complaints data
heating_individual <- df_311 %>%
filter(complaint_type == "HEAT/HOT WATER" & !is.na(temp_mean_f)) %>%
filter(!is.na(response_time) & response_time > 0 & response_time < 720)
# Plot 1: Heating Complaints COUNT vs Temperature
heating_count <- df_311 %>%
filter(complaint_type == "HEAT/HOT WATER" & !is.na(temp_mean_f)) %>%
group_by(date_only, temp_mean_f) %>%
summarise(complaints = n(), .groups = "drop")
p1 <- ggplot(heating_count, aes(x = temp_mean_f, y = complaints)) +
geom_point(alpha = 0.5, color = "navy") +
labs(
title = "Heating Complaint Volume vs Temperature",
x = "Mean Daily Temperature (°F)",
y = "Number of Complaints"
) +
theme_minimal()
print(p1)
# Noise complaints - do they increase in warm weather?
noise_temp <- df_311 %>%
filter(grepl("Noise", complaint_type, ignore.case = TRUE)) %>%
group_by(date_only, temp_mean_f) %>%
summarise(complaints = n(), .groups = "drop")
ggplot(noise_temp, aes(x = temp_mean_f, y = complaints)) +
geom_point(alpha = 0.5, color = "orange") +
geom_smooth(method = "loess", color = "purple", se = FALSE) +
labs(
title = "Do Noise Complaints Rise in Warmer Weather?",
x = "Mean Daily Temperature (°F)",
y = "Number of Noise Complaints"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
cor.test(noise_temp$temp_mean_f, noise_temp$complaints)
##
## Pearson's product-moment correlation
##
## data: noise_temp$temp_mean_f and noise_temp$complaints
## t = -0.42892, df = 6, p-value = 0.683
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7820955 0.6058262
## sample estimates:
## cor
## -0.1724801
# Create temperature bins
temp_complaints <- df_311 %>%
filter(!is.na(temp_mean_f)) %>%
mutate(temp_range = cut(temp_mean_f,
breaks = c(0, 32, 50, 70, 85, 100),
labels = c("Freezing (<32°F)", "Cold (32-50°F)",
"Mild (50-70°F)", "Warm (70-85°F)", "Hot (>85°F)"))) %>%
filter(complaint_type %in% names(sort(table(df_311$complaint_type), decreasing = TRUE)[1:10])) %>%
count(complaint_type, temp_range)
ggplot(temp_complaints, aes(x = temp_range, y = complaint_type, fill = n)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
labs(
title = "Complaint Types by Temperature Range",
x = "Temperature Range",
y = "Complaint Type",
fill = "Count"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Complaints per capita by borough
complaints_per_capita <- df_311 %>%
filter(!is.na(borough) & borough != "Unspecified") %>%
group_by(borough, population, median_income, pct_renter, pct_poverty) %>%
summarise(total_complaints = n(), .groups = "drop") %>%
mutate(complaints_per_1000 = total_complaints / population * 1000)
ggplot(complaints_per_capita, aes(x = median_income, y = complaints_per_1000)) +
geom_point(size = 5, color = "steelblue") +
geom_text(aes(label = borough), vjust = -1, size = 3) +
geom_smooth(method = "lm", color = "red", se = FALSE, linetype = "dashed") +
scale_x_continuous(labels = scales::dollar) +
labs(
title = "Do Lower-Income Boroughs File More Complaints?",
x = "Median Household Income",
y = "Complaints per 1,000 Residents"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Housing-related complaints by renter percentage
housing_renter <- df_311 %>%
filter(complaint_type %in% c("HEAT/HOT WATER", "PLUMBING", "WATER SYSTEM", "UNSANITARY CONDITION")) %>%
filter(!is.na(borough) & borough != "Unspecified") %>%
group_by(borough, pct_renter, population) %>%
summarise(housing_complaints = n(), .groups = "drop") %>%
mutate(housing_per_1000 = housing_complaints / population * 1000)
ggplot(housing_renter, aes(x = pct_renter, y = housing_per_1000)) +
geom_point(size = 5, color = "orange") +
geom_text(aes(label = borough), vjust = -1, size = 3) +
geom_smooth(method = "lm", color = "purple", se = FALSE, linetype = "dashed") +
labs(
title = "Housing Complaints Higher in Renter-Heavy Boroughs",
x = "Percent Renter-Occupied Housing",
y = "Housing Complaints per 1,000 Residents"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Load required libraries
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:httr':
##
## progress
library(tibble)
# Prepare modeling dataset WITH weather and demographics
model_data <- df_response %>%
filter(!is.na(borough) & borough != "Unspecified" &
!is.na(complaint_type) & !is.na(hour) & !is.na(day_of_week) &
!is.na(temp_mean_f) & !is.na(median_income)) %>% # NEW: require weather & demographics
select(
response_time,
borough,
complaint_type,
hour,
day_of_week,
month,
# NEW: Weather features
temp_mean_f,
precipitation_mm,
# NEW: Demographics features
median_income,
pct_renter,
pct_poverty
) %>%
# Keep top complaint types to avoid sparse categories
filter(complaint_type %in% names(sort(table(df_response$complaint_type), decreasing = TRUE)[1:20])) %>%
mutate(across(where(is.character), as.factor))
# Check structure - verify new columns are included
str(model_data)
## 'data.frame': 53378 obs. of 11 variables:
## $ response_time : num 0.0506 0.165 0.2439 0.2297 0.1914 ...
## $ borough : Factor w/ 5 levels "BRONX","BROOKLYN",..: 4 4 2 4 1 1 2 4 4 5 ...
## $ complaint_type : Factor w/ 20 levels "ABANDONED VEHICLE",..: 9 9 13 9 9 9 14 13 13 13 ...
## $ hour : int 2 2 2 2 2 2 1 1 1 1 ...
## $ day_of_week : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 12 12 12 12 12 12 12 12 12 12 ...
## $ temp_mean_f : num 26.4 26.4 26.4 26.4 26.4 ...
## $ precipitation_mm: num 11.4 11.4 11.4 11.4 11.4 11.4 11.4 11.4 11.4 11.4 ...
## $ median_income : num 82431 82431 74692 82431 47036 ...
## $ pct_renter : num 49.9 49.9 64.4 49.9 76.8 ...
## $ pct_poverty : num 11.6 11.6 18.8 11.6 26.3 ...
cat("\nColumns in model:", colnames(model_data), "\n")
##
## Columns in model: response_time borough complaint_type hour day_of_week month temp_mean_f precipitation_mm median_income pct_renter pct_poverty
# Split data: 80% train, 20% test
set.seed(42)
train_index <- createDataPartition(model_data$response_time, p = 0.8, list = FALSE)
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]
cat("Training set:", nrow(train_data), "rows\n")
## Training set: 42704 rows
cat("Test set:", nrow(test_data), "rows\n")
## Test set: 10674 rows
# Train Random Forest model
rf_model <- randomForest(
response_time ~ .,
data = train_data,
ntree = 200,
mtry = 3, # Increased from 2 since we have more variables
importance = TRUE,
na.action = na.omit
)
# Model summary
print(rf_model)
##
## Call:
## randomForest(formula = response_time ~ ., data = train_data, ntree = 200, mtry = 3, importance = TRUE, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 200
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 140.1463
## % Var explained: 68.45
# Variable Importance - KEY OUTPUT for interpretation
importance_df <- as.data.frame(importance(rf_model)) %>%
rownames_to_column("Variable") %>%
arrange(desc(`%IncMSE`))
print("Variable Importance (% Increase in MSE when removed):")
## [1] "Variable Importance (% Increase in MSE when removed):"
print(importance_df)
## Variable %IncMSE IncNodePurity
## 1 complaint_type 102.88198 10309297.91
## 2 hour 36.52809 274802.35
## 3 day_of_week 35.68186 881386.58
## 4 temp_mean_f 16.92703 380188.69
## 5 borough 12.93899 183092.15
## 6 median_income 12.45602 150579.17
## 7 pct_poverty 11.85146 75619.51
## 8 precipitation_mm 10.88450 111542.88
## 9 pct_renter 10.25228 67630.57
## 10 month 0.00000 0.00
# Variable Importance Plot
varImpPlot(rf_model, main = "Variable Importance: Predicting 311 Response Time\n(Including Weather & Demographics)")
# Predictions on test set
predictions <- predict(rf_model, newdata = test_data)
# Model Evaluation Metrics
actuals <- test_data$response_time
rmse <- sqrt(mean((predictions - actuals)^2))
mae <- mean(abs(predictions - actuals))
r_squared <- 1 - sum((actuals - predictions)^2) / sum((actuals - mean(actuals))^2)
cat("\n--- Model Performance on Test Set ---\n")
##
## --- Model Performance on Test Set ---
cat("RMSE:", round(rmse, 2), "hours\n")
## RMSE: 11.54 hours
cat("MAE:", round(mae, 2), "hours\n")
## MAE: 6.67 hours
cat("R-squared:", round(r_squared, 4), "\n")
## R-squared: 0.6866
# Actual vs Predicted Plot
results_df <- data.frame(Actual = actuals, Predicted = predictions)
ggplot(results_df, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(
title = "Random Forest: Actual vs Predicted Response Time",
subtitle = paste("R² =", round(r_squared, 3), "| RMSE =", round(rmse, 2), "hours"),
x = "Actual Response Time (Hours)",
y = "Predicted Response Time (Hours)"
) +
theme_minimal()
The Random Forest model reveals that complaint type is the strongest predictor of response time, followed by borough. This suggests that operational workflows—not just geographic location—drive resolution speed.
The model’s R² of 0.7117 indicates that 71% of response time variability can be explained by these factors. The remaining variance likely stems from factors not in our dataset: inspector availability, complaint complexity, or whether violations were found.
RMSE = 11.14 hours Predictions are accurate within approximately 0.46 days(about half a day) sufficient for operational triage and resource planning.
library(knitr)
interpretations <- data.frame(
Interpretation = c(
"Borough disparities",
"Complaint type matters most",
"Demographics drive demand",
"Temporal patterns exist",
"Predictions enable proactivity"
),
Key_Insight = c(
"Manhattan/Brooklyn have highest variability",
"Type predicts response time better than location",
"Renter-heavy, lower-income areas file more complaints",
"Peak demand mid-week, daytime hours",
"71% of variance is explainable"
),
Actionable_Recommendation = c(
"Audit workflows in high-variability boroughs",
"Develop complaint-specific SLAs",
"Proactive inspections in vulnerable neighborhoods",
"Align staffing to demand curves",
"Shift from reactive to predictive operations"
)
)
kable(interpretations,
col.names = c("Interpretation", "Key Insight", "Actionable Recommendation"),
caption = "Key Interpretations from 311 Data Analysis")
| Interpretation | Key Insight | Actionable Recommendation |
|---|---|---|
| Borough disparities | Manhattan/Brooklyn have highest variability | Audit workflows in high-variability boroughs |
| Complaint type matters most | Type predicts response time better than location | Develop complaint-specific SLAs |
| Demographics drive demand | Renter-heavy, lower-income areas file more complaints | Proactive inspections in vulnerable neighborhoods |
| Temporal patterns exist | Peak demand mid-week, daytime hours | Align staffing to demand curves |
| Predictions enable proactivity | 71% of variance is explainable | Shift from reactive to predictive operations |
The biggest challenge I had in completing this project was learning how to build, update, and modify a Shiny app. The process felt similar to developing a User Experience (UX) application, which is quite a challenging feature.
My second biggest challenge was developing meaningful insights from the data. I used three APIs in this project, and only the NYC Open Data API allowed me to get a maximum of 80,000 data points. Since 311 gathers thousands of service calls in a day, 80,000 data points approximates to only a few days of data. With the variety of visualizations, I was able to develop meaningful plots through trial and error.
The third challenge I had while completing this project is that since there were so many dependencies—Shiny app code, Random Forest model, datasets to merge—simple changes meant virtually redoing the project from the start. The process, though tedious, was worthwhile as I developed a comprehensive project that I am proud of.