As a NYC resident and Data Science graduate student, I’m analyzing 311 service request data from 2010 to present to understand complaint patterns across our five boroughs. The 311 system is how we report non-emergency issues—from potholes and noise complaints to heat outages and street light repairs. By visualizing this data geographically, we can see where problems occur and how they cluster in specific neighborhoods. My primary hypothesis is that NYC 311 complaint patterns-including volume, type, and response times—are significantly influenced by neighborhood demographics, and borough characteristics.
How NYC Currently Uses 311 Data
Reactive Response:
Performance Monitoring:
This project analyzes NYC 311 service request data to identify geographic complaint patterns and enable proactive municipal management. By integrating weather conditions and demographic characteristics, the analysis examines how environmental and socioeconomic factors influence service request volumes and types across neighborhoods. The project follows the OSEMN data science workflow—Obtain, Scrub, Explore, Model, and iNterpret—to ensure a systematic and reproducible approach from data acquisition through actionable insights.
Obtain: Data will be acquired from three primary sources: - NYC 311 Service Requests via the NYC Open Data Socrata API using the RSocrata package in R - Historical Weather Data from Open-Meteo.com, a free open-source weather API requiring no API key, to capture temperature, precipitation, and other meteorological conditions corresponding to complaint dates - Demographic Data from the U.S. Census Bureau API (census.gov) to obtain population density, income levels, housing characteristics, and other socioeconomic indicators at the census tract or ZIP code level - Geographic Boundary Files (boroughs, ZIP codes, community districts) using the tigris package for spatial mapping and joining datasets
Scrub: The data cleaning process will use dplyr, tidyr, and lubridate to remove records with missing coordinates, standardize complaint categories, convert dates to proper datetime format, and create time-based variables like year, month, and season for trend analysis. Weather data will be aligned with complaint records by date, and demographic data will be joined spatially by geographic identifier (ZIP code or census tract) to create an enriched analytical dataset.
Explore: Exploratory data analysis will examine complaint distributions across boroughs, identify seasonal and weather-related trends, and investigate correlations between complaint types and neighborhood demographics. Preliminary visualizations will reveal geographic clustering patterns and potential relationships between environmental conditions and service request volumes.
Model: Statistical and spatial analysis techniques will be applied to quantify complaint hotspots, measure temporal trends, and identify significant patterns across neighborhoods and time periods. Regression models will assess the influence of weather variables (temperature extremes, precipitation events) and demographic factors (population density, median income) on complaint frequency and type.
Interpret: My deliverable will be an interactive Shiny dashboard that allows users to filter by complaint type, date range, borough, weather conditions, and demographic characteristics. The dashboard will feature linked visualizations—an interactive map with complaint markers, dynamic time series charts, weather overlays, demographic profiles, and summary statistics—where selections on one view automatically update the others. Users can explore patterns, download custom reports, and identify areas needing proactive intervention based on environmental and socioeconomic context.
# Load required packages
library(httr)
library(jsonlite)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(tibble)
# Load API token from environment variable (stored in .Renviron)
app_token <- Sys.getenv("NYC_OPENDATA_TOKEN")
# Check if token loaded successfully
if (app_token == "") {
stop("API token not found. Please set NYC_OPENDATA_TOKEN in your .Renviron file.")
}
# Use the SODA API endpoint
api_url <- "https://data.cityofnewyork.us/resource/erm2-nwe9.json"
query_params <- list(
`$limit` = 80000,
`$order` = "created_date DESC",
`$$app_token` = app_token
)
response <- GET(api_url, query = query_params)
if (status_code(response) == 200) {
df_311 <- content(response, as = "text", encoding = "UTF-8") %>%
fromJSON(flatten = TRUE) %>%
as.data.frame()
print(head(df_311))
} else {
print(paste("Error:", status_code(response)))
}
## unique_key created_date agency agency_name
## 1 67192069 2025-12-16T03:06:57.000 DSNY Department of Sanitation
## 2 67192238 2025-12-16T02:27:43.000 DSNY Department of Sanitation
## 3 67186869 2025-12-16T02:09:12.000 DOT Department of Transportation
## 4 67184900 2025-12-16T02:06:24.000 NYPD New York City Police Department
## 5 67193394 2025-12-16T02:06:16.000 NYPD New York City Police Department
## 6 67195627 2025-12-16T02:06:14.000 NYPD New York City Police Department
## complaint_type descriptor location_type
## 1 Graffiti Graffiti Mixed Use
## 2 Graffiti Graffiti <NA>
## 3 Street Condition Pothole <NA>
## 4 Noise - Commercial Loud Music/Party Store/Commercial
## 5 Illegal Parking Posted Parking Sign Violation Street/Sidewalk
## 6 Noise - Vehicle Engine Idling Street/Sidewalk
## incident_zip incident_address street_name address_type city
## 1 11215 234 4 AVENUE 4 AVENUE ADDRESS BROOKLYN
## 2 11225 475 FLATBUSH AVENUE FLATBUSH AVENUE ADDRESS BROOKLYN
## 3 11413 133 AVENUE 133 AVENUE BLOCKFACE QUEENS
## 4 10038 277 WATER STREET WATER STREET ADDRESS NEW YORK
## 5 11207 289 JEROME STREET JEROME STREET ADDRESS BROOKLYN
## 6 11226 538 EAST 21 STREET EAST 21 STREET ADDRESS BROOKLYN
## facility_type status due_date
## 1 N/A Open 2026-01-15T03:06:57.000
## 2 N/A Open <NA>
## 3 N/A Open <NA>
## 4 <NA> Unspecified <NA>
## 5 <NA> In Progress <NA>
## 6 <NA> In Progress <NA>
## resolution_description
## 1 The graffiti on this property has been scheduled to be removed by the City.
## 2 The graffiti has been reported. The property owner will be notified that the City will remove the graffiti after 35 days unless the property owner requests to keep the graffiti or remove it himself or herself.
## 3 The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair.
## 4 <NA>
## 5 <NA>
## 6 <NA>
## resolution_action_updated_date community_board bbl borough
## 1 2025-12-16T03:06:57.000 06 BROOKLYN 3004410034 BROOKLYN
## 2 2025-12-16T02:27:43.000 09 BROOKLYN 3011970015 BROOKLYN
## 3 2025-12-16T02:09:12.000 13 QUEENS <NA> QUEENS
## 4 <NA> 01 MANHATTAN 1001070050 MANHATTAN
## 5 <NA> 05 BROOKLYN 3039820036 BROOKLYN
## 6 <NA> 14 BROOKLYN 3051840025 BROOKLYN
## x_coordinate_state_plane y_coordinate_state_plane open_data_channel_type
## 1 988886 185952 UNKNOWN
## 2 994763 180716 UNKNOWN
## 3 <NA> <NA> UNKNOWN
## 4 983963 197472 MOBILE
## 5 1015556 185179 MOBILE
## 6 995841 172790 MOBILE
## park_facility_name park_borough latitude longitude
## 1 Unspecified BROOKLYN 40.677070828995085 -73.98328623676551
## 2 Unspecified BROOKLYN 40.66269415161571 -73.96210657313388
## 3 Unspecified QUEENS <NA> <NA>
## 4 Unspecified MANHATTAN 40.708691774634985 -74.00103518668939
## 5 Unspecified BROOKLYN 40.6748949770469 -73.88713888206964
## 6 Unspecified BROOKLYN 40.64093766389099 -73.95823461385314
## cross_street_1 cross_street_2 intersection_street_1 intersection_street_2
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 229 STREET 230 STREET <NA> <NA>
## 4 PECK SLIP DOVER STREET PECK SLIP DOVER STREET
## 5 LIBERTY AVENUE GLENMORE AVENUE LIBERTY AVENUE GLENMORE AVENUE
## 6 DORCHESTER ROAD DITMAS AVENUE DORCHESTER ROAD DITMAS AVENUE
## landmark vehicle_type closed_date taxi_company_borough
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 WATER STREET <NA> <NA> <NA>
## 5 JEROME STREET <NA> <NA> <NA>
## 6 EAST 21 STREET Car <NA> <NA>
## taxi_pick_up_location bridge_highway_name bridge_highway_segment road_ramp
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA>
## bridge_highway_direction location.latitude location.longitude
## 1 <NA> 40.677070828995085 -73.98328623676551
## 2 <NA> 40.66269415161571 -73.96210657313388
## 3 <NA> <NA> <NA>
## 4 <NA> 40.708691774634985 -74.00103518668939
## 5 <NA> 40.6748949770469 -73.88713888206964
## 6 <NA> 40.64093766389099 -73.95823461385314
## location.human_address
## 1 {"address": "", "city": "", "state": "", "zip": ""}
## 2 {"address": "", "city": "", "state": "", "zip": ""}
## 3 <NA>
## 4 {"address": "", "city": "", "state": "", "zip": ""}
## 5 {"address": "", "city": "", "state": "", "zip": ""}
## 6 {"address": "", "city": "", "state": "", "zip": ""}
nrow(df_311)
## [1] 80000
sum(is.na(df_311$hour))
## [1] 0
df_311 <- df_311 %>%
mutate(
created_date = as.POSIXct(created_date, format = "%Y-%m-%dT%H:%M:%S"),
closed_date = as.POSIXct(closed_date, format = "%Y-%m-%dT%H:%M:%S"),
date_only = as.Date(created_date),
hour = hour(created_date),
day_of_week = wday(created_date, label = TRUE),
month = month(created_date, label = TRUE)
)
print(str(df_311))
## 'data.frame': 80000 obs. of 47 variables:
## $ unique_key : chr "67192069" "67192238" "67186869" "67184900" ...
## $ created_date : POSIXct, format: "2025-12-16 03:06:57" "2025-12-16 02:27:43" ...
## $ agency : chr "DSNY" "DSNY" "DOT" "NYPD" ...
## $ agency_name : chr "Department of Sanitation" "Department of Sanitation" "Department of Transportation" "New York City Police Department" ...
## $ complaint_type : chr "Graffiti" "Graffiti" "Street Condition" "Noise - Commercial" ...
## $ descriptor : chr "Graffiti" "Graffiti" "Pothole" "Loud Music/Party" ...
## $ location_type : chr "Mixed Use" NA NA "Store/Commercial" ...
## $ incident_zip : chr "11215" "11225" "11413" "10038" ...
## $ incident_address : chr "234 4 AVENUE" "475 FLATBUSH AVENUE" "133 AVENUE" "277 WATER STREET" ...
## $ street_name : chr "4 AVENUE" "FLATBUSH AVENUE" "133 AVENUE" "WATER STREET" ...
## $ address_type : chr "ADDRESS" "ADDRESS" "BLOCKFACE" "ADDRESS" ...
## $ city : chr "BROOKLYN" "BROOKLYN" "QUEENS" "NEW YORK" ...
## $ facility_type : chr "N/A" "N/A" "N/A" NA ...
## $ status : chr "Open" "Open" "Open" "Unspecified" ...
## $ due_date : chr "2026-01-15T03:06:57.000" NA NA NA ...
## $ resolution_description : chr "The graffiti on this property has been scheduled to be removed by the City." "The graffiti has been reported. The property owner will be notified that the City will remove the graffiti afte"| __truncated__ "The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair." NA ...
## $ resolution_action_updated_date: chr "2025-12-16T03:06:57.000" "2025-12-16T02:27:43.000" "2025-12-16T02:09:12.000" NA ...
## $ community_board : chr "06 BROOKLYN" "09 BROOKLYN" "13 QUEENS" "01 MANHATTAN" ...
## $ bbl : chr "3004410034" "3011970015" NA "1001070050" ...
## $ borough : chr "BROOKLYN" "BROOKLYN" "QUEENS" "MANHATTAN" ...
## $ x_coordinate_state_plane : chr "988886" "994763" NA "983963" ...
## $ y_coordinate_state_plane : chr "185952" "180716" NA "197472" ...
## $ open_data_channel_type : chr "UNKNOWN" "UNKNOWN" "UNKNOWN" "MOBILE" ...
## $ park_facility_name : chr "Unspecified" "Unspecified" "Unspecified" "Unspecified" ...
## $ park_borough : chr "BROOKLYN" "BROOKLYN" "QUEENS" "MANHATTAN" ...
## $ latitude : chr "40.677070828995085" "40.66269415161571" NA "40.708691774634985" ...
## $ longitude : chr "-73.98328623676551" "-73.96210657313388" NA "-74.00103518668939" ...
## $ cross_street_1 : chr NA NA "229 STREET" "PECK SLIP" ...
## $ cross_street_2 : chr NA NA "230 STREET" "DOVER STREET" ...
## $ intersection_street_1 : chr NA NA NA "PECK SLIP" ...
## $ intersection_street_2 : chr NA NA NA "DOVER STREET" ...
## $ landmark : chr NA NA NA "WATER STREET" ...
## $ vehicle_type : chr NA NA NA NA ...
## $ closed_date : POSIXct, format: NA NA ...
## $ taxi_company_borough : chr NA NA NA NA ...
## $ taxi_pick_up_location : chr NA NA NA NA ...
## $ bridge_highway_name : chr NA NA NA NA ...
## $ bridge_highway_segment : chr NA NA NA NA ...
## $ road_ramp : chr NA NA NA NA ...
## $ bridge_highway_direction : chr NA NA NA NA ...
## $ location.latitude : chr "40.677070828995085" "40.66269415161571" NA "40.708691774634985" ...
## $ location.longitude : chr "-73.98328623676551" "-73.96210657313388" NA "-74.00103518668939" ...
## $ location.human_address : chr "{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}" "{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}" NA "{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}" ...
## $ date_only : Date, format: "2025-12-16" "2025-12-16" ...
## $ hour : int 3 2 2 2 2 2 2 2 2 2 ...
## $ day_of_week : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 3 3 3 3 3 3 3 3 3 3 ...
## $ month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 12 12 12 12 12 12 12 12 12 12 ...
## NULL
sum(is.na(df_311$hour))
## [1] 0
df_311 %>% count(day_of_week)
## day_of_week n
## 1 Sun 11247
## 2 Mon 16225
## 3 Tue 6282
## 4 Wed 12155
## 5 Thu 11637
## 6 Fri 12001
## 7 Sat 10453
# WEATHER DATA INTEGRATION
library(httr)
library(jsonlite)
library(dplyr)
library(lubridate)
# Get date range from your 311 data
start_date <- min(df_311$date_only, na.rm = TRUE)
end_date <- max(df_311$date_only, na.rm = TRUE)
# Fetch NYC historical weather
weather_url <- "https://archive-api.open-meteo.com/v1/archive"
weather_params <- list(
latitude = 40.7128,
longitude = -74.0060,
start_date = as.character(start_date),
end_date = as.character(end_date),
daily = "temperature_2m_max,temperature_2m_min,temperature_2m_mean,precipitation_sum,windspeed_10m_max",
timezone = "America/New_York"
)
weather_response <- GET(weather_url, query = weather_params)
if (status_code(weather_response) == 200) {
weather_json <- fromJSON(content(weather_response, "text", encoding = "UTF-8"))
weather_df <- data.frame(
date = as.Date(weather_json$daily$time),
temp_max_f = weather_json$daily$temperature_2m_max * 9/5 + 32, # Convert to Fahrenheit
temp_min_f = weather_json$daily$temperature_2m_min * 9/5 + 32,
temp_mean_f = weather_json$daily$temperature_2m_mean * 9/5 + 32,
precipitation_mm = weather_json$daily$precipitation_sum,
wind_max_mph = weather_json$daily$windspeed_10m_max * 0.621371 # Convert to mph
)
cat("Weather data retrieved:", nrow(weather_df), "days\n")
print(head(weather_df))
} else {
cat("Error fetching weather:", status_code(weather_response), "\n")
}
## Weather data retrieved: 8 days
## date temp_max_f temp_min_f temp_mean_f precipitation_mm wind_max_mph
## 1 2025-12-09 31.46 18.32 25.16 0.0 9.817662
## 2 2025-12-10 43.34 30.56 36.68 3.1 11.371089
## 3 2025-12-11 38.84 27.50 32.90 0.0 14.975041
## 4 2025-12-12 34.52 23.00 28.40 0.0 10.004073
## 5 2025-12-13 38.48 23.72 30.74 1.1 4.970968
## 6 2025-12-14 32.54 16.88 26.42 11.4 11.806049
# Join weather to 311 data
df_311 <- df_311 %>%
left_join(weather_df, by = c("date_only" = "date"))
# Verify join
cat("\nWeather columns added:", sum(!is.na(df_311$temp_mean_f)), "rows with weather data\n")
##
## Weather columns added: 80000 rows with weather data
library(tidycensus)
library(dplyr)
# Load the key from environment variable
census_api_key(Sys.getenv("CENSUS_API_KEY"))
## To install your API key for use in future sessions, run this function with `install = TRUE`.
query_params <- list(
`$limit` = 75000, # More records = more days
`$order` = "created_date DESC"
)
# Fetch NYC demographic data by borough
nyc_demographics <- get_acs(
geography = "county",
state = "NY",
county = c("005", "047", "061", "081", "085"),
variables = c(
total_pop = "B01003_001",
median_income = "B19013_001",
median_age = "B01002_001",
total_housing = "B25001_001",
renter_occupied = "B25003_003",
poverty_count = "B17001_002"
),
year = 2022,
output = "wide"
)
## Getting data from the 2018-2022 5-year ACS
## Warning: • You have not set a Census API key. Users without a key are limited to 500
## queries per day and may experience performance limitations.
## ℹ For best results, get a Census API key at
## http://api.census.gov/data/key_signup.html and then supply the key to the
## `census_api_key()` function to use it throughout your tidycensus session.
## This warning is displayed once per session.
# Clean up borough names to match your 311 data
nyc_demographics <- nyc_demographics %>%
mutate(
borough = case_when(
grepl("Bronx", NAME) ~ "BRONX",
grepl("Kings", NAME) ~ "BROOKLYN",
grepl("New York County", NAME) ~ "MANHATTAN",
grepl("Queens", NAME) ~ "QUEENS",
grepl("Richmond", NAME) ~ "STATEN ISLAND"
),
pct_renter = renter_occupiedE / total_housingE * 100,
pct_poverty = poverty_countE / total_popE * 100
) %>%
select(borough,
population = total_popE,
median_income = median_incomeE,
median_age = median_ageE,
pct_renter,
pct_poverty)
print("NYC Demographics by Borough:")
## [1] "NYC Demographics by Borough:"
print(nyc_demographics)
## # A tibble: 5 × 6
## borough population median_income median_age pct_renter pct_poverty
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BRONX 1443229 47036 35 76.8 26.3
## 2 BROOKLYN 2679620 74692 35.9 64.4 18.8
## 3 MANHATTAN 1645867 99880 38.5 64.0 15.4
## 4 QUEENS 2360826 82431 39.9 49.9 11.6
## 5 STATEN ISLAND 492925 96185 40.3 29.1 10.3
# Join demographics to 311 data
df_311 <- df_311 %>%
left_join(nyc_demographics, by = "borough")
cat("Demographics joined:", sum(!is.na(df_311$population)), "rows with demographic data\n")
## Demographics joined: 79936 rows with demographic data
head(df_311, 5)
## unique_key created_date agency agency_name
## 1 67192069 2025-12-16 03:06:57 DSNY Department of Sanitation
## 2 67192238 2025-12-16 02:27:43 DSNY Department of Sanitation
## 3 67186869 2025-12-16 02:09:12 DOT Department of Transportation
## 4 67184900 2025-12-16 02:06:24 NYPD New York City Police Department
## 5 67193394 2025-12-16 02:06:16 NYPD New York City Police Department
## complaint_type descriptor location_type
## 1 Graffiti Graffiti Mixed Use
## 2 Graffiti Graffiti <NA>
## 3 Street Condition Pothole <NA>
## 4 Noise - Commercial Loud Music/Party Store/Commercial
## 5 Illegal Parking Posted Parking Sign Violation Street/Sidewalk
## incident_zip incident_address street_name address_type city
## 1 11215 234 4 AVENUE 4 AVENUE ADDRESS BROOKLYN
## 2 11225 475 FLATBUSH AVENUE FLATBUSH AVENUE ADDRESS BROOKLYN
## 3 11413 133 AVENUE 133 AVENUE BLOCKFACE QUEENS
## 4 10038 277 WATER STREET WATER STREET ADDRESS NEW YORK
## 5 11207 289 JEROME STREET JEROME STREET ADDRESS BROOKLYN
## facility_type status due_date
## 1 N/A Open 2026-01-15T03:06:57.000
## 2 N/A Open <NA>
## 3 N/A Open <NA>
## 4 <NA> Unspecified <NA>
## 5 <NA> In Progress <NA>
## resolution_description
## 1 The graffiti on this property has been scheduled to be removed by the City.
## 2 The graffiti has been reported. The property owner will be notified that the City will remove the graffiti after 35 days unless the property owner requests to keep the graffiti or remove it himself or herself.
## 3 The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair.
## 4 <NA>
## 5 <NA>
## resolution_action_updated_date community_board bbl borough
## 1 2025-12-16T03:06:57.000 06 BROOKLYN 3004410034 BROOKLYN
## 2 2025-12-16T02:27:43.000 09 BROOKLYN 3011970015 BROOKLYN
## 3 2025-12-16T02:09:12.000 13 QUEENS <NA> QUEENS
## 4 <NA> 01 MANHATTAN 1001070050 MANHATTAN
## 5 <NA> 05 BROOKLYN 3039820036 BROOKLYN
## x_coordinate_state_plane y_coordinate_state_plane open_data_channel_type
## 1 988886 185952 UNKNOWN
## 2 994763 180716 UNKNOWN
## 3 <NA> <NA> UNKNOWN
## 4 983963 197472 MOBILE
## 5 1015556 185179 MOBILE
## park_facility_name park_borough latitude longitude
## 1 Unspecified BROOKLYN 40.677070828995085 -73.98328623676551
## 2 Unspecified BROOKLYN 40.66269415161571 -73.96210657313388
## 3 Unspecified QUEENS <NA> <NA>
## 4 Unspecified MANHATTAN 40.708691774634985 -74.00103518668939
## 5 Unspecified BROOKLYN 40.6748949770469 -73.88713888206964
## cross_street_1 cross_street_2 intersection_street_1 intersection_street_2
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 229 STREET 230 STREET <NA> <NA>
## 4 PECK SLIP DOVER STREET PECK SLIP DOVER STREET
## 5 LIBERTY AVENUE GLENMORE AVENUE LIBERTY AVENUE GLENMORE AVENUE
## landmark vehicle_type closed_date taxi_company_borough
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 WATER STREET <NA> <NA> <NA>
## 5 JEROME STREET <NA> <NA> <NA>
## taxi_pick_up_location bridge_highway_name bridge_highway_segment road_ramp
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA>
## bridge_highway_direction location.latitude location.longitude
## 1 <NA> 40.677070828995085 -73.98328623676551
## 2 <NA> 40.66269415161571 -73.96210657313388
## 3 <NA> <NA> <NA>
## 4 <NA> 40.708691774634985 -74.00103518668939
## 5 <NA> 40.6748949770469 -73.88713888206964
## location.human_address date_only hour
## 1 {"address": "", "city": "", "state": "", "zip": ""} 2025-12-16 3
## 2 {"address": "", "city": "", "state": "", "zip": ""} 2025-12-16 2
## 3 <NA> 2025-12-16 2
## 4 {"address": "", "city": "", "state": "", "zip": ""} 2025-12-16 2
## 5 {"address": "", "city": "", "state": "", "zip": ""} 2025-12-16 2
## day_of_week month temp_max_f temp_min_f temp_mean_f precipitation_mm
## 1 Tue Dec 27.5 15.08 21.74 0
## 2 Tue Dec 27.5 15.08 21.74 0
## 3 Tue Dec 27.5 15.08 21.74 0
## 4 Tue Dec 27.5 15.08 21.74 0
## 5 Tue Dec 27.5 15.08 21.74 0
## wind_max_mph population median_income median_age pct_renter pct_poverty
## 1 7.270041 2679620 74692 35.9 64.42049 18.79087
## 2 7.270041 2679620 74692 35.9 64.42049 18.79087
## 3 7.270041 2360826 82431 39.9 49.92156 11.56061
## 4 7.270041 1645867 99880 38.5 64.03536 15.35221
## 5 7.270041 2679620 74692 35.9 64.42049 18.79087
library(dplyr)
library(tidyr)
library(lubridate)
# 1. DATASET OVERVIEW
cat("Rows:", nrow(df_311), "\n")
## Rows: 80000
cat("Columns:", ncol(df_311), "\n")
## Columns: 57
glimpse(df_311)
## Rows: 80,000
## Columns: 57
## $ unique_key <chr> "67192069", "67192238", "67186869", "67…
## $ created_date <dttm> 2025-12-16 03:06:57, 2025-12-16 02:27:…
## $ agency <chr> "DSNY", "DSNY", "DOT", "NYPD", "NYPD", …
## $ agency_name <chr> "Department of Sanitation", "Department…
## $ complaint_type <chr> "Graffiti", "Graffiti", "Street Conditi…
## $ descriptor <chr> "Graffiti", "Graffiti", "Pothole", "Lou…
## $ location_type <chr> "Mixed Use", NA, NA, "Store/Commercial"…
## $ incident_zip <chr> "11215", "11225", "11413", "10038", "11…
## $ incident_address <chr> "234 4 AVENUE", "475 FLATBUSH AVENUE", …
## $ street_name <chr> "4 AVENUE", "FLATBUSH AVENUE", "133 AVE…
## $ address_type <chr> "ADDRESS", "ADDRESS", "BLOCKFACE", "ADD…
## $ city <chr> "BROOKLYN", "BROOKLYN", "QUEENS", "NEW …
## $ facility_type <chr> "N/A", "N/A", "N/A", NA, NA, NA, NA, NA…
## $ status <chr> "Open", "Open", "Open", "Unspecified", …
## $ due_date <chr> "2026-01-15T03:06:57.000", NA, NA, NA, …
## $ resolution_description <chr> "The graffiti on this property has been…
## $ resolution_action_updated_date <chr> "2025-12-16T03:06:57.000", "2025-12-16T…
## $ community_board <chr> "06 BROOKLYN", "09 BROOKLYN", "13 QUEEN…
## $ bbl <chr> "3004410034", "3011970015", NA, "100107…
## $ borough <chr> "BROOKLYN", "BROOKLYN", "QUEENS", "MANH…
## $ x_coordinate_state_plane <chr> "988886", "994763", NA, "983963", "1015…
## $ y_coordinate_state_plane <chr> "185952", "180716", NA, "197472", "1851…
## $ open_data_channel_type <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "MOBIL…
## $ park_facility_name <chr> "Unspecified", "Unspecified", "Unspecif…
## $ park_borough <chr> "BROOKLYN", "BROOKLYN", "QUEENS", "MANH…
## $ latitude <chr> "40.677070828995085", "40.6626941516157…
## $ longitude <chr> "-73.98328623676551", "-73.962106573133…
## $ cross_street_1 <chr> NA, NA, "229 STREET", "PECK SLIP", "LIB…
## $ cross_street_2 <chr> NA, NA, "230 STREET", "DOVER STREET", "…
## $ intersection_street_1 <chr> NA, NA, NA, "PECK SLIP", "LIBERTY AVENU…
## $ intersection_street_2 <chr> NA, NA, NA, "DOVER STREET", "GLENMORE A…
## $ landmark <chr> NA, NA, NA, "WATER STREET", "JEROME STR…
## $ vehicle_type <chr> NA, NA, NA, NA, NA, "Car", NA, NA, NA, …
## $ closed_date <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ taxi_company_borough <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ taxi_pick_up_location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bridge_highway_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bridge_highway_segment <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ road_ramp <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bridge_highway_direction <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ location.latitude <chr> "40.677070828995085", "40.6626941516157…
## $ location.longitude <chr> "-73.98328623676551", "-73.962106573133…
## $ location.human_address <chr> "{\"address\": \"\", \"city\": \"\", \"…
## $ date_only <date> 2025-12-16, 2025-12-16, 2025-12-16, 20…
## $ hour <int> 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ day_of_week <ord> Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue,…
## $ month <ord> Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec,…
## $ temp_max_f <dbl> 27.5, 27.5, 27.5, 27.5, 27.5, 27.5, 27.…
## $ temp_min_f <dbl> 15.08, 15.08, 15.08, 15.08, 15.08, 15.0…
## $ temp_mean_f <dbl> 21.74, 21.74, 21.74, 21.74, 21.74, 21.7…
## $ precipitation_mm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ wind_max_mph <dbl> 7.270041, 7.270041, 7.270041, 7.270041,…
## $ population <dbl> 2679620, 2679620, 2360826, 1645867, 267…
## $ median_income <dbl> 74692, 74692, 82431, 99880, 74692, 7469…
## $ median_age <dbl> 35.9, 35.9, 39.9, 38.5, 35.9, 35.9, 35.…
## $ pct_renter <dbl> 64.42049, 64.42049, 49.92156, 64.03536,…
## $ pct_poverty <dbl> 18.79087, 18.79087, 11.56061, 15.35221,…
# 2. MISSING VALUES ANALYSIS
missing_summary <- df_311 %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "column", values_to = "missing_count") %>%
mutate(
missing_pct = round(missing_count / nrow(df_311) * 100, 2),
recommendation = case_when(
missing_pct == 0 ~ "Complete",
missing_pct < 5 ~ "Impute",
missing_pct < 30 ~ "Impute with caution",
TRUE ~ "Consider dropping"
)
) %>%
arrange(desc(missing_pct))
print(missing_summary %>% filter(missing_pct > 0))
## # A tibble: 36 × 4
## column missing_count missing_pct recommendation
## <chr> <int> <dbl> <chr>
## 1 taxi_company_borough 79938 99.9 Consider dropping
## 2 road_ramp 79869 99.8 Consider dropping
## 3 bridge_highway_direction 79846 99.8 Consider dropping
## 4 due_date 79788 99.7 Consider dropping
## 5 bridge_highway_name 79691 99.6 Consider dropping
## 6 bridge_highway_segment 79691 99.6 Consider dropping
## 7 taxi_pick_up_location 79247 99.1 Consider dropping
## 8 vehicle_type 77561 97.0 Consider dropping
## 9 facility_type 76582 95.7 Consider dropping
## 10 landmark 33060 41.3 Consider dropping
## # ℹ 26 more rows
# Identify columns with >30% missing, BUT keep closed_date for response_time calc
threshold <- 0.30
cols_to_drop <- names(df_311)[colSums(is.na(df_311)) / nrow(df_311) > threshold]
# Protect closed_date from being dropped
cols_to_drop <- setdiff(cols_to_drop, "closed_date")
cat("Dropping columns:", cols_to_drop, "\n")
## Dropping columns: facility_type due_date cross_street_1 cross_street_2 intersection_street_1 intersection_street_2 landmark vehicle_type taxi_company_borough taxi_pick_up_location bridge_highway_name bridge_highway_segment road_ramp bridge_highway_direction
df_311 <- df_311 %>%
select(-all_of(cols_to_drop))
# 3. DUPLICATE ROWS
cat("\n=== DUPLICATE ROWS ===\n")
##
## === DUPLICATE ROWS ===
dup_count <- sum(duplicated(df_311))
cat("Duplicate rows:", dup_count, "\n")
## Duplicate rows: 0
cat("Duplicate %:", round(dup_count / nrow(df_311) * 100, 2), "%\n")
## Duplicate %: 0 %
# Remove duplicates (uncomment to execute)
df_311 <- df_311 %>% distinct()
# 4. DATA TYPES CHECK
sapply(df_311, class)
## $unique_key
## [1] "character"
##
## $created_date
## [1] "POSIXct" "POSIXt"
##
## $agency
## [1] "character"
##
## $agency_name
## [1] "character"
##
## $complaint_type
## [1] "character"
##
## $descriptor
## [1] "character"
##
## $location_type
## [1] "character"
##
## $incident_zip
## [1] "character"
##
## $incident_address
## [1] "character"
##
## $street_name
## [1] "character"
##
## $address_type
## [1] "character"
##
## $city
## [1] "character"
##
## $status
## [1] "character"
##
## $resolution_description
## [1] "character"
##
## $resolution_action_updated_date
## [1] "character"
##
## $community_board
## [1] "character"
##
## $bbl
## [1] "character"
##
## $borough
## [1] "character"
##
## $x_coordinate_state_plane
## [1] "character"
##
## $y_coordinate_state_plane
## [1] "character"
##
## $open_data_channel_type
## [1] "character"
##
## $park_facility_name
## [1] "character"
##
## $park_borough
## [1] "character"
##
## $latitude
## [1] "character"
##
## $longitude
## [1] "character"
##
## $closed_date
## [1] "POSIXct" "POSIXt"
##
## $location.latitude
## [1] "character"
##
## $location.longitude
## [1] "character"
##
## $location.human_address
## [1] "character"
##
## $date_only
## [1] "Date"
##
## $hour
## [1] "integer"
##
## $day_of_week
## [1] "ordered" "factor"
##
## $month
## [1] "ordered" "factor"
##
## $temp_max_f
## [1] "numeric"
##
## $temp_min_f
## [1] "numeric"
##
## $temp_mean_f
## [1] "numeric"
##
## $precipitation_mm
## [1] "numeric"
##
## $wind_max_mph
## [1] "numeric"
##
## $population
## [1] "numeric"
##
## $median_income
## [1] "numeric"
##
## $median_age
## [1] "numeric"
##
## $pct_renter
## [1] "numeric"
##
## $pct_poverty
## [1] "numeric"
# 5. STANDARDIZE TEXT COLUMNS
df_311 <- df_311 %>%
mutate(
# Trim whitespace and standardize case
borough = toupper(trimws(borough)),
complaint_type = toupper(trimws(complaint_type)),
# Standardize "Unspecified" values
borough = if_else(borough %in% c("", "UNSPECIFIED", "NA"), NA_character_, borough)
)
cat("Unique boroughs:", unique(df_311$borough), "\n")
## Unique boroughs: BROOKLYN QUEENS MANHATTAN BRONX STATEN ISLAND NA
# 6. INVALID/ILLOGICAL VALUES
# Check: Closed date before Created date
if ("closed_date" %in% names(df_311) & "created_date" %in% names(df_311)) {
invalid_dates <- df_311 %>%
filter(!is.na(closed_date) & closed_date < created_date)
cat("Invalid dates (closed before created):", nrow(invalid_dates), "\n")
}
## Invalid dates (closed before created): 12
# Check: Future dates
future_dates <- df_311 %>%
filter(created_date > Sys.time())
cat("Future dates:", nrow(future_dates), "\n")
## Future dates: 0
# Check: Negative response times
if ("response_time" %in% names(df_311)) {
negative_response <- df_311 %>%
filter(response_time < 0)
cat("Negative response times:", nrow(negative_response), "\n")
}
# 7. OUTLIER DETECTION (Numeric Columns)
detect_outliers <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR
sum(x < lower | x > upper, na.rm = TRUE)
}
numeric_cols <- df_311 %>% select(where(is.numeric)) %>% names()
for (col in numeric_cols) {
outlier_count <- detect_outliers(df_311[[col]])
pct <- round(outlier_count / nrow(df_311) * 100, 2)
cat(col, ": ", outlier_count, " outliers (", pct, "%)\n", sep = "")
}
## hour: 0 outliers (0%)
## temp_max_f: 0 outliers (0%)
## temp_min_f: 0 outliers (0%)
## temp_mean_f: 0 outliers (0%)
## precipitation_mm: 10815 outliers (13.52%)
## wind_max_mph: 25172 outliers (31.46%)
## population: 0 outliers (0%)
## median_income: 0 outliers (0%)
## median_age: 0 outliers (0%)
## pct_renter: 2610 outliers (3.26%)
## pct_poverty: 0 outliers (0%)
# 8. CARDINALITY CHECK
cardinality <- df_311 %>%
summarise(across(everything(), n_distinct)) %>%
pivot_longer(everything(), names_to = "column", values_to = "unique_values") %>%
arrange(desc(unique_values))
print(cardinality)
## # A tibble: 43 × 2
## column unique_values
## <chr> <int>
## 1 unique_key 80000
## 2 created_date 66848
## 3 closed_date 37021
## 4 latitude 35682
## 5 longitude 35682
## 6 location.latitude 35682
## 7 location.longitude 35682
## 8 incident_address 34400
## 9 resolution_action_updated_date 33950
## 10 y_coordinate_state_plane 30333
## # ℹ 33 more rows
library(ggplot2)
#DATA. VISUALIZATIONS
# Top Complaint Types
df_311 %>%
count(complaint_type, sort = TRUE) %>%
head(10) %>%
ggplot(aes(x = reorder(complaint_type, n), y = n, fill = complaint_type)) +
geom_col() +
coord_flip() +
labs(
title = "Top 10 Complaint Types in NYC 311 Data",
x = "Complaint Type",
y = "Number of Complaints"
) +
theme_minimal() +
theme(legend.position = "none")
# Complaints by Borough
df_311 %>%
filter(!is.na(borough) & borough != "Unspecified") %>%
count(borough) %>%
ggplot(aes(x = reorder(borough, n), y = n, fill = borough)) +
geom_col() +
labs(
title = "311 Complaints by Borough",
x = "Borough",
y = "Number of Complaints"
) +
theme_minimal() +
theme(legend.position = "none")
# Complaints by Hour of Day
df_311 %>%
count(hour) %>%
ggplot(aes(x = hour, y = n)) +
geom_line(color = "steelblue", size = 1) +
geom_point(color = "steelblue", size = 2) +
labs(
title = "311 Complaints by Hour of Day",
x = "Hour (24-hour format)",
y = "Number of Complaints"
) +
scale_x_continuous(breaks = seq(0, 23, 2)) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Complaints by Day of Week
df_311 %>%
count(day_of_week) %>%
ggplot(aes(x = day_of_week, y = n, fill = day_of_week)) +
geom_col() +
labs(
title = "311 Complaints by Day of Week",
x = "Day",
y = "Number of Complaints"
) +
theme_minimal() +
theme(legend.position = "none")
# Heatmap - Hour vs Day of Week
df_311 %>%
count(day_of_week, hour) %>%
ggplot(aes(x = hour, y = day_of_week, fill = n)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
labs(
title = "Complaint Frequency: Hour vs Day of Week",
x = "Hour of Day",
y = "Day of Week",
fill = "Count"
) +
theme_minimal()
# Top Complaints by Borough
df_311 %>%
filter(!is.na(borough) & borough != "Unspecified") %>%
count(borough, complaint_type) %>%
group_by(borough) %>%
slice_max(n, n = 5) %>%
ggplot(aes(x = reorder(complaint_type, n), y = n, fill = borough)) +
geom_col() +
coord_flip() +
facet_wrap(~borough, scales = "free_y") +
labs(
title = "Top 5 Complaint Types by Borough",
x = "Complaint Type",
y = "Count"
) +
theme_minimal() +
theme(legend.position = "none")
# STATISTICAL ANALYSIS
# Chi-Square Test: Are complaint types independent of borough?
contingency_table <- table(df_311$borough, df_311$complaint_type)
chi_test <- chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
print("Chi-Square Test: Complaint Type vs Borough")
## [1] "Chi-Square Test: Complaint Type vs Borough"
print(chi_test)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = NaN, df = 580, p-value = NA
# ANOVA: Does response time differ by borough?
df_311 <- df_311 %>%
mutate(response_time = as.numeric(difftime(closed_date, created_date, units = "hours")))
# Remove NA and extreme outliers
df_response <- df_311 %>%
filter(!is.na(response_time) & response_time > 0 & response_time < 720)
anova_result <- aov(response_time ~ borough, data = df_response)
print("ANOVA: Response Time by Borough")
## [1] "ANOVA: Response Time by Borough"
print(summary(anova_result))
## Df Sum Sq Mean Sq F value Pr(>F)
## borough 4 588550 147138 285.3 <2e-16 ***
## Residuals 54285 27992982 516
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 27 observations deleted due to missingness
# Post-hoc test if significant
tukey_result <- TukeyHSD(anova_result)
print(tukey_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = response_time ~ borough, data = df_response)
##
## $borough
## diff lwr upr p adj
## BROOKLYN-BRONX 6.542921 5.8497798 7.236062 0.0000000
## MANHATTAN-BRONX 8.249035 7.4513231 9.046747 0.0000000
## QUEENS-BRONX 2.007849 1.2649527 2.750746 0.0000000
## STATEN ISLAND-BRONX 3.810022 2.1987907 5.421253 0.0000000
## MANHATTAN-BROOKLYN 1.706114 0.8412581 2.570970 0.0000007
## QUEENS-BROOKLYN -4.535072 -5.3496433 -3.720500 0.0000000
## STATEN ISLAND-BROOKLYN -2.732899 -4.3784074 -1.087391 0.0000578
## QUEENS-MANHATTAN -6.241186 -7.1464074 -5.335964 0.0000000
## STATEN ISLAND-MANHATTAN -4.439013 -6.1312298 -2.746796 0.0000000
## STATEN ISLAND-QUEENS 1.802173 0.1350948 3.469250 0.0264692
# Response Time by Borough
df_response %>%
filter(!is.na(borough) & borough != "Unspecified") %>%
ggplot(aes(x = borough, y = response_time, fill = borough)) +
geom_boxplot() +
labs(
title = "Response Time (Hours) by Borough",
x = "Borough",
y = "Response Time (Hours)"
) +
theme_minimal() +
theme(legend.position = "none")
# SUMMARY STATISTICS TABLE
summary_stats <- df_response %>%
filter(!is.na(borough) & borough != "Unspecified") %>%
group_by(borough) %>%
summarise(
count = n(),
mean_response = mean(response_time, na.rm = TRUE),
median_response = median(response_time, na.rm = TRUE),
sd_response = sd(response_time, na.rm = TRUE)
)
print("Summary Statistics by Borough:")
## [1] "Summary Statistics by Borough:"
print(summary_stats)
## # A tibble: 5 × 5
## borough count mean_response median_response sd_response
## <chr> <int> <dbl> <dbl> <dbl>
## 1 BRONX 20807 10.9 2.97 18.3
## 2 BROOKLYN 12961 17.4 3.33 26.5
## 3 MANHATTAN 8490 19.2 3.42 26.4
## 4 QUEENS 10441 12.9 2.32 22.4
## 5 STATEN ISLAND 1591 14.7 3.25 22.9
# Filter heating complaints data
heating_individual <- df_311 %>%
filter(complaint_type == "HEAT/HOT WATER" & !is.na(temp_mean_f)) %>%
filter(!is.na(response_time) & response_time > 0 & response_time < 720)
# Plot 1: Heating Complaints COUNT vs Temperature
heating_count <- df_311 %>%
filter(complaint_type == "HEAT/HOT WATER" & !is.na(temp_mean_f)) %>%
group_by(date_only, temp_mean_f) %>%
summarise(complaints = n(), .groups = "drop")
p1 <- ggplot(heating_count, aes(x = temp_mean_f, y = complaints)) +
geom_point(alpha = 0.5, color = "navy") +
labs(
title = "Heating Complaint Volume vs Temperature",
x = "Mean Daily Temperature (°F)",
y = "Number of Complaints"
) +
theme_minimal()
print(p1)
# Noise complaints - do they increase in warm weather?
noise_temp <- df_311 %>%
filter(grepl("Noise", complaint_type, ignore.case = TRUE)) %>%
group_by(date_only, temp_mean_f) %>%
summarise(complaints = n(), .groups = "drop")
ggplot(noise_temp, aes(x = temp_mean_f, y = complaints)) +
geom_point(alpha = 0.5, color = "orange") +
geom_smooth(method = "loess", color = "purple", se = FALSE) +
labs(
title = "Do Noise Complaints Rise in Warmer Weather?",
x = "Mean Daily Temperature (°F)",
y = "Number of Noise Complaints"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
cor.test(noise_temp$temp_mean_f, noise_temp$complaints)
##
## Pearson's product-moment correlation
##
## data: noise_temp$temp_mean_f and noise_temp$complaints
## t = 1.1518, df = 6, p-value = 0.2932
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3986927 0.8694817
## sample estimates:
## cor
## 0.4255328
# Create temperature bins
temp_complaints <- df_311 %>%
filter(!is.na(temp_mean_f)) %>%
mutate(temp_range = cut(temp_mean_f,
breaks = c(0, 32, 50, 70, 85, 100),
labels = c("Freezing (<32°F)", "Cold (32-50°F)",
"Mild (50-70°F)", "Warm (70-85°F)", "Hot (>85°F)"))) %>%
filter(complaint_type %in% names(sort(table(df_311$complaint_type), decreasing = TRUE)[1:10])) %>%
count(complaint_type, temp_range)
ggplot(temp_complaints, aes(x = temp_range, y = complaint_type, fill = n)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
labs(
title = "Complaint Types by Temperature Range",
x = "Temperature Range",
y = "Complaint Type",
fill = "Count"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Complaints per capita by borough
complaints_per_capita <- df_311 %>%
filter(!is.na(borough) & borough != "Unspecified") %>%
group_by(borough, population, median_income, pct_renter, pct_poverty) %>%
summarise(total_complaints = n(), .groups = "drop") %>%
mutate(complaints_per_1000 = total_complaints / population * 1000)
ggplot(complaints_per_capita, aes(x = median_income, y = complaints_per_1000)) +
geom_point(size = 5, color = "steelblue") +
geom_text(aes(label = borough), vjust = -1, size = 3) +
geom_smooth(method = "lm", color = "red", se = FALSE, linetype = "dashed") +
scale_x_continuous(labels = scales::dollar) +
labs(
title = "Do Lower-Income Boroughs File More Complaints?",
x = "Median Household Income",
y = "Complaints per 1,000 Residents"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Housing-related complaints by renter percentage
housing_renter <- df_311 %>%
filter(complaint_type %in% c("HEAT/HOT WATER", "PLUMBING", "WATER SYSTEM", "UNSANITARY CONDITION")) %>%
filter(!is.na(borough) & borough != "Unspecified") %>%
group_by(borough, pct_renter, population) %>%
summarise(housing_complaints = n(), .groups = "drop") %>%
mutate(housing_per_1000 = housing_complaints / population * 1000)
ggplot(housing_renter, aes(x = pct_renter, y = housing_per_1000)) +
geom_point(size = 5, color = "orange") +
geom_text(aes(label = borough), vjust = -1, size = 3) +
geom_smooth(method = "lm", color = "purple", se = FALSE, linetype = "dashed") +
labs(
title = "Housing Complaints Higher in Renter-Heavy Boroughs",
x = "Percent Renter-Occupied Housing",
y = "Housing Complaints per 1,000 Residents"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Load required libraries
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:httr':
##
## progress
library(tibble)
# Prepare modeling dataset WITH weather and demographics
model_data <- df_response %>%
filter(!is.na(borough) & borough != "Unspecified" &
!is.na(complaint_type) & !is.na(hour) & !is.na(day_of_week) &
!is.na(temp_mean_f) & !is.na(median_income)) %>% # NEW: require weather & demographics
select(
response_time,
borough,
complaint_type,
hour,
day_of_week,
month,
# NEW: Weather features
temp_mean_f,
precipitation_mm,
# NEW: Demographics features
median_income,
pct_renter,
pct_poverty
) %>%
# Keep top complaint types to avoid sparse categories
filter(complaint_type %in% names(sort(table(df_response$complaint_type), decreasing = TRUE)[1:20])) %>%
mutate(across(where(is.character), as.factor))
# Check structure - verify new columns are included
str(model_data)
## 'data.frame': 49354 obs. of 11 variables:
## $ response_time : num 0.0642 0.1808 0.3244 0.4058 0.3131 ...
## $ borough : Factor w/ 5 levels "BRONX","BROOKLYN",..: 4 4 3 2 4 1 1 1 4 1 ...
## $ complaint_type : Factor w/ 20 levels "ABANDONED VEHICLE",..: 12 12 12 12 12 12 12 8 2 12 ...
## $ hour : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day_of_week : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 3 3 3 3 3 3 3 3 3 3 ...
## $ month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 12 12 12 12 12 12 12 12 12 12 ...
## $ temp_mean_f : num 21.7 21.7 21.7 21.7 21.7 ...
## $ precipitation_mm: num 0 0 0 0 0 0 0 0 0 0 ...
## $ median_income : num 82431 82431 99880 74692 82431 ...
## $ pct_renter : num 49.9 49.9 64 64.4 49.9 ...
## $ pct_poverty : num 11.6 11.6 15.4 18.8 11.6 ...
cat("\nColumns in model:", colnames(model_data), "\n")
##
## Columns in model: response_time borough complaint_type hour day_of_week month temp_mean_f precipitation_mm median_income pct_renter pct_poverty
# Split data: 80% train, 20% test
set.seed(42)
train_index <- createDataPartition(model_data$response_time, p = 0.8, list = FALSE)
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]
cat("Training set:", nrow(train_data), "rows\n")
## Training set: 39485 rows
cat("Test set:", nrow(test_data), "rows\n")
## Test set: 9869 rows
# Train Random Forest model
rf_model <- randomForest(
response_time ~ .,
data = train_data,
ntree = 200,
mtry = 3, # Increased from 2 since we have more variables
importance = TRUE,
na.action = na.omit
)
# Model summary
print(rf_model)
##
## Call:
## randomForest(formula = response_time ~ ., data = train_data, ntree = 200, mtry = 3, importance = TRUE, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 200
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 135.3143
## % Var explained: 68.4
# Variable Importance - KEY OUTPUT for interpretation
importance_df <- as.data.frame(importance(rf_model)) %>%
rownames_to_column("Variable") %>%
arrange(desc(`%IncMSE`))
print("Variable Importance (% Increase in MSE when removed):")
## [1] "Variable Importance (% Increase in MSE when removed):"
print(importance_df)
## Variable %IncMSE IncNodePurity
## 1 complaint_type 104.21719 9566058.88
## 2 hour 40.13356 293187.21
## 3 day_of_week 26.82610 565645.46
## 4 temp_mean_f 17.45643 335748.56
## 5 precipitation_mm 14.72861 107773.67
## 6 borough 13.57307 150914.37
## 7 pct_poverty 10.77001 74147.33
## 8 median_income 10.50143 87042.42
## 9 pct_renter 10.42872 72120.33
## 10 month 0.00000 0.00
# Variable Importance Plot
varImpPlot(rf_model, main = "Variable Importance: Predicting 311 Response Time\n(Including Weather & Demographics)")
# Predictions on test set
predictions <- predict(rf_model, newdata = test_data)
# Model Evaluation Metrics
actuals <- test_data$response_time
rmse <- sqrt(mean((predictions - actuals)^2))
mae <- mean(abs(predictions - actuals))
r_squared <- 1 - sum((actuals - predictions)^2) / sum((actuals - mean(actuals))^2)
cat("\n--- Model Performance on Test Set ---\n")
##
## --- Model Performance on Test Set ---
cat("RMSE:", round(rmse, 2), "hours\n")
## RMSE: 11.46 hours
cat("MAE:", round(mae, 2), "hours\n")
## MAE: 6.3 hours
cat("R-squared:", round(r_squared, 4), "\n")
## R-squared: 0.6927
# Actual vs Predicted Plot
results_df <- data.frame(Actual = actuals, Predicted = predictions)
ggplot(results_df, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(
title = "Random Forest: Actual vs Predicted Response Time",
subtitle = paste("R² =", round(r_squared, 3), "| RMSE =", round(rmse, 2), "hours"),
x = "Actual Response Time (Hours)",
y = "Predicted Response Time (Hours)"
) +
theme_minimal()
The Random Forest model reveals that complaint type is the strongest predictor of response time, followed by borough. This suggests that operational workflows—not just geographic location—drive resolution speed.
The model’s R² of 0.7117 indicates that 71% of response time variability can be explained by these factors. The remaining variance likely stems from factors not in our dataset: inspector availability, complaint complexity, or whether violations were found.
RMSE = 11.14 hours Predictions are accurate within approximately 0.46 days(about half a day) sufficient for operational triage and resource planning.
library(knitr)
interpretations <- data.frame(
Interpretation = c(
"Borough disparities",
"Complaint type matters most",
"Demographics drive demand",
"Temporal patterns exist",
"Predictions enable proactivity"
),
Key_Insight = c(
"Manhattan/Brooklyn have highest variability",
"Type predicts response time better than location",
"Renter-heavy, lower-income areas file more complaints",
"Peak demand mid-week, daytime hours",
"71% of variance is explainable"
),
Actionable_Recommendation = c(
"Audit workflows in high-variability boroughs",
"Develop complaint-specific SLAs",
"Proactive inspections in vulnerable neighborhoods",
"Align staffing to demand curves",
"Shift from reactive to predictive operations"
)
)
kable(interpretations,
col.names = c("Interpretation", "Key Insight", "Actionable Recommendation"),
caption = "Key Interpretations from 311 Data Analysis")
| Interpretation | Key Insight | Actionable Recommendation |
|---|---|---|
| Borough disparities | Manhattan/Brooklyn have highest variability | Audit workflows in high-variability boroughs |
| Complaint type matters most | Type predicts response time better than location | Develop complaint-specific SLAs |
| Demographics drive demand | Renter-heavy, lower-income areas file more complaints | Proactive inspections in vulnerable neighborhoods |
| Temporal patterns exist | Peak demand mid-week, daytime hours | Align staffing to demand curves |
| Predictions enable proactivity | 71% of variance is explainable | Shift from reactive to predictive operations |
The biggest challenge I had in completing this project was learning how to build, update, and modify a Shiny app. The process felt similar to developing a User Experience (UX) application, which is quite a challenging feature.
My second biggest challenge was developing meaningful insights from the data. I used three APIs in this project, and only the NYC Open Data API allowed me to get a maximum of 80,000 data points. Since 311 gathers thousands of service calls in a day, 80,000 data points approximates to only a few days of data. With the variety of visualizations, I was able to develop meaningful plots through trial and error.
The third challenge I had while completing this project is that since there were so many dependencies—Shiny app code, Random Forest model, datasets to merge—simple changes meant virtually redoing the project from the start. The process, though tedious, was worthwhile as I developed a comprehensive project that I am proud of.