As a NYC resident and Data Science graduate student, I’m analyzing 311 service request data from 2010 to present to understand complaint patterns across our five boroughs. The 311 system is how we report non-emergency issues—from potholes and noise complaints to heat outages and street light repairs. By visualizing this data geographically, we can see where problems occur and how they cluster in specific neighborhoods. My primary hypothesis is that NYC 311 complaint patterns-including volume, type, and response times—are significantly influenced by neighborhood demographics, and borough characteristics.

How NYC Currently Uses 311 Data

Reactive Response:

Performance Monitoring:


Introduction

This project analyzes NYC 311 service request data to identify geographic complaint patterns and enable proactive municipal management. By integrating weather conditions and demographic characteristics, the analysis examines how environmental and socioeconomic factors influence service request volumes and types across neighborhoods. The project follows the OSEMN data science workflow—Obtain, Scrub, Explore, Model, and iNterpret—to ensure a systematic and reproducible approach from data acquisition through actionable insights.

Obtain: Data will be acquired from three primary sources: - NYC 311 Service Requests via the NYC Open Data Socrata API using the RSocrata package in R - Historical Weather Data from Open-Meteo.com, a free open-source weather API requiring no API key, to capture temperature, precipitation, and other meteorological conditions corresponding to complaint dates - Demographic Data from the U.S. Census Bureau API (census.gov) to obtain population density, income levels, housing characteristics, and other socioeconomic indicators at the census tract or ZIP code level - Geographic Boundary Files (boroughs, ZIP codes, community districts) using the tigris package for spatial mapping and joining datasets

Scrub: The data cleaning process will use dplyr, tidyr, and lubridate to remove records with missing coordinates, standardize complaint categories, convert dates to proper datetime format, and create time-based variables like year, month, and season for trend analysis. Weather data will be aligned with complaint records by date, and demographic data will be joined spatially by geographic identifier (ZIP code or census tract) to create an enriched analytical dataset.

Explore: Exploratory data analysis will examine complaint distributions across boroughs, identify seasonal and weather-related trends, and investigate correlations between complaint types and neighborhood demographics. Preliminary visualizations will reveal geographic clustering patterns and potential relationships between environmental conditions and service request volumes.

Model: Statistical and spatial analysis techniques will be applied to quantify complaint hotspots, measure temporal trends, and identify significant patterns across neighborhoods and time periods. Regression models will assess the influence of weather variables (temperature extremes, precipitation events) and demographic factors (population density, median income) on complaint frequency and type.

Interpret: My deliverable will be an interactive Shiny dashboard that allows users to filter by complaint type, date range, borough, weather conditions, and demographic characteristics. The dashboard will feature linked visualizations—an interactive map with complaint markers, dynamic time series charts, weather overlays, demographic profiles, and summary statistics—where selections on one view automatically update the others. Users can explore patterns, download custom reports, and identify areas needing proactive intervention based on environmental and socioeconomic context.


Obtain Data

Obtain data from NYC Open Data via the Socrata REST AP

# Load required packages
library(httr)
library(jsonlite)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(tibble)


# Load API token from environment variable (stored in .Renviron)
app_token <- Sys.getenv("NYC_OPENDATA_TOKEN")

# Check if token loaded successfully
if (app_token == "") {
  stop("API token not found. Please set NYC_OPENDATA_TOKEN in your .Renviron file.")
}

# Use the SODA API endpoint
api_url <- "https://data.cityofnewyork.us/resource/erm2-nwe9.json"

query_params <- list(
  `$limit` = 80000,
  `$order` = "created_date DESC",
  `$$app_token` = app_token
)

response <- GET(api_url, query = query_params)

if (status_code(response) == 200) {
  df_311 <- content(response, as = "text", encoding = "UTF-8") %>%
    fromJSON(flatten = TRUE) %>%
    as.data.frame()
  
  print(head(df_311))
} else {
  print(paste("Error:", status_code(response)))
}
##   unique_key            created_date agency                     agency_name
## 1   67192069 2025-12-16T03:06:57.000   DSNY        Department of Sanitation
## 2   67192238 2025-12-16T02:27:43.000   DSNY        Department of Sanitation
## 3   67186869 2025-12-16T02:09:12.000    DOT    Department of Transportation
## 4   67184900 2025-12-16T02:06:24.000   NYPD New York City Police Department
## 5   67193394 2025-12-16T02:06:16.000   NYPD New York City Police Department
## 6   67195627 2025-12-16T02:06:14.000   NYPD New York City Police Department
##       complaint_type                    descriptor    location_type
## 1           Graffiti                      Graffiti        Mixed Use
## 2           Graffiti                      Graffiti             <NA>
## 3   Street Condition                       Pothole             <NA>
## 4 Noise - Commercial              Loud Music/Party Store/Commercial
## 5    Illegal Parking Posted Parking Sign Violation  Street/Sidewalk
## 6    Noise - Vehicle                 Engine Idling  Street/Sidewalk
##   incident_zip     incident_address      street_name address_type     city
## 1        11215         234 4 AVENUE         4 AVENUE      ADDRESS BROOKLYN
## 2        11225  475 FLATBUSH AVENUE  FLATBUSH AVENUE      ADDRESS BROOKLYN
## 3        11413           133 AVENUE       133 AVENUE    BLOCKFACE   QUEENS
## 4        10038     277 WATER STREET     WATER STREET      ADDRESS NEW YORK
## 5        11207    289 JEROME STREET    JEROME STREET      ADDRESS BROOKLYN
## 6        11226 538 EAST   21 STREET EAST   21 STREET      ADDRESS BROOKLYN
##   facility_type      status                due_date
## 1           N/A        Open 2026-01-15T03:06:57.000
## 2           N/A        Open                    <NA>
## 3           N/A        Open                    <NA>
## 4          <NA> Unspecified                    <NA>
## 5          <NA> In Progress                    <NA>
## 6          <NA> In Progress                    <NA>
##                                                                                                                                                                                              resolution_description
## 1                                                                                                                                       The graffiti on this property has been scheduled to be removed by the City.
## 2 The graffiti has been reported. The property owner will be notified that the City will remove the graffiti after 35 days unless the property owner requests to keep the graffiti or remove it himself or herself.
## 3                                                                                                          The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair.
## 4                                                                                                                                                                                                              <NA>
## 5                                                                                                                                                                                                              <NA>
## 6                                                                                                                                                                                                              <NA>
##   resolution_action_updated_date community_board        bbl   borough
## 1        2025-12-16T03:06:57.000     06 BROOKLYN 3004410034  BROOKLYN
## 2        2025-12-16T02:27:43.000     09 BROOKLYN 3011970015  BROOKLYN
## 3        2025-12-16T02:09:12.000       13 QUEENS       <NA>    QUEENS
## 4                           <NA>    01 MANHATTAN 1001070050 MANHATTAN
## 5                           <NA>     05 BROOKLYN 3039820036  BROOKLYN
## 6                           <NA>     14 BROOKLYN 3051840025  BROOKLYN
##   x_coordinate_state_plane y_coordinate_state_plane open_data_channel_type
## 1                   988886                   185952                UNKNOWN
## 2                   994763                   180716                UNKNOWN
## 3                     <NA>                     <NA>                UNKNOWN
## 4                   983963                   197472                 MOBILE
## 5                  1015556                   185179                 MOBILE
## 6                   995841                   172790                 MOBILE
##   park_facility_name park_borough           latitude          longitude
## 1        Unspecified     BROOKLYN 40.677070828995085 -73.98328623676551
## 2        Unspecified     BROOKLYN  40.66269415161571 -73.96210657313388
## 3        Unspecified       QUEENS               <NA>               <NA>
## 4        Unspecified    MANHATTAN 40.708691774634985 -74.00103518668939
## 5        Unspecified     BROOKLYN   40.6748949770469 -73.88713888206964
## 6        Unspecified     BROOKLYN  40.64093766389099 -73.95823461385314
##    cross_street_1  cross_street_2 intersection_street_1 intersection_street_2
## 1            <NA>            <NA>                  <NA>                  <NA>
## 2            <NA>            <NA>                  <NA>                  <NA>
## 3      229 STREET      230 STREET                  <NA>                  <NA>
## 4       PECK SLIP    DOVER STREET             PECK SLIP          DOVER STREET
## 5  LIBERTY AVENUE GLENMORE AVENUE        LIBERTY AVENUE       GLENMORE AVENUE
## 6 DORCHESTER ROAD   DITMAS AVENUE       DORCHESTER ROAD         DITMAS AVENUE
##           landmark vehicle_type closed_date taxi_company_borough
## 1             <NA>         <NA>        <NA>                 <NA>
## 2             <NA>         <NA>        <NA>                 <NA>
## 3             <NA>         <NA>        <NA>                 <NA>
## 4     WATER STREET         <NA>        <NA>                 <NA>
## 5    JEROME STREET         <NA>        <NA>                 <NA>
## 6 EAST   21 STREET          Car        <NA>                 <NA>
##   taxi_pick_up_location bridge_highway_name bridge_highway_segment road_ramp
## 1                  <NA>                <NA>                   <NA>      <NA>
## 2                  <NA>                <NA>                   <NA>      <NA>
## 3                  <NA>                <NA>                   <NA>      <NA>
## 4                  <NA>                <NA>                   <NA>      <NA>
## 5                  <NA>                <NA>                   <NA>      <NA>
## 6                  <NA>                <NA>                   <NA>      <NA>
##   bridge_highway_direction  location.latitude location.longitude
## 1                     <NA> 40.677070828995085 -73.98328623676551
## 2                     <NA>  40.66269415161571 -73.96210657313388
## 3                     <NA>               <NA>               <NA>
## 4                     <NA> 40.708691774634985 -74.00103518668939
## 5                     <NA>   40.6748949770469 -73.88713888206964
## 6                     <NA>  40.64093766389099 -73.95823461385314
##                                location.human_address
## 1 {"address": "", "city": "", "state": "", "zip": ""}
## 2 {"address": "", "city": "", "state": "", "zip": ""}
## 3                                                <NA>
## 4 {"address": "", "city": "", "state": "", "zip": ""}
## 5 {"address": "", "city": "", "state": "", "zip": ""}
## 6 {"address": "", "city": "", "state": "", "zip": ""}
nrow(df_311)
## [1] 80000
sum(is.na(df_311$hour))
## [1] 0

Data transformation on create date column

df_311 <- df_311 %>%
  mutate(
    created_date = as.POSIXct(created_date, format = "%Y-%m-%dT%H:%M:%S"),
    closed_date = as.POSIXct(closed_date, format = "%Y-%m-%dT%H:%M:%S"),
    date_only = as.Date(created_date),
    hour = hour(created_date),
    day_of_week = wday(created_date, label = TRUE),
    month = month(created_date, label = TRUE)
  )
print(str(df_311))
## 'data.frame':    80000 obs. of  47 variables:
##  $ unique_key                    : chr  "67192069" "67192238" "67186869" "67184900" ...
##  $ created_date                  : POSIXct, format: "2025-12-16 03:06:57" "2025-12-16 02:27:43" ...
##  $ agency                        : chr  "DSNY" "DSNY" "DOT" "NYPD" ...
##  $ agency_name                   : chr  "Department of Sanitation" "Department of Sanitation" "Department of Transportation" "New York City Police Department" ...
##  $ complaint_type                : chr  "Graffiti" "Graffiti" "Street Condition" "Noise - Commercial" ...
##  $ descriptor                    : chr  "Graffiti" "Graffiti" "Pothole" "Loud Music/Party" ...
##  $ location_type                 : chr  "Mixed Use" NA NA "Store/Commercial" ...
##  $ incident_zip                  : chr  "11215" "11225" "11413" "10038" ...
##  $ incident_address              : chr  "234 4 AVENUE" "475 FLATBUSH AVENUE" "133 AVENUE" "277 WATER STREET" ...
##  $ street_name                   : chr  "4 AVENUE" "FLATBUSH AVENUE" "133 AVENUE" "WATER STREET" ...
##  $ address_type                  : chr  "ADDRESS" "ADDRESS" "BLOCKFACE" "ADDRESS" ...
##  $ city                          : chr  "BROOKLYN" "BROOKLYN" "QUEENS" "NEW YORK" ...
##  $ facility_type                 : chr  "N/A" "N/A" "N/A" NA ...
##  $ status                        : chr  "Open" "Open" "Open" "Unspecified" ...
##  $ due_date                      : chr  "2026-01-15T03:06:57.000" NA NA NA ...
##  $ resolution_description        : chr  "The graffiti on this property has been scheduled to be removed by the City." "The graffiti has been reported. The property owner will be notified that the City will remove the graffiti afte"| __truncated__ "The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair." NA ...
##  $ resolution_action_updated_date: chr  "2025-12-16T03:06:57.000" "2025-12-16T02:27:43.000" "2025-12-16T02:09:12.000" NA ...
##  $ community_board               : chr  "06 BROOKLYN" "09 BROOKLYN" "13 QUEENS" "01 MANHATTAN" ...
##  $ bbl                           : chr  "3004410034" "3011970015" NA "1001070050" ...
##  $ borough                       : chr  "BROOKLYN" "BROOKLYN" "QUEENS" "MANHATTAN" ...
##  $ x_coordinate_state_plane      : chr  "988886" "994763" NA "983963" ...
##  $ y_coordinate_state_plane      : chr  "185952" "180716" NA "197472" ...
##  $ open_data_channel_type        : chr  "UNKNOWN" "UNKNOWN" "UNKNOWN" "MOBILE" ...
##  $ park_facility_name            : chr  "Unspecified" "Unspecified" "Unspecified" "Unspecified" ...
##  $ park_borough                  : chr  "BROOKLYN" "BROOKLYN" "QUEENS" "MANHATTAN" ...
##  $ latitude                      : chr  "40.677070828995085" "40.66269415161571" NA "40.708691774634985" ...
##  $ longitude                     : chr  "-73.98328623676551" "-73.96210657313388" NA "-74.00103518668939" ...
##  $ cross_street_1                : chr  NA NA "229 STREET" "PECK SLIP" ...
##  $ cross_street_2                : chr  NA NA "230 STREET" "DOVER STREET" ...
##  $ intersection_street_1         : chr  NA NA NA "PECK SLIP" ...
##  $ intersection_street_2         : chr  NA NA NA "DOVER STREET" ...
##  $ landmark                      : chr  NA NA NA "WATER STREET" ...
##  $ vehicle_type                  : chr  NA NA NA NA ...
##  $ closed_date                   : POSIXct, format: NA NA ...
##  $ taxi_company_borough          : chr  NA NA NA NA ...
##  $ taxi_pick_up_location         : chr  NA NA NA NA ...
##  $ bridge_highway_name           : chr  NA NA NA NA ...
##  $ bridge_highway_segment        : chr  NA NA NA NA ...
##  $ road_ramp                     : chr  NA NA NA NA ...
##  $ bridge_highway_direction      : chr  NA NA NA NA ...
##  $ location.latitude             : chr  "40.677070828995085" "40.66269415161571" NA "40.708691774634985" ...
##  $ location.longitude            : chr  "-73.98328623676551" "-73.96210657313388" NA "-74.00103518668939" ...
##  $ location.human_address        : chr  "{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}" "{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}" NA "{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}" ...
##  $ date_only                     : Date, format: "2025-12-16" "2025-12-16" ...
##  $ hour                          : int  3 2 2 2 2 2 2 2 2 2 ...
##  $ day_of_week                   : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ month                         : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 12 12 12 12 12 12 12 12 12 12 ...
## NULL
sum(is.na(df_311$hour))
## [1] 0
df_311 %>% count(day_of_week)
##   day_of_week     n
## 1         Sun 11247
## 2         Mon 16225
## 3         Tue  6282
## 4         Wed 12155
## 5         Thu 11637
## 6         Fri 12001
## 7         Sat 10453

Obtain data from Open-Meteo Historical Weather API

# WEATHER DATA INTEGRATION

library(httr)
library(jsonlite)
library(dplyr)
library(lubridate)

# Get date range from your 311 data
start_date <- min(df_311$date_only, na.rm = TRUE)
end_date <- max(df_311$date_only, na.rm = TRUE)

# Fetch NYC historical weather
weather_url <- "https://archive-api.open-meteo.com/v1/archive"

weather_params <- list(
  latitude = 40.7128,
  longitude = -74.0060,
  start_date = as.character(start_date),
  end_date = as.character(end_date),
  daily = "temperature_2m_max,temperature_2m_min,temperature_2m_mean,precipitation_sum,windspeed_10m_max",
  timezone = "America/New_York"
)

weather_response <- GET(weather_url, query = weather_params)

if (status_code(weather_response) == 200) {
  weather_json <- fromJSON(content(weather_response, "text", encoding = "UTF-8"))
  
  weather_df <- data.frame(
    date = as.Date(weather_json$daily$time),
    temp_max_f = weather_json$daily$temperature_2m_max * 9/5 + 32,  # Convert to Fahrenheit
    temp_min_f = weather_json$daily$temperature_2m_min * 9/5 + 32,
    temp_mean_f = weather_json$daily$temperature_2m_mean * 9/5 + 32,
    precipitation_mm = weather_json$daily$precipitation_sum,
    wind_max_mph = weather_json$daily$windspeed_10m_max * 0.621371  # Convert to mph
  )
  
  cat("Weather data retrieved:", nrow(weather_df), "days\n")
  print(head(weather_df))
} else {
  cat("Error fetching weather:", status_code(weather_response), "\n")
}
## Weather data retrieved: 8 days
##         date temp_max_f temp_min_f temp_mean_f precipitation_mm wind_max_mph
## 1 2025-12-09      31.46      18.32       25.16              0.0     9.817662
## 2 2025-12-10      43.34      30.56       36.68              3.1    11.371089
## 3 2025-12-11      38.84      27.50       32.90              0.0    14.975041
## 4 2025-12-12      34.52      23.00       28.40              0.0    10.004073
## 5 2025-12-13      38.48      23.72       30.74              1.1     4.970968
## 6 2025-12-14      32.54      16.88       26.42             11.4    11.806049
# Join weather to 311 data
df_311 <- df_311 %>%
  left_join(weather_df, by = c("date_only" = "date"))

# Verify join
cat("\nWeather columns added:", sum(!is.na(df_311$temp_mean_f)), "rows with weather data\n")
## 
## Weather columns added: 80000 rows with weather data

Obtain census data from the U.S. Census Bureau’s American Community Survey via the tidycensus package

library(tidycensus)
library(dplyr)

# Load the key from environment variable
census_api_key(Sys.getenv("CENSUS_API_KEY"))
## To install your API key for use in future sessions, run this function with `install = TRUE`.
query_params <- list(
  `$limit` = 75000,  # More records = more days
  `$order` = "created_date DESC"
)

# Fetch NYC demographic data by borough
nyc_demographics <- get_acs(
  geography = "county",
  state = "NY",
  county = c("005", "047", "061", "081", "085"),
  variables = c(
    total_pop = "B01003_001",
    median_income = "B19013_001",
    median_age = "B01002_001",
    total_housing = "B25001_001",
    renter_occupied = "B25003_003",
    poverty_count = "B17001_002"
  ),
  year = 2022,
  output = "wide"
)
## Getting data from the 2018-2022 5-year ACS
## Warning: • You have not set a Census API key. Users without a key are limited to 500
## queries per day and may experience performance limitations.
## ℹ For best results, get a Census API key at
## http://api.census.gov/data/key_signup.html and then supply the key to the
## `census_api_key()` function to use it throughout your tidycensus session.
## This warning is displayed once per session.
# Clean up borough names to match your 311 data
nyc_demographics <- nyc_demographics %>%
  mutate(
    borough = case_when(
      grepl("Bronx", NAME) ~ "BRONX",
      grepl("Kings", NAME) ~ "BROOKLYN",
      grepl("New York County", NAME) ~ "MANHATTAN",
      grepl("Queens", NAME) ~ "QUEENS",
      grepl("Richmond", NAME) ~ "STATEN ISLAND"
    ),
    pct_renter = renter_occupiedE / total_housingE * 100,
    pct_poverty = poverty_countE / total_popE * 100
  ) %>%
  select(borough, 
         population = total_popE, 
         median_income = median_incomeE, 
         median_age = median_ageE,
         pct_renter,
         pct_poverty)

print("NYC Demographics by Borough:")
## [1] "NYC Demographics by Borough:"
print(nyc_demographics)
## # A tibble: 5 × 6
##   borough       population median_income median_age pct_renter pct_poverty
##   <chr>              <dbl>         <dbl>      <dbl>      <dbl>       <dbl>
## 1 BRONX            1443229         47036       35         76.8        26.3
## 2 BROOKLYN         2679620         74692       35.9       64.4        18.8
## 3 MANHATTAN        1645867         99880       38.5       64.0        15.4
## 4 QUEENS           2360826         82431       39.9       49.9        11.6
## 5 STATEN ISLAND     492925         96185       40.3       29.1        10.3
# Join demographics to 311 data
df_311 <- df_311 %>%
  left_join(nyc_demographics, by = "borough")

cat("Demographics joined:", sum(!is.na(df_311$population)), "rows with demographic data\n")
## Demographics joined: 79936 rows with demographic data
head(df_311, 5)
##   unique_key        created_date agency                     agency_name
## 1   67192069 2025-12-16 03:06:57   DSNY        Department of Sanitation
## 2   67192238 2025-12-16 02:27:43   DSNY        Department of Sanitation
## 3   67186869 2025-12-16 02:09:12    DOT    Department of Transportation
## 4   67184900 2025-12-16 02:06:24   NYPD New York City Police Department
## 5   67193394 2025-12-16 02:06:16   NYPD New York City Police Department
##       complaint_type                    descriptor    location_type
## 1           Graffiti                      Graffiti        Mixed Use
## 2           Graffiti                      Graffiti             <NA>
## 3   Street Condition                       Pothole             <NA>
## 4 Noise - Commercial              Loud Music/Party Store/Commercial
## 5    Illegal Parking Posted Parking Sign Violation  Street/Sidewalk
##   incident_zip    incident_address     street_name address_type     city
## 1        11215        234 4 AVENUE        4 AVENUE      ADDRESS BROOKLYN
## 2        11225 475 FLATBUSH AVENUE FLATBUSH AVENUE      ADDRESS BROOKLYN
## 3        11413          133 AVENUE      133 AVENUE    BLOCKFACE   QUEENS
## 4        10038    277 WATER STREET    WATER STREET      ADDRESS NEW YORK
## 5        11207   289 JEROME STREET   JEROME STREET      ADDRESS BROOKLYN
##   facility_type      status                due_date
## 1           N/A        Open 2026-01-15T03:06:57.000
## 2           N/A        Open                    <NA>
## 3           N/A        Open                    <NA>
## 4          <NA> Unspecified                    <NA>
## 5          <NA> In Progress                    <NA>
##                                                                                                                                                                                              resolution_description
## 1                                                                                                                                       The graffiti on this property has been scheduled to be removed by the City.
## 2 The graffiti has been reported. The property owner will be notified that the City will remove the graffiti after 35 days unless the property owner requests to keep the graffiti or remove it himself or herself.
## 3                                                                                                          The Department of Transportation referred this complaint to the appropriate Maintenance Unit for repair.
## 4                                                                                                                                                                                                              <NA>
## 5                                                                                                                                                                                                              <NA>
##   resolution_action_updated_date community_board        bbl   borough
## 1        2025-12-16T03:06:57.000     06 BROOKLYN 3004410034  BROOKLYN
## 2        2025-12-16T02:27:43.000     09 BROOKLYN 3011970015  BROOKLYN
## 3        2025-12-16T02:09:12.000       13 QUEENS       <NA>    QUEENS
## 4                           <NA>    01 MANHATTAN 1001070050 MANHATTAN
## 5                           <NA>     05 BROOKLYN 3039820036  BROOKLYN
##   x_coordinate_state_plane y_coordinate_state_plane open_data_channel_type
## 1                   988886                   185952                UNKNOWN
## 2                   994763                   180716                UNKNOWN
## 3                     <NA>                     <NA>                UNKNOWN
## 4                   983963                   197472                 MOBILE
## 5                  1015556                   185179                 MOBILE
##   park_facility_name park_borough           latitude          longitude
## 1        Unspecified     BROOKLYN 40.677070828995085 -73.98328623676551
## 2        Unspecified     BROOKLYN  40.66269415161571 -73.96210657313388
## 3        Unspecified       QUEENS               <NA>               <NA>
## 4        Unspecified    MANHATTAN 40.708691774634985 -74.00103518668939
## 5        Unspecified     BROOKLYN   40.6748949770469 -73.88713888206964
##   cross_street_1  cross_street_2 intersection_street_1 intersection_street_2
## 1           <NA>            <NA>                  <NA>                  <NA>
## 2           <NA>            <NA>                  <NA>                  <NA>
## 3     229 STREET      230 STREET                  <NA>                  <NA>
## 4      PECK SLIP    DOVER STREET             PECK SLIP          DOVER STREET
## 5 LIBERTY AVENUE GLENMORE AVENUE        LIBERTY AVENUE       GLENMORE AVENUE
##        landmark vehicle_type closed_date taxi_company_borough
## 1          <NA>         <NA>        <NA>                 <NA>
## 2          <NA>         <NA>        <NA>                 <NA>
## 3          <NA>         <NA>        <NA>                 <NA>
## 4  WATER STREET         <NA>        <NA>                 <NA>
## 5 JEROME STREET         <NA>        <NA>                 <NA>
##   taxi_pick_up_location bridge_highway_name bridge_highway_segment road_ramp
## 1                  <NA>                <NA>                   <NA>      <NA>
## 2                  <NA>                <NA>                   <NA>      <NA>
## 3                  <NA>                <NA>                   <NA>      <NA>
## 4                  <NA>                <NA>                   <NA>      <NA>
## 5                  <NA>                <NA>                   <NA>      <NA>
##   bridge_highway_direction  location.latitude location.longitude
## 1                     <NA> 40.677070828995085 -73.98328623676551
## 2                     <NA>  40.66269415161571 -73.96210657313388
## 3                     <NA>               <NA>               <NA>
## 4                     <NA> 40.708691774634985 -74.00103518668939
## 5                     <NA>   40.6748949770469 -73.88713888206964
##                                location.human_address  date_only hour
## 1 {"address": "", "city": "", "state": "", "zip": ""} 2025-12-16    3
## 2 {"address": "", "city": "", "state": "", "zip": ""} 2025-12-16    2
## 3                                                <NA> 2025-12-16    2
## 4 {"address": "", "city": "", "state": "", "zip": ""} 2025-12-16    2
## 5 {"address": "", "city": "", "state": "", "zip": ""} 2025-12-16    2
##   day_of_week month temp_max_f temp_min_f temp_mean_f precipitation_mm
## 1         Tue   Dec       27.5      15.08       21.74                0
## 2         Tue   Dec       27.5      15.08       21.74                0
## 3         Tue   Dec       27.5      15.08       21.74                0
## 4         Tue   Dec       27.5      15.08       21.74                0
## 5         Tue   Dec       27.5      15.08       21.74                0
##   wind_max_mph population median_income median_age pct_renter pct_poverty
## 1     7.270041    2679620         74692       35.9   64.42049    18.79087
## 2     7.270041    2679620         74692       35.9   64.42049    18.79087
## 3     7.270041    2360826         82431       39.9   49.92156    11.56061
## 4     7.270041    1645867         99880       38.5   64.03536    15.35221
## 5     7.270041    2679620         74692       35.9   64.42049    18.79087

Scrub the Data

Scrub:Data set overview and missing values analysis

library(dplyr)
library(tidyr)
library(lubridate)

# 1. DATASET OVERVIEW

cat("Rows:", nrow(df_311), "\n")
## Rows: 80000
cat("Columns:", ncol(df_311), "\n")
## Columns: 57
glimpse(df_311)
## Rows: 80,000
## Columns: 57
## $ unique_key                     <chr> "67192069", "67192238", "67186869", "67…
## $ created_date                   <dttm> 2025-12-16 03:06:57, 2025-12-16 02:27:…
## $ agency                         <chr> "DSNY", "DSNY", "DOT", "NYPD", "NYPD", …
## $ agency_name                    <chr> "Department of Sanitation", "Department…
## $ complaint_type                 <chr> "Graffiti", "Graffiti", "Street Conditi…
## $ descriptor                     <chr> "Graffiti", "Graffiti", "Pothole", "Lou…
## $ location_type                  <chr> "Mixed Use", NA, NA, "Store/Commercial"…
## $ incident_zip                   <chr> "11215", "11225", "11413", "10038", "11…
## $ incident_address               <chr> "234 4 AVENUE", "475 FLATBUSH AVENUE", …
## $ street_name                    <chr> "4 AVENUE", "FLATBUSH AVENUE", "133 AVE…
## $ address_type                   <chr> "ADDRESS", "ADDRESS", "BLOCKFACE", "ADD…
## $ city                           <chr> "BROOKLYN", "BROOKLYN", "QUEENS", "NEW …
## $ facility_type                  <chr> "N/A", "N/A", "N/A", NA, NA, NA, NA, NA…
## $ status                         <chr> "Open", "Open", "Open", "Unspecified", …
## $ due_date                       <chr> "2026-01-15T03:06:57.000", NA, NA, NA, …
## $ resolution_description         <chr> "The graffiti on this property has been…
## $ resolution_action_updated_date <chr> "2025-12-16T03:06:57.000", "2025-12-16T…
## $ community_board                <chr> "06 BROOKLYN", "09 BROOKLYN", "13 QUEEN…
## $ bbl                            <chr> "3004410034", "3011970015", NA, "100107…
## $ borough                        <chr> "BROOKLYN", "BROOKLYN", "QUEENS", "MANH…
## $ x_coordinate_state_plane       <chr> "988886", "994763", NA, "983963", "1015…
## $ y_coordinate_state_plane       <chr> "185952", "180716", NA, "197472", "1851…
## $ open_data_channel_type         <chr> "UNKNOWN", "UNKNOWN", "UNKNOWN", "MOBIL…
## $ park_facility_name             <chr> "Unspecified", "Unspecified", "Unspecif…
## $ park_borough                   <chr> "BROOKLYN", "BROOKLYN", "QUEENS", "MANH…
## $ latitude                       <chr> "40.677070828995085", "40.6626941516157…
## $ longitude                      <chr> "-73.98328623676551", "-73.962106573133…
## $ cross_street_1                 <chr> NA, NA, "229 STREET", "PECK SLIP", "LIB…
## $ cross_street_2                 <chr> NA, NA, "230 STREET", "DOVER STREET", "…
## $ intersection_street_1          <chr> NA, NA, NA, "PECK SLIP", "LIBERTY AVENU…
## $ intersection_street_2          <chr> NA, NA, NA, "DOVER STREET", "GLENMORE A…
## $ landmark                       <chr> NA, NA, NA, "WATER STREET", "JEROME STR…
## $ vehicle_type                   <chr> NA, NA, NA, NA, NA, "Car", NA, NA, NA, …
## $ closed_date                    <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ taxi_company_borough           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ taxi_pick_up_location          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bridge_highway_name            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bridge_highway_segment         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ road_ramp                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bridge_highway_direction       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ location.latitude              <chr> "40.677070828995085", "40.6626941516157…
## $ location.longitude             <chr> "-73.98328623676551", "-73.962106573133…
## $ location.human_address         <chr> "{\"address\": \"\", \"city\": \"\", \"…
## $ date_only                      <date> 2025-12-16, 2025-12-16, 2025-12-16, 20…
## $ hour                           <int> 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ day_of_week                    <ord> Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue,…
## $ month                          <ord> Dec, Dec, Dec, Dec, Dec, Dec, Dec, Dec,…
## $ temp_max_f                     <dbl> 27.5, 27.5, 27.5, 27.5, 27.5, 27.5, 27.…
## $ temp_min_f                     <dbl> 15.08, 15.08, 15.08, 15.08, 15.08, 15.0…
## $ temp_mean_f                    <dbl> 21.74, 21.74, 21.74, 21.74, 21.74, 21.7…
## $ precipitation_mm               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ wind_max_mph                   <dbl> 7.270041, 7.270041, 7.270041, 7.270041,…
## $ population                     <dbl> 2679620, 2679620, 2360826, 1645867, 267…
## $ median_income                  <dbl> 74692, 74692, 82431, 99880, 74692, 7469…
## $ median_age                     <dbl> 35.9, 35.9, 39.9, 38.5, 35.9, 35.9, 35.…
## $ pct_renter                     <dbl> 64.42049, 64.42049, 49.92156, 64.03536,…
## $ pct_poverty                    <dbl> 18.79087, 18.79087, 11.56061, 15.35221,…
# 2. MISSING VALUES ANALYSIS

missing_summary <- df_311 %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "column", values_to = "missing_count") %>%
  mutate(
    missing_pct = round(missing_count / nrow(df_311) * 100, 2),
    recommendation = case_when(
      missing_pct == 0 ~ "Complete",
      missing_pct < 5 ~ "Impute",
      missing_pct < 30 ~ "Impute with caution",
      TRUE ~ "Consider dropping"
    )
  ) %>%
  arrange(desc(missing_pct))

print(missing_summary %>% filter(missing_pct > 0))
## # A tibble: 36 × 4
##    column                   missing_count missing_pct recommendation   
##    <chr>                            <int>       <dbl> <chr>            
##  1 taxi_company_borough             79938        99.9 Consider dropping
##  2 road_ramp                        79869        99.8 Consider dropping
##  3 bridge_highway_direction         79846        99.8 Consider dropping
##  4 due_date                         79788        99.7 Consider dropping
##  5 bridge_highway_name              79691        99.6 Consider dropping
##  6 bridge_highway_segment           79691        99.6 Consider dropping
##  7 taxi_pick_up_location            79247        99.1 Consider dropping
##  8 vehicle_type                     77561        97.0 Consider dropping
##  9 facility_type                    76582        95.7 Consider dropping
## 10 landmark                         33060        41.3 Consider dropping
## # ℹ 26 more rows

Scrub: Drop columns with >30% of data missing

# Identify columns with >30% missing, BUT keep closed_date for response_time calc
threshold <- 0.30
cols_to_drop <- names(df_311)[colSums(is.na(df_311)) / nrow(df_311) > threshold]

# Protect closed_date from being dropped
cols_to_drop <- setdiff(cols_to_drop, "closed_date")

cat("Dropping columns:", cols_to_drop, "\n")
## Dropping columns: facility_type due_date cross_street_1 cross_street_2 intersection_street_1 intersection_street_2 landmark vehicle_type taxi_company_borough taxi_pick_up_location bridge_highway_name bridge_highway_segment road_ramp bridge_highway_direction
df_311 <- df_311 %>%
  select(-all_of(cols_to_drop))

Scrub: Identify duplicates, verify data types, and validate entries

# 3. DUPLICATE ROWS

cat("\n=== DUPLICATE ROWS ===\n")
## 
## === DUPLICATE ROWS ===
dup_count <- sum(duplicated(df_311))
cat("Duplicate rows:", dup_count, "\n")
## Duplicate rows: 0
cat("Duplicate %:", round(dup_count / nrow(df_311) * 100, 2), "%\n")
## Duplicate %: 0 %
# Remove duplicates (uncomment to execute)
df_311 <- df_311 %>% distinct()


# 4. DATA TYPES CHECK
sapply(df_311, class)
## $unique_key
## [1] "character"
## 
## $created_date
## [1] "POSIXct" "POSIXt" 
## 
## $agency
## [1] "character"
## 
## $agency_name
## [1] "character"
## 
## $complaint_type
## [1] "character"
## 
## $descriptor
## [1] "character"
## 
## $location_type
## [1] "character"
## 
## $incident_zip
## [1] "character"
## 
## $incident_address
## [1] "character"
## 
## $street_name
## [1] "character"
## 
## $address_type
## [1] "character"
## 
## $city
## [1] "character"
## 
## $status
## [1] "character"
## 
## $resolution_description
## [1] "character"
## 
## $resolution_action_updated_date
## [1] "character"
## 
## $community_board
## [1] "character"
## 
## $bbl
## [1] "character"
## 
## $borough
## [1] "character"
## 
## $x_coordinate_state_plane
## [1] "character"
## 
## $y_coordinate_state_plane
## [1] "character"
## 
## $open_data_channel_type
## [1] "character"
## 
## $park_facility_name
## [1] "character"
## 
## $park_borough
## [1] "character"
## 
## $latitude
## [1] "character"
## 
## $longitude
## [1] "character"
## 
## $closed_date
## [1] "POSIXct" "POSIXt" 
## 
## $location.latitude
## [1] "character"
## 
## $location.longitude
## [1] "character"
## 
## $location.human_address
## [1] "character"
## 
## $date_only
## [1] "Date"
## 
## $hour
## [1] "integer"
## 
## $day_of_week
## [1] "ordered" "factor" 
## 
## $month
## [1] "ordered" "factor" 
## 
## $temp_max_f
## [1] "numeric"
## 
## $temp_min_f
## [1] "numeric"
## 
## $temp_mean_f
## [1] "numeric"
## 
## $precipitation_mm
## [1] "numeric"
## 
## $wind_max_mph
## [1] "numeric"
## 
## $population
## [1] "numeric"
## 
## $median_income
## [1] "numeric"
## 
## $median_age
## [1] "numeric"
## 
## $pct_renter
## [1] "numeric"
## 
## $pct_poverty
## [1] "numeric"
# 5. STANDARDIZE TEXT COLUMNS
df_311 <- df_311 %>%
  mutate(
    # Trim whitespace and standardize case
    borough = toupper(trimws(borough)),
    complaint_type = toupper(trimws(complaint_type)),
    # Standardize "Unspecified" values
    borough = if_else(borough %in% c("", "UNSPECIFIED", "NA"), NA_character_, borough)
  )

cat("Unique boroughs:", unique(df_311$borough), "\n")
## Unique boroughs: BROOKLYN QUEENS MANHATTAN BRONX STATEN ISLAND NA
# 6. INVALID/ILLOGICAL VALUES

# Check: Closed date before Created date
if ("closed_date" %in% names(df_311) & "created_date" %in% names(df_311)) {
  invalid_dates <- df_311 %>%
    filter(!is.na(closed_date) & closed_date < created_date)
  cat("Invalid dates (closed before created):", nrow(invalid_dates), "\n")
}
## Invalid dates (closed before created): 12
# Check: Future dates
future_dates <- df_311 %>%
  filter(created_date > Sys.time())
cat("Future dates:", nrow(future_dates), "\n")
## Future dates: 0
# Check: Negative response times
if ("response_time" %in% names(df_311)) {
  negative_response <- df_311 %>%
    filter(response_time < 0)
  cat("Negative response times:", nrow(negative_response), "\n")
}

Scrub: Outlier detection and cardinality check

# 7. OUTLIER DETECTION (Numeric Columns)
detect_outliers <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower <- Q1 - 1.5 * IQR
  upper <- Q3 + 1.5 * IQR
  sum(x < lower | x > upper, na.rm = TRUE)
}

numeric_cols <- df_311 %>% select(where(is.numeric)) %>% names()

for (col in numeric_cols) {
  outlier_count <- detect_outliers(df_311[[col]])
  pct <- round(outlier_count / nrow(df_311) * 100, 2)
  cat(col, ": ", outlier_count, " outliers (", pct, "%)\n", sep = "")
}
## hour: 0 outliers (0%)
## temp_max_f: 0 outliers (0%)
## temp_min_f: 0 outliers (0%)
## temp_mean_f: 0 outliers (0%)
## precipitation_mm: 10815 outliers (13.52%)
## wind_max_mph: 25172 outliers (31.46%)
## population: 0 outliers (0%)
## median_income: 0 outliers (0%)
## median_age: 0 outliers (0%)
## pct_renter: 2610 outliers (3.26%)
## pct_poverty: 0 outliers (0%)
# 8. CARDINALITY CHECK
cardinality <- df_311 %>%
  summarise(across(everything(), n_distinct)) %>%
  pivot_longer(everything(), names_to = "column", values_to = "unique_values") %>%
  arrange(desc(unique_values))

print(cardinality)
## # A tibble: 43 × 2
##    column                         unique_values
##    <chr>                                  <int>
##  1 unique_key                             80000
##  2 created_date                           66848
##  3 closed_date                            37021
##  4 latitude                               35682
##  5 longitude                              35682
##  6 location.latitude                      35682
##  7 location.longitude                     35682
##  8 incident_address                       34400
##  9 resolution_action_updated_date         33950
## 10 y_coordinate_state_plane               30333
## # ℹ 33 more rows

Explore the data

Data Visualizations and Statistical Analysis

library(ggplot2)

#DATA. VISUALIZATIONS

# Top Complaint Types
df_311 %>%
  count(complaint_type, sort = TRUE) %>%
  head(10) %>%
  ggplot(aes(x = reorder(complaint_type, n), y = n, fill = complaint_type)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Complaint Types in NYC 311 Data",
    x = "Complaint Type",
    y = "Number of Complaints"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Complaints by Borough

df_311 %>%
  filter(!is.na(borough) & borough != "Unspecified") %>%
  count(borough) %>%
  ggplot(aes(x = reorder(borough, n), y = n, fill = borough)) +
  geom_col() +
  labs(
    title = "311 Complaints by Borough",
    x = "Borough",
    y = "Number of Complaints"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Complaints by Hour of Day

df_311 %>%
  count(hour) %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color = "steelblue", size = 2) +
  labs(
    title = "311 Complaints by Hour of Day",
    x = "Hour (24-hour format)",
    y = "Number of Complaints"
  ) +
  scale_x_continuous(breaks = seq(0, 23, 2)) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Complaints by Day of Week

df_311 %>%
  count(day_of_week) %>%
  ggplot(aes(x = day_of_week, y = n, fill = day_of_week)) +
  geom_col() +
  labs(
    title = "311 Complaints by Day of Week",
    x = "Day",
    y = "Number of Complaints"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# Heatmap - Hour vs Day of Week

df_311 %>%
  count(day_of_week, hour) %>%
  ggplot(aes(x = hour, y = day_of_week, fill = n)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "red") +
  labs(
    title = "Complaint Frequency: Hour vs Day of Week",
    x = "Hour of Day",
    y = "Day of Week",
    fill = "Count"
  ) +
  theme_minimal()

# Top Complaints by Borough

df_311 %>%
  filter(!is.na(borough) & borough != "Unspecified") %>%
  count(borough, complaint_type) %>%
  group_by(borough) %>%
  slice_max(n, n = 5) %>%
  ggplot(aes(x = reorder(complaint_type, n), y = n, fill = borough)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~borough, scales = "free_y") +
  labs(
    title = "Top 5 Complaint Types by Borough",
    x = "Complaint Type",
    y = "Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# STATISTICAL ANALYSIS

# Chi-Square Test: Are complaint types independent of borough?
contingency_table <- table(df_311$borough, df_311$complaint_type)
chi_test <- chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
print("Chi-Square Test: Complaint Type vs Borough")
## [1] "Chi-Square Test: Complaint Type vs Borough"
print(chi_test)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = NaN, df = 580, p-value = NA
# ANOVA: Does response time differ by borough?
df_311 <- df_311 %>%
  mutate(response_time = as.numeric(difftime(closed_date, created_date, units = "hours")))

# Remove NA and extreme outliers
df_response <- df_311 %>%
  filter(!is.na(response_time) & response_time > 0 & response_time < 720)

anova_result <- aov(response_time ~ borough, data = df_response)
print("ANOVA: Response Time by Borough")
## [1] "ANOVA: Response Time by Borough"
print(summary(anova_result))
##                Df   Sum Sq Mean Sq F value Pr(>F)    
## borough         4   588550  147138   285.3 <2e-16 ***
## Residuals   54285 27992982     516                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 27 observations deleted due to missingness
# Post-hoc test if significant
tukey_result <- TukeyHSD(anova_result)
print(tukey_result)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = response_time ~ borough, data = df_response)
## 
## $borough
##                              diff        lwr       upr     p adj
## BROOKLYN-BRONX           6.542921  5.8497798  7.236062 0.0000000
## MANHATTAN-BRONX          8.249035  7.4513231  9.046747 0.0000000
## QUEENS-BRONX             2.007849  1.2649527  2.750746 0.0000000
## STATEN ISLAND-BRONX      3.810022  2.1987907  5.421253 0.0000000
## MANHATTAN-BROOKLYN       1.706114  0.8412581  2.570970 0.0000007
## QUEENS-BROOKLYN         -4.535072 -5.3496433 -3.720500 0.0000000
## STATEN ISLAND-BROOKLYN  -2.732899 -4.3784074 -1.087391 0.0000578
## QUEENS-MANHATTAN        -6.241186 -7.1464074 -5.335964 0.0000000
## STATEN ISLAND-MANHATTAN -4.439013 -6.1312298 -2.746796 0.0000000
## STATEN ISLAND-QUEENS     1.802173  0.1350948  3.469250 0.0264692
# Response Time by Borough

df_response %>%
  filter(!is.na(borough) & borough != "Unspecified") %>%
  ggplot(aes(x = borough, y = response_time, fill = borough)) +
  geom_boxplot() +
  labs(
    title = "Response Time (Hours) by Borough",
    x = "Borough",
    y = "Response Time (Hours)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

# SUMMARY STATISTICS TABLE
summary_stats <- df_response %>%
  filter(!is.na(borough) & borough != "Unspecified") %>%
  group_by(borough) %>%
  summarise(
    count = n(),
    mean_response = mean(response_time, na.rm = TRUE),
    median_response = median(response_time, na.rm = TRUE),
    sd_response = sd(response_time, na.rm = TRUE)
  )

print("Summary Statistics by Borough:")
## [1] "Summary Statistics by Borough:"
print(summary_stats)
## # A tibble: 5 × 5
##   borough       count mean_response median_response sd_response
##   <chr>         <int>         <dbl>           <dbl>       <dbl>
## 1 BRONX         20807          10.9            2.97        18.3
## 2 BROOKLYN      12961          17.4            3.33        26.5
## 3 MANHATTAN      8490          19.2            3.42        26.4
## 4 QUEENS        10441          12.9            2.32        22.4
## 5 STATEN ISLAND  1591          14.7            3.25        22.9

Heating Complaints Response Time vs Tempaerature

# Filter heating complaints data
heating_individual <- df_311 %>%
  filter(complaint_type == "HEAT/HOT WATER" & !is.na(temp_mean_f)) %>%
  filter(!is.na(response_time) & response_time > 0 & response_time < 720)

# Plot 1: Heating Complaints COUNT vs Temperature
heating_count <- df_311 %>%
  filter(complaint_type == "HEAT/HOT WATER" & !is.na(temp_mean_f)) %>%
  group_by(date_only, temp_mean_f) %>%
  summarise(complaints = n(), .groups = "drop")

p1 <- ggplot(heating_count, aes(x = temp_mean_f, y = complaints)) +
  geom_point(alpha = 0.5, color = "navy") +
  labs(
    title = "Heating Complaint Volume vs Temperature",
    x = "Mean Daily Temperature (°F)",
    y = "Number of Complaints"
  ) +
  theme_minimal()
print(p1)

Noise complaints vs Temperature

# Noise complaints - do they increase in warm weather?
noise_temp <- df_311 %>%
  filter(grepl("Noise", complaint_type, ignore.case = TRUE)) %>%
  group_by(date_only, temp_mean_f) %>%
  summarise(complaints = n(), .groups = "drop")

ggplot(noise_temp, aes(x = temp_mean_f, y = complaints)) +
  geom_point(alpha = 0.5, color = "orange") +
  geom_smooth(method = "loess", color = "purple", se = FALSE) +
  labs(
    title = "Do Noise Complaints Rise in Warmer Weather?",
    x = "Mean Daily Temperature (°F)",
    y = "Number of Noise Complaints"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

cor.test(noise_temp$temp_mean_f, noise_temp$complaints)
## 
##  Pearson's product-moment correlation
## 
## data:  noise_temp$temp_mean_f and noise_temp$complaints
## t = 1.1518, df = 6, p-value = 0.2932
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3986927  0.8694817
## sample estimates:
##       cor 
## 0.4255328

Heatmap: Complaint Types vs Tempaerature Ranges

# Create temperature bins
temp_complaints <- df_311 %>%
  filter(!is.na(temp_mean_f)) %>%
  mutate(temp_range = cut(temp_mean_f, 
                          breaks = c(0, 32, 50, 70, 85, 100),
                          labels = c("Freezing (<32°F)", "Cold (32-50°F)", 
                                     "Mild (50-70°F)", "Warm (70-85°F)", "Hot (>85°F)"))) %>%
  filter(complaint_type %in% names(sort(table(df_311$complaint_type), decreasing = TRUE)[1:10])) %>%
  count(complaint_type, temp_range)

ggplot(temp_complaints, aes(x = temp_range, y = complaint_type, fill = n)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "red") +
  labs(
    title = "Complaint Types by Temperature Range",
    x = "Temperature Range",
    y = "Complaint Type",
    fill = "Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Complaints per capita vs Median Income

# Complaints per capita by borough
complaints_per_capita <- df_311 %>%
  filter(!is.na(borough) & borough != "Unspecified") %>%
  group_by(borough, population, median_income, pct_renter, pct_poverty) %>%
  summarise(total_complaints = n(), .groups = "drop") %>%
  mutate(complaints_per_1000 = total_complaints / population * 1000)

ggplot(complaints_per_capita, aes(x = median_income, y = complaints_per_1000)) +
  geom_point(size = 5, color = "steelblue") +
  geom_text(aes(label = borough), vjust = -1, size = 3) +
  geom_smooth(method = "lm", color = "red", se = FALSE, linetype = "dashed") +
  scale_x_continuous(labels = scales::dollar) +
  labs(
    title = "Do Lower-Income Boroughs File More Complaints?",
    x = "Median Household Income",
    y = "Complaints per 1,000 Residents"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Housing Complaints vs Renter Percentage

# Housing-related complaints by renter percentage
housing_renter <- df_311 %>%
  filter(complaint_type %in% c("HEAT/HOT WATER", "PLUMBING", "WATER SYSTEM", "UNSANITARY CONDITION")) %>%
  filter(!is.na(borough) & borough != "Unspecified") %>%
  group_by(borough, pct_renter, population) %>%
  summarise(housing_complaints = n(), .groups = "drop") %>%
  mutate(housing_per_1000 = housing_complaints / population * 1000)

ggplot(housing_renter, aes(x = pct_renter, y = housing_per_1000)) +
  geom_point(size = 5, color = "orange") +
  geom_text(aes(label = borough), vjust = -1, size = 3) +
  geom_smooth(method = "lm", color = "purple", se = FALSE, linetype = "dashed") +
  labs(
    title = "Housing Complaints Higher in Renter-Heavy Boroughs",
    x = "Percent Renter-Occupied Housing",
    y = "Housing Complaints per 1,000 Residents"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Model the data

Diagnosing Delays: Predicting 311 Response Time with Random Forest

# Load required libraries
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:httr':
## 
##     progress
library(tibble)

# Prepare modeling dataset WITH weather and demographics
model_data <- df_response %>%
  filter(!is.na(borough) & borough != "Unspecified" & 
         !is.na(complaint_type) & !is.na(hour) & !is.na(day_of_week) &
         !is.na(temp_mean_f) & !is.na(median_income)) %>%  # NEW: require weather & demographics
  select(
    response_time, 
    borough, 
    complaint_type, 
    hour, 
    day_of_week, 
    month,
    # NEW: Weather features
    temp_mean_f,
    precipitation_mm,
    # NEW: Demographics features
    median_income,
    pct_renter,
    pct_poverty
  ) %>%
  # Keep top complaint types to avoid sparse categories
  filter(complaint_type %in% names(sort(table(df_response$complaint_type), decreasing = TRUE)[1:20])) %>%
  mutate(across(where(is.character), as.factor))

# Check structure - verify new columns are included
str(model_data)
## 'data.frame':    49354 obs. of  11 variables:
##  $ response_time   : num  0.0642 0.1808 0.3244 0.4058 0.3131 ...
##  $ borough         : Factor w/ 5 levels "BRONX","BROOKLYN",..: 4 4 3 2 4 1 1 1 4 1 ...
##  $ complaint_type  : Factor w/ 20 levels "ABANDONED VEHICLE",..: 12 12 12 12 12 12 12 8 2 12 ...
##  $ hour            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day_of_week     : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ month           : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ temp_mean_f     : num  21.7 21.7 21.7 21.7 21.7 ...
##  $ precipitation_mm: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ median_income   : num  82431 82431 99880 74692 82431 ...
##  $ pct_renter      : num  49.9 49.9 64 64.4 49.9 ...
##  $ pct_poverty     : num  11.6 11.6 15.4 18.8 11.6 ...
cat("\nColumns in model:", colnames(model_data), "\n")
## 
## Columns in model: response_time borough complaint_type hour day_of_week month temp_mean_f precipitation_mm median_income pct_renter pct_poverty
# Split data: 80% train, 20% test
set.seed(42)
train_index <- createDataPartition(model_data$response_time, p = 0.8, list = FALSE)
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]

cat("Training set:", nrow(train_data), "rows\n")
## Training set: 39485 rows
cat("Test set:", nrow(test_data), "rows\n")
## Test set: 9869 rows
# Train Random Forest model
rf_model <- randomForest(
  response_time ~ ., 
  data = train_data,
  ntree = 200,
  mtry = 3,           # Increased from 2 since we have more variables
  importance = TRUE,
  na.action = na.omit
)

# Model summary
print(rf_model)
## 
## Call:
##  randomForest(formula = response_time ~ ., data = train_data,      ntree = 200, mtry = 3, importance = TRUE, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 200
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 135.3143
##                     % Var explained: 68.4
# Variable Importance - KEY OUTPUT for interpretation
importance_df <- as.data.frame(importance(rf_model)) %>%
  rownames_to_column("Variable") %>%
  arrange(desc(`%IncMSE`))

print("Variable Importance (% Increase in MSE when removed):")
## [1] "Variable Importance (% Increase in MSE when removed):"
print(importance_df)
##            Variable   %IncMSE IncNodePurity
## 1    complaint_type 104.21719    9566058.88
## 2              hour  40.13356     293187.21
## 3       day_of_week  26.82610     565645.46
## 4       temp_mean_f  17.45643     335748.56
## 5  precipitation_mm  14.72861     107773.67
## 6           borough  13.57307     150914.37
## 7       pct_poverty  10.77001      74147.33
## 8     median_income  10.50143      87042.42
## 9        pct_renter  10.42872      72120.33
## 10            month   0.00000          0.00
# Variable Importance Plot
varImpPlot(rf_model, main = "Variable Importance: Predicting 311 Response Time\n(Including Weather & Demographics)")

# Predictions on test set
predictions <- predict(rf_model, newdata = test_data)

# Model Evaluation Metrics
actuals <- test_data$response_time
rmse <- sqrt(mean((predictions - actuals)^2))
mae <- mean(abs(predictions - actuals))
r_squared <- 1 - sum((actuals - predictions)^2) / sum((actuals - mean(actuals))^2)

cat("\n--- Model Performance on Test Set ---\n")
## 
## --- Model Performance on Test Set ---
cat("RMSE:", round(rmse, 2), "hours\n")
## RMSE: 11.46 hours
cat("MAE:", round(mae, 2), "hours\n")
## MAE: 6.3 hours
cat("R-squared:", round(r_squared, 4), "\n")
## R-squared: 0.6927
# Actual vs Predicted Plot
results_df <- data.frame(Actual = actuals, Predicted = predictions)

ggplot(results_df, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(
    title = "Random Forest: Actual vs Predicted Response Time",
    subtitle = paste("R² =", round(r_squared, 3), "| RMSE =", round(rmse, 2), "hours"),
    x = "Actual Response Time (Hours)",
    y = "Predicted Response Time (Hours)"
  ) +
  theme_minimal()

Interpret the data

The Random Forest model reveals that complaint type is the strongest predictor of response time, followed by borough. This suggests that operational workflows—not just geographic location—drive resolution speed.

The model’s R² of 0.7117 indicates that 71% of response time variability can be explained by these factors. The remaining variance likely stems from factors not in our dataset: inspector availability, complaint complexity, or whether violations were found.

RMSE = 11.14 hours Predictions are accurate within approximately 0.46 days(about half a day) sufficient for operational triage and resource planning.

Key Takeaways from the data

library(knitr)

interpretations <- data.frame(
  Interpretation = c(
    "Borough disparities",
    "Complaint type matters most",
    "Demographics drive demand",
    "Temporal patterns exist",
    "Predictions enable proactivity"
  ),
  Key_Insight = c(
    "Manhattan/Brooklyn have highest variability",
    "Type predicts response time better than location",
    "Renter-heavy, lower-income areas file more complaints",
    "Peak demand mid-week, daytime hours",
    "71% of variance is explainable"
  ),
  Actionable_Recommendation = c(
    "Audit workflows in high-variability boroughs",
    "Develop complaint-specific SLAs",
    "Proactive inspections in vulnerable neighborhoods",
    "Align staffing to demand curves",
    "Shift from reactive to predictive operations"
  )
)

kable(interpretations, 
      col.names = c("Interpretation", "Key Insight", "Actionable Recommendation"),
      caption = "Key Interpretations from 311 Data Analysis")
Key Interpretations from 311 Data Analysis
Interpretation Key Insight Actionable Recommendation
Borough disparities Manhattan/Brooklyn have highest variability Audit workflows in high-variability boroughs
Complaint type matters most Type predicts response time better than location Develop complaint-specific SLAs
Demographics drive demand Renter-heavy, lower-income areas file more complaints Proactive inspections in vulnerable neighborhoods
Temporal patterns exist Peak demand mid-week, daytime hours Align staffing to demand curves
Predictions enable proactivity 71% of variance is explainable Shift from reactive to predictive operations

The Biggest Challenges Encountered in Completing the Project