Assignment 1 DATA607: Basic Data Loading and Transformation (NYC 311)
Author
Muhammad Suffyan Khan
Published
February 1, 2026
Dataset Source
The dataset used for this assignment is the NYC 311 Service Requests dataset, which records non emergency service requests submitted by residents of New York City. The data includes information such as complaint type, responsible agency, borough, request status, and timestamps for when requests were created and closed. The original data is derived from NYC Open Data and distributed in CSV format. Source Link: https://www.kaggle.com/search?q=NYC+311+Service+Requests
Motivation for Dataset Selection
This dataset was selected because it represents a realistic and widely used urban administrative data source that is highly relevant to data science and public policy analysis. The NYC 311 dataset reflects real-world challenges in data acquisition and management, including working with categorical variables, date-time fields, and missing values.
Additionally, the dataset’s focus on New York City makes it directly applicable to analyzing service demand, operational efficiency, and citizen-reported issues, aligning well with the course’s emphasis on practical, professional data workflows.
Planned Approach
The planned approach involves loading the dataset from the GitHub repository into R, selecting a subset of relevant columns, and standardizing column names to improve clarity and reproducibility. Where appropriate, data values will be transformed into more interpretable formats, such as converting date strings into date time objects or standardizing categorical values.
Libraries Loaded
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
X Created.Date Closed.Date Agency
1 0 10/31/2013 02:08:41 AM <NA> NYPD
2 1 10/31/2013 02:01:04 AM <NA> NYPD
3 2 10/31/2013 02:00:24 AM 10/31/2013 02:40:32 AM NYPD
4 3 10/31/2013 01:56:23 AM 10/31/2013 02:21:48 AM NYPD
5 4 10/31/2013 01:53:44 AM <NA> DOHMH
6 5 10/31/2013 01:46:52 AM <NA> NYPD
Agency.Name Complaint.Type
1 New York City Police Department Noise - Street/Sidewalk
2 New York City Police Department Illegal Parking
3 New York City Police Department Noise - Commercial
4 New York City Police Department Noise - Vehicle
5 Department of Health and Mental Hygiene Rodent
6 New York City Police Department Noise - Commercial
Descriptor Location.Type Incident.Zip Address.Type
1 Loud Talking Street/Sidewalk 11432.0 ADDRESS
2 Commercial Overnight Parking Street/Sidewalk 11378.0 BLOCKFACE
3 Loud Music/Party Club/Bar/Restaurant 10032.0 ADDRESS
4 Car/Truck Horn Street/Sidewalk 10023.0 BLOCKFACE
5 Condition Attracting Rodents Vacant Lot 10027.0 BLOCKFACE
6 Banging/Pounding Club/Bar/Restaurant 11372.0 BLOCKFACE
City Landmark Status Borough
1 JAMAICA <NA> Assigned QUEENS
2 MASPETH <NA> Open QUEENS
3 NEW YORK <NA> Closed MANHATTAN
4 NEW YORK <NA> Closed MANHATTAN
5 NEW YORK <NA> Pending MANHATTAN
6 JACKSON HEIGHTS <NA> Open QUEENS
The dataset is loaded directly from a public GitHub repository to ensure reproducibility. After loading the data into R, an initial preview is performed to validate that the data was read correctly and to inspect the available variables. Instead of fighting spaces/case, I standardized column names once, then selection becomes trivial and clean.
Data Inspection
Before performing transformations, the dataset structure and available variables are inspected to understand data types and naming conventions.
str(nyc_311)
'data.frame': 111069 obs. of 14 variables:
$ x : int 0 1 2 3 4 5 6 7 8 9 ...
$ created_date : chr "10/31/2013 02:08:41 AM" "10/31/2013 02:01:04 AM" "10/31/2013 02:00:24 AM" "10/31/2013 01:56:23 AM" ...
$ closed_date : chr NA NA "10/31/2013 02:40:32 AM" "10/31/2013 02:21:48 AM" ...
$ agency : chr "NYPD" "NYPD" "NYPD" "NYPD" ...
$ agency_name : chr "New York City Police Department" "New York City Police Department" "New York City Police Department" "New York City Police Department" ...
$ complaint_type: chr "Noise - Street/Sidewalk" "Illegal Parking" "Noise - Commercial" "Noise - Vehicle" ...
$ descriptor : chr "Loud Talking" "Commercial Overnight Parking" "Loud Music/Party" "Car/Truck Horn" ...
$ location_type : chr "Street/Sidewalk" "Street/Sidewalk" "Club/Bar/Restaurant" "Street/Sidewalk" ...
$ incident_zip : chr "11432.0" "11378.0" "10032.0" "10023.0" ...
$ address_type : chr "ADDRESS" "BLOCKFACE" "ADDRESS" "BLOCKFACE" ...
$ city : chr "JAMAICA" "MASPETH" "NEW YORK" "NEW YORK" ...
$ landmark : chr NA NA NA NA ...
$ status : chr "Assigned" "Open" "Closed" "Closed" ...
$ borough : chr "QUEENS" "QUEENS" "MANHATTAN" "MANHATTAN" ...
# Total rows loadednrow(nyc_311)
[1] 111069
Column Selection
A subset of relevant variables is selected to focus on key attributes of each service request, including the type of complaint, responsible agency, location, status, and request timing. This reduces complexity and prepares the data for clearer downstream analysis. Some records contain missing values in key fields. For this introductory transformation exercise, I removed incomplete rows from the selected subset to simplify downstream analysis and ensure consistent datetime parsing but not in Closed Date column because the request can be still open.
# Calculate response time (hours) only when closed_date exists# and treat negative / unrealistic values as invalid (set to NA)nyc_311_clean <- nyc_311_clean %>%mutate(response_time_hours =case_when(is.na(closed_date) ~NA_real_,TRUE~as.numeric(difftime(closed_date, created_date, units ="hours")) ),response_time_hours =ifelse(response_time_hours <0, NA_real_, response_time_hours),response_time_hours =ifelse(response_time_hours >24*365, NA_real_, response_time_hours) # optional cap: > 1 year )
Quick Note
During validation, some records produced negative response times because the closed timestamp occurred earlier than the created timestamp. This indicates inconsistent or erroneous timestamps in the source data. These negative values were treated as invalid and set to missing (NA) to avoid misleading results.
summary(nyc_311_clean$response_time_hours)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 1.524 14.817 35.726 48.000 639.025 52846
The Top 10 complaint categories provide a high-level view of what issues residents most frequently report through NYC 311. A small number of complaint types account for a large share of total requests, suggesting that many service demands are driven by recurring quality-of-life issues. This type of frequency analysis helps prioritize which categories are most important for deeper operational or policy-focused investigation.
Feature Engineering: Create an “issue_priority” label
To demonstrate feature engineering, I created a simplified “issue_priority” variable by grouping complaints into High, Medium, and Low/Other categories using rule-based matching on complaint keywords (e.g., heat/hot water and utility-related issues). This engineered feature makes the dataset more interpretable for downstream analysis by reducing many complaint categories into broader urgency levels. While the classification rules are simple, they mirror real-world preprocessing where domain knowledge is used to create decision-support variables.
Comparing total service requests across boroughs highlights differences in reported service demand by location. Borough-level request volume can be influenced by multiple factors such as population size, housing density, local infrastructure, and reporting behavior. This comparison provides useful context for future work, such as normalizing requests by population or exploring borough-specific complaint patterns.
Conclusions
This assignment demonstrated a reproducible workflow for acquiring and transforming a real-world administrative dataset using R and Quarto. The NYC 311 service request data was loaded directly from a public GitHub repository, column names were standardized for consistent use in code, and a meaningful subset of variables was selected to support downstream analysis. Key data preparation steps included handling missing values appropriately and converting timestamp fields into usable datetime formats.
As a creative extension, I explored the distribution of complaint types across NYC by identifying the most frequent complaint categories, and I performed simple rule-based feature engineering to classify complaints into priority tiers. I also compared request volumes across boroughs to highlight differences in service demand by location.