Assignment 1 DATA607: Basic Data Loading and Transformation (NYC 311)

Author

Muhammad Suffyan Khan

Published

February 1, 2026

Dataset Source

The dataset used for this assignment is the NYC 311 Service Requests dataset, which records non emergency service requests submitted by residents of New York City. The data includes information such as complaint type, responsible agency, borough, request status, and timestamps for when requests were created and closed. The original data is derived from NYC Open Data and distributed in CSV format. Source Link: https://www.kaggle.com/search?q=NYC+311+Service+Requests

Motivation for Dataset Selection

This dataset was selected because it represents a realistic and widely used urban administrative data source that is highly relevant to data science and public policy analysis. The NYC 311 dataset reflects real-world challenges in data acquisition and management, including working with categorical variables, date-time fields, and missing values.

Additionally, the dataset’s focus on New York City makes it directly applicable to analyzing service demand, operational efficiency, and citizen-reported issues, aligning well with the course’s emphasis on practical, professional data workflows.

Planned Approach

The planned approach involves loading the dataset from the GitHub repository into R, selecting a subset of relevant columns, and standardizing column names to improve clarity and reproducibility. Where appropriate, data values will be transformed into more interpretable formats, such as converting date strings into date time objects or standardizing categorical values.

Libraries Loaded

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(lubridate)

Dataset Loaded

nyc_311 <- read.csv(
  "https://raw.githubusercontent.com/suffyankhan77/Assignment1-DATA-607/main/311-service-requests.csv",
  stringsAsFactors = FALSE,
  na.strings = c("", "NA", "N/A")
)



head(nyc_311)

  X           Created.Date            Closed.Date Agency
1 0 10/31/2013 02:08:41 AM                   <NA>   NYPD
2 1 10/31/2013 02:01:04 AM                   <NA>   NYPD
3 2 10/31/2013 02:00:24 AM 10/31/2013 02:40:32 AM   NYPD
4 3 10/31/2013 01:56:23 AM 10/31/2013 02:21:48 AM   NYPD
5 4 10/31/2013 01:53:44 AM                   <NA>  DOHMH
6 5 10/31/2013 01:46:52 AM                   <NA>   NYPD
                              Agency.Name          Complaint.Type
1         New York City Police Department Noise - Street/Sidewalk
2         New York City Police Department         Illegal Parking
3         New York City Police Department      Noise - Commercial
4         New York City Police Department         Noise - Vehicle
5 Department of Health and Mental Hygiene                  Rodent
6         New York City Police Department      Noise - Commercial
                    Descriptor       Location.Type Incident.Zip Address.Type
1                 Loud Talking     Street/Sidewalk      11432.0      ADDRESS
2 Commercial Overnight Parking     Street/Sidewalk      11378.0    BLOCKFACE
3             Loud Music/Party Club/Bar/Restaurant      10032.0      ADDRESS
4               Car/Truck Horn     Street/Sidewalk      10023.0    BLOCKFACE
5 Condition Attracting Rodents          Vacant Lot      10027.0    BLOCKFACE
6             Banging/Pounding Club/Bar/Restaurant      11372.0    BLOCKFACE
             City Landmark   Status   Borough
1         JAMAICA     <NA> Assigned    QUEENS
2         MASPETH     <NA>     Open    QUEENS
3        NEW YORK     <NA>   Closed MANHATTAN
4        NEW YORK     <NA>   Closed MANHATTAN
5        NEW YORK     <NA>  Pending MANHATTAN
6 JACKSON HEIGHTS     <NA>     Open    QUEENS

Standarized Data

colnames(nyc_311) <- janitor::make_clean_names(colnames(nyc_311))

# Verify cleaned names
colnames(nyc_311)

 [1] "x"              "created_date"   "closed_date"    "agency"        
 [5] "agency_name"    "complaint_type" "descriptor"     "location_type" 
 [9] "incident_zip"   "address_type"   "city"           "landmark"      
[13] "status"         "borough"

Work Done

The dataset is loaded directly from a public GitHub repository to ensure reproducibility. After loading the data into R, an initial preview is performed to validate that the data was read correctly and to inspect the available variables. Instead of fighting spaces/case, I standardized column names once, then selection becomes trivial and clean.

Data Inspection

Before performing transformations, the dataset structure and available variables are inspected to understand data types and naming conventions.

str(nyc_311)

'data.frame':   111069 obs. of  14 variables:
 $ x             : int  0 1 2 3 4 5 6 7 8 9 ...
 $ created_date  : chr  "10/31/2013 02:08:41 AM" "10/31/2013 02:01:04 AM" "10/31/2013 02:00:24 AM" "10/31/2013 01:56:23 AM" ...
 $ closed_date   : chr  NA NA "10/31/2013 02:40:32 AM" "10/31/2013 02:21:48 AM" ...
 $ agency        : chr  "NYPD" "NYPD" "NYPD" "NYPD" ...
 $ agency_name   : chr  "New York City Police Department" "New York City Police Department" "New York City Police Department" "New York City Police Department" ...
 $ complaint_type: chr  "Noise - Street/Sidewalk" "Illegal Parking" "Noise - Commercial" "Noise - Vehicle" ...
 $ descriptor    : chr  "Loud Talking" "Commercial Overnight Parking" "Loud Music/Party" "Car/Truck Horn" ...
 $ location_type : chr  "Street/Sidewalk" "Street/Sidewalk" "Club/Bar/Restaurant" "Street/Sidewalk" ...
 $ incident_zip  : chr  "11432.0" "11378.0" "10032.0" "10023.0" ...
 $ address_type  : chr  "ADDRESS" "BLOCKFACE" "ADDRESS" "BLOCKFACE" ...
 $ city          : chr  "JAMAICA" "MASPETH" "NEW YORK" "NEW YORK" ...
 $ landmark      : chr  NA NA NA NA ...
 $ status        : chr  "Assigned" "Open" "Closed" "Closed" ...
 $ borough       : chr  "QUEENS" "QUEENS" "MANHATTAN" "MANHATTAN" ...

# Total rows loaded
nrow(nyc_311)

[1] 111069

Column Selection

A subset of relevant variables is selected to focus on key attributes of each service request, including the type of complaint, responsible agency, location, status, and request timing. This reduces complexity and prepares the data for clearer downstream analysis. Some records contain missing values in key fields. For this introductory transformation exercise, I removed incomplete rows from the selected subset to simplify downstream analysis and ensure consistent datetime parsing but not in Closed Date column because the request can be still open.

# Select a subset of relevant columns
nyc_311_subset <- nyc_311 %>%
  select(
    complaint_type,
    descriptor,
    borough,
    agency,
    status,
    created_date,
    closed_date
  )
nyc_311_subset <- nyc_311_subset %>%
  filter(!is.na(created_date))

head(nyc_311_subset)

           complaint_type                   descriptor   borough agency
1 Noise - Street/Sidewalk                 Loud Talking    QUEENS   NYPD
2         Illegal Parking Commercial Overnight Parking    QUEENS   NYPD
3      Noise - Commercial             Loud Music/Party MANHATTAN   NYPD
4         Noise - Vehicle               Car/Truck Horn MANHATTAN   NYPD
5                  Rodent Condition Attracting Rodents MANHATTAN  DOHMH
6      Noise - Commercial             Banging/Pounding    QUEENS   NYPD
    status           created_date            closed_date
1 Assigned 10/31/2013 02:08:41 AM                   <NA>
2     Open 10/31/2013 02:01:04 AM                   <NA>
3   Closed 10/31/2013 02:00:24 AM 10/31/2013 02:40:32 AM
4   Closed 10/31/2013 01:56:23 AM 10/31/2013 02:21:48 AM
5  Pending 10/31/2013 01:53:44 AM                   <NA>
6     Open 10/31/2013 01:46:52 AM                   <NA>

nyc_311_clean <- nyc_311_subset

# Convert date columns to datetime format
nyc_311_clean <- nyc_311_clean %>%
  mutate(
    created_date = parse_date_time(created_date, orders = c("mdy HMS p", "mdy HMS", "ymd HMS", "ymd HMSz")),
    closed_date  = parse_date_time(closed_date,  orders = c("mdy HMS p", "mdy HMS", "ymd HMS", "ymd HMSz"))
  )

# Rows after filtering missing created_date
nrow(nyc_311_subset)

[1] 111069

str(nyc_311_clean)

'data.frame':   111069 obs. of  7 variables:
 $ complaint_type: chr  "Noise - Street/Sidewalk" "Illegal Parking" "Noise - Commercial" "Noise - Vehicle" ...
 $ descriptor    : chr  "Loud Talking" "Commercial Overnight Parking" "Loud Music/Party" "Car/Truck Horn" ...
 $ borough       : chr  "QUEENS" "QUEENS" "MANHATTAN" "MANHATTAN" ...
 $ agency        : chr  "NYPD" "NYPD" "NYPD" "NYPD" ...
 $ status        : chr  "Assigned" "Open" "Closed" "Closed" ...
 $ created_date  : POSIXct, format: "2013-10-31 02:08:41" "2013-10-31 02:01:04" ...
 $ closed_date   : POSIXct, format: NA NA ...

# Calculate response time (hours) only when closed_date exists
# and treat negative / unrealistic values as invalid (set to NA)
nyc_311_clean <- nyc_311_clean %>%
  mutate(
    response_time_hours = case_when(
      is.na(closed_date) ~ NA_real_,
      TRUE ~ as.numeric(difftime(closed_date, created_date, units = "hours"))
    ),
    response_time_hours = ifelse(response_time_hours < 0, NA_real_, response_time_hours),
    response_time_hours = ifelse(response_time_hours > 24*365, NA_real_, response_time_hours)  # optional cap: > 1 year
  )

Quick Note

During validation, some records produced negative response times because the closed timestamp occurred earlier than the created timestamp. This indicates inconsistent or erroneous timestamps in the source data. These negative values were treated as invalid and set to missing (NA) to avoid misleading results.

summary(nyc_311_clean$response_time_hours)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   1.524  14.817  35.726  48.000 639.025   52846

glimpse(nyc_311_clean)

Rows: 111,069
Columns: 8
$ complaint_type      <chr> "Noise - Street/Sidewalk", "Illegal Parking", "Noi…
$ descriptor          <chr> "Loud Talking", "Commercial Overnight Parking", "L…
$ borough             <chr> "QUEENS", "QUEENS", "MANHATTAN", "MANHATTAN", "MAN…
$ agency              <chr> "NYPD", "NYPD", "NYPD", "NYPD", "DOHMH", "NYPD", "…
$ status              <chr> "Assigned", "Open", "Closed", "Closed", "Pending",…
$ created_date        <dttm> 2013-10-31 02:08:41, 2013-10-31 02:01:04, 2013-10…
$ closed_date         <dttm> NA, NA, 2013-10-31 02:40:32, 2013-10-31 02:21:48,…
$ response_time_hours <dbl> NA, NA, 0.6688889, 0.4236111, NA, NA, NA, 0.241666…

summary(nyc_311_clean)

 complaint_type      descriptor          borough             agency         
 Length:111069      Length:111069      Length:111069      Length:111069     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    status           created_date                  closed_date                 
 Length:111069      Min.   :2013-10-04 00:00:10   Min.   :1900-01-01 00:00:00  
 Class :character   1st Qu.:2013-10-11 00:00:00   1st Qu.:2013-10-11 12:15:45  
 Mode  :character   Median :2013-10-19 00:00:00   Median :2013-10-19 12:00:00  
                    Mean   :2013-10-18 12:24:59   Mean   :2013-10-16 19:05:29  
                    3rd Qu.:2013-10-25 00:00:00   3rd Qu.:2013-10-25 08:40:00  
                    Max.   :2013-10-31 02:08:41   Max.   :2013-10-31 03:00:20  
                                                  NA's   :50799                
 response_time_hours
 Min.   :  0.000    
 1st Qu.:  1.524    
 Median : 14.817    
 Mean   : 35.726    
 3rd Qu.: 48.000    
 Max.   :639.025    
 NA's   :52846

Top 10 Complaint Types

complaint_summary <- nyc_311_clean %>%
  count(complaint_type, sort = TRUE) %>%
  mutate(
    pct_label = scales::percent(n / sum(n), accuracy = 0.1)
  ) %>%
  slice_head(n = 10)

complaint_summary

           complaint_type     n pct_label
1                 HEATING 14200     12.8%
2    GENERAL CONSTRUCTION  7471      6.7%
3  Street Light Condition  7117      6.4%
4  DOF Literature Request  5797      5.2%
5                PLUMBING  5373      4.8%
6         PAINT - PLASTER  5149      4.6%
7        Blocked Driveway  4590      4.1%
8                NONCONST  3998      3.6%
9        Street Condition  3473      3.1%
10        Illegal Parking  3343      3.0%

ggplot(complaint_summary, aes(x = reorder(complaint_type, n), y = n)) +
  geom_col() +
  coord_flip() +
  geom_text(aes(label = pct_label), hjust = -0.1) +
  labs(
    title = "Top 10 NYC 311 Complaint Types",
    x = "Complaint Type",
    y = "Count"
  ) +
  theme_minimal()

The Top 10 complaint categories provide a high-level view of what issues residents most frequently report through NYC 311. A small number of complaint types account for a large share of total requests, suggesting that many service demands are driven by recurring quality-of-life issues. This type of frequency analysis helps prioritize which categories are most important for deeper operational or policy-focused investigation.

Feature Engineering: Create an “issue_priority” label

nyc_311_clean <- nyc_311_clean %>%
  mutate(
    issue_priority = case_when(
      str_detect(tolower(complaint_type), "heat|hot water|electric|gas") ~ "High",
      str_detect(tolower(complaint_type), "noise") ~ "Medium",
      TRUE ~ "Low/Other"
    )
  )

nyc_311_clean %>%
  count(issue_priority, sort = TRUE)

  issue_priority     n
1      Low/Other 85385
2           High 16742
3         Medium  8942

nyc_311_clean %>%
  count(issue_priority) %>%
  ggplot(aes(x = issue_priority, y = n)) +
  geom_col() +
  labs(
    title = "Engineered Feature: Complaint Priority (Simple Rule-Based)",
    x = "Priority",
    y = "Count"
  ) +
  theme_minimal()

To demonstrate feature engineering, I created a simplified “issue_priority” variable by grouping complaints into High, Medium, and Low/Other categories using rule-based matching on complaint keywords (e.g., heat/hot water and utility-related issues). This engineered feature makes the dataset more interpretable for downstream analysis by reducing many complaint categories into broader urgency levels. While the classification rules are simple, they mirror real-world preprocessing where domain knowledge is used to create decision-support variables.

Borough comparison (counts + response time)

borough_summary <- nyc_311_clean %>%
  group_by(borough) %>%
  summarise(
    total_requests = n(),
    closed_requests = sum(!is.na(response_time_hours)),
    median_response_hours = median(response_time_hours, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_requests))

borough_summary

# A tibble: 6 × 4
  borough       total_requests closed_requests median_response_hours
  <chr>                  <int>           <int>                 <dbl>
1 BROOKLYN               32890           16457                  9.46
2 MANHATTAN              24288           12130                  7.28
3 QUEENS                 22281           12714                  4.57
4 BRONX                  19686            7986                 23.1 
5 Unspecified             7107            6301                 21.0 
6 STATEN ISLAND           4817            2635                 11.8

ggplot(borough_summary, aes(x = reorder(borough, total_requests), y = total_requests)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Total NYC 311 Requests by Borough",
    x = "Borough",
    y = "Total Requests"
  ) +
  theme_minimal()

Comparing total service requests across boroughs highlights differences in reported service demand by location. Borough-level request volume can be influenced by multiple factors such as population size, housing density, local infrastructure, and reporting behavior. This comparison provides useful context for future work, such as normalizing requests by population or exploring borough-specific complaint patterns.

Conclusions

This assignment demonstrated a reproducible workflow for acquiring and transforming a real-world administrative dataset using R and Quarto. The NYC 311 service request data was loaded directly from a public GitHub repository, column names were standardized for consistent use in code, and a meaningful subset of variables was selected to support downstream analysis. Key data preparation steps included handling missing values appropriately and converting timestamp fields into usable datetime formats.

As a creative extension, I explored the distribution of complaint types across NYC by identifying the most frequent complaint categories, and I performed simple rule-based feature engineering to classify complaints into priority tiers. I also compared request volumes across boroughs to highlight differences in service demand by location.