Data Analysis on NYC Service Requests After COVID-19's First Year¶

Melvin Paredes
February 1, 2023

Table of Content¶

  • Project Objectives and Data Source
  • Preview of the Data
    • Environment
    • Preview of the Original Dataset
    • Attributes of the Dataset
  • Data Cleaning
    • Addressing Missing Values
    • Formatting and Data Types
  • Data Manipulation
    • Most Frequent Complaints
    • Complaint Rate per 1,000 People
    • Aggregation of Complaint Types
    • Creation of Time Series
    • Group Proportions
    • Complaint Cycle Time
  • Data Analysis and Visualizations
    • 10 Most Frequent Complaints in New York City
    • Seasonal Patterns in the Number of Complaints
    • Descriptors' Proportion of NYC's Most Frequent Complaints
    • Number of Complaints According to Borough
    • Annual Complaint Rate According to Borough
    • Number of Complaints According to Responsible Government Agency
    • Heatmap: Level of Activity per Hour of the Day
    • Heatmap: Level of Activity According to Type and Weekday
    • Descriptive Statistics: Complaint Processing Time
  • Conclusions and Recommendations

Project Objectives and Data Source ¶

The objective of the present project was to analyze service requests from the City of New York (NYC 311) for the year 2021 to derive insights that could be used by local authorities for data-driven decision-making. The period that was analyzed was the year 2021, as it was a period in which the population was trying to go back to normality, following the economy shut down and confinement measures of the year 2020. In that regard, this analysis sought to provide recommendations to make the NYC 311 System and city services more efficient with the ultimate objective of increasing the level of satisfaction and well-being of its residents.

To enhance the impact that could be derived from this work, the present project focused largely on examining in detail the most frequent types of complaints but also on analyzing the data according to the five boroughs of New York City: Brooklyn, the Bronx, Manhattan, Queens, and Staten Island. In that respect, by comparing the results between the boroughs, a clearer picture was obtained in terms of overall performance in different areas, which in turn led to key recommendations that could be implemented in data-driven decision-making by public officials.

Concerning the data source, it was concluded that the dataset used for this project (311 Service Requests from 2010 to Present, 2023) was reliable and could be trusted since it was collected by the NYC 311 system and made public by NYC Open Data, which is managed by the Open Data Team at the NYC Office of Technology and Innovation (OTI).

The original dataset, as provided by NYC Open Data, contained 30.6 million rows and extended from January 1, 2010, to July 30, 2022, with each row representing one particular claim. Since a file of that size could either be difficult to load or slow down processing in Python, prior to this analysis we manipulated the dataset in BigQuery and reduced its size using SQL queries to select only the data that corresponded to the year 2021. The resulting subset contained 3.2 million rows and it was used as the original dataset for this analysis.

Preview of the Data ¶

Environment ¶

In [1]:
from datetime import datetime
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as mticker

%matplotlib inline

%reload_ext nb_black
In [2]:
# set option to display all columns
pd.set_option("display.max_columns", None)

# set option to display more rows
pd.set_option("display.min_rows", None)
pd.set_option("display.max_rows", 300)

# set options for plots
pd.set_option("display.width", 4500)
plt.rcParams["figure.figsize"] = (15, 7)

# Format numbers
pd.options.display.float_format = "{:,.0f}".format

# Set the plot background to white
plt.rcParams["axes.facecolor"] = "white"
plt.rcParams["figure.facecolor"] = "white"

# Set plot size
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["axes.axisbelow"] = True

styles = [
    {"selector": "th", "props": [("text-align", "left")]},
    {
        "selector": ".row_heading",
        "props": [
            ("text-align", "left"),
            ("font-size", "100%"),
            ("color", "black"),
        ],
    },
    {
        "selector": "caption",
        "props": [
            ("text-align", "left"),
            ("font-size", "120%"),
            ("color", "black"),
        ],
    },
]

Create function to display tables with correct format

In [3]:
def tbl_format(
    df, table_name, hide_index=True, col_types="formatted", col_names="titlecase"
):

    # create function to change column names to title case
    if col_names == "titlecase":

        def titlecase_col(string):
            def capitalize_words(word):
                exception = ["a", "an", "the", "or", "of", "and"]
                if word not in exception:
                    return word.capitalize()
                return word

            words_list = string.replace(" ", "_")
            words_list = string.split("_")
            words_formatted = map(capitalize_words, words_list)
            return " ".join(words_formatted)

        df = df.rename(columns=titlecase_col)

        if df.index.name is not None:
            df.index.name = titlecase_col(df.index.name)

    string_cols = list(df.select_dtypes(include="string"))
    date_cols = list(df.select_dtypes(include="datetime64"))
    int_cols = list(df.select_dtypes(include="int64"))
    float_cols = list(df.select_dtypes(include="float64"))

    cols_left = string_cols + date_cols
    cols_right = int_cols + float_cols

    # create dictionaries to format integer and float columns
    int_dict = dict.fromkeys(int_cols, "{:,.0f}")
    float_dict = dict.fromkeys(float_cols, "{:,.1f}")

    # Merge the dictionaries
    number_format = {**int_dict, **float_dict}

    # dict used to align column names of string columns to the left
    cols_left_dict = dict.fromkeys(
        cols_left, [{"selector": "th", "props": [("text-align", "left")]}]
    )

    # dict used to align column names of number columns to the right
    cols_right_dict = dict.fromkeys(
        cols_right, [{"selector": "th", "props": [("text-align", "right")]}]
    )

    header_align_dict = {**cols_left_dict, **cols_right_dict}
    styles_display = [
        {"selector": "th", "props": [("text-align", "left")]},
        {
            "selector": ".row_heading",
            "props": [
                ("text-align", "left"),
                ("font-size", "100%"),
                ("color", "black"),
            ],
        },
        {
            "selector": "caption",
            "props": [
                ("text-align", "left"),
                ("font-size", "120%"),
                ("color", "black"),
            ],
        },
    ]

    # if the columns have been assigned data types
    if col_types == "formatted":

        if hide_index == False:
            styler = (
                df.style.format(number_format)
                # align the headers according to dict values
                .set_table_styles(header_align_dict)
                # align index and configure caption using styles_display
                .set_table_styles(styles_display, overwrite=False)
                # align values to text if they are string or date
                .applymap(lambda x: "text-align:left", subset=cols_left).set_caption(
                    table_name
                )
            )

        else:
            styler = (
                df.style.format(number_format)
                .set_table_styles(header_align_dict)
                .set_table_styles(styles_display, overwrite=False)
                .applymap(lambda x: "text-align:left", subset=cols_left)
                .hide(axis="index")
                .set_caption(table_name)
            )

    # if the columns have not been assigned data types
    # and contain mostly strings
    elif col_types == "text_unformatted":
        if hide_index == False:
            styler = (
                df.style.format(number_format)
                .applymap(lambda x: "text-align:left")
                .applymap(lambda x: "text-align:right", subset=cols_right)
                .set_table_styles(header_align_dict, overwrite=True)
                .set_table_styles(styles_display, overwrite=False)
                .set_caption(table_name)
            )
        else:
            styler = (
                df.style.format(number_format)
                .applymap(lambda x: "text-align:left")
                .applymap(lambda x: "text-align:right", subset=cols_right)
                .set_table_styles(header_align_dict, overwrite=True)
                .set_table_styles(styles_display, overwrite=False)
                .hide(axis="index")
                .set_caption(table_name)
            )

    # if the columns have not been assigned data types
    # and contain mostly numbers
    elif col_types == "number_unformatted":
        if hide_index == False:
            styler = (
                df.style.format(number_format)
                .set_table_styles(styles_display, overwrite=False)
                .hide(axis="index")
                .set_caption(table_name)
            )
        else:
            styler = (
                df.style.format(number_format)
                .set_table_styles(styles_display, overwrite=False)
                .set_caption(table_name)
            )

    else:
        return "Ambiguous argument for col_types"
    display(styler)

Preview of the Original Dataset ¶

In [4]:
complaints_orig = pd.read_csv("C:\\dataset\\311_service_requests_2021.csv", dtype="str")
In [5]:
# Preview the first 5 rows of the original dataset
tbl_format(
    complaints_orig.head(),
    "Table 1. Preview of the original dataset: NYC Service Requests 2021",
    col_types="text_unformatted",
    col_names="actual",
)
Table 1. Preview of the original dataset: NYC Service Requests 2021
created_date closed_date complaint_type descriptor borough agency agency_name
2021-09-05 21:47:24 UTC nan Mold Public Complaint - Comm Location STATEN ISLAND DOHMH Department of Health and Mental Hygiene
2021-02-04 11:46:00 UTC 2021-02-21 12:00:00 UTC Snow E9 Snow / Icy Sidewalk BROOKLYN DSNY Department of Sanitation
2021-02-03 15:27:00 UTC 2021-02-05 12:00:00 UTC Snow E9 Snow / Icy Sidewalk QUEENS DSNY Department of Sanitation
2021-02-08 13:23:00 UTC 2021-02-22 12:00:00 UTC Snow E9 Snow / Icy Sidewalk BROOKLYN DSNY Department of Sanitation
2021-09-06 15:14:00 UTC 2021-09-07 12:00:00 UTC Storm 2S Bulk-Storm Related QUEENS DSNY Department of Sanitation

It is important to note that the column names of the original dataset were in snake_case format, which we kept throughout the analysis to perform data manipulation for the reasons explained in the data cleaning phase; however, in order to present tables in a format that is appropriate for visualization purposes, the column names were shown in title case whenever they were displayed in such tables. This was taken care by one of the functionalities of the function *tbl_format*, which we created earlier for this project.

Attributes of the Dataset ¶

In [6]:
# Identify the number of observations (complaints) and columns in the data
row_count = complaints_orig.shape[0]
col_count = complaints_orig.shape[1]

print("Number of rows: ", "{:,.0f}".format(row_count))
print("Number of selected columns: ", col_count)
print("Date of first record: ", complaints_orig["created_date"].min())
print("Date of Last record: ",complaints_orig["created_date"].max())
Number of rows:  3,214,274
Number of selected columns:  7
Date of first record:  2021-01-01 00:00:00 UTC
Date of Last record:  2021-12-30 23:59:00 UTC

The 2021 subset of the NYC 311 dataset, which we extracted beforehand through SQL queries, contained 3.2 million rows. This figure is the total number of complaints made in the City of New York during the year 2021. Similarly, the initial dataset contained 52 columns, but it was reduced to the seven most relevant fields prior to further manipulation in Python to speed up processing time in any device. Therefore, the size of the dataset that was analyzed herein was 22.5 million cells.

Relevant fields selected from the original dataset

In [7]:
# Get column names from dataset
column_names = complaints_orig.columns.to_list()

# Create a data frame for the attributes of the dataset
selected_columns = pd.DataFrame({"column_name": column_names})

# load dataset with field descriptions of all columns
column_descriptions = pd.read_csv("C:\\dataset\\field_descriptions.csv", dtype="str")

# Merge the data frames to create table with field descriptions
selected_columns = pd.merge(selected_columns, column_descriptions, on="column_name")

selected_columns["column_name"] = (
    selected_columns["column_name"].str.title().str.replace("_", " ")
)

# table
tbl_format(
    selected_columns,
    "Table 2. Selected fields from the original dataset",
    col_types="text_unformatted",
)
Table 2. Selected fields from the original dataset
Column Name Description
Created Date Date SR was created
Closed Date Date SR was closed by responding agency
Complaint Type This is the first level of a hierarchy identifying the topic of the incident or condition
Descriptor This is associated to the Complaint Type, and provides further detail on the incident or condition
Borough Borough Block and Lot, provided by geovalidation
Agency Acronym of responding City Government Agency
Agency Name Full Agency name of responding City Government Agency

Data types of the selected fields

In [8]:
# Check columns" data types for the relevant variables
print("\033[1m" + "Data types of the selected columns:" + "\033[0m")

complaints_orig.dtypes
Data types of the selected columns:
Out[8]:
created_date      object
closed_date       object
complaint_type    object
descriptor        object
borough           object
agency            object
agency_name       object
dtype: object

The seven columns that were selected for this analysis were originally formatted as general objects. More specific and appropriate data types were assigned to these seven variables during the cleaning stage.

Types of complaints

In [9]:
# Explore how many different types of complaints there are
types_of_compl = complaints_orig["complaint_type"].unique().size

print ("Types of complaints: ",types_of_compl)
Types of complaints:  224

The degree of comprehensiveness of the dataset can be illustrated by identifying how many different types of complaints were found in the data. There were 224 different types of complaints. This figure demonstrates that the 311 Service Requests dataset is very complete and could be used for a wide range of analyses; however, the scope of this project was limited to the 10 most frequent complaints of New York City.

Preview: number of occurrences of the most frequent types of conplaints

In [10]:
# Count the occurrences of unique values in the Complaint Type column
top_complaints = (
    complaints_orig["complaint_type"]
    .value_counts()
    .rename_axis("complaint_type")
    .reset_index(name="count")
)

tbl_format(
    top_complaints.head(15),
    "Table 3. 2021 Most frequent complaints",
    col_types="text_unformatted",
)
Table 3. 2021 Most frequent complaints
Complaint Type Count
Noise - Residential 361,416
Illegal Parking 317,202
Request Large Bulky Item Collection 302,558
HEAT/HOT WATER 199,049
Noise - Street/Sidewalk 184,190
Blocked Driveway 152,264
UNSANITARY CONDITION 90,271
Noise - Vehicle 84,900
Street Condition 73,027
Street Light Condition 56,184
Water System 55,342
Noise - Commercial 53,982
PAINT/PLASTER 52,310
PLUMBING 51,839
Abandoned Vehicle 48,627

As shown in Table 3, which contains the most frequent complaints in New York City, the text formatting of the values corresponding to the Complaint Type variable was not consistent, as some values were in upper case letters while others in lower case. This issue was corrected during the data cleaning stage of this analysis.

Unique values in the Borough column

In [11]:
borough_values = pd.DataFrame(complaints_orig["borough"].unique())
borough_values.rename(
    columns={borough_values.columns[0]: "Unique values in the Borough column"}
).sort_values("Unique values in the Borough column").head(15).style.set_properties(
    **{"text-align": "left"}
).hide(
    axis="index"
)
Out[11]:
Unique values in the Borough column
BRONX
BROOKLYN
MANHATTAN
QUEENS
STATEN ISLAND
Unspecified
nan

The list of unique values that were present in the Borough field of the original dataset showed that this column, in addition to containing the five boroughs of New York City, included missing values (NA) as well as rows containing the text "Unspecified". The exact figure for the number of cells with these characteristics is shown in Table 4, which added the unspecified values to the number of NA cells.

City government agencies in the data

In [12]:
agencies = complaints_orig.copy()
agencies["name_and_acronym"] = (
    complaints_orig["agency_name"] + " (" + complaints_orig["agency"] + ")"
)
agencies = pd.DataFrame(agencies["name_and_acronym"].unique())

agencies.rename(
    columns={
        agencies.columns[
            0
        ]: "City Government Agencies that responded to Service Requests"
    }
).sort_values("City Government Agencies that responded to Service Requests").head(
    50
).style.hide(
    axis="index"
).set_properties(
    **{"text-align": "left"}
)
Out[12]:
City Government Agencies that responded to Service Requests
Department for the Aging (DFTA)
Department of Buildings (DOB)
Department of Consumer Affairs (DCA)
Department of Education (DOE)
Department of Environmental Protection (DEP)
Department of Health and Mental Hygiene (DOHMH)
Department of Homeless Services (DHS)
Department of Housing Preservation and Development (HPD)
Department of Information Technology and Telecommunications (DOITT)
Department of Parks and Recreation (DPR)
Department of Sanitation (DSNY)
Department of Transportation (DOT)
Economic Development Corporation (EDC)
Mayor’s Office of Special Enforcement (MAYOR’S OFFICE OF SPECIAL ENFORCEMENT)
New York City Police Department (NYPD)
Taxi and Limousine Commission (TLC)

The list of unique values found in the Agency Name column showed that a total of 16 City Government agencies were included in the data and that the name for one of those institutions needed to be corrected.

Missing values (NA)

To have an accurate representation of the initial number of missing values in the relevant variables, the values entered as "Unspecified" in the Borough column were replaced with NA values.

In [13]:
# Replace "Unspecified" values with NA values
complaints_orig["borough"] = complaints_orig["borough"].replace("Unspecified", np.nan)

# Count the number of NA values for each column
na_values = (
    complaints_orig[
        [
            "created_date",
            "closed_date",
            "complaint_type",
            "descriptor",
            "borough",
            "agency",
            "agency_name",
        ]
    ]
    .isna()
    .sum()
    .rename_axis(["Column"])
    .reset_index(name="NA Values")
)

tbl_format(
    na_values, "Table 4. NA values in the selected fields", col_types="text_unformatted"
)
Table 4. NA values in the selected fields
Column Na values
created_date 0
closed_date 110,211
complaint_type 0
descriptor 90,682
borough 16,317
agency 0
agency_name 0

Data Cleaning ¶

The data cleaning steps consisted of removing the rows that contained missing (or Unspecified) values in the Borough column, assigning correct data types to each column, fixing the text formatting for the values contained in the Complaint Type and Borough columns, and correcting typos found in the Agency Name field.

Addressing Missing Values ¶

Create new data frame for data cleaning and manipulation

In [14]:
complaints_clean = complaints_orig.copy()

As shown in Table 4, three columns of the original dataset contained missing values; nonetheless, the Created Date and Borough fields, which were the most relevant columns for this analysis, had zero and 16,317 NA values, respectively. Considering that the dataset had a total of 3.2 million rows, the number of rows with missing values in the Borough column was relatively low, amounting to only 0.5% of the total number of rows. In view of this, the NA values that were present in the Borough could be removed without affecting the results.

In [15]:
NA_percentage = "{:.1%}".format(
    complaints_clean["borough"].isnull().sum() / complaints_clean.shape[0]
)
print("Percentage of missing values in the Borough Column: ", NA_percentage)
Percentage of missing values in the Borough Column:  0.5%

Remove the missing values in the Borough Column

In [16]:
# Remove the remaining NA values in the Borough Column
complaints_clean = complaints_clean[complaints_clean["borough"].notna()]

Formatting and Data Types ¶

Column names

Despite the fact that Python can handle column names containing spaces and capital letters, it is more appropriate to use snake_case format not only to simplify operations within the program itself but also to make the code and files more consistent with current naming conventions. For instance, many databases already do not allow column names with capital letters, and popular software libraries like Pandas and Seaborn use underscores in their documentation. In addition, considering that the aforementioned trend is increasing as more statistical packages implement this naming convention, using the snake_case format for this analysis would make it more robust for future reference.

Therefore, although normal title case was used to display tables, snake_case was the format implemented throughout the analysis to perform operations related to data frames.

Assign appropriate data types to the columns

As mentioned earlier, the data type of the seven columns that were relevant to this analysis needed to be converted to more specific data types. This was addressed in the following code cells.

In [17]:
# Convert "Created Date" column to datetime
complaints_clean["created_date"] = pd.to_datetime(
    complaints_clean["created_date"], format="%Y-%m-%d %H:%M:%S UTC"
)

# Convert "Closed Date" column to datetime
complaints_clean["closed_date"] = pd.to_datetime(
    complaints_clean["closed_date"], format="%Y-%m-%d %H:%M:%S UTC"
)

# Convert "City" and "Complaint Type" columns to string data type

complaints_clean["complaint_type"] = complaints_clean["complaint_type"].astype("string")
complaints_clean["descriptor"] = complaints_clean["descriptor"].astype("string")
complaints_clean["borough"] = complaints_clean["borough"].astype("string")
complaints_clean["agency"] = complaints_clean["agency"].astype("string")
complaints_clean["agency_name"] = complaints_clean["agency_name"].astype("string")

complaints_clean.dtypes
Out[17]:
created_date      datetime64[ns]
closed_date       datetime64[ns]
complaint_type            string
descriptor                string
borough                   string
agency                    string
agency_name               string
dtype: object

Change text entries to title case format

As mentioned earlier, the values contained in the Complaint Type and Borough columns had inconsistent formatting; thus, in order to have uniformity across all the text entries, the format for all these values was changed to title case. To achieve this, we created a simple but useful function called titlecase_vals because the built-in function str.title capitalizes all words without exception, including conjunction words, which was something we needed to avoid. Our function titlecase_vals addresses these exceptions according to the possible values found in the dataset that was used for this analysis, but it could also be used for more general purposes by expanding the list of exceptions to included more conjunctions, articles, and other words that should not be capitalized.

In [23]:
def titlecase_vals(df, column_name):
    df[column_name] = df[column_name].str.title()
    df[column_name] = df[column_name].replace(
        {" Or ": " or ", " Of ": " of ", " And ": " and "}, regex=True
    )
    return df[column_name]
In [24]:
complaints_clean["complaint_type"] = titlecase_vals(complaints_clean, "complaint_type")
complaints_clean["descriptor"] = titlecase_vals(complaints_clean, "descriptor")
complaints_clean["borough"] = complaints_clean["borough"].str.title()

Correct the name for one of the government agencies

In [25]:
complaints_clean["agency_name"] = complaints_clean["agency_name"].replace(
    "Mayor’s Office of Special Enforcement", "Mayor's Office of Special Enforcement"
)

Sort the data frame based on created date and complaint Type

In [26]:
complaints_clean.sort_values(
    ["created_date", "complaint_type"], ascending=[True, False], inplace=True
)
complaints_clean.reset_index(drop=True, inplace=True)

Display a preview the cleaned data frame

In [27]:
# Display preview ofthe cleaned data frame

tbl_format(
    complaints_clean.head(),
    "Table 5. Preview of the cleaned data frame used for data manipulation",
    col_names="actual",
    hide_index=True,
)
Table 5. Preview of the cleaned data frame used for data manipulation
created_date closed_date complaint_type descriptor borough agency agency_name
2021-01-01 00:00:00 2021-01-08 00:00:01 Food Poisoning 1 or 2 Brooklyn DOHMH Department of Health and Mental Hygiene
2021-01-01 00:00:04 2021-01-01 00:23:27 Noise - Commercial Car/Truck Music Brooklyn NYPD New York City Police Department
2021-01-01 00:00:17 2021-01-01 00:43:43 Illegal Fireworks Brooklyn NYPD New York City Police Department
2021-01-01 00:00:24 2021-01-01 00:42:21 Illegal Fireworks Brooklyn NYPD New York City Police Department
2021-01-01 00:00:49 2021-01-01 01:08:12 Noise - Residential Loud Talking Bronx NYPD New York City Police Department

Data Manipulation ¶

Most Frequent Complaints ¶

As discussed previously, this analysis was based on the 10 most frequent complaints of New York City. Therefore, a new data frame was created that included only the 10 types of complaints that had the most rows in the original dataset.

In [28]:
# Create a list containing only the most frequent types of complaints
major_complaints_list = (
    complaints_clean["complaint_type"].value_counts()[:10].index.tolist()
)

# Create a new data frame with only these complaint types
major_complaints = complaints_clean[
    complaints_clean["complaint_type"].isin(major_complaints_list)
]
In [29]:
print("\033[1m" + "Most Frequent Complaints Data Frame" + "\033[0m")
print("Number of rows: ", "{:,.0f}".format(major_complaints.shape[0]))
Most Frequent Complaints Data Frame
Number of rows:  1,815,615

The new data frame that included only the 10 most frequent complaint types had 1.8 million rows, which was still a significant amount of data and represented more than 50% of total complaints.

Complaint Rate per 1,000 People ¶

Create function to calculate complaint rate per 1,000 people for each borough

To be able to compare the data between different boroughs effectively, it was necessary to have a variable that would take into consideration the difference in population between such boroughs. In that regard, we created a complaint rate variable to calculate the number of complaints per 1,000 people for each borough and complaint type. This variable was created by dividing the number of complaints of each complaint type by the borough's population and then multiplying this value by 1,000.

Population data for each borough were obtained from the U.S. Census Bureau Website (County Population Totals: 2020-2021, 2022).

In [30]:
def rate(df):

    # Set rate factor to 1,000 people
    rate_factor = 1000

    # Set populations for each borough according to US Census Bureau 2021 estimates
    bronx_population = 1424948
    brooklyn_population = 2641052
    manhattan_population = 1576876
    queens_population = 2331143
    staten_island_population = 493494

    # Calculate rate factor for each complaint type and borough

    if df["borough"] == "Bronx":
        return df["number_of_complaints"] * rate_factor / bronx_population
    elif df["borough"] == "Brooklyn":
        return df["number_of_complaints"] * rate_factor / brooklyn_population
    elif df["borough"] == "Manhattan":
        return df["number_of_complaints"] * rate_factor / manhattan_population
    elif df["borough"] == "Queens":
        return df["number_of_complaints"] * rate_factor / queens_population
    elif df["borough"] == "Staten Island":
        return df["number_of_complaints"] * rate_factor / staten_island_population
    else:
        return np.nan

Create function to calculate complaint rate per 1,000 people for the entire City of New York

In [31]:
def rate_all(df):

    # Set rate factor to 1,000 people
    rate_factor = 1000

    # Set populations for each borough according to US Census Bureau
    bronx_population = 1424948
    brooklyn_population = 2641052
    manhattan_population = 1576876
    queens_population = 2331143
    staten_island_population = 493494
    
    nyc_population = (bronx_population
                      + brooklyn_population
                      + manhattan_population
                      + queens_population
                      + staten_island_population)

    # Calculate rate factor for the entire City of New York
    return df["number_of_complaints"] * rate_factor / nyc_population
    

Aggregation of Complaint Types ¶

Total aggregation of complaints types in the City of New York

In [32]:
# Calculate total number of complaints by counting the occurences
nyc_year = (
    major_complaints["complaint_type"]
    .value_counts()
    .rename_axis(["complaint_type"])
    .reset_index(name="number_of_complaints")
)

# Add column with the calculated annual rate using the previously defined function
nyc_year["annual_rate_(per_1,000_people)"] = nyc_year.apply(rate_all, axis=1)

tbl_format(nyc_year, "Table 6. Annual rate of NYC's 10 most frequent complaints")
Table 6. Annual rate of NYC's 10 most frequent complaints
Complaint Type Number of Complaints Annual Rate (per 1,000 People)
Noise - Residential 361,411 42.7
Illegal Parking 317,193 37.5
Request Large Bulky Item Collection 302,555 35.7
Heat/Hot Water 199,044 23.5
Noise - Street/Sidewalk 184,185 21.8
Blocked Driveway 152,264 18.0
Unsanitary Condition 90,271 10.7
Noise - Vehicle 84,896 10.0
Street Condition 68,499 8.1
Water System 55,297 6.5

Total aggregation of complaints types according to borough

In [33]:
# Calculate total number of complaints based on borough
borough_year = (
    major_complaints.groupby("complaint_type")["borough"]
    .value_counts()
    .reset_index(name="number_of_complaints")
)

# Add column with the calculated annual rate using the previously defined function
borough_year["annual_rate_(per_1,000_people)"] = borough_year.apply(rate, axis=1)

# Create wide version for number of complaints according to borough
borough_complaints_wide = (
    pd.pivot(
        borough_year,
        index="borough",
        columns="complaint_type",
        values="number_of_complaints",
    )
    .rename_axis(None, axis=0)
    .rename_axis(None, axis=1)
)

tbl_format(
    borough_complaints_wide,
    "Table 7. NYC's 10 most frequent complaints according to borough",
    hide_index=False,
)
Table 7. NYC's 10 most frequent complaints according to borough
  Blocked driveway Heat/hot water Illegal parking Noise - residential Noise - street/sidewalk Noise - vehicle Request large bulky item collection Street condition Unsanitary condition Water system
Bronx 27,010 69,573 53,469 116,882 57,536 24,221 22,356 8,088 30,423 12,801
Brooklyn 57,179 56,152 109,920 91,497 38,985 16,300 113,285 19,186 28,098 14,097
Manhattan 4,644 44,301 46,056 71,611 64,664 19,262 35,414 10,154 17,724 7,899
Queens 59,776 26,749 98,071 71,992 20,916 23,750 103,085 23,156 11,693 14,343
Staten Island 3,655 2,269 9,677 9,429 2,084 1,363 28,415 7,915 2,333 6,157
In [34]:
# Create wide version for complaint rate according to borough
borough_rate_wide = (
    pd.pivot(
        borough_year,
        index="borough",
        columns="complaint_type",
        values="annual_rate_(per_1,000_people)",
    )
    .astype(float)
    .rename_axis(None, axis=0)
    .rename_axis(None, axis=1)
)

tbl_format(
    borough_rate_wide,
    "Table 8. Complaint annual rate according to borough",
    hide_index=False,
)
Table 8. Complaint annual rate according to borough
  Blocked driveway Heat/hot water Illegal parking Noise - residential Noise - street/sidewalk Noise - vehicle Request large bulky item collection Street condition Unsanitary condition Water system
Bronx 19.0 48.8 37.5 82.0 40.4 17.0 15.7 5.7 21.4 9.0
Brooklyn 21.7 21.3 41.6 34.6 14.8 6.2 42.9 7.3 10.6 5.3
Manhattan 2.9 28.1 29.2 45.4 41.0 12.2 22.5 6.4 11.2 5.0
Queens 25.6 11.5 42.1 30.9 9.0 10.2 44.2 9.9 5.0 6.2
Staten Island 7.4 4.6 19.6 19.1 4.2 2.8 57.6 16.0 4.7 12.5

Total aggregation of complaints types according to agency

In [35]:
# Calculate total number of complaints based on agency
agency_year = (
    major_complaints.groupby("agency_name")["complaint_type"]
    .value_counts()
    .reset_index(name="number_of_complaints")
    .sort_values("agency_name")
)

Creation of Time Series ¶

Hourly aggregation of complaints types according to borough

In [36]:
# Create a data frame for time series data
complaints_hourly = major_complaints.copy()

# Group by hour to count number of complaints
complaints_hourly = (
    complaints_hourly.groupby(
        [
            pd.Grouper(key="created_date", freq="H"),
            "complaint_type",
            "borough",
            "agency_name",
        ]
    )
    .size()
    .reset_index(name="number_of_complaints")
)

complaints_hourly.sort_values(
    ["created_date", "complaint_type"],
    ascending=[True, True],
    ignore_index=True,
    inplace=True,
)

Hourly total according to complaint type

In [37]:
hour_total = (
    complaints_hourly.groupby(["complaint_type"])
    .resample("H", on="created_date")
    .sum()
    .reset_index()
    .sort_values(by="created_date", ignore_index=True)
)

Averge number of complaints according to day of the week and hour

In [38]:
weekday_hour = hour_total.copy()
weekday_hour["weekday"] = weekday_hour["created_date"].dt.day_name()
weekday_hour["hour"] = weekday_hour["created_date"].dt.hour


weekday_hour = (
    weekday_hour.groupby(["weekday", "hour"])["number_of_complaints"]
    .aggregate("mean")
    .reset_index()
)

# Create list and category to set the order of weekdays
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
weekday_hour["weekday"] = pd.Categorical(
    weekday_hour["weekday"], categories=days, ordered=True
)


weekday_hour_wide = (
    pd.pivot(
        weekday_hour,
        index="weekday",
        columns="hour",
        values="number_of_complaints",
    )
    .astype(float)
    .sort_values("weekday")
    .rename_axis(None, axis=0)
    .rename_axis(None, axis=1)
)
weekday_hour_wide = weekday_hour_wide.add_suffix(":00")
tbl_format(
    weekday_hour_wide,
    "Table 9. Averge number of complaints according to hour and day of the week",
    hide_index=False,
)
Table 9. Averge number of complaints according to hour and day of the week
  0:00 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
Monday 25.0 15.4 9.6 6.6 5.7 6.8 9.6 16.2 21.3 24.8 25.8 25.8 25.8 24.9 25.6 24.6 25.3 25.4 25.5 26.0 26.6 27.6 28.3 23.8
Tuesday 17.0 10.7 7.1 4.9 4.7 6.5 9.6 17.2 21.5 24.7 24.5 24.6 24.5 24.3 24.8 23.7 25.4 24.7 24.4 25.0 26.9 28.4 28.9 25.1
Wednesday 17.6 10.7 6.8 4.9 4.6 6.4 9.6 17.1 20.9 23.7 23.1 23.0 23.0 23.0 23.3 22.9 23.7 22.9 24.3 24.6 26.7 27.5 27.8 23.9
Thursday 17.9 10.7 7.1 4.9 4.8 6.4 9.2 16.4 20.9 24.6 23.1 24.0 23.3 22.5 22.8 21.9 22.8 22.2 22.7 23.4 25.1 27.4 29.5 25.3
Friday 19.6 12.5 8.0 5.7 5.2 6.4 9.3 16.1 20.7 23.6 23.0 23.7 23.1 22.3 22.1 21.8 22.1 22.7 22.9 24.0 26.1 30.7 35.5 36.4
Saturday 34.8 23.8 15.7 11.1 8.0 6.8 7.8 11.2 15.5 17.7 18.8 19.4 18.9 20.0 20.3 20.3 21.1 21.9 23.8 26.4 30.2 37.6 48.6 54.7
Sunday 51.3 38.7 25.3 16.9 11.1 8.3 8.1 11.0 14.5 17.9 20.2 20.9 21.2 21.9 22.7 22.7 23.9 24.4 26.0 28.0 31.1 35.1 37.5 33.6

Average number of complaints for each day of the week according to type

In [39]:
daily = (
    complaints_hourly.groupby(["complaint_type"])
    .resample("D", on="created_date")
    .sum()
    .reset_index()
    .sort_values(by="created_date", ignore_index=True)
)

days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
weekday = daily.copy()
weekday["weekday"] = weekday["created_date"].dt.day_name()

weekday["weekday"] = pd.Categorical(weekday["weekday"], categories=days, ordered=True)

weekday_avg = (
    weekday.groupby(["weekday", "complaint_type"])
    .mean("number_of_complaints")
    .reset_index()
)

weekday_avg_wide = (
    pd.pivot(
        weekday_avg,
        index="complaint_type",
        columns="weekday",
        values="number_of_complaints",
    )
    .astype(float)
    .rename_axis(None, axis=1)
    .rename_axis("complaint_type", axis=0)
)
tbl_format(
    weekday_avg_wide,
    "Table 10. Average number of complaints for each day of the week according to type",
    hide_index=False,
)
Table 10. Average number of complaints for each day of the week according to type
  Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Complaint Type              
Blocked Driveway 420.3 404.3 414.0 419.2 424.6 423.2 422.5
Heat/Hot Water 632.3 593.1 509.8 505.2 553.8 520.3 513.2
Illegal Parking 865.8 893.3 881.1 909.2 896.8 809.2 844.5
Noise - Residential 780.4 690.4 660.0 672.1 899.6 1,616.3 1,631.4
Noise - Street/Sidewalk 400.2 336.1 342.3 359.7 482.2 777.6 843.9
Noise - Vehicle 195.5 187.9 192.6 197.0 232.0 299.4 328.2
Request Large Bulky Item Collection 1,035.8 965.1 920.3 872.5 750.2 518.0 756.5
Street Condition 216.2 231.4 231.1 211.3 180.2 116.5 130.6
Unsanitary Condition 313.4 315.0 291.8 274.3 251.4 148.6 141.5
Water System 157.7 173.3 176.7 168.4 162.9 114.7 109.6

Monthly Aggregation of Complaints Types

In [40]:
nyc_monthly = (
    complaints_hourly.groupby(["complaint_type"])
    .resample("M", on="created_date")
    .sum()
    .reset_index()
    .sort_values(by="created_date", ignore_index=True)
)

# Create data frame with months sorted for table
nyc_monthly_tbl = nyc_monthly.copy()

# Create category to keep month names ordered
months = [
    "Jan",
    "Feb",
    "Mar",
    "Apr",
    "May",
    "Jun",
    "Jul",
    "Aug",
    "Sep",
    "Oct",
    "Nov",
    "Dec",
]

# Extract month name from created date and rename column
nyc_monthly_tbl["created_date"] = (
    nyc_monthly_tbl["created_date"].dt.month_name().str[:3]
)
nyc_monthly_tbl.rename(columns={"created_date": "month"}, inplace=True)

# Assign category to sort months
nyc_monthly_tbl["month"] = pd.Categorical(
    nyc_monthly_tbl["month"], categories=months, ordered=True
)

# Rearrange columns back to original positions
nyc_monthly_tbl = nyc_monthly_tbl[
    list(("month", "complaint_type", "number_of_complaints"))
]

nyc_monthly_tbl.sort_values(by="month", inplace=True)
# Reshape the data frame
nyc_monthly_wide = (
    pd.pivot(
        nyc_monthly_tbl,
        index="month",
        columns="complaint_type",
        values="number_of_complaints",
    )
    .astype(float)
    # .reset_index()
    .rename_axis(None, axis=1)
    .rename_axis(None, axis=0)
)

tbl_format(
    nyc_monthly_wide,
    "Table 11. Monthly aggregation of complaints types",
    hide_index=False,
)
Table 11. Monthly aggregation of complaints types
  Blocked driveway Heat/hot water Illegal parking Noise - residential Noise - street/sidewalk Noise - vehicle Request large bulky item collection Street condition Unsanitary condition Water system
Jan 10,478.0 35,308.0 19,314.0 27,428.0 3,915.0 3,461.0 23,917.0 4,927.0 6,033.0 4,081.0
Feb 12,797.0 28,885.0 18,315.0 21,899.0 3,300.0 2,474.0 16,812.0 4,965.0 5,773.0 2,616.0
Mar 10,881.0 19,933.0 23,384.0 24,471.0 9,818.0 5,978.0 28,905.0 7,094.0 6,021.0 3,896.0
Apr 11,047.0 12,126.0 24,621.0 28,301.0 14,381.0 8,149.0 27,104.0 6,644.0 5,580.0 3,162.0
May 12,811.0 7,155.0 28,813.0 36,245.0 25,046.0 11,638.0 27,739.0 6,383.0 5,832.0 3,625.0
Jun 12,899.0 2,733.0 28,628.0 35,574.0 29,349.0 10,932.0 29,205.0 6,914.0 7,025.0 7,948.0
Jul 12,245.0 2,314.0 25,548.0 41,602.0 25,126.0 9,003.0 28,759.0 5,952.0 8,235.0 8,462.0
Aug 12,359.0 2,256.0 26,325.0 32,561.0 21,728.0 8,867.0 28,630.0 5,263.0 8,864.0 7,403.0
Sep 13,778.0 5,338.0 30,389.0 38,055.0 22,156.0 9,051.0 27,911.0 6,218.0 10,885.0 4,306.0
Oct 14,731.0 12,572.0 32,592.0 31,291.0 17,787.0 7,250.0 23,661.0 4,956.0 9,891.0 3,275.0
Nov 14,541.0 40,630.0 30,645.0 22,379.0 6,612.0 4,078.0 19,864.0 4,166.0 9,036.0 3,117.0
Dec 13,697.0 29,794.0 28,619.0 21,605.0 4,967.0 4,015.0 20,048.0 5,017.0 7,096.0 3,406.0

Group Proportions ¶

Since the data frame that was being analyzed contained ten different types of complaints and each complaint type had different kinds of subcategories—in some cases even more than 20 subcategories (e.g. descriptors)—, a simple method for calculating proportions would not work, as the pie charts would end up displaying too many slices. To address this issue, we created a function that not only calculated the proportions for all subcategories (descriptors) within each parent category (complaint type) but was also able to add an extra subcategory called "Other" to group the subcategories with lower proportions under one term, with the purpose of limiting the number of slices of each subplot.

Create a function to calculate the proportion of subcategories within each category

In [41]:
def create_proportion(df, n_slices, slices_col=None, subplot_col=None):

    # Calculate proportion of all slices, no subplots
    if subplot_col == None:
        df_proportion = (
            df[slices_col]
            .value_counts(normalize=True)
            # extract the string to rename column in results df
            .rename_axis([format(slices_col)])
            .reset_index(name="proportion")
            .sort_values("proportion", ascending=False)
        )

        # If there are more than 1 group left after n_slices,
        # group the remaining proportions under the category "Other"
        # by adding together all the rows after the n_slice row
        # and add them to last row
        if len(df_proportion["proportion"][n_slices:]) > 1:
            others_total = df_proportion["proportion"][n_slices:].sum()
            df_top = df_proportion[:n_slices].copy()
            df_top.loc[n_slices, :] = ["Other", others_total]

        # If there are no groups after n_slices
        else:
            df_top = df_proportion.copy()

    # If df has subplot groups (e.g. one group per subplot),
    # calculate slice proportions for each subplot group individually
    else:
        df_proportion = (
            df.groupby(subplot_col)[slices_col]
            .value_counts(normalize=True)
            .reset_index(name="proportion")
        )

        groups_list = df_proportion[subplot_col].unique().tolist()
        df_group = groups_list
        df_top = pd.DataFrame()

        # Create a temporary separate data frame for each subplot group
        # by filtering each subplot group
        for i in range(len(groups_list)):
            group_x = str(groups_list[i])
            df_group[i] = df_proportion.copy(deep=True)
            df_group[i] = df_group[i].loc[df_group[i][subplot_col] == group_x]

            # If there are more than 1 group left after n_slices,
            # group the remaining proportions under the "Other" group
            if len(df_group[i]["proportion"][n_slices:]) > 1:
                others_total = df_group[i]["proportion"][n_slices:].sum()
                df_group[i] = df_group[i][:n_slices].copy()
                df_group[i].loc[n_slices, :] = [group_x, "Other", others_total]

            df_top = pd.concat([df_top, df_group[i]])

            df_top = (
                df_top.sort_values([subplot_col, "proportion"], ascending=[True, False])
                .groupby(subplot_col)
                .head(50)
            )

    return df_top

The create_proportion function, which was created for this project, proved to be very useful during the visualization section, as it could be used to generate many subplots containing the proportions of each parent category. Furthermore, the function added a degree of flexibility for the analysis since the number of slices to be displayed in the pie charts could be changed via an input argument.

Data Frame: Complaint Cycle Time ¶

Create a data frame to analyze complaint cycle time

To perform a more thorough analysis of the data, we created a new variable to calculate the number of days that each complaint remained open. It is important to note that this variable had be created within a new data frame because the calculations required the elimination of the NA values of the Closed Date column, as well as the removal of rows in which the value entered corresponded to a date earlier than the date the complaint was created, which implied that there was an error in the data collection process for that particular complaint and, hence, needed to be removed.

In [42]:
# Create data frame for cycle time (days open)
processing_time = major_complaints.copy()

# Create variable to calculate the hours a complaint remained open
processing_time["processing_hours"] = (
    processing_time["closed_date"] - processing_time["created_date"]
).dt.total_seconds() / 3600  # must use total_seconds because it is timedelta

Clean the cycle time data frame: negative values, NA values and outliers

In addition to removing the negative and missing values, we also removed possible outliers by setting a reasonable limit for the time a complaint could remained open. This limit was set at 6 months, or 4,320 hours.

In [43]:
# Remove rows with NA values
processing_time.dropna(axis=0, how="any", thresh=None, subset=None, inplace=True)

# Remove rows with Negative values
processing_time = processing_time[processing_time["processing_hours"] >= 0]

# Use only the complaints that do not have outliers
processing_time = processing_time[processing_time["processing_hours"] <= 4320]

# Calculate number of rows in the base data frame
initial_rows = major_complaints.shape[0]

# Calculate number of rows in the cycle time data frame
cycle_rows= processing_time.shape[0]

print("\nNumber of rows in the base data frame:\n", "{:,.0f}".format(initial_rows))
print("\nNumber of rows in the cycle time data frame:\n", "{:,.0f}".format(cycle_rows))
Number of rows in the base data frame:
 1,815,615

Number of rows in the cycle time data frame:
 1,807,068

After removing the missing and negative values that were present in the Closed Date and Days Open columns, respectively, the resulting data frame for the cycle time data had 1.8 million rows.

Add variable for the number of days that complaints remained open

In [44]:
processing_time["processing_days"] = processing_time["processing_hours"] / 24

Complaint processing days average according to boroughs

In [45]:
boroughs_days_avg = processing_time.copy()
boroughs_days_avg = (
    boroughs_days_avg.groupby(["complaint_type", "borough"])
    .mean("processing_days")
    .reset_index()
)

Complaint processing days average according to descriptors

In [46]:
descriptor_days_avg = processing_time.copy()
descriptor_days_avg = (
    descriptor_days_avg.groupby(["complaint_type", "descriptor"])
    .mean("processing_days")
    .reset_index()
)

Data Analysis and Visualizations ¶

10 Most Frequent Complaints in New York City ¶

As mentioned earlier, the original dataset contained 224 different types of complaints, which makes the dataset interesting and valuable for a wide range of purposes; however, the main objective of this analysis was to provide recommendations to improve the overall performance of the city services in terms of service requests , and therefore it was be more appropriate to examine only the types of complaints that could have a greater impact on the residents' lives based on how frequent they occur. On that account, this analysis was focused on the 10 most frequent types of complaints of the City of New York during a year in which the population was seeking to return to normality after the economy shut down and confinement measures due the COVID-19 pandemic.

NYC's 10 most common complaints are illustrated in Figure 1, which shows the number of complaints for each type as recorded in the year 2021.

In [47]:
# Visualize most frequent complaint types
fig, ax = plt.subplots()
ax.set_ylabel("Number of Complaints", fontsize=12)
ax.set_title("Figure 1. NYC's 10 Most Frequent Complaints", fontsize=12)
plt.xticks(rotation=40, ha="right")
ax.set_facecolor("white")
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter("{x:,.0f}"))

ax.bar(
    x=nyc_year["complaint_type"],
    height=nyc_year["number_of_complaints"],
    color="darkblue",
    width=0.7,
)
ax.grid(axis="y")

As Figure 1 shows, three types of complaints were significantly more common than the rest. These complaints were noise - residential, illegal parking, and request large bulky item Collection. Nevertheless, the other seven types of complaints were still significant in terms of frequency and, hence, were included in Figure 3 to examine their composition.

Seasonal Patterns in the Number of Complaints ¶

In [71]:
# cnfigure a new data frame to plot data  
nyc_monthly_plot =nyc_monthly.copy()

complaint_colors = [
    "black",
    "red",
    "grey",
    "orange",
    "yellow",
    "pink",
    "green",
    "darkgrey",
    "purple",
    "blue",
]

# Convert datetime to float so that lmplot is able to use it in the x axis
nyc_monthly_plot["created_date"] = nyc_monthly_plot["created_date"].apply(
    datetime.timestamp
)

monthly_plt = (sns.lmplot(data=nyc_monthly_plot, x='created_date', y='number_of_complaints',
            hue='complaint_type',
            order=3,
            scatter_kws={"s": 20},
            palette= complaint_colors,
            height=7, aspect=1.5,
            ci=None,
            scatter=False,
            hue_order=nyc_monthly_plot.groupby(
                'complaint_type').count().sort_values(by="complaint_type").index))
ax = plt.gca()
# fix xticks
#plt.title("Figure 2. NYC Number of Complaints per Month 2021")
ticks_loc = ax.get_xticks().tolist()
xticks = ax.get_xticks()
ax.set(ylim=(0, 50000))
ax.set_xlabel("Month", fontsize=14)
ax.set_ylabel("Number of Complaints", fontsize=14)
monthly_plt._legend.set_title('Complaint Type')
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter("{x:,.0f}"))
ax.xaxis.set_major_locator(mticker.FixedLocator(ticks_loc))
ax.set_xticklabels([pd.to_datetime(tm, unit='s').strftime('%b') for tm in xticks])
ax.set_title("Figure 2. NYC Number of Complaints per Month 2021", fontsize=15)

plt.show(monthly_plt)

Figure 2 revealed several seasonal patterns in the complaints data. First, as it was expected, complaints related to heat and hot water decreased considerably during the summer months. On the other hand, the three complaint types concerning noise - residential, noise - street/sidewalk, and noise - vehicle showed a pattern that was the inverse of the heat/hot water complaint type; that is, noise complaints were high during the summer months, and decreased significantly in winter. Similarly, the requests for large bulky item collection increased during the summer months but to a lesser extent.

Descriptors' Proportion of NYC's Most Frequent Complaints ¶

In [49]:
complaint_descriptors = create_proportion(
    major_complaints, 3, slices_col="descriptor", subplot_col="complaint_type"
)

complaint_agencies = create_proportion(
    major_complaints, 3, slices_col="complaint_type", subplot_col="agency_name"
)
In [52]:
n_groups = len(complaint_descriptors["complaint_type"].unique())
fig, axs = plt.subplots(
    nrows=4,
    ncols=3,
    figsize=(2 * n_groups, 24),
    subplot_kw={"aspect": "equal"},
)
fig.subplots_adjust(top=0.7)
axs = axs.flatten()

for ax, (group_name, df_group) in zip(axs, complaint_descriptors.groupby("complaint_type")):
    texts, autotexts, autopcts = ax.pie(
        df_group.proportion,
        labels=df_group.descriptor,
        counterclock=False,
        startangle=87,
        autopct="%1.0f%%",
    )
    ax.set_title(group_name)
    # ax.yaxis.label.set_color("darkblue")
    ax.set_title(group_name, weight='bold', color="darkred",fontsize=14)
    #[autotext.set_color("white") for autotext in autotexts]
    plt.setp(autopcts, **{'color':'white', 'weight':'bold', 'fontsize':9})
#fig.tight_layout(pad=2)
fig.subplots_adjust(hspace=.2)
axs[10].set_visible(False)
axs[11].set_visible(False)
plt.suptitle("Figure 3. Composition of NYC's Most Frequent Complaints", size=20, y=.75)

plt.show()

According to the results shown in Figure 2, the complaint type noise - residential, which was the most frequent kind of complaint in 2021 (with a total of 361,416, see Table 4), mostly consisted of complaints related to music or social activities (party), with banging/pounding coming in a far second place. Whereas in the case of illegal parking, which was the second most common type of complaint, it was largely driven by blocked hydrants situations.

It is worth mentioning that the requests for large bulky item collection had only one descriptor, which went by the same name. We addressed this issue in the recommendation section, as it would be advisable to obtain more detail data on this kind of complaint since it was the third most frequent service request.

Number of Complaints According to Borough ¶

As mentioned previously, it is important to analyze complaint data based on more specific geographical locations in order to provide recommendations in relation to resource allocation and city performance. In that regard, Figure 4 illustrates the number of complaints made during 2021 according to borough and complaint type, while Table 12 shows the total amount for each borough.

In [53]:
borough_total = (
    major_complaints["borough"]
    .value_counts()
    .rename_axis("borough")
    .reset_index(name="number_of_complaints")
)

tbl_format(
    borough_total,
    "Table 12. Annual rate according to borough and complaint type (per 1,000 people)",
)
Table 12. Annual rate according to borough and complaint type (per 1,000 people)
Borough Number of Complaints
Brooklyn 544,699
Queens 453,531
Bronx 422,359
Manhattan 321,729
Staten Island 73,297
In [70]:
ax = borough_complaints_wide.plot(
    kind="bar",
    width=0.7,
    edgecolor="white",
    linewidth=2,
    color=complaint_colors,
)
# ax = plt.axes()
plt.xticks(
    rotation=0, horizontalalignment="center", fontweight="light", fontsize="medium"
)
# ax.set_facecolor("pink")
plt.legend(fontsize=11)
plt.title("Figure 4. NYC Total Number of Complaints According to Borough 2021")
plt.xlabel("Borough", fontweight="bold")
plt.ylabel("Number  of Complaints")
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter("{x:,.0f}"))
ax.get_legend().set_bbox_to_anchor((1, 1))

plt.show()

Although Figure 4 is very useful to understand the weight each borough exerts on the NYC 311 system and the city services, it is also important to note that the boroughs' population varies significantly from one another.

To counteract any data misinterpretation that could arise from the difference in population, we calculated the annual rate per 1,000 people for each borough. This variable took into account the population of each borough to estimate the number of complaints that were being done per 1,000 people. In this sense, we were able to level the field such that the boroughs could be compared with each other in a more objective way. Figure 5 shows the annual rate per 1,000 people for each borough.

Annual Complaint Rate According to Borough ¶

In [69]:
#

ax = borough_rate_wide.plot(
    kind="bar",
    width=0.7,
    edgecolor="white",
    linewidth=2,
    color=complaint_colors,
)
# ax = plt.axes()
plt.xticks(
    rotation=0, horizontalalignment="center", fontweight="light", fontsize="medium"
)

plt.legend(fontsize=11)
plt.title("Figure 5. NYC Complaint Rate (Per 1,000 People) 2021")
plt.xlabel("Borough", fontweight="bold")
plt.ylabel("Complaint Rate (per 1,000 people)")
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter("{x:,.0f}"))
ax.get_legend().set_bbox_to_anchor((1, 1))

plt.show()

By examining the annual complaint rate, as illustrated in Figure 4, it became evident that the Bronx was generating more complaints per 1,000 people, although both Brooklyn and Queens had greater numbers of complaints in total (see Table 12). Thus, in relative terms, once the population factor was accounted for, the Bronx was the most active borough with regard to the number of complaints. On the other hand, Brooklyn and Queens were leading in the total number of complaints due to their larger populations, which in 2021 were 2.64 and 2.33 million people, respectively, while the Bronx only had 1.42 million people, according to data from the U.S. Census Bureau.

Since the Bronx exhibited the largest annual complaint rate per 1,000 people, we proceeded to analyze this borough in more detail by examining its complaint rate for each complaint type and comparing this number with the other boroughs. In this respect, Table 8 highlights in red all the annual complaint rates that were 20% higher than the overall complaint rate of the entire City of New York.

In [56]:
def highlight_greater20(x):

    col_name = x.name
    comparison = nyc_year.query("complaint_type == @col_name")[
        "annual_rate_(per_1,000_people)"
    ].iloc[0]

    highlight = "color: red;"
    greater20 = x > comparison + (0.3 * comparison)

    return [highlight if i else "" for i in greater20]
In [57]:
table = (
    borough_rate_wide.style.format("{:,.1f}")  # .rename_axis("Borough")
    .set_table_styles(styles)
    .set_properties(**{"text-align": "left"})
    .apply(highlight_greater20)
    .set_caption(
        "Table 13. Annual rate (per 1,000 people) according to borough and complaint type "
    )
)

table
Out[57]:
Table 13. Annual rate (per 1,000 people) according to borough and complaint type
  Blocked Driveway Heat/Hot Water Illegal Parking Noise - Residential Noise - Street/Sidewalk Noise - Vehicle Request Large Bulky Item Collection Street Condition Unsanitary Condition Water System
Bronx 19.0 48.8 37.5 82.0 40.4 17.0 15.7 5.7 21.4 9.0
Brooklyn 21.7 21.3 41.6 34.6 14.8 6.2 42.9 7.3 10.6 5.3
Manhattan 2.9 28.1 29.2 45.4 41.0 12.2 22.5 6.4 11.2 5.0
Queens 25.6 11.5 42.1 30.9 9.0 10.2 44.2 9.9 5.0 6.2
Staten Island 7.4 4.6 19.6 19.1 4.2 2.8 57.6 16.0 4.7 12.5

By examining the results shown in Table 13, it can be seen that the Bronx had the highest annual complaint rate per 1,000 people in four out of the 10 categories of complaint types; and in six of these categories, this borough displayed a complaint rate that was 20% higher than the overall complaint rate of New York City.

Similarly, Table 13 was also used to analyze the other boroughs and make the corresponding recommendations (see the Recommendations Section).

Number of Complaints According to Responsible Government Agency ¶

In [58]:
agency_year = (
    major_complaints["agency"]
    .value_counts()
    .rename_axis("agency")
    .reset_index(name="number_of_complaints")
)

# Visualize most frequent complaint types
fig, ax = plt.subplots()
ax.set_ylabel("Number of Complaints", fontsize=12)
ax.set_title("Figure 6. NYC's Most Frequent Complaints Accordin to Agency", fontsize=12)
plt.xticks(rotation=30, ha="right")
ax.set_facecolor("white")
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter("{x:,.0f}"))

ax.bar(
    x=agency_year["agency"],
    height=agency_year["number_of_complaints"],
    color="darkblue",
    width=0.7,
)
ax.grid(axis="y")

NYPD: New York City Police Department
DSNY: Department of Sanitation of New York City
HPD: Department of Housing Preservation and Development
DOT: Department of Transportation
*DEP: Department of Environmental Protection

In [59]:
complaints_nypd = major_complaints.loc[(major_complaints["agency"] == "NYPD")]
# nypd = major_complaints["agency_name"] == "New York City Police Department"
nypd_proportion = create_proportion(complaints_nypd, 5, slices_col="complaint_type")


axes = nypd_proportion.plot(
    kind="pie",
    y="proportion",
    labels=nypd_proportion["complaint_type"],
    startangle=90,
    autopct="%.0f%%",
    legend=None,
    counterclock=False,
    # ax=axes[0],
    ylabel="",
    figsize=(12, 4),
)
axes.set_title("Figure 7. Complaints Directed to NYPD", color="black", fontsize=12)
# axes[1].set_title("Illegal Parking", color="r", fontsize=14)
plt.setp(autopcts, **{"color": "white", "weight": "bold", "fontsize": 9})
[text.set_color("red") for text in texts]
texts[0].set_color("blue")
[autotext.set_color("white") for autotext in autotexts]

plt.show()

As shown in Figure 7, the most frequent complaints for which the New York City Police Department was responsible for were noise - residential and illegal parking, with noise - street/sidewalk coming in third place.

Heatmap: Level of Activity per Hour of the Day ¶

In [68]:
fig, ax = plt.subplots(figsize=(12, 3))

from matplotlib.colors import TwoSlopeNorm

norm = TwoSlopeNorm(vcenter=25)

title = "Figure 8. Average Number of Complaints per Hour of the Day"

plt.title(title, fontsize=12)
t = ax.title
t.set_position([0.5, 1.05])


sns.heatmap(
    weekday_hour_wide,
    cmap="YlGnBu",
    linewidths=0.50,
    ax=ax,
    norm=norm,
)
plt.xticks(rotation=40)
plt.yticks(rotation=0)
plt.show()

Based on the results illustrated by Figure 8, the time in which the average number of complaints was the highest was from Saturday 10:00 p.m. to Sunday 1:00 a.m. This was driven largely by complaints related to noise in residential locations, as can be seen in Figure 9.

Heatmap: Level of Activity According to Type and Weekday ¶

In [67]:
fig, ax = plt.subplots(figsize=(10, 4))

from matplotlib.colors import TwoSlopeNorm

# Adjust intensity in heatmap
norm = TwoSlopeNorm(vcenter=70)

title = "Figure 9. Average Number of Complaints According to Type and Day of the Week"

plt.title(title, fontsize=11)
t = ax.title
t.set_position([0.5, 5])

sns.heatmap(
    weekday_avg_wide,
    cmap="YlGnBu",
    linewidths=0.30,
    ax=ax,
    norm=norm,
)
ax.set(ylabel=None)
plt.show()

Descriptive Statistics: Complaint Processing Time ¶

In [63]:
# descriptive statistics for the number of hours that complaints remained open
cycle_stats = processing_time.groupby("complaint_type")["processing_hours"].describe()
cycle_stats.sort_values("mean", ascending=False, inplace=True)

tbl_format(
    cycle_stats,
    "Table 14. Descriptive statistics for complaint cycle time (hours)",
    hide_index=False,
)
Table 14. Descriptive statistics for complaint cycle time (hours)
  Count Mean Std Min 25% 50% 75% Max
Complaint Type                
Unsanitary Condition 88,645.0 488.8 611.8 0.0 137.2 289.3 552.6 4,318.8
Street Condition 66,734.0 169.9 301.0 0.0 21.0 65.8 210.1 4,318.3
Request Large Bulky Item Collection 302,545.0 88.5 75.4 0.0 37.0 62.6 108.8 1,467.6
Water System 54,624.0 82.3 275.3 0.0 0.1 5.0 34.5 4,291.9
Heat/Hot Water 198,830.0 58.3 49.2 0.0 27.5 49.3 77.8 2,789.1
Noise - Residential 360,188.0 7.1 36.2 0.0 0.3 0.7 1.7 1,730.2
Blocked Driveway 151,648.0 1.6 3.0 0.0 0.5 0.9 1.8 165.4
Illegal Parking 316,095.0 1.4 5.0 0.0 0.4 0.8 1.5 1,601.2
Noise - Street/Sidewalk 183,349.0 1.2 2.7 0.0 0.3 0.6 1.3 161.9
Noise - Vehicle 84,410.0 1.2 2.5 0.0 0.3 0.6 1.2 162.4

Complaints related to unsanitary conditions remained open considerably more than the others (see Table 14). Although this could be expected, as these types of complaints are more difficult to resolve than the rest, these kinds of calls were examined more closely, considering that unsanitary conditions play a key role in the level of well-being of the residents and their perception towards their city government.

In [66]:
# Create data frame for processing days
# of unsanitary conditions according to borough
unsanitary_days_borough = boroughs_days_avg[
    boroughs_days_avg["complaint_type"] == "Unsanitary Condition"
].sort_values("processing_days", ascending=False)

# Create data frame for processing days
# of unsanitary conditions according to descriptor
unsanitary_days_descriptors = descriptor_days_avg[
    descriptor_days_avg["complaint_type"] == "Unsanitary Condition"
].sort_values("processing_days", ascending=False)

fig, ax = plt.subplots(ncols=2, figsize=(9, 3))

plt1 = unsanitary_days_borough.plot.bar(
    x="borough", y="processing_days", color="blue", ax=ax[0], legend=None
)
plt2 = unsanitary_days_descriptors.plot.bar(
    x="descriptor", y="processing_days", color="red", ax=ax[1], legend=None
)

plt.suptitle("Figure 10. Unsanitary Condition Processing Days", size=12, y=1.05)

ax[0].set(ylabel="Processing Days")
ax[0].set_title("According to Borough", color="black", fontsize=9)
ax[0].set(xlabel=None)
ax[0].set_xticklabels(plt1.get_xticklabels(), rotation=40, ha="right")
ax[0].yaxis.set_ticks(np.arange(0, 30, 2))
ax[0].grid(axis="y")

ax[1].set_title("According to Descriptor", color="black", fontsize=9)
ax[1].set(xlabel=None)
ax[1].yaxis.set_ticks(np.arange(0, 30, 2))
ax[1].set_xticklabels(plt2.get_xticklabels(), rotation=40, ha="right")
ax[1].grid(axis="y")

plt.show()

As mentioned earlier, complaints concerning unsanitary conditions took the longest time to resolve. Figure 10 (right), which shows the descriptors for this type of complaint, sheds light on why the processing time for unsanitary condition is high. As shown in Figure 10, the unsanitary condition complaint type consisted of calls related to garbage/recycling storage, sewage, pests, and mold. The last two, pests and mold, are situations that due to their nature require more resources and time to resolve. However, the descriptor that corresponded to garbage/recycling storage showed an average processing time that was even longer than the number of days that were required to resolve mold and pests conditions, which is something that should be evaluated by the responsible agency, as garbage/recycling storage situations should be, in theory, more straightforward to address.

In relation to the boroughs of New York City, Figure 10 revealed that unsanitary conditions in Staten Island, which had an average processing time of 27 days, took almost one week longer to resolve than in most other boroughs. Whereas in Queens and Brooklyn, on average, this type of complaint took less than 19 days to close or correct.

Conclusions and Recommendations ¶

Noise in Residential Areas

In 2021, calls concerning noise in residential areas were the most frequent kinds of complaint in New York City. This type of complaint was driven in large part by calls coming from the Bronx, which in the case of residential noise had an annual complaint rate (per 1,000 people) twice as high as the borough that came in second place for this kind of complaint. This is an interesting fact, considering that it cannot be explained by population density since the Bronx ranks third in terms of people per square miles compared to the other boroughs.

This analysis revealed that in the case of noise in residential areas, 69% of this type of complaint was due to loud music or parties. In that context, it is recommended to develop policies, within the Bronx communities, that promote thoughtful ways of coexistence between neighbors. Likewise, such policies should discourage the use of residential areas for activities that are characterized by loud music by either establishing a reasonable fine system or by providing alternative locations that could be rented at a low rate for such events. Also, policies should focus on educating people about how to live in a culturally diverse community.


Capitalizing on Complaints Seasonality

Several types of complaints displayed strong seasonal patterns. Complaints concerning noise, especially residential and street/sidewalk noise, along with collection requests for large bulky items increased significantly during the summer months. Whereas, as expected, complaints relating to heat/hot water decreased during summer. With this in mind, it is advisable to allocate technical resources and expertise according to these seasonal patterns.

One way in which the local government could use seasonal patterns for its own advantage is by establishing collaboration programs between the NYC 311 administrators and the government agency responsible for addressing a particular seasonal complaint such that during the months of high activity for that type of complaint, the NYC 311 call center is provided with a small team from the responsible agency to handle the calls that require technical expertise and, whenever possible, resolve the complaint immediately.


Targeted Investment According to Borough's Needs

The annual complaint rate per 1,000 people for each of the five boroughs, as presented in Table 13 and Figure 5, is a valuable tool to identify the areas that need improvement in each geographical location. This indicator is a useful tool to determine the real priorities for each borough because it takes into account the differences in population to provide a more objective and meaningful way to measure their use of the NYC 311 system.

Table 13 highlights in red the cases in which a borough's annual complaint rate (per 1,000 people) was 20% higher than the city's rate for a particular type of complaint. In this respect, each borough displayed a different set of needs that must be addressed. Thus, it is recommended to prioritize investment or the development of technical expertise to address proactively the conditions that led to the following complaints in each borough:


The Bronx

The Bronx had six complaints types for which the annual complaint rate per 1,000 people was 20% higher than the average of the entire city for those same kinds of complaints. Those complaints were heat/hot water, noise - residential, noise - stree/sidewalk, noise - vehicle, unsanitary condition, and water systems.


Brooklyn

Although Brooklyn had the highest total number of complaints, with regard to the annual complaint rate per 1,00 people this borough was either close to or below the average in the 10 complaints that were analyzed. This is an indication that Brooklyn was doing better than the other boroughs with respect to complaints or service requests. The high number of total complaints that this borough had was due to its large population, which is the largest among the five boroughs.


Manhattan

Out of the 10 most frequent types of complaints, in only one case Manhattan had a complaint rate 20% higher than the average. This type of complaint was noise in street/sidewalk, which can be explained by the very own nature and characteristics of this borough. Considering that it is normal for Manhattan to have a high volume of pedestrians due to tourism, entertainment, shopping, and nightlife, this type of complaint is difficult to address but not impossible. Figure 3 shows that in most cases, similarly to complaints concerning noise in residential areas, the noise in street/sidewalk was also due to loud music or parties, which are activities that could restrained within the normal legal framework.


Queens

In the case of Queens, the only complaint type in which this borough had a complaint rate that was 20% higher than the average was blocked driveways. However, before investing resources to address this issue, it is recommended to perform a data analysis that not only focuses on the resolution measures of past incidents but that is also based on several years of data in order to determine if these complaints were caused by construction works that took place in 2021.


Staten Island

Regarding Staten Island, this borough had complaint rates that were 20% higher than the average in three categories. These complaint types were collection request of large bulky items, street condition and water system.


Greater Collaboration Between NYPD and NYC 311

The New York City Police Department (NYPD) was by far the government agency responsible for addressing the most complaints. In that sense, it is essential to evaluate how the process efficiency of handling complaints could be improved by increasing the level of collaboration between NYPD and NYC 311. NYPD was responsible for addressing more than 1.1 million complaints, whereas the second most busy government agency was the Department of Sanitation of New York City, which came in a far second place with around 300,000 complaints. Most of the complaint assigned to NYPD were noise in residential areas and illegal parking (see Figure 7), with 33% and 29%, respectively.


Shift Work Scheduling

The analysis revealed that the time and day of the week with the highest traffic in terms of complaints was from Saturday 10:00 p.m. to Sunday 1:00 a.m. This high volume of calls was largely driven by noise in residential areas (see figures 8 and 9). In that sense, the work shift that includes Saturday's late hours should be assigned the most operators, and a rotation schedule should be adopted if necessary.


Unsanitary Condition Processing Days

As it would be expected, the analysis confirmed that complaints related to unsanitary conditions took the longest time to resolve; however, these processing times were exceedingly high, and more resources should be allocated to evaluating and resolving these kinds of complaints given the effects that unsanitary conditions could have on health and well-being. Furthermore, this type of complaint could play a fundamental role in the way residents perceive the performance of their city government.

The results showed that on average complaints concerning unsanitary conditions took 20 days to resolved. This type of complaint consisted of conditions related to pests, mold, sewage, and garbage/recycling storage. The latter, surprisingly, had a processing time higher than the others. Garbage/recycling storage showed an average processing time of 28 days, while sewage, pests and mold were resolved in 23, 21 and 14 days, respectively. In view of this, and considering that, in theory, garbage/recycling storage should be easier to address than conditions relating to sewage and pests, more resources should be allocated to evaluating the former to determine why its processing time is so high.

References¶

311 Service Requests from 2010 to Present. NYC Open Data. January 15, 2023.
https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9

County Population Totals: 2020-2021. US Census Bureau. March 24, 2022.
https://www.census.gov/data/tables/time-series/demo/popest/2020s-counties-total.html

Gazetter Files. US Census Bureau. February 1, 2023.
https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.2021.html#list-tab-SOHTLTAHR50YZCT6KC