Introduction

In the context of an academic assignment within a Data Mining course, this project delves into the exploration and analysis of a publicly available dataset from the city of Austin, Texas. The dataset, sourced from Austin’s 311 Public Data, comprises a diverse range of service requests made to city authorities. The primary objective of this undertaking is to employ data mining techniques to uncover patterns, insights, and actionable information from the dataset.

Through a systematic series of data preprocessing, filtering, and visualization steps, this project aims to gain a deeper understanding of service request trends, their associations with various factors, and the utilization of data analytics for decision-making in municipal services. The subsequent sections present an organized and informative analysis of the dataset, employing R programming and data visualization tools to extract valuable knowledge and facilitate data-driven decision support.

Load necessary packages:

library(dplyr)
library(scales)
library(data.table)
library(tidyverse)
library(lubridate)
library(anytime)
library(ggplot2)
library(leaflet)
library(tidyr)

Data Pre-processing

In this section of the analysis, necessary R packages for data manipulation and visualization were loaded, and the dataset from the city of Austin, Texas, was imported. The structure of the dataset was examined using the str() function, providing an overview of its characteristics, including the number of variables (22) and records (479,361).

The dataset was filtered to include only records with a specific Zip Code (78753), resulting in 21,019 records.

# Read Data
df <- read.csv("Austin_311_Public_Data.csv")

# Display structure of the dataset
str(df)
## 'data.frame':    479361 obs. of  22 variables:
##  $ Service.Request..SR..Number: chr  "19-00090956" "20-00135805" "20-00052942" "19-00467964" ...
##  $ SR.Description             : chr  "Animal - Proper Care" "Tree Issue ROW" "Pothole Repair" "ARR Dead Animal Collection" ...
##  $ Method.Received            : chr  "Phone" "Mobile Device" "Web" "Phone" ...
##  $ SR.Status                  : chr  "Closed" "Closed" "Closed" "Closed" ...
##  $ Status.Change.Date         : chr  "11/23/2020 01:41:21 PM" "11/23/2020 12:02:05 AM" "02/12/2020 12:57:35 PM" "12/11/2019 09:57:54 AM" ...
##  $ Created.Date               : chr  "03/14/2019 03:02:15 PM" "04/07/2020 07:06:32 PM" "02/12/2020 01:55:04 AM" "12/10/2019 02:17:04 PM" ...
##  $ Last.Update.Date           : chr  "11/23/2020 01:41:22 PM" "11/23/2020 12:02:05 AM" "08/11/2020 01:03:41 AM" "12/11/2019 09:57:54 AM" ...
##  $ Close.Date                 : chr  "11/23/2020 01:41:21 PM" "11/23/2020 12:02:05 AM" "02/12/2020 12:57:35 PM" "12/11/2019 09:57:54 AM" ...
##  $ SR.Location                : chr  "4609 RIBBECKE AVE, AUSTIN, TX 78721" "3521 WEST AVE, AUSTIN, TX 78705" "7900 E BEN WHITE BLVD WB, AUSTIN, TX" "7100 METROPOLIS DR, AUSTIN, TX 78744" ...
##  $ Street.Number              : chr  "4609" "3521" "7900" "7100" ...
##  $ Street.Name                : chr  "RIBBECKE" "WEST" "BEN WHITE BLVD WB" "METROPOLIS" ...
##  $ City                       : chr  "AUSTIN" "AUSTIN" "AUSTIN" "AUSTIN" ...
##  $ Zip.Code                   : int  78721 78705 78741 78744 78741 78723 78712 78741 78745 78745 ...
##  $ County                     : chr  "TRAVIS" "TRAVIS" "TRAVIS" "TRAVIS" ...
##  $ State.Plane.X.Coordinate   : num  3132348 3114174 3131888 3127538 3125956 ...
##  $ State.Plane.Y.Coordinate   : num  10075731 10083407 10053622 10048498 10054146 ...
##  $ Latitude.Coordinate        : num  30.3 30.3 30.2 30.2 30.2 ...
##  $ Longitude.Coordinate       : num  -97.7 -97.7 -97.7 -97.7 -97.7 ...
##  $ X.Latitude.Longitude.      : chr  "(30.28058076, -97.68531793)" "(30.30285681, -97.74231249)" "(30.21983764, -97.68844011)" "(30.20603574, -97.70259216)" ...
##  $ Council.District           : int  1 9 3 2 3 9 9 3 5 3 ...
##  $ Map.Page                   : chr  "586L" "555S" "646F" "646N" ...
##  $ Map.Tile                   : chr  "MM23" "MJ25" "ML18" "ML17" ...

The dataset contains 22 variables (columns) and 479,361 records (rows).

# Filter records by Zipcode
df_filtered <- filter(df, Zip.Code == "78753")

# Count filtered records
cat("There are", nrow(df_filtered), "records in the dataset.")
## There are 21019 records in the dataset.

Handling Missing Data

A check for missing values showed that there were no missing values in this subset. However, blank cells in the dataset were converted to NA values to ensure data integrity. After this conversion, a recheck for missing values revealed 3,633 NA values. This step was essential for identifying and handling missing data appropriately.

Are there any NA values in the dataframe?

# Check for NA values in the dataframe
cat("There are", sum(is.na(df_filtered)), "missing values in the dataset.")
## There are 0 missing values in the dataset.
# Calculate percentage of complete cases

cat(percent(sum(complete.cases(df_filtered)) / nrow(df_filtered), accuracy = 0.01), "of the rows in the dataset are complete cases.")
## 100.00% of the rows in the dataset are complete cases.

Note: A complete case is an observation in the dataset for which there are no missing values (NA values) for any of the given variables.

# Convert blank cells in the dataframe into NAs
df_filtered[df_filtered == ""] <- NA

# Recount number of NAs
cat("There are now", sum(is.na(df_filtered)), "missing values in the dataset.")
## There are now 3633 missing values in the dataset.
# Recalculate percentage of complete cases
cat("Now,", percent(sum(complete.cases(df_filtered)) / nrow(df_filtered), accuracy = 0.01), "of the rows in the dataset are complete cases.")
## Now, 82.81% of the rows in the dataset are complete cases.

Note: This value decreased a result of converting blank cells to NA cells. In other words, the blank cells were being counted as complete cases instead of incomplete cases (i.e., records with missing information); as these missing values were not properly characterized as ’NA’s.

# Generate table of missing values
num_NAs <- colSums(is.na(df_filtered))
percent_NAs <- percent(colSums(is.na(df_filtered)) / nrow(df_filtered), accuracy = 0.01)
NA.df <- data.frame(num_NAs, percent_NAs)

NA.table <- data.table(NA.df, keep.rownames = TRUE) 
colnames(NA.table) <- c("Variable", "Number of NAs", "Percent of NAs")

# Sort the table by the "Number of NAs" column in descending order
NA.table <- NA.table[order(-NA.table$`Number of NAs`)]

# Display the missing values table
cat("Missing Values by Column:\n")
## Missing Values by Column:
NA.table
##                        Variable Number of NAs Percent of NAs
##  1:               Street.Number          3514         16.72%
##  2:                  Close.Date            94          0.45%
##  3:                      County            25          0.12%
##  4: Service.Request..SR..Number             0          0.00%
##  5:              SR.Description             0          0.00%
##  6:             Method.Received             0          0.00%
##  7:                   SR.Status             0          0.00%
##  8:          Status.Change.Date             0          0.00%
##  9:                Created.Date             0          0.00%
## 10:            Last.Update.Date             0          0.00%
## 11:                 SR.Location             0          0.00%
## 12:                 Street.Name             0          0.00%
## 13:                        City             0          0.00%
## 14:                    Zip.Code             0          0.00%
## 15:    State.Plane.X.Coordinate             0          0.00%
## 16:    State.Plane.Y.Coordinate             0          0.00%
## 17:         Latitude.Coordinate             0          0.00%
## 18:        Longitude.Coordinate             0          0.00%
## 19:       X.Latitude.Longitude.             0          0.00%
## 20:            Council.District             0          0.00%
## 21:                    Map.Page             0          0.00%
## 22:                    Map.Tile             0          0.00%
##                        Variable Number of NAs Percent of NAs

The missing values table suggests that the ‘Street.Number’, ‘Close.Date’, & ‘County’ columns contain missing values. Less than 1% of the data is missing for the ‘Close.Date’ & ‘County’ variables. However, a substantial portion of the observations are missing values for the ‘Street.Number’ column (16.72%). We will leave the NA values for now to avoid unnecessary data reduction.

Handling Dates

Date variables, specifically ‘Created.Date’ and ‘Close.Date’, were converted from character to the ‘Date’ data type using the anydate() function. Additionally, the ‘Duration’ variable, representing the time between ‘Close.Date’ and ‘Created.Date’, was calculated. This information was crucial for understanding the time it takes to address service requests.

Inspect data types of date variables:

str(df$Created.Date)
##  chr [1:479361] "03/14/2019 03:02:15 PM" "04/07/2020 07:06:32 PM" ...
str(df$Close.Date)
##  chr [1:479361] "11/23/2020 01:41:21 PM" "11/23/2020 12:02:05 AM" ...

R interprets the ‘Created.Date’ and ‘Close.Date’ variables as a “character” data type, when they are actually of the data type “date.”

# Convert date variables to the 'Date' type
df_filtered$Created.Date <- anydate(df_filtered$Created.Date)
df_filtered$Close.Date <- anydate(df_filtered$Close.Date)
# Generate new variable 'Duration' (time between Close.Date & Created.Date)
df_filtered <- df_filtered %>%
  filter(!is.na(Created.Date) & !is.na(Close.Date)) %>%
  mutate(Duration = difftime(Close.Date, Created.Date, units = "days"))

Next, I want to query for requests on a specific date & identify the most common type of ‘SR.Description’ for those requests. For the date, I will use my birthday:

  • October 19 (10/19)
# Query all service requests created on Oct. 19
bday.df <- filter(df_filtered, Created.Date == '2019-10-19')

# Return total number of records corresponding to this date
cat(nrow(bday.df), "city service requests were initiated on Oct. 19 in Austin, TX.\n")
## 148 city service requests were initiated on Oct. 19 in Austin, TX.
# Identify most common type of 'SR.Description'
cat(max(bday.df$SR.Description), "was the most common SR.Description for for requests made on this date.")
## Traffic Signal - New/Change was the most common SR.Description for for requests made on this date.

Exploring the Dataset

Various aspects of the dataset were explored to uncover insights. The percentage of service requests received through the Spot311 interface was calculated, revealing that 24.11% of requests came through this method. Similarly, it was found that 7.03% of requests were made due to issues related to loose dogs. Additionally, eight unique methods through which service requests were received were identified.

# Calculate the percentage of all the 311 city service requests that came in through the Spot311 interface
cat(percent(sum(df_filtered$Method.Received == "Spot311 Interface") / nrow(df_filtered), accuracy = 0.01), "of all 311 city service requests in Austin were received via the Spot311 Interface.")
## 24.11% of all 311 city service requests in Austin were received via the Spot311 Interface.
# Calculate the percentage of all the 311 city service requests that were made because of loose dogs
cat(percent(sum(df_filtered$SR.Description == "Loose Dog") / nrow(df_filtered), accuracy = 0.01), "of all 311 city service requests in Austin were made because of loose dogs.")
## 7.03% of all 311 city service requests in Austin were made because of loose dogs.
# Identify unique types of methods through which service requests were received
cat("The 311 city service requests in Austin were received through", length(unique(df_filtered$Method.Received)), "unique types of methods.")
## The 311 city service requests in Austin were received through 8 unique types of methods.

Visualization

To provide a visual representation of the data, several plots were created. These included barplots showing the counts of service requests by season and by the six most common ‘SR.Description’ types. These visualizations allowed for the identification of trends, such as the higher number of service requests in the Spring season and the most common ‘SR.Description’ types.

Faceted bar charts were also generated to explore the relationship between the most common ‘Method.Received’ and ‘SR.Description’ types. These visualizations highlighted patterns in how different service request types were received.

Furthermore, a histogram was created to visualize the distribution of service request durations, showing that most requests were resolved within a few days.

df_filtered <- subset(df_filtered, select = -c(Map.Page))

Create Season Variable

Construct a new column called ‘Season’ & rename Quarters 1-4 so that they become “Winter”, “Spring”, “Summer” & “Fall.”

df_filtered <- df_filtered %>% 
  mutate(Season = quarter(df_filtered$Created.Date))

df_filtered$Season[df_filtered$Season == "1"] <- "Winter"

df_filtered$Season[df_filtered$Season == "2"] <- "Spring"

df_filtered$Season[df_filtered$Season == "3"] <- "Summer"

df_filtered$Season[df_filtered$Season == "4"] <- "Fall"

Barplot by Season

ggplot(df_filtered, aes(x = Season, fill = Season)) + geom_bar() + labs(title = "Number of City Service Requests by Season", x = "Season", y = "Number of Service Requests")

Regarding the findings derived from the barplot depicting service request counts across seasons, it is evident that Spring has the highest number of service requests, while Fall exhibits the lowest. In contrast, service requests in Fall, Summer, and Winter are fairly consistent and relatively similar in magnitude.

Barplot by Common SR.Description Types

# Filter dataset so that only rows with the 6 most common 'SR.Description' types remain
common_SR_descriptions <- c(
  "Austin Code - Request Code Officer",
  "Traffic Signal - Maintenance",
  "Loose Dog",
  "Parking Violation Enforcement",
  "Animal Control - Assistance Request",
  "ARR Missed Yard Trimmings/Compost"
)

df_filtered <- filter(df_filtered, SR.Description %in% common_SR_descriptions)

# Barplot of 6 Most Common 'SR.Description' Types
ggplot(df_filtered, aes(x = fct_infreq(SR.Description), fill = SR.Description)) +
  geom_bar() +
  labs(title = "Most Common SR.Description Types", x = "SR.Description Types", y = "Count of SR.Description Types") +
  scale_x_discrete(guide = guide_axis(check.overlap = TRUE))

The above barplot illustrates the top 6 most prevalent SR.Description categories in a descending order of occurrence. These represent the most frequently encountered types of 311 city service requests reported in Austin. Notably, “Austin Code - Request Code Officer” emerges as the most frequent request type, followed by “Traffic Signal - Maintenance” and “Loose Dog” as the second and third most common request descriptions in Austin. It’s worth emphasizing that less common request types were excluded in a prior data processing step.

Faceted Bar Charts

# Filter the dataset so that only rows with the 6 most common types of Method.Received remain
common_methods_received <- c(
  "Phone",
  "Spot311 Interface",
  "Web",
  "Mobile Device",
  "Other",
  "E-Mail"
)

df_filtered <- filter(df_filtered, Method.Received %in% common_methods_received)

# Faceted Bar chart
ggplot(df_filtered, aes(x = fct_infreq(Method.Received), fill = Method.Received)) +
  geom_bar() +
  labs(title = "Most Common Type of Method.Received by SR.Description", x = "Types of Method.Received", y = "Count of Method.Received") +
  facet_wrap(~ SR.Description, labeller = as_labeller(function(x) strwrap(x, width = 20))) +
  scale_x_discrete(guide = guide_axis(check.overlap = TRUE)) +
  theme(strip.text = element_text(size = 7, face = "bold", hjust = 0.5))
## Warning in strip_mat[panel_pos] <- unlist(unname(strips), recursive =
## FALSE)[[params$strip.position]]: number of items to replace is not a multiple
## of replacement length

The faceted bar chart presents a breakdown of the predominant methods used to submit service requests, organized according to the most frequent service description types. Across most service request categories, it is apparent that phone calls are the prevailing method of contact. However, when examining “Austin Code - Request Code Officer,” “Loose Dog,” and “Parking Violation Enforcement,” it becomes evident that the Spot311 Interface is also a notably high-frequency method of request submission.

For the majority of SR.Descriptions and service requests, a single common method is employed. An exception to this pattern is the request for a code officer, which demonstrates a distinctive characteristic by being commonly requested through all six of the most prevalent methods (i.e., “Method.Received”), with the exception of “Other.”

Histogram - Distribution of Duration

ggplot(df_filtered, aes(x = Duration, fill = Method.Received)) +
  geom_histogram(binwidth = 50) +
  labs(title = "Distribution of Service Request Duration by Method Received", x = "Duration", y = "Count")
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

The histogram provides insight into the durations of service requests, revealing that the majority of requests have relatively short durations, with nearly half of them lasting only a few days. Additionally, it’s a rarity for a service request to extend beyond 250 days.

Examining the data further, it appears that, in general, phone calls were the most common method of contact for the six most prevalent service description types. Interestingly, service requests initiated via mobile devices or email tend to have shorter durations compared to other methods. Nonetheless, it’s important to note that there are only a relatively small number of cases where the service request duration exceeds 250 days.

Proportional Bar Chart

# Filter the dataframe so that only the rows with the six most common streets remain
common_streets <- c(
  "IH 35 SVRD",
  "NORTH CREEK",
  "PARMER",
  "RUNDBERG",
  "IH 35",
  "BROWNIE"
)

df_filtered <- filter(df_filtered, Street.Name %in% common_streets)


# Proportional Bar chart
ggplot(df_filtered, aes(x = Street.Name, fill = SR.Description)) +
  geom_bar(position = "fill") +
  labs(title = "Service Request Descriptions by Street Name", x = "Street Names", y = "Proportion") +
  scale_x_discrete(guide = guide_axis(check.overlap = TRUE)) +
  scale_y_continuous(limits = c(0, 1)) + theme(axis.text.x = element_text(angle = 45, hjust = 1))  

In this proportional bar chart, we observe the proportions of the most common service request descriptions across the most frequently mentioned street names for service requests in Austin. What stands out as intriguing and somewhat unusual is the consistent heights of the bars. It’s worth noting that the number of service requests on each of these six streets is not equal, yet the uniformity in bar heights allows for effective comparison of the proportions of our “fill” variable, “SR.Description,” across these prominent streets.

From the chart, we can discern some notable patterns. Specifically, a significant proportion of service requests on these streets are made to request a code officer. However, there are exceptions to this trend. On “Parmer” and “North Creek,” a larger proportion of service requests are related to parking violation enforcement, setting them apart from the other streets in terms of service request distribution.

Geospatial Analysis

To add a geospatial component to the analysis, the ‘leaflet’ package was used to create interactive maps. Service request locations were plotted on a map of Austin, Texas, with circles representing the requests. This allowed for a visual understanding of the geographic distribution of service requests.

Finally, the map was enhanced by adding additional provider tiles, such as National Geographic, to provide more detailed geographical context.

Create Leaflet Map

m <- leaflet() %>% addTiles() %>% addCircles(
    lng = df_filtered$Longitude.Coordinate, 
    lat = df_filtered$Latitude.Coordinate
  )

m

Screenshot of the plot zoomed in to 78753 zipcode of Austin, TX:

Enhance Leaflet Map

m <- leaflet() %>% addTiles() %>% addCircles(
    lng = df_filtered$Longitude.Coordinate, 
    lat = df_filtered$Latitude.Coordinate
  ) %>% addProviderTiles(providers$Esri.NatGeoWorldMap)

m

The ‘provider’ argument with a dollar sign allows us to alter the appearance of the map (e.g., adding more descriptive features like road/highway names) based on different providers (e.g., this map is from National Geographic).

Screenshot of the plot zoomed in to 78753 zipcode of Austin, TX:

Conclusions

The insights gained from the exploratory data analysis of Austin’s 311 service requests hold significant value for various stakeholders involved in municipal service management and decision-making. This section highlights key findings and their relevance to stakeholders, helping them make informed decisions and enhance the quality of public services in Austin, Texas.

Key Insights

  1. Zip Code Analysis: Focus was placed on service requests in the 78753 Zip Code area, which had over 21,000 records. This information is valuable for city planners and local government officials who need to allocate resources efficiently based on geographic demand.

  2. Missing Data Handling: The identification and handling of missing data are critical for ensuring data quality. Stakeholders can be confident that the analysis is based on complete and accurate information.

  3. Seasonal Trends: Seasonal trends in service requests were observed, with Spring having the highest number of requests. This insight can help city authorities prepare for increased demand during specific seasons and plan resource allocation accordingly.

  4. Common Service Request Types: Identification of the most common service request types allows stakeholders to prioritize their response and allocate resources to address these issues effectively. For example, “Austin Code - Request Code Officer” requests were prevalent.

  5. Method of Request: Understanding the methods through which service requests are received can help stakeholders optimize communication channels. The analysis revealed that “Phone” was the most common method, but certain request types, such as code officer requests, were also commonly received through the Spot311 interface.

  6. Duration of Service Requests: The histogram analysis of service request durations indicated that most requests are resolved within a few days. This information can assist in setting expectations for response times and managing service request workflows.

  7. Geospatial Distribution: The interactive map displayed the geographic distribution of service requests in the 78753 area. This can be invaluable for local law enforcement, neighborhood associations, and emergency services to understand where specific issues are concentrated.

Stakeholder Relevance

  1. City Planners: City planners can use the insights from the analysis to make data-driven decisions about resource allocation, service prioritization, and seasonal adjustments in service delivery.

  2. Local Government Officials: Elected officials and government agencies can leverage the findings to address common issues efficiently, allocate budgets, and enhance communication channels with residents.

  3. Emergency Services: Emergency service providers can benefit from geospatial information to understand the distribution of service requests in specific areas, enabling quicker response times.

  4. Community Organizations: Neighborhood associations and community organizations can use the data to advocate for improvements in their respective areas and collaborate with local authorities to address common concerns.

  5. Residents: Residents of Austin can gain insights into common issues in their neighborhoods and better understand how to report service requests effectively.

In summary, this exploratory data analysis of Austin’s 311 service requests provides valuable information for a wide range of stakeholders. By understanding the data, addressing missing values, and visualizing trends, stakeholders can make informed decisions and improve the quality of municipal services in Austin, Texas.