Data 608 Final Project
NYC 311 complaint analysis
Write-up on visualization
Parameters of dataset
NYC 311 dataset is formed from or all the data points are derived from New York.
Now, since data source is socrata API, It pulls the latest dataset from the NYC 311 and it updated on a daily basis. Therefore, whenever the API is called, daily datasets are called.
For twitter dataset, it pulls 1100 latest tweets mentioning NYC 311.
Some of the main data points for the NYC 311 dataset are:
“agency”
“agency_name”
“complaint_type”
“descriptor”
“incident_zip”
“incident_address”
“street_name”
“cross_street_1”
“cross_street_2”
“intersection_street_1”
“intersection_street_2”
“status”
“community_board”
“borough”
“x_coordinate_state_plane”
“y_coordinate_state_plane”
“open_data_channel_type”
“park_facility_name”
“park_borough”
“latitude”
“longitude”
“location”
“resolution_description”
“resolution_action_updated_date”
About NYC Open Data set
Beginning in 2010, NYC launched an initiative to expose government data via NYC Open Data in an effort to “improve the accessibility, transparency, and accountability of City government, this catalog offers access to a repository of government-produced, machine-readable data sets.”
What dataset shows and Why is it important
NYC 311’s mission is to provide the public with quick, easy access to all New York City government services and information while offering the best customer service. It help Agencies improve service delivery by allowing them to focus on their core missions and manage their workload efficiently.
NYC 311 data is updated on a daily basis and is provided by DoITT where currently I am pursuing my internship. Therefore, I wanted to apply visualization concepts studied in Data 608 to analyze this data set.
Aim
To analyze and build visualizations for issues around New York City (including Manhattan, Queens, Brooklyn, and Bronx) by frequency of reported incidents in each area.
NYC 311 Service Requests & Resolution Analysis through Text Mining
Explore and analyze NYC 311 Service requests (historical data sets) to understand diverse patterns, regular themes and trends, as well as community satisfaction levels derived from resolution categories and timing.
I would also want to do sentiment analysis using
Syuzhet Packageon the NYC 311 twitter comments to determine “nyc311” Tweet’s Emotions especially during the period of virus outbreak and also create visualization for same.
Import libraries
Load all the necessary packages
Load the data using socrata API
Analyze the dataset with socrata API
api_endpoint <- "https://data.cityofnewyork.us/resource/erm2-nwe9.json"
json_dataset311 <- fromJSON(paste0(api_endpoint))
class(json_dataset311)## [1] "data.frame"
Display column Names and no. of rows
## [1] "unique_key" "created_date"
## [3] "agency" "agency_name"
## [5] "complaint_type" "descriptor"
## [7] "location_type" "incident_zip"
## [9] "incident_address" "street_name"
## [11] "cross_street_1" "cross_street_2"
## [13] "intersection_street_1" "intersection_street_2"
## [15] "city" "landmark"
## [17] "status" "community_board"
## [19] "bbl" "borough"
## [21] "x_coordinate_state_plane" "y_coordinate_state_plane"
## [23] "open_data_channel_type" "park_facility_name"
## [25] "park_borough" "latitude"
## [27] "longitude" "location"
## [29] ":@computed_region_efsh_h5xi" ":@computed_region_f5dn_yrer"
## [31] ":@computed_region_yeji_bk3q" ":@computed_region_92fq_4b7q"
## [33] ":@computed_region_sbqj_enih" "closed_date"
## [35] "resolution_description" "resolution_action_updated_date"
## [37] "address_type" "facility_type"
## [39] "taxi_pick_up_location"
## [1] 1000
First 5 rows of dataset
Data Exploration and Visualization
1) Top 50 most common complain types
dataset311 <- json_dataset311
ggplot(subset(dataset311, complaint_type %in% count(dataset311, complaint_type, sort=T)[1:50,]$complaint_type), aes(complaint_type)) +
geom_histogram(stat = "count",color="black", fill="purple") +
labs(x="Complaint Type", y="Service Requests") +
coord_flip() + theme_bw()As we see above, highest number of service requests is for the Noise-residential complain type followed by Noise-street/sidewalk.
2) Most common complaint types by borough and status
No. of complaints/ Count of complaints by borough and status:
dataset_borough <- dataset_borough %>% select(complaint_type, borough, status) %>% filter(!str_detect(borough, "Unspecified"))
ggplot(dataset_borough, aes(x=status, y = complaint_type)) +
geom_point() +
geom_count(n=2, colour="darkgreen") +
facet_wrap(~borough)As we analyze from the graph above, Bronx, Brooklyn and Manhattan has over 100 complaints which are at the closed status, which shows a good progress to solve complaints by NYC 311.
Service Request Resolutions Tidying and Analysis - Using Tidytext
In this section, we will analyse frequent words used by Service Request Resolutions,
Let’s use Tidytext for this purpose.
Most frequent words used in NYC311 Service Requests
The following step also filters the data having value as “NA” and does not include it in the tokenized_resolutions dataset
library(tidytext)
data(stop_words)
tokenized_resolutions <- dataset311 %>%
select(complaint_type, descriptor, street_name, city, resolution_description, borough, open_data_channel_type) %>%
filter(!str_detect(borough, "Unspecified")) %>%
filter(!str_detect(resolution_description,"NA")) %>%
unnest_tokens(word, resolution_description) %>%
anti_join(stop_words) %>%
group_by(borough, word) %>%
tally()## Joining, by = "word"
## Rows: 281
## Columns: 3
## Groups: borough [5]
## $ borough <chr> "BRONX", "BRONX", "BRONX", "BRONX", "BRONX", "BRONX", "BRONX"…
## $ word <chr> "act", "action", "additional", "arrival", "attempt", "complai…
## $ n <int> 1, 65, 2, 3, 32, 128, 95, 32, 32, 32, 32, 100, 5, 4, 21, 60, …
Analyze internal structure of tokenized_resolutions
## tibble [281 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ borough: chr [1:281] "BRONX" "BRONX" "BRONX" "BRONX" ...
## $ word : chr [1:281] "act" "action" "additional" "arrival" ...
## $ n : int [1:281] 1 65 2 3 32 128 95 32 32 32 ...
## - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ borough: chr [1:5] "BRONX" "BROOKLYN" "MANHATTAN" "QUEENS" ...
## ..$ .rows :List of 5
## .. ..$ : int [1:40] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ : int [1:64] 41 42 43 44 45 46 47 48 49 50 ...
## .. ..$ : int [1:78] 105 106 107 108 109 110 111 112 113 114 ...
## .. ..$ : int [1:59] 183 184 185 186 187 188 189 190 191 192 ...
## .. ..$ : int [1:40] 242 243 244 245 246 247 248 249 250 251 ...
## ..- attr(*, ".drop")= logi TRUE
Let’s see first few rows of tokenized_resolutions
Now let’s look for the top 25 most frequent word used in complaints by 5 boroughs:
tokenized_resolutions %>%
group_by(borough) %>%
top_n(25) %>%
arrange(desc(n)) %>%
ggplot(aes(x = reorder(word,n), y = n, fill = factor(borough))) +
geom_bar(stat = "identity") +
theme(legend.position = "none") +
facet_wrap(~borough, scales = "free") +
coord_flip() +
labs(x = "Words",
y = "Frequency",
title = "Top words used in NYC311 Service Requests by Borough",
subtitle = "")As we see above, in all 5 boroughs, most frequently used word is police followed by department, complaint,responded.
Determining terms/words truly characteristic for SRs by Borough leveraging textmining (TF-IDF)
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
tf_idf_words <- tokenized_resolutions %>%
bind_tf_idf(word, borough, n) %>%
arrange(desc(tf_idf))
tf_idf_wordsPresenting characterisitc terms/words for SRs by Borough
Let’s analyze some distinctive words used by each borough
tf_idf_words %>%
top_n(25) %>%
arrange(desc(tf_idf)) %>%
ggplot(aes(x = reorder(word, tf_idf), y = tf, fill = borough)) +
geom_col() +
labs(x = "Words", y = "tf",
title = "Distinctive words used in NYC311 Service Requests by Borough",
subtitle = "") +
coord_flip() +
theme(legend.position = "none") +
facet_wrap(~ borough, scales = "free")As we can infer from the graph above, most distinctive words used by boroughs are
Bronx:
reviewedfollowed byprovidedManhattan:
unablefollowed bypremisesBrooklyn:
reviewedfollowed byprovidedQuees:
reportedfollowed bycityStaten Island:
violationfollowed bytime
Map Analysis
Now let’s analyse the NYC 311 data using Map Analysis
Preparing and tidying up the data for map plotting
dataset_map <- subset(dataset311, complaint_type %in% count(dataset311, complaint_type, sort=T)[1:50,]$complaint_type)
dataset_map <- dataset_map %>% select(complaint_type, borough, latitude, longitude) %>% drop_na()
library(plyr)
counts <- ddply(dataset_map, .(complaint_type), "count")
counts_filtered <- filter(counts, freq > 2)
counts_filtered$freq <- as.numeric(counts_filtered$freq)
counts_filtered$longitude <- as.numeric(counts_filtered$longitude)
counts_filtered$latitude <- as.numeric(counts_filtered$latitude)Dataset of counts_filtered in which the count of complaints is greater than 2
Map Plotting
Now since we have the map plotting dataset prepared, let’s plot those langitude and latitude points, and analyze highest no. of service requests by complaint types
#install.packages("rworldmap")
#install.packages("rworldxtra")
library(rworldmap)
library(rworldxtra)
newmap <- getMap(resolution = "high")
nyc_coorflimits <- data.frame( long = c(-74.5, -73.5), lat = c(40.5, 41), stringsAsFactors = FALSE)
nyc <- ggplot() + geom_polygon(data = newmap, aes(x=long, y = lat, group = group), fill = "gray", color = "blue") + xlim(-74.5, -73.5) + ylim(40.5, 41)
nyc_SRs <- nyc +
geom_point(data=counts_filtered, aes(longitude, latitude, size=freq), colour="red") +
facet_wrap(~complaint_type, scales = "free") +
labs(x = "Longitude", y = "Latitude", title = "Highest Number of SRs by Complaint Type") + scale_size(name="# of SRs")
nyc_SRsAs we see from the graph above, Noise-Residential has more number of service requests by complaint type.
Service Request Resolutions Tidying and Analysis - Using TM
tm vignette is meant for text mining in R utilizing the text mining framework provided by the tm package.
Load required libraries
Filtering dataset to the most relevant Complaint Types
dataset_filt <- subset(dataset311, complaint_type %in% count(dataset311, complaint_type, sort=T)[1:50,]$complaint_type)
sr_resolution <- dataset_filt$resolution_descriptionCleaning up non-standard characters (encoding conversion)
sr_resolution_cln <- sr_resolution %>% iconv("latin1", "ASCII")
control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, minDocFreq=5) # stemming=TRUE does not provide much valueCreating Corpus and TDM
sr_corpus <- VCorpus(VectorSource(sr_resolution_cln))
sr_tdm <- TermDocumentMatrix(sr_corpus, control)
sr_tdm## <<TermDocumentMatrix (terms: 121, documents: 998)>>
## Non-/sparse entries: 8179/112579
## Sparsity : 93%
## Maximal term length: 15
## Weighting : term frequency (tf)
Removing sparse terms (80% of sparse percentage of empty)
## <<TermDocumentMatrix (terms: 10, documents: 998)>>
## Non-/sparse entries: 5414/4566
## Sparsity : 46%
## Maximal term length: 11
## Weighting : term frequency (tf)
Top terms by frequency (mentioned at least 50 times)
## [1] 10
Displaying top terms
## [1] "action" "available" "complaint" "condition" "department"
## [6] "fix" "information" "police" "responded" "took"
Find top associations using findAssocs() for the top terms (lower correlation limit of 0.4). More consistent term association patterns found in service requests
sr_topterms <- sr_topterms[!is.na(sr_topterms)]
sr_assocs <- findAssocs(sr_tdm_unsprsd, sr_topterms[1:5], 0.4)
lapply(sr_assocs, function(x) kable(x))## $action
##
##
## x
## ---------- -----
## fix 0.88
## took 0.88
## responded 0.59
## police 0.54
##
## $available
##
##
## x
## ------------ -----
## information 0.84
##
## $complaint
##
##
## x
## ------- ----
## police 0.4
##
## $condition
##
##
## x
## ----- ----
## fix 0.6
## took 0.6
##
## $department
##
##
## x
## ---------- -----
## police 0.84
## responded 0.82
## fix 0.43
## took 0.43
As per the above association figures,
action is associated to word:
fixby 85%tookby 85%policeby 48%respondedby 46%
Similarly, we can interpret for other words.
Creating a WordCloud for the top terms/words in the SRs
library(wordcloud)
sr_tdm_cloud <- as.matrix(sr_tdm_unsprsd)
v <- sort(rowSums(sr_tdm_cloud),decreasing=TRUE)
d <- data.frame(word=names(v),freq=v)
wordcloud(d$word,d$freq,max.words=50, min.freq=10, colors=brewer.pal(8, 'Dark2'))NYC311 Tweets Analysis
Now, let’s do NYC 311 tweet analysis,
Data Collection and Exploration
API Set-up (Application Name and security context). Commands commented and keys masked
##------- store api keys (these are fake example values; replace with your own keys)
library(rtweet)
api_key <- "aaa"
api_secret_key <- "bbb"
access_token <- "ccc"
access_token_secret <- "ddd"Search and collect 1100 tweets doing any mention to the “nyc311” service (hashtag, user, follower, etc.)
| x |
|---|
| @KGRLogic Good afternoon, thank you for reaching out. Please DM me with details on the type of inspection you requested. Thanks! https://t.co/hDTCua1AH9 |
| @willardk Good afternoon, please send us a DM so I may ask you a few questions about this food delivery. Thank you! https://t.co/hDTCu9JZPB |
| @pixistik04 Good evening. You can report a fire hydrant that’s open online here: https://t.co/XGvdzZiRb8 or by sending us a DM for help with reporting. https://t.co/hDTCu9JZPB |
| @immichaelmorgan @NYCMayor @NYCMayorsOffice Good evening. You can get information and guidance about DMV service changes online at https://t.co/KXNs23W0Ry or you can reach out to them by phone at (718) 966-6155 Monday through Friday from 8:30 AM to 4 PM. Thanks! |
| @andrewPnelson2 Hi, all non-essential construction in NYC has been halted. DOB created a Real-Time Essential Construction Map, which shows the location of allowed essential construction sites in NYC. If a worksite isn’t on the map, DM us to file a report. https://t.co/hDTCu9JZPB |
Sample of Users tweeting about “nyc311”
| name | location | followers_count | friends_count |
|---|---|---|---|
| New York City 311 | New York City | 346557 | 238 |
| Yalaisa Wright | United States | 75 | 295 |
| Boerum Hill Neighbors | Brooklyn, NY | 518 | 1702 |
| Kevin | New York, NY | 221 | 451 |
| Nicholas F | 382 | 1330 | |
| Eagle One 🇺🇸🦅 | ’Merica | 277 | 307 |
Let’s plot “nyc311” Tweets Time series (Last 7-9 days)
We can see a very interesting graph above, there have been consecutive increase and decrease in the no. of tweets from May 03 to May 08, but there’s a drastic decrease in no. of tweets to nyc 311 betwwen May 10 and May 12, one of the main reasons can be due to covid-19.
Sentiment Analysis - Syuzhet Package
Syuzhet breaks the text/words into 10 different emotions - anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative and positive.
Let’s determine “nyc311” Tweet’s Emotions
#devtools::install_github("mjockers/syuzhet")
library(syuzhet)
nyc311_tweets_txt <- as.vector(nyc311_tweets$text)
emotion_df <- get_nrc_sentiment(nyc311_tweets_txt)
twt_emotion_df <- cbind(nyc311_tweets_txt, emotion_df)
kable(head(twt_emotion_df,3))| nyc311_tweets_txt | anger | anticipation | disgust | fear | joy | sadness | surprise | trust | negative | positive |
|---|---|---|---|---|---|---|---|---|---|---|
| @KGRLogic Good afternoon, thank you for reaching out. Please DM me with details on the type of inspection you requested. Thanks! https://t.co/hDTCua1AH9 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 |
| @willardk Good afternoon, please send us a DM so I may ask you a few questions about this food delivery. Thank you! https://t.co/hDTCu9JZPB | 0 | 2 | 0 | 0 | 2 | 0 | 1 | 2 | 0 | 3 |
| @pixistik04 Good evening. You can report a fire hydrant that’s open online here: https://t.co/XGvdzZiRb8 or by sending us a DM for help with reporting. https://t.co/hDTCu9JZPB | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
Sentiment Scoring
The core idea of sentiment scores is to put the number of positive reviews in relation to the number of negative reviews.
Let’s have a look at Positive Tweets
| x |
|---|
| @KGRLogic Good afternoon, thank you for reaching out. Please DM me with details on the type of inspection you requested. Thanks! https://t.co/hDTCua1AH9 |
| @willardk Good afternoon, please send us a DM so I may ask you a few questions about this food delivery. Thank you! https://t.co/hDTCu9JZPB |
| @pixistik04 Good evening. You can report a fire hydrant that’s open online here: https://t.co/XGvdzZiRb8 or by sending us a DM for help with reporting. https://t.co/hDTCu9JZPB |
| @immichaelmorgan @NYCMayor @NYCMayorsOffice Good evening. You can get information and guidance about DMV service changes online at https://t.co/KXNs23W0Ry or you can reach out to them by phone at (718) 966-6155 Monday through Friday from 8:30 AM to 4 PM. Thanks! |
| @andrewPnelson2 Hi, all non-essential construction in NYC has been halted. DOB created a Real-Time Essential Construction Map, which shows the location of allowed essential construction sites in NYC. If a worksite isn’t on the map, DM us to file a report. https://t.co/hDTCu9JZPB |
Most Positive Tweet
## [1] "@PQuinceNYC @HelenRosenthal @NYCDOB Good morning, please send us a Direct Message. We have a few questions to clarify what is happening at the construction site to ensure that we file the correct report. Thank you. https://t.co/hDTCu9JZPB"
Let’s have a look at Negative Tweets
| x |
|---|
| .@NYCDHS’s Code Blue is in effect until tomorrow, Sunday, May 10 at 8:00 AM. If you see a homeless person outside in these frigid temperatures, please call us at 311. https://t.co/jEaQyOxlxc |
| @domiruiz02 Hi, thank you for your tweets. Call 911 to report an emergency situation or condition that might cause danger to life or personal property and to report a medical or health-related emergency: https://t.co/Gf62x24xHN. |
| @DiamondVMedia @NYC_DOT @Pollytrott @NYCSpeakerCoJo @BPEricAdams @NYCMayor @NYGovCuomo Good morning, if the potholes are dangerous and likely to cause an accident, call 911. You can report potholes at https://t.co/MkR064QHhv or DM me and I’ll file for you. https://t.co/hDTCu9JZPB |
| UPDATE: #NYCASP rules are suspended through Sunday, May 17. |
#NYCASP resumes Monday, May 18 through Sunday, May 24 for a citywide clean sweep.
#NYCASP rules will then be suspended again through Sunday, June 7.
Parking meters will remain in effect.
Follow @NYCASP for more. https://t.co/Qfh0v6R3Ia | |@megshashin @NYCMayor @NYCMayorsOffice Good morning, we’re sorry to hear about your experience. If you believe you’ve been discriminated against, you can file a complaint with NYC Commission on Human Rights at https://t.co/FvPrtMXcLe or send us a DM. https://t.co/hDTCu9JZPB |
Most Negative Tweet
## [1] "@nyc311 @NYPD13Pct @CarlinaRivera I reported a recurring homeless condition in Gramercy. It was referred to the NYPD and subsequently closed as “non crime corrected” It’s a disgusting and unhealthy situation. He’s defecating on the sidewalk. 333 East 23 Street b/t 1st and 2nd https://t.co/Huc4taBHkc"
Let’s now see Neutral Tweets
| x |
|---|
| #NYCASP Las reglas de estacionamiento alterno están suspendidas hoy, sábado, 9 de mayo, hasta el martes, 12 de mayo. Los parquímetros permanecerán en efecto. Sigue @NYCASP y baja la aplicación móvil para recibir alertas directas a tu teléfono: https://t.co/9GSt3VfwSg https://t.co/8oFLl91kmn |
| #NYCASP Las reglas de estacionamiento alterno están suspendidas hoy, miércoles, 13 de mayo. Los parquímetros permanecerán en efecto. |
Sigue @NYCASP y baja la aplicación móvil para recibir alertas directas a tu teléfono: https://t.co/9GSt3VfwSg | |#NYCASP Las reglas de estacionamiento alterno están suspendidas hoy, jueves, 7 de mayo, hasta el martes, 12 de mayo. Los parquímetros permanecerán en efecto. Sigue @NYCASP y baja la aplicación móvil para recibir alertas directas a tu teléfono: https://t.co/9GSt3VfwSg https://t.co/7OmeoBoGbv | |#NYCASP Las reglas de estacionamiento alterno están suspendidas hoy, martes, 12 de mayo. Los parquímetros permanecerán en efecto. Sigue @NYCASP y baja la aplicación móvil para recibir alertas directas a tu teléfono: https://t.co/9GSt3VfwSg https://t.co/bbZezAr66f | |#NYCASP Las reglas de estacionamiento alterno están suspendidas hoy, miércoles, 6 de mayo, hasta el martes, 12 de mayo. Los parquímetros permanecerán en efecto. Sigue @NYCASP y baja la aplicación móvil para recibir alertas directas a tu teléfono: https://t.co/9GSt3VfwSg https://t.co/TMiRuFsyUf |
Total Tweets by Sentiment using plotly package
#install.packages("plotly")
library(plotly)
category_sent <- ifelse(sent.value < 0, "Negative", ifelse(sent.value > 0, "Positive", "Neutral"))
totals <- data.frame(table(category_sent))
plot_ly(totals, x = ~category_sent, y = ~Freq, type = 'bar',
marker = list(color = c('red', 'orange',
'green'))) %>% layout(title = 'NYC311 Tweets by Sentiment')Conclusion
Based on all the analyses performed, the NYC311 Service represents a very popular and reliable channel and resource for the NYC communities to raise awareness to the local agencies and citizen services providers about multiple topics of importance and well-being for the society.
I was able to identify overall themes and topics affecting the main boroughs within the NY Metro area but more importantly, I was able to narrow down characteristic themes and patterns that were more prevalent in each one, providing an idea of the specific challenges, needs and local dynamics each borough community experiments on a quotidian basis.
In terms of Sentiment Analysis for the “nyc311” tweets, the majority of them describe a positive sentiment , surprisingly not a considerable number of complaints or negative mentions being raised leveraging the Twitter channel and also, the NYC311 service uses it to provide resolution advice, status and redirection guidance to its users/followers.
Issues faced during the creation of this Analytics project
Twitter developer account - Process of getting permission to create twitter account has been modiefied and upgraded and requires much smaller details which is then reviewed by the twitter. It was a 3 day process to explain about how and where I will be using NYC 311 twitter data, but I finally got permissions to create app with twitter developer account.
Map plotting in RMarkdown - Map plotting code to plot maps of the NYC area with SR statistics overlayed into multiple facets by complaint type worked perfectly in the RStudio Console. Once I tried the code within R Markdown it threw an exception/error not supporting facets and not overlaying SR statistics. I added a picture of the correct plot right after the affected code section as a reference.