First: weeks / months / quarters / years - all with categories.
Make maps with volume of complaints. Think how this material can be presented.
The main dataset is the collection of all requests submitted to Municipal Services Office from Holland-Bukit Panjang Town Council residential units.
## Timestamp of the first request in the dataset: 14/Nov/2016 13:16:57:323
## Timestamp of the last request in the dataset: 01/Jan/2012 00:07:05:32
## Dataset dimensions: 169397 18
Sanity check: the number of duplicates is
## [1] 0
There was some initial simple pre-processing, like converting the 2-symbol “def_cat” to a more explicit “report_cat”. Below is a sample of initial raw data (10 variables):
We have cleaned texts of reports before. However, we did it only for reports from Bukit Panjang. Later we will include the description of cleaning here and we will clean all reports. Currently, this section only covers reports from Bukit Panjang residential blocks.
## [1] 126042
This section takes a long time and we only need to do it once.
In order to do text processing, we will need a document-term matrix.
## [1] 4828 5198
This goes beyond simple exploration
First, we will calculate the entire sentiment score of each complaint as follows: \[ \sum_{w\in \{\mbox{neutral}\}}1-\sum_{w\in \{\mbox{negative}\}}1 \] And then we label reports as neutral if the score is non-negative and negative if the score is negative.
## There are 1173 frequent words
## First few frequent words are _car_no _case_id _date _floor_no _motorbike_no _phone_no
## Neutral words are: please thanks serious especially heavy inconvenience kindly quite really frequent necessary urgently bad afraid frequently previous concern several twice usually badly appreciate definitely
## Negative words are: always alot still complaining also many already dangerous just disturbing even affecting almost often insist unhappy much angry yet despite never unbearable constantly frustrated irritating numerous psychotic scared worried extremely regularly concerned continuous
Here is the table of sentiments
## final_sentiment
## negative neutral
## 1648 3180
Here is distribution of reports by the word count:
By the number of negative words:
By the number of neutral words:
By the sentiment score:
Below is a sample of reports without negative or neutral words.
## [1] " feedback unknown upper level thrown cigarettes cigarettes down and burning her clothes , issue have been on going for long time . request officer in charge to look into it "
## [2] " feedback 18 - 26 keep dripping wet clothes . please advise "
## [3] " illegalparking bicycles blocking the access road link way . it's everyday issue , caller request towncouncil to call him due no responce . "
## [4] " it - motorbike illegalparking at voiddeck "
## [5] " it spotted _motorbike_no at block 474 voiddeck at _time on 17feb . "
## [6] " request officerincharge to return call regarding about some construction in progress in front of this block . request what construction is that . please return call "
## [7] " resident inform neighbour washing bicycles under the voiddeck . request to advice . "
## [8] " tng testing mso for illegalparking "
## [9] " it the attached photo was taken on 24may around _time showing a motorbike parked at a voiddeck near block 474 segar road . as that day was a sunday and there is no enforcement officer available . please assist to look into this areas . thanks you . "
## [10] " resident inform that unit _unit_no is doing renovation . work and has been noisy for the past 3 week , distracting her son studying for exam . request officerincharge 's attention "
Now we will generate a dataset for sentiments and consruct a model
## glmnet
##
## 4828 samples
## 1173 predictors
## 2 classes: 'negative', 'neutral'
##
## Pre-processing: scaled (1173)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3862, 3862, 3862, 3863, 3863
## Resampling results across tuning parameters:
##
## lambda Accuracy Kappa
## 0.0001000000 0.9975142 0.9944727
## 0.0001438450 0.9975142 0.9944727
## 0.0002069138 0.9975142 0.9944727
## 0.0002976351 0.9975142 0.9944727
## 0.0004281332 0.9975142 0.9944727
## 0.0006158482 0.9975142 0.9944727
## 0.0008858668 0.9971002 0.9935548
## 0.0012742750 0.9964790 0.9921737
## 0.0018329807 0.9958575 0.9907928
## 0.0026366509 0.9933713 0.9852730
## 0.0037926902 0.9904715 0.9788600
## 0.0054555948 0.9881932 0.9737810
## 0.0078475997 0.9817726 0.9596066
## 0.0112883789 0.9757659 0.9462472
## 0.0162377674 0.9664461 0.9254324
## 0.0233572147 0.9446993 0.8749750
## 0.0335981829 0.9150780 0.8030272
## 0.0483293024 0.8782094 0.7125936
## 0.0695192796 0.8218739 0.5627555
## 0.1000000000 0.8152454 0.5437783
##
## Tuning parameter 'alpha' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.0006158482.
Here is what we got
According to the model, “neutral” words are
## [1] "(Intercept)" "`182`" "`2014`" "`7th`"
## [5] "afraid" "appreciate" "bad" "badly"
## [9] "blk170" "concern" "especially" "frequent"
## [13] "frequently" "heavy" "inconvenience" "kindly"
## [17] "necessary" "please" "previous" "quite"
## [21] "really" "serious" "several" "thanks"
## [25] "twice" "urgently" "usually"
According to the model, “negative” words are
## [1] "affecting" "almost" "alot" "already" "also"
## [6] "always" "angry" "complaining" "constantly" "continuous"
## [11] "daily" "dangerous" "despite" "disturbing" "even"
## [16] "extremely" "foul" "frustrated" "insist" "irritating"
## [21] "just" "mahjong" "many" "much" "never"
## [26] "numerous" "often" "oranges" "psychotic" "regularly"
## [31] "resident" "scared" "still" "unbearable" "unhappy"
## [36] "worried" "yet"
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 3 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 24
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[308 item(s), 4828 transaction(s)] done [0.00s].
## sorting and recoding items ... [137 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [178 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 178 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 66 112
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.629 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.005178 Min. :0.4015 Min. :0.005385 Min. : 3.170
## 1st Qu.:0.006266 1st Qu.:0.5953 1st Qu.:0.007871 1st Qu.: 8.179
## Median :0.006835 Median :0.7250 Median :0.010563 Median : 14.991
## Mean :0.011144 Mean :0.7251 Mean :0.016898 Mean : 24.805
## 3rd Qu.:0.012945 3rd Qu.:0.8762 3rd Qu.:0.020971 3rd Qu.: 30.898
## Max. :0.059445 Max. :1.0000 Max. :0.090307 Max. :137.943
## count
## Min. : 25.00
## 1st Qu.: 30.25
## Median : 33.00
## Mean : 53.80
## 3rd Qu.: 62.50
## Max. :287.00
##
## mining info:
## data ntransactions support confidence
## T 4828 0.005 0.4
We can visualize association rules by a scatterplot as follows:
We can extract rules by certain threshold:
We can colour the scatterplot according to the rule length:
We can plot a small set of rules as direct graph, where items and rules are vertices and there is an edge from iteam \(A\) to rule \(R\) whenever \(A\) appears in the LFS or rules \(R\) and an adge from a rule \(R\) to an item \(B\) whenever \(B\) appears in the RHS or rule \(R\).
This is a graph for 10 rules
And here si the graph for 38 rules:
And here is the table of 38 rules:
Social issues
We will work with reports categorized as social issues. Below is a sample
And below are some cleaned reports containing the words “bug” and “litter”:
We will remove stopwords.
Below is the word cloud of all reports in social issues:
Most common terms
Unigrams
Bigrams
Trigrams
We manually labelled words that appear at least 5 times.
Facts
Table of words representing facts:For example, here are perpetrator words:
Co-occurence matrix:
Below is the co-occurence matrix of facts:
Here is a Venn Diagram
Sentiments
Sentiments:
Here are neutral words:
Issues
Table of words representing facts:For example, here are noise words: