To do

First: weeks / months / quarters / years - all with categories.

Make maps with volume of complaints. Think how this material can be presented.

Load data

The main dataset is the collection of all requests submitted to Municipal Services Office from Holland-Bukit Panjang Town Council residential units.

## Timestamp of the first request in the dataset: 14/Nov/2016 13:16:57:323

## Timestamp of the last request in the dataset: 01/Jan/2012 00:07:05:32

## Dataset dimensions: 169397 18

Sanity check: the number of duplicates is

## [1] 0

There was some initial simple pre-processing, like converting the 2-symbol “def_cat” to a more explicit “report_cat”. Below is a sample of initial raw data (10 variables):

Block attributes

Text of reports

We have cleaned texts of reports before. However, we did it only for reports from Bukit Panjang. Later we will include the description of cleaning here and we will clean all reports. Currently, this section only covers reports from Bukit Panjang residential blocks.

## [1] 126042

Pre-cleaning

This section takes a long time and we only need to do it once.

Document-term matrix

In order to do text processing, we will need a document-term matrix.

## [1] 4828 5198

Social issues

We will work with reports categorized as social issues. Below is a sample

And below are some cleaned reports containing the words “bug” and “litter”:

## [1] " it - report near 35 . do not call me . just takeaction on this daily litter bug .  on 28 march 2016 ,   _time  found 1 cigarettes cigarettes thrown down from above ,  landing on my car parked at alot 105 .  please catch this litter bug who has been littering daily since 201 "
## [2] " it report near 10 .  no need to call me .  just catch the litter bug .  found 3 cigarettes cigarettes thrown down from above near my car parked at alot 104 .  every night the litter bug / s throw cigarettes cigarettes down ,  . "                                               
## [3] " it - report near 27 .  do not call me .  just takeaction on this daily litter bug .  this morning ,  found 3 cigarettes cigarettes thrown down from above ,  landing on ground /  my car near alot 104 .  please catch this litter bug who has been littering daily since "         
## [4] " it report near 22 .  do not call me .  just takeaction on this daily litter bug .  this morning found 3 cigarettes cigarettes thrown down from above ,  landing on car / ground near alot 105 .  please catch this litter bug who has been littering daily since 2012 . "           
## [5] " it report near 25 .  do not call me .  just takeaction on this daily litter bug .  this morning ,  found 4 cigarettes cigarettes thrown down from above ,  landing on ground  /  car near alot 104 .  please catch this litter bug who has been littering daily since 2012 . "

We will remove stopwords.

Below is the word cloud of all reports in social issues:

Most common terms

Unigrams

Bigrams

Trigrams

We manually labelled words that appear at least 5 times.

Facts

Table of words representing facts:

For example, here are perpetrator words:

##  [1] "_unit_no"      "neighbour"     "_floor_no"     "people"       
##  [5] "someone"       "group"         "children"      "kids"         
##  [9] "_motorbike_no" "worker"        "boys"          "teenagers"    
## [13] "man"           "owner"         "chinese"       "old"          
## [17] "family"        "culprit"       "malaysian"     "indian"       
## [21] "person"        "shop"          "students"      "lady"         
## [25] "baby"          "tenant"        "guys"          "lorry"        
## [29] "seller"        "somebody"      "foreigner"     "guni"         
## [33] "adults"        "passerby"      "maid"          "_car_no"      
## [37] "offender"      "uncle"         "young"         "elderly"      
## [41] "garang"        "men"           "boy"           "woman"

## Here are our fact types:

## issue place action sentiment time perpetrator

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL

Co-occurence matrix:

Below is the co-occurence matrix of facts:

##             issue place action sentiment time perpetrator
## issue        4694  3252   3805      2439 1583        2830
## place        3252  3327   2683      1768 1146        1946
## action       3805  2683   3909      2179 1297        2428
## sentiment    2439  1768   2179      2492  969        1583
## time         1583  1146   1297       969 1619         946
## perpetrator  2830  1946   2428      1583  946        2886

Here is a Venn Diagram

Sentiments

Sentiments:

Here are neutral words:

##  [1] "please"        "thanks"        "serious"       "especially"   
##  [5] "heavy"         "inconvenience" "kindly"        "quite"        
##  [9] "really"        "frequent"      "necessary"     "urgently"     
## [13] "bad"           "afraid"        "frequently"    "previous"     
## [17] "concern"       "several"       "twice"         "usually"      
## [21] "badly"         "appreciate"    "definitely"

## Here are our sentiment types:

## negative neutral

## [[1]]
## NULL
## 
## [[2]]
## NULL

Issues

Table of words representing facts:

For example, here are noise words:

##  [1] "playing"    "noisy"      "loud"       "soccer"     "nuisance"  
##  [6] "football"   "sleep"      "music"      "knocking"   "drilling"  
## [11] "skating"    "ball"       "volume"     "badminton"  "event"     
## [16] "motor"      "shouting"   "dragging"   "singing"    "hacking"   
## [21] "funeral"    "generator"  "stall"      "talking"    "goods"     
## [26] "shift"      "barking"    "karaoke"    "kicking"    "heard"     
## [31] "radio"      "banging"    "dancing"    "microphone" "events"    
## [36] "exercise"   "party"      "bang"       "sound"      "sitting"

## Here are our issue types:

## noise littering cleanliness dripping_water obstruction hazard parking incense_burning smell smoking incense litter laundry

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL
## 
## [[13]]
## NULL

Predictive modelling and inference

This goes beyond simple exploration

Sentiment score

First, we will calculate the entire sentiment score of each complaint as follows: \[ \sum_{w\in \{\mbox{neutral}\}}1-\sum_{w\in \{\mbox{negative}\}}1 \] And then we label reports as neutral if the score is non-negative and negative if the score is negative.

## There are 1173 frequent words

## First few frequent words are _car_no _case_id _date _floor_no _motorbike_no _phone_no

## Neutral words are:    please thanks serious especially heavy inconvenience kindly quite really frequent necessary urgently bad afraid frequently previous concern several twice usually badly appreciate definitely

## Negative words are:    always alot still complaining also many already dangerous just disturbing even affecting almost often insist unhappy much angry yet despite never unbearable constantly frustrated irritating numerous psychotic scared worried extremely regularly concerned continuous

Here is the table of sentiments

## final_sentiment
## negative  neutral 
##     1648     3180

Simple exploratory analysis

Here is distribution of reports by the word count:

By the number of negative words:

By the number of neutral words:

By the sentiment score:

Below is a sample of reports without negative or neutral words.

##  [1] " feedback unknown upper level thrown cigarettes cigarettes down and burning her clothes ,  issue have been on going for long time .  request officer in charge to look into it "                                                                                
##  [2] " feedback 18 - 26 keep dripping wet clothes .  please advise "                                                                                                                                                                                                  
##  [3] " illegalparking bicycles blocking the access road link way .  it's everyday issue ,  caller request towncouncil to call him due no responce . "                                                                                                                 
##  [4] " it -  motorbike illegalparking  at  voiddeck "                                                                                                                                                                                                                 
##  [5] " it spotted  _motorbike_no at block 474 voiddeck at  _time  on 17feb . "                                                                                                                                                                                        
##  [6] " request officerincharge to return call regarding about some construction in progress in front of this block .  request what construction is that .  please return call "                                                                                       
##  [7] " resident inform neighbour washing bicycles under the voiddeck .  request to advice . "                                                                                                                                                                         
##  [8] " tng testing mso for illegalparking "                                                                                                                                                                                                                           
##  [9] " it the attached photo was taken on 24may around  _time  showing a motorbike parked at a voiddeck near block 474 segar road .  as that day was a sunday and there is no enforcement officer available .  please assist to look into this areas .  thanks you . "
## [10] " resident inform that  unit _unit_no  is doing renovation .  work  and  has been noisy for the past 3 week ,  distracting her son studying for exam .  request officerincharge 's attention "

Logistic regression

Now we will generate a dataset for sentiments and consruct a model

## glmnet 
## 
## 4828 samples
## 1173 predictors
##    2 classes: 'negative', 'neutral' 
## 
## Pre-processing: scaled (1173) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3862, 3862, 3862, 3863, 3863 
## Resampling results across tuning parameters:
## 
##   lambda        Accuracy   Kappa    
##   0.0001000000  0.9975142  0.9944727
##   0.0001438450  0.9975142  0.9944727
##   0.0002069138  0.9975142  0.9944727
##   0.0002976351  0.9975142  0.9944727
##   0.0004281332  0.9975142  0.9944727
##   0.0006158482  0.9975142  0.9944727
##   0.0008858668  0.9971002  0.9935548
##   0.0012742750  0.9964790  0.9921737
##   0.0018329807  0.9958575  0.9907928
##   0.0026366509  0.9933713  0.9852730
##   0.0037926902  0.9904715  0.9788600
##   0.0054555948  0.9881932  0.9737810
##   0.0078475997  0.9817726  0.9596066
##   0.0112883789  0.9757659  0.9462472
##   0.0162377674  0.9664461  0.9254324
##   0.0233572147  0.9446993  0.8749750
##   0.0335981829  0.9150780  0.8030272
##   0.0483293024  0.8782094  0.7125936
##   0.0695192796  0.8218739  0.5627555
##   0.1000000000  0.8152454  0.5437783
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.0006158482.

Here is what we got

According to the model, “neutral” words are

##  [1] "(Intercept)"   "`182`"         "`2014`"        "`7th`"        
##  [5] "afraid"        "appreciate"    "bad"           "badly"        
##  [9] "blk170"        "concern"       "especially"    "frequent"     
## [13] "frequently"    "heavy"         "inconvenience" "kindly"       
## [17] "necessary"     "please"        "previous"      "quite"        
## [21] "really"        "serious"       "several"       "thanks"       
## [25] "twice"         "urgently"      "usually"

According to the model, “negative” words are

##  [1] "affecting"   "almost"      "alot"        "already"     "also"       
##  [6] "always"      "angry"       "complaining" "constantly"  "continuous" 
## [11] "daily"       "dangerous"   "despite"     "disturbing"  "even"       
## [16] "extremely"   "foul"        "frustrated"  "insist"      "irritating" 
## [21] "just"        "mahjong"     "many"        "much"        "never"      
## [26] "numerous"    "often"       "oranges"     "psychotic"   "regularly"  
## [31] "resident"    "scared"      "still"       "unbearable"  "unhappy"    
## [36] "worried"     "yet"

Association analysis

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 24 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[308 item(s), 4828 transaction(s)] done [0.00s].
## sorting and recoding items ... [137 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [178 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

## set of 178 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3 
##  66 112 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.629   3.000   3.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.005178   Min.   :0.4015   Min.   :0.005385   Min.   :  3.170  
##  1st Qu.:0.006266   1st Qu.:0.5953   1st Qu.:0.007871   1st Qu.:  8.179  
##  Median :0.006835   Median :0.7250   Median :0.010563   Median : 14.991  
##  Mean   :0.011144   Mean   :0.7251   Mean   :0.016898   Mean   : 24.805  
##  3rd Qu.:0.012945   3rd Qu.:0.8762   3rd Qu.:0.020971   3rd Qu.: 30.898  
##  Max.   :0.059445   Max.   :1.0000   Max.   :0.090307   Max.   :137.943  
##      count       
##  Min.   : 25.00  
##  1st Qu.: 30.25  
##  Median : 33.00  
##  Mean   : 53.80  
##  3rd Qu.: 62.50  
##  Max.   :287.00  
## 
## mining info:
##  data ntransactions support confidence
##     T          4828   0.005        0.4

Association rule visualization

We can visualize association rules by a scatterplot as follows:

We can extract rules by certain threshold:

We can colour the scatterplot according to the rule length:

We can plot a small set of rules as direct graph, where items and rules are vertices and there is an edge from iteam \(A\) to rule \(R\) whenever \(A\) appears in the LFS or rules \(R\) and an adge from a rule \(R\) to an item \(B\) whenever \(B\) appears in the RHS or rule \(R\).

This is a graph for 10 rules

And here si the graph for 38 rules:

And here is the table of 38 rules: