(1) Document and Term Matrix (DTM)
先做兩個 Document-Term Matrix …
dtm1 : 完整字詞、未還原字根
corp = Corpus(VectorSource(review$text))
corp = tm_map(corp, content_transformer(tolower))
corp = tm_map(corp, removePunctuation)
dtm1 = DocumentTermMatrix(corp); dtm1 # terms: 215402
<<DocumentTermMatrix (documents: 215879, terms: 215402)>>
Non-/sparse entries: 15740971/46485027387
Sparsity : 100%
Maximal term length: 932
Weighting : term frequency (tf)
dtm1 = removeSparseTerms(dtm1, .9999); dtm1 # terms: 18664
<<DocumentTermMatrix (documents: 215879, terms: 18664)>>
Non-/sparse entries: 15289544/4013876112
Sparsity : 100%
Maximal term length: 18
Weighting : term frequency (tf)
dtm2 : 字根還原
corp = tm_map(corp, removeWords, stopwords("english"))
corp = tm_map(corp, stemDocument)
dtm2 = DocumentTermMatrix(corp); dtm2 # terms: 177911
<<DocumentTermMatrix (documents: 215879, terms: 177911)>>
Non-/sparse entries: 11655667/38395593102
Sparsity : 100%
Maximal term length: 932
Weighting : term frequency (tf)
dtm2 = removeSparseTerms(dtm2, .9999); dtm2 # terms: 12968
<<DocumentTermMatrix (documents: 215879, terms: 12968)>>
Non-/sparse entries: 11312194/2788206678
Sparsity : 100%
Maximal term length: 17
Weighting : term frequency (tf)
save(dtm1,dtm2,file='data/dtm.rdata')
先存起來,免得每次都要重做。
(2) Preparation
load('data/dtm.rdata')
DTM = dtm1 %>% {.[, order(-col_sums(.))]}
為了觀察方便,我們選擇完整字詞(未還原字根)的DTM,讀進DTM之後,通常先把字詞欄位依字詞的出現評率排列。
Effect = function(y, x, m=rep(TRUE, length(y))) {
x = x[m]; y = y[m]
n = as.numeric(length(x))
pX = sum(x)/n; pY = sum(y)/n; pXY = sum(x&y)/n
ef = c(usage=pX, base=pY, support=pXY, conf=pXY/pX, lift=pXY/pX/pY)
c(round(100*ef, 2), count=n) }
定義一個function(Effect())來計算X對Y的各種效果,包括:
- Usage (
Pr[X]) – the usage ratio of X
- Base (
Pr[Y]) – the overall probability of Y
- Confidence (
Pr[Y|X]) – the probability of Y given X
- Support (
Pr[Y^X]) – the probability of X^Y
- Lift (
Pr[Y|X]/Pr[Y]) – X’s effect on Y (the lift of Y’s probability given X)
- Count – the length of the vectors (
X and Y should have the same length)
Y = review[,"useful"] %>% {. > median(.)}
Effect(Y, as.vector(DTM[,"pizza"]) > 0)
usage base support conf lift count
6.18 30.08 1.89 30.54 101.55 215879.00
pizza has no effect on review$useful (its lift is barely higher than 100%.)
Effect(Y, as.vector(DTM[,"dresses"]) > 0)
usage base support conf lift count
0.20 30.08 0.09 47.07 156.50 215879.00
dresses has a positive effect on review$useful (its lift is higher than 150%.)
2.1 – The effect of 20 most frequent words
Now we can observe the effect of the most frequent (20) words …
df = t(sapply(colnames(DTM)[1:20], function(w)
Effect(Y, as.vector(DTM[, w]) > 0)))
df
usage base support conf lift count
the 91.48 30.08 28.99 31.69 105.36 215879
and 89.02 30.08 28.33 31.82 105.80 215879
was 56.59 30.08 19.63 34.69 115.35 215879
for 65.38 30.08 22.75 34.80 115.71 215879
that 51.60 30.08 19.56 37.91 126.05 215879
with 52.02 30.08 19.22 36.94 122.82 215879
but 54.98 30.08 19.55 35.56 118.21 215879
this 56.42 30.08 19.91 35.28 117.30 215879
you 44.47 30.08 16.71 37.59 124.96 215879
they 46.53 30.08 16.65 35.79 118.98 215879
have 48.28 30.08 17.46 36.16 120.22 215879
not 42.80 30.08 15.75 36.81 122.37 215879
had 40.13 30.08 14.31 35.67 118.59 215879
are 40.38 30.08 14.26 35.31 117.39 215879
good 41.73 30.08 13.67 32.75 108.87 215879
place 42.74 30.08 14.40 33.68 111.99 215879
were 31.79 30.08 11.86 37.31 124.04 215879
food 38.89 30.08 12.28 31.57 104.97 215879
there 33.28 30.08 12.55 37.71 125.36 215879
great 35.08 30.08 10.43 29.75 98.90 215879
As you have just experienced, calculating the effect is a lengthy task. Let’s turn on parallel computation.
library(doParallel)
K = 4; clust=makeCluster(K)
registerDoParallel(clust)
getDoParWorkers()
[1] 4
(3) Words’ Effect on the Entire Corpus
3.1 – The Most Frequent Words
Pick the most frequent works (360) from the DTM. Since we had sorted DTM by descending col_sums(), we simply take the first 360 colnames() from it. It’d take some time even with parallel processing (~60 seconds in my notebook). So, be patient.
Words = colnames(DTM)[1:360]
t0 = Sys.time()
df = foreach(word = Words, .combine=rbind) %dopar% {
library(slam)
Effect(Y, row_sums(DTM[,word]) > 0) }
Sys.time() - t0
Time difference of 1.0529 mins
df = data.frame(df, word=Words)
df[1:6,]
hchart(df,"scatter",hcaes(
y=usage,x=lift,color=conf,size=conf,group=word)) %>%
hc_legend(enabled=F) %>% hc_chart(zoomType="xy") %>%
hc_add_theme(hc_theme_flat()) %>%
hc_plotOptions(bubble=list(maxSize="2%",minSize=10)) %>%
hc_xAxis(plotLines=list(list(color="pink",value=mean(df$lift),width=2))) %>%
hc_title(text="Effect of the Most Frequent Words (360) on Usefulness")
In the figure above, each bubble represents a word. The bubbles’ …
- Colors represent the
confidence: Pr[Y|X]; yellow/blue indicates high/low confidence.
- The confidences are, by definition, closely correlated to the lift.
- X-coordinate represent the
lift: Pr[Y|X]/Pr[Y] in percentage.
- On the right, we can see the words of the highest lift, including:
yes, review, those, into, finally;
- One the left, we can see some words of low/negative lift.
- Interestingly, such words of positive implications as
great, excellent, recommend all carry negative lifts.
- On the upper left, we have the two most frequent words in English –
the and and.
- The positive bias of lift is quite obvious. Even such common words as
the and and carry positive lifts. To cope with this bias, we will adjust the neutral lift from 100 to 125 (the mean of the lift).
- Y-coordinate represent the
usage: Pr[X].
- Lift is negatively correlated with usage; all of the high-lift words exhibits low-usage.
3.2 – Select Words with TF-IDF
Usually, the most importance words are not the same as the most frequent words. We can use the TF-IDF (Term Frequency – Inverse Document Frequency) method to pick the important yet less frequent words.
tfidf = DTM %>% {tapply(.$v/row_sums(.)[.$i], .$j, mean) *
log2( nDocs(.) / col_sums(. > 0) )}
TFxIDF = DTM[,which(tfidf[1:3000] > quantile(tfidf)[3])] %>%
col_sums() %>% sort %>% names
length(TFxIDF)
[1] 377
The algorithm generate 377 words. To be consistent, we only take the first 360 words. We put the words in Words and use the same code to calculate the effects of these words …
Words = TFxIDF[1:360]
t0 = Sys.time()
df = foreach(word = Words, .combine=rbind) %dopar% {
library(slam)
Effect(Y, row_sums(DTM[,word]) > 0) }
Sys.time() - t0
Time difference of 56.508 secs
and then make a interactive chart.
df = data.frame(df, word=Words)
hchart(df,"scatter",hcaes(
y=usage,x=lift,color=conf,size=conf,group=word)) %>%
hc_legend(enabled=F) %>% hc_chart(zoomType="xy") %>%
hc_add_theme(hc_theme_flat()) %>%
hc_plotOptions(bubble=list(maxSize="2%",minSize=10)) %>%
hc_xAxis(plotLines=list(list(color="pink",value=mean(df$lift),width=2))) %>%
hc_title(text="Effect of TFxIDF Words (360-, censored) on Usefulness")
There are some major difference between the two charts:
- The Scale : The usages of TF-IDF words are lower, but their lifts are more diverse than the frequent words.
- Because TF-IDF rates the words by Inverse document frequency, such common words as
this and that are precluded. In this way, it helps to find the less frequent but important words.
- To aviod glitches, we censor the words with usage lower than 0.1%. The upper bound of usage are merely 1.33%, whilst that of frequent words is 91.48%.
- Although the usage of these words are low, the spread of their lifts are much wider – [75.9%, 205.8%] comparing to [98.6%, 153.4%].
The Distribution of Lift : The lifts exhibit a normal-alike-distribution. With the mean lifts is around 130%. Therfore it is easier to identify the ‘good’ and ‘bad’ words.
The Goods are: fez, gay, republic, lolos, hood, …
The Bads are: definately, haircut, auto, repair, gluten, …
(4) Words’ Effect Across Two Sub-Corpus
除了關心字詞在整個文集的表現,我們也可以比較字詞在不同種類的子文集之中的效果。
4.1 – Helper Function
一個字在不同子文集會有不同的lift,如果我們用這些lift做座標,把一整群字畫在同一個平面上,我們就可以比較字在子文集的效果。 我們先做一個製圖的helper function,叫它WordsLift().
WordsLift = function(L, On="Usefulness", c1="Cat1", c2="Cat2", Wd="Frequent") {
ttl = sprintf("Effect of %s Words on %s", Wd, On)
sub = sprintf("Size: Total Usage; Color: %s(yellow) | %s(blue)", c2, c1)
ttlx = sprintf("%% Lift on %s (blue)", c1)
ttly = sprintf("%% Lift on %s (yellow)", c2)
df = merge(L[[1]],L[[2]],by='word',sort=F,suffixes=c("_x","_y"))
df = df[df$usage_x > 0.1 & df$usage_y > 0.1, ]
df$ratio = round(df$usage_y / df$usage_x, 3)
df$total = round(df$usage_y + df$usage_x, 3)
tips=paste0("<b>{point.word}</b><br>",
"conf: ({point.conf_x}%, {point.conf_y}%)<br>",
"usage: ({point.usage_x}%, {point.usage_y}%)<br>",
"total uasge (y + x): {point.total}%<br>",
"uasge ratio (y / x): {point.ratio}")
hchart(df,"scatter",hcaes(
x=lift_x, y=lift_y, size=log(total), color=log(ratio))) %>%
hc_chart(zoomType="xy") %>% hc_add_theme(hc_theme_538()) %>%
hc_plotOptions(bubble=list(maxSize="2%",minSize=4)) %>%
hc_xAxis(title=list(text=ttlx)) %>% hc_yAxis(title=list(text=ttly)) %>%
hc_tooltip(headerFormat="",hideDelay=100,useHTML=T,pointFormat=tips) %>%
hc_xAxis(plotLines=list(list(color="orange",value=mean(df$lift_x),width=2))) %>%
hc_yAxis(plotLines=list(list(color="orange",value=mean(df$lift_y),width=2))) %>%
hc_title(text=ttl) %>% hc_subtitle(text=sub)
}
4.2 – Food vs. Non-Food, 360 Frequent Words
在CatGroup裡面定義兩個子文集:
Food: the Restaurants and Food categories
nonFood: others
把要分析的字放在Words裡面,然後用一個迴圈,對兩個子文集分別做效果分析,將結果放在L這個list裡面。
bids = rowSums(mxBC[,c('Restaurants','Food')]) > 0
CatGroup = list(Food=bids, nFood=!bids)
Words = colnames(DTM)[1:360]
L = list(); for(i in 1:length(CatGroup)) {
rmask = review$bid %in% biz$bid[ CatGroup[[i]] ]
df = foreach(word = Words, .combine=rbind) %dopar% {
Effect(Y, row_sums(DTM[,word]) > 0, rmask) }
df = data.frame(df, word=Words, catgrp=names(CatGroup)[i])
L[[i]] = df}
之後就可以用WordsLift()製圖。
# food.freq = L
WordsLift(food.freq, "Usefulness", "Food", "nonFood", "Frequent (360)")
圖中每一點代表一個字 …
- 點的橫(綜)座標代表字對
Food(nonFood)子集的lifts
- 點的大小代表字的使用率 (the sum of usage, log transfered)
- 點的顏色代表字出現在
Food(藍色)和`nonFood(黃色)子集的比重
We can observe that:
- As we’d observed in the previous charts, lift is negatively correlated with usage.
- Lift of
Food is positively correlated with nonFood
- In the upper left, some food related words (
pizza, tacos, sushi, chips, and spicy) are have low/high-lift on Food/nonFood.
4.3 – Food vs. Non-Food, 194 TF-IDF Words
Let’s repeat the process in the previous sub-session with TF-IDF words.
Words = TFxIDF[1:360]
L = list(); for(i in 1:length(CatGroup)) {
rmask = review$bid %in% biz$bid[ CatGroup[[i]] ]
df = foreach(word = Words, .combine=rbind) %dopar% {
Effect(Y, row_sums(DTM[,word]) > 0, rmask) }
df = data.frame(df, word=Words, catgrp=names(CatGroup)[i])
L[[i]] = df}
# food.tfidf = L
WordsLift(food.tfidf, "Usefulness", "Food", "nonFood",
"TFxIDF (360, censored by 0.1% usage)")
To improve the comparative validity, we censor the words with usage lower than 0.1% in either sub-corpus. Thereby, the words displayed are 174, instead of 360. As we can see …
- the distribution of lift of TF-IDF is wider than that of frequent words.
- The correlation between lift and usage remains
- But, the correlation between
Food and nonFood is no longer significant.
- By median split, we can divide the plane into four quarters. On the …
- upper right –
gay and republic is good for both sub-corpses
- lower left –
definately is bad for both sub-corpus
- lower right –
mike, polish, chris and result are good in Food but bad in nonFood
- lower left –
magaritas and peaks are good in nonFood but bad in Food
4.4 – Bars vs Shopping, 1000 Frequent Words
As an exercise, we compare the effect of
- the most frequent 1000 words
- on
review$cool
- across
Bars and Shopping categories
Simply put the criteria in Y, CatGroup and Words …
Y = review[,"cool"] %>% {. > median(.)}
CatGroup = list(Bar = mxBC[,'Bars'], Shopping = mxBC[,'Shopping'])
Words = colnames(DTM)[1:1000]
It take about 5 minutes to evaluate the effect of 1000 words.
t0 = Sys.time()
L = list(); for(i in 1:length(CatGroup)) {
rmask = review$bid %in% biz$bid[ CatGroup[[i]] ]
df = foreach(word = Words, .combine=rbind) %dopar% {
Effect(Y, row_sums(DTM[,word]) > 0, rmask) }
df = data.frame(df, word=Words, catgrp=names(CatGroup)[i])
L[[i]] = df}
Sys.time() - t0 # 4.7617 mins
Time difference of 4.8501 mins
We can plot the data with the same helper function (WordsLift())
# bars_shop = L
WordsLift(bars_shop, "Coolness", "Bars", "Shopping", "Frequent (1000)")
QUIZ :
- What are your observations in the chart above?
- Try to
- pick two groups categories
- make a lift comparison chart with the 360 most frequent words
- make a lift chart with the TF-IDF words
- share with us your major findings …
# save before leaving
save(bars_shop, food.freq, food.tfidf, file="data/wordeffect.rdata")
stopCluster(clust) # stop parallel processing
---
title: "字詞的效果"
subtitle: Yelp Kaggle, The Effect of Words
author: "Tony Chuo, tonychuo@gmail.com"
date: "2017/08/06"
output:
  html_notebook:
    highlight: textmate
    theme: lumen
---

<br>
<br>
我們先依字詞(或字根)的 ...

+ 出現評率(frequence) 或
+ 相對重要性(average tf-idf score)

選定一批字詞，然後觀察它們在整個文集對`review$useful`、`review$cool`或
`review$funny`的效果。<br> <br>

同時，我們也可以 ...

+ 依商業類別，把整個文集切成兩部分 (eg: `Restaurants` & Non-`Restaurants`)
+ 任選兩個商業類別 (eg: `Bars`& `Shopping`) 或
+ 任意定義兩組商業類別 (eg: {`Chinese`, `Japenese`} & {`Mexican`, `Tex-Mex`})

然後比較這些字詞在兩個子文集裡面的效果。

- - -

```{r set-options, echo=FALSE, cache=FALSE}
library(knitr)
options(width=90)
opts_chunk$set(comment = NA)
```


```{r results='asis', warning=F, message=F, cache=F}
Sys.setlocale('LC_ALL','C')
library(magrittr)
library(highcharter)
library(slam)
library(tm)
library(SnowballC)

# get color palette
library(RColorBrewer)          
pals = c(brewer.pal(8,"Set2")[c(6)],
         brewer.pal(8,"Dark2"),
         brewer.pal(8,"Set1")[c(1)])

# load data
load('data/yelp1.rdata')
load('data/average.rdata')
load('data/empath.rdata')
```
<br>

## (1) Document and Term Matrix (DTM)
先做兩個 Document-Term Matrix ...

###  `dtm1` : 完整字詞、未還原字根
```{r}
corp = Corpus(VectorSource(review$text))
corp = tm_map(corp,  content_transformer(tolower))
corp = tm_map(corp, removePunctuation)
dtm1 = DocumentTermMatrix(corp); dtm1        # terms: 215402
dtm1 = removeSparseTerms(dtm1, .9999); dtm1  # terms: 18664
```
<br>

###  `dtm2` : 字根還原
```{r}
corp = tm_map(corp, removeWords, stopwords("english"))
corp = tm_map(corp, stemDocument)
dtm2 = DocumentTermMatrix(corp); dtm2        # terms: 177911
dtm2 = removeSparseTerms(dtm2, .9999); dtm2  # terms: 12968
```
<br>

```{r}
save(dtm1,dtm2,file='data/dtm.rdata')
```
先存起來，免得每次都要重做。 <br>
<br>

- - -
<br>

## (2) Preparation

```{r}
load('data/dtm.rdata')
DTM = dtm1 %>% {.[, order(-col_sums(.))]}
```
為了觀察方便，我們選擇完整字詞(未還原字根)的`DTM`，讀進`DTM`之後，通常先把字詞欄位依字詞的出現評率排列。<br>

```{r}
Effect = function(y, x, m=rep(TRUE, length(y))) {
  x = x[m]; y = y[m]
  n = as.numeric(length(x))
  pX = sum(x)/n; pY = sum(y)/n; pXY = sum(x&y)/n
  ef = c(usage=pX, base=pY, support=pXY, conf=pXY/pX, lift=pXY/pX/pY)
  c(round(100*ef, 2), count=n) }
```
定義一個function(`Effect()`)來計算`X`對`Y`的各種效果，包括：

+ Usage (`Pr[X]`) -- the usage ratio of `X` 
+ Base (`Pr[Y]`) -- the overall probability of `Y`
+ Confidence (`Pr[Y|X]`) -- the probability of `Y` given `X`
+ Support (`Pr[Y^X]`) -- the probability of `X^Y`
+ Lift (`Pr[Y|X]/Pr[Y]`) -- `X`'s effect on `Y` (the lift of `Y`'s probability given `X`)
+ Count -- the length of the vectors (`X` and `Y` should have the same length)


```{r}
Y = review[,"useful"] %>% {. > median(.)}
Effect(Y, as.vector(DTM[,"pizza"]) > 0)
```
`pizza` has no effect on `review$useful` (its `lift` is barely higher than 100%.)


```{r}
Effect(Y, as.vector(DTM[,"dresses"]) > 0)
```
`dresses` has a positive effect on `review$useful` (its `lift` is higher than 150%.)<br> <br> 

### 2.1 -- The effect of 20 most frequent words
Now we can observe the effect of the most frequent (20) words ... 
```{r}
df = t(sapply(colnames(DTM)[1:20], function(w) 
  Effect(Y, as.vector(DTM[, w]) > 0)))
df
```
As you have just experienced, calculating the effect is a lengthy task. Let's turn on parallel computation.

```{r}
library(doParallel)
K = 4; clust=makeCluster(K)
registerDoParallel(clust)
getDoParWorkers()
```
<br>

- - -
<br>

## (3) Words' Effect on the Entire Corpus

### 3.1 -- The Most Frequent Words
Pick the most frequent works (360) from the `DTM`. Since we had sorted `DTM` by descending `col_sums()`, we simply take the first 360 `colnames()` from it. It'd take some time even with parallel processing (~60 seconds in my notebook). So, be patient.   
```{r}
Words = colnames(DTM)[1:360] 
t0 = Sys.time()
df = foreach(word = Words, .combine=rbind) %dopar% {
  library(slam)
  Effect(Y, row_sums(DTM[,word]) > 0) }
Sys.time() - t0
```

```{r}
df = data.frame(df, word=Words)
df[1:6,]
```

```{r}
hchart(df,"scatter",hcaes(
  y=usage,x=lift,color=conf,size=conf,group=word)) %>% 
  hc_legend(enabled=F) %>% hc_chart(zoomType="xy") %>% 
  hc_add_theme(hc_theme_flat()) %>% 
  hc_plotOptions(bubble=list(maxSize="2%",minSize=10)) %>% 
  hc_xAxis(plotLines=list(list(color="pink",value=mean(df$lift),width=2))) %>% 
  hc_title(text="Effect of the Most Frequent Words (360) on Usefulness")
```
<br>
In the figure above, each bubble represents a word. The bubbles' ...

* Colors represent the `confidence: Pr[Y|X]`; yellow/blue indicates high/low confidence.
    + The confidences are, by definition, closely correlated to the lift.<br>
    
* X-coordinate represent the `lift: Pr[Y|X]/Pr[Y]` in percentage.
    + On the right, we can see the words of the highest lift, including： `yes`, `review`, `those`, `into`, `finally`;
    + One the left, we can see some words of low/negative lift.
    + Interestingly, such words of positive implications as `great`, `excellent`, `recommend` all carry negative lifts.
    + On the upper left, we have the two most frequent words in English -- `the` and `and`. 
    + The positive bias of lift is quite obvious. Even such common words as `the` and `and` carry positive lifts. To cope with this bias, we will adjust the neutral lift from 100 to 125 (the mean of the lift).<br>
    
* Y-coordinate represent the `usage: Pr[X]`.
    + Lift is negatively correlated with usage; all of the high-lift words exhibits low-usage.<br>
<br>

### 3.2 -- Select Words with TF-IDF
Usually, the most importance words are not the same as the most frequent words. We can use the [TF-IDF](http://zh.wikipedia.org/wiki/Tf-idf) (Term Frequency -- Inverse Document Frequency) method to pick the important yet less frequent words. 
```{r}
tfidf = DTM %>% {tapply(.$v/row_sums(.)[.$i], .$j, mean) *
    log2( nDocs(.) / col_sums(. > 0) )}
TFxIDF = DTM[,which(tfidf[1:3000] > quantile(tfidf)[3])] %>%
  col_sums() %>% sort %>% names
length(TFxIDF)
```
<br>
The algorithm generate 377 words. To be consistent, we only take the first 360 words.  We put the words in `Words` and use the same code to calculate the effects of these words ...
```{r}
Words = TFxIDF[1:360]
t0 = Sys.time()
df = foreach(word = Words, .combine=rbind) %dopar% {
  library(slam)
  Effect(Y, row_sums(DTM[,word]) > 0) }
Sys.time() - t0 
```
and then make a interactive chart.

```{r}
df = data.frame(df, word=Words)
hchart(df,"scatter",hcaes(
  y=usage,x=lift,color=conf,size=conf,group=word)) %>% 
  hc_legend(enabled=F) %>% hc_chart(zoomType="xy") %>% 
  hc_add_theme(hc_theme_flat()) %>% 
  hc_plotOptions(bubble=list(maxSize="2%",minSize=10)) %>%
  hc_xAxis(plotLines=list(list(color="pink",value=mean(df$lift),width=2))) %>% 
  hc_title(text="Effect of TFxIDF Words (360-, censored) on Usefulness")
```
<br>

There are some major difference between the two charts:

1. The __Scale__ : The usages of TF-IDF words are lower, but their lifts are more diverse than the frequent words.
  +  Because [TF-IDF](http://zh.wikipedia.org/wiki/Tf-idf) rates the words by __Inverse__ document frequency, such common words as `this` and `that` are precluded. In this way, it helps to find the less frequent but important words.
  + To aviod glitches, we censor the words with usage lower than 0.1%. The upper bound of usage are merely 1.33%, whilst that of frequent words is 91.48%.
  + Although the usage of these words are low, the spread of their lifts are much wider -- [75.9%, 205.8%] comparing to [98.6%, 153.4%].<br>
  
2. The __Distribution of Lift__ : The lifts exhibit a normal-alike-distribution. With the mean lifts is around 130%. Therfore it is easier to identify the 'good' and 'bad' words. <br>

3. The __Goods__ are: `fez`, `gay`, `republic`, `lolos`, `hood`, ... <br>

4. The __Bads__ are: `definately`, `haircut`, `auto`, `repair`, `gluten`, ... <br>
<br>

- - -

<br>

## (4) Words' Effect Across Two Sub-Corpus

除了關心字詞在整個文集的表現，我們也可以比較字詞在不同種類的子文集之中的效果。

### 4.1 -- Helper Function

一個字在不同子文集會有不同的lift，如果我們用這些lift做座標，把一整群字畫在同一個平面上，我們就可以比較字在子文集的效果。 我們先做一個製圖的helper function，叫它`WordsLift()`. 

```{r}
WordsLift = function(L, On="Usefulness", c1="Cat1", c2="Cat2", Wd="Frequent") {
  ttl = sprintf("Effect of %s Words on %s", Wd, On)
  sub = sprintf("Size: Total Usage; Color: %s(yellow) | %s(blue)", c2, c1)
  ttlx = sprintf("%% Lift on %s (blue)", c1)
  ttly = sprintf("%% Lift on %s (yellow)", c2)
  
  df = merge(L[[1]],L[[2]],by='word',sort=F,suffixes=c("_x","_y"))
  df = df[df$usage_x > 0.1 & df$usage_y > 0.1, ] 
  df$ratio = round(df$usage_y / df$usage_x, 3)
  df$total = round(df$usage_y + df$usage_x, 3)
  
  tips=paste0("<b>{point.word}</b><br>",
              "conf: ({point.conf_x}%, {point.conf_y}%)<br>",
              "usage: ({point.usage_x}%, {point.usage_y}%)<br>",
              "total uasge (y + x): {point.total}%<br>",
              "uasge ratio (y / x): {point.ratio}")
  
  hchart(df,"scatter",hcaes(
    x=lift_x, y=lift_y, size=log(total), color=log(ratio))) %>% 
    hc_chart(zoomType="xy") %>% hc_add_theme(hc_theme_538()) %>% 
    hc_plotOptions(bubble=list(maxSize="2%",minSize=4)) %>% 
    hc_xAxis(title=list(text=ttlx)) %>% hc_yAxis(title=list(text=ttly)) %>% 
    hc_tooltip(headerFormat="",hideDelay=100,useHTML=T,pointFormat=tips) %>%
    hc_xAxis(plotLines=list(list(color="orange",value=mean(df$lift_x),width=2))) %>% 
    hc_yAxis(plotLines=list(list(color="orange",value=mean(df$lift_y),width=2))) %>% 
    hc_title(text=ttl) %>% hc_subtitle(text=sub)
}
```


### 4.2 -- Food vs. Non-Food, 360 Frequent Words

在`CatGroup`裡面定義兩個子文集：

+ `Food`: the `Restaurants` and `Food` categories
+ `nonFood`: others

把要分析的字放在`Words`裡面，然後用一個迴圈，對兩個子文集分別做效果分析，將結果放在`L`這個list裡面。

```{r}
bids = rowSums(mxBC[,c('Restaurants','Food')]) > 0
CatGroup = list(Food=bids, nFood=!bids)
Words = colnames(DTM)[1:360]

L = list(); for(i in 1:length(CatGroup)) {
  rmask = review$bid %in% biz$bid[ CatGroup[[i]] ]
  df = foreach(word = Words, .combine=rbind) %dopar% {
    Effect(Y, row_sums(DTM[,word]) > 0, rmask) }
  df = data.frame(df, word=Words, catgrp=names(CatGroup)[i])
  L[[i]] = df}
food.freq = L
```

之後就可以用`WordsLift()`製圖。
```{r}
WordsLift(food.freq, "Usefulness", "Food", "nonFood", "Frequent (360)")
```
<br>

圖中每一點代表一個字 ...

+ 點的橫(綜)座標代表字對`Food`(`nonFood`)子集的lifts
+ 點的大小代表字的使用率 (the sum of usage, log transfered)
+ 點的顏色代表字出現在`Food`(藍色)和`nonFood(黃色)子集的比重

<br>
We can observe that:

+ As we'd observed in the previous charts, lift is negatively correlated with usage.
+ Lift of `Food` is positively correlated with `nonFood`
+ In the upper left, some food related words (`pizza`, `tacos`, `sushi`, `chips`, and `spicy`) are have low/high-lift on `Food`/`nonFood`.<br>

<br>

### 4.3 -- Food vs. Non-Food, 194 TF-IDF Words

Let's repeat the process in the previous sub-session with TF-IDF words.  

```{r}
Words = TFxIDF[1:360]
food.tfidf = L = list(); for(i in 1:length(CatGroup)) {
  rmask = review$bid %in% biz$bid[ CatGroup[[i]] ]
  df = foreach(word = Words, .combine=rbind) %dopar% {
    Effect(Y, row_sums(DTM[,word]) > 0, rmask) }
  df = data.frame(df, word=Words, catgrp=names(CatGroup)[i])
  L[[i]] = df}
food.tfidf = L
```

```{r}
WordsLift(food.tfidf, "Usefulness", "Food", "nonFood", 
         "TFxIDF (360, censored by 0.1% usage)")
```
<br>

To improve the comparative validity, we censor the words with usage lower than 0.1% in either sub-corpus. Thereby, the words displayed are 174, instead of 360.  As we can see ...

1. the distribution of lift of TF-IDF is wider than that of frequent words.  
2. The correlation between lift and usage remains 
3. But, the correlation between `Food` and `nonFood` is no longer significant.
4. By median split, we can divide the plane into four quarters. On the ...
    + upper right -- `gay` and `republic` is good for both sub-corpses
    + lower left -- `definately` is bad for both sub-corpus
    + lower right -- `mike`, `polish`, `chris` and `result` are good in `Food` but bad in `nonFood`
    + lower left -- `magaritas` and `peaks` are good in `nonFood` but bad in `Food`

<br>
<br>

- - -

<br>

### 4.4 -- Bars vs Shopping, 1000 Frequent Words 

As an exercise, we compare the effect of 

+ the most __frequent__ 1000 words 
+ on `review$cool`
+ across `Bars` and `Shopping` categories

Simply put the criteria in `Y`, `CatGroup` and `Words` ...
```{r}
Y = review[,"cool"] %>% {. > median(.)}
CatGroup = list(Bar = mxBC[,'Bars'], Shopping = mxBC[,'Shopping'])
Words = colnames(DTM)[1:1000]
```

It take about 5 minutes to evaluate the effect of 1000 words. 
```{r}
t0 = Sys.time()
bars_shop = L = list(); for(i in 1:length(CatGroup)) {
  rmask = review$bid %in% biz$bid[ CatGroup[[i]] ]
  df = foreach(word = Words, .combine=rbind) %dopar% {
    Effect(Y, row_sums(DTM[,word]) > 0, rmask) }
  df = data.frame(df, word=Words, catgrp=names(CatGroup)[i])
  L[[i]] = df}
Sys.time() - t0 
bars_shop = L
```

We can plot the data with the same helper function (`WordsLift()`) 
```{r}
WordsLift(bars_shop, "Coolness", "Bars", "Shopping", "Frequent (1000)")
```
<br>

__QUIZ :__

1. What are your observations in the chart above?
2. Try to 
    + pick two groups categories
    + make a lift comparison chart with the 360 most frequent words
    + make a lift chart with the TF-IDF words
    + share with us your major findings ...

<br>
```{r}
# save before leaving 
save(bars_shop, food.freq, food.tfidf, file="data/wordeffect.rdata")
stopCluster(clust) # stop parallel processing
```

<br>
<br>
<br>
