(1) Document and Term Matrix (DTM)
先做兩個 Document-Term Matrix …
dtm1 : 完整字詞、未還原字根
corp = Corpus(VectorSource(review$text))
corp = tm_map(corp, content_transformer(tolower))
corp = tm_map(corp, removePunctuation)
dtm1 = DocumentTermMatrix(corp); dtm1 # terms: 215402
<<DocumentTermMatrix (documents: 215879, terms: 215402)>>
Non-/sparse entries: 15740971/46485027387
Sparsity : 100%
Maximal term length: 932
Weighting : term frequency (tf)
dtm1 = removeSparseTerms(dtm1, .9999); dtm1 # terms: 18664
<<DocumentTermMatrix (documents: 215879, terms: 18664)>>
Non-/sparse entries: 15289544/4013876112
Sparsity : 100%
Maximal term length: 18
Weighting : term frequency (tf)
dtm2 : 字根還原
corp = tm_map(corp, removeWords, stopwords("english"))
corp = tm_map(corp, stemDocument)
dtm2 = DocumentTermMatrix(corp); dtm2 # terms: 177911
<<DocumentTermMatrix (documents: 215879, terms: 177911)>>
Non-/sparse entries: 11655667/38395593102
Sparsity : 100%
Maximal term length: 932
Weighting : term frequency (tf)
dtm2 = removeSparseTerms(dtm2, .9999); dtm2 # terms: 12968
<<DocumentTermMatrix (documents: 215879, terms: 12968)>>
Non-/sparse entries: 11312194/2788206678
Sparsity : 100%
Maximal term length: 17
Weighting : term frequency (tf)
save(dtm1,dtm2,file='data/dtm.rdata')
先存起來,免得每次都要重做。
(2) Preparation
load('data/dtm.rdata')
DTM = dtm1 %>% {.[, order(-col_sums(.))]}
為了觀察方便,我們選擇完整字詞(未還原字根)的DTM,讀進DTM之後,通常先把字詞欄位依字詞的出現評率排列。
Effect = function(y, x, m=rep(TRUE, length(y))) {
x = x[m]; y = y[m]
n = as.numeric(length(x))
pX = sum(x)/n; pY = sum(y)/n; pXY = sum(x&y)/n
ef = c(usage=pX, base=pY, support=pXY, conf=pXY/pX, lift=pXY/pX/pY)
c(round(100*ef, 2), count=n) }
定義一個function(Effect())來計算X對Y的各種效果,包括:
- Usage (
Pr[X]) – the usage ratio of X
- Base (
Pr[Y]) – the overall probability of Y
- Confidence (
Pr[Y|X]) – the probability of Y given X
- Support (
Pr[Y^X]) – the probability of X^Y
- Lift (
Pr[Y|X]/Pr[Y]) – X’s effect on Y (the lift of Y’s probability given X)
- Count – the length of the vectors (
X and Y should have the same length)
Y = review[,"useful"] %>% {. > median(.)}
Effect(Y, as.vector(DTM[,"pizza"]) > 0)
usage base support conf lift count
6.18 30.08 1.89 30.54 101.55 215879.00
pizza has no effect on review$useful (its lift is barely higher than 100%.)
Effect(Y, as.vector(DTM[,"dresses"]) > 0)
usage base support conf lift count
0.20 30.08 0.09 47.07 156.50 215879.00
dresses has a positive effect on review$useful (its lift is higher than 150%.)
2.1 – The effect of 20 most frequent words
Now we can observe the effect of the most frequent (20) words …
df = t(sapply(colnames(DTM)[1:20], function(w)
Effect(Y, as.vector(DTM[, w]) > 0)))
df
usage base support conf lift count
the 91.48 30.08 28.99 31.69 105.36 215879
and 89.02 30.08 28.33 31.82 105.80 215879
was 56.59 30.08 19.63 34.69 115.35 215879
for 65.38 30.08 22.75 34.80 115.71 215879
that 51.60 30.08 19.56 37.91 126.05 215879
with 52.02 30.08 19.22 36.94 122.82 215879
but 54.98 30.08 19.55 35.56 118.21 215879
this 56.42 30.08 19.91 35.28 117.30 215879
you 44.47 30.08 16.71 37.59 124.96 215879
they 46.53 30.08 16.65 35.79 118.98 215879
have 48.28 30.08 17.46 36.16 120.22 215879
not 42.80 30.08 15.75 36.81 122.37 215879
had 40.13 30.08 14.31 35.67 118.59 215879
are 40.38 30.08 14.26 35.31 117.39 215879
good 41.73 30.08 13.67 32.75 108.87 215879
place 42.74 30.08 14.40 33.68 111.99 215879
were 31.79 30.08 11.86 37.31 124.04 215879
food 38.89 30.08 12.28 31.57 104.97 215879
there 33.28 30.08 12.55 37.71 125.36 215879
great 35.08 30.08 10.43 29.75 98.90 215879
As you have just experienced, calculating the effect is a lengthy task. Let’s turn on parallel computation.
library(doParallel)
K = 4; clust=makeCluster(K)
registerDoParallel(clust)
getDoParWorkers()
[1] 4
(3) Words’ Effect on the Entire Corpus
3.1 – The Most Frequent Words
Pick the most frequent works (360) from the DTM. Since we had sorted DTM by descending col_sums(), we simply take the first 360 colnames() from it. It’d take some time even with parallel processing (~60 seconds in my notebook). So, be patient.
Words = colnames(DTM)[1:360]
t0 = Sys.time()
df = foreach(word = Words, .combine=rbind) %dopar% {
library(slam)
Effect(Y, row_sums(DTM[,word]) > 0) }
Sys.time() - t0
Time difference of 1.0529 mins
df = data.frame(df, word=Words)
df[1:6,]
hchart(df,"scatter",hcaes(
y=usage,x=lift,color=conf,size=conf,group=word)) %>%
hc_legend(enabled=F) %>% hc_chart(zoomType="xy") %>%
hc_add_theme(hc_theme_flat()) %>%
hc_plotOptions(bubble=list(maxSize="2%",minSize=10)) %>%
hc_xAxis(plotLines=list(list(color="pink",value=mean(df$lift),width=2))) %>%
hc_title(text="Effect of the Most Frequent Words (360) on Usefulness")
In the figure above, each bubble represents a word. The bubbles’ …
- Colors represent the
confidence: Pr[Y|X]; yellow/blue indicates high/low confidence.
- The confidences are, by definition, closely correlated to the lift.
- X-coordinate represent the
lift: Pr[Y|X]/Pr[Y] in percentage.
- On the right, we can see the words of the highest lift, including:
yes, review, those, into, finally;
- One the left, we can see some words of low/negative lift.
- Interestingly, such words of positive implications as
great, excellent, recommend all carry negative lifts.
- On the upper left, we have the two most frequent words in English –
the and and.
- The positive bias of lift is quite obvious. Even such common words as
the and and carry positive lifts. To cope with this bias, we will adjust the neutral lift from 100 to 125 (the mean of the lift).
- Y-coordinate represent the
usage: Pr[X].
- Lift is negatively correlated with usage; all of the high-lift words exhibits low-usage.
3.2 – Select Words with TF-IDF
Usually, the most importance words are not the same as the most frequent words. We can use the TF-IDF (Term Frequency – Inverse Document Frequency) method to pick the important yet less frequent words.
tfidf = DTM %>% {tapply(.$v/row_sums(.)[.$i], .$j, mean) *
log2( nDocs(.) / col_sums(. > 0) )}
TFxIDF = DTM[,which(tfidf[1:3000] > quantile(tfidf)[3])] %>%
col_sums() %>% sort %>% names
length(TFxIDF)
[1] 377
The algorithm generate 377 words. To be consistent, we only take the first 360 words. We put the words in Words and use the same code to calculate the effects of these words …
Words = TFxIDF[1:360]
t0 = Sys.time()
df = foreach(word = Words, .combine=rbind) %dopar% {
library(slam)
Effect(Y, row_sums(DTM[,word]) > 0) }
Sys.time() - t0
Time difference of 56.508 secs
and then make a interactive chart.
df = data.frame(df, word=Words)
hchart(df,"scatter",hcaes(
y=usage,x=lift,color=conf,size=conf,group=word)) %>%
hc_legend(enabled=F) %>% hc_chart(zoomType="xy") %>%
hc_add_theme(hc_theme_flat()) %>%
hc_plotOptions(bubble=list(maxSize="2%",minSize=10)) %>%
hc_xAxis(plotLines=list(list(color="pink",value=mean(df$lift),width=2))) %>%
hc_title(text="Effect of TFxIDF Words (360-, censored) on Usefulness")
There are some major difference between the two charts:
- The Scale : The usages of TF-IDF words are lower, but their lifts are more diverse than the frequent words.
- Because TF-IDF rates the words by Inverse document frequency, such common words as
this and that are precluded. In this way, it helps to find the less frequent but important words.
- To aviod glitches, we censor the words with usage lower than 0.1%. The upper bound of usage are merely 1.33%, whilst that of frequent words is 91.48%.
- Although the usage of these words are low, the spread of their lifts are much wider – [75.9%, 205.8%] comparing to [98.6%, 153.4%].
The Distribution of Lift : The lifts exhibit a normal-alike-distribution. With the mean lifts is around 130%. Therfore it is easier to identify the ‘good’ and ‘bad’ words.
The Goods are: fez, gay, republic, lolos, hood, …
The Bads are: definately, haircut, auto, repair, gluten, …
(4) Words’ Effect Across Two Sub-Corpus
除了關心字詞在整個文集的表現,我們也可以比較字詞在不同種類的子文集之中的效果。
4.1 – Helper Function
一個字在不同子文集會有不同的lift,如果我們用這些lift做座標,把一整群字畫在同一個平面上,我們就可以比較字在子文集的效果。 我們先做一個製圖的helper function,叫它WordsLift().
WordsLift = function(L, On="Usefulness", c1="Cat1", c2="Cat2", Wd="Frequent") {
ttl = sprintf("Effect of %s Words on %s", Wd, On)
sub = sprintf("Size: Total Usage; Color: %s(yellow) | %s(blue)", c2, c1)
ttlx = sprintf("%% Lift on %s (blue)", c1)
ttly = sprintf("%% Lift on %s (yellow)", c2)
df = merge(L[[1]],L[[2]],by='word',sort=F,suffixes=c("_x","_y"))
df = df[df$usage_x > 0.1 & df$usage_y > 0.1, ]
df$ratio = round(df$usage_y / df$usage_x, 3)
df$total = round(df$usage_y + df$usage_x, 3)
tips=paste0("<b>{point.word}</b><br>",
"conf: ({point.conf_x}%, {point.conf_y}%)<br>",
"usage: ({point.usage_x}%, {point.usage_y}%)<br>",
"total uasge (y + x): {point.total}%<br>",
"uasge ratio (y / x): {point.ratio}")
hchart(df,"scatter",hcaes(
x=lift_x, y=lift_y, size=log(total), color=log(ratio))) %>%
hc_chart(zoomType="xy") %>% hc_add_theme(hc_theme_538()) %>%
hc_plotOptions(bubble=list(maxSize="2%",minSize=4)) %>%
hc_xAxis(title=list(text=ttlx)) %>% hc_yAxis(title=list(text=ttly)) %>%
hc_tooltip(headerFormat="",hideDelay=100,useHTML=T,pointFormat=tips) %>%
hc_xAxis(plotLines=list(list(color="orange",value=mean(df$lift_x),width=2))) %>%
hc_yAxis(plotLines=list(list(color="orange",value=mean(df$lift_y),width=2))) %>%
hc_title(text=ttl) %>% hc_subtitle(text=sub)
}
4.2 – Food vs. Non-Food, 360 Frequent Words
在CatGroup裡面定義兩個子文集:
Food: the Restaurants and Food categories
nonFood: others
把要分析的字放在Words裡面,然後用一個迴圈,對兩個子文集分別做效果分析,將結果放在L這個list裡面。
bids = rowSums(mxBC[,c('Restaurants','Food')]) > 0
CatGroup = list(Food=bids, nFood=!bids)
Words = colnames(DTM)[1:360]
L = list(); for(i in 1:length(CatGroup)) {
rmask = review$bid %in% biz$bid[ CatGroup[[i]] ]
df = foreach(word = Words, .combine=rbind) %dopar% {
Effect(Y, row_sums(DTM[,word]) > 0, rmask) }
df = data.frame(df, word=Words, catgrp=names(CatGroup)[i])
L[[i]] = df}
之後就可以用WordsLift()製圖。
# food.freq = L
WordsLift(food.freq, "Usefulness", "Food", "nonFood", "Frequent (360)")
圖中每一點代表一個字 …
- 點的橫(綜)座標代表字對
Food(nonFood)子集的lifts
- 點的大小代表字的使用率 (the sum of usage, log transfered)
- 點的顏色代表字出現在
Food(藍色)和`nonFood(黃色)子集的比重
We can observe that:
- As we’d observed in the previous charts, lift is negatively correlated with usage.
- Lift of
Food is positively correlated with nonFood
- In the upper left, some food related words (
pizza, tacos, sushi, chips, and spicy) are have low/high-lift on Food/nonFood.
4.3 – Food vs. Non-Food, 194 TF-IDF Words
Let’s repeat the process in the previous sub-session with TF-IDF words.
Words = TFxIDF[1:360]
L = list(); for(i in 1:length(CatGroup)) {
rmask = review$bid %in% biz$bid[ CatGroup[[i]] ]
df = foreach(word = Words, .combine=rbind) %dopar% {
Effect(Y, row_sums(DTM[,word]) > 0, rmask) }
df = data.frame(df, word=Words, catgrp=names(CatGroup)[i])
L[[i]] = df}
# food.tfidf = L
WordsLift(food.tfidf, "Usefulness", "Food", "nonFood",
"TFxIDF (360, censored by 0.1% usage)")
To improve the comparative validity, we censor the words with usage lower than 0.1% in either sub-corpus. Thereby, the words displayed are 174, instead of 360. As we can see …
- the distribution of lift of TF-IDF is wider than that of frequent words.
- The correlation between lift and usage remains
- But, the correlation between
Food and nonFood is no longer significant.
- By median split, we can divide the plane into four quarters. On the …
- upper right –
gay and republic is good for both sub-corpses
- lower left –
definately is bad for both sub-corpus
- lower right –
mike, polish, chris and result are good in Food but bad in nonFood
- lower left –
magaritas and peaks are good in nonFood but bad in Food
4.4 – Bars vs Shopping, 1000 Frequent Words
As an exercise, we compare the effect of
- the most frequent 1000 words
- on
review$cool
- across
Bars and Shopping categories
Simply put the criteria in Y, CatGroup and Words …
Y = review[,"cool"] %>% {. > median(.)}
CatGroup = list(Bar = mxBC[,'Bars'], Shopping = mxBC[,'Shopping'])
Words = colnames(DTM)[1:1000]
It take about 5 minutes to evaluate the effect of 1000 words.
t0 = Sys.time()
L = list(); for(i in 1:length(CatGroup)) {
rmask = review$bid %in% biz$bid[ CatGroup[[i]] ]
df = foreach(word = Words, .combine=rbind) %dopar% {
Effect(Y, row_sums(DTM[,word]) > 0, rmask) }
df = data.frame(df, word=Words, catgrp=names(CatGroup)[i])
L[[i]] = df}
Sys.time() - t0 # 4.7617 mins
Time difference of 4.8501 mins
We can plot the data with the same helper function (WordsLift())
# bars_shop = L
WordsLift(bars_shop, "Coolness", "Bars", "Shopping", "Frequent (1000)")
QUIZ :
- What are your observations in the chart above?
- Try to
- pick two groups categories
- make a lift comparison chart with the 360 most frequent words
- make a lift chart with the TF-IDF words
- share with us your major findings …
# save before leaving
save(bars_shop, food.freq, food.tfidf, file="data/wordeffect.rdata")
stopCluster(clust) # stop parallel processing
