動機與目地:欲了解機器人所產生的評論和真人所寫之評論有何不同之處,並以文字分析來判斷真假評論。

載入packages

library(readr)
library(tm)
## Loading required package: NLP
library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(tidytext)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(tidyr)
library(topicmodels)
library(LDAvis)
library(wordcloud2)
library(webshot)
library(htmlwidgets)
library(servr)
library(SnowballC)

載入資料集

reviews <- fread("./deceptive-opinion.csv") 

head(reviews)
##    deceptive  hotel polarity      source
## 1:  truthful conrad positive TripAdvisor
## 2:  truthful  hyatt positive TripAdvisor
## 3:  truthful  hyatt positive TripAdvisor
## 4:  truthful   omni positive TripAdvisor
## 5:  truthful  hyatt positive TripAdvisor
## 6:  truthful   omni positive TripAdvisor
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 text
## 1:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     We stayed for a one night getaway with family on a thursday. Triple AAA rate of 173 was a steal. 7th floor room complete with 44in plasma TV bose stereo, voss and evian water, and gorgeous bathroom(no tub but was fine for us) Concierge was very helpful. You cannot beat this location... Only flaw was breakfast was pricey and service was very very slow(2hours for four kids and four adults on a friday morning) even though there were only two other tables in the restaurant. Food was very good so it was worth the wait. I would return in a heartbeat. A gem in chicago... \n
## 2:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Triple A rate with upgrade to view room was less than $200 which also included breakfast vouchers. Had a great view of river, lake, Wrigley Bldg. & Tribune Bldg. Most major restaurants, Shopping, Sightseeing attractions within walking distance. Large room with a very comfortable bed. \n
## 3: This comes a little late as I'm finally catching up on my reviews from the past several months:) A dear friend and I stayed at the Hyatt Regency in late October 2007 for one night while visiting a friend and her husband from out of town. This hotel is perfect, IMO. Easy check in and check out. Lovely, clean, comfortable rooms with great views of the city. I know this area pretty well and it's very convenient to many downtown Chicago attractions. We had dinner and went clubing with our friends around Division St.. We had no problems getting cabs back and forth to the Hyatt and there's even public transportation right near by but we didn't bother since we only needed cabs from and to the hotel. Parking, as is usual for Chicago, was expensive but we were able to get our car out quickly (however, we left on a Sunday morning, not exactly a high traffic time although it was a Bears homegame day, so a bit busier than usual I would think). No problems at all and the best part is that we got a rate of $100 through Hotwire, a downright steal for this area of Chicago and the quality of the hotel. \n
## 4:                                                                                                                                                                                                                                                                                                                                                                                                              The Omni Chicago really delivers on all fronts, from the spaciousness of the rooms to the helpful staff to the prized location on Michigan Avenue. While this address in Chicago requires a high level of quality, the Omni delivers. Check in for myself and a whole group of people with me was under 3 minutes, the staff had plentiful recommendations for dining and events, and the rooms are some of the largest you'll find at this price range in Chicago. Even the 'standard' room has a separate living area and work desk. The fitness center has free weights, weight machines, and two rows of cardio equipment. I shared the room with 7 others and did not feel cramped in any way! All in all, a great property! \n
## 5:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 I asked for a high floor away from the elevator and that is what I got. The room was pleasantly decorated, functional and very clean. I didn't need a whole lot of service but when I did they were pleasant and prompt. I used the fitness center which was well equipped and everything was in working order. It is in a great location at one end of the Michigan Avenue shopping district. \n
## 6:                                                              I stayed at the Omni for one night following a business meeting at another downtown Chicago hotel. I was completely impressed by the service; all personnel during my stay were absolutely outstanding. I checked in quite early (no problem) and was efficiently checked in. My room had a somewhat heavy scent of air freshener (the ONLY negative from the entire stay), which was managed reasonably well by opening the windows. I don't generally require much during my hotel stays, but suffice to say the doorman, housekeeping, the night manager and bartender at 676, the day waiter at 676, and the concierge were amazing - I never waited more than about 30 seconds for anything. The room was very comfy and the amenities were superior. One very tiny complaint - there was no wastebasket near the sink or near the wet bar - had to walk to the other end of the bathroom or sitting room to dispose of kleenex/coffee paraphernalia. One wastebasket would make all the difference. All that said - what a great hotel!! Thanks, Omni - I had a great stay! \n

簡單的描述統計

正負評論各800筆

table(reviews$deceptive)
## 
## deceptive  truthful 
##       800       800

人工標記的情緒 分別800筆

table(reviews$polarity)
## 
## negative positive 
##      800      800

評論的來源

table(reviews$source)
## 
##       MTurk TripAdvisor         Web 
##         800         400         400

正負情緒評論也分別有一半是真的 一半是造假的

table(reviews$deceptive, reviews$polarity)
##            
##             negative positive
##   deceptive      400      400
##   truthful       400      400

來源的部分假評論全來自MTurk(土耳其機器人) 其他來源都是真的評論

table(reviews$deceptive, reviews$source)
##            
##             MTurk TripAdvisor Web
##   deceptive   800           0   0
##   truthful      0         400 400

機器人產出分別400筆正負評論資料 而web都是負面評論 tripadvisor都是正面評論

table(reviews$polarity, reviews$source)
##           
##            MTurk TripAdvisor Web
##   negative   400           0 400
##   positive   400         400   0

20間飯店 一間各80筆

table(reviews$hotel)
## 
##          affinia          allegro           amalfi       ambassador 
##               80               80               80               80 
##           conrad         fairmont         hardrock           hilton 
##               80               80               80               80 
##         homewood            hyatt intercontinental            james 
##               80               80               80               80 
##    knickerbocker           monaco             omni           palmer 
##               80               80               80               80 
##         sheraton          sofitel        swissotel          talbott 
##               80               80               80               80

1600筆評論中 最長有784個字 最短25字 平均150字左右

reviews %>% 
  mutate(id = c(1:nrow(.))) %>% 
  unnest_tokens(word, text) %>% 
  group_by(id) %>% 
  summarize(count = n()) %>% 
  select(count) %>% 
  summary()
##      count       
##  Min.   : 25.00  
##  1st Qu.: 88.75  
##  Median :127.50  
##  Mean   :149.31  
##  3rd Qu.:183.00  
##  Max.   :784.00

每個飯店分別有40筆真假的評論

table(reviews$deceptive, reviews$hotel)
##            
##             affinia allegro amalfi ambassador conrad fairmont hardrock
##   deceptive      40      40     40         40     40       40       40
##   truthful       40      40     40         40     40       40       40
##            
##             hilton homewood hyatt intercontinental james knickerbocker
##   deceptive     40       40    40               40    40            40
##   truthful      40       40    40               40    40            40
##            
##             monaco omni palmer sheraton sofitel swissotel talbott
##   deceptive     40   40     40       40      40        40      40
##   truthful      40   40     40       40      40        40      40

每個飯店正負情緒的數量也剛好各佔一半

table(reviews$polarity, reviews$hotel)
##           
##            affinia allegro amalfi ambassador conrad fairmont hardrock
##   negative      40      40     40         40     40       40       40
##   positive      40      40     40         40     40       40       40
##           
##            hilton homewood hyatt intercontinental james knickerbocker
##   negative     40       40    40               40    40            40
##   positive     40       40    40               40    40            40
##           
##            monaco omni palmer sheraton sofitel swissotel talbott
##   negative     40   40     40       40      40        40      40
##   positive     40   40     40       40      40        40      40

word colud

reviews_words <- reviews %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>% 
  count(word) 
## Joining, by = "word"
reviews_words %>% wordcloud2()

因為旅館皆位於芝加哥因此提到最多地名,也可以看到除了提到服務之外,還有一些像是bathroom、desk、bar等描述硬體設施的字占多數

TFIDF

以TFIDF來觀察真假評論常用的字有沒有不同

reviews_words <- reviews %>%
  mutate(id = row_number())%>%
  unnest_tokens(word, text)


deceptive_words <- reviews_words %>%
  count(deceptive, word, sort = T)

deceptive_words <- deceptive_words %>%
  bind_tf_idf(word, deceptive, n) %>%
  arrange(desc(tf_idf))

deceptive_words
## # A tibble: 13,899 x 6
##    deceptive word          n       tf   idf    tf_idf
##    <chr>     <chr>     <int>    <dbl> <dbl>     <dbl>
##  1 truthful  priceline    46 0.000380 0.693 0.000263 
##  2 truthful  25           16 0.000132 0.693 0.0000915
##  3 truthful  hancock      16 0.000132 0.693 0.0000915
##  4 truthful  seasons      16 0.000132 0.693 0.0000915
##  5 truthful  attached     15 0.000124 0.693 0.0000858
##  6 truthful  hotwire      15 0.000124 0.693 0.0000858
##  7 truthful  thru         15 0.000124 0.693 0.0000858
##  8 truthful  3rd          13 0.000107 0.693 0.0000744
##  9 truthful  facing       13 0.000107 0.693 0.0000744
## 10 truthful  subway       13 0.000107 0.693 0.0000744
## # ... with 13,889 more rows
deceptive_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(deceptive) %>% 
  top_n(10) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = deceptive)) +
  geom_col(show.legend = F) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~deceptive, ncol = 2, scales = "free") +
  coord_flip()
## Selecting by tf_idf

可以看到真假評論用字是有差別的,而比較有趣的是真實的評論都會用一些數字來描述(如25min)

對各個飯店的評論有何差別

hotel_words <- reviews_words %>%
  count(hotel, word, sort = T)

hotel_words <- hotel_words %>%
  bind_tf_idf(word, hotel, n) %>%
  arrange(desc(tf_idf))

hotel_words
## # A tibble: 41,011 x 6
##    hotel         word              n      tf   idf tf_idf
##    <chr>         <chr>         <int>   <dbl> <dbl>  <dbl>
##  1 affinia       affinia          61 0.00546  3.00 0.0164
##  2 amalfi        amalfi           84 0.00685  2.30 0.0158
##  3 talbott       talbott          88 0.00640  2.30 0.0147
##  4 allegro       allegro          77 0.00633  2.30 0.0146
##  5 sofitel       sofitel          67 0.00625  2.30 0.0144
##  6 james         james           105 0.00846  1.61 0.0136
##  7 knickerbocker knickerbocker    65 0.00534  2.30 0.0123
##  8 homewood      homewood         50 0.00404  3.00 0.0121
##  9 ambassador    ambassador       78 0.00629  1.90 0.0119
## 10 omni          omni             95 0.00809  1.39 0.0112
## # ... with 41,001 more rows

各hotel的tf-idf

hotel_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(hotel) %>% 
  top_n(5) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = hotel)) +
  geom_col(show.legend = F) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~hotel, ncol = 5, scales = "free") +
  theme(axis.text = element_text(size = 7)) +
  coord_flip()
## Selecting by tf_idf

由上圖的結果可以觀察到,因為旅館名稱數量最多。但旅館的名稱在評論中不具意義,因此將旅館名稱移除再做一次tf-idf

# 有評論的hotel名稱筆誤(ex: affinia打成affina、talbott打成talbot)
#修正hotel名稱
hotel_name = c("affinia","allegro","amalfi","ambassador","conrad","fairmont","hardrock","hilton","homewood","hyatt","intercontinental","james","knickerbocker","monaco","omni","palmer","sheraton","sofitel","swissotel","talbott","affina","talbot")

reviews_words %>%
  filter(!word %in% hotel_name) %>% #移除hotel名稱
  count(hotel, word, sort = T) %>% 
  bind_tf_idf(word, hotel, n) %>%
  arrange(desc(tf_idf)) %>% 
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(hotel) %>% 
  top_n(5) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = hotel)) +
  geom_col(show.legend = F) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~hotel, ncol = 5, scales = "free") +
  theme(axis.text = element_text(size = 7)) +
  coord_flip()
## Selecting by tf_idf

Bigram + TF-IDF

透過詞組來看評論中重要的內容是什麼

all_bigrams <- reviews %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

all_bigrams %>%
  count(bigram, sort = TRUE)
## # A tibble: 82,443 x 2
##    bigram         n
##    <chr>      <int>
##  1 in the      1283
##  2 at the      1129
##  3 the hotel   1124
##  4 of the      1075
##  5 and the      923
##  6 the room     896
##  7 i was        802
##  8 to the       741
##  9 it was       716
## 10 this hotel   688
## # ... with 82,433 more rows
bigrams_separated <- all_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts
## # A tibble: 15,054 x 3
##    word1       word2         n
##    <chr>       <chr>     <int>
##  1 front       desk        339
##  2 downtown    chicago     111
##  3 chicago     hotel       110
##  4 customer    service      99
##  5 hard        rock         96
##  6 walking     distance     80
##  7 michigan    ave          76
##  8 highly      recommend    71
##  9 ambassador  east         69
## 10 magnificent mile         67
## # ... with 15,044 more rows

由bigram的內容可以發現:

bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

#bigrams_united

bigram_tf_idf <- bigrams_united %>%
  count(source,deceptive, bigram) %>%
  bind_tf_idf(bigram, source, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf
## # A tibble: 16,656 x 7
##    source      deceptive bigram                    n      tf   idf  tf_idf
##    <chr>       <chr>     <chr>                 <int>   <dbl> <dbl>   <dbl>
##  1 MTurk       deceptive chicago millennium       32 0.00294 1.10  0.00323
##  2 MTurk       deceptive luxury hotel             28 0.00257 1.10  0.00282
##  3 MTurk       deceptive intercontinental chi…    27 0.00248 1.10  0.00272
##  4 TripAdvisor truthful  2 blocks                  8 0.00138 1.10  0.00151
##  5 TripAdvisor truthful  john hancock              8 0.00138 1.10  0.00151
##  6 MTurk       deceptive allegro chicago          14 0.00128 1.10  0.00141
##  7 MTurk       deceptive monaco chicago           14 0.00128 1.10  0.00141
##  8 TripAdvisor truthful  hancock tower             7 0.00120 1.10  0.00132
##  9 MTurk       deceptive park hotel               13 0.00119 1.10  0.00131
## 10 MTurk       deceptive sheraton chicago         33 0.00303 0.405 0.00123
## # ... with 16,646 more rows
bigram_tf_idf$source <- bigram_tf_idf$source %>% as.factor()

bigram_tf_idf %>%
  arrange(desc(tf_idf)) %>%
  mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>% 
  group_by(source) %>% 
  top_n(10) %>% 
  ungroup() %>%
  ggplot(aes(bigram, tf_idf, fill = source)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~source, ncol = 3, scales = "free") +
  coord_flip()
## Selecting by tf_idf

情緒分析

雖然原本的data裡面已經有標記正負情緒 但我們想知道真人寫的評論和機器人寫的情緒程度上有何不同(機器人使用的文字情緒是否較為強烈?)

reviews_word <- reviews %>%
  mutate(id = row_number())%>%
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>% 
  group_by(deceptive) %>% 
  count(id, word)
## Joining, by = "word"
reviews_word_plot <- reviews_word %>% 
  inner_join(get_sentiments("bing")) %>%
  group_by(id, deceptive,sentiment) %>%
  summarise(sum=n()) %>%
  spread(sentiment, sum, fill = 0) %>%
  mutate(sentiment = positive - negative) %>% 
  mutate(color = ifelse(sentiment < 0, "negative","positive"))
## Joining, by = "word"
reviews_word_plot %>%
  ggplot(aes(x= id,y=sentiment)) +
  geom_col(aes(fill = deceptive))

上圖所顯示的結果看起來沒有很明顯的差距 因此以ANOVA來檢驗看看真假評論情緒程度上有沒有顯著差距 Pr(>F)遠大於0.05 無法拒絕H0 因此各組的平均值間沒有明顯差異

anova <- aov(reviews_word_plot$sentiment~as.character(reviews_word_plot$deceptive))
summary(anova)
##                                             Df Sum Sq Mean Sq F value
## as.character(reviews_word_plot$deceptive)    1      6   6.126   0.228
## Residuals                                 1598  42970  26.890        
##                                           Pr(>F)
## as.character(reviews_word_plot$deceptive)  0.633
## Residuals

LDA

dtm資料前處理

corpus = Corpus(VectorSource(reviews$text)) 
##轉為小寫
corpus = tm_map(corpus, content_transformer(tolower))
##移除標點
corpus = tm_map(corpus, removePunctuation)
##去除贅字
corpus = tm_map(corpus, removeWords, c("hotel", stopwords("english")))
##字根還原
corpus = tm_map(corpus, stemDocument)

轉成dtm格式

### 文件字詞矩陣 (字頻表,DTM)
##建立文件字詞矩陣 (Document Term Matrix)
frequencies = DocumentTermMatrix(corpus)
# Look at matrix 
#findFreqTerms(frequencies, lowfreq=10) #出現次數超過10以上的字顯現出來

##### 移除頻率太低的字詞
sparse = removeSparseTerms(frequencies, 0.995) 
sparse
## <<DocumentTermMatrix (documents: 1600, terms: 1444)>>
## Non-/sparse entries: 82356/2228044
## Sparsity           : 96%
## Maximal term length: 14
## Weighting          : term frequency (tf)
##### 轉成資料框
# Convert to a data frame把矩陣轉成資料框
review_Sparse = as.data.frame(as.matrix(sparse))

#合法取出欄位
colnames(review_Sparse) = make.names(colnames(review_Sparse))

Topic Model

reviews_lda <- LDA(frequencies, k = 2, control = list(seed = 1234))
reviews_lda
## A LDA_VEM topic model with 2 topics.

看每個topic的字的分布

library(tidytext)

reviews_topics <- tidy(reviews_lda, matrix = "beta") 
reviews_topics
## # A tibble: 14,744 x 3
##    topic term        beta
##    <int> <chr>      <dbl>
##  1     1 173   0.0000101 
##  2     2 173   0.00000779
##  3     1 44in  0.0000115 
##  4     2 44in  0.00000639
##  5     1 7th   0.0000699 
##  6     2 7th   0.0000195 
##  7     1 aaa   0.0000540 
##  8     2 aaa   0.0000710 
##  9     1 adult 0.0000830 
## 10     2 adult 0.000203  
## # ... with 14,734 more rows

畫圖看兩個topic Beta值最高的字 在沒有去除stopwords的情況下得到很差的結果

reviews_top_terms <- reviews_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

reviews_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  labs(x = NULL, y = "Beta")+
  coord_flip()

去除stop_word重新做一次topic model

remove_words<-c("hotel","room","stay","chicago")
custom_topic<-reviews_topics %>%
  filter(! term %in% remove_words) %>% 
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

custom_topic %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

以gamma值來檢驗LDA在旅館評論上分析主題的效果(每則評論屬於各topic的機率)

reviews_gamma <- tidy(reviews_lda, matrix = "gamma")
reviews_gamma <- reviews_gamma %>% 
  spread(topic, gamma) 

reviews_gamma
## # A tibble: 1,600 x 3
##    document   `1`   `2`
##    <chr>    <dbl> <dbl>
##  1 1        0.512 0.488
##  2 10       0.490 0.510
##  3 100      0.503 0.497
##  4 1000     0.510 0.490
##  5 1001     0.483 0.517
##  6 1002     0.473 0.527
##  7 1003     0.510 0.490
##  8 1004     0.502 0.498
##  9 1005     0.504 0.496
## 10 1006     0.498 0.502
## # ... with 1,590 more rows

由結果可以看到,兩個主題的機率幾乎各是0.5,所以整體而言,每個評論中,兩個主題的gamma值太接近,所以這些旅館的評論做LDA效果皆不慎理想

判斷評論真偽

CART classification tree

# 新增 dependent variable欄位做分類模型
review_Sparse$class = reviews$deceptive 


library(caTools)
set.seed(123)
split = sample.split(review_Sparse$class, SplitRatio = 0.7)

TR = subset(review_Sparse, split==TRUE)
TS = subset(review_Sparse, split==FALSE)

table(TR$class)
## 
## deceptive  truthful 
##       560       560

建立CART模型

library(rpart)
library(rpart.plot)
trialCART = rpart(class~., data=TS, method="class")
rpart.plot(trialCART,cex=0.6)

predict CART model

pred = predict(trialCART, TS, type = "class")  # predict classes
x = table(actual = TS$class, pred); x 
##            pred
## actual      deceptive truthful
##   deceptive       203       37
##   truthful         43      197

正確率

sum(diag(x))/sum(x)
## [1] 0.8333333

ROC & AUC

library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
PredictROC = predict(trialCART, TS)              
head(PredictROC)
##    deceptive  truthful
## 2  0.0000000 1.0000000
## 4  0.8272727 0.1727273
## 5  0.1875000 0.8125000
## 8  0.1875000 0.8125000
## 11 0.8059701 0.1940299
## 16 0.2727273 0.7272727
pred = predict(trialCART, TS)[,2]             
colAUC(pred, TS$class, T) 

##                             [,1]
## deceptive vs. truthful 0.8672569
#AUC= 0.8672569

結論

  1. 主題分析在旅館評論無法有效地判斷出主題,TF-IDF分析在此有較佳的結果。
  2. 將去除stopwords並做前處理後的評論文字之詞頻作為屬性放進分類樹做預測,就算不使用複雜的模型,也能達到不錯的預測效果。