library(readr)
library(tm)
## Loading required package: NLP
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(tidytext)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(tidyr)
library(topicmodels)
library(LDAvis)
library(wordcloud2)
library(webshot)
library(htmlwidgets)
library(servr)
library(SnowballC)
reviews <- fread("./deceptive-opinion.csv")
head(reviews)
## deceptive hotel polarity source
## 1: truthful conrad positive TripAdvisor
## 2: truthful hyatt positive TripAdvisor
## 3: truthful hyatt positive TripAdvisor
## 4: truthful omni positive TripAdvisor
## 5: truthful hyatt positive TripAdvisor
## 6: truthful omni positive TripAdvisor
## text
## 1: We stayed for a one night getaway with family on a thursday. Triple AAA rate of 173 was a steal. 7th floor room complete with 44in plasma TV bose stereo, voss and evian water, and gorgeous bathroom(no tub but was fine for us) Concierge was very helpful. You cannot beat this location... Only flaw was breakfast was pricey and service was very very slow(2hours for four kids and four adults on a friday morning) even though there were only two other tables in the restaurant. Food was very good so it was worth the wait. I would return in a heartbeat. A gem in chicago... \n
## 2: Triple A rate with upgrade to view room was less than $200 which also included breakfast vouchers. Had a great view of river, lake, Wrigley Bldg. & Tribune Bldg. Most major restaurants, Shopping, Sightseeing attractions within walking distance. Large room with a very comfortable bed. \n
## 3: This comes a little late as I'm finally catching up on my reviews from the past several months:) A dear friend and I stayed at the Hyatt Regency in late October 2007 for one night while visiting a friend and her husband from out of town. This hotel is perfect, IMO. Easy check in and check out. Lovely, clean, comfortable rooms with great views of the city. I know this area pretty well and it's very convenient to many downtown Chicago attractions. We had dinner and went clubing with our friends around Division St.. We had no problems getting cabs back and forth to the Hyatt and there's even public transportation right near by but we didn't bother since we only needed cabs from and to the hotel. Parking, as is usual for Chicago, was expensive but we were able to get our car out quickly (however, we left on a Sunday morning, not exactly a high traffic time although it was a Bears homegame day, so a bit busier than usual I would think). No problems at all and the best part is that we got a rate of $100 through Hotwire, a downright steal for this area of Chicago and the quality of the hotel. \n
## 4: The Omni Chicago really delivers on all fronts, from the spaciousness of the rooms to the helpful staff to the prized location on Michigan Avenue. While this address in Chicago requires a high level of quality, the Omni delivers. Check in for myself and a whole group of people with me was under 3 minutes, the staff had plentiful recommendations for dining and events, and the rooms are some of the largest you'll find at this price range in Chicago. Even the 'standard' room has a separate living area and work desk. The fitness center has free weights, weight machines, and two rows of cardio equipment. I shared the room with 7 others and did not feel cramped in any way! All in all, a great property! \n
## 5: I asked for a high floor away from the elevator and that is what I got. The room was pleasantly decorated, functional and very clean. I didn't need a whole lot of service but when I did they were pleasant and prompt. I used the fitness center which was well equipped and everything was in working order. It is in a great location at one end of the Michigan Avenue shopping district. \n
## 6: I stayed at the Omni for one night following a business meeting at another downtown Chicago hotel. I was completely impressed by the service; all personnel during my stay were absolutely outstanding. I checked in quite early (no problem) and was efficiently checked in. My room had a somewhat heavy scent of air freshener (the ONLY negative from the entire stay), which was managed reasonably well by opening the windows. I don't generally require much during my hotel stays, but suffice to say the doorman, housekeeping, the night manager and bartender at 676, the day waiter at 676, and the concierge were amazing - I never waited more than about 30 seconds for anything. The room was very comfy and the amenities were superior. One very tiny complaint - there was no wastebasket near the sink or near the wet bar - had to walk to the other end of the bathroom or sitting room to dispose of kleenex/coffee paraphernalia. One wastebasket would make all the difference. All that said - what a great hotel!! Thanks, Omni - I had a great stay! \n
正負評論各800筆
table(reviews$deceptive)
##
## deceptive truthful
## 800 800
人工標記的情緒 分別800筆
table(reviews$polarity)
##
## negative positive
## 800 800
評論的來源
table(reviews$source)
##
## MTurk TripAdvisor Web
## 800 400 400
正負情緒評論也分別有一半是真的 一半是造假的
table(reviews$deceptive, reviews$polarity)
##
## negative positive
## deceptive 400 400
## truthful 400 400
來源的部分假評論全來自MTurk(土耳其機器人) 其他來源都是真的評論
table(reviews$deceptive, reviews$source)
##
## MTurk TripAdvisor Web
## deceptive 800 0 0
## truthful 0 400 400
機器人產出分別400筆正負評論資料 而web都是負面評論 tripadvisor都是正面評論
table(reviews$polarity, reviews$source)
##
## MTurk TripAdvisor Web
## negative 400 0 400
## positive 400 400 0
20間飯店 一間各80筆
table(reviews$hotel)
##
## affinia allegro amalfi ambassador
## 80 80 80 80
## conrad fairmont hardrock hilton
## 80 80 80 80
## homewood hyatt intercontinental james
## 80 80 80 80
## knickerbocker monaco omni palmer
## 80 80 80 80
## sheraton sofitel swissotel talbott
## 80 80 80 80
1600筆評論中 最長有784個字 最短25字 平均150字左右
reviews %>%
mutate(id = c(1:nrow(.))) %>%
unnest_tokens(word, text) %>%
group_by(id) %>%
summarize(count = n()) %>%
select(count) %>%
summary()
## count
## Min. : 25.00
## 1st Qu.: 88.75
## Median :127.50
## Mean :149.31
## 3rd Qu.:183.00
## Max. :784.00
每個飯店分別有40筆真假的評論
table(reviews$deceptive, reviews$hotel)
##
## affinia allegro amalfi ambassador conrad fairmont hardrock
## deceptive 40 40 40 40 40 40 40
## truthful 40 40 40 40 40 40 40
##
## hilton homewood hyatt intercontinental james knickerbocker
## deceptive 40 40 40 40 40 40
## truthful 40 40 40 40 40 40
##
## monaco omni palmer sheraton sofitel swissotel talbott
## deceptive 40 40 40 40 40 40 40
## truthful 40 40 40 40 40 40 40
每個飯店正負情緒的數量也剛好各佔一半
table(reviews$polarity, reviews$hotel)
##
## affinia allegro amalfi ambassador conrad fairmont hardrock
## negative 40 40 40 40 40 40 40
## positive 40 40 40 40 40 40 40
##
## hilton homewood hyatt intercontinental james knickerbocker
## negative 40 40 40 40 40 40
## positive 40 40 40 40 40 40
##
## monaco omni palmer sheraton sofitel swissotel talbott
## negative 40 40 40 40 40 40 40
## positive 40 40 40 40 40 40 40
reviews_words <- reviews %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word)
## Joining, by = "word"
reviews_words %>% wordcloud2()
因為旅館皆位於芝加哥因此提到最多地名,也可以看到除了提到服務之外,還有一些像是bathroom、desk、bar等描述硬體設施的字占多數
以TFIDF來觀察真假評論常用的字有沒有不同
reviews_words <- reviews %>%
mutate(id = row_number())%>%
unnest_tokens(word, text)
deceptive_words <- reviews_words %>%
count(deceptive, word, sort = T)
deceptive_words <- deceptive_words %>%
bind_tf_idf(word, deceptive, n) %>%
arrange(desc(tf_idf))
deceptive_words
## # A tibble: 13,899 x 6
## deceptive word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 truthful priceline 46 0.000380 0.693 0.000263
## 2 truthful 25 16 0.000132 0.693 0.0000915
## 3 truthful hancock 16 0.000132 0.693 0.0000915
## 4 truthful seasons 16 0.000132 0.693 0.0000915
## 5 truthful attached 15 0.000124 0.693 0.0000858
## 6 truthful hotwire 15 0.000124 0.693 0.0000858
## 7 truthful thru 15 0.000124 0.693 0.0000858
## 8 truthful 3rd 13 0.000107 0.693 0.0000744
## 9 truthful facing 13 0.000107 0.693 0.0000744
## 10 truthful subway 13 0.000107 0.693 0.0000744
## # ... with 13,889 more rows
deceptive_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(deceptive) %>%
top_n(10) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = deceptive)) +
geom_col(show.legend = F) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~deceptive, ncol = 2, scales = "free") +
coord_flip()
## Selecting by tf_idf
可以看到真假評論用字是有差別的,而比較有趣的是真實的評論都會用一些數字來描述(如25min)
hotel_words <- reviews_words %>%
count(hotel, word, sort = T)
hotel_words <- hotel_words %>%
bind_tf_idf(word, hotel, n) %>%
arrange(desc(tf_idf))
hotel_words
## # A tibble: 41,011 x 6
## hotel word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 affinia affinia 61 0.00546 3.00 0.0164
## 2 amalfi amalfi 84 0.00685 2.30 0.0158
## 3 talbott talbott 88 0.00640 2.30 0.0147
## 4 allegro allegro 77 0.00633 2.30 0.0146
## 5 sofitel sofitel 67 0.00625 2.30 0.0144
## 6 james james 105 0.00846 1.61 0.0136
## 7 knickerbocker knickerbocker 65 0.00534 2.30 0.0123
## 8 homewood homewood 50 0.00404 3.00 0.0121
## 9 ambassador ambassador 78 0.00629 1.90 0.0119
## 10 omni omni 95 0.00809 1.39 0.0112
## # ... with 41,001 more rows
各hotel的tf-idf
hotel_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(hotel) %>%
top_n(5) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = hotel)) +
geom_col(show.legend = F) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~hotel, ncol = 5, scales = "free") +
theme(axis.text = element_text(size = 7)) +
coord_flip()
## Selecting by tf_idf
由上圖的結果可以觀察到,因為旅館名稱數量最多。但旅館的名稱在評論中不具意義,因此將旅館名稱移除再做一次tf-idf
# 有評論的hotel名稱筆誤(ex: affinia打成affina、talbott打成talbot)
#修正hotel名稱
hotel_name = c("affinia","allegro","amalfi","ambassador","conrad","fairmont","hardrock","hilton","homewood","hyatt","intercontinental","james","knickerbocker","monaco","omni","palmer","sheraton","sofitel","swissotel","talbott","affina","talbot")
reviews_words %>%
filter(!word %in% hotel_name) %>% #移除hotel名稱
count(hotel, word, sort = T) %>%
bind_tf_idf(word, hotel, n) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(hotel) %>%
top_n(5) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = hotel)) +
geom_col(show.legend = F) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~hotel, ncol = 5, scales = "free") +
theme(axis.text = element_text(size = 7)) +
coord_flip()
## Selecting by tf_idf
透過詞組來看評論中重要的內容是什麼
all_bigrams <- reviews %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
all_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 82,443 x 2
## bigram n
## <chr> <int>
## 1 in the 1283
## 2 at the 1129
## 3 the hotel 1124
## 4 of the 1075
## 5 and the 923
## 6 the room 896
## 7 i was 802
## 8 to the 741
## 9 it was 716
## 10 this hotel 688
## # ... with 82,433 more rows
bigrams_separated <- all_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts
## # A tibble: 15,054 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 front desk 339
## 2 downtown chicago 111
## 3 chicago hotel 110
## 4 customer service 99
## 5 hard rock 96
## 6 walking distance 80
## 7 michigan ave 76
## 8 highly recommend 71
## 9 ambassador east 69
## 10 magnificent mile 67
## # ... with 15,044 more rows
由bigram的內容可以發現:
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
#bigrams_united
bigram_tf_idf <- bigrams_united %>%
count(source,deceptive, bigram) %>%
bind_tf_idf(bigram, source, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf
## # A tibble: 16,656 x 7
## source deceptive bigram n tf idf tf_idf
## <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 MTurk deceptive chicago millennium 32 0.00294 1.10 0.00323
## 2 MTurk deceptive luxury hotel 28 0.00257 1.10 0.00282
## 3 MTurk deceptive intercontinental chi… 27 0.00248 1.10 0.00272
## 4 TripAdvisor truthful 2 blocks 8 0.00138 1.10 0.00151
## 5 TripAdvisor truthful john hancock 8 0.00138 1.10 0.00151
## 6 MTurk deceptive allegro chicago 14 0.00128 1.10 0.00141
## 7 MTurk deceptive monaco chicago 14 0.00128 1.10 0.00141
## 8 TripAdvisor truthful hancock tower 7 0.00120 1.10 0.00132
## 9 MTurk deceptive park hotel 13 0.00119 1.10 0.00131
## 10 MTurk deceptive sheraton chicago 33 0.00303 0.405 0.00123
## # ... with 16,646 more rows
bigram_tf_idf$source <- bigram_tf_idf$source %>% as.factor()
bigram_tf_idf %>%
arrange(desc(tf_idf)) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>%
group_by(source) %>%
top_n(10) %>%
ungroup() %>%
ggplot(aes(bigram, tf_idf, fill = source)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~source, ncol = 3, scales = "free") +
coord_flip()
## Selecting by tf_idf
雖然原本的data裡面已經有標記正負情緒 但我們想知道真人寫的評論和機器人寫的情緒程度上有何不同(機器人使用的文字情緒是否較為強烈?)
reviews_word <- reviews %>%
mutate(id = row_number())%>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
group_by(deceptive) %>%
count(id, word)
## Joining, by = "word"
reviews_word_plot <- reviews_word %>%
inner_join(get_sentiments("bing")) %>%
group_by(id, deceptive,sentiment) %>%
summarise(sum=n()) %>%
spread(sentiment, sum, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
mutate(color = ifelse(sentiment < 0, "negative","positive"))
## Joining, by = "word"
reviews_word_plot %>%
ggplot(aes(x= id,y=sentiment)) +
geom_col(aes(fill = deceptive))
上圖所顯示的結果看起來沒有很明顯的差距 因此以ANOVA來檢驗看看真假評論情緒程度上有沒有顯著差距 Pr(>F)遠大於0.05 無法拒絕H0 因此各組的平均值間沒有明顯差異
anova <- aov(reviews_word_plot$sentiment~as.character(reviews_word_plot$deceptive))
summary(anova)
## Df Sum Sq Mean Sq F value
## as.character(reviews_word_plot$deceptive) 1 6 6.126 0.228
## Residuals 1598 42970 26.890
## Pr(>F)
## as.character(reviews_word_plot$deceptive) 0.633
## Residuals
dtm資料前處理
corpus = Corpus(VectorSource(reviews$text))
##轉為小寫
corpus = tm_map(corpus, content_transformer(tolower))
##移除標點
corpus = tm_map(corpus, removePunctuation)
##去除贅字
corpus = tm_map(corpus, removeWords, c("hotel", stopwords("english")))
##字根還原
corpus = tm_map(corpus, stemDocument)
轉成dtm格式
### 文件字詞矩陣 (字頻表,DTM)
##建立文件字詞矩陣 (Document Term Matrix)
frequencies = DocumentTermMatrix(corpus)
# Look at matrix
#findFreqTerms(frequencies, lowfreq=10) #出現次數超過10以上的字顯現出來
##### 移除頻率太低的字詞
sparse = removeSparseTerms(frequencies, 0.995)
sparse
## <<DocumentTermMatrix (documents: 1600, terms: 1444)>>
## Non-/sparse entries: 82356/2228044
## Sparsity : 96%
## Maximal term length: 14
## Weighting : term frequency (tf)
##### 轉成資料框
# Convert to a data frame把矩陣轉成資料框
review_Sparse = as.data.frame(as.matrix(sparse))
#合法取出欄位
colnames(review_Sparse) = make.names(colnames(review_Sparse))
Topic Model
reviews_lda <- LDA(frequencies, k = 2, control = list(seed = 1234))
reviews_lda
## A LDA_VEM topic model with 2 topics.
library(tidytext)
reviews_topics <- tidy(reviews_lda, matrix = "beta")
reviews_topics
## # A tibble: 14,744 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 173 0.0000101
## 2 2 173 0.00000779
## 3 1 44in 0.0000115
## 4 2 44in 0.00000639
## 5 1 7th 0.0000699
## 6 2 7th 0.0000195
## 7 1 aaa 0.0000540
## 8 2 aaa 0.0000710
## 9 1 adult 0.0000830
## 10 2 adult 0.000203
## # ... with 14,734 more rows
畫圖看兩個topic Beta值最高的字 在沒有去除stopwords的情況下得到很差的結果
reviews_top_terms <- reviews_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
reviews_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
labs(x = NULL, y = "Beta")+
coord_flip()
remove_words<-c("hotel","room","stay","chicago")
custom_topic<-reviews_topics %>%
filter(! term %in% remove_words) %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
custom_topic %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
reviews_gamma <- tidy(reviews_lda, matrix = "gamma")
reviews_gamma <- reviews_gamma %>%
spread(topic, gamma)
reviews_gamma
## # A tibble: 1,600 x 3
## document `1` `2`
## <chr> <dbl> <dbl>
## 1 1 0.512 0.488
## 2 10 0.490 0.510
## 3 100 0.503 0.497
## 4 1000 0.510 0.490
## 5 1001 0.483 0.517
## 6 1002 0.473 0.527
## 7 1003 0.510 0.490
## 8 1004 0.502 0.498
## 9 1005 0.504 0.496
## 10 1006 0.498 0.502
## # ... with 1,590 more rows
由結果可以看到,兩個主題的機率幾乎各是0.5,所以整體而言,每個評論中,兩個主題的gamma值太接近,所以這些旅館的評論做LDA效果皆不慎理想
# 新增 dependent variable欄位做分類模型
review_Sparse$class = reviews$deceptive
library(caTools)
set.seed(123)
split = sample.split(review_Sparse$class, SplitRatio = 0.7)
TR = subset(review_Sparse, split==TRUE)
TS = subset(review_Sparse, split==FALSE)
table(TR$class)
##
## deceptive truthful
## 560 560
library(rpart)
library(rpart.plot)
trialCART = rpart(class~., data=TS, method="class")
rpart.plot(trialCART,cex=0.6)
pred = predict(trialCART, TS, type = "class") # predict classes
x = table(actual = TS$class, pred); x
## pred
## actual deceptive truthful
## deceptive 203 37
## truthful 43 197
正確率
sum(diag(x))/sum(x)
## [1] 0.8333333
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
PredictROC = predict(trialCART, TS)
head(PredictROC)
## deceptive truthful
## 2 0.0000000 1.0000000
## 4 0.8272727 0.1727273
## 5 0.1875000 0.8125000
## 8 0.1875000 0.8125000
## 11 0.8059701 0.1940299
## 16 0.2727273 0.7272727
pred = predict(trialCART, TS)[,2]
colAUC(pred, TS$class, T)
## [,1]
## deceptive vs. truthful 0.8672569
#AUC= 0.8672569