In This dataset around 20.000 reviews from tripadvisor are presented. In this project following methods will be used: sentimental analysis and topic modeling.
#turn off scientific notation
options(scipen=999)
#upload data
setwd("~/DSBA/year 2/Semester1/Text mining/Text mining project")
review <- readr::read_csv("tripadvisor_hotel_reviews.csv")
head(review)
## # A tibble: 6 x 2
## Review Rating
## <chr> <dbl>
## 1 nice hotel expensive parking got good deal stay hotel anniversary, arr~ 4
## 2 ok nothing special charge diamond member hilton decided chain shot 20t~ 2
## 3 nice rooms not 4* experience hotel monaco seattle good hotel n't 4* le~ 3
## 4 unique, great stay, wonderful time hotel monaco, location excellent sh~ 5
## 5 great stay great stay, went seahawk game awesome, downfall view buildi~ 5
## 6 love monaco staff husband stayed hotel crazy weekend attending memoria~ 5
str(review)
## spec_tbl_df [20,491 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Review: chr [1:20491] "nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous re"| __truncated__ "ok nothing special charge diamond member hilton decided chain shot 20th anniversary seattle, start booked suite"| __truncated__ "nice rooms not 4* experience hotel monaco seattle good hotel n't 4* level.positives large bathroom mediterranea"| __truncated__ "unique, great stay, wonderful time hotel monaco, location excellent short stroll main downtown shopping area, p"| __truncated__ ...
## $ Rating: num [1:20491] 4 2 3 5 5 5 5 4 5 5 ...
## - attr(*, "spec")=
## .. cols(
## .. Review = col_character(),
## .. Rating = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
library(pander)
pandoc.table(review[2:6,],
justify = c('left', 'center'), style = 'grid')
##
##
## +--------------------------------+--------+
## | Review | Rating |
## +================================+========+
## | ok nothing special charge | 2 |
## | diamond member hilton decided | |
## | chain shot 20th anniversary | |
## | seattle, start booked suite | |
## | paid extra website description | |
## | not, suite bedroom bathroom | |
## | standard hotel room, took | |
## | printed reservation desk | |
## | showed said things like tv | |
## | couch ect desk clerk told oh | |
## | mixed suites description | |
## | kimpton website sorry free | |
## | breakfast, got kidding, | |
## | embassy suits sitting room | |
## | bathroom bedroom unlike | |
## | kimpton calls suite, 5 day | |
## | stay offer correct false | |
## | advertising, send kimpton | |
## | preferred guest website email | |
## | asking failure provide suite | |
## | advertised website reservation | |
## | description furnished hard | |
## | copy reservation printout | |
## | website desk manager duty did | |
## | not reply solution, send email | |
## | trip guest survey did not | |
## | follow email mail, guess tell | |
## | concerned guest.the staff | |
## | ranged indifferent not | |
## | helpful, asked desk good | |
## | breakfast spots neighborhood | |
## | hood told no hotels, gee best | |
## | breakfast spots seattle 1/2 | |
## | block away convenient hotel | |
## | does not know exist, arrived | |
## | late night 11 pm inside run | |
## | bellman busy chating cell | |
## | phone help bags.prior arrival | |
## | emailed hotel inform 20th | |
## | anniversary half really picky | |
## | wanted make sure good, got | |
## | nice email saying like deliver | |
## | bottle champagne chocolate | |
## | covered strawberries room | |
## | arrival celebrate, told needed | |
## | foam pillows, arrival no | |
## | champagne strawberries no foam | |
## | pillows great room view alley | |
## | high rise building good not | |
## | better housekeeping staff | |
## | cleaner room property, | |
## | impressed left morning | |
## | shopping room got short trips | |
## | 2 hours, beds comfortable.not | |
## | good ac-heat control 4 x 4 | |
## | inch screen bring green shine | |
## | directly eyes night, light | |
## | sensitive tape controls.this | |
## | not 4 start hotel clean | |
## | business hotel super high | |
## | rates, better chain hotels | |
## | seattle, | |
## +--------------------------------+--------+
## | nice rooms not 4* experience | 3 |
## | hotel monaco seattle good | |
## | hotel n't 4* level.positives | |
## | large bathroom mediterranean | |
## | suite comfortable bed | |
## | pillowsattentive housekeeping | |
## | staffnegatives ac unit | |
## | malfunctioned stay desk | |
## | disorganized, missed 3 | |
## | separate wakeup calls, | |
## | concierge busy hard touch, did | |
## | n't provide guidance special | |
## | requests.tv hard use ipod | |
## | sound dock suite non | |
## | functioning. decided book | |
## | mediterranean suite 3 night | |
## | weekend stay 1st choice rest | |
## | party filled, comparison w | |
## | spent 45 night larger square | |
## | footage room great soaking tub | |
## | whirlpool jets nice | |
## | shower.before stay hotel | |
## | arrange car service price 53 | |
## | tip reasonable driver waiting | |
## | arrival.checkin easy downside | |
## | room picked 2 person jacuzi | |
## | tub no bath accessories salts | |
## | bubble bath did n't stay, | |
## | night got 12/1a checked | |
## | voucher bottle champagne nice | |
## | gesture fish waiting room, | |
## | impression room huge open | |
## | space felt room big, tv far | |
## | away bed chore change channel, | |
## | ipod dock broken | |
## | disappointing.in morning way | |
## | asked desk check thermostat | |
## | said 65f 74 2 degrees warm try | |
## | cover face night bright blue | |
## | light kept, got room night no, | |
## | 1st drop desk, called | |
## | maintainence came look | |
## | thermostat told play settings | |
## | happy digital box wo n't work, | |
## | asked wakeup 10am morning did | |
## | n't happen, called later 6pm | |
## | nap wakeup forgot, 10am wakeup | |
## | morning yep forgotten.the | |
## | bathroom facilities great room | |
## | surprised room sold whirlpool | |
## | bath tub n't bath amenities, | |
## | great relax water jets going, | |
## +--------------------------------+--------+
## | unique, great stay, wonderful | 5 |
## | time hotel monaco, location | |
## | excellent short stroll main | |
## | downtown shopping area, pet | |
## | friendly room showed no signs | |
## | animal hair smells, monaco | |
## | suite sleeping area big | |
## | striped curtains pulled closed | |
## | nice touch felt cosy, goldfish | |
## | named brandi enjoyed, did n't | |
## | partake free wine coffee/tea | |
## | service lobby thought great | |
## | feature, great staff friendly, | |
## | free wireless internet hotel | |
## | worked suite 2 laptops, decor | |
## | lovely eclectic mix pattens | |
## | color palatte, animal print | |
## | bathrobes feel like rock | |
## | stars, nice did n't look like | |
## | sterile chain hotel hotel | |
## | personality excellent stay, | |
## +--------------------------------+--------+
## | great stay great stay, went | 5 |
## | seahawk game awesome, downfall | |
## | view building did n't | |
## | complain, room huge staff | |
## | helpful, booked hotels website | |
## | seahawk package, no charge | |
## | parking got voucher taxi, | |
## | problem taxi driver did n't | |
## | want accept voucher barely | |
## | spoke english, funny thing | |
## | speak arabic called started | |
## | making comments girlfriend | |
## | cell phone buddy, took second | |
## | realize just said fact speak | |
## | language face priceless, ass | |
## | told, said large city, told | |
## | head doorman issue called cab | |
## | company promply answer did | |
## | n't, apologized offered pay | |
## | taxi, bucks 2 miles stadium, | |
## | game plan taxi return going | |
## | humpin, great walk did n't | |
## | mind, right christmas | |
## | wonderful lights, homeless | |
## | stowed away building entrances | |
## | leave, police presence not | |
## | greatest area stadium, | |
## | activities 7 blocks pike | |
## | street waterfront great coffee | |
## | shops way, hotel mantained | |
## | foyer awesome, wine tasting | |
## | available evening, best dog, | |
## | taking st. bernard time | |
## | family, safes hotel located | |
## | service desk room, bathroom | |
## | huge jetted tub huge, funny | |
## | house keeping walked | |
## | girlfriend getting dressed, | |
## | did n't hear knock doing turn | |
## | service, screamed girlfriend | |
## | screams hit floor laughing, | |
## | started talking spanish | |
## | worked, place recommend price, | |
## | check online deals just good | |
## | not better, besite contains | |
## | deals vouchers travel websites | |
## | n't tell, | |
## +--------------------------------+--------+
## | love monaco staff husband | 5 |
## | stayed hotel crazy weekend | |
## | attending memorial service | |
## | best friend husband | |
## | celebrating 12th wedding | |
## | anniversary, talk mixed | |
## | emotions, booked suite hotel | |
## | monte carlos, loaned beautiful | |
## | fan-tanned goldfish named | |
## | joliet weekend visited dogs | |
## | worked desk human companions, | |
## | room decorated nicely couch | |
## | used pillows, l'occitane bath | |
## | amenities welcome sight, room | |
## | quiet peaceful, wireless | |
## | internet access wonderful | |
## | server went morning leaving | |
## | problems printing boarding | |
## | passes, afternoon reception | |
## | serves oenophile-satisfying | |
## | wine australia scrumptious | |
## | cookies, restaurant closed | |
## | renovation stay finally ate | |
## | food good drinks better, word | |
## | caution restaurant larger | |
## | person not sit booths wo n't | |
## | fit, 5'6 125 lbs husband 5'9 | |
## | 175. table smack-against | |
## | stomach couple inches space | |
## | mighty uncomfortable patron | |
## | larger pregnant, bad design | |
## | opinion place decorated funky | |
## | welcoming way metal wood | |
## | handblown glass light fixtures | |
## | expect seattle capital glass | |
## | art industry, definitely stay | |
## | reason, | |
## +--------------------------------+--------+
table(review$Rating)
##
## 1 2 3 4 5
## 1421 1793 2184 6039 9054
library(dplyr)
library(ggplot2)
review %>%
summarize(number_rows = n())
## # A tibble: 1 x 1
## number_rows
## <int>
## 1 20491
review %>%
group_by(Rating) %>%
summarize(number_rows = n())
## # A tibble: 5 x 2
## Rating number_rows
## <dbl> <int>
## 1 1 1421
## 2 2 1793
## 3 3 2184
## 4 4 6039
## 5 5 9054
# We have more reviews for Rating 5 compared to Rating 1 and 2.
review %>%
count(Rating,sort=TRUE) %>%
ggplot(aes(x= Rating, y=n)) +
geom_bar(stat="identity", fill = 'orange') +
geom_text(aes(label=n), vjust=1.4, hjust=1, color="white", size= 5)
library(tidytext)
#tokenization
tidy_review <- review %>%
unnest_tokens(word, Review)
head(tidy_review)
## # A tibble: 6 x 2
## Rating word
## <dbl> <chr>
## 1 4 nice
## 2 4 hotel
## 3 4 expensive
## 4 4 parking
## 5 4 got
## 6 4 good
tidy_review %>%
count(word) %>%
arrange(desc(n))
## # A tibble: 72,806 x 2
## word n
## <chr> <int>
## 1 hotel 49016
## 2 room 34669
## 3 not 31617
## 4 great 21175
## 5 n't 18984
## 6 good 17084
## 7 staff 16260
## 8 stay 15176
## 9 did 13904
## 10 just 12616
## # ... with 72,796 more rows
tidy_review2 <- review %>%
unnest_tokens(word, Review) %>%
anti_join(stop_words)
## Joining, by = "word"
tidy_review2
## # A tibble: 1,704,496 x 2
## Rating word
## <dbl> <chr>
## 1 4 nice
## 2 4 hotel
## 3 4 expensive
## 4 4 parking
## 5 4 deal
## 6 4 stay
## 7 4 hotel
## 8 4 anniversary
## 9 4 arrived
## 10 4 late
## # ... with 1,704,486 more rows
tidy_review2 %>%
count(word) %>%
arrange(desc(n))
## # A tibble: 72,203 x 2
## word n
## <chr> <int>
## 1 hotel 49016
## 2 n't 18984
## 3 staff 16260
## 4 stay 15176
## 5 nice 12460
## 6 location 11095
## 7 stayed 10467
## 8 service 10088
## 9 time 9901
## 10 night 9899
## # ... with 72,193 more rows
# We still see that the most frequent words are "hotel" and "n't". In the next step mentioned words will be removed from our data.
library(stopwords)
library(ngram)
#creating cusotm stop words
custom_stop_words <- tibble(word = c("hotel", "room", "hotels", "1", "2", "3", "4", "5", "stay", "stayed", "restaurants","6", "30", "15", "20", "7", "la", "9", "10", "13", "16", "it__ç_é_", "don__ç_é_", "4th", "ac","100", "200", '12' , '27', 'n\'t', "wow", "5th", "1st","18", "2006", "2008", "dr", "ubid", "pike", "45", "14", "itc", "ike", "mtr", "mithila", "sarento", "marcial", "manggis", "30hk", "16yr", "matteo", "mocenigo", "bahn", stopwords("en")))
tidy_review2 = tidy_review2 %>%
anti_join(custom_stop_words, by = c("word" = "word"))
#tidy_review2 = tidy_review %>%
# mutate(word = wordStem(word)) # for this method stemming cuts too many words and they are loosing their meaning that's why it was decided to continue analysis without it
tidy_review2 %>%
count(word) %>%
arrange(desc(n))
## # A tibble: 72,151 x 2
## word n
## <chr> <int>
## 1 staff 16260
## 2 nice 12460
## 3 location 11095
## 4 service 10088
## 5 time 9901
## 6 night 9899
## 7 beach 9765
## 8 day 9680
## 9 clean 9407
## 10 breakfast 9336
## # ... with 72,141 more rows
#tidy_review2 = removeNumbers(tidy_review2$stem)
tidy_review2 %>%
count(word, sort = TRUE) %>%
filter(n > 3000) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = word)) +
geom_col(show.legend=FALSE) +
xlab(NULL) +
coord_flip() +
ggtitle("Review Word Counts")
The most frequent words are staff, nice and location.
library(forcats)
sentiment_review <- tidy_review2 %>%
inner_join(get_sentiments("bing"))
# negative and positive sentiments
sentiment_review %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 77871
## 2 positive 183416
# the most often words for a given sentiment
sentiment_review %>%
count(word, sentiment) %>%
arrange(desc(n))
## # A tibble: 3,537 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 nice positive 12460
## 2 clean positive 9407
## 3 friendly positive 6764
## 4 excellent positive 6037
## 5 helpful positive 5477
## 6 recommend positive 4850
## 7 beautiful positive 4675
## 8 comfortable positive 4429
## 9 free positive 4111
## 10 wonderful positive 4053
## # ... with 3,527 more rows
counts_sentiment <- sentiment_review %>%
count(word, sentiment) %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(word2 = fct_reorder(word, n))
# visualization
ggplot(counts_sentiment, aes(x=word2, y=n, fill=sentiment)) +
geom_col(show.legend=FALSE) +
scale_fill_manual(values = c("Red", "Darkgreen")) +
facet_wrap(~sentiment, scales="free") +
coord_flip() +
labs(title = "Sentiment Word Counts",x = "Words") +
ylim(0, 15000)
The most frequent negative words are bad, noise, expensive, hard, cold …
The most frequently appeared positive word in the review are following: nice, clean, friendly, excellent, helpful …
library(tidyr)
sentiment_rating <- tidy_review2 %>%
inner_join(get_sentiments("bing")) %>%
count(Rating, sentiment) %>%
spread(sentiment, n) %>%
mutate(overall = positive - negative,
Stars = fct_reorder(as.factor(Rating), overall))
ggplot(sentiment_rating, aes(x=Rating, y=overall, fill=as.factor(Rating))) +
geom_col(show.legend=FALSE) +
coord_flip() +
scale_fill_manual(values = c("Red","Red", "Darkgreen", "Darkgreen", "Darkgreen")) +
labs(title = "Overall Sentiment by Rating",
x = "Rating",
y = "Overall Sentiment")
As it is expected for the low rating our overall score is below 0 and the higher the rating the higher overall score we can observe.
library(wordcloud)
library(reshape2)
tidy_review2 %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100, colors = "#ec9a1c"))
sentiment_review%>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "darkgreen"),
max.words = 100)
Negative part of the wordclud is represented by the following most frequent words: noise, expensive, bad, cold, dissapointed. As an example positive part of the word cloud has next words: great, good, nice, clean, fun, friendly, helpful, beautiful.
library(tidyverse)
library(tm)
# combining data for modeling
tidy_review_modeling =
tidy_review2 %>%
group_by(Rating) %>%
summarise(text = str_c(word, collapse = " ")) %>%
ungroup()
# creation corpus
Corpus <- Corpus(VectorSource(tidy_review_modeling$text))
DTM <- DocumentTermMatrix(Corpus)
Due to the big amount of data, it was decided to divide it into sub samples.
Below code is commented as it is very time consuming. Results of the procedures are paste to the report in a form of image.
# Corpus_K1 <- Corpus(VectorSource(tidy_review_modeling$text[c(1,3)])) # make a corpus object
# DTM_K1 <- DocumentTermMatrix(Corpus_K1)
# library(topicmodels)
# library(ldatuning)
# result_1 <- FindTopicsNumber(
# DTM_K1,
# topics = seq(from = 2, to = 15, by = 1),
# metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
# method = "Gibbs",
# control = list(seed = 77),
# mc.cores = 2L,
# verbose = TRUE
# )
#FindTopicsNumber_plot(result_1)
Figure 1. Number of Topics
Based on the first plots the best number of K is 3-5.
# Corpus_K2 <- Corpus(VectorSource(tidy_review_modeling$text[c(2,5)])) # make a corpus object
# DTM_K2 <- DocumentTermMatrix(Corpus_K2)
#
# library(topicmodels)
# library(ldatuning)
# result_2 <- FindTopicsNumber(
# DTM_K2,
# topics = seq(from = 2, to = 15, by = 1),
# metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
# method = "Gibbs",
# control = list(seed = 77),
# mc.cores = 2L,
# verbose = TRUE)
#
# FindTopicsNumber_plot(result_2)
Figure 2. Number of Topics
Plots based on the second sample are showing that we need to take K value equal to 4-5. Therefore we can assume that K should be equal to a value out of the range 3-5. After trying each of the separately it was decided to used K equal to 3, as it is the most informative.
# Corpus_K3 <- Corpus(VectorSource(tidy_review_modeling$text[c(3,4)])) # make a corpus object
# DTM_K3 <- DocumentTermMatrix(Corpus_K3)
#
# library(topicmodels)
# library(ldatuning)
# result_3 <- FindTopicsNumber(
# DTM_K3,
# topics = seq(from = 2, to = 15, by = 1),
# metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
# method = "Gibbs",
# control = list(seed = 77),
# mc.cores = 2L,
# verbose = TRUE)
#
# FindTopicsNumber_plot(result_3)
library(topicmodels)
library(tm)
library(tidytext)
unique_indexes <- unique(DTM$i) # unique index
DTM <- DTM[unique_indexes,]
# LDA modeling
# k - the number of topics that we specified
# beta - the word probabilities to define the topics
lda <- LDA(DTM, k = 3, control = list(seed = 1234))
tidy_topics <- tidy(lda, matrix = "beta")
# stemming
library(SnowballC)
library(dplyr)
tidy_topics <- tidy_topics %>%
mutate(stem = wordStem(term))
library(ggplot2)
tidy_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta) %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
labs(x = NULL, y = "Beta") +
coord_flip() +
ylim(0, 0.017) + # sa,e y-axis for all the graphs
scale_fill_manual(values = c("orange","Darkgreen", "deeppink3"))
We can notice that topics 2 and 3 are similar to each other. However, we can identified some similarities and main ideas of each topic. 1 topics is describing good time that customer can have in the hotel and rooms and stuff. 2nd topics describes beach and pool. 3rd topic is about staff, food, and location.
library(dplyr)
library(tidytext)
# counts of words for specific review
review_words = tidy_review2 %>%
count(Rating, word, sort = TRUE)
# total counts of words for all the reviews
total_words = review_words %>%
group_by(Rating) %>%
summarize(total = sum(n))
review_words = left_join(review_words, total_words)
# caclucating td-idf score
tf_idf = review_words %>%
bind_tf_idf(word, Rating, n) %>%
select(-total) %>%
arrange(desc(tf_idf))
# visualization
tf_idf %>%
group_by(Rating) %>%
top_n(5) %>%
ungroup %>%
ggplot(aes(word, tf_idf, fill = as.factor(Rating))) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~Rating, scales = "free") +
coord_flip() +
ylim(0, 0.00009) +
theme(axis.text.x=element_text(angle = -50, hjust = 0)) +
scale_fill_manual(values = c("orange","Darkgreen", "deeppink3", "cornflowerblue", "darkcyan"))
We can say that this supervised learning method did not provide any useful insights regarding our data. The words with higher tf-idf score are common to the for the Review for specific Rating and uncommon across all the reviews.
We can notice that all the reviews are very similar to each other.
Source:
1. source