Data Overview

In This dataset around 20.000 reviews from tripadvisor are presented. In this project following methods will be used: sentimental analysis and topic modeling.

#turn off scientific notation
options(scipen=999)

#upload data
setwd("~/DSBA/year 2/Semester1/Text mining/Text mining project")

review <- readr::read_csv("tripadvisor_hotel_reviews.csv")
head(review)
## # A tibble: 6 x 2
##   Review                                                                  Rating
##   <chr>                                                                    <dbl>
## 1 nice hotel expensive parking got good deal stay hotel anniversary, arr~      4
## 2 ok nothing special charge diamond member hilton decided chain shot 20t~      2
## 3 nice rooms not 4* experience hotel monaco seattle good hotel n't 4* le~      3
## 4 unique, great stay, wonderful time hotel monaco, location excellent sh~      5
## 5 great stay great stay, went seahawk game awesome, downfall view buildi~      5
## 6 love monaco staff husband stayed hotel crazy weekend attending memoria~      5
str(review)
## spec_tbl_df [20,491 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Review: chr [1:20491] "nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous re"| __truncated__ "ok nothing special charge diamond member hilton decided chain shot 20th anniversary seattle, start booked suite"| __truncated__ "nice rooms not 4* experience hotel monaco seattle good hotel n't 4* level.positives large bathroom mediterranea"| __truncated__ "unique, great stay, wonderful time hotel monaco, location excellent short stroll main downtown shopping area, p"| __truncated__ ...
##  $ Rating: num [1:20491] 4 2 3 5 5 5 5 4 5 5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Review = col_character(),
##   ..   Rating = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
library(pander)

pandoc.table(review[2:6,], 
             justify = c('left', 'center'), style = 'grid')
## 
## 
## +--------------------------------+--------+
## | Review                         | Rating |
## +================================+========+
## | ok nothing special charge      |   2    |
## | diamond member hilton decided  |        |
## | chain shot 20th anniversary    |        |
## | seattle, start booked suite    |        |
## | paid extra website description |        |
## | not, suite bedroom bathroom    |        |
## | standard hotel room, took      |        |
## | printed reservation desk       |        |
## | showed said things like tv     |        |
## | couch ect desk clerk told oh   |        |
## | mixed suites description       |        |
## | kimpton website sorry free     |        |
## | breakfast, got kidding,        |        |
## | embassy suits sitting room     |        |
## | bathroom bedroom unlike        |        |
## | kimpton calls suite, 5 day     |        |
## | stay offer correct false       |        |
## | advertising, send kimpton      |        |
## | preferred guest website email  |        |
## | asking failure provide suite   |        |
## | advertised website reservation |        |
## | description furnished hard     |        |
## | copy reservation printout      |        |
## | website desk manager duty did  |        |
## | not reply solution, send email |        |
## | trip guest survey did not      |        |
## | follow email mail, guess tell  |        |
## | concerned guest.the staff      |        |
## | ranged indifferent not         |        |
## | helpful, asked desk good       |        |
## | breakfast spots neighborhood   |        |
## | hood told no hotels, gee best  |        |
## | breakfast spots seattle 1/2    |        |
## | block away convenient hotel    |        |
## | does not know exist, arrived   |        |
## | late night 11 pm inside run    |        |
## | bellman busy chating cell      |        |
## | phone help bags.prior arrival  |        |
## | emailed hotel inform 20th      |        |
## | anniversary half really picky  |        |
## | wanted make sure good, got     |        |
## | nice email saying like deliver |        |
## | bottle champagne chocolate     |        |
## | covered strawberries room      |        |
## | arrival celebrate, told needed |        |
## | foam pillows, arrival no       |        |
## | champagne strawberries no foam |        |
## | pillows great room view alley  |        |
## | high rise building good not    |        |
## | better housekeeping staff      |        |
## | cleaner room property,         |        |
## | impressed left morning         |        |
## | shopping room got short trips  |        |
## | 2 hours, beds comfortable.not  |        |
## | good ac-heat control 4 x 4     |        |
## | inch screen bring green shine  |        |
## | directly eyes night, light     |        |
## | sensitive tape controls.this   |        |
## | not 4 start hotel clean        |        |
## | business hotel super high      |        |
## | rates, better chain hotels     |        |
## | seattle,                       |        |
## +--------------------------------+--------+
## | nice rooms not 4* experience   |   3    |
## | hotel monaco seattle good      |        |
## | hotel n't 4* level.positives   |        |
## | large bathroom mediterranean   |        |
## | suite comfortable bed          |        |
## | pillowsattentive housekeeping  |        |
## | staffnegatives ac unit         |        |
## | malfunctioned stay desk        |        |
## | disorganized, missed 3         |        |
## | separate wakeup calls,         |        |
## | concierge busy hard touch, did |        |
## | n't provide guidance special   |        |
## | requests.tv hard use ipod      |        |
## | sound dock suite non           |        |
## | functioning. decided book      |        |
## | mediterranean suite 3 night    |        |
## | weekend stay 1st choice rest   |        |
## | party filled, comparison w     |        |
## | spent 45 night larger square   |        |
## | footage room great soaking tub |        |
## | whirlpool jets nice            |        |
## | shower.before stay hotel       |        |
## | arrange car service price 53   |        |
## | tip reasonable driver waiting  |        |
## | arrival.checkin easy downside  |        |
## | room picked 2 person jacuzi    |        |
## | tub no bath accessories salts  |        |
## | bubble bath did n't stay,      |        |
## | night got 12/1a checked        |        |
## | voucher bottle champagne nice  |        |
## | gesture fish waiting room,     |        |
## | impression room huge open      |        |
## | space felt room big, tv far    |        |
## | away bed chore change channel, |        |
## | ipod dock broken               |        |
## | disappointing.in morning way   |        |
## | asked desk check thermostat    |        |
## | said 65f 74 2 degrees warm try |        |
## | cover face night bright blue   |        |
## | light kept, got room night no, |        |
## | 1st drop desk, called          |        |
## | maintainence came look         |        |
## | thermostat told play settings  |        |
## | happy digital box wo n't work, |        |
## | asked wakeup 10am morning did  |        |
## | n't happen, called later 6pm   |        |
## | nap wakeup forgot, 10am wakeup |        |
## | morning yep forgotten.the      |        |
## | bathroom facilities great room |        |
## | surprised room sold whirlpool  |        |
## | bath tub n't bath amenities,   |        |
## | great relax water jets going,  |        |
## +--------------------------------+--------+
## | unique, great stay, wonderful  |   5    |
## | time hotel monaco, location    |        |
## | excellent short stroll main    |        |
## | downtown shopping area, pet    |        |
## | friendly room showed no signs  |        |
## | animal hair smells, monaco     |        |
## | suite sleeping area big        |        |
## | striped curtains pulled closed |        |
## | nice touch felt cosy, goldfish |        |
## | named brandi enjoyed, did n't  |        |
## | partake free wine coffee/tea   |        |
## | service lobby thought great    |        |
## | feature, great staff friendly, |        |
## | free wireless internet hotel   |        |
## | worked suite 2 laptops, decor  |        |
## | lovely eclectic mix pattens    |        |
## | color palatte, animal print    |        |
## | bathrobes feel like rock       |        |
## | stars, nice did n't look like  |        |
## | sterile chain hotel hotel      |        |
## | personality excellent stay,    |        |
## +--------------------------------+--------+
## | great stay great stay, went    |   5    |
## | seahawk game awesome, downfall |        |
## | view building did n't          |        |
## | complain, room huge staff      |        |
## | helpful, booked hotels website |        |
## | seahawk package, no charge     |        |
## | parking got voucher taxi,      |        |
## | problem taxi driver did n't    |        |
## | want accept voucher barely     |        |
## | spoke english, funny thing     |        |
## | speak arabic called started    |        |
## | making comments girlfriend     |        |
## | cell phone buddy, took second  |        |
## | realize just said fact speak   |        |
## | language face priceless, ass   |        |
## | told, said large city, told    |        |
## | head doorman issue called cab  |        |
## | company promply answer did     |        |
## | n't, apologized offered pay    |        |
## | taxi, bucks 2 miles stadium,   |        |
## | game plan taxi return going    |        |
## | humpin, great walk did n't     |        |
## | mind, right christmas          |        |
## | wonderful lights, homeless     |        |
## | stowed away building entrances |        |
## | leave, police presence not     |        |
## | greatest area stadium,         |        |
## | activities 7 blocks pike       |        |
## | street waterfront great coffee |        |
## | shops way, hotel mantained     |        |
## | foyer awesome, wine tasting    |        |
## | available evening, best dog,   |        |
## | taking st. bernard time        |        |
## | family, safes hotel located    |        |
## | service desk room, bathroom    |        |
## | huge jetted tub huge, funny    |        |
## | house keeping walked           |        |
## | girlfriend getting dressed,    |        |
## | did n't hear knock doing turn  |        |
## | service, screamed girlfriend   |        |
## | screams hit floor laughing,    |        |
## | started talking spanish        |        |
## | worked, place recommend price, |        |
## | check online deals just good   |        |
## | not better, besite contains    |        |
## | deals vouchers travel websites |        |
## | n't tell,                      |        |
## +--------------------------------+--------+
## | love monaco staff husband      |   5    |
## | stayed hotel crazy weekend     |        |
## | attending memorial service     |        |
## | best friend husband            |        |
## | celebrating 12th wedding       |        |
## | anniversary, talk mixed        |        |
## | emotions, booked suite hotel   |        |
## | monte carlos, loaned beautiful |        |
## | fan-tanned goldfish named      |        |
## | joliet weekend visited dogs    |        |
## | worked desk human companions,  |        |
## | room decorated nicely couch    |        |
## | used pillows, l'occitane bath  |        |
## | amenities welcome sight, room  |        |
## | quiet peaceful, wireless       |        |
## | internet access wonderful      |        |
## | server went morning leaving    |        |
## | problems printing boarding     |        |
## | passes, afternoon reception    |        |
## | serves oenophile-satisfying    |        |
## | wine australia scrumptious     |        |
## | cookies, restaurant closed     |        |
## | renovation stay finally ate    |        |
## | food good drinks better, word  |        |
## | caution restaurant larger      |        |
## | person not sit booths wo n't   |        |
## | fit, 5'6 125 lbs husband 5'9   |        |
## | 175. table smack-against       |        |
## | stomach couple inches space    |        |
## | mighty uncomfortable patron    |        |
## | larger pregnant, bad design    |        |
## | opinion place decorated funky  |        |
## | welcoming way metal wood       |        |
## | handblown glass light fixtures |        |
## | expect seattle capital glass   |        |
## | art industry, definitely stay  |        |
## | reason,                        |        |
## +--------------------------------+--------+
table(review$Rating)
## 
##    1    2    3    4    5 
## 1421 1793 2184 6039 9054
library(dplyr)
library(ggplot2)
review %>%
  summarize(number_rows = n())
## # A tibble: 1 x 1
##   number_rows
##         <int>
## 1       20491
review %>%
  group_by(Rating) %>%
  summarize(number_rows = n())
## # A tibble: 5 x 2
##   Rating number_rows
##    <dbl>       <int>
## 1      1        1421
## 2      2        1793
## 3      3        2184
## 4      4        6039
## 5      5        9054
# We have more reviews for Rating 5 compared to Rating 1 and 2.

review %>%
    count(Rating,sort=TRUE) %>%
    ggplot(aes(x= Rating, y=n)) +
    geom_bar(stat="identity", fill = 'orange') +
    geom_text(aes(label=n), vjust=1.4, hjust=1, color="white", size= 5)

Sentiment Analysis

Data PreProcessing

library(tidytext)

#tokenization
tidy_review <- review %>%
  unnest_tokens(word, Review)

head(tidy_review)
## # A tibble: 6 x 2
##   Rating word     
##    <dbl> <chr>    
## 1      4 nice     
## 2      4 hotel    
## 3      4 expensive
## 4      4 parking  
## 5      4 got      
## 6      4 good
tidy_review %>%
  count(word) %>%
  arrange(desc(n))
## # A tibble: 72,806 x 2
##    word      n
##    <chr> <int>
##  1 hotel 49016
##  2 room  34669
##  3 not   31617
##  4 great 21175
##  5 n't   18984
##  6 good  17084
##  7 staff 16260
##  8 stay  15176
##  9 did   13904
## 10 just  12616
## # ... with 72,796 more rows
tidy_review2 <- review %>%
  unnest_tokens(word, Review) %>%
  anti_join(stop_words)
## Joining, by = "word"
tidy_review2
## # A tibble: 1,704,496 x 2
##    Rating word       
##     <dbl> <chr>      
##  1      4 nice       
##  2      4 hotel      
##  3      4 expensive  
##  4      4 parking    
##  5      4 deal       
##  6      4 stay       
##  7      4 hotel      
##  8      4 anniversary
##  9      4 arrived    
## 10      4 late       
## # ... with 1,704,486 more rows
tidy_review2 %>%
  count(word) %>%
  arrange(desc(n))
## # A tibble: 72,203 x 2
##    word         n
##    <chr>    <int>
##  1 hotel    49016
##  2 n't      18984
##  3 staff    16260
##  4 stay     15176
##  5 nice     12460
##  6 location 11095
##  7 stayed   10467
##  8 service  10088
##  9 time      9901
## 10 night     9899
## # ... with 72,193 more rows
# We still see that the most frequent words are "hotel" and "n't". In the next step mentioned words will be removed from our data.
library(stopwords)
library(ngram)

#creating cusotm stop words
custom_stop_words <- tibble(word = c("hotel", "room", "hotels", "1", "2", "3", "4", "5", "stay", "stayed", "restaurants","6", "30", "15", "20", "7", "la", "9", "10", "13", "16", "it__ç_é_", "don__ç_é_", "4th", "ac","100", "200", '12' , '27', 'n\'t', "wow", "5th", "1st","18", "2006", "2008", "dr", "ubid", "pike", "45", "14", "itc", "ike", "mtr", "mithila", "sarento", "marcial", "manggis", "30hk", "16yr", "matteo", "mocenigo", "bahn",  stopwords("en")))

tidy_review2 = tidy_review2 %>%
  anti_join(custom_stop_words, by = c("word" = "word"))

#tidy_review2 = tidy_review  %>% 
#  mutate(word = wordStem(word)) # for this method stemming cuts too many words and they are loosing their meaning that's why it was decided to continue analysis without it

tidy_review2 %>%
  count(word) %>%
  arrange(desc(n))
## # A tibble: 72,151 x 2
##    word          n
##    <chr>     <int>
##  1 staff     16260
##  2 nice      12460
##  3 location  11095
##  4 service   10088
##  5 time       9901
##  6 night      9899
##  7 beach      9765
##  8 day        9680
##  9 clean      9407
## 10 breakfast  9336
## # ... with 72,141 more rows
#tidy_review2 = removeNumbers(tidy_review2$stem)
tidy_review2 %>% 
  count(word, sort = TRUE) %>% 
  filter(n > 3000) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = word)) + 
  geom_col(show.legend=FALSE) + 
  xlab(NULL) + 
  coord_flip() +
  ggtitle("Review Word Counts")

The most frequent words are staff, nice and location.

Sentiment Analysis by words

library(forcats)

sentiment_review <- tidy_review2 %>%
  inner_join(get_sentiments("bing"))

# negative and positive sentiments
sentiment_review %>%
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment      n
##   <chr>      <int>
## 1 negative   77871
## 2 positive  183416
#  the most often words for a given sentiment
sentiment_review %>%
  count(word, sentiment) %>%
  arrange(desc(n))
## # A tibble: 3,537 x 3
##    word        sentiment     n
##    <chr>       <chr>     <int>
##  1 nice        positive  12460
##  2 clean       positive   9407
##  3 friendly    positive   6764
##  4 excellent   positive   6037
##  5 helpful     positive   5477
##  6 recommend   positive   4850
##  7 beautiful   positive   4675
##  8 comfortable positive   4429
##  9 free        positive   4111
## 10 wonderful   positive   4053
## # ... with 3,527 more rows
counts_sentiment <- sentiment_review %>%
  count(word, sentiment) %>%
  group_by(sentiment) %>%
  top_n(10, n) %>%
  ungroup() %>%
  mutate(word2 = fct_reorder(word, n))

# visualization
ggplot(counts_sentiment, aes(x=word2, y=n, fill=sentiment)) + 
  geom_col(show.legend=FALSE) +
  scale_fill_manual(values = c("Red", "Darkgreen")) +
  facet_wrap(~sentiment, scales="free") +
  coord_flip() +
  labs(title = "Sentiment Word Counts",x = "Words") +
  ylim(0, 15000)

The most frequent negative words are bad, noise, expensive, hard, cold …
The most frequently appeared positive word in the review are following: nice, clean, friendly, excellent, helpful …

Sentiment Analysis by Rating

library(tidyr)

sentiment_rating <- tidy_review2 %>%
  inner_join(get_sentiments("bing")) %>%
  count(Rating, sentiment) %>%
  spread(sentiment, n) %>%
  mutate(overall = positive - negative,
         Stars = fct_reorder(as.factor(Rating), overall))

ggplot(sentiment_rating, aes(x=Rating, y=overall, fill=as.factor(Rating))) + 
  geom_col(show.legend=FALSE) +
  coord_flip() +
  scale_fill_manual(values = c("Red","Red",  "Darkgreen", "Darkgreen", "Darkgreen")) +
  labs(title = "Overall Sentiment by Rating",
       x = "Rating",
       y = "Overall Sentiment")

As it is expected for the low rating our overall score is below 0 and the higher the rating the higher overall score we can observe.

Wordcloud

library(wordcloud)
library(reshape2)

tidy_review2 %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, colors = "#ec9a1c"))

sentiment_review%>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "darkgreen"),
                   max.words = 100)

Negative part of the wordclud is represented by the following most frequent words: noise, expensive, bad, cold, dissapointed. As an example positive part of the word cloud has next words: great, good, nice, clean, fun, friendly, helpful, beautiful.

Topic Modeling

Data Preparation

library(tidyverse)
library(tm)

# combining data for modeling

tidy_review_modeling = 
  tidy_review2 %>% 
  group_by(Rating) %>% 
  summarise(text = str_c(word, collapse = " ")) %>% 
  ungroup()
# creation corpus
Corpus <- Corpus(VectorSource(tidy_review_modeling$text))
DTM <- DocumentTermMatrix(Corpus)

Choosing Number of Topics - K

Due to the big amount of data, it was decided to divide it into sub samples.

Below code is commented as it is very time consuming. Results of the procedures are paste to the report in a form of image.

# Corpus_K1 <- Corpus(VectorSource(tidy_review_modeling$text[c(1,3)])) # make a corpus object
# DTM_K1 <- DocumentTermMatrix(Corpus_K1)
# library(topicmodels)
# library(ldatuning)
# result_1 <- FindTopicsNumber(
#   DTM_K1,
#   topics = seq(from = 2, to = 15, by = 1),
#   metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
#   method = "Gibbs",
#   control = list(seed = 77),
#   mc.cores = 2L,
#   verbose = TRUE
# )
#FindTopicsNumber_plot(result_1)
Figure 1. Number of Topics

Figure 1. Number of Topics

Based on the first plots the best number of K is 3-5.

# Corpus_K2 <- Corpus(VectorSource(tidy_review_modeling$text[c(2,5)])) # make a corpus object
# DTM_K2 <- DocumentTermMatrix(Corpus_K2)
# 
# library(topicmodels)
# library(ldatuning)
# result_2 <- FindTopicsNumber(
#   DTM_K2,
#   topics = seq(from = 2, to = 15, by = 1),
#   metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
#   method = "Gibbs",
#   control = list(seed = 77),
#   mc.cores = 2L,
#   verbose = TRUE)
# 
# FindTopicsNumber_plot(result_2)
Figure 2. Number of Topics

Figure 2. Number of Topics

Plots based on the second sample are showing that we need to take K value equal to 4-5. Therefore we can assume that K should be equal to a value out of the range 3-5. After trying each of the separately it was decided to used K equal to 3, as it is the most informative.

# Corpus_K3 <- Corpus(VectorSource(tidy_review_modeling$text[c(3,4)])) # make a corpus object
# DTM_K3 <- DocumentTermMatrix(Corpus_K3)
# 
# library(topicmodels)
# library(ldatuning)
# result_3 <- FindTopicsNumber(
#   DTM_K3,
#   topics = seq(from = 2, to = 15, by = 1),
#   metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
#   method = "Gibbs",
#   control = list(seed = 77),
#   mc.cores = 2L,
#   verbose = TRUE)
# 
# FindTopicsNumber_plot(result_3)

LDA Modelling

library(topicmodels) 
library(tm)
library(tidytext)

unique_indexes <- unique(DTM$i) # unique index
DTM <- DTM[unique_indexes,]

# LDA modeling
# k - the number of topics that we specified
# beta - the word probabilities to define the topics

lda <- LDA(DTM, k = 3, control = list(seed = 1234))
tidy_topics <- tidy(lda, matrix = "beta")
# stemming
library(SnowballC)
library(dplyr)
tidy_topics <- tidy_topics %>% 
    mutate(stem = wordStem(term))
library(ggplot2)


tidy_topics  %>% 
  group_by(topic) %>% 
  top_n(10, beta) %>% 
  ungroup() %>% 
  arrange(topic, -beta) %>% 
  mutate(term = reorder(term, beta)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~ topic, scales = "free") +
  labs(x = NULL, y = "Beta") +
  coord_flip() +
  ylim(0, 0.017) + # sa,e y-axis for all the graphs
  scale_fill_manual(values = c("orange","Darkgreen",  "deeppink3"))

We can notice that topics 2 and 3 are similar to each other. However, we can identified some similarities and main ideas of each topic. 1 topics is describing good time that customer can have in the hotel and rooms and stuff. 2nd topics describes beach and pool. 3rd topic is about staff, food, and location.

TF-IDF Modelling

library(dplyr)
library(tidytext)

# counts of words for specific review
review_words = tidy_review2 %>%
  count(Rating, word, sort = TRUE)

# total counts of words for all the reviews
total_words = review_words %>% 
  group_by(Rating) %>% 
  summarize(total = sum(n))

review_words = left_join(review_words, total_words)

# caclucating td-idf score
tf_idf = review_words %>%
  bind_tf_idf(word, Rating, n) %>%
  select(-total) %>%
  arrange(desc(tf_idf))

# visualization
tf_idf %>% 
  group_by(Rating) %>% 
  top_n(5) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = as.factor(Rating))) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~Rating, scales = "free") +
  coord_flip() +
  ylim(0, 0.00009) +
  theme(axis.text.x=element_text(angle = -50, hjust = 0)) +
  scale_fill_manual(values = c("orange","Darkgreen",  "deeppink3", "cornflowerblue", "darkcyan"))

We can say that this supervised learning method did not provide any useful insights regarding our data. The words with higher tf-idf score are common to the for the Review for specific Rating and uncommon across all the reviews.

We can notice that all the reviews are very similar to each other.

Source:
1. source