For homework 8 we are asked to do some text mining on Yelp! reviews of a 7 different restaurants. This included 50 different reviews combined.

library(textir)
## Loading required package: distrom
## Loading required package: Matrix
## Loading required package: gamlr
## Loading required package: parallel
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:distrom':
## 
##     collapse
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(tidytext)
rests <- read.csv("~/DataMining/Data/YelpReviewsFall2017.csv", stringsAsFactors = FALSE)
dim(rests)
## [1] 50  3
head(rests)
##              Restauraunt X5StarRating
## 1 O<U+0092>Gara<U+0092>s Bar And Grill            4
## 2 O<U+0092>Gara<U+0092>s Bar And Grill            2
## 3 O<U+0092>Gara<U+0092>s Bar And Grill            1
## 4 O<U+0092>Gara<U+0092>s Bar And Grill            3
## 5 O<U+0092>Gara<U+0092>s Bar And Grill            1
## 6 O<U+0092>Gara<U+0092>s Bar And Grill            4
##                                                                                                                                                                                                                                                                                                                                                                   Review
## 1 "My hubbie and I walked over after a long day of work in and out of the house. I was hungry for a burger. I almost never have a burger. When I do I'm often disappointed. Not tonight. I had the jalapeno business. It was juicy and delicious. The waiter was attentive, nice looking too :) It's a bit expensive, but most places are for someone who usually cooks"
## 2                                                                                                                                                                                                                                                                  "It's ok. Been there three times, and underwhelmed with their food. Nice tap beer selection however."
## 3                                                                                                                                                                                                                                      "Nasty rude staff. I was recently accosted by a woman who came outside while I was taking a smoke break from my overpriced drink"
## 4                         " I had fish and chips the first time that I was there and corned beef and cabbage the second time...both were ehhh. Fish and chips were crispy though which is a plus, not greasy, and the corned beef and cabbage was a little blah. Keep in mind that this was during their week of specials, so maybe it wasn't typical of their regular "
## 5                                                                                                                                                                     "It smells like a porta potty inside. The staff comes off very dull and lifeless. But I got to meet the local drunk! That was definitely the only highlight of this place. Typical bad Irish bar."
## 6                                                                                                                                                                            "I was surprised by how good the food was - I honestly expected some boring bar food, but the Jalapeno Business burger was killer good. I don't even like french fries, but I loved these."
length(rests$Review)
## [1] 50

Rests is composed of data that is separated into 3 columns and 50 rows. The columns are Restaurant, Rating and Review.

rests_df <- data_frame(line =1:50, text = as.character(rests$Review))
rests_df
## # A tibble: 50 x 2
##     line
##    <int>
##  1     1
##  2     2
##  3     3
##  4     4
##  5     5
##  6     6
##  7     7
##  8     8
##  9     9
## 10    10
## # ... with 40 more rows, and 1 more variables: text <chr>

After we have this, we will need to look individually at each word and not just the entire sentence.

tidy_rests <- rests_df %>% unnest_tokens(word, text)
tidy_rests
## # A tibble: 2,857 x 2
##     line   word
##    <int>  <chr>
##  1     1     my
##  2     1 hubbie
##  3     1    and
##  4     1      i
##  5     1 walked
##  6     1   over
##  7     1  after
##  8     1      a
##  9     1   long
## 10     1    day
## # ... with 2,847 more rows

After we have done that we will need to get rid of the filler words in order to ensure that they will not skew are results as they would be the most common.

data(stop_words)
tidy_rests <- tidy_rests %>% anti_join(stop_words)
## Joining, by = "word"

Now we will comb through the words and look for words that have more of a positive meaning.

tidy_rests
## # A tibble: 1,055 x 2
##     line         word
##    <int>        <chr>
##  1     1       hubbie
##  2     1       walked
##  3     1          day
##  4     1        house
##  5     1       hungry
##  6     1       burger
##  7     1       burger
##  8     1 disappointed
##  9     1      tonight
## 10     1     jalapeno
## # ... with 1,045 more rows
tidy_rests %>% count(word, sort = TRUE)
## # A tibble: 642 x 2
##           word     n
##          <chr> <int>
##  1        food    28
##  2         bar    15
##  3     service    15
##  4      burger    12
##  5        time    11
##  6   delicious    10
##  7 minneapolis    10
##  8        beer     7
##  9        spot     7
## 10     amazing     6
## # ... with 632 more rows
rests_joy <- get_sentiments("nrc") %>% filter(sentiment=="joy")
rests_joy
## # A tibble: 689 x 2
##             word sentiment
##            <chr>     <chr>
##  1    absolution       joy
##  2     abundance       joy
##  3      abundant       joy
##  4      accolade       joy
##  5 accompaniment       joy
##  6    accomplish       joy
##  7  accomplished       joy
##  8       achieve       joy
##  9   achievement       joy
## 10       acrobat       joy
## # ... with 679 more rows
tidy_rests %>% inner_join(rests_joy) %>% count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 44 x 2
##         word     n
##        <chr> <int>
##  1      food    28
##  2 delicious    10
##  3      beer     7
##  4 excellent     4
##  5   special     4
##  6     enjoy     3
##  7   perfect     3
##  8  favorite     2
##  9  friendly     2
## 10       fun     2
## # ... with 34 more rows

Next we will build our word cloud.

library(wordcloud)
## Loading required package: RColorBrewer
tidy_rests %>% anti_join(stop_words) %>% count(word) %>% with(wordcloud(word,n,max.words=50))
## Joining, by = "word"

After we have our original word cloud we will now highlight what words are considered positive and negative. Positive words will be darker and negative words will be lighter in color.

library(reshape2)
tidy_rests %>% inner_join(get_sentiments("bing")) %>% 
  count(word, sentiment,sort = TRUE) %>% 
  acast(word~sentiment, value.var = "n", fill = 0) %>% 
  comparison.cloud(colors =c("gray80", "gray20"), max.words=100)
## Joining, by = "word"
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): forgetful could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): indulge could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): lukewarm could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): needless could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): overpriced could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): poorly could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): pricey could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): rough could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): slow could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): smoke could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): split could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): stuck could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): stupid could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): tense could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): unsatisfactory could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): weaker could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): wrong could not be fit on page. It will not be plotted.

Bigrams will be the next thing that we will dive into

rests_bigrams <- rests_df %>% unnest_tokens(bigrams, text, token="ngrams", n=2)
rests_bigrams
## # A tibble: 2,807 x 2
##     line     bigrams
##    <int>       <chr>
##  1     1   my hubbie
##  2     1  hubbie and
##  3     1       and i
##  4     1    i walked
##  5     1 walked over
##  6     1  over after
##  7     1     after a
##  8     1      a long
##  9     1    long day
## 10     1      day of
## # ... with 2,797 more rows
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths
## The following object is masked from 'package:Matrix':
## 
##     expand
bigrams_separated <- rests_bigrams %>% separate(bigrams, c("word1", "word2"), sep=" ")
bigrams_separated
## # A tibble: 2,807 x 3
##     line  word1  word2
##  * <int>  <chr>  <chr>
##  1     1     my hubbie
##  2     1 hubbie    and
##  3     1    and      i
##  4     1      i walked
##  5     1 walked   over
##  6     1   over  after
##  7     1  after      a
##  8     1      a   long
##  9     1   long    day
## 10     1    day     of
## # ... with 2,797 more rows

If we would like to remove stop words so that two real words will get paired together, we would do this:

bigrams_filtered = bigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word)
bigrams_filtered
## # A tibble: 363 x 3
##     line     word1     word2
##    <int>     <chr>     <chr>
##  1     1  jalapeno  business
##  2     1 attentive      nice
##  3     1       bit expensive
##  4     2      food      nice
##  5     2      nice       tap
##  6     2       tap      beer
##  7     2      beer selection
##  8     3     nasty      rude
##  9     3      rude     staff
## 10     3  recently  accosted
## # ... with 353 more rows

If we wanted to see the frequency of two non-stop words we would use this:

bigram_counts <- bigrams_filtered %>% count(word1, word2, sort =TRUE)
bigram_counts
## # A tibble: 339 x 3
##       word1       word2     n
##       <chr>       <chr> <int>
##  1      bar          la     3
##  2     beer   selection     3
##  3   french       fries     3
##  4    juicy        lucy     3
##  5       la      grassa     3
##  6        5       stars     2
##  7   corned        beef     2
##  8  deviled        eggs     2
##  9     dive         bar     2
## 10 downtown minneapolis     2
## # ... with 329 more rows