For homework 8 we are asked to do some text mining on Yelp! reviews of a 7 different restaurants. This included 50 different reviews combined.
library(textir)
## Loading required package: distrom
## Loading required package: Matrix
## Loading required package: gamlr
## Loading required package: parallel
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:distrom':
##
## collapse
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(tidytext)
rests <- read.csv("~/DataMining/Data/YelpReviewsFall2017.csv", stringsAsFactors = FALSE)
dim(rests)
## [1] 50 3
head(rests)
## Restauraunt X5StarRating
## 1 O<U+0092>Gara<U+0092>s Bar And Grill 4
## 2 O<U+0092>Gara<U+0092>s Bar And Grill 2
## 3 O<U+0092>Gara<U+0092>s Bar And Grill 1
## 4 O<U+0092>Gara<U+0092>s Bar And Grill 3
## 5 O<U+0092>Gara<U+0092>s Bar And Grill 1
## 6 O<U+0092>Gara<U+0092>s Bar And Grill 4
## Review
## 1 "My hubbie and I walked over after a long day of work in and out of the house. I was hungry for a burger. I almost never have a burger. When I do I'm often disappointed. Not tonight. I had the jalapeno business. It was juicy and delicious. The waiter was attentive, nice looking too :) It's a bit expensive, but most places are for someone who usually cooks"
## 2 "It's ok. Been there three times, and underwhelmed with their food. Nice tap beer selection however."
## 3 "Nasty rude staff. I was recently accosted by a woman who came outside while I was taking a smoke break from my overpriced drink"
## 4 "Â I had fish and chips the first time that I was there and corned beef and cabbage the second time...both were ehhh. Fish and chips were crispy though which is a plus, not greasy, and the corned beef and cabbage was a little blah. Keep in mind that this was during their week of specials, so maybe it wasn't typical of their regular "
## 5 "It smells like a porta potty inside. The staff comes off very dull and lifeless. But I got to meet the local drunk! That was definitely the only highlight of this place. Typical bad Irish bar."
## 6 "I was surprised by how good the food was - I honestly expected some boring bar food, but the Jalapeno Business burger was killer good. I don't even like french fries, but I loved these."
length(rests$Review)
## [1] 50
Rests is composed of data that is separated into 3 columns and 50 rows. The columns are Restaurant, Rating and Review.
rests_df <- data_frame(line =1:50, text = as.character(rests$Review))
rests_df
## # A tibble: 50 x 2
## line
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## # ... with 40 more rows, and 1 more variables: text <chr>
After we have this, we will need to look individually at each word and not just the entire sentence.
tidy_rests <- rests_df %>% unnest_tokens(word, text)
tidy_rests
## # A tibble: 2,857 x 2
## line word
## <int> <chr>
## 1 1 my
## 2 1 hubbie
## 3 1 and
## 4 1 i
## 5 1 walked
## 6 1 over
## 7 1 after
## 8 1 a
## 9 1 long
## 10 1 day
## # ... with 2,847 more rows
After we have done that we will need to get rid of the filler words in order to ensure that they will not skew are results as they would be the most common.
data(stop_words)
tidy_rests <- tidy_rests %>% anti_join(stop_words)
## Joining, by = "word"
Now we will comb through the words and look for words that have more of a positive meaning.
tidy_rests
## # A tibble: 1,055 x 2
## line word
## <int> <chr>
## 1 1 hubbie
## 2 1 walked
## 3 1 day
## 4 1 house
## 5 1 hungry
## 6 1 burger
## 7 1 burger
## 8 1 disappointed
## 9 1 tonight
## 10 1 jalapeno
## # ... with 1,045 more rows
tidy_rests %>% count(word, sort = TRUE)
## # A tibble: 642 x 2
## word n
## <chr> <int>
## 1 food 28
## 2 bar 15
## 3 service 15
## 4 burger 12
## 5 time 11
## 6 delicious 10
## 7 minneapolis 10
## 8 beer 7
## 9 spot 7
## 10 amazing 6
## # ... with 632 more rows
rests_joy <- get_sentiments("nrc") %>% filter(sentiment=="joy")
rests_joy
## # A tibble: 689 x 2
## word sentiment
## <chr> <chr>
## 1 absolution joy
## 2 abundance joy
## 3 abundant joy
## 4 accolade joy
## 5 accompaniment joy
## 6 accomplish joy
## 7 accomplished joy
## 8 achieve joy
## 9 achievement joy
## 10 acrobat joy
## # ... with 679 more rows
tidy_rests %>% inner_join(rests_joy) %>% count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 44 x 2
## word n
## <chr> <int>
## 1 food 28
## 2 delicious 10
## 3 beer 7
## 4 excellent 4
## 5 special 4
## 6 enjoy 3
## 7 perfect 3
## 8 favorite 2
## 9 friendly 2
## 10 fun 2
## # ... with 34 more rows
Next we will build our word cloud.
library(wordcloud)
## Loading required package: RColorBrewer
tidy_rests %>% anti_join(stop_words) %>% count(word) %>% with(wordcloud(word,n,max.words=50))
## Joining, by = "word"
After we have our original word cloud we will now highlight what words are considered positive and negative. Positive words will be darker and negative words will be lighter in color.
library(reshape2)
tidy_rests %>% inner_join(get_sentiments("bing")) %>%
count(word, sentiment,sort = TRUE) %>%
acast(word~sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors =c("gray80", "gray20"), max.words=100)
## Joining, by = "word"
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): forgetful could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): indulge could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): lukewarm could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): needless could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): overpriced could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): poorly could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): pricey could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): rough could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): slow could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): smoke could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): split could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): stuck could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): stupid could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): tense could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): unsatisfactory could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): weaker could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("gray80", "gray20"), max.words =
## 100): wrong could not be fit on page. It will not be plotted.
Bigrams will be the next thing that we will dive into
rests_bigrams <- rests_df %>% unnest_tokens(bigrams, text, token="ngrams", n=2)
rests_bigrams
## # A tibble: 2,807 x 2
## line bigrams
## <int> <chr>
## 1 1 my hubbie
## 2 1 hubbie and
## 3 1 and i
## 4 1 i walked
## 5 1 walked over
## 6 1 over after
## 7 1 after a
## 8 1 a long
## 9 1 long day
## 10 1 day of
## # ... with 2,797 more rows
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
## The following object is masked from 'package:Matrix':
##
## expand
bigrams_separated <- rests_bigrams %>% separate(bigrams, c("word1", "word2"), sep=" ")
bigrams_separated
## # A tibble: 2,807 x 3
## line word1 word2
## * <int> <chr> <chr>
## 1 1 my hubbie
## 2 1 hubbie and
## 3 1 and i
## 4 1 i walked
## 5 1 walked over
## 6 1 over after
## 7 1 after a
## 8 1 a long
## 9 1 long day
## 10 1 day of
## # ... with 2,797 more rows
If we would like to remove stop words so that two real words will get paired together, we would do this:
bigrams_filtered = bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_filtered
## # A tibble: 363 x 3
## line word1 word2
## <int> <chr> <chr>
## 1 1 jalapeno business
## 2 1 attentive nice
## 3 1 bit expensive
## 4 2 food nice
## 5 2 nice tap
## 6 2 tap beer
## 7 2 beer selection
## 8 3 nasty rude
## 9 3 rude staff
## 10 3 recently accosted
## # ... with 353 more rows
If we wanted to see the frequency of two non-stop words we would use this:
bigram_counts <- bigrams_filtered %>% count(word1, word2, sort =TRUE)
bigram_counts
## # A tibble: 339 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 bar la 3
## 2 beer selection 3
## 3 french fries 3
## 4 juicy lucy 3
## 5 la grassa 3
## 6 5 stars 2
## 7 corned beef 2
## 8 deviled eggs 2
## 9 dive bar 2
## 10 downtown minneapolis 2
## # ... with 329 more rows