This data set includes 50 Yelp reviews for 7 different restaurants. Which include O’Gara’s Bar and Grill, Butcher & The Boar, 112 Eatery, Hell’s Kitchen, Bar La Grassa, The Lowry, and Mama Maria’s. We will use text mining to deal with the reviews in order to find good words and bad words.
library(textir)
## Loading required package: distrom
## Loading required package: Matrix
## Loading required package: gamlr
## Loading required package: parallel
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:distrom':
##
## collapse
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(tidytext)
rests<-read.csv("c:/users/abbey/Desktop/Data Mining/Yelp Reviews Fall2017.csv")
rests [1:5,]
## Restauraunt X5StarRating
## 1 OGaras Bar And Grill 4
## 2 OGaras Bar And Grill 2
## 3 OGaras Bar And Grill 1
## 4 OGaras Bar And Grill 3
## 5 OGaras Bar And Grill 1
## Review
## 1 "My hubbie and I walked over after a long day of work in and out of the house. I was hungry for a burger. I almost never have a burger. When I do I'm often disappointed. Not tonight. I had the jalapeno business. It was juicy and delicious. The waiter was attentive, nice looking too :) It's a bit expensive, but most places are for someone who usually cooks"
## 2 "It's ok. Been there three times, and underwhelmed with their food. Nice tap beer selection however."
## 3 "Nasty rude staff. I was recently accosted by a woman who came outside while I was taking a smoke break from my overpriced drink"
## 4 " I had fish and chips the first time that I was there and corned beef and cabbage the second time...both were ehhh. Fish and chips were crispy though which is a plus, not greasy, and the corned beef and cabbage was a little blah. Keep in mind that this was during their week of specials, so maybe it wasn't typical of their regular "
## 5 "It smells like a porta potty inside. The staff comes off very dull and lifeless. But I got to meet the local drunk! That was definitely the only highlight of this place. Typical bad Irish bar."
dim(rests)
## [1] 50 3
This data set contains 50 rows and 3 columns. The set includes the name of the restaurant, the rating that reviewer gave the restaurant, and the review. We have to turn the reviews into text for the system to interpret what the ints are.
rests_df<-data_frame(line=1:50,text=as.character(rests$Review))
rests_df
## # A tibble: 50 x 2
## line
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## # ... with 40 more rows, and 1 more variables: text <chr>
We need to seperate each text word from the rest of the sentence.
tidy_rests<-rests_df %>% unnest_tokens(word,text)
tidy_rests
## # A tibble: 2,894 x 2
## line word
## <int> <chr>
## 1 1 my
## 2 1 hubbie
## 3 1 and
## 4 1 i
## 5 1 walked
## 6 1 over
## 7 1 after
## 8 1 a
## 9 1 long
## 10 1 day
## # ... with 2,884 more rows
Next, we need to pull from a built in r dictionary with words that contain stop words also known as filler words i.e. in, and, or, etc. This will get rid of the stop words in the review in order to focus on the fairly important words.
data(stop_words)
tidy_rests<-tidy_rests %>% anti_join(stop_words)
## Joining, by = "word"
tidy_rests
## # A tibble: 1,040 x 2
## line word
## <int> <chr>
## 1 1 hubbie
## 2 1 walked
## 3 1 day
## 4 1 house
## 5 1 hungry
## 6 1 burger
## 7 1 burger
## 8 1 disappointed
## 9 1 tonight
## 10 1 jalapeno
## # ... with 1,030 more rows
We then need to pull from the r dictionary of good words. Good words are charactorized as words that mean joy in this case. So the system will then filter through the remaining words after the stop words are discarded and then pick up the joy words. Then it will count the number of times that certain word shows up in the entire reviews.
tidy_rests %>% count(word,sort=TRUE)
## # A tibble: 635 x 2
## word n
## <chr> <int>
## 1 food 31
## 2 service 13
## 3 bar 12
## 4 delicious 10
## 5 minneapolis 10
## 6 time 10
## 7 pasta 8
## 8 staff 8
## 9 beer 6
## 10 eggs 6
## # ... with 625 more rows
##takes from the sentiment dictionary
rests_joy<-get_sentiments("nrc") %>%
filter(sentiment=="joy")
rests_joy
## # A tibble: 689 x 2
## word sentiment
## <chr> <chr>
## 1 absolution joy
## 2 abundance joy
## 3 abundant joy
## 4 accolade joy
## 5 accompaniment joy
## 6 accomplish joy
## 7 accomplished joy
## 8 achieve joy
## 9 achievement joy
## 10 acrobat joy
## # ... with 679 more rows
tidy_rests %>% inner_join(rests_joy) %>% count(word,sort=TRUE)
## Joining, by = "word"
## # A tibble: 42 x 2
## word n
## <chr> <int>
## 1 food 31
## 2 delicious 10
## 3 beer 6
## 4 excellent 4
## 5 enjoy 3
## 6 glad 3
## 7 pretty 3
## 8 special 3
## 9 favorite 2
## 10 friendly 2
## # ... with 32 more rows
This creates a word cloud. The words that are bigger are words more frequenctly used in the reviews.
library(wordcloud)
## Loading required package: RColorBrewer
tidy_rests %>% anti_join(stop_words) %>% count(word) %>% with(wordcloud(word,n,max.words=50))
## Joining, by = "word"
Next, we will position the words to either negative or postive words. We hope for more lighter colored gray words which indicate more positive reviews.
library(reshape2)
tidy_rests %>%
inner_join(get_sentiments("bing")) %>%
count(word,sentiment,sort=TRUE) %>%
acast(word~sentiment,value.var="n", fill=0) %>%
comparison.cloud(colors=c("gray20","gray80"), max.words=50)
## Joining, by = "word"
## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 50): recommendation could not be fit on page. It will not be plotted.
The most positive word used is delicious in these reviews.
Next, we can look at bigrams which are two word that can make a phrase. It pairs words next to each other, chronoclogical order, in the reviews.
rests_bigrams<-rests_df %>% unnest_tokens(bigram,text,token="ngrams", n=2)
rests_bigrams
## # A tibble: 2,844 x 2
## line bigram
## <int> <chr>
## 1 1 my hubbie
## 2 1 hubbie and
## 3 1 and i
## 4 1 i walked
## 5 1 walked over
## 6 1 over after
## 7 1 after a
## 8 1 a long
## 9 1 long day
## 10 1 day of
## # ... with 2,834 more rows
If we do not want bigrams closely together we can identify each word individually, but yet are still in pairs. For example, my hubbie word, “my” is classifiend as word 1 “hubbie” is classified as word 2.
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
## The following object is masked from 'package:Matrix':
##
## expand
bigrams_separated<-rests_bigrams %>% separate(bigram,c("word1","word2"), sep=" ")
bigrams_separated
## # A tibble: 2,844 x 3
## line word1 word2
## * <int> <chr> <chr>
## 1 1 my hubbie
## 2 1 hubbie and
## 3 1 and i
## 4 1 i walked
## 5 1 walked over
## 6 1 over after
## 7 1 after a
## 8 1 a long
## 9 1 long day
## 10 1 day of
## # ... with 2,834 more rows
We can use bigrams seperated by taking out stop words. This will pair a non-stop word with the closest other non-stop word.
## taking stop words out
bigrams_filtered=bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_filtered
## # A tibble: 345 x 3
## line word1 word2
## <int> <chr> <chr>
## 1 1 jalapeno business
## 2 1 attentive nice
## 3 1 bit expensive
## 4 2 food nice
## 5 2 nice tap
## 6 2 tap beer
## 7 2 beer selection
## 8 3 nasty rude
## 9 3 rude staff
## 10 3 recently accosted
## # ... with 335 more rows
Next, we can count the number of times each two close non-stop words to tell which ones are used frequently.
bigram_counts<-bigrams_filtered %>%
count(word1,word2,sort=TRUE)
bigram_counts
## # A tibble: 324 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 bar la 3
## 2 beer selection 3
## 3 french fries 3
## 4 la grassa 3
## 5 5 stars 2
## 6 corned beef 2
## 7 delicious food 2
## 8 deviled eggs 2
## 9 downtown minneapolis 2
## 10 hollandaise sauce 2
## # ... with 314 more rows
Beer selection was used three times and so was french fries.
Next we can put the words back together for each non-stop word and it will identify which line it was from.
##putting back together
bigrams_united<-bigrams_filtered %>%
unite(bigram,word1,word2,sep=" ")
bigrams_united
## # A tibble: 345 x 2
## line bigram
## * <int> <chr>
## 1 1 jalapeno business
## 2 1 attentive nice
## 3 1 bit expensive
## 4 2 food nice
## 5 2 nice tap
## 6 2 tap beer
## 7 2 beer selection
## 8 3 nasty rude
## 9 3 rude staff
## 10 3 recently accosted
## # ... with 335 more rows