Homework 8

This data set includes 50 Yelp reviews for 7 different restaurants. Which include O’Gara’s Bar and Grill, Butcher & The Boar, 112 Eatery, Hell’s Kitchen, Bar La Grassa, The Lowry, and Mama Maria’s. We will use text mining to deal with the reviews in order to find good words and bad words.

library(textir)

## Loading required package: distrom

## Loading required package: Matrix

## Loading required package: gamlr

## Loading required package: parallel

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:distrom':
## 
##     collapse

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(tidytext)

rests<-read.csv("c:/users/abbey/Desktop/Data Mining/Yelp Reviews Fall2017.csv")
rests [1:5,]

##            Restauraunt X5StarRating
## 1 OGaras Bar And Grill            4
## 2 OGaras Bar And Grill            2
## 3 OGaras Bar And Grill            1
## 4 OGaras Bar And Grill            3
## 5 OGaras Bar And Grill            1
##                                                                                                                                                                                                                                                                                                                                                                   Review
## 1 "My hubbie and I walked over after a long day of work in and out of the house. I was hungry for a burger. I almost never have a burger. When I do I'm often disappointed. Not tonight. I had the jalapeno business. It was juicy and delicious. The waiter was attentive, nice looking too :) It's a bit expensive, but most places are for someone who usually cooks"
## 2                                                                                                                                                                                                                                                                  "It's ok. Been there three times, and underwhelmed with their food. Nice tap beer selection however."
## 3                                                                                                                                                                                                                                      "Nasty rude staff. I was recently accosted by a woman who came outside while I was taking a smoke break from my overpriced drink"
## 4                         " I had fish and chips the first time that I was there and corned beef and cabbage the second time...both were ehhh. Fish and chips were crispy though which is a plus, not greasy, and the corned beef and cabbage was a little blah. Keep in mind that this was during their week of specials, so maybe it wasn't typical of their regular "
## 5                                                                                                                                                                     "It smells like a porta potty inside. The staff comes off very dull and lifeless. But I got to meet the local drunk! That was definitely the only highlight of this place. Typical bad Irish bar."

dim(rests)

## [1] 50  3

This data set contains 50 rows and 3 columns. The set includes the name of the restaurant, the rating that reviewer gave the restaurant, and the review. We have to turn the reviews into text for the system to interpret what the ints are.

rests_df<-data_frame(line=1:50,text=as.character(rests$Review))
rests_df

## # A tibble: 50 x 2
##     line
##    <int>
##  1     1
##  2     2
##  3     3
##  4     4
##  5     5
##  6     6
##  7     7
##  8     8
##  9     9
## 10    10
## # ... with 40 more rows, and 1 more variables: text <chr>

We need to seperate each text word from the rest of the sentence.

tidy_rests<-rests_df %>% unnest_tokens(word,text)
tidy_rests

## # A tibble: 2,894 x 2
##     line   word
##    <int>  <chr>
##  1     1     my
##  2     1 hubbie
##  3     1    and
##  4     1      i
##  5     1 walked
##  6     1   over
##  7     1  after
##  8     1      a
##  9     1   long
## 10     1    day
## # ... with 2,884 more rows

Next, we need to pull from a built in r dictionary with words that contain stop words also known as filler words i.e. in, and, or, etc. This will get rid of the stop words in the review in order to focus on the fairly important words.

data(stop_words)
tidy_rests<-tidy_rests %>% anti_join(stop_words)

## Joining, by = "word"

tidy_rests

## # A tibble: 1,040 x 2
##     line         word
##    <int>        <chr>
##  1     1       hubbie
##  2     1       walked
##  3     1          day
##  4     1        house
##  5     1       hungry
##  6     1       burger
##  7     1       burger
##  8     1 disappointed
##  9     1      tonight
## 10     1     jalapeno
## # ... with 1,030 more rows

We then need to pull from the r dictionary of good words. Good words are charactorized as words that mean joy in this case. So the system will then filter through the remaining words after the stop words are discarded and then pick up the joy words. Then it will count the number of times that certain word shows up in the entire reviews.

tidy_rests %>% count(word,sort=TRUE)

## # A tibble: 635 x 2
##           word     n
##          <chr> <int>
##  1        food    31
##  2     service    13
##  3         bar    12
##  4   delicious    10
##  5 minneapolis    10
##  6        time    10
##  7       pasta     8
##  8       staff     8
##  9        beer     6
## 10        eggs     6
## # ... with 625 more rows

##takes from the sentiment dictionary 
rests_joy<-get_sentiments("nrc") %>%
  filter(sentiment=="joy")
rests_joy

## # A tibble: 689 x 2
##             word sentiment
##            <chr>     <chr>
##  1    absolution       joy
##  2     abundance       joy
##  3      abundant       joy
##  4      accolade       joy
##  5 accompaniment       joy
##  6    accomplish       joy
##  7  accomplished       joy
##  8       achieve       joy
##  9   achievement       joy
## 10       acrobat       joy
## # ... with 679 more rows

tidy_rests %>% inner_join(rests_joy) %>% count(word,sort=TRUE)

## Joining, by = "word"

## # A tibble: 42 x 2
##         word     n
##        <chr> <int>
##  1      food    31
##  2 delicious    10
##  3      beer     6
##  4 excellent     4
##  5     enjoy     3
##  6      glad     3
##  7    pretty     3
##  8   special     3
##  9  favorite     2
## 10  friendly     2
## # ... with 32 more rows

This creates a word cloud. The words that are bigger are words more frequenctly used in the reviews.

library(wordcloud)

## Loading required package: RColorBrewer

tidy_rests %>% anti_join(stop_words) %>% count(word) %>% with(wordcloud(word,n,max.words=50))

## Joining, by = "word"

Next, we will position the words to either negative or postive words. We hope for more lighter colored gray words which indicate more positive reviews.

library(reshape2)
tidy_rests %>%
  inner_join(get_sentiments("bing")) %>%
  count(word,sentiment,sort=TRUE) %>%
  acast(word~sentiment,value.var="n", fill=0) %>%
  comparison.cloud(colors=c("gray20","gray80"), max.words=50)

## Joining, by = "word"

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 50): recommendation could not be fit on page. It will not be plotted.

The most positive word used is delicious in these reviews.

Next, we can look at bigrams which are two word that can make a phrase. It pairs words next to each other, chronoclogical order, in the reviews.

rests_bigrams<-rests_df %>% unnest_tokens(bigram,text,token="ngrams", n=2)
rests_bigrams

## # A tibble: 2,844 x 2
##     line      bigram
##    <int>       <chr>
##  1     1   my hubbie
##  2     1  hubbie and
##  3     1       and i
##  4     1    i walked
##  5     1 walked over
##  6     1  over after
##  7     1     after a
##  8     1      a long
##  9     1    long day
## 10     1      day of
## # ... with 2,834 more rows

If we do not want bigrams closely together we can identify each word individually, but yet are still in pairs. For example, my hubbie word, “my” is classifiend as word 1 “hubbie” is classified as word 2.

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

## The following object is masked from 'package:Matrix':
## 
##     expand

bigrams_separated<-rests_bigrams %>% separate(bigram,c("word1","word2"), sep=" ")
bigrams_separated

## # A tibble: 2,844 x 3
##     line  word1  word2
##  * <int>  <chr>  <chr>
##  1     1     my hubbie
##  2     1 hubbie    and
##  3     1    and      i
##  4     1      i walked
##  5     1 walked   over
##  6     1   over  after
##  7     1  after      a
##  8     1      a   long
##  9     1   long    day
## 10     1    day     of
## # ... with 2,834 more rows

We can use bigrams seperated by taking out stop words. This will pair a non-stop word with the closest other non-stop word.

## taking stop words out
bigrams_filtered=bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
bigrams_filtered

## # A tibble: 345 x 3
##     line     word1     word2
##    <int>     <chr>     <chr>
##  1     1  jalapeno  business
##  2     1 attentive      nice
##  3     1       bit expensive
##  4     2      food      nice
##  5     2      nice       tap
##  6     2       tap      beer
##  7     2      beer selection
##  8     3     nasty      rude
##  9     3      rude     staff
## 10     3  recently  accosted
## # ... with 335 more rows

Next, we can count the number of times each two close non-stop words to tell which ones are used frequently.

bigram_counts<-bigrams_filtered %>%
  count(word1,word2,sort=TRUE)
bigram_counts

## # A tibble: 324 x 3
##          word1       word2     n
##          <chr>       <chr> <int>
##  1         bar          la     3
##  2        beer   selection     3
##  3      french       fries     3
##  4          la      grassa     3
##  5           5       stars     2
##  6      corned        beef     2
##  7   delicious        food     2
##  8     deviled        eggs     2
##  9    downtown minneapolis     2
## 10 hollandaise       sauce     2
## # ... with 314 more rows

Beer selection was used three times and so was french fries.

Next we can put the words back together for each non-stop word and it will identify which line it was from.

##putting back together
bigrams_united<-bigrams_filtered %>%
  unite(bigram,word1,word2,sep=" ")
bigrams_united

## # A tibble: 345 x 2
##     line            bigram
##  * <int>             <chr>
##  1     1 jalapeno business
##  2     1    attentive nice
##  3     1     bit expensive
##  4     2         food nice
##  5     2          nice tap
##  6     2          tap beer
##  7     2    beer selection
##  8     3        nasty rude
##  9     3        rude staff
## 10     3 recently accosted
## # ... with 335 more rows

Homework 8

Abbey Ober

November 29, 2017