Hey there! Welcome to my project for my unstructured data class. The focus of the course was to work with text and pictures to do some cool analysis. The program I use is called R, which is a language for data analysis.
My project is going to scrape some reviews from yelp for some of the “best” bars in the Jersey Shore. I hope you watched the TV show because that’s what I’m going for.
My first step, which you can’t see, is me loading some packages into R that will help me import, clean, and analyze text
And with that…
Mike “The Situation” Sorrentino, what a guy
Let’s start with Bamboo, from the actual TV show
Home of the “Jersey Turnpike”
Bamboo = "https://www.yelp.com/biz/the-bamboo-bar-seaside-heights"
BambooHTML = read_html(Bamboo)
bbRatings = BambooHTML %>%
html_nodes(".review-wrapper .review-content .i-stars") %>%
html_attr("title") %>%
stringr::str_extract("[0-5]")
bbReviews = BambooHTML %>%
html_nodes(".review-wrapper .review-content p") %>%
html_text()
BambooData = data.frame(ratings = bbRatings,
reviews = bbReviews,
restaurant = "The Bamboo Bar",
stringsAsFactors = FALSE)
DJais, one of our classmate’s favorite spots in Belmar NJ
The most annoying people I went to middleschool with love this place too
DJais = "https://www.yelp.com/biz/d-jais-oceanview-bar-and-cafe-belmar"
DJaisHTML = read_html(DJais)
djRatings = DJaisHTML %>%
html_nodes(".review-wrapper .review-content .i-stars") %>%
html_attr("title") %>%
stringr::str_extract("[0-5]")
djReviews = DJaisHTML %>%
html_nodes(".review-wrapper .review-content p") %>%
html_text()
DJaisData = data.frame(ratings = djRatings,
reviews = djReviews,
restaurant = "DJais",
stringsAsFactors = FALSE)
Bar Anticipation, home of the legendary Beat-the-Clock special that turns Tuesdays into unwanted college reunions
BarA = "https://www.yelp.com/biz/bar-anticipation-lake-como?osq=bar+anticipation"
BarAHTML = read_html(BarA)
BarARatings = BarAHTML %>%
html_nodes(".review-wrapper .review-content .i-stars") %>%
html_attr("title") %>%
stringr::str_extract("[0-5]")
BarAReviews = BarAHTML %>%
html_nodes(".review-wrapper .review-content p") %>%
html_text()
BarAData = data.frame(ratings = BarARatings,
reviews = BarAReviews,
restaurant = "Bar A",
stringsAsFactors = FALSE)
Johnny Mac’s, the best bar I’ve ever been to because you get free pizza with every drink and the pizza is better than anything in the midwest
JohnnyMacs = "https://www.yelp.com/biz/johnny-mac-house-of-spirits-asbury-park"
JMacsHTML = read_html(JohnnyMacs)
jmRatings = JMacsHTML %>%
html_nodes(".review-wrapper .review-content .i-stars") %>%
html_attr("title") %>%
stringr::str_extract("[0-5]")
jmReviews = JMacsHTML %>%
html_nodes(".review-wrapper .review-content p") %>%
html_text()
jmData = data.frame(ratings = jmRatings,
reviews = jmReviews,
restaurant = "Johnny Mac's",
stringsAsFactors = FALSE)
Combining all the seperate reviews and ratings for each bar into one dataframe
allReviews = dplyr::bind_rows(BambooData, DJaisData, BarAData, jmData ) %>%
dplyr::mutate(ratings = as.numeric(ratings),
wordCount = stringr::str_count(reviews, pattern = "\\S+"))
Let’s do a word cloud to see some of the most frequent terms
allReviews %>%
unnest_tokens(output = word, input = reviews) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(n > 5) %>%
na.omit() %>%
wordcloud2(shape = "cardioid")
## Joining, by = "word"
I also tried this for DJais, but it wasn’t family friendly
Let’s move on to more analytical sentiment analysis
allReviews = allReviews %>%
mutate(reviewID = 1:nrow(.)) # Just adding a reviewer id to the data
reviewSentiment = sentiment(get_sentences(allReviews$reviews),
polarity_dt = hash_sentiment_jockers)
reviewSentiment = reviewSentiment %>%
group_by(element_id) %>%
summarize(meanSentiment = mean(sentiment))
allReviews = left_join(allReviews, reviewSentiment, by = c("reviewID" = "element_id"))
So everything above this was just some cleaning and prepping, below is some actual useful stuff
For actual insight to what was done above, I looked at the actual text of the reviews and compared the words to a preset library to get a feeling for whether each word was positive or negative
allReviews %>%
group_by(restaurant) %>%
summarize(meanRating = mean(ratings),
meanSentiment = mean(meanSentiment))
## # A tibble: 4 x 3
## restaurant meanRating meanSentiment
## <chr> <dbl> <dbl>
## 1 Bar A 1.85 -0.0107
## 2 DJais 2.5 0.0431
## 3 Johnny Mac's 2.95 0.116
## 4 The Bamboo Bar 2.35 0.0152
Looking at the rating and sentiment, there is a bit of a pattern between the sentiment of the reviews and the actual ratings, obviously we want higher values for both
I thought there would be more of a gap between Johnny Mac’s and everyone else but I guess my taste sucks too
Anyhow let’s make things visual, this first trick being a nice table of the reviews with word counts, ratings, and sentiment
If the table doesn’t load don’t worry about it, it wasn’t that cool anyway
sentimentBreaks = c(-.5, 0, .5)
breakColors = c('rgb(178,24,43)', 'rgb(239,138,98)', 'rgb(103,169,207)', 'rgb(33,102,172)')
datatable(allReviews, rownames = FALSE) %>%
formatStyle("reviews", "meanSentiment", backgroundColor = styleInterval(sentimentBreaks, breakColors))
This next graph just plots sentiment vs ratings, which we expect to be related as we saw before
ggplot(allReviews, aes(ratings, meanSentiment, color = restaurant)) +
geom_point() +
theme_minimal()
This second one looks at sentiment vs word count
The lines here look at the trend (so the more people write the less they liked JMac’s)
If you look really close you can see some person wrote the bible for Bar A in a review
ggplot(allReviews, aes(wordCount, meanSentiment, color = restaurant)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()
For shiggles let’s see what was going on in that long review
I apologize for the massive amount of whitespace
allReviews[which.max(allReviews$wordCount),]
## ratings
## 9 1
## reviews
## 9 had the worst experience ever here. Such terrible service and it's not even a great place. First of all, I am 21 years of age and they wouldn't let me in cause I didn't have a drivers license (I don't like to drive it gives me anxiety and I have recurring seizures). My workers ID and credit cards WITH my pictures and name on them apparently were not good enough either. Second of all my friend who went inside who is 20 got kicked out. She paid $15 cover charge to get in (which doesn't make sense at all) got kicked out because they stuck her wrist band on to her skin instead of the actual paper and when she washed her hands and when it got wet it was sticking to her and irritating her skin (plus she was also sweating so it was giving her a rash) so she took it off and put it in her pocket. Because of this she got FORCIBLY removed from the club as if she was drinking or doing something illegal and terrible. The bouncers and the manager were UNFRIENDLY and do not have manners with people at all. I will not be attending again and I'll make sure no one I know does either. The only positive experience was that Gary the bouncer on the other side of bamboo (the bar side) was extremely nice and helped us get an uber because oh yea they kicked us out IN THE RAIN and didn't give us an option to get an uber and then leave. Horrible place DO NOT GO!!!! Also, they had so many people in there using fake IDs and drinking alcohol yet somehow we were attacked. My friends witnessed the manager send the bouncer over to specifically target us. AND the woman who was holding the IDS (which is a stupid policy) told us "he wanted us out from the start and didn't like us because we questioned why he kept the IDs". seems like that's a little childish for a manager. Again will not be returning, except to spit in the managers face.
## restaurant wordCount reviewID meanSentiment
## 9 The Bamboo Bar 355 9 -0.1158715
This person needs a hobby besides yelping
Thanks for reading!
KLove