Project: Data Wrangling in R

Yelp Restaurant Reviews Analysis

Yelp.com is a popular website in which customers can rate and write their experiences at various businesses. This analysis examines the reviews posted by user for various type of restaurants located in Las Vegas.

Overview

Problem Statement

Analyse the rating of different type of restaurants in Las Vegas, NV. This includes the sentiment analysis on the reviews posted and exploratory analysis of different types of restaurants basis their attributes.

Both the analysis aim to provide useful information to the consumers as well as the business owners. The business owners can use this analysis to improve/add upon features that their establishment currently lacks. For the consumers of the website, the analysis will help them easly decide their next place to visit in the city.

Methodology

The data for this analysis has been downloaded from the Yelp website. This data will be used to visulaise the trends basis different attributes and also determine the popular opinions of the people. The text mining approach will be utilized to examine the sentiments of the people basis the comments written in the review posts. The analysis will help the users of the website to quickly find a restaurant basis their interests by elimating the need to throughly read across reviews.

Packages Required

The following packages will be used for the analysis:

tidyverse: Package of multiple R packages used for data manipulation
stringr: String operations in R
ggplot2: Data visualisation in R
dplyr: Easy functions to perform data manipulation in R
tm: Framework for common text mining applications in R
jsonlite: Package to read JSON formatted files
NLP: Package for basic classes and methods for Natural Language Processing
DT: Package to put data objects in R as HTML tables
lubridate: Package to manipulate dates and times
tidytext: Package for text mining for word processing and sentiment analysis
wordcloud: Package to generate word clouds
igraph: Package for network analysis and visualization
widyr: Package to widen, process and re-tidy data
leaflet: Package to create and customize interactive maps
scales: Package to graphically scale map data to aesthetics

## function to install and load multiple packages at once in R

install_load <- function(pack){
    ## Statement to check if the package has been previously installed
    new_pack_load <- pack[!(pack %in% installed.packages()[,"Package"])]
  if (length(new_pack_load))
      install.packages(new_pack_load, dependencies = TRUE)
      sapply(pack, require, character.only = TRUE)
}

package_load <- c("ggplot2", "dplyr", "tidyverse", "NLP", "tm", "stringr", "jsonlite", "DT", "lubridate", "tidytext", "wordcloud", "igraph", "ggraph","widyr", "ggmap", "leaflet", "scales")
install_load(package_load)

Data Preparation

The section here explains the steps to prepare the data for analysis.

Source of Data

Since 2014, Yelp.com has been organising the Yelp Dataset Challenge to conduct research and provide analysis from the huge data on various business that is available on its webiste. This challenge is used by students to uncover new insights from data and even publish or present their results in major conferences. More information about this challenge can be found at this link.

The data for this analysis has been obtained from the Yelp.com website. The datasets can be accessed at this link. The datasets can be downloaded as either a single sql database file or multiple json formatted files. For this analysis, json formatted files were donwloaded.

Data Description

The downloaded data contains information for 156,000 businesses in 12 metropolitan areas and 4,700,000 reviews provided by 1,100,000 users. The data consists of six different tables: businenss, review, user, checkin, tip and photos. The description of these tables is as follows:

business.json : Contains business data including location, attributes and categories
review.json : Contains user text review data for businesses
user.json : Contains user information data
checkin.json : Contains checkin data on a business
tip.json : Contains tips written by a user on a business. Tips are shorter than reviews and convey quick suggestions.
photos.json : Contains photo_id to link to the different pictures (in a different data) posted for a business on the website

Further, information regarding the data can be found on this link.

For this analysis, the focus is to examine only restaurant category businesses in Las Vegas, NV.

Data Import

This section highlights the process on how the data was imported and filtered for analysis.

Import

For the analysis, the following datasets were used :

user.json: The primary key for the data is business_id
review.json: The primary key for the data is the combination of variables business_id and review_id

Initially, to import these datasets in R, the stream_in() function in the jsonlite package was used. After the data was loaded in R, the same was saved in RData format. The RData formatted files are quicker to read in R and thus can save time in future while reloading the datasets for other exploratory analysis. The code to perform this task is shown below:

yelp_biz <- stream_in(file("C:/Users/rohit/Documents/Fall_2017/Data_Wrangling_R/Project_data/yelp_dataset.tar/dataset/business.json"))
save(yelp_biz, file = "yelp_biz.RData")

yelp_review <- stream_in(file("C:/Users/rohit/Documents/Fall_2017/Data_Wrangling_R/Project_data/yelp_dataset.tar/dataset/review.json"))
save(yelp_review, file = "yelp_review.RData")

Filter

Filtering variables

There are 210 variables in the yelp_biz data, containing information such as name, city, rating, review count, attributes and categories. For the analysis, only the required interest of the variables have been kept. Further, the category variable is only used to extract restaurant category business information from the data.

#### keeping the required variables for business data
yelp_biz_f <- data.frame(yelp_biz$business_id, yelp_biz$name, yelp_biz$neighborhood, yelp_biz$address,
                       yelp_biz$city, yelp_biz$state, yelp_biz$postal_code, yelp_biz$latitude,
                       yelp_biz$longitude, yelp_biz$stars, yelp_biz$review_count, yelp_biz$is_open,
                       yelp_biz$attributes$RestaurantsPriceRange2, yelp_biz$attributes$WheelchairAccessible,
                       yelp_biz$attributes$GoodForMeal$dessert, yelp_biz$attributes$GoodForMeal$latenight,
                       yelp_biz$attributes$GoodForMeal$lunch, yelp_biz$attributes$GoodForMeal$dinner,
                       yelp_biz$attributes$GoodForMeal$breakfast, yelp_biz$attributes$GoodForMeal$brunch,
                       yelp_biz$attributes$RestaurantsGoodForGroups, yelp_biz$attributes$NoiseLevel,
                       yelp_biz$attributes$RestaurantsAttire, yelp_biz$attributes$RestaurantsReservations,
                       yelp_biz$attributes$OutdoorSeating, yelp_biz$attributes$BusinessAcceptsCreditCards,
                       yelp_biz$attributes$RestaurantsDelivery, yelp_biz$attributes$Ambience$romantic,
                       yelp_biz$attributes$Ambience$intimate, yelp_biz$attributes$Ambience$classy,
                       yelp_biz$attributes$Ambience$hipster, yelp_biz$attributes$Ambience$divey,
                       yelp_biz$attributes$Ambience$touristy, yelp_biz$attributes$Ambience$trendy,
                       yelp_biz$attributes$Ambience$upscale, yelp_biz$attributes$Ambience$casual,
                       yelp_biz$attributes$RestaurantsTakeOut, yelp_biz$attributes$GoodForKids,
                       yelp_biz$attributes$WiFi, yelp_biz$attributes$RestaurantsTableService,
                       yelp_biz$attributes$Alcohol, yelp_biz$attributes$Caters, yelp_biz$attributes$DogsAllowed,
                       yelp_biz$attributes$Music$dj, yelp_biz$attributes$Music$background_music,
                       yelp_biz$attributes$Music$no_music, yelp_biz$attributes$Music$karaoke,
                       yelp_biz$attributes$Music$live, yelp_biz$attributes$Music$video, 
                       yelp_biz$attributes$Music$jukebox, yelp_biz$attributes$HappyHour,
                       yelp_biz$attributes$GoodForDancing, yelp_biz$attributes$DriveThru,
                       yelp_biz$attributes$Smoking, yelp_biz$attributes$BestNights$monday,
                       yelp_biz$attributes$BestNights$tuesday, yelp_biz$attributes$BestNights$friday,
                       yelp_biz$attributes$BestNights$wednesday, yelp_biz$attributes$BestNights$thursday,
                       yelp_biz$attributes$BestNights$sunday, yelp_biz$attributes$BestNights$saturday,
                       yelp_biz$attributes$DietaryRestrictions$`dairy-free`, 
                       yelp_biz$attributes$DietaryRestrictions$`gluten-free`,
                       yelp_biz$attributes$DietaryRestrictions$vegan,
                       yelp_biz$attributes$DietaryRestrictions$kosher,
                       yelp_biz$attributes$DietaryRestrictions$halal,
                       yelp_biz$attributes$DietaryRestrictions$`soy-free`,
                       yelp_biz$attributes$DietaryRestrictions$vegetarian,
                       yelp_biz$attributes$Open24Hours,
                       stringsAsFactors = FALSE)
#### removing the inital 10 characters from column names as all these variables contain the text "yelp_biz."
names(yelp_biz_f) <- substring(names(yelp_biz_f), 10)

Filtering data

The business data is now filtered to keep the observations only for the restaurant category of businesses in the city of Las Vegas. After the operation, there are 8,703 restuarants available for analysis. This data will be used for exploratory analysis. Further, the yelp_review data is filtered to keep the text reviews for the list of restaurants obtained in the previous step. This data will be utilized to perform sentiment analysis. The yelp_review_vegas data contains 1,009,036 reviews for the 8,703 restaurants.

#### Categories of business for the analysis
catg <- c("Restaurant","Food", "Pubs", "Nightlife", "Bars", "Dance Clubs", "Lounges", "Dive Bars", "Sports Bars")

#### Filtering the business data for the restaurant category
# Variable 'categories' is of the type list in the original data and thus, it is first converted to a character variable using the mutate function in the dplyr package 
yelp_biz_catg <-  yelp_biz %>%
                  select(business_id, categories) %>% 
                  mutate(categories = as.character(categories)) %>%  
                  filter(str_detect(categories, paste(catg, collapse = '|')))

#### Filtering business data for Las Vegas. There are 8,703 restaurants.
yelp_biz_vegas <- yelp_biz_f %>%
                  semi_join(yelp_biz_catg, by = "business_id") %>%
                  filter(str_detect(city, "Las Vegas"))

#### Dimension of the business data
#dim(yelp_biz_vegas)

#### Filtering the review data for Las Vegas restaurants. There are 1,009,036 text reviews for 8,703 restaurants.
yelp_review_vegas <- yelp_review %>%
                     semi_join(yelp_biz_vegas, by = "business_id")

#### Dimension of the review data
#dim(yelp_review_vegas)

Data Cleaning

This section provides information regarding the process of data cleaning and finally, preview of the cleaned data.

Steps followed

To clean the data in both yelp_biz_vegas and yelp_review_vegas, theh following steps have been followed:

Determine the count of missing values in each columns of the dataset
Filter out columns that have more than 40% values as missing
Replacing the missing values in the column
Check for outliers in the numeric variables
Treatment for any outliers in the data

Upon checking the summary of all the variables in the data, it was found that the logical columns in the yelp_biz_vegas data required missing value treatment. The rest of the variables were structured properly.

The missing values in these logical columns were replaced with “FALSE”. Further, no missing values were observed for the yelp_review_vegas data.

#### cleaning the data
## Determining the percentage of missing values in the business data
#sort(colMeans(is.na(yelp_biz_vegas)), decreasing = TRUE)

## Keeping only the variables which have less than 40% missing values in them
keep_vars <- (colMeans(is.na(yelp_biz_vegas))) < 0.4

## Selecting the list of final columns for analysis in business data
yelp_biz_vegas_2 <- yelp_biz_vegas %>% select(names(keep_vars[keep_vars == TRUE]))

## Replacing the missing values in logical columns with "FALSE"
# Filtering out the columns having logical values (TRUE/FALSE) and keeping "business_id" column, so that after cleaning the splitted datsets can be combined
col_name_logical <- sapply(yelp_biz_vegas_2, is.logical)

yelp_biz_vegas_bl <- yelp_biz_vegas_2 %>% 
                     select(business_id, names(col_name_logical[col_name_logical == FALSE])) %>%
                     inner_join(yelp_biz_vegas_2 %>% 
                     select(business_id, names(col_name_logical[col_name_logical == TRUE])) %>%
                     mutate_all(funs(replace(., is.na(.),"FALSE"))), by = "business_id")

## Determining the percentage of missing values in the review data
#sort(colMeans(is.na(yelp_review_vegas)), decreasing = TRUE)
# No missing value treatment required for review data

## Converting the string formatted date variable to date format
yelp_review_vegas$date <- ymd(yelp_review_vegas$date)

## Extracting review year from the date
yelp_review_vegas$year_review <- year(yelp_review_vegas$date)

## Removing the datasets that are no longer required
rm(yelp_biz, yelp_review, yelp_biz_f, yelp_biz_catg, yelp_biz_vegas, yelp_biz_vegas_2)

Preview of Final Data

A sample preview of the cleaned data is shown below:

head(yelp_biz_vegas_bl,200) %>%
  datatable(caption = "Table 1: Restaurants in Las Vegas")

head(yelp_review_vegas, 200) %>%
  datatable(caption = "Table 2: Text reviews of the restaurants")

Data Summary

Table Summary

The summary of both the datasets is as follows:

Restaurant Information Data

Information	Restaurant Dataset
Number of rows	8,703
NUmber of columns	39
Date Range	April 19, 2004 to July 26, 2017
Number of numerical variables	9
Number of character variables	33

Text Review Data

Information	Review Dataset
Number of rows	1,009,036
NUmber of columns	9
Date Range	April 19, 2004 to July 26, 2017
Number of numerical variables	5
Number of character variables	4
Number of date variables	1

Variable Summary

Summary of some of the important variables in the data is given below:

Variable Name	Dataset	Remarks
`stars`	`yelp_biz_vegas_bl`	Provides the rating of the restaurant. Minimum = 0, Median = 3.5, Maximum = 5
`review_count`	`yelp_biz_vegas_bl`	Count of reviews for a particular restaurant. Minimum = 3, Median = 30, Maximum = 6979
`latitude`	`yelp_biz_vegas_bl`	Latitute coordinate of the restaurant. Will be used to plot the location on the map.
`longitude`	`yelp_biz_vegas_bl`	Longitude coordinate of the restaurant. Will be used to plot the location on the map.
`attributes.RestaurantsPriceRange2`	`yelp_biz_vegas_bl`	Categorized price range of the restaurant
`stars`	`yelp_review_vegas`	Star rating of the review posted
`text`	`yelp_review_vegas`	Text of the review posted
`useful`	`yelp_review_vegas`	Count of users that found the particular review useful. Minimum = 0, Median = 0, Maximum = 168
`funny`	`yelp_review_vegas`	Count of users that found the particular review funny. Minimum = 0, Median = 0, Maximum = 154
`cool`	`yelp_review_vegas`	Count of users that found the particular review cool. Minimum = 0, Median = 0, Maximum = 156

Data Analysis

This section explores the various useful insights driven out from the data and then the same is visually presented. For this purpose, data has been sliced and diced across different parameters and new variables formed to gather information.

Time Trend of Reviews

A time series trend of the reviews posted on the website has been analyzed here. These reviews have been posted from the year 2014 to the year 2017. For the year 2017, only reviews till October have been posted. So, the analysis does in this section does not count for reviews posted in the year 2017. Further, there are approximately 7,000 different neighborhoods containg all the restaurants in Las Vegas and for this analysis only the top ten popular neighborhoods have been chosen. The figures below present the count of reviews posted across years and the count of different restaurans that opened across the years.

## To know the range of year of reviews in the data
#range(yelp_review_vegas$year_review)

## Filtering data to have observations where neighborhood is populated
not_missing_neighrhood <- yelp_biz_vegas_bl %>%
                            filter(str_detect(neighborhood, "[a-z']$")) %>%
                            select(business_id, neighborhood)

## Keeping text reviews for these neighborhoods only
yelp_review_vegas_neighr <- yelp_review_vegas %>%
                              inner_join(not_missing_neighrhood, by = "business_id")

## Top 10 popular neighborhood, by count of reviews, from 2004 to 2016
top_ten_neighrhd_count <- yelp_review_vegas_neighr %>%
                            filter(year_review %in% c(2004:2016)) %>%
                            group_by(neighborhood) %>%
                            summarise(n = n()) %>%
                            arrange(desc(n)) %>%
                            top_n(10, n)

## Summarizing the number of reviews by neighborhood and year of review
count_review_neighrhd_year <- yelp_review_vegas_neighr %>%
                                filter(year_review %in% c(2004:2016)) %>%
                                group_by(neighborhood, year_review) %>%
                                summarise(n = n()) %>%
                                semi_join(top_ten_neighrhd_count, by = "neighborhood")

## Plot of number of reviews across different years by neighborhood
ggplot(data = count_review_neighrhd_year, aes(x = year_review, y = n, col = neighborhood)) + 
  geom_line(size = 1.5) +
  labs(title = "Time series of count of reviews for the top ten popular neighborhoods",
        x = "Year of review", y = "Number of reviews") +
  theme(panel.grid.minor = element_blank()) + theme_light()

The figure above shows how the number of reviews posted on the website have risen across the years. Specifically, in the The Strip neighborhood has seen a lot of increase in the number of reviews posted for the restaurants.

## Number of businesses opened in years and average rating of these businesses across years
count_business_neighrhd_year <- yelp_review_vegas_neighr %>%
                                filter(year_review %in% c(2004:2016)) %>%
                                group_by(neighborhood, year_review) %>%
                                summarise(n = n_distinct(business_id), average_star_rating = mean(stars)) %>%
                                semi_join(top_ten_neighrhd_count, by = "neighborhood")

## Horizontal bar plot for the analysis
ggplot(data = count_business_neighrhd_year, aes(x = as.factor(year_review), y = n)) + 
  facet_wrap(~ neighborhood) +
  geom_bar(stat = 'identity', aes(fill = average_star_rating), width = 0.75) +
  coord_flip() +
  labs(title = "Number of restaurants opened across the top ten popular neighborhood",
       x = "Number of restaurants", y = "Year of Opening") +
  theme(panel.grid.minor = element_blank()) + theme_light()

## removing datasets to free the memory
rm(yelp_review_vegas_neighr, top_ten_neighrhd_count, count_review_neighrhd_year, count_business_neighrhd_year)

The figure above shows the number of restaurants opened across the years in these top ten neighborhoods. It is observed that there are a lot of restaurants in the The Strip neighborhood, which explains the huge increase in the number of reviews being posted on the website.

Summary The range of reviews posted on the website, available to us, is from the year 2014 to the year 2016. That represents a huge set of reviews to analyze. As observed from the above plots, during the recent times there has been a surge in the number of reviews for various restaurants and thus, for further analysis in the project, only the reviews posted between the year 2014 and the year 2016 have been selected.

Trending Words across Reviews

This section here analyses the words that were trending in the reviews. This will help to determine how these words contribute towards the star rating of the reviews posted for the restaurant and also, how different users react to the reviews.

## Filtering the text review data for years between 2014 and 2016
yelp_review_vegas_ft <- yelp_review_vegas %>% filter(year_review >= 2014, year_review <= 2016)

## Saving the file as this needs to be removed later in the part due to memory constraints
save(yelp_review_vegas_ft, file = "yelp_review_vegas_ft.RData")

## Removing this data to free up the system memory
rm(yelp_review_vegas)

## Nesting each and every word in the text to form a tidy data
unnest_text_review <- yelp_review_vegas_ft %>%
                        select(review_id, text) %>%
                        unnest_tokens(word, text) 

## Saving the file as this needs to be removed later in the part due to memory constraints
save(unnest_text_review, file = "unnest_text_review.RData")

## Filtering out common words and words that don't start with an alphabet. Counting number of words in the resulting data.
unnest_text_review_word_cnt <- unnest_text_review %>%
                                anti_join(stop_words, by = "word") %>%
                                filter(str_detect(word, "[a-z']$")) %>%
                                count(word) 

## Selecting top 130 words to form the word cloud
word_cloud <- unnest_text_review_word_cnt %>% 
                arrange(desc(unnest_text_review_word_cnt$n)) %>%
                top_n(130, n) 
word_cloud %>%
  with(wordcloud(word, n, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2")))

The figure above shows the most trending words in the reviews posted. As it can be observed from the figure, most of the top trending words are positive in sentiment. Next, we analyse for the top two words, how these are related to other words in the text reviews.

## Selecting top 2 words from the word cloud to do Bi-gram analysis, i.e. analysis using pair of words as a combination
top_two_text_words <- word_cloud %>%
                        top_n(2, n) 

## Selecting only the review id's associated with thes top two words
review_ids <- unnest_text_review %>%
                filter(word %in% c(top_two_text_words$word)) %>%
                distinct(review_id)

## List of words to be removed from the bi-gram analysis
remove_cusine <- data_frame(word = c("chinese", "dog", "indian", "mexican", "asian", "italian", "thai", "korean", "japanese"))

## removing datasets to free the memory
rm(unnest_text_review, unnest_text_review_word_cnt, word_cloud, yelp_review_vegas_neighr)

## Nesting of the words for Bi-gram analysis and removing common words from the analysis
bi_nested_text_review <- yelp_review_vegas_ft %>%
                            semi_join(review_ids, by = "review_id") %>%
                            select(review_id, text) %>%
                            unnest_tokens(word, text, token = "ngrams", n = 2) %>%
                            separate(word, c("word1", "word2"), sep = " ") %>%
                            filter(word2 %in% c(top_two_text_words$word)) %>%
                            filter(!word1 %in% c(stop_words$word,remove_cusine$word))

## Count using the pair of words as a combination
bigram_counts <- bi_nested_text_review %>% 
                    count(word1, word2, sort = TRUE)

## Displaying the bi-gram connection for the top 30 pairs
#head(bigram_counts$n, n = 30) ## provides the cut-off value for 30 pairs

bigram_graph <- bigram_counts %>%
                  filter(n >= 760) %>%
                  graph_from_data_frame()

## Setting seed to replicate the graph
set.seed(24)

## Visual parameter setting
a_size <- grid::arrow(type = "closed", length = unit(.1, "inches"))

## Visual representation of connection of pair of words
ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = TRUE,
                 arrow = a_size, end_cap = circle(.1, 'inches')) +
  geom_node_point(color = "lightblue", size = 7) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

## removing datasets to free the memory
rm(bi_nested_text_review, top_two_text_words, review_ids, bigram_counts, bigram_graph, a_size, yelp_review_vegas_ft)

The figure here shows that between the words food and service there is a high frequency of positive sentiment words. The legend in the plot changes the pattern of the arrowhead basis the frequency of word pairs obtained. Further, it can be seen that customer and service have the highest counts of repetition in text reviews. This would mean that customer service is one of the main criterias for a user to rate the restaurant.

## Extracting nested words that form the useful category of text reviews posted on the website
load("unnest_text_review.RData") ## Loading this data, as it was earlier removed
load("yelp_review_vegas_ft.RData") 

useful_words_summary_rm_stop_wrds <- unnest_text_review %>%
                                      inner_join(yelp_review_vegas_ft %>%
                                                   mutate(helpful = ifelse(useful > 0, "Yes", "No")) %>%
                                                   select(review_id, helpful), by = "review_id") %>%
                                      anti_join(stop_words, by = "word") %>%
                                      filter(str_detect(word, "[a-z']$")) %>%
                                      group_by(word, helpful) %>%
                                      summarise(count_words = n()) %>%
                                      mutate(category = "useful")

## Extracting nested words that form the cool category of text reviews posted on the website
cool_words_summary_rm_stop_wrds <- unnest_text_review %>%
                                    inner_join(yelp_review_vegas_ft %>%
                                                 mutate(helpful = ifelse(cool > 0, "Yes", "No")) %>%
                                                 select(review_id, helpful), by = "review_id") %>%
                                    anti_join(stop_words, by = "word") %>%
                                    filter(str_detect(word, "[a-z']$")) %>%
                                    group_by(word, helpful) %>%
                                    summarise(count_words = n()) %>%
                                    mutate(category = "cool")

## Extracting nested words that form the funny category of text reviews posted on the website
funny_words_summary_rm_stop_wrds <- unnest_text_review %>%
                                      inner_join(yelp_review_vegas_ft %>%
                                                   mutate(helpful = ifelse(funny > 0, "Yes", "No")) %>%
                                                   select(review_id, helpful), by = "review_id") %>%
                                      anti_join(stop_words, by = "word") %>%
                                      filter(str_detect(word, "[a-z']$")) %>%
                                      group_by(word, helpful) %>%
                                      summarise(count_words = n()) %>%
                                      mutate(category = "funny")

## Binding all the above three datasets and arranging by maximum count of words
final_uniword_data <- bind_rows(useful_words_summary_rm_stop_wrds, cool_words_summary_rm_stop_wrds,
                                funny_words_summary_rm_stop_wrds) %>%
                        arrange(desc(count_words))

## Removing data to free the memory
rm(useful_words_summary_rm_stop_wrds, cool_words_summary_rm_stop_wrds, funny_words_summary_rm_stop_wrds)

## Getting the top 10 wordsd across the different categories of text review and whether these were helpful or not
final_uniword_data_rank <- final_uniword_data %>%
                            group_by(category, helpful) %>%
                            mutate(rank = dense_rank(desc(count_words))) 

## Filtering for the top 10 words
final_uniword_data_rank_top10 <- final_uniword_data_rank %>% filter(rank <= 10)

## Plotting the frequencies of the most used words across the different categories
ggplot(final_uniword_data_rank_top10, aes(x = count_words, y = word)) + 
  geom_point(size = 3) +
  facet_grid(helpful ~  category, scales = "free_y") + 
  geom_segment(aes(x = 0, 
                   xend = count_words, 
                   y = word, 
                   yend = word), col = "blue") + 
  labs(title = "Frequently used words in review category",
       x = "Count of words", y = "Words") +
  theme(axis.text.x = element_text(angle = 65, vjust = 0.6)) + theme_bw()

## removing data to free up memory space
rm(final_uniword_data)

Further, for each review posted on the website, there is an option to tag the review as useful, funny and/or cool. The figure above shows how top words trend in these categories, with an additional layer to determine if the reviews were helpful or not. As discovered above, positive sentiment words top the most frequently used words in a text review. Further, words like food, service and time dominate to determine whether the review was helpful to other users or not.

## Getting the parameter for average rating to be used term frequency scatter plot
unnest_text_review_rate <- unnest_text_review %>%
                            inner_join(yelp_review_vegas_ft %>%
                                         select(review_id, stars), by = "review_id") %>%
                            anti_join(stop_words, by = "word") %>%
                            filter(str_detect(word, "[a-z]$")) %>%
                            mutate(rating = ifelse(stars >= 3, "Above_Average_Rating", "Below_Average_Rating"))

## Finding the frequency of the words used in each of the rating category
freq_words <- unnest_text_review_rate %>%
                group_by(rating) %>% 
                count(word, sort = TRUE) %>% 
                left_join(unnest_text_review_rate %>% 
                            group_by(rating) %>% 
                            summarise(total = n()), by = "rating") %>%
                mutate(freq = n/total)

## Using the spread function to in order to plot frequencies on the axis
freq_words <- freq_words %>% 
                select(rating, word, freq) %>% 
                spread(rating, freq) %>%
                arrange(Above_Average_Rating, Below_Average_Rating)

## Plotting the frequency of words used in different ratings
ggplot(freq_words, aes(x = Below_Average_Rating, y = Above_Average_Rating)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  labs(title = "Frequency of words used in different ratings",
       x = "Below Average Rating", y = "Above Average Rating") +
  geom_abline(color = "red") + theme_bw()

## Removing data to free storage
rm(unnest_text_review_rate)

For the purpose of analysis in this section, any review having a rating of 3 or more is considered as an above average review. Plotting a scatter plot of the frequent words across rating category show how the word sentiment determines the rating of the review posted, which in turn affect the rating of the restaurant.

Summary In this analysis, the frequency of words and its impact has been analyzed. It is observed that customer service is of prime importance while determining the rating of the review posted on the website. Along with this, the word sentiment also plays an significant role to determine the review rating. Other users use these reviews and its rating to make their decision of visiting a place. Thus, the review ratings posted on the website might have an impact on the popularity of the restaurant.

Sentiment Analysis

In this section, we analyse how the sentiments associated with the words affect the rating of the restaurant. To determine the sentiment score associated with the word, the AFINN lexicon in the sentiment dataset, which is available in the tidytext package, has been used.

## Filtering out the common words and words that don't start with an alphabet
unnest_text_review_flt <- unnest_text_review %>%
                            anti_join(stop_words, by = "word") %>%
                            filter(str_detect(word, "[a-z]$")) 

## Joining with the AFINN sentiment data to calculate the sentiment score
word_contri <- unnest_text_review_flt %>%
                inner_join(get_sentiments("afinn"), by = "word") %>%
                group_by(word) %>%
                summarize(occurences = n(),
                          contribution = sum(score))

## Plotting the contribution of the top 35 words in the sentiment of the text
word_contri %>%
  top_n(35, abs(contribution)) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(x = word, y = contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Words with greatest contribution to Positive/Negative Sentiment",
       x = "Word", y = "Contribution") +
  coord_flip() + theme_bw()

The figure above shows the contribution of top words in the sentiment of the text review. Towards the right of zero on the x-axis, is the contribution of positive words. As observed in the previous section, positive sentiment words are the most trending words in the text reviews posted on the website.

## Joining to get the stars rating across the reviews posted
words_by_rating <- unnest_text_review_flt %>%
                    inner_join(yelp_review_vegas_ft %>%
                                 select(review_id, stars), by = "review_id") %>%
                    count(stars, word, sort = TRUE) %>%
                    ungroup()

## Forming the data to calculate the sentiment score across different star rating category
top_sentiment_words <- words_by_rating %>%
                        inner_join(get_sentiments("afinn"), by = "word") %>%
                        mutate(contribution = score * n / sum(n))

## Plotting to get the sentiment of top words in different category ratings
top_sentiment_words %>%
  group_by(stars) %>%
  top_n(15, abs(contribution)) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(x = word, y = contribution, fill = contribution > 0)) +
  facet_wrap(~ stars, scales = "free_y") +
  geom_col(show.legend = FALSE) +
  labs(title = "Top 15 Words that contribute to Sentiment Score across Star Rating",
       x = "Word", y = "Sentiment Score * # of occurences") +
  coord_flip()

The figure above shows how the word sentiment varies across the different star ratings of the reviews. As expected, 1-star rated reviews have a high number of negative sentiment words. As the review rating increases, the positive sentiment of the text also increases. Having words such as amazing, awesome and nice provide a high positive sentiment to the text review and thus, the higher star rating of the review.

Summary From this analysis here, we determined how the sentiment associated with the words affect the star rating of the reviews posted on the website. Further, having a positive sentiment words in the text, would result in a higher rating of the text reviews posted.

Sentiment Review across Neighborhood

In this section, we try to determine how the sentiment of the word varies across the popular neighborhoods. For this purpose, the nrc lexicon in the sentiment dataset, available in the tidytext package, has been used. Further, only the reviews posted in the year 2016 has been considered for analysis.

## Calculating the average star rating in the Year 2016
avg_biz_rating_2016 <-  yelp_review_vegas_ft %>%
                          filter(year_review == 2016) %>%
                          group_by(business_id) %>%
                          summarise(average_star_rating = mean(stars))

yelp_biz_vegas_bl_rate <- yelp_biz_vegas_bl %>%
                            inner_join(avg_biz_rating_2016, by = "business_id") 

## Plotting on a map the average rating, highlighting the high rated restaurants
leaflet(data = yelp_biz_vegas_bl_rate) %>% 
  addProviderTiles("CartoDB.DarkMatter") %>% 
  setView(lng = -115.2, lat = 36.13, zoom = 12) %>%
  addCircleMarkers(~longitude, ~latitude, radius = ifelse(yelp_biz_vegas_bl_rate$average_star_rating >= 4, 3, 0.5),
                   color = ifelse(yelp_biz_vegas_bl_rate$average_star_rating >= 4, "white", "blue"),
                   fillOpacity = 0.5)

The figure above highlights the area having a high rating restaurant. A high rated restaurant is the one that has an average star rating of reviews more than four. The white dots in the figure symbolizes these high rated restaurant. As it can be seen that high rated restaurant are well distributed across the region. A user can even zoom on the map to have a closer look on the neighborhood and have a knowledge of streets where such restaurants are located.

## Keeping only resturants that have neighborhood populated against them
yelp_review_vegas_ft_neighr <- yelp_review_vegas_ft %>%
                                 inner_join(not_missing_neighrhood, by = "business_id")

## top 10 popular neighborhood, by count of reviews, in the Year 2016
top_ten_neighrhd_count_2016 <- yelp_review_vegas_ft_neighr %>%
                                filter(year_review == 2016) %>%
                                group_by(neighborhood) %>%
                                summarise(n = n()) %>%
                                arrange(desc(n)) %>%
                                top_n(10, n)

## get word-sentiment lexicon from the nrc data
nrc <- sentiments %>%
        filter(lexicon == "nrc") %>%
        select(word, sentiment)

## Unnesting token words for reviews in the year 2016
unnest_review_2016 <- yelp_review_vegas_ft_neighr %>%
                        filter(year_review == 2016, neighborhood %in% top_ten_neighrhd_count_2016$neighborhood) %>%
                        unnest_tokens(word, text) %>%
                        filter(!word %in% stop_words$word, str_detect(word, "^[a-z']+$")) %>%
                        select(review_id, business_id, neighborhood, word)

## Count total number of words in each neighbourhood
total_words_neigh <- unnest_review_2016 %>%
                      group_by(neighborhood) %>%
                      mutate(total_words = n()) %>%
                      ungroup() %>%
                      distinct(neighborhood, total_words)

## Number of words associated with each type of sentiment in each neighbourhood
sentiment_neigh <- unnest_review_2016 %>%
                    inner_join(nrc, by = "word") %>%
                    count(sentiment, neighborhood) %>%
                    ungroup() %>%
                    complete(sentiment, neighborhood, fill = list(n = 0)) %>%
                    inner_join(total_words_neigh, by = "neighborhood") %>%
                    group_by(neighborhood, sentiment, total_words) %>%
                    summarize(words = sum(n)) %>%
                    mutate(propotion = round(words / total_words * 100, digits = 1)) %>%
                    ungroup()

## Plotting to see how the sentiment of top words vary across the neighborhoods
ggplot(data = sentiment_neigh) +
  geom_bar(mapping = aes(x = neighborhood, y = propotion),
           stat = "identity",  fill = "orange") +
  facet_wrap( ~ sentiment) +
  labs(title = "Sentiment Analysis in Top 10 Neighbourhoods",
       x = "Neighbourhood", y = "Proportion of sentiment across total word count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The figure above shows how the different sentiments vary across neighborhoods. As it has been observed from previous analysis, positive sentiment words contribute the most towards the text sentiment of the review. Further, words associated with positive sentiments such as joy and trust also feature the same trend across the different neighborhoods.

Summary The analysis here provides the varied sentiment associated across neighborhoods. These top ten neighborhoods are the most preferred areas by the users and thus, we see a high number of positive sentiment related words.

Conclusion

Problem Statement: The project was aimed to analyse the text reviews posted on the Yelp.com website, specifically for restaurants in Las Vegas, NV. The analysis include results on how the text reviews varied across time, the frequency of words across different categories and the sentiment analysis of text posted in the review.

Implementation of the analysis: The data for this analysis has been downloaded from the Yelp website. This data has been sliced and diced to get interesting insights from the data, which will help both the users of the website and the owners of the restaurants. The text mining approach has been utilized to examine the sentiments of the people basis the comments written in the review posts. Various packages in R have been utilized to determine the results for the analysis.

Insights from the analysis: Overall it can be determined that there is a positive sentiment associated across the reviews posted on the website.

With the advent of technology, there was has been a huge increase in the usage of website. Further, there has been an increase in the number of restaurants opened. All this combined there has been an increase in the number of reviews being posted online.
From the word frequency analyses, it was determined that most user consider customer service as one of the prime criterias to rate the rating of a restaurant
Words such as food, service and time dominate to determine whether the review posted was helpful to other users or not. Further, for an above average rating review, positive words have a higher contribution.
Having words such as amazing, awesome and nice provide a high positive sentiment to the text review and thus, the higher star rating of the review
High rated restaurants are well distributed across the region as was determined from the map analysis of the reviews

Implication to the users of this analysis: This anlaysis can be used by the owners of the restaurant to determine in what ways does the reviews posted drive the sentiment of the users. They can thus devise strategies to improve upon their existing status.

Limitations of this analysis: The analysis only focuses on one word contribution to sentiment of the reviews. In actual text, more than one words are required to accurately determine the true sentiment of the text posted by the user. Various advanced machine learning techniques can be applied here to further improve the results.