Yelp Restaurant Reviews Analysis

Yelp.com is a popular website in which customers can rate and write their experiences at various businesses. This analysis examines the reviews posted by user for various type of restaurants located in Las Vegas.

Overview

Problem Statement

Analyse the rating of different type of restaurants in Las Vegas, NV. This includes the sentiment analysis on the reviews posted and exploratory analysis of different types of restaurants basis their attributes.

Both the analysis aim to provide useful information to the consumers as well as the business owners. The business owners can use this analysis to improve/add upon features that their establishment currently lacks. For the consumers of the website, the analysis will help them easly decide their next place to visit in the city.

Methodology

The data for this analysis has been downloaded from the Yelp website. This data will be used to visulaise the trends basis different attributes and also determine the popular opinions of the people. The text mining approach will be utilized to examine the sentiments of the people basis the comments written in the review posts. The analysis will help the users of the website to quickly find a restaurant basis their interests by elimating the need to throughly read across reviews.

Packages Required

The following packages will be used for the analysis:

  • tidyverse: Package of multiple R packages used for data manipulation
  • stringr: String operations in R
  • ggplot2: Data visualisation in R
  • dplyr: Easy functions to perform data manipulation in R
  • tm: Framework for common text mining applications in R
  • jsonlite: Package to read JSON formatted files
  • NLP: Package for basic classes and methods for Natural Language Processing
  • DT: Package to put data objects in R as HTML tables
  • lubridate: Package to manipulate dates and times
  • tidytext: Package for text mining for word processing and sentiment analysis
  • wordcloud: Package to generate word clouds
  • igraph: Package for network analysis and visualization
  • widyr: Package to widen, process and re-tidy data
  • leaflet: Package to create and customize interactive maps
  • scales: Package to graphically scale map data to aesthetics
## function to install and load multiple packages at once in R

install_load <- function(pack){
    ## Statement to check if the package has been previously installed
    new_pack_load <- pack[!(pack %in% installed.packages()[,"Package"])]
  if (length(new_pack_load))
      install.packages(new_pack_load, dependencies = TRUE)
      sapply(pack, require, character.only = TRUE)
}

package_load <- c("ggplot2", "dplyr", "tidyverse", "NLP", "tm", "stringr", "jsonlite", "DT", "lubridate", "tidytext", "wordcloud", "igraph", "ggraph","widyr", "ggmap", "leaflet", "scales")
install_load(package_load)

Data Preparation

The section here explains the steps to prepare the data for analysis.

Source of Data

Since 2014, Yelp.com has been organising the Yelp Dataset Challenge to conduct research and provide analysis from the huge data on various business that is available on its webiste. This challenge is used by students to uncover new insights from data and even publish or present their results in major conferences. More information about this challenge can be found at this link.

The data for this analysis has been obtained from the Yelp.com website. The datasets can be accessed at this link. The datasets can be downloaded as either a single sql database file or multiple json formatted files. For this analysis, json formatted files were donwloaded.

Data Description

The downloaded data contains information for 156,000 businesses in 12 metropolitan areas and 4,700,000 reviews provided by 1,100,000 users. The data consists of six different tables: businenss, review, user, checkin, tip and photos. The description of these tables is as follows:

  • business.json : Contains business data including location, attributes and categories
  • review.json : Contains user text review data for businesses
  • user.json : Contains user information data
  • checkin.json : Contains checkin data on a business
  • tip.json : Contains tips written by a user on a business. Tips are shorter than reviews and convey quick suggestions.
  • photos.json : Contains photo_id to link to the different pictures (in a different data) posted for a business on the website

Further, information regarding the data can be found on this link.

For this analysis, the focus is to examine only restaurant category businesses in Las Vegas, NV.

Data Import

This section highlights the process on how the data was imported and filtered for analysis.

Import

For the analysis, the following datasets were used :

  • user.json: The primary key for the data is business_id
  • review.json: The primary key for the data is the combination of variables business_id and review_id

Initially, to import these datasets in R, the stream_in() function in the jsonlite package was used. After the data was loaded in R, the same was saved in RData format. The RData formatted files are quicker to read in R and thus can save time in future while reloading the datasets for other exploratory analysis. The code to perform this task is shown below:

yelp_biz <- stream_in(file("C:/Users/rohit/Documents/Fall_2017/Data_Wrangling_R/Project_data/yelp_dataset.tar/dataset/business.json"))
save(yelp_biz, file = "yelp_biz.RData")

yelp_review <- stream_in(file("C:/Users/rohit/Documents/Fall_2017/Data_Wrangling_R/Project_data/yelp_dataset.tar/dataset/review.json"))
save(yelp_review, file = "yelp_review.RData")
Filter

Filtering variables

There are 210 variables in the yelp_biz data, containing information such as name, city, rating, review count, attributes and categories. For the analysis, only the required interest of the variables have been kept. Further, the category variable is only used to extract restaurant category business information from the data.

#### keeping the required variables for business data
yelp_biz_f <- data.frame(yelp_biz$business_id, yelp_biz$name, yelp_biz$neighborhood, yelp_biz$address,
                       yelp_biz$city, yelp_biz$state, yelp_biz$postal_code, yelp_biz$latitude,
                       yelp_biz$longitude, yelp_biz$stars, yelp_biz$review_count, yelp_biz$is_open,
                       yelp_biz$attributes$RestaurantsPriceRange2, yelp_biz$attributes$WheelchairAccessible,
                       yelp_biz$attributes$GoodForMeal$dessert, yelp_biz$attributes$GoodForMeal$latenight,
                       yelp_biz$attributes$GoodForMeal$lunch, yelp_biz$attributes$GoodForMeal$dinner,
                       yelp_biz$attributes$GoodForMeal$breakfast, yelp_biz$attributes$GoodForMeal$brunch,
                       yelp_biz$attributes$RestaurantsGoodForGroups, yelp_biz$attributes$NoiseLevel,
                       yelp_biz$attributes$RestaurantsAttire, yelp_biz$attributes$RestaurantsReservations,
                       yelp_biz$attributes$OutdoorSeating, yelp_biz$attributes$BusinessAcceptsCreditCards,
                       yelp_biz$attributes$RestaurantsDelivery, yelp_biz$attributes$Ambience$romantic,
                       yelp_biz$attributes$Ambience$intimate, yelp_biz$attributes$Ambience$classy,
                       yelp_biz$attributes$Ambience$hipster, yelp_biz$attributes$Ambience$divey,
                       yelp_biz$attributes$Ambience$touristy, yelp_biz$attributes$Ambience$trendy,
                       yelp_biz$attributes$Ambience$upscale, yelp_biz$attributes$Ambience$casual,
                       yelp_biz$attributes$RestaurantsTakeOut, yelp_biz$attributes$GoodForKids,
                       yelp_biz$attributes$WiFi, yelp_biz$attributes$RestaurantsTableService,
                       yelp_biz$attributes$Alcohol, yelp_biz$attributes$Caters, yelp_biz$attributes$DogsAllowed,
                       yelp_biz$attributes$Music$dj, yelp_biz$attributes$Music$background_music,
                       yelp_biz$attributes$Music$no_music, yelp_biz$attributes$Music$karaoke,
                       yelp_biz$attributes$Music$live, yelp_biz$attributes$Music$video, 
                       yelp_biz$attributes$Music$jukebox, yelp_biz$attributes$HappyHour,
                       yelp_biz$attributes$GoodForDancing, yelp_biz$attributes$DriveThru,
                       yelp_biz$attributes$Smoking, yelp_biz$attributes$BestNights$monday,
                       yelp_biz$attributes$BestNights$tuesday, yelp_biz$attributes$BestNights$friday,
                       yelp_biz$attributes$BestNights$wednesday, yelp_biz$attributes$BestNights$thursday,
                       yelp_biz$attributes$BestNights$sunday, yelp_biz$attributes$BestNights$saturday,
                       yelp_biz$attributes$DietaryRestrictions$`dairy-free`, 
                       yelp_biz$attributes$DietaryRestrictions$`gluten-free`,
                       yelp_biz$attributes$DietaryRestrictions$vegan,
                       yelp_biz$attributes$DietaryRestrictions$kosher,
                       yelp_biz$attributes$DietaryRestrictions$halal,
                       yelp_biz$attributes$DietaryRestrictions$`soy-free`,
                       yelp_biz$attributes$DietaryRestrictions$vegetarian,
                       yelp_biz$attributes$Open24Hours,
                       stringsAsFactors = FALSE)
#### removing the inital 10 characters from column names as all these variables contain the text "yelp_biz."
names(yelp_biz_f) <- substring(names(yelp_biz_f), 10)

Filtering data

The business data is now filtered to keep the observations only for the restaurant category of businesses in the city of Las Vegas. After the operation, there are 8,703 restuarants available for analysis. This data will be used for exploratory analysis. Further, the yelp_review data is filtered to keep the text reviews for the list of restaurants obtained in the previous step. This data will be utilized to perform sentiment analysis. The yelp_review_vegas data contains 1,009,036 reviews for the 8,703 restaurants.

#### Categories of business for the analysis
catg <- c("Restaurant","Food", "Pubs", "Nightlife", "Bars", "Dance Clubs", "Lounges", "Dive Bars", "Sports Bars")

#### Filtering the business data for the restaurant category
# Variable 'categories' is of the type list in the original data and thus, it is first converted to a character variable using the mutate function in the dplyr package 
yelp_biz_catg <-  yelp_biz %>%
                  select(business_id, categories) %>% 
                  mutate(categories = as.character(categories)) %>%  
                  filter(str_detect(categories, paste(catg, collapse = '|')))

#### Filtering business data for Las Vegas. There are 8,703 restaurants.
yelp_biz_vegas <- yelp_biz_f %>%
                  semi_join(yelp_biz_catg, by = "business_id") %>%
                  filter(str_detect(city, "Las Vegas"))

#### Dimension of the business data
#dim(yelp_biz_vegas)

#### Filtering the review data for Las Vegas restaurants. There are 1,009,036 text reviews for 8,703 restaurants.
yelp_review_vegas <- yelp_review %>%
                     semi_join(yelp_biz_vegas, by = "business_id")

#### Dimension of the review data
#dim(yelp_review_vegas)

Data Cleaning

This section provides information regarding the process of data cleaning and finally, preview of the cleaned data.

Steps followed

To clean the data in both yelp_biz_vegas and yelp_review_vegas, theh following steps have been followed:

  • Determine the count of missing values in each columns of the dataset
  • Filter out columns that have more than 40% values as missing
  • Replacing the missing values in the column
  • Check for outliers in the numeric variables
  • Treatment for any outliers in the data

Upon checking the summary of all the variables in the data, it was found that the logical columns in the yelp_biz_vegas data required missing value treatment. The rest of the variables were structured properly.

The missing values in these logical columns were replaced with “FALSE”. Further, no missing values were observed for the yelp_review_vegas data.

#### cleaning the data
## Determining the percentage of missing values in the business data
#sort(colMeans(is.na(yelp_biz_vegas)), decreasing = TRUE)

## Keeping only the variables which have less than 40% missing values in them
keep_vars <- (colMeans(is.na(yelp_biz_vegas))) < 0.4

## Selecting the list of final columns for analysis in business data
yelp_biz_vegas_2 <- yelp_biz_vegas %>% select(names(keep_vars[keep_vars == TRUE]))

## Replacing the missing values in logical columns with "FALSE"
# Filtering out the columns having logical values (TRUE/FALSE) and keeping "business_id" column, so that after cleaning the splitted datsets can be combined
col_name_logical <- sapply(yelp_biz_vegas_2, is.logical)

yelp_biz_vegas_bl <- yelp_biz_vegas_2 %>% 
                     select(business_id, names(col_name_logical[col_name_logical == FALSE])) %>%
                     inner_join(yelp_biz_vegas_2 %>% 
                     select(business_id, names(col_name_logical[col_name_logical == TRUE])) %>%
                     mutate_all(funs(replace(., is.na(.),"FALSE"))), by = "business_id")

## Determining the percentage of missing values in the review data
#sort(colMeans(is.na(yelp_review_vegas)), decreasing = TRUE)
# No missing value treatment required for review data

## Converting the string formatted date variable to date format
yelp_review_vegas$date <- ymd(yelp_review_vegas$date)

## Extracting review year from the date
yelp_review_vegas$year_review <- year(yelp_review_vegas$date)

## Removing the datasets that are no longer required
rm(yelp_biz, yelp_review, yelp_biz_f, yelp_biz_catg, yelp_biz_vegas, yelp_biz_vegas_2)
Preview of Final Data

A sample preview of the cleaned data is shown below:

head(yelp_biz_vegas_bl,200) %>%
  datatable(caption = "Table 1: Restaurants in Las Vegas")
head(yelp_review_vegas, 200) %>%
  datatable(caption = "Table 2: Text reviews of the restaurants")

Data Summary

  • Table Summary

The summary of both the datasets is as follows:

Restaurant Information Data

Information Restaurant Dataset
Number of rows 8,703
NUmber of columns 39
Date Range April 19, 2004 to July 26, 2017
Number of numerical variables 9
Number of character variables 33

Text Review Data

Information Review Dataset
Number of rows 1,009,036
NUmber of columns 9
Date Range April 19, 2004 to July 26, 2017
Number of numerical variables 5
Number of character variables 4
Number of date variables 1
  • Variable Summary

Summary of some of the important variables in the data is given below:

Variable Name Dataset Remarks
stars yelp_biz_vegas_bl Provides the rating of the restaurant. Minimum = 0, Median = 3.5, Maximum = 5
review_count yelp_biz_vegas_bl Count of reviews for a particular restaurant. Minimum = 3, Median = 30, Maximum = 6979
latitude yelp_biz_vegas_bl Latitute coordinate of the restaurant. Will be used to plot the location on the map.
longitude yelp_biz_vegas_bl Longitude coordinate of the restaurant. Will be used to plot the location on the map.
attributes.RestaurantsPriceRange2 yelp_biz_vegas_bl Categorized price range of the restaurant
stars yelp_review_vegas Star rating of the review posted
text yelp_review_vegas Text of the review posted
useful yelp_review_vegas Count of users that found the particular review useful. Minimum = 0, Median = 0, Maximum = 168
funny yelp_review_vegas Count of users that found the particular review funny. Minimum = 0, Median = 0, Maximum = 154
cool yelp_review_vegas Count of users that found the particular review cool. Minimum = 0, Median = 0, Maximum = 156

Data Analysis

This section explores the various useful insights driven out from the data and then the same is visually presented. For this purpose, data has been sliced and diced across different parameters and new variables formed to gather information.

Time Trend of Reviews

A time series trend of the reviews posted on the website has been analyzed here. These reviews have been posted from the year 2014 to the year 2017. For the year 2017, only reviews till October have been posted. So, the analysis does in this section does not count for reviews posted in the year 2017. Further, there are approximately 7,000 different neighborhoods containg all the restaurants in Las Vegas and for this analysis only the top ten popular neighborhoods have been chosen. The figures below present the count of reviews posted across years and the count of different restaurans that opened across the years.

## To know the range of year of reviews in the data
#range(yelp_review_vegas$year_review)

## Filtering data to have observations where neighborhood is populated
not_missing_neighrhood <- yelp_biz_vegas_bl %>%
                            filter(str_detect(neighborhood, "[a-z']$")) %>%
                            select(business_id, neighborhood)

## Keeping text reviews for these neighborhoods only
yelp_review_vegas_neighr <- yelp_review_vegas %>%
                              inner_join(not_missing_neighrhood, by = "business_id")

## Top 10 popular neighborhood, by count of reviews, from 2004 to 2016
top_ten_neighrhd_count <- yelp_review_vegas_neighr %>%
                            filter(year_review %in% c(2004:2016)) %>%
                            group_by(neighborhood) %>%
                            summarise(n = n()) %>%
                            arrange(desc(n)) %>%
                            top_n(10, n)

## Summarizing the number of reviews by neighborhood and year of review
count_review_neighrhd_year <- yelp_review_vegas_neighr %>%
                                filter(year_review %in% c(2004:2016)) %>%
                                group_by(neighborhood, year_review) %>%
                                summarise(n = n()) %>%
                                semi_join(top_ten_neighrhd_count, by = "neighborhood")

## Plot of number of reviews across different years by neighborhood
ggplot(data = count_review_neighrhd_year, aes(x = year_review, y = n, col = neighborhood)) + 
  geom_line(size = 1.5) +
  labs(title = "Time series of count of reviews for the top ten popular neighborhoods",
        x = "Year of review", y = "Number of reviews") +
  theme(panel.grid.minor = element_blank()) + theme_light()

The figure above shows how the number of reviews posted on the website have risen across the years. Specifically, in the The Strip neighborhood has seen a lot of increase in the number of reviews posted for the restaurants.

## Number of businesses opened in years and average rating of these businesses across years
count_business_neighrhd_year <- yelp_review_vegas_neighr %>%
                                filter(year_review %in% c(2004:2016)) %>%
                                group_by(neighborhood, year_review) %>%
                                summarise(n = n_distinct(business_id), average_star_rating = mean(stars)) %>%
                                semi_join(top_ten_neighrhd_count, by = "neighborhood")

## Horizontal bar plot for the analysis
ggplot(data = count_business_neighrhd_year, aes(x = as.factor(year_review), y = n)) + 
  facet_wrap(~ neighborhood) +
  geom_bar(stat = 'identity', aes(fill = average_star_rating), width = 0.75) +
  coord_flip() +
  labs(title = "Number of restaurants opened across the top ten popular neighborhood",
       x = "Number of restaurants", y = "Year of Opening") +
  theme(panel.grid.minor = element_blank()) + theme_light()

## removing datasets to free the memory
rm(yelp_review_vegas_neighr, top_ten_neighrhd_count, count_review_neighrhd_year, count_business_neighrhd_year)

The figure above shows the number of restaurants opened across the years in these top ten neighborhoods. It is observed that there are a lot of restaurants in the The Strip neighborhood, which explains the huge increase in the number of reviews being posted on the website.

Summary The range of reviews posted on the website, available to us, is from the year 2014 to the year 2016. That represents a huge set of reviews to analyze. As observed from the above plots, during the recent times there has been a surge in the number of reviews for various restaurants and thus, for further analysis in the project, only the reviews posted between the year 2014 and the year 2016 have been selected.

Sentiment Analysis

In this section, we analyse how the sentiments associated with the words affect the rating of the restaurant. To determine the sentiment score associated with the word, the AFINN lexicon in the sentiment dataset, which is available in the tidytext package, has been used.

## Filtering out the common words and words that don't start with an alphabet
unnest_text_review_flt <- unnest_text_review %>%
                            anti_join(stop_words, by = "word") %>%
                            filter(str_detect(word, "[a-z]$")) 

## Joining with the AFINN sentiment data to calculate the sentiment score
word_contri <- unnest_text_review_flt %>%
                inner_join(get_sentiments("afinn"), by = "word") %>%
                group_by(word) %>%
                summarize(occurences = n(),
                          contribution = sum(score))

## Plotting the contribution of the top 35 words in the sentiment of the text
word_contri %>%
  top_n(35, abs(contribution)) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(x = word, y = contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Words with greatest contribution to Positive/Negative Sentiment",
       x = "Word", y = "Contribution") +
  coord_flip() + theme_bw()

The figure above shows the contribution of top words in the sentiment of the text review. Towards the right of zero on the x-axis, is the contribution of positive words. As observed in the previous section, positive sentiment words are the most trending words in the text reviews posted on the website.

## Joining to get the stars rating across the reviews posted
words_by_rating <- unnest_text_review_flt %>%
                    inner_join(yelp_review_vegas_ft %>%
                                 select(review_id, stars), by = "review_id") %>%
                    count(stars, word, sort = TRUE) %>%
                    ungroup()

## Forming the data to calculate the sentiment score across different star rating category
top_sentiment_words <- words_by_rating %>%
                        inner_join(get_sentiments("afinn"), by = "word") %>%
                        mutate(contribution = score * n / sum(n))

## Plotting to get the sentiment of top words in different category ratings
top_sentiment_words %>%
  group_by(stars) %>%
  top_n(15, abs(contribution)) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(x = word, y = contribution, fill = contribution > 0)) +
  facet_wrap(~ stars, scales = "free_y") +
  geom_col(show.legend = FALSE) +
  labs(title = "Top 15 Words that contribute to Sentiment Score across Star Rating",
       x = "Word", y = "Sentiment Score * # of occurences") +
  coord_flip()

The figure above shows how the word sentiment varies across the different star ratings of the reviews. As expected, 1-star rated reviews have a high number of negative sentiment words. As the review rating increases, the positive sentiment of the text also increases. Having words such as amazing, awesome and nice provide a high positive sentiment to the text review and thus, the higher star rating of the review.

Summary From this analysis here, we determined how the sentiment associated with the words affect the star rating of the reviews posted on the website. Further, having a positive sentiment words in the text, would result in a higher rating of the text reviews posted.

Sentiment Review across Neighborhood

In this section, we try to determine how the sentiment of the word varies across the popular neighborhoods. For this purpose, the nrc lexicon in the sentiment dataset, available in the tidytext package, has been used. Further, only the reviews posted in the year 2016 has been considered for analysis.

## Calculating the average star rating in the Year 2016
avg_biz_rating_2016 <-  yelp_review_vegas_ft %>%
                          filter(year_review == 2016) %>%
                          group_by(business_id) %>%
                          summarise(average_star_rating = mean(stars))

yelp_biz_vegas_bl_rate <- yelp_biz_vegas_bl %>%
                            inner_join(avg_biz_rating_2016, by = "business_id") 

## Plotting on a map the average rating, highlighting the high rated restaurants
leaflet(data = yelp_biz_vegas_bl_rate) %>% 
  addProviderTiles("CartoDB.DarkMatter") %>% 
  setView(lng = -115.2, lat = 36.13, zoom = 12) %>%
  addCircleMarkers(~longitude, ~latitude, radius = ifelse(yelp_biz_vegas_bl_rate$average_star_rating >= 4, 3, 0.5),
                   color = ifelse(yelp_biz_vegas_bl_rate$average_star_rating >= 4, "white", "blue"),
                   fillOpacity = 0.5)

The figure above highlights the area having a high rating restaurant. A high rated restaurant is the one that has an average star rating of reviews more than four. The white dots in the figure symbolizes these high rated restaurant. As it can be seen that high rated restaurant are well distributed across the region. A user can even zoom on the map to have a closer look on the neighborhood and have a knowledge of streets where such restaurants are located.

## Keeping only resturants that have neighborhood populated against them
yelp_review_vegas_ft_neighr <- yelp_review_vegas_ft %>%
                                 inner_join(not_missing_neighrhood, by = "business_id")

## top 10 popular neighborhood, by count of reviews, in the Year 2016
top_ten_neighrhd_count_2016 <- yelp_review_vegas_ft_neighr %>%
                                filter(year_review == 2016) %>%
                                group_by(neighborhood) %>%
                                summarise(n = n()) %>%
                                arrange(desc(n)) %>%
                                top_n(10, n)

## get word-sentiment lexicon from the nrc data
nrc <- sentiments %>%
        filter(lexicon == "nrc") %>%
        select(word, sentiment)

## Unnesting token words for reviews in the year 2016
unnest_review_2016 <- yelp_review_vegas_ft_neighr %>%
                        filter(year_review == 2016, neighborhood %in% top_ten_neighrhd_count_2016$neighborhood) %>%
                        unnest_tokens(word, text) %>%
                        filter(!word %in% stop_words$word, str_detect(word, "^[a-z']+$")) %>%
                        select(review_id, business_id, neighborhood, word)

## Count total number of words in each neighbourhood
total_words_neigh <- unnest_review_2016 %>%
                      group_by(neighborhood) %>%
                      mutate(total_words = n()) %>%
                      ungroup() %>%
                      distinct(neighborhood, total_words)

## Number of words associated with each type of sentiment in each neighbourhood
sentiment_neigh <- unnest_review_2016 %>%
                    inner_join(nrc, by = "word") %>%
                    count(sentiment, neighborhood) %>%
                    ungroup() %>%
                    complete(sentiment, neighborhood, fill = list(n = 0)) %>%
                    inner_join(total_words_neigh, by = "neighborhood") %>%
                    group_by(neighborhood, sentiment, total_words) %>%
                    summarize(words = sum(n)) %>%
                    mutate(propotion = round(words / total_words * 100, digits = 1)) %>%
                    ungroup()

## Plotting to see how the sentiment of top words vary across the neighborhoods
ggplot(data = sentiment_neigh) +
  geom_bar(mapping = aes(x = neighborhood, y = propotion),
           stat = "identity",  fill = "orange") +
  facet_wrap( ~ sentiment) +
  labs(title = "Sentiment Analysis in Top 10 Neighbourhoods",
       x = "Neighbourhood", y = "Proportion of sentiment across total word count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The figure above shows how the different sentiments vary across neighborhoods. As it has been observed from previous analysis, positive sentiment words contribute the most towards the text sentiment of the review. Further, words associated with positive sentiments such as joy and trust also feature the same trend across the different neighborhoods.

Summary The analysis here provides the varied sentiment associated across neighborhoods. These top ten neighborhoods are the most preferred areas by the users and thus, we see a high number of positive sentiment related words.

Conclusion

Problem Statement: The project was aimed to analyse the text reviews posted on the Yelp.com website, specifically for restaurants in Las Vegas, NV. The analysis include results on how the text reviews varied across time, the frequency of words across different categories and the sentiment analysis of text posted in the review.

Implementation of the analysis: The data for this analysis has been downloaded from the Yelp website. This data has been sliced and diced to get interesting insights from the data, which will help both the users of the website and the owners of the restaurants. The text mining approach has been utilized to examine the sentiments of the people basis the comments written in the review posts. Various packages in R have been utilized to determine the results for the analysis.

Insights from the analysis: Overall it can be determined that there is a positive sentiment associated across the reviews posted on the website.

  • With the advent of technology, there was has been a huge increase in the usage of website. Further, there has been an increase in the number of restaurants opened. All this combined there has been an increase in the number of reviews being posted online.
  • From the word frequency analyses, it was determined that most user consider customer service as one of the prime criterias to rate the rating of a restaurant
  • Words such as food, service and time dominate to determine whether the review posted was helpful to other users or not. Further, for an above average rating review, positive words have a higher contribution.
  • Having words such as amazing, awesome and nice provide a high positive sentiment to the text review and thus, the higher star rating of the review
  • High rated restaurants are well distributed across the region as was determined from the map analysis of the reviews

Implication to the users of this analysis: This anlaysis can be used by the owners of the restaurant to determine in what ways does the reviews posted drive the sentiment of the users. They can thus devise strategies to improve upon their existing status.

Limitations of this analysis: The analysis only focuses on one word contribution to sentiment of the reviews. In actual text, more than one words are required to accurately determine the true sentiment of the text posted by the user. Various advanced machine learning techniques can be applied here to further improve the results.