Yelp Data Analysis of Restaurant Attributes

1. Introduction

Yelp’s mission is to “connect consumers with great local businesses” and describes itself as “a one-stop local platform for consumers to discover, connect and transact with local businesses of all sizes” (Yelp Fast Facts). The value proposition to consumers is free access to business information, past customer reviews, and submitted photos. Potential customers use Yelp to look up businesses and find out more about them. It would make sense then that the more complete a business’ Yelp page is, the more likely a customer is to interact with, and potentially patronize, that business. While Yelp is most known for the reviews customers post, it is still important for businesses to claim their page and fill out basic business information such as address, business type/categories, and hours of operation. This analysis will attempt to answer these questions: does greater data transparency lead to more customer engagement (number of reviews) and is it more positive (star rating and review sentiment)?

The analysis focuses on restaurants in the United States and will be two-fold. The first part will be focused on the business data file. This data is what is on a restaurant’s business page on Yelp. I will look at how many restaurants have complete basic information (business name and address), hours of operations, and the number of factual attribute fields filled in. This will give a baseline of how many restaurants are using the free page features. Attribute fields are Boolean data – a restaurant either has the amenity or not. However, these fields can also be left null. I will analyze if there is a difference in the number of reviews and average ratings for filling out the field (even if ‘FALSE’) or if it is better to be left blank. I will also look if there is a difference between marking an attribute as ‘TRUE’, indicating the business has that amenity, versus marking as ‘FALSE’ or leaving blank. Looking at the data these two ways will help determine if full data transparency or only posting what attributes the restaurant has is better. The second phase of the project will look at the reviews themselves. A basic sentiment analysis will be performed on the review text. I will compare whether businesses with the key attributes determined in the first phase have more reviews with positive sentiments than those that do not.

The data sets used for this analysis are publicly available and are described below, as well as the data preparation process.

Data Sets

Data was accessed from Yelp’s Open Dataset. In total, this source has over 6.9 million reviews for 150,000 businesses in 11 metropolitan areas. However, previously noted, this analysis will only focus on restaurants in the United States within this data. Descriptions of the data sets and fields included can be found on the Yelp Open Data Set website as well.

Business: After data cleaning steps, there were 43,294 restaurants to be studied. This data file includes descriptor fields about the businesses (attributes) that will be the main focus of this analysis. The file also includes the restaurant’s lifetime star rating and number of reviews posted.
Reviews: This full data set is over 8.6 million rows in total, so a sample was taken for ease of processing. Only restaurant reviews since 2015 were considered. After data cleaning steps, the sample contains 838,020 reviews.

Packages Used

This project was completed in R and utilized several packages. The main ones used were:

jsonlite: Used to load the original JSON files from Yelp
dplyr: Used to manipulate the data
ggplot2: Used to create graphs
heatmaply: Used to create the interactive correlation matrix
rwa: Used to run the relative weights analysis
tidytext and textdata: Used to run the sentiment analysis of the review text

Data Preparation

The large majority of the data preparation focused on the Business data file. When the JSON file was flattened, there were 37 attributes and 7 hour columns. These needed to be transformed into binary variables in order to study whether or not the restaurant listed the attribute on their Yelp page and if they claimed to have it. User-voted attributes were also removed in order to focus the analysis only on attributes that are in control of the business.

Data preparation of the Reviews data set will be explained in the Sentiment Analysis section.

First, attribute columns that were nearly entirely Null were removed from the Business data. Of the six that were over 99.9% Null, two were not applicable to restaurants. It may be of interest to Yelp to understand why the others are being left blank and if they need to remain available to restaurants.

NAcount_rest <- sapply(Restaurants_US, function(x) sum(is.na(x)))
NAcount_rest <- data.frame(NAcount_rest)
NAcount_rest <- rownames_to_column(NAcount_rest, var = "Col_name")
NAcount_rest <- NAcount_rest %>%
  mutate(Per_NA = NAcount_rest/(nrow(Restaurants_US))) %>%
  arrange(desc(NAcount_rest))
Cols_to_drop_rest <- NAcount_rest %>%
  filter(Per_NA > 0.99)

test_restaurants <- Restaurants_US %>%
  select(-one_of(Cols_to_drop_rest$Col_name))

NAcount_rest %>%
  arrange(desc(NAcount_rest)) %>% slice(1:10) %>%
  ggplot(aes(x=reorder(Col_name, NAcount_rest), y=Per_NA, fill=Per_NA)) + geom_col() + 
  geom_text(aes(label=(percent(Per_NA, accuracy = 0.1))), color = 'black', size = 3, hjust = -0.1) +
  scale_y_continuous(labels = percent, name ="", limits = c(0,1.2), expand = c(0, 0)) + 
  scale_fill_gradient(low="light grey", high="black") + theme_classic() + 
  theme(legend.position = 'none', plot.title.position = 'plot') + 
  labs(title = 'Top 10 Attributes by % Null',x='') + coord_flip()

Filled Attributes

Then, we look to see which attributes are being listed on the profile, regardless if restaurant has amenity or not. We will refer to these as ‘Filled Attributes’ throughout this analysis. A function was created to convert strings of Attributes to 1 for complete, 0 for NA. Looking at the graph of attributes by percentage filled, there’s a steep drop off between top and bottom half of attributes.

attr_function <- function(x) {
  if_else(condition = is.na(x), 
          true = 0, 
          false = 1)}
#Use mutate to count number of attributes/hours filled for each restaurant
#Taking out user-voted attributes (list below) as businesses cannot edit those on page themselves:
#GoodForGroups, PriceRange, Ambience, GoodForMeal, NoiseLevel, RestaurantsAttire, GoodForKids,
#GoodForDancing, BestNights, Music
filled_columns <- test_restaurants %>%
  select(business_id, address, starts_with("attr"), starts_with("hours")) %>%
  mutate(across(-c(business_id,address), attr_function)) %>%
  select(-attributes.Ambience, -attributes.RestaurantsPriceRange2, -attributes.GoodForMeal,
         -attributes.NoiseLevel, -attributes.RestaurantsAttire, -attributes.GoodForKids,
         -attributes.GoodForDancing, -attributes.Music, -attributes.BestNights,
         -attributes.RestaurantsGoodForGroups) %>%
  mutate(attributes.HasAddress = if_else(address == "",0,1),
         total_attr = select(.,starts_with("attr")) %>% rowSums(na.rm = TRUE),
         total_hrs = select(.,starts_with("hours")) %>% rowSums(na.rm = TRUE),
         HasHours = if_else(total_hrs>0,1,0),
         total_filled = total_attr + HasHours) %>%
  select(-address, -starts_with('hours'), -total_attr, -total_hrs)
names(filled_columns) <- gsub(x = names(filled_columns), 
                                pattern = "attributes.", replacement = "", fixed = TRUE) 

filled_summary <- filled_columns %>%
  select(-business_id, -total_filled) %>%
  colSums(na.rm=TRUE)
filled_summary <- data.frame(filled_summary)
colnames(filled_summary) <- "Total_Filled"
filled_summary <- tibble::rownames_to_column(filled_summary,'Attribute')
filled_summary <- filled_summary %>% mutate(Per_Filled = Total_Filled/(nrow(Restaurants_US)))

#Graph of Attributes and % of Restaurants with them FILLED
filled_summary%>% 
  ggplot(aes(x=reorder(Attribute, Total_Filled), y=Per_Filled, fill=Per_Filled)) + geom_col() + 
  geom_text(aes(label=(percent(Per_Filled, accuracy = 0.1))), color = 'black', size = 3, hjust = -0.1) +
  scale_y_continuous(labels = percent, name ="", limits = c(0,1.15), expand = c(0, 0)) + 
  scale_fill_gradient(low="light grey", high="black") + theme_classic() + 
  theme(legend.position = 'none', plot.title.position = 'plot') + 
  labs(title = '% of Restaurants with Filled Attribute',x='') + coord_flip()

Yes Attributes

Next, we look will at which attributes restaurants do list to possess. We will refer to these as ‘Yes Attributes’ throughout this analysis. A function was created to convert the value that indicated the restaurant had the amenity to a value of 1, and if the restaurant indicated they did not have the amenity or the field was left blank it was converted to a value of 0. Looking at the graph for attributes by percentage of restaurants that have the amenity, there’s not as steep drop off between top and bottom half of attributes like with Filled. The Top 3 are the same but then ranking starts to differ.

attributes <- test_restaurants %>%
  select(starts_with("attr")) %>%
  lapply(unique)
#First create data frame to convert string of Attributes to 1 for Yes, 0 for No, NA left as null
#Also taking out the same user-voted attributes (listed in previous section)
Yes_columns1 <- test_restaurants %>%
  select(business_id, address, starts_with("attr"), starts_with("hours")) %>%
  select(-attributes.Ambience, -attributes.RestaurantsPriceRange2, -attributes.GoodForMeal,
         -attributes.NoiseLevel, -attributes.RestaurantsAttire, -attributes.GoodForKids,
         -attributes.GoodForDancing, -attributes.Music, -attributes.BestNights,
         -attributes.RestaurantsGoodForGroups) %>%
  mutate(RestaurantsTableService = str_detect(attributes.RestaurantsTableService, "True") * 1,
         WiFi = str_detect(attributes.WiFi, c("'free'", "'paid'")) * 1,
         BikeParking = str_detect(attributes.BikeParking, "True") * 1,
         BusinessParking = str_detect(attributes.BusinessParking, "True") * 1,
         BusinessAcceptsCreditCards = str_detect(attributes.BusinessAcceptsCreditCards, "True") * 1,
         RestaurantsReservations = str_detect(attributes.RestaurantsReservations, "True") * 1,
         WheelchairAccessible = str_detect(attributes.WheelchairAccessible, "True") * 1,
         Caters = str_detect(attributes.Caters, "True") * 1,
         OutdoorSeating = str_detect(attributes.OutdoorSeating, "True") * 1,
         HappyHour = str_detect(attributes.HappyHour, "True") * 1,
         BusinessAcceptsBitcoin = str_detect(attributes.BusinessAcceptsBitcoin, "True") * 1,
         HasTV = str_detect(attributes.HasTV, "True") * 1,
         Alcohol = str_detect(attributes.Alcohol, c("'beer_and_wine'", "'full_bar'")) * 1,
         DogsAllowed = str_detect(attributes.DogsAllowed, "True") * 1,
         RestaurantsTakeOut = str_detect(attributes.RestaurantsTakeOut, "True") * 1,
         RestaurantsDelivery = str_detect(attributes.RestaurantsDelivery, "True") * 1,
         ByAppointmentOnly = str_detect(attributes.ByAppointmentOnly, "True") * 1,
         BYOB = str_detect(attributes.BYOB, "True") * 1,
         CoatCheck = str_detect(attributes.CoatCheck, "True") * 1,
         Smoking = str_detect(attributes.Smoking, c("'outdoor'", "'yes'")) * 1,
         DriveThru = str_detect(attributes.DriveThru, "True") * 1,
         BYOBCorkage = str_detect(attributes.BYOBCorkage, "yes") *1,
         Corkage = str_detect(attributes.Corkage, "True") *1,
         HasAddress = if_else(address == "",0,1),
         TotalHours = mutate(across(starts_with("hours"),attr_function)) %>% rowSums(),
         HasHours = if_else(TotalHours >0, 1, 0)) %>%
  select(-starts_with("attr"),-starts_with("hours"), -address) %>%
  mutate(attr_yes = select(.,-business_id) %>% rowSums(na.rm = TRUE),
         total_yes = HasHours + attr_yes) %>%
  select(-attr_yes, -TotalHours)

Yes_summary <- Yes_columns1 %>%
  select(-business_id, -total_yes) %>%
  colSums(na.rm=TRUE)
Yes_summary <- data.frame(Yes_summary)
colnames(Yes_summary) <- "Total_Yes"
Yes_summary <- tibble::rownames_to_column(Yes_summary,'Attribute')
Yes_summary <- Yes_summary %>% mutate(Per_Yes = Total_Yes/(nrow(Restaurants_US)))

#In order to do linear regression, need to have as many complete cases as possible
#So create 2nd data frame to convert 1=Yes, 0=No or Null so all rows are complete cases
yes_function <- function(x) {
  if_else(condition = is.na(x), 
          true = 0, 
          false = (if_else(condition = x > 0, 1, 0)))}

Yes_columns <- Yes_columns1 %>% mutate(across(-c(business_id, total_yes, HasAddress, HasHours), yes_function))

#Graph of Attributes and % of Restaurants with them marked as 'True' or equivalent
Yes_summary%>% 
  ggplot(aes(x=reorder(Attribute, Total_Yes), y=Per_Yes, fill=Per_Yes)) + geom_col() + 
  geom_text(aes(label=(percent(Per_Yes, accuracy = 0.1))), color = 'black', size = 3, hjust = -0.1) +
  scale_y_continuous(labels = percent, name ="", limits = c(0,1.15), expand = c(0, 0)) + 
  scale_fill_gradient(low="light grey", high="black") + theme_classic() + 
  theme(legend.position = 'none', plot.title.position = 'plot') + 
  labs(title = '% of Restaurants with Yes Attribute',x='') + coord_flip()

2. Initial Analysis

These two data frames for Filled and Yes Attributes were merged with the original business data to bring in the restaurant’s Star Rating and Review Count. Now, we can look at descriptive statistics of the business data to see if there are any initial trends.

Overall Averages

Star Rating

Star ratings range from 1 to 5 in half-steps. Most restaurants have 4-star rating, with the average rating of about 3.5 stars.

Review Count

The majority of restaurants have under 200 reviews, with the average at 115. However, there is a large spread in values. Review count ranged from 5 to 9185.

Filled and Yes Attribute counts

The number of Filled Attributes follows a normal distribution (peaks at 13-14 attributes), but Yes Attributes has 2 peaks (5-6 & 17-18 attributes).

Averages by Star Rating

Combining the data above, we can look to see if the average number of reviews and attributes changes with star ratings. Until we get to 4 stars, the average of total filled, total yes, and reviews goes up consistently.

stars	mean_filled	mean_yes	mean_reviews
1.0	7.18	10.60	17.54
1.5	9.60	12.60	27.77
2.0	10.22	13.03	34.65
2.5	10.69	13.44	51.97
3.0	11.42	14.34	78.13
3.5	12.30	15.28	125.66
4.0	12.84	15.56	166.42
4.5	12.56	14.90	138.10
5.0	10.47	12.86	39.46

Baseline Linear Regression

Star Rating by Review Count

First, we want to see if there is already a relationship between Star Rating and Review Count. While the P value is very small (so it is statistically significant), the estimate is nearly zero and the R-squared is very low at 0.02. Therefore, Review Count is not great predictor of Star Rating and we should do further analysis.

review_stars_lm <- lm(stars ~ review_count, data = Filled_Yes)
summary(review_stars_lm)

## 
## Call:
## lm(formula = stars ~ review_count, data = Filled_Yes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8246 -0.4941  0.0091  0.5064  1.5191 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.478e+00  4.227e-03   822.8   <2e-16 ***
## review_count 5.276e-04  1.707e-05    30.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7788 on 43292 degrees of freedom
## Multiple R-squared:  0.02158,    Adjusted R-squared:  0.02156 
## F-statistic:   955 on 1 and 43292 DF,  p-value: < 2.2e-16

ggplot(Filled_Yes, aes(review_count, stars)) + geom_count() + geom_smooth(method = 'lm', se = FALSE, color = 'red3') +
  scale_y_continuous(limits = c(1,5)) + labs(title = 'Predicting Star Rating by Review Count', size = "# Restaurants") +
  theme(plot.title.position = 'plot') + theme_classic()

Total Filled

Then, we look to see if the value Total Filled attributes is a good predictor for Star Rating or Review Count.

For Star Rating and Total Filled, the relationship is again statistically significant, but the correlation is very low.

lm_filled_stars <- lm(stars ~ total_filled, data = Filled_Yes)
summary(lm_filled_stars)

## 
## Call:
## lm(formula = stars ~ total_filled, data = Filled_Yes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72541 -0.50880  0.02214  0.49120  1.83159 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.1684140  0.0111836  283.31   <2e-16 ***
## total_filled 0.0309444  0.0008803   35.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7763 on 43292 degrees of freedom
## Multiple R-squared:  0.02775,    Adjusted R-squared:  0.02773 
## F-statistic:  1236 on 1 and 43292 DF,  p-value: < 2.2e-16

ggplot(Filled_Yes, aes(total_filled, stars)) + geom_count() + geom_smooth(method = 'lm', se = FALSE, color = 'red3') +
  scale_y_continuous(limits = c(1,5)) + labs(title = 'Predicting Star Rating by Total Filled', size = "# Restaurants") +
  theme(plot.title.position = 'plot') + theme_classic()

For Review Count and Total Filled, the relationship is also statistically significant. The R-squared is low, but will be one of the highest in this analysis. Looking at graph, one can see that the very large Review Counts are towards the higher end of total_filled.

lm_filled_review <- lm(review_count ~ total_filled, data = Filled_Yes)
summary(lm_filled_review)

## 
## Call:
## lm(formula = review_count ~ total_filled, data = Filled_Yes)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -305.6  -83.6  -36.6   30.9 9010.9 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -118.4478     2.9250  -40.49   <2e-16 ***
## total_filled   19.5004     0.2302   84.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203 on 43292 degrees of freedom
## Multiple R-squared:  0.1421, Adjusted R-squared:  0.1421 
## F-statistic:  7173 on 1 and 43292 DF,  p-value: < 2.2e-16

ggplot(Filled_Yes, aes(total_filled, review_count)) + geom_count() + geom_smooth(method = 'lm', se = FALSE, color = 'red3') +
  scale_y_continuous(expand = c(0,0)) + labs(title = 'Predicting Review Count by Total Filled', size = "# Restaurants") +
  theme(plot.title.position = 'plot') + theme_classic()

Total Yes

Finally, we check if Total Yes is a good predictor for Star Rating or Review Count.

For Star Rating and Total Yes, the relationship is statistically significant, but the correlation is very low. In fact, it is worse than the models using Review Count or Total Filled.

lm_yes_stars <- lm(stars ~ total_yes, data = Filled_Yes)
summary(lm_yes_stars)

## 
## Call:
## lm(formula = stars ~ total_yes, data = Filled_Yes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64854 -0.54359 -0.00861  0.47390  1.71876 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.2812366  0.0116731  281.09   <2e-16 ***
## total_yes   0.0174906  0.0007498   23.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7824 on 43292 degrees of freedom
## Multiple R-squared:  0.01241,    Adjusted R-squared:  0.01239 
## F-statistic: 544.1 on 1 and 43292 DF,  p-value: < 2.2e-16

ggplot(Filled_Yes, aes(total_yes, stars)) + geom_count() + geom_smooth(method = 'lm', se = FALSE, color = 'red3') +
  scale_y_continuous(limits = c(1,5)) + labs(title = 'Predicting Star Rating by Total Yes', size = "# Restaurants") +
  theme(plot.title.position = 'plot') + theme_classic()

For Review Count and Total Yes, the relationship is again statistically significant. The correlation is also low. Looking at graph, there is a similar pattern, although less pronounced, like Total Filled with the very high review counts are towards the higher end of total_yes.

lm_yes_reviews <- lm(review_count ~ total_yes, data = Filled_Yes)
summary(lm_yes_reviews)

## 
## Call:
## lm(formula = review_count ~ total_yes, data = Filled_Yes)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -256.3  -92.5  -50.5   31.3 9053.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -76.2337     3.1230  -24.41   <2e-16 ***
## total_yes    12.9820     0.2006   64.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.3 on 43292 degrees of freedom
## Multiple R-squared:  0.0882, Adjusted R-squared:  0.08818 
## F-statistic:  4188 on 1 and 43292 DF,  p-value: < 2.2e-16

ggplot(Filled_Yes, aes(total_yes, review_count)) + geom_count() + geom_smooth(method = 'lm', se = FALSE, color = 'red3') +
  scale_y_continuous(expand = c(0,0)) + labs(title = 'Predicting Review Count by Total Yes', size = "# Restaurants") +
  theme(plot.title.position = 'plot') + theme_classic()

3. Difference of Means

As starting point for analysis by attribute, we will compare how the average Star Rating and average Review Count differed between the 1/0 (True/False) groups of Filled and Yes Attributes.

Star Rating

Filled Attributes

Looking first by Filled Attributes, we can see the spread of average Star Rating for restaurants that include information about the amenity (AvgStar_1) and those that do not (AvgStar_0).

The five largest differences in mean Star Rating for Filled Attributes were DriveThru, BYOB, Wheelchair Accessible, Dogs Allowed, and Address.

Attribute	AvgStar_0	AvgStar_1	star_diff
DriveThru	3.59	3.06	-0.54
BYOB	3.50	4.02	0.52
WheelchairAccessible	3.41	3.90	0.49
DogsAllowed	3.43	3.89	0.46
HasAddress	3.96	3.53	-0.43
BusinessAcceptsBitcoin	3.48	3.90	0.42
RestaurantsTableService	3.40	3.77	0.37
ByAppointmentOnly	3.51	3.84	0.32
Corkage	3.51	3.80	0.28
HappyHour	3.46	3.74	0.28
HasHours	3.31	3.58	0.27
Smoking	3.52	3.77	0.25
WiFi	3.41	3.58	0.17
BikeParking	3.44	3.58	0.15
BusinessParking	3.40	3.55	0.15
RestaurantsReservations	3.67	3.52	-0.15
CoatCheck	3.53	3.67	0.14
Caters	3.45	3.58	0.13
RestaurantsDelivery	3.63	3.53	-0.10
BusinessAcceptsCreditCards	3.60	3.54	-0.07
BYOBCorkage	3.54	3.48	-0.07
Alcohol	3.49	3.55	0.06
HasTV	3.49	3.55	0.06
RestaurantsTakeOut	3.58	3.54	-0.05
OutdoorSeating	3.51	3.54	0.03

Yes Attributes

For the Yes Attributes, the five largest average differences were the same as for the Filled Attributes (although slightly different rankings).

The largest differences were for DriveThru, BYOB, DogsAllowed, WheelchairAccessible, and Address.

The average difference was twice as large for DriveThru when marked Yes compared to marked Filled. BYOB and DogsAllowed also had a larger spreads for Yes than when Filled.

Attribute	AvgStar_0	AvgStar_1	star_diff
DriveThru	3.59	2.58	-1.01
BYOB	3.53	4.23	0.70
DogsAllowed	3.50	4.07	0.57
WheelchairAccessible	3.42	3.89	0.47
ByAppointmentOnly	3.54	3.96	0.43
HasAddress	3.96	3.53	-0.43
BusinessAcceptsBitcoin	3.54	3.95	0.41
Corkage	3.53	3.87	0.34
HasHours	3.31	3.58	0.27
RestaurantsTableService	3.48	3.75	0.27
BikeParking	3.41	3.65	0.24
BusinessParking	3.37	3.61	0.24
CoatCheck	3.54	3.76	0.23
HasTV	3.66	3.44	-0.22
Caters	3.46	3.66	0.21
OutdoorSeating	3.45	3.66	0.20
RestaurantsDelivery	3.63	3.45	-0.18
Smoking	3.54	3.71	0.17
BusinessAcceptsCreditCards	3.69	3.53	-0.16
HappyHour	3.51	3.65	0.14
RestaurantsReservations	3.51	3.62	0.11
RestaurantsTakeOut	3.64	3.53	-0.11
Alcohol	3.52	3.62	0.10
WiFi	3.53	3.59	0.06
BYOBCorkage	3.54	3.49	-0.05

Review Count

Filled Attributes

Review Counts had different top attributes than Star Ratings for Filled Attributes. The attributes with the largest differences in average number of reviews were ByAppointmentOnly, DogsAllowed, BYOBCorkage, HappyHour, and Corkage.

Compared to the differences in Star Ratings, no Filled Attribute had a negative impact on review count. Those where Filled = 1 had a greater number of reviews than those that had Filled = 0.

Attribute	AvgRev_0	AvgRev_1	rev_diff
ByAppointmentOnly	98.76	315.47	216.70
DogsAllowed	72.03	251.11	179.08
BYOBCorkage	100.04	277.98	177.95
HappyHour	72.91	225.08	152.17
Corkage	102.68	247.33	144.64
Smoking	105.16	242.90	137.74
CoatCheck	103.58	236.47	132.89
Caters	32.04	151.70	119.66
BikeParking	33.13	151.23	118.10
WiFi	26.65	143.61	116.95
BYOB	107.08	219.40	112.33
BusinessParking	17.56	124.72	107.16
Alcohol	28.33	134.72	106.39
BusinessAcceptsCreditCards	16.57	121.17	104.60
HasTV	31.27	133.50	102.24
OutdoorSeating	26.10	127.52	101.42
HasHours	30.08	130.21	100.13
RestaurantsDelivery	24.93	122.67	97.74
RestaurantsTableService	77.81	175.20	97.38
BusinessAcceptsBitcoin	101.46	197.03	95.57
RestaurantsReservations	33.24	127.94	94.69
RestaurantsTakeOut	27.21	120.63	93.42
WheelchairAccessible	91.87	178.39	86.52
HasAddress	32.34	115.91	83.58
DriveThru	111.37	147.69	36.32

Yes Attributes

The attributes with the largest differences in average number of Reviews were different for Yes Attributes compared to Filled Attributes. ByAppointmentOnly, CoatCheck, Corkage, BYOBCorkage, and RestaurantTableService rounded out the top five.

DriveThru was the only Yes Attribute to have a negative difference in Review Count for when Yes = 1. Like with the Star Ratings, the differences in average Review Counts were slightly higher for Yes Attributes compared to Filled.

Attribute	AvgRev_0	AvgRev_1	rev_diff
ByAppointmentOnly	114.17	403.50	289.33
CoatCheck	113.47	341.52	228.05
Corkage	109.93	313.44	203.52
BYOBCorkage	110.35	248.09	137.74
RestaurantsTableService	86.77	217.49	130.72
BusinessParking	31.55	149.38	117.83
HappyHour	94.28	211.57	117.29
BYOB	114.08	228.18	114.10
HasHours	30.08	130.21	100.13
BikeParking	63.36	159.51	96.15
RestaurantsReservations	89.62	184.68	95.07
DogsAllowed	109.19	198.64	89.44
Alcohol	95.74	184.51	88.77
WheelchairAccessible	93.20	180.71	87.51
HasAddress	32.34	115.91	83.58
BusinessAcceptsCreditCards	40.25	121.94	81.69
Smoking	114.26	173.55	59.29
DriveThru	117.78	64.10	-53.68
Caters	95.40	144.67	49.26
WiFi	105.38	151.55	46.17
OutdoorSeating	95.94	141.24	45.31
BusinessAcceptsBitcoin	115.02	146.92	31.90
RestaurantsDelivery	101.21	128.80	27.59
HasTV	100.08	126.55	26.47
RestaurantsTakeOut	112.01	115.53	3.51

4. Correlation Matrix

If we were to build a multiple regression model, we would need to first check that the independent variables are not highly correlated to each other. Regression analysis requires predictors to not have high levels of multi-collinearity. A correlation matrix can help us to evaluate all variables at once and their correlations between them. This same matrix will also help us to see which attributes have the strongest correlations to Star Rating and Review Count.

The correlation matrices reveal that there is a higher level of multi-collinearity between the variables than desired for a multiple regression model. Therefore, another type of analysis should be used to evalute the attributes and their impact on Star Rating and Review Count.

Filled Attributes

There are several pairs of attributes that have correlations over 0.5, which would suggest a high level of collinearity. RestaurantsTableService/WheelchairAccessible, WiFi/Caters, WiFi/BikeParking, WiFi/HasTV, BikeParking/Caters, RestaurantsReservations/OutdoorSeating, RestaurantsReservations/HasTV, RestaurantsReservations/Alcohol, RestaurantsReservations/RestaurantsDelivery, WheelchairAccessible/DogsAllowed, OutdoorSeating/HasTV, OutdoorSeating/Alcohol, BYOB/Corkage, and CoatCheck/Smoking all have correlation coefficients over +/- 0.5.

The correlation matrix can help us quickly spot these pairs, which higher correlation coefficients having a darking color. Hover over the boxes to see the pairs and the values.

cor_FS_full <- Filled_Stars %>%
  select(-business_id,-total_filled) %>% cor()
cor_FS_full <- data.frame(cor_FS_full)

heatmaply_cor(cor(Filled_Stars[4:28]),
  dendrogram = "none",
  xlab = "", ylab = "", main = "Filled Attribute Correlation",
  grid_gap = 1, grid_width = 0.00001, margins = c(30,30,40,10),
  hide_colorbar = TRUE,
  plot_method = "plotly",
  label_names = c("row", "column", "Correlation"),
  fontsize_row = 10, fontsize_col = 8,
  heatmap_layers = theme(axis.line=element_blank()))

Looking at just Stars and Review Count, no Filled Attribute has very high correlation.

The three highest for Stars are WheelchairAccessible (0.276), DogsAllowed (0.252), and RestaurantsTableService(0.231).

The three highest for Review Count are DogsAllowed (0.349), HappyHour (0.311), and ByAppointmentOnly (0.261).

heatmaply(cor_FS_full[1:2],
          scale_fill_gradient_fun = ggplot2::scale_fill_gradient2(low = "black", 
          high = "red3", midpoint = 0, limits = c(-0.5, 1)),
          dendrogram = "none",
          xlab = "", ylab = "", main = "Filled Attribute Correlation",
          grid_color = "white", grid_width = 0.00001, margins = c(30,30,40,10),
          hide_colorbar = TRUE,
          label_names = c("row", "column", "Correlation"),
          fontsize_row = 10, fontsize_col = 8,
          heatmap_layers = theme(axis.line=element_blank()))

Yes Attributes

Conversely to Filled Attributes, there are no Yes variables that have correlation to each other over +/-0.5.

The highest correlations are RestaurantsTableService/WheelchairAccessible (0.403), RestaurantsReservations/RestaurantsTableService (0.356), BikeParking/BusinessParking (0.291), BikeParking/WheelchairAccessible (0.263), and BikeParking/Caters (0.273).

Again, we can see where there is high correlation in the below heatmap, with darker colors signaling higher correlations. Hover over the boxes to see the pairs and the values.

cor_Yes_full <- Yes_Stars %>%
  select(-business_id, -total_yes) %>% cor()
cor_Yes_full <- data.frame(cor_Yes_full)

heatmaply_cor(cor(Yes_Stars[4:28]),
               dendrogram = "none",
  xlab = "", ylab = "", main = "Yes Attribute Correlation",
  grid_gap = 1, grid_width = 0.00001, margins = c(30,30,40,10),
  hide_colorbar = TRUE,
  plot_method = "plotly",
  label_names = c("row", "column", "Correlation"),
  fontsize_row = 10, fontsize_col = 8,
  heatmap_layers = theme(axis.line=element_blank()))

Looking at just Stars and Review Count, no Yes Attribute has very high correlation.

The three highest for Stars are DriveThru (-0.28), WheelchairAccessible (0.257), and DogsAllowed (0.18). The three highest for Review Count are RestaurantsTableService (0.247), BusinessParking (0.244), and BikeParking (0.219).

heatmaply(cor_Yes_full[1:2],
          scale_fill_gradient_fun = ggplot2::scale_fill_gradient2(low = "black", 
          high = "red3", midpoint = 0, limits = c(-0.5, 1)),
          dendrogram = "none",
          xlab = "", ylab = "", main = "Yes Attribute Correlation",
          grid_color = "white", grid_width = 0.00001, margins = c(30,30,40,10),
          hide_colorbar = TRUE,
          label_names = c("row", "column", "Correlation"),
          fontsize_row = 10, fontsize_col = 8,
          heatmap_layers = theme(axis.line=element_blank()))

5. Relative Weights Analysis

Because there is a high level of correlation between these predictors, a multiple regression model with all attributes is not the best course of action. Linear regression will not produce a model with a high R-squared value. Instead, a Relative Weights Analysis (also called Key Driver Analysis) should now be used in this project. RWA can help find which variables are most important and have the most impact on the dependent variable.

The results of the RWA give us the correlation of models with all attributes. However, the results we are most interested in are the Signed Rescaled Relative Weights. These values provide estimates of the relative importance using the metric of percentage of the predicted variance associated to each variable and signals whether it has a positive or negative impact.

In general, it appears that Yes Attributes have more impact on Star Ratings, while Filled Attributes have more impact on Review Count.

Star Rating

Filled Attributes

Looking at Filled Attributes first, results of the RWA show that a model with all 25 attributes has a correlation of 0.176. This is low, but it is higher than the baseline regression run earlier.

The attributes with the largest rescaled relative weights are DriveThru, WheelchairAccessible, DogsAllowed, RestaurantsTableService, and BusinessAcceptsBitcoin.

Fill_col_names <- colnames(Filled_Stars[c(4:28)])

rwa_fill_stars <- Filled_Stars %>% 
  rwa(outcome = 'stars', 
      predictors = Fill_col_names,
      applysigns = TRUE)
rwa_fill_s_results <- rwa_fill_stars$result
rwa_fill_stars[2]

## $rsquare
## [1] 0.1760108

rwa_fill_s_results %>% 
  select(-Sign) %>%
  arrange(desc(Rescaled.RelWeight)) %>%
  kable(digits=3) %>%
  kable_minimal(full_width = F, position = 'center') %>%
  scroll_box(height = '300px')

Variables	Raw.RelWeight	Rescaled.RelWeight	Sign.Rescaled.RelWeight
DriveThru	0.047	26.505	-26.505
WheelchairAccessible	0.023	13.177	13.177
DogsAllowed	0.023	12.926	12.926
RestaurantsTableService	0.020	11.141	11.141
BusinessAcceptsBitcoin	0.009	5.052	5.052
HasHours	0.008	4.684	4.684
BYOB	0.007	4.116	4.116
HappyHour	0.006	3.440	3.440
RestaurantsReservations	0.005	3.062	-3.062
ByAppointmentOnly	0.004	2.004	2.004
HasAddress	0.003	1.774	-1.774
Corkage	0.003	1.674	1.674
WiFi	0.003	1.634	1.634
BusinessParking	0.002	1.407	1.407
RestaurantsDelivery	0.002	1.161	-1.161
BusinessAcceptsCreditCards	0.002	1.066	-1.066
BikeParking	0.002	0.943	0.943
Smoking	0.001	0.745	0.745
CoatCheck	0.001	0.660	-0.660
Caters	0.001	0.654	0.654
HasTV	0.001	0.551	-0.551
Alcohol	0.001	0.470	0.470
BYOBCorkage	0.001	0.462	-0.462
OutdoorSeating	0.001	0.358	-0.358
RestaurantsTakeOut	0.001	0.334	-0.334

rwa_fill_s_results %>% 
  arrange(desc(Rescaled.RelWeight)) %>% slice(1:25) %>%
  ggplot(aes(x=reorder(Variables, Rescaled.RelWeight), y=Sign.Rescaled.RelWeight, fill=Sign)) + geom_col() +
  scale_y_continuous(limits = c(-30,15))+
  geom_text(aes(label=round(Sign.Rescaled.RelWeight,2)), color = 'black', size = 3, 
            hjust=ifelse(rwa_fill_s_results$Sign.Rescaled.RelWeight < 0, 0, 1)) +
  scale_fill_manual(values = c('grey80', 'grey60')) + theme_classic() + 
  theme(legend.position = 'none', plot.title.position = 'plot') + 
  labs(title = 'Relative Weight by Filled Attribute - Stars',x='') + coord_flip()

Yes Attributes

Results of the RWA show that a model with all 25 Yes Attributes has a correlation of 0.215. This is higher than the baseline regression and the model for Filled Attributes.

The attributes with the largest rescaled relative weights are DriveThru, WheelchairAccessible, HasTV, DogsAllowed, and BikeParking.

Yes_col_names <- colnames(Yes_Stars[c(4:28)])

rwa_yes_stars <- Yes_Stars %>% 
  rwa(outcome = 'stars', 
      predictors = Yes_col_names,
      applysigns = TRUE)
rwa_yes_s_results <- rwa_yes_stars$result
rwa_yes_stars[2]

## $rsquare
## [1] 0.2151438

rwa_yes_s_results %>%
  select(-Sign) %>%
  arrange(desc(Rescaled.RelWeight)) %>%
  kable(digits=3) %>%
  kable_minimal(full_width = F, position = 'center') %>%
  scroll_box(height = '300px')

Variables	Raw.RelWeight	Rescaled.RelWeight	Sign.Rescaled.RelWeight
DriveThru	0.063	29.351	-29.351
WheelchairAccessible	0.034	15.746	15.746
HasTV	0.017	8.037	-8.037
DogsAllowed	0.014	6.391	6.391
BikeParking	0.011	5.338	5.338
RestaurantsDelivery	0.011	5.286	-5.286
HasHours	0.011	4.987	4.987
BusinessParking	0.010	4.532	4.532
BusinessAcceptsCreditCards	0.008	3.846	-3.846
Caters	0.008	3.827	3.827
RestaurantsTableService	0.007	3.428	3.428
OutdoorSeating	0.007	3.237	3.237
BYOB	0.003	1.431	1.431
HasAddress	0.003	1.168	-1.168
RestaurantsTakeOut	0.002	1.023	-1.023
Corkage	0.001	0.577	0.577
HappyHour	0.001	0.377	0.377
RestaurantsReservations	0.001	0.351	0.351
Alcohol	0.001	0.298	0.298
ByAppointmentOnly	0.000	0.215	0.215
WiFi	0.000	0.187	0.187
BYOBCorkage	0.000	0.145	-0.145
Smoking	0.000	0.089	-0.089
BusinessAcceptsBitcoin	0.000	0.076	0.076
CoatCheck	0.000	0.057	0.057

rwa_yes_s_results %>% 
  arrange(desc(Rescaled.RelWeight)) %>% slice(1:25) %>%
  ggplot(aes(x=reorder(Variables, Rescaled.RelWeight), y=Sign.Rescaled.RelWeight, fill=Sign)) + geom_col() +
  scale_y_continuous(limits = c(-40,30))+
  geom_text(aes(label=round(Sign.Rescaled.RelWeight,2)), color = 'black', size = 3, 
            hjust=ifelse(rwa_yes_s_results$Sign.Rescaled.RelWeight < 0, 0, 0.5)) +
  scale_fill_manual(values = c('grey80', 'grey60')) + theme_classic() + 
  theme(legend.position = 'none', plot.title.position = 'plot') + 
  labs(title = 'Relative Weight by Yes Attribute - Stars',x='') + coord_flip()

Review Count

Filled Attributes

The correlation for all Filled Attributes to Review Count is 0.256 - one of the highest correlation values in this analysis.

The Filled Attributes with the largest rescaled relative weights are DogsAllowed, BYOBCorkage, ByAppointmentOnly, HappyHour, and BikeParking.

rwa_fill_reviews <- Filled_Stars %>% 
  rwa(outcome = 'review_count', 
      predictors = Fill_col_names,
      applysigns = TRUE)
rwa_fill_r_results <- rwa_fill_reviews$result
rwa_fill_reviews[2]

## $rsquare
## [1] 0.2555005

rwa_fill_r_results %>%
  select(-Sign) %>%
  arrange(desc(Rescaled.RelWeight)) %>%
  kable(digits=3) %>%
  kable_minimal(full_width = F, position = 'center') %>%
  scroll_box(height = '300px')

Variables	Raw.RelWeight	Rescaled.RelWeight	Sign.Rescaled.RelWeight
DogsAllowed	0.052	20.442	20.442
BYOBCorkage	0.039	15.158	15.158
ByAppointmentOnly	0.034	13.315	13.315
HappyHour	0.033	12.987	12.987
BikeParking	0.013	5.276	5.276
Caters	0.013	5.209	5.209
Corkage	0.009	3.424	3.424
WiFi	0.008	3.275	3.275
RestaurantsTableService	0.007	2.910	2.910
HasHours	0.006	2.419	2.419
Smoking	0.005	2.017	2.017
CoatCheck	0.005	1.906	1.906
WheelchairAccessible	0.005	1.839	-1.839
BusinessAcceptsBitcoin	0.004	1.512	1.512
Alcohol	0.004	1.472	1.472
HasTV	0.003	1.314	1.314
BYOB	0.003	1.109	1.109
BusinessParking	0.003	1.041	1.041
RestaurantsReservations	0.003	1.035	1.035
OutdoorSeating	0.002	0.725	0.725
RestaurantsDelivery	0.001	0.524	0.524
BusinessAcceptsCreditCards	0.001	0.351	0.351
RestaurantsTakeOut	0.001	0.317	0.317
DriveThru	0.001	0.284	0.284
HasAddress	0.000	0.139	0.139

rwa_fill_r_results %>% 
  arrange(desc(Rescaled.RelWeight)) %>% slice(1:25) %>%
  ggplot(aes(x=reorder(Variables, Rescaled.RelWeight), y=Sign.Rescaled.RelWeight, fill=Sign)) + geom_col() +
  scale_y_continuous(limits = c(-3,25)) +
  geom_text(aes(label=round(Sign.Rescaled.RelWeight,2)), color = 'black', size = 3, 
            hjust=-0.1) +
  scale_fill_manual(values = c('grey80', 'grey60')) + theme_classic() + 
  theme(legend.position = 'none', plot.title.position = 'plot') + 
  labs(title = 'Relative Weight by Filled Attribute - Reviews',x='') + coord_flip()

Yes Attributes

The correlation for all Yes Attributes to Review Count is 0.163.

The Yes Attributes with the largest rescaled relative weights are BusinessParking, RestaurantsTableService, BikeParking, HappyHour, and RestaurantsReservations.

rwa_yes_reviews <- Yes_Stars %>% 
  rwa(outcome = 'review_count', 
      predictors = Yes_col_names,
      applysigns = TRUE)
rwa_yes_r_results <- rwa_yes_reviews$result
rwa_yes_reviews[2]

## $rsquare
## [1] 0.1635754

rwa_yes_r_results %>%
  select(-Sign) %>%
  arrange(desc(Rescaled.RelWeight)) %>%
  kable(digits=3) %>%
  kable_minimal(full_width = F, position = 'center') %>%
  scroll_box(height = '300px')

Variables	Raw.RelWeight	Rescaled.RelWeight	Sign.Rescaled.RelWeight
BusinessParking	0.026	16.008	16.008
RestaurantsTableService	0.023	14.107	14.107
BikeParking	0.021	12.853	12.853
HappyHour	0.014	8.302	8.302
RestaurantsReservations	0.011	6.874	6.874
HasHours	0.010	6.059	6.059
Alcohol	0.010	5.870	5.870
BYOBCorkage	0.009	5.698	5.698
Corkage	0.009	5.292	5.292
WheelchairAccessible	0.006	3.441	3.441
ByAppointmentOnly	0.004	2.583	2.583
CoatCheck	0.003	2.126	2.126
DogsAllowed	0.003	1.710	1.710
RestaurantsDelivery	0.003	1.604	1.604
Caters	0.002	1.295	1.295
BusinessAcceptsCreditCards	0.002	1.153	1.153
WiFi	0.002	1.123	1.123
OutdoorSeating	0.002	1.114	1.114
RestaurantsTakeOut	0.002	0.972	-0.972
DriveThru	0.001	0.823	-0.823
HasTV	0.001	0.307	0.307
BYOB	0.000	0.271	0.271
HasAddress	0.000	0.250	0.250
Smoking	0.000	0.155	-0.155
BusinessAcceptsBitcoin	0.000	0.011	-0.011

rwa_yes_r_results %>% 
  arrange(desc(Rescaled.RelWeight)) %>% slice(1:25) %>%
  ggplot(aes(x=reorder(Variables, Rescaled.RelWeight), y=Sign.Rescaled.RelWeight, fill=Sign)) + geom_col() +
  scale_y_continuous(limits = c(-1,18)) +
  geom_text(aes(label=round(Sign.Rescaled.RelWeight,2)), color = 'black', size = 3, 
            hjust=-0.1) +
  scale_fill_manual(values = c('grey80', 'grey60')) + theme_classic() + 
  theme(legend.position = 'none', plot.title.position = 'plot') + 
  labs(title = 'Relative Weight by Yes Attribute - Reviews',x='') + coord_flip()

6. Attribute Analysis Summary

To compare results of these three methods of analyzing the attributes, a summary table was created of the Top 5 Filled and Yes Attributes of their impact on Star Rating and Review Count by each method.

Star Rating

Filled Attributes

Across all three methods, three Filled Attriubtes were in the Top 5 of each: DogsAllowed, DriveThru, and WheelchairAccessible.

Rank	Mean Diff.	Correlation	RWA
1	DriveThru	WheelchairAccessible	DriveThru
2	BYOB	DogsAllowed	WheelchairAccessible
3	WheelchairAccessible	RestaurantsTableService	DogsAllowed
4	DogsAllowed	DriveThru	RestaurantsTableService
5	Address	BusinessAcceptsBitcoin	BusinessAcceptsBitcoin

Filled Attr.	#
DogsAllowed	3
DriveThru	3
WheelchairAccessible	3
BusinessAcceptsBitcoin	2
RestaurantsTableService	2
Address	1
BYOB	1

Yes Attributes

Although there were some slight differences in the Top 5 across the three analyses for Yes Attributes, there was overlap. DogsAllowed, DriveThru, and WheelchairAccessible were all in the Top 5 for each method.

Rank	Mean Diff.	Correlation	RWA
1	DriveThru	DriveThru	DriveThru
2	BYOB	WheelchairAccessible	WheelchairAccessible
3	DogsAllowed	DogsAllowed	HasTV
4	WheelchairAccessible	BikeParking	DogsAllowed
5	ByAppointmentOnly	BusinessParking	BikeParking

Yes Attr.	#
DogsAllowed	3
DriveThru	3
WheelchairAccessible	3
BikeParking	2
BusinessParking	1
ByAppointmentOnly	1
BYOB	1
HasTV	1

Overall

If we combine the Filled and Yes Attribute results, we can see the overlap in attributes. From this list, we can see that DogsAllowed, DriveThru, and WheelchairAccessible are clearly important to show on a Yelp page for Star Rating. The other top attributes were only included in the Top 5 in one or two methods.

Attribute	#
DogsAllowed	6
DriveThru	6
WheelchairAccessible	6
BikeParking	2
BusinessAcceptsBitcoin	2
BYOB	2
RestaurantsTableService	2
Address	1
BusinessParking	1
ByAppointmentOnly	1
HasTV	1

Review Count

Filled Attributes

Like with Star Rating, there were three Filled Attributes that were in the Top 5 of each analysis: ByAppointmentOnly, DogsAllowed, and HappyHour.

Rank	Mean Diff.	Correlation	RWA
1	ByAppointmentOnly	DogsAllowed	DogsAllowed
2	DogsAllowed	HappyHour	BYOBCorkage
3	BYOBCorkage	ByAppointmentOnly	ByAppointmentOnly
4	HappyHour	Caters	HappyHour
5	Corkage	BikeParking	BikeParking

Filled Attr.	#
ByAppointmentOnly	3
DogsAllowed	3
HappyHour	3
BikeParking	2
BYOBCorkage	2
Caters	1
Corkage	1

Yes Attributes

Yes Attributes had the largest variety in Top 5 attributes across the Star Rating analyses. Only RestaurantsTableService was in all three. BikeParking, HappyHour, BusinessParking, and RestaurantsReservations were in two.

Rank	Mean Diff.	Correlation	RWA
1	ByAppointmentOnly	RestaurantsTableService	BusinessParking
2	CoatCheck	BusinessParking	RestaurantsTableService
3	Corkage	BikeParking	BikeParking
4	BYOBCorkage	HappyHour	HappyHour
5	RestaurantsTableService	RestaurantsReservations	RestaurantsReservations

Yes Attr.	#
RestaurantsTableService	3
BikeParking	2
BusinessParking	2
HappyHour	2
RestaurantsReservations	2
ByAppointmentOnly	1
BYOBCorkage	1
CoatCheck	1
Corkage	1

Overall

If we combine the Filled and Yes Attribute results, we can see the overlap in attributes. Compared to Star Rating, there was more variety in the Top 5 Filled and Yes Attributes. HappyHour appeared in five of the six analyses. BikeParking and ByAppointmentOnly appeared four times.

Attribute	#
HappyHour	5
BikeParking	4
ByAppointmentOnly	4
BYOBCorkage	3
DogsAllowed	3
RestaurantsTableService	3
BusinessParking	2
Corkage	2
RestaurantsReservations	2
Caters	1
CoatCheck	1

Key Takeaways

If we were to create a list of all the attributes that was in the Top 5 of any of the previous analyses, 18 of the 25 attributes studied are shown. But there were six attributes that appeared most consistently. DogsAllowed was in 9 of the 18 different iterations of this analysis for Star Rating or Review Count. BikeParking, DriveThru, and WheelchairAccessible were in six. ByAppointmentOnly and HappyHour were in five. The other 12 attributes appeared three or less times in a Top 5 results list.

While none of the attributes alone were a good predictor of Star Rating or Review Count, Yelp could still promote these top six attributes to restaurant owners as important to list on their Yelp page and if applicable to have at their restaurant (DriveThru being the exception as it was the only attribute to have a consistent negative effect).

Attribute	#
DogsAllowed	9
BikeParking	6
DriveThru	6
WheelchairAccessible	6
ByAppointmentOnly	5
HappyHour	5
RestaurantsTableService	5
BusinessParking	3
BYOBCorkage	3
BusinessAcceptsBitcoin	2
BYOB	2
Corkage	2
RestaurantsReservations	2
Address	1
Caters	1
CoatCheck	1
HasTV	1

7. Sentiment Analysis

To analyze the actual text of reviews left on restaurants’ Yelp pages, we will look at most common words used and conduct a sentiment analysis. As noted previously, the Review data set was extremely large. Therefore, a sample was taken to speed up processing. The sample was then filtered for the restaurants that have or do not have the most common attributes listed from the prior attribute analysis. The reviews will be compared to see if different words were used or different sentiments expressed between the groups of restaurants.

Overall, there were not large differences in sentiments of the reviews across the groups for both Star Rating and Review Count by Filled or Yes Attribute. However, the reviews that were for restaurants with the desired attributes were consistently more positive than the reviews for the restaurants without.

Top 20 Words

Because this analysis looks at individual words and not at complete phrases like some other text mining methods, the review text is not scrubbed completely for typos or non-English characters. Data cleaning steps were to make the data be in a tidy text format (a table with one word token per row).

Common words like “the”, “of” and “to” and any numerals were removed first before looking at most common words.

From this graph, we can see that “food” is by far the most common word used. It has no real meaning in this context, so it will also be removed from the subsequent analyses.

Stars / Filled

Using the most common Filled Attributes for Star Ratings from the previous analysis, we will look if there are differences within the review texts themselves for restaurants that have them shown on their Yelp page versus those that do not.

Top Words

The first group of reviews were for restaurants that had DogsAllowed and WheelchairAccessible filled on their page and did not have DriveThru filled (remember that DriveThru had a negative impact on Star Rating). There are 6,445 restaurants that meet these conditions.

The opposite group of restaurants would be those that do not have DogsAllowed and WheelchairAccessible filled on the Yelp page, but do have DriveThru filled. There are 2,609 restaurants that meet these three conditions.

Comparing the two graphs, there is some overlap and does not show how customers truly felt about their experiences at that restaurant.

Sentiments

While looking at the top used words can be interesting, we are more interested in the sentiments expressed in the reviews. For restaurants that met the conditions, we would expect the reviews to be overall more positive than those that did not. We will use two methods to measure sentiments from the tidytext package. The NRC lexicon from Saif Mohammad and Peter Turney associates words with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The AFINN lexicon from Finn Årup Nielsen is numerical by assigning each word a value from -5 (negative sentiment) to 5 (positive sentiment).

The differences in the NRC sentiments between the two Filled groups are slight. Positive is the most common sentiment in both groups, but the Filled=0 group has a higher proportion of negative sentiments.

If we quantify the sentiments using the AFINN lexicon, we can get a sense of on average how positive or negative are the reviews. On average, the Filled = 1 group has a 0.76 higher AFINN score.

AFINN	Filled = 0	Filled = 1	Avg. Difference
Star_AFINN_filled	0.551359	1.315273	0.764

Stars / Yes

Using the most common Yes Attributes for Star Ratings from the previous analysis, we will look if there are differences within the review texts themselves for restaurants that have them shown on their Yelp page versus those that do not.

Top Words

The first group of reviews were for restaurants that claim to have the amenities of DogsAllowed and WheelchairAccessible filled on their page and do not have DriveThru (remember that DriveThru had a negative impact on Star Rating). There are 2,262 restaurants that meet these 3 conditions in the data set. There are several words here in common with the most common words to the Filled=1 group shown previously.

The opposite group of restaurants would be those that do not claim to have the amenities DogsAllowed and WheelchairAccessible filled on the Yelp page, but do have a DriveThru. There are 1,817 restaurants that meet these 3 conditions.

Comparing the two graphs, there is some overlap, but the words do not show how customers truly felt about their experiences at that restaurant as the terms are fairly generic to all restaurants.

Sentiments

The differences in the NRC sentiments is most noticable in the proportion of the ‘Joy’ emotion and “Negative” sentiment. The groups had opposite proportions- the Yes=1 group had ‘Joy’ as the 2nd largest sentiment (14.9%) and Negative as the 5th largest (10.2%), while the Yes=0 had ‘Negative’ as the 2nd largest sentiment (13.6%) and Joy as the 5th largest (10.6%).

If we quantify the sentiments using the AFINN lexicon, we can get a sense of on average how positive or negative the reviews are for each group. On average, the Yes = 1 group is higher by just over a full point.

AFINN	Yes = 0	Yes = 1	Avg. Difference
Star_AFINN_yes	0.3503304	1.354613	1.004

Reviews / Filled

Using the most common Filled Attributes for Review Count from the previous analysis, we will look if there are differences within the review texts themselves for restaurants that have them shown on their Yelp page versus those that do not.

Top Words

The first group of reviews were for restaurants that have DogsAllowed, ByAppointmentOnly, and HappyHour filled on their Yelp page. There are 1,656 restaurants that meet these 3 conditions in the data set.

The opposite group of restaurants would be those that do not have DogsAllowed, ByAppointmentOnly, and HappyHour filled on their Yelp page. There are 26,660 restaurants that meet these 3 conditions.

Comparing the two graphs, there is again a large amount of overlap to the previous Top Words graphs, so a sentiment analysis will better evaluate how customers felt while writing their reviews.

Sentiments

The differences in the NRC sentiments between the two Filled groups are very slight. Proportions of each feeling and sentiment are nearly the same for each. Positive remains the largest sentiment expressed.

If we quantify the sentiments using the AFINN lexicon, we can get a sense of on average how positive or negative are the reviews. On average, the Filled = 1 group has a slightly higher AFINN score.

AFINN	Filled = 0	Filled = 1	Avg. Difference
Rev_AFINN_filled	0.8739431	1.344708	0.471

Reviews / Yes

Using the most common Yes Attributes for Review from the previous analysis, we will look if there are differences within the review texts themselves for restaurants that have them shown on their Yelp page versus those that do not.

Top Words

The first group of reviews were for restaurants that claim to have the amenities of RestaurantsTableService, BusinessParking, and BikeParking. There are 5,948 restaurants that meet these 3 conditions in the data set.

The opposite group of restaurants would be those that do not claim to have the amenities of RestaurantsTableService, BusinessParking, and BikeParking. There are 8,086 restaurants that meet these 3 conditions.

Again, these graphs are similar to the ones previously shown and do not indicate how the customers felt when writing the reviews.

Sentiments

The differences in the NRC sentiments between the two Yes groups are again pretty slight. Positive is the most common sentiment, but the Yes=1 group had a higher proportion of ‘Joy’.

If we quantify the sentiments using the AFINN lexicon, results are similar to the Filled difference for Review Count. The Yes = 1 group is on average higher by 0.44.

AFINN	Yes = 0	Yes = 1	Avg. Difference
Rev_AFINN_yes	0.8060857	1.247525	0.441

8. Limitations

The attributes that a business can show on its Yelp page were not proven to be strong predictors of Star Rating or Review Count through this analysis. There was a subset of six attributes that were most commonly in the Top 5 results, but overall we were unable to find a model that would explain even 30% of the variation in the data. The highest correlation values were around 20%. Therefore, Yelp cannot go to restaurant owners with full confidence that listing certain attributes will help improve Star Ratings or Review Count.

This analysis was for restaurants across the United States, with all metros and restaurant types grouped together. Future analysis could include analyzing by types of restaurants. For example, DriveThru consistently showed a negative impact on Star Rating. If we were to look at the restaurants with DriveThru, the list would likely be mostly fast food restaurants. Fast food restaurants are very different than a fine dining restaurant. If the analysis were done again for more specific restaurant categories, there could be different attributes that are important within the subgroups.

Yelp Data Analysis of Restaurant Attributes

University of Indianapolis MSDA Capstone

Brianna Marshall

May 2022

1. Introduction

Data Sets

Packages Used

Data Preparation

2. Initial Analysis

Overall Averages

Baseline Linear Regression

3. Difference of Means

Star Rating

Review Count

4. Correlation Matrix

Filled Attributes

Yes Attributes

5. Relative Weights Analysis

Star Rating

Review Count

6. Attribute Analysis Summary

Star Rating

Review Count

Key Takeaways

7. Sentiment Analysis

Top 20 Words

Stars / Filled

Stars / Yes

Reviews / Filled

Reviews / Yes

8. Limitations