Yelp’s mission is to “connect consumers with great local businesses” and describes itself as “a one-stop local platform for consumers to discover, connect and transact with local businesses of all sizes” (Yelp Fast Facts). The value proposition to consumers is free access to business information, past customer reviews, and submitted photos. Potential customers use Yelp to look up businesses and find out more about them. It would make sense then that the more complete a business’ Yelp page is, the more likely a customer is to interact with, and potentially patronize, that business. While Yelp is most known for the reviews customers post, it is still important for businesses to claim their page and fill out basic business information such as address, business type/categories, and hours of operation. This analysis will attempt to answer these questions: does greater data transparency lead to more customer engagement (number of reviews) and is it more positive (star rating and review sentiment)?
The analysis focuses on restaurants in the United States and will be two-fold. The first part will be focused on the business data file. This data is what is on a restaurant’s business page on Yelp. I will look at how many restaurants have complete basic information (business name and address), hours of operations, and the number of factual attribute fields filled in. This will give a baseline of how many restaurants are using the free page features. Attribute fields are Boolean data – a restaurant either has the amenity or not. However, these fields can also be left null. I will analyze if there is a difference in the number of reviews and average ratings for filling out the field (even if ‘FALSE’) or if it is better to be left blank. I will also look if there is a difference between marking an attribute as ‘TRUE’, indicating the business has that amenity, versus marking as ‘FALSE’ or leaving blank. Looking at the data these two ways will help determine if full data transparency or only posting what attributes the restaurant has is better. The second phase of the project will look at the reviews themselves. A basic sentiment analysis will be performed on the review text. I will compare whether businesses with the key attributes determined in the first phase have more reviews with positive sentiments than those that do not.
The data sets used for this analysis are publicly available and are described below, as well as the data preparation process.
Data was accessed from Yelp’s Open Dataset. In total, this source has over 6.9 million reviews for 150,000 businesses in 11 metropolitan areas. However, previously noted, this analysis will only focus on restaurants in the United States within this data. Descriptions of the data sets and fields included can be found on the Yelp Open Data Set website as well.
This project was completed in R and utilized several packages. The main ones used were:
The large majority of the data preparation focused on the Business data file. When the JSON file was flattened, there were 37 attributes and 7 hour columns. These needed to be transformed into binary variables in order to study whether or not the restaurant listed the attribute on their Yelp page and if they claimed to have it. User-voted attributes were also removed in order to focus the analysis only on attributes that are in control of the business.
Data preparation of the Reviews data set will be explained in the Sentiment Analysis section.
First, attribute columns that were nearly entirely Null were removed from the Business data. Of the six that were over 99.9% Null, two were not applicable to restaurants. It may be of interest to Yelp to understand why the others are being left blank and if they need to remain available to restaurants.
NAcount_rest <- sapply(Restaurants_US, function(x) sum(is.na(x)))
NAcount_rest <- data.frame(NAcount_rest)
NAcount_rest <- rownames_to_column(NAcount_rest, var = "Col_name")
NAcount_rest <- NAcount_rest %>%
mutate(Per_NA = NAcount_rest/(nrow(Restaurants_US))) %>%
arrange(desc(NAcount_rest))
Cols_to_drop_rest <- NAcount_rest %>%
filter(Per_NA > 0.99)
test_restaurants <- Restaurants_US %>%
select(-one_of(Cols_to_drop_rest$Col_name))
NAcount_rest %>%
arrange(desc(NAcount_rest)) %>% slice(1:10) %>%
ggplot(aes(x=reorder(Col_name, NAcount_rest), y=Per_NA, fill=Per_NA)) + geom_col() +
geom_text(aes(label=(percent(Per_NA, accuracy = 0.1))), color = 'black', size = 3, hjust = -0.1) +
scale_y_continuous(labels = percent, name ="", limits = c(0,1.2), expand = c(0, 0)) +
scale_fill_gradient(low="light grey", high="black") + theme_classic() +
theme(legend.position = 'none', plot.title.position = 'plot') +
labs(title = 'Top 10 Attributes by % Null',x='') + coord_flip()
Filled Attributes
Then, we look to see which attributes are being listed on the profile, regardless if restaurant has amenity or not. We will refer to these as ‘Filled Attributes’ throughout this analysis. A function was created to convert strings of Attributes to 1 for complete, 0 for NA. Looking at the graph of attributes by percentage filled, there’s a steep drop off between top and bottom half of attributes.
attr_function <- function(x) {
if_else(condition = is.na(x),
true = 0,
false = 1)}
#Use mutate to count number of attributes/hours filled for each restaurant
#Taking out user-voted attributes (list below) as businesses cannot edit those on page themselves:
#GoodForGroups, PriceRange, Ambience, GoodForMeal, NoiseLevel, RestaurantsAttire, GoodForKids,
#GoodForDancing, BestNights, Music
filled_columns <- test_restaurants %>%
select(business_id, address, starts_with("attr"), starts_with("hours")) %>%
mutate(across(-c(business_id,address), attr_function)) %>%
select(-attributes.Ambience, -attributes.RestaurantsPriceRange2, -attributes.GoodForMeal,
-attributes.NoiseLevel, -attributes.RestaurantsAttire, -attributes.GoodForKids,
-attributes.GoodForDancing, -attributes.Music, -attributes.BestNights,
-attributes.RestaurantsGoodForGroups) %>%
mutate(attributes.HasAddress = if_else(address == "",0,1),
total_attr = select(.,starts_with("attr")) %>% rowSums(na.rm = TRUE),
total_hrs = select(.,starts_with("hours")) %>% rowSums(na.rm = TRUE),
HasHours = if_else(total_hrs>0,1,0),
total_filled = total_attr + HasHours) %>%
select(-address, -starts_with('hours'), -total_attr, -total_hrs)
names(filled_columns) <- gsub(x = names(filled_columns),
pattern = "attributes.", replacement = "", fixed = TRUE)
filled_summary <- filled_columns %>%
select(-business_id, -total_filled) %>%
colSums(na.rm=TRUE)
filled_summary <- data.frame(filled_summary)
colnames(filled_summary) <- "Total_Filled"
filled_summary <- tibble::rownames_to_column(filled_summary,'Attribute')
filled_summary <- filled_summary %>% mutate(Per_Filled = Total_Filled/(nrow(Restaurants_US)))
#Graph of Attributes and % of Restaurants with them FILLED
filled_summary%>%
ggplot(aes(x=reorder(Attribute, Total_Filled), y=Per_Filled, fill=Per_Filled)) + geom_col() +
geom_text(aes(label=(percent(Per_Filled, accuracy = 0.1))), color = 'black', size = 3, hjust = -0.1) +
scale_y_continuous(labels = percent, name ="", limits = c(0,1.15), expand = c(0, 0)) +
scale_fill_gradient(low="light grey", high="black") + theme_classic() +
theme(legend.position = 'none', plot.title.position = 'plot') +
labs(title = '% of Restaurants with Filled Attribute',x='') + coord_flip()
Yes Attributes
Next, we look will at which attributes restaurants do list to possess. We will refer to these as ‘Yes Attributes’ throughout this analysis. A function was created to convert the value that indicated the restaurant had the amenity to a value of 1, and if the restaurant indicated they did not have the amenity or the field was left blank it was converted to a value of 0. Looking at the graph for attributes by percentage of restaurants that have the amenity, there’s not as steep drop off between top and bottom half of attributes like with Filled. The Top 3 are the same but then ranking starts to differ.
attributes <- test_restaurants %>%
select(starts_with("attr")) %>%
lapply(unique)
#First create data frame to convert string of Attributes to 1 for Yes, 0 for No, NA left as null
#Also taking out the same user-voted attributes (listed in previous section)
Yes_columns1 <- test_restaurants %>%
select(business_id, address, starts_with("attr"), starts_with("hours")) %>%
select(-attributes.Ambience, -attributes.RestaurantsPriceRange2, -attributes.GoodForMeal,
-attributes.NoiseLevel, -attributes.RestaurantsAttire, -attributes.GoodForKids,
-attributes.GoodForDancing, -attributes.Music, -attributes.BestNights,
-attributes.RestaurantsGoodForGroups) %>%
mutate(RestaurantsTableService = str_detect(attributes.RestaurantsTableService, "True") * 1,
WiFi = str_detect(attributes.WiFi, c("'free'", "'paid'")) * 1,
BikeParking = str_detect(attributes.BikeParking, "True") * 1,
BusinessParking = str_detect(attributes.BusinessParking, "True") * 1,
BusinessAcceptsCreditCards = str_detect(attributes.BusinessAcceptsCreditCards, "True") * 1,
RestaurantsReservations = str_detect(attributes.RestaurantsReservations, "True") * 1,
WheelchairAccessible = str_detect(attributes.WheelchairAccessible, "True") * 1,
Caters = str_detect(attributes.Caters, "True") * 1,
OutdoorSeating = str_detect(attributes.OutdoorSeating, "True") * 1,
HappyHour = str_detect(attributes.HappyHour, "True") * 1,
BusinessAcceptsBitcoin = str_detect(attributes.BusinessAcceptsBitcoin, "True") * 1,
HasTV = str_detect(attributes.HasTV, "True") * 1,
Alcohol = str_detect(attributes.Alcohol, c("'beer_and_wine'", "'full_bar'")) * 1,
DogsAllowed = str_detect(attributes.DogsAllowed, "True") * 1,
RestaurantsTakeOut = str_detect(attributes.RestaurantsTakeOut, "True") * 1,
RestaurantsDelivery = str_detect(attributes.RestaurantsDelivery, "True") * 1,
ByAppointmentOnly = str_detect(attributes.ByAppointmentOnly, "True") * 1,
BYOB = str_detect(attributes.BYOB, "True") * 1,
CoatCheck = str_detect(attributes.CoatCheck, "True") * 1,
Smoking = str_detect(attributes.Smoking, c("'outdoor'", "'yes'")) * 1,
DriveThru = str_detect(attributes.DriveThru, "True") * 1,
BYOBCorkage = str_detect(attributes.BYOBCorkage, "yes") *1,
Corkage = str_detect(attributes.Corkage, "True") *1,
HasAddress = if_else(address == "",0,1),
TotalHours = mutate(across(starts_with("hours"),attr_function)) %>% rowSums(),
HasHours = if_else(TotalHours >0, 1, 0)) %>%
select(-starts_with("attr"),-starts_with("hours"), -address) %>%
mutate(attr_yes = select(.,-business_id) %>% rowSums(na.rm = TRUE),
total_yes = HasHours + attr_yes) %>%
select(-attr_yes, -TotalHours)
Yes_summary <- Yes_columns1 %>%
select(-business_id, -total_yes) %>%
colSums(na.rm=TRUE)
Yes_summary <- data.frame(Yes_summary)
colnames(Yes_summary) <- "Total_Yes"
Yes_summary <- tibble::rownames_to_column(Yes_summary,'Attribute')
Yes_summary <- Yes_summary %>% mutate(Per_Yes = Total_Yes/(nrow(Restaurants_US)))
#In order to do linear regression, need to have as many complete cases as possible
#So create 2nd data frame to convert 1=Yes, 0=No or Null so all rows are complete cases
yes_function <- function(x) {
if_else(condition = is.na(x),
true = 0,
false = (if_else(condition = x > 0, 1, 0)))}
Yes_columns <- Yes_columns1 %>% mutate(across(-c(business_id, total_yes, HasAddress, HasHours), yes_function))
#Graph of Attributes and % of Restaurants with them marked as 'True' or equivalent
Yes_summary%>%
ggplot(aes(x=reorder(Attribute, Total_Yes), y=Per_Yes, fill=Per_Yes)) + geom_col() +
geom_text(aes(label=(percent(Per_Yes, accuracy = 0.1))), color = 'black', size = 3, hjust = -0.1) +
scale_y_continuous(labels = percent, name ="", limits = c(0,1.15), expand = c(0, 0)) +
scale_fill_gradient(low="light grey", high="black") + theme_classic() +
theme(legend.position = 'none', plot.title.position = 'plot') +
labs(title = '% of Restaurants with Yes Attribute',x='') + coord_flip()
These two data frames for Filled and Yes Attributes were merged with the original business data to bring in the restaurant’s Star Rating and Review Count. Now, we can look at descriptive statistics of the business data to see if there are any initial trends.
Star Rating
Star ratings range from 1 to 5 in half-steps. Most restaurants have 4-star rating, with the average rating of about 3.5 stars.
Review Count
The majority of restaurants have under 200 reviews, with the average at 115. However, there is a large spread in values. Review count ranged from 5 to 9185.
Filled and Yes Attribute counts
The number of Filled Attributes follows a normal distribution (peaks at 13-14 attributes), but Yes Attributes has 2 peaks (5-6 & 17-18 attributes).
Averages by Star Rating
Combining the data above, we can look to see if the average number of reviews and attributes changes with star ratings. Until we get to 4 stars, the average of total filled, total yes, and reviews goes up consistently.
| stars | mean_filled | mean_yes | mean_reviews |
|---|---|---|---|
| 1.0 | 7.18 | 10.60 | 17.54 |
| 1.5 | 9.60 | 12.60 | 27.77 |
| 2.0 | 10.22 | 13.03 | 34.65 |
| 2.5 | 10.69 | 13.44 | 51.97 |
| 3.0 | 11.42 | 14.34 | 78.13 |
| 3.5 | 12.30 | 15.28 | 125.66 |
| 4.0 | 12.84 | 15.56 | 166.42 |
| 4.5 | 12.56 | 14.90 | 138.10 |
| 5.0 | 10.47 | 12.86 | 39.46 |
Star Rating by Review Count
First, we want to see if there is already a relationship between Star Rating and Review Count. While the P value is very small (so it is statistically significant), the estimate is nearly zero and the R-squared is very low at 0.02. Therefore, Review Count is not great predictor of Star Rating and we should do further analysis.
review_stars_lm <- lm(stars ~ review_count, data = Filled_Yes)
summary(review_stars_lm)
##
## Call:
## lm(formula = stars ~ review_count, data = Filled_Yes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8246 -0.4941 0.0091 0.5064 1.5191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.478e+00 4.227e-03 822.8 <2e-16 ***
## review_count 5.276e-04 1.707e-05 30.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7788 on 43292 degrees of freedom
## Multiple R-squared: 0.02158, Adjusted R-squared: 0.02156
## F-statistic: 955 on 1 and 43292 DF, p-value: < 2.2e-16
ggplot(Filled_Yes, aes(review_count, stars)) + geom_count() + geom_smooth(method = 'lm', se = FALSE, color = 'red3') +
scale_y_continuous(limits = c(1,5)) + labs(title = 'Predicting Star Rating by Review Count', size = "# Restaurants") +
theme(plot.title.position = 'plot') + theme_classic()
Total Filled
Then, we look to see if the value Total Filled attributes is a good predictor for Star Rating or Review Count.
For Star Rating and Total Filled, the relationship is again statistically significant, but the correlation is very low.
lm_filled_stars <- lm(stars ~ total_filled, data = Filled_Yes)
summary(lm_filled_stars)
##
## Call:
## lm(formula = stars ~ total_filled, data = Filled_Yes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.72541 -0.50880 0.02214 0.49120 1.83159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1684140 0.0111836 283.31 <2e-16 ***
## total_filled 0.0309444 0.0008803 35.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7763 on 43292 degrees of freedom
## Multiple R-squared: 0.02775, Adjusted R-squared: 0.02773
## F-statistic: 1236 on 1 and 43292 DF, p-value: < 2.2e-16
ggplot(Filled_Yes, aes(total_filled, stars)) + geom_count() + geom_smooth(method = 'lm', se = FALSE, color = 'red3') +
scale_y_continuous(limits = c(1,5)) + labs(title = 'Predicting Star Rating by Total Filled', size = "# Restaurants") +
theme(plot.title.position = 'plot') + theme_classic()
For Review Count and Total Filled, the relationship is also statistically significant. The R-squared is low, but will be one of the highest in this analysis. Looking at graph, one can see that the very large Review Counts are towards the higher end of total_filled.
lm_filled_review <- lm(review_count ~ total_filled, data = Filled_Yes)
summary(lm_filled_review)
##
## Call:
## lm(formula = review_count ~ total_filled, data = Filled_Yes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -305.6 -83.6 -36.6 30.9 9010.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -118.4478 2.9250 -40.49 <2e-16 ***
## total_filled 19.5004 0.2302 84.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 203 on 43292 degrees of freedom
## Multiple R-squared: 0.1421, Adjusted R-squared: 0.1421
## F-statistic: 7173 on 1 and 43292 DF, p-value: < 2.2e-16
ggplot(Filled_Yes, aes(total_filled, review_count)) + geom_count() + geom_smooth(method = 'lm', se = FALSE, color = 'red3') +
scale_y_continuous(expand = c(0,0)) + labs(title = 'Predicting Review Count by Total Filled', size = "# Restaurants") +
theme(plot.title.position = 'plot') + theme_classic()
Total Yes
Finally, we check if Total Yes is a good predictor for Star Rating or Review Count.
For Star Rating and Total Yes, the relationship is statistically significant, but the correlation is very low. In fact, it is worse than the models using Review Count or Total Filled.
lm_yes_stars <- lm(stars ~ total_yes, data = Filled_Yes)
summary(lm_yes_stars)
##
## Call:
## lm(formula = stars ~ total_yes, data = Filled_Yes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64854 -0.54359 -0.00861 0.47390 1.71876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2812366 0.0116731 281.09 <2e-16 ***
## total_yes 0.0174906 0.0007498 23.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7824 on 43292 degrees of freedom
## Multiple R-squared: 0.01241, Adjusted R-squared: 0.01239
## F-statistic: 544.1 on 1 and 43292 DF, p-value: < 2.2e-16
ggplot(Filled_Yes, aes(total_yes, stars)) + geom_count() + geom_smooth(method = 'lm', se = FALSE, color = 'red3') +
scale_y_continuous(limits = c(1,5)) + labs(title = 'Predicting Star Rating by Total Yes', size = "# Restaurants") +
theme(plot.title.position = 'plot') + theme_classic()
For Review Count and Total Yes, the relationship is again statistically significant. The correlation is also low. Looking at graph, there is a similar pattern, although less pronounced, like Total Filled with the very high review counts are towards the higher end of total_yes.
lm_yes_reviews <- lm(review_count ~ total_yes, data = Filled_Yes)
summary(lm_yes_reviews)
##
## Call:
## lm(formula = review_count ~ total_yes, data = Filled_Yes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -256.3 -92.5 -50.5 31.3 9053.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -76.2337 3.1230 -24.41 <2e-16 ***
## total_yes 12.9820 0.2006 64.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.3 on 43292 degrees of freedom
## Multiple R-squared: 0.0882, Adjusted R-squared: 0.08818
## F-statistic: 4188 on 1 and 43292 DF, p-value: < 2.2e-16
ggplot(Filled_Yes, aes(total_yes, review_count)) + geom_count() + geom_smooth(method = 'lm', se = FALSE, color = 'red3') +
scale_y_continuous(expand = c(0,0)) + labs(title = 'Predicting Review Count by Total Yes', size = "# Restaurants") +
theme(plot.title.position = 'plot') + theme_classic()
As starting point for analysis by attribute, we will compare how the average Star Rating and average Review Count differed between the 1/0 (True/False) groups of Filled and Yes Attributes.
Filled Attributes
Looking first by Filled Attributes, we can see the spread of average Star Rating for restaurants that include information about the amenity (AvgStar_1) and those that do not (AvgStar_0).
The five largest differences in mean Star Rating for Filled Attributes were DriveThru, BYOB, Wheelchair Accessible, Dogs Allowed, and Address.
| Attribute | AvgStar_0 | AvgStar_1 | star_diff |
|---|---|---|---|
| DriveThru | 3.59 | 3.06 | -0.54 |
| BYOB | 3.50 | 4.02 | 0.52 |
| WheelchairAccessible | 3.41 | 3.90 | 0.49 |
| DogsAllowed | 3.43 | 3.89 | 0.46 |
| HasAddress | 3.96 | 3.53 | -0.43 |
| BusinessAcceptsBitcoin | 3.48 | 3.90 | 0.42 |
| RestaurantsTableService | 3.40 | 3.77 | 0.37 |
| ByAppointmentOnly | 3.51 | 3.84 | 0.32 |
| Corkage | 3.51 | 3.80 | 0.28 |
| HappyHour | 3.46 | 3.74 | 0.28 |
| HasHours | 3.31 | 3.58 | 0.27 |
| Smoking | 3.52 | 3.77 | 0.25 |
| WiFi | 3.41 | 3.58 | 0.17 |
| BikeParking | 3.44 | 3.58 | 0.15 |
| BusinessParking | 3.40 | 3.55 | 0.15 |
| RestaurantsReservations | 3.67 | 3.52 | -0.15 |
| CoatCheck | 3.53 | 3.67 | 0.14 |
| Caters | 3.45 | 3.58 | 0.13 |
| RestaurantsDelivery | 3.63 | 3.53 | -0.10 |
| BusinessAcceptsCreditCards | 3.60 | 3.54 | -0.07 |
| BYOBCorkage | 3.54 | 3.48 | -0.07 |
| Alcohol | 3.49 | 3.55 | 0.06 |
| HasTV | 3.49 | 3.55 | 0.06 |
| RestaurantsTakeOut | 3.58 | 3.54 | -0.05 |
| OutdoorSeating | 3.51 | 3.54 | 0.03 |
Yes Attributes
For the Yes Attributes, the five largest average differences were the same as for the Filled Attributes (although slightly different rankings).
The largest differences were for DriveThru, BYOB, DogsAllowed, WheelchairAccessible, and Address.
The average difference was twice as large for DriveThru when marked Yes compared to marked Filled. BYOB and DogsAllowed also had a larger spreads for Yes than when Filled.
| Attribute | AvgStar_0 | AvgStar_1 | star_diff |
|---|---|---|---|
| DriveThru | 3.59 | 2.58 | -1.01 |
| BYOB | 3.53 | 4.23 | 0.70 |
| DogsAllowed | 3.50 | 4.07 | 0.57 |
| WheelchairAccessible | 3.42 | 3.89 | 0.47 |
| ByAppointmentOnly | 3.54 | 3.96 | 0.43 |
| HasAddress | 3.96 | 3.53 | -0.43 |
| BusinessAcceptsBitcoin | 3.54 | 3.95 | 0.41 |
| Corkage | 3.53 | 3.87 | 0.34 |
| HasHours | 3.31 | 3.58 | 0.27 |
| RestaurantsTableService | 3.48 | 3.75 | 0.27 |
| BikeParking | 3.41 | 3.65 | 0.24 |
| BusinessParking | 3.37 | 3.61 | 0.24 |
| CoatCheck | 3.54 | 3.76 | 0.23 |
| HasTV | 3.66 | 3.44 | -0.22 |
| Caters | 3.46 | 3.66 | 0.21 |
| OutdoorSeating | 3.45 | 3.66 | 0.20 |
| RestaurantsDelivery | 3.63 | 3.45 | -0.18 |
| Smoking | 3.54 | 3.71 | 0.17 |
| BusinessAcceptsCreditCards | 3.69 | 3.53 | -0.16 |
| HappyHour | 3.51 | 3.65 | 0.14 |
| RestaurantsReservations | 3.51 | 3.62 | 0.11 |
| RestaurantsTakeOut | 3.64 | 3.53 | -0.11 |
| Alcohol | 3.52 | 3.62 | 0.10 |
| WiFi | 3.53 | 3.59 | 0.06 |
| BYOBCorkage | 3.54 | 3.49 | -0.05 |
Filled Attributes
Review Counts had different top attributes than Star Ratings for Filled Attributes. The attributes with the largest differences in average number of reviews were ByAppointmentOnly, DogsAllowed, BYOBCorkage, HappyHour, and Corkage.
Compared to the differences in Star Ratings, no Filled Attribute had a negative impact on review count. Those where Filled = 1 had a greater number of reviews than those that had Filled = 0.
| Attribute | AvgRev_0 | AvgRev_1 | rev_diff |
|---|---|---|---|
| ByAppointmentOnly | 98.76 | 315.47 | 216.70 |
| DogsAllowed | 72.03 | 251.11 | 179.08 |
| BYOBCorkage | 100.04 | 277.98 | 177.95 |
| HappyHour | 72.91 | 225.08 | 152.17 |
| Corkage | 102.68 | 247.33 | 144.64 |
| Smoking | 105.16 | 242.90 | 137.74 |
| CoatCheck | 103.58 | 236.47 | 132.89 |
| Caters | 32.04 | 151.70 | 119.66 |
| BikeParking | 33.13 | 151.23 | 118.10 |
| WiFi | 26.65 | 143.61 | 116.95 |
| BYOB | 107.08 | 219.40 | 112.33 |
| BusinessParking | 17.56 | 124.72 | 107.16 |
| Alcohol | 28.33 | 134.72 | 106.39 |
| BusinessAcceptsCreditCards | 16.57 | 121.17 | 104.60 |
| HasTV | 31.27 | 133.50 | 102.24 |
| OutdoorSeating | 26.10 | 127.52 | 101.42 |
| HasHours | 30.08 | 130.21 | 100.13 |
| RestaurantsDelivery | 24.93 | 122.67 | 97.74 |
| RestaurantsTableService | 77.81 | 175.20 | 97.38 |
| BusinessAcceptsBitcoin | 101.46 | 197.03 | 95.57 |
| RestaurantsReservations | 33.24 | 127.94 | 94.69 |
| RestaurantsTakeOut | 27.21 | 120.63 | 93.42 |
| WheelchairAccessible | 91.87 | 178.39 | 86.52 |
| HasAddress | 32.34 | 115.91 | 83.58 |
| DriveThru | 111.37 | 147.69 | 36.32 |
Yes Attributes
The attributes with the largest differences in average number of Reviews were different for Yes Attributes compared to Filled Attributes. ByAppointmentOnly, CoatCheck, Corkage, BYOBCorkage, and RestaurantTableService rounded out the top five.
DriveThru was the only Yes Attribute to have a negative difference in Review Count for when Yes = 1. Like with the Star Ratings, the differences in average Review Counts were slightly higher for Yes Attributes compared to Filled.
| Attribute | AvgRev_0 | AvgRev_1 | rev_diff |
|---|---|---|---|
| ByAppointmentOnly | 114.17 | 403.50 | 289.33 |
| CoatCheck | 113.47 | 341.52 | 228.05 |
| Corkage | 109.93 | 313.44 | 203.52 |
| BYOBCorkage | 110.35 | 248.09 | 137.74 |
| RestaurantsTableService | 86.77 | 217.49 | 130.72 |
| BusinessParking | 31.55 | 149.38 | 117.83 |
| HappyHour | 94.28 | 211.57 | 117.29 |
| BYOB | 114.08 | 228.18 | 114.10 |
| HasHours | 30.08 | 130.21 | 100.13 |
| BikeParking | 63.36 | 159.51 | 96.15 |
| RestaurantsReservations | 89.62 | 184.68 | 95.07 |
| DogsAllowed | 109.19 | 198.64 | 89.44 |
| Alcohol | 95.74 | 184.51 | 88.77 |
| WheelchairAccessible | 93.20 | 180.71 | 87.51 |
| HasAddress | 32.34 | 115.91 | 83.58 |
| BusinessAcceptsCreditCards | 40.25 | 121.94 | 81.69 |
| Smoking | 114.26 | 173.55 | 59.29 |
| DriveThru | 117.78 | 64.10 | -53.68 |
| Caters | 95.40 | 144.67 | 49.26 |
| WiFi | 105.38 | 151.55 | 46.17 |
| OutdoorSeating | 95.94 | 141.24 | 45.31 |
| BusinessAcceptsBitcoin | 115.02 | 146.92 | 31.90 |
| RestaurantsDelivery | 101.21 | 128.80 | 27.59 |
| HasTV | 100.08 | 126.55 | 26.47 |
| RestaurantsTakeOut | 112.01 | 115.53 | 3.51 |
If we were to build a multiple regression model, we would need to first check that the independent variables are not highly correlated to each other. Regression analysis requires predictors to not have high levels of multi-collinearity. A correlation matrix can help us to evaluate all variables at once and their correlations between them. This same matrix will also help us to see which attributes have the strongest correlations to Star Rating and Review Count.
The correlation matrices reveal that there is a higher level of multi-collinearity between the variables than desired for a multiple regression model. Therefore, another type of analysis should be used to evalute the attributes and their impact on Star Rating and Review Count.
There are several pairs of attributes that have correlations over 0.5, which would suggest a high level of collinearity. RestaurantsTableService/WheelchairAccessible, WiFi/Caters, WiFi/BikeParking, WiFi/HasTV, BikeParking/Caters, RestaurantsReservations/OutdoorSeating, RestaurantsReservations/HasTV, RestaurantsReservations/Alcohol, RestaurantsReservations/RestaurantsDelivery, WheelchairAccessible/DogsAllowed, OutdoorSeating/HasTV, OutdoorSeating/Alcohol, BYOB/Corkage, and CoatCheck/Smoking all have correlation coefficients over +/- 0.5.
The correlation matrix can help us quickly spot these pairs, which higher correlation coefficients having a darking color. Hover over the boxes to see the pairs and the values.
cor_FS_full <- Filled_Stars %>%
select(-business_id,-total_filled) %>% cor()
cor_FS_full <- data.frame(cor_FS_full)
heatmaply_cor(cor(Filled_Stars[4:28]),
dendrogram = "none",
xlab = "", ylab = "", main = "Filled Attribute Correlation",
grid_gap = 1, grid_width = 0.00001, margins = c(30,30,40,10),
hide_colorbar = TRUE,
plot_method = "plotly",
label_names = c("row", "column", "Correlation"),
fontsize_row = 10, fontsize_col = 8,
heatmap_layers = theme(axis.line=element_blank()))
Looking at just Stars and Review Count, no Filled Attribute has very high correlation.
The three highest for Stars are WheelchairAccessible (0.276), DogsAllowed (0.252), and RestaurantsTableService(0.231).
The three highest for Review Count are DogsAllowed (0.349), HappyHour (0.311), and ByAppointmentOnly (0.261).
heatmaply(cor_FS_full[1:2],
scale_fill_gradient_fun = ggplot2::scale_fill_gradient2(low = "black",
high = "red3", midpoint = 0, limits = c(-0.5, 1)),
dendrogram = "none",
xlab = "", ylab = "", main = "Filled Attribute Correlation",
grid_color = "white", grid_width = 0.00001, margins = c(30,30,40,10),
hide_colorbar = TRUE,
label_names = c("row", "column", "Correlation"),
fontsize_row = 10, fontsize_col = 8,
heatmap_layers = theme(axis.line=element_blank()))
Conversely to Filled Attributes, there are no Yes variables that have correlation to each other over +/-0.5.
The highest correlations are RestaurantsTableService/WheelchairAccessible (0.403), RestaurantsReservations/RestaurantsTableService (0.356), BikeParking/BusinessParking (0.291), BikeParking/WheelchairAccessible (0.263), and BikeParking/Caters (0.273).
Again, we can see where there is high correlation in the below heatmap, with darker colors signaling higher correlations. Hover over the boxes to see the pairs and the values.
cor_Yes_full <- Yes_Stars %>%
select(-business_id, -total_yes) %>% cor()
cor_Yes_full <- data.frame(cor_Yes_full)
heatmaply_cor(cor(Yes_Stars[4:28]),
dendrogram = "none",
xlab = "", ylab = "", main = "Yes Attribute Correlation",
grid_gap = 1, grid_width = 0.00001, margins = c(30,30,40,10),
hide_colorbar = TRUE,
plot_method = "plotly",
label_names = c("row", "column", "Correlation"),
fontsize_row = 10, fontsize_col = 8,
heatmap_layers = theme(axis.line=element_blank()))
Looking at just Stars and Review Count, no Yes Attribute has very high correlation.
The three highest for Stars are DriveThru (-0.28), WheelchairAccessible (0.257), and DogsAllowed (0.18). The three highest for Review Count are RestaurantsTableService (0.247), BusinessParking (0.244), and BikeParking (0.219).
heatmaply(cor_Yes_full[1:2],
scale_fill_gradient_fun = ggplot2::scale_fill_gradient2(low = "black",
high = "red3", midpoint = 0, limits = c(-0.5, 1)),
dendrogram = "none",
xlab = "", ylab = "", main = "Yes Attribute Correlation",
grid_color = "white", grid_width = 0.00001, margins = c(30,30,40,10),
hide_colorbar = TRUE,
label_names = c("row", "column", "Correlation"),
fontsize_row = 10, fontsize_col = 8,
heatmap_layers = theme(axis.line=element_blank()))
Because there is a high level of correlation between these predictors, a multiple regression model with all attributes is not the best course of action. Linear regression will not produce a model with a high R-squared value. Instead, a Relative Weights Analysis (also called Key Driver Analysis) should now be used in this project. RWA can help find which variables are most important and have the most impact on the dependent variable.
The results of the RWA give us the correlation of models with all attributes. However, the results we are most interested in are the Signed Rescaled Relative Weights. These values provide estimates of the relative importance using the metric of percentage of the predicted variance associated to each variable and signals whether it has a positive or negative impact.
In general, it appears that Yes Attributes have more impact on Star Ratings, while Filled Attributes have more impact on Review Count.
Filled Attributes
Looking at Filled Attributes first, results of the RWA show that a model with all 25 attributes has a correlation of 0.176. This is low, but it is higher than the baseline regression run earlier.
The attributes with the largest rescaled relative weights are DriveThru, WheelchairAccessible, DogsAllowed, RestaurantsTableService, and BusinessAcceptsBitcoin.
Fill_col_names <- colnames(Filled_Stars[c(4:28)])
rwa_fill_stars <- Filled_Stars %>%
rwa(outcome = 'stars',
predictors = Fill_col_names,
applysigns = TRUE)
rwa_fill_s_results <- rwa_fill_stars$result
rwa_fill_stars[2]
## $rsquare
## [1] 0.1760108
rwa_fill_s_results %>%
select(-Sign) %>%
arrange(desc(Rescaled.RelWeight)) %>%
kable(digits=3) %>%
kable_minimal(full_width = F, position = 'center') %>%
scroll_box(height = '300px')
| Variables | Raw.RelWeight | Rescaled.RelWeight | Sign.Rescaled.RelWeight |
|---|---|---|---|
| DriveThru | 0.047 | 26.505 | -26.505 |
| WheelchairAccessible | 0.023 | 13.177 | 13.177 |
| DogsAllowed | 0.023 | 12.926 | 12.926 |
| RestaurantsTableService | 0.020 | 11.141 | 11.141 |
| BusinessAcceptsBitcoin | 0.009 | 5.052 | 5.052 |
| HasHours | 0.008 | 4.684 | 4.684 |
| BYOB | 0.007 | 4.116 | 4.116 |
| HappyHour | 0.006 | 3.440 | 3.440 |
| RestaurantsReservations | 0.005 | 3.062 | -3.062 |
| ByAppointmentOnly | 0.004 | 2.004 | 2.004 |
| HasAddress | 0.003 | 1.774 | -1.774 |
| Corkage | 0.003 | 1.674 | 1.674 |
| WiFi | 0.003 | 1.634 | 1.634 |
| BusinessParking | 0.002 | 1.407 | 1.407 |
| RestaurantsDelivery | 0.002 | 1.161 | -1.161 |
| BusinessAcceptsCreditCards | 0.002 | 1.066 | -1.066 |
| BikeParking | 0.002 | 0.943 | 0.943 |
| Smoking | 0.001 | 0.745 | 0.745 |
| CoatCheck | 0.001 | 0.660 | -0.660 |
| Caters | 0.001 | 0.654 | 0.654 |
| HasTV | 0.001 | 0.551 | -0.551 |
| Alcohol | 0.001 | 0.470 | 0.470 |
| BYOBCorkage | 0.001 | 0.462 | -0.462 |
| OutdoorSeating | 0.001 | 0.358 | -0.358 |
| RestaurantsTakeOut | 0.001 | 0.334 | -0.334 |
rwa_fill_s_results %>%
arrange(desc(Rescaled.RelWeight)) %>% slice(1:25) %>%
ggplot(aes(x=reorder(Variables, Rescaled.RelWeight), y=Sign.Rescaled.RelWeight, fill=Sign)) + geom_col() +
scale_y_continuous(limits = c(-30,15))+
geom_text(aes(label=round(Sign.Rescaled.RelWeight,2)), color = 'black', size = 3,
hjust=ifelse(rwa_fill_s_results$Sign.Rescaled.RelWeight < 0, 0, 1)) +
scale_fill_manual(values = c('grey80', 'grey60')) + theme_classic() +
theme(legend.position = 'none', plot.title.position = 'plot') +
labs(title = 'Relative Weight by Filled Attribute - Stars',x='') + coord_flip()
Yes Attributes
Results of the RWA show that a model with all 25 Yes Attributes has a correlation of 0.215. This is higher than the baseline regression and the model for Filled Attributes.
The attributes with the largest rescaled relative weights are DriveThru, WheelchairAccessible, HasTV, DogsAllowed, and BikeParking.
Yes_col_names <- colnames(Yes_Stars[c(4:28)])
rwa_yes_stars <- Yes_Stars %>%
rwa(outcome = 'stars',
predictors = Yes_col_names,
applysigns = TRUE)
rwa_yes_s_results <- rwa_yes_stars$result
rwa_yes_stars[2]
## $rsquare
## [1] 0.2151438
rwa_yes_s_results %>%
select(-Sign) %>%
arrange(desc(Rescaled.RelWeight)) %>%
kable(digits=3) %>%
kable_minimal(full_width = F, position = 'center') %>%
scroll_box(height = '300px')
| Variables | Raw.RelWeight | Rescaled.RelWeight | Sign.Rescaled.RelWeight |
|---|---|---|---|
| DriveThru | 0.063 | 29.351 | -29.351 |
| WheelchairAccessible | 0.034 | 15.746 | 15.746 |
| HasTV | 0.017 | 8.037 | -8.037 |
| DogsAllowed | 0.014 | 6.391 | 6.391 |
| BikeParking | 0.011 | 5.338 | 5.338 |
| RestaurantsDelivery | 0.011 | 5.286 | -5.286 |
| HasHours | 0.011 | 4.987 | 4.987 |
| BusinessParking | 0.010 | 4.532 | 4.532 |
| BusinessAcceptsCreditCards | 0.008 | 3.846 | -3.846 |
| Caters | 0.008 | 3.827 | 3.827 |
| RestaurantsTableService | 0.007 | 3.428 | 3.428 |
| OutdoorSeating | 0.007 | 3.237 | 3.237 |
| BYOB | 0.003 | 1.431 | 1.431 |
| HasAddress | 0.003 | 1.168 | -1.168 |
| RestaurantsTakeOut | 0.002 | 1.023 | -1.023 |
| Corkage | 0.001 | 0.577 | 0.577 |
| HappyHour | 0.001 | 0.377 | 0.377 |
| RestaurantsReservations | 0.001 | 0.351 | 0.351 |
| Alcohol | 0.001 | 0.298 | 0.298 |
| ByAppointmentOnly | 0.000 | 0.215 | 0.215 |
| WiFi | 0.000 | 0.187 | 0.187 |
| BYOBCorkage | 0.000 | 0.145 | -0.145 |
| Smoking | 0.000 | 0.089 | -0.089 |
| BusinessAcceptsBitcoin | 0.000 | 0.076 | 0.076 |
| CoatCheck | 0.000 | 0.057 | 0.057 |
rwa_yes_s_results %>%
arrange(desc(Rescaled.RelWeight)) %>% slice(1:25) %>%
ggplot(aes(x=reorder(Variables, Rescaled.RelWeight), y=Sign.Rescaled.RelWeight, fill=Sign)) + geom_col() +
scale_y_continuous(limits = c(-40,30))+
geom_text(aes(label=round(Sign.Rescaled.RelWeight,2)), color = 'black', size = 3,
hjust=ifelse(rwa_yes_s_results$Sign.Rescaled.RelWeight < 0, 0, 0.5)) +
scale_fill_manual(values = c('grey80', 'grey60')) + theme_classic() +
theme(legend.position = 'none', plot.title.position = 'plot') +
labs(title = 'Relative Weight by Yes Attribute - Stars',x='') + coord_flip()
Filled Attributes
The correlation for all Filled Attributes to Review Count is 0.256 - one of the highest correlation values in this analysis.
The Filled Attributes with the largest rescaled relative weights are DogsAllowed, BYOBCorkage, ByAppointmentOnly, HappyHour, and BikeParking.
rwa_fill_reviews <- Filled_Stars %>%
rwa(outcome = 'review_count',
predictors = Fill_col_names,
applysigns = TRUE)
rwa_fill_r_results <- rwa_fill_reviews$result
rwa_fill_reviews[2]
## $rsquare
## [1] 0.2555005
rwa_fill_r_results %>%
select(-Sign) %>%
arrange(desc(Rescaled.RelWeight)) %>%
kable(digits=3) %>%
kable_minimal(full_width = F, position = 'center') %>%
scroll_box(height = '300px')
| Variables | Raw.RelWeight | Rescaled.RelWeight | Sign.Rescaled.RelWeight |
|---|---|---|---|
| DogsAllowed | 0.052 | 20.442 | 20.442 |
| BYOBCorkage | 0.039 | 15.158 | 15.158 |
| ByAppointmentOnly | 0.034 | 13.315 | 13.315 |
| HappyHour | 0.033 | 12.987 | 12.987 |
| BikeParking | 0.013 | 5.276 | 5.276 |
| Caters | 0.013 | 5.209 | 5.209 |
| Corkage | 0.009 | 3.424 | 3.424 |
| WiFi | 0.008 | 3.275 | 3.275 |
| RestaurantsTableService | 0.007 | 2.910 | 2.910 |
| HasHours | 0.006 | 2.419 | 2.419 |
| Smoking | 0.005 | 2.017 | 2.017 |
| CoatCheck | 0.005 | 1.906 | 1.906 |
| WheelchairAccessible | 0.005 | 1.839 | -1.839 |
| BusinessAcceptsBitcoin | 0.004 | 1.512 | 1.512 |
| Alcohol | 0.004 | 1.472 | 1.472 |
| HasTV | 0.003 | 1.314 | 1.314 |
| BYOB | 0.003 | 1.109 | 1.109 |
| BusinessParking | 0.003 | 1.041 | 1.041 |
| RestaurantsReservations | 0.003 | 1.035 | 1.035 |
| OutdoorSeating | 0.002 | 0.725 | 0.725 |
| RestaurantsDelivery | 0.001 | 0.524 | 0.524 |
| BusinessAcceptsCreditCards | 0.001 | 0.351 | 0.351 |
| RestaurantsTakeOut | 0.001 | 0.317 | 0.317 |
| DriveThru | 0.001 | 0.284 | 0.284 |
| HasAddress | 0.000 | 0.139 | 0.139 |
rwa_fill_r_results %>%
arrange(desc(Rescaled.RelWeight)) %>% slice(1:25) %>%
ggplot(aes(x=reorder(Variables, Rescaled.RelWeight), y=Sign.Rescaled.RelWeight, fill=Sign)) + geom_col() +
scale_y_continuous(limits = c(-3,25)) +
geom_text(aes(label=round(Sign.Rescaled.RelWeight,2)), color = 'black', size = 3,
hjust=-0.1) +
scale_fill_manual(values = c('grey80', 'grey60')) + theme_classic() +
theme(legend.position = 'none', plot.title.position = 'plot') +
labs(title = 'Relative Weight by Filled Attribute - Reviews',x='') + coord_flip()
Yes Attributes
The correlation for all Yes Attributes to Review Count is 0.163.
The Yes Attributes with the largest rescaled relative weights are BusinessParking, RestaurantsTableService, BikeParking, HappyHour, and RestaurantsReservations.
rwa_yes_reviews <- Yes_Stars %>%
rwa(outcome = 'review_count',
predictors = Yes_col_names,
applysigns = TRUE)
rwa_yes_r_results <- rwa_yes_reviews$result
rwa_yes_reviews[2]
## $rsquare
## [1] 0.1635754
rwa_yes_r_results %>%
select(-Sign) %>%
arrange(desc(Rescaled.RelWeight)) %>%
kable(digits=3) %>%
kable_minimal(full_width = F, position = 'center') %>%
scroll_box(height = '300px')
| Variables | Raw.RelWeight | Rescaled.RelWeight | Sign.Rescaled.RelWeight |
|---|---|---|---|
| BusinessParking | 0.026 | 16.008 | 16.008 |
| RestaurantsTableService | 0.023 | 14.107 | 14.107 |
| BikeParking | 0.021 | 12.853 | 12.853 |
| HappyHour | 0.014 | 8.302 | 8.302 |
| RestaurantsReservations | 0.011 | 6.874 | 6.874 |
| HasHours | 0.010 | 6.059 | 6.059 |
| Alcohol | 0.010 | 5.870 | 5.870 |
| BYOBCorkage | 0.009 | 5.698 | 5.698 |
| Corkage | 0.009 | 5.292 | 5.292 |
| WheelchairAccessible | 0.006 | 3.441 | 3.441 |
| ByAppointmentOnly | 0.004 | 2.583 | 2.583 |
| CoatCheck | 0.003 | 2.126 | 2.126 |
| DogsAllowed | 0.003 | 1.710 | 1.710 |
| RestaurantsDelivery | 0.003 | 1.604 | 1.604 |
| Caters | 0.002 | 1.295 | 1.295 |
| BusinessAcceptsCreditCards | 0.002 | 1.153 | 1.153 |
| WiFi | 0.002 | 1.123 | 1.123 |
| OutdoorSeating | 0.002 | 1.114 | 1.114 |
| RestaurantsTakeOut | 0.002 | 0.972 | -0.972 |
| DriveThru | 0.001 | 0.823 | -0.823 |
| HasTV | 0.001 | 0.307 | 0.307 |
| BYOB | 0.000 | 0.271 | 0.271 |
| HasAddress | 0.000 | 0.250 | 0.250 |
| Smoking | 0.000 | 0.155 | -0.155 |
| BusinessAcceptsBitcoin | 0.000 | 0.011 | -0.011 |
rwa_yes_r_results %>%
arrange(desc(Rescaled.RelWeight)) %>% slice(1:25) %>%
ggplot(aes(x=reorder(Variables, Rescaled.RelWeight), y=Sign.Rescaled.RelWeight, fill=Sign)) + geom_col() +
scale_y_continuous(limits = c(-1,18)) +
geom_text(aes(label=round(Sign.Rescaled.RelWeight,2)), color = 'black', size = 3,
hjust=-0.1) +
scale_fill_manual(values = c('grey80', 'grey60')) + theme_classic() +
theme(legend.position = 'none', plot.title.position = 'plot') +
labs(title = 'Relative Weight by Yes Attribute - Reviews',x='') + coord_flip()
To compare results of these three methods of analyzing the attributes, a summary table was created of the Top 5 Filled and Yes Attributes of their impact on Star Rating and Review Count by each method.
Filled Attributes
Across all three methods, three Filled Attriubtes were in the Top 5 of each: DogsAllowed, DriveThru, and WheelchairAccessible.| Rank | Mean Diff. | Correlation | RWA |
|---|---|---|---|
| 1 | DriveThru | WheelchairAccessible | DriveThru |
| 2 | BYOB | DogsAllowed | WheelchairAccessible |
| 3 | WheelchairAccessible | RestaurantsTableService | DogsAllowed |
| 4 | DogsAllowed | DriveThru | RestaurantsTableService |
| 5 | Address | BusinessAcceptsBitcoin | BusinessAcceptsBitcoin |
| Filled Attr. | # |
|---|---|
| DogsAllowed | 3 |
| DriveThru | 3 |
| WheelchairAccessible | 3 |
| BusinessAcceptsBitcoin | 2 |
| RestaurantsTableService | 2 |
| Address | 1 |
| BYOB | 1 |
Yes Attributes
| Rank | Mean Diff. | Correlation | RWA |
|---|---|---|---|
| 1 | DriveThru | DriveThru | DriveThru |
| 2 | BYOB | WheelchairAccessible | WheelchairAccessible |
| 3 | DogsAllowed | DogsAllowed | HasTV |
| 4 | WheelchairAccessible | BikeParking | DogsAllowed |
| 5 | ByAppointmentOnly | BusinessParking | BikeParking |
| Yes Attr. | # |
|---|---|
| DogsAllowed | 3 |
| DriveThru | 3 |
| WheelchairAccessible | 3 |
| BikeParking | 2 |
| BusinessParking | 1 |
| ByAppointmentOnly | 1 |
| BYOB | 1 |
| HasTV | 1 |
Overall
If we combine the Filled and Yes Attribute results, we can see the overlap in attributes. From this list, we can see that DogsAllowed, DriveThru, and WheelchairAccessible are clearly important to show on a Yelp page for Star Rating. The other top attributes were only included in the Top 5 in one or two methods.| Attribute | # |
|---|---|
| DogsAllowed | 6 |
| DriveThru | 6 |
| WheelchairAccessible | 6 |
| BikeParking | 2 |
| BusinessAcceptsBitcoin | 2 |
| BYOB | 2 |
| RestaurantsTableService | 2 |
| Address | 1 |
| BusinessParking | 1 |
| ByAppointmentOnly | 1 |
| HasTV | 1 |
Filled Attributes
Like with Star Rating, there were three Filled Attributes that were in the Top 5 of each analysis: ByAppointmentOnly, DogsAllowed, and HappyHour.| Rank | Mean Diff. | Correlation | RWA |
|---|---|---|---|
| 1 | ByAppointmentOnly | DogsAllowed | DogsAllowed |
| 2 | DogsAllowed | HappyHour | BYOBCorkage |
| 3 | BYOBCorkage | ByAppointmentOnly | ByAppointmentOnly |
| 4 | HappyHour | Caters | HappyHour |
| 5 | Corkage | BikeParking | BikeParking |
| Filled Attr. | # |
|---|---|
| ByAppointmentOnly | 3 |
| DogsAllowed | 3 |
| HappyHour | 3 |
| BikeParking | 2 |
| BYOBCorkage | 2 |
| Caters | 1 |
| Corkage | 1 |
Yes Attributes
| Rank | Mean Diff. | Correlation | RWA |
|---|---|---|---|
| 1 | ByAppointmentOnly | RestaurantsTableService | BusinessParking |
| 2 | CoatCheck | BusinessParking | RestaurantsTableService |
| 3 | Corkage | BikeParking | BikeParking |
| 4 | BYOBCorkage | HappyHour | HappyHour |
| 5 | RestaurantsTableService | RestaurantsReservations | RestaurantsReservations |
| Yes Attr. | # |
|---|---|
| RestaurantsTableService | 3 |
| BikeParking | 2 |
| BusinessParking | 2 |
| HappyHour | 2 |
| RestaurantsReservations | 2 |
| ByAppointmentOnly | 1 |
| BYOBCorkage | 1 |
| CoatCheck | 1 |
| Corkage | 1 |
Overall
If we combine the Filled and Yes Attribute results, we can see the overlap in attributes. Compared to Star Rating, there was more variety in the Top 5 Filled and Yes Attributes. HappyHour appeared in five of the six analyses. BikeParking and ByAppointmentOnly appeared four times.| Attribute | # |
|---|---|
| HappyHour | 5 |
| BikeParking | 4 |
| ByAppointmentOnly | 4 |
| BYOBCorkage | 3 |
| DogsAllowed | 3 |
| RestaurantsTableService | 3 |
| BusinessParking | 2 |
| Corkage | 2 |
| RestaurantsReservations | 2 |
| Caters | 1 |
| CoatCheck | 1 |
If we were to create a list of all the attributes that was in the Top 5 of any of the previous analyses, 18 of the 25 attributes studied are shown. But there were six attributes that appeared most consistently. DogsAllowed was in 9 of the 18 different iterations of this analysis for Star Rating or Review Count. BikeParking, DriveThru, and WheelchairAccessible were in six. ByAppointmentOnly and HappyHour were in five. The other 12 attributes appeared three or less times in a Top 5 results list.
While none of the attributes alone were a good predictor of Star Rating or Review Count, Yelp could still promote these top six attributes to restaurant owners as important to list on their Yelp page and if applicable to have at their restaurant (DriveThru being the exception as it was the only attribute to have a consistent negative effect).
| Attribute | # |
|---|---|
| DogsAllowed | 9 |
| BikeParking | 6 |
| DriveThru | 6 |
| WheelchairAccessible | 6 |
| ByAppointmentOnly | 5 |
| HappyHour | 5 |
| RestaurantsTableService | 5 |
| BusinessParking | 3 |
| BYOBCorkage | 3 |
| BusinessAcceptsBitcoin | 2 |
| BYOB | 2 |
| Corkage | 2 |
| RestaurantsReservations | 2 |
| Address | 1 |
| Caters | 1 |
| CoatCheck | 1 |
| HasTV | 1 |
To analyze the actual text of reviews left on restaurants’ Yelp pages, we will look at most common words used and conduct a sentiment analysis. As noted previously, the Review data set was extremely large. Therefore, a sample was taken to speed up processing. The sample was then filtered for the restaurants that have or do not have the most common attributes listed from the prior attribute analysis. The reviews will be compared to see if different words were used or different sentiments expressed between the groups of restaurants.
Overall, there were not large differences in sentiments of the reviews across the groups for both Star Rating and Review Count by Filled or Yes Attribute. However, the reviews that were for restaurants with the desired attributes were consistently more positive than the reviews for the restaurants without.
Because this analysis looks at individual words and not at complete phrases like some other text mining methods, the review text is not scrubbed completely for typos or non-English characters. Data cleaning steps were to make the data be in a tidy text format (a table with one word token per row).
Common words like “the”, “of” and “to” and any numerals were removed first before looking at most common words.
From this graph, we can see that “food” is by far the most common word used. It has no real meaning in this context, so it will also be removed from the subsequent analyses.
Using the most common Filled Attributes for Star Ratings from the previous analysis, we will look if there are differences within the review texts themselves for restaurants that have them shown on their Yelp page versus those that do not.
Top Words
The first group of reviews were for restaurants that had DogsAllowed and WheelchairAccessible filled on their page and did not have DriveThru filled (remember that DriveThru had a negative impact on Star Rating). There are 6,445 restaurants that meet these conditions.
The opposite group of restaurants would be those that do not have DogsAllowed and WheelchairAccessible filled on the Yelp page, but do have DriveThru filled. There are 2,609 restaurants that meet these three conditions.
Comparing the two graphs, there is some overlap and does not show how customers truly felt about their experiences at that restaurant.
Sentiments
While looking at the top used words can be interesting, we are more interested in the sentiments expressed in the reviews. For restaurants that met the conditions, we would expect the reviews to be overall more positive than those that did not. We will use two methods to measure sentiments from the tidytext package. The NRC lexicon from Saif Mohammad and Peter Turney associates words with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The AFINN lexicon from Finn Årup Nielsen is numerical by assigning each word a value from -5 (negative sentiment) to 5 (positive sentiment).
The differences in the NRC sentiments between the two Filled groups are slight. Positive is the most common sentiment in both groups, but the Filled=0 group has a higher proportion of negative sentiments.
If we quantify the sentiments using the AFINN lexicon, we can get a sense of on average how positive or negative are the reviews. On average, the Filled = 1 group has a 0.76 higher AFINN score.
| AFINN | Filled = 0 | Filled = 1 | Avg. Difference |
|---|---|---|---|
| Star_AFINN_filled | 0.551359 | 1.315273 | 0.764 |
Using the most common Yes Attributes for Star Ratings from the previous analysis, we will look if there are differences within the review texts themselves for restaurants that have them shown on their Yelp page versus those that do not.
Top Words
The first group of reviews were for restaurants that claim to have the amenities of DogsAllowed and WheelchairAccessible filled on their page and do not have DriveThru (remember that DriveThru had a negative impact on Star Rating). There are 2,262 restaurants that meet these 3 conditions in the data set. There are several words here in common with the most common words to the Filled=1 group shown previously.
The opposite group of restaurants would be those that do not claim to have the amenities DogsAllowed and WheelchairAccessible filled on the Yelp page, but do have a DriveThru. There are 1,817 restaurants that meet these 3 conditions.
Comparing the two graphs, there is some overlap, but the words do not show how customers truly felt about their experiences at that restaurant as the terms are fairly generic to all restaurants.
Sentiments
The differences in the NRC sentiments is most noticable in the proportion of the ‘Joy’ emotion and “Negative” sentiment. The groups had opposite proportions- the Yes=1 group had ‘Joy’ as the 2nd largest sentiment (14.9%) and Negative as the 5th largest (10.2%), while the Yes=0 had ‘Negative’ as the 2nd largest sentiment (13.6%) and Joy as the 5th largest (10.6%).
If we quantify the sentiments using the AFINN lexicon, we can get a sense of on average how positive or negative the reviews are for each group. On average, the Yes = 1 group is higher by just over a full point.
| AFINN | Yes = 0 | Yes = 1 | Avg. Difference |
|---|---|---|---|
| Star_AFINN_yes | 0.3503304 | 1.354613 | 1.004 |
Using the most common Filled Attributes for Review Count from the previous analysis, we will look if there are differences within the review texts themselves for restaurants that have them shown on their Yelp page versus those that do not.
Top Words
The first group of reviews were for restaurants that have DogsAllowed, ByAppointmentOnly, and HappyHour filled on their Yelp page. There are 1,656 restaurants that meet these 3 conditions in the data set.
The opposite group of restaurants would be those that do not have DogsAllowed, ByAppointmentOnly, and HappyHour filled on their Yelp page. There are 26,660 restaurants that meet these 3 conditions.
Comparing the two graphs, there is again a large amount of overlap to the previous Top Words graphs, so a sentiment analysis will better evaluate how customers felt while writing their reviews.
Sentiments
The differences in the NRC sentiments between the two Filled groups are very slight. Proportions of each feeling and sentiment are nearly the same for each. Positive remains the largest sentiment expressed.
If we quantify the sentiments using the AFINN lexicon, we can get a sense of on average how positive or negative are the reviews. On average, the Filled = 1 group has a slightly higher AFINN score.
| AFINN | Filled = 0 | Filled = 1 | Avg. Difference |
|---|---|---|---|
| Rev_AFINN_filled | 0.8739431 | 1.344708 | 0.471 |
Using the most common Yes Attributes for Review from the previous analysis, we will look if there are differences within the review texts themselves for restaurants that have them shown on their Yelp page versus those that do not.
Top Words
The first group of reviews were for restaurants that claim to have the amenities of RestaurantsTableService, BusinessParking, and BikeParking. There are 5,948 restaurants that meet these 3 conditions in the data set.
The opposite group of restaurants would be those that do not claim to have the amenities of RestaurantsTableService, BusinessParking, and BikeParking. There are 8,086 restaurants that meet these 3 conditions.
Again, these graphs are similar to the ones previously shown and do not indicate how the customers felt when writing the reviews.
Sentiments
The differences in the NRC sentiments between the two Yes groups are again pretty slight. Positive is the most common sentiment, but the Yes=1 group had a higher proportion of ‘Joy’.
If we quantify the sentiments using the AFINN lexicon, results are similar to the Filled difference for Review Count. The Yes = 1 group is on average higher by 0.44.
| AFINN | Yes = 0 | Yes = 1 | Avg. Difference |
|---|---|---|---|
| Rev_AFINN_yes | 0.8060857 | 1.247525 | 0.441 |
The attributes that a business can show on its Yelp page were not proven to be strong predictors of Star Rating or Review Count through this analysis. There was a subset of six attributes that were most commonly in the Top 5 results, but overall we were unable to find a model that would explain even 30% of the variation in the data. The highest correlation values were around 20%. Therefore, Yelp cannot go to restaurant owners with full confidence that listing certain attributes will help improve Star Ratings or Review Count.
This analysis was for restaurants across the United States, with all metros and restaurant types grouped together. Future analysis could include analyzing by types of restaurants. For example, DriveThru consistently showed a negative impact on Star Rating. If we were to look at the restaurants with DriveThru, the list would likely be mostly fast food restaurants. Fast food restaurants are very different than a fine dining restaurant. If the analysis were done again for more specific restaurant categories, there could be different attributes that are important within the subgroups.