INTRODUCTION

We are living in a review-obsessed culture and, when we are looking for a place to lunch or have dinner, it has now become a common reflex to check online ratings before making a final call. This reflects of course the customers point of view, trying to narrow their choices to some 3, 4 or 5 stars fancy venues depending on their taste and available budget.

But there is also the business owner point of view!

The importance of ratings for restaurants have been heavily emphasized in France by the suicide of the famous Chef Bernard Loiseau, whose tragic death 12 years ago has been related to the threat of losing his 3 stars prestigious ranking in the Michelin Red Guide.

It is obvious that a restaurant with poor rating will probably face serious business issues and maybe have to stop its activity. According to an article published in the New Yorker following the death of Loiseau (http://www.newyorker.com/magazine/2003/05/12/death-of-a-chef):

“the loss of a Michelin star can mean financial strain: business can drop by as much as twenty-five per cent. (When Loiseau got his third star, in 1991, business at La Côte d’Or increased by sixty per cent.)”

Then comes a simple question, beyond quality of food, what are the factors explaining restaurants ratings?

Being able to identify them and predict ratings would be a precious information for the owners, allowing them to propose a better customer experience and ultimately grow their businesses.

Our initial exploratory analysis results are presented below and illustrated in Figures 1-2-3 and 4, as well as in Table 1. The algorithm calibration is illustrated in Figures 5-6 and the prediction results & accuracy presented in Figure 7 and Table 2.

METHODS & DATA

(a) Getting the data

The dataset is available at http://www.yelp.com/dataset_challenge. It contains in particular:

  • 1.6M reviews and 500K tips by 366K users for 61K businesses

  • 481K business attributes, e.g., hours, parking availability, ambience

If you are not familiar with Yelp (maybe a user of TripAdvisor?), you will find below an example of French Bakery review in Oslo: http://www.yelp.co.uk/biz/pascal-oslo-2

Note: The data format is mainly JSON files.

(b) R Libraries

In order to clean the data, visualize and publish our results, we will use the following R packages (if they are not installed please use the install.packages(“name of the package”); Installing packages such as “DT”" may also require to install “devtools” first).

library(rjson)
library(plyr)
library(dplyr)
library(ggplot2)
library(knitr)
library(glmnet)
library(googleVis)
library(DT)
library(scales)

The key package is “glmnet”. More details are provided in Reference (1).

(c) Reading the “Business” JSON file

Note: The files have to be saved previously in your own working directory. Do not also forget to close the file.

con <- file("yelp_academic_dataset_business.json", "r")
input <- readLines(con, -1L)
close(con)

(d) Creating and loading our dataframe

It simply consists in extracting each single business and make it a row of a data frame we will call “yelpdata”.

yelpdata <- input %>%
   lapply(function(x) t(unlist(fromJSON(x)))) %>% 
   ldply()
save(yelpdata, file= 'yelpdata.rdata')
load("yelpdata.rdata")

(e) Cleaning the data

We will identify more clearly the different features such as the fact that there is “Happy Hour”, the “WIFI” available and so on.

We will also ensure the treament of “N/A” values.

clean.names <- function(df){
  colnames(df) <- gsub("[^[:alnum:]]", "", colnames(df))
  colnames(df) <- tolower(colnames(df))
  return(df)
}
yelpdata <- clean.names(yelpdata)
yelpdata <- yelpdata[,!duplicated(colnames(yelpdata))]

# Features 
yelpdata$stars <- as.numeric(as.character(yelpdata$stars))
yelpdata$reviewcount <- as.numeric(as.character(yelpdata$reviewcount))
names(yelpdata)[names(yelpdata)=="attributeshappyhour"] <- "happyhour"
names(yelpdata)[names(yelpdata)=="attributesacceptscreditcards"] <- "acc"
names(yelpdata)[names(yelpdata)=="attributesgoodforgroups"] <- "groups"
names(yelpdata)[names(yelpdata)=="attributesoutdoorseating"] <- "outdoor"
names(yelpdata)[names(yelpdata)=="attributespricerange"] <- "price"
names(yelpdata)[names(yelpdata)=="attributesalcohol"] <- "alcohol"
names(yelpdata)[names(yelpdata)=="attributesnoiselevel"] <- "noiselevel"
names(yelpdata)[names(yelpdata)=="attributesambienceclassy"] <- "classy"
names(yelpdata)[names(yelpdata)=="attributesparkingvalet"] <- "valet"
names(yelpdata)[names(yelpdata)=="neighborhoods"] <- "nhood"
names(yelpdata)[names(yelpdata)=="attributesdrivethru"] <- "drivethru"
names(yelpdata)[names(yelpdata)=="attributesparkinglot"] <- "parkinglot"
names(yelpdata)[names(yelpdata)=="attributesparkinglot"] <- "parkinglot"
names(yelpdata)[names(yelpdata)=="attributespaymenttypescashonly"] <- "cash"
names(yelpdata)[names(yelpdata)=="attributesambiencecasual"] <- "casual"
names(yelpdata)[names(yelpdata)=="attributesgoodfordancing"] <- "dance"
names(yelpdata)[names(yelpdata)=="attributesdelivery"] <- "delivery"
names(yelpdata)[names(yelpdata)=="attributescoatcheck"] <- "ccheck"
names(yelpdata)[names(yelpdata)=="attributestakeout"] <- "takeout"
names(yelpdata)[names(yelpdata)=="attributestakesreservations"] <- "res"
names(yelpdata)[names(yelpdata)=="attributeswaiterservice"] <- "service"
names(yelpdata)[names(yelpdata)=="attributesparkingstreet"] <- "street"
names(yelpdata)[names(yelpdata)=="attributesparkinggarage"] <- "garage"
names(yelpdata)[names(yelpdata)=="attributesgoodforlatenight"] <- "late"
names(yelpdata)[names(yelpdata)=="attributesgoodfordessert"] <- "desert"
names(yelpdata)[names(yelpdata)=="attributescaters"] <- "caters"
names(yelpdata)[names(yelpdata)=="attributeswifi"] <- "wifi"
names(yelpdata)[names(yelpdata)=="attributesattire"] <- "attire"
names(yelpdata)[names(yelpdata)=="attributesgoodforkids"] <- "goodforkids"
names(yelpdata)[names(yelpdata)=="attributeshastv"] <- "tv"
names(yelpdata)[names(yelpdata)=="attributesambienceromantic"] <- "romantic"
names(yelpdata)[names(yelpdata)=="attributesambiencetrendy"] <- "trendy"
names(yelpdata)[names(yelpdata)=="attributesambienceupscale"] <- "upscale"
names(yelpdata)[names(yelpdata)=="attributesambiencedivey"] <- "divey"
names(yelpdata)[names(yelpdata)=="attributeswheelchairaccessible"] <- "wheelchair"
names(yelpdata)[names(yelpdata)=="attributesmusicbackgroundmusic"] <- "bkgmusic"
names(yelpdata)[names(yelpdata)=="attributesmusiclive"] <- "livemusic"
names(yelpdata)[names(yelpdata)=="attributesbyob"] <- "byob"
names(yelpdata)[names(yelpdata)=="attributesdogsallowed"] <- "dogsallowed"
names(yelpdata)[names(yelpdata)=="attributesopen24hours"] <- "open24hrs"
names(yelpdata)[names(yelpdata)=="attributespaymenttypesamex"] <- "amex"
names(yelpdata)[names(yelpdata)=="attributesorderatcounter"] <- "orderatcounter"
names(yelpdata)[names(yelpdata)=="attributespaymenttypesvisa"] <- "visa"


# Identify N/A as "dnr" (did not respond).
addDNR <- function(x){
  if(is.factor(x)) return(factor(x, levels=c(levels(x), "dnr")))
  return(x)
}
yelpdata <- as.data.frame(lapply(yelpdata, addDNR))
yelpdata[is.na(yelpdata)] <- "dnr"

Then, we have to:

  • Identify the different city and US States

  • Classify the restaurants in the 60 most important categories (Wine Bar, Tex-Mex, Vegan…)

yelpdata <- mutate(yelpdata, loc = ifelse(yelpdata$state=="NV", "Las Vegas, NV",
                                        ifelse(yelpdata$state=="PA", "Pittsburg, PA",
                                          ifelse(yelpdata$state=="NC", "Charlotte, NC",
                                            ifelse(yelpdata$state=="AZ", "Phoenix, AZ",
                                              ifelse(yelpdata$state=="IL", "Urbana-Champaign, IL",
                                                ifelse(yelpdata$state=="WI", "Madison, WI",
                                                  ifelse(yelpdata$state=="MLN", "Edinburgh, UK",
                                                    ifelse(yelpdata$state=="BW", "Karlsruhe, Germany",
                                                      ifelse(yelpdata$state=="QC", "Montreal, Canada",  
                                                       ifelse(yelpdata$state=="ON", "Waterloo, Canada",
                                                        ifelse(yelpdata$state=="SC", "Charlotte, NC",
                                                         ifelse(yelpdata$state=="EDH", "Edinburgh, UK",
                                                          ifelse(yelpdata$state=="KHL", "Edinburgh, UK",
                                                           ifelse(yelpdata$state=="XGL", "Edinburgh, UK",
                                                            ifelse(yelpdata$state=="NTH", "Edinburgh, UK",
                                                            ifelse(yelpdata$state=="SCB", "Edinburgh, UK",
                                                         NA)))))))))))))))))

# Filter the restaurants.
all_restaurants <- filter(yelpdata, categories == "Restaurants" |
                     categories1 == "Restaurants" | 
                     categories2 == "Restaurants"| 
                     categories3 == "Restaurants"|
                     categories4 == "Restaurants"|
                     categories5 == "Restaurants"|
                     categories6 == "Restaurants"|
                     categories7 == "Restaurants"|
                     categories8 == "Restaurants"|
                     categories9 == "Restaurants"|
                     categories10 == "Restaurants") 

# Showy all of the categories of a restaurants 
bigcat <- c(as.character(all_restaurants$categories1), 
            as.character(all_restaurants$categories2), 
            as.character(all_restaurants$categories3),
            as.character(all_restaurants$categories4), 
            as.character(all_restaurants$categories5), 
            as.character(all_restaurants$categories6),
            as.character(all_restaurants$categories7), 
            as.character(all_restaurants$categories8), 
            as.character(all_restaurants$categories9),
            as.character(all_restaurants$categories10),
            as.character(all_restaurants$categories)) %>% 
  table() %>% 
  sort()
 
# Let's have a look at the most important categories
# tail(bigcat,65)

# "Varmaker" function creates a column for a category
# 1 = yes, 0 = no
varmaker <- function(x){
  all_restaurants <- mutate(all_restaurants, 
                            a = 
                              ifelse(
                                categories == x |
                                categories1 == x | 
                                categories2 == x | 
                                categories3 == x | 
                                categories4 == x | 
                                categories5 == x | 
                                categories6 == x | 
                                categories7 == x | 
                                categories8 == x | 
                                categories9 == x | 
                                categories10 == x , 1, 0) )
  all_restaurants$a <- as.factor(all_restaurants$a)
  names(all_restaurants)[names(all_restaurants)=="a"] <- gsub(" ", "", x, fixed = TRUE)
  return(all_restaurants)
  }

# Create new columns associated to the most important categories
all_restaurants <- varmaker("Fast Food")
all_restaurants <- varmaker("Pizza")
all_restaurants <- varmaker("Mexican")
all_restaurants <- varmaker("American (Traditional)")
all_restaurants <- varmaker("Nightlife")
all_restaurants <- varmaker("Sandwiches")
all_restaurants <- varmaker("Bars")
all_restaurants <- varmaker("Food")
all_restaurants <- varmaker("Italian")
all_restaurants <- varmaker("Chinese")
all_restaurants <- varmaker("American (New)")
all_restaurants <- varmaker("Burgers")
all_restaurants <- varmaker("Breakfast & Brunch")
all_restaurants <- varmaker("Cafes")
all_restaurants <- varmaker("Japanese")
all_restaurants <- varmaker("Sushi Bars")
all_restaurants <- varmaker("Delis")
all_restaurants <- varmaker("Steakhouses")
all_restaurants <- varmaker("Seafood")
all_restaurants <- varmaker("Chicken Wings")
all_restaurants <- varmaker("Sports Bars")
all_restaurants <- varmaker("Coffee & Tea")
all_restaurants <- varmaker("Mediterranean")
all_restaurants <- varmaker("Barbeque")
all_restaurants <- varmaker("Thai")
all_restaurants <- varmaker("Asian Fusion")
all_restaurants <- varmaker("French")
all_restaurants <- varmaker("Buffets")
all_restaurants <- varmaker("Indian")
all_restaurants <- varmaker("Pubs")
all_restaurants <- varmaker("Greek")
all_restaurants <- varmaker("Diners")
all_restaurants <- varmaker("Bakeries")
all_restaurants <- varmaker("Vietnamese")
all_restaurants <- varmaker("Tex-Mex")
all_restaurants <- varmaker("Vegetarian")
all_restaurants <- varmaker("Salad")
all_restaurants <- varmaker("Hot Dogs")
all_restaurants <- varmaker("Middle Eastern")
all_restaurants <- varmaker("Event Planning & Services")
all_restaurants <- varmaker("Specialty Food")
all_restaurants <- varmaker("Lounges")
all_restaurants <- varmaker("Korean")
all_restaurants <- varmaker("Canadian (New)")
all_restaurants <- varmaker("Arts & Entertainment")
all_restaurants <- varmaker("Wine Bars")
all_restaurants <- varmaker("Gluten-Free")
all_restaurants <- varmaker("Latin American")
all_restaurants <- varmaker("British")
all_restaurants <- varmaker("Gastropubs")
all_restaurants <- varmaker("Ice Cream & Frozen Yogurt")
all_restaurants <- varmaker("Southern")
all_restaurants <- varmaker("Vegan")
all_restaurants <- varmaker("Desserts")
all_restaurants <- varmaker("Hawaiian")
all_restaurants <- varmaker("German")
all_restaurants <- varmaker("Bagels")
all_restaurants <- varmaker("Caterers")
all_restaurants <- varmaker("Juice Bars & Smoothies")
all_restaurants <- varmaker("Fish & Chips")
all_restaurants <- varmaker("Ethnic Food")
all_restaurants <- varmaker("Tapas Bars")
all_restaurants <- varmaker("Soup")
all_restaurants <- varmaker("Halal")

(f) Observations

Contrary to what may be naively expected, the distribution of ratings is not normal but rather negatively skewed (Figure 1). Low ratings seems to be less frequent and some people would interpret too quickly this information as a sign of biased data. Furthermore, the histogram below shows that most places tend to have fewer than 10 reviews (Figure 2).

Quite surprisingly, Figure 3 shows that they are more reviews for 5 stars restaurants than for 1 star, indicating that reporting a very good customer experience would probably be a more common Yelp User behavioural pattern than reporting a very very bad experience. We can argue that more probably 1 star restaurants do not stay open very long!

But, anyway as number of reviews increases, the Yelp rating will begin to approach this restaurants “true” rating, which will most likely not be 5 stars.

Figure 1

ggplot(all_restaurants, aes(as.factor(stars))) + 
  geom_histogram(fill = "blue", col="black") +
  xlab("Stars") +
  ylab("Number of Restaurants") +
  theme_classic()

Figure 2

ggplot(all_restaurants, aes(reviewcount)) + 
  geom_histogram(binwidth = .15, fill = "blue", col="white") + 
  scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x),
                   labels = trans_format("log10", math_format(10^.x))) +
  xlab("Number of Reviews") +
  ylab("Number of Restaurants") + 
  theme_classic()

Figure 2

Figure 3

ggplot(all_restaurants, aes(as.factor(stars), reviewcount)) + 
  geom_boxplot(col = "blue") +
  scale_y_log10() +
  xlab("Stars") +
  ylab("Number of Reviews") + 
  theme_classic()

Figure 3

Figure 4

# localisation in US and Europe
all_restaurants$latlong <- paste(all_restaurants$latitude, all_restaurants$longitude, sep=":")
counts <- all_restaurants %>% 
  group_by(loc) %>%
  summarize(Restaurants = n(),  Avg_Rating = round(mean(stars),2))

locdata <- data.frame(latlong = all_restaurants$latlong, 
                      loc = all_restaurants$loc)

counts <- inner_join(counts, locdata, by="loc") %>%
  group_by(loc) %>%
  summarize(Restaurants = first(Restaurants), 
            latlong = first(latlong),
            Avg_Rating = first(Avg_Rating))

require(datasets)

USmap <- gvisGeoChart(counts, "loc", 
                          sizevar="Restaurants",
                          colorvar="Avg_Rating", 
                          options=list(region = 'US',
                                       displayMode = "markers",
                                       colorAxis="{colors:['white', 'blue']}"))

Europemap <- gvisGeoChart(counts, "loc", 
                          sizevar = "Restaurants",
                          colorvar = "Avg_Rating", 
                          options=list(region = '150',
                                       displayMode = "markers",
                                       colorAxis="{colors:['white', 'blue']}"
                                       ))


print(gvisMerge(USmap, Europemap, horizontal=TRUE), "chart")

Table 1

dstate <- all_restaurants %>%
  group_by(loc) %>%
  summarise(num_rest = n(),
            avg_stars = round(mean(stars), digits =  2), 
            avg_num_rev = round(mean(reviewcount), digits =  2)) %>%
  tbl_df()

dstate <- dstate[order(-dstate$avg_stars),]

kable(dstate, col.names = c("City","Restaurants","Average Star Rating of Restaurants","Average Number of Ratings per Restaurant") , align  = "c")
City Restaurants Average Star Rating of Restaurants Average Number of Ratings per Restaurant
NA 19 4.00 5.58
Edinburgh, UK 1137 3.81 14.47
Karlsruhe, Germany 469 3.71 11.88
Montreal, Canada 2353 3.61 16.16
Pittsburg, PA 1361 3.55 37.38
Waterloo, Canada 244 3.55 10.03
Madison, WI 981 3.45 34.10
Charlotte, NC 2119 3.43 33.74
Phoenix, AZ 7985 3.43 51.23
Las Vegas, NV 4960 3.42 90.13
Urbana-Champaign, IL 264 3.38 34.28

(h) The final step in this Capstone Project: Predict Ratings

In order to be able to predict the ratings and create the necessary algorithm, we will:

  • Use the business attributes (trendy, romantic and so on) as predictors

  • Define a training and a testing set in order to examine the prediction accuracy

  • Use a LASSO (Least Absolute Shrinkage and Selection Operator) method with our linear regression analysis in order to avoid overfitting. A more detailed description of this statistical approach can be found in the book “The Elements of Statistical learning: Data Mining, Inference and Prediction” from Trevor Hastie et al. (see Reference (2)).

(i) Implementing the LASSO Model

When fitting a linear regression, we estimate the dependent variable \(\hat{y}\) through a series of independant variables \(x_j\) and

\[ \hat{y}~=~b_0~+~b_1x_1~+~...~+~b_kx_k . \]

The Ordinary least squares approach in linear regression is based on finding the coefficients \(b_j\) that minimize the value of \(\sum{(y-\hat{y})^2}\), but with LASSO regression we impose an additional constraint:

\[\sum{| b_j |} \leq s,\]

So the sum of the magnitude of all of the coefficients cannot exceed the value of \(s\):

If we make \(s\) small and greater than zero, then the coefficients of unimportant parameters go to zero, and are thus not really included in the model.

Therefore, only the variables with a significant impact will appear in the model (see Reference (3)).

How to determine \(s\)?

It is simply a trade-off between the smallest error and the fewest number of variables.

As predictors, we will use the relevant variables included for restaurants in the dataset. This includes categories (i.e. Chinese, French or Breakfast), the geographic location (city, neighborhood), and various attributes (wifi, classy and so on).

Figure 5 shows the variation of the coefficients for an increasing lasso parameter \(s\) (x-axis) and helps identifying the key variables as they “jump out” early on.

library(dplyr)
# Make dataset with predictors.
dataset <- all_restaurants %>%
  select(businessid,
         stars,   
         city,
         price,
         alcohol,
         noiselevel,
         classy,
         valet,
         cash,
         nhood,
         drivethru,
         parkinglot,
         casual,
         dance,
         delivery,
         ccheck,
         takeout,
         res,
         service,
         street,
         garage,
         late,
         desert,
         caters,
         wifi,
         goodforkids,
         tv,
         romantic,
         trendy,
         upscale,
         divey,
         wheelchair,
         bkgmusic,
         livemusic,
         byob,
         dogsallowed,
         open24hrs,
         amex,
         orderatcounter,
         visa
         )

dataset <- left_join(dataset, all_restaurants[c(1,119:(length(all_restaurants)-1))], by = "businessid")
dataset <- subset(dataset, select = -businessid)

# Define make predictors into a matrix.
x <- model.matrix(stars ~ ., data = dataset)[,-1]
y <- dataset$stars
# Define training and test sets.
set.seed(1)
train <- sample(1:nrow(x), nrow(x)/2)
test <- (-train)
y.train <- y[train]
y.test <- y[! (1:nrow(x)) %in% train]
grid=10^seq(10,-2, length =100)

Figure 5

# Train the LASSO model. Make plot of coefficients for increasing "s".
lasso.mod <- glmnet(x[train,], y[train], alpha=1, lambda=grid)
plot(lasso.mod)

Figure 6 shows the variation of the Mean-Squared Error (\(\frac{1}{k}\sum{(y-\hat{y})^2}\)) as a function of a changing \(s\) which is itself proportional to the \(log(\lambda)\).

The top of the x-axis indicates the number of variables included in the model and the optimal \(s\) value corresponds to the lowest mean-square value.

Figure 6

set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha=1)
plot(cv.out, col = "blue")

bestlam <- cv.out$lambda.min

RESULTS

The prediction accuracy is shown in Figure 7. The R-squared is close to 17%, in otther words 17 % of the ratings can be explained through the independant variables selected. Note: The thick blue line indicates a theoritical perfect predictions and the black points represent individual predictions.

Figure 7

lasso <- data.frame(y.act = y.test)
lasso$y.act <- lasso$y.act %>%
  as.character() %>%
  as.numeric()
lasso$pred <- predict(lasso.mod, s=bestlam, newx=x[test,])  %>%
  as.character() %>%
  as.numeric()

# R-squared value estimate
1-(mean((lasso$pred -y.test)^2)/var(y.test))
## [1] 0.1735251
# Plot of actual vs. predicted values for the test data set.
ggplot(lasso, aes(jitter(y.act), pred)) + 
  geom_point() + 
  geom_smooth(method = "lm", colour="#000099") + 
  geom_abline(intercept = 0, colour="#000099", size = 2) +
  coord_cartesian(ylim = c(1,5)) +
  xlab("Actual Yelp Star Ratings") +
  ylab("LASSO Predicted Star Ratings") + 
  theme_classic()

Finally, the largest coefficients are shown in Table 2.

Table 2

out <- glmnet(x, y, alpha=1, lambda=grid)
lasso.coef <- predict(out, type="coefficients", s=bestlam) [-1,] %>% data.frame()
names(lasso.coef)[names(lasso.coef)=="row.names"] <- "Variable"
names(lasso.coef)[names(lasso.coef)=="."] <- "Coefficient"
lasso.coef$Coefficient <- round(as.numeric(lasso.coef$Coefficient), 3) 
datatable(lasso.coef)

Table 2 indicates that the top ten positive variables are related to the location within the city. It might indicate that wealthier neighborhoods are interested in a better quality of food. If the business environment is positive, it would not be really surprising that good restaurants tend to group together competing for the potential customers.

Besides the location, categories like ‘Latin American’ ‘French’ and ‘Vegan1’ tend to be better rated and may reveal the main current trends in fooding (see Table 2, page 63 to 70). This sentiment is reinforced when looking at the negative impact on ratings of categories like ‘Buffets’ ‘Chicken Wings’ or ‘Fast Food’. Clearly a ‘very loud’ or a ‘loud’ noisy environment will negatively affect the clients’ perception of a restaurant (see Table 2, page 39). Moreover, having to pay for wifi will be always negatively emphasized! (see Table 2, page 60)

When establishments failed to provide information (‘dnr’ = ‘did not respond’), the ratings are also impacted. A likely explanation is that consumers being not able to make could be prone to be dissapointed with their experience, which triggers informed decisions negative reviews. These were things like failing to answer whether or not a restaurant was ‘upscale’, whether or not dogs were allowed. Filling in these responses resulted in better ratings in every single category except ‘drive-thru: TRUE’, ‘wifi: paid’, ‘delivery: TRUE’, ‘garage: TRUE’, and ‘open-late:TRUE’ (see Table 2, pages 57 to 62).

DISCUSSION

This LASSO model captures the overall trend of the star ratings. Unsurprisingly being based on linear regression, the model estimate precisely the average rating but fails to predict any outliers. The lower predicted ratings overestimates significantly the actual low ratings and the higher predicted ratings underestimate the 5 stars ratings actually present in the Yelp database. Again, this conclusion might have been anticipated given the linear nature of this machine learning algorithm. In addition, LASSO does a great job at reducing the dimensionality of the problem, i.e doing the variables selection (Note: Principal Components Analysis could also have been used - see Reference (4) & (5)) and there would be no absolute guarantee that non-linear modelling techniques such as Random Forest or SVM (Suppport Vector Machine) perform better (See Reference (6)).

What are the lessons learned from this project usefull for a restaurant owner?

As stated in introduction ratings matter as restaurants with lower ratings are more likely to close (see Reference (6)). However, we should also warn a note of caution and keep in mind that fixing the problems mentioned above is not a guarantee of an immediate better rating: Correlation is not causation!

Restaurant owners should use these conclusions as an indicator of the the possible improvements to implement in order to grow their business. In summary, they should exercise a sound judgment and business acumen.Nothing new under the sun!

Finally, what can we do to improve our prediction model?

The Yelp database is so rich, that it would make sense to use Natural Language Processing and Sentiment Analysis to extract some information from the users reviews (see Reference (8)).