Introduction

The review website Yelp not only connects customers with businesses, but also allows customers to rate their experiences. There are millions of data consisting of user/business information, reviews, votes, and so on. For Johns Hopkins University Coursera capstone project, a specific Yelp dataset had been provided for analytical purpose.

From this dataset, I will attempt to answer the following questions:

  1. Predict the rating of a restaurant.
  2. Identify how and which various attributes of a restaurant can affect its ratings on yelp.

Methods and Data

Step 1: Getting and Manipulating Data

The dataset was downloaded from the Coursera Page. The raw files were in JSON Format. They were first converted to “RDS”" Format . To be able to exploratory analysis and model building , The business data was flattened and cleaned .

Below we explain each step :

We first start with loading necessary libraries

# We use a very useful function which installs the package automatically if they are not alreay installed on the machine. 
  usePackage<-function(pkg){
          if ( !require(pkg, character.only=TRUE) ) { 
                    install.packages(as.character(pkg), dependencies=TRUE)
          require(pkg, character.only=TRUE) }
                         }

usePackage("dplyr")
usePackage("ggplot2")
usePackage("knitr")
usePackage("scales")
usePackage("tidyr")
usePackage("jsonlite")
usePackage("caret")
usePackage("randomForest")
usePackage("gtools")

# To make the code reproducible , we set the seed
set.seed(12345)

We first convert the files from “JSON” format to “RDS” format as it much more memory efficient to load and run.

#The snippet of code below is for  converting the big json files to more memory efficient RDS format.

#setwd("/home/sanjay_meena/rworkspace/capstone project")
#dataDir <- "yelp_dataset_challenge_academic_dataset"

# read and write reviews json

#review  <- stream_in(file(paste0(dataDir, "/yelp_academic_dataset_review.json")))

#Save to RDS Format

#saveRDS(review, file="review.RDS")

#Remove the loaded dataframe for json
#rm(review)

# read and write business json

#business  <- stream_in(file(paste0(dataDir, "/yelp_academic_dataset_business.json")))
#saveRDS(business, file="business.RDS")
#rm(business)

# read and write checkin json

#checkin  <- stream_in(file(paste0(dataDir, "/yelp_academic_dataset_checkin.json")))
#saveRDS(checkin, file="checkin.RDS")
#rm(checkin)

# read and write tip json

#tip  <- stream_in(file(paste0(dataDir, "/yelp_academic_dataset_tip.json")))
#saveRDS(tip, file="tip.RDS")
#rm(tip)

# read and write user json

#user  <- stream_in(file(paste0(dataDir, "/yelp_academic_dataset_user.json")))
#saveRDS(user, file="user.RDS")

# 
#rm(user)

Load the Business and Reviews data

#setwd("/home/sanjay_meena/rworkspace/capstone project")
setwd("~/rworkspace/capstone_project")
review         <- readRDS("dataset/review.RDS")
business <- readRDS("dataset/business.RDS")

Clean Business data

  1. We will extract restaurant information from the business data.
  2. We will create columns from restaurant categories as they have useful information related to restaurant reviews.
  3. We will create columns from business attributes as that information is necessary for making the prediction model.

Extract Restaurant information from Business data.

# Select all business data except attributes and hours. We will deal with them separately
business1 <- business %>% select (-attributes,-hours) 

restaurants1 <- separate(data = business1, col = categories, into = c("categories1","categories2","categories3","categories4","categories5","categories6","categories7","categories8","categories9","categories10"), sep = ",",extra="merge")


# Pick out the restaurants.
all_restaurants <- filter(restaurants1, 
                            grepl("Restaurants", categories1) | 
                            grepl("Restaurants", categories2)| 
                            grepl("Restaurants", categories3)|
                            grepl("Restaurants", categories4)|
                            grepl("Restaurants", categories5)|
                            grepl("Restaurants", categories6)|
                            grepl("Restaurants", categories7)|
                            grepl("Restaurants", categories8)|
                            grepl("Restaurants", categories9)|
                            grepl("Restaurants", categories10)) 

Extract major categories from the resturant data.

majorCategories <- c(as.character(all_restaurants$categories1), 
            as.character(all_restaurants$categories2), 
            as.character(all_restaurants$categories3),
            as.character(all_restaurants$categories4), 
            as.character(all_restaurants$categories5), 
            as.character(all_restaurants$categories6),
            as.character(all_restaurants$categories7), 
            as.character(all_restaurants$categories8), 
            as.character(all_restaurants$categories9),
            as.character(all_restaurants$categories10)
            ) %>% 
  table() %>% 
  sort() 

Create new flat columns in the restaurant data from the categories.

# This function creates a column for a category, 1 = yes, 0 = no.
variableCreator <- function(x){
  all_restaurants <- mutate(all_restaurants, 
                            a = 
                              ifelse(
                                  grepl(x, categories1)| 
                                  grepl(x, categories2) | 
                                  grepl(x, categories3)| 
                                  grepl(x, categories4)| 
                                  grepl(x, categories5) | 
                                  grepl(x, categories6) | 
                                  grepl(x, categories7) | 
                                  grepl(x, categories8)| 
                                  grepl(x, categories9) | 
                                  grepl(x, categories10) , 1, 0) )
  all_restaurants$a <- as.factor(all_restaurants$a)
  names(all_restaurants)[names(all_restaurants)=="a"] <- gsub(" ", "", x, fixed = TRUE)
  return(all_restaurants)
}
  
  # Make the new columns.
all_restaurants <- variableCreator("Restaurants")
all_restaurants <- variableCreator("Fast Food")
all_restaurants <- variableCreator("Pizza")
all_restaurants <- variableCreator("Mexican")
all_restaurants <- variableCreator("American (Traditional)")
all_restaurants <- variableCreator("Nightlife")
all_restaurants <- variableCreator("Sandwiches")
all_restaurants <- variableCreator("Bars")
all_restaurants <- variableCreator("Food")
all_restaurants <- variableCreator("Italian")
all_restaurants <- variableCreator("Chinese")
all_restaurants <- variableCreator("American (New)")
all_restaurants <- variableCreator("Burgers")
all_restaurants <- variableCreator("Breakfast & Brunch")
all_restaurants <- variableCreator("Cafes")
all_restaurants <- variableCreator("Japanese")
all_restaurants <- variableCreator("Sushi Bars")
all_restaurants <- variableCreator("Delis")
all_restaurants <- variableCreator("Steakhouses")
all_restaurants <- variableCreator("Seafood")
all_restaurants <- variableCreator("Chicken Wings")
all_restaurants <- variableCreator("Sports Bars")
all_restaurants <- variableCreator("Coffee & Tea")
all_restaurants <- variableCreator("Mediterranean")
all_restaurants <- variableCreator("Barbeque")
all_restaurants <- variableCreator("Thai")
all_restaurants <- variableCreator("Asian Fusion")
all_restaurants <- variableCreator("French")
all_restaurants <- variableCreator("Buffets")
all_restaurants <- variableCreator("Indian")
all_restaurants <- variableCreator("Pubs")
all_restaurants <- variableCreator("Greek")
all_restaurants <- variableCreator("Diners")
all_restaurants <- variableCreator("Bakeries")
all_restaurants <- variableCreator("Vietnamese")
all_restaurants <- variableCreator("Tex-Mex")
all_restaurants <- variableCreator("Vegetarian")
all_restaurants <- variableCreator("Salad")
all_restaurants <- variableCreator("Hot Dogs")
all_restaurants <- variableCreator("Middle Eastern")
all_restaurants <- variableCreator("Event Planning & Services")
all_restaurants <- variableCreator("Specialty Food")
all_restaurants <- variableCreator("Lounges")
all_restaurants <- variableCreator("Korean")
all_restaurants <- variableCreator("Canadian (New)")
all_restaurants <- variableCreator("Arts & Entertainment")
all_restaurants <- variableCreator("Wine Bars")
all_restaurants <- variableCreator("Gluten-Free")
all_restaurants <- variableCreator("Latin American")
all_restaurants <- variableCreator("British")
all_restaurants <- variableCreator("Gastropubs")
all_restaurants <- variableCreator("Ice Cream & Frozen Yogurt")
all_restaurants <- variableCreator("Southern")
all_restaurants <- variableCreator("Vegan")
all_restaurants <- variableCreator("Desserts")
all_restaurants <- variableCreator("Hawaiian")
all_restaurants <- variableCreator("German")
all_restaurants <- variableCreator("Bagels")
all_restaurants <- variableCreator("Caterers")
all_restaurants <- variableCreator("Juice Bars & Smoothies")
all_restaurants <- variableCreator("Fish & Chips")
all_restaurants <- variableCreator("Ethnic Food")
all_restaurants <- variableCreator("Tapas Bars")
all_restaurants <- variableCreator("Soup")
all_restaurants <- variableCreator("Halal")


all_restaurants <- select(all_restaurants,-categories1,-categories2,-categories3,-categories4,-categories5,-categories6,-categories7,-categories8,-categories9,-categories10)

Clean restaurant data

  1. Convert some list column to vector columns for model building.
  2. We will clean the column names.
  3. Deal with NA (missing) values
Convert list column to vector columns
# Function to remove empty character(0) from the list
removeEmptyCharacter <- function(x){
  
  if(invalid(x)) {
    x <- "dnr"
  } 
  return (x)
}

# Function to help conversion of list to vector
removeEmptyCharacter2 <- function(x){
  t1 <- x
  if(invalid(x)) {
    x <- NA
  } 
  return (x)
}


# Replace character(0)
all_restaurants$neighborhoods <- sapply(all_restaurants$neighborhoods, removeEmptyCharacter)

# Convert list column to vector column as model does not accept list
all_restaurants$neighborhoods <- as.vector(all_restaurants$neighborhoods,mode="character")
Clean column names
# Function to clean the names of the variables
clean.names <- function(df){
  colnames(df) <- gsub("attributes[^[:alnum:]]", "", colnames(df))
  colnames(df) <- gsub("[^[:alnum:]]", "", colnames(df))
  colnames(df) <- tolower(colnames(df))
  return(df)
}



#The column names will look like the following after flattening and extracting : 
all_restaurants <- clean.names(all_restaurants)
#colnames(all_restaurants)

Create business attributes information

# Fixing business attributes
business_attributes <- business %>% select(business_id,attributes)
business_attributes <- flatten(business_attributes, recursive = TRUE)
business_attributes <- clean.names(business_attributes)
# remove duplicate variables
business_attributes <- business_attributes[,!duplicated(colnames(business_attributes))]

# Change acceptcredit card column from list to vector
temp <- sapply(business_attributes$acceptscreditcards, removeEmptyCharacter2)
temp <- as.vector(temp,mode="logical")
business_attributes$acceptscreditcards <- temp

#colnames(business_attributes)

Step 2: Exploratory Data Analysis

Let’s look at the star ratings of a restaurants from various views.

Plot star ratings histograms of restaurants

We find that most ratings are between 3.5 to 4.

Number of reviews per restaurants

We see that most restaurants have less than 10 reviews.

Reviews per restaurant classified by star rating

Step 3: Model Building

Preparation of dataset

Let’s create the dataset which will used to build and test models.

dataset <- left_join( all_restaurants,business_attributes, by = "businessid")
dataset <- clean.names(dataset)

# We will leave out business id, full address , name, lattitue, longitude information. 
dataset <- select(dataset,-businessid,-fulladdress,-name,-latitude,-longitude)

#replace missing values by "dnr"
dataset[is.na(dataset)] <- "dnr"

We will factor the variables and convert character value to numeric values.

#take small portion of dataset
t <- dataset
t <- head(dataset, n=1000)
t=as.data.frame(t, stringsAsFactors=T)

# make all character columns as factors
t[sapply(t, is.character)] <- lapply(t[sapply(t, is.character)],  as.factor)
# make all factors as numeric variables
t[sapply(t, is.factor)] <- lapply(t[sapply(t, is.factor)],  as.numeric)
# make star rating as factor
t$stars <- as.factor(t$stars)

#str(t)

We will now create the training and testing set. 70% data will be used for training and 30% data will be used for testing.

# cross validation data
inTrain <- createDataPartition(t$stars, p=0.70, list=F)
trainData <- t[inTrain, ]
crossValidationData <- t[-inTrain, ]
dim(trainData)
## [1] 705 149

Step 4: Model Validation

We will create several models using Random Forest The steps are :

  1. Train first model on all the variables with stars as the predictor
  2. Find out the important variables using importance function.
  3. Select important variables for retraining
  4. Retrain the model
# fit random forest
rf.1 <- randomForest(as.factor(stars) ~ . , data=trainData, importance=TRUE, ntree=500)
#check for importance of variables
imp<-importance(rf.1)
vars<-dimnames(imp)[[1]]
imp<-data.frame(vars=vars,imp=as.numeric(imp[,1]))
imp<-imp[order(imp$imp,decreasing=T),]
#imp
# Select important variables for retraining
#set.seed(42)
selected<-c(as.character(imp[1:38,1]),'stars')
#selected
# train again
rf.2<-randomForest(stars~.,data=trainData[,selected],replace=T,ntree=500)
# Peform prediction

# We will load the pre compiled model for the generation of the report here

rf.2 <- get(load("yelp_star_model.rda"))

Plot of variable importance using random forest

varImpPlot(rf.2,main='Variable Importance : Final Model',pch=16,col='blue')

#plot(rf.1, main='Error vs No. of trees plot: Final Model',col='blue')
predict_rf <- predict(rf.2, crossValidationData, type = "class")
confusion_matrix <- confusionMatrix(predict_rf,crossValidationData$stars)
confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1 1.5  2 2.5  3 3.5  4 4.5  5
##        1    0   0  0   0  0   0  0   0  0
##        1.5  0   0  0   0  0   0  0   0  0
##        2    1   0  1   0  2   1  0   0  0
##        2.5  0   2  0   8  1   2  2   0  0
##        3    0   2  4   5 14   4  4   1  0
##        3.5  1   1  6  10 28  57 22   4  1
##        4    0   0  0  10 14  23 46  10  0
##        4.5  0   0  0   0  2   1  1   4  0
##        5    0   0  0   0  0   0  0   0  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4407          
##                  95% CI : (0.3832, 0.4994)
##     No Information Rate : 0.2983          
##     P-Value [Acc > NIR] : 1.644e-07       
##                                           
##                   Kappa : 0.2521          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 1.5 Class: 2 Class: 2.5 Class: 3
## Sensitivity           0.00000    0.00000  0.09091    0.24242  0.22951
## Specificity           1.00000    1.00000  0.98592    0.97328  0.91453
## Pos Pred Value            NaN        NaN  0.20000    0.53333  0.41176
## Neg Pred Value        0.99322    0.98305  0.96552    0.91071  0.81992
## Prevalence            0.00678    0.01695  0.03729    0.11186  0.20678
## Detection Rate        0.00000    0.00000  0.00339    0.02712  0.04746
## Detection Prevalence  0.00000    0.00000  0.01695    0.05085  0.11525
## Balanced Accuracy     0.50000    0.50000  0.53841    0.60785  0.57202
##                      Class: 3.5 Class: 4 Class: 4.5 Class: 5
## Sensitivity              0.6477   0.6133    0.21053  0.00000
## Specificity              0.6473   0.7409    0.98551  1.00000
## Pos Pred Value           0.4385   0.4466    0.50000      NaN
## Neg Pred Value           0.8121   0.8490    0.94774  0.99661
## Prevalence               0.2983   0.2542    0.06441  0.00339
## Detection Rate           0.1932   0.1559    0.01356  0.00000
## Detection Prevalence     0.4407   0.3492    0.02712  0.00000
## Balanced Accuracy        0.6475   0.6771    0.59802  0.50000

Confusion Matrix for the prediction results after updated variables

# Create prediction table
all.predictions <- data.frame(actual=crossValidationData$stars,random.forest = predict_rf)
all.predictions <- gather(all.predictions,key = model,value = predictions,2)

all.predictions$actual <- as.numeric(as.character( all.predictions$actual ))
all.predictions$predictions <- as.numeric(as.character( all.predictions$predictions ))
all.predictions$difference <- all.predictions$predictions-all.predictions$actual
#str(all.predictions)

Results

Important Variables

We found out the variables which have the highest impact on the restaurant rating. The variable are mentioned on the variable importance plot. We see that review count, city, waiter service are top three attributes for restaurant rating.

Prediction model

As a validation for the prediction model, a confusion matrix was generated. The calculated accuracy is ~44%.

Discussion

We find that that about 30 attributes like review count, city, waiter service, good for kids etc. have the highest impact on the restaurant ratings.

If we are willing to relax the accuracy requirement, the model may be a good rough predictor. To verify this, we draw a histogram for the difference between the predicted and actual star ratings. It shows that most of predicted values are within 1 star rating from actual value.

g <- ggplot(all.predictions)
g <- g + geom_histogram(aes(x=difference),fill="orange",binwidth=.3)
g <-  g + ggtitle("Difference between predicted and actual rating")
g <- g + ylab("Count") + xlab("Predicted - Actual")
g

Roughly 91% of predictions are between +/- 1 star as shown below:

difference <- as.data.frame(table(all.predictions$difference))
difference
##   Var1 Freq
## 1 -1.5    5
## 2   -1   12
## 3 -0.5   37
## 4    0  130
## 5  0.5   57
## 6    1   32
## 7  1.5   20
## 8    2    1
## 9  2.5    1
(difference[2,2]+difference[3,2]+difference[4,2]+difference[5,2]+difference[6,2])/sum(difference[,2])
## [1] 0.9084746