The review website Yelp not only connects customers with businesses, but also allows customers to rate their experiences. There are millions of data consisting of user/business information, reviews, votes, and so on. For Johns Hopkins University Coursera capstone project, a specific Yelp dataset had been provided for analytical purpose.
From this dataset, I will attempt to answer the following questions:
The dataset was downloaded from the Coursera Page. The raw files were in JSON Format. They were first converted to “RDS”" Format . To be able to exploratory analysis and model building , The business data was flattened and cleaned .
Below we explain each step :
# We use a very useful function which installs the package automatically if they are not alreay installed on the machine.
usePackage<-function(pkg){
if ( !require(pkg, character.only=TRUE) ) {
install.packages(as.character(pkg), dependencies=TRUE)
require(pkg, character.only=TRUE) }
}
usePackage("dplyr")
usePackage("ggplot2")
usePackage("knitr")
usePackage("scales")
usePackage("tidyr")
usePackage("jsonlite")
usePackage("caret")
usePackage("randomForest")
usePackage("gtools")
# To make the code reproducible , we set the seed
set.seed(12345)
We first convert the files from “JSON” format to “RDS” format as it much more memory efficient to load and run.
#The snippet of code below is for converting the big json files to more memory efficient RDS format.
#setwd("/home/sanjay_meena/rworkspace/capstone project")
#dataDir <- "yelp_dataset_challenge_academic_dataset"
# read and write reviews json
#review <- stream_in(file(paste0(dataDir, "/yelp_academic_dataset_review.json")))
#Save to RDS Format
#saveRDS(review, file="review.RDS")
#Remove the loaded dataframe for json
#rm(review)
# read and write business json
#business <- stream_in(file(paste0(dataDir, "/yelp_academic_dataset_business.json")))
#saveRDS(business, file="business.RDS")
#rm(business)
# read and write checkin json
#checkin <- stream_in(file(paste0(dataDir, "/yelp_academic_dataset_checkin.json")))
#saveRDS(checkin, file="checkin.RDS")
#rm(checkin)
# read and write tip json
#tip <- stream_in(file(paste0(dataDir, "/yelp_academic_dataset_tip.json")))
#saveRDS(tip, file="tip.RDS")
#rm(tip)
# read and write user json
#user <- stream_in(file(paste0(dataDir, "/yelp_academic_dataset_user.json")))
#saveRDS(user, file="user.RDS")
#
#rm(user)
#setwd("/home/sanjay_meena/rworkspace/capstone project")
setwd("~/rworkspace/capstone_project")
review <- readRDS("dataset/review.RDS")
business <- readRDS("dataset/business.RDS")
# Select all business data except attributes and hours. We will deal with them separately
business1 <- business %>% select (-attributes,-hours)
restaurants1 <- separate(data = business1, col = categories, into = c("categories1","categories2","categories3","categories4","categories5","categories6","categories7","categories8","categories9","categories10"), sep = ",",extra="merge")
# Pick out the restaurants.
all_restaurants <- filter(restaurants1,
grepl("Restaurants", categories1) |
grepl("Restaurants", categories2)|
grepl("Restaurants", categories3)|
grepl("Restaurants", categories4)|
grepl("Restaurants", categories5)|
grepl("Restaurants", categories6)|
grepl("Restaurants", categories7)|
grepl("Restaurants", categories8)|
grepl("Restaurants", categories9)|
grepl("Restaurants", categories10))
majorCategories <- c(as.character(all_restaurants$categories1),
as.character(all_restaurants$categories2),
as.character(all_restaurants$categories3),
as.character(all_restaurants$categories4),
as.character(all_restaurants$categories5),
as.character(all_restaurants$categories6),
as.character(all_restaurants$categories7),
as.character(all_restaurants$categories8),
as.character(all_restaurants$categories9),
as.character(all_restaurants$categories10)
) %>%
table() %>%
sort()
# This function creates a column for a category, 1 = yes, 0 = no.
variableCreator <- function(x){
all_restaurants <- mutate(all_restaurants,
a =
ifelse(
grepl(x, categories1)|
grepl(x, categories2) |
grepl(x, categories3)|
grepl(x, categories4)|
grepl(x, categories5) |
grepl(x, categories6) |
grepl(x, categories7) |
grepl(x, categories8)|
grepl(x, categories9) |
grepl(x, categories10) , 1, 0) )
all_restaurants$a <- as.factor(all_restaurants$a)
names(all_restaurants)[names(all_restaurants)=="a"] <- gsub(" ", "", x, fixed = TRUE)
return(all_restaurants)
}
# Make the new columns.
all_restaurants <- variableCreator("Restaurants")
all_restaurants <- variableCreator("Fast Food")
all_restaurants <- variableCreator("Pizza")
all_restaurants <- variableCreator("Mexican")
all_restaurants <- variableCreator("American (Traditional)")
all_restaurants <- variableCreator("Nightlife")
all_restaurants <- variableCreator("Sandwiches")
all_restaurants <- variableCreator("Bars")
all_restaurants <- variableCreator("Food")
all_restaurants <- variableCreator("Italian")
all_restaurants <- variableCreator("Chinese")
all_restaurants <- variableCreator("American (New)")
all_restaurants <- variableCreator("Burgers")
all_restaurants <- variableCreator("Breakfast & Brunch")
all_restaurants <- variableCreator("Cafes")
all_restaurants <- variableCreator("Japanese")
all_restaurants <- variableCreator("Sushi Bars")
all_restaurants <- variableCreator("Delis")
all_restaurants <- variableCreator("Steakhouses")
all_restaurants <- variableCreator("Seafood")
all_restaurants <- variableCreator("Chicken Wings")
all_restaurants <- variableCreator("Sports Bars")
all_restaurants <- variableCreator("Coffee & Tea")
all_restaurants <- variableCreator("Mediterranean")
all_restaurants <- variableCreator("Barbeque")
all_restaurants <- variableCreator("Thai")
all_restaurants <- variableCreator("Asian Fusion")
all_restaurants <- variableCreator("French")
all_restaurants <- variableCreator("Buffets")
all_restaurants <- variableCreator("Indian")
all_restaurants <- variableCreator("Pubs")
all_restaurants <- variableCreator("Greek")
all_restaurants <- variableCreator("Diners")
all_restaurants <- variableCreator("Bakeries")
all_restaurants <- variableCreator("Vietnamese")
all_restaurants <- variableCreator("Tex-Mex")
all_restaurants <- variableCreator("Vegetarian")
all_restaurants <- variableCreator("Salad")
all_restaurants <- variableCreator("Hot Dogs")
all_restaurants <- variableCreator("Middle Eastern")
all_restaurants <- variableCreator("Event Planning & Services")
all_restaurants <- variableCreator("Specialty Food")
all_restaurants <- variableCreator("Lounges")
all_restaurants <- variableCreator("Korean")
all_restaurants <- variableCreator("Canadian (New)")
all_restaurants <- variableCreator("Arts & Entertainment")
all_restaurants <- variableCreator("Wine Bars")
all_restaurants <- variableCreator("Gluten-Free")
all_restaurants <- variableCreator("Latin American")
all_restaurants <- variableCreator("British")
all_restaurants <- variableCreator("Gastropubs")
all_restaurants <- variableCreator("Ice Cream & Frozen Yogurt")
all_restaurants <- variableCreator("Southern")
all_restaurants <- variableCreator("Vegan")
all_restaurants <- variableCreator("Desserts")
all_restaurants <- variableCreator("Hawaiian")
all_restaurants <- variableCreator("German")
all_restaurants <- variableCreator("Bagels")
all_restaurants <- variableCreator("Caterers")
all_restaurants <- variableCreator("Juice Bars & Smoothies")
all_restaurants <- variableCreator("Fish & Chips")
all_restaurants <- variableCreator("Ethnic Food")
all_restaurants <- variableCreator("Tapas Bars")
all_restaurants <- variableCreator("Soup")
all_restaurants <- variableCreator("Halal")
all_restaurants <- select(all_restaurants,-categories1,-categories2,-categories3,-categories4,-categories5,-categories6,-categories7,-categories8,-categories9,-categories10)
# Function to remove empty character(0) from the list
removeEmptyCharacter <- function(x){
if(invalid(x)) {
x <- "dnr"
}
return (x)
}
# Function to help conversion of list to vector
removeEmptyCharacter2 <- function(x){
t1 <- x
if(invalid(x)) {
x <- NA
}
return (x)
}
# Replace character(0)
all_restaurants$neighborhoods <- sapply(all_restaurants$neighborhoods, removeEmptyCharacter)
# Convert list column to vector column as model does not accept list
all_restaurants$neighborhoods <- as.vector(all_restaurants$neighborhoods,mode="character")
# Function to clean the names of the variables
clean.names <- function(df){
colnames(df) <- gsub("attributes[^[:alnum:]]", "", colnames(df))
colnames(df) <- gsub("[^[:alnum:]]", "", colnames(df))
colnames(df) <- tolower(colnames(df))
return(df)
}
#The column names will look like the following after flattening and extracting :
all_restaurants <- clean.names(all_restaurants)
#colnames(all_restaurants)
# Fixing business attributes
business_attributes <- business %>% select(business_id,attributes)
business_attributes <- flatten(business_attributes, recursive = TRUE)
business_attributes <- clean.names(business_attributes)
# remove duplicate variables
business_attributes <- business_attributes[,!duplicated(colnames(business_attributes))]
# Change acceptcredit card column from list to vector
temp <- sapply(business_attributes$acceptscreditcards, removeEmptyCharacter2)
temp <- as.vector(temp,mode="logical")
business_attributes$acceptscreditcards <- temp
#colnames(business_attributes)
Let’s look at the star ratings of a restaurants from various views.
We find that most ratings are between 3.5 to 4.
We see that most restaurants have less than 10 reviews.
Let’s create the dataset which will used to build and test models.
dataset <- left_join( all_restaurants,business_attributes, by = "businessid")
dataset <- clean.names(dataset)
# We will leave out business id, full address , name, lattitue, longitude information.
dataset <- select(dataset,-businessid,-fulladdress,-name,-latitude,-longitude)
#replace missing values by "dnr"
dataset[is.na(dataset)] <- "dnr"
We will factor the variables and convert character value to numeric values.
#take small portion of dataset
t <- dataset
t <- head(dataset, n=1000)
t=as.data.frame(t, stringsAsFactors=T)
# make all character columns as factors
t[sapply(t, is.character)] <- lapply(t[sapply(t, is.character)], as.factor)
# make all factors as numeric variables
t[sapply(t, is.factor)] <- lapply(t[sapply(t, is.factor)], as.numeric)
# make star rating as factor
t$stars <- as.factor(t$stars)
#str(t)
We will now create the training and testing set. 70% data will be used for training and 30% data will be used for testing.
# cross validation data
inTrain <- createDataPartition(t$stars, p=0.70, list=F)
trainData <- t[inTrain, ]
crossValidationData <- t[-inTrain, ]
dim(trainData)
## [1] 705 149
We will create several models using Random Forest The steps are :
# fit random forest
rf.1 <- randomForest(as.factor(stars) ~ . , data=trainData, importance=TRUE, ntree=500)
#check for importance of variables
imp<-importance(rf.1)
vars<-dimnames(imp)[[1]]
imp<-data.frame(vars=vars,imp=as.numeric(imp[,1]))
imp<-imp[order(imp$imp,decreasing=T),]
#imp
# Select important variables for retraining
#set.seed(42)
selected<-c(as.character(imp[1:38,1]),'stars')
#selected
# train again
rf.2<-randomForest(stars~.,data=trainData[,selected],replace=T,ntree=500)
# Peform prediction
# We will load the pre compiled model for the generation of the report here
rf.2 <- get(load("yelp_star_model.rda"))
Plot of variable importance using random forest
varImpPlot(rf.2,main='Variable Importance : Final Model',pch=16,col='blue')
#plot(rf.1, main='Error vs No. of trees plot: Final Model',col='blue')
predict_rf <- predict(rf.2, crossValidationData, type = "class")
confusion_matrix <- confusionMatrix(predict_rf,crossValidationData$stars)
confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 1.5 2 2.5 3 3.5 4 4.5 5
## 1 0 0 0 0 0 0 0 0 0
## 1.5 0 0 0 0 0 0 0 0 0
## 2 1 0 1 0 2 1 0 0 0
## 2.5 0 2 0 8 1 2 2 0 0
## 3 0 2 4 5 14 4 4 1 0
## 3.5 1 1 6 10 28 57 22 4 1
## 4 0 0 0 10 14 23 46 10 0
## 4.5 0 0 0 0 2 1 1 4 0
## 5 0 0 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.4407
## 95% CI : (0.3832, 0.4994)
## No Information Rate : 0.2983
## P-Value [Acc > NIR] : 1.644e-07
##
## Kappa : 0.2521
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 1.5 Class: 2 Class: 2.5 Class: 3
## Sensitivity 0.00000 0.00000 0.09091 0.24242 0.22951
## Specificity 1.00000 1.00000 0.98592 0.97328 0.91453
## Pos Pred Value NaN NaN 0.20000 0.53333 0.41176
## Neg Pred Value 0.99322 0.98305 0.96552 0.91071 0.81992
## Prevalence 0.00678 0.01695 0.03729 0.11186 0.20678
## Detection Rate 0.00000 0.00000 0.00339 0.02712 0.04746
## Detection Prevalence 0.00000 0.00000 0.01695 0.05085 0.11525
## Balanced Accuracy 0.50000 0.50000 0.53841 0.60785 0.57202
## Class: 3.5 Class: 4 Class: 4.5 Class: 5
## Sensitivity 0.6477 0.6133 0.21053 0.00000
## Specificity 0.6473 0.7409 0.98551 1.00000
## Pos Pred Value 0.4385 0.4466 0.50000 NaN
## Neg Pred Value 0.8121 0.8490 0.94774 0.99661
## Prevalence 0.2983 0.2542 0.06441 0.00339
## Detection Rate 0.1932 0.1559 0.01356 0.00000
## Detection Prevalence 0.4407 0.3492 0.02712 0.00000
## Balanced Accuracy 0.6475 0.6771 0.59802 0.50000
Confusion Matrix for the prediction results after updated variables
# Create prediction table
all.predictions <- data.frame(actual=crossValidationData$stars,random.forest = predict_rf)
all.predictions <- gather(all.predictions,key = model,value = predictions,2)
all.predictions$actual <- as.numeric(as.character( all.predictions$actual ))
all.predictions$predictions <- as.numeric(as.character( all.predictions$predictions ))
all.predictions$difference <- all.predictions$predictions-all.predictions$actual
#str(all.predictions)
We found out the variables which have the highest impact on the restaurant rating. The variable are mentioned on the variable importance plot. We see that review count, city, waiter service are top three attributes for restaurant rating.
As a validation for the prediction model, a confusion matrix was generated. The calculated accuracy is ~44%.
We find that that about 30 attributes like review count, city, waiter service, good for kids etc. have the highest impact on the restaurant ratings.
If we are willing to relax the accuracy requirement, the model may be a good rough predictor. To verify this, we draw a histogram for the difference between the predicted and actual star ratings. It shows that most of predicted values are within 1 star rating from actual value.
g <- ggplot(all.predictions)
g <- g + geom_histogram(aes(x=difference),fill="orange",binwidth=.3)
g <- g + ggtitle("Difference between predicted and actual rating")
g <- g + ylab("Count") + xlab("Predicted - Actual")
g
Roughly 91% of predictions are between +/- 1 star as shown below:
difference <- as.data.frame(table(all.predictions$difference))
difference
## Var1 Freq
## 1 -1.5 5
## 2 -1 12
## 3 -0.5 37
## 4 0 130
## 5 0.5 57
## 6 1 32
## 7 1.5 20
## 8 2 1
## 9 2.5 1
(difference[2,2]+difference[3,2]+difference[4,2]+difference[5,2]+difference[6,2])/sum(difference[,2])
## [1] 0.9084746