Overview

Our project used data from Kaggle’s 2013 Yelp Challenge. This challenge included a subset of Yelp data from the metropolitan area of Phoenix, Arizona. Our data takes into account user reviews, ratings, and check-in data for a wide-range of businesses.

Data Aquisition & Transformations

Data was acquired and transformed in the preprocessing.R file located within our repositories final-project folder. Our data source was provided as multiarray Json files, meaning each file is a collection of json data. We used stream_in function, which parses json data line-by-line from the data folder of our repository. The collections included three, large data for Yelp businesses, users, and reviews.

Once obtained, we prepared our data for our recommender system using the following transformations:

Business

We choose to limit the scope to our recommender system to only businesses with tags related to food and beverages. There were originally 508 unique category tags listed within our business data. We manually filtered 112 targeted categories to subset our data.

We applied additional transformation to remove unnecessary data. There were 1,224 business in our data that were permanently closed. These companies accounted for 9.8% of all businesses, which were subsequently removed from our data. There were also 3 businesses in our data set from outside of AZ that we also removed.

As a result of our transformations, our recommender data was shortened 4,828 unique businesses. This was further limited to 4,332 after randomly sampling our user-data. The output of which can be previewed below:

Review

We subset our review data from the subset of food and beverage businesses. This dropped our review data from 229,907 to 165,823 reviews. We later applied another filter to the data to only use reviews from 10,000 randomly sampled users. This further decreases reviews to 44,494 observations. Our review data can be previewed in two parts below:

User

Next, we applied a similar filter to users to subset our data based on only our selected businesses. This decreased our user data from 43,873 to 35,268 distinct user_id observations. Do to processing constraints in R, we choose to randomly sample 10,000 users from these unique profiles.

The data frame preview below shows aggregate user data for all reviews an individual user provided for yelp within our data selection.

Merged Dataframe

Last, we created our main data frame by merging business and reviews on Business_ID. This data frame will serve as the source of data for our recommender algorithms. The user and business unique keys were simplified from characters to numeric user/item identifiers.

This data frame will be referenced later on when building our recommender matrices and algorithms.

Algorthim Data Preparation

Matrix Building

We converted our raw ratings data into a user-item matrix to test and train our subsequent recommender system algorithms. The matrix was saved as a realRatingMatrix for processing purposes later on using the recommenderlab package.

# spread data from long to wide format 
matrix_data <- df %>% select(userID, itemID, stars) %>% spread(itemID, stars)
# set row names to userid
rownames(matrix_data)<-matrix_data$userID 
# remove userid from columns 
matrix_data <-matrix_data %>% select(-userID) 
# randomize dataframe
set.seed(1)
matrix_data <- matrix_data[sample(nrow(matrix_data)),]
# convert to matrix
ui_mat <- matrix_data %>% as.matrix()
# store matrix as realRatingMatrix
ui_mat <- as(ui_mat,"realRatingMatrix")

Train and Test Splits

Our data was split into training and tests sets for model evaluation of both two recommender algorithms. We split our data with 5 k-folds. 80% of data was retained for training and 20% for testing purposes.

# evaluation method with 80% of data for train and 20% for test
set.seed(1)

evalu <- evaluationScheme(ui_mat, method="split", train=0.8, given=1, goodRating=4, k=5)

# prep data
train <- getData(evalu, 'train')# Training Dataset 
dev_test <- getData(evalu, 'known') # Test data from evaluationScheme of type KNOWN
test <- getData(evalu, 'unknown') # Unknow datset used for RMSE / model evaluation

Recommender Algorithms

We tested recommender algorithms using recommenderlab and sparklyr to see which performed the best on our recommender system data. To test the algorithms, we first had to create a user-item matrix and then split our data into training and test sets.

RecommenderLab

User-based CF

In our first example, user-based CF is used to create recommendations with the recommender lab package in R. We start by training our recommender with the train set, with our data being normalized with the Z-score and using cosine similarity for comparisons.

We then create our predictions using the dev-test set with ratings as our prediction output. It is imperative to set a floor and ceiling as sometimes predictions will fall outside of our ratings scale of 1-5.

Finally we calculated the prediction accuracy against the test data

# using recommender lab, create UBCF recommender with z-score normalized data using cosine similarity
UB <- Recommender(getData(evalu, "train"), "UBCF", 
      param=list(normalize = "Z-score",method="Cosine"))

# create rating predictions and store
p <- predict(UB, getData(evalu, "known"), type="ratings")

# set floor and ceiling for ratings that fall outside scale
p@data@x[p@data@x[] < 1] <- 1
p@data@x[p@data@x[] > 5] <- 5

# calculate the prediction accurary based on our test data
UB_acc <- calcPredictionAccuracy(p, getData(evalu, "unknown"))

Item-based CF

This is the same process as above except this time we are using an item-based collaborative filtering to create our recommendations.

IB <- Recommender(getData(evalu, "train"), "IBCF", 
      param=list(normalize = "Z-score",method="Cosine"))

p1 <- predict(IB, getData(evalu, "known"), type="ratings")

p1@data@x[p1@data@x[] < 1] <- 1
p1@data@x[p1@data@x[] > 5] <- 5

IB_acc <- calcPredictionAccuracy(p1, getData(evalu, "unknown"))

Performance

After our analysis we see that User-based CF outperforms Item-based CF in all error metrics. Considering the size of our data set the RMSE is relatively low.

# print out errors in table
error <- rbind(UB_acc,IB_acc)
error

##            RMSE      MSE       MAE
## UB_acc 1.337380 1.788585 0.9812829
## IB_acc 1.473754 2.171951 1.1263370

Sparklyr

Due to the size of our data, we choose to use Spark in R to avoid input/output (I/O) bottleneck issues and maximize the performance speed of our recommender algorithm calculations.

Connecting

We initiated a local connection with Spark (V2.4.3). Our yelp data was inputted into a spark table and split for training and testing purposes. We uploaded our training and test splits to minimize the variance in our comparisons.

# configure spark connection
config <- spark_config()
config$spark.executor.memory <- "8G"
config$spark.executor.cores <- 2
config$spark.executor.instances <- 3
config$spark.dynamicAllocation.enabled <- "false"

# initiate connection
sc <- spark_connect(master = "local", config=config, version = "2.4.3")

# unhash to verify version: 
# spark_version(sc)

# select data for spark and create spark table 
spark_train <- as(train,"data.frame")
spark_test<- as(test,"data.frame") 

spark_train <- sdf_copy_to(sc, spark_train, "spark_train", overwrite = TRUE) 
spark_test <- sdf_copy_to(sc, spark_test, "spark_test", overwrite = TRUE) 

# Transform features
spark_train <- spark_train %>%
  ft_string_indexer(input_col = "user", output_col = "user_index") %>%
  ft_string_indexer(input_col = "item", output_col = "item_index") %>%
  sdf_register("spark_train")
  
spark_test <- spark_test %>%
  ft_string_indexer(input_col = "user", output_col = "user_index") %>%
  ft_string_indexer(input_col = "item", output_col = "item_index") %>%
  sdf_register("spark_test")

ALS

Once connected, we applied the alternating least squares (ALS) for our recommender predictions.

# build model using user/business/ratings
als_fit <- ml_als(spark_train, 
                  max_iter = 5, 
                  nonnegative = TRUE, 
                  rank = 2,
                  rating_col = "rating", 
                  user_col = "user_index", 
                  item_col = "item_index")

# predict from the model for the training data
als_predict_train <- ml_predict(als_fit, spark_train) %>% collect()
als_predict_test <- ml_predict(als_fit, spark_test) %>% collect()

# Remove NaN (result of test/train splits - not data)
als_predict_train <- als_predict_train[!is.na(als_predict_train$prediction), ] 
als_predict_test <- als_predict_test[!is.na(als_predict_test$prediction), ]

# Set floor/ceiling for predictions
als_predict_train$prediction[als_predict_train$prediction < 1] <- 1
als_predict_train$prediction[als_predict_train$prediction > 5] <- 5
als_predict_test$prediction[als_predict_test$prediction < 1] <- 1
als_predict_test$prediction[als_predict_test$prediction > 5] <- 5

# View results
als_predict_test %>% head() %>% kable() %>% kable_styling()

user	item	rating	user_index	item_index	prediction
8004	2697	4	1763	12	4.619310
463	2697	5	16	12	3.483246
2272	2697	4	138	12	3.857820
2898	2697	4	243	12	5.000000
1677	2697	4	180	12	4.086490
3017	2697	5	195	12	4.077936

Performance

Our ALS calculations for RMSE, MSE, and MAE can be viewed below:

# Calculate RMSE/MSE/MAE 
als_mse_train <- mean((als_predict_train$rating - als_predict_train$prediction)^2)
als_rmse_train <- sqrt(als_mse_train)
als_mae_train <- mean(abs(als_predict_train$rating - als_predict_train$prediction))

als_mse_test <- mean((als_predict_test$rating - als_predict_test$prediction)^2)
als_rmse_test <- sqrt(als_mse_test)
als_mae_test <- mean(abs(als_predict_test$rating - als_predict_test$prediction))

ALS Performance
type	rmse	mse	mae
ALS_train	0.6942522	0.4819861	0.5052867
ALS_test	1.3477505	1.8164314	1.0831787

Recommendations

The ml_recommend function allows us to see the top n user recommendations for each user/item. Below, we use this function and filtered our recommendations to show the top 10 restaurant recommendations for a selected user.

als_user_recommend<-ml_recommend(als_fit, type="users", n=10)

## NULL

Conclusion

Analysis

Through this project, we took an all-encompassing look at the different recommender methods we have learned this semester. We built a RealRating Matrix in RecommendLab and performed several algorithms on our data. We then compared this approach to running the training and test data in Spark using sparklyr’s ALS algorithm. We found that the Userbased recommender algorithm performed the best and had the lowest RMSE, however our ALS calculations performed very similiarly to this. Our item-based reommender produced our highest error score.

Our transition to Sparklyr showed us how effective cloud computing can be for large datasets. Our algorithm and prediction performance speeds signficantly improved when using Spark’s service, even on a local channel. ALS in Sparklyr was the clear winner for efficiency.

Limitations

The size of our data significantly limited our performance using certain packages in R. Functions in Recommenderlab took ~15 minutes to run in comparison to Sparklyr, which took approximately ~2 minutes. Sparklyr would have been able to handle our full data set, whereas our personal computers would have lacked the computational memory to solely use Recommenderlab. However, Sparklyr packages lacks in comparison to the built-in functions recommenderlab has to use and evaluate recommender algorithms functions.

Recommendations

We would recommend for future attempts performing natural language processing on the review text sentiment and analyzing the term-frequency of our categories to see how these variables could improve our recommendations. We would also benefit from using data processing engines, like Spark, to conduct all of our future large data, recommender calculations.

References

Data Overview: Kagle Yelp Challenge 2013
Accuracy Code :
Sparklyr:
Recommenderlab Package Vignette:

Final Project - Yelp Recommender System

Juliann McEachern, Rajwant Mishra,Christina Valore

July 16, 2019

Overview

Data Aquisition & Transformations

Business

Review

User

Merged Dataframe

Algorthim Data Preparation

Matrix Building

Train and Test Splits

Recommender Algorithms

RecommenderLab

User-based CF

Item-based CF

Performance

Sparklyr

Connecting

ALS

Performance

Recommendations

Conclusion

Analysis

Limitations

Recommendations

References