Our project used data from Kaggle’s 2013 Yelp Challenge. This challenge included a subset of Yelp data from the metropolitan area of Phoenix, Arizona. Our data takes into account user reviews, ratings, and check-in data for a wide-range of businesses.
Data was acquired and transformed in the preprocessing.R file located within our repositories final-project folder. Our data source was provided as multiarray Json files, meaning each file is a collection of json data. We used stream_in function, which parses json data line-by-line from the data folder of our repository. The collections included three, large data for Yelp businesses, users, and reviews.
Once obtained, we prepared our data for our recommender system using the following transformations:
We choose to limit the scope to our recommender system to only businesses with tags related to food and beverages. There were originally 508 unique category tags listed within our business data. We manually filtered 112 targeted categories to subset our data.
We applied additional transformation to remove unnecessary data. There were 1,224 business in our data that were permanently closed. These companies accounted for 9.8% of all businesses, which were subsequently removed from our data. There were also 3 businesses in our data set from outside of AZ that we also removed.
As a result of our transformations, our recommender data was shortened 4,828 unique businesses. This was further limited to 4,332 after randomly sampling our user-data. The output of which can be previewed below:
We subset our review data from the subset of food and beverage businesses. This dropped our review data from 229,907 to 165,823 reviews. We later applied another filter to the data to only use reviews from 10,000 randomly sampled users. This further decreases reviews to 44,494 observations. Our review data can be previewed in two parts below:
Next, we applied a similar filter to users to subset our data based on only our selected businesses. This decreased our user data from 43,873 to 35,268 distinct user_id observations. Do to processing constraints in R, we choose to randomly sample 10,000 users from these unique profiles.
The data frame preview below shows aggregate user data for all reviews an individual user provided for yelp within our data selection.
Last, we created our main data frame by merging business and reviews on Business_ID. This data frame will serve as the source of data for our recommender algorithms. The user and business unique keys were simplified from characters to numeric user/item identifiers.
This data frame will be referenced later on when building our recommender matrices and algorithms.
We converted our raw ratings data into a user-item matrix to test and train our subsequent recommender system algorithms. The matrix was saved as a realRatingMatrix for processing purposes later on using the recommenderlab package.
# spread data from long to wide format
matrix_data <- df %>% select(userID, itemID, stars) %>% spread(itemID, stars)
# set row names to userid
rownames(matrix_data)<-matrix_data$userID
# remove userid from columns
matrix_data <-matrix_data %>% select(-userID)
# randomize dataframe
set.seed(1)
matrix_data <- matrix_data[sample(nrow(matrix_data)),]
# convert to matrix
ui_mat <- matrix_data %>% as.matrix()
# store matrix as realRatingMatrix
ui_mat <- as(ui_mat,"realRatingMatrix")Our data was split into training and tests sets for model evaluation of both two recommender algorithms. We split our data with 5 k-folds. 80% of data was retained for training and 20% for testing purposes.
# evaluation method with 80% of data for train and 20% for test
set.seed(1)
evalu <- evaluationScheme(ui_mat, method="split", train=0.8, given=1, goodRating=4, k=5)
# prep data
train <- getData(evalu, 'train')# Training Dataset
dev_test <- getData(evalu, 'known') # Test data from evaluationScheme of type KNOWN
test <- getData(evalu, 'unknown') # Unknow datset used for RMSE / model evaluationWe tested recommender algorithms using recommenderlab and sparklyr to see which performed the best on our recommender system data. To test the algorithms, we first had to create a user-item matrix and then split our data into training and test sets.
In our first example, user-based CF is used to create recommendations with the recommender lab package in R. We start by training our recommender with the train set, with our data being normalized with the Z-score and using cosine similarity for comparisons.
We then create our predictions using the dev-test set with ratings as our prediction output. It is imperative to set a floor and ceiling as sometimes predictions will fall outside of our ratings scale of 1-5.
Finally we calculated the prediction accuracy against the test data
# using recommender lab, create UBCF recommender with z-score normalized data using cosine similarity
UB <- Recommender(getData(evalu, "train"), "UBCF",
param=list(normalize = "Z-score",method="Cosine"))
# create rating predictions and store
p <- predict(UB, getData(evalu, "known"), type="ratings")
# set floor and ceiling for ratings that fall outside scale
p@data@x[p@data@x[] < 1] <- 1
p@data@x[p@data@x[] > 5] <- 5
# calculate the prediction accurary based on our test data
UB_acc <- calcPredictionAccuracy(p, getData(evalu, "unknown"))This is the same process as above except this time we are using an item-based collaborative filtering to create our recommendations.
After our analysis we see that User-based CF outperforms Item-based CF in all error metrics. Considering the size of our data set the RMSE is relatively low.
## RMSE MSE MAE
## UB_acc 1.337380 1.788585 0.9812829
## IB_acc 1.473754 2.171951 1.1263370
Due to the size of our data, we choose to use Spark in R to avoid input/output (I/O) bottleneck issues and maximize the performance speed of our recommender algorithm calculations.
We initiated a local connection with Spark (V2.4.3). Our yelp data was inputted into a spark table and split for training and testing purposes. We uploaded our training and test splits to minimize the variance in our comparisons.
# configure spark connection
config <- spark_config()
config$spark.executor.memory <- "8G"
config$spark.executor.cores <- 2
config$spark.executor.instances <- 3
config$spark.dynamicAllocation.enabled <- "false"
# initiate connection
sc <- spark_connect(master = "local", config=config, version = "2.4.3")
# unhash to verify version:
# spark_version(sc)
# select data for spark and create spark table
spark_train <- as(train,"data.frame")
spark_test<- as(test,"data.frame")
spark_train <- sdf_copy_to(sc, spark_train, "spark_train", overwrite = TRUE)
spark_test <- sdf_copy_to(sc, spark_test, "spark_test", overwrite = TRUE)
# Transform features
spark_train <- spark_train %>%
ft_string_indexer(input_col = "user", output_col = "user_index") %>%
ft_string_indexer(input_col = "item", output_col = "item_index") %>%
sdf_register("spark_train")
spark_test <- spark_test %>%
ft_string_indexer(input_col = "user", output_col = "user_index") %>%
ft_string_indexer(input_col = "item", output_col = "item_index") %>%
sdf_register("spark_test")Once connected, we applied the alternating least squares (ALS) for our recommender predictions.
# build model using user/business/ratings
als_fit <- ml_als(spark_train,
max_iter = 5,
nonnegative = TRUE,
rank = 2,
rating_col = "rating",
user_col = "user_index",
item_col = "item_index")
# predict from the model for the training data
als_predict_train <- ml_predict(als_fit, spark_train) %>% collect()
als_predict_test <- ml_predict(als_fit, spark_test) %>% collect()
# Remove NaN (result of test/train splits - not data)
als_predict_train <- als_predict_train[!is.na(als_predict_train$prediction), ]
als_predict_test <- als_predict_test[!is.na(als_predict_test$prediction), ]
# Set floor/ceiling for predictions
als_predict_train$prediction[als_predict_train$prediction < 1] <- 1
als_predict_train$prediction[als_predict_train$prediction > 5] <- 5
als_predict_test$prediction[als_predict_test$prediction < 1] <- 1
als_predict_test$prediction[als_predict_test$prediction > 5] <- 5
# View results
als_predict_test %>% head() %>% kable() %>% kable_styling()| user | item | rating | user_index | item_index | prediction |
|---|---|---|---|---|---|
| 8004 | 2697 | 4 | 1763 | 12 | 4.619310 |
| 463 | 2697 | 5 | 16 | 12 | 3.483246 |
| 2272 | 2697 | 4 | 138 | 12 | 3.857820 |
| 2898 | 2697 | 4 | 243 | 12 | 5.000000 |
| 1677 | 2697 | 4 | 180 | 12 | 4.086490 |
| 3017 | 2697 | 5 | 195 | 12 | 4.077936 |
Our ALS calculations for RMSE, MSE, and MAE can be viewed below:
# Calculate RMSE/MSE/MAE
als_mse_train <- mean((als_predict_train$rating - als_predict_train$prediction)^2)
als_rmse_train <- sqrt(als_mse_train)
als_mae_train <- mean(abs(als_predict_train$rating - als_predict_train$prediction))
als_mse_test <- mean((als_predict_test$rating - als_predict_test$prediction)^2)
als_rmse_test <- sqrt(als_mse_test)
als_mae_test <- mean(abs(als_predict_test$rating - als_predict_test$prediction))| type | rmse | mse | mae |
|---|---|---|---|
| ALS_train | 0.6942522 | 0.4819861 | 0.5052867 |
| ALS_test | 1.3477505 | 1.8164314 | 1.0831787 |
The ml_recommend function allows us to see the top n user recommendations for each user/item. Below, we use this function and filtered our recommendations to show the top 10 restaurant recommendations for a selected user.
## NULL
Through this project, we took an all-encompassing look at the different recommender methods we have learned this semester. We built a RealRating Matrix in RecommendLab and performed several algorithms on our data. We then compared this approach to running the training and test data in Spark using sparklyr’s ALS algorithm. We found that the Userbased recommender algorithm performed the best and had the lowest RMSE, however our ALS calculations performed very similiarly to this. Our item-based reommender produced our highest error score.
Our transition to Sparklyr showed us how effective cloud computing can be for large datasets. Our algorithm and prediction performance speeds signficantly improved when using Spark’s service, even on a local channel. ALS in Sparklyr was the clear winner for efficiency.
The size of our data significantly limited our performance using certain packages in R. Functions in Recommenderlab took ~15 minutes to run in comparison to Sparklyr, which took approximately ~2 minutes. Sparklyr would have been able to handle our full data set, whereas our personal computers would have lacked the computational memory to solely use Recommenderlab. However, Sparklyr packages lacks in comparison to the built-in functions recommenderlab has to use and evaluate recommender algorithms functions.
We would recommend for future attempts performing natural language processing on the review text sentiment and analyzing the term-frequency of our categories to see how these variables could improve our recommendations. We would also benefit from using data processing engines, like Spark, to conduct all of our future large data, recommender calculations.