Overview

Our project used data from Kaggle’s 2013 Yelp Challenge. This challenge included a subset of Yelp data from the metropolitan area of Phoenix, Arizona. Our data takes into account user reviews, ratings, and check-in data for a wide-range of businesses.

Data Aquisition & Transformations

Data was acquired and transformed in the preprocessing.R file located within our repositories final-project folder. Our data source was provided as multiarray Json files, meaning each file is a collection of json data. We used stream_in function, which parses json data line-by-line from the data folder of our repository. The collections included three, large data for Yelp businesses, users, and reviews.

Once obtained, we prepared our data for our recommender system using the following transformations:

Business

We choose to limit the scope to our recommender system to only businesses with tags related to food and beverages. There were originally 508 unique category tags listed within our business data. We manually filtered 112 targeted categories to subset our data.

We applied additional transformation to remove unnecessary data. There were 1,224 business in our data that were permanently closed. These companies accounted for 9.8% of all businesses, which were subsequently removed from our data. There were also 3 businesses in our data set from outside of AZ that we also removed.

As a result of our transformations, our recommender data was shortened 4,828 unique businesses. This was further limited to 4,332 after randomly sampling our user-data. The output of which can be previewed below:

Review

We subset our review data from the subset of food and beverage businesses. This dropped our review data from 229,907 to 165,823 reviews. We later applied another filter to the data to only use reviews from 10,000 randomly sampled users. This further decreases reviews to 44,494 observations. Our review data can be previewed in two parts below:

User

Next, we applied a similar filter to users to subset our data based on only our selected businesses. This decreased our user data from 43,873 to 35,268 distinct user_id observations. Do to processing constraints in R, we choose to randomly sample 10,000 users from these unique profiles.

The data frame preview below shows aggregate user data for all reviews an individual user provided for yelp within our data selection.

Merged Dataframe

Last, we created our main data frame by merging business and reviews on Business_ID. This data frame will serve as the source of data for our recommender algorithms. The user and business unique keys were simplified from characters to numeric user/item identifiers.

This data frame will be referenced later on when building our recommender matrices and algorithms.

Recommender Algorithms

We tested recommender algorithms using recommenderlab and sparklyr to see which performed the best on our recommender system data. To test the algorithms, we first had to create a user-item matrix and then split our data into training and test sets.

RecommenderLab

Matrix Building

We converted our raw ratings data into a user-item matrix to test and train our subsequent recommender system algorithms. The matrix was saved as a realRatingMatrix for processing purposes later on using the recommenderlab package.

# spread data from long to wide format
matrix_data <- df %>% select(userID, itemID, stars) %>% spread(itemID, stars)
# set row names to userid
rownames(matrix_data) <- matrix_data$userID
# remove userid from columns
matrix_data <- matrix_data %>% select(-userID)
# convert to matrix
ui_mat <- matrix_data %>% as.matrix()
# store matrix as realRatingMatrix
ui_mat <- as(ui_mat, "realRatingMatrix")
# view matrix data

head(matrix_data)

Binary Ratings Matrix

  • In the approach we will work with Binary data set and see how Business can be Recommended for the Given Business.

Binary Rating Matrix

  • I first convert the ratings into a binary format to keep things simple. ratings of 4 and 5 are
  • mapped to 1, representing likes, and ratings of 3 and below are mapped to 0, representing dislikes.
# Columns are Busienss ID and Row is User_id
dat_binaryRatingMatrix <- df[, c(11, 1, 13)] %>% mutate(Type = case_when(stars == 
    1 ~ 0, stars == 2 ~ 0, stars == 3 ~ 0, stars == 4 ~ 1, stars == 5 ~ 1, TRUE ~ 
    0)) %>% .[c(1, 2, 4)] %>% as("binaryRatingMatrix")

# Creation of Sample dataset. for Evalution Scheme
evalu_a2 <- evaluationScheme(dat_binaryRatingMatrix, method = "split", train = 0.9, 
    given = 1)

# Prep data
ratings_train <- getData(evalu_a2, "train")  # Training Dataset 
ratings_test_known <- getData(evalu_a2, "known")  # Test data from evaluationScheme of type KNOWN
ratings_test_unknown <- getData(evalu_a2, "unknown")  # Unknow datset used for RMSE / model

head(as(dat_binaryRatingMatrix[1:5, 1:3], "matrix"))
FALSE                        --5jkZ3-nUPZxUvtcbr8Uw --BlvDO_RG2yElKu9XA1_g
FALSE --lMCM6K8-9NTvPlbCMXEA                  FALSE                  FALSE
FALSE --LzFD0UDbYE-Oho3AhsOg                  FALSE                  FALSE
FALSE --M-cIkGnH1KhnLaCOmoPQ                  FALSE                  FALSE
FALSE -01H9S7YxFrhRgNdvxmaVQ                  FALSE                  FALSE
FALSE -06LYbA4Qm_9E83KNT1Jrg                  FALSE                  FALSE
FALSE                        -_JBgygYYD_UkuD-GVTp6A
FALSE --lMCM6K8-9NTvPlbCMXEA                  FALSE
FALSE --LzFD0UDbYE-Oho3AhsOg                  FALSE
FALSE --M-cIkGnH1KhnLaCOmoPQ                  FALSE
FALSE -01H9S7YxFrhRgNdvxmaVQ                  FALSE
FALSE -06LYbA4Qm_9E83KNT1Jrg                  FALSE

Evaluating the Model

  • In this section we are trying to see how “POPULAR” method and User Based Collaborative Filtering UBCF can recommend the top n items.
  • Its also important to note that Binary ratings are not well evaluated with the help of Precision and Recall .
# creation of recommender model based on ubcf
Rec.ubcf <- Recommender(ratings_train, "UBCF")
# Creating POPULAR recommender model for comparison
Rec.pop <- Recommender(ratings_train, "POPULAR")
pred_ubcf2 <- readRDS("pred_ubcf2.rds")
pred_ubcf3 <- readRDS("pred_ubcf3.rds")
pred_ubcf5 <- readRDS("pred_ubcf5.rds")
pred_ubcf10 <- readRDS("pred_ubcf10.rds")
# SAVE IT FOR LATTER USE


# making predictions on the test data set witj Top 2, 3, 5, and 10
pred_pop2 <- readRDS("pred_pop2.rds")
pred_pop3 <- readRDS("pred_pop3.rds")
pred_pop5 <- readRDS("pred_pop5.rds")
pred_pop10 <- readRDS("pred_pop10.rds")

# Evaluting the Result of UBCF
error.ubcf2 <- calcPredictionAccuracy(pred_ubcf2, ratings_test_unknown, given = 0)
error.ubcf3 <- calcPredictionAccuracy(pred_ubcf3, ratings_test_unknown, given = 0)
error.ubcf5 <- calcPredictionAccuracy(pred_ubcf5, ratings_test_unknown, given = 0)
error.ubcf10 <- calcPredictionAccuracy(pred_ubcf10, ratings_test_unknown, given = 0)

# Evaluting the Result of Popular predictor
error.pop2 <- calcPredictionAccuracy(pred_pop2, ratings_test_unknown, given = 0)
error.pop3 <- calcPredictionAccuracy(pred_pop3, ratings_test_unknown, given = 0)
error.pop5 <- calcPredictionAccuracy(pred_pop5, ratings_test_unknown, given = 0)
error.pop10 <- calcPredictionAccuracy(pred_pop10, ratings_test_unknown, given = 0)
# error.ibcf<-calcPredictionAccuracy(p.ibcf, ratings_test_unknown)

error <- rbind(error.ubcf2, error.ubcf3, error.ubcf5, error.ubcf10, error.pop2, 
    error.pop3, error.pop5, error.pop10) %>% cbind(TopN = c(2, 3, 5, 10, 2, 
    3, 5, 10), Type = c("UBCF", "UBCF", "UBCF", "UBCF", "POP", "POP", "POP", 
    "POP"))

Precision-Recall Below plot displays the Precision and Recall of the POPULAR and UBCF Model, We are looking for higher recall with maximum precision, and to our surprise POPULAR (Popular items/Business in our case) MODEL is doing well compare to UBCF. It could be due to the SAMPLE size.

as_data_frame(error) %>% ggplot(aes(recall, precision, color = Type)) + geom_line() + 
    geom_label(aes(label = TopN)) + labs(title = "Precision-Recall for Top N", 
    colour = "Model") + theme_grey(base_size = 14) + theme(axis.text.x = element_text(angle = 70, 
    hjust = 1))
FALSE Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
FALSE This warning is displayed once per session.
FALSE geom_path: Each group consists of only one observation. Do you need to
FALSE adjust the group aesthetic?

** ROC Curve TPR vs FPR** Below Graph is showing the how much is TPR i.e TRUE Positive (Valid Item/business is getting recommended ,which are more likely to be visited by users) vs same amount of FPR i.e not a relevant recommendation . In this case Our focus is to find Max TPR for any given level of FPR.

as_data_frame(error) %>% ggplot(aes(FPR, TPR, color = Type)) + geom_line() + 
    geom_label(aes(label = TopN)) + labs(title = "ROC Curve / TPR vs FPR", colour = "Model") + 
    theme_grey(base_size = 14) + theme(axis.text.x = element_text(angle = 70, 
    hjust = 1))
FALSE geom_path: Each group consists of only one observation. Do you need to
FALSE adjust the group aesthetic?

Predictions For a New User

He we will build the persona of User by visiting to a Business (business_id)

Function to find top 10 Business for the Given Business.

# Predictions For a New User
REC_TOP_BUSINESS <- function(bus_id = "usAsSV36QmUej8--yvN-dg") {
  

# Generating a prediction with the best performing model. 

# Creating an user visiting to a business :
# # customer_visit <- c("usAsSV36QmUej8--yvN-dg", # Food, Grocery--Phoenix
#                      "PzOqRohWw7F7YEPBz6AubA", # Food, Bagels, Delis, Restaurants--Glendale Az
#                      "JxVGJ9Nly2FFIs_WpJvkug") # Pizza, Restaurants--Scottsdale

customer_visit <- c(bus_id)


# put this order in a format that recommenderlab accept: As shown below
# --5jkZ3-nUPZxUvtcbr8Uw --BlvDO_RG2yElKu9XA1_g -_JBgygYYD_UkuD-GVTp6A .....
#                      0                      0                      0 .....

new_cust_rat_matrx <- df[,c(11,1,13)] %>%  # Select only needed variables
                    # Add a column of 1s an 0s
# Select Business Name
  select(business_id) %>% 
  unique() %>% 
# Add a 'value' column with 1's for Users Visited Business
  mutate(value = as.numeric(business_id %in% customer_visit)) %>% 
# Spread into sparse matrix format
  spread(key = business_id, value = value) %>% 
# Change to a matrix
  as.matrix() %>% 
# Convert to recommenderlab class 'binaryRatingsMatrix'
  as("binaryRatingMatrix")  


# Creating UBCF Model again:
recomm <- Recommender(getData(evalu_a2, 'train'),method  = "POPULAR")
recomm

#I can pass the Recommender and the made-up user's visits  to the predict function to create a top 10 recommendation list for the new customer.
pred <- predict(recomm, 
                newdata = new_cust_rat_matrx, 
                n       = 10)

pred_result <- data.frame(as(pred, 'list') )  %>% rename('business_id'=X1) %>% left_join(df[,c('business_id','categories','city','name')],by='business_id') %>% unique()

pred_result['MeanRate']<- 0

for (i in 1:10){
 pred_result$MeanRate[i] <- mean(review[which(review$business_id == pred_result$business_id[i]),'stars'])
}


pred_result[,c('business_id','MeanRate','categories','city','name')]
  
}
suppressWarnings(REC_TOP_BUSINESS("SvdlC39JGPI_Tj3pS0ruzw"))

Conclusion

Analysis:

BinaryMatrix can be best evaluated with the help of precision and recall, here Precision stands for predicting business which is more likely to get the visits. Precision shows how sensitive models are to false positives (i.e. recommending a business not very likely to be visited)

We noted the POPULAR Item method (POPULAR) model gave good results with high recall and the same level of precision. When compared with all the models for n = 1,2,3,10 for top n items, as number of top n increases we noted a decrease in the precision and an increase in recall, since recall is sensitive to false negatives (i.e. do not suggest a business which is highly likely to be visited), so in contrast we aim for higher recall with maximum precision.