Assignment Instructions

Build out the system that you described in your final project planning document.

Introduction

My original objective for this project was to create a recommendation system that predicted Amazon beauty product ratings. However, the compact version of the Amazon - Ratings (Beauty Products) dataset (Available from Kaggle) was missing metadata, and thus it did not contain the names of the beauty products.

I attempted to get around this issue by using the full version of the dataset (Available from UCSD) that contained the metadata that I needed, but the files were too large to run locally. This forced me to go with a smaller dataset (Amazon Magazine Subscription Ratings) that was small enough to run locally, and also contained the metadata that I needed.

Project Objective

The objective of this project is to build a recommendation system that predicts magazine subscription ratings. The dataset for this project is considerably larger than those of previous projects, so I will leverage Spark to store the data. In order to integrate Spark with R, I will utilize the R sparklyr package. For the recommendation model, I will use the Alternating Least Square (ALS) matrix factorization algorithm.

Data Source

As noted in the introduction section above, I will use the Amazon Magazine Subscription Ratings dataset made available by the University of California San Diego. The dataset was compiled in 2018, and is split into 2 json files - a reviews file, and a metadata file. The reviews file contains 89,689 magazine subscription reviews, and the metadata file contains 3,493 products.

Data Manipulation

Pull in the datasets from GitHub

In order to make the json data files accessible outside of the confines of my local machine, I will store them in Github, and then ingest them using the jsonlite package.

# Import the Json data from GitHub.
amazon_meta_source <- url('https://raw.githubusercontent.com/stephen-haslett/data612/612-final-project/amazon_magazine_meta.json')
amazon_ratings_source <- url('https://raw.githubusercontent.com/stephen-haslett/data612/612-final-project/amazon_magazine_ratings.json')

amazon_magazine_metadata <- do.call(rbind, lapply(paste(readLines(amazon_meta_source, warn = FALSE), collapse = ''), jsonlite::fromJSON))
amazon_magazine_ratings <- do.call(rbind, lapply(paste(readLines(amazon_ratings_source, warn = FALSE), collapse = ''), jsonlite::fromJSON))

Explore the dataset column names.

Metadata column names.

knitr::kable(names(amazon_magazine_metadata), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover')) %>%
  add_header_above(c('Metadata Column Names' = 1))
Metadata Column Names
x
category
description
also_buy
image
brand
also_view
details
main_cat
asin
rank
title

Ratings column names.

knitr::kable(names(amazon_magazine_ratings), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover')) %>%
  add_header_above(c('Ratings Column Names' = 1))
Ratings Column Names
x
overall
vote
verified
reviewTime
reviewerID
asin
reviewerName
reviewText
summary
unixReviewTime
style
image

Remove the columns that we don’t need.

There are quite a few columns in the datasets that we do not need. The only columns that we are interested in from the metadata set are the ‘asin’ (product ID), and the ‘title’ (product name) columns, so we can remove all the other columns. The same goes for the ratings set; we only need the asin, reviewerID, reviewerName, and overall (rating) columns.

# Select the columns we need from the metadata dataset.
metadata_refined <- select(amazon_magazine_metadata, asin, title)
# We only want items that have product titles, so remove all the rows that do not have a title.
metadata_refined <- metadata_refined[!is.na(metadata_refined$title),]

# Select the columns we need from the metadata dataset.
ratings_refined <- select(amazon_magazine_ratings, asin, reviewerID, reviewerName, overall)

Combine the metadata and ratings datasets.

Now that we have the columns that we need from both datasets, we can combine the dataset using a join. Both sets share the asin column, so we can perform the join using the ‘asin’ column. This will leave us with a table that contains both the user review data, and the item ids and product names.

magazine_ratings <- join(ratings_refined, metadata_refined, by = 'asin', type = 'right')

# Remove rows that are missing reviewerID values.
magazine_ratings <- magazine_ratings[!is.na(magazine_ratings$reviewerID),]

# Rename column names to make them more descriptive.
names(magazine_ratings)[1] <- 'item_id'
names(magazine_ratings)[2] <- 'user_id'
names(magazine_ratings)[3] <- 'user_name'
names(magazine_ratings)[4] <- 'rating'
names(magazine_ratings)[5] <- 'item_name'

Convert the item IDs and user IDs to numeric values.

The item_ids and user_ids are alphanumeric. When we create the model, we use sparklyr’s “ml_als()” function. The function’s ‘user_col’ and ‘item_col’ parameter values must be integers, so we need to convert these to integer values. To achieve this, we will first convert these values to factors, so that we can then convert them to integers.

item_factor <-as.factor(magazine_ratings$item_id)
item_id <- as.integer(item_factor)
magazine_ratings$item_id <- item_id

user_factor <- as.factor(magazine_ratings$user_id)
user_id <- as.integer(user_factor)
magazine_ratings$user_id <- user_id

knitr::kable(head(magazine_ratings, 10), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover')) %>%
  add_header_above(c('Refined Ratings Dataset with Numeric Item and User IDs' = 5))
Refined Ratings Dataset with Numeric Item and User IDs
item_id user_id user_name rating item_name
2 3179 Amazon Customer 2 Natural Health
2 1650 Amazon Customer 1 Natural Health
2 3137 Bita Hunt 1 Natural Health
2 2678 kristina bogar 1 Natural Health
2 2152 K. Salinger, Holistic Nurse Practitioner 1 Natural Health
2 2127 Marce T. Hanson 5 Natural Health
2 2798 H. Lum 1 Natural Health
2 3320 David Allen Hazlewood 3 Natural Health
2 3928 Valerio Valentino 1 Natural Health
2 796 Farmer Jane 4 Natural Health

Distribution of Ratings

With the alterations to our dataset complete, we are now in a position to take a look at the ratings distribution. As we can see from the below bar plot, the majority of magazine subscriptions were given high ratings, with fewer items being given low ratings.

magazine_ratings %>% 
  ggplot(aes(rating)) +
  geom_bar(fill = 'darkgreen') +
  labs(title = 'Distribution of Ratings', y = 'Frequency', x = 'Ratings') +
  theme_minimal()

Copy the data over to Spark.

Now that we have a table that contains only the values that we need, we can copy the data over to Spark. Because The dataset was too large for my local machine to handle, I needed to reduce the number of records to send over to Spark. To accomplish this, I selected the first 2000 records.

# Select the first 2000 records from the dataset to accommodate for limited resources on my local machine.
magazine_ratings <- head(magazine_ratings, 2000)

# Connect to Spark and copy over the dataset.
sc <- spark_connect(master = 'local', version = '3.0.0')
ratings <- sdf_copy_to(sc, magazine_ratings, overwrite = TRUE)

Split the data into test and training sets.

Now that the data is in Spark, we can split the dataset into test and training sets using a ratio of 70:25 (70% of the entire dataset for training, and 25% for testing).

partition <- ratings %>%  sdf_partition(training = 0.75, test = 0.25, seed = 1099)
training <- partition$training
test <- partition$test

Create the Alternating Least Squares (ALS) model.

To create the ALS model, I used the Sparklyr’s ml_als() function. After running my predictions, I found that a lot of predictions had NaN values, which is less than ideal. Thankfully, the ml_als() function provides a useful parameter that takes care of this issue - “cold_start_strategy”. Setting this parameter to “drop” removes rows containing Nan prediction values from the dataframe.

als_model <- ml_als(training, rating_col = 'rating', user_col = 'user_id', item_col = 'item_id', cold_start_strategy = 'drop')

knitr::kable(summary(als_model), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover')) %>%
  add_header_above(c('ALS Model Summary' = 4))
ALS Model Summary
Length Class Mode
uid 1 -none- character
param_map 5 -none- list
rank 1 -none- numeric
recommend_for_all_items 1 -none- function
recommend_for_all_users 1 -none- function
item_factors 2 tbl_spark list
user_factors 2 tbl_spark list
user_col 1 -none- character
item_col 1 -none- character
prediction_col 1 -none- character
.jobj 2 spark_jobj environment

Perform predictions on the data.

predictions <- als_model$.jobj %>%
  invoke('transform', spark_dataframe(test)) %>%
  collect()

knitr::kable(head(predictions, 50), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover'), fixed_thead = T) %>%
  add_header_above(c('Prediction Results' = 6))
Prediction Results
item_id user_id user_name rating item_name prediction
12 1069 African Daisy 4 CosmoGIRL! (1-year) 3.9345741
12 2461 A. Denise 5 CosmoGIRL! (1-year) 4.9182177
12 3815 Amazon Customer 2 CosmoGIRL! (1-year) 1.9672871
12 54 Amazon Customer 5 CosmoGIRL! (1-year) 4.9182177
12 2430 Cari 5 CosmoGIRL! (1-year) 4.9182177
12 2539 Joseph Diorio 5 CosmoGIRL! (1-year) 4.9182177
12 3641 Stephanie Pressman 5 CosmoGIRL! (1-year) 4.9182177
12 4090 LeoraKate 4 CosmoGIRL! (1-year) 3.9345741
12 1505 Ashley 5 CosmoGIRL! (1-year) 4.9182177
12 1842 Jane Christenson 5 CosmoGIRL! (1-year) 4.9182177
12 3188 Jennifer Lux 5 CosmoGIRL! (1-year) 4.9182177
12 1667 H. Hok 5 CosmoGIRL! (1-year) 4.9182177
12 2198 Emily 5 CosmoGIRL! (1-year) 4.9182177
12 2721 AgilitynHorseCrazed 3 CosmoGIRL! (1-year) 2.9509304
12 3589 Shanno 2 CosmoGIRL! (1-year) 1.9672871
13 909 Annamarie Rutschke 5 Diabetic Cooking 4.9509950
13 2736 Em&#039;s mom 5 Diabetic Cooking 4.9509950
13 3836 Mary S. 4 Diabetic Cooking 3.9607959
14 1142 M. McMahon 5 Selling Power 4.9174910
18 150 Peplon423 5 Ladies Home Journal (1-year) 4.9270101
18 3769 Constance Rowell 3 Ladies Home Journal (1-year) 2.9562063
18 3626 Heather=..= 2 Ladies Home Journal (1-year) 1.9708043
18 3883 PPencle 3 Ladies Home Journal (1-year) 2.9562063
18 678 LG 3 Ladies Home Journal (1-year) 2.9562063
18 2486 GP The Engineer 4 Ladies Home Journal (1-year) 3.9416087
18 3406 JFK 5 Ladies Home Journal (1-year) 4.9270101
37 2570 Bosse 1 Brio 0.9870335
37 4127 June 4 Brio 3.9481342
37 1041 S. Thompson 5 Brio 4.9351678
37 3270 Steve Slater 3 Brio 2.9611008
38 3110 Martin 1 Camcorder & Computer Video 0.9825760
46 393 Jim Proulx 5 Rev! (Pastoral Resource) 4.8877940
50 2249 rowena gibson 5 Vermont Magazine 4.9551368
50 1856 Constance B. Andrews 5 Vermont Magazine 4.9551368
50 2572 Mary M. Miller 5 Vermont Magazine 4.9551368
52 4100 Doost 4 Current History 3.9559941
52 1110 C. M. Wood 5 Current History 4.9449925
56 513 J. Cooper 5 Australian Patchwork & Quilting 4.9523754
6 295 Patty D 1 PC World 0.9808766
6 1101 Deimos 3 PC World 2.9426303
6 1545 WAC 5 PC World 4.9043832
6 2217 Charlie Spivey 5 PC World 4.9043832
6 3273 Frank 4 PC World 3.9235065
6 3285 Old-and-Wise 4 PC World 3.7350338
6 695 sarmad 5 PC World 4.9043832
6 1455 CSKapper 5 PC World 4.9043832
6 1562 Gregory 1 PC World 0.9808766
6 1874 Lori_Mac 4 PC World 3.9235065
6 3292 gorillazfan249 4 PC World 3.9235065
6 3333 Steve Wilson 1 PC World 0.9808766

Conclusion

In this project, we built a recommendation system using the ALS model, and an Amazon magazine ratings dataset. The first challenge was with the dataset. The dataset came in 2 separate files, one file contained the ratings, and the other contained the product metadata. Due to the fact that we needed the product names (which were contained in the metadata file), as well as the user ratings, we needed to join the data based on a column that was common to both sets. Both sets contained the ‘asin’ (product ID) column, so we were able to join on this column.

The next challenge with the dataset was the fact that the product IDs and user IDs were alphanumeric, and the function that we used to create the model only accepts numeric values for these parameters. To get around this issue, we needed to convert these values to numeric values. Alphanumeric character cannot be directly converted to numeric values, so we first had to convert them to factors, and then to numeric values. This allowed us to build the model using Sparklyr’s ml_als() function.

Additionally, running Spark locally proved to be problematic. Due to limited computational resources, I had to greatly reduce the size of the dataset in order for the model to run. Given more time, I would have liked to have ran Spark via Amazon EMR.

Finally, The recommender system is missing a UI. With more time, I would have liked to have built a UI that allowed users to interact with the system in order to get magazine subscription recommendation. My plan was to use Shiny for this purpose, but I am not familiar with setting up UIs using Shiny, and did not have the time to experiment with it.