Build out the system that you described in your final project planning document.
My original objective for this project was to create a recommendation system that predicted Amazon beauty product ratings. However, the compact version of the Amazon - Ratings (Beauty Products) dataset (Available from Kaggle) was missing metadata, and thus it did not contain the names of the beauty products.
I attempted to get around this issue by using the full version of the dataset (Available from UCSD) that contained the metadata that I needed, but the files were too large to run locally. This forced me to go with a smaller dataset (Amazon Magazine Subscription Ratings) that was small enough to run locally, and also contained the metadata that I needed.
The objective of this project is to build a recommendation system that predicts magazine subscription ratings. The dataset for this project is considerably larger than those of previous projects, so I will leverage Spark to store the data. In order to integrate Spark with R, I will utilize the R sparklyr package. For the recommendation model, I will use the Alternating Least Square (ALS) matrix factorization algorithm.
As noted in the introduction section above, I will use the Amazon Magazine Subscription Ratings dataset made available by the University of California San Diego. The dataset was compiled in 2018, and is split into 2 json files - a reviews file, and a metadata file. The reviews file contains 89,689 magazine subscription reviews, and the metadata file contains 3,493 products.
In order to make the json data files accessible outside of the confines of my local machine, I will store them in Github, and then ingest them using the jsonlite package.
# Import the Json data from GitHub.
amazon_meta_source <- url('https://raw.githubusercontent.com/stephen-haslett/data612/612-final-project/amazon_magazine_meta.json')
amazon_ratings_source <- url('https://raw.githubusercontent.com/stephen-haslett/data612/612-final-project/amazon_magazine_ratings.json')
amazon_magazine_metadata <- do.call(rbind, lapply(paste(readLines(amazon_meta_source, warn = FALSE), collapse = ''), jsonlite::fromJSON))
amazon_magazine_ratings <- do.call(rbind, lapply(paste(readLines(amazon_ratings_source, warn = FALSE), collapse = ''), jsonlite::fromJSON))
Metadata column names.
knitr::kable(names(amazon_magazine_metadata), format = 'html') %>%
kableExtra::kable_styling(bootstrap_options = c('striped', 'hover')) %>%
add_header_above(c('Metadata Column Names' = 1))
| x |
|---|
| category |
| description |
| also_buy |
| image |
| brand |
| also_view |
| details |
| main_cat |
| asin |
| rank |
| title |
Ratings column names.
knitr::kable(names(amazon_magazine_ratings), format = 'html') %>%
kableExtra::kable_styling(bootstrap_options = c('striped', 'hover')) %>%
add_header_above(c('Ratings Column Names' = 1))
| x |
|---|
| overall |
| vote |
| verified |
| reviewTime |
| reviewerID |
| asin |
| reviewerName |
| reviewText |
| summary |
| unixReviewTime |
| style |
| image |
There are quite a few columns in the datasets that we do not need. The only columns that we are interested in from the metadata set are the ‘asin’ (product ID), and the ‘title’ (product name) columns, so we can remove all the other columns. The same goes for the ratings set; we only need the asin, reviewerID, reviewerName, and overall (rating) columns.
# Select the columns we need from the metadata dataset.
metadata_refined <- select(amazon_magazine_metadata, asin, title)
# We only want items that have product titles, so remove all the rows that do not have a title.
metadata_refined <- metadata_refined[!is.na(metadata_refined$title),]
# Select the columns we need from the metadata dataset.
ratings_refined <- select(amazon_magazine_ratings, asin, reviewerID, reviewerName, overall)
Now that we have the columns that we need from both datasets, we can combine the dataset using a join. Both sets share the asin column, so we can perform the join using the ‘asin’ column. This will leave us with a table that contains both the user review data, and the item ids and product names.
magazine_ratings <- join(ratings_refined, metadata_refined, by = 'asin', type = 'right')
# Remove rows that are missing reviewerID values.
magazine_ratings <- magazine_ratings[!is.na(magazine_ratings$reviewerID),]
# Rename column names to make them more descriptive.
names(magazine_ratings)[1] <- 'item_id'
names(magazine_ratings)[2] <- 'user_id'
names(magazine_ratings)[3] <- 'user_name'
names(magazine_ratings)[4] <- 'rating'
names(magazine_ratings)[5] <- 'item_name'
The item_ids and user_ids are alphanumeric. When we create the model, we use sparklyr’s “ml_als()” function. The function’s ‘user_col’ and ‘item_col’ parameter values must be integers, so we need to convert these to integer values. To achieve this, we will first convert these values to factors, so that we can then convert them to integers.
item_factor <-as.factor(magazine_ratings$item_id)
item_id <- as.integer(item_factor)
magazine_ratings$item_id <- item_id
user_factor <- as.factor(magazine_ratings$user_id)
user_id <- as.integer(user_factor)
magazine_ratings$user_id <- user_id
knitr::kable(head(magazine_ratings, 10), format = 'html') %>%
kableExtra::kable_styling(bootstrap_options = c('striped', 'hover')) %>%
add_header_above(c('Refined Ratings Dataset with Numeric Item and User IDs' = 5))
| item_id | user_id | user_name | rating | item_name |
|---|---|---|---|---|
| 2 | 3179 | Amazon Customer | 2 | Natural Health |
| 2 | 1650 | Amazon Customer | 1 | Natural Health |
| 2 | 3137 | Bita Hunt | 1 | Natural Health |
| 2 | 2678 | kristina bogar | 1 | Natural Health |
| 2 | 2152 | K. Salinger, Holistic Nurse Practitioner | 1 | Natural Health |
| 2 | 2127 | Marce T. Hanson | 5 | Natural Health |
| 2 | 2798 | H. Lum | 1 | Natural Health |
| 2 | 3320 | David Allen Hazlewood | 3 | Natural Health |
| 2 | 3928 | Valerio Valentino | 1 | Natural Health |
| 2 | 796 | Farmer Jane | 4 | Natural Health |
With the alterations to our dataset complete, we are now in a position to take a look at the ratings distribution. As we can see from the below bar plot, the majority of magazine subscriptions were given high ratings, with fewer items being given low ratings.
magazine_ratings %>%
ggplot(aes(rating)) +
geom_bar(fill = 'darkgreen') +
labs(title = 'Distribution of Ratings', y = 'Frequency', x = 'Ratings') +
theme_minimal()
Now that we have a table that contains only the values that we need, we can copy the data over to Spark. Because The dataset was too large for my local machine to handle, I needed to reduce the number of records to send over to Spark. To accomplish this, I selected the first 2000 records.
# Select the first 2000 records from the dataset to accommodate for limited resources on my local machine.
magazine_ratings <- head(magazine_ratings, 2000)
# Connect to Spark and copy over the dataset.
sc <- spark_connect(master = 'local', version = '3.0.0')
ratings <- sdf_copy_to(sc, magazine_ratings, overwrite = TRUE)
Now that the data is in Spark, we can split the dataset into test and training sets using a ratio of 70:25 (70% of the entire dataset for training, and 25% for testing).
partition <- ratings %>% sdf_partition(training = 0.75, test = 0.25, seed = 1099)
training <- partition$training
test <- partition$test
To create the ALS model, I used the Sparklyr’s ml_als() function. After running my predictions, I found that a lot of predictions had NaN values, which is less than ideal. Thankfully, the ml_als() function provides a useful parameter that takes care of this issue - “cold_start_strategy”. Setting this parameter to “drop” removes rows containing Nan prediction values from the dataframe.
als_model <- ml_als(training, rating_col = 'rating', user_col = 'user_id', item_col = 'item_id', cold_start_strategy = 'drop')
knitr::kable(summary(als_model), format = 'html') %>%
kableExtra::kable_styling(bootstrap_options = c('striped', 'hover')) %>%
add_header_above(c('ALS Model Summary' = 4))
| Length | Class | Mode | |
|---|---|---|---|
| uid | 1 | -none- | character |
| param_map | 5 | -none- | list |
| rank | 1 | -none- | numeric |
| recommend_for_all_items | 1 | -none- | function |
| recommend_for_all_users | 1 | -none- | function |
| item_factors | 2 | tbl_spark | list |
| user_factors | 2 | tbl_spark | list |
| user_col | 1 | -none- | character |
| item_col | 1 | -none- | character |
| prediction_col | 1 | -none- | character |
| .jobj | 2 | spark_jobj | environment |
predictions <- als_model$.jobj %>%
invoke('transform', spark_dataframe(test)) %>%
collect()
knitr::kable(head(predictions, 50), format = 'html') %>%
kableExtra::kable_styling(bootstrap_options = c('striped', 'hover'), fixed_thead = T) %>%
add_header_above(c('Prediction Results' = 6))
| item_id | user_id | user_name | rating | item_name | prediction |
|---|---|---|---|---|---|
| 12 | 1069 | African Daisy | 4 | CosmoGIRL! (1-year) | 3.9345741 |
| 12 | 2461 | A. Denise | 5 | CosmoGIRL! (1-year) | 4.9182177 |
| 12 | 3815 | Amazon Customer | 2 | CosmoGIRL! (1-year) | 1.9672871 |
| 12 | 54 | Amazon Customer | 5 | CosmoGIRL! (1-year) | 4.9182177 |
| 12 | 2430 | Cari | 5 | CosmoGIRL! (1-year) | 4.9182177 |
| 12 | 2539 | Joseph Diorio | 5 | CosmoGIRL! (1-year) | 4.9182177 |
| 12 | 3641 | Stephanie Pressman | 5 | CosmoGIRL! (1-year) | 4.9182177 |
| 12 | 4090 | LeoraKate | 4 | CosmoGIRL! (1-year) | 3.9345741 |
| 12 | 1505 | Ashley | 5 | CosmoGIRL! (1-year) | 4.9182177 |
| 12 | 1842 | Jane Christenson | 5 | CosmoGIRL! (1-year) | 4.9182177 |
| 12 | 3188 | Jennifer Lux | 5 | CosmoGIRL! (1-year) | 4.9182177 |
| 12 | 1667 | H. Hok | 5 | CosmoGIRL! (1-year) | 4.9182177 |
| 12 | 2198 | Emily | 5 | CosmoGIRL! (1-year) | 4.9182177 |
| 12 | 2721 | AgilitynHorseCrazed | 3 | CosmoGIRL! (1-year) | 2.9509304 |
| 12 | 3589 | Shanno | 2 | CosmoGIRL! (1-year) | 1.9672871 |
| 13 | 909 | Annamarie Rutschke | 5 | Diabetic Cooking | 4.9509950 |
| 13 | 2736 | Em's mom | 5 | Diabetic Cooking | 4.9509950 |
| 13 | 3836 | Mary S. | 4 | Diabetic Cooking | 3.9607959 |
| 14 | 1142 | M. McMahon | 5 | Selling Power | 4.9174910 |
| 18 | 150 | Peplon423 | 5 | Ladies Home Journal (1-year) | 4.9270101 |
| 18 | 3769 | Constance Rowell | 3 | Ladies Home Journal (1-year) | 2.9562063 |
| 18 | 3626 | Heather=..= | 2 | Ladies Home Journal (1-year) | 1.9708043 |
| 18 | 3883 | PPencle | 3 | Ladies Home Journal (1-year) | 2.9562063 |
| 18 | 678 | LG | 3 | Ladies Home Journal (1-year) | 2.9562063 |
| 18 | 2486 | GP The Engineer | 4 | Ladies Home Journal (1-year) | 3.9416087 |
| 18 | 3406 | JFK | 5 | Ladies Home Journal (1-year) | 4.9270101 |
| 37 | 2570 | Bosse | 1 | Brio | 0.9870335 |
| 37 | 4127 | June | 4 | Brio | 3.9481342 |
| 37 | 1041 | S. Thompson | 5 | Brio | 4.9351678 |
| 37 | 3270 | Steve Slater | 3 | Brio | 2.9611008 |
| 38 | 3110 | Martin | 1 | Camcorder & Computer Video | 0.9825760 |
| 46 | 393 | Jim Proulx | 5 | Rev! (Pastoral Resource) | 4.8877940 |
| 50 | 2249 | rowena gibson | 5 | Vermont Magazine | 4.9551368 |
| 50 | 1856 | Constance B. Andrews | 5 | Vermont Magazine | 4.9551368 |
| 50 | 2572 | Mary M. Miller | 5 | Vermont Magazine | 4.9551368 |
| 52 | 4100 | Doost | 4 | Current History | 3.9559941 |
| 52 | 1110 | C. M. Wood | 5 | Current History | 4.9449925 |
| 56 | 513 | J. Cooper | 5 | Australian Patchwork & Quilting | 4.9523754 |
| 6 | 295 | Patty D | 1 | PC World | 0.9808766 |
| 6 | 1101 | Deimos | 3 | PC World | 2.9426303 |
| 6 | 1545 | WAC | 5 | PC World | 4.9043832 |
| 6 | 2217 | Charlie Spivey | 5 | PC World | 4.9043832 |
| 6 | 3273 | Frank | 4 | PC World | 3.9235065 |
| 6 | 3285 | Old-and-Wise | 4 | PC World | 3.7350338 |
| 6 | 695 | sarmad | 5 | PC World | 4.9043832 |
| 6 | 1455 | CSKapper | 5 | PC World | 4.9043832 |
| 6 | 1562 | Gregory | 1 | PC World | 0.9808766 |
| 6 | 1874 | Lori_Mac | 4 | PC World | 3.9235065 |
| 6 | 3292 | gorillazfan249 | 4 | PC World | 3.9235065 |
| 6 | 3333 | Steve Wilson | 1 | PC World | 0.9808766 |
In this project, we built a recommendation system using the ALS model, and an Amazon magazine ratings dataset. The first challenge was with the dataset. The dataset came in 2 separate files, one file contained the ratings, and the other contained the product metadata. Due to the fact that we needed the product names (which were contained in the metadata file), as well as the user ratings, we needed to join the data based on a column that was common to both sets. Both sets contained the ‘asin’ (product ID) column, so we were able to join on this column.
The next challenge with the dataset was the fact that the product IDs and user IDs were alphanumeric, and the function that we used to create the model only accepts numeric values for these parameters. To get around this issue, we needed to convert these values to numeric values. Alphanumeric character cannot be directly converted to numeric values, so we first had to convert them to factors, and then to numeric values. This allowed us to build the model using Sparklyr’s ml_als() function.
Additionally, running Spark locally proved to be problematic. Due to limited computational resources, I had to greatly reduce the size of the dataset in order for the model to run. Given more time, I would have liked to have ran Spark via Amazon EMR.
Finally, The recommender system is missing a UI. With more time, I would have liked to have built a UI that allowed users to interact with the system in order to get magazine subscription recommendation. My plan was to use Shiny for this purpose, but I am not familiar with setting up UIs using Shiny, and did not have the time to experiment with it.