DATA 612 Final Project

Assignment Instructions

Build out the system that you described in your final project planning document.

Introduction

My original objective for this project was to create a recommendation system that predicted Amazon beauty product ratings. However, the compact version of the Amazon - Ratings (Beauty Products) dataset (Available from Kaggle) was missing metadata, and thus it did not contain the names of the beauty products.

I attempted to get around this issue by using the full version of the dataset (Available from UCSD) that contained the metadata that I needed, but the files were too large to run locally. This forced me to go with a smaller dataset (Amazon Magazine Subscription Ratings) that was small enough to run locally, and also contained the metadata that I needed.

Project Objective

The objective of this project is to build a recommendation system that predicts magazine subscription ratings. The dataset for this project is considerably larger than those of previous projects, so I will leverage Spark to store the data. In order to integrate Spark with R, I will utilize the R sparklyr package. For the recommendation model, I will use the Alternating Least Square (ALS) matrix factorization algorithm.

Data Source

As noted in the introduction section above, I will use the Amazon Magazine Subscription Ratings dataset made available by the University of California San Diego. The dataset was compiled in 2018, and is split into 2 json files - a reviews file, and a metadata file. The reviews file contains 89,689 magazine subscription reviews, and the metadata file contains 3,493 products.

Data Manipulation

Pull in the datasets from GitHub

In order to make the json data files accessible outside of the confines of my local machine, I will store them in Github, and then ingest them using the jsonlite package.

# Import the Json data from GitHub.
amazon_meta_source <- url('https://raw.githubusercontent.com/stephen-haslett/data612/612-final-project/amazon_magazine_meta.json')
amazon_ratings_source <- url('https://raw.githubusercontent.com/stephen-haslett/data612/612-final-project/amazon_magazine_ratings.json')

amazon_magazine_metadata <- do.call(rbind, lapply(paste(readLines(amazon_meta_source, warn = FALSE), collapse = ''), jsonlite::fromJSON))
amazon_magazine_ratings <- do.call(rbind, lapply(paste(readLines(amazon_ratings_source, warn = FALSE), collapse = ''), jsonlite::fromJSON))

Explore the dataset column names.

Metadata column names.

knitr::kable(names(amazon_magazine_metadata), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover')) %>%
  add_header_above(c('Metadata Column Names' = 1))

Metadata Column Names
x
category
description
also_buy
image
brand
also_view
details
main_cat
asin
rank
title

Ratings column names.

knitr::kable(names(amazon_magazine_ratings), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover')) %>%
  add_header_above(c('Ratings Column Names' = 1))

Ratings Column Names
x
overall
vote
verified
reviewTime
reviewerID
asin
reviewerName
reviewText
summary
unixReviewTime
style
image

Remove the columns that we don’t need.

There are quite a few columns in the datasets that we do not need. The only columns that we are interested in from the metadata set are the ‘asin’ (product ID), and the ‘title’ (product name) columns, so we can remove all the other columns. The same goes for the ratings set; we only need the asin, reviewerID, reviewerName, and overall (rating) columns.

# Select the columns we need from the metadata dataset.
metadata_refined <- select(amazon_magazine_metadata, asin, title)
# We only want items that have product titles, so remove all the rows that do not have a title.
metadata_refined <- metadata_refined[!is.na(metadata_refined$title),]

# Select the columns we need from the metadata dataset.
ratings_refined <- select(amazon_magazine_ratings, asin, reviewerID, reviewerName, overall)

Combine the metadata and ratings datasets.

Now that we have the columns that we need from both datasets, we can combine the dataset using a join. Both sets share the asin column, so we can perform the join using the ‘asin’ column. This will leave us with a table that contains both the user review data, and the item ids and product names.

magazine_ratings <- join(ratings_refined, metadata_refined, by = 'asin', type = 'right')

# Remove rows that are missing reviewerID values.
magazine_ratings <- magazine_ratings[!is.na(magazine_ratings$reviewerID),]

# Rename column names to make them more descriptive.
names(magazine_ratings)[1] <- 'item_id'
names(magazine_ratings)[2] <- 'user_id'
names(magazine_ratings)[3] <- 'user_name'
names(magazine_ratings)[4] <- 'rating'
names(magazine_ratings)[5] <- 'item_name'

Convert the item IDs and user IDs to numeric values.

The item_ids and user_ids are alphanumeric. When we create the model, we use sparklyr’s “ml_als()” function. The function’s ‘user_col’ and ‘item_col’ parameter values must be integers, so we need to convert these to integer values. To achieve this, we will first convert these values to factors, so that we can then convert them to integers.

item_factor <-as.factor(magazine_ratings$item_id)
item_id <- as.integer(item_factor)
magazine_ratings$item_id <- item_id

user_factor <- as.factor(magazine_ratings$user_id)
user_id <- as.integer(user_factor)
magazine_ratings$user_id <- user_id

knitr::kable(head(magazine_ratings, 10), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover')) %>%
  add_header_above(c('Refined Ratings Dataset with Numeric Item and User IDs' = 5))

Refined Ratings Dataset with Numeric Item and User IDs
item_id	user_id	user_name	rating	item_name
2	3179	Amazon Customer	2	Natural Health
2	1650	Amazon Customer	1	Natural Health
2	3137	Bita Hunt	1	Natural Health
2	2678	kristina bogar	1	Natural Health
2	2152	K. Salinger, Holistic Nurse Practitioner	1	Natural Health
2	2127	Marce T. Hanson	5	Natural Health
2	2798	H. Lum	1	Natural Health
2	3320	David Allen Hazlewood	3	Natural Health
2	3928	Valerio Valentino	1	Natural Health
2	796	Farmer Jane	4	Natural Health

Distribution of Ratings

With the alterations to our dataset complete, we are now in a position to take a look at the ratings distribution. As we can see from the below bar plot, the majority of magazine subscriptions were given high ratings, with fewer items being given low ratings.

magazine_ratings %>% 
  ggplot(aes(rating)) +
  geom_bar(fill = 'darkgreen') +
  labs(title = 'Distribution of Ratings', y = 'Frequency', x = 'Ratings') +
  theme_minimal()

Copy the data over to Spark.

Now that we have a table that contains only the values that we need, we can copy the data over to Spark. Because The dataset was too large for my local machine to handle, I needed to reduce the number of records to send over to Spark. To accomplish this, I selected the first 2000 records.

# Select the first 2000 records from the dataset to accommodate for limited resources on my local machine.
magazine_ratings <- head(magazine_ratings, 2000)

# Connect to Spark and copy over the dataset.
sc <- spark_connect(master = 'local', version = '3.0.0')
ratings <- sdf_copy_to(sc, magazine_ratings, overwrite = TRUE)

Split the data into test and training sets.

Now that the data is in Spark, we can split the dataset into test and training sets using a ratio of 70:25 (70% of the entire dataset for training, and 25% for testing).

partition <- ratings %>%  sdf_partition(training = 0.75, test = 0.25, seed = 1099)
training <- partition$training
test <- partition$test

Create the Alternating Least Squares (ALS) model.

To create the ALS model, I used the Sparklyr’s ml_als() function. After running my predictions, I found that a lot of predictions had NaN values, which is less than ideal. Thankfully, the ml_als() function provides a useful parameter that takes care of this issue - “cold_start_strategy”. Setting this parameter to “drop” removes rows containing Nan prediction values from the dataframe.

als_model <- ml_als(training, rating_col = 'rating', user_col = 'user_id', item_col = 'item_id', cold_start_strategy = 'drop')

knitr::kable(summary(als_model), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover')) %>%
  add_header_above(c('ALS Model Summary' = 4))

ALS Model Summary
	Length	Class	Mode
uid	1	-none-	character
param_map	5	-none-	list
rank	1	-none-	numeric
recommend_for_all_items	1	-none-	function
recommend_for_all_users	1	-none-	function
item_factors	2	tbl_spark	list
user_factors	2	tbl_spark	list
user_col	1	-none-	character
item_col	1	-none-	character
prediction_col	1	-none-	character
.jobj	2	spark_jobj	environment

Perform predictions on the data.

predictions <- als_model$.jobj %>%
  invoke('transform', spark_dataframe(test)) %>%
  collect()

knitr::kable(head(predictions, 50), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = c('striped', 'hover'), fixed_thead = T) %>%
  add_header_above(c('Prediction Results' = 6))

Prediction Results
item_id	user_id	user_name	rating	item_name	prediction
12	1069	African Daisy	4	CosmoGIRL! (1-year)	3.9345741
12	2461	A. Denise	5	CosmoGIRL! (1-year)	4.9182177
12	3815	Amazon Customer	2	CosmoGIRL! (1-year)	1.9672871
12	54	Amazon Customer	5	CosmoGIRL! (1-year)	4.9182177
12	2430	Cari	5	CosmoGIRL! (1-year)	4.9182177
12	2539	Joseph Diorio	5	CosmoGIRL! (1-year)	4.9182177
12	3641	Stephanie Pressman	5	CosmoGIRL! (1-year)	4.9182177
12	4090	LeoraKate	4	CosmoGIRL! (1-year)	3.9345741
12	1505	Ashley	5	CosmoGIRL! (1-year)	4.9182177
12	1842	Jane Christenson	5	CosmoGIRL! (1-year)	4.9182177
12	3188	Jennifer Lux	5	CosmoGIRL! (1-year)	4.9182177
12	1667	H. Hok	5	CosmoGIRL! (1-year)	4.9182177
12	2198	Emily	5	CosmoGIRL! (1-year)	4.9182177
12	2721	AgilitynHorseCrazed	3	CosmoGIRL! (1-year)	2.9509304
12	3589	Shanno	2	CosmoGIRL! (1-year)	1.9672871
13	909	Annamarie Rutschke	5	Diabetic Cooking	4.9509950
13	2736	Em's mom	5	Diabetic Cooking	4.9509950
13	3836	Mary S.	4	Diabetic Cooking	3.9607959
14	1142	M. McMahon	5	Selling Power	4.9174910
18	150	Peplon423	5	Ladies Home Journal (1-year)	4.9270101
18	3769	Constance Rowell	3	Ladies Home Journal (1-year)	2.9562063
18	3626	Heather=^..=	2	Ladies Home Journal (1-year)	1.9708043
18	3883	PPencle	3	Ladies Home Journal (1-year)	2.9562063
18	678	LG	3	Ladies Home Journal (1-year)	2.9562063
18	2486	GP The Engineer	4	Ladies Home Journal (1-year)	3.9416087
18	3406	JFK	5	Ladies Home Journal (1-year)	4.9270101
37	2570	Bosse	1	Brio	0.9870335
37	4127	June	4	Brio	3.9481342
37	1041	S. Thompson	5	Brio	4.9351678
37	3270	Steve Slater	3	Brio	2.9611008
38	3110	Martin	1	Camcorder & Computer Video	0.9825760
46	393	Jim Proulx	5	Rev! (Pastoral Resource)	4.8877940
50	2249	rowena gibson	5	Vermont Magazine	4.9551368
50	1856	Constance B. Andrews	5	Vermont Magazine	4.9551368
50	2572	Mary M. Miller	5	Vermont Magazine	4.9551368
52	4100	Doost	4	Current History	3.9559941
52	1110	C. M. Wood	5	Current History	4.9449925
56	513	J. Cooper	5	Australian Patchwork & Quilting	4.9523754
6	295	Patty D	1	PC World	0.9808766
6	1101	Deimos	3	PC World	2.9426303
6	1545	WAC	5	PC World	4.9043832
6	2217	Charlie Spivey	5	PC World	4.9043832
6	3273	Frank	4	PC World	3.9235065
6	3285	Old-and-Wise	4	PC World	3.7350338
6	695	sarmad	5	PC World	4.9043832
6	1455	CSKapper	5	PC World	4.9043832
6	1562	Gregory	1	PC World	0.9808766
6	1874	Lori_Mac	4	PC World	3.9235065
6	3292	gorillazfan249	4	PC World	3.9235065
6	3333	Steve Wilson	1	PC World	0.9808766

Conclusion

In this project, we built a recommendation system using the ALS model, and an Amazon magazine ratings dataset. The first challenge was with the dataset. The dataset came in 2 separate files, one file contained the ratings, and the other contained the product metadata. Due to the fact that we needed the product names (which were contained in the metadata file), as well as the user ratings, we needed to join the data based on a column that was common to both sets. Both sets contained the ‘asin’ (product ID) column, so we were able to join on this column.

The next challenge with the dataset was the fact that the product IDs and user IDs were alphanumeric, and the function that we used to create the model only accepts numeric values for these parameters. To get around this issue, we needed to convert these values to numeric values. Alphanumeric character cannot be directly converted to numeric values, so we first had to convert them to factors, and then to numeric values. This allowed us to build the model using Sparklyr’s ml_als() function.

Additionally, running Spark locally proved to be problematic. Due to limited computational resources, I had to greatly reduce the size of the dataset in order for the model to run. Given more time, I would have liked to have ran Spark via Amazon EMR.

Finally, The recommender system is missing a UI. With more time, I would have liked to have built a UI that allowed users to interact with the system in order to get magazine subscription recommendation. My plan was to use Shiny for this purpose, but I am not familiar with setting up UIs using Shiny, and did not have the time to experiment with it.