library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(recosystem)
library(sparklyr)
##
## Attaching package: 'sparklyr'
##
## The following object is masked from 'package:purrr':
##
## invoke
##
## The following object is masked from 'package:stats':
##
## filter
When working on this assignment two weeks ago, processing time was such a hinderance. Steps such as model tuning and running loops to calculate similarity ground my progress to a temporary halt. Spark allows for online processing, so I can calculate beyond the limitations of my laptop. I’m going back to working with the Jester dataset, looking at how jokes were rated by 54905 users.
jester <- read_xls("C:/Users/ddebo/Downloads/JesterDataset3/FINAL jester 2006-15.xls", col_names = FALSE)
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`
## • `` -> `...20`
## • `` -> `...21`
## • `` -> `...22`
## • `` -> `...23`
## • `` -> `...24`
## • `` -> `...25`
## • `` -> `...26`
## • `` -> `...27`
## • `` -> `...28`
## • `` -> `...29`
## • `` -> `...30`
## • `` -> `...31`
## • `` -> `...32`
## • `` -> `...33`
## • `` -> `...34`
## • `` -> `...35`
## • `` -> `...36`
## • `` -> `...37`
## • `` -> `...38`
## • `` -> `...39`
## • `` -> `...40`
## • `` -> `...41`
## • `` -> `...42`
## • `` -> `...43`
## • `` -> `...44`
## • `` -> `...45`
## • `` -> `...46`
## • `` -> `...47`
## • `` -> `...48`
## • `` -> `...49`
## • `` -> `...50`
## • `` -> `...51`
## • `` -> `...52`
## • `` -> `...53`
## • `` -> `...54`
## • `` -> `...55`
## • `` -> `...56`
## • `` -> `...57`
## • `` -> `...58`
## • `` -> `...59`
## • `` -> `...60`
## • `` -> `...61`
## • `` -> `...62`
## • `` -> `...63`
## • `` -> `...64`
## • `` -> `...65`
## • `` -> `...66`
## • `` -> `...67`
## • `` -> `...68`
## • `` -> `...69`
## • `` -> `...70`
## • `` -> `...71`
## • `` -> `...72`
## • `` -> `...73`
## • `` -> `...74`
## • `` -> `...75`
## • `` -> `...76`
## • `` -> `...77`
## • `` -> `...78`
## • `` -> `...79`
## • `` -> `...80`
## • `` -> `...81`
## • `` -> `...82`
## • `` -> `...83`
## • `` -> `...84`
## • `` -> `...85`
## • `` -> `...86`
## • `` -> `...87`
## • `` -> `...88`
## • `` -> `...89`
## • `` -> `...90`
## • `` -> `...91`
## • `` -> `...92`
## • `` -> `...93`
## • `` -> `...94`
## • `` -> `...95`
## • `` -> `...96`
## • `` -> `...97`
## • `` -> `...98`
## • `` -> `...99`
## • `` -> `...100`
## • `` -> `...101`
## • `` -> `...102`
## • `` -> `...103`
## • `` -> `...104`
## • `` -> `...105`
## • `` -> `...106`
## • `` -> `...107`
## • `` -> `...108`
## • `` -> `...109`
## • `` -> `...110`
## • `` -> `...111`
## • `` -> `...112`
## • `` -> `...113`
## • `` -> `...114`
## • `` -> `...115`
## • `` -> `...116`
## • `` -> `...117`
## • `` -> `...118`
## • `` -> `...119`
## • `` -> `...120`
## • `` -> `...121`
## • `` -> `...122`
## • `` -> `...123`
## • `` -> `...124`
## • `` -> `...125`
## • `` -> `...126`
## • `` -> `...127`
## • `` -> `...128`
## • `` -> `...129`
## • `` -> `...130`
## • `` -> `...131`
## • `` -> `...132`
## • `` -> `...133`
## • `` -> `...134`
## • `` -> `...135`
## • `` -> `...136`
## • `` -> `...137`
## • `` -> `...138`
## • `` -> `...139`
## • `` -> `...140`
## • `` -> `...141`
## • `` -> `...142`
## • `` -> `...143`
## • `` -> `...144`
## • `` -> `...145`
## • `` -> `...146`
## • `` -> `...147`
## • `` -> `...148`
## • `` -> `...149`
## • `` -> `...150`
## • `` -> `...151`
# first column is number of jokes rated
names(jester)[-1] <- paste0("joke_", 1:150)
jester <- jester[-1]
# converting 99 to na
jester[jester == 99] <- NA
# Spark isn't working with columns that start with numbers apparently
coljest <- colnames(jester)
new_coljest <- paste0("joke_", coljest)
colnames(jester) <- new_coljest
jester$user_id <- seq_len(nrow(jester))
# removing jokes with no ratings at all
empty <- which(colSums(!is.na(jester))==0)
kept_jokes <- setdiff(1:ncol(jester), empty)
jester <- jester[, kept_jokes]
#converting to long format and creating a numerical joke id while keeping the column name starting with letters
ratings_long <- jester |>
pivot_longer(cols = -user_id, names_to = 'joke_id', values_to = 'rating')
ratings_long <- ratings_long |>
filter(!is.na(rating)) |>
mutate(
user_id = as.integer(user_id),
joke_id = as.integer(gsub("joke_", "", joke_id)),
rating = as.numeric(rating)
)
# csv format might make Spark work faster? One time forming the table took about an hour
write.csv(ratings_long, 'jester_long.csv', row.names = FALSE)
Now that I have Temurin installed, I can connect to Spark. I tried for a long time to get databricks to work, but this was not possible with the current Free edition. In any case, I’ll be using Spark locally. Since time is a focus for this week, I will note that this connection was far from instantaneous. Each connection takes about three minutes, not considering the hours I spent trying to make it work.
Sys.setenv(JAVA_HOME = "C:/Program Files/Eclipse Adoptium/jdk-8.0.452.9-hotspot")
config <- spark_config()
config$spark.driver.memory <- "2G"
options(sparklyr.log.console = TRUE)
sc <- spark_connect(master = 'local', version = '3.0.0', config = config)
ratings_tbl <- spark_read_csv(
sc,
name = "ratings",
path = 'jester_long.csv',
header = TRUE,
infer_schema = TRUE
)
Let’s see how this process goes when building a recommender using a paradigm that I have used before.
# getting average rating for joke
avg_ratings_tbl <- ratings_tbl |>
group_by(joke_id) |>
summarize(avg_rating = mean(rating, na.rm = TRUE))
# using average to center data
ratings_cent_tbl <- ratings_tbl |>
left_join(avg_ratings_tbl, by = 'joke_id')|>
mutate(rating_centered = rating - avg_rating) |>
sdf_register("ratings_centered")
joke_pairs <- ratings_cent_tbl |>
inner_join(ratings_cent_tbl,
by = 'user_id',
suffix = c('.joke1', '.joke2')) |>
filter(joke_id_joke1 < joke_id_joke2)
## Replacing '.' with '_' in suffixes. New suffixes: _joke1, _joke2
pair_scores <- joke_pairs |>
mutate(
prod = rating_centered_joke1 * rating_centered_joke2,
sq1 = rating_centered_joke1^2,
sq2 = rating_centered_joke2^2
) |>
group_by(joke_id_joke1, joke_id_joke2) |>
summarize(
sim_numerator = sum(prod, na.rm = TRUE),
sim_den1 = sqrt(sum(sq1, na.rm = TRUE)),
sim_den2 = sqrt(sum(sq2, na.rm = TRUE)),
.groups = 'drop'
) |>
mutate(
cosine_similarity = sim_numerator/ (sim_den1 * sim_den2)
)|>
filter(!is.na(cosine_similarity))
With the cosine similarity computed, let’s split the data into a training and test set so we have a base for comparison to calculate error from ratings.
splits <- ratings_cent_tbl |>
sdf_random_split(training = .8, test = .2, seed = 24601)
train_tbl <- splits$training
test_tbl <- splits$test
user_history <- train_tbl |>
select(user_id, joke_id, rating_centered)
test_history <- test_tbl |>
inner_join(user_history, by = 'user_id', suffix = c("_target", "_neighbor"))
pred_input <- test_history |>
inner_join(pair_scores, by = c('joke_id_target' = 'joke_id_joke1',
'joke_id_neighbor' = 'joke_id_joke2'))
preds_cf <- pred_input |>
mutate(weighted = rating_centered_neighbor * cosine_similarity) |>
group_by(user_id, joke_id_target) |>
summarize(
pred = sum(weighted) / sum(abs(cosine_similarity)),
.groups = "drop"
)
I want to know what the predicted ratings would be in order to compare to past weeks’ models, so in reporting I want to undo the centering. Otherwise the root mean squared error would be based on different scales.
preds_cf <- preds_cf |>
inner_join(avg_ratings_tbl, by = c('joke_id_target'='joke_id')) |>
mutate(pred_final = pred + avg_rating)
## Warning: Missing values are always removed in SQL aggregation functions.
## Use `na.rm = TRUE` to silence this warning
## This warning is displayed once every 8 hours.
Now we can calculate the root mean squared error as well as the mean average error. The progress bar is a somewhat welcome sight here, since at least there is progress to follow. The whole process took about five minutes.
rmse_cf <- preds_cf |>
inner_join(test_tbl, by = c('user_id', 'joke_id_target'= 'joke_id')) |>
mutate(sq_error = (pred_final - rating)^2) |>
summarize(rmse = sqrt(mean(sq_error)))
rmse_cf
## # Source: SQL [?? x 1]
## # Database: spark_connection
## rmse
## <dbl>
## 1 4.34
mae_cf <- preds_cf |>
inner_join(test_tbl, by = c('user_id', 'joke_id_target' = 'joke_id')) |>
mutate(abs_error = abs(pred_final - rating)) |>
summarize(mae = mean(abs_error))
mae_cf
## # Source: SQL [?? x 1]
## # Database: spark_connection
## mae
## <dbl>
## 1 3.23
This was not a particularly complex model and these values are in the same range as previous models. I want to test it and generate some recommendations though, so I need to make predictions for the whole userbase on their unrated jokes and connect that back to the original joke text.
Even though it is a small table, I’ll import this table to Spark since I want to see the text of the recommended jokes
jokes <- read_xlsx("C:/Users/ddebo/Downloads/Dataset3JokeSet/Dataset3JokeSet.xlsx", col_names = FALSE)
## New names:
## • `` -> `...1`
joketext <- jokes$...1
joke_text_df <- tibble(
item = seq_along(joketext),
text = joketext
)
jokes_tbl <- copy_to(sc, joke_text_df, "jokes_tbl", overwrite = TRUE)
pair_scores_tbl <- sdf_copy_to(sc, pair_scores, overwrite = TRUE)
get_itemcf_recs <- function(user_id, ratings_cent_tbl, pair_scores_tbl, avg_ratings_tbl, jokes_tbl, n =10){
user_ratings <- ratings_cent_tbl |>
filter(user_id == !!user_id)
user_sim_scores <- user_ratings |>
rename(joke_id_joke1 = joke_id, rating_centered_joke1 = rating_centered)|>
inner_join(pair_scores_tbl, by = 'joke_id_joke1') |>
mutate(weighted_rating = rating_centered_joke1 * cosine_similarity)
predicted_centered <- user_sim_scores |>
group_by(joke_id_joke2) |>
summarize(
sim_sum = sum(abs(cosine_similarity), na.rm = TRUE),
weighted_sum = sum(weighted_rating, na.rm = TRUE),
.groups = 'drop'
) |>
filter(sim_sum > 0) |>
mutate(predicted_centered = weighted_sum / sim_sum) |>
rename(joke_id = joke_id_joke2)
# adding back average to decenter predicted scores so we can see the predicted rating
predicted_final <- predicted_centered |>
inner_join(avg_ratings_tbl, by = 'joke_id') |>
mutate(predicted_rating = predicted_centered * avg_rating)
already_rated <- ratings_cent_tbl |>
filter(user_id == !!user_id) |>
select(joke_id)
recommendations <- predicted_final |>
anti_join(already_rated, by = 'joke_id') |>
inner_join(jokes_tbl, by=c('joke_id' = 'item')) |>
mutate(row_num = dense_rank(desc(predicted_rating)))|>
filter(row_num <=!!n) |>
select(joke_id, predicted_rating, text)
return(recommendations)
}
Let’s see what this model recommends for user number 5.
top_10_recs <- get_itemcf_recs(
user_id = 5,
ratings_cent_tbl = ratings_cent_tbl,
pair_scores_tbl = pair_scores,
avg_ratings_tbl = avg_ratings_tbl,
jokes_tbl = jokes_tbl,
n = 10
)
top_10_recs |>
collect()
## # A tibble: 10 × 3
## joke_id predicted_rating text
## <int> <dbl> <chr>
## 1 141 8.61 "Jack Bauer can get McDonald's breakfast after 10:3…
## 2 124 7.66 "Person 1: Hey, wanna hear a great knock-knock joke…
## 3 58 3.52 "How many teddybears does it take to change a light…
## 4 75 2.49 "Q: Do you know the difference between an intellige…
## 5 74 2.02 "Q: How many stalkers does it take to change a ligh…
## 6 24 1.79 "What do you get when you run over a parakeet with …
## 7 44 1.27 "A horse walks into a bar. Bartender says: \"So, wh…
## 8 57 1.04 "Why are there so many Jones's in the phone book? B…
## 9 123 1.03 "When most people claim to be \"killing time\", it'…
## 10 101 0.970 "Did you hear about the Buddhist who refused Novoca…
These jokes are somewhat surprising given what other recommenders have recommended for this person. In the past, there was an emphasis on political humor that I do not see here. These are mostly puns. That being said, compared to the last item-item recommender, the predicted ratings for these jokes are more in line with what I expected; the one from week 2 for instance had the highest joke getting around a 2 on the scale as opposed to the 8.6 seen here. ### Using Spark Whatever time was saved on computing in Spark was completely negated by the amount of time spent navigating working in different data formats. The incompatibility of Spark with some of the features of other R packages made troubleshooting a particularly trying process.
The fact that Spark contains a function to compute the matrices required for this algorithm. This is my first time using this algorithm, though in the past I used Singular Value Decomposition which is also based on latent factors.
als_input <- ratings_tbl |>
filter(!is.na(rating)) |>
rename(user = user_id, item = joke_id) |>
mutate(user = as.integer(user), item = as.integer(item), rating = as.numeric(rating)) |>
filter(!is.na(user), !is.na(item), !is.na(rating))|>
sdf_register('als_input')
splits <- als_input |>
sdf_random_split(training = .8, test = .2, seed = 24601)
train_als <- splits$training
test_als <- splits$test
sdf_nrow(train_als)
## [1] 1473211
library(purrr)
library(Metrics)
param_grid <- expand.grid(
rank = c(10, 20, 50),
reg_param = c(.01, .1, 1)
)
results <- list()
for (i in seq_len(nrow(param_grid))) {
params <- param_grid[i, ]
cat("Fitting ALS model with rank =", params$rank, "and lambda =", params$reg_param, "\n")
# Train ALS model
model <- tryCatch({
ml_als(
train_als,
rating_col = "rating",
user_col = "user",
item_col = "item",
rank = params$rank,
reg_param = params$reg_param,
max_iter = 10,
nonnegative = TRUE
)
# clearly I had many issues with this section so I wanted to analyze the error messages
}, error = function(e) {
warning(paste("Model failed:", e$message))
return(NULL)
})
if (!is.null(model)) {
preds <- tryCatch({
ml_predict(model, test_als) %>%
collect() %>%
filter(!is.na(prediction))
}, error = function(e) {
warning(paste("Prediction failed:", e$message))
return(tibble())
})
if (nrow(preds) > 0) {
rmse_val <- rmse(preds$rating, preds$prediction)
mae_val <- mae(preds$rating, preds$prediction)
} else {
rmse_val <- NA
mae_val <- NA
}
results[[i]] <- tibble(
rank = params$rank,
lambda = params$reg_param,
rmse = rmse_val,
mae = mae_val
)
}
}
## Fitting ALS model with rank = 10 and lambda = 0.01
## Fitting ALS model with rank = 20 and lambda = 0.01
## Fitting ALS model with rank = 50 and lambda = 0.01
## Fitting ALS model with rank = 10 and lambda = 0.1
## Fitting ALS model with rank = 20 and lambda = 0.1
## Fitting ALS model with rank = 50 and lambda = 0.1
## Fitting ALS model with rank = 10 and lambda = 1
## Fitting ALS model with rank = 20 and lambda = 1
## Fitting ALS model with rank = 50 and lambda = 1
results_tbl <- dplyr::bind_rows(results)
print(results_tbl)
## # A tibble: 9 × 4
## rank lambda rmse mae
## <dbl> <dbl> <dbl> <dbl>
## 1 10 0.01 4.66 3.50
## 2 20 0.01 4.78 3.62
## 3 50 0.01 4.92 3.75
## 4 10 0.1 4.52 3.42
## 5 20 0.1 4.56 3.48
## 6 50 0.1 4.58 3.52
## 7 10 1 4.55 3.65
## 8 20 1 4.55 3.65
## 9 50 1 4.55 3.65
It seems that the lowest values of RMSE and MAE were found where rank is ten and the lambda is .1. This printout was hard to follow, so a graph is going to be particularly helpful.
ggplot(results_tbl, aes(x = rank, y = rmse, color = as.factor(lambda))) +
geom_line() + geom_point() +
labs(title = 'ALS Hyperparameter Tuning', color = 'lambda')
Looking at the graph, it seems that having a regularization parameter of 1 produces a pretty consistent value for RMSE no matter how many latent features are used. However, the lowest value overall occured with a lambda of .1.
ggplot(results_tbl, aes(x = rank, y = mae, color = as.factor(lambda))) +
geom_line() + geom_point() +
labs(title = 'ALS Hyperparameter Tuning', color = 'lambda')
For both mean average error and root mean square error, the values stay consistent despite the number of latent features where the parameter is equal to one, where the error increases for the other values of lambda. It is clearer looking at mean average error that .1 is the optimal value of lambda.
best_als <- ml_als(
train_als,
rating_col = 'rating',
user_col = 'user',
item_col = 'item',
rank = 10,
reg_param = .1,
max_iter = 10,
nonnegative = TRUE
)
Now that the best Alternating Least Squares model has been identified. Let’s see how the model performs in predicting values for unrated jokes.
# making grid of all joke user combinations before removing those already rated
users <- train_als |>
select(user)|>
distinct()
jokes <- train_als |>
select(item) |>
distinct()
# made up a column to join on then deleted it
users_jokes <- users |>
mutate(temp = 1)|>
inner_join(jokes |> mutate(temp = 1), by = 'temp') |>
select(-temp)
rated_pairs <- train_als |>
select(user, item)
to_predict <- users_jokes |>
anti_join(rated_pairs, by = c('user', 'item'))
predictions <- ml_predict(best_als, to_predict)
At this point, I have a sense of the taste of user 5 let’s see what this model comes up with.
top_n_recs <- predictions |>
group_by(user) |>
mutate(rank = row_number(desc(prediction))) |>
filter(rank<=10) |>
ungroup()
head(collect(top_n_recs))
## # A tibble: 6 × 4
## user item prediction rank
## <int> <int> <dbl> <int>
## 1 7 140 5.22 1
## 2 7 119 4.74 2
## 3 7 123 4.73 3
## 4 7 132 4.61 4
## 5 7 72 4.46 5
## 6 7 127 4.32 6
user_recs_raw <- ml_recommend(best_als, type = 'items', n=10)
urr2 <- user_recs_raw |>
mutate(recommendations = sql('array(recommendations)'))
sdf_register(urr2, "user_recs_raw_tbl")
## # Source: table<`user_recs_raw_tbl`> [?? x 4]
## # Database: spark_connection
## user recommendations item rating
## <int> <list> <int> <dbl>
## 1 12 <list [1]> 138 5.02
## 2 12 <list [1]> 89 4.53
## 3 12 <list [1]> 17 3.67
## 4 12 <list [1]> 114 3.63
## 5 12 <list [1]> 106 3.61
## 6 12 <list [1]> 32 3.00
## 7 12 <list [1]> 105 2.63
## 8 12 <list [1]> 116 2.53
## 9 12 <list [1]> 73 2.29
## 10 12 <list [1]> 127 2.28
## # ℹ more rows
user_recs_flat <- tbl(sc, sql("
SELECT
user AS user_id, rec.item AS joke_id, rec.rating AS predicted_rating
FROM user_recs_raw_tbl
LATERAL VIEW explode(recommendations) exploded_table as rec"))
recs_text <- user_recs_flat |>
inner_join(jokes_tbl, by = c('joke_id' = 'item')) |>
select(user_id, joke_id, predicted_rating, text)
filtered_recs <- recs_text |>
anti_join(ratings_tbl |>
select(user_id, joke_id), by = c("user_id", 'joke_id'))
top_recs_user <- filtered_recs |>
filter(user_id == 5) |>
arrange(desc(predicted_rating))|>
head(10)|>
collect()|>
print()
## # A tibble: 8 × 4
## user_id joke_id predicted_rating text
## <int> <int> <dbl> <chr>
## 1 5 58 6.65 "How many teddybears does it take to change …
## 2 5 124 6.40 "Person 1: Hey, wanna hear a great knock-kno…
## 3 5 24 5.73 "What do you get when you run over a parakee…
## 4 5 141 5.63 "Jack Bauer can get McDonald's breakfast aft…
## 5 5 44 4.88 "A horse walks into a bar. Bartender says: \…
## 6 5 57 4.17 "Why are there so many Jones's in the phone …
## 7 5 74 3.93 "Q: How many stalkers does it take to change…
## 8 5 123 3.77 "When most people claim to be \"killing time…
This does not quite line up with the expected emphasis on political and occupational humor we have seen in the past. This time, we’re seeing a lot of the same jokes being recommended as our item-item cf model. Let’s look at another familiar user.
top_recs_user <- filtered_recs |>
filter(user_id == 5000) |>
arrange(desc(predicted_rating))|>
head(10)|>
collect()|>
print()
## # A tibble: 8 × 4
## user_id joke_id predicted_rating text
## <int> <int> <dbl> <chr>
## 1 5000 140 3.75 "Chuck Norris' calendar goes straight from M…
## 2 5000 123 3.17 "When most people claim to be \"killing time…
## 3 5000 38 1.32 "\"May I take your order?\" the waiter asked…
## 4 5000 56 1.17 "A man and Cindy Crawford get stranded on a …
## 5 5000 144 1.09 "A man is driving in the country one evening…
## 6 5000 45 1.08 "A boy comes home from school and tells his …
## 7 5000 40 1.00 "How many Irishmen does it take to change a …
## 8 5000 5 0.942 "Q.\tWhat's O. J. Simpson's Internet address…
This time, the recommended jokes are pretty similar to the past model. The jokes are generally based on dated celebrity references or gender. One issue I want to highlight was the persistent Java heap space which means I need more memory for my Java Virtual Machine. In the end, the error persisted even after trying to allocate more RAM to this process. In the end, I decided to look at the most commonly recommended jokes
top_recs <- filtered_recs|>
group_by(joke_id)|>
summarize(freq = n()) |>
arrange(desc(freq)) |>
head(10)|>
inner_join(jokes_tbl, by = c('joke_id' = 'item'))|>
collect()
print(top_recs)
## # A tibble: 10 × 3
## joke_id freq text
## <int> <dbl> <chr>
## 1 27 12067 "Clinton returns from a vacation in Arkansas and walks down t…
## 2 31 8986 "President Clinton looks up from his desk in the Oval Office t…
## 3 50 8747 "A guy goes into confession and says to the priest, \"Father, …
## 4 60 7294 "What did the Buddhist say to the hot dog vendor? Make me one …
## 5 80 9948 "Hillary, Bill Clinton and the Pope are sitting together on an…
## 6 98 7432 "Age and Womanhood 1. Between the ages of 13 and 18 ... S…
## 7 116 10611 "A man in a hot air balloon realized he was lost. He reduced a…
## 8 123 10410 "When most people claim to be \"killing time\", it's only an e…
## 9 138 8119 "WASHINGTON (Reuters) - A tragic fire on Monday destroyed the …
## 10 140 10011 "Chuck Norris' calendar goes straight from March 31st to April…
It is interesting to note the frequency of dated cultural references in these jokes but this is not universal so I am not as worried about the prevalence of dated jokes in the recommender.
With this model, I was more reliant on the processing power of Spark to deal with this large dataset. The main limitation was highlighted upon generating the recommendations, causing the JVM to run out of memory, rendering that power useless.
The main issue in this model for this assignment is that the program that creates the model requires data frames written to separate places and not the tables used in Spark.
# already in the form of three columns corresponding to user, item, and rating
ratings_df <- ratings_tbl |> collect()
splits <- ratings_tbl |>
sdf_random_split(training = .8, test = .2, seed = 24601)
train_svd <- splits$training
test_svd <- splits$test
r <- Reco()
train_file <- tempfile()
write.table(train_svd, file = train_file, sep = ' ', row.names = FALSE, col.names = FALSE)
test_file <- tempfile()
write.table(test_svd, file = test_file, sep = ' ', row.names = FALSE, col.names = FALSE)
To find the best SVD model, hyperparameter tuning is an important step, but one that costs a lot of memory. Apparently this cannot be done in spark because of the limitations of the recosystem package.
opts <- r$tune(train_file, opts = list(dim = c(10, 20, 30), lrate = c(.05, .1), niter = 10))
## Warning in r$tune(train_file, opts = list(dim = c(10, 20, 30), lrate = c(0.05, : API has changed since version 0.4
## use data_file(path) for argument 'train_data' instead
r$train(train_file, opts = c(opts$min, niter=20))
## Warning in r$train(train_file, opts = c(opts$min, niter = 20)): API has changed since version 0.4
## use data_file(path) for argument 'train_data' instead
## iter tr_rmse obj
## 0 4.5636 3.2959e+07
## 1 4.1712 2.8468e+07
## 2 3.9875 2.6356e+07
## 3 3.9079 2.5528e+07
## 4 3.8444 2.4944e+07
## 5 3.7750 2.4370e+07
## 6 3.6956 2.3749e+07
## 7 3.6103 2.3099e+07
## 8 3.5235 2.2454e+07
## 9 3.4367 2.1821e+07
## 10 3.3541 2.1229e+07
## 11 3.2771 2.0688e+07
## 12 3.2079 2.0196e+07
## 13 3.1465 1.9774e+07
## 14 3.0921 1.9395e+07
## 15 3.0437 1.9066e+07
## 16 3.0005 1.8772e+07
## 17 2.9614 1.8509e+07
## 18 2.9256 1.8273e+07
## 19 2.8925 1.8054e+07
pred_file <- tempfile()
r$predict(test_file, pred_file)
## Warning in r$predict(test_file, pred_file): API has changed since version 0.4
## use data_file(path) for argument 'test_data' instead
## Warning in r$predict(test_file, pred_file): API has changed since version 0.4
## use out_file(path) for argument 'out_pred' instead
## prediction output generated at C:\Users\ddebo\AppData\Local\Temp\RtmporOQTa\file98c176f387d
preds <- scan(pred_file)
test_svd_df <- test_svd |>
select(user_id, joke_id, rating) |>
collect()
write.table(test_svd_df, file = "test_data.txt", sep = ' ', row.names=FALSE, col.names=FALSE)
r$predict(data_file('test_data.txt'), out_file('pred.txt'))
## prediction output generated at pred.txt
preds <- scan('pred.txt')
actuals <- test_svd_df$rating
rmse <- sqrt(mean((preds - actuals)^2))
mae <- mean(abs(preds - actuals))
rmse
## [1] 4.032697
mae
## [1] 3.032528
Finally got values for Root Mean Squared Error and Mean Average Error, both of which are the lowest of any model seen thus far. I’d like to visualize the tuning process as I did with the ALS model, so I need to get the results from the tuning in a plotable form. The problem is in getting everything in the df form to be understood by ggplot, I couldn’t take advantage of Spark’s decreased processing time, so this took hours on knitting.
params <- expand.grid(
dim = c(10, 20, 30),
lrate = c(.05, .1)
)
results <- data.frame()
for (i in 1:nrow(params)) {
r <- Reco()
this_result <- r$tune(
train_data = train_file,
opts = list(
dim = params$dim[i],
lrate = params$lrate[i]
)
)
results <- rbind(results, cbind(params[i, ], rmse=this_result$min))
}
## Warning in r$tune(train_data = train_file, opts = list(dim = params$dim[i], : API has changed since version 0.4
## use data_file(path) for argument 'train_data' instead
## Warning in r$tune(train_data = train_file, opts = list(dim = params$dim[i], : API has changed since version 0.4
## use data_file(path) for argument 'train_data' instead
## Warning in r$tune(train_data = train_file, opts = list(dim = params$dim[i], : API has changed since version 0.4
## use data_file(path) for argument 'train_data' instead
## Warning in r$tune(train_data = train_file, opts = list(dim = params$dim[i], : API has changed since version 0.4
## use data_file(path) for argument 'train_data' instead
## Warning in r$tune(train_data = train_file, opts = list(dim = params$dim[i], : API has changed since version 0.4
## use data_file(path) for argument 'train_data' instead
## Warning in r$tune(train_data = train_file, opts = list(dim = params$dim[i], : API has changed since version 0.4
## use data_file(path) for argument 'train_data' instead
tune_df <- as.data.frame(results)
ggplot(tune_df, aes(x = dim, y = rmse.loss_fun, color = as.factor(lrate))) +
geom_point() +
geom_line() +
labs(title = 'RMSE by Number of Latent Factors', x = 'Number of latent factors', y = 'Validation RMSE', color = 'Learning Rate') +
theme_minimal()
All versions of the model have a lower RMSE than the other two models, with the 20 factor model with the learning rate of .05 getting the lowest RMSE. Now to continue investigating by seeing the results generated for some users.
train_svd_df <- train_svd |>
select(user_id, joke_id, rating) |>
collect()
r <- Reco()
train_data_svd <- data_memory(
user_index = train_svd_df$user_id,
item_index = train_svd_df$joke_id,
rating = train_svd_df$rating
)
r$train(train_data_svd, opts = list(dim = 20, lrate = .05, nthread = 2, niter=20, verbose = TRUE))
## iter tr_rmse obj
## 0 4.6706 3.3625e+07
## 1 4.2813 2.8678e+07
## 2 4.0815 2.6450e+07
## 3 3.9483 2.4973e+07
## 4 3.8627 2.4099e+07
## 5 3.7811 2.3339e+07
## 6 3.6937 2.2577e+07
## 7 3.6041 2.1814e+07
## 8 3.5198 2.1113e+07
## 9 3.4418 2.0484e+07
## 10 3.3701 1.9917e+07
## 11 3.3048 1.9416e+07
## 12 3.2455 1.8955e+07
## 13 3.1938 1.8560e+07
## 14 3.1477 1.8205e+07
## 15 3.1077 1.7897e+07
## 16 3.0725 1.7628e+07
## 17 3.0415 1.7389e+07
## 18 3.0139 1.7172e+07
## 19 2.9896 1.6986e+07
user_id = 5
rated_items <- ratings_df |>
filter(user_id == !!user_id) |>
pull(joke_id)
all_jokes <- unique(ratings_df$joke_id)
items_to_predict <- setdiff(all_jokes, rated_items)
user_predict_df <- data.frame(
user = rep(user_id, length(items_to_predict)),
item = items_to_predict
)
pred_data <- data_memory(
user_index= user_predict_df$user,
item_index = user_predict_df$item
)
user_predict_df$predicted_rating <- r$predict(pred_data, out_memory())
top_10_svd <- user_predict_df |>
arrange(desc(predicted_rating)) |>
slice_head(n = 10)
top_10_svd <- top_10_svd |>
left_join(joke_text_df, by = 'item') |>
select(item, predicted_rating, text)
top_10_svd
## item predicted_rating
## 1 132 2.80467916
## 2 127 1.73782670
## 3 129 1.34252834
## 4 119 0.99050230
## 5 105 0.57031387
## 6 56 0.29345974
## 7 130 0.20472500
## 8 32 0.06898821
## 9 126 -0.10279253
## 10 69 -0.10415936
## text
## 1 Mickey Mouse is having a nasty divorce with Minnie Mouse. Mickey spoke to the judge about the separation."I'm sorry Mickey, but I can't legally separate you two on the grounds that Minnie is mentally insane..." Mickey replied, "I didn't say she was mentally insane, I said that she's fucking Goofy!"
## 2 A little boy goes to his dad and asks, "What is politics?"His dad says, "Well son, let me try to explain it this way: I'm the breadwinner of the family, so let's call me capitalism. Your Mom, she's the administrator of the money, so we'll call her the government. We're here to take care of your needs, so we'll call you the people. The nanny, we'll consider her the working class. And your baby brother, we'll call him the future. Now, think about that and see if that makes sense." So the little boy goes off to bed thinking about what dad had said. Later that night, he hears his baby brother crying, so he gets up to check on him. He finds that the baby has severely soiled his diaper. So the little boy goes to his parents' room and finds his mother sound asleep. Not wanting to wake her, he goes to the nanny's room. Finding the door locked, he peeks in the keyhole and sees his father in bed with the nanny. He gives up and goes back to bed. The next morning, the little boy says to his father, "Dad, I think I understand the concept of politics now." The father says, "Good, son. Tell me in your own words what you think politics is all about." The little boy replies, "Well, while capitalism is screwing the working class, the government is sound asleep, the people are being ignored and the future is in deep shit."
## 3 A group of girlfriends is on vacation when they see a 5-story hotel with a sign that reads: "For Women Only." Since they are without their boyfriends and husbands, they decide to go in. The bouncer, a very attractive guy, explains to them how it works. "We have 5 floors. Go up floor by floor, and once you find what you are looking for, you can stay there. It's easy to decide since each floor has a sign telling you what's inside." So they start going up and on the first floor the sign reads: "All the men on this floor are short and plain." The friends laugh and without hesitation move on to the next floor. The sign on the second floor reads: "All the men here are short and handsome." Still, this isn't good enough, so the friends continue on up. They reach the third floor and the sign reads: "All the men here are tall and plain." They still want to do better, and so, knowing there are still two floors left, they continued on up. On the fourth floor, the sign is perfect: "All the men here are tall and handsome." The women get all excited and are going in when they realize that there is still one floor left. Wondering what they are missing, they head on up to the fifth floor. There they find a sign that reads: "There are no men here. This floor was built only to prove that there is no way to please a woman."
## 4 One day the first grade teacher was reading the story of the Three Little Pigs to her class. She came to the part of the story where the first pig was trying to accumulate the building materials for his home. She read, "...and so the pig went up to the man with the wheelbarrow full of straw and said, 'Pardon me sir, but may I have some of that straw to build my house?'" The teacher paused then asked the class, "And what do you think that man said?" One little boy raised his hand and said, "I know...he said, 'Holy Shit! A talking pig!'"
## 5 A couple of hunters are out in the woods in the deep south when one of them falls to the ground. He doesn't seem to be breathing, and his eyes are rolled back in his head.The other guy whips out his cell phone and calls 911. He gasps to the operator, "My friend is dead! What can I do?" The operator, in a calm and soothing voice, says, "Alright, take it easy. I can help. First, let's make sure he's dead." There is silence, and then a gun shot is heard.The hunter comes back on the line. "Okay. Now what??"
## 6 A man and Cindy Crawford get stranded on a desert island. After a couple of days they fall in love and start sleeping together. Time pass the man seems frustrated, Cindy asks if there is anything she can do? He says there is one thing, "Could you put on this baseball cap and go to the other side of the island and answer me when I call you Bob?" She agrees. Next day he is walking on the other side of the island, runs into her and says "Hi Bob!" She says "Hello, what's up?" He replies: "Bob you won't believe it: I've been sleeping with Cindy Crawford for the past two weeks!!!!"
## 7 An old man goes to the doctor for his yearly physical, his wife tagging along. When the doctor enters the examination room, he tells the old man, "I need a urine sample, a stool sample and a sperm sample." The old man, being hard of hearing, looks at his wife and yells: "WHAT? What did he say? What's he want?" His wife yells back, "He needs your underwear."
## 8 A man arrives at the gates of heaven. St. Peter asks, "Religion?" The man says, "Methodist." St. Peter looks down his list, and says, "Go to room 24, but be very quiet as you pass room 8." Another man arrives at the gates of heaven. "Religion?" "Baptist." "Go to room 18, but be very quiet as you pass room 8." A third man arrives at the gates. "Religion?" "Jewish." "Go to room 11, but be very quiet as you pass room 8." The man says, "I can understand there being different rooms for different religions, but why must I be quiet when I pass room 8?" St. Peter tells him, "Well the Catholics are in room 8, and they think they're the only ones here.
## 9 A Briton, a Frenchman and a Russian are viewing a painting of Adam and Eve frolicking in the Garden of Eden. "Look at their reserve, their calm," muses the Brit. "They must be British." "Nonsense," the Frenchman disagrees. "They're naked, and so beautiful. Clearly, they are French." "No way! They have no clothes and no shelter," the Russian points out, "They have only an apple to eat, and they are being told they live in a paradise. Obviously, they are Russian."
## 10 This guys wife asks, "Honey if I died would you remarry?" and he replies, "Well, after a considerable period of grieving, we all need companionship, I guess I would." She then asks, "If I died and you remarried, would she live in this house?" and he replies, "We've spent a lot of time and money getting this house just the way we want it. I'm not going to get rid of my house, I guess she would." "If I died and you remarried, and she lived in this house, would she sleep in our bed?" and he says, "That bed is brand new, we just paid two thousand dollars for it, it's going to last a long time, I guess she would." So she asks, "If I died and you remarried, and she lived in this house, and slept in our bed, would she use my golf clubs?" "Oh no, she's left handed."
This set of jokes is closer to what user five was recommended in previous models, seeing more jokes about politicians and Chuck Norris, less about wordplay. Also the ratings are fairly low for the top recommendations compared to the others. Moving on to user 5000.
user_id = 5000
rated_items <- ratings_df |>
filter(user_id == !!user_id) |>
pull(joke_id)
all_jokes <- unique(ratings_df$joke_id)
items_to_predict <- setdiff(all_jokes, rated_items)
user_predict_df <- data.frame(
user = rep(user_id, length(items_to_predict)),
item = items_to_predict
)
pred_data <- data_memory(
user_index= user_predict_df$user,
item_index = user_predict_df$item
)
user_predict_df$predicted_rating <- r$predict(pred_data, out_memory())
top_10_svd <- user_predict_df |>
arrange(desc(predicted_rating)) |>
slice_head(n = 10)
top_10_svd <- top_10_svd |>
left_join(joke_text_df, by = 'item') |>
select(item, predicted_rating, text)
top_10_svd
## item predicted_rating
## 1 75 1.8099052
## 2 142 0.3741921
## 3 120 -0.1494167
## 4 143 -0.1907443
## 5 20 -0.2726785
## 6 79 -0.3540100
## 7 115 -0.3669051
## 8 146 -0.5688461
## 9 58 -0.6347480
## 10 124 -0.7673085
## text
## 1 Q: Do you know the difference between an intelligent male and the Sasquatch? A: There have been actual reported sightings of the Sasquatch.
## 2 One day, three men went to a shrine to ask the Father for forgiveness. The first man went to the Father...First Man: "Father, Father, I have sinned!" Father: "What have you done?" First Man: "I have lied!" Father: "Drink the holy water and you will be saved." And so the man drank the water and was "saved." The second man went up to the Father... Second Man: "Father, Father, I have sinned!" Father: "What have you done?" Second Man: "I have stolen from the jeweler's!" Father: "Drink the holy water and you will be saved." And so the second man drank the holy water and was "saved." The third man went up to the Father... Third Man: "Father, Father, I have sinned!" Father: "What have you done?" Third Man: "I peed in the holy water!"
## 3 Judy was having trouble with her computer, so she called Tony, the computer guy, over to her desk. Tony clicked a couple buttons and solved the problem. As he was walking away, Judy called after him, "So, what was wrong?" And he replied, "It was an ID Ten T Error." A puzzled expression ran riot over Judy's face. "An ID Ten T Error? What's that...in case I need to fix it again?" He gave her a grin..."Haven't you ever heard of an ID Ten T Error before?" "No," replied Judy. "Write it down," he said, "and I think you'll figure it out."(She wrote...) I D 1 0 T
## 4 A preist, a 12-year-old kid, and the smartest guy in the world are on a plane. The pilot screams, "The plane is going down! You have to jump!" He then grabs a parachute and jumps off, leaving only two more parachutes on the plane. The smartest guy in the world says, "I have to go. I mean, I'm the smartest guy in the world!" He grabs a parachute, and jumps. The priest then looks at the 12-year-old kid, and says, "Go, my son. You have a long life to live." The kid calmly responds: "Dude, chill. We'll be fine. The 'smartest guy in the world' took my backpack."
## 5 What's the difference between a MacIntosh and an Etch-A-Sketch? You don't have to shake the Mac to clear the screen.
## 6 Q: Ever wonder why the IRS calls it Form 1040? A: Because for every $50 that you earn, you get 10 and they get 40.
## 7 A lady bought a new Lexus. It cost a bundle. Two days later, she brought it back, complaining that the radio was not working."Madam," said the sales manager, "the audio system in this car is completely automatic. All you need to do is tell it what you want to listen to, and you will hear exactly that!" She drove out, somewhat amazed and a little confused. She looked at the radio and said, "Nelson." The radio responded, "Ricky or Willie?" She was astounded. If she wanted Beethoven, that's what she got. If she wanted Nat King Cole, she got it. She was stopped at a traffic light enjoying "On the Road Again" when the light turned green and she pulled out. Suddenly an enormous sports utility vehicle coming from the street she was crossing sped toward her, obviously not paying attention to the light. She swerved and narrowly missed a collision."Idiot!" she yelled and, from the radio, "Ladies and gentlemen, the President of the United States."
## 8 America: 8:00 - Welcome to work! 12:00 - Lunch break 17:00 - The work day is over Japan: 8:00 - Are you already at work? 12:00 - Continue your work 17:00 - The work day is over 20:00 - Please finish your work Romania: 8:00 - Has anyone come to work? 12:00 - Did someone start working? 16:00 - Is anyone at work?
## 9 How many teddybears does it take to change a lightbulb? It takes only one teddybear, but it takes a whole lot of lightbulbs.
## 10 Person 1: Hey, wanna hear a great knock-knock joke? Person 2: Sure, What is it? Person 1: Okay, you start. Person 2: Knock-knock. Person 1: Who's there? Person 2: … Person 1: Hah!