Assignment#11

Approach

Compute pairwise similarity between users using existing ratings. Use user-user collaborative filtering. I need to compile the data and do similarity tests. This is a personalized recommendation system so its heavily dependent on user data and input.

Code Base

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(recommenderlab)

Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loading required package: arules

Attaching package: 'arules'

The following object is masked from 'package:dplyr':

    recode

The following objects are masked from 'package:base':

    abbreviate, write

Loading required package: proxy

Attaching package: 'proxy'

The following object is masked from 'package:Matrix':

    as.matrix

The following objects are masked from 'package:stats':

    as.dist, dist

The following object is masked from 'package:base':

    as.matrix

url <- "https://raw.githubusercontent.com/AslamF/DATA607-Assignment-3A-/refs/heads/main/finaldata.csv"
raw <- read.csv(url, row.names = 1, check.names = FALSE)

# Convert to realRatingMatrix (recommenderlab's required format)
r <- as.matrix(raw)
r <- apply(r, 2, as.numeric)
rownames(r) <- rownames(raw)
rmat <- as(r, "realRatingMatrix")

# Build User-Based Collaborative Filtering model
model <- Recommender(rmat, method = "UBCF",
                     parameter = list(method = "cosine", nn = 3, normalize = "center"))

# Generate top-3 recommendations for all users
recs <- predict(model, rmat, n = 3)
as(recs, "list")

$`0`
[1] "Deadpool"

$`1`
character(0)

$`2`
[1] "JungleBook"

$`3`
[1] "PitchPerfect2" "JungleBook"    "Frozen"       

$`4`
[1] "Deadpool"   "JungleBook"

$`5`
[1] "StarWarsForce" "Deadpool"     

$`6`
character(0)

$`7`
[1] "Deadpool"   "JungleBook"

$`8`
[1] "PitchPerfect2" "JungleBook"   

$`9`
character(0)

$`10`
[1] "PitchPerfect2"

$`11`
[1] "CaptainAmerica" "Deadpool"       "PitchPerfect2" 

$`12`
character(0)

$`13`
[1] "Deadpool"   "JungleBook"

$`14`
[1] "StarWarsForce"

$`15`
[1] "Deadpool"       "CaptainAmerica" "StarWarsForce"

# Evaluation: 80/20 split, measure RMSE and MAE
set.seed(42)
eval_scheme <- evaluationScheme(rmat, method = "split", train = 0.8,
                                given = -1, goodRating = 4)

Warning in .local(data, ...): The following users do not have enough ratings
leaving no given items: 8

eval_results <- evaluate(eval_scheme, method = "UBCF",
                         parameter = list(method = "cosine", nn = 3, normalize = "center"),
                         type = "ratings")

UBCF run fold/sample [model time/prediction time]
     1  [0.001sec/0.018sec]

cat("RMSE / MSE / MAE:\n")

RMSE / MSE / MAE:

print(avg(eval_results))

         RMSE      MSE      MAE
[1,] 1.409958 1.987982 1.307616

Conclusion

This project built a User-Based Collaborative Filtering recommender using the reccomenderlab package in R. The model computes cosine similarity between users on their mean-centered ratings, finds the 3 most similar neighbors, and predicts ratings for unseen movies — outputting top-3 personalized recommendations per user. Most users received recommendations successfully. Some returned character(0) due to too few overlapping ratings with neighbors or having rated/watched all the movies in the list, a common limitation with sparse data. Evaluation via an 80/20 split produced an RMSE of 1.41 and MAE of 1.31 — predictions were off by about 1.3 stars on average. This is higher than ideal but expected given only 16 users and a heavily sparse matrix. A larger dataset would substantially improve performance.