M.A._Data607_Assignment11

Author

Muhammad Ahmad

Approach

I choose to use a package that already exist to do content-based filtering. The package I am using is the recommenderlab package authored by Michael Hahsler. I would like my algorithm to recommend the top 5 items based off different customer buyer history. I will test my algorithm using rank based metrics to see how accurate its based off the data. Some rank based metrics I will use include Precision, Recall, and F1-scores.

Challenges

The challenges I think of now is if the data is missing information to make accurate recommendations and how to tackle that. The first fix I can think of is to recommend items that may be popular in the customers area. I am also worried that the system might overspecialize certain items based on a customers data. I experience this problem from my customer end application like YouTube when I only get recommended videos that I may have watch a lot of recently but got sick of. Using weights or some other method of random item generation recommendations may help diversify item recommendation.

Libraries

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(recommenderlab)

Loading required package: Matrix

Loading required package: arules


Attaching package: 'arules'

The following object is masked from 'package:dplyr':

    recode

The following objects are masked from 'package:base':

    abbreviate, write

Loading required package: proxy


Attaching package: 'proxy'

The following object is masked from 'package:Matrix':

    as.matrix

The following objects are masked from 'package:stats':

    as.dist, dist

The following object is masked from 'package:base':

    as.matrix

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.1     ✔ readr     2.2.0
✔ ggplot2   4.0.2     ✔ stringr   1.6.0
✔ lubridate 1.9.5     ✔ tibble    3.3.1
✔ purrr     1.2.1     ✔ tidyr     1.3.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::expand()  masks Matrix::expand()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ tidyr::pack()    masks Matrix::pack()
✖ arules::recode() masks dplyr::recode()
✖ tidyr::unpack()  masks Matrix::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)
library(readxl)
library(caret)

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

The following objects are masked from 'package:recommenderlab':

    MAE, RMSE

Dataset

url <- "https://raw.githubusercontent.com/MuhammadAhmad0006/Data607_Assignment11/refs/heads/main/MovieRatings.csv"


MovieRatingsRAW <- read.csv(url)
head(MovieRatingsRAW)

     Critic CaptainAmerica Deadpool Frozen JungleBook PitchPerfect2
1    Burton             NA       NA     NA          4            NA
2   Charley              4        5      4          3             2
3       Dan             NA        5     NA         NA            NA
4 Dieudonne              5        4     NA         NA            NA
5      Matt              4       NA      2         NA             2
6  Mauricio              4       NA      3          3             4
  StarWarsForce
1             4
2             3
3             5
4             5
5             5
6            NA

After reviewing this data set I believe the best personalized recommendation algorithm is user to user collaborative filtering. This is because of the focus of user rating input.

MovieRatings_df <- as.data.frame(MovieRatingsRAW)
row.names(MovieRatings_df) <- MovieRatings_df[,1]
MovieRatings_df <- MovieRatings_df[,-1]
MovieRatings_df

          CaptainAmerica Deadpool Frozen JungleBook PitchPerfect2 StarWarsForce
Burton                NA       NA     NA          4            NA             4
Charley                4        5      4          3             2             3
Dan                   NA        5     NA         NA            NA             5
Dieudonne              5        4     NA         NA            NA             5
Matt                   4       NA      2         NA             2             5
Mauricio               4       NA      3          3             4            NA
Max                    4        4      4          2             2             4
Nathan                NA       NA     NA         NA            NA             4
Param                  4        4      1         NA            NA             5
Parshu                 4        3      5          5             2             3
Prashanth              5        5      5          5            NA             4
Shipra                NA       NA      4          5            NA             3
Sreejaya               5        5      5          4             4             5
Steve                  4       NA     NA         NA            NA             4
Vuthy                  4        5      3          3             3            NA
Xingjia               NA       NA      5          5            NA            NA

MovieRatings_df[] <- lapply(MovieRatings_df, as.numeric)
ratings_matrix <- as.matrix(MovieRatings_df)
ratings <- as(ratings_matrix, "realRatingMatrix")

Implementing User Based Collaborative Algorithm

scheme <- evaluationScheme(ratings,
                           method = "cross-validation",
                           k = 3,
                           given = 1)

model <- Recommender(getData(scheme, "train"),
                     method = "UBCF")

pred <- predict(model,
                getData(scheme, "known"),
                type = "ratings")

accuracy <- calcPredictionAccuracy(pred,
                                   getData(scheme,"unknown")) 
accuracy

    RMSE      MSE      MAE 
1.302345 1.696104 1.088165

The accuracy shows that the model is off by one star on average. With a smaller data set this is expected due to low availability of data.

model <- Recommender(ratings, method = "UBCF")

for(i in 1:nrow(ratings)){

  top3 <- predict(model, ratings[i], n = 3)
  rec_list <- as(top3, "list")

  if(length(rec_list[[1]]) > 0){
    cat("Top 3 Recommendations for", rownames(ratings_matrix)[i], ":\n")
    print(unlist(rec_list))
  }
}

Top 3 Recommendations for Dieudonne :
             01              02              03 
   "JungleBook"        "Frozen" "PitchPerfect2" 
Top 3 Recommendations for Matt :
          01           02 
  "Deadpool" "JungleBook" 
Top 3 Recommendations for Mauricio :
             01              02 
"StarWarsForce"      "Deadpool" 
Top 3 Recommendations for Param :
             01              02 
   "JungleBook" "PitchPerfect2" 
Top 3 Recommendations for Prashanth :
              0 
"PitchPerfect2" 
Top 3 Recommendations for Shipra :
              01               02               03 
"CaptainAmerica"       "Deadpool"  "PitchPerfect2" 
Top 3 Recommendations for Vuthy :
              0 
"StarWarsForce"

The critics who do not have three recommended movies most likely have seen most of the movies watch. The recommendation is not going to recommended something a critic already watched. Vuthy as an example has watched every movies except Star Wars so it was recommended.

Conclusions

Using the recommender lab package I was able to make a Userbased content filter to recommend the type of movies each critic would want to watch based off of their ratings. I used cross validation to split the data into testing and training. Using accuracy metrics to test whether the model is accurate and found that it usually is off by one star. I then ran the model for each critic to recommend movies they have not watched.

Citations

Hahsler, M. (2022). recommenderlab: An R framework for developing and testing recommendation algorithms. CRAN. https://cran.r-project.org/package=recommenderlab