607 Assignment 11 Dylan Gold

Approach

In this assignment we have to develop a personalized recommended system. This assignment builds of a previous assignment where we build a global estimate system.
I intend to use the data that was given to us for the previous assignment.
There are 4 ways suggested, Content-based filtering, Item-to-item collaborative filtering, User-to-user collaborative filtering, Matrix factorization.
I have taken some classes on linear algebra so I considered the Matrix Factorization method but actually computing UV looks very tedious.
I would probably also use a k value of 2 just to keep it as simple as I can. I am not sure if k = 1 is valid but probably less successful. I also see that there is a function to find the derivative of an expression. As well as a way to get an expression equal to 0 with the stats package. I will try this out and hopefully it works.

f <- expression((x+1)^2)  # sample
d = D(f, "x") # Derivative, we need to minimize the root mean square error for UV decomposition. 
# We take the derivative(slope) find where its 0(local minima/maxima) and hope that's a minima.
# It should just be parabolic degree 2 so the local minima is the global minima
d2 <- function(x) eval(d) # Wrap in function, uniroot needs a function
uniroot(f = d2, lower = -10, upper = 10) # Gets the root with a range (finds when equal 0 in our case)

$root
[1] -1

$f.root
[1] -3.552714e-15

$iter
[1] 2

$init.it
[1] NA

$estim.prec
[1] 6.103516e-05

In the case that this does not work I will probably use a user to user collaborative filtering with a normalized cosine to measure the difference between users. It is described in Mining Massive Datasets in section 9.3.2

Codebase

Imports

library(tidyverse)

First we need to import the data for which we are creating a personalized recommendation system.
I will be reusing the given data in assignment 3A.

url <- "https://raw.githubusercontent.com/DylanGoldJ/607-Assignment-3/refs/heads/main/MovieRatings3A.csv"

df <- read_csv(
  file = url,
  col_names = FALSE # Missing headers in file.
)

# Set the names of the column to the movies that were watched.
df <- df %>% set_names("Critic", "CaptainAmerica", "Deadpool", "Frozen", "JungleBook", "PitchPerfect2", "StarWarsForce")

head(df, 5)

# A tibble: 5 × 7
  Critic   CaptainAmerica Deadpool Frozen JungleBook PitchPerfect2 StarWarsForce
  <chr>             <dbl>    <dbl>  <dbl>      <dbl>         <dbl>         <dbl>
1 Burton               NA       NA     NA          4            NA             4
2 Charley               4        5      4          3             2             3
3 Dan                  NA        5     NA         NA            NA             5
4 Dieudon…              5        4     NA         NA            NA             5
5 Matt                  4       NA      2         NA             2             5

We will keep our data in this non-tidy format. This format will make it easier to work with as a matrix for UV decomposition.
We will let the number of people equal to the rows and number of movies as the columns.
I will first just get the numeric values as a matrix.

M <- as.matrix(df)
M <- M[, -1] # drop first column (Critic names)

# Convert to numeric values
M <- matrix(
  as.numeric(trimws(M)),
  nrow = nrow(M)
)
M

      [,1] [,2] [,3] [,4] [,5] [,6]
 [1,]   NA   NA   NA    4   NA    4
 [2,]    4    5    4    3    2    3
 [3,]   NA    5   NA   NA   NA    5
 [4,]    5    4   NA   NA   NA    5
 [5,]    4   NA    2   NA    2    5
 [6,]    4   NA    3    3    4   NA
 [7,]    4    4    4    2    2    4
 [8,]   NA   NA   NA   NA   NA    4
 [9,]    4    4    1   NA   NA    5
[10,]    4    3    5    5    2    3
[11,]    5    5    5    5   NA    4
[12,]   NA   NA    4    5   NA    3
[13,]    5    5    5    4    4    5
[14,]    4   NA   NA   NA   NA    4
[15,]    4    5    3    3    3   NA
[16,]   NA   NA    5    5   NA   NA

I had tried the matrix factorization but I was not able to get it to work.

Rather I will try a user to user collaborative filtering. This method was described in 9.3.2 in Mining of Massive Data sets.
In collaborative filtering we are using how similar users are to other users (or items are to items).
What we will do and one method suggested by the text book is first find similar users(predetermined N) using some sort of distance measure, and take the average of those users for that specific item for the original users prediction. It is also suggested that the matrix is normalized first.

First I will normalize the matrix. We will just subtract everyone average rating from their own ratings.
We can see that each user has their reviews decreased by a fixed amount, their respective averages.

normalization_vector <- rowMeans(M, na.rm = TRUE) # Get vector of everyones average rating.
M_norm <- M - normalization_vector
M_norm

           [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
 [1,]        NA         NA         NA  0.0000000         NA  0.0000000
 [2,] 0.5000000  1.5000000  0.5000000 -0.5000000 -1.5000000 -0.5000000
 [3,]        NA  0.0000000         NA         NA         NA  0.0000000
 [4,] 0.3333333 -0.6666667         NA         NA         NA  0.3333333
 [5,] 0.7500000         NA -1.2500000         NA -1.2500000  1.7500000
 [6,] 0.5000000         NA -0.5000000 -0.5000000  0.5000000         NA
 [7,] 0.6666667  0.6666667  0.6666667 -1.3333333 -1.3333333  0.6666667
 [8,]        NA         NA         NA         NA         NA  0.0000000
 [9,] 0.5000000  0.5000000 -2.5000000         NA         NA  1.5000000
[10,] 0.3333333 -0.6666667  1.3333333  1.3333333 -1.6666667 -0.6666667
[11,] 0.2000000  0.2000000  0.2000000  0.2000000         NA -0.8000000
[12,]        NA         NA  0.0000000  1.0000000         NA -1.0000000
[13,] 0.3333333  0.3333333  0.3333333 -0.6666667 -0.6666667  0.3333333
[14,] 0.0000000         NA         NA         NA         NA  0.0000000
[15,] 0.4000000  1.4000000 -0.6000000 -0.6000000 -0.6000000         NA
[16,]        NA         NA  0.0000000  0.0000000         NA         NA

I was confused about this strategy because users who give all 3s will be the same as someone who gives all 5s but with collaborative filtering we want to see patterns like how much someone likes something more than another.
We need to now find users that are similar to eachother. Because so many users though had all the same rating which normalized to 0 the cosine difference will not work very well. The cosine diff treats 0s as unseen which is not good for us.

Instead of using the cosine difference like planned I will switch to the Jaccard distance.
The Jaccard divides the intersection by the union of the values.
Because our matrix has values describe how much they like a movie relative to their average, values equal to and above 0 will represent a 1, and less than 0 is 0. We then can look at the unions and the intersections.

M_Jac <- ifelse(M_norm >= 0, 1, 0)

Jac_diff <- function(x, y) {# Difference between two row/vectors,(users) x and y.

  neither_na <- !is.na(x) & !is.na(y) # Get columns that have no na values.
  
  x <- x[neither_na] # Update x and y, they still have corresponding columns
  y <- y[neither_na] 
  
  intersection = sum(x == y)
  
  union = length(x) # We only have values that are not na for both vectors now, counting how many gives us our union.
  
  if(union == 0) {return(0)} # Avoid divide by 0
  
  return(intersection/union)
}
# Test examples, 2 and 3 should have .5
Jac_diff(M_Jac[3,], M_Jac[2,])

[1] 0.5

Now that we have a metric to find how similar two users are we can find who are the most similar users.
We can make a matrix that has every pair of users distance from each other.

create_jac_matrix <- function(J){
  # Create N by N matrix N being number of users(rows)
  n <- nrow(J)
  new_matrix <- matrix(0, n, n) 
  # Iterate through and populate M
  for(i in 1:n){ # Iterate over rows
    for(j in 1:n){
      # Diagonal keep as 0
      if(i != j) (new_matrix[i,j] <- Jac_diff(J[i,], J[j,])) # Set equal to the jaccard distance for each person(should be 1 for itself?)
    }
  }
  return(new_matrix)
}

jac_matrix<- create_jac_matrix(M_Jac)

Now that we have a way to relate each user to each other we can find which users are most similar to each other, then use the average rating of those users to recommend movies.

get_recommended_users <- function(sim_matrix, n){ # gets the top n users returns as list of indicies.
  # Iterate through rows, sort the rows and take the top n values and append to list. 
  # Return the list
  rows <- nrow(sim_matrix)
  list <- (c())
  for(i in 1:rows){ # Iterate over matrix rows
    # The user itself should be 0, I will set to -1 to make sure they are not accidentally selected by themselves
    sim_matrix[i,i] <- -1
    # Get the top n values by sorting then indexing 1:n
    top_n <- order(sim_matrix[i,], decreasing = TRUE)[1:n] # get top n
    list[[length(list)+1]] = top_n
  }
  return(list)
}
closest_users <- get_recommended_users(jac_matrix, 2)

Now we have a list of the closest 2 users. We can now get a recommended rating through this. We are getting the average difference (from being normalized) of the similar users.

create_recommended_ratings <- function(ratings, related_users_list){ # Iterate through the rows of the users and get the average movie ratings of these users from the org_ratings. 
  rows <- nrow(ratings)
  cols <- ncol(ratings)
  new_matrix <- matrix(0, rows, cols) # Create new matrix to populate
  
  for(i in 1:rows){ # Iterate through users
      related_users <- related_users_list[[i]] # Get the list as a vector for the related_users
      for(j in 1:cols){ # iterate items
        current_ratings <- ratings[related_users, j] # Gets a list of the related users from the normalized ratings
        # If the current ratings are all na then no similar users rated different movies currently leaves as NA.
      
        new_matrix[i, j] <- mean(current_ratings, na.rm = TRUE) # the prediction is a average rating of these users
      }
  }
  return(new_matrix)
}
rec_diff <- create_recommended_ratings(M_norm,closest_users)

Finally we add this difference to the average mean score each user had. Some of the differences may be NA depending on N.

user_avgs <- rowMeans(M, na.rm = TRUE)
final_recommendations <- rec_diff + user_avgs
final_recommendations

          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
 [1,] 4.333333 3.666667      NaN      NaN      NaN 4.166667
 [2,] 4.000000 4.000000 4.000000 2.500000 2.500000 4.000000
 [3,] 5.750000      NaN 3.750000 5.000000 3.750000 5.875000
 [4,] 5.416667      NaN 3.416667 4.666667 3.416667 5.541667
 [5,]      NaN 3.250000      NaN 3.250000      NaN 3.250000
 [6,] 3.916667 3.416667 1.000000      NaN      NaN 4.416667
 [7,]      NaN 3.333333      NaN      NaN      NaN 3.333333
 [8,]      NaN 4.000000      NaN 4.000000      NaN 4.000000
 [9,]      NaN 3.500000      NaN 3.500000      NaN 3.500000
[10,]      NaN      NaN 3.666667 4.166667      NaN 2.666667
[11,]      NaN      NaN 4.800000 5.300000      NaN 3.800000
[12,] 4.266667 3.766667 4.766667 4.766667 2.333333 3.266667
[13,] 5.333333 5.000000 5.333333 3.333333 3.333333 5.000000
[14,]      NaN 4.000000      NaN 4.000000      NaN 4.000000
[15,] 4.350000 3.600000 2.350000      NaN 2.350000 4.475000
[16,] 5.333333 4.333333 6.333333 5.666667 3.333333 4.666667

This whole process was very messy. I will try to combine everything into a single function, that way we can also test using hold-out data easily by rerunning it with missing data.
I will also add back the data from the original dataframe

get_recommendations <- function(df, n) { # Original matrix, n users for user to user collaborative recommendations.
  df <- as.data.frame(df)
  
  col_names <- colnames(df) # save column(movie) names
  critics <- df[,1] # save rows(users)
  
  # Convert to matrix dropping the first column.
  M <- as.matrix(df)
  M <- M[, -1] # drop first column (Critic names)
  
  # Convert to numeric values
  M <- matrix(
    as.numeric(trimws(M)),
    nrow = nrow(M)
  )
    
  
  normalization_vector <- rowMeans(M, na.rm = TRUE) # Get vector of everyones average rating.
  M_norm <- M - normalization_vector 
  
  M_Jac <- ifelse(M_norm >= 0, 1, 0) # Modify a matrix for Jaccard distance. 
  
  jac_matrix <- create_jac_matrix(M_Jac) # Use the binary values to create a distance
  
  closest_users <- get_recommended_users(jac_matrix, n) # use the distances to get the closest n users
  
  rec_diff <- create_recommended_ratings(M_norm,closest_users) # Use closest users to get the recommendations 
  
  recommendations <- rec_diff + normalization_vector # Add back normalization
  
  # There is also no point in recommending movies that the user has already seen.
  # Note that we can have na values from either no recommended or they seen the movie already.
  # If the original has an na values, replace with NA in recommendations
  
  recommendations[!is.na(df[,-1])] <- NA # Cut first column from the df 
  
  # Convert back into a dataframe
  rec_df <- as.data.frame(recommendations)
  # Add back column names and critics
  rec_df <- cbind(critics, rec_df)
  rec_df <- setNames(rec_df, col_names)
  
  return(rec_df)
}

recommendations_2n <- get_recommendations(df, 2) 
head(recommendations_2n)

     Critic CaptainAmerica Deadpool   Frozen JungleBook PitchPerfect2
1    Burton       4.333333 3.666667      NaN         NA           NaN
2   Charley             NA       NA       NA         NA            NA
3       Dan       5.750000       NA 3.750000   5.000000      3.750000
4 Dieudonne             NA       NA 3.416667   4.666667      3.416667
5      Matt             NA 3.250000       NA   3.250000            NA
6  Mauricio             NA 3.416667       NA         NA            NA
  StarWarsForce
1            NA
2            NA
3            NA
4            NA
5            NA
6      4.416667

recommendations_3n <- get_recommendations(df, 3) 
head(recommendations_3n)

     Critic CaptainAmerica Deadpool   Frozen JungleBook PitchPerfect2
1    Burton       4.541667 3.666667 2.750000         NA      2.750000
2   Charley             NA       NA       NA         NA            NA
3       Dan       5.708333       NA 4.708333   4.333333      3.708333
4 Dieudonne             NA       NA 3.791667   4.416667      4.291667
5      Matt             NA 2.916667       NA   3.250000            NA
6  Mauricio             NA 3.416667       NA         NA            NA
  StarWarsForce
1            NA
2            NA
3            NA
4            NA
5            NA
6      4.111111

We can try n = 4. Comparing to our previous assignment some values match decently. We can try to drop values and see if our system is able to guess accurately. We have now created recommendations for users for movies they have not seen based on what similar users had reviewed.

recommendations_4n <- get_recommendations(df, 4)
recommendations_4n

      Critic CaptainAmerica Deadpool   Frozen JungleBook PitchPerfect2
1     Burton       4.541667 3.666667 2.750000         NA      2.750000
2    Charley             NA       NA       NA         NA            NA
3        Dan       5.708333       NA 4.708333   4.333333      3.708333
4  Dieudonne             NA       NA 3.791667   4.416667      4.291667
5       Matt             NA 2.916667       NA   3.250000            NA
6   Mauricio             NA 3.911111       NA         NA            NA
7        Max             NA       NA       NA         NA            NA
8     Nathan       4.541667 3.666667 2.750000   4.000000      2.750000
9      Param             NA       NA       NA   3.250000      3.125000
10    Parshu             NA       NA       NA         NA            NA
11 Prashanth             NA       NA       NA         NA      3.216667
12    Shipra       4.344444 4.344444       NA         NA      2.416667
13  Sreejaya             NA       NA       NA         NA            NA
14     Steve             NA 3.666667 2.750000   4.000000      2.750000
15     Vuthy             NA       NA       NA         NA            NA
16   Xingjia       5.266667 4.766667       NA         NA      3.333333
   StarWarsForce
1             NA
2             NA
3             NA
4             NA
5             NA
6       4.111111
7             NA
8             NA
9             NA
10            NA
11            NA
12            NA
13            NA
14            NA
15      4.412500
16      4.383333

We can test out some values with hold-out data.
We can drop Charley’s rating of 4 for Captain America, we can see our rating came very close, if we had rounded to an integer then we would have guessed correctly. I performed this test more than once and saw it was typically close to the original value.

dropped_df <- df # Copy
original_rating <- dropped_df[2,2] # Save charleys rating of 4 for captain america
dropped_df[2,2] <- NA # Remove
recommendations_dropped_4n <- get_recommendations(dropped_df, 4) # Recreate with dropped value matrix
recommendations_dropped_4n[2,2] # new value

[1] 3.8

(original_rating - recommendations_dropped_4n[2,2])[1,1] #difference

[1] 0.2

Finally we can take the highest value and pick out that column name to have a single suggestion for users. Some users may not have a recommendation due to already seeing all the movies or other reasons.

recommendations_4n$Recommended_Movie <- colnames(recommendations_4n)[max.col(recommendations_4n)]
recommendations_4n$Recommended_Movie <- apply(recommendations_4n, 1, function(val){
  ifelse(all(is.na(val)), NA, colnames(recommendations_4n)[which.max(val)])
})
rec_movie <- recommendations_4n %>%
  select("Critic", "Recommended_Movie")
rec_movie

      Critic Recommended_Movie
1     Burton    CaptainAmerica
2    Charley              <NA>
3        Dan    CaptainAmerica
4  Dieudonne        JungleBook
5       Matt        JungleBook
6   Mauricio     StarWarsForce
7        Max              <NA>
8     Nathan    CaptainAmerica
9      Param        JungleBook
10    Parshu              <NA>
11 Prashanth     PitchPerfect2
12    Shipra    CaptainAmerica
13  Sreejaya              <NA>
14     Steve        JungleBook
15     Vuthy     StarWarsForce
16   Xingjia    CaptainAmerica

Conclusion

In conclusion I was able to create a collaborative recommendation system. I was able to find users that were similar to other users with a similarity measure. Prior to doing this I first normalized the matrix by subtracting everyone’s average rating from their ratings. Then I used Jaccard’s distance to find how similar users were to one another. By selecting n similar users where n is an arbitrary number, I was able to get the average difference in ratings of those similar users. From that I had added back the normalization of the original users to get the critics recommendation that is based off similar critics. There are several ways I could expand/improve on this assignment. For example finding the best value for N, trying out different methods like item to item or matrix factorization and comparing them to each other, or even combining different techniques together.