607Assignment3A Dylan Gold

607 Assignment 3A Dylan Gold

Approach

In this assignment we are making a rating system based on the ratings in assignment 2A or new data given. Initially I was pretty confused just looking off the xlsx document we had. Maybe I missed something there but I did not see much in the 4 tables in the document. After the class lecture though I understood a bit better.

I will use the data given to us. The data I collected is pretty high ratings cause I just picked shows that I liked, and my friends have discussed who also generally like. So it seems kind of biased for a global baseline.

From the lecture we break this down into these steps: Create baseline from average of all movies. Generate averages of users Generate averages of movies Create baseline user and movie averages from difference from overall baseline. Create specific unseen movies ratings from sum of baseline global,baseline user and baseline movie ratings. Perhaps recommend movie based on highest unseen for each user? (at least 1)

Ideally I can create a function that does this when given a data frame.

Codebase:

After setting up the data in github for the raw, I perform the usual setup for the R environment.

library(tidyverse)

Getting data into data frame

url <- "https://raw.githubusercontent.com/DylanGoldJ/607-Assignment-3/refs/heads/main/MovieRatings3A.csv"

df <- read_csv(
  file = url,
  col_names = FALSE # No headers in the csv, needed.
)


head(df, 10)
# A tibble: 10 × 7
   X1           X2    X3    X4    X5    X6    X7
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Burton       NA    NA    NA     4    NA     4
 2 Charley       4     5     4     3     2     3
 3 Dan          NA     5    NA    NA    NA     5
 4 Dieudonne     5     4    NA    NA    NA     5
 5 Matt          4    NA     2    NA     2     5
 6 Mauricio      4    NA     3     3     4    NA
 7 Max           4     4     4     2     2     4
 8 Nathan       NA    NA    NA    NA    NA     4
 9 Param         4     4     1    NA    NA     5
10 Parshu        4     3     5     5     2     3

The problem statement missed the headers, I put them back

df <- df %>% set_names("Critic", "CaptainAmerica", "Deadpool", "Frozen", "JungleBook", "PitchPerfect2", "StarWarsForce")

tail(df, 5)
# A tibble: 5 × 7
  Critic   CaptainAmerica Deadpool Frozen JungleBook PitchPerfect2 StarWarsForce
  <chr>             <dbl>    <dbl>  <dbl>      <dbl>         <dbl>         <dbl>
1 Shipra               NA       NA      4          5            NA             3
2 Sreejaya              5        5      5          4             4             5
3 Steve                 4       NA     NA         NA            NA             4
4 Vuthy                 4        5      3          3             3            NA
5 Xingjia              NA       NA      5          5            NA            NA

Now create baseline from average of all movies. Generate averages of users Generate averages of movies

Create a vector for the averages of the columns(Movie avg)

# I create movies for easier selection
movies <- c("CaptainAmerica", "Deadpool", "Frozen", "JungleBook", "PitchPerfect2", "StarWarsForce")
# This following line was kind of tricky for me. My understanding is ~ creates a function where the period is the input. across uses this function down the columns
# The function we created with ~ is just get the mean with all the column values as input, we also specify the columns before hand.
movie_avg <- df %>% summarize(across(all_of(movies), ~ mean(., na.rm = TRUE)))
movie_avg
# A tibble: 1 × 6
  CaptainAmerica Deadpool Frozen JungleBook PitchPerfect2 StarWarsForce
           <dbl>    <dbl>  <dbl>      <dbl>         <dbl>         <dbl>
1           4.27     4.44   3.73        3.9          2.71          4.15

Get the user_mean

user_avg <- df %>%
  rowwise() %>%
  summarize(user_mean = mean(c_across(all_of(movies)), na.rm = TRUE)) #c_across with rowwise to perform row-wise aggregations
user_avg
# A tibble: 16 × 1
   user_mean
       <dbl>
 1      4   
 2      3.5 
 3      5   
 4      4.67
 5      3.25
 6      3.5 
 7      3.33
 8      4   
 9      3.5 
10      3.67
11      4.8 
12      4   
13      4.67
14      4   
15      3.6 
16      5   

Lets combine this with the names for readability and usability We can use bind_col

user_avg <- bind_cols(select(df, c("Critic")), user_avg)
user_avg
# A tibble: 16 × 2
   Critic    user_mean
   <chr>         <dbl>
 1 Burton         4   
 2 Charley        3.5 
 3 Dan            5   
 4 Dieudonne      4.67
 5 Matt           3.25
 6 Mauricio       3.5 
 7 Max            3.33
 8 Nathan         4   
 9 Param          3.5 
10 Parshu         3.67
11 Prashanth      4.8 
12 Shipra         4   
13 Sreejaya       4.67
14 Steve          4   
15 Vuthy          3.6 
16 Xingjia        5   

Now we get the global baseline average: From what I seen online we can just convert to a matrix for this

df_matrix <- as.matrix(select(df, movies)) #Create matrix, get rid of the column with character values
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(movies)

  # Now:
  data %>% select(all_of(movies))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
global_avg <- mean(df_matrix, na.rm = TRUE)
global_avg 
[1] 3.934426

We now have the values for user_avg, movie_avg and global_avg. We can check with the excel file to double check our values and it seems alright.

Now we can calculate the movie scores for unseen movies. I will first create a function so we can apply it to the whole data frame. First a function that can create the estimate. I just give it everything for the sake of side-effects

create_estimate <- function(user, movie, users_means, movies_means, global_mean){
  user_val <- filter(users_means, Critic == user) %>% select(user_mean) %>% pull() # Get user rating, filter cause data is multiple rows, select for mean only, pull
  movie_val <- select(movies_means, all_of(movie)) %>% pull() #Get movie rating, select cause data is multiple columns. pull
  # Now generate the baselines for both the user and movie
  user_baseline <- user_val - global_mean 
  movie_baseline <- movie_val - global_mean

  estimate <- global_mean + movie_baseline + user_baseline # Use the formula. 
  
  #estimate <- user_val + movie_val - global_mean #Alternate formula kind of interesting
  
  return(estimate)
  
}
create_estimate("Param", "PitchPerfect2", user_avg, movie_avg, global_avg)
[1] 2.279859

We can see that we created a function that works properly by comparing it to the example in the excel file.

This function will take the information we just calculated, the data frame and return a data frame with all the values filled in. I struggled to figure this out in dplyr. I believe that in dplyr you are encouraged to do either row OR column work at once. For doing stuff cell by cell it seemed like converting to a matrix was the best way.

I really tried to avoid a nested loop but I did not find how to do it while also keeping a reference to the current column and row. Perhaps this was a design flaw in my approach, or I could maybe bind the columns to the data frame or something but I already set up the previous function

create_estimate_df <- function(df, users_mean, movies_mean, global_mean){
  df_matrix <- as.matrix(df)
  movie_names <- colnames(df_matrix)
  rownames(df_matrix) <- df_matrix[,"Critic"] # Set the rownames of the matrix to the first column "Critics" column
  critics <- rownames(df_matrix)
  
  for (movie_name in movie_names[-1]){ # Iterate Movies, cut critic column
    for (critic in critics){ # Iterate Critics
      ifelse(is.na(df_matrix[critic, movie_name]), # If element is null
             df_matrix[critic, movie_name] <- create_estimate(critic, movie_name, user_avg, movie_avg, global_avg), #generate estimate
             df_matrix[critic, movie_name] <- NA) #otherwise set NA
    }
  }
  df_matrix <-as.data.frame(df_matrix) %>% select(movies)#convert back to data frame, select movies because we duplicated a column(got rid of critic column for indexed by critic)
  
  return(df_matrix)
}
estimated_review <- create_estimate_df(df, user_avg, movie_avg, global_avg)
head(estimated_review)
            CaptainAmerica         Deadpool           Frozen       JungleBook
Burton    4.33830104321908 4.51001821493625 3.79284649776453             <NA>
Charley               <NA>             <NA>             <NA>             <NA>
Dan       5.33830104321908             <NA> 4.79284649776453  4.9655737704918
Dieudonne             <NA>             <NA>  4.4595131644312 4.63224043715847
Matt                  <NA> 3.76001821493625             <NA>  3.2155737704918
Mauricio              <NA> 4.01001821493625             <NA>             <NA>
             PitchPerfect2    StarWarsForce
Burton    2.77985948477752             <NA>
Charley               <NA>             <NA>
Dan       3.77985948477752             <NA>
Dieudonne 3.44652615144418             <NA>
Matt                  <NA>             <NA>
Mauricio              <NA> 3.71941992433796

We now have a matrix showing the generated scores. NA values are shows that they have already seen Finally we can go row wise to find the max score for each person.

#Save rownames because tibble operations remove rownames(they are discouraged, we can add back as a column at the end after our work.)
rec_movies <- estimated_review %>% 
  rownames_to_column("Critic") %>%
  rowwise() %>% #Row wise
  filter(any(!is.na(c_across(movies)))) %>% #Which.max will error if a row is fully na, remove the rows that are NA
  mutate(Recommended_Movie = colnames(df)[which.max(c_across(movies))])  #Create new column, get max index, then use that index in col names to get the movie
  
rec_movies
# A tibble: 12 × 8
# Rowwise: 
   Critic  CaptainAmerica Deadpool Frozen JungleBook PitchPerfect2 StarWarsForce
   <chr>   <chr>          <chr>    <chr>  <chr>      <chr>         <chr>        
 1 Burton  4.33830104321… 4.51001… 3.792… <NA>       2.7798594847… <NA>         
 2 Dan     5.33830104321… <NA>     4.792… 4.9655737… 3.7798594847… <NA>         
 3 Dieudo… <NA>           <NA>     4.459… 4.6322404… 3.4465261514… <NA>         
 4 Matt    <NA>           3.76001… <NA>   3.2155737… <NA>          <NA>         
 5 Mauric… <NA>           4.01001… <NA>   <NA>       <NA>          3.7194199243…
 6 Nathan  4.33830104321… 4.51001… 3.792… 3.9655737… 2.7798594847… <NA>         
 7 Param   <NA>           <NA>     <NA>   3.4655737… 2.2798594847… <NA>         
 8 Prasha… <NA>           <NA>     <NA>   <NA>       3.5798594847… <NA>         
 9 Shipra  4.33830104321… 4.51001… <NA>   <NA>       2.7798594847… <NA>         
10 Steve   <NA>           4.51001… 3.792… 3.9655737… 2.7798594847… <NA>         
11 Vuthy   <NA>           <NA>     <NA>   <NA>       <NA>          3.8194199243…
12 Xingjia 5.33830104321… 5.51001… <NA>   <NA>       3.7798594847… 5.2194199243…
# ℹ 1 more variable: Recommended_Movie <chr>

We now have values for the predicted movie. I want to join this with the original while and having NA fill in for the values not in this. We should be able to just join for this to happen. We can use the original df as the anchor for the join. We just want the predicted movie column

# We get just the Critic and Predicted_Movie and join that to our original dataframe.
ratings_with_recommendation<- left_join(df, rec_movies[, c("Critic", "Recommended_Movie")], "Critic")
ratings_with_recommendation
# A tibble: 16 × 8
   Critic  CaptainAmerica Deadpool Frozen JungleBook PitchPerfect2 StarWarsForce
   <chr>            <dbl>    <dbl>  <dbl>      <dbl>         <dbl>         <dbl>
 1 Burton              NA       NA     NA          4            NA             4
 2 Charley              4        5      4          3             2             3
 3 Dan                 NA        5     NA         NA            NA             5
 4 Dieudo…              5        4     NA         NA            NA             5
 5 Matt                 4       NA      2         NA             2             5
 6 Mauric…              4       NA      3          3             4            NA
 7 Max                  4        4      4          2             2             4
 8 Nathan              NA       NA     NA         NA            NA             4
 9 Param                4        4      1         NA            NA             5
10 Parshu               4        3      5          5             2             3
11 Prasha…              5        5      5          5            NA             4
12 Shipra              NA       NA      4          5            NA             3
13 Sreeja…              5        5      5          4             4             5
14 Steve                4       NA     NA         NA            NA             4
15 Vuthy                4        5      3          3             3            NA
16 Xingjia             NA       NA      5          5            NA            NA
# ℹ 1 more variable: Recommended_Movie <chr>

We now have the recommended movie for all users who did not see all the movies. The ones who did see all movies have NA as their Recommended Movie

Conclusion

In this assignment we were able to heavily modify a data frame based on a given algorithm. By breaking the algorithm into different steps we were able to sequentially create different values needed to generate an estimate for each movie each user has not seen yet. With this we could find the movie that would be most suggested to them based on the algorithm. Some ways I could add on to this is make the whole things more modular such that when any data frame with the same format is put in, it can generate the most recommended movie for each user. I also feel like the double looping with a matrix may have been a bit poor and maybe there is a better way to do that in the future.