This assignment will use the following packages:

library(readr)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)

General Overview

This assignment for extra credit will utilize the data from Assignment 2 from my GitHub page.

This assignment will also be available on my GitHub Repo.

The purpose of this assignment is to utilize Global Baseline estimates to predict what critics would rate films that they did not see. Similar to what I mentioned on my previous assignment, there will be a scale of 1 to 5, where 1 is the lowest and 5 is the highest. If and only if there are critics that did not see such movie, the data will output as NA. The data set utilizes a random algorithm, which has each of the random 20 names generated to “rate” the 20 videos picked from the dataset taken from MovieLens and output a ratings file. The file was then combined into movieratings.csv for this assignment for simplification purposes.

Movie Ratings File Extraction

We will first extract the data from the movie_ratings.csv file to gather the required variables from Assignment 2 utilizing the following code block:

# Initialize directory
a <- getwd()
setwd(a)

movieRatings <- read.csv("movie_ratings.csv")

Initializing Global Mean

Here we have the globalMean equate to 2.965.

globalMean <- mean(movieRatings$rating, na.rm = TRUE)
print(globalMean)
## [1] 2.965

Critics vs Global Mean

The following table uses the following values to show the average ratings of each user, and the User Average subtracted by the globalMean. The globalMeanUserAvg subtracts userAvg with globalMean. This will be used further in the assignment. The number of digits was condensed into a format of #.## to make the tables a bit smaller.

averageUserRating <- movieRatings %>%
  group_by(userName) %>%
  summarise(userAvg = mean(rating, na.rm = TRUE)) %>%
  mutate(globalMeanUserAvg = userAvg - globalMean)

averageUserRating %>%
  kable(col.names = c("Critic", "Average Rating", "User Average - Mean Movie"), digits = 2, align = 'lcc') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Critic Average Rating User Average - Mean Movie
Alice 2.95 -0.01
Bob 2.35 -0.61
Charlie 3.05 0.08
David 3.40 0.44
Eve 2.75 -0.21
Frank 2.55 -0.42
Grace 3.45 0.49
Hannah 2.75 -0.21
Ivy 3.05 0.08
Jack 2.60 -0.36
Karen 3.15 0.19
Leo 2.70 -0.26
Mona 2.85 -0.11
Nathan 2.35 -0.61
Olivia 3.65 0.69
Paul 3.40 0.44
Quincy 2.90 -0.06
Rachel 2.95 -0.01
Sam 3.20 0.24
Tina 3.25 0.29

Movies vs Global Mean

The following table uses the following values to show the average ratings of each movie, and the Movie Average subtracted by the globalMean. The globalMeanMovieAvg subtracts MovieAvg with globalMean. This will be used further in the assignment. The number of digits was condensed into a format of #.## to make the tables a bit smaller.

averageUserRating <- movieRatings %>%
  group_by(movieName) %>%
  summarise(movieAvg = mean(rating, na.rm = TRUE)) %>%
  mutate(globalMeanMovieAvg = movieAvg - globalMean)

averageUserRating %>%
  kable(col.names = c("Movie", "Average Rating", "Movie Average - Mean Movie"), digits = 2, align = 'lcc') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Movie Average Rating Movie Average - Mean Movie
Back to the Future 3.00 0.04
Fight Club 3.05 0.08
Forrest Gump 3.15 0.19
Gladiator 3.15 0.19
Goodfellas 3.00 0.04
Inception 3.30 0.33
Interstellar 2.75 -0.21
Jurassic Park 2.75 -0.21
Pulp Fiction 3.00 0.04
Saving Private Ryan 3.10 0.14
Schindler’s List 3.20 0.24
Se7en 2.85 -0.11
The Dark Knight 2.80 -0.17
The Empire Strikes Back 3.30 0.33
The Godfather 2.85 -0.11
The Lord of the Rings: The Return of the King 2.70 -0.26
The Matrix 2.25 -0.71
The Shawshank Redemption 2.80 -0.17
The Silence of the Lambs 3.05 0.08
The Usual Suspects 3.25 0.29

Generating the Tables

This code block will prepare the table to for the global base line estimate. Since the movie title numbers is greater than 10, I decided to split it apart to make it better visually. I also included a section to calculate the Movie Average on the last row to get the average of each movie in each column. The last column for the last row will be NA. Another line named Movie Avg - Mean Movie is going to output Movie Average subtracted by globalMean. After splitting the two tables, I will utilize the knitr and kableExtra packages to output the tables. The wideRatings will be utilized to calculate the mean rating for each critic, and will tkae the difference from globalMean. The columns that are for userAvg and globalMeanUserAvg will be renamed to Average Rating and User Average - Mean Movie respectively (combines the previous tables in a condensed format).

wideRatings <- movieRatings %>%
  select(userName, movieName, rating) %>%
  pivot_wider(names_from = movieName, values_from = rating)

globalMean <- mean(movieRatings$rating, na.rm = TRUE)

wideRatings <- wideRatings %>%
  rowwise() %>%
  mutate(userAvg = mean(c_across(where(is.numeric)), na.rm = TRUE),
         globalMeanUserAvg = userAvg - globalMean)

wideRatings <- wideRatings %>%
  rename(Critic = userName, 
         `Average Rating` = userAvg, 
         `User Average - Mean Movie` = globalMeanUserAvg)

moviecolumns <- colnames(wideRatings)[2:(ncol(wideRatings) - 2)]
firsttenMovies <- moviecolumns[1:10]
lasttenMovies  <- moviecolumns[11:20]

firstMovieset <- wideRatings %>%
  select(Critic, all_of(firsttenMovies), `Average Rating`, `User Average - Mean Movie`)

secondMovieset <- wideRatings %>%
  select(Critic, all_of(lasttenMovies), `Average Rating`, `User Average - Mean Movie`)

firstMovieSetAvg <- movieRatings %>%
  filter(movieName %in% firsttenMovies) %>%
  group_by(movieName) %>%
  summarise(movie_avg = mean(rating, na.rm = TRUE)) %>%
  pivot_wider(names_from = movieName, values_from = movie_avg) %>%
  mutate(Critic = "Movie Average", `Average Rating` = globalMean, `User Average - Mean Movie` = NA)

secondMovieSetAvg <- movieRatings %>%
  filter(movieName %in% lasttenMovies) %>%
  group_by(movieName) %>%
  summarise(movie_avg = mean(rating, na.rm = TRUE)) %>%
  pivot_wider(names_from = movieName, values_from = movie_avg) %>%
  mutate(Critic = "Movie Average", `Average Rating` = globalMean, `User Average - Mean Movie` = NA)

firstMoviesetGAvg <- firstMovieSetAvg %>%
  mutate(across(all_of(firsttenMovies), ~ . - globalMean)) %>%
  mutate(Critic = "Movie Avg - Mean Movie", `Average Rating` = NA, `User Average - Mean Movie` = NA)

secondMoviesetGAvg <- secondMovieSetAvg %>%
  mutate(across(all_of(lasttenMovies), ~ . - globalMean)) %>%
  mutate(Critic = "Movie Avg - Mean Movie", `Average Rating` = NA, `User Average - Mean Movie` = NA)

firstMovieset <- bind_rows(firstMovieset, firstMovieSetAvg, firstMoviesetGAvg)
secondMovieset <- bind_rows(secondMovieset, secondMovieSetAvg, secondMoviesetGAvg)

First 10 Movies

This table outputs the first 10 movies with the Global Base Estimate:

firstMovieset %>%
  kable(digits = 2, align = 'lcccccccccccc') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Critic The Shawshank Redemption The Godfather The Dark Knight Pulp Fiction Schindler’s List Forrest Gump Inception Fight Club The Matrix Goodfellas Average Rating User Average - Mean Movie
Alice 3.00 2.00 5.00 5.00 4.00 5.00 1.00 2.00 3.00 5.00 2.95 -0.01
Bob 3.00 1.00 5.00 5.00 2.00 3.00 2.00 1.00 3.00 2.00 2.35 -0.61
Charlie 2.00 3.00 4.00 3.00 2.00 1.00 3.00 2.00 1.00 3.00 3.05 0.08
David 2.00 4.00 5.00 1.00 4.00 2.00 4.00 5.00 4.00 5.00 3.40 0.44
Eve 3.00 1.00 2.00 4.00 4.00 1.00 3.00 5.00 2.00 4.00 2.75 -0.21
Frank 5.00 3.00 1.00 1.00 1.00 2.00 1.00 1.00 1.00 5.00 2.55 -0.42
Grace 4.00 5.00 1.00 1.00 3.00 5.00 5.00 2.00 2.00 2.00 3.45 0.49
Hannah 1.00 4.00 3.00 3.00 3.00 3.00 5.00 5.00 4.00 1.00 2.75 -0.21
Ivy 2.00 2.00 1.00 4.00 1.00 4.00 2.00 4.00 5.00 4.00 3.05 0.08
Jack 3.00 5.00 5.00 1.00 3.00 4.00 3.00 2.00 1.00 2.00 2.60 -0.36
Karen 5.00 1.00 1.00 3.00 5.00 1.00 5.00 2.00 1.00 5.00 3.15 0.19
Leo 3.00 1.00 2.00 5.00 2.00 4.00 1.00 3.00 1.00 1.00 2.70 -0.26
Mona 3.00 2.00 4.00 3.00 3.00 1.00 4.00 1.00 5.00 5.00 2.85 -0.11
Nathan 1.00 3.00 4.00 2.00 2.00 3.00 2.00 1.00 3.00 1.00 2.35 -0.61
Olivia 4.00 4.00 3.00 5.00 5.00 4.00 4.00 5.00 1.00 4.00 3.65 0.69
Paul 1.00 5.00 1.00 5.00 5.00 3.00 5.00 5.00 2.00 3.00 3.40 0.44
Quincy 1.00 5.00 2.00 3.00 3.00 5.00 5.00 3.00 2.00 1.00 2.90 -0.06
Rachel 5.00 3.00 1.00 2.00 4.00 4.00 5.00 2.00 1.00 1.00 2.95 -0.01
Sam 3.00 1.00 2.00 2.00 4.00 4.00 5.00 5.00 2.00 2.00 3.20 0.24
Tina 2.00 2.00 4.00 2.00 4.00 4.00 1.00 5.00 1.00 4.00 3.25 0.29
Movie Average 2.80 2.85 2.80 3.00 3.20 3.15 3.30 3.05 2.25 3.00 2.96 NA
Movie Avg - Mean Movie -0.17 -0.11 -0.17 0.04 0.24 0.19 0.33 0.08 -0.71 0.04 NA NA

Last 10 Movies

This will output the last 10 movies with the Global Base Estimate:

secondMovieset %>%
  kable(digits = 2, align = 'lcccccccccccc') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Critic The Empire Strikes Back Interstellar Gladiator The Lord of the Rings: The Return of the King Back to the Future Saving Private Ryan The Silence of the Lambs Se7en The Usual Suspects Jurassic Park Average Rating User Average - Mean Movie
Alice 2.00 5.00 3.00 3.00 2.00 3.00 1.00 2.00 1.00 2.00 2.95 -0.01
Bob 1.00 1.00 4.00 5.00 2.00 2.00 1.00 2.00 1.00 1.00 2.35 -0.61
Charlie 5.00 1.00 3.00 5.00 5.00 5.00 3.00 2.00 4.00 4.00 3.05 0.08
David 5.00 2.00 3.00 3.00 4.00 4.00 4.00 1.00 4.00 2.00 3.40 0.44
Eve 1.00 1.00 3.00 1.00 4.00 2.00 4.00 3.00 5.00 2.00 2.75 -0.21
Frank 1.00 2.00 5.00 3.00 5.00 1.00 5.00 4.00 1.00 3.00 2.55 -0.42
Grace 5.00 5.00 3.00 4.00 4.00 4.00 2.00 3.00 4.00 5.00 3.45 0.49
Hannah 5.00 1.00 2.00 2.00 1.00 1.00 2.00 3.00 5.00 1.00 2.75 -0.21
Ivy 2.00 3.00 3.00 5.00 2.00 4.00 4.00 3.00 3.00 3.00 3.05 0.08
Jack 2.00 2.00 1.00 1.00 1.00 5.00 2.00 4.00 2.00 3.00 2.60 -0.36
Karen 5.00 2.00 5.00 4.00 2.00 4.00 1.00 4.00 5.00 2.00 3.15 0.19
Leo 1.00 5.00 1.00 3.00 3.00 2.00 4.00 3.00 4.00 5.00 2.70 -0.26
Mona 1.00 4.00 4.00 1.00 2.00 2.00 3.00 4.00 4.00 1.00 2.85 -0.11
Nathan 5.00 1.00 2.00 2.00 3.00 4.00 1.00 2.00 1.00 4.00 2.35 -0.61
Olivia 5.00 5.00 2.00 3.00 1.00 3.00 5.00 3.00 3.00 4.00 3.65 0.69
Paul 4.00 2.00 4.00 3.00 4.00 5.00 3.00 4.00 3.00 1.00 3.40 0.44
Quincy 5.00 5.00 1.00 1.00 2.00 2.00 4.00 2.00 3.00 3.00 2.90 -0.06
Rachel 5.00 2.00 4.00 2.00 3.00 3.00 2.00 1.00 5.00 4.00 2.95 -0.01
Sam 4.00 2.00 5.00 1.00 5.00 3.00 5.00 3.00 2.00 4.00 3.20 0.24
Tina 2.00 4.00 5.00 2.00 5.00 3.00 5.00 4.00 5.00 1.00 3.25 0.29
Movie Average 3.30 2.75 3.15 2.70 3.00 3.10 3.05 2.85 3.25 2.75 2.96 NA
Movie Avg - Mean Movie 0.33 -0.21 0.19 -0.26 0.04 0.14 0.08 -0.11 0.29 -0.21 NA NA

Conclusion

My initial plan was to utilize the large dataset that I had from MovieLens, but this would have took a bit more time and I am still working with tables to make it better. I did experience a similar issue when I worked with GTFSR with my assignment in undergrad, but as I progress in this course, I believe like this can improve significantly.