This assignment will use the following packages:
library(readr)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
This assignment for extra credit will utilize the data from Assignment 2 from my GitHub page.
This assignment will also be available on my GitHub Repo.
The purpose of this assignment is to utilize Global Baseline
estimates to predict what critics would rate films that they did not
see. Similar to what I mentioned on my previous assignment, there will
be a scale of 1 to 5, where 1 is the lowest and 5 is the highest. If and
only if there are critics that did not see such movie, the data will
output as NA
. The data set utilizes a random algorithm,
which has each of the random 20 names generated to “rate” the 20 videos
picked from the dataset taken from MovieLens and
output a ratings file. The file was then combined into
movieratings.csv
for this assignment for simplification
purposes.
We will first extract the data from the
movie_ratings.csv
file to gather the required variables
from Assignment 2 utilizing the following code block:
# Initialize directory
a <- getwd()
setwd(a)
movieRatings <- read.csv("movie_ratings.csv")
Here we have the globalMean
equate to
2.965
.
globalMean <- mean(movieRatings$rating, na.rm = TRUE)
print(globalMean)
## [1] 2.965
The following table uses the following values to show the average
ratings of each user, and the User Average subtracted by the
globalMean
. The globalMeanUserAvg
subtracts
userAvg
with globalMean
. This will be used
further in the assignment. The number of digits was condensed into a
format of #.##
to make the tables a bit smaller.
averageUserRating <- movieRatings %>%
group_by(userName) %>%
summarise(userAvg = mean(rating, na.rm = TRUE)) %>%
mutate(globalMeanUserAvg = userAvg - globalMean)
averageUserRating %>%
kable(col.names = c("Critic", "Average Rating", "User Average - Mean Movie"), digits = 2, align = 'lcc') %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Critic | Average Rating | User Average - Mean Movie |
---|---|---|
Alice | 2.95 | -0.01 |
Bob | 2.35 | -0.61 |
Charlie | 3.05 | 0.08 |
David | 3.40 | 0.44 |
Eve | 2.75 | -0.21 |
Frank | 2.55 | -0.42 |
Grace | 3.45 | 0.49 |
Hannah | 2.75 | -0.21 |
Ivy | 3.05 | 0.08 |
Jack | 2.60 | -0.36 |
Karen | 3.15 | 0.19 |
Leo | 2.70 | -0.26 |
Mona | 2.85 | -0.11 |
Nathan | 2.35 | -0.61 |
Olivia | 3.65 | 0.69 |
Paul | 3.40 | 0.44 |
Quincy | 2.90 | -0.06 |
Rachel | 2.95 | -0.01 |
Sam | 3.20 | 0.24 |
Tina | 3.25 | 0.29 |
The following table uses the following values to show the average
ratings of each movie, and the Movie Average subtracted by the
globalMean
. The globalMeanMovieAvg
subtracts
MovieAvg
with globalMean
. This will be used
further in the assignment. The number of digits was condensed into a
format of #.##
to make the tables a bit smaller.
averageUserRating <- movieRatings %>%
group_by(movieName) %>%
summarise(movieAvg = mean(rating, na.rm = TRUE)) %>%
mutate(globalMeanMovieAvg = movieAvg - globalMean)
averageUserRating %>%
kable(col.names = c("Movie", "Average Rating", "Movie Average - Mean Movie"), digits = 2, align = 'lcc') %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Movie | Average Rating | Movie Average - Mean Movie |
---|---|---|
Back to the Future | 3.00 | 0.04 |
Fight Club | 3.05 | 0.08 |
Forrest Gump | 3.15 | 0.19 |
Gladiator | 3.15 | 0.19 |
Goodfellas | 3.00 | 0.04 |
Inception | 3.30 | 0.33 |
Interstellar | 2.75 | -0.21 |
Jurassic Park | 2.75 | -0.21 |
Pulp Fiction | 3.00 | 0.04 |
Saving Private Ryan | 3.10 | 0.14 |
Schindler’s List | 3.20 | 0.24 |
Se7en | 2.85 | -0.11 |
The Dark Knight | 2.80 | -0.17 |
The Empire Strikes Back | 3.30 | 0.33 |
The Godfather | 2.85 | -0.11 |
The Lord of the Rings: The Return of the King | 2.70 | -0.26 |
The Matrix | 2.25 | -0.71 |
The Shawshank Redemption | 2.80 | -0.17 |
The Silence of the Lambs | 3.05 | 0.08 |
The Usual Suspects | 3.25 | 0.29 |
This code block will prepare the table to for the global base line
estimate. Since the movie title numbers is greater than 10, I decided to
split it apart to make it better visually. I also included a section to
calculate the Movie Average
on the last row to get the
average of each movie in each column. The last column for the last row
will be NA
. Another line named
Movie Avg - Mean Movie
is going to output
Movie Average
subtracted by globalMean
. After
splitting the two tables, I will utilize the knitr
and
kableExtra
packages to output the tables. The
wideRatings
will be utilized to calculate the mean rating
for each critic, and will tkae the difference from
globalMean
. The columns that are for userAvg
and globalMeanUserAvg
will be renamed to
Average Rating
and User Average - Mean Movie
respectively (combines the previous tables in a condensed format).
wideRatings <- movieRatings %>%
select(userName, movieName, rating) %>%
pivot_wider(names_from = movieName, values_from = rating)
globalMean <- mean(movieRatings$rating, na.rm = TRUE)
wideRatings <- wideRatings %>%
rowwise() %>%
mutate(userAvg = mean(c_across(where(is.numeric)), na.rm = TRUE),
globalMeanUserAvg = userAvg - globalMean)
wideRatings <- wideRatings %>%
rename(Critic = userName,
`Average Rating` = userAvg,
`User Average - Mean Movie` = globalMeanUserAvg)
moviecolumns <- colnames(wideRatings)[2:(ncol(wideRatings) - 2)]
firsttenMovies <- moviecolumns[1:10]
lasttenMovies <- moviecolumns[11:20]
firstMovieset <- wideRatings %>%
select(Critic, all_of(firsttenMovies), `Average Rating`, `User Average - Mean Movie`)
secondMovieset <- wideRatings %>%
select(Critic, all_of(lasttenMovies), `Average Rating`, `User Average - Mean Movie`)
firstMovieSetAvg <- movieRatings %>%
filter(movieName %in% firsttenMovies) %>%
group_by(movieName) %>%
summarise(movie_avg = mean(rating, na.rm = TRUE)) %>%
pivot_wider(names_from = movieName, values_from = movie_avg) %>%
mutate(Critic = "Movie Average", `Average Rating` = globalMean, `User Average - Mean Movie` = NA)
secondMovieSetAvg <- movieRatings %>%
filter(movieName %in% lasttenMovies) %>%
group_by(movieName) %>%
summarise(movie_avg = mean(rating, na.rm = TRUE)) %>%
pivot_wider(names_from = movieName, values_from = movie_avg) %>%
mutate(Critic = "Movie Average", `Average Rating` = globalMean, `User Average - Mean Movie` = NA)
firstMoviesetGAvg <- firstMovieSetAvg %>%
mutate(across(all_of(firsttenMovies), ~ . - globalMean)) %>%
mutate(Critic = "Movie Avg - Mean Movie", `Average Rating` = NA, `User Average - Mean Movie` = NA)
secondMoviesetGAvg <- secondMovieSetAvg %>%
mutate(across(all_of(lasttenMovies), ~ . - globalMean)) %>%
mutate(Critic = "Movie Avg - Mean Movie", `Average Rating` = NA, `User Average - Mean Movie` = NA)
firstMovieset <- bind_rows(firstMovieset, firstMovieSetAvg, firstMoviesetGAvg)
secondMovieset <- bind_rows(secondMovieset, secondMovieSetAvg, secondMoviesetGAvg)
This table outputs the first 10 movies with the Global Base Estimate:
firstMovieset %>%
kable(digits = 2, align = 'lcccccccccccc') %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Critic | The Shawshank Redemption | The Godfather | The Dark Knight | Pulp Fiction | Schindler’s List | Forrest Gump | Inception | Fight Club | The Matrix | Goodfellas | Average Rating | User Average - Mean Movie |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Alice | 3.00 | 2.00 | 5.00 | 5.00 | 4.00 | 5.00 | 1.00 | 2.00 | 3.00 | 5.00 | 2.95 | -0.01 |
Bob | 3.00 | 1.00 | 5.00 | 5.00 | 2.00 | 3.00 | 2.00 | 1.00 | 3.00 | 2.00 | 2.35 | -0.61 |
Charlie | 2.00 | 3.00 | 4.00 | 3.00 | 2.00 | 1.00 | 3.00 | 2.00 | 1.00 | 3.00 | 3.05 | 0.08 |
David | 2.00 | 4.00 | 5.00 | 1.00 | 4.00 | 2.00 | 4.00 | 5.00 | 4.00 | 5.00 | 3.40 | 0.44 |
Eve | 3.00 | 1.00 | 2.00 | 4.00 | 4.00 | 1.00 | 3.00 | 5.00 | 2.00 | 4.00 | 2.75 | -0.21 |
Frank | 5.00 | 3.00 | 1.00 | 1.00 | 1.00 | 2.00 | 1.00 | 1.00 | 1.00 | 5.00 | 2.55 | -0.42 |
Grace | 4.00 | 5.00 | 1.00 | 1.00 | 3.00 | 5.00 | 5.00 | 2.00 | 2.00 | 2.00 | 3.45 | 0.49 |
Hannah | 1.00 | 4.00 | 3.00 | 3.00 | 3.00 | 3.00 | 5.00 | 5.00 | 4.00 | 1.00 | 2.75 | -0.21 |
Ivy | 2.00 | 2.00 | 1.00 | 4.00 | 1.00 | 4.00 | 2.00 | 4.00 | 5.00 | 4.00 | 3.05 | 0.08 |
Jack | 3.00 | 5.00 | 5.00 | 1.00 | 3.00 | 4.00 | 3.00 | 2.00 | 1.00 | 2.00 | 2.60 | -0.36 |
Karen | 5.00 | 1.00 | 1.00 | 3.00 | 5.00 | 1.00 | 5.00 | 2.00 | 1.00 | 5.00 | 3.15 | 0.19 |
Leo | 3.00 | 1.00 | 2.00 | 5.00 | 2.00 | 4.00 | 1.00 | 3.00 | 1.00 | 1.00 | 2.70 | -0.26 |
Mona | 3.00 | 2.00 | 4.00 | 3.00 | 3.00 | 1.00 | 4.00 | 1.00 | 5.00 | 5.00 | 2.85 | -0.11 |
Nathan | 1.00 | 3.00 | 4.00 | 2.00 | 2.00 | 3.00 | 2.00 | 1.00 | 3.00 | 1.00 | 2.35 | -0.61 |
Olivia | 4.00 | 4.00 | 3.00 | 5.00 | 5.00 | 4.00 | 4.00 | 5.00 | 1.00 | 4.00 | 3.65 | 0.69 |
Paul | 1.00 | 5.00 | 1.00 | 5.00 | 5.00 | 3.00 | 5.00 | 5.00 | 2.00 | 3.00 | 3.40 | 0.44 |
Quincy | 1.00 | 5.00 | 2.00 | 3.00 | 3.00 | 5.00 | 5.00 | 3.00 | 2.00 | 1.00 | 2.90 | -0.06 |
Rachel | 5.00 | 3.00 | 1.00 | 2.00 | 4.00 | 4.00 | 5.00 | 2.00 | 1.00 | 1.00 | 2.95 | -0.01 |
Sam | 3.00 | 1.00 | 2.00 | 2.00 | 4.00 | 4.00 | 5.00 | 5.00 | 2.00 | 2.00 | 3.20 | 0.24 |
Tina | 2.00 | 2.00 | 4.00 | 2.00 | 4.00 | 4.00 | 1.00 | 5.00 | 1.00 | 4.00 | 3.25 | 0.29 |
Movie Average | 2.80 | 2.85 | 2.80 | 3.00 | 3.20 | 3.15 | 3.30 | 3.05 | 2.25 | 3.00 | 2.96 | NA |
Movie Avg - Mean Movie | -0.17 | -0.11 | -0.17 | 0.04 | 0.24 | 0.19 | 0.33 | 0.08 | -0.71 | 0.04 | NA | NA |
This will output the last 10 movies with the Global Base Estimate:
secondMovieset %>%
kable(digits = 2, align = 'lcccccccccccc') %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Critic | The Empire Strikes Back | Interstellar | Gladiator | The Lord of the Rings: The Return of the King | Back to the Future | Saving Private Ryan | The Silence of the Lambs | Se7en | The Usual Suspects | Jurassic Park | Average Rating | User Average - Mean Movie |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Alice | 2.00 | 5.00 | 3.00 | 3.00 | 2.00 | 3.00 | 1.00 | 2.00 | 1.00 | 2.00 | 2.95 | -0.01 |
Bob | 1.00 | 1.00 | 4.00 | 5.00 | 2.00 | 2.00 | 1.00 | 2.00 | 1.00 | 1.00 | 2.35 | -0.61 |
Charlie | 5.00 | 1.00 | 3.00 | 5.00 | 5.00 | 5.00 | 3.00 | 2.00 | 4.00 | 4.00 | 3.05 | 0.08 |
David | 5.00 | 2.00 | 3.00 | 3.00 | 4.00 | 4.00 | 4.00 | 1.00 | 4.00 | 2.00 | 3.40 | 0.44 |
Eve | 1.00 | 1.00 | 3.00 | 1.00 | 4.00 | 2.00 | 4.00 | 3.00 | 5.00 | 2.00 | 2.75 | -0.21 |
Frank | 1.00 | 2.00 | 5.00 | 3.00 | 5.00 | 1.00 | 5.00 | 4.00 | 1.00 | 3.00 | 2.55 | -0.42 |
Grace | 5.00 | 5.00 | 3.00 | 4.00 | 4.00 | 4.00 | 2.00 | 3.00 | 4.00 | 5.00 | 3.45 | 0.49 |
Hannah | 5.00 | 1.00 | 2.00 | 2.00 | 1.00 | 1.00 | 2.00 | 3.00 | 5.00 | 1.00 | 2.75 | -0.21 |
Ivy | 2.00 | 3.00 | 3.00 | 5.00 | 2.00 | 4.00 | 4.00 | 3.00 | 3.00 | 3.00 | 3.05 | 0.08 |
Jack | 2.00 | 2.00 | 1.00 | 1.00 | 1.00 | 5.00 | 2.00 | 4.00 | 2.00 | 3.00 | 2.60 | -0.36 |
Karen | 5.00 | 2.00 | 5.00 | 4.00 | 2.00 | 4.00 | 1.00 | 4.00 | 5.00 | 2.00 | 3.15 | 0.19 |
Leo | 1.00 | 5.00 | 1.00 | 3.00 | 3.00 | 2.00 | 4.00 | 3.00 | 4.00 | 5.00 | 2.70 | -0.26 |
Mona | 1.00 | 4.00 | 4.00 | 1.00 | 2.00 | 2.00 | 3.00 | 4.00 | 4.00 | 1.00 | 2.85 | -0.11 |
Nathan | 5.00 | 1.00 | 2.00 | 2.00 | 3.00 | 4.00 | 1.00 | 2.00 | 1.00 | 4.00 | 2.35 | -0.61 |
Olivia | 5.00 | 5.00 | 2.00 | 3.00 | 1.00 | 3.00 | 5.00 | 3.00 | 3.00 | 4.00 | 3.65 | 0.69 |
Paul | 4.00 | 2.00 | 4.00 | 3.00 | 4.00 | 5.00 | 3.00 | 4.00 | 3.00 | 1.00 | 3.40 | 0.44 |
Quincy | 5.00 | 5.00 | 1.00 | 1.00 | 2.00 | 2.00 | 4.00 | 2.00 | 3.00 | 3.00 | 2.90 | -0.06 |
Rachel | 5.00 | 2.00 | 4.00 | 2.00 | 3.00 | 3.00 | 2.00 | 1.00 | 5.00 | 4.00 | 2.95 | -0.01 |
Sam | 4.00 | 2.00 | 5.00 | 1.00 | 5.00 | 3.00 | 5.00 | 3.00 | 2.00 | 4.00 | 3.20 | 0.24 |
Tina | 2.00 | 4.00 | 5.00 | 2.00 | 5.00 | 3.00 | 5.00 | 4.00 | 5.00 | 1.00 | 3.25 | 0.29 |
Movie Average | 3.30 | 2.75 | 3.15 | 2.70 | 3.00 | 3.10 | 3.05 | 2.85 | 3.25 | 2.75 | 2.96 | NA |
Movie Avg - Mean Movie | 0.33 | -0.21 | 0.19 | -0.26 | 0.04 | 0.14 | 0.08 | -0.11 | 0.29 | -0.21 | NA | NA |
My initial plan was to utilize the large dataset that I had from MovieLens, but this would have took a bit more time and I am still working with tables to make it better. I did experience a similar issue when I worked with GTFSR with my assignment in undergrad, but as I progress in this course, I believe like this can improve significantly.