Assignment 3A: Global Baseline Estimate

Author

Emily El Mouaquite

Approach

The below formula will be applied to information in a data frame based off of my Assignment 2A movie ratings:

Predicted Rating = Global Mean + User Bias + Item Bias

Using this, I will add the global mean (mean of all movie ratings), user bias (global mean subtracted from the individual rater average), and movie bias (global mean subtracted from the movies’ average ratings) to the data frame. Then, I will be able to create a variable that implements the above formula, and populate the null values with global baseline estimate ratings for each unseen/ unrated movie.

Code Base

Introduction

The data that I am going to use for this project are derived from my Assignment 2A movie ratings. This includes the rater’s name, the movie names, their ratings.

# Read movie ratings CSV, convert 0 ratings (movies not seen by raters) to null
df <- read.csv("ghibli_ratings.csv", na.strings = "0")

Body

In order to calculate the predicted movie ratings for each rater’s unseen movies using the global baseline estimate, each rater’s average rating and each movie’s average rating will need to be calculated and added to the data frame.

# ensure all ratings are numeric
df[, -1] <- sapply(df[, -1], as.numeric)
#rater averages
df$RaterAverage <- rowMeans(df[ , -1], na.rm=TRUE)
# movie averages
movie_avg <- colMeans(df[, -1], na.rm = TRUE)
df[nrow(df) + 1, ] <- c("MovieAverage", as.character(movie_avg))
# ensure all averages are numeric
df[, -1] <- sapply(df[, -1], as.numeric)

The global mean would then be able to be defined as the intersection of the movie averages and the rater averages in the data frame.

# create global mean variable
global_mean <- df$RaterAverage[6]
global_mean <- as.numeric(global_mean)

After defining the global mean, the user and item biases are defined.

# user bias
user_bias <- df$RaterAverage - global_mean
# item bias
item_bias <- movie_avg - global_mean 

The last step would be to implement the global baseline estimate formula with the predicted ratings for each NA value.

for(u in 1:(nrow(df)-1)) {         
  for(i in 2:(ncol(df)-1)) {       
    if(is.na(df[u, i])) {
      df[u, i] <- global_mean + user_bias[u] + item_bias[i-1]
    }
  }
}

This creates the below data frame.

print(df)
        Critic Howls.Moving.Castle Kiki.s.Delivery.Service Pom.Poko
1         Ally            4.516667                4.000000 4.266667
2      Barbara            5.000000                5.000000 5.266667
3       Martin            3.000000                3.000000 4.000000
4         Lucy            4.000000                3.016667 3.266667
5         Amal            5.000000                3.000000 4.000000
6 MovieAverage            4.250000                3.750000 4.000000
  Princess.Mononoke    Ponyo Spirited.Away RaterAverage
1          3.600000 3.933333      3.266667     4.000000
2          4.600000 5.000000      5.000000     5.000000
3          2.000000 2.933333      3.000000     3.000000
4          3.000000 3.000000      2.000000     3.000000
5          5.000000 3.000000      2.000000     3.666667
6          3.333333 3.666667      3.000000     3.733333

Conclusion

My largest challenge throughout this assignment was ensuring that the data types were correct. I had trouble ordering my code in a way that did not overwrite the numeric data type. This issue aside, I was able to populate the data frame with the predicted ratings for each critic’s unseen movies. To further verify this work, one might remove some of the ratings that were collected from the critics to see how well the global baseline estimate is actually predticing them.