Global Baseline Estimate

Overview

This assignment takes the survey results from my film survey and uses the Global Baseline Estimate to predict how the participants would rate the films they had not seen. Each participant rated each film they had seen on a scale from 1 to 5, with 5 being the most enjoyable. Participants did not rate films they had not seen, resulting in missing data. The Global Baseline Estimate compares each participant’s average rating to the overall average rating to determine how much higher or lower each participant’s ratings tend to be than the average, then compares each film’s average rating to the overall average to determine how much higher or lower each film’s rating tends to be than the average, and then uses those two adjustments to the overall average to predict each missing combination of participant and film.

Importing Survey Data

This code block imports the data from my GitHub repository and removes the “Date and Time” column, which is not useful for this assignment. It also calculates the overall average film rating from all participants across all films.

csvlink = "https://raw.githubusercontent.com/Marley-Myrianthopoulos/607assignment2/main/607%20Assignment%202%20Movie%20Survey%20Data.csv"

initial_ratings <- read.csv(url(csvlink))

initial_ratings <- subset(initial_ratings, select = -1)

overall_average <- sum(initial_ratings[2:ncol(initial_ratings)], na.rm = TRUE) / sum(initial_ratings[2:ncol(initial_ratings)] != "", na.rm = TRUE)

Each Participant’s Average Rating

This code block calculates the average rating for each participant. It then uses this information to determine how much higher or lower each participant’s average rating is than the overall average (this information will be used later to adjust the predicted rating for each film from each participant).

initial_ratings$rater_mean <- rowMeans(initial_ratings[2:ncol(initial_ratings)], na.rm = TRUE)

initial_ratings$rater_adj <- initial_ratings$rater_mean - overall_average

Each Film’s Average Rating

This code block calculates the average rating for each film. It then uses this information to determine how much higher or lower each film’s average rating is than the overall average (this information will be used later to adjust the predicted rating for each film from each participant).

last_film_column <- ncol(initial_ratings) - 2

film_averages <- colMeans(initial_ratings[2:ncol(initial_ratings)], na.rm = TRUE)

Adding Average Film Ratings to the Data Frame

Unlike the Participant average ratings and adjustments, which were added to the data frame as new columns as part of the code for calculating them, the new rows for average film rating and film adjustment need to be added to the data frame separately from calculating them. This code block adds this information to the data frame.

initial_ratings[nrow(initial_ratings)+1,1] <- "Film Average"

for (i in 2:last_film_column) {
  initial_ratings[nrow(initial_ratings),i] <- film_averages[i-1]
}

initial_ratings[nrow(initial_ratings)+1,1] <- "Film Adjustment"

film_adj <- film_averages - overall_average

for (i in 2:last_film_column) {
  initial_ratings[nrow(initial_ratings),i] <- film_adj[i-1]
}

Projecting Ratings

This code block calculates the projected rating for each missing combination of film and participant. The nested for loops go through each cell with a rating and use the Global Baseline Estimate to replace any “NA” values that they find in the process. Since ratings are on a scale from 1 to 5, predictions greater than 5 are replaced with 5, and predictions less than 1 are replaced with 1. The Kable package is then used for the tabular output.

respondents <- nrow(initial_ratings) - 2

for (i in 2:last_film_column) {
  for (j in 1:respondents) {
    if (is.na(initial_ratings[j,i])) {
      if(overall_average + initial_ratings[nrow(initial_ratings),i] + initial_ratings[j,ncol(initial_ratings)] > 5) {
        initial_ratings[j,i] <- 5} else {
          if(overall_average + initial_ratings[nrow(initial_ratings),i] + initial_ratings[j,ncol(initial_ratings)] < 1) {
            initial_ratings[j,i] <- 1} else {
              initial_ratings[j,i] <- overall_average + initial_ratings[nrow(initial_ratings),i] + initial_ratings[j,ncol(initial_ratings)]
            }
        }
    }
  }
}

library(knitr)

kable(initial_ratings, col.names = c("Participant", "Barbie", "Oppenheimer", "Spider Man: Across the Spiderverse", "Asteroid City", "Dungeons & Dragons: Honor Among Thieves", "Glass Onion: A Knives Out Adventure", "Participant Average", "Participant Adjustment"), format = "pipe", caption = "Movie Ratings (w/ Global Baseline Estimate)", align = "lcccccccc", digits = 2)

Movie Ratings (w/ Global Baseline Estimate)
Participant	Barbie	Oppenheimer	Spider Man: Across the Spiderverse	Asteroid City	Dungeons & Dragons: Honor Among Thieves	Glass Onion: A Knives Out Adventure	Participant Average	Participant Adjustment
Mani	5.00	4.00	5.00	3.00	5.00	5.00	4.50	0.59
Crossman	4.77	4.16	4.51	3.70	5.00	4.00	4.50	0.59
Wilson	4.52	5.00	3.00	5.00	4.25	4.00	4.25	0.34
Julie	4.00	3.86	5.00	2.00	5.00	5.00	4.20	0.29
Tim	5.00	4.66	5.00	4.20	5.00	5.00	5.00	1.09
Blue	5.00	4.66	5.00	4.20	5.00	5.00	5.00	1.09
Elaine	4.00	4.16	4.51	3.70	4.50	5.00	4.50	0.59
Chris	1.00	1.00	1.00	1.00	1.00	4.00	1.50	-2.41
Martin	5.00	5.00	1.00	1.00	1.00	3.00	2.67	-1.24
Zoe	5.00	4.66	5.00	4.20	5.00	5.00	5.00	1.09
Wade	4.27	3.66	4.00	3.20	4.00	4.40	4.00	0.09
Sidus	4.60	3.99	5.00	3.53	4.00	4.00	4.33	0.42
Lemon	3.00	4.26	5.00	5.00	5.00	5.00	4.60	0.69
Berry	5.00	1.00	3.00	4.00	5.00	5.00	3.83	-0.08
Sema	5.00	3.99	4.00	3.53	4.33	4.00	4.33	0.42
John	5.00	5.00	5.00	3.00	3.00	2.00	3.83	-0.08
Dianne	4.00	4.00	5.00	4.00	4.00	4.00	4.17	0.26
Film Average	4.18	3.57	3.92	3.11	3.91	4.31	NA	NA
Film Adjustment	0.27	-0.34	0.01	-0.80	0.00	0.40	NA	NA

Findings and Recommendations

One of the things I’m interested in pursuing further is code efficiency. I’m proud of the algorithm I came up with using 3 conditional statements inside a for loop inside another for loop, but I suspect that there was a much better way to execute the process, I’m just not sure where to look for it. I hope to revisit this at the end of the semester and try again with my greater knowledge of R.