Tidytuesday’s 2020 week 12 challenge is on The Office! Everyone’s favorite show featuring everyone’s favorite character, Michael Scott. Or is he? I’d like to take a look at how IMDB ratings relate to Michael’s (verbal) presence.

Let’s get started!

#load packages and read in dataset
library(tidyverse)
library(gghighlight)
library(schrute)

raw_data <- schrute::theoffice

raw_data

The Schrute package (great name) pulls in every line by every character in every episode. Awesome! For my purposes, I’m going to want to know how many words each character is speaking. So let’s manipulate this data.

lines_manip <- raw_data %>%
    select(season, episode, character, text) %>%
    tidytext::unnest_tokens(word, text) %>%
    mutate(ep_key = paste0("s", season, "e", episode)) %>%
    add_count(character, ep_key) %>%
    rename(char_words = n) %>%
    add_count(ep_key) %>%
    rename(ep_words = n)

lines_sum <- lines_manip %>%
    select(ep_key, character, char_words, ep_words) %>%
    distinct()

The “lines_manip” dataset gives me a 570k row dataset on every word said during the show. Cool, but not that necessary for what I’m currently doing. “lines_sum” aggregates this up to be one row per character per episode, counting their total words in the episode, and the overall total words in the episode. Below you can see an example for season 1 episode 1.

Now that we have a basic dataset to work off of, let’s start digging into the World’s Greatest Boss (according to coffee mugs). I want to plot the IMDB rating for each epsiode compared to Michael’s share of the total word count.

episodes <- raw_data %>%
    mutate(ep_key = paste0("s", season, "e", episode)) %>%
    select(ep_key, episode_name, imdb_rating) %>%
    unique()

michael_words <- lines_sum %>%
    filter(character == "Michael") 

michael_ratings <- episodes %>%
    left_join(michael_words, by = "ep_key") %>%
    mutate(michael_share = char_words / ep_words) %>%
    replace_na(list(michael_share = 0)) %>%
    arrange(michael_share)

ggplot(michael_ratings, aes(x = michael_share, y = imdb_rating)) +
    geom_point() +
    theme_bw()

Some things immediately stand out here. First question: what’s up with the big line of dots on the left? Well, Michael is not in every episode! Let’s take a step aside and see how the average rating changes based on Michael’s presence.

michael_presence <- michael_ratings %>%
    mutate(michael_exists = if_else(michael_share == 0, "No", "Yes")) %>%
    group_by(michael_exists) %>%
    summarise(avg_rating = mean(imdb_rating))

michael_presence %>%
    knitr::kable(digits = 1, col.names = c("Michael Exists", "Avg Rating"),
                 align = "c")
Michael Exists Avg Rating
No 7.8
Yes 8.4

Looks like viewers do appreciate him being in the show! That’s a good start. Now that we have that taken care of, I want to dial in on the episodes that feature Mike the Magic. Let’s look at the same plot, but only showing episodes where Michael speaks at least 1 word.

michael_ratings_no0 <- episodes %>%
    left_join(michael_words, by = "ep_key") %>%
    mutate(michael_share = char_words / ep_words) %>%
    #replace_na(list(michael_share = 0))  >>> no longer replacing NAs!
    arrange(michael_share)

ggplot(michael_ratings_no0, aes(x = michael_share, y = imdb_rating)) +
    geom_point() +
    theme_bw()

Looks like there’s a few outlier points. Let’s see if we can figure out what’s up with these four:

Aha! This makes some sense. Hardcore fans of the show may have even guessed what each of these data points are. Here’s my brief insight into them:

Now that we have our outliers understood, let’s see if there’s any correlation between Michael’s share of the script and the IMDB rating. I’m going to remove the above four outlier episodes for this.

michael_reg_data <- michael_ratings_highlight %>%
  filter(!ep_key %in% highlights) %>%
  filter(!is.na(michael_share))
  
ggplot(michael_reg_data, aes(x = michael_share, y = imdb_rating)) +
    geom_point() +
    theme_bw() +
    geom_smooth(method = "lm") 

The regression line through these points does have a positive slope, however the confidence bands clearly contain lines that go anywhere from flat to negative.

This scatterplot is a bit too messy to really discern a trend, so let’s look at deciles of Michael share instead.

#Going back to use the full dataset of episodes, including ones Michael isn't in
michael_decile <- michael_ratings %>%
    mutate(decile = cut(michael_share, seq(0, 1, by = .1), right = FALSE))

ggplot(michael_decile, aes(x = decile, y = imdb_rating)) +
    geom_boxplot() +
    theme_bw() +
    labs(x = "Decile of Michael Share")

It does seem that the IMDB ratings generally trend upward as Michael speaks more, though it’s not a perfect pattern. Once Michael starts to take over the episode by speaking up close to 50% or more of total words, the episodes tend to receive lower ratings.

Does this align with your expectations? Do you wish Michael would speak ALL of the lines in an episode?

Feel free to let me know what you think of this analysis at . You can find the full RMarkdown code on my github.