The data used in this analysis came from two sources - IMDB - Schrute library

Both data sources were parsed and combined into one data set TheOffice. IMDB calls the office “A mockumentary on a group of typical office workers, where the workday consists of ego clashes, inappropriate behavior, and tedium.”

I have seen the office several times and still in 2021 it ranks among my top 5 favorite shows of all time. Let’s begin, high five!

Introduction

The office consists of 9 season and ~200 episodes and ran from March 2005 to May 2013

fdf <- data.frame(theoffice)
glimpse(fdf)
## Rows: 55,130
## Columns: 12
## $ index            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ season           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ episode          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ episode_name     <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
## $ director         <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
## $ writer           <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
## $ character        <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
## $ text             <chr> "All right Jim. Your quarterlies look very good. How …
## $ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
## $ imdb_rating      <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
## $ total_votes      <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
## $ air_date         <fct> 2005-03-24, 2005-03-24, 2005-03-24, 2005-03-24, 2005-…

Exploratory Data Analysis

Ratings analysis

To begin we will look at how the IMDB ratings changed throughout the show.

The boxplots show that season 4 was the best season on average and season 7 has the most variations. Season 9 has a few dedicated critics, giving scores in the high 9s.

fdf %>% ggplot(aes(x=season, imdb_rating, fill=as.factor(season))) + geom_boxplot(show.legend = F) +   scale_x_continuous(breaks=1:9) +
  labs(title='IMDB ratings per season', y='IMDB Ratings', x='Season')

The plot below shows how the average ratings peaked around season 4 and 5. The Stress Relief, Goodbye Michael and The Finale have the highest ratings with a lot of votes. The Banker and Gone Girl were the least favorite episodes.

fdf %>% distinct(season, episode, episode_name, imdb_rating, total_votes) %>%
  mutate(title= str_replace_all(string=episode_name, pattern=' ', replacement='\n') %>%
                 str_replace_all(pattern='\n(Parts\n1&2)', replace=''),
               title= fct_inorder(title),
               rn = row_number()) %>%
  ggplot(aes(x=rn, y=imdb_rating)) + 
  geom_line() + 
  geom_smooth(method = "loess", size = 1, colour = "blueviolet", fill = "blueviolet", alpha=0.3) +
  geom_point(aes(color=factor(season), size=total_votes)) +
  geom_text(aes(label=title), check_overlap = T, hjust=1) +
  geom_vline(xintercept = 136 , color = "red", linetype = "dashed") +
  expand_limits(x=-10, y=10.01) +
  theme(axis.text.x = element_blank(), legend.position = 'none') + 
  labs(x = 'Episode Number', 
       y='IMDB Rating', 
       title='The office IMDB Ratings', 
       subtitle = 'Color indicates seasons')
## `geom_smooth()` using formula 'y ~ x'

My favorite episode of the office is Stress Relief where Dwight starts a fire in the office. It looks like my favorite is number 3. Also among the top 10 favorite episodes seasons 1 and 8 didn’t make the cut

fdf %>%
  distinct(season, episode, episode_name, imdb_rating) %>%
  arrange(desc(imdb_rating)) %>%
  mutate(title=paste0('S',season, 'E', episode,'-', episode_name),
         title=fct_reorder(title, imdb_rating)) %>%
  top_n(10, wt=imdb_rating) %>%
  ggplot(aes(x=title, y=imdb_rating)) + 
  geom_point(aes(color=factor(season))) + 
  geom_text(label='My favorite', y=9.55, x='S5E14-Stress Relief (Parts 1&2)', color='darkgray') +
  coord_flip() +
  labs(color='Seasons',
       title='Top 10 most popular episodes',
       y='IMDB Ratings')

Transcript analysis

Now we will move to the transcripts. The data set contains all of the lines in the show so we can see how unique each character are.

I want to look at the most common words said by a specific characters. The steps are:

  • Remove words that are not very interesting.
  • Filter by characters we are interested in
  • Calculate term frequency and inverse document frequency
st_words <- rbind(stop_words, data.frame(word=c('yeah', 'hey', 'huh', 'gonna', 'em', 'um', 'uh'), lexicon='office_words'))

interesting_characters = c('Dwight', 'Michael', 'Jim', 'Pam', 'Angela', 'Andy')

fdf %>%
  unnest_tokens(word, text) %>% 
  anti_join(st_words, by='word') %>% 
  add_count(word) %>%
  filter(n > 20) %>%
  count(word, character) %>%
  bind_tf_idf(word, character, n) %>% 
  filter(character %in% interesting_characters) %>%
  group_by(character) %>%
  top_n(20, tf_idf) %>% 
  ungroup() %>%
  mutate(word = reorder_within(word, tf_idf, character)) %>%
  ggplot(aes(word, tf_idf)) + 
  geom_col() + 
  coord_flip() +
  scale_x_reordered() +
  facet_wrap(~character, scales='free') + 
  labs(y = 'Term Frequency and Inverse Document Frequency',
       x = '')

It should be noted that these are not the most common words said but are the most unique words to these characters. It just so happens that sometimes the most unique words are the most frequent words.

To fully appreciate this plot you need to have watched the office. Dwight talks to Jim and Michael a lot and likes to refer to himself by his last name, hence the top 3 words for Dwight. Andy calls Jim tuna a lot and says whoa a lot. Michael, Jim and Angela refer to Dwight a lot. Pam reported to Michael so she often refers to him. A lot more can be said about these 6 plots.

Although set of character worth look at are those that played a small role. Even though I have watched The office multiple times I have no idea who these people are. Most of their names don’t ring a bell. I guess I have to watch the series again.

no_name_character <- c('Woman', 'Pizza guy', 'Nurse', 'Host', 'Bar Manager')

fdf %>% 
  add_count(character, name='lines') %>% 
  distinct(character, season, episode_name, lines) %>% 
  filter(lines > 20, !(character %in% no_name_character)) %>% 
  arrange(desc(lines)) %>% 
  tail(20) %>% 
  ggplot(aes(x=reorder(character, lines), y=lines)) + 
  geom_point(aes(color=factor(season))) + 
  geom_text(aes(label=episode_name), check_overlap = T, hjust=1.15) +
  coord_flip() + 
  expand_limits(y=17.5) +
  labs(x='Characters',
       y='Number of lines',
       color='Seasons',
       title= 'Forgotten Characters',
       subtitle = 'Episode Name (season)')

Unsupervised learning - Principal Component Analysis

Principal Component Analysis (PCA) is a statistical procedure that converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

For every episode we will count the number of times each character speaks. To prevent this data set from getting too big I will filter out character who have less than 600 lines overall. For example in the Pilot Andy and Ering have no lines because he wasn’t there while Michael and Dwight are speaking the most.

## Rows: 1
## Columns: 20
## $ episode_name <chr> "Pilot"
## $ Andy         <int> 0
## $ Angela       <int> 1
## $ Darryl       <int> 0
## $ Dwight       <int> 29
## $ Jim          <int> 36
## $ Kelly        <int> 0
## $ Kevin        <int> 1
## $ Michael      <int> 81
## $ Oscar        <int> 3
## $ Pam          <int> 41
## $ Phyllis      <int> 2
## $ Ryan         <int> 8
## $ Stanley      <int> 5
## $ Toby         <int> 0
## $ Erin         <int> 0
## $ Jan          <int> 12
## $ season       <int> 1
## $ episode      <int> 1
## $ imdb_rating  <dbl> 7.6
rdf <- df %>% select(-c('imdb_rating', 'episode_name'))
  
pca_rec <- recipe(~., data=rdf) %>% 
  update_role(season, episode, new_role = 's-e') %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors())

pca_prep <- prep(pca_rec)
df_prep <- tidy(pca_prep, 2)

The principal components are a bit difficult to understand here. Looking at the first 2 which contribute the most.

PC1 - Has characters that never had a strong relationship with others in the show except for Angela. PC2 - Michael, Pam, Jim and Dwight are like the couples of the show so their grouping makes sense. Erin is also in this group but her vector is in the opposite direction.

df_prep %>%
  group_by(component) %>%
  top_n(5, abs(value)) %>%
  ungroup() %>%
  mutate(terms=reorder_within(terms, abs(value), component),
         component=fct_inorder(component)) %>%
  ggplot(aes(x=abs(value), y=terms, fill=value > 0)) + 
  geom_col() + 
  facet_wrap(~component, scales='free_y') + 
  scale_y_reordered() +
  labs(y=NULL, 
       x=NULL,
       fill='Positive?',
       title='Principal Components')

PC1 to PC3 captures 43% of the variance in the show by number of lines spoken. Said differently Kevin, Andy, Oscar, Erin, Angela, Michael, Pam, Jim, Dwight, Kelly and Ryan make up most of the show. Any Office fan can agree with that.

sdev <- pca_prep$steps[[2]]$res$sdev
pct_var <- sdev^2/sum(sdev^2)
ncomp <- df_prep %>% count(component) %>% pull(n) %>% unique()

data.frame(component = paste0('PC', 1:ncomp), percent_var=pct_var) %>%
  mutate(component = fct_inorder(component)) %>%
  ggplot(aes(x=component, y=percent_var)) + 
  geom_line(aes(x=component, y=cumsum(pct_var)), group=1, size=1.5) +
  geom_point(aes(x=component, y=cumsum(pct_var))) + 
  geom_col()

Supervised learning - Ridge Regression

In this section we will try to predict IMDB ratings using the same data set, the number of lines spoken by each character. Ridge Regression will be used, which is a linear least squares model with l2 regularization.

To begin we split the data into training and testing sets. Training will be used to tune the model and testing will be used to validate the model.

set.seed(72)
df_split <- initial_split(df, strata=season)
df_train <- training(df_split)
df_test <- testing(df_split)

Create a recipe with the numeric data, and normalize all values.

df_rec <- recipe(imdb_rating ~., data=df_train) %>%
  update_role(episode_name, season, episode, new_role='ense') %>% 
  step_normalize(all_predictors())

df_prep <- df_rec %>% prep(strings_as_factors=F)

Build the model and set up the workflow

model <- linear_reg(penalty = tune(), mixture = 0) %>% 
  set_engine('glmnet')

model_wf <- workflow() %>%
  add_recipe(df_rec) %>% 
  add_model(model)

We will take multiple samples of the data set through bootstraping. Those sample will be used to tune the model.

set.seed(42)

grid <- grid_regular(penalty(), levels=30)
folds <- bootstraps(df_train, strata = season)

As one would expect model tuning is a long process so we will employ parallelism to speed up the process

doParallel::registerDoParallel()
set.seed(42)
model_grid <- tune_grid(model_wf, resamples=folds, grid=grid)

With the tuning process completed we can visualize the metrics. We want the lowest Root Mean Square Error (rmse) and highest \(R^2\)

model_grid %>% collect_metrics() %>% 
  ggplot(aes(x=penalty, y=mean, color=.metric)) + 
  geom_line() + 
  facet_wrap(~.metric, scales='free', nrow=2) + 
  scale_x_log10() +
  theme(legend.position = 'none')

It looks like the critics really like Michael a lot followed by Jan, Dwight and Jim as they contribute to the higher IMDB ratings. Erin was not liked a lot along with Andy, Darryl and Ryan who contributed to negative ratings. I agree with the positives as Dwight was my favorite character, but I don’t think the negatives are so harsh.

final_model %>% 
  fit(df_train) %>% 
  pull_workflow_fit() %>% 
  vi(lambda = best_param$penalty) %>%
  mutate(Importance=abs(Importance),
         Variable=fct_reorder(Variable, Importance)) %>% 
  ggplot(aes(x=Importance, y=Variable, fill=Sign)) + 
  geom_col() +
  scale_x_continuous() + 
  labs(y=NULL)

However these results shouldn’t be taken too seriously as we can see from the RMSE and \(R^2\) plot the model still has a lot to learn. I would say a linear model isn’t enough to capture the complexities of this show.

model_fit <- last_fit(final_model, df_split) 
model_fit %>% collect_metrics()
## # A tibble: 2 x 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       0.478 Preprocessor1_Model1
## 2 rsq     standard       0.236 Preprocessor1_Model1
collect_predictions(model_fit) %>%
  ggplot(aes(x=imdb_rating, y=.pred)) + 
  geom_abline(lty=2) +
  geom_point()