For my first ever data analysis project, and as a devoted fan of the
TV series The
Office, an American mockumentary, I’m excited to embark on this
project that showcases the importance of characters and their
relationships. As someone who has seen the full series more times than I
can count on my fingers, I’ve always been curious in the relationships
and patterns that lie within the characters and episodes.
This project represents my initial foray into data analysis, where I aim to:
Questions I answer in this project include:
Michael Scott once said “You miss 100% of the shots you don’t take”,
and in the same spirit, I embark on my first data analysis
project.
For this analysis of The Office, I integrated four data sets from Kaggle, each providing info about every episode throughout the show. The raw data sets from kaggle requires significant processing to create an analysis ready data structure. Here are the links where I obtained the data sets from off Kaggle.
Working with the episode information data sets:
| Season | Episode | Title | Director | Writer | Air Date | Viewers | IMDb | Viewership |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Pilot | Ken Kwapis | Ricky Gervais & Stephen Merchant and Greg Daniels | Mar 24, 2005 | 11.2M | 7.4 | 11.2 |
| 1 | 2 | Diversity Day | Ken Kwapis | B. J. Novak | Mar 29, 2005 | 6M | 8.3 | 6.0 |
| 1 | 3 | Health Care | Ken Whittingham | Paul Lieberstein | Apr 05, 2005 | 5.8M | 7.7 | 5.8 |
| 1 | 4 | The Alliance | Bryan Gordon | Michael Schur | Apr 12, 2005 | 5.4M | 8.0 | 5.4 |
| 1 | 5 | Basketball | Greg Daniels | Greg Daniels | Apr 19, 2005 | 5M | 8.4 | 5.0 |
| 1 | 6 | Hot Girl | Amy Heckerling | Mindy Kaling | Apr 26, 2005 | 4.8M | 7.7 | 4.8 |
| 2 | 1 | The Dundies | Greg Daniels | Mindy Kaling | Sep 20, 2005 | 9M | 8.7 | 9.0 |
| 2 | 2 | Sexual Harassment | Ken Kwapis | B. J. Novak | Sep 27, 2005 | 7.1M | 8.2 | 7.1 |
| 2 | 3 | Office Olympics | Paul Feig | Michael Schur | Oct 04, 2005 | 8.3M | 8.3 | 8.3 |
| 2 | 4 | The Fire | Ken Kwapis | B. J. Novak | Oct 11, 2005 | 7.6M | 8.3 | 7.6 |
Working with the script data set:
Initial data loading
Filter out deleted scenes
Only include relevant columns
Turn season and episode values into integers
Remove any duplicates
| season | episode | line_text | speaker |
|---|---|---|---|
| 1 | 1 | All right Jim. Your quarterlies look very good. How are things at the library? | Michael |
| 1 | 1 | Oh, I told you. I couldn’t close it. So… | Jim |
| 1 | 1 | So you’ve come to the master for guidance? Is this what you’re saying, grasshopper? | Michael |
| 1 | 1 | Actually, you called me in here, but yeah. | Jim |
| 1 | 1 | All right. Well, let me show you how it’s done. | Michael |
Now that both data sets have been thoroughly cleaned and structured with consistent formatting, I have a solid foundation to begin the analysis of The Office
To explore how the focus shifts for each character per season, I created a season by season breakdown of the most spoken characters of each season. Below is an interactive plot of the top 5 characters per season by lines spoken, where the user can hover over parts of the bar graph to see what each colour section represents, and how many lines that character had spoken for a certain season.
To better understanding the character dynamics throughout The Office, I decided to analyze the script data to identify the most frequently appearing character for each season. Using the dialogue records, I counted up the number of lines spoken by each character per season. Using this method, we can see the main characters of the series like Michael, Dwight, Jim, and Pam, but also we can see the fluctuations of the supporting characters overtime. Examples of this is in season 4 we can see that Ryan makes an appearance in the top 5 characters for the season, because that is when Ryan takes over Jan’s job becoming the Vice President of Northeast Sales at Dunder Mifflin. Also, we can see in the last season Angela makes an appearance in the top 5 characters since she gets a side story with Dwight where they get married. This plot graph gives visible insight on character development and character arcs. The interactive plot allows for the user to hover over parts of the stacked bar graph to see what each section represents easier.
One of the most pivotal moments in The Office was when Michael leaves the show near the end of Season 7. It is a well discussed topic in The Office fan community that the show goes downhill after the departure of Michael. I want to explore weather this claim can be backed up via Data. Below is a chronological line graph of the IMDB rating of each episode before and after Michael leaves the show.
| Era | Avg. Viewers (Millions) | Avg. IMDb Rating |
|---|---|---|
| With Michael | 8185933 | 8.44 |
| Without Michael | 4926275 | 7.77 |
The data here supports the widely held belief among fans of The Office for the decline in the show after Michael leaves the show. During the era of Michael, the show averaged 8.08 million viewers per episode with an impressive average of 8.40 IMDb rating. However, in contrast after he leaves the show the viewership declines significantly to an average of 4.89 million views with an average IMDb rating of 7.71.
The noticeable dip in the viewership and ratings of the show showcases the significant role of Michael in the success of the show. We can also see a gradual increase in ratings during the last few episodes of the series, suggesting the lead up to the series finale had viewers intrigued. The top two rated episodes in the series are both ends of an era, as for the first one is the departure of Michael, and the last is the end of the series.
“That’s what she said” is the most iconic running joke in The Office, made popular by Michael in the show. Not only is it a fan favourite but also a defining part of the show’s humor. To explore weather the episodes that features the joke has any measurable impact on the rating, I collected data from The Office Wiki where it lists every episode that the joke was used. This section aims to find the impact by comparing IMDb ratings of the episodes that include to those that don’t.
| Non-TWSS Episodes | TWSS Episodes | p-value |
|---|---|---|
| 8.22 | 8.518 | 0.022 |
The boxplot created and the statistical analysis compare the average IMDb reviews of episodes that include and don’t include the joke. Episodes with the joke included has an average IMDb rating of 8.42 and episodes without has a rating of 8.18, a modest increase.
With a closer look into the statistical analysis, I compared the two groups which shows a p-value of 0.071 which is slightly above the conventional threshold of 0.05, indicating that we cannot confidently say that the difference in the ratings is due to the presece of the joke rather than random variation.
In summary, while we can see a small increase in the average ratings, the data does not provide a strong enough evidence to conclude that the episodes with the joke included has impact on the average IMDb ratings.
Embarking on this project both as a fan of The Office and as a beginner into my journey as a data analyst has been very rewarding. Through this project I was able to take my curiosity of my favourite show and translated it into a data driven analysis project to answer questions I’ve had throughout my viewings of the show. Along the way, I gained valuable hands on experience, from data wrangling where I merged multiple messy data sets into something I can create appealing visual plots and statistical testings in R and HTML.
Key insights I gathered about the show include:
Character relevance throughout the show, with main characters like Michael, Dwight, Jim and Pam, but also interesting fluctuations of side characters like Ryan’s peak in Season 4, and Angela’s rise in the final season.
Michael Scott’s departure had a clear impact on the shows rating with a significant drop in the shows viewership and IMDb ratings, which supports the widely debated topic of the shows decline after his departure by the fans of The Office
The iconic “That’s what she said” joke, while beloved, did not have statistical significant effect on the episode ratings, though the episodes with the joke had a slight increase in IMDb ratings.
This project not only answered my questions but also helped me develop stronger data analysis skills, and learning how to code in R, and create a meaningful report using RStudio, Markdown, and HTML.