1. Introduction

This is the second part of the Project 2.

The goal of this exercise is to get the data on the viewership of each episode of the TV series “The Walking Dead”, and to compare it with IMDB ratings for the episodes. In particular, the tasks are:
1) Compare viewership across seasons
2) Compare public ratings of episodes on a site like IMDB or Rotten Tomatoes to the viewership

2. Read the data

In order to read the data directly from Wikipedia and IMDB, the package rvest is used. Link to the tutorial I used to understand its functions is provided at the bottom of the page.

# Load the required packages
library(readr)
library(knitr)
library(tidyr)
library(dplyr)
library(plotly)
library(rvest)
library(DT)

First, we extract the viewership data from Wikipedia

viewers = read_html("https://en.wikipedia.org/wiki/Template:The_Walking_Dead_ratings")

# Get the table with viewer counts

wiki = viewers %>% 
  html_nodes(".wikitable") %>%
  .[[2]] %>% 
  html_table(header = T, fill = T)

head(wiki)
##   Season Episode number Episode number Episode number Episode number
## 1 Season          1.000          2.000          3.000          4.000
## 2      1          5.350          4.710          5.070          4.750
## 3      2          7.260          6.700          6.100          6.290
## 4      3         10.970          9.550         10.510          9.270
## 5      4         16.111         13.950         12.920         13.310
## 6      5         17.300         15.143         13.801         14.518
##   Episode number Episode number Episode number Episode number
## 1           5.00          6.000              7              8
## 2           5.56          5.970              —              —
## 3           6.12          6.080          6.620          8.100
## 4          10.37          9.210         10.430         10.480
## 5          12.20         12.000         11.290         12.050
## 6          13.53         14.068         13.330         14.807
##   Episode number Episode number Episode number Episode number
## 1              9             10             11             12
## 2              —              —              —              —
## 3          6.890          7.040          6.770          6.890
## 4         12.260         11.050         11.010         11.300
## 5         15.760         13.340         13.120         12.610
## 6         15.643         12.267         13.438         14.430
##   Episode number Episode number Episode number Episode number
## 1             13             14             15             16
## 2              —              —              —              —
## 3          8.990              —              —              —
## 4         11.460         10.840         10.990         12.420
## 5         12.650         12.870         13.470         15.680
## 6         14.534         13.781         13.757         15.784

Secondly, we extract episode names and ratings from the respective IMDB webpage

wd_ratings = read_html("http://www.imdb.com/list/ls012062344/?start=1&view=compact&sort=release_date_us:asc&defaults=1&scb=0.6008014953041274")

# Get episode names
episode_list = wd_ratings %>% 
  html_nodes(".episode a:nth-child(3)") %>%
  html_text() 

# Get episode ratings
episode_rating = wd_ratings %>% 
  html_nodes(".user_rating ") %>%
  html_text()

# Remove the empty string in the ratings vector
episode_rating = episode_rating[2:length(episode_rating)]

# Bind together
imdb = data.frame(name = as.character(episode_list), 
                  rating = as.numeric(episode_rating))

head(imdb)
##                   name rating
## 1        Days Gone Bye    9.2
## 2                 Guts    8.7
## 3 Tell It to the Frogs    8.3
## 4                Vatos    8.6
## 5             Wildfire    8.3
## 6                TS-19    8.7

The data from IMDB seems inconsistent in terms of the number of episodes with rating: Wikipedia lists 17 episodes, and IMDB data has just 2 episodes and no indication of season and episode number.

Therefore we use an additional datasource listing all episode numbers and names to facilitate the joining of the data. This data was downloaded from the website EpGuides and stored as .CSV file.

epnames = read_csv("wd_episodes.csv")
## Parsed with column specification:
## cols(
##   number = col_character(),
##   season = col_integer(),
##   episode = col_integer(),
##   airdate = col_character(),
##   title = col_character(),
##   `tvmaze link` = col_character()
## )
kable(epnames[1:10,])
number season episode airdate title tvmaze link
1 1 1 31 Oct 10 Days Gone Bye http://www.tvmaze.com/episodes/4095/the-walking-dead-1x01-days-gone-bye
2 1 2 07 Nov 10 Guts http://www.tvmaze.com/episodes/4096/the-walking-dead-1x02-guts
3 1 3 14 Nov 10 Tell It to the Frogs http://www.tvmaze.com/episodes/4097/the-walking-dead-1x03-tell-it-to-the-frogs
4 1 4 21 Nov 10 Vatos http://www.tvmaze.com/episodes/4098/the-walking-dead-1x04-vatos
5 1 5 28 Nov 10 Wildfire http://www.tvmaze.com/episodes/4099/the-walking-dead-1x05-wildfire
6 1 6 05 Dec 10 TS-19 http://www.tvmaze.com/episodes/4100/the-walking-dead-1x06-ts-19
7 2 1 16 Oct 11 What Lies Ahead http://www.tvmaze.com/episodes/4101/the-walking-dead-2x01-what-lies-ahead
8 2 2 23 Oct 11 Bloodletting http://www.tvmaze.com/episodes/4102/the-walking-dead-2x02-bloodletting
9 2 3 30 Oct 11 Save the Last One http://www.tvmaze.com/episodes/4103/the-walking-dead-2x03-save-the-last-one
10 2 4 06 Nov 11 Cherokee Rose http://www.tvmaze.com/episodes/4104/the-walking-dead-2x04-cherokee-rose

3. Transform the data into tidy format

Now the Wikipedia data should be cleaned and transformed into a long format for joining with the names and rating data.

# Apply correct column names and drop the row with names
names(wiki) = wiki[1,]

wiki = slice(wiki, 2:nrow(wiki))

# Transform into tidy format and fix variable types

wiki1 = wiki %>% 
  # Gather counts per episode into one column
  gather("episode","Mln_viewers", 2:17) %>% 
  # Convert missing values into NAs
  mutate(Mln_viewers = as.numeric(Mln_viewers)) %>% 
  # Fix variable names and types for joining
  rename(season = Season) %>% 
  mutate(season = as.integer(season),
         episode = as.integer(episode))

The Wikipedia dataset is now in tidy format and can be joined with the ratings data

kable(wiki[1:10,])
Season 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 1 5.350 4.710 5.070 4.750 5.56 5.970
2 2 7.260 6.700 6.100 6.290 6.12 6.080 6.620 8.100 6.890 7.040 6.770 6.890 8.990
3 3 10.970 9.550 10.510 9.270 10.37 9.210 10.430 10.480 12.260 11.050 11.010 11.300 11.460 10.840 10.990 12.420
4 4 16.111 13.950 12.920 13.310 12.20 12.000 11.290 12.050 15.760 13.340 13.120 12.610 12.650 12.870 13.470 15.680
5 5 17.300 15.143 13.801 14.518 13.53 14.068 13.330 14.807 15.643 12.267 13.438 14.430 14.534 13.781 13.757 15.784
6 6 14.633 12.183 13.143 13.339 12.44 12.871 13.224 13.981 13.742 13.483 12.794 12.812 12.530 12.686 12.384 14.193
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA.3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Join the episode viewership with the episode names and then with ratings

df = left_join(epnames, wiki1) %>% select (- `tvmaze link`)
df = left_join(df, imdb, by = c("title"="name"))

Remove rows without viewer information

df = df %>% filter(complete.cases(Mln_viewers))

Surprizingly, most of the name strings provided and exact match, and we now have a cleaned up tidy table with episode names, viewers, and ratings that is ready for analysis:

datatable(df)

4. Calculate summary statistics

4.1. Compare viewership across seasons

Let’s have a look at the average viewership across the seasons and the maximum viewer numbers per season

season_stats = df %>% 
  group_by(season) %>% 
  summarise(season_avg_v = mean(Mln_viewers), # average viewers 
         season_max_v = max(Mln_viewers), # max viewers
         nobs = n())

datatable(season_stats)
plot_ly(data = season_stats, x = season, y = season_avg_v, 
        type = "bar", name = "Average viewers") %>% 
  add_trace(x = season, y = season_max_v, name = "Max viewers") %>% 
  layout(title = "Average and maximum viewers per season (Mln)",
         yaxis = list(title = "Mln. viewers"))

We see that based on the viewership numbers, it seems that “Walking Dead” has reached maximum popularity with the season 5, and afterwards it started to decline.

What would the ratings say?

4.2. Compare public ratings of from IMDB to the viewership

We calculate max and average episode rating per season and see if the data follows the similar pattern with a peak in season 5.

rating_stats = df %>% 
  group_by(season) %>% 
  summarise(season_avg_r = mean(rating, na.rm=T), # average rating 
         season_max_r = max(rating, na.rm=T), # max rating
         nobs = n())

datatable(rating_stats)
plot_ly(data = rating_stats, x = season, y = season_avg_r, 
        type = "bar", name = "Average rating") %>% 
  add_trace(x = season, y = season_max_r, name = "Max rating") %>% 
  layout(title = "Average and maximum IMDB ratings per season",
         yaxis = list(title = "IMDB Rating"))

We see that in terms of the ratings, the show has been received well from the start, which has possibly contributed to its increasing popularity.

Finally, let is have a look at the (possible) correlation between viewership and rating

plot_ly(data = df, x = Mln_viewers, y = rating, mode = "markers", group = season, name = "Season") %>% layout(title = "Number of viewers vs. IMDB Rating per Episode", yaxis = list(title = "IMDB Rating"))

The colors represent the different seasons.

From the plot we see that there is no linear relationship between the viewer number and the rating.

5. Conclusion

“The Walking Dead” seems to have passed the peak of its popularity, and the number of its viewers has start to decline in the season 6. Nevertheless, the show is consistently rated above 8 stars, and has even achieved a rating of 9.7 for some of the episodes.

So far, the most watched episode has been “No Sanctuary”, with 17.3 mln. viewers.

df = ungroup(df)
kable(filter(df, Mln_viewers == max(Mln_viewers)))
number season episode airdate title Mln_viewers rating
52 5 1 12 Oct 14 No Sanctuary 17.3 9.6

And the top rated (9.7 out of 10) episodes on IMDB so far are:

kable(filter(df, rating == 9.7))
number season episode airdate title Mln_viewers rating
43 4 8 01 Dec 13 Too Far Gone 12.050 9.7
76 6 9 14 Feb 16 No Way Out 13.742 9.7

6. Sources

  1. Data sources are listed in the code.

  2. Rvest:easy web scraping with R