Data 607 Project 2-2

1. Introduction

This is the second part of the Project 2.

The goal of this exercise is to get the data on the viewership of each episode of the TV series “The Walking Dead”, and to compare it with IMDB ratings for the episodes. In particular, the tasks are:
1) Compare viewership across seasons
2) Compare public ratings of episodes on a site like IMDB or Rotten Tomatoes to the viewership

2. Read the data

In order to read the data directly from Wikipedia and IMDB, the package rvest is used. Link to the tutorial I used to understand its functions is provided at the bottom of the page.

# Load the required packages
library(readr)
library(knitr)
library(tidyr)
library(dplyr)
library(plotly)
library(rvest)
library(DT)

First, we extract the viewership data from Wikipedia

viewers = read_html("https://en.wikipedia.org/wiki/Template:The_Walking_Dead_ratings")

# Get the table with viewer counts

wiki = viewers %>% 
  html_nodes(".wikitable") %>%
  .[[2]] %>% 
  html_table(header = T, fill = T)

head(wiki)

##   Season Episode number Episode number Episode number Episode number
## 1 Season          1.000          2.000          3.000          4.000
## 2      1          5.350          4.710          5.070          4.750
## 3      2          7.260          6.700          6.100          6.290
## 4      3         10.970          9.550         10.510          9.270
## 5      4         16.111         13.950         12.920         13.310
## 6      5         17.300         15.143         13.801         14.518
##   Episode number Episode number Episode number Episode number
## 1           5.00          6.000              7              8
## 2           5.56          5.970              —              —
## 3           6.12          6.080          6.620          8.100
## 4          10.37          9.210         10.430         10.480
## 5          12.20         12.000         11.290         12.050
## 6          13.53         14.068         13.330         14.807
##   Episode number Episode number Episode number Episode number
## 1              9             10             11             12
## 2              —              —              —              —
## 3          6.890          7.040          6.770          6.890
## 4         12.260         11.050         11.010         11.300
## 5         15.760         13.340         13.120         12.610
## 6         15.643         12.267         13.438         14.430
##   Episode number Episode number Episode number Episode number
## 1             13             14             15             16
## 2              —              —              —              —
## 3          8.990              —              —              —
## 4         11.460         10.840         10.990         12.420
## 5         12.650         12.870         13.470         15.680
## 6         14.534         13.781         13.757         15.784

Secondly, we extract episode names and ratings from the respective IMDB webpage

wd_ratings = read_html("http://www.imdb.com/list/ls012062344/?start=1&view=compact&sort=release_date_us:asc&defaults=1&scb=0.6008014953041274")

# Get episode names
episode_list = wd_ratings %>% 
  html_nodes(".episode a:nth-child(3)") %>%
  html_text() 

# Get episode ratings
episode_rating = wd_ratings %>% 
  html_nodes(".user_rating ") %>%
  html_text()

# Remove the empty string in the ratings vector
episode_rating = episode_rating[2:length(episode_rating)]

# Bind together
imdb = data.frame(name = as.character(episode_list), 
                  rating = as.numeric(episode_rating))

head(imdb)

##                   name rating
## 1        Days Gone Bye    9.2
## 2                 Guts    8.7
## 3 Tell It to the Frogs    8.3
## 4                Vatos    8.6
## 5             Wildfire    8.3
## 6                TS-19    8.7

The data from IMDB seems inconsistent in terms of the number of episodes with rating: Wikipedia lists 17 episodes, and IMDB data has just 2 episodes and no indication of season and episode number.

Therefore we use an additional datasource listing all episode numbers and names to facilitate the joining of the data. This data was downloaded from the website EpGuides and stored as .CSV file.

epnames = read_csv("wd_episodes.csv")

## Parsed with column specification:
## cols(
##   number = col_character(),
##   season = col_integer(),
##   episode = col_integer(),
##   airdate = col_character(),
##   title = col_character(),
##   `tvmaze link` = col_character()
## )

kable(epnames[1:10,])

number	season	episode	airdate	title	tvmaze link
1	1	1	31 Oct 10	Days Gone Bye	http://www.tvmaze.com/episodes/4095/the-walking-dead-1x01-days-gone-bye
2	1	2	07 Nov 10	Guts	http://www.tvmaze.com/episodes/4096/the-walking-dead-1x02-guts
3	1	3	14 Nov 10	Tell It to the Frogs	http://www.tvmaze.com/episodes/4097/the-walking-dead-1x03-tell-it-to-the-frogs
4	1	4	21 Nov 10	Vatos	http://www.tvmaze.com/episodes/4098/the-walking-dead-1x04-vatos
5	1	5	28 Nov 10	Wildfire	http://www.tvmaze.com/episodes/4099/the-walking-dead-1x05-wildfire
6	1	6	05 Dec 10	TS-19	http://www.tvmaze.com/episodes/4100/the-walking-dead-1x06-ts-19
7	2	1	16 Oct 11	What Lies Ahead	http://www.tvmaze.com/episodes/4101/the-walking-dead-2x01-what-lies-ahead
8	2	2	23 Oct 11	Bloodletting	http://www.tvmaze.com/episodes/4102/the-walking-dead-2x02-bloodletting
9	2	3	30 Oct 11	Save the Last One	http://www.tvmaze.com/episodes/4103/the-walking-dead-2x03-save-the-last-one
10	2	4	06 Nov 11	Cherokee Rose	http://www.tvmaze.com/episodes/4104/the-walking-dead-2x04-cherokee-rose

3. Transform the data into tidy format

Now the Wikipedia data should be cleaned and transformed into a long format for joining with the names and rating data.

# Apply correct column names and drop the row with names
names(wiki) = wiki[1,]

wiki = slice(wiki, 2:nrow(wiki))

# Transform into tidy format and fix variable types

wiki1 = wiki %>% 
  # Gather counts per episode into one column
  gather("episode","Mln_viewers", 2:17) %>% 
  # Convert missing values into NAs
  mutate(Mln_viewers = as.numeric(Mln_viewers)) %>% 
  # Fix variable names and types for joining
  rename(season = Season) %>% 
  mutate(season = as.integer(season),
         episode = as.integer(episode))

The Wikipedia dataset is now in tidy format and can be joined with the ratings data

kable(wiki[1:10,])

	Season	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
1	1	5.350	4.710	5.070	4.750	5.56	5.970	—	—	—	—	—	—	—	—	—	—
2	2	7.260	6.700	6.100	6.290	6.12	6.080	6.620	8.100	6.890	7.040	6.770	6.890	8.990	—	—	—
3	3	10.970	9.550	10.510	9.270	10.37	9.210	10.430	10.480	12.260	11.050	11.010	11.300	11.460	10.840	10.990	12.420
4	4	16.111	13.950	12.920	13.310	12.20	12.000	11.290	12.050	15.760	13.340	13.120	12.610	12.650	12.870	13.470	15.680
5	5	17.300	15.143	13.801	14.518	13.53	14.068	13.330	14.807	15.643	12.267	13.438	14.430	14.534	13.781	13.757	15.784
6	6	14.633	12.183	13.143	13.339	12.44	12.871	13.224	13.981	13.742	13.483	12.794	12.812	12.530	12.686	12.384	14.193
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA.1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA.2	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA.3	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Join the episode viewership with the episode names and then with ratings

df = left_join(epnames, wiki1) %>% select (- `tvmaze link`)
df = left_join(df, imdb, by = c("title"="name"))

Remove rows without viewer information

df = df %>% filter(complete.cases(Mln_viewers))

Surprizingly, most of the name strings provided and exact match, and we now have a cleaned up tidy table with episode names, viewers, and ratings that is ready for analysis:

datatable(df)

4. Calculate summary statistics

4.1. Compare viewership across seasons

Let’s have a look at the average viewership across the seasons and the maximum viewer numbers per season

season_stats = df %>% 
  group_by(season) %>% 
  summarise(season_avg_v = mean(Mln_viewers), # average viewers 
         season_max_v = max(Mln_viewers), # max viewers
         nobs = n())

datatable(season_stats)

plot_ly(data = season_stats, x = season, y = season_avg_v, 
        type = "bar", name = "Average viewers") %>% 
  add_trace(x = season, y = season_max_v, name = "Max viewers") %>% 
  layout(title = "Average and maximum viewers per season (Mln)",
         yaxis = list(title = "Mln. viewers"))

We see that based on the viewership numbers, it seems that “Walking Dead” has reached maximum popularity with the season 5, and afterwards it started to decline.

What would the ratings say?

4.2. Compare public ratings of from IMDB to the viewership

We calculate max and average episode rating per season and see if the data follows the similar pattern with a peak in season 5.

rating_stats = df %>% 
  group_by(season) %>% 
  summarise(season_avg_r = mean(rating, na.rm=T), # average rating 
         season_max_r = max(rating, na.rm=T), # max rating
         nobs = n())

datatable(rating_stats)

plot_ly(data = rating_stats, x = season, y = season_avg_r, 
        type = "bar", name = "Average rating") %>% 
  add_trace(x = season, y = season_max_r, name = "Max rating") %>% 
  layout(title = "Average and maximum IMDB ratings per season",
         yaxis = list(title = "IMDB Rating"))

We see that in terms of the ratings, the show has been received well from the start, which has possibly contributed to its increasing popularity.

Finally, let is have a look at the (possible) correlation between viewership and rating

plot_ly(data = df, x = Mln_viewers, y = rating, mode = "markers", group = season, name = "Season") %>% layout(title = "Number of viewers vs. IMDB Rating per Episode", yaxis = list(title = "IMDB Rating"))

The colors represent the different seasons.

From the plot we see that there is no linear relationship between the viewer number and the rating.

5. Conclusion

“The Walking Dead” seems to have passed the peak of its popularity, and the number of its viewers has start to decline in the season 6. Nevertheless, the show is consistently rated above 8 stars, and has even achieved a rating of 9.7 for some of the episodes.

So far, the most watched episode has been “No Sanctuary”, with 17.3 mln. viewers.

df = ungroup(df)
kable(filter(df, Mln_viewers == max(Mln_viewers)))

number	season	episode	airdate	title	Mln_viewers	rating
52	5	1	12 Oct 14	No Sanctuary	17.3	9.6

And the top rated (9.7 out of 10) episodes on IMDB so far are:

kable(filter(df, rating == 9.7))

number	season	episode	airdate	title	Mln_viewers	rating
43	4	8	01 Dec 13	Too Far Gone	12.050	9.7
76	6	9	14 Feb 16	No Way Out	13.742	9.7

6. Sources

Data sources are listed in the code.
Rvest:easy web scraping with R