This is the second part of the Project 2.
The goal of this exercise is to get the data on the viewership of each episode of the TV series “The Walking Dead”, and to compare it with IMDB ratings for the episodes. In particular, the tasks are:
1) Compare viewership across seasons
2) Compare public ratings of episodes on a site like IMDB or Rotten Tomatoes to the viewership
In order to read the data directly from Wikipedia and IMDB, the package rvest is used. Link to the tutorial I used to understand its functions is provided at the bottom of the page.
# Load the required packages
library(readr)
library(knitr)
library(tidyr)
library(dplyr)
library(plotly)
library(rvest)
library(DT)
First, we extract the viewership data from Wikipedia
viewers = read_html("https://en.wikipedia.org/wiki/Template:The_Walking_Dead_ratings")
# Get the table with viewer counts
wiki = viewers %>%
html_nodes(".wikitable") %>%
.[[2]] %>%
html_table(header = T, fill = T)
head(wiki)
## Season Episode number Episode number Episode number Episode number
## 1 Season 1.000 2.000 3.000 4.000
## 2 1 5.350 4.710 5.070 4.750
## 3 2 7.260 6.700 6.100 6.290
## 4 3 10.970 9.550 10.510 9.270
## 5 4 16.111 13.950 12.920 13.310
## 6 5 17.300 15.143 13.801 14.518
## Episode number Episode number Episode number Episode number
## 1 5.00 6.000 7 8
## 2 5.56 5.970 — —
## 3 6.12 6.080 6.620 8.100
## 4 10.37 9.210 10.430 10.480
## 5 12.20 12.000 11.290 12.050
## 6 13.53 14.068 13.330 14.807
## Episode number Episode number Episode number Episode number
## 1 9 10 11 12
## 2 — — — —
## 3 6.890 7.040 6.770 6.890
## 4 12.260 11.050 11.010 11.300
## 5 15.760 13.340 13.120 12.610
## 6 15.643 12.267 13.438 14.430
## Episode number Episode number Episode number Episode number
## 1 13 14 15 16
## 2 — — — —
## 3 8.990 — — —
## 4 11.460 10.840 10.990 12.420
## 5 12.650 12.870 13.470 15.680
## 6 14.534 13.781 13.757 15.784
Secondly, we extract episode names and ratings from the respective IMDB webpage
wd_ratings = read_html("http://www.imdb.com/list/ls012062344/?start=1&view=compact&sort=release_date_us:asc&defaults=1&scb=0.6008014953041274")
# Get episode names
episode_list = wd_ratings %>%
html_nodes(".episode a:nth-child(3)") %>%
html_text()
# Get episode ratings
episode_rating = wd_ratings %>%
html_nodes(".user_rating ") %>%
html_text()
# Remove the empty string in the ratings vector
episode_rating = episode_rating[2:length(episode_rating)]
# Bind together
imdb = data.frame(name = as.character(episode_list),
rating = as.numeric(episode_rating))
head(imdb)
## name rating
## 1 Days Gone Bye 9.2
## 2 Guts 8.7
## 3 Tell It to the Frogs 8.3
## 4 Vatos 8.6
## 5 Wildfire 8.3
## 6 TS-19 8.7
The data from IMDB seems inconsistent in terms of the number of episodes with rating: Wikipedia lists 17 episodes, and IMDB data has just 2 episodes and no indication of season and episode number.
Therefore we use an additional datasource listing all episode numbers and names to facilitate the joining of the data. This data was downloaded from the website EpGuides and stored as .CSV file.
epnames = read_csv("wd_episodes.csv")
## Parsed with column specification:
## cols(
## number = col_character(),
## season = col_integer(),
## episode = col_integer(),
## airdate = col_character(),
## title = col_character(),
## `tvmaze link` = col_character()
## )
kable(epnames[1:10,])
Now the Wikipedia data should be cleaned and transformed into a long format for joining with the names and rating data.
# Apply correct column names and drop the row with names
names(wiki) = wiki[1,]
wiki = slice(wiki, 2:nrow(wiki))
# Transform into tidy format and fix variable types
wiki1 = wiki %>%
# Gather counts per episode into one column
gather("episode","Mln_viewers", 2:17) %>%
# Convert missing values into NAs
mutate(Mln_viewers = as.numeric(Mln_viewers)) %>%
# Fix variable names and types for joining
rename(season = Season) %>%
mutate(season = as.integer(season),
episode = as.integer(episode))
The Wikipedia dataset is now in tidy format and can be joined with the ratings data
kable(wiki[1:10,])
| Season | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 5.350 | 4.710 | 5.070 | 4.750 | 5.56 | 5.970 | — | — | — | — | — | — | — | — | — | — |
| 2 | 2 | 7.260 | 6.700 | 6.100 | 6.290 | 6.12 | 6.080 | 6.620 | 8.100 | 6.890 | 7.040 | 6.770 | 6.890 | 8.990 | — | — | — |
| 3 | 3 | 10.970 | 9.550 | 10.510 | 9.270 | 10.37 | 9.210 | 10.430 | 10.480 | 12.260 | 11.050 | 11.010 | 11.300 | 11.460 | 10.840 | 10.990 | 12.420 |
| 4 | 4 | 16.111 | 13.950 | 12.920 | 13.310 | 12.20 | 12.000 | 11.290 | 12.050 | 15.760 | 13.340 | 13.120 | 12.610 | 12.650 | 12.870 | 13.470 | 15.680 |
| 5 | 5 | 17.300 | 15.143 | 13.801 | 14.518 | 13.53 | 14.068 | 13.330 | 14.807 | 15.643 | 12.267 | 13.438 | 14.430 | 14.534 | 13.781 | 13.757 | 15.784 |
| 6 | 6 | 14.633 | 12.183 | 13.143 | 13.339 | 12.44 | 12.871 | 13.224 | 13.981 | 13.742 | 13.483 | 12.794 | 12.812 | 12.530 | 12.686 | 12.384 | 14.193 |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA.1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA.2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA.3 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Join the episode viewership with the episode names and then with ratings
df = left_join(epnames, wiki1) %>% select (- `tvmaze link`)
df = left_join(df, imdb, by = c("title"="name"))
Remove rows without viewer information
df = df %>% filter(complete.cases(Mln_viewers))
Surprizingly, most of the name strings provided and exact match, and we now have a cleaned up tidy table with episode names, viewers, and ratings that is ready for analysis:
datatable(df)
Let’s have a look at the average viewership across the seasons and the maximum viewer numbers per season
season_stats = df %>%
group_by(season) %>%
summarise(season_avg_v = mean(Mln_viewers), # average viewers
season_max_v = max(Mln_viewers), # max viewers
nobs = n())
datatable(season_stats)
plot_ly(data = season_stats, x = season, y = season_avg_v,
type = "bar", name = "Average viewers") %>%
add_trace(x = season, y = season_max_v, name = "Max viewers") %>%
layout(title = "Average and maximum viewers per season (Mln)",
yaxis = list(title = "Mln. viewers"))
We see that based on the viewership numbers, it seems that “Walking Dead” has reached maximum popularity with the season 5, and afterwards it started to decline.
What would the ratings say?
We calculate max and average episode rating per season and see if the data follows the similar pattern with a peak in season 5.
rating_stats = df %>%
group_by(season) %>%
summarise(season_avg_r = mean(rating, na.rm=T), # average rating
season_max_r = max(rating, na.rm=T), # max rating
nobs = n())
datatable(rating_stats)
plot_ly(data = rating_stats, x = season, y = season_avg_r,
type = "bar", name = "Average rating") %>%
add_trace(x = season, y = season_max_r, name = "Max rating") %>%
layout(title = "Average and maximum IMDB ratings per season",
yaxis = list(title = "IMDB Rating"))
We see that in terms of the ratings, the show has been received well from the start, which has possibly contributed to its increasing popularity.
Finally, let is have a look at the (possible) correlation between viewership and rating
plot_ly(data = df, x = Mln_viewers, y = rating, mode = "markers", group = season, name = "Season") %>% layout(title = "Number of viewers vs. IMDB Rating per Episode", yaxis = list(title = "IMDB Rating"))
The colors represent the different seasons.
From the plot we see that there is no linear relationship between the viewer number and the rating.
“The Walking Dead” seems to have passed the peak of its popularity, and the number of its viewers has start to decline in the season 6. Nevertheless, the show is consistently rated above 8 stars, and has even achieved a rating of 9.7 for some of the episodes.
So far, the most watched episode has been “No Sanctuary”, with 17.3 mln. viewers.
df = ungroup(df)
kable(filter(df, Mln_viewers == max(Mln_viewers)))
| number | season | episode | airdate | title | Mln_viewers | rating |
|---|---|---|---|---|---|---|
| 52 | 5 | 1 | 12 Oct 14 | No Sanctuary | 17.3 | 9.6 |
And the top rated (9.7 out of 10) episodes on IMDB so far are:
kable(filter(df, rating == 9.7))
| number | season | episode | airdate | title | Mln_viewers | rating |
|---|---|---|---|---|---|---|
| 43 | 4 | 8 | 01 Dec 13 | Too Far Gone | 12.050 | 9.7 |
| 76 | 6 | 9 | 14 Feb 16 | No Way Out | 13.742 | 9.7 |
Data sources are listed in the code.