What are best and worst Friends seasons according to IMDb users? How can we plot this information in a clean and effective way so we can also represent which are best and worst episodes (overall and for each season), the average rating for each season, and so forth? Let’s see how we can achieve this with just a few lines of R code.
According to Can I use IMDb data in my software?, we are not allowed to use data mining, robots, screen scraping, or similar online data gathering and extraction tools on IMDb website. However, they do provide some daily refreshed IMDb data files available for download.
For the plot we want to create we will need both title.episode.tsv.gz
and title.ratings.tsv.gz
. Documentation for these data files can be found here.
library(tidyverse)
episodes <- read_tsv('data/title.episode.tsv', na = "\\N", quote = '')
ratings <- read_tsv('data/title.ratings.tsv', na = "\\N", quote = '')
Basically, title.ratings.tsv
provides ratings information about every episode (or movie or whatever) on IMDb. But it just has three attributes: an identifier, the average rating and the number of votes.
head(ratings)
## # A tibble: 6 x 3
## tconst averageRating numVotes
## <chr> <dbl> <dbl>
## 1 tt0000001 5.8 1470
## 2 tt0000002 6.4 177
## 3 tt0000003 6.6 1096
## 4 tt0000004 6.5 106
## 5 tt0000005 6.2 1803
## 6 tt0000006 5.6 95
We also need to know which of the identifiers in that file belong to Friends episodes. Luckily, title.episode.tsv
contains this information (and the season and episode number for each episode as well, which we also need for our little study).
head(episodes)
## # A tibble: 6 x 4
## tconst parentTconst seasonNumber episodeNumber
## <chr> <chr> <dbl> <dbl>
## 1 tt0041951 tt0041038 1 9
## 2 tt0042816 tt0989125 1 17
## 3 tt0042889 tt0989125 NA NA
## 4 tt0043426 tt0040051 3 42
## 5 tt0043631 tt0989125 2 16
## 6 tt0043693 tt0989125 2 8
With this in place, we need to know the identifier for Friends. A good option could be to use a third file to find that out. But in this case we can just search for the IMDb webpage for Friends and find that this series identifier is tt0108778
. By the way, we also see Friends lasted 10 seasons with a total of 236 episodes. This will be useful information in just a second to double check we are working with the correct filtered data.
Now that we know the identifier for Friends, let’s use it to filter()
all the episodes and keep only the ones we are interested in.
friends_episodes <- episodes %>% filter(parentTconst == 'tt0108778')
friends_episodes$seasonNumber <- as.factor(friends_episodes$seasonNumber)
We can see we get 236 rows, as expected :)
nrow(friends_episodes)
## [1] 236
Let’s add to each episode information about its ratings using left_join()
. Also, we will need a new column with the overall episode number (so, from 1 to 236). To achieve this, we first arrange()
our data by season and episode number and add_column()
with values 1:236
.
friends_ratings <- friends_episodes %>%
left_join(ratings, by = 'tconst') %>%
arrange(seasonNumber, episodeNumber) %>%
add_column(overallEpisodeNumber = 1:nrow(friends_episodes))
friends_ratings %>% glimpse()
## Observations: 236
## Variables: 7
## $ tconst <chr> "tt0583459", "tt0583647", "tt0583653", "tt0…
## $ parentTconst <chr> "tt0108778", "tt0108778", "tt0108778", "tt0…
## $ seasonNumber <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ episodeNumber <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ averageRating <dbl> 8.4, 8.1, 8.2, 8.2, 8.5, 8.2, 9.0, 8.2, 8.3…
## $ numVotes <dbl> 5356, 3950, 3718, 3586, 3564, 3445, 4447, 3…
## $ overallEpisodeNumber <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
And that is all the data we need to finally create our plot.
ggplot(friends_ratings) +
aes(x = overallEpisodeNumber, y = averageRating, color = seasonNumber) +
scale_color_manual(values = rep(two_colors, 5)) +
stat_smooth(method = 'lm', formula = y ~ 1, size = 0.2) +
geom_point(aes(size = numVotes)) +
scale_size_continuous(range = c(.2, 5)) +
scale_x_continuous(breaks = seq(0, 225, 25)) +
ylim(7, 10) +
labs(title = "Friends Ratings (236 Episodes, 10 Seasons)",
x = "Overall Episode Number",
y = "Average Rating by IMDb Users",
caption = credits_imdb) +
theme(legend.position = "none") +
pudding_theme()
Some interesting points:
color = seasonNumber
and scale_color_manual(values = rep(two_colors, 5))
we can represent all 10 seasons using two different colors.stat_smooth(method = 'lm', formula = y ~ 1, size = 0.2)
we can plot the average rating for each season.geom_point(aes(size = numVotes))
we can represent the rating for each episode with a bigger or smaller dot based on the number of votes that episode received (and we control maximum and minimum sizes with scale_size_continuous(range = c(.2, 5))
).Please note two_colors
is just a vector with blue and red colors, and pudding_theme()
returns a ggplot2
theme to customize this plot (basically font families, sizes, colors, and other similar things). I keep them in a separated themes.R
file for easier code reuse.
Finally, remember we can save our plots (i.e. as SVG) by assigning the result of ggplot()
call to a variable (i.e. p
) and then using ggsave()
function.
ggsave(file = "friends-plot.svg", plot = p, width = 10, height = 8)
If you like this post, follow me on Twitter to be the first to know about other related stuff.