library(tidyverse)
library(tidymodels)
library(tidytext)
library(tidylo)
library(textrecipes)
tidymodels_prefer()Classification model with tidymodels and Netflix movie data
On the correlation between TV rating and the description of Netflix films
In the sequel, after exploring the variables of Kaggle’s Netflix from TidyTuesday data set, we shall analyse the text of the description of the film and build a classification model to predict the rating of the film, based on the words in the description.
Loading the libraries needed
Load the data set
Let us start by reading the data set:
netflix_title <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-20/netflix_titles.csv")Let us also copy for reference the variable descriptors from the website:
| variable | class | description |
|---|---|---|
| show_id | character | Unique ID for every Movie / TV Show |
| type | character | Identifier - A Movie or TV Show |
| title | character | Title of the Movie / TV Show |
| director | character | Director of the Movie/Show |
| cast | character | Actors involved in the movie / show |
| country | character | Country where the movie / show was produced |
| date_added | character | Date it was added on Netflix |
| release_year | double | Actual Release year of the movie / show |
| rating | character | TV Rating of the movie / show |
| duration | character | Total Duration - in minutes or number of seasons |
| listed_in | character | Genre |
| description | character | Summary description of the film/show |
Exploratory Data Analysis (EDA)
Let us begin to look into the data set to find interesting relationship between variables.
netflix_title# A tibble: 7,787 × 12
show_id type title director cast country date_added release_year rating
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 s1 TV Show 3% <NA> João… Brazil August 14… 2020 TV-MA
2 s2 Movie 7:19 Jorge Mic… Demi… Mexico December … 2016 TV-MA
3 s3 Movie 23:59 Gilbert C… Tedd… Singap… December … 2011 R
4 s4 Movie 9 Shane Ack… Elij… United… November … 2009 PG-13
5 s5 Movie 21 Robert Lu… Jim … United… January 1… 2008 PG-13
6 s6 TV Show 46 Serdar Ak… Erda… Turkey July 1, 2… 2016 TV-MA
7 s7 Movie 122 Yasir Al … Amin… Egypt June 1, 2… 2019 TV-MA
8 s8 Movie 187 Kevin Rey… Samu… United… November … 1997 R
9 s9 Movie 706 Shravan K… Divy… India April 1, … 2019 TV-14
10 s10 Movie 1920 Vikram Bh… Rajn… India December … 2008 TV-MA
# ℹ 7,777 more rows
# ℹ 3 more variables: duration <chr>, listed_in <chr>, description <chr>
glimpse(netflix_title)Rows: 7,787
Columns: 12
$ show_id <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s1…
$ type <chr> "TV Show", "Movie", "Movie", "Movie", "Movie", "TV Show",…
$ title <chr> "3%", "7:19", "23:59", "9", "21", "46", "122", "187", "70…
$ director <chr> NA, "Jorge Michel Grau", "Gilbert Chan", "Shane Acker", "…
$ cast <chr> "João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Val…
$ country <chr> "Brazil", "Mexico", "Singapore", "United States", "United…
$ date_added <chr> "August 14, 2020", "December 23, 2016", "December 20, 201…
$ release_year <dbl> 2020, 2016, 2011, 2009, 2008, 2016, 2019, 1997, 2019, 200…
$ rating <chr> "TV-MA", "TV-MA", "R", "PG-13", "PG-13", "TV-MA", "TV-MA"…
$ duration <chr> "4 Seasons", "93 min", "78 min", "80 min", "123 min", "1 …
$ listed_in <chr> "International TV Shows, TV Dramas, TV Sci-Fi & Fantasy",…
$ description <chr> "In a future where the elite inhabit an island paradise f…
netflix_title |>
group_by(type) |>
count() |>
ggplot(aes(x = type, y = n, fill = type)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = n), vjust = -0.5)netflix_title |>
mutate(decade = 10 * (release_year %/% 10)) |>
ggplot(aes(x = decade)) +
geom_histogram(binwidth = 10, fill = "midnightblue", alpha = 0.8)netflix_title |>
filter(type == "Movie") |>
count(rating, sort = TRUE)# A tibble: 15 × 2
rating n
<chr> <int>
1 TV-MA 1845
2 TV-14 1272
3 R 663
4 TV-PG 505
5 PG-13 386
6 PG 247
7 TV-Y 117
8 TV-G 111
9 TV-Y7 95
10 NR 79
11 G 39
12 TV-Y7-FV 5
13 UR 5
14 <NA> 5
15 NC-17 3
netflix_title |>
separate_rows(listed_in, sep = ", ") |>
rename(genre = listed_in) |>
filter(type == "Movie") |>
count(genre, sort = TRUE)# A tibble: 20 × 2
genre n
<chr> <int>
1 International Movies 2437
2 Dramas 2106
3 Comedies 1471
4 Documentaries 786
5 Action & Adventure 721
6 Independent Movies 673
7 Children & Family Movies 532
8 Romantic Movies 531
9 Thrillers 491
10 Stand-Up Comedy 329
11 Music & Musicals 321
12 Horror Movies 312
13 Sci-Fi & Fantasy 218
14 Sports Movies 196
15 Classic Movies 103
16 LGBTQ Movies 90
17 Cult Movies 59
18 Anime Features 57
19 Faith & Spirituality 57
20 Movies 56
netflix_title |> count(rating)# A tibble: 15 × 2
rating n
<chr> <int>
1 G 39
2 NC-17 3
3 NR 84
4 PG 247
5 PG-13 386
6 R 665
7 TV-14 1931
8 TV-G 194
9 TV-MA 2863
10 TV-PG 806
11 TV-Y 280
12 TV-Y7 271
13 TV-Y7-FV 6
14 UR 5
15 <NA> 7
From the previous EDA, it turns out that there are much more movies than TV Shows, so we shall restrict to movies. It seems interesting to study the rating of the films. Let us start by reshaping the data set to show the genre of each film, and let us add a logical variable that indicates whether the film is suitable for children or not. The variable shall be called adult and it will take the value 0 if the film is children appropriate and 1 if the film is for adults only, i.e. rated either TV-MA, NC-17 or R.
movies <- netflix_title |>
mutate(date_added = mdy(date_added)) |>
separate_rows(listed_in, sep = ", ") |>
rename(genre = listed_in) |>
filter(type == "Movie") |>
mutate(adult = if_else(rating %in% c("TV-MA", "NC-17", "R"), 1, 0),
adult = factor(adult, levels = c(0, 1))) |>
drop_na()The variable description could be related to the rating of the film. Let us explore this variable choosing a small sample and having a quick look at it.
set.seed(123)
movies |>
sample_n(10) |>
pull(description) [1] "A questionable character working as a self-styled \"biblical archaeologist\" advances his cause by creating phony religious artifacts."
[2] "Keen to bring honor to his clan, young villager Dong Yilong embarks on a perilous journey to compete in a tournament that selects warriors for battle."
[3] "An idealistic young woman travels to the jungles of Indonesia to teach literacy – and much more – to local children."
[4] "In three interwoven stories, love ends up in limbo as romantic partners navigate bliss, loss, failures and feelings while trying to make happiness last."
[5] "Comic book artist Holden meets the perfect woman, only to learn that she's a lesbian. But that doesn't stop him from falling in love with her."
[6] "Years after the presumed death of his three sons in battle, a grieving farmer journeys to Turkey to find them and return them to their homeland."
[7] "Anticipating the next move in a serial killer's grisly game, a police chief realizes that his daughter and estranged wife are the next likely victims."
[8] "After seducing a philandering con man, a rebellious young woman enlists his help in carrying out a heinous crime against her sister."
[9] "This documentary profiles music and culture icon Quincy Jones, offering unprecedented access to his private life and stories from his unparalleled career."
[10] "To evade greedy pharmaceutical crooks, a medical scientist disguises himself as a struggling salesman to protect his groundbreaking cancer research."
It looks like some data cleaning for this must be done, e.g. to clean elements like \" but we can attempt to work on this using the tidytext package to analyse the text. We shall also use tidylo to work out the log-odds for each words. Let us recall that the log odds of a word w in the document i is defined as
O_{w,i} = \log\left(\dfrac{p_{w,i}}{1 - p_{w,i}}\right),
where p_{w,i} is the sum (y_{w,i} : n_i) + \varepsilon of the ration between the of the count y_{w,i} of word w in document i, and the number n_i of words in the document i, and a small positive number \varepsilon which smoothed the word frequency when a word is only present in one single document (hence when the denominator of the log-odds would otherwise be zero).
The result for the 10 words with the highest log-odds are shown in the next diagram.
movies |>
unnest_tokens(word, description) |>
anti_join(stop_words, by = join_by(word)) |>
count(adult, word, sort = TRUE) |>
bind_log_odds(adult, word, n) |>
arrange(-log_odds_weighted) |>
group_by(adult) |>
top_n(10) |>
ungroup() |>
mutate(word = reorder_within(word, log_odds_weighted, adult)) |>
ggplot(aes(x = log_odds_weighted, y = word, fill = adult)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ adult, scales = "free_y",
labeller = labeller(adult = c("1" = "Adult", "0" = "Not adult"))) +
scale_y_reordered()It is clear that there are some words that are more related to non-adult films, such as “dragon”, “renowned” and “science”, whereas other are clearly linked to adult films, such as “unravels”, “dystopian”, “abducted”, “sexuality”, etc. It is interesting to do the same with digrams.
movies |>
unnest_tokens(digram, description, token = "ngrams", n = 2) |>
count(adult, digram, sort = TRUE) |>
bind_log_odds(adult, digram, n) |>
arrange(-log_odds_weighted) |>
group_by(adult) |>
top_n(10) |>
ungroup() |>
mutate(digram = reorder_within(digram, log_odds_weighted, adult)) |>
ggplot(aes(x = log_odds_weighted, y = digram, fill = adult)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ adult, scales = "free_y",
labeller = labeller(adult = c("1" = "Adult", "0" = "Not adult"))) +
scale_y_reordered()Let’s do the same for trigrams.
movies |>
unnest_tokens(trigram, description, token = "ngrams", n = 3) |>
count(adult, trigram, sort = TRUE) |>
bind_log_odds(adult, trigram, n) |>
arrange(-log_odds_weighted) |>
group_by(adult) |>
top_n(10) |>
ungroup() |>
mutate(trigram = reorder_within(trigram, log_odds_weighted, adult)) |>
ggplot(aes(x = log_odds_weighted, y = trigram, fill = adult)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ adult, scales = "free_y",
labeller = labeller(adult = c("1" = "Adult", "0" = "Not adult"))) +
scale_y_reordered()It seems clear that there is a huge difference between the log-odds of certain keywords. Thus it is safe to say that there is a correlation between the rating (adult/non-adult film) and the description of the film. In the rest of this article, we shall build a model to try to predict whether a film is adult or non-adult based on the words in the description.
Building a model
We shall use tidymodels with tidyrecipes to build a classification model for the outcome adult with predictors genre and word.
Let us start by creating the data set that we want to use for the model.
movie_df <- movies |>
select(adult, genre, description)Let us then split the data set into testing and training:
set.seed(123)
movie_split <- initial_split(movie_df, strata = adult)
movie_train <- training(movie_split)
movie_test <- testing(movie_split)We also need to pre-process the data. We shall do this with a recipe.
movie_recipe <- recipe(adult ~ ., data = movie_train) |>
step_tokenize(description, token = "words") |>
step_stopwords(description) |>
step_tokenfilter(description, max_tokens = 500) |>
step_tfidf(description) |>
step_other(genre, threshold = 0.01) |>
step_dummy(genre) |>
step_normalize(all_predictors())We can now prepare the model. We want to use random forest. We could tune the parameters, but to understand whether the model works at all, let us only choose a suitable number of trees and prepare a basic model with the default values for the hyperparameters. If it gives decent results, we may then decide to update the model, tuning the hyperparameters later to maximise the accuracy (or some other measure of variability).
rf_spec <- rand_forest(trees = 1000) |>
set_engine("ranger") |>
set_mode("classification")Let us create a workflow in order to keep together the model and the pre-process.
movie_wf <- workflow() |>
add_model(rf_spec) |>
add_recipe(movie_recipe)Let us start by checking the model with cross-validation.
set.seed(234)
movie_folds <- vfold_cv(movie_train, strata = adult)
doParallel::registerDoParallel()
cv_res <- fit_resamples(
movie_wf,
resamples = movie_folds,
metrics = metric_set(roc_auc, accuracy, spec, sens),
control = control_resamples(save_pred = TRUE)
)Let us collect the metrics of the previous cross-validation and see whether the model predicts correctly the rating.
collect_metrics(cv_res)# A tibble: 4 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.780 10 0.00505 Preprocessor1_Model1
2 roc_auc binary 0.853 10 0.00497 Preprocessor1_Model1
3 sens binary 0.809 10 0.00517 Preprocessor1_Model1
4 spec binary 0.746 10 0.00897 Preprocessor1_Model1
conf_mat_resampled(cv_res) |>
mutate(percent = Freq / sum(Freq) * 100)# A tibble: 4 × 4
Prediction Truth Freq percent
<fct> <fct> <dbl> <dbl>
1 0 0 335. 43.1
2 0 1 91.9 11.8
3 1 0 79.1 10.2
4 1 1 270. 34.8
Let us see the ROC curve as well.
collect_predictions(cv_res) |>
group_by(id) |>
roc_curve(adult, .pred_0) |>
ggplot(aes(x = 1 - specificity, y = sensitivity, colour = id)) +
geom_abline(linetype = "dashed") +
geom_path() +
coord_equal()The graph is quite decent and shows probably not the most accurate prediction model, yet a decent prediction model. We could try to make the model more accurate by tuning the hyperparameters of the random forest (i.e. tuning mtry and min_n) and choosing the values that maximise the specificity, because we would rather that the model predicts that a non-adult film is adult only, rather than the opposite.
Tuning the random forest
In order to tune the random forest we need to update the workflow, with a model that has hyperparameters for tuning and update the workflow accordingly.
tune_spec <- rand_forest(mtry = tune(), trees = 1000, min_n = tune()) |>
set_engine("ranger") |>
set_mode("classification")
movie_wf <- movie_wf |>
update_model(tune_spec)Let us another cross-validation fold to tune the hyperparameters.
set.seed(345)
tune_folds <- vfold_cv(movie_train, strata = adult)Let us also create a grid of possible values of the hyperparameter to check for tuning. The functions mtry() and min_n() will give the range of possible values for the corresponding paramteters. We shall use grid_regular() to make the grid of values.
rf_grid <- expand.grid(mtry = 5:15, min_n = 2:4)
# NOTICE: This operation takes over 1 hour
# on a iMac M1 with 16GB RAM
tune_res <- tune_grid(
movie_wf,
tune_folds,
grid = rf_grid,
metrics = metric_set(roc_auc, accuracy, sens, spec),
control = control_grid(save_pred = TRUE)
)Let us draw a quick diagram to see which parameters have performed better.
tune_res |>
collect_metrics() |>
filter(.metric == "spec") |>
select(mean, min_n, mtry) |>
pivot_longer(cols = min_n:mtry,
values_to = "value",
names_to = "parameter") |>
ggplot(aes(x = value, y = mean, colour = parameter)) +
geom_point() +
facet_wrap(~ parameter, scales = "free_x") +
theme(legend.position = "none")Let us also check which values are the best according to the specificity.
show_best(tune_res, "spec")# A tibble: 5 × 8
mtry min_n .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 14 2 spec binary 0.739 10 0.0105 Preprocessor1_Model10
2 14 4 spec binary 0.738 10 0.00957 Preprocessor1_Model32
3 15 3 spec binary 0.737 10 0.00936 Preprocessor1_Model22
4 15 4 spec binary 0.737 10 0.00991 Preprocessor1_Model33
5 12 2 spec binary 0.737 10 0.00920 Preprocessor1_Model08
We can now select the values of the parameters that maximise the specificity.
best_spec <- select_best(tune_res, "spec")At this point, we can finalize the workflow with this new information and proceed to the final fit to check how the model works with the testing data.
final_wf <- finalize_workflow(movie_wf, best_spec)Application of the model to the testing data
movie_res <- last_fit(final_wf, movie_split)To complete the analysis, let us extract the metric and see the confusion matrix for the final result to see how the model performs on new data.
collect_metrics(movie_res)# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.813 Preprocessor1_Model1
2 roc_auc binary 0.888 Preprocessor1_Model1
collect_predictions(movie_res) |>
conf_mat(adult, .pred_class) Truth
Prediction 0 1
0 1152 257
1 228 951
Let us also see the sensitivity and specificity on the test data:
collect_predictions(movie_res) |>
sensitivity(adult, .pred_class) |>
bind_rows(
collect_predictions(movie_res) |>
specificity(adult, .pred_class)
) |>
bind_rows(
collect_predictions(movie_res) |>
roc_auc(adult, .pred_0)
)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sensitivity binary 0.835
2 specificity binary 0.787
3 roc_auc binary 0.888
As we can see the model correctly identifies most of the titles, though it fails to identify some of the titles, whose description may be borderline between adult and children-safe content. It is not surprising that though we have tried to maximise the specificity, it is still the sensitivity to be higher than the specificity. Indeed, non-adult films are more restrictive than adult films. Furthermore, many words, depending on the context may lead to a children-safe description or an adult only film, depending on how those words were used.