Classification model with tidymodels and Netflix movie data

Author

Andrea Carpignani

Published

20 January 2024

On the correlation between TV rating and the description of Netflix films

In the sequel, after exploring the variables of Kaggle’s Netflix from TidyTuesday data set, we shall analyse the text of the description of the film and build a classification model to predict the rating of the film, based on the words in the description.

Loading the libraries needed

library(tidyverse)
library(tidymodels)
library(tidytext)
library(tidylo)
library(textrecipes)
tidymodels_prefer()

Load the data set

Let us start by reading the data set:

netflix_title <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-20/netflix_titles.csv")

Let us also copy for reference the variable descriptors from the website:

variable	class	description
show_id	character	Unique ID for every Movie / TV Show
type	character	Identifier - A Movie or TV Show
title	character	Title of the Movie / TV Show
director	character	Director of the Movie/Show
cast	character	Actors involved in the movie / show
country	character	Country where the movie / show was produced
date_added	character	Date it was added on Netflix
release_year	double	Actual Release year of the movie / show
rating	character	TV Rating of the movie / show
duration	character	Total Duration - in minutes or number of seasons
listed_in	character	Genre
description	character	Summary description of the film/show

Exploratory Data Analysis (EDA)

Let us begin to look into the data set to find interesting relationship between variables.

netflix_title

# A tibble: 7,787 × 12
   show_id type    title director   cast  country date_added release_year rating
   <chr>   <chr>   <chr> <chr>      <chr> <chr>   <chr>             <dbl> <chr> 
 1 s1      TV Show 3%    <NA>       João… Brazil  August 14…         2020 TV-MA 
 2 s2      Movie   7:19  Jorge Mic… Demi… Mexico  December …         2016 TV-MA 
 3 s3      Movie   23:59 Gilbert C… Tedd… Singap… December …         2011 R     
 4 s4      Movie   9     Shane Ack… Elij… United… November …         2009 PG-13 
 5 s5      Movie   21    Robert Lu… Jim … United… January 1…         2008 PG-13 
 6 s6      TV Show 46    Serdar Ak… Erda… Turkey  July 1, 2…         2016 TV-MA 
 7 s7      Movie   122   Yasir Al … Amin… Egypt   June 1, 2…         2019 TV-MA 
 8 s8      Movie   187   Kevin Rey… Samu… United… November …         1997 R     
 9 s9      Movie   706   Shravan K… Divy… India   April 1, …         2019 TV-14 
10 s10     Movie   1920  Vikram Bh… Rajn… India   December …         2008 TV-MA 
# ℹ 7,777 more rows
# ℹ 3 more variables: duration <chr>, listed_in <chr>, description <chr>

glimpse(netflix_title)

Rows: 7,787
Columns: 12
$ show_id      <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s1…
$ type         <chr> "TV Show", "Movie", "Movie", "Movie", "Movie", "TV Show",…
$ title        <chr> "3%", "7:19", "23:59", "9", "21", "46", "122", "187", "70…
$ director     <chr> NA, "Jorge Michel Grau", "Gilbert Chan", "Shane Acker", "…
$ cast         <chr> "João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Val…
$ country      <chr> "Brazil", "Mexico", "Singapore", "United States", "United…
$ date_added   <chr> "August 14, 2020", "December 23, 2016", "December 20, 201…
$ release_year <dbl> 2020, 2016, 2011, 2009, 2008, 2016, 2019, 1997, 2019, 200…
$ rating       <chr> "TV-MA", "TV-MA", "R", "PG-13", "PG-13", "TV-MA", "TV-MA"…
$ duration     <chr> "4 Seasons", "93 min", "78 min", "80 min", "123 min", "1 …
$ listed_in    <chr> "International TV Shows, TV Dramas, TV Sci-Fi & Fantasy",…
$ description  <chr> "In a future where the elite inhabit an island paradise f…

netflix_title |> 
    group_by(type) |> 
    count() |> 
    ggplot(aes(x = type, y = n, fill = type)) +
    geom_col(show.legend = FALSE) +
    geom_text(aes(label = n), vjust = -0.5)

netflix_title |> 
    mutate(decade = 10 * (release_year %/% 10)) |> 
    ggplot(aes(x = decade)) +
    geom_histogram(binwidth = 10, fill = "midnightblue", alpha = 0.8)

netflix_title |> 
    filter(type == "Movie") |> 
    count(rating, sort = TRUE)

# A tibble: 15 × 2
   rating       n
   <chr>    <int>
 1 TV-MA     1845
 2 TV-14     1272
 3 R          663
 4 TV-PG      505
 5 PG-13      386
 6 PG         247
 7 TV-Y       117
 8 TV-G       111
 9 TV-Y7       95
10 NR          79
11 G           39
12 TV-Y7-FV     5
13 UR           5
14 <NA>         5
15 NC-17        3

netflix_title |> 
    separate_rows(listed_in, sep = ", ") |> 
    rename(genre = listed_in) |>
    filter(type == "Movie") |> 
    count(genre, sort = TRUE)

# A tibble: 20 × 2
   genre                        n
   <chr>                    <int>
 1 International Movies      2437
 2 Dramas                    2106
 3 Comedies                  1471
 4 Documentaries              786
 5 Action & Adventure         721
 6 Independent Movies         673
 7 Children & Family Movies   532
 8 Romantic Movies            531
 9 Thrillers                  491
10 Stand-Up Comedy            329
11 Music & Musicals           321
12 Horror Movies              312
13 Sci-Fi & Fantasy           218
14 Sports Movies              196
15 Classic Movies             103
16 LGBTQ Movies                90
17 Cult Movies                 59
18 Anime Features              57
19 Faith & Spirituality        57
20 Movies                      56

netflix_title |> count(rating)

# A tibble: 15 × 2
   rating       n
   <chr>    <int>
 1 G           39
 2 NC-17        3
 3 NR          84
 4 PG         247
 5 PG-13      386
 6 R          665
 7 TV-14     1931
 8 TV-G       194
 9 TV-MA     2863
10 TV-PG      806
11 TV-Y       280
12 TV-Y7      271
13 TV-Y7-FV     6
14 UR           5
15 <NA>         7

From the previous EDA, it turns out that there are much more movies than TV Shows, so we shall restrict to movies. It seems interesting to study the rating of the films. Let us start by reshaping the data set to show the genre of each film, and let us add a logical variable that indicates whether the film is suitable for children or not. The variable shall be called adult and it will take the value 0 if the film is children appropriate and 1 if the film is for adults only, i.e. rated either TV-MA, NC-17 or R.

movies <- netflix_title |> 
    mutate(date_added = mdy(date_added)) |> 
    separate_rows(listed_in, sep = ", ") |> 
    rename(genre = listed_in) |> 
    filter(type == "Movie") |> 
    mutate(adult = if_else(rating %in% c("TV-MA", "NC-17", "R"), 1, 0),
           adult = factor(adult, levels = c(0, 1))) |>
    drop_na()

The variable description could be related to the rating of the film. Let us explore this variable choosing a small sample and having a quick look at it.

set.seed(123)
movies |> 
    sample_n(10) |> 
    pull(description)

 [1] "A questionable character working as a self-styled \"biblical archaeologist\" advances his cause by creating phony religious artifacts."                    
 [2] "Keen to bring honor to his clan, young villager Dong Yilong embarks on a perilous journey to compete in a tournament that selects warriors for battle."    
 [3] "An idealistic young woman travels to the jungles of Indonesia to teach literacy – and much more – to local children."                                      
 [4] "In three interwoven stories, love ends up in limbo as romantic partners navigate bliss, loss, failures and feelings while trying to make happiness last."  
 [5] "Comic book artist Holden meets the perfect woman, only to learn that she's a lesbian. But that doesn't stop him from falling in love with her."            
 [6] "Years after the presumed death of his three sons in battle, a grieving farmer journeys to Turkey to find them and return them to their homeland."          
 [7] "Anticipating the next move in a serial killer's grisly game, a police chief realizes that his daughter and estranged wife are the next likely victims."    
 [8] "After seducing a philandering con man, a rebellious young woman enlists his help in carrying out a heinous crime against her sister."                      
 [9] "This documentary profiles music and culture icon Quincy Jones, offering unprecedented access to his private life and stories from his unparalleled career."
[10] "To evade greedy pharmaceutical crooks, a medical scientist disguises himself as a struggling salesman to protect his groundbreaking cancer research."

It looks like some data cleaning for this must be done, e.g. to clean elements like \" but we can attempt to work on this using the tidytext package to analyse the text. We shall also use tidylo to work out the log-odds for each words. Let us recall that the log odds of a word w in the document i is defined as

O_{w,i} = \log\left(\dfrac{p_{w,i}}{1 - p_{w,i}}\right),

where p_{w,i} is the sum (y_{w,i} : n_i) + \varepsilon of the ration between the of the count y_{w,i} of word w in document i, and the number n_i of words in the document i, and a small positive number \varepsilon which smoothed the word frequency when a word is only present in one single document (hence when the denominator of the log-odds would otherwise be zero).

The result for the 10 words with the highest log-odds are shown in the next diagram.

movies |> 
    unnest_tokens(word, description) |> 
    anti_join(stop_words, by = join_by(word)) |> 
    count(adult, word, sort = TRUE) |> 
    bind_log_odds(adult, word, n) |> 
    arrange(-log_odds_weighted) |> 
    group_by(adult) |> 
    top_n(10) |> 
    ungroup() |> 
    mutate(word = reorder_within(word, log_odds_weighted, adult)) |> 
    ggplot(aes(x = log_odds_weighted, y = word, fill = adult)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ adult, scales = "free_y", 
               labeller = labeller(adult = c("1" = "Adult", "0" = "Not adult"))) +
    scale_y_reordered()

It is clear that there are some words that are more related to non-adult films, such as “dragon”, “renowned” and “science”, whereas other are clearly linked to adult films, such as “unravels”, “dystopian”, “abducted”, “sexuality”, etc. It is interesting to do the same with digrams.

movies |> 
    unnest_tokens(digram, description, token = "ngrams", n = 2) |> 
    count(adult, digram, sort = TRUE) |> 
    bind_log_odds(adult, digram, n) |> 
    arrange(-log_odds_weighted) |> 
    group_by(adult) |> 
    top_n(10) |> 
    ungroup() |> 
    mutate(digram = reorder_within(digram, log_odds_weighted, adult)) |> 
    ggplot(aes(x = log_odds_weighted, y = digram, fill = adult)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ adult, scales = "free_y",
               labeller = labeller(adult = c("1" = "Adult", "0" = "Not adult"))) +
    scale_y_reordered()

Let’s do the same for trigrams.

movies |> 
    unnest_tokens(trigram, description, token = "ngrams", n = 3) |> 
    count(adult, trigram, sort = TRUE) |> 
    bind_log_odds(adult, trigram, n) |> 
    arrange(-log_odds_weighted) |> 
    group_by(adult) |> 
    top_n(10) |> 
    ungroup() |> 
    mutate(trigram = reorder_within(trigram, log_odds_weighted, adult)) |> 
    ggplot(aes(x = log_odds_weighted, y = trigram, fill = adult)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ adult, scales = "free_y",
               labeller = labeller(adult = c("1" = "Adult", "0" = "Not adult"))) +
    scale_y_reordered()

It seems clear that there is a huge difference between the log-odds of certain keywords. Thus it is safe to say that there is a correlation between the rating (adult/non-adult film) and the description of the film. In the rest of this article, we shall build a model to try to predict whether a film is adult or non-adult based on the words in the description.

Building a model

We shall use tidymodels with tidyrecipes to build a classification model for the outcome adult with predictors genre and word.

Let us start by creating the data set that we want to use for the model.

movie_df <- movies |>
    select(adult, genre, description)

Let us then split the data set into testing and training:

set.seed(123)
movie_split <- initial_split(movie_df, strata = adult) 
movie_train <- training(movie_split)
movie_test <- testing(movie_split)

We also need to pre-process the data. We shall do this with a recipe.

movie_recipe <- recipe(adult ~ ., data = movie_train) |> 
    step_tokenize(description, token = "words") |>
    step_stopwords(description) |> 
    step_tokenfilter(description, max_tokens = 500) |> 
    step_tfidf(description) |>
    step_other(genre, threshold = 0.01) |> 
    step_dummy(genre) |> 
    step_normalize(all_predictors())

We can now prepare the model. We want to use random forest. We could tune the parameters, but to understand whether the model works at all, let us only choose a suitable number of trees and prepare a basic model with the default values for the hyperparameters. If it gives decent results, we may then decide to update the model, tuning the hyperparameters later to maximise the accuracy (or some other measure of variability).

rf_spec <- rand_forest(trees = 1000) |> 
    set_engine("ranger") |> 
    set_mode("classification")

Let us create a workflow in order to keep together the model and the pre-process.

movie_wf <- workflow() |> 
    add_model(rf_spec) |> 
    add_recipe(movie_recipe)

Let us start by checking the model with cross-validation.

set.seed(234)
movie_folds <- vfold_cv(movie_train, strata = adult)

doParallel::registerDoParallel()

cv_res <- fit_resamples(
    movie_wf,
    resamples = movie_folds,
    metrics = metric_set(roc_auc, accuracy, spec, sens),
    control = control_resamples(save_pred = TRUE)
)

Let us collect the metrics of the previous cross-validation and see whether the model predicts correctly the rating.

collect_metrics(cv_res)

# A tibble: 4 × 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy binary     0.780    10 0.00505 Preprocessor1_Model1
2 roc_auc  binary     0.853    10 0.00497 Preprocessor1_Model1
3 sens     binary     0.809    10 0.00517 Preprocessor1_Model1
4 spec     binary     0.746    10 0.00897 Preprocessor1_Model1

conf_mat_resampled(cv_res) |> 
    mutate(percent = Freq / sum(Freq) * 100)

# A tibble: 4 × 4
  Prediction Truth  Freq percent
  <fct>      <fct> <dbl>   <dbl>
1 0          0     335.     43.1
2 0          1      91.9    11.8
3 1          0      79.1    10.2
4 1          1     270.     34.8

Let us see the ROC curve as well.

collect_predictions(cv_res) |> 
    group_by(id) |> 
    roc_curve(adult, .pred_0) |> 
    ggplot(aes(x = 1 - specificity, y = sensitivity, colour = id)) +
    geom_abline(linetype = "dashed") +
    geom_path() +
    coord_equal()

The graph is quite decent and shows probably not the most accurate prediction model, yet a decent prediction model. We could try to make the model more accurate by tuning the hyperparameters of the random forest (i.e. tuning mtry and min_n) and choosing the values that maximise the specificity, because we would rather that the model predicts that a non-adult film is adult only, rather than the opposite.

Tuning the random forest

In order to tune the random forest we need to update the workflow, with a model that has hyperparameters for tuning and update the workflow accordingly.

tune_spec <- rand_forest(mtry = tune(), trees = 1000, min_n = tune()) |> 
    set_engine("ranger") |> 
    set_mode("classification")

movie_wf <- movie_wf |> 
    update_model(tune_spec)

Let us another cross-validation fold to tune the hyperparameters.

set.seed(345)
tune_folds <- vfold_cv(movie_train, strata = adult)

Let us also create a grid of possible values of the hyperparameter to check for tuning. The functions mtry() and min_n() will give the range of possible values for the corresponding paramteters. We shall use grid_regular() to make the grid of values.

rf_grid <- expand.grid(mtry = 5:15, min_n = 2:4)

# NOTICE: This operation takes over 1 hour
# on a iMac M1 with 16GB RAM
tune_res <- tune_grid(
    movie_wf,
    tune_folds,
    grid = rf_grid,
    metrics = metric_set(roc_auc, accuracy, sens, spec),
    control = control_grid(save_pred = TRUE)
)

Let us draw a quick diagram to see which parameters have performed better.

tune_res |> 
    collect_metrics() |> 
    filter(.metric == "spec") |> 
    select(mean, min_n, mtry) |> 
    pivot_longer(cols = min_n:mtry,
                 values_to = "value",
                 names_to = "parameter") |> 
    ggplot(aes(x = value, y = mean, colour = parameter)) +
    geom_point() +
    facet_wrap(~ parameter, scales = "free_x") +
    theme(legend.position = "none")

Let us also check which values are the best according to the specificity.

show_best(tune_res, "spec")

# A tibble: 5 × 8
   mtry min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1    14     2 spec    binary     0.739    10 0.0105  Preprocessor1_Model10
2    14     4 spec    binary     0.738    10 0.00957 Preprocessor1_Model32
3    15     3 spec    binary     0.737    10 0.00936 Preprocessor1_Model22
4    15     4 spec    binary     0.737    10 0.00991 Preprocessor1_Model33
5    12     2 spec    binary     0.737    10 0.00920 Preprocessor1_Model08

We can now select the values of the parameters that maximise the specificity.

best_spec <- select_best(tune_res, "spec")

At this point, we can finalize the workflow with this new information and proceed to the final fit to check how the model works with the testing data.

final_wf <- finalize_workflow(movie_wf, best_spec)

Application of the model to the testing data

movie_res <- last_fit(final_wf, movie_split)

To complete the analysis, let us extract the metric and see the confusion matrix for the final result to see how the model performs on new data.

collect_metrics(movie_res)

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.813 Preprocessor1_Model1
2 roc_auc  binary         0.888 Preprocessor1_Model1

collect_predictions(movie_res) |> 
    conf_mat(adult, .pred_class)

          Truth
Prediction    0    1
         0 1152  257
         1  228  951

Let us also see the sensitivity and specificity on the test data:

collect_predictions(movie_res) |> 
    sensitivity(adult, .pred_class) |> 
    bind_rows(
        collect_predictions(movie_res) |> 
    specificity(adult, .pred_class)
    ) |> 
    bind_rows(
        collect_predictions(movie_res) |> 
    roc_auc(adult, .pred_0)
    )

# A tibble: 3 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 sensitivity binary         0.835
2 specificity binary         0.787
3 roc_auc     binary         0.888

As we can see the model correctly identifies most of the titles, though it fails to identify some of the titles, whose description may be borderline between adult and children-safe content. It is not surprising that though we have tried to maximise the specificity, it is still the sensitivity to be higher than the specificity. Indeed, non-adult films are more restrictive than adult films. Furthermore, many words, depending on the context may lead to a children-safe description or an adult only film, depending on how those words were used.