Loading packages and dataset

pacotes <- c("readr", "dplyr", "ggplot2", "PerformanceAnalytics", "tidyr", "tm", "caret", "stringr", "tidytext" )
lapply(pacotes, library, character.only = TRUE)

filmes <- read_csv("imdb_movies.csv")
filmes$Runtime <- gsub(" min", "", filmes$Runtime)
filmes$Runtime <- as.numeric(filmes$Runtime)
filmes$Released_Year <- as.numeric(as.character(filmes$Released_Year))

The ‘Overview’ column can provide various insights about the movies. For example, it’s possible to get an idea of the movie’s emotional tone by identifying whether it is more positive, neutral, or negative. Additionally, the themes addressed in the movie can be identified, allowing for a clearer understanding of the movie’s overall picture when we analyze the ‘Overview’ column along with the movie’s genre.

Moreover, it’s definitely possible to identify the movie’s genre through the ‘Overview’ column. Not only is it possible to make this identification, but we can also use machine learning algorithms and natural language processing to automate this identification. These algorithms can be used not only to identify the genre but also to determine the movie’s emotional tone, as well as its themes and topics.

I will demonstrate these possibilities through a simple R code. First, let’s select the most common words found in the ‘Overview’ column.

Looking at the most common words in the ‘Overview’ column

# Cleaning the text in the "Overview" variable
overview_texto <- filmes$Overview
texto_limpo <- tolower(overview_texto)  # Converting to lowercase
texto_limpo <- str_replace_all(texto_limpo, "[^[:alnum:][:space:]]", "")  # Removing punctuation and numbers

# Tokenizing and removing stop words
tokens <- unnest_tokens(tibble(text = texto_limpo), word, text) %>%
  anti_join(stop_words, by = "word")  # Remove stop words

# Displaying word frequency
frequencia_palavras <- tokens %>%
  count(word, sort = TRUE)

head(frequencia_palavras, 10)

## # A tibble: 10 × 2
##    word        n
##    <chr>   <int>
##  1 life      101
##  2 world      78
##  3 story      63
##  4 love       61
##  5 war        61
##  6 woman      60
##  7 family     59
##  8 boy        42
##  9 friends    41
## 10 girl       39

Through the code above, it is possible to see which words are most repeated in the ‘Overview’ column. Now, I will show which movie genres are most common for the words that are most repeated in the ‘Overview’ column.

Looking at the most common genres for the most common words in the ‘Overview’ column.

# Cleaning and tokenizing the texts
textolimpo <- filmes %>%
  mutate(Overview = tolower(Overview)) %>%
  mutate(Overview = str_replace_all(Overview, "[^[:alnum:][:space:]]", "")) %>%
  unnest_tokens(word, Overview) %>%
  anti_join(stop_words, by = "word")

# Counting the word frequency
frequencia_palavras <- textolimpo %>%
  count(word, sort = TRUE) %>%
  slice(1:10)

# Identifying the most common genre for each of the most frequent words
frequencia_genero_palavras <- textolimpo %>%
  filter(word %in% frequencia_palavras$word) %>%
  group_by(word, Genre) %>%
  count(sort = TRUE)

genero_mais_comum_por_palavra <- frequencia_genero_palavras %>%
  group_by(word) %>%
  slice_max(order_by = n, n = 1)

# Displaying the results
print(genero_mais_comum_por_palavra)

## # A tibble: 12 × 3
## # Groups:   word [10]
##    word    Genre                           n
##    <chr>   <chr>                       <int>
##  1 boy     Drama                           4
##  2 family  Drama                           9
##  3 friends Comedy, Drama                   6
##  4 girl    Animation, Adventure, Drama     5
##  5 life    Drama                          14
##  6 love    Drama, Romance                 10
##  7 story   Biography, Drama, Sport         6
##  8 war     Drama, War                     10
##  9 woman   Drama, Romance                 12
## 10 world   Animation, Adventure, Drama     4
## 11 world   Drama, Romance                  4
## 12 world   Drama, War                      4

Now, using the scope of the codes above, it’s possible to create an algorithm that returns a potential genre for a given overview.

In summary, this algorithm searches the overview for one of the words that were identified as most common. Then, it selects a genre according to the most common genre for the word(s) found in the overview.

Algorithm to identify the genre through the Overview

# Cleaning and tokenizing the texts, keeping the Genre column
texto_limpo <- filmes %>%
  mutate(Overview = tolower(Overview)) %>%
  mutate(Overview = str_replace_all(Overview, "[^[:alnum:][:space:]]", "")) %>%
  unnest_tokens(word, Overview) %>%
  anti_join(stop_words, by = "word")

# Counting the frequency of words
frequencia_palavras <- texto_limpo %>%
  count(word, sort = TRUE) %>%
  slice(1:100)

# Identifying the most common genre for each of the 100 most frequent words
frequencia_genero_palavras <- texto_limpo %>%
  filter(word %in% frequencia_palavras$word) %>%
  group_by(word, Genre) %>%
  count(sort = TRUE)

# For each word, finding the most common genre
genero_mais_comum_por_palavra <- frequencia_genero_palavras %>%
  group_by(word) %>%
  slice_max(order_by = n, n = 1) %>%
  ungroup() %>%
  select(word, Genre)

# Function to predict the genre based on a new "Overview"
prever_genero <- function(novo_overview, genero_mais_comum_por_palavra) {
  # Clean and tokenize the new overview
  novo_overview <- tolower(novo_overview)
  novo_overview <- str_replace_all(novo_overview, "[^[:alnum:][:space:]]", "")
  
  novos_tokens <- tibble(texto = novo_overview) %>%
    unnest_tokens(word, texto) %>%
    anti_join(stop_words, by = "word")
  
  # Checking which words are in the list of most frequent words
  generos_correspondentes <- novos_tokens %>%
    inner_join(genero_mais_comum_por_palavra, by = "word") %>%
    count(Genre, sort = TRUE)
  
  # Returning the most common genre found
  if (nrow(generos_correspondentes) == 0) {
    return("Genre not found")
  } else {
    return(generos_correspondentes$Genre[1])
  }
}

Testing the function with the Overview of the movie “The Godfather”

overview <- "An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son."
prever_genero(overview, genero_mais_comum_por_palavra)

## [1] "Crime, Drama, Thriller"

Testing the function with the fictional Overview suggested at the end of the EDA

overview2 <- "In a distant future, a group of unlikely adventurers set off on a thrilling quest of action and adventure to rescue their home planet."
prever_genero(overview2, genero_mais_comum_por_palavra)

## [1] "Action, Adventure, Sci-Fi"

Notice that, by using the 100 most common words from the “Overview” column, the algorithm was able to identify the genre of the movie The Godfather with relatively good accuracy. Moreover, it also successfully identified the genre of the fictional movie suggested during the EDA phase!

Although this is just a simple example, the algorithm demonstrates how it is possible to automate the task of identifying a movie’s genre through the “Overview”. It is worth noting that this algorithm could be improved with additional observations to enhance its accuracy, and there are various other algorithms available for this type of task.

Therefore, yes, it is possible to gain insights from the “Overview” column, such as the movie’s tone, its themes, and its genre. Additionally, this task can be automated through computational algorithms specifically designed for this purpose!

What insights can be drawn from the ‘Overview’ column? Is it possible to infer the genre of the movie from this column?

rafael

2024-07-01