pacotes <- c("readr", "dplyr", "ggplot2", "PerformanceAnalytics", "tidyr", "tm", "caret", "stringr", "tidytext" )
lapply(pacotes, library, character.only = TRUE)
filmes <- read_csv("imdb_movies.csv")
filmes$Runtime <- gsub(" min", "", filmes$Runtime)
filmes$Runtime <- as.numeric(filmes$Runtime)
filmes$Released_Year <- as.numeric(as.character(filmes$Released_Year))
The ‘Overview’ column can provide various insights about the movies. For example, it’s possible to get an idea of the movie’s emotional tone by identifying whether it is more positive, neutral, or negative. Additionally, the themes addressed in the movie can be identified, allowing for a clearer understanding of the movie’s overall picture when we analyze the ‘Overview’ column along with the movie’s genre.
Moreover, it’s definitely possible to identify the movie’s genre through the ‘Overview’ column. Not only is it possible to make this identification, but we can also use machine learning algorithms and natural language processing to automate this identification. These algorithms can be used not only to identify the genre but also to determine the movie’s emotional tone, as well as its themes and topics.
I will demonstrate these possibilities through a simple R code. First, let’s select the most common words found in the ‘Overview’ column.
# Cleaning the text in the "Overview" variable
overview_texto <- filmes$Overview
texto_limpo <- tolower(overview_texto) # Converting to lowercase
texto_limpo <- str_replace_all(texto_limpo, "[^[:alnum:][:space:]]", "") # Removing punctuation and numbers
# Tokenizing and removing stop words
tokens <- unnest_tokens(tibble(text = texto_limpo), word, text) %>%
anti_join(stop_words, by = "word") # Remove stop words
# Displaying word frequency
frequencia_palavras <- tokens %>%
count(word, sort = TRUE)
head(frequencia_palavras, 10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 life 101
## 2 world 78
## 3 story 63
## 4 love 61
## 5 war 61
## 6 woman 60
## 7 family 59
## 8 boy 42
## 9 friends 41
## 10 girl 39
Through the code above, it is possible to see which words are most repeated in the ‘Overview’ column. Now, I will show which movie genres are most common for the words that are most repeated in the ‘Overview’ column.
# Cleaning and tokenizing the texts
textolimpo <- filmes %>%
mutate(Overview = tolower(Overview)) %>%
mutate(Overview = str_replace_all(Overview, "[^[:alnum:][:space:]]", "")) %>%
unnest_tokens(word, Overview) %>%
anti_join(stop_words, by = "word")
# Counting the word frequency
frequencia_palavras <- textolimpo %>%
count(word, sort = TRUE) %>%
slice(1:10)
# Identifying the most common genre for each of the most frequent words
frequencia_genero_palavras <- textolimpo %>%
filter(word %in% frequencia_palavras$word) %>%
group_by(word, Genre) %>%
count(sort = TRUE)
genero_mais_comum_por_palavra <- frequencia_genero_palavras %>%
group_by(word) %>%
slice_max(order_by = n, n = 1)
# Displaying the results
print(genero_mais_comum_por_palavra)
## # A tibble: 12 × 3
## # Groups: word [10]
## word Genre n
## <chr> <chr> <int>
## 1 boy Drama 4
## 2 family Drama 9
## 3 friends Comedy, Drama 6
## 4 girl Animation, Adventure, Drama 5
## 5 life Drama 14
## 6 love Drama, Romance 10
## 7 story Biography, Drama, Sport 6
## 8 war Drama, War 10
## 9 woman Drama, Romance 12
## 10 world Animation, Adventure, Drama 4
## 11 world Drama, Romance 4
## 12 world Drama, War 4
Now, using the scope of the codes above, it’s possible to create an algorithm that returns a potential genre for a given overview.
In summary, this algorithm searches the overview for one of the words that were identified as most common. Then, it selects a genre according to the most common genre for the word(s) found in the overview.
# Cleaning and tokenizing the texts, keeping the Genre column
texto_limpo <- filmes %>%
mutate(Overview = tolower(Overview)) %>%
mutate(Overview = str_replace_all(Overview, "[^[:alnum:][:space:]]", "")) %>%
unnest_tokens(word, Overview) %>%
anti_join(stop_words, by = "word")
# Counting the frequency of words
frequencia_palavras <- texto_limpo %>%
count(word, sort = TRUE) %>%
slice(1:100)
# Identifying the most common genre for each of the 100 most frequent words
frequencia_genero_palavras <- texto_limpo %>%
filter(word %in% frequencia_palavras$word) %>%
group_by(word, Genre) %>%
count(sort = TRUE)
# For each word, finding the most common genre
genero_mais_comum_por_palavra <- frequencia_genero_palavras %>%
group_by(word) %>%
slice_max(order_by = n, n = 1) %>%
ungroup() %>%
select(word, Genre)
# Function to predict the genre based on a new "Overview"
prever_genero <- function(novo_overview, genero_mais_comum_por_palavra) {
# Clean and tokenize the new overview
novo_overview <- tolower(novo_overview)
novo_overview <- str_replace_all(novo_overview, "[^[:alnum:][:space:]]", "")
novos_tokens <- tibble(texto = novo_overview) %>%
unnest_tokens(word, texto) %>%
anti_join(stop_words, by = "word")
# Checking which words are in the list of most frequent words
generos_correspondentes <- novos_tokens %>%
inner_join(genero_mais_comum_por_palavra, by = "word") %>%
count(Genre, sort = TRUE)
# Returning the most common genre found
if (nrow(generos_correspondentes) == 0) {
return("Genre not found")
} else {
return(generos_correspondentes$Genre[1])
}
}
overview <- "An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son."
prever_genero(overview, genero_mais_comum_por_palavra)
## [1] "Crime, Drama, Thriller"
overview2 <- "In a distant future, a group of unlikely adventurers set off on a thrilling quest of action and adventure to rescue their home planet."
prever_genero(overview2, genero_mais_comum_por_palavra)
## [1] "Action, Adventure, Sci-Fi"
Notice that, by using the 100 most common words from the “Overview” column, the algorithm was able to identify the genre of the movie The Godfather with relatively good accuracy. Moreover, it also successfully identified the genre of the fictional movie suggested during the EDA phase!
Although this is just a simple example, the algorithm demonstrates how it is possible to automate the task of identifying a movie’s genre through the “Overview”. It is worth noting that this algorithm could be improved with additional observations to enhance its accuracy, and there are various other algorithms available for this type of task.
Therefore, yes, it is possible to gain insights from the “Overview” column, such as the movie’s tone, its themes, and its genre. Additionally, this task can be automated through computational algorithms specifically designed for this purpose!