The activity corresponds to comparing two movie scripts, one is from Aladdin from 1992 and the other is from Aladdin from 2019. The first is an animated version, so the task is to analyze how much the new script has been modernized and if it is still remaining faithful to his predecessor.
library(tm)
library(pdftools)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(syuzhet)
library(ggplot2)
library(stringr)
library(text)
library(textdata)
library(tidytext)
library(dplyr)
library(tidyr)
library(zoo)
library(gridExtra)
For the 1992 script (plain text):
text_1992 <- readLines("Aladdin_1992.txt")
corp_1992 <- Corpus(VectorSource(text_1992))
For the 2019 script (PDF or plain text):
# text_2019 <- pdf_text("Aladdin_2019.pdf")
text_2019 <- readLines("Aladdin_2019.txt")
corp_2019 <- Corpus(VectorSource(text_2019))
Now, cleaning the corpus by converting to lowercase, removing punctuation, numbers, common stopwords, and applying stemming is necessary for further analysis.
removeQuotationMarks <- content_transformer(function(x) gsub("\"", "", x))
cleanTextEnhanced <- content_transformer(function(x) {
x <- gsub("[\"“”‘’]", "", x) # Remove all types of quotation marks including smart quotes
x <- gsub("[[:punct:]]", "", x) # Remove all other punctuation
x <- gsub("\\s+", " ", x) # Replace multiple spaces with a single space
x <- trimws(x) # Trim leading and trailing whitespace
return(x)
})
# Apply cleaning steps to corp_1992
corp_1992_clean <- tm_map(corp_1992, content_transformer(tolower))
corp_1992_clean <- tm_map(corp_1992_clean, removePunctuation, ucp = TRUE)
corp_1992_clean <- tm_map(corp_1992_clean, removeNumbers)
corp_1992_clean <- tm_map(corp_1992_clean, removeWords, stopwords("english"))
# Apply cleaning steps to corp_2019
corp_2019_clean <- tm_map(corp_2019, content_transformer(tolower))
corp_2019_clean <- tm_map(corp_2019_clean, removePunctuation, ucp = TRUE)
corp_2019_clean <- tm_map(corp_2019_clean, removeNumbers)
corp_2019_clean <- tm_map(corp_2019_clean, removeWords, stopwords("english"))
Combination of both into a single corpus for comparative analysis
corp_combined <- c(corp_1992_clean, corp_2019_clean)
We generate a term-document matrix to analyze word frequency
# Term-Document Matrix for the 1992 version
dtm_1992 <- TermDocumentMatrix(corp_1992_clean)
# Term-Document Matrix for the 2019 version
dtm_2019 <- TermDocumentMatrix(corp_2019_clean)
# Term-Document Matrix for Combined versions
dtm <- TermDocumentMatrix(corp_combined)
dtm_matrix <- as.matrix(dtm)
We find the most frequent terms across both scripts and their counts
freq_terms <- findFreqTerms(dtm, lowfreq = 20)
freq_matrix <- dtm_matrix[freq_terms, ]
apply(freq_matrix, 1, sum)
## "", "abu "aladdin "genie "iago "jafar "jasmine "sultan
## 1236 47 396 168 68 156 178 71
## abu abu", agrabah ahead aladdin aladdin", aladdins ali
## 126 22 25 21 144 33 22 58
## ali", around away back begins boy can cant
## 23 24 30 81 23 27 63 43
## carpet cave come comes dont find free friend
## 84 27 44 30 94 24 20 25
## genie genie", genies get gonna good got gotta
## 106 23 20 74 31 38 62 31
## grabs guards hand head hes hey iago ill
## 26 26 21 40 49 33 38 23
## ive jafar jasmine just know lamp lamp", last
## 20 88 71 79 52 64 31 20
## let lets like little look looks love magic
## 21 21 94 41 51 59 23 25
## make man need never new now okay one
## 50 39 26 63 36 69 35 77
## palace people please prince princess pulls rajah really
## 24 30 20 105 54 36 26 21
## right say second see sees sorry stop street
## 62 35 20 69 34 20 24 27
## sultan take tell thank thats theres think three
## 62 40 31 20 55 20 55 25
## time top try turns two want way well
## 32 20 23 40 27 24 40 52
## whole will wish world yes youre
## 29 74 66 25 43 58
Analyze words that commonly co-occur with a specific word, such as “genie”
associations <- findAssocs(dtm, "genie", 0.3)
# Extracting the associations for 'geni' and sorting them in decreasing order
geni_associations <- associations$geni
geni_associations_sorted <- sort(geni_associations, decreasing = TRUE)
# Getting the top 10 associations
top_10_associations <- head(geni_associations_sorted, 10)
# Print the top 10 associations
print(top_10_associations)
## "iago "jafar along", another apple arms bow boy", brand bread
## 1 1 1 1 1 1 1 1 1 1
If wanted to see all of the associations
# findAssocs(dtm, "genie", 0.3)
# Find frequent terms in the 1992 version
ft_1992 <- findFreqTerms(dtm_1992, lowfreq = 90)
# Find frequent terms in the 2019 version
ft_2019 <- findFreqTerms(dtm_2019, lowfreq = 10)
# Extract frequencies for the 1992 version
freq_matrix_1992 <- as.matrix(dtm_1992[ft_1992, ])
freqs_1992 <- rowSums(freq_matrix_1992)
# Extract frequencies for the 2019 version
freq_matrix_2019 <- as.matrix(dtm_2019[ft_2019, ])
freqs_2019 <- rowSums(freq_matrix_2019)
# Combine into a data frame for comparison
# This will include only terms that were identified as frequent in at least one of the versions
freq_comparison <- merge(data.frame(Term = names(freqs_1992), Aladdin_1992 = freqs_1992),
data.frame(Term = names(freqs_2019), Aladdin_2019 = freqs_2019),
by = "Term", all = TRUE)
# Replace NA with 0 for terms not present in one of the versions
freq_comparison[is.na(freq_comparison)] <- 0
# Ordering by one of the frequencies for better visualization
freq_comparison <- freq_comparison[order(-freq_comparison$Aladdin_1992),]
# View the comparison
print(freq_comparison)
## Term Aladdin_1992 Aladdin_2019
## 4 aladdin 381 193
## 42 jafar 197 66
## 45 jasmine 194 74
## 1 abu 166 29
## 28 genie 164 133
## 95 sultan 110 43
## 39 iago 93 20
## 2 agrabah 0 17
## 3 ahead 0 12
## 5 ali 0 45
## 6 anders 0 10
## 7 arabian 0 11
## 8 away 0 11
## 9 baba 0 15
## 10 back 0 17
## 11 better 0 10
## 12 boy 0 14
## 13 bracelet 0 11
## 14 can 0 43
## 15 cant 0 24
## 16 carpet 0 14
## 17 cave 0 11
## 18 come 0 22
## 19 dalia 0 15
## 20 done 0 12
## 21 dont 0 60
## 22 enough 0 12
## 23 ever 0 11
## 24 every 0 12
## 25 find 0 11
## 26 free 0 12
## 27 friend 0 18
## 29 get 0 39
## 30 gonna 0 16
## 31 good 0 17
## 32 got 0 30
## 33 gotta 0 14
## 34 guards 0 12
## 35 hakim 0 14
## 36 help 0 10
## 37 hes 0 32
## 38 hey 0 19
## 40 ill 0 10
## 41 ive 0 11
## 43 jamal 0 11
## 44 jams 0 13
## 46 jump 0 13
## 47 just 0 42
## 48 kid 0 13
## 49 know 0 41
## 50 lamp 0 38
## 51 let 0 13
## 52 life 0 12
## 53 like 0 52
## 54 little 0 20
## 55 look 0 19
## 56 love 0 12
## 57 magic 0 14
## 58 make 0 25
## 59 man 0 13
## 60 marry 0 12
## 61 master 0 13
## 62 mean 0 12
## 63 monkey 0 19
## 64 much 0 10
## 65 nay 0 11
## 66 need 0 18
## 67 never 0 37
## 68 new 0 14
## 69 nothing 0 13
## 70 now 0 40
## 71 okay 0 32
## 72 one 0 47
## 73 palace 0 10
## 74 people 0 24
## 75 place 0 12
## 76 please 0 17
## 77 power 0 10
## 78 powerful 0 11
## 79 prince 0 77
## 80 princess 0 29
## 81 really 0 11
## 82 right 0 43
## 83 said 0 15
## 84 say 0 25
## 85 second 0 13
## 86 see 0 31
## 87 seen 0 10
## 88 sherabad 0 11
## 89 something 0 10
## 90 sorry 0 10
## 91 speechless 0 10
## 92 steal 0 11
## 93 stop 0 14
## 94 street 0 10
## 96 take 0 25
## 97 tell 0 16
## 98 thank 0 14
## 99 thats 0 34
## 100 theres 0 13
## 101 thief 0 14
## 102 think 0 40
## 103 thought 0 10
## 104 three 0 15
## 105 time 0 18
## 106 try 0 14
## 107 want 0 11
## 108 way 0 21
## 109 well 0 26
## 110 whats 0 10
## 111 whole 0 13
## 112 will 0 47
## 113 wish 0 45
## 114 wont 0 16
## 115 world 0 17
## 116 years 0 10
## 117 yes 0 26
## 118 youre 0 34
color_palette <- colorRampPalette(c("#63238E", "#EEB405"))
# Convert the term-document matrix to a matrix
dtm_matrix <- as.matrix(TermDocumentMatrix(corp_combined))
# Calculate word frequencies
word_freqs <- sort(rowSums(dtm_matrix), decreasing = TRUE)
# Remove words that start or end with quotation marks from the frequency list
word_freqs <- word_freqs[!grepl('^"|"$', names(word_freqs))]
# Generate the word cloud, now excluding words with quotation marks
wordcloud(names(word_freqs), word_freqs, max.words = 20, colors = color_palette(4))
# Cargar el guion de 1992
text_1992 <- readLines("Aladdin_1992.txt")
corp_1992 <- Corpus(VectorSource(text_1992))
# Cargar el guion de 2019
text_2019 <- readLines("Aladdin_2019.txt")
corp_2019 <- Corpus(VectorSource(text_2019))
corp_1992 <- tm_map(corp_1992, content_transformer(tolower))
corp_1992 <- tm_map(corp_1992, removePunctuation)
corp_1992 <- tm_map(corp_1992, removeNumbers)
corp_1992 <- tm_map(corp_1992, removeWords, stopwords("english"))
corp_1992 <- tm_map(corp_1992, stripWhitespace)
# Convert the corpus to plain text
text_1992_clean <- sapply(corp_1992, as.character)
# Sentiment analysis
emociones_df <- get_nrc_sentiment(text_1992_clean)
# Plotting emotion distribution
emotions_sums <- colSums(prop.table(emociones_df[, 1:8]))
barplot(emotions_sums, main = "Distribution of Emotions", ylab = "Proportion", las = 2)
# Calculate sentiment values using syuzhet
sentiment_values <- get_sentiment(text_1992_clean, method = "syuzhet")
# Generate the Syuzhet plot
plot_data <- data.frame(scores = sentiment_values, index = 1:length(sentiment_values))
ggplot(plot_data, aes(x = index, y = scores)) +
geom_line() +
geom_smooth(span = 0.05, method = "loess", colour = "blue", se = FALSE) +
geom_smooth(span = 0.1, method = "loess", colour = "red", se = FALSE) +
labs(title = "Syuzhet Plot", x = "Full Narrative Time", y = "Scaled Sentiment") +
theme_minimal()
# Generate the simplified macro shape plot
dct_values_1992 <- get_dct_transform(sentiment_values, low_pass_size = 5, x_reverse_len = length(sentiment_values))
plot_data_dct <- data.frame(scores = dct_values_1992, index = 1:length(dct_values_1992))
ggplot(plot_data_dct, aes(x = index, y = scores)) +
geom_line(colour = "red") +
labs(title = "Simplified Macro Shape", x = "Normalized Narrative Time", y = "Scaled Sentiment") +
theme_minimal()
corp_2019 <- tm_map(corp_2019, content_transformer(tolower))
corp_2019 <- tm_map(corp_2019, removePunctuation)
corp_2019 <- tm_map(corp_2019, removeNumbers)
corp_2019 <- tm_map(corp_2019, removeWords, stopwords("english"))
corp_2019 <- tm_map(corp_2019, stripWhitespace)
# Convert the corpus to plain text
text_2019_clean <- sapply(corp_2019, as.character)
# Sentiment analysis
emociones_df <- get_nrc_sentiment(text_2019_clean)
# Plotting emotion distribution
emotions_sums_2019 <- colSums(prop.table(emociones_df[, 1:8]))
barplot(emotions_sums_2019, main = "Distribution of Emotions", ylab = "Proportion", las = 2)
# Calculate sentiment values using syuzhet
sentiment_values <- get_sentiment(text_2019_clean, method = "syuzhet")
# Generate the Syuzhet plot
plot_data <- data.frame(scores = sentiment_values, index = 1:length(sentiment_values))
ggplot(plot_data, aes(x = index, y = scores)) +
geom_line() +
geom_smooth(span = 0.05, method = "loess", colour = "blue", se = FALSE) +
geom_smooth(span = 0.1, method = "loess", colour = "red", se = FALSE) +
labs(title = "Syuzhet Plot", x = "Full Narrative Time", y = "Scaled Sentiment") +
theme_minimal()
# Generate the simplified macro shape plot
dct_values_2019 <- get_dct_transform(sentiment_values, low_pass_size = 5, x_reverse_len = length(sentiment_values))
plot_data_dct <- data.frame(scores = dct_values_2019, index = 1:length(dct_values_2019))
ggplot(plot_data_dct, aes(x = index, y = scores)) +
geom_line(colour = "red") +
labs(title = "Simplified Macro Shape", x = "Normalized Narrative Time", y = "Scaled Sentiment") +
theme_minimal()
# Bar plot for emotion distribution for both versions
emotions_df_combined <- rbind(data.frame(Emotion = names(emotions_sums), Proportion = emotions_sums, Version = "1992"),
data.frame(Emotion = names(emotions_sums_2019), Proportion = emotions_sums_2019, Version = "2019"))
# Plotting combined bar plot
ggplot(emotions_df_combined, aes(x = Emotion, y = Proportion, fill = Version)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.7)) +
scale_fill_manual(values = c("1992" = "#EEB405", "2019" = "#63238E")) +
labs(title = "Emotion Distribution in Aladdin 1992 vs. 2019", y = "Proportion", x = "Emotion") +
theme_minimal()
# Plotting simplified sentiment trend for 1992 version
plot_1992 <- ggplot(data = data.frame(Index = 1:length(dct_values_1992), Score = dct_values_1992),
aes(x = Index, y = Score)) +
geom_line(color = "#EEB405") +
labs(title = "Simplified Sentiment Trend in Aladdin 1992", x = "Narrative Time", y = "Sentiment Score") +
theme_minimal()
# Plotting simplified sentiment trend for 2019 version
plot_2019 <- ggplot(data = data.frame(Index = 1:length(dct_values_2019), Score = dct_values_2019),
aes(x = Index, y = Score)) +
geom_line(color = "#63238E") +
labs(title = "Simplified Sentiment Trend in Aladdin 2019", x = "Narrative Time", y = "Sentiment Score") +
theme_minimal()
# Determine the length of the longer plot
max_length <- max(length(dct_values_1992), length(dct_values_2019))
# Set up the grid
grid.arrange(plot_1992, plot_2019, ncol = 1)
Length and Repetition: The script from Aladdin (1992) exhibits a greater length, resulting in a more frequent repetition of certain words and character names. This trend suggests a richer dialogue and more numerous musical sequences in the animated classic.
Adaptation and Innovation: The 2019 rendition of Aladdin diverges subtly from its predecessor, not only by condensing the script but also by introducing new dialogues and songs absent in the original. This variation indicates an effort to modernize the story while maintaining its core essence.
Key Characters: The names “Aladdin” and “Abu” are amongst the most repeated, underscoring their central role in the narrative. Their prominence across both versions highlights their significance as the main protagonist and his companion, respectively.
Character Associations: The character “Genie” is notably linked with “carpet,” reflecting their interactions within the narrative. The association analysis also connects Genie with terms like “afraid” and “along,” likely emerging from dialogues involving Aladdin or Abu, pointing to the depth of their relationships.
Sentiment Analysis: Despite differing lengths, both scripts show a similar sentiment trajectory, peaking with positive emotions towards the climax. Notably, the 2019 script exhibits fewer negative sentiments compared to the 1992 version, which emphasized such emotions early on. This shift suggests a modern adaptation’s preference for gradually building positive moments, illustrating a nuanced change in storytelling approach.
Emotional Dynamics: The overall balance of emotions between the two versions remains remarkably consistent, indicating that the fundamental tone of the story has been preserved across adaptations.
Inclusion Criteria: - All of the text in the script, including characters, notes, songs, etc. - Title and or subtitles included are also being analyze.
Exclusion Criteria: - Stop words, punctuation and capital letters are being transformed or eliminated.
The elimination of punctuation, stop words and capital letter do not search to bias the analysis but in the contrary make it easier and prevent exceeding from the necessary data from which important insights can be obtained.