Film Scripts: Linguistic Genre Analysis with Word Frequency & Word Count

options(repos = "https://cloud.r-project.org")

Montclair State University • FMTV 210-01 • Story Analysis and Introduction to Screenwriting

For this short lab, we will be taking a look at how two film genres can be analyzed from a linguistic standpoint… we will essentially perform a small-scale linguistic genre analysis.

We will be using the Film Corpus 2.0, created by the Natural Language and Dialogue Systems lab at the University of California Santa Cruz. Please follow this link and download the corpus so that you can follow along with this lab! Takeaway: how to screenwrite for a particular genre. What type of language do we find a particular genre?

Films which belong to the Family and War genres are those that we will be analyzing. Obviously, films that are labeled “family” will use very different language than those which have the “war” label… But you will not get the full picture until we visualize this and really bring this point home!

Word Clouds

In this section, we will create word clouds for the Family films and the War films… hopefully we’ll get some eye-opening information.

But… before we do any visualizing in R, we need to do some cleaning and pre-processing of the corpora. I will show some necessary code in both R and Python throughout this lab.

To work with the data, we should first combine all of the scripts from both respective genres. Please add all 39 Family scripts to one file named Family.txt, and do the same for the 26 War scripts.

First: we need to normalize our text. I will provide the Python code that will allow us to replace all text to lowercase letters. Now… if you do not have NLP experience, you may be asking why we would need to do this? Why does it matter if the words are capitalized? Because if we are counting frequency, we do not want instances of “Baby” and “baby” to count as two separate words.

#Normalizing

file_for_input = '/Users/jordanthomas/Desktop/Family.txt' #Replace this with your own local path to the Family genre scripts
file_for_output = '/Users/jordanthomas/Desktop/Family_normalized.txt' #Replace this with your own local path for where you'd like the Family genre scripts to go after normalization

try:
    with open(file_for_input, 'r') as input_file:
        text = input_file.read()

    modified_corpus = text.lower()

    with open(file_for_output, 'w') as output_file:
        output_file.write(modified_corpus)

    print(f"Text has been normalized and saved to: '{file_for_output}'.")
except Exception as e:
    print(f"An error occurred, try again: {e}")

## An error occurred, try again: [Errno 2] No such file or directory: '/Users/jordanthomas/Desktop/Family.txt'

Be sure to use the same code for the War scripts.

Next: we will need to remove stop words. Stop words, for those who are unfamiliar with commonly-used NLP lingo, are function words that may add unwanted bloat to a dataset. However, if you are working on a context-dependent analysis, you will want to leave the stop words in the text. For the purpose of creating word clouds from our corpora, we definitely want to remove them.

Refer to NLTK documentation for instructions on how to remove stop words from various text types.

Our final step before making our word clouds will be to get frequency counts for every unique word in both corpora. And once again, do the same for the War scripts.

Now, let’s get started with creating our word clouds! I chose to show the 67 most occurring words in each genre for this particular exercise… There were some unnecessary words that remained after our normalization and removal of stop words, so they were removed manually.

Let’s begin this section by installing the wordcloud2 package below:

install.packages("wordcloud2")

## 
## The downloaded binary packages are in
##  /var/folders/l4/18s3pv3d4s330pqmbc1w5w3c0000gn/T//RtmprCrnjs/downloaded_packages

library(wordcloud2)

Family_word_freq <- data.frame(
  word = c("know", "one", "well", "right", "see", "come", "look", "fox", "think", "going",
           "time", "want", "little", "jack", "good", "let", "yeah", "hey", "looks", "never",
           "would", "take", "man", "say", "could", "around", "way", "yes", "okay", "away",
           "something", "tell", "gon", "make", "really", "fred", "mean", "big", "give", "head",
           "peter", "wait", "old", "george", "thing", "sure", "mary", "stop", "ever", "still",
           "need", "home", "room", "maybe", "must", "love", "much", "help", "find", "said",
           "michael", "please", "even", "door", "new", "says", "great")
  ,
  freq = c(791, 788, 682, 672, 623, 589, 544, 521, 518, 485,
           480, 466, 458, 456, 444, 444, 424, 418, 407, 393,
           372, 372, 349, 343, 337, 337, 332, 327, 323, 315,
           307, 299, 287, 268, 255, 253, 248, 243, 243, 242,
           239, 237, 233, 231, 223, 222, 218, 217, 214, 213,
           213, 206, 201, 201, 199, 196, 195, 192, 191, 188,
           187, 185, 185, 183, 178, 177, 174)
)
wordcloud2(Family_word_freq, size = .6)

War_word_freq <- data.frame(
  word = c("sir", "war", "army", "mean", "stauffenberg", "shit", "general", "dead", "god", "colonel",
           "captain", "love", "money", "hell", "father", "life", "beat", "fight", "kill", "gold",
           "french", "fucking", "die", "killed", "shoot", "old", "hitler", "fire", "horse", "sergeant",
           "soldiers", "olbricht", "bad", "fuck", "orders", "english", "world", "side", "country", "king",
           "thousand", "alive", "pay", "shot", "ass", "hope", "sal", "damn", "private", "field", "jesus",
           "hit", "merle", "cut", "officer", "drink", "lord", "lost", "gentlemen", "soldier", "wallace",
           "crazy", "death", "german", "fuckin", "fort", "sarge")
  ,
  freq = c(531, 283, 197, 184, 178, 175, 173, 170, 166, 153,
           152, 145, 142, 136, 134, 134, 132, 124, 121, 112,
           111, 106, 102, 98, 97, 96, 96, 91, 86, 85,
           84, 84, 83, 83, 81, 79, 78, 77, 77, 77,
           74, 72, 71, 70, 69, 68, 67, 65, 64, 64,
           64, 63, 62, 61, 61, 60, 59, 58, 57, 56,
           56, 56, 55, 55, 54, 54, 54)
)
wordcloud2(War_word_freq, size =.9)

When reading a word cloud, the larger a word appears, the more often it is used within a given context. So, we see that “know” is perhaps the most used word in the Family corpus, while “new” and “mean” appear far less across the scripts.

Now… what do these visualizations tell us about these two genres?

Immediately, we can see that the word cloud that we made from the War corpus has words that could fit directly in the linguistic register of a soldier, or a captain: “captain,” “general,” “private,” “army,” “field,” “fight,” to name a few. And for the genre, we can compare the appearance of words like: “fucking,” “shit,” “hell,” “die,” “hitler,” and “shoot,” which we do not see in our Family word cloud.

This is just one way that we can see exactly what type of language is used in these two very different film genres. The Family word cloud, shows us that “know”, “one”, “well”, “right”, and “see” are the most commonly occurring words across the genre.

Hovering over each word will allow you to see the count for that word… be sure to try this R feature out!

Word Count

Next, let’s see if we can gather any information on these two film genres when it comes to QUANTITY of text.

For this particular task, I have consulted BoxOfficeMojo to narrow down the top 10 grossing Family and War films from our corpora.

We want to gather the word count for the dialogue from the Family and War film top 10 lists… we can do so by creating a function in R:

#Gather word count for Family films
wordcount <- function(file_path_for_Family) {
  text <- readLines(file_path_for_Family, warn = FALSE)
  text <- paste(text, collapse = " ")
  words <- strsplit(text, "\\s+")[[1]]
  return(length(words))
}
file_paths_for_Family <- c(
  "/Users/jordanthomas/Desktop/Family/aladdin.txt",
  "/Users/jordanthomas/Desktop/Family/chroniclesofnarniathelionthewitchandthewardrobe.txt",
  "/Users/jordanthomas/Desktop/Family/e.t..txt",
  "/Users/jordanthomas/Desktop/Family/findingnemo.txt",
  "/Users/jordanthomas/Desktop/Family/happyfeet.txt",
  "/Users/jordanthomas/Desktop/Family/kungfupanda.txt",
  "/Users/jordanthomas/Desktop/Family/shrek.txt",
  "/Users/jordanthomas/Desktop/Family/toystory.txt",
  "/Users/jordanthomas/Desktop/Family/up.txt",
  "/Users/jordanthomas/Desktop/Family/walle.txt"
  )
word_counts <- list()

for (file_path_for_Family in file_paths_for_Family) {
  count <- wordcount(file_path_for_Family)
  film_name <- basename(file_path_for_Family)
  word_counts[film_name] <- count
}

for (film_name in names(word_counts)) {
  cat("Film:", film_name, "- Word Count:", word_counts[[film_name]], "\n")
}

## Film: aladdin.txt - Word Count: 17154 
## Film: chroniclesofnarniathelionthewitchandthewardrobe.txt - Word Count: 8212 
## Film: e.t..txt - Word Count: 17248 
## Film: findingnemo.txt - Word Count: 12264 
## Film: happyfeet.txt - Word Count: 14648 
## Film: kungfupanda.txt - Word Count: 14792 
## Film: shrek.txt - Word Count: 12977 
## Film: toystory.txt - Word Count: 21426 
## Film: up.txt - Word Count: 17145 
## Film: walle.txt - Word Count: 17298

wordcount <- function(file_path_for_Family) {
  text <- readLines(file_path_for_Family, warn = FALSE)
  text <- paste(text, collapse = " ")
  words <- strsplit(text, "\\s+")[[1]]
  return(length(words))
#Average word count for Family films: 16796.4
}

#Gather word count for War films
wordcount <- function(file_path_for_War) {
  text <- readLines(file_path_for_War, warn = FALSE)
  text <- paste(text, collapse = " ")
  words <- strsplit(text, "\\s+")[[1]]
  return(length(words))
}
file_paths_for_War <- c(
  "/Users/jordanthomas/Desktop/War/braveheart.txt",
  "/Users/jordanthomas/Desktop/War/inglouriousbasterds.txt",
  "/Users/jordanthomas/Desktop/War/lastsamuraithe.txt",
  "/Users/jordanthomas/Desktop/War/patriotthe.txt",
  "/Users/jordanthomas/Desktop/War/pearlharbor.txt",
  "/Users/jordanthomas/Desktop/War/savingprivateryan.txt",
  "/Users/jordanthomas/Desktop/War/schindlerslist.txt",
  "/Users/jordanthomas/Desktop/War/tropicthunder.txt",
  "/Users/jordanthomas/Desktop/War/valkyrie.txt",
  "/Users/jordanthomas/Desktop/War/warhorse.txt"
  )
word_counts <- list()

for (file_path_for_War in file_paths_for_War) {
  count <- wordcount(file_path_for_War)
  film_name <- basename(file_path_for_War)
  word_counts[film_name] <- count
}

for (film_name in names(word_counts)) {
  cat("Film:", film_name, "- Word Count:", word_counts[[film_name]], "\n")
}

## Film: braveheart.txt - Word Count: 27450 
## Film: inglouriousbasterds.txt - Word Count: 33210 
## Film: lastsamuraithe.txt - Word Count: 28748 
## Film: patriotthe.txt - Word Count: 30311 
## Film: pearlharbor.txt - Word Count: 33249 
## Film: savingprivateryan.txt - Word Count: 22206 
## Film: schindlerslist.txt - Word Count: 29909 
## Film: tropicthunder.txt - Word Count: 21346 
## Film: valkyrie.txt - Word Count: 24456 
## Film: warhorse.txt - Word Count: 24928

wordcount <- function(file_path_for_War) {
  text <- readLines(file_path_for_War, warn = FALSE)
  text <- paste(text, collapse = " ")
  words <- strsplit(text, "\\s+")[[1]]
  return(length(words))
#Average word count for War films: 27512.3
}

Keep in mind: this is not the most accurate method of getting a definitive word count for these scripts. The inclusion of character names, and other information remain in the text, and that skews the numbers. But, this code will suffice for this particular demonstration.

We can already see that the War genre has a far higher average word count across its 10 films’ scripts… but let’s see a visual representation of thiz difference!

Next, we will create two bar charts using ggplot.

library(ggplot2)

#FAMILY word count
Family_data <- read.csv("/Users/jordanthomas/Desktop/Family.csv")

bar_plot <- ggplot(Family_data, aes(x = Movie, y = Count, fill = Movie)) +
  geom_bar(stat = "identity") +
  labs(title = "Dialogue count for Family genre", x = "Movie", y = "Word Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(bar_plot)

#WAR word count
War_data <- read.csv("/Users/jordanthomas/Desktop/War.csv")

bar_plot <- ggplot(War_data, aes(x = Movie, y = Count, fill = Movie)) +
  geom_bar(stat = "identity") +
  labs(title = "Dialogue count for War genre", x = "Movie", y = "Word Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(bar_plot)

A barplot is most useful when we have two variables that we want to show a relationship… numeric versus a categorical variable. In our case, our numeric value is the word count for each movie script. The categorical value here is our top 10 films in each genre. We can use the group_by function to order the bars by word frequency, as well.

After looking at these two barplots, we can see that Pearl Harbor (2001), which has the highest word count from the War films, has a script length that is over 55% longer than that of Toy Story (2005). Looking at raw numbers may be confusing for some… and that is why visualizations are a must when depicting a narrative! Whether that narrative is one that you are creating for an academic paper, lecture, or a screenplay!

These are just a few ways that we can create our own visualizations in R to further our understanding of screenwriting!

Film Scripts: Linguistic Genre Analysis with Word Frequency & Word Count

Jordan Thomas

Mon Dec 18 2023