Introduction

For my project, I decided to analyze Anthony Burgess’ novel ’A Clockwork Orange and Stanley Kubrick’s film adaption of “A Clockwork Orange”. The two main differences between the book and the film are that one is a book and one is a script, and that the film excluded the last chapter of the book. The 21st or last chapter of the book (which was excluded from the film and the American version of the book), ends with Alex’s character abandons his unscrupulous lifestyle and reaching a state of redemption. This version was thought of as unconvincing and the chapter was later removed. With the absence of the last chapter in Kubrick’s film, Alex receives treatment to undo the Ludovico technique, restoring Alex back to his old mischievous violent lifestyle, facing no repercussions. The opposing endings of the book and the film leave the reader/audience with different takeaways about society, violence, morality, and humanity in general which the book speaks to. The movie was thought of as the darker version due to the removal of the last redemptive chapter as well as it’s littered with crude depictions of violence and rape. In my analysis, I hope to dissect the text and understand the difference in sentiment in the book and the film that was impacted by these nuances.

Data

To conduct my sentiment analysis I used the AFINN lexicon to categorize the sentiment of the words in A Clockwork Orange. Our textbook “Text Mining with R” by Julia Silge and David Robinson defines AFINN as, “The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.”

When searching for the dataset for the A Clockwork Orange book, I had to do some digging because the book is a banned book. Because of this, it was not available on Project Gutenberg. Instead I found the text files on Google. I found the text file for the book on https://vk.com/wall244814709_739. I found the text file for the film’s screenplay using https://www.scriptslug.com/script/a-clockwork-orange-1971.

It is important to note that when comparing frequencies of the measures of the book and the film that there is a large difference in overall word count for each. In total the dataset of the book contains 5,579 words whereas the film’s screenplay contains 2,788 words.

In addition it is important to note that A Clockwork Orange is written in fictional slang called Nadsat. Anthony Burgess, the author of A Clockwork Orange made up this language specifically for this story. The terms of this language are defined here. In this analysis I made sure to keep in mind how the use of an unknown language may skew my analysis as some words will be excluded from the AFINN lexicon.

Predictions

The screenplay will be more negative in sentiment than the book. My prediction is that the screenplay although it will be shorter in actual words, will have an overall more negative sentiment than the book. In Kubrick’s adaption of “A Clockwork Orange,” he instigates discomfort in the audience adding details that heightened the violence depicted and sending an overall message that humans are evil by nature. I am aware the comparing a screenplay and a book creates difficulties as they are not written in the same prose. In my analysis I will take this factor into account as I come to a conclusion about whether the film or book is darker.

Project Structure

I divided up the code for the book and the film. First I analyzed the book, then the film. I followed the same structure for the two. The project structure follows this order, starting with a word cloud of the top words, then moving into the sentiments where I created visuals of the overall, most positive and most negative sentiments of the top words in each of the data sets. I conclude my analysis comparing the mean sentiments of the book and the film. Lastly to wrap up the project, I summarized my findings qualitatively to come to a conclusion based off my predictions.

First, let’s load the necessary packages,

library(tidytext)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(textdata)
library(ggplot2)
library(ggthemes)
library(wordcloud2)
library(readr)
acwo <- read_csv("~/Desktop/acwo.txt", col_names = FALSE, show_col_types = FALSE)

Then, I cleaned up my code creating a data set whose text was separated by word.

acwo %>% 
  unnest_tokens(word, X1) -> acwo_words

Next, I created a data set for the top 20 words in the text and graphed it.

acwo_words %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(!word %in% c("hke", "ing")) %>% 
  head(50) -> acwo_top_words 
## Joining, by = "word"
acwo_top_words %>% 
wordcloud2()

This world cloud shows the most common words in the book with “brothers” being the most common word at 238 instances. Alex, the main character and narrator of the book uses the word brothers often to address the readers.

Next, I moved into analyzing the sentiment of the text. Using the AFINN lexicon, I set myself up to then analyze the text for positive and negative words on a scale from 5 to -5, where 5 classifies the most positive words and -5 classifies the most negative words.

acwo_words %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  inner_join(get_sentiments("afinn")) -> acwo_sentiment
## Joining, by = "word"
## Joining, by = "word"

I decided to create a graph of the overall sentiment based on frequency of word before I continue on to graph the most negative and positive words. This way we are able to visualize the proportion of negative to positive words and gain a better understanding of the overall sentiment of the book.

acwo_sentiment %>% 
  head(10) %>% 
  ggplot(aes(reorder(word, n), value, fill = value)) + geom_col() + 
  ggtitle("Overall Sentiment Analysis of A Clockwork Orange Book ") +
  xlab("Words") +
  ylab("Value") + 
  theme_stata()

Out of the 10 words, seven of them were negative words. We can imply that the book has an overall negative sentiment based on this visual of the most frequently used words throughout the book. Notice that two (the most frequent words) of the three positive words are valued 3, a solidly positive rating.

Next, I decided to get more specific and plot the most positive and negative words by AFINN sentiment value. Positive first,

acwo_sentiment%>% 
  filter(!duplicated(word)) %>% 
  arrange(desc(value)) %>% 
  head(10)->  acwo_pos_sentiment


acwo_pos_sentiment %>% 
  ggplot(aes(reorder(word, n), n, fill= value)) + geom_col() + 
  ggtitle("Most Positive Words in A Clockwork Orange Book ") +
  xlab("Words") +
  ylab("Frequency") + 
  theme_stata()

We can see that there was one word rated a 5, the most positive value rating. Notice how there are few instances of the positive words with the most frequent word, “funny” said nine times.

Now the negative,

acwo_sentiment %>% 
  inner_join(get_sentiments('afinn'))  %>% 
  anti_join(stop_words) %>% 
  filter(!duplicated(word)) %>% 
  arrange(value) %>% 
  head(10) -> acwo_neg_sentiment
## Joining, by = c("word", "value")
## Joining, by = "word"
acwo_neg_sentiment %>% 
ggplot(aes(reorder(word, n), n, fill = value)) + geom_col() + 
  ggtitle("Most Negative Words in A Clockwork Orange Book ") +
  xlab("Emotions") +
  ylab("Frequency") + 
  theme_stata()

The negative words were more scattered compared to the positive words with three words rating a -5, four words rating a -4 and three words rating a -3. The three most frequent words words were rated -3, meaning the sentiment of the words that occurred most frequently in the text weren’t extremely negative.

Now lets do the same for the film. We will then compare the two and draw conclusions to prove or disprove my predictions.

A Clockwork Orange Film Screenplay

First I cleaned the data.

library(tidyverse)
library(tidytext)
library(textdata)
library(ggthemes)
library(devtools)
## Loading required package: usethis
library(wordcloud2)
library(readr)
acwofilm <- read_csv("~/Desktop/acwofilm.txt", col_names = FALSE, show_col_types = FALSE)
## Warning: One or more parsing issues, see `problems()` for details
acwofilm %>% 
  unnest_tokens(word, X1) -> acwofilm_words

I then created a lexicon for the most frequent words in the film and created a word cloud to visualize it.

acwofilm_words %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(!word %in% c("int", "v.o")) %>% 
  head(50) -> acwofilm_top_words
## Joining, by = "word"
acwofilm_top_words %>% 
wordcloud2()

“Alex” was the most frequently used word in the film at 407 instances. The second most common word, “sir” was far behind “Alex” at 141 instances. This shows how central Alex’s character was to the film.

Now that we know what the most common words are, we can dive into the sentiment analysis of the screenplay. I started by creating a sentiment lexicon of the most frequent words and plotted it based on the AFINN’s sentiment lexicon.

acwofilm_words %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  inner_join(get_sentiments("afinn")) -> acwofilm_sentiment
## Joining, by = "word"
## Joining, by = "word"
acwofilm_sentiment %>% 
  head(10) %>% 
  ggplot(aes(reorder(word, n), value, fill = value)) + geom_col() + 
  ggtitle("Sentiment of The Most Common Words in A Clockwork Orange Film ") +
  xlab("Words") +
  ylab("Value") + 
  theme_stata()

Similarly to the book, the sentiment of the most frequent words in the film are predominantly negative. The least common word is rated the most negative. The three most common words (on the right) were rated both positive and negative. We can conclude that the sentiment of the film is generally negative with positive words dulling the severity of its negative sentiment.

I then plotted the most positive words based on AFINN’s assigned sentiment values.

acwofilm_sentiment %>% 
  arrange(desc(value)) %>% 
  filter(!duplicated(word)) %>% 
  head(10) -> acwofilm_pos

acwofilm_pos %>% 
  ggplot(aes(reorder(word, n), n, fill = value)) + geom_col() + 
  ggtitle("Most Positive Words in A Clockwork Orange Film ") +
  xlab("Words") +
  ylab("Frequency") + 
  theme_stata()

The bar graph is scalled to the number of times the word occurred. I filled the bar graph with color based on the sentiment value assigned to each word. The top words were rated 3 and 4, meaning that the positive sentiment of the film was higher than average we are rating the values based on a scale of 0-5.

Now we will plot the most common words with a negative sentiment,

acwofilm_sentiment %>% 
   arrange(value) %>% 
  filter(!duplicated(word)) %>% 
  head(10) -> acwofilm_neg

acwofilm_neg %>% 
ggplot(aes(reorder(word, n), n, fill = value)) + geom_col()+
  ggtitle("Most Negative Words in A Clockwork Orange Film") +
  xlab("Words") +
  ylab("Frequency") +
  theme_stata()

We can see that the negative words receives high negative sentiment values. Two of the word, one being the most frequently used word, “bastard” received ratings of -5. The rest of the words were rated -4. Compared to the positive words, the negative were more negative. We can conclude that because of this the overall sentiment of the film would feel more negative than positive.

Conclusion

I was able to discover key differences between the film and the book which give reason to the controversy around the famous novel and film, A Clockwork Orange.

Starting with an overall look at the most frequent words, I found that the three most frequent word in the book were: brothers, real and veck (translating to person, man, fellow) compared to the three most frequent words on the film: Alex, sir and chief. We can draw from this that the book is clearly written from the perspective of Alex who uses these words to speak to others, compared to the film which illustrates that the narration was external from Alex’s character as his name repeats most frequently in the film.

After the initial visual of the most frequent words, I moved onto the sentiment analysis. The overall sentiment analysis broke the top ten words into positive and negative values assigned to each word using the AFINN lexicon. Both the book and the film had three instanced of positive words both of which were rated the same positive values. The rest of the words were rated negative values, giving reason to the overall negative sentiment of A Clockwork Orange story in general. Further, we can conclude that the book had a more negative sentiment as the positive words were less frequent than the positive words and vice versa in comparison to the film.

I decided to further my sentiment analysis by looking more into the positive and negative words individually. In terms of the positive words, it is important to note the difference in scales where the book’s scale goes up to 5 compared to the film whose scale only goes up to 4 at the highest. At first glance we could conclude that the film has an overall more positive sentiment. However keeping in mind the differences in scales, the book has an overall more positive sentiment as all words are rated either a 4 or 5 compared to the film whose positive words have values of 3 and 4. We can conclude that the book was more positive than the film. Keep in mind that this is a skewed conclusion as we haven’t taken into account the negative sentiments which could overpower the positive.

In terms of the negative words, the book was more negative than the film. I come to this conclusion primarily based on the frequency ranges of the book and the film. The scale of frequency for the book ranges between 0-55, meaning there is more instances of the negative words compared to the film whose frequency scale is 0-7. Although the book’s most frequent words were rated -3, the most negatively rated words receiving a -5 occurred more frequently than those rated -5 in the film. With all of this information in mind, we can conclude that the book was more negative than the film because as we saw in the overall sentiment plots, the negative words in the book outshine the positive words and show higher ranking sentiment values than the film. My predictions have been disproven through this analysis as the negative sentiment of the book is stronger than the positive sentiments of the book as well as the positive and negative sentiments of the film.