Hey there! Welcome to another R project of mine. This time, I’m biting off a bit of a larger chunk than usual. I have a pretty clean dataset of lines of dialogue for the entire south park show. From season 1 in 1997 up until the recent specials in 2021. That’s about 95k lines of dialogue across 24 seasons. I’m going to run some text analytics on it and have some fun in the spirit of learning with silly data.

Why South Park? Besides being a fan of the show I know that most episodes are written and aired within the same week and they stay pretty current on events and pop culture. A type of social history book can be viewed through this show. We can explore the rise and fall of themes and events through the years just by looking at this show. We can also have fun looking at the crazier and raunchier sides to the show.

The rough flow of this project will be: importing & cleaning the data, simple word counts, sentiment analysis, then topic modelling.

Anyhow, let’s see what we have. I used a lot of packages here: tidyverse, knitr, textmineR, kablextra, tm, topicmodels, reshape2, pals, snowballc, lda, ldatuning, flextable, sentimentr, lexicon, magrittr, DT, wordcloud2, tidytext, stringr.

And with that…

South Park

So here is where I’m going to import two datasets locally: lines_sp, which has all the lines connected to episodes and who said them, and episodes_sp, which has all the episodes connected to the seasons and when they aired.

Real simple we just connect the two via episode to make masterdata_sp, which now has every line that can be rolled up to a season and date. I also get rid of the character “Scene Description” because it was messing up results. The resulting dataframe is sp_text. I do this because when I mess up I can just reload masterdata_sp.

I included a head() function that is supposed to show the first 5 rows of a dataset.

#loading the two datasets, lines and episodes
lines_sp <- read.csv(file = "SouthPark_Lines.csv")
episodes_sp <-read.csv(file = "SouthPark_Episodes.csv")

#joining the two datasets
masterdata_sp <- left_join(lines_sp,episodes_sp, by ="Title")
sp_text<-masterdata_sp
sp_text<-dplyr::mutate(sp_text, ID = row_number())
#removing scene descriptions
sp_text<-filter(sp_text, Character!="Scene Description")

head(sp_text)
##                        Title       Character
## 1 Cartman Gets an Anal Probe        The Boys
## 2 Cartman Gets an Anal Probe Kyle Broflovski
## 3 Cartman Gets an Anal Probe  Ike Broflovski
## 4 Cartman Gets an Anal Probe            Kyle
## 5 Cartman Gets an Anal Probe    Eric Cartman
## 6 Cartman Gets an Anal Probe            Kyle
##                                                                    Line
## 1                      School days, school days, teacher's golden ru...
## 2 Ah, damn it! My little brother's trying to follow me to school again.
## 3                                                           Eat banana.
## 4                Ike, you can't come to school with me. [Ike Chortles.]
## 5                                       Yeah, go home you little dildo!
## 6                                  Dude, don't call my brother a dildo!
##          Air.Date Code X.
## 1 August 13, 1997  101  1
## 2 August 13, 1997  101  1
## 3 August 13, 1997  101  1
## 4 August 13, 1997  101  1
## 5 August 13, 1997  101  1
## 6 August 13, 1997  101  1
##                                                                                                                             Description
## 1 While the boys are waiting for the school bus, Cartman explains the odd nightmare he had the previous night involving alien visitors.
## 2 While the boys are waiting for the school bus, Cartman explains the odd nightmare he had the previous night involving alien visitors.
## 3 While the boys are waiting for the school bus, Cartman explains the odd nightmare he had the previous night involving alien visitors.
## 4 While the boys are waiting for the school bus, Cartman explains the odd nightmare he had the previous night involving alien visitors.
## 5 While the boys are waiting for the school bus, Cartman explains the odd nightmare he had the previous night involving alien visitors.
## 6 While the boys are waiting for the school bus, Cartman explains the odd nightmare he had the previous night involving alien visitors.
##   Season Episode ID
## 1      1       1  2
## 2      1       1  3
## 3      1       1  4
## 4      1       1  5
## 5      1       1  6
## 6      1       1  7

Great, now to some actual fun stuff.

So first up is a stupid word cloud that every text analytics project has to include. Ignore it unless you find it pretty.

###STUPID WORD CLOUD THING###

#getting a count by word, transformed into a world cloud
sp_text %>%
  unnest_tokens(output = word, input = Line, ) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(n > 500) %>%
  na.omit() %>% 
  wordcloud2(shape = "cardioid")
## Joining, by = "word"

Next we get more interesting with swear words. The show notoriously and increasingly uses words like ass, fuck, bitch, crap & shit and used to get in trouble for it as a cartoon that preteens watch. Ever wonder what the most popular curse word on the show is and how much it’s said? Wonder no more, it’s “ass” with almost 800 mentions. This doesn’t count extensions like “asses.”

#swear words counts
foulWordCounts <- sp_text %>%
  unnest_tokens(output = word, input = Line, ) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter( word %in% c("ass","fuck","bitch","crap", "shit")) %>%
  na.omit()
## Joining, by = "word"
foulWordCounts
##    word   n
## 1   ass 793
## 2  fuck 714
## 3  shit 620
## 4 bitch 488
## 5  crap 423

So bear with me with this next piece. Family guy creator Seth Macfarlane has said that gay jokes in his show decreased over time as the show matured and the humor of the time has used the gay demographic less and less as a punchline. Well, I wanted to see if that trend stayed true for South Park. So this visualization of a line chart looks at mention of the word “gay” over time and we see a downward slope with two big spikes in seasons 13 and 18, which probably have pretty specific episodes to explain. Understand I wanted to capture jokes and insults and not just the overall theme, so I didn’t include “gays”. Having watched the show, the context would be more “that’s so gay” or “you’re gay”.

#"gay" over time
wordOverTime<-sp_text %>%
  group_by(Season) %>%
  unnest_tokens(output = word, input = Line, ) %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter( word== "gay" | word =="Gay") %>%
  na.omit() %>%
  arrange(Season)
## Joining, by = "word"
#plotting "gay" over time
plot(wordOverTime$Season, wordOverTime$n, type = "l")

#word count of "gay"
sum(wordOverTime$n)
## [1] 482

Let’s kick off the training wheels and get into some data science with sentiment analysis. Sentiment analysis is pretty simple - how positive or negative is a statement or collection of words? If I say “I hate you” it would give a negative score because of words like “hate” and if I say “I love chicken and waffles” it would give a positive score because of words like “love”. Now I was a little lazy here in a moderate way, I’ll venmo whoever figures it out $5. Hint: the sentiment scores have a low range because of this error. The range for sentiment is -1 to +1.

There’s a lot I can do with sentiment and this dataset, but the story is to find the darkest episode of the darkest season and the lightest episode of the lightest season. That’s it. I also do this by character.

###SENTIMENT ANALYSIS###

#getting sentiment by sentence
episodeSentiment = sentiment(get_sentences(sp_text$Line), 
                            polarity_dt = hash_sentiment_jockers)
#aggregating sentiment by line
episodeSentiment = episodeSentiment%>% 
  group_by(element_id) %>% 
  summarize(meanSentiment = mean(sentiment))
#adding an ID row to original dataset
sp_text<-dplyr::mutate(sp_text, ID = row_number())
#joining sentiment by line to original dataset
sp_text = left_join(sp_text, episodeSentiment, by = c("ID" = "element_id"))
#checking out sentiment by season, although season 24 is the darkest, it's only two episodes and still going
#setting up plot by season
spt_s<-sp_text %>% 
  group_by(Season) %>% 
  summarize(meanSentiment = mean(meanSentiment)) %>%
  arrange(desc(meanSentiment))
#plotting sentiment by season
ggplot(spt_s, aes(x=reorder(Season,-meanSentiment),y= meanSentiment, fill=meanSentiment))+
  geom_col()+
  scale_fill_viridis_c(option = "viridis")

The darkest season ends up being season 20 since 24 is still going and consists of two specials at the pulling of this data. The lightest season is 17. As we dive in, the darkest episode is “The Damned” in S20. This makes sense, the episode revolves around internet trolling and even suicide. It’s pretty dark. In contrast, the lightest episode is “Ginger Cow” in S17. This makes sense too as the episode includes the end to war between muslims, jews, and christians and even showcases a massive party in Jerusalem headlined by Van Halen. So thats funny.

#creating an object to dive into the episodes
spt_e <-sp_text %>% 
  group_by(Code,Season, Title) %>% 
  summarize(meanSentiment = mean(meanSentiment))
## `summarise()` has grouped output by 'Code', 'Season'. You can override using the `.groups` argument.
#what's the most negative episode of the most negative season?
spt_e %>% filter(Season == "20") %>% arrange((meanSentiment))
## # A tibble: 10 x 4
## # Groups:   Code, Season [10]
##     Code Season Title                                  meanSentiment
##    <int>  <int> <chr>                                          <dbl>
##  1  2003     20 The Damned                                  -0.0267 
##  2  2009     20 Not Funny                                   -0.0229 
##  3  2006     20 Fort Collins                                -0.00392
##  4  2001     20 Member Berries                               0.00102
##  5  2010     20 The End of Serialization as We Know It       0.00394
##  6  2005     20 Douche and a Danish                          0.00464
##  7  2004     20 Wieners Out                                  0.00785
##  8  2002     20 Skank Hunt                                   0.0179 
##  9  2008     20 Members Only                                 0.0200 
## 10  2007     20 Oh, Jeez                                     0.0479
#what's the most positive episode of the most positive season?
spt_e %>% filter(Season == "17") %>% arrange(desc(meanSentiment)) 
## # A tibble: 10 x 4
## # Groups:   Code, Season [10]
##     Code Season Title                           meanSentiment
##    <int>  <int> <chr>                                   <dbl>
##  1  1706     17 Ginger Cow                            0.105  
##  2  1701     17 Let Go, Let Gov                       0.0663 
##  3  1709     17 Titties and Dragons                   0.0463 
##  4  1708     17 A Song of Ass and Fire                0.0457 
##  5  1705     17 Taming Strange                        0.0432 
##  6  1707     17 Black Friday                          0.0314 
##  7  1703     17 World War Zimmerman                   0.0122 
##  8  1710     17 The Hobbit                            0.0110 
##  9  1702     17 Informative Murder Porn               0.00219
## 10  1704     17 Goth Kids 3: Dawn of the Posers      -0.0231
#checking out sentiment by character, setting up for plotting

Additionally, we look at characters, filtering for those who have >2000 lines. Butters, especially in early seasons, is a very aloof and positive character and rarely curses (“Oh hamburgers!”) where Kyle is the foil to Cartman and is known to complain a lot.

spt_c<-sp_text %>% 
  group_by(Character)%>% 
  filter(n()>2000)%>%
  summarize(meanSentiment = mean(meanSentiment)) %>%
  arrange((meanSentiment))
#plotting sentiment of the most active characters
ggplot(spt_c, aes(x=reorder(Character,-meanSentiment),y= meanSentiment, fill=meanSentiment))+
  geom_col()+
  scale_fill_viridis_c(option = "viridis")

Alright, now on to topic modelling. I’ll admit, I only half know what I’m doing here. But that’s why I make these.

So rolling with the darkest season, we’re going to look at topics and themes for S20. The steps here are: Filter data to season 20, convert to a simple ID & Line form, clean up the text (removing stopwords like “the”), setting parameters for words to consider in terms of frequency and length, exploring the ideal amount of topics, running the model for the ideal amount of topics, and finally examine results.

On the cleanup side of things, stopwords like “a, the, it” don’t really tell us anything good, so we remove them. We need to create a corpus object to help with that part but also to help set parameters for the topic model to run. One of the important parameters is how many words do we want to consider for a term within a topic. The default is 1, for single words. I set it from 1 to 3, so we can look at short phrases. This is just one of the many levers you can pull in this analysis and it drastically impacts results and computational workload. This analysis originally took 30 min just to let my laptop run.

###TOPIC MODELING### attempt #1

# grabbing dictionary of stop words that hopefully is better than the default one
english_stopwords <- readLines("https://slcladal.github.io/resources/stopwords_en.txt", encoding = "UTF-8")
# grabbing and converting sp_text columns to a format that the topic model algo can use
sp_tp <- sp_text %>%
  filter(Season==20)
sp_tp <-sp_tp[c('ID','Line')]
sp_tp <-rename(sp_tp, doc_id = ID, text = Line)
# creating a corpus object to clean
corpus <- VCorpus(DataframeSource(sp_tp))
# Preprocessing chain
processedCorpus <- tm_map(corpus, content_transformer(tolower))
processedCorpus <- tm_map(processedCorpus, removeWords, english_stopwords)
processedCorpus <- tm_map(processedCorpus, removePunctuation, preserve_intra_word_dashes = TRUE)
processedCorpus <- tm_map(processedCorpus, removeNumbers)
processedCorpus <- tm_map(processedCorpus, stemDocument, language = "en")
processedCorpus <- tm_map(processedCorpus, stripWhitespace)
# so this is just one of a few ways to expand how many words I want to include in a topic, i.e. phrases. I chose 3 max
NLP_tokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)
}
# this is probably redundant
control_list_ngram = list(tokenize = NLP_tokenizer,
                          removePunctuation = FALSE,
                          removeNumbers = FALSE, 
                          stopwords = stopwords("english"), 
                          tolower = T, 
                          stemming = T, 
                          global = c(100, 1000),
                          weighting = function(x)
                            weightTf(x)
)
# compute document term matrix with terms >= minimumFrequency
DTM <- DocumentTermMatrix(processedCorpus, control = control_list_ngram)
# have a look at the number of documents and terms in the matrix
dim(DTM)
## [1]  2210 21409

Next we run a training model and set some more parameters, like how many topics to try and evaluate. I want to run 20 iterations. So I tell the training model to run it and sort all the terms and phrases into two topics, then do it again with three topics, and so on until it hits 20. I also ask it to plot something called coherence to evaluate which amount of topics (K) is the best performing. Here’s the funny part, my results are terrible. When visualized in a line chart it clearly says whether the line should be maximized or minimized and both of my lines go opposite to what would be considered “good”. The funnier part is that the results later still make sense. So we chose K = 11 as the amount of topics because why not.

# due to vocabulary pruning, we have empty rows in our DTM
# LDA does not like this. So we remove those docs from the
# DTM and the metadata
sel_idx <- slam::row_sums(DTM) > 0
DTM <- DTM[sel_idx, ]
sp_tp <- sp_tp[sel_idx, ]
# create models with different number of topics
result <- ldatuning::FindTopicsNumber(
  DTM,
  topics = seq(from = 2, to = 20, by = 1),
  metrics = c("CaoJuan2009",  "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  verbose = TRUE
)
## fit models... done.
## calculate metrics:
##   CaoJuan2009... done.
##   Deveaud2014... done.
# plotting the number of topics against how well they make sense
FindTopicsNumber_plot(result)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

# number of topics
K <- 11
# set random number generator seed
set.seed(9161)

Now that we have K we can run a model, note the various levers like alpha that can be changed to tune the model. I don’t really know what they all do. Alpha makes the results a little tighter for high accuracy. But as we proceed through we actually generate 11 topics and they’re not bad. We pull the top 10 terms from each and a fan of the show might be able to tell you what’s going on. Topic 3 refers to Kyle’s dad, Gerald, who (spoiler alert) is the internet Troll and gets into trouble with the country of Denmark. Topic 8 is just insults and curses. Topic 6 has to do with Cartman and his girlfriend Heidi as they try to escape to Mars on Elon Musk’s Spacex Rocket, where they also run into Butters. Not bad eh? We try to give a summary name to each of the topics and show coherence at the end, which Topic 6 has the most coherence. I think coherence goes 0 to 1 so these results are bad but intelligible.

# compute the LDA model, inference via 500 iterations of Gibbs sampling
topicModel <- LDA(DTM, K, method="Gibbs", control=list(iter = 500, verbose = 25, alpha = 0.1))
## K = 11; V = 21409; M = 2068
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!
# have a look a some of the results (posterior distributions)
tmResult <- posterior(topicModel)
#let's see what we got
topicmodels::terms(topicModel, 10)
##       Topic 1            Topic 2  Topic 3        Topic 4        Topic 5 
##  [1,] "troll"            "ike"    "troll"        "member"       "funni" 
##  [2,] "internet"         "door"   "peopl"        "yeah"         "girl"  
##  [3,] "danish"           "open"   "skankhunt"    "berri"        "guy"   
##  [4,] "happen"           "kyle"   "denmark"      "member_berri" "women" 
##  [5,] "histori"          "gerald" "gerald"       "hey"          "thing" 
##  [6,] "onlin"            "back"   "person"       "member_memb"  "make"  
##  [7,] "trolltrac"        "walk"   "social"       "yeah_memb"    "mar"   
##  [8,] "internet_histori" "board"  "media"        "love"         "vagina"
##  [9,] "denmark"          "room"   "social_media" "wow"          "gonna" 
## [10,] "freja"            "troll"  "speak"        "good"         "peopl" 
##       Topic 6   Topic 7   Topic 8 Topic 9    Topic 10   Topic 11       
##  [1,] "kyle"    "god"     "fuck"  "douch"    "make"     "gonna"        
##  [2,] "cartman" "take"    "yeah"  "quit"     "presid"   "anthem"       
##  [3,] "walk"    "comput"  "phone" "vote"     "gonna"    "stand"        
##  [4,] "heidi"   "car"     "bitch" "peopl"    "face"     "nation"       
##  [5,] "talk"    "move"    "dick"  "sandwich" "turn"     "sit"          
##  [6,] "eric"    "god_god" "mouth" "turd"     "work"     "nation_anthem"
##  [7,] "butter"  "mail"    "dude"  "elect"    "garrison" "abram"        
##  [8,] "stop"    "stop"    "guy"   "tjing"    "hand"     "show"         
##  [9,] "elon"    "death"   "gonna" "huh"      "line"     "bill"         
## [10,] "hey"     "govern"  "back"  "talk"     "walk"     "heidi"
# topics are probability distributions over the entire vocabulary
beta <- tmResult$terms   
# for every document we have a probability distribution of its contained topics
theta <- tmResult$topics 
# trying to fit a name to the topics as a summary
topicNames <- apply(lda::top.topic.words(beta, 5, by.score = T), 2, paste, collapse = " ")
topicNames
##                                                1 
## "troll internet danish histori internet_histori" 
##                                                2 
##                       "ike door open board room" 
##                                                3 
##            "troll peopl skankhunt social person" 
##                                                4 
##     "member berri yeah member_berri member_memb" 
##                                                5 
##                     "funni girl women guy thing" 
##                                                6 
##                   "kyle cartman walk eric heidi" 
##                                                7 
##                    "god take car comput god_god" 
##                                                8 
##                     "fuck yeah phone bitch dick" 
##                                                9 
##                  "douch quit vote sandwich turd" 
##                                               10 
##                "make presid face garrison gonna" 
##                                               11 
##        "anthem nation stand gonna nation_anthem"
# mean probabilities over all paragraphs
topicProportions <- colSums(theta) / nDocs(DTM)  
# assign the topic names we created before
names(topicProportions) <- topicNames    
# taking a look at coherence across topics
sort(topicProportions, decreasing = TRUE)
##                   kyle cartman walk eric heidi 
##                                     0.11006558 
##                     funni girl women guy thing 
##                                     0.10454930 
##                       ike door open board room 
##                                     0.10363370 
##                     fuck yeah phone bitch dick 
##                                     0.09519340 
##                make presid face garrison gonna 
##                                     0.09142821 
##            troll peopl skankhunt social person 
##                                     0.08886895 
##        anthem nation stand gonna nation_anthem 
##                                     0.08745880 
##     member berri yeah member_berri member_memb 
##                                     0.08646296 
##                  douch quit vote sandwich turd 
##                                     0.08336096 
##                    god take car comput god_god 
##                                     0.07695534 
## troll internet danish histori internet_histori 
##                                     0.07202282

That’s all I have today. Let me know what you think and thanks for reading!