R Markdown

ERB: The Joker vs Pennywise. Who (actually) Won?! The beloved YouTube channel ERB (Epic Rap Battles of History) ran by our favourite artists/content creators Nice Peter and EpicLLOYD are hugely famous with over 14.8 Million subscribers.

Whether we’re listening to their newest release or revisiting some of our favourite tracks, at the end of every video we are given the option to cast our vote for which contestant we think won in their colourful exchange of verses in the comment section of the video.

As somebody who can find themselves being absorbed in not only the YouTube videos, but the comment section too. I’ve always wanted to see if I could capture the ‘public’ sentiments about who won some of my favourite rap battles.

In this document, I present a brief outline of who YOU (or at least, the YouTube public) declared the winner of one of my favourite duels of disparagement. The Joker vs Pennywise

##                                                comments
## 1195  The lines in this were the most hardcore savag...
## 1196                                             Joker.
## 1197                                              Joker
## 1198                                          Ask robin
## 1199                                          pennywise

First off, there are over 93,000 comments on this video. I have ~1,200. So Ultimately this mini-investigation is hamstrung by only considering a slither of the total comments.

#make the pandas dataframe into a series.
word_series = df.astype(str).apply(lambda x: ' is '.join(x), axis=1)

# take every element in the series and append them into a singular string for
#decomposition and tokenisation
everything = []
for i in word_series:
    everything.append(i)
    
words = word_tokenize(str(everything))
#print(len(words))
#20,222 words in total
words_no_punc = []
for w in words:
    if w.isalpha():
        words_no_punc.append(w.lower())
 
  
#print(len(words_no_punc))
# 13,658 words after filtered of punctuation
stopwords = stopwords.words('english')
clean_words=[]
for w in words_no_punc:
    if w not in stopwords:
        clean_words.append(w)


#print(len(clean_words))
# 7,413 words post filtered of punctuation and stopwords
fdist = FreqDist(clean_words)
#fdist.most_common(10)
plt.tight_layout()
fig=fdist.plot(15)

fdist.most_common(15)
## [('joker', 355), ('pennywise', 167), ('like', 118), ('one', 99), ('vs', 87), ('ledger', 67), ('rap', 67), ('heath', 66), ('line', 65), ('bars', 64), ('robin', 60), ('drop', 56), ('batman', 56), ('battle', 54), ('play', 51)]

Just inspecting the distribution of words used throughout the comments (post filtering of punctuation and stopwords) it’s assuring that Joker and Pennywise are the two most frequent words used.
To content with whether or not the commenter used ‘Joker’ or ‘joker’ i.e. capitalised/or not the first letter. I just lowered the case of all letters in all the words.

At a glance, it looks like the Joker is mentioned approximately twice as often as Pennywise. Let’s see about tallying votes.

Returning to the comments I have, I wish to tabulate which ones can be counted as votes for either the Joker or Pennywise.
For each contestant, I will consider comments which feature only their name, no other context. For example.

JK_words = ['joker','Joker','The joker.','The Joker','the joker','the Joker','Joker.']
PW_words = ['Pennywise','pennywise','Penny wise','penny wise']

# Retrieve any row which has only the JK words 
JK_rows=[]
for word in JK_words:
    tmp=df[df['comments']==word]
    tmp=tmp.index.values
    if not tmp.any():
        continue 
    else:
        JK_rows.append(tmp)

JK_rows = np.concatenate(JK_rows,axis=0)

# Do the same for Pennywise 
PW_rows=[]
for word in PW_words:
    tmp=df[df['comments']==word]
    tmp=tmp.index.values
    if not tmp.any():
        continue 
    else:
        PW_rows.append(tmp)

PW_rows = np.concatenate(PW_rows,axis=0)
df.iloc[JK_rows,[0]]
##      comments
## 1197    Joker
## 1196   Joker.
##         comments
## 1141   Pennywise
## 1149   Pennywise
## 1151   Pennywise
## 1153   Pennywise
## 1154   Pennywise
## 1156   Pennywise
## 1157   Pennywise
## 1158   Pennywise
## 1159   Pennywise
## 1161   Pennywise
## 1162   Pennywise
## 1189   pennywise
## 1191   pennywise
## 1194   pennywise
## 1199   pennywise
## 1172  Penny wise
## 1178  Penny wise

Hmm, well. So far it’s not looking for for The Joker with only 2 votes. While Pennywise takes an early lead with 17 votes.

It would be naive of me to only count comments like this however. A lot of viewer in the comments cast their verdicts in a more expressive way. So we’ve got more rows to tally for each contestant.
Now we’ll look for comments which state more explicitly who they thought won. For both The Joker and Pennywise we’ll look for any string which contains the respective explicit statements ‘Joker/Pennywise’ ‘won/win/winner’

Joker_comments = df.apply(lambda row: row.astype(str).str.contains('Joker won|Joker wins|Joker winner'))
Joker_comments = np.where(Joker_comments)[0]
Joker_comments = df.iloc[Joker_comments,[0]]
print(Joker_comments)
##                                                comments
## 39    "You giggling sewer ginger"\nYeah, the Joker won.
## 277             Joker won the moment It mentioned Bat's
## 376                        I’d have to say Joker won~~!
## 453                             Joker won that by miles
## 532   Joker won for me you can’t beat the clown Prin...
## 534                                  Honestly Joker won
## 542   Of course The Joker won. It was never even a c...
## 543   I was kinda hoping Gwynplaine--the character w...
## 572   Joker won, but Pennywise had a point with the ...
## 603                 Joker won with his insane wordplay.
## 654   Joker won he threatened him A LOT and I want t...
## 699                                    Joker won easily
## 703                            Joker won by a long shot
## 712   I think Joker won because he's so good and muc...
## 734                           Joker wins by a long shot
## 736   Joker won.  Let's get master Roshi vs Master S...
## 743                              Joker won he killed it
## 751                                    Joker won a 100%
## 773                                  Joker won this one
## 811                                          Joker won!
## 814                                  Joker won obvs lol
## 817                                          Joker won!
## 827                              Joker won that shitttt
## 838                                      Joker won this
## 839                                      Joker won easy
## 852                                        Joker won!!!
## 857                                       Joker wins!!!
## 871   Joker wins because Pennywise loses when afraid...
## 973                                          Joker won.
## 1016                                         Joker wins
## 1019                                          Joker won
## 1028                                          Joker won
## 1032                                          Joker won
## 1036                                          Joker won
## 1037                                          Joker won
## 1039                                          Joker won
## 1041                                          Joker won
## 1042                                          Joker won
## 1051                                          Joker won
## 1054                                          Joker won
## 1058                               Joker wins game over
## 1064                                         Joker wins
## 1072                                         Joker wins
## 1081                                         Joker wins
## 1083                                  Joker won I think
## 1086                                         Joker wins
## 1089                                         Joker wins
## 1094                                         Joker wins
## 1101                                         Joker wins
## 1117                                         Joker wins
PW_comments = df.apply(lambda row: row.astype(str).str.contains('Pennywise won|Pennywise wins|Pennywise winner|IT won|IT wins|IT winner'))
PW_comments = np.where(PW_comments)[0]
PW_comments = df.iloc[PW_comments,[0]]
#PW_comments = df.iloc[np.append(PW_comments,PW_rows),[0]]
print(PW_comments)
##                                               comments
## 66   ugh so much to this I loved everything about i...
## 151          Pennywise won with that Heath Ledger line
## 251  Pennywise won, but he's going to have to limp ...
## 292  The Edgar Allen Poe reference was amazing... P...
## 359  Nah. Pennywise won this one.  Well. Idk. They ...
## 443       Pennywise won it with the heath ledger verse
## 528                    I think Pennywise won this one!
## 634  Well I don’t know who to pick but I do know wh...
## 753                 I love the Joker But Pennywise won
## 764                             Pennywise wins. Totes.
## 769                            Pennywise won by a mile
## 813                                  Pennywise won lol
## 894                                      Pennywise won
## 899                                     Pennywise wins
## 914                                     Pennywise wins
## 917                                     Pennywise wins
## 921                                      Pennywise won
## 922                                      Pennywise won
## 923                                      Pennywise won
## 927                                      Pennywise won
## 931                                      Pennywise won
## 932                                      Pennywise won
## 933                                      Pennywise won
## 934                                      Pennywise won
## 938                                      Pennywise won
## 939                                      Pennywise won
## 944                                      Pennywise won
## 947                                      Pennywise won
## 952                                     Oh btw IT won😂
## 953                                      Pennywise won

Lets visualize the combined results from the two approaches to tally votes.

## <BarContainer object of 2 artists>
## Text(0.5, 1.0, 'ERB: The Joker vs Pennywise')
## Text(0.5, 0, 'Votes')

Oh, that’s close. From the 1,200 comments in my dataset which could be used to find votes to count. The final tally for the contestants were
Pennywise: 47
The Joker: 52

This was a very close call. In fact, considering I could only scrape 1,200 comments from 93,000 comments which is ~ 1.3% of the total comments. I would not feel confident declaring either contestant the overall winner or loser based on this small subset analysis.

Next time I dabble with this kind of analysis, I’d like to use YouTube’s API to retrieve comments from the video to analyse a more representative sample. Secondly, I would like to record the username for each comment to make sure I am not counting the vote from the same user multiple times. Lastly, I am interested to investigate if the votes for any contestant changes over time cough Donald Trump cough.