The beloved YouTube channel ERB (Epic Rap Battles of History) ran by our favourite artists/content creators Nice Peter and EpicLLOYD are hugely famous with over 14.8 Million subscribers.
Whether we’re listening to their newest release or revisiting some of our favourite tracks, at the end of every video we are given the option to cast our vote for which contestant we think won in their colourful exchange of verses in the comment section of the video.
As somebody who can find themselves being absorbed in not only the YouTube videos, but the comment section too. I’ve always wanted to see if I could capture the ‘public’ sentiments about who won some of my favourite rap battles.
In this document, I present a brief outline of who YOU (or at least, the YouTube public) declared the winner of one of my favourite duels of disparagement. The Joker vs Pennywise
## comments
## 1195 The lines in this were the most hardcore savag...
## 1196 Joker.
## 1197 Joker
## 1198 Ask robin
## 1199 pennywise
First off, there are over 93,000 comments on this video. I have ~1,200. So Ultimately this mini-investigation is hamstrung by only considering a slither of the total comments.
#make the pandas dataframe into a series.
word_series = df.astype(str).apply(lambda x: ' is '.join(x), axis=1)
# take every element in the series and append them into a singular string for
#decomposition and tokenisation
everything = []
for i in word_series:
everything.append(i)
words = word_tokenize(str(everything))
#print(len(words))
#20,222 words in total
words_no_punc = []
for w in words:
if w.isalpha():
words_no_punc.append(w.lower())
#print(len(words_no_punc))
# 13,658 words after filtered of punctuation
stopwords = stopwords.words('english')
clean_words=[]
for w in words_no_punc:
if w not in stopwords:
clean_words.append(w)
#print(len(clean_words))
# 7,413 words post filtered of punctuation and stopwords
fdist = FreqDist(clean_words)
#fdist.most_common(10)
plt.tight_layout()
fig=fdist.plot(15)
fdist.most_common(15)
## [('joker', 355), ('pennywise', 167), ('like', 118), ('one', 99), ('vs', 87), ('ledger', 67), ('rap', 67), ('heath', 66), ('line', 65), ('bars', 64), ('robin', 60), ('drop', 56), ('batman', 56), ('battle', 54), ('play', 51)]
Just inspecting the distribution of words used throughout the comments (post filtering of punctuation and stopwords) it’s assuring that Joker and Pennywise are the two most frequent words used.
To content with whether or not the commenter used ‘Joker’ or ‘joker’ i.e. capitalised/or not the first letter. I just lowered the case of all letters in all the words.
At a glance, it looks like the Joker is mentioned approximately twice as often as Pennywise. Let’s see about tallying votes.
Returning to the comments I have, I wish to tabulate which ones can be counted as votes for either the Joker or Pennywise.
For each contestant, I will consider comments which feature only their name, no other context. For example.
JK_words = ['joker','Joker','The joker.','The Joker','the joker','the Joker','Joker.']
PW_words = ['Pennywise','pennywise','Penny wise','penny wise']
# Retrieve any row which has only the JK words
JK_rows=[]
for word in JK_words:
tmp=df[df['comments']==word]
tmp=tmp.index.values
if not tmp.any():
continue
else:
JK_rows.append(tmp)
JK_rows = np.concatenate(JK_rows,axis=0)
# Do the same for Pennywise
PW_rows=[]
for word in PW_words:
tmp=df[df['comments']==word]
tmp=tmp.index.values
if not tmp.any():
continue
else:
PW_rows.append(tmp)
PW_rows = np.concatenate(PW_rows,axis=0)
df.iloc[JK_rows,[0]]
## comments
## 1197 Joker
## 1196 Joker.
## comments
## 1141 Pennywise
## 1149 Pennywise
## 1151 Pennywise
## 1153 Pennywise
## 1154 Pennywise
## 1156 Pennywise
## 1157 Pennywise
## 1158 Pennywise
## 1159 Pennywise
## 1161 Pennywise
## 1162 Pennywise
## 1189 pennywise
## 1191 pennywise
## 1194 pennywise
## 1199 pennywise
## 1172 Penny wise
## 1178 Penny wise
Hmm, well. So far it’s not looking for for The Joker with only 2 votes. While Pennywise takes an early lead with 17 votes.
It would be naive of me to only count comments like this however. A lot of viewer in the comments cast their verdicts in a more expressive way. So we’ve got more rows to tally for each contestant.
Now we’ll look for comments which state more explicitly who they thought won. For both The Joker and Pennywise we’ll look for any string which contains the respective explicit statements ‘Joker/Pennywise’ ‘won/win/winner’
Joker_comments = df.apply(lambda row: row.astype(str).str.contains('Joker won|Joker wins|Joker winner'))
Joker_comments = np.where(Joker_comments)[0]
Joker_comments = df.iloc[Joker_comments,[0]]
print(Joker_comments)
## comments
## 39 "You giggling sewer ginger"\nYeah, the Joker won.
## 277 Joker won the moment It mentioned Bat's
## 376 I’d have to say Joker won~~!
## 453 Joker won that by miles
## 532 Joker won for me you can’t beat the clown Prin...
## 534 Honestly Joker won
## 542 Of course The Joker won. It was never even a c...
## 543 I was kinda hoping Gwynplaine--the character w...
## 572 Joker won, but Pennywise had a point with the ...
## 603 Joker won with his insane wordplay.
## 654 Joker won he threatened him A LOT and I want t...
## 699 Joker won easily
## 703 Joker won by a long shot
## 712 I think Joker won because he's so good and muc...
## 734 Joker wins by a long shot
## 736 Joker won. Let's get master Roshi vs Master S...
## 743 Joker won he killed it
## 751 Joker won a 100%
## 773 Joker won this one
## 811 Joker won!
## 814 Joker won obvs lol
## 817 Joker won!
## 827 Joker won that shitttt
## 838 Joker won this
## 839 Joker won easy
## 852 Joker won!!!
## 857 Joker wins!!!
## 871 Joker wins because Pennywise loses when afraid...
## 973 Joker won.
## 1016 Joker wins
## 1019 Joker won
## 1028 Joker won
## 1032 Joker won
## 1036 Joker won
## 1037 Joker won
## 1039 Joker won
## 1041 Joker won
## 1042 Joker won
## 1051 Joker won
## 1054 Joker won
## 1058 Joker wins game over
## 1064 Joker wins
## 1072 Joker wins
## 1081 Joker wins
## 1083 Joker won I think
## 1086 Joker wins
## 1089 Joker wins
## 1094 Joker wins
## 1101 Joker wins
## 1117 Joker wins
PW_comments = df.apply(lambda row: row.astype(str).str.contains('Pennywise won|Pennywise wins|Pennywise winner|IT won|IT wins|IT winner'))
PW_comments = np.where(PW_comments)[0]
PW_comments = df.iloc[PW_comments,[0]]
#PW_comments = df.iloc[np.append(PW_comments,PW_rows),[0]]
print(PW_comments)
## comments
## 66 ugh so much to this I loved everything about i...
## 151 Pennywise won with that Heath Ledger line
## 251 Pennywise won, but he's going to have to limp ...
## 292 The Edgar Allen Poe reference was amazing... P...
## 359 Nah. Pennywise won this one. Well. Idk. They ...
## 443 Pennywise won it with the heath ledger verse
## 528 I think Pennywise won this one!
## 634 Well I don’t know who to pick but I do know wh...
## 753 I love the Joker But Pennywise won
## 764 Pennywise wins. Totes.
## 769 Pennywise won by a mile
## 813 Pennywise won lol
## 894 Pennywise won
## 899 Pennywise wins
## 914 Pennywise wins
## 917 Pennywise wins
## 921 Pennywise won
## 922 Pennywise won
## 923 Pennywise won
## 927 Pennywise won
## 931 Pennywise won
## 932 Pennywise won
## 933 Pennywise won
## 934 Pennywise won
## 938 Pennywise won
## 939 Pennywise won
## 944 Pennywise won
## 947 Pennywise won
## 952 Oh btw IT won😂
## 953 Pennywise won
Lets visualize the combined results from the two approaches to tally votes.
## <BarContainer object of 2 artists>
## Text(0.5, 1.0, 'ERB: The Joker vs Pennywise')
## Text(0.5, 0, 'Votes')
Oh, that’s close. From the 1,200 comments in my dataset which could be used to find votes to count. The final tally for the contestants were
Pennywise: 47
The Joker: 52
This was a very close call. In fact, considering I could only scrape 1,200 comments from 93,000 comments which is ~ 1.3% of the total comments. I would not feel confident declaring either contestant the overall winner or loser based on this small subset analysis.
Next time I dabble with this kind of analysis, I’d like to use YouTube’s API to retrieve comments from the video to analyse a more representative sample. Secondly, I would like to record the username for each comment to make sure I am not counting the vote from the same user multiple times. Lastly, I am interested to investigate if the votes for any contestant changes over time cough Donald Trump cough.