I have found some domain specific lexicons created from some of the most popular reddit communites, made available by William Leif with an Apache License v2.0. https://github.com/williamleif/socialsent
We will be using Eric Cartman’s dialogue from seasons 1-19 of Southpark to compare these lexicons. This information is available at https://github.com/BobAdamsEE/SouthParkData with the Attribution-Share Alike v3.0 unported license.
First things first: we load the data described above. The lexicons we will be comparing represent feelings associated with the 5000 most common words (minus stop words) on r/AskMen, r/AskWomen, r/Parenting, r/MensRights, and r/Anarcho_Capitalism.
library(dplyr)
library(tm)
library(tidytext)
cartman <- read.csv("https://raw.githubusercontent.com/BobAdamsEE/SouthParkData/master/by-character/cartman.csv", header = FALSE)
cartman<- cartman$V1 %>%
stringr::str_replace_all('\n', ' ')
# schema: <word> <average sentiment> <standard deviation>
askmen <- tibble(read.delim(
"https://raw.githubusercontent.com/TheWerefriend/data607/master/week10/AskMen.tsv",
header = FALSE) %>%
select(-V3))
askwomen <- tibble(read.delim(
"https://raw.githubusercontent.com/TheWerefriend/data607/master/week10/AskWomen.tsv",
header = FALSE) %>%
select(-V3))
parenting <- tibble(read.delim(
"https://raw.githubusercontent.com/TheWerefriend/data607/master/week10/Parenting.tsv",
header = FALSE) %>%
select(-V3))
mensrights <- tibble(read.delim(
"https://raw.githubusercontent.com/TheWerefriend/data607/master/week10/MensRights.tsv",
header = FALSE) %>%
select(-V3))
ancap <- tibble(read.delim(
"https://raw.githubusercontent.com/TheWerefriend/data607/master/week10/Anarcho_Capitalism.tsv",
header = FALSE) %>%
select(-V3))
First, we should create a corpus of Cartman’s language, transform it a bit, and remove stopwords. Stemming seems unhelpful. Then we can make a document-term matrix…
NOTE: This regular expression should remove the endings of anything with an apostrophe: "?:\w*(?:’-)+|[’].?)+"
However, no permutation with varied amounts of \\\\\ will stop the errors. There is a function in tm for removing punctuation.
EC <- VCorpus(VectorSource(cartman)) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("SMART")) %>%
tm_map(removePunctuation, ucp = TRUE)
EMatrix <- DocumentTermMatrix(EC)
inspect(EMatrix)
## <<DocumentTermMatrix (documents: 9774, terms: 8585)>>
## Non-/sparse entries: 43486/83866304
## Sparsity : 100%
## Maximal term length: 31
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs butters dude god gonna guys hey kenny kyle people yeah
## 1484 0 0 0 0 0 0 0 0 1 0
## 1650 0 0 0 1 1 0 0 0 3 0
## 1831 0 0 0 0 0 0 0 0 0 0
## 2039 0 1 0 2 0 0 0 4 0 0
## 3086 0 0 0 2 0 0 0 4 0 0
## 3631 0 0 0 0 0 0 0 0 2 0
## 6240 0 0 0 1 1 0 0 0 1 0
## 7386 0 0 0 3 0 0 0 0 0 4
## 7419 0 0 0 0 0 0 2 0 0 0
## 8233 0 0 0 0 0 0 0 0 0 0
Using different stopword lists effects the outcome a lot. Compare
[1] "eh got free free face life ahead board captain climb aboard search tomorrow every shore try oh lord try carry maymaynemay maymaynemay maymaynemay maymaynemay gathering angels appeared head sang us song hope said come sail away come sail away come sail away lads come sail away come sail away come sail away thought angels surprise climbed starship headed skies come sail away come sail away come sail away lads come sail away come sail away come sail away lads"
with
EC$content[4616][[1]]$content
## [1] "eh free free face life ahead board captain climb aboard search tomorrow shore lord carry maymaynemay maymaynemay maymaynemay maymaynemay gathering angels appeared head sang song hope sail sail sail lads sail sail sail thought angels surprise climbed starship headed skies sail sail sail lads sail sail sail lads"
<<<<He’s just singing “Come Sail Away” by Styx.>>>>
Which subreddit communities have the most positive/negative associations toward Eric Cartman’s language?
We weight Cartman’s vocabulary by word count, and apply the sentiment score from each.
ancapScore <- ancap %>%
inner_join(ancap) %>%
group_by(index = 'V1') %>%
summarize(sentiment = sum(V2)) %>%
mutate(method = "ancap")
## Joining, by = c("V1", "V2")
parentingScore <- parenting %>%
inner_join(parenting) %>%
group_by(index = 'V1') %>%
summarize(sentiment = sum(V2)) %>%
mutate(method = "parenting")
## Joining, by = c("V1", "V2")
askmenScore <- askmen %>%
inner_join(askmen) %>%
group_by(index = 'V1') %>%
summarize(sentiment = sum(V2)) %>%
mutate(method = "askmen")
## Joining, by = c("V1", "V2")
askwomenScore <- askwomen %>%
inner_join(askwomen) %>%
group_by(index = 'V1') %>%
summarize(sentiment = sum(V2)) %>%
mutate(method = "askwomen")
## Joining, by = c("V1", "V2")
mensrightsScore <- mensrights %>%
inner_join(mensrights) %>%
group_by(index = 'V1') %>%
summarize(sentiment = sum(V2)) %>%
mutate(method = "mensrights")
## Joining, by = c("V1", "V2")
rbind(mensrightsScore, ancapScore, askmenScore, askwomenScore, parentingScore)[2:3]
## # A tibble: 5 x 2
## sentiment method
## <dbl> <chr>
## 1 0.0100 mensrights
## 2 0.220 ancap
## 3 -0.100 askmen
## 4 0.390 askwomen
## 5 -0.310 parenting
Apparently, Eric Cartman’s vocabulary sits best with r/AskWomen, with a close second coming from r/Anarcho_Capitalism. r/MensRights is extremely neutral on his speech, r/AskMen is negative, and r/Parenting hates him the most. Who knew???