I have found some domain specific lexicons created from some of the most popular reddit communites, made available by William Leif with an Apache License v2.0. https://github.com/williamleif/socialsent

We will be using Eric Cartman’s dialogue from seasons 1-19 of Southpark to compare these lexicons. This information is available at https://github.com/BobAdamsEE/SouthParkData with the Attribution-Share Alike v3.0 unported license.

The crude data

First things first: we load the data described above. The lexicons we will be comparing represent feelings associated with the 5000 most common words (minus stop words) on r/AskMen, r/AskWomen, r/Parenting, r/MensRights, and r/Anarcho_Capitalism.

library(dplyr)
library(tm)
library(tidytext)

cartman <- read.csv("https://raw.githubusercontent.com/BobAdamsEE/SouthParkData/master/by-character/cartman.csv", header = FALSE)

cartman<- cartman$V1 %>%
  stringr::str_replace_all('\n', ' ')

# schema: <word> <average sentiment> <standard deviation>

askmen <- tibble(read.delim(
  "https://raw.githubusercontent.com/TheWerefriend/data607/master/week10/AskMen.tsv", 
  header = FALSE) %>%
  select(-V3))

askwomen <- tibble(read.delim(
  "https://raw.githubusercontent.com/TheWerefriend/data607/master/week10/AskWomen.tsv",
  header = FALSE) %>%
  select(-V3))

parenting <- tibble(read.delim(
  "https://raw.githubusercontent.com/TheWerefriend/data607/master/week10/Parenting.tsv", 
  header = FALSE) %>%
  select(-V3))

mensrights <- tibble(read.delim(
  "https://raw.githubusercontent.com/TheWerefriend/data607/master/week10/MensRights.tsv", 
  header = FALSE) %>%
  select(-V3))

ancap <- tibble(read.delim(
  "https://raw.githubusercontent.com/TheWerefriend/data607/master/week10/Anarcho_Capitalism.tsv",
  header = FALSE) %>%
  select(-V3))

The Eric Corpus

First, we should create a corpus of Cartman’s language, transform it a bit, and remove stopwords. Stemming seems unhelpful. Then we can make a document-term matrix…

NOTE: This regular expression should remove the endings of anything with an apostrophe: "?:\w*(?:’-)+|[’].?)+"

However, no permutation with varied amounts of \\\\\ will stop the errors. There is a function in tm for removing punctuation.

EC <- VCorpus(VectorSource(cartman)) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("SMART")) %>%
  tm_map(removePunctuation, ucp = TRUE)

EMatrix <-  DocumentTermMatrix(EC)

inspect(EMatrix)

## <<DocumentTermMatrix (documents: 9774, terms: 8585)>>
## Non-/sparse entries: 43486/83866304
## Sparsity           : 100%
## Maximal term length: 31
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   butters dude god gonna guys hey kenny kyle people yeah
##   1484       0    0   0     0    0   0     0    0      1    0
##   1650       0    0   0     1    1   0     0    0      3    0
##   1831       0    0   0     0    0   0     0    0      0    0
##   2039       0    1   0     2    0   0     0    4      0    0
##   3086       0    0   0     2    0   0     0    4      0    0
##   3631       0    0   0     0    0   0     0    0      2    0
##   6240       0    0   0     1    1   0     0    0      1    0
##   7386       0    0   0     3    0   0     0    0      0    4
##   7419       0    0   0     0    0   0     2    0      0    0
##   8233       0    0   0     0    0   0     0    0      0    0

Using different stopword lists effects the outcome a lot. Compare

[1] "eh    got   free free  face  life  ahead    board   captain  climb aboard  search  tomorrow  every shore   try oh lord  try  carry  maymaynemay maymaynemay maymaynemay maymaynemay  gathering  angels appeared   head  sang  us  song  hope      said come sail away come sail away come sail away   lads come sail away come sail away come sail away    thought    angels    surprise  climbed   starship  headed   skies come sail away come sail away come sail away   lads come sail away come sail away come sail away   lads"

with

EC$content[4616][[1]]$content

## [1] "eh       free free  face  life  ahead    board   captain  climb aboard  search  tomorrow   shore     lord    carry  maymaynemay maymaynemay maymaynemay maymaynemay  gathering  angels appeared   head  sang    song  hope        sail   sail   sail    lads  sail   sail   sail     thought    angels    surprise  climbed   starship  headed   skies  sail   sail   sail    lads  sail   sail   sail    lads"

<<<<He’s just singing “Come Sail Away” by Styx.>>>>

Reddit Vocabularies

Which subreddit communities have the most positive/negative associations toward Eric Cartman’s language?

We weight Cartman’s vocabulary by word count, and apply the sentiment score from each.

ancapScore <- ancap %>%
  inner_join(ancap) %>%
  group_by(index = 'V1') %>%
  summarize(sentiment = sum(V2)) %>%
  mutate(method = "ancap")

## Joining, by = c("V1", "V2")

parentingScore <- parenting %>%
  inner_join(parenting) %>%
  group_by(index = 'V1') %>%
  summarize(sentiment = sum(V2)) %>%
  mutate(method = "parenting")

## Joining, by = c("V1", "V2")

askmenScore <- askmen %>%
  inner_join(askmen) %>%
  group_by(index = 'V1') %>%
  summarize(sentiment = sum(V2)) %>%
  mutate(method = "askmen")

## Joining, by = c("V1", "V2")

askwomenScore <- askwomen %>%
  inner_join(askwomen) %>%
  group_by(index = 'V1') %>%
  summarize(sentiment = sum(V2)) %>%
  mutate(method = "askwomen")

## Joining, by = c("V1", "V2")

mensrightsScore <- mensrights %>%
  inner_join(mensrights) %>%
  group_by(index = 'V1') %>%
  summarize(sentiment = sum(V2)) %>%
  mutate(method = "mensrights")

## Joining, by = c("V1", "V2")

rbind(mensrightsScore, ancapScore, askmenScore, askwomenScore, parentingScore)[2:3]

## # A tibble: 5 x 2
##   sentiment method    
##       <dbl> <chr>     
## 1    0.0100 mensrights
## 2    0.220  ancap     
## 3   -0.100  askmen    
## 4    0.390  askwomen  
## 5   -0.310  parenting

Conclusions

Apparently, Eric Cartman’s vocabulary sits best with r/AskWomen, with a close second coming from r/Anarcho_Capitalism. r/MensRights is extremely neutral on his speech, r/AskMen is negative, and r/Parenting hates him the most. Who knew???

Text Mining (Eric Cartman and Reddit)

Sam Reeves

4/17/2021

The crude data

The Eric Corpus

Reddit Vocabularies

Conclusions