Introduction

The goal of this analysis is to use sentiment analysis techniques in R to examine anti-trans Facebook comments across various news platforms. Rather than analyzing the content of the comment as a whole, we can analyze the individual words in a comment. While context and relationship between words in a sentence are crucial, this method allows for a quick “snapshot” view of the sentiment of these comments.

Structure

The original dataset has one comment per row, however, for our analysis, each word of a comment will now have its own row. This is called “unnesting”. Additionally, we will remove stop words from the dataset. These are words like “as”, “it”, “the”, that frequently appear in text, but do not provide any valuable information. More about stop words here: https://www.opinosis-analytics.com/knowledge-base/stop-words-explained/#.YfBtvVhKjt0

An example of the dataset we’re working with is here:

column_label media_source word
nbc1521 NBC didn’t
nbc1521 NBC read
nbc1521 NBC article
nbc1521 NBC headline
nbc1521 NBC told

Positive/Negative Sentiment

We can analyze the amount of positive/negative sentiment used in the comments for a certain news website. Drawing from ‘AFINN’, a set of ~2,000 words with assigned polarity scores ranging from -5 (most negative) to 5 (most positive), we can aggregate the scores for words that appear in our comments and calculate an overall sentiment score per comment, and an average for a news website.

(More info about AFINN here: https://www.geeksforgeeks.org/python-sentiment-analysis-using-affin/#:~:text=Afinn%20is%20the%20simplest%20yet,built%20function%20for%20this%20lexicon.)

Here’s an example of some words in the AFINN set:
word value
abandon -2
abandoned -2
abandons -2
abducted -2
abduction -2
Here is the ranking of all the media from most to least positive.
media_source total
The Economist 0.9629630
Washington Post 0.6231884
Jezebel 0.5232558
BBC 0.2150538
NYT 0.1891892
Slate.com -0.0660377
Reuters -0.0735294
AP -0.0797546
HuffPost -0.0873016
CNN -0.0951374
CBS -0.1104651
Politico -0.1451613
CNBC -0.1521739
Fox News -0.1879562
Daily Wire -0.1909263
Huff Post -0.2391304
Palmer Report -0.2631579
New York Post -0.2671395
WSJ -0.3879004
The Hill -0.4246154
The Guardian -0.4658385
USA today -0.4689655
Stars and Stripes -0.6035242
ABC News -0.6053571
Fox News -0.6601942
NPR -0.6666667
Washington Times -0.7378277
Breitbart -0.8062016
Washington Examiner -0.8571429
PBS -1.0481928
Daily Beast -1.1928934
NBC -1.2413793
Here are the most frequent pos/neg words from The Economist and NBC respectively.
word n val
save 3 2
advantages 2 2
fair 2 2
miracle 2 4
stronger 2 2
unfair 2 -2
winning 2 4
banned 1 -2
beloved 1 3
broke 1 -1
word n val
killed 7 -3
stop 6 -1
hate 4 -3
hurting 4 -2
killing 4 -3
stronger 4 2
violence 4 -3
accidental 3 -2
crap 3 -3
death 3 -2
Here is the positivity/negativity by source bias categorization. The second table shows the breakdown of source bias categorization of all articles studied. Since there are only 7 articles labelled as “orange”, it would be unwise to make conclusions aboout that. Articles from “green” sources are slightly more negative than articles from “yellow” sources.
source_bias total n
Orange -0.1979522 586
NA -0.2767329 1861
Yellow -0.3686736 2292
Green -0.4621005 2190
## # A tibble: 4 × 2
## # Groups:   source_bias [4]
##   source_bias     n
##   <chr>       <int>
## 1 Yellow         37
## 2 Green          30
## 3 <NA>           24
## 4 Orange          7

Correlation between sentiment + responses to comment?

Is there a relationship between the strength of pos/neg sentiment vs the number of “reactions” / likes / comments? No.

## 
##  Pearson's product-moment correlation
## 
## data:  sent_vs_reacts$sentiment and sent_vs_reacts$total_reactions
## t = 0.45531, df = 1473, p-value = 0.649
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03920206  0.06286501
## sample estimates:
##        cor 
## 0.01186237
## 
##  Pearson's product-moment correlation
## 
## data:  sent_vs_reacts$sentiment and sent_vs_reacts$number_of_replies_to_comment
## t = 0.84488, df = 1473, p-value = 0.3983
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02906485  0.07296723
## sample estimates:
##       cor 
## 0.0220085
## 
##  Pearson's product-moment correlation
## 
## data:  sent_vs_reacts$sentiment and sent_vs_reacts$like_reacts
## t = 0.031771, df = 2123, p-value = 0.9747
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04183366  0.04321025
## sample estimates:
##          cor 
## 0.0006895433

Other types of sentiment

Apart from the AFINN set of words, there is NRC, which has 13,000 words along with the emotions associated with them: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust. Words can have multiple emotions associated with them.

Using the same process for positive/negative sentiment, we can find the overarching “emotion” expressed in comments.

For more information about NRC: https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

Here’s an example of the words listed in NRC
word sentiment
abacus trust
abandon fear
abandon negative
abandon sadness
abandoned anger
abandoned fear
abandoned negative
abandoned sadness
abandonment anger
abandonment fear
We can find the top sentiment expressed per article, and the first 10 rows are shown:
article emotion percent
ABC1618 anger 0.2367865
ABC31121 trust 0.1742424
ABC31221 anger 0.1777778
ABC3721 trust 0.1843318
ABC52621 trust 0.1881188
ABC6121 trust 0.1990050
AP102520 trust 0.1911765
AP12120 anticipation 0.2319588
AP42721 trust 0.2500000
AP43121 fear 0.2324324

Note: Percentage represents the number of words associated with the specific emotion divded by the total amount of emotion words that were identified with by the NRC dataset. It does not represent the total amount of words in a comment.

The same can be done across media sources:

media_source emotion avg_percent
ABC News trust 0.1747331
AP trust 0.1921672
BBC trust 0.1967213
Breitbart fear 0.1791872
CBS anticipation 0.1754267
CNBC fear 0.2083744
CNN fear 0.2002517
Daily Beast fear 0.2008817
Daily Wire trust 0.2073240
Fox News fear 0.2361111
Fox News trust 0.2045601
Huff Post trust 0.2334630
HuffPost trust 0.1854545
Jezebel trust 0.2635417
NBC trust 0.1903244
New York Post trust 0.2043473
NPR fear 0.1724138
NYT trust 0.2314815
Palmer Report trust 0.2592593
PBS trust 0.1902017
Politico trust 0.2079646
Reuters trust 0.2290503
Slate.com joy 0.1697568
Stars and Stripes fear 0.2375491
The Economist trust 0.2117647
The Guardian trust 0.1883851
The Hill trust 0.1776031
USA today sadness 0.1612732
Washington Examiner fear 0.1781788
Washington Post sadness 0.1804511
Washington Times trust 0.2052631
WSJ fear 0.1768743

It seems that trust and fear are popular emotions.. here are examples of common trust and fear words

Trust:
word n
god 134
serve 127
sex 113
medical 96
school 83
real 82
law 80
governor 73
pay 67
hope 63
Fear:
word n
military 254
god 134
change 133
government 121
medical 96
prison 88
surgery 85
hate 83
hell 52
fight 49

Note that the “trust” emotion doesn’t necessarily entail that the comments are embodying trust; it is more likely that the comments are more centered around certain topics that are sensitive/important.

We can look at the top words for a specific emotion for a media source. Here are the top “sadness” words from CNN comments. A function provided in the later section will allow for specific article/media source + emotion lookup.
word n
hate 12
unfair 7
attacking 3
discrimination 3
bigoted 2
disabled 2
doubt 2
hurt 2
mother 2
bad 1

Amount of fear words per Media source

Topic vs sentiment

Here is a table showing the most common emotion associated with comments on articles of various topics.

## # A tibble: 7 × 3
## # Groups:   topic_of_article [7]
##   topic_of_article            emotion avg_emotion
##   <chr>                       <chr>         <dbl>
## 1 1 = Transgender Athletes    trust         0.207
## 2 2 = Transgender Bathroom    fear          0.200
## 3 3 = Transgender in Military fear          0.232
## 4 4 = Transgender Adolescents trust         0.193
## 5 5 = Transgender Inmates     fear          0.227
## 6 6 = Other                   trust         0.178
## 7 7: Pageants                 trust         0.184

Usable functions for your enjoyment…

Finding most frequent words used per article

find_top_words <- function(article){
   article_words <- words %>%
    filter(column_label == article)
  
  cloud_words <- article_words %>%
    count(word) 
  
  return(cloud_words %>%arrange(desc(n)))
}

# Example, limit 10 words
find_top_words("CNN12521") %>% head(10) %>% kbl() %>% kable_paper("hover", full_width = F)
word n
country 18
people 16
serve 15
military 10
protect 8
american 5
biden 5
care 5
equal 5
trans 5

and for a media source

find_top_words_media <- function(source){
   article_words <- words %>%
    filter(media_source == source)
  
  cloud_words <- article_words %>%
    count(word) 
  
  return(cloud_words %>%arrange(desc(n)))
}

# Example, limit 10 words
find_top_words_media("BBC") %>% head(10) %>% kbl() %>% kable_paper("hover", full_width = F)
word n
sports 18
trans 13
women 11
girls 10
fair 9
mississippi 9
people 8
compete 7
it’s 7
sense 7

Code to look through sentiment words

all_sent_words %>%
  filter(media_source == "NBC")  %>%   #edit this line
  filter(sentiment == "fear") %>%      # edit this line
  count(word) %>%
  arrange(desc(n)) %>% head(10)       # can remove (head(10))
##          word n
## 1        hate 4
## 2     hurting 4
## 3     killing 4
## 4    violence 4
## 5  accidental 3
## 6       death 3
## 7        fear 3
## 8        hell 3
## 9      murder 3
## 10    caution 2

Brief Introduction Latent Dietrich Allocation (LDA)

LDA is a topic modeling method that aims to find natural grouping of topics within a collection of texts. This method is considered to be “unsupervised” machine learning, in which the algorithm does not follow any known outcomes, and is inferring and constructing patterns on its own.

In LDA, you must first specify a specific number of topics to group the text into. Then, LDA calculates the probability of a word being in a certain topic. For example, word1 could have a .60 probability of being in topic 2, .3 probability of being in topic 1, and .1 probability of being in topic 3. Based on this information, documents (or in this case Facebook comments), are classified as belonging to a certain topic based what words are contained within them. For more information, check out this article: https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2

After specifying 6 topics, we can take a look at a subset of words within each topic that have a high probability of being in the topic (i.e we are more confident that these words belong to this specific topic rather than another one). It is interesting to see that the 6 subsets identified by the LDA algorithm match closely to the groups identified through qualitative analysis. Topic 1 seems to be related to bathroom policies, topic 5 seems to be about transgender prison inmates, etc.

Each document (collection of comments) can also be assigned topic probabilities, here we show a selection of comments under Facebook articles, and the topic number with the highest probability. For example, ABC1618 is most likely to be within topic 5, which is hypothesized to be related about prison inmates. The title of this article is “Transgender inmate seeks rare transfer to female prison”, so this classification is very accurate.

## # A tibble: 5 × 3
## # Groups:   document [5]
##   document topic gamma
##   <chr>    <int> <dbl>
## 1 ABC1618      3 1.00 
## 2 ABC31121     3 0.560
## 3 ABC31221     3 1.00 
## 4 ABC3721      3 1.00 
## 5 ABC52621     3 0.838

This method is very useful for finding potential topics within a large volume of text that is hard to parse manually.