Introduction

The goal of this analysis is to use sentiment analysis techniques in R to examine anti-trans Facebook comments across various news platforms. Rather than analyzing the content of the comment as a whole, we can analyze the individual words in a comment. While context and relationship between words in a sentence are crucial, this method allows for a quick “snapshot” view of the sentiment of these comments.

Structure

The original dataset has one comment per row, however, for our analysis, each word of a comment will now have its own row. This is called “unnesting”. Additionally, we will remove stop words from the dataset. These are words like “as”, “it”, “the”, that frequently appear in text, but do not provide any valuable information. More about stop words here: https://www.opinosis-analytics.com/knowledge-base/stop-words-explained/#.YfBtvVhKjt0

An example of the dataset we’re working with is here:

column_label	media_source	word
nbc1521	NBC	didn’t
nbc1521	NBC	read
nbc1521	NBC	article
nbc1521	NBC	headline
nbc1521	NBC	told

Positive/Negative Sentiment

We can analyze the amount of positive/negative sentiment used in the comments for a certain news website. Drawing from ‘AFINN’, a set of ~2,000 words with assigned polarity scores ranging from -5 (most negative) to 5 (most positive), we can aggregate the scores for words that appear in our comments and calculate an overall sentiment score per comment, and an average for a news website.

(More info about AFINN here: https://www.geeksforgeeks.org/python-sentiment-analysis-using-affin/#:~:text=Afinn%20is%20the%20simplest%20yet,built%20function%20for%20this%20lexicon.)

Here’s an example of some words in the AFINN set:

word	value
abandon	-2
abandoned	-2
abandons	-2
abducted	-2
abduction	-2

Here is the ranking of all the media from most to least positive.

media_source	total
The Economist	0.9629630
Washington Post	0.6231884
Jezebel	0.5232558
BBC	0.2150538
NYT	0.1891892
Slate.com	-0.0660377
Reuters	-0.0735294
AP	-0.0797546
HuffPost	-0.0873016
CNN	-0.0951374
CBS	-0.1104651
Politico	-0.1451613
CNBC	-0.1521739
Fox News	-0.1879562
Daily Wire	-0.1909263
Huff Post	-0.2391304
Palmer Report	-0.2631579
New York Post	-0.2671395
WSJ	-0.3879004
The Hill	-0.4246154
The Guardian	-0.4658385
USA today	-0.4689655
Stars and Stripes	-0.6035242
ABC News	-0.6053571
Fox News	-0.6601942
NPR	-0.6666667
Washington Times	-0.7378277
Breitbart	-0.8062016
Washington Examiner	-0.8571429
PBS	-1.0481928
Daily Beast	-1.1928934
NBC	-1.2413793

Here are the most frequent pos/neg words from The Economist and NBC respectively.

word	n	val
save	3	2
advantages	2	2
fair	2	2
miracle	2	4
stronger	2	2
unfair	2	-2
winning	2	4
banned	1	-2
beloved	1	3
broke	1	-1

word	n	val
killed	7	-3
stop	6	-1
hate	4	-3
hurting	4	-2
killing	4	-3
stronger	4	2
violence	4	-3
accidental	3	-2
crap	3	-3
death	3	-2

Here is the positivity/negativity by source bias categorization. The second table shows the breakdown of source bias categorization of all articles studied. Since there are only 7 articles labelled as “orange”, it would be unwise to make conclusions aboout that. Articles from “green” sources are slightly more negative than articles from “yellow” sources.

source_bias	total	n
Orange	-0.1979522	586
NA	-0.2767329	1861
Yellow	-0.3686736	2292
Green	-0.4621005	2190

## # A tibble: 4 × 2
## # Groups:   source_bias [4]
##   source_bias     n
##   <chr>       <int>
## 1 Yellow         37
## 2 Green          30
## 3 <NA>           24
## 4 Orange          7

Correlation between sentiment + responses to comment?

Is there a relationship between the strength of pos/neg sentiment vs the number of “reactions” / likes / comments? No.

## 
##  Pearson's product-moment correlation
## 
## data:  sent_vs_reacts$sentiment and sent_vs_reacts$total_reactions
## t = 0.45531, df = 1473, p-value = 0.649
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03920206  0.06286501
## sample estimates:
##        cor 
## 0.01186237

## 
##  Pearson's product-moment correlation
## 
## data:  sent_vs_reacts$sentiment and sent_vs_reacts$number_of_replies_to_comment
## t = 0.84488, df = 1473, p-value = 0.3983
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02906485  0.07296723
## sample estimates:
##       cor 
## 0.0220085

## 
##  Pearson's product-moment correlation
## 
## data:  sent_vs_reacts$sentiment and sent_vs_reacts$like_reacts
## t = 0.031771, df = 2123, p-value = 0.9747
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04183366  0.04321025
## sample estimates:
##          cor 
## 0.0006895433

Other types of sentiment

Apart from the AFINN set of words, there is NRC, which has 13,000 words along with the emotions associated with them: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust. Words can have multiple emotions associated with them.

Using the same process for positive/negative sentiment, we can find the overarching “emotion” expressed in comments.

For more information about NRC: https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

Here’s an example of the words listed in NRC

word	sentiment
abacus	trust
abandon	fear
abandon	negative
abandon	sadness
abandoned	anger
abandoned	fear
abandoned	negative
abandoned	sadness
abandonment	anger
abandonment	fear

We can find the top sentiment expressed per article, and the first 10 rows are shown:

article	emotion	percent
ABC1618	anger	0.2367865
ABC31121	trust	0.1742424
ABC31221	anger	0.1777778
ABC3721	trust	0.1843318
ABC52621	trust	0.1881188
ABC6121	trust	0.1990050
AP102520	trust	0.1911765
AP12120	anticipation	0.2319588
AP42721	trust	0.2500000
AP43121	fear	0.2324324

Note: Percentage represents the number of words associated with the specific emotion divded by the total amount of emotion words that were identified with by the NRC dataset. It does not represent the total amount of words in a comment.

The same can be done across media sources:

media_source	emotion	avg_percent
ABC News	trust	0.1747331
AP	trust	0.1921672
BBC	trust	0.1967213
Breitbart	fear	0.1791872
CBS	anticipation	0.1754267
CNBC	fear	0.2083744
CNN	fear	0.2002517
Daily Beast	fear	0.2008817
Daily Wire	trust	0.2073240
Fox News	fear	0.2361111
Fox News	trust	0.2045601
Huff Post	trust	0.2334630
HuffPost	trust	0.1854545
Jezebel	trust	0.2635417
NBC	trust	0.1903244
New York Post	trust	0.2043473
NPR	fear	0.1724138
NYT	trust	0.2314815
Palmer Report	trust	0.2592593
PBS	trust	0.1902017
Politico	trust	0.2079646
Reuters	trust	0.2290503
Slate.com	joy	0.1697568
Stars and Stripes	fear	0.2375491
The Economist	trust	0.2117647
The Guardian	trust	0.1883851
The Hill	trust	0.1776031
USA today	sadness	0.1612732
Washington Examiner	fear	0.1781788
Washington Post	sadness	0.1804511
Washington Times	trust	0.2052631
WSJ	fear	0.1768743

It seems that trust and fear are popular emotions.. here are examples of common trust and fear words

Trust:

word	n
god	134
serve	127
sex	113
medical	96
school	83
real	82
law	80
governor	73
pay	67
hope	63

Fear:

word	n
military	254
god	134
change	133
government	121
medical	96
prison	88
surgery	85
hate	83
hell	52
fight	49

Note that the “trust” emotion doesn’t necessarily entail that the comments are embodying trust; it is more likely that the comments are more centered around certain topics that are sensitive/important.

We can look at the top words for a specific emotion for a media source. Here are the top “sadness” words from CNN comments. A function provided in the later section will allow for specific article/media source + emotion lookup.

word	n
hate	12
unfair	7
attacking	3
discrimination	3
bigoted	2
disabled	2
doubt	2
hurt	2
mother	2
bad	1

Amount of fear words per Media source

Topic vs sentiment

Here is a table showing the most common emotion associated with comments on articles of various topics.

## # A tibble: 7 × 3
## # Groups:   topic_of_article [7]
##   topic_of_article            emotion avg_emotion
##   <chr>                       <chr>         <dbl>
## 1 1 = Transgender Athletes    trust         0.207
## 2 2 = Transgender Bathroom    fear          0.200
## 3 3 = Transgender in Military fear          0.232
## 4 4 = Transgender Adolescents trust         0.193
## 5 5 = Transgender Inmates     fear          0.227
## 6 6 = Other                   trust         0.178
## 7 7: Pageants                 trust         0.184

Usable functions for your enjoyment…

Finding most frequent words used per article

find_top_words <- function(article){
   article_words <- words %>%
    filter(column_label == article)
  
  cloud_words <- article_words %>%
    count(word) 
  
  return(cloud_words %>%arrange(desc(n)))
}

# Example, limit 10 words
find_top_words("CNN12521") %>% head(10) %>% kbl() %>% kable_paper("hover", full_width = F)

word	n
country	18
people	16
serve	15
military	10
protect	8
american	5
biden	5
care	5
equal	5
trans	5

and for a media source

find_top_words_media <- function(source){
   article_words <- words %>%
    filter(media_source == source)
  
  cloud_words <- article_words %>%
    count(word) 
  
  return(cloud_words %>%arrange(desc(n)))
}

# Example, limit 10 words
find_top_words_media("BBC") %>% head(10) %>% kbl() %>% kable_paper("hover", full_width = F)

word	n
sports	18
trans	13
women	11
girls	10
fair	9
mississippi	9
people	8
compete	7
it’s	7
sense	7

Code to look through sentiment words

all_sent_words %>%
  filter(media_source == "NBC")  %>%   #edit this line
  filter(sentiment == "fear") %>%      # edit this line
  count(word) %>%
  arrange(desc(n)) %>% head(10)       # can remove (head(10))

##          word n
## 1        hate 4
## 2     hurting 4
## 3     killing 4
## 4    violence 4
## 5  accidental 3
## 6       death 3
## 7        fear 3
## 8        hell 3
## 9      murder 3
## 10    caution 2

Brief Introduction Latent Dietrich Allocation (LDA)

LDA is a topic modeling method that aims to find natural grouping of topics within a collection of texts. This method is considered to be “unsupervised” machine learning, in which the algorithm does not follow any known outcomes, and is inferring and constructing patterns on its own.

In LDA, you must first specify a specific number of topics to group the text into. Then, LDA calculates the probability of a word being in a certain topic. For example, word1 could have a .60 probability of being in topic 2, .3 probability of being in topic 1, and .1 probability of being in topic 3. Based on this information, documents (or in this case Facebook comments), are classified as belonging to a certain topic based what words are contained within them. For more information, check out this article: https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2

After specifying 6 topics, we can take a look at a subset of words within each topic that have a high probability of being in the topic (i.e we are more confident that these words belong to this specific topic rather than another one). It is interesting to see that the 6 subsets identified by the LDA algorithm match closely to the groups identified through qualitative analysis. Topic 1 seems to be related to bathroom policies, topic 5 seems to be about transgender prison inmates, etc.

Each document (collection of comments) can also be assigned topic probabilities, here we show a selection of comments under Facebook articles, and the topic number with the highest probability. For example, ABC1618 is most likely to be within topic 5, which is hypothesized to be related about prison inmates. The title of this article is “Transgender inmate seeks rare transfer to female prison”, so this classification is very accurate.

## # A tibble: 5 × 3
## # Groups:   document [5]
##   document topic gamma
##   <chr>    <int> <dbl>
## 1 ABC1618      3 1.00 
## 2 ABC31121     3 0.560
## 3 ABC31221     3 1.00 
## 4 ABC3721      3 1.00 
## 5 ABC52621     3 0.838

This method is very useful for finding potential topics within a large volume of text that is hard to parse manually.

Text Analysis

“Jessica Yu”

1/25/2022