The goal of this analysis is to use sentiment analysis techniques in R to examine anti-trans Facebook comments across various news platforms. Rather than analyzing the content of the comment as a whole, we can analyze the individual words in a comment. While context and relationship between words in a sentence are crucial, this method allows for a quick “snapshot” view of the sentiment of these comments.
The original dataset has one comment per row, however, for our analysis, each word of a comment will now have its own row. This is called “unnesting”. Additionally, we will remove stop words from the dataset. These are words like “as”, “it”, “the”, that frequently appear in text, but do not provide any valuable information. More about stop words here: https://www.opinosis-analytics.com/knowledge-base/stop-words-explained/#.YfBtvVhKjt0
An example of the dataset we’re working with is here:
| column_label | media_source | word |
|---|---|---|
| nbc1521 | NBC | didn’t |
| nbc1521 | NBC | read |
| nbc1521 | NBC | article |
| nbc1521 | NBC | headline |
| nbc1521 | NBC | told |
We can analyze the amount of positive/negative sentiment used in the comments for a certain news website. Drawing from ‘AFINN’, a set of ~2,000 words with assigned polarity scores ranging from -5 (most negative) to 5 (most positive), we can aggregate the scores for words that appear in our comments and calculate an overall sentiment score per comment, and an average for a news website.
(More info about AFINN here: https://www.geeksforgeeks.org/python-sentiment-analysis-using-affin/#:~:text=Afinn%20is%20the%20simplest%20yet,built%20function%20for%20this%20lexicon.)
Here’s an example of some words in the AFINN set:| word | value |
|---|---|
| abandon | -2 |
| abandoned | -2 |
| abandons | -2 |
| abducted | -2 |
| abduction | -2 |
| media_source | total |
|---|---|
| The Economist | 0.9629630 |
| Washington Post | 0.6231884 |
| Jezebel | 0.5232558 |
| BBC | 0.2150538 |
| NYT | 0.1891892 |
| Slate.com | -0.0660377 |
| Reuters | -0.0735294 |
| AP | -0.0797546 |
| HuffPost | -0.0873016 |
| CNN | -0.0951374 |
| CBS | -0.1104651 |
| Politico | -0.1451613 |
| CNBC | -0.1521739 |
| Fox News | -0.1879562 |
| Daily Wire | -0.1909263 |
| Huff Post | -0.2391304 |
| Palmer Report | -0.2631579 |
| New York Post | -0.2671395 |
| WSJ | -0.3879004 |
| The Hill | -0.4246154 |
| The Guardian | -0.4658385 |
| USA today | -0.4689655 |
| Stars and Stripes | -0.6035242 |
| ABC News | -0.6053571 |
| Fox News | -0.6601942 |
| NPR | -0.6666667 |
| Washington Times | -0.7378277 |
| Breitbart | -0.8062016 |
| Washington Examiner | -0.8571429 |
| PBS | -1.0481928 |
| Daily Beast | -1.1928934 |
| NBC | -1.2413793 |
| word | n | val |
|---|---|---|
| save | 3 | 2 |
| advantages | 2 | 2 |
| fair | 2 | 2 |
| miracle | 2 | 4 |
| stronger | 2 | 2 |
| unfair | 2 | -2 |
| winning | 2 | 4 |
| banned | 1 | -2 |
| beloved | 1 | 3 |
| broke | 1 | -1 |
| word | n | val |
|---|---|---|
| killed | 7 | -3 |
| stop | 6 | -1 |
| hate | 4 | -3 |
| hurting | 4 | -2 |
| killing | 4 | -3 |
| stronger | 4 | 2 |
| violence | 4 | -3 |
| accidental | 3 | -2 |
| crap | 3 | -3 |
| death | 3 | -2 |
| source_bias | total | n |
|---|---|---|
| Orange | -0.1979522 | 586 |
| NA | -0.2767329 | 1861 |
| Yellow | -0.3686736 | 2292 |
| Green | -0.4621005 | 2190 |
## # A tibble: 4 × 2
## # Groups: source_bias [4]
## source_bias n
## <chr> <int>
## 1 Yellow 37
## 2 Green 30
## 3 <NA> 24
## 4 Orange 7
Is there a relationship between the strength of pos/neg sentiment vs the number of “reactions” / likes / comments? No.
##
## Pearson's product-moment correlation
##
## data: sent_vs_reacts$sentiment and sent_vs_reacts$total_reactions
## t = 0.45531, df = 1473, p-value = 0.649
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03920206 0.06286501
## sample estimates:
## cor
## 0.01186237
##
## Pearson's product-moment correlation
##
## data: sent_vs_reacts$sentiment and sent_vs_reacts$number_of_replies_to_comment
## t = 0.84488, df = 1473, p-value = 0.3983
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02906485 0.07296723
## sample estimates:
## cor
## 0.0220085
##
## Pearson's product-moment correlation
##
## data: sent_vs_reacts$sentiment and sent_vs_reacts$like_reacts
## t = 0.031771, df = 2123, p-value = 0.9747
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04183366 0.04321025
## sample estimates:
## cor
## 0.0006895433
Apart from the AFINN set of words, there is NRC, which has 13,000 words along with the emotions associated with them: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust. Words can have multiple emotions associated with them.
Using the same process for positive/negative sentiment, we can find the overarching “emotion” expressed in comments.
For more information about NRC: https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
Here’s an example of the words listed in NRC| word | sentiment |
|---|---|
| abacus | trust |
| abandon | fear |
| abandon | negative |
| abandon | sadness |
| abandoned | anger |
| abandoned | fear |
| abandoned | negative |
| abandoned | sadness |
| abandonment | anger |
| abandonment | fear |
| article | emotion | percent |
|---|---|---|
| ABC1618 | anger | 0.2367865 |
| ABC31121 | trust | 0.1742424 |
| ABC31221 | anger | 0.1777778 |
| ABC3721 | trust | 0.1843318 |
| ABC52621 | trust | 0.1881188 |
| ABC6121 | trust | 0.1990050 |
| AP102520 | trust | 0.1911765 |
| AP12120 | anticipation | 0.2319588 |
| AP42721 | trust | 0.2500000 |
| AP43121 | fear | 0.2324324 |
Note: Percentage represents the number of words associated with the specific emotion divded by the total amount of emotion words that were identified with by the NRC dataset. It does not represent the total amount of words in a comment.
The same can be done across media sources:
| media_source | emotion | avg_percent |
|---|---|---|
| ABC News | trust | 0.1747331 |
| AP | trust | 0.1921672 |
| BBC | trust | 0.1967213 |
| Breitbart | fear | 0.1791872 |
| CBS | anticipation | 0.1754267 |
| CNBC | fear | 0.2083744 |
| CNN | fear | 0.2002517 |
| Daily Beast | fear | 0.2008817 |
| Daily Wire | trust | 0.2073240 |
| Fox News | fear | 0.2361111 |
| Fox News | trust | 0.2045601 |
| Huff Post | trust | 0.2334630 |
| HuffPost | trust | 0.1854545 |
| Jezebel | trust | 0.2635417 |
| NBC | trust | 0.1903244 |
| New York Post | trust | 0.2043473 |
| NPR | fear | 0.1724138 |
| NYT | trust | 0.2314815 |
| Palmer Report | trust | 0.2592593 |
| PBS | trust | 0.1902017 |
| Politico | trust | 0.2079646 |
| Reuters | trust | 0.2290503 |
| Slate.com | joy | 0.1697568 |
| Stars and Stripes | fear | 0.2375491 |
| The Economist | trust | 0.2117647 |
| The Guardian | trust | 0.1883851 |
| The Hill | trust | 0.1776031 |
| USA today | sadness | 0.1612732 |
| Washington Examiner | fear | 0.1781788 |
| Washington Post | sadness | 0.1804511 |
| Washington Times | trust | 0.2052631 |
| WSJ | fear | 0.1768743 |
It seems that trust and fear are popular emotions.. here are examples of common trust and fear words
Trust:| word | n |
|---|---|
| god | 134 |
| serve | 127 |
| sex | 113 |
| medical | 96 |
| school | 83 |
| real | 82 |
| law | 80 |
| governor | 73 |
| pay | 67 |
| hope | 63 |
| word | n |
|---|---|
| military | 254 |
| god | 134 |
| change | 133 |
| government | 121 |
| medical | 96 |
| prison | 88 |
| surgery | 85 |
| hate | 83 |
| hell | 52 |
| fight | 49 |
Note that the “trust” emotion doesn’t necessarily entail that the comments are embodying trust; it is more likely that the comments are more centered around certain topics that are sensitive/important.
We can look at the top words for a specific emotion for a media source. Here are the top “sadness” words from CNN comments. A function provided in the later section will allow for specific article/media source + emotion lookup.| word | n |
|---|---|
| hate | 12 |
| unfair | 7 |
| attacking | 3 |
| discrimination | 3 |
| bigoted | 2 |
| disabled | 2 |
| doubt | 2 |
| hurt | 2 |
| mother | 2 |
| bad | 1 |
Here is a table showing the most common emotion associated with comments on articles of various topics.
## # A tibble: 7 × 3
## # Groups: topic_of_article [7]
## topic_of_article emotion avg_emotion
## <chr> <chr> <dbl>
## 1 1 = Transgender Athletes trust 0.207
## 2 2 = Transgender Bathroom fear 0.200
## 3 3 = Transgender in Military fear 0.232
## 4 4 = Transgender Adolescents trust 0.193
## 5 5 = Transgender Inmates fear 0.227
## 6 6 = Other trust 0.178
## 7 7: Pageants trust 0.184
Finding most frequent words used per article
find_top_words <- function(article){
article_words <- words %>%
filter(column_label == article)
cloud_words <- article_words %>%
count(word)
return(cloud_words %>%arrange(desc(n)))
}
# Example, limit 10 words
find_top_words("CNN12521") %>% head(10) %>% kbl() %>% kable_paper("hover", full_width = F)
| word | n |
|---|---|
| country | 18 |
| people | 16 |
| serve | 15 |
| military | 10 |
| protect | 8 |
| american | 5 |
| biden | 5 |
| care | 5 |
| equal | 5 |
| trans | 5 |
and for a media source
find_top_words_media <- function(source){
article_words <- words %>%
filter(media_source == source)
cloud_words <- article_words %>%
count(word)
return(cloud_words %>%arrange(desc(n)))
}
# Example, limit 10 words
find_top_words_media("BBC") %>% head(10) %>% kbl() %>% kable_paper("hover", full_width = F)
| word | n |
|---|---|
| sports | 18 |
| trans | 13 |
| women | 11 |
| girls | 10 |
| fair | 9 |
| mississippi | 9 |
| people | 8 |
| compete | 7 |
| it’s | 7 |
| sense | 7 |
Code to look through sentiment words
all_sent_words %>%
filter(media_source == "NBC") %>% #edit this line
filter(sentiment == "fear") %>% # edit this line
count(word) %>%
arrange(desc(n)) %>% head(10) # can remove (head(10))
## word n
## 1 hate 4
## 2 hurting 4
## 3 killing 4
## 4 violence 4
## 5 accidental 3
## 6 death 3
## 7 fear 3
## 8 hell 3
## 9 murder 3
## 10 caution 2
LDA is a topic modeling method that aims to find natural grouping of topics within a collection of texts. This method is considered to be “unsupervised” machine learning, in which the algorithm does not follow any known outcomes, and is inferring and constructing patterns on its own.
In LDA, you must first specify a specific number of topics to group the text into. Then, LDA calculates the probability of a word being in a certain topic. For example, word1 could have a .60 probability of being in topic 2, .3 probability of being in topic 1, and .1 probability of being in topic 3. Based on this information, documents (or in this case Facebook comments), are classified as belonging to a certain topic based what words are contained within them. For more information, check out this article: https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2
After specifying 6 topics, we can take a look at a subset of words within each topic that have a high probability of being in the topic (i.e we are more confident that these words belong to this specific topic rather than another one). It is interesting to see that the 6 subsets identified by the LDA algorithm match closely to the groups identified through qualitative analysis. Topic 1 seems to be related to bathroom policies, topic 5 seems to be about transgender prison inmates, etc.
Each document (collection of comments) can also be assigned topic probabilities, here we show a selection of comments under Facebook articles, and the topic number with the highest probability. For example, ABC1618 is most likely to be within topic 5, which is hypothesized to be related about prison inmates. The title of this article is “Transgender inmate seeks rare transfer to female prison”, so this classification is very accurate.
## # A tibble: 5 × 3
## # Groups: document [5]
## document topic gamma
## <chr> <int> <dbl>
## 1 ABC1618 3 1.00
## 2 ABC31121 3 0.560
## 3 ABC31221 3 1.00
## 4 ABC3721 3 1.00
## 5 ABC52621 3 0.838
This method is very useful for finding potential topics within a large volume of text that is hard to parse manually.