Assignment

In Part 1, we reviewed the sentiment analysis from Chapter 2 of Silge and Robinson’s “Text Mining with R” (https://www.tidytextmining.com/). Now, I’ll perform similar analysis on another corpus.

The corpus I’ve chosen is a collection of ~280,000 tweets from 424 congressional candidates from the 2022 election cycle (hosted here on Kaggle: https://www.kaggle.com/datasets/kac624/politicaltweets). I gathered the tweets by scraping Twitter in Python using the snscrape package. As an aside, I attempted to gather this corpus in R, but all approaches I found require Twitter API keys. I’ve been waiting several weeks for a key for the Twitter API, but I suspect it may never come, as Twitter has recently moved to monetize the API. So, I’ve had to resort to the few tools that are still working to gather tweets.

Data on the 424 target candidates were gathered from Federal Election Commission data. I initially gathered many more candidates (3000+), but I’m not yet able to reliably pair all of those candidates with their Twitter IDs / handle. I plan to continue working on this effort for Project 4. Please see https://github.com/kac624/cuny/raw/main/D607/Project4_politicalTweets.Rmd for the collection and cleaning of the list of candidates and their IDs, and https://github.com/kac624/cuny/raw/main/D607/twitterScrape.ipynb for details on scraping.

Setup

library(tidyverse)
library(tidytext)
library(wordcloud)
library(lexicon)
library(reshape2)
library(httr)
library(kableExtra)

Extension to Another Corpus with Another Lexicon

First, we’ll read in our data from Kaggle. We’ll take a quick preview of the dataframe, with the actual tweet (under the text column) in a second table.

kaggle <- jsonlite::read_json('data/kaggle.json')
username <- kaggle$username
authkey <- kaggle$key

url <- 'https://www.kaggle.com/api/v1/datasets/download/kac624/politicaltweets/candidate_tweets2022.csv'
response <- GET(url, authenticate(username, authkey, type = 'basic'))
temp <- tempfile()
download.file(response$url, temp, mode = 'wb')
tweets <- read_csv(unz(temp, 'candidate_tweets2022.csv'))
unlink(temp)

tweets %>%
  select(-text) %>%
  head(5) %>%
  kable(align = 'l') %>% 
  kable_classic(position = 'center')

…1	twitter_id	username	verified_status	date_created	tweet_id	likes	views	replies	retweets	source	language	retweeted_id	quoted_id
0	2916086925	RepAdams	TRUE	2022-12-31 20:00:08	1.609278e+18	11	NA	5	4	TweetDeck	en	NA	NA
1	2916086925	RepAdams	TRUE	2022-12-31 00:50:02	1.608989e+18	10	NA	3	2	Twitter for iPhone	en	NA	NA
2	2916086925	RepAdams	TRUE	2022-12-29 22:38:47	1.608593e+18	10	NA	3	2	Twitter for iPhone	en	NA	https://twitter.com/bhalle87/status/1608592326691287042
3	2916086925	RepAdams	TRUE	2022-12-29 20:40:45	1.608564e+18	16	NA	0	3	Twitter for iPhone	fr	NA	https://twitter.com/MLSist/status/1608560933428776962
4	2916086925	RepAdams	TRUE	2022-12-28 15:58:30	1.608130e+18	42	NA	2	13	TweetDeck	en	NA	NA

tweets %>%
  select(username, text) %>%
  head(10) %>%
  kable(align = 'l') %>% 
  kable_classic(position = 'center')

username	text
RepAdams	It’s already 2023 in many parts of the world! As we celebrate tonight, please remember to do so responsibly. Don’t drink and drive, call a cab or rideshare, and tip your drivers and hospitality workers well. Happy New Year! https://t.co/63SfPHBxsv
RepAdams	I’m sad the Congress is losing a visionary leader like @RepDavidEPrice as well. https://t.co/vNb1SH6lf2
RepAdams	Thank you, @SecretaryPete. if you or your loved ones were impacted by the Southwest Airlines cancellations, please read below:
RepAdams	Rest in Power, Pelé.
RepAdams	U.S. citizen Paul Whelan has had four years stolen from him while wrongfully detained by the Russian government. Americans will not be used as political pawns. We will not relent until Russia gives up their inhumane games and returns Paul home. #FreePaulWhelan https://t.co/cjt3MBbNbK
RepAdams	Wishing a happy first day of #Kwanzaa to everyone celebrating! ❤️💚🖤 From Umoja to Imani, may all the blessings of #Kwanzaa be yours! https://t.co/hJ5hlV8vHW
RepAdams	I wish you and your family a #HappyHanukkah on the last night of the holiday tonight! Long after the final flame is extinguished, may this be a season of love, light and laughter for you and your loved ones. https://t.co/e2pFJXbkN3
RepAdams	Merry Christmas! I hope joy fills your home and your heart as we celebrate the birth of a Savior today. This is truly a day of wonder, from our family rituals to the ways we express our love through food, togetherness, and gifts. Season’s Greetings from your Congresswoman! https://t.co/FyXIh4dowI
RepAdams	From my family to yours, have a Merry Christmas Eve! https://t.co/o3KYpAEicz
RepAdams	Let’s pass the #Momnibus in the 118th Congress!

I’ll remove a few extraneous columns and attempt to clean up the tweet text. I’ll remove poorly captured characters (e.g. the ampersand), remove links, eliminate non-UTF characters and emojis, eliminate “’s” at the end of words (for better aggregation), and remove line breaks / control elements. We again preview our text column to see the cleaned version. Compared to the above, it looks much nicer!

tweets <- tweets %>%
  select(-...1, -views, -retweeted_id, -quoted_id)

tweets$text <- tweets$text %>%
  str_remove_all('&amp;') %>%
  str_remove_all('https://t.co/[a-z,A-Z,0-9]*') %>%
  str_remove_all('[â¢\u0080-\uFFFF]') %>%
  str_remove_all('\\p{So}') %>%
  str_remove_all('\'s') %>%
  str_replace_all('[[:cntrl:]]', ' ')

tweets %>%
  select(username, text) %>%
  head(10) %>%
  kable(align = 'l') %>% 
  kable_classic(position = 'center')

username	text
RepAdams	It already 2023 in many parts of the world! As we celebrate tonight, please remember to do so responsibly. Don’t drink and drive, call a cab or rideshare, and tip your drivers and hospitality workers well. Happy New Year!
RepAdams	Im sad the Congress is losing a visionary leader like @RepDavidEPrice as well.
RepAdams	Thank you, @SecretaryPete. if you or your loved ones were impacted by the Southwest Airlines cancellations, please read below:
RepAdams	Rest in Power, Pel.
RepAdams	U.S. citizen Paul Whelan has had four years stolen from him while wrongfully detained by the Russian government. Americans will not be used as political pawns. We will not relent until Russia gives up their inhumane games and returns Paul home. #FreePaulWhelan
RepAdams	Wishing a happy first day of #Kwanzaa to everyone celebrating! From Umoja to Imani, may all the blessings of #Kwanzaa be yours!
RepAdams	I wish you and your family a #HappyHanukkah on the last night of the holiday tonight! Long after the final flame is extinguished, may this be a season of love, light and laughter for you and your loved ones.
RepAdams	Merry Christmas! I hope joy fills your home and your heart as we celebrate the birth of a Savior today. This is truly a day of wonder, from our family rituals to the ways we express our love through food, togetherness, and gifts. Season Greetings from your Congresswoman!
RepAdams	From my family to yours, have a Merry Christmas Eve!
RepAdams	Lets pass the #Momnibus in the 118th Congress!

Next, we’ll tokenize our tweets to support the sentiment analysis.

tidy_tweets <- tweets %>%
  select(twitter_id, tweet_id, date_created, text) %>%
  unnest_tokens(word, text)

head(tidy_tweets) %>% 
  kable(align = 'l') %>% 
  kable_classic(position = 'center')

twitter_id	tweet_id	date_created	word
2916086925	1.609278e+18	2022-12-31 20:00:08	it
2916086925	1.609278e+18	2022-12-31 20:00:08	already
2916086925	1.609278e+18	2022-12-31 20:00:08	2023
2916086925	1.609278e+18	2022-12-31 20:00:08	in
2916086925	1.609278e+18	2022-12-31 20:00:08	many
2916086925	1.609278e+18	2022-12-31 20:00:08	parts

We’ll use the AFINN lexicon for most of our analysis. The code below captures that lexicon in a dataframe and joins it with our tidy tweets dataframe. Each tweet is treated as a separate “chapter” or document. The net sentiment of all words in a single tweet is calculated to provide an overall positive / negative rating for the tweet.

afinn <- get_sentiments('afinn')

tweets_afinn <- tidy_tweets %>% 
  inner_join(afinn, by = 'word') %>% 
  group_by(twitter_id, tweet_id, date_created) %>% 
  summarize(sentiment = sum(value), .groups = 'keep')

head(tweets_afinn) %>% 
  kable(align = 'l') %>% 
  kable_classic(position = 'center')

twitter_id	tweet_id	date_created	sentiment
7698032	1.477266e+18	2022-01-01 13:10:41	3
7698032	1.477346e+18	2022-01-01 18:29:01	2
7698032	1.477680e+18	2022-01-02 16:34:02	1
7698032	1.477824e+18	2022-01-03 02:06:43	2
7698032	1.478039e+18	2022-01-03 16:23:42	4
7698032	1.478056e+18	2022-01-03 17:31:30	6

Now that we have our data cleaned, we can begin with some analysis. I’ll first pull in the previously mentioned data that provides details on all congressional candidates from the 2022 cycle (see https://github.com/kac624/cuny/raw/main/D607/Project4_politicalTweets.Rmd). This dataset also contains a unique Twitter ID for each candidate. We’ll use that ID as a key to join our sentiment dataframe with the candidates dataframe so we can classify each tweet by party.

candidates <- read_csv('data/candidates2022.csv')

## Rows: 922 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): fec_id, name, state, district, party, office, incumbent_challenge, ...
## dbl (1): twitter_id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

For now, we’ll focus on just Democrats and Republicans. Our first visualization provides a sense of the distribution of positive and negative tweets for candidates in each party.

tweets_afinn %>%
  left_join(select(candidates, party, twitter_id), 
            by = 'twitter_id') %>%
  filter(!is.na(party),
         party == 'DEMOCRATIC PARTY' |
           party == 'REPUBLICAN PARTY') %>%
  ggplot(aes(sentiment, fill = party)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c('blue','red'))

The distributions are very similar, with most tweets having a slightly positive sentiment. Overall, however, the mean is very close to neutral (zero). We do see a second mode just under zero, indicating that a significant number of tweets are slightly negative.

Next, we’ll examine the sentiment of tweets over the whole election year, separated by week. The week of the election in November is highlighted as a vertical line. Overall, the dataset is biased (in terms of volume) towards Democratic candidates (i.e. there are more tweets from Democrats than Republicans). To create a more “apples-to-apples” comparison, I standardized the sentiment scores, by taking the sum sentiment rating of all tweets within a week and dividing by the number of tweets.

tweets_afinn %>%
  left_join(select(candidates, party, twitter_id),
            by = 'twitter_id') %>%
  filter(!is.na(party),
         party == 'DEMOCRATIC PARTY' |
           party == 'REPUBLICAN PARTY') %>%
  group_by(week = lubridate::week(date_created), party) %>%
  summarize(sum_sentiment = sum(sentiment),
            std_sentiment = sum_sentiment / n(),
            .groups = 'keep') %>%
  ggplot(aes(week, std_sentiment, fill = party)) +
  geom_col() + 
  geom_vline(xintercept = lubridate::week('02-11-2022'),
             color = 'purple') +
  scale_fill_manual(values = c('blue','red')) +
  facet_wrap(~party, ncol = 1)

In terms of a time series, no clear trend emerges. The sentiment rating appears mostly consistent through the year, with a few aberrant peaks. What’s most notable is that the aggregate sentiment rating remains positive for all weeks. This insight aligns with the density distribution above, showing that most tweets lean positive. Overall, it seems that Democratic candidates tweeted with more positive language. This could be related to the position of each party in 2022. With a Democratic administration in the White House, Democrats may have been more inclined to lean on positive news in support of the status quo. By contrast, Republicans were looking to win seats in Congress, so they appeared to be more focused on highlighting problems and negative news to inspire voters to vote for change.

Our next visualization is a wordcloud, highting the most frequently used words in tweets from candidates of each party.

tidy_tweets %>%
  anti_join(stop_words) %>%
  left_join(select(candidates, party, twitter_id), 
            by = 'twitter_id') %>%
  filter(!is.na(party),
         party == 'DEMOCRATIC PARTY' |
           party == 'REPUBLICAN PARTY') %>%
  count(word, party, sort = TRUE) %>%
  acast(word ~ party, value.var = 'n', fill = 0) %>%
  comparison.cloud(colors = c('blue', 'red'),
                   max.words = 80)

In line with the narrative highlighted by the previous visualization, it seems that Republicans, seeking to drive voters to vote for change, tweeted mostly about topics they construe as problematic: the border, inflation, energy prices, and drugs. They also focused a lot on Biden. Democratic candidates, by contrast, tweeted a lot about issues that are typically associated with Democrats: gun control, abortion rights, race and climate.

Next, we’ll use the NRC lexicon to highlight the general emotions most frequently evoked in tweets. I’ll remove the binary “positive” and “negative” labels and instead focus more on specific emotions, such as anger, joy and trust. I’ve excluded a few custom stop words that were high ranking, but appeared incorrectly categorized (e.g. “vote” was associated with “fear”, and “congress” with “disgust”).

The first visualization shows the top words that contribute to each sentiment across all parties. The second then provides a side-by-side comparison of the sentiments most frequently evoked by candidates on each side. As above, we use a “standardized” sentiment (calculated the proportion of total words tied to each sentiment) to address the imbalance in the dataset.

nrc <- get_sentiments('nrc')

exclusions <- tibble(word = c('vote','congress'))

tweets_nrc <- tidy_tweets %>%
  anti_join(exclusions) %>%
  inner_join(nrc, by = 'word') %>%
  left_join(select(candidates, party, twitter_id), 
            by = 'twitter_id') %>%
  filter(!is.na(party),
         party == 'DEMOCRATIC PARTY' |
           party == 'REPUBLICAN PARTY') %>%
  count(word, sentiment, party, sort = TRUE)

tweets_nrc %>%
  filter(sentiment != 'positive',
         sentiment != 'negative') %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = 'free_y')

tweets_nrc %>%
  filter(sentiment != 'positive',
         sentiment != 'negative') %>%
  group_by(sentiment, party) %>%
  summarize(total = sum(n)) %>%
  group_by(party) %>%
  mutate(freq = total / sum(total)) %>%
  ggplot(aes(freq, sentiment, fill = party)) +
  geom_col() + 
  scale_fill_manual(values = c('blue','red')) +
  facet_wrap(~party)

The distribution of sentiments is very similar across parties. The most noticeable difference is that Republican candidates appear to use words associated with fear more frequently than Democratic candidates. Similarly, Democratic candidates appear to use words associated with joy more frequently. These differences, however, appears quite minor.

Finally, we’ll bring in a new lexicon, sourced from an assignment posted to RPubs by Colin Vail (MBA 676: Political Sentiment Lexicon, https://rpubs.com/colinvail/338458). The analysis aims to produce a lexicon characterizing words as right-leaning or left-leaning (in terms of the US political spectrum) on a scale from -4 to 4. The author produces the lexicon with a relatively straightforward approach, but they do provide some validation work. Overall, however, I can’t confirm the quality of the lexicon with a high degree of confidence. In applying it to our data, however, we can provide an assessment of its quality. We would expect that tweets from Democratic candidates would have a left-leaning score (i.e. positive) and tweets from Republican candidates would have a right-leaning (i.e. negative) score.

Let’s read in the data and join it to our tidy_tweets dataframe to see if the lexicon can correctly characterize tweets according to the candidates’ political orientation.

poli_sent <- read_csv('data/poli_sent_lexicon.csv') %>%
  select(-...1)

tweets_poli <- tidy_tweets %>%
  inner_join(poli_sent, by = 'word') %>%
  left_join(select(candidates, party, twitter_id), 
            by = 'twitter_id') %>%
  filter(!is.na(party),
         party == 'DEMOCRATIC PARTY' |
           party == 'REPUBLICAN PARTY') %>% 
  group_by(tweet_id, party) %>% 
  summarize(orientation = sum(value), .groups = 'keep')

tweets_poli %>%
  ggplot(aes(orientation, fill = party)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c('blue','red'))

Unfortunately, it seems the lexicon provides little value in assessing the political orientation of candidates from different parties. The distributions above are nearly identical, indicating little difference between the orientation of opposing candidates. Obviously, however, this is not the case. We know the political orientations of Republican and Democratic candidates differ significantly. So, we would expect divergent peaks in each distribution, with Republicans having primarily negative scores and Democrats positive. The distributions above, however, show no such difference.

We can conclude that this lexicon alone provides little insight on political orientation when used for sentiment analysis. I hope to characterize tweets according to political orientation for future assignments, so finding (or constructing) a suitable lexicon will be an area of future research!

Week 10 - NLP

Keith Colella

2023-04-02

Assignment

Setup

Extension to Another Corpus with Another Lexicon