Before beginning this project, I was interested in how tweets affect elections and the relationships between tweets and political parties as well as the keywords that occur and their frequencies. My goal in this project was an exploratory analysis of Rtweet, the package that utilizes the Twitter API (Application Program Interface) to call tweets and to run several different methods of analysis to link keywords from tweets across thousands of tweets. The final step of this project will be to apply sentiment analysis to determine the overall sentiment of this divisive election which I predict will have a strong majority of mostly negative tweets.
In order to begin an analysis several steps are required. The most lengthy of which is getting your API approved with a Twitter Developer Account. Once you complete this process and get your account approved, I recommend downloading the following packages into R.
For searching tweets:
For text analysis as well as graphing large amounts of tweets in a data frame:
Finally, for sentiment analysis, I recommend:
In this project, I focused on the U.S. Senate runoff election in Mississippi. Before running any code, it is important to effectively research hashtags and keywords that may pertain to the election and that are receiving significant traction on Twitter. An effective way to search is to use the simple search bar function on Twitter and follow a series of highly retweeted messages to search for keywords. It is also important, especially for sentiment analysis, that the keywords or hashtags you use are not biased within themselves unless you omit the sentiment analysis portion of this project.
The hashtags and keywords that I used in this project are seen below in the tweetR search function. I have elaborated on each portion of the function’s purpose to make the results as replicable as possible.
#set up to authenticate API
consumer_key <- "*******************"
consumer_secret<- "**********************************************"
access_token <- "***********************************************"
access_secret <- "********************************************"
#NOTE: I omitted the keys authenitcation for confidentiality reasons
#Searching tweets from top elections hastags
Miss_tweets <- search_tweets(q = "#msleg OR 'Cindy Hyde-Smith' OR 'Mike Espy' OR 'Hyde-Smith' OR #EpsyForMississippi OR #Epsy4Senate OR #MSSen OR #MississippiSenateRace OR #MSSen OR #mselex OR #MSdems OR #MSReps OR #EpsyForSenate OR #Epsy4All OR #Vote4MikeEspy4Senator OR #CindyHydeSmith OR #Cindy2018 OR #TeamCindy",
n = 30000,
lang = "en",
include_rts = FALSE,
parse = TRUE,
retryonratelimit = TRUE,
max_id = TRUE)
#q is used to identify what hastags or keywords you would like to search
#n denominates the ammount of tweets you would like to search
#The parse function is highly useful as it stores your collected tweets in a dataframe
#Normally the twitter API will only allow you to call 18,000 tweets in 15 minutes.
#Using this and the max_id function allows you to bypass this limit.
#view the first tweets to ensure the query was successful
head(Miss_tweets$text)
The result from the head function will look like the image below:
Two things to note:
The Twitter API on a free developer account will only allow you to search back 9 days from when you execute the function. In order to go back further, you will need to purchase an expensive twitter API subscription. A cheaper method for larger projects is simply running the function for the maximum amount of tweets you can collect (just increase the n parameter until you get the same number) and re-running it every 9 days and then merging your data frames into a larger file.
For this project, I ran the above code 3 times:
The first time using all the hashtags above, the second time using only the general and Republican hashtags (relating to Cindy Hyde-Smith), and the third time using the general and Democratic hashtags (relating to Mike Epsy).
Before even running any analysis on the data, there are some interesting extrapolations from the searches. In the first search (using all hashtags) 35,341 tweets were returned. In the second search (Republican) 35,324 tweets were returned and in the third search (Democratic) 16,220 tweets were returned. This is indeed interesting because there were seven unique Democratic hashtag or keywords compared to five unique Republican hashtags. Yet despite this, there were almost two times as many tweets using Republican hashtags. Two initial educated guesses for this occurrence are a larger prevalence of negative media relating to the Republican candidate by the Democratic hashtags or that there were fewer people sending tweets with the Democratic candidates hashtag due to an uphill campaign in a deep red state that was called to the Republican candidate more than a week before the vote.
According to the Pew Research Report from 2018, “Democrats are more likely than Republicans to report participating… on social media. Most notably, Democrats are more than twice as likely as Republicans (by a 24% to 9% margin) to say they have used social media in the past year… 59% of Democrats have performed at least one of these five activities the Center measured, as have 45% of Republicans.” This makes our findings even more shocking regarding the amount of tweets collected from Democratic and Republican hashtags considering there are nearly twice as many Republican Tweets in this election.
For this next step, there is no pre-cleaning required, as we will be using using the ts_plot function to get a graph of the frequency of when tweets where posted.
ts_plot(Miss_tweets, "10 mins") +
ggplot2::theme_light() +
ggplot2::theme(plot.title = ggplot2::element_text(face = "bold")) +
ggplot2::labs(
x = NULL, y = NULL,
title = "Frequency of Mississippi Special Election Twitter statuses from past 9 days",
subtitle = "Twitter status (tweet) counts aggregated using 10 minute intervals",
caption = "\nSource: Data collected from Twitter's REST API via rtweet"
)
The following code was run for a graph that plots tweets every 10 minutes, and this code was run for the general, Republican, and Democratic data frames from the first step.
Results
The results from each of the plots are below.
Findings
In all three plots, we can visualize that the overall trends of the data are relatively the same with high activity during the day and evening and lower activity at night. There are, however, two interesting things to note with the graphs. First is the frequency of tweets between the Republican and Democratic graphs are almost half with Democratic hashtags struggling to receive 100 tweets during any given 10-minute interval. A feat accomplished by Republican tweets several times at peaks. This is expected considering the amount of Republican to Democratic tweets that were found, but it is interesting to note how both are correlated in frequencies of posting times.
The second most obvious piece of data is the huge increase on November 28th at 1:44 PM. After researching the tweets and Twitter, I was able to discover a correlation with the first bump relating to a tweet by President Donald Trump. The massive uptick, however, from 3:00 AM until 4:30 AM was shocking. At first, I suspected interference botnets or other sources to explain the output. However, as described later on in this post. After isolating the times, I was able to determine this was not the case.
In this step, you must clean up your raw Twitter dataframe and then process the data into creating a graph with the most common words. The data cleaning step is very important or the results will likely make little to no sense.
# First, remove http elements manually
miss_data$stripped_text <- gsub("http.*","", miss_data$text)
miss_data$stripped_text <- gsub("https.*","", miss_data$stripped_text)
#clean puncuation
miss_data_clean <- miss_data %>%
dplyr::select(stripped_text) %>%
unnest_tokens(word, stripped_text)
#remove stop words
miss_cleaned_tweet_words <- miss_data_clean %>%
anti_join(stop_words)
At this point, the data has been cleaned of pesky punctuation issues, empty spaces, hyperlinks, and stop words.
Stop Words are likely the most important to clean, as they are the most common words in the English language such as “the” and are the most difficult to remove. In this project, I utilized the stop words remover with the dpylr package, but there are others, and I recommend you find the library of stop words that most fits your needs.
Now that your data is clean, let’s plot the top 20 words from each of the data frames.
# Finally, plot the top 20 words
miss_cleaned_tweet_words %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n, fill = word)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(y = "Count",
x = "Unique words",
title = "Most Common words found in Mississippi Special Election tweets",
subtitle = '"Miss_Elections" , "IDK" tops our most used words"',
caption = "\nSource: Data collected from Twitter's REST API via rtweet")
The results for this plot are shown below for all three data frames:
After looking at the graphs, one can tell there are several confirmations of previous questions as well as similarities between the data sets. For Republicans, their candidate dominates the word frequency, with the same situation occurring with the Democratic side. This suggests that there were actually more Republicans tweeting about the race using Republican hashtags than Democrats with their own as questioned in the first analysis. Some interesting words that appear in both sides include “trump” and “racist”. However, if we examine the Republican graph, we notice several other words including “white” and “hanging” that do not appear in the Democratic list at all. The best explannation is that Democrats are using Republican hashtags sarcastically or in an antagonistic way while utilizing such keywords. This is based on the rationale that controverisal mud slinging would have been unlikely on republican hashtags unless they were not from their intended base (Republicans). I did find the information concering however, as both words are missing from the list that includes Democratic keywords, and I find it unlikely that sarcastic Democrats would tag Republican hashtags without also one of their own candidates. It is possible that controverisal elements or the media hijacked and used these hashtags as well. I did, however, in this project omit more antagonistic hashtags targeting candidates or parties and another analysis including these hashtags may help further investigate this issue. I also found it interesting that “defeats” made it onto the Democratic list of words while “wins” made it onto the Republican list. This could be based on the perception leading up to the election.
The next analysis we will conduct is using word networks to connect common words across tweets. This is a very visual graph that also will help answer some of our previous questions on the connections of certain words. This step requires re-cleaning the original data frame to prevent any errors from the first cleaning as well as utilizing tokens to connect keywords after cleaning them before creating the word maps.
# remove punctuation, convert to lowercase, and add a token for each tweet
miss_cleaned_tweet_words_1 <- miss_data %>%
dplyr::select(stripped_text) %>%
unnest_tokens(paired_words, stripped_text, token = "ngrams", n = 2)
miss_cleaned_tweet_words_1 %>%
count(paired_words, sort = TRUE)
miss_cleaned_tweet_words_2 <- miss_cleaned_tweet_words_1 %>%
separate(paired_words, c("word1", "word2"), sep = " ")
miss_tweets_filtered <- miss_cleaned_tweet_words_2 %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
miss_word_count <- miss_tweets_filtered %>%
count(word1, word2, sort = TRUE)
Now that the tweets have been cleaned and paired, we will run a word network using the code below.
# plot word network
miss_word_count %>%
filter(n >= 24) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 3) +
geom_node_text(aes(label = name), vjust = 1.8, size = 3) +
labs(title = "Word Network: Tweets based on top hastags and keywords from the 2018 Special Senate Election in Mississippi ",
subtitle = "Text mining twitter data ",
x = "", y = "", caption = "\nSource: Data collected from Twitter's REST API via rtweet")
The results for all three maps are shown below with one in a larger resolution and one in a smaller resolution.
General Tweets Word Networks
Republican Tweets Word Network
Democratic Tweets Word Network
From the results above, we can see several interesting correlations. First, if we view the nexus point for words like “racist” as well as their connecting words for Democrats and Republicans, we see some interesting connections emerge between Republicans and words like “folks” or “people”, compared to Democrats with words like “bigot” and “rural”. From these word map connections, we can draw some interesting results on the demographics and ideologies of the two groups and based on our prior analysis and the connection of words, we can rule out that most of the racist messages in the Republican analysis were correlated with sarcastic or antagonistic Democrats. Interestingly, in all three maps, Cindy Hyde-Smith dominates the most used words appearing only as the second largest conglomeration of words in the Democratic word maps where Mike Epsy is the dominant name. This further defends the argument that democrats where mostly using Democratic based hashtags and vice versa with Republicans.
In the final step, I will delve into sentiment analysis on the total amount of tweets collected from both Democratic and Republican sources to examine the sentiment of tweets during the election.
The first step is sub-setting the data frame to only compare the tweets. There are several ways of doing this but for the sake of simplicity, I decided to just select the columns from the data that contained the tweets (column 5) and then moved them into a new data frame for analysis.
Sent_miss <- miss_data[c(1:35341),c(5)]
The next step involves removing some of the hyperlinks embedded in tweets to increase the effectiveness of the analysis. In more advanced levels, you are able to analyze emojis as well as other pieces of data. However, in this project, I only focused on the most obvious data: the words themselves. As I removed the unwanted hyperlinks using the gsub function, I added them into a new column called stripped_text, to ensure my cleaning had worked correctly.
Sent_miss$stripped_text <- gsub("http.*","", Sent_miss$text)
Sent_miss$stripped_text <- gsub("https.*","", Sent_miss$stripped_text)
Once examining the now cleaned data frame, I subset the data frame again into a new data frame that only included the stripped_text column.
Sent_miss1 <- Sent_miss[c(1:35341),c(2)]
Finally we get to the portion of measuring sentiment. In this portion I utliized the Syuzhet package to conduct sentiment analysis. I first converted the cleaned data frame into a large character to allow sentiment analysis to work. I noticed if you try to convert the data frame into a vector or use it as a datframe, an error will occur so this step is crucial. I then used the get_nrc_sentiment function on the new large character. I used the get_nrc_sentiment function because it is the most complex in terms of sentiment analysis. It included eight different emotions that it categorized tweets into as well as a negative and positive value to each, which allows for analysis on certain types of sentiment and the total negative and positive value for even larger projects. The ten values it provides for each tweet are anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, and positive.
sent_miss2 <- as.character(Sent_miss1)
emotion.df <- get_nrc_sentiment(sent_miss2)
emotion.df2 <- cbind(Sent_miss1, emotion.df)
Next, I conducted some analysis on the top postive tweet.The code and the result are shown below.
sent.value <- get_sentiment(sent_miss2)
most.positive <- sent_miss2[sent.value == max(sent.value)]
most.positive
I next repeated the same process but with the most negative tweet. The code and result is shown below.
most.negative <- sent_miss2[sent.value <= min(sent.value)]
most.negative
The final step in sentiment analysis is adding up the total values of the larger raw file and organizing them into positive, negative, and neutral sentiment. Luckily, the get_nrc_sentiment function labels postive tweets as a positive number and negative tweets as negative number, which makes the process much easier. I then created a table of the total, which is printed below.
category_senti <- ifelse(sent.value < 0, "Negative", ifelse(sent.value > 0, "Positive", "Neutral"))
table(category_senti)
Sentiment Analysis Thoughts
The results were actually quite shocking. While clearly, the sentiment analysis code worked successfully as viewed by the top positive and negative tweets, the amount of tweets with an overall positive score far surpasses the tweets with a negative score which was not what I was expecting to find in such a divisive election. In a report released by the Pew Research Center after the 2012 election, there where almost contradictory results with negative tweets almost double the amount of positive as shown below in the chart. This indicates that in this election, despite the divisive comments and racism appearing as a top word on both Republican and Democratic tweet analysis that overall most of the tweets where overwhelmingly categorized as positive in nature, rebuking the data from Pews Report. This result was contradictory to what I expected from reading articles and reports on social media postings and challenges the notion that most tweets are negative. Some interesting other examinations, however, could be based around the amount of positive and negative tweets leading up to election day where more politically active members will post and where the results would be undiluted from the postings of news corporations or congratulatory remarks.
Now that I have explained the steps to analyzing data and visualizing in word networks, I will explain how I examined this outlier in the data.
First, I subset the data frame into a new one that corresponded to the times of the tweets. What I found interesting is that despite my raw data and t_plot indicating the time frame was from 3:00 AM to 4:30 PM, when I tried sub-setting those times I received a data frame that was 5 hours ahead because Twitter runs on UTC time not eastern time (EST is UTC+5). Below you can see the two plots I created as I narrowed down the data set to ensure I was collecting the right surge in tweets.
miss_russian9 <- subset(miss_data,created_at < "2018-11-28 00:30:00" & created_at > "2018-11-27 22:00:00")
ts_plot(miss_russian9, "10 mins") +
ggplot2::theme_light() +
ggplot2::theme(plot.title = ggplot2::element_text(face = "bold")) +
ggplot2::labs(
x = NULL, y = NULL,
title = "Frequency of Mississippi Special Election Twitter statuses from November 28th",
subtitle = "Twitter status (tweet) counts aggregated using 10 minute intervals",
caption = "\nSource: Data collected from Twitter's REST API via rtweet"
)
Now that I had confirmed that I had the right data set of tweets segregated from time of publish, I then analyzed the top 20 words and created a word network. The steps and results of which are shown below.
# First, remove http elements manually
miss_russian9$stripped_text <- gsub("http.*","", miss_russian9$text)
miss_russian9$stripped_text <- gsub("https.*","", miss_russian9$stripped_text)
#clean puncuation
miss_russian9_clean <- miss_russian9 %>%
dplyr::select(stripped_text) %>%
unnest_tokens(word, stripped_text)
#remove stop words
miss_russian9_clean_w <- miss_russian9_clean %>%
anti_join(stop_words)
# Finally, plot the top 20 words
miss_russian9_clean_w %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n, fill = word)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(y = "Count",
x = "Unique words",
title = "Most Common words found in Mississippi Special Election tweets",
subtitle = '"Miss_Elections" , "IDK" tops our most used words"',
caption = "\nSource: Data collected from Twitter's REST API via rtweet")
# remove punctuation, convert to lowercase, - yes again, and add a token for each tweet
miss_r_cleaned_tweet_words_1 <- miss_russian9 %>%
dplyr::select(stripped_text) %>%
unnest_tokens(paired_words, stripped_text, token = "ngrams", n = 2)
miss_r_cleaned_tweet_words_1 %>%
count(paired_words, sort = TRUE)
miss_r_cleaned_tweet_words_2 <- miss_r_cleaned_tweet_words_1 %>%
separate(paired_words, c("word1", "word2"), sep = " ")
miss_r_tweets_filtered <- miss_r_cleaned_tweet_words_2 %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
miss_r_word_count <- miss_r_tweets_filtered %>%
count(word1, word2, sort = TRUE)
# plot word network
miss_r_word_count %>%
filter(n >= 24) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 3) +
geom_node_text(aes(label = name), vjust = 1.8, size = 3) +
labs(title = "Word Network: Tweets based on top hastags and keywords from the 2018 Special Senate Election in Mississippi ",
subtitle = "Text mining twitter data ",
x = "", y = "", caption = "\nSource: Data collected from Twitter's REST API via rtweet")
The results of this subset of tweets leads to me rule out any “foreign interference” and instead conclude that the majority of the tweets where late night and early morning traditional news-based tweets about the election results. While this isn’t a juicy result it is still interesting not only due to the time of posting but also because of the influence that these news tweets had on the overall amount of tweets that were released on the election, it also provides an interesting perspective where more tweets where made after the results were confirmed than prior to the conclusion which leads me to believe that most election-related tweets where not as much about advocating for change or for individuals to vote as much as congratulating or complaining about the resulting winner.
Based on the various types of analysis on the Twitter data from the Mississippi Special Senate Election of 2018 there are several interesting findings. First, contrary to Pew Research and my prior beliefs, most of the tweets were positive in nature. This is especially shocking regarding several racially charged remarks made during the elction and the high profile of this midterm election during Trump’s presidency. Second, of the nine days of tweets analyzed, most of the tweets made where after the election had been called and not prior which implies most tweets are not by political activists to vote as much as individuals congratulating or complaining about the winner. This draws into some interesting research on social media being a tool for venting frustrations but not necessarily encouraging action in the real world. Finally, some interesting comparisons are reached between Republican and Democratic tweets regarding both the semantic nature and amount of such tweets. While Mississippi is a very red state, based on Pew Research’s findings there would still be a more substantial amount of democratic tweets gathered from outside the state. Based on both the collection of tweets by size based on partisan and campaign hashtags as well as the frequency of which the democratic candidate was mentioned this did not appear to be true. While there are many more examinations to be made about this data, and further ways to manipulate and visualize connections between tweets, based on my analysis many things that we assume to be true between partisan social media postings both in sentiment and volume of tweets need to be re-examined.
The Following Tutorials were utlized to help create this project:
Analyzing Trends, Word Networks and Data Mining
In addition, the following resources are excellent references for further exploration into the subject:
Finally, the following sources were used for academic or informational content in this posting.
Pew Research Report regarding users
Pew Research Report regarding Sentiment & Image Source
This project was completed as an Assignment for Dr. Michael McDonald’s course at the University of Florida on Election Data Science.