library(rtweet)
library(tidytext)
library(stringr)
library(dplyr)
library(ggplot2)
app <- "Ngolaz"
consumer_key <- "V6htm9XnGv1AI5VyJVgsY7rKH"
consumer_secret <- "Wn1R8X13p8QQr8wzghs78tXrR6M2mphNbZpjLHkhZBotD5wgdK"
access_token <- "1199062437548326912-hp6co4yYq41ZybxilgCmbvP6ZvLGrg"
access_secret <- "uELfp6850GCApKfn2vMios4IZhRjtvxSTTiMq4MzINAP8"
my_token <- create_token(app = app,
consumer_key = consumer_key,
consumer_secret = consumer_secret,
access_token = access_token,
access_secret = access_secret)
identical(my_token, get_token())
## [1] FALSE
Proir to begining this analysis, I perceived the major supporting base of minority political candidates to be mostly concernced with racial issues. As a result, I selected Kamala Harris and Andrew Yang - two presidential aspirants for this analysis, both of whom are from minority backgrounds - African American and Asian American respectively. My goal is to understand if both groups of supporters reference race in similar proportions.
Lets begin by pulling tweets with the hashtag #KamalaHarris and #AndrewYang
num_tweets <- 2000
mt <- search_tweets('#Kamalaharris', n = num_tweets, include_rts = FALSE)
mt2 <- search_tweets('#AndrewYang', n = num_tweets, include_rts = FALSE)
#mt3 <- search_tweets('kamala harris black women', n = num_tweets, include_rts = FALSE)
#mt4 <- search_tweets('kamala harris race', n = num_tweets, include_rts = FALSE)
#head(mt)
#lemgth(mt2)
#head(mt3)
First, I wanted to see what percentage of their tweets referenced words such such as black, white, asian (racial identities). Note: the words black and white can be ambigious so we shall do further analysis to understand its usage.
#Top 20 words amongst tweets with #KamalaHarris
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
kamala_words <- mt %>% select(status_id, text) %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
kamala_words %>% group_by(word) %>% summarize(n = n()) %>% arrange(desc(n)) %>% top_n(20)
## Selecting by n
## # A tibble: 21 x 2
## word n
## <chr> <int>
## 1 #kamalaharris 1980
## 2 @kamalaharris 223
## 3 kamala 220
## 4 campaign 189
## 5 harris 180
## 6 race 178
## 7 black 175
## 8 #khive 150
## 9 people 127
## 10 candidate 106
## # ... with 11 more rows
#Top 20 words amongst tweets with #AndrewYang
andrew_words <- mt2 %>% select(status_id, text) %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
andrew_words %>% group_by(word) %>% summarize(n = n()) %>% arrange(desc(n)) %>% top_n(20)
## Selecting by n
## # A tibble: 20 x 2
## word n
## <chr> <int>
## 1 #andrewyang 1949
## 2 #yanggang 814
## 3 #yang2020 526
## 4 @andrewyang 405
## 5 yang 292
## 6 #humanityfirst 189
## 7 people 161
## 8 #ubi 156
## 9 andrew 152
## 10 #yanggang2020 134
## 11 #andrewyang2020 127
## 12 #freedomdividend 121
## 13 candidate 106
## 14 vote 96
## 15 #yangganglove 85
## 16 time 85
## 17 #math 81
## 18 campaign 80
## 19 president 77
## 20 support 66
Excluding the words #Kamalaharris,@Kamalaharris and kamala, the sum of counts of the 10 most frequent words in tweets with #Kamalaharris is 1302. 20% out of the sum referred to race (white and Black) while with Andrew Yang,zero words referencing race are amongst the 20 most frequent words in tweets with #AndrewYang.
The words Black and White could mean multiple things - so for the next steps, I’ll conduct a few more analysis to show that the words, black and white most likely mean race. First, lets use the screen name to learn more about the top tweeters of the hashtage #kamalaharris
mt %>%
group_by(screen_name) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
slice(1:10)
## # A tibble: 10 x 2
## screen_name n
## <chr> <int>
## 1 SophiaLamar1 43
## 2 blackwomenviews 29
## 3 Kennyboo93B 15
## 4 shuboogie 14
## 5 seandrayton 12
## 6 chanteezy 10
## 7 HainesForSF 10
## 8 MAGA4EVERY1 10
## 9 TheBasedLiberal 10
## 10 janna_bastone 8
Lets read on the description of these users:
head(
mt$description[mt$screen_name=='blackwomenviews'],1)
## [1] "<U+0001F449><U+0001F3FE> Follow for the latest in Black Entertainment <U+0001F37F> Culture <U+270A><U+0001F3FE> Politics <U+0001F4AF> & News Commentary <U+0001F4F0>\nYoutube/IG/FB @ blackwomenviews"
head(
mt$description[mt$screen_name=='chanteezy'],1)
## [1] "Seasoned Non-Professional, Pro-Black; down for the revolution <U+270A><U+0001F3FF>"
head(
mt$description[mt$screen_name=='HainesForSF'],1)
## [1] "#SF #Native #HainesForSF 2020 restoring power with our people - join us | Text Squad to 66599"
head(
mt$description[mt$screen_name=='TheBasedLiberal'],1)
## [1] "A Very Good Boy\nPronouns: Barron/Master\nNever Accept The Leftist's Premise\nGaming+Anime+Politics=My Hobbies\nI May Or May Not Work For The God Emperor"
head(
mt$description[mt$screen_name=='janna_bastone'],1)
## [1] "Defining deviancy down is dangerous to democracy & humanity. The bar hasn’t been lowered. Individual-1 violently ripped it out and beats America with it daily."
Out of the top individuals who tweeted the hashtag #kamalaharris, the profile description of the majority seems to indicate an interest in race.This is an indicator that the words black and white probably refer to race. A second analysis to conduct would be to find out how often each of the top users refer to race in their tweets. Because of the scope of this course, I will not be conducting this analysis since it would involve me aggregating data from multiple profiles.
To finalize this work, I’ll look at the originating location of the users who tweeted out the word #kamalaharris. My goal is see if they’re in a location where the topic of race is a popular topic.
mt%>%
count(location, sort = TRUE) %>%
mutate(location = reorder(location,n)) %>%
na.omit() %>%
top_n(5) %>%
ggplot(aes(x = location,y = n)) +
geom_col() +
scale_y_sqrt() +
coord_flip() +
labs(x = "Location",
y = "Count",
title = "#kamalaHarris Tweet Locations")
## Selecting by n
New York City and Los Angeles are two of the largest metropolitian/diverse cities in the United States. It is probable that a tweet from this location is about race.
conclusion: Its interesting to see that for two set of supporters of minority candidates- the issues of interest are not homogenous. Kamala’s base has a high interest in race and woman issues by Yang tends to resemble that of the majority (not shown in this analysis)
For future analysis, I’ll recommend using the context in which the words are used to understand ambiguity better.