Assignment #4 MBA 676

library(rtweet)
library(tidytext)
library(stringr)
library(dplyr) 
library(ggplot2)

app <- "Ngolaz"
consumer_key <- "V6htm9XnGv1AI5VyJVgsY7rKH"
consumer_secret <- "Wn1R8X13p8QQr8wzghs78tXrR6M2mphNbZpjLHkhZBotD5wgdK"
access_token <- "1199062437548326912-hp6co4yYq41ZybxilgCmbvP6ZvLGrg"
access_secret <- "uELfp6850GCApKfn2vMios4IZhRjtvxSTTiMq4MzINAP8"
my_token <- create_token(app = app,
            consumer_key = consumer_key,
            consumer_secret = consumer_secret,
            access_token = access_token,
            access_secret = access_secret)

identical(my_token, get_token())

## [1] FALSE

Proir to begining this analysis, I perceived the major supporting base of minority political candidates to be mostly concernced with racial issues. As a result, I selected Kamala Harris and Andrew Yang - two presidential aspirants for this analysis, both of whom are from minority backgrounds - African American and Asian American respectively. My goal is to understand if both groups of supporters reference race in similar proportions.

Lets begin by pulling tweets with the hashtag #KamalaHarris and #AndrewYang

num_tweets <- 2000
mt <- search_tweets('#Kamalaharris', n = num_tweets, include_rts = FALSE)
mt2 <- search_tweets('#AndrewYang', n = num_tweets, include_rts = FALSE)
#mt3 <- search_tweets('kamala harris black women', n = num_tweets, include_rts = FALSE)
#mt4 <- search_tweets('kamala harris race', n = num_tweets, include_rts = FALSE)
#head(mt)
#lemgth(mt2)
#head(mt3)

First, I wanted to see what percentage of their tweets referenced words such such as black, white, asian (racial identities). Note: the words black and white can be ambigious so we shall do further analysis to understand its usage.

#Top 20 words amongst tweets with #KamalaHarris
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
kamala_words <- mt %>% select(status_id, text) %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

kamala_words %>% group_by(word) %>% summarize(n = n()) %>% arrange(desc(n)) %>% top_n(20)

## Selecting by n

## # A tibble: 21 x 2
##    word              n
##    <chr>         <int>
##  1 #kamalaharris  1980
##  2 @kamalaharris   223
##  3 kamala          220
##  4 campaign        189
##  5 harris          180
##  6 race            178
##  7 black           175
##  8 #khive          150
##  9 people          127
## 10 candidate       106
## # ... with 11 more rows

#Top 20 words amongst tweets with #AndrewYang

andrew_words <- mt2 %>% select(status_id, text) %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

andrew_words %>% group_by(word) %>% summarize(n = n()) %>% arrange(desc(n)) %>% top_n(20)

## Selecting by n

## # A tibble: 20 x 2
##    word                 n
##    <chr>            <int>
##  1 #andrewyang       1949
##  2 #yanggang          814
##  3 #yang2020          526
##  4 @andrewyang        405
##  5 yang               292
##  6 #humanityfirst     189
##  7 people             161
##  8 #ubi               156
##  9 andrew             152
## 10 #yanggang2020      134
## 11 #andrewyang2020    127
## 12 #freedomdividend   121
## 13 candidate          106
## 14 vote                96
## 15 #yangganglove       85
## 16 time                85
## 17 #math               81
## 18 campaign            80
## 19 president           77
## 20 support             66

Excluding the words #Kamalaharris,@Kamalaharris and kamala, the sum of counts of the 10 most frequent words in tweets with #Kamalaharris is 1302. 20% out of the sum referred to race (white and Black) while with Andrew Yang,zero words referencing race are amongst the 20 most frequent words in tweets with #AndrewYang.

The words Black and White could mean multiple things - so for the next steps, I’ll conduct a few more analysis to show that the words, black and white most likely mean race. First, lets use the screen name to learn more about the top tweeters of the hashtage #kamalaharris

mt %>%
  group_by(screen_name) %>% 
  summarize(n = n()) %>%
  arrange(desc(n)) %>% 
  slice(1:10)

## # A tibble: 10 x 2
##    screen_name         n
##    <chr>           <int>
##  1 SophiaLamar1       43
##  2 blackwomenviews    29
##  3 Kennyboo93B        15
##  4 shuboogie          14
##  5 seandrayton        12
##  6 chanteezy          10
##  7 HainesForSF        10
##  8 MAGA4EVERY1        10
##  9 TheBasedLiberal    10
## 10 janna_bastone       8

Lets read on the description of these users:

head(
mt$description[mt$screen_name=='blackwomenviews'],1)

## [1] "<U+0001F449><U+0001F3FE> Follow for the latest in Black Entertainment <U+0001F37F> Culture <U+270A><U+0001F3FE> Politics <U+0001F4AF> & News Commentary <U+0001F4F0>\nYoutube/IG/FB @ blackwomenviews"

head(
mt$description[mt$screen_name=='chanteezy'],1)

## [1] "Seasoned Non-Professional, Pro-Black; down for the revolution <U+270A><U+0001F3FF>"

head(
mt$description[mt$screen_name=='HainesForSF'],1)

## [1] "#SF #Native #HainesForSF 2020 restoring power with our people - join us | Text Squad to 66599"

head(
mt$description[mt$screen_name=='TheBasedLiberal'],1)

## [1] "A Very Good Boy\nPronouns: Barron/Master\nNever Accept The Leftist's Premise\nGaming+Anime+Politics=My Hobbies\nI May Or May Not Work For The God Emperor"

head(
mt$description[mt$screen_name=='janna_bastone'],1)

## [1] "Defining deviancy down is dangerous to democracy & humanity. The bar hasn’t been lowered. Individual-1 violently ripped it out and beats America with it daily."

Out of the top individuals who tweeted the hashtag #kamalaharris, the profile description of the majority seems to indicate an interest in race.This is an indicator that the words black and white probably refer to race. A second analysis to conduct would be to find out how often each of the top users refer to race in their tweets. Because of the scope of this course, I will not be conducting this analysis since it would involve me aggregating data from multiple profiles.

To finalize this work, I’ll look at the originating location of the users who tweeted out the word #kamalaharris. My goal is see if they’re in a location where the topic of race is a popular topic.

mt%>%
  count(location, sort = TRUE) %>%
  mutate(location = reorder(location,n)) %>%
  na.omit() %>%
  top_n(5) %>%
  ggplot(aes(x = location,y = n)) +
  geom_col() +
  scale_y_sqrt() +
  coord_flip() +
  labs(x = "Location",
       y = "Count",
       title = "#kamalaHarris Tweet Locations")

## Selecting by n

New York City and Los Angeles are two of the largest metropolitian/diverse cities in the United States. It is probable that a tweet from this location is about race.

conclusion: Its interesting to see that for two set of supporters of minority candidates- the issues of interest are not homogenous. Kamala’s base has a high interest in race and woman issues by Yang tends to resemble that of the majority (not shown in this analysis)

For future analysis, I’ll recommend using the context in which the words are used to understand ambiguity better.

Assignment #4 MBA 676

JOY.OFIELU

11/22/2019