Assignment 4

Voting on Quesiton 1

In light of the recent voting, I thought I’d look at what twitter had to say about legalizing recreational marijuana.

I searched Twitter for the last 1000 #Legalization tweets.

n_tweets <- 1000
legalization <- searchTwitter('#Legalization', n = n_tweets)

There weren’t many results when I searched Twitter for tweets containing #Legalization and reference to a state, so I searched for last 3000 #Legalization and state tweets for all states that have legalized recreational marijuana.

num_tweets <- 3000
legal_me <- searchTwitter('"#Legalization" "maine"', n = num_tweets)
me_df <- twListToDF(legal_me)
# list of 49
  
legal_ca <- searchTwitter('"#Legalization" "california"', n = num_tweets)
ca_df <- twListToDF(legal_ca)
# list of 553
  
legal_nv <- searchTwitter('"#Legalization" "nevada"', n = num_tweets)
nv_df <- twListToDF(legal_nv)
# list of 299
  
legal_ma <- searchTwitter('"#Legalization" "massachusetts"', n = num_tweets)

## [1] "Rate limited .... blocking for a minute and retrying up to 119 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 118 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 117 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 116 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 115 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 114 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 113 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 112 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 111 times ..."

ma_df <- twListToDF(legal_ma)
# list of 296

legal_wa <- searchTwitter('"#Legalization" "washington"', n = num_tweets)
# list of 9

legal_or <- searchTwitter('"#Legalization" "oregon"', n = num_tweets)
# list of 4

legal_co <- searchTwitter('"#Legalization" "colorado"', n = num_tweets)
# list of 28

legal_ak <- searchTwitter('"#Legalization" "alaska"', n = num_tweets)
# list of 3

Interestingly enough, the amount of tweets referencing both Legalization and the state are proportional to the percentage the question passed in the respective state. The states voting to pass in 2016 had, by far, the most number of tweets.

I converted my list of 1000 #Legalization tweets to a data frame.

legalization_df <- twListToDF(legalization)

Then I looked at the count of tweets by platform.

The code below shows the top 10 platforms used for the 1000 and what percentage of the tweets was from each platform.

legalization_df$statusSource =substr(legalization_df$statusSource,
        regexpr('>',legalization_df$statusSource) + 1,
        regexpr('</a>',legalization_df$statusSource)- 1)
legalization_platform <- legalization_df %>% group_by(statusSource) %>% 
        summarize(n = n()) %>%
        mutate(percent_of_tweets = n/sum(n)) %>%
        arrange(desc(n))
legalization_platform$percent_of_tweets <- round(legalization_platform$percent_of_tweets, digits = 3)
legal_plat_10 <- legalization_platform %>% top_n(10)
kable(legal_plat_10)

statusSource	n	percent_of_tweets
Twitter Web Client	199	0.218
Twitter for Android	149	0.164
Twitter for iPhone	141	0.155
Hootsuite	88	0.097
IFTTT	58	0.064
The Social Jukebox	49	0.054
dlvr.it	40	0.044
Buffer	32	0.035
TweetDeck	22	0.024
Twitter for iPad	21	0.023

I created table of words and their sentiments and store as nrc data frame

nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  select(word, sentiment)

*The top 3 platforms for Twitter were Android, Twitter Web Client and Twitter for iPhone.

I extracted tweets by the top 3 source applications.

tweets <- legalization_df %>% select(id, statusSource, text) %>% extract(statusSource, "source", "Twitter for (.*)") %>% filter(source %in% c("Android", "iPhone"))

I then divided the tweets into individual words and removed common “stopwords.”

reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
tweet_words <- tweets %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

tweet_words_sentiment <- tweet_words %>% inner_join(nrc, by = "word")

I did the same for legal_words.

reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
legal_words <- legalization_df %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

tbl_legal_words <- legal_words %>% group_by(word) %>% summarize(n = n()) %>% arrange(desc(n)) %>% top_n(10)
pander(tbl_df(tbl_legal_words))

word	n
#legalization	884
rt	362
#cannabis	344
#marijuana	285
marijuana	78
#pot	76
#mmj	73
#weed	72
#california	67
https	60

I joined nrc to legal_words to look at the different sentiment counts.

legal_words_sentiments <- legal_words %>% inner_join(nrc, by = "word")

tbl_legal_word_sent <- legal_words_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% arrange(desc(n))

pander(tbl_df(tbl_legal_word_sent))

sentiment	n
positive	486
trust	347
anticipation	266
negative	189
fear	157
joy	146
anger	109
surprise	69
sadness	58
disgust	50

To show the first 5 positive and negative tweets.

pos_tw_ids <- legal_words_sentiments %>% filter(sentiment == "positive") %>% distinct(id, word)
pos_legal_df <- legalization_df %>% inner_join(pos_tw_ids, by = "id") %>% select(text, word) %>% slice(1:5)
pander(tbl_df(pos_legal_df))

text	word
AXIM Biotechnologies Granted U.S. Patent for #MMJ Chewing Gum Products - https://t.co/qtOUs3owHk #witloncannabis #legalization	granted
AXIM Biotechnologies Granted U.S. Patent for #MMJ Chewing Gum Products - https://t.co/qtOUs3owHk #witloncannabis #legalization	patent
#Recreational #cannabis #legalization in Canada could create a $22.6 Billion economic impact, set to begin in 2017.… https://t.co/08fuRAtEA2	create
RT @JUJUJoints: America’s first #cannabis-friendly bars are coming to Denver > https://t.co/hiEWVs9FiZ #legalization #cannabiscommunity htt…	friendly
RT @JUJUJoints: America’s first #cannabis-friendly bars are coming to Denver > https://t.co/hiEWVs9FiZ #legalization #cannabiscommunity htt…	friendly

neg_tw_ids <- legal_words_sentiments %>% filter(sentiment == "negative") %>% distinct(id, word)
neg_legal_df <- legalization_df %>% inner_join(neg_tw_ids, by = "id") %>% select(text, word) %>% slice(1:5)
pander(tbl_df(neg_legal_df))

text	word
@CNMMAOfficial @mtlgazette #legalization should bankrupt the government via #restitution for the injustices wrought by #Prohibition.	bankrupt
@CNMMAOfficial @mtlgazette #legalization should bankrupt the government via #restitution for the injustices wrought by #Prohibition.	government
@CNMMAOfficial @mtlgazette #legalization should bankrupt the government via #restitution for the injustices wrought by #Prohibition.	wrought
RT @SamDCress: What do Texas Republicans have to fear with #legalization? @arepublicantx @SenateGOP @texasgreenleaf @ProgressTX @GregAbbott…	fear
What do Texas Republicans have to fear with #legalization? @arepublicantx @SenateGOP @texasgreenleaf @ProgressTX @GregAbbott_TX @DanPatrick	fear

The code below measures sentiment on Android and iPhone platforms.

sources <- tweet_words %>%
  group_by(source) %>%
  mutate(total_words = n()) %>%
  ungroup() %>%
  distinct(id, source, total_words)

by_source_sentiment <- tweet_words %>%
  inner_join(nrc, by = "word") %>%
  count(sentiment, id) %>%
  ungroup() %>%
  complete(sentiment, id, fill = list(n = 0)) %>%
  inner_join(sources) %>%
  group_by(source, sentiment, total_words) %>%
  summarize(words = sum(n)) %>%
  ungroup()

by_source_sentiment <- by_source_sentiment %>% group_by(source) %>% mutate(tot_sentiment = sum(words))

by_source_sentiment <- by_source_sentiment %>% mutate(percent_of_tweets = (words / tot_sentiment) * 100)
by_source_sentiment$percent_of_tweets <- round(by_source_sentiment$percent_of_tweets, digits = 2)

This measures sentiment on Android, iPhone and Web Client.

pf <- c("Twitter for Android", "Twitter for iPhone", "Twitter Web Client")

pf_legal_words <- legal_words %>% filter(statusSource %in% pf)

legal_sources <- legal_words %>%
  group_by(statusSource) %>%
  mutate(total_words = n()) %>%
  ungroup() %>%
  distinct(id, statusSource, total_words)


legal_source_sentiment <- pf_legal_words %>%
  inner_join(nrc, by = "word") %>%
  count(sentiment, id) %>%
  ungroup() %>%
  complete(sentiment, id, fill = list(n = 0)) %>%
  inner_join(legal_sources) %>%
  group_by(statusSource, sentiment, total_words) %>%
  summarize(words = sum(n)) %>%
  ungroup()

I used my collected state tweets to prepare Maine, Massachusetts, Nevada and California data.

#extract the platform
me_df$statusSource = substr(me_df$statusSource, 
                        regexpr('>', me_df$statusSource) + 1, 
                        regexpr('</a>', me_df$statusSource) - 1)
me_platform <- me_df %>% group_by(statusSource) %>% 
                summarize(n = n()) %>% 
                mutate(percent_of_tweets = n / sum(n)) %>% 
                arrange(desc(n))

#extract the words and join to nrc sentiment words
me_words <- me_df %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))
me_words_sentiments <- me_words %>% inner_join(nrc, by = "word")

Massachusetts Data

#extract the platform
ma_df$statusSource = substr(ma_df$statusSource, 
                        regexpr('>', ma_df$statusSource) + 1, 
                        regexpr('</a>', ma_df$statusSource) - 1)
ma_platform <- ma_df %>% group_by(statusSource) %>% 
                summarize(n = n()) %>% 
                mutate(percent_of_tweets = n / sum(n)) %>% 
                arrange(desc(n))

#extract the words and join to nrc sentiment words
ma_words <- ma_df %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))
ma_words_sentiments <- ma_words %>% inner_join(nrc, by = "word")

Nevada Data

#extract the platform
nv_df$statusSource = substr(nv_df$statusSource, 
                        regexpr('>', nv_df$statusSource) + 1, 
                        regexpr('</a>', nv_df$statusSource) - 1)
nv_platform <- nv_df %>% group_by(statusSource) %>% 
                summarize(n = n()) %>% 
                mutate(percent_of_tweets = n / sum(n)) %>% 
                arrange(desc(n))

#extract the words and join to nrc sentiment words
nv_words <- nv_df %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))
nv_words_sentiments <- nv_words %>% inner_join(nrc, by = "word")

California Data

#extract the platform
ca_df$statusSource = substr(ca_df$statusSource, 
                        regexpr('>', ca_df$statusSource) + 1, 
                        regexpr('</a>', ca_df$statusSource) - 1)
ca_platform <- ca_df %>% group_by(statusSource) %>% 
                summarize(n = n()) %>% 
                mutate(percent_of_tweets = n / sum(n)) %>% 
                arrange(desc(n))

#extract the words and join to nrc sentiment words
ca_words <- ca_df %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))
ca_words_sentiments <- ca_words %>% inner_join(nrc, by = "word")

I combined state data by creating a new column for state name.

me_platform$State <- "Maine"
ma_platform$State <- "Massachusetts"
nv_platform$State <- "Nevada"
ca_platform$State <- "California"

me_words_sentiments$State <- "Maine"
ma_words_sentiments$State <- "Massachusetts"
nv_words_sentiments$State <- "Nevada"
ca_words_sentiments$State <- "California"

platform <- rbind(me_platform, ma_platform, nv_platform, ca_platform)

words_sentiments <- rbind(me_words_sentiments, ma_words_sentiments, nv_words_sentiments, ca_words_sentiments)

Looking at sentiment by platform…

source_sentiment <- by_source_sentiment %>% select(source, sentiment, percent_of_tweets)

ggplot(source_sentiment, aes(x=source, y=percent_of_tweets, fill = sentiment)) + 
      geom_bar(stat = "identity", position = "stack") +
      scale_fill_brewer(palette = "RdBu") + xlab("Platform") +
      ylab("Percent of Tweets") + theme(axis.text.x =
      element_text(hjust = 1)) + ggtitle("Sentiment of Andriod versus iPhone tweets")

*It looks like iPhone users have less angry tweets, but more negative tweets than Android users.

Comparing the frequency of tweets from Android and iPhone platforms by sentiment, I found that the frequency of negative and positive tweets is proportional to the percent voted for. CA passed by 56%, NV by 54%, MA by 54% and Maine by 50.3%. Nevada did have a slightly higher negative tweet frequency, but the measure passed by almost the same amount in MA and NV.

word_sent <- words_sentiments %>% group_by(State, sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/sum(n))

# Stacked barplot
ggplot(word_sent, aes(x = sentiment, y = frequency, fill = State)) + geom_bar(stat = "identity", position = "stack") + scale_fill_manual(values = c("#8B0000", "#FF6A6A", "#00BFFF", "#104E8B")) + xlab("Sentiment") + ylab("Frequency of Tweets") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Frequency of Sentiment by State")

This shows the same information as the previous barplot, but is in grouped form instead of stacked form.

ggplot(word_sent, aes(x = sentiment, y = frequency, fill = State)) + geom_bar(stat = "identity", position = "dodge") + scale_fill_manual(values = c("#FF3030", "#FFFF00", "#66CD00", "#4876FF")) + xlab("Sentiment") + ylab("Frequency of Tweets") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Frequency of Sentiment by State")

*Note barplots are different than in the previous version because of the time the tweets were pulled.

Looking at the Nevada “negative” words, it looks like the majority were “margin”. There were a few “error” negative words as well. I’m guessing margin is in reference to the polling margins.

Assignment 4

Meryn Lounsbury

November 13, 2016