Voting on Quesiton 1

In light of the recent voting, I thought I’d look at what twitter had to say about legalizing recreational marijuana.

I searched Twitter for the last 1000 #Legalization tweets.

n_tweets <- 1000
legalization <- searchTwitter('#Legalization', n = n_tweets)

There weren’t many results when I searched Twitter for tweets containing #Legalization and reference to a state, so I searched for last 3000 #Legalization and state tweets for all states that have legalized recreational marijuana.

num_tweets <- 3000
legal_me <- searchTwitter('"#Legalization" "maine"', n = num_tweets)
me_df <- twListToDF(legal_me)
# list of 49
  
legal_ca <- searchTwitter('"#Legalization" "california"', n = num_tweets)
ca_df <- twListToDF(legal_ca)
# list of 553
  
legal_nv <- searchTwitter('"#Legalization" "nevada"', n = num_tweets)
nv_df <- twListToDF(legal_nv)
# list of 299
  
legal_ma <- searchTwitter('"#Legalization" "massachusetts"', n = num_tweets)
## [1] "Rate limited .... blocking for a minute and retrying up to 119 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 118 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 117 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 116 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 115 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 114 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 113 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 112 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 111 times ..."
ma_df <- twListToDF(legal_ma)
# list of 296

legal_wa <- searchTwitter('"#Legalization" "washington"', n = num_tweets)
# list of 9

legal_or <- searchTwitter('"#Legalization" "oregon"', n = num_tweets)
# list of 4

legal_co <- searchTwitter('"#Legalization" "colorado"', n = num_tweets)
# list of 28

legal_ak <- searchTwitter('"#Legalization" "alaska"', n = num_tweets)
# list of 3

Interestingly enough, the amount of tweets referencing both Legalization and the state are proportional to the percentage the question passed in the respective state. The states voting to pass in 2016 had, by far, the most number of tweets.

I converted my list of 1000 #Legalization tweets to a data frame.

legalization_df <- twListToDF(legalization)

Then I looked at the count of tweets by platform.

The code below shows the top 10 platforms used for the 1000 and what percentage of the tweets was from each platform.

legalization_df$statusSource =substr(legalization_df$statusSource,
        regexpr('>',legalization_df$statusSource) + 1,
        regexpr('</a>',legalization_df$statusSource)- 1)
legalization_platform <- legalization_df %>% group_by(statusSource) %>% 
        summarize(n = n()) %>%
        mutate(percent_of_tweets = n/sum(n)) %>%
        arrange(desc(n))
legalization_platform$percent_of_tweets <- round(legalization_platform$percent_of_tweets, digits = 3)
legal_plat_10 <- legalization_platform %>% top_n(10)
kable(legal_plat_10)
statusSource n percent_of_tweets
Twitter Web Client 199 0.218
Twitter for Android 149 0.164
Twitter for iPhone 141 0.155
Hootsuite 88 0.097
IFTTT 58 0.064
The Social Jukebox 49 0.054
dlvr.it 40 0.044
Buffer 32 0.035
TweetDeck 22 0.024
Twitter for iPad 21 0.023

I created table of words and their sentiments and store as nrc data frame

nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  select(word, sentiment)

*The top 3 platforms for Twitter were Android, Twitter Web Client and Twitter for iPhone.

I extracted tweets by the top 3 source applications.

tweets <- legalization_df %>% select(id, statusSource, text) %>% extract(statusSource, "source", "Twitter for (.*)") %>% filter(source %in% c("Android", "iPhone"))

I then divided the tweets into individual words and removed common “stopwords.”

reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
tweet_words <- tweets %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

tweet_words_sentiment <- tweet_words %>% inner_join(nrc, by = "word")

I did the same for legal_words.

reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
legal_words <- legalization_df %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

tbl_legal_words <- legal_words %>% group_by(word) %>% summarize(n = n()) %>% arrange(desc(n)) %>% top_n(10)
pander(tbl_df(tbl_legal_words))
word n
#legalization 884
rt 362
#cannabis 344
#marijuana 285
marijuana 78
#pot 76
#mmj 73
#weed 72
#california 67
https 60

I joined nrc to legal_words to look at the different sentiment counts.

legal_words_sentiments <- legal_words %>% inner_join(nrc, by = "word")

tbl_legal_word_sent <- legal_words_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% arrange(desc(n))

pander(tbl_df(tbl_legal_word_sent))
sentiment n
positive 486
trust 347
anticipation 266
negative 189
fear 157
joy 146
anger 109
surprise 69
sadness 58
disgust 50

To show the first 5 positive and negative tweets.

pos_tw_ids <- legal_words_sentiments %>% filter(sentiment == "positive") %>% distinct(id, word)
pos_legal_df <- legalization_df %>% inner_join(pos_tw_ids, by = "id") %>% select(text, word) %>% slice(1:5)
pander(tbl_df(pos_legal_df))
text word
AXIM Biotechnologies Granted U.S. Patent for #MMJ Chewing Gum Products - https://t.co/qtOUs3owHk #witloncannabis #legalization granted
AXIM Biotechnologies Granted U.S. Patent for #MMJ Chewing Gum Products - https://t.co/qtOUs3owHk #witloncannabis #legalization patent
#Recreational #cannabis #legalization in Canada could create a $22.6 Billion economic impact, set to begin in 2017.… https://t.co/08fuRAtEA2 create
RT @JUJUJoints: America’s first #cannabis-friendly bars are coming to Denver > https://t.co/hiEWVs9FiZ #legalization #cannabiscommunity htt… friendly
RT @JUJUJoints: America’s first #cannabis-friendly bars are coming to Denver > https://t.co/hiEWVs9FiZ #legalization #cannabiscommunity htt… friendly
neg_tw_ids <- legal_words_sentiments %>% filter(sentiment == "negative") %>% distinct(id, word)
neg_legal_df <- legalization_df %>% inner_join(neg_tw_ids, by = "id") %>% select(text, word) %>% slice(1:5)
pander(tbl_df(neg_legal_df))
text word
@CNMMAOfficial @mtlgazette #legalization should bankrupt the government via #restitution for the injustices wrought by #Prohibition. bankrupt
@CNMMAOfficial @mtlgazette #legalization should bankrupt the government via #restitution for the injustices wrought by #Prohibition. government
@CNMMAOfficial @mtlgazette #legalization should bankrupt the government via #restitution for the injustices wrought by #Prohibition. wrought
RT @SamDCress: What do Texas Republicans have to fear with #legalization? @arepublicantx @SenateGOP @texasgreenleaf @ProgressTX @GregAbbott fear
What do Texas Republicans have to fear with #legalization? @arepublicantx @SenateGOP @texasgreenleaf @ProgressTX @GregAbbott_TX @DanPatrick fear

The code below measures sentiment on Android and iPhone platforms.

sources <- tweet_words %>%
  group_by(source) %>%
  mutate(total_words = n()) %>%
  ungroup() %>%
  distinct(id, source, total_words)

by_source_sentiment <- tweet_words %>%
  inner_join(nrc, by = "word") %>%
  count(sentiment, id) %>%
  ungroup() %>%
  complete(sentiment, id, fill = list(n = 0)) %>%
  inner_join(sources) %>%
  group_by(source, sentiment, total_words) %>%
  summarize(words = sum(n)) %>%
  ungroup()

by_source_sentiment <- by_source_sentiment %>% group_by(source) %>% mutate(tot_sentiment = sum(words))

by_source_sentiment <- by_source_sentiment %>% mutate(percent_of_tweets = (words / tot_sentiment) * 100)
by_source_sentiment$percent_of_tweets <- round(by_source_sentiment$percent_of_tweets, digits = 2)

This measures sentiment on Android, iPhone and Web Client.

pf <- c("Twitter for Android", "Twitter for iPhone", "Twitter Web Client")

pf_legal_words <- legal_words %>% filter(statusSource %in% pf)

legal_sources <- legal_words %>%
  group_by(statusSource) %>%
  mutate(total_words = n()) %>%
  ungroup() %>%
  distinct(id, statusSource, total_words)


legal_source_sentiment <- pf_legal_words %>%
  inner_join(nrc, by = "word") %>%
  count(sentiment, id) %>%
  ungroup() %>%
  complete(sentiment, id, fill = list(n = 0)) %>%
  inner_join(legal_sources) %>%
  group_by(statusSource, sentiment, total_words) %>%
  summarize(words = sum(n)) %>%
  ungroup()

I used my collected state tweets to prepare Maine, Massachusetts, Nevada and California data.

#extract the platform
me_df$statusSource = substr(me_df$statusSource, 
                        regexpr('>', me_df$statusSource) + 1, 
                        regexpr('</a>', me_df$statusSource) - 1)
me_platform <- me_df %>% group_by(statusSource) %>% 
                summarize(n = n()) %>% 
                mutate(percent_of_tweets = n / sum(n)) %>% 
                arrange(desc(n))

#extract the words and join to nrc sentiment words
me_words <- me_df %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))
me_words_sentiments <- me_words %>% inner_join(nrc, by = "word")

Massachusetts Data

#extract the platform
ma_df$statusSource = substr(ma_df$statusSource, 
                        regexpr('>', ma_df$statusSource) + 1, 
                        regexpr('</a>', ma_df$statusSource) - 1)
ma_platform <- ma_df %>% group_by(statusSource) %>% 
                summarize(n = n()) %>% 
                mutate(percent_of_tweets = n / sum(n)) %>% 
                arrange(desc(n))

#extract the words and join to nrc sentiment words
ma_words <- ma_df %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))
ma_words_sentiments <- ma_words %>% inner_join(nrc, by = "word")

Nevada Data

#extract the platform
nv_df$statusSource = substr(nv_df$statusSource, 
                        regexpr('>', nv_df$statusSource) + 1, 
                        regexpr('</a>', nv_df$statusSource) - 1)
nv_platform <- nv_df %>% group_by(statusSource) %>% 
                summarize(n = n()) %>% 
                mutate(percent_of_tweets = n / sum(n)) %>% 
                arrange(desc(n))

#extract the words and join to nrc sentiment words
nv_words <- nv_df %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))
nv_words_sentiments <- nv_words %>% inner_join(nrc, by = "word")

California Data

#extract the platform
ca_df$statusSource = substr(ca_df$statusSource, 
                        regexpr('>', ca_df$statusSource) + 1, 
                        regexpr('</a>', ca_df$statusSource) - 1)
ca_platform <- ca_df %>% group_by(statusSource) %>% 
                summarize(n = n()) %>% 
                mutate(percent_of_tweets = n / sum(n)) %>% 
                arrange(desc(n))

#extract the words and join to nrc sentiment words
ca_words <- ca_df %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))
ca_words_sentiments <- ca_words %>% inner_join(nrc, by = "word")

I combined state data by creating a new column for state name.

me_platform$State <- "Maine"
ma_platform$State <- "Massachusetts"
nv_platform$State <- "Nevada"
ca_platform$State <- "California"

me_words_sentiments$State <- "Maine"
ma_words_sentiments$State <- "Massachusetts"
nv_words_sentiments$State <- "Nevada"
ca_words_sentiments$State <- "California"

platform <- rbind(me_platform, ma_platform, nv_platform, ca_platform)

words_sentiments <- rbind(me_words_sentiments, ma_words_sentiments, nv_words_sentiments, ca_words_sentiments)

Looking at sentiment by platform…

source_sentiment <- by_source_sentiment %>% select(source, sentiment, percent_of_tweets)

ggplot(source_sentiment, aes(x=source, y=percent_of_tweets, fill = sentiment)) + 
      geom_bar(stat = "identity", position = "stack") +
      scale_fill_brewer(palette = "RdBu") + xlab("Platform") +
      ylab("Percent of Tweets") + theme(axis.text.x =
      element_text(hjust = 1)) + ggtitle("Sentiment of Andriod versus iPhone tweets")

*It looks like iPhone users have less angry tweets, but more negative tweets than Android users.

Comparing the frequency of tweets from Android and iPhone platforms by sentiment, I found that the frequency of negative and positive tweets is proportional to the percent voted for. CA passed by 56%, NV by 54%, MA by 54% and Maine by 50.3%. Nevada did have a slightly higher negative tweet frequency, but the measure passed by almost the same amount in MA and NV.

word_sent <- words_sentiments %>% group_by(State, sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/sum(n))

# Stacked barplot
ggplot(word_sent, aes(x = sentiment, y = frequency, fill = State)) + geom_bar(stat = "identity", position = "stack") + scale_fill_manual(values = c("#8B0000", "#FF6A6A", "#00BFFF", "#104E8B")) + xlab("Sentiment") + ylab("Frequency of Tweets") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Frequency of Sentiment by State")

This shows the same information as the previous barplot, but is in grouped form instead of stacked form.

ggplot(word_sent, aes(x = sentiment, y = frequency, fill = State)) + geom_bar(stat = "identity", position = "dodge") + scale_fill_manual(values = c("#FF3030", "#FFFF00", "#66CD00", "#4876FF")) + xlab("Sentiment") + ylab("Frequency of Tweets") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Frequency of Sentiment by State")

*Note barplots are different than in the previous version because of the time the tweets were pulled.

Looking at the Nevada “negative” words, it looks like the majority were “margin”. There were a few “error” negative words as well. I’m guessing margin is in reference to the polling margins.