Voting on Quesiton 1
In light of the recent voting, I thought I’d look at what twitter had to say about legalizing recreational marijuana.
I searched Twitter for the last 1000 #Legalization tweets.
n_tweets <- 1000
legalization <- searchTwitter('#Legalization', n = n_tweets)
There weren’t many results when I searched Twitter for tweets containing #Legalization and reference to a state, so I searched for last 3000 #Legalization and state tweets for all states that have legalized recreational marijuana.
num_tweets <- 3000
legal_me <- searchTwitter('"#Legalization" "maine"', n = num_tweets)
me_df <- twListToDF(legal_me)
# list of 49
legal_ca <- searchTwitter('"#Legalization" "california"', n = num_tweets)
ca_df <- twListToDF(legal_ca)
# list of 553
legal_nv <- searchTwitter('"#Legalization" "nevada"', n = num_tweets)
nv_df <- twListToDF(legal_nv)
# list of 299
legal_ma <- searchTwitter('"#Legalization" "massachusetts"', n = num_tweets)
## [1] "Rate limited .... blocking for a minute and retrying up to 119 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 118 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 117 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 116 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 115 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 114 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 113 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 112 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 111 times ..."
ma_df <- twListToDF(legal_ma)
# list of 296
legal_wa <- searchTwitter('"#Legalization" "washington"', n = num_tweets)
# list of 9
legal_or <- searchTwitter('"#Legalization" "oregon"', n = num_tweets)
# list of 4
legal_co <- searchTwitter('"#Legalization" "colorado"', n = num_tweets)
# list of 28
legal_ak <- searchTwitter('"#Legalization" "alaska"', n = num_tweets)
# list of 3
Interestingly enough, the amount of tweets referencing both Legalization and the state are proportional to the percentage the question passed in the respective state. The states voting to pass in 2016 had, by far, the most number of tweets.
I converted my list of 1000 #Legalization tweets to a data frame.
legalization_df <- twListToDF(legalization)
Then I looked at the count of tweets by platform.
The code below shows the top 10 platforms used for the 1000 and what percentage of the tweets was from each platform.
legalization_df$statusSource =substr(legalization_df$statusSource,
regexpr('>',legalization_df$statusSource) + 1,
regexpr('</a>',legalization_df$statusSource)- 1)
legalization_platform <- legalization_df %>% group_by(statusSource) %>%
summarize(n = n()) %>%
mutate(percent_of_tweets = n/sum(n)) %>%
arrange(desc(n))
legalization_platform$percent_of_tweets <- round(legalization_platform$percent_of_tweets, digits = 3)
legal_plat_10 <- legalization_platform %>% top_n(10)
kable(legal_plat_10)
| statusSource | n | percent_of_tweets |
|---|---|---|
| Twitter Web Client | 199 | 0.218 |
| Twitter for Android | 149 | 0.164 |
| Twitter for iPhone | 141 | 0.155 |
| Hootsuite | 88 | 0.097 |
| IFTTT | 58 | 0.064 |
| The Social Jukebox | 49 | 0.054 |
| dlvr.it | 40 | 0.044 |
| Buffer | 32 | 0.035 |
| TweetDeck | 22 | 0.024 |
| Twitter for iPad | 21 | 0.023 |
I created table of words and their sentiments and store as nrc data frame
nrc <- sentiments %>%
filter(lexicon == "nrc") %>%
select(word, sentiment)
*The top 3 platforms for Twitter were Android, Twitter Web Client and Twitter for iPhone.
I extracted tweets by the top 3 source applications.
tweets <- legalization_df %>% select(id, statusSource, text) %>% extract(statusSource, "source", "Twitter for (.*)") %>% filter(source %in% c("Android", "iPhone"))
I then divided the tweets into individual words and removed common “stopwords.”
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
tweet_words <- tweets %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
tweet_words_sentiment <- tweet_words %>% inner_join(nrc, by = "word")
I did the same for legal_words.
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
legal_words <- legalization_df %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
tbl_legal_words <- legal_words %>% group_by(word) %>% summarize(n = n()) %>% arrange(desc(n)) %>% top_n(10)
pander(tbl_df(tbl_legal_words))
| word | n |
|---|---|
| #legalization | 884 |
| rt | 362 |
| #cannabis | 344 |
| #marijuana | 285 |
| marijuana | 78 |
| #pot | 76 |
| #mmj | 73 |
| #weed | 72 |
| #california | 67 |
| https | 60 |
I joined nrc to legal_words to look at the different sentiment counts.
legal_words_sentiments <- legal_words %>% inner_join(nrc, by = "word")
tbl_legal_word_sent <- legal_words_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% arrange(desc(n))
pander(tbl_df(tbl_legal_word_sent))
| sentiment | n |
|---|---|
| positive | 486 |
| trust | 347 |
| anticipation | 266 |
| negative | 189 |
| fear | 157 |
| joy | 146 |
| anger | 109 |
| surprise | 69 |
| sadness | 58 |
| disgust | 50 |
To show the first 5 positive and negative tweets.
pos_tw_ids <- legal_words_sentiments %>% filter(sentiment == "positive") %>% distinct(id, word)
pos_legal_df <- legalization_df %>% inner_join(pos_tw_ids, by = "id") %>% select(text, word) %>% slice(1:5)
pander(tbl_df(pos_legal_df))
| text | word |
|---|---|
| AXIM Biotechnologies Granted U.S. Patent for #MMJ Chewing Gum Products - https://t.co/qtOUs3owHk #witloncannabis #legalization | granted |
| AXIM Biotechnologies Granted U.S. Patent for #MMJ Chewing Gum Products - https://t.co/qtOUs3owHk #witloncannabis #legalization | patent |
| #Recreational #cannabis #legalization in Canada could create a $22.6 Billion economic impact, set to begin in 2017.… https://t.co/08fuRAtEA2 | create |
| RT @JUJUJoints: America’s first #cannabis-friendly bars are coming to Denver > https://t.co/hiEWVs9FiZ #legalization #cannabiscommunity htt… | friendly |
| RT @JUJUJoints: America’s first #cannabis-friendly bars are coming to Denver > https://t.co/hiEWVs9FiZ #legalization #cannabiscommunity htt… | friendly |
neg_tw_ids <- legal_words_sentiments %>% filter(sentiment == "negative") %>% distinct(id, word)
neg_legal_df <- legalization_df %>% inner_join(neg_tw_ids, by = "id") %>% select(text, word) %>% slice(1:5)
pander(tbl_df(neg_legal_df))
| text | word |
|---|---|
| @CNMMAOfficial @mtlgazette #legalization should bankrupt the government via #restitution for the injustices wrought by #Prohibition. | bankrupt |
| @CNMMAOfficial @mtlgazette #legalization should bankrupt the government via #restitution for the injustices wrought by #Prohibition. | government |
| @CNMMAOfficial @mtlgazette #legalization should bankrupt the government via #restitution for the injustices wrought by #Prohibition. | wrought |
| RT @SamDCress: What do Texas Republicans have to fear with #legalization? @arepublicantx @SenateGOP @texasgreenleaf @ProgressTX @GregAbbott… | fear |
| What do Texas Republicans have to fear with #legalization? @arepublicantx @SenateGOP @texasgreenleaf @ProgressTX @GregAbbott_TX @DanPatrick | fear |
The code below measures sentiment on Android and iPhone platforms.
sources <- tweet_words %>%
group_by(source) %>%
mutate(total_words = n()) %>%
ungroup() %>%
distinct(id, source, total_words)
by_source_sentiment <- tweet_words %>%
inner_join(nrc, by = "word") %>%
count(sentiment, id) %>%
ungroup() %>%
complete(sentiment, id, fill = list(n = 0)) %>%
inner_join(sources) %>%
group_by(source, sentiment, total_words) %>%
summarize(words = sum(n)) %>%
ungroup()
by_source_sentiment <- by_source_sentiment %>% group_by(source) %>% mutate(tot_sentiment = sum(words))
by_source_sentiment <- by_source_sentiment %>% mutate(percent_of_tweets = (words / tot_sentiment) * 100)
by_source_sentiment$percent_of_tweets <- round(by_source_sentiment$percent_of_tweets, digits = 2)
This measures sentiment on Android, iPhone and Web Client.
pf <- c("Twitter for Android", "Twitter for iPhone", "Twitter Web Client")
pf_legal_words <- legal_words %>% filter(statusSource %in% pf)
legal_sources <- legal_words %>%
group_by(statusSource) %>%
mutate(total_words = n()) %>%
ungroup() %>%
distinct(id, statusSource, total_words)
legal_source_sentiment <- pf_legal_words %>%
inner_join(nrc, by = "word") %>%
count(sentiment, id) %>%
ungroup() %>%
complete(sentiment, id, fill = list(n = 0)) %>%
inner_join(legal_sources) %>%
group_by(statusSource, sentiment, total_words) %>%
summarize(words = sum(n)) %>%
ungroup()
I used my collected state tweets to prepare Maine, Massachusetts, Nevada and California data.
#extract the platform
me_df$statusSource = substr(me_df$statusSource,
regexpr('>', me_df$statusSource) + 1,
regexpr('</a>', me_df$statusSource) - 1)
me_platform <- me_df %>% group_by(statusSource) %>%
summarize(n = n()) %>%
mutate(percent_of_tweets = n / sum(n)) %>%
arrange(desc(n))
#extract the words and join to nrc sentiment words
me_words <- me_df %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
me_words_sentiments <- me_words %>% inner_join(nrc, by = "word")
Massachusetts Data
#extract the platform
ma_df$statusSource = substr(ma_df$statusSource,
regexpr('>', ma_df$statusSource) + 1,
regexpr('</a>', ma_df$statusSource) - 1)
ma_platform <- ma_df %>% group_by(statusSource) %>%
summarize(n = n()) %>%
mutate(percent_of_tweets = n / sum(n)) %>%
arrange(desc(n))
#extract the words and join to nrc sentiment words
ma_words <- ma_df %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
ma_words_sentiments <- ma_words %>% inner_join(nrc, by = "word")
Nevada Data
#extract the platform
nv_df$statusSource = substr(nv_df$statusSource,
regexpr('>', nv_df$statusSource) + 1,
regexpr('</a>', nv_df$statusSource) - 1)
nv_platform <- nv_df %>% group_by(statusSource) %>%
summarize(n = n()) %>%
mutate(percent_of_tweets = n / sum(n)) %>%
arrange(desc(n))
#extract the words and join to nrc sentiment words
nv_words <- nv_df %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
nv_words_sentiments <- nv_words %>% inner_join(nrc, by = "word")
California Data
#extract the platform
ca_df$statusSource = substr(ca_df$statusSource,
regexpr('>', ca_df$statusSource) + 1,
regexpr('</a>', ca_df$statusSource) - 1)
ca_platform <- ca_df %>% group_by(statusSource) %>%
summarize(n = n()) %>%
mutate(percent_of_tweets = n / sum(n)) %>%
arrange(desc(n))
#extract the words and join to nrc sentiment words
ca_words <- ca_df %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
ca_words_sentiments <- ca_words %>% inner_join(nrc, by = "word")
I combined state data by creating a new column for state name.
me_platform$State <- "Maine"
ma_platform$State <- "Massachusetts"
nv_platform$State <- "Nevada"
ca_platform$State <- "California"
me_words_sentiments$State <- "Maine"
ma_words_sentiments$State <- "Massachusetts"
nv_words_sentiments$State <- "Nevada"
ca_words_sentiments$State <- "California"
platform <- rbind(me_platform, ma_platform, nv_platform, ca_platform)
words_sentiments <- rbind(me_words_sentiments, ma_words_sentiments, nv_words_sentiments, ca_words_sentiments)
Looking at sentiment by platform…
source_sentiment <- by_source_sentiment %>% select(source, sentiment, percent_of_tweets)
ggplot(source_sentiment, aes(x=source, y=percent_of_tweets, fill = sentiment)) +
geom_bar(stat = "identity", position = "stack") +
scale_fill_brewer(palette = "RdBu") + xlab("Platform") +
ylab("Percent of Tweets") + theme(axis.text.x =
element_text(hjust = 1)) + ggtitle("Sentiment of Andriod versus iPhone tweets")
*It looks like iPhone users have less angry tweets, but more negative tweets than Android users.
Comparing the frequency of tweets from Android and iPhone platforms by sentiment, I found that the frequency of negative and positive tweets is proportional to the percent voted for. CA passed by 56%, NV by 54%, MA by 54% and Maine by 50.3%. Nevada did have a slightly higher negative tweet frequency, but the measure passed by almost the same amount in MA and NV.
word_sent <- words_sentiments %>% group_by(State, sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/sum(n))
# Stacked barplot
ggplot(word_sent, aes(x = sentiment, y = frequency, fill = State)) + geom_bar(stat = "identity", position = "stack") + scale_fill_manual(values = c("#8B0000", "#FF6A6A", "#00BFFF", "#104E8B")) + xlab("Sentiment") + ylab("Frequency of Tweets") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Frequency of Sentiment by State")
This shows the same information as the previous barplot, but is in grouped form instead of stacked form.
ggplot(word_sent, aes(x = sentiment, y = frequency, fill = State)) + geom_bar(stat = "identity", position = "dodge") + scale_fill_manual(values = c("#FF3030", "#FFFF00", "#66CD00", "#4876FF")) + xlab("Sentiment") + ylab("Frequency of Tweets") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Frequency of Sentiment by State")
*Note barplots are different than in the previous version because of the time the tweets were pulled.
Looking at the Nevada “negative” words, it looks like the majority were “margin”. There were a few “error” negative words as well. I’m guessing margin is in reference to the polling margins.