Revised Real-time graphing of Tweet sentimental analysis scores vs. clock-time

The program uses a function for sentiment analysis of words using the AFINN lexicon. This “dictionary”" maps a selected word to a “valence,” of integer scores between -5 to +5 that represent negative to positive sentiment ratings.

The program uploads 750 tweets from Texas and 750 simultaneously from Maine. Words from the text of tweets are matched to the lexicon. Matched text words are assigned a sentiment valence. Scores are aggregated by the minute, averaged (into numerics), and graphed.

This program’s results can become relevant during events of mutual interest between any two states.

With access to the code, the user can simply change the program’s two states compared, the number of Tweets downloaded, and select any real-time to querie.

load libraries

library(twitteR, quietly=TRUE, warn.conflicts=FALSE)
library(tidytext, quietly=TRUE, warn.conflicts=FALSE)
library(stringr, quietly=TRUE, warn.conflicts=FALSE)
library(plyr, quietly=TRUE, warn.conflicts=FALSE)
library(dplyr, quietly=TRUE, warn.conflicts=FALSE)
library(ggvis, quietly=TRUE , warn.conflicts=FALSE)
library(gridExtra,quietly=TRUE, warn.conflicts=FALSE)
library(pander, quietly=TRUE, warn.conflicts=FALSE)

authenticate

## [1] "Using direct authentication"

TEXAS

Texas(TX) tweets downloaded, converted to df, & universally date-formatted

num_tweets <- 750
Tt <- searchTwitter('#Texas', n = num_tweets)
Tt_df <- twListToDF(Tt) %>% select(text, created) %>% arrange(created)
Tt_df$created <- as.POSIXct(Tt_df$created, format="%Y-%m-%d %H%M")

Reference times to CST and create independent hour, min, and sec vectors

Tt_df$created <- format(Tt_df$created, tz="America/Chicago")
Tt_df$hour <- as.POSIXlt(Tt_df$created)$hour
Tt_df$min <- as.POSIXlt(Tt_df$created)$min
Tt_df$sec <- as.POSIXlt(Tt_df$created)$sec

TX tweet words isolated

reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
Tt_sep_words <- Tt_df %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]")) %>% arrange(created)

create a df of sentiment scored words. Lexicon word scores & TX words are inner joined

TaFinn <- get_sentiments("afinn")
Tscore_clock <- Tt_sep_words %>% inner_join(TaFinn, by="word") %>% arrange(created)
x <- length(Tscore_clock$score)

T_reg <- Tscore_clock
T_reg$time <- as.POSIXct(factor(paste0(as.character(T_reg$hour),
            ':', as.character(T_reg$min))), format="%H:%M")
head(T_reg, 25)

## # A tibble: 25 × 7
##                created  hour   min   sec          word score
##                  <chr> <int> <int> <dbl>         <chr> <int>
## 1  2016-11-19 10:46:00    10    46     0        forget    -1
## 2  2016-11-19 10:46:00    10    46     0         grant     1
## 3  2016-11-19 10:46:00    10    46     0        prison    -2
## 4  2016-11-19 10:46:02    10    46     2       limited    -1
## 5  2016-11-19 10:46:03    10    46     3           die    -3
## 6  2016-11-19 10:46:03    10    46     3          ugly    -3
## 7  2016-11-19 10:46:03    10    46     3         death    -2
## 8  2016-11-19 10:46:10    10    46    10      pressure    -1
## 9  2016-11-19 10:46:15    10    46    15          mock    -2
## 10 2016-11-19 10:46:15    10    46    15 assassination    -3
## # ... with 15 more rows, and 1 more variables: time <dttm>

A summary df is started for TX word scores that are aggregated, & averaged by the minute

Tscore_clock_summary <- ddply(Tscore_clock, c("hour","min"), summarise, 
            N = length(score), TX_sentiment = mean(score))

Clock times are entered for scores

Tscore_clock_summary$time <- as.POSIXct(factor(paste0(as.character(Tscore_clock_summary$hour),
            ':', as.character(Tscore_clock_summary$min))), format="%H:%M")
i <-  length(Tscore_clock_summary$TX_sentiment)
head(Tscore_clock_summary, 25)

##    hour min  N TX_sentiment                time
## 1    10  46 11   -1.4545455 2016-11-19 10:46:00
## 2    10  47  6    0.1666667 2016-11-19 10:47:00
## 3    10  48  2    0.5000000 2016-11-19 10:48:00
## 4    10  49  2    0.5000000 2016-11-19 10:49:00
## 5    10  50  1    3.0000000 2016-11-19 10:50:00
## 6    10  51  3   -1.3333333 2016-11-19 10:51:00
## 7    10  52  2    2.5000000 2016-11-19 10:52:00
## 8    10  53  1   -1.0000000 2016-11-19 10:53:00
## 9    10  54  3   -1.3333333 2016-11-19 10:54:00
## 10   10  55 10    1.3000000 2016-11-19 10:55:00
## 11   10  56  1   -3.0000000 2016-11-19 10:56:00
## 12   10  57  4   -0.7500000 2016-11-19 10:57:00
## 13   10  58  1   -4.0000000 2016-11-19 10:58:00
## 14   10  59  1    1.0000000 2016-11-19 10:59:00
## 15   11   0  6   -1.3333333 2016-11-19 11:00:00
## 16   11   1  5   -1.2000000 2016-11-19 11:01:00
## 17   11   2  6    0.3333333 2016-11-19 11:02:00
## 18   11   4  3   -1.3333333 2016-11-19 11:04:00
## 19   11   6  4   -0.7500000 2016-11-19 11:06:00
## 20   11   7  8   -1.1250000 2016-11-19 11:07:00
## 21   11   8 10   -1.6000000 2016-11-19 11:08:00
## 22   11   9 13   -1.1538462 2016-11-19 11:09:00
## 23   11  10  2    0.0000000 2016-11-19 11:10:00
## 24   11  12  5   -1.0000000 2016-11-19 11:12:00
## 25   11  13  1   -1.0000000 2016-11-19 11:13:00

a <- sum(Tscore_clock_summary$N!=1)

MAINE

The same process is performed on Maine tweets

Mt <- searchTwitter('#Maine', n = num_tweets)
Mt_df <- twListToDF(Mt) %>% select(text, created) %>% arrange(created)
Mt_df$created <- as.POSIXct(Mt_df$created, format="%Y-%m-%d %H%M")

Mt_df$created <- format(Mt_df$created, tz="America/Chicago")
Mt_df$hour <- as.POSIXlt(Mt_df$created)$hour
Mt_df$min <- as.POSIXlt(Mt_df$created)$min
Mt_df$sec <- as.POSIXlt(Mt_df$created)$sec

reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
Mt_sep_words <- Mt_df %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]")) %>% arrange(created)

MaFinn <- get_sentiments("afinn")
Mscore_clock <- Mt_sep_words %>% inner_join(MaFinn, by="word") %>% arrange(created)
y <- length(Mscore_clock$score)

M_reg <- Mscore_clock
M_reg$time <- as.POSIXct(factor(paste0(as.character(M_reg$hour),
            ':', as.character(M_reg$min))), format="%H:%M")

nT <- as.integer(c(x, 2, 3))
nM <- as.integer(c(y, 5, 6))

n <- data.frame(nT, nM, stringsAsFactors = FALSE)

Mscore_clock_summary <- ddply(Mscore_clock, c("hour","min"), summarise, 
            N = length(score), ME_sentiment = mean(score))

Mscore_clock_summary$time <- as.POSIXct(factor(paste0(as.character(Mscore_clock_summary$hour),
            ':', as.character(Mscore_clock_summary$min))), format="%H:%M")

j <-  length(Mscore_clock_summary$ME_sentiment)
b <- sum(Mscore_clock_summary$N!=1)

n_T <- as.integer(c(i, 12, 13))
n_M <- as.integer(c(j, 15, 16))

NN <- data.frame(n_T, n_M, stringsAsFactors = FALSE)

a_T <- as.integer(c(a, 21, 22))
a_M <- as.integer(c(b, 24, 25))

A <- data.frame(a_T, a_M, stringsAsFactors = FALSE)

Graph tweet sentiments. (I used echo=FALSE to optimize the graphs. That code is re-produced below)

#Tscore_clock_summary %>% 
#  ggvis(x= ~time, y= ~TX_sentiment) %>%
#  layer_lines
#Mscore_clock_summary %>% 
#  ggvis(x= ~time, y= ~ME_sentiment) %>%
#  layer_lines

Tweeted words are assigned a “valence,” of integer scores between -5 to +5 that represent negative to positive sentiment ratings. Tweets were downloaded simultaneously, minute-per-minute, in real-time. Scores were aggregated by the minute, averaged (into numerics), and graphed. The times were referenced to the CST zone but they can be referenced to EST or any time zone. The states chosen, number of tweets, and times are optional.

Drag the right lower corner of the graph rightward to elongate the time axis

Texas (TX) Sentiment Scores from -5 to 5 (from negative to positive) -Vs.- Time

- - - - - raw Texas data points plotted vertically at aggregated minute times

Maine (ME) Sentiment Scores -Versus- Time

- - - - - raw Maine data points plotted vertically at aggregated minute times

__________________________________________________________

the cumulative number of scores used for each state either graphed by individual or as an average

Texas, nT and Maine, nM

nT	nM
586	328

sum of aggregated scores (as a single average) and individual scores

Texas, n_T, and Maine, n_M

n_T	n_M
110	186

the number of average scores graphed from aggregates

Texas, a_T, and Maine, a_M

a_T	a_M
91	100

Averages should not obscure raw numbers. The take away was that the information I got from these graphs depended on how I viewed the data’s context. It reminded me how constricted that statistics alone (without knowledge of the raw data) can be.

It would be interesting to chart usernames with unique symbols or colors to see what they’re adding to the graphs. Coding by events or using tags might also be interesting.

Revised Real-time graphing of Tweet sentimental analysis scores vs. clock-time

Cesar Garcia

November 20, 2016

Texas (TX) Sentiment Scores from -5 to 5 (from negative to positive) -Vs.- Time

Maine (ME) Sentiment Scores -Versus- Time