The program uses a function for sentiment analysis of words using the AFINN lexicon. This “dictionary”" maps a selected word to a “valence,” of integer scores between -5 to +5 that represent negative to positive sentiment ratings.
The program uploads 750 tweets from Texas and 750 simultaneously from Maine. Words from the text of tweets are matched to the lexicon. Matched text words are assigned a sentiment valence. Scores are aggregated by the minute, averaged (into numerics), and graphed.
TEXAS
Texas(TX) tweets downloaded, converted to df, & universally date-formatted
num_tweets <- 750
Tt <- searchTwitter('#Texas', n = num_tweets)
Tt_df <- twListToDF(Tt) %>% select(text, created) %>% arrange(created)
Tt_df$created <- as.POSIXct(Tt_df$created, format="%Y-%m-%d %H%M")
Reference times to CST and create independent hour, min, and sec vectors
Tt_df$created <- format(Tt_df$created, tz="America/Chicago")
Tt_df$hour <- as.POSIXlt(Tt_df$created)$hour
Tt_df$min <- as.POSIXlt(Tt_df$created)$min
Tt_df$sec <- as.POSIXlt(Tt_df$created)$sec
TX tweet words isolated
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
Tt_sep_words <- Tt_df %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]")) %>% arrange(created)
create a df of sentiment scored words. Lexicon word scores & TX words are inner joined
TaFinn <- get_sentiments("afinn")
Tscore_clock <- Tt_sep_words %>% inner_join(TaFinn, by="word") %>% arrange(created)
x <- length(Tscore_clock$score)
T_reg <- Tscore_clock
T_reg$time <- as.POSIXct(factor(paste0(as.character(T_reg$hour),
':', as.character(T_reg$min))), format="%H:%M")
head(T_reg, 25)
## # A tibble: 25 × 7
## created hour min sec word score
## <chr> <int> <int> <dbl> <chr> <int>
## 1 2016-11-19 10:46:00 10 46 0 forget -1
## 2 2016-11-19 10:46:00 10 46 0 grant 1
## 3 2016-11-19 10:46:00 10 46 0 prison -2
## 4 2016-11-19 10:46:02 10 46 2 limited -1
## 5 2016-11-19 10:46:03 10 46 3 die -3
## 6 2016-11-19 10:46:03 10 46 3 ugly -3
## 7 2016-11-19 10:46:03 10 46 3 death -2
## 8 2016-11-19 10:46:10 10 46 10 pressure -1
## 9 2016-11-19 10:46:15 10 46 15 mock -2
## 10 2016-11-19 10:46:15 10 46 15 assassination -3
## # ... with 15 more rows, and 1 more variables: time <dttm>
A summary df is started for TX word scores that are aggregated, & averaged by the minute
Tscore_clock_summary <- ddply(Tscore_clock, c("hour","min"), summarise,
N = length(score), TX_sentiment = mean(score))
Clock times are entered for scores
Tscore_clock_summary$time <- as.POSIXct(factor(paste0(as.character(Tscore_clock_summary$hour),
':', as.character(Tscore_clock_summary$min))), format="%H:%M")
i <- length(Tscore_clock_summary$TX_sentiment)
head(Tscore_clock_summary, 25)
## hour min N TX_sentiment time
## 1 10 46 11 -1.4545455 2016-11-19 10:46:00
## 2 10 47 6 0.1666667 2016-11-19 10:47:00
## 3 10 48 2 0.5000000 2016-11-19 10:48:00
## 4 10 49 2 0.5000000 2016-11-19 10:49:00
## 5 10 50 1 3.0000000 2016-11-19 10:50:00
## 6 10 51 3 -1.3333333 2016-11-19 10:51:00
## 7 10 52 2 2.5000000 2016-11-19 10:52:00
## 8 10 53 1 -1.0000000 2016-11-19 10:53:00
## 9 10 54 3 -1.3333333 2016-11-19 10:54:00
## 10 10 55 10 1.3000000 2016-11-19 10:55:00
## 11 10 56 1 -3.0000000 2016-11-19 10:56:00
## 12 10 57 4 -0.7500000 2016-11-19 10:57:00
## 13 10 58 1 -4.0000000 2016-11-19 10:58:00
## 14 10 59 1 1.0000000 2016-11-19 10:59:00
## 15 11 0 6 -1.3333333 2016-11-19 11:00:00
## 16 11 1 5 -1.2000000 2016-11-19 11:01:00
## 17 11 2 6 0.3333333 2016-11-19 11:02:00
## 18 11 4 3 -1.3333333 2016-11-19 11:04:00
## 19 11 6 4 -0.7500000 2016-11-19 11:06:00
## 20 11 7 8 -1.1250000 2016-11-19 11:07:00
## 21 11 8 10 -1.6000000 2016-11-19 11:08:00
## 22 11 9 13 -1.1538462 2016-11-19 11:09:00
## 23 11 10 2 0.0000000 2016-11-19 11:10:00
## 24 11 12 5 -1.0000000 2016-11-19 11:12:00
## 25 11 13 1 -1.0000000 2016-11-19 11:13:00
a <- sum(Tscore_clock_summary$N!=1)
MAINE
The same process is performed on Maine tweets
Mt <- searchTwitter('#Maine', n = num_tweets)
Mt_df <- twListToDF(Mt) %>% select(text, created) %>% arrange(created)
Mt_df$created <- as.POSIXct(Mt_df$created, format="%Y-%m-%d %H%M")
Mt_df$created <- format(Mt_df$created, tz="America/Chicago")
Mt_df$hour <- as.POSIXlt(Mt_df$created)$hour
Mt_df$min <- as.POSIXlt(Mt_df$created)$min
Mt_df$sec <- as.POSIXlt(Mt_df$created)$sec
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
Mt_sep_words <- Mt_df %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]")) %>% arrange(created)
MaFinn <- get_sentiments("afinn")
Mscore_clock <- Mt_sep_words %>% inner_join(MaFinn, by="word") %>% arrange(created)
y <- length(Mscore_clock$score)
M_reg <- Mscore_clock
M_reg$time <- as.POSIXct(factor(paste0(as.character(M_reg$hour),
':', as.character(M_reg$min))), format="%H:%M")
nT <- as.integer(c(x, 2, 3))
nM <- as.integer(c(y, 5, 6))
n <- data.frame(nT, nM, stringsAsFactors = FALSE)
Mscore_clock_summary <- ddply(Mscore_clock, c("hour","min"), summarise,
N = length(score), ME_sentiment = mean(score))
Mscore_clock_summary$time <- as.POSIXct(factor(paste0(as.character(Mscore_clock_summary$hour),
':', as.character(Mscore_clock_summary$min))), format="%H:%M")
j <- length(Mscore_clock_summary$ME_sentiment)
b <- sum(Mscore_clock_summary$N!=1)
n_T <- as.integer(c(i, 12, 13))
n_M <- as.integer(c(j, 15, 16))
NN <- data.frame(n_T, n_M, stringsAsFactors = FALSE)
a_T <- as.integer(c(a, 21, 22))
a_M <- as.integer(c(b, 24, 25))
A <- data.frame(a_T, a_M, stringsAsFactors = FALSE)
Graph tweet sentiments. (I used echo=FALSE to optimize the graphs. That code is re-produced below)
#Tscore_clock_summary %>%
# ggvis(x= ~time, y= ~TX_sentiment) %>%
# layer_lines
#Mscore_clock_summary %>%
# ggvis(x= ~time, y= ~ME_sentiment) %>%
# layer_lines
Tweeted words are assigned a “valence,” of integer scores between -5 to +5 that represent negative to positive sentiment ratings. Tweets were downloaded simultaneously, minute-per-minute, in real-time. Scores were aggregated by the minute, averaged (into numerics), and graphed. The times were referenced to the CST zone but they can be referenced to EST or any time zone. The states chosen, number of tweets, and times are optional.
Drag the right lower corner of the graph rightward to elongate the time axis
Maine (ME) Sentiment Scores -Versus- Time
- - - - - raw Maine data points plotted vertically at aggregated minute times
__________________________________________________________
the cumulative number of scores used for each state either graphed by individual or as an average
Texas, nT and Maine, nM
sum of aggregated scores (as a single average) and individual scores
Texas, n_T, and Maine, n_M
the number of average scores graphed from aggregates
Texas, a_T, and Maine, a_M
Averages should not obscure raw numbers. The take away was that the information I got from these graphs depended on how I viewed the data’s context. It reminded me how constricted that statistics alone (without knowledge of the raw data) can be.
It would be interesting to chart usernames with unique symbols or colors to see what they’re adding to the graphs. Coding by events or using tags might also be interesting.