Bullying is a natinal health concern (Xu et al. 2012). Bullying online, known as cyberbullying, frequently pervades social media sites like Twitter, with as many as 15,000 bully-related tweets being sent each day (Huff Post 2012, Phys 2012). In this study, I collected English tweets for six hours in September, 2016. Originally intending to replicate parts of Xu’s et al. (2012) study, but given the current events at the time, my results turned political quickly. Here, I focused on differences and similarities in the sentiment and quantity of words used in tweets sent to different Twitter users and created word maps. I began with the full set of tweets, then subsetted it to include only tweets that were in direct reply to a popular person within each sentiment frame (negative, positive, neutral, and combined). I then subsetted tweets to include only those sent in direct reply to one of the two 2016 U.S. presidential candidates or either of the news sites: Fox News, a conservative news source, or CNN, a moderate news source that Donald Trump repeatedly claimed was biased against him during his campaign and repeatedly called, the “Clinton News Network.”
After witnessing a vast amount of insults and bullying infiltrating Twitter over several months, I read Xu’s et al. (2012) article from the Association for Computational Linguistics. Their research team used natural langauge processing to detect tweets capturing reports of real-life bullying as well as cyberbullying. I adapted some of their methods for my study. I collected tweets using the Twitter API with the intention of finding cyberbullying traces. The first half of this project I performed using Python, and my code can be found in Appendix i and Appendix ii. The rest of the project I completed using R, and the code can be found in this document.
I selected tweets for this study based on the following criteria: that they were in English, that they contained at least one word from AFINN-111 (README), a list of rated English words (Nielsen 2011), and that they were all original content (no retweets; Xu et al. 2012).
I collected data from the Twitter API using Python code that was made available to me by Bill Howe’s Data Manipulation at Scale course on Coursera. I used Python to access the Twitter API simply because I had previous experience in that language, as opposed to R. I began collecting tweets at 18:09:49 GMT on 8 Sep 2016 and finished at 22:20:52 GMT on the same day (time difference of 4.18 hours). During that time frame, I collected 870,852 tweets.
After collecting the specified tweet set, I calculated each tweet’s sentiment using unigrams from the AFINN-file. Then, I removed from my analysis any URLs, words with numbers in them, words with the ‘@’ symbol, and words with letters that repeated 3 or more times. I kept hashtags (compound phrases starting with #) as single tokens (Xu et al. 2012). Then, I imputed the sentiment score of all remaning unigrams not found in the AFINN file. I did this by calculating the average sentiment score of all tweets where an unrated or “unknown” word appeared. Then, I assigned each unknown word its imputed sentiment score and recalculated the sentiment of the entire tweet set using both the AFINN file and the unknown words set. Once again, this code can be found in Appendix i.
Before reading the data into R, I deparsed it using the JSONIO package. The code for this step can be found in Appendix 1. Then, using R, I created word maps of the combined tweets and then of tweets based on their sentiment score: positive, negative, or neutral. I analyzed sentiment score and tweet length across sentiment frames within a list and then across subsets. The original frames list was the tweets collected in full, followed by the tweets subsetted by a direct reply to a popular Twitter user, followed by tweets subsetted by a direct reply to a political figure.
First, I loaded all of the R-studio (version 0.99.903) packages I needed for the analysis.
library(plyr) ## reshape data
library(tm) ## removeSparseTerms
## Loading required package: NLP
library(slam)
library(RColorBrewer)
library(wordcloud) ## visualizations
library(ggplot2) ## visualizations
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
Next, I loaded the tweets I collected using Python. I subset the resulting data frame, df, into smaller data frames based on a tweet’s sentiment score, namely: neutral, positive, or negative. I modified the dataframe a bit, ensuring that the text attribute was in character format and that the created_at (timestamp) was in as.POSIXct format (source). I concatenated the resulting data frames in a list called frames_list. To see the structure and summary figures for df, please view Appendix 2.
df <- read.csv("tweet_frame.csv")
df$text <- as.character(df$text)
df$created_at <- as.POSIXct(strptime(df$created_at,
"%a %b %d %H:%M:%S %z %Y",tz="GMT"), tz="GMT")
neu_df <- subset(df, sentiment==0)
pos_df <- subset(df, sentiment>0)
neg_df <- subset(df, sentiment<0)
frames_list <- list(df, neu_df, pos_df, neg_df)
names(frames_list) <- c("df", "neu_df", "pos_df", "neg_df")
lapply(frames_list, function(x) {
length(x[,1]) })
## $df
## [1] 27949
##
## $neu_df
## [1] 871
##
## $pos_df
## [1] 22883
##
## $neg_df
## [1] 4195
There were 27949 total tweets analyzed, with 22883 classified as positive (82%), 4195 as negative (15%), and 871 classified as neutral (3%).
Then, I created some user-defined functions that I would repeat several times throughout the report. The first one was to remove most punctuation from the data frame (source 1, source 2). The second one removes any repeating characters in strings with 3 or more of the same character repeated.
removeMostPunctuation <-
function (x) {
x <- gsub("([@])|[[:punct:]]", "\\1", x)
return(x)
}
Then, I created a user-defined function, term_doc_df, that would create convert term document matrices to data frames (adapted from source 1, source 2). I removed the stop words ‘like’, ‘just’, and ‘via’ that weren’t removed with removeWords(), as well as via, a common word that appeared when sharing media. I added a grepl function call that would remove any links (words containing ‘http’), unigrams with repeating @, words with characters that repeat 3 times or more, and Twitter handles @youtube and @c0nvey, because they were frequent in tweets with shared media generated by those providers.
term_doc_df <- function(x) {
text <- iconv(x$text, 'UTF-8', 'ASCII')
#text <- removeMostPunctuation(text)
# text <- gsub(" +", " ", text)
corpus <- Corpus(DataframeSource(data.frame(text)))
corpus <- tm_map(corpus, content_transformer(removeMostPunctuation))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, c('just', 'like', 'via', 'amp'))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
tdm <- TermDocumentMatrix(corpus)
tdm <- removeSparseTerms(tdm, .9999)
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing=TRUE)
d_pre <- data.frame(word=names(v), freq=v)
d <- d_pre[!grepl("^http|.*@{2,}.*|(\\w)\\1{2,}|@youtube|@c0nvey", d_pre$word),]
## This function currently returning several empty values w/ frequencies > 0
}
I created a custom avg_sent function that would calculate and print the average sentiment score of a tweet data frame.
avg_sent <- function(x) {
avg <- mean(x$sentiment)
stdev <- sd(x$sentiment)
paste("The average sentiment score was", avg,
"sentiment points per tweet with a standard deviation of", stdev,
sep=" ")
}
Finally, I defined a make_wordcloud function that created wordclouds for a term document data frame (source 1, source 2).
make_wordcloud <- function(x, y, z=NULL, n=2) {
for(i in 1:length(x)) {
set.seed(0)
layout(matrix(c(1,2), nrow=2), heights=c(1,4))
par(bg=z)
par(mar=rep(0,4))
plot.new()
text(x=0.5, y=0.5, names(x[i]))
wordcloud(x[[i]]$word, x[[i]]$freq, scale=c(8,.3), min.freq=n,
max.words=80, random.order=TRUE, rot.per=.15,
colors=y[[i]], vfont=c("sans serif", "plain"),
main=names(x[[i]]))
}
}
I assigned colors for each data frame’s wordcloud to be constructed.
df.pal <- brewer.pal(6, "PuOr"); df.pal <- df.pal[-(3:4)]
neu_df.pal <- brewer.pal(5, "Greys"); neu_df.pal <- neu_df.pal[-(1:2)]
pos_df.pal <- brewer.pal(5, "GnBu"); pos_df.pal <- pos_df.pal[-(1:2)]
neg_df.pal <- brewer.pal(5, "YlOrRd"); neg_df.pal <- neg_df.pal[-(1:2)]
cols_list <- list(df.pal, neu_df.pal, pos_df.pal, neg_df.pal)
After setting up my environment, I ran term_doc_df() using lapply on the frames_list. I stored the resulting data frames of word occurrences in a new list called d_frames_list.
d_frames_list <- lapply(frames_list, term_doc_df)
names(d_frames_list) <- names(frames_list)
lapply(d_frames_list, function(x) {
head(x, 10)
})
## $df
## word freq
## 1 love 1424
## 2 good 1177
## 3 thanks 1046
## 4 lol 1013
## 5 dont 991
## 6 can 933
## 7 video 916
## 8 get 902
## 9 will 816
## 10 one 813
##
## $neu_df
## word freq
## 1 dont 42
## 2 ill 36
## 3 know 32
## 4 get 28
## 5 miss 28
## 6 sorry 28
## 7 people 23
## 8 stop 22
## 9 fuck 20
## 10 well 19
##
## $pos_df
## word freq
## 1 love 1411
## 2 good 1155
## 3 thanks 1036
## 4 lol 971
## 5 video 873
## 6 can 848
## 7 get 792
## 8 dont 790
## 9 one 726
## 10 will 720
##
## $neg_df
## word freq
## 1 shit 259
## 2 fuck 219
## 3 fucking 161
## 4 dont 159
## 5 bad 146
## 6 ill 141
## 7 ass 118
## 8 hate 106
## 9 hell 104
## 10 know 104
At this point, I noticed a pattern other than the glaring observation that a lot of Twitter users have foul mouths. That is, U.S. presidential candidates for 2016 Hillary Clinton and Donald Trump made it into the wordclouds. Who was mentioned the most in each data frame?
lapply(d_frames_list, function(x) {
mentions <- x[grepl("^@", x$word),]
my_df <- data.frame(mentions)
head(my_df, 10) } )
## $df
## word freq
## 91 @realdonaldtrump 214
## 194 @jacobwhitesides 125
## 205 @hillaryclinton 120
## 265 @cnvey 95
## 274 @harrystyles 92
## 378 @foxnews 67
## 425 @camerondallas 59
## 441 @cnn 57
## 519 @dtopbeautyworld 48
## 753 @clevernetwork 33
##
## $neu_df
## word freq
## 68 @hillaryclinton 7
## 69 @realdonaldtrump 7
## 148 @amazingphil 4
## 149 @ladygaga 4
## 225 @huffpostpol 3
## 226 @rickyvaughn 3
## 352 @barackobama 2
## 353 @bastilledan 2
## 354 @cnni 2
## 355 @cnvey 2
##
## $pos_df
## word freq
## 135 @realdonaldtrump 141
## 170 @jacobwhitesides 115
## 225 @harrystyles 91
## 264 @hillaryclinton 78
## 304 @cnvey 67
## 371 @camerondallas 56
## 436 @dtopbeautyworld 48
## 569 @foxnews 37
## 635 @clevernetwork 33
## 744 @cnn 28
##
## $neg_df
## word freq
## 26 @realdonaldtrump 66
## 63 @hillaryclinton 35
## 84 @cnn 29
## 85 @foxnews 29
## 101 @cnvey 26
## 259 @msnbc 13
## 319 @govgaryjohnson 11
## 320 @potus 11
## 365 @reince 10
## 400 @mlauer 9
@realdonaldtrump was consistently the most popular person to mention. Additionally, @hillaryclinton made it into the top 10 mentions in every data frame, followed @foxnews and @cnn, who made it everywhere except for the neutral frame, neu_df.
Given the polarization of the most recent U.S. politics, I figured that there might be something distinct happening in tweets aimed at the four previously mentioned figures. In general, I also wanted to see how tweets changed when they were sent to a specific person.
I first calculated the average sentiment of the data frames in frames_list. Then, I created sentiment histograms for each data frame to check the underlying distribution of data before performing a t-test.
lapply(frames_list, avg_sent)
## $df
## [1] "The average sentiment score was 7.33278471501664 sentiment points per tweet with a standard deviation of 8.94353934404517"
##
## $neu_df
## [1] "The average sentiment score was 0 sentiment points per tweet with a standard deviation of 0"
##
## $pos_df
## [1] "The average sentiment score was 9.85618144474064 sentiment points per tweet with a standard deviation of 7.53944176408914"
##
## $neg_df
## [1] "The average sentiment score was -4.90941597139452 sentiment points per tweet with a standard deviation of 5.20297542595142"
hist(df$sentiment, prob=TRUE, breaks=50)
lines(density(df$sentiment), col="chocolate4", lwd=1)
par(mfrow=c(1,2))
hist(pos_df$sentiment, breaks=20)
hist(neg_df$sentiment, breaks=10)
Overall, our tweets are approximately Gaussian. Naturally, our positive and negative tweet frames are skewed to the right and to the left, respectively. By the Central Limit Theorem, t-tests for these data should be valid so long as the sample sizes are sufficiently large.
On average, tweets in English that contained a user mention had a positive sentiment of 7.33 points. Positive tweets of the same type averaged at 9.86 sentiment points, and negative tweets were at -4.91. Using a 95% confidence interval and Student’s t-tests, negative, neutral, and positive tweets’ sentiment all differed significantly from the mean in df (Appendix 3.a).
Next, I looked at the amount of words a Twitterer used. Who was more verbose? I defined a function for calculating the total words used in each tweet in the data frames (adapted from source 1, 2). I removed English stop words and words with links, letters, @ symbols, and repeating characters. The resulting function would create an extra variable on the data frame it is iterating over.
calc_tweet_length <- function(x) {
new_text <- iconv(x$text, 'UTF-8', 'ASCII')
new_text <- removeWords(new_text, stopwords("english"))
new_text <- removeWords(new_text, c("just", "like", "amp", "via"))
new_text <- gsub("\\s+", " ", new_text)
new_text <- removeMostPunctuation(new_text)
x$num_words <- sapply(new_text, function(x) length(x[!grepl("http|@|(\\w)\\1{2,}|[0-9]",
unlist(strsplit(x, "\\W+")))]))
return(x)
}
I ran the function calc_tweet_length, which created a new column in frames_list called num_words.
frames_list <- lapply(frames_list, calc_tweet_length)
for (i in 1:length(frames_list)) {
print(paste("The average number of words per tweet in",
names(frames_list[i]), "was", mean(frames_list[[i]]$num_words),
"with a standard deviation of",
sd(frames_list[[i]]$num_words)))
}
## [1] "The average number of words per tweet in df was 8.49958853626248 with a standard deviation of 4.28957235497388"
## [1] "The average number of words per tweet in neu_df was 7.50746268656716 with a standard deviation of 3.6175977125453"
## [1] "The average number of words per tweet in pos_df was 8.57435650919897 with a standard deviation of 4.31674737014887"
## [1] "The average number of words per tweet in neg_df was 8.29773539928486 with a standard deviation of 4.23789275174261"
The null tweet frame, df, had an average of 8.46 words per tweet. Neutral and negative tweets were significantly shorter with 7.10 and 8.01 words per tweet, respectively (p-values<0.05, Appendix 3.b. Positive tweets were significantly longer than the null, with 8.59 average words per tweet (p-value<0.05, Appendix 3.b)
Next, I asked the question, how did our tweet stats (sentiment and tweet length) change when we only focused on tweets sent to the top 10 most popular people in each data frame? I subset frames_list by those tweets which have a value for column 14, in_reply_to_screen_name. Note that this is a distinct feature that requires a tweet to begin with a specific user’s name, not simply mention it in the text anywhere. I created reps_frames_list by subsetting frames_list using lapply.
reps_frames_list <- lapply(frames_list, function(x) {
with_reply_df <- subset(x, !is.na(x[,14]))
})
names(reps_frames_list) <- names(frames_list)
Then, I calculated which were the ten most popular people to receive tweets in each data frame and stored the results in top10reps.
top10reps <- lapply(reps_frames_list, function(x) {
temp <- data.frame(table(x[,14]))
head(arrange(temp, desc(Freq)), 10)
})
names(top10reps) <- names(frames_list)
for (i in 1:length(top10reps)) {
print(top10reps[i])
top10reps[[i]] <- as.character(top10reps[[i]][,1])
}
## $df
## Var1 Freq
## 1 jacobwhitesides 93
## 2 realdonaldtrump 77
## 3 harry_styles 72
## 4 camerondallas 42
## 5 foxnews 38
## 6 clever_network 31
## 7 hillaryclinton 30
## 8 jack_septic_eye 28
## 9 amazingphil 22
## 10 cnn 21
##
## $neu_df
## Var1 Freq
## 1 amazingphil 4
## 2 hillaryclinton 2
## 3 jacobwhitesides 2
## 4 ricky_vaughn99 2
## 5 __shelbihoffman 1
## 6 _813am 1
## 7 _arudeboi 1
## 8 _bcmt817x 1
## 9 _esstar 1
## 10 _grendan 1
##
## $pos_df
## Var1 Freq
## 1 jacobwhitesides 88
## 2 harry_styles 72
## 3 realdonaldtrump 49
## 4 camerondallas 41
## 5 clever_network 31
## 6 foxnews 25
## 7 jack_septic_eye 24
## 8 sebtsb 18
## 9 shawnmendes 18
## 10 amazingphil 17
##
## $neg_df
## Var1 Freq
## 1 realdonaldtrump 28
## 2 foxnews 12
## 3 hillaryclinton 11
## 4 cnn 10
## 5 mcuban 6
## 6 thehill 6
## 7 joyannreid 5
## 8 msnbc 5
## 9 reince 5
## 10 cnnpolitics 4
I then subset the reps_frames_list, including only tweets which were sent to twitter users in the top10reps list for that particular data frame. I stored the result in the list, top_rep_frames_list.
subset_by_top_reps <- function(x,y) {
df_list <- NULL
for (i in 1:length(x)) {
temp <- subset(x[[i]], x[[i]][,14] %in% y[[i]])
df_list[[i]] <- temp
}
return(df_list)
}
top_rep_frames_list <- subset_by_top_reps(reps_frames_list, top10reps)
names(top_rep_frames_list) <- names(frames_list)
lapply(top_rep_frames_list, function(x) {
length(x[,1])
})
## $df
## [1] 454
##
## $neu_df
## [1] 16
##
## $pos_df
## [1] 383
##
## $neg_df
## [1] 92
There were 454 tweets in df sent to the most popular ten people. For neutral tweets, there were only 16. For positive, 383, and for negative, 92.
Then, I created term document data frames for top_rep_frames_list, but this time removing any user handles and the names ‘trump’ and ‘hillary’.
d_top_rep_frames_list <- lapply(top_rep_frames_list, term_doc_df)
d_top_rep_frames_list <- lapply(d_top_rep_frames_list, function(x) {
remove_mentions <- x[!grepl("^@|donald|trump|hillary|clinton", x$word),]
})
names(d_top_rep_frames_list) <- names(frames_list)
lapply(d_top_rep_frames_list, function(x) {
head(x, 10)
})
## $df
## word freq
## 1 love 59
## 2 whytonight 58
## 3 youre 40
## 4 follow 36
## 5 happy 34
## 6 cleanconfession 30
## 7 following 22
## 8 mind 22
## 9 will 21
## 10 birthday 20
##
## $neu_df
## word freq
## 1 dont 2
## 2 people 2
## 3 sure 2
## 4 whytonight 2
## 5 accident 1
## 6 album 1
## 7 america 1
## 8 bad 1
## 9 bet 1
## 10 blocked 1
##
## $pos_df
## word freq
## 1 love 65
## 2 whytonight 54
## 3 follow 37
## 4 youre 34
## 5 happy 33
## 6 cleanconfession 30
## 7 please 28
## 8 following 22
## 9 mind 22
## 10 birthday 20
##
## $neg_df
## word freq
## 1 liar 8
## 2 lied 5
## 3 ass 4
## 4 cant 4
## 5 died 4
## 6 get 4
## 7 kill 4
## 8 obama 4
## 9 old 4
## 10 prison 4
I calculated the average sentiment score of tweets in the top_rep_frames_list.
lapply(top_rep_frames_list, avg_sent)
## $df
## [1] "The average sentiment score was 10.2466960352423 sentiment points per tweet with a standard deviation of 12.2453189726458"
##
## $neu_df
## [1] "The average sentiment score was 0 sentiment points per tweet with a standard deviation of 0"
##
## $pos_df
## [1] "The average sentiment score was 13.2114882506527 sentiment points per tweet with a standard deviation of 10.3199939947445"
##
## $neg_df
## [1] "The average sentiment score was -7.1195652173913 sentiment points per tweet with a standard deviation of 9.12771698681025"
Sentiment shifted overall in tweets to the more positive, with df producing an average sentiment score of 10.25. Positive tweets moved up to 13.21, and negative tweets decreased in sentiment to -7.11. Neutral, positive, and negative tweets all differed significantly in sentiment from the null, top_rep_frames_list$df data frame (Appendix 4.a).
Then, I compared the data frames in our original frames_list with the ones in top_rep_frames_list. Both df and pos_df sentiment increased significantly after subsetting by tweets sent to a top 10 recipient of tweets in their data frames (p-values<0.05), but there was no significant difference in neg_df sentiment after subsetting similarly (Appendix 4.b).
Which tweets contained more words?
top_rep_frames_list <- lapply(top_rep_frames_list, calc_tweet_length)
for (i in 1:length(top_rep_frames_list)) {
print(paste("The average number of words per tweet in",
names(top_rep_frames_list[i]), "was",
mean(top_rep_frames_list[[i]]$num_words),
"with a standard deviation of",
sd(top_rep_frames_list[[i]]$num_words)))
}
## [1] "The average number of words per tweet in df was 9.58810572687225 with a standard deviation of 4.78868602469423"
## [1] "The average number of words per tweet in neu_df was 7.5625 with a standard deviation of 2.96577702016408"
## [1] "The average number of words per tweet in pos_df was 8.93733681462141 with a standard deviation of 4.91724609097849"
## [1] "The average number of words per tweet in neg_df was 11.3804347826087 with a standard deviation of 3.6938987155815"
Overall, our tweets in the top replies df had an average of 9.59 words per tweet. Neutral tweets were still shorter at 7.56. This time, positive and negative tweets switched: negative tweets were longer than the null at 11.38 words per tweet, and positive tweets shorter at 8.94 words per tweet on average. I tested the significance of the length of the neutral, positive, and negative tweets compared to the mean in this list, df. Neutral and positive tweets aimed at a top 10 recipient in each respective data frame were significantly shorter than in the null frame, and negative tweets were significantly longer (p-values<0.05; Appendix 4.c).
I then compared these results to the previous ones from frames_list. How did tweet length change when tweets were only being sent to the most popular people in each data frame? The average length of tweets in df, pos_df, and neg_df increased significantly after subsetting tweets sent to a top 10 recipient in each data frame (p-values<0.05), but there was no signficant difference in neutral tweet length (Appendix 4.d).
Next, I split the data frames further into a list of data frames of tweets addressed to the left and to the right. The left included tweets aimed at @hillaryclinton and @cnn, and the right, tweets to @realdonaldtrump and @foxnews. Then, I combined the two frames and stored them in a list called political_frames_list.
left <- c("hillaryclinton", "cnn")
left_frames_list <- lapply(reps_frames_list, function(x) {
tweets_at_left <- subset(x, x[,14] %in% left)
tweets_at_left$side <- "L"
return(tweets_at_left)
}
)
names(left_frames_list) <- names(frames_list)
lapply(left_frames_list, function(x) {
length(x[,1])
})
## $df
## [1] 51
##
## $neu_df
## [1] 2
##
## $pos_df
## [1] 28
##
## $neg_df
## [1] 21
left_frames_list <- list(left_frames_list[[1]], left_frames_list[[3]],
left_frames_list[[4]])
names(left_frames_list) <- names(frames_list)[-2]
right <- c("realdonaldtrump", "foxnews")
right_frames_list <- lapply(reps_frames_list, function(x) {
tweets_at_right <- subset(x, x[,14] %in% right)
tweets_at_right$side <- "R"
return(tweets_at_right)
}
)
names(right_frames_list) <- names(frames_list)
lapply(right_frames_list, function(x) {
length(x[,1])
})
## $df
## [1] 115
##
## $neu_df
## [1] 1
##
## $pos_df
## [1] 74
##
## $neg_df
## [1] 40
right_frames_list <- list(right_frames_list[[1]], right_frames_list[[3]],
right_frames_list[[4]])
names(right_frames_list) <- names(frames_list)[-2]
political_frames_list <- NULL
for (i in 1:length(left_frames_list)) {
political_frames_list[[i]] <- rbind(left_frames_list[[i]],
right_frames_list[[i]])
}
names(political_frames_list) <- names(frames_list)[-2]
The data frames’ lengths dropped a lot after subsetting by politically aimed tweets. With the exception of neutral tweets, our number of tweets, n, was large enough per the Central Limit Theorem that using t-tests was still valid. Because there weren’t enough observations of neutral tweets, I dropped them from the list and from future analyses.
Before splitting the political tweets, I analyzed them in the political_frames_list, first generating term docucment data frames and then creating wordclouds. I removed mentions of Twitter handles and the 2016 U.S. presidential candidates’ names that could confound results.
d_political_frames_list <- lapply(political_frames_list, term_doc_df)
d_political_frames_list <- lapply(d_political_frames_list, function(x) {
remove_mentions <- x[!grepl("^@|donald|trump|hillary|clinton", x$word),]
})
names(d_political_frames_list) <- names(political_frames_list)
lapply(d_political_frames_list, function(x) {
head(x, 10) })
## $df
## word freq
## 1 america 11
## 2 country 10
## 3 will 10
## 4 youre 10
## 5 can 9
## 6 cant 8
## 7 liar 8
## 8 lied 8
## 9 night 8
## 10 time 8
##
## $pos_df
## word freq
## 1 america 9
## 2 country 9
## 3 will 8
## 4 can 7
## 5 night 7
## 6 time 7
## 7 best 6
## 8 putin 6
## 9 thing 6
## 10 youre 6
##
## $neg_df
## word freq
## 1 liar 5
## 2 lied 5
## 3 old 4
## 4 prison 4
## 5 boy 3
## 6 cant 3
## 7 lie 3
## 8 obama 3
## 9 racist 3
## 10 right 3
Then, I calculated average sentiment and and tweet length. I also compared the results to the tweets in the previous frames, frames_list, and top_rep_frames_list.
lapply(political_frames_list, avg_sent)
## $df
## [1] "The average sentiment score was 2.74698795180723 sentiment points per tweet with a standard deviation of 10.2405532207259"
##
## $pos_df
## [1] "The average sentiment score was 8.42156862745098 sentiment points per tweet with a standard deviation of 6.33774841255064"
##
## $neg_df
## [1] "The average sentiment score was -6.60655737704918 sentiment points per tweet with a standard deviation of 8.75838396152431"
The null tweet set, df, had an average sentiment score of 2.74. Positive tweets were at 8.42 and negative tweets -6.61 average sentiment score per tweet. Both positive and negative tweets differed significantly in sentiment from the null, df (p-values<0.05, Appendix 5.a). Compared to the original data frames in frames_list, both combined (df) and positive tweet sentiment decreased significantly (p-values<0.05), but there was no change in negative tweet sentiment Appendix 5.b. Political tweet sentiment differed in sentiment similarly from tweets sent to the top 10 most popular people in each data frame; both combined and positive tweets had a lower sentiment score (p-values<0.05), but negative tweets were unaffected Appendix 5.c.
How did tweet length differ among political tweets?
political_frames_list <- lapply(political_frames_list, calc_tweet_length)
for (i in 1:length(political_frames_list)) {
print(paste("The average number of words per tweet in",
names(political_frames_list[i]), "was",
mean(political_frames_list[[i]]$num_words),
"with a standard deviation of",
sd(political_frames_list[[i]]$num_words)))
}
## [1] "The average number of words per tweet in df was 11.6385542168675 with a standard deviation of 3.69956037766"
## [1] "The average number of words per tweet in pos_df was 12.0098039215686 with a standard deviation of 3.67893413868965"
## [1] "The average number of words per tweet in neg_df was 11.0983606557377 with a standard deviation of 3.76255639174922"
There was no difference in tweet length between positive or negative tweets and the null frame df (Appendix 5.d). However, all tweets (combined, positive, and negative) in the political frames list, when compared to the original frames_list, were all longer in length (Appendix 5.e). Positive and combined tweets were longer in length than in their parallel data frames in top_rep_frames_list (p-values<0.05), but there was no difference in the length of negative tweets between the two lists (Appendix 5.f).
Finally, I zoomed divided the tweets by left and right. I repeated the previous steps by creating term document data frames and wordmaps and by calculating average tweet sentiment and length for each side.
d_left_frames_list <- lapply(left_frames_list, term_doc_df)
d_left_frames_list <- lapply(d_left_frames_list, function(x) {
remove_mentions <- x[!grepl("^@|donald|trump|hillary|clinton|fox|cnn", x$word),]
})
names(d_left_frames_list) <- names(left_frames_list)
lapply(d_left_frames_list, function(x) {
head(x, 10) })
## $df
## word freq
## 1 putin 5
## 2 thing 5
## 3 time 5
## 4 know 4
## 5 prison 4
## 6 will 4
## 7 america 3
## 8 cant 3
## 9 chief 3
## 10 let 3
##
## $pos_df
## word freq
## 1 thing 5
## 2 time 5
## 3 know 4
## 4 putin 4
## 5 never 3
## 6 will 3
## 7 youre 3
## 8 america 2
## 9 bashing 2
## 10 best 2
##
## $neg_df
## word freq
## 1 prison 4
## 2 billion 2
## 3 calls 2
## 4 deal 2
## 5 doctors 2
## 6 kill 2
## 7 least 2
## 8 mistake 2
## 9 now 2
## 10 obama 2
d_right_frames_list <- lapply(right_frames_list, term_doc_df)
d_right_frames_list <- lapply(d_right_frames_list, function(x) {
remove_mentions <- x[!grepl("^@|donald|trump|hillary|clinton|fox|cnn", x$word),]
})
names(d_right_frames_list) <- names(right_frames_list)
lapply(d_right_frames_list, function(x) {
head(x, 10) } )
## $df
## word freq
## 1 america 8
## 2 can 8
## 3 country 8
## 4 youre 7
## 5 liar 6
## 6 lied 6
## 7 make 6
## 8 will 6
## 9 cant 5
## 10 good 5
##
## $pos_df
## word freq
## 1 america 7
## 2 country 7
## 3 can 6
## 4 make 5
## 5 night 5
## 6 want 5
## 7 will 5
## 8 best 4
## 9 good 4
## 10 better 3
##
## $neg_df
## word freq
## 1 liar 4
## 2 lied 4
## 3 old 3
## 4 still 3
## 5 youre 3
## 6 aleppo 2
## 7 another 2
## 8 ass 2
## 9 boy 2
## 10 brain 2
There was a lot of emotion apparent in these lists. I calculated average sentiment score for these political tweets.
lapply(left_frames_list, avg_sent)
## $df
## [1] "The average sentiment score was 1.01960784313725 sentiment points per tweet with a standard deviation of 11.9272632168129"
##
## $pos_df
## [1] "The average sentiment score was 8.25 sentiment points per tweet with a standard deviation of 6.12599198316304"
##
## $neg_df
## [1] "The average sentiment score was -8.52380952380952 sentiment points per tweet with a standard deviation of 11.6645576324996"
I observed an average sentiment score of 1.02 across all tweets aimed at the left, 8.25 for positive tweets aimed at the left, and -8.52 for negative tweets aimed at the left. Then, I tested for signficance. I omitted neutral data frames from all analyses due to the lack of sample size. Positive tweets sent to @hillaryclinton and @cnn had significantly higher sentiment than the null, and negative tweets sent to the left had a significantly lower sentiment than the null (p-values<0.05, Appendix 6.a).
Compared to the sentiment in the original tweet lists in frames_list, combined sentiment decreased significantly, but there was no difference in sentiment on the positive and negative levels (p-values<0.05, Appendix 6.b). When compared to the sentiment across generally popular people in top_rep_frames_list, sentiment decreased significantly overall and among positive tweets (p-values<0.05), but there was no difference in negative tweet sentiment (Appendix 6.c).
I then compared sentiment in the left-aimed tweets to the combined political tweets, but there was no difference in sentiment (Appendix 6.d).
How did tweet length change in left-aimed tweets?
left_frames_list <- lapply(left_frames_list, calc_tweet_length)
for (i in 1:length(left_frames_list)) {
print(paste("The average number of words per tweet in",
names(left_frames_list[i]), "was",
mean(left_frames_list[[i]]$num_words),
"with a standard deviation of",
sd(left_frames_list[[i]]$num_words)))
}
## [1] "The average number of words per tweet in df was 12.2352941176471 with a standard deviation of 3.60881274268487"
## [1] "The average number of words per tweet in pos_df was 12.6071428571429 with a standard deviation of 3.75489099032267"
## [1] "The average number of words per tweet in neg_df was 11.9047619047619 with a standard deviation of 3.59033093049599"
Left-aimed tweets had average tweet lengths of 12.24 words overall, with 12.61 and 11.90 average words per tweet in positive and negative tweets, respectively. There was no difference in left-aimed positive or negative tweet length compared to the null, df (Appendix 6.e). All tweet frames were longer when compared to parallel frames in the original frames_list (p-values<0.05, Appendix 6.f). Both positive and combined tweets were longer in the left-aimed frames list than parallel frames in top_rep_frames_list (p-values<0.05), but there was no difference in negative tweet length (Appendix 6.g). Additionally, there was no difference in tweet length between the left-aimed tweets and the broader, political frames list (Appendix 6.h).
Then, I switched sides and asked, how did the right change?
lapply(right_frames_list, avg_sent)
## $df
## [1] "The average sentiment score was 3.51304347826087 sentiment points per tweet with a standard deviation of 9.35214134161432"
##
## $pos_df
## [1] "The average sentiment score was 8.48648648648649 sentiment points per tweet with a standard deviation of 6.45584208877179"
##
## $neg_df
## [1] "The average sentiment score was -5.6 sentiment points per tweet with a standard deviation of 6.72461990156416"
Right-aimed tweets had a sentiment score of 3.51 points across all tweets, 8.49 for positive tweets, and -5.6 for negative tweets. Positive tweets sent to @realdonaldtrump and @foxnews were significantly higher in sentiment than the null, and negative tweets were significantly lower than the null (p-values<0.05; Appendix 7.a). Compared to the original tweets in frames_list, combined sentiment decreased in tweets sent to the right (p-value<0.05), but there was no sentiment difference on the positive or negative tweet level (Appendix 7.b). When compared to top_rep_frames_list, combined sentiment and positive tweet sentiment decreased significantly (p-values<0.05), but there was no difference in negative tweet sentiment (Appendix 7.c).
I calculated tweet length in the right-aimed tweet frames list.
right_frames_list <- lapply(right_frames_list, calc_tweet_length)
for (i in 1:length(right_frames_list)) {
print(paste("The average number of words per tweet in",
names(right_frames_list[i]), "was",
mean(right_frames_list[[i]]$num_words),
"with a standard deviation of",
sd(right_frames_list[[i]]$num_words)))
}
## [1] "The average number of words per tweet in df was 11.3739130434783 with a standard deviation of 3.72394072909005"
## [1] "The average number of words per tweet in pos_df was 11.7837837837838 with a standard deviation of 3.64999632302412"
## [1] "The average number of words per tweet in neg_df was 10.675 with a standard deviation of 3.82560536520123"
There were an average 11.37 words per tweet across all tweets in the right-aimed tweet frame. Positive tweets were slightly longer at 11.78 average words per tweet, and negative tweets slighly shorter at 10.68.
When left-aimed and right-aimed tweets were compared to each other, there was no difference in sentiment across any data frame (Appendix 8.a).
Next, I explored how political sentiment differed from the original data frames, as well as the top10replies data frames.
t.test(left_frames_list$df$sentiment, reps_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$df$sentiment and reps_frames_list$df$sentiment
## t = -3.2595, df = 50.117, p-value = 0.002009
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.803384 -2.090653
## sample estimates:
## mean of x mean of y
## 1.019608 6.466627
Recall the original list of tweets collected from the Twitter API (Table 1). The specifications for that list were that a tweet was in English, contained a user mention, and had at least one word found in the affinity file.
I made several wordclouds from the original and political tweets.
Wordclouds created on the sentiment frames of the original tweet list using a random set of 100 words that occured at least 7 times in each frame.
Wordclouds created on the sentiment frames of the political tweet list using a random set of a maximum 100 words that occured at least 2 times in each frame.
Wordclouds created on the sentiment frames of the left-aimed tweet list using a random set of a maximum 100 words that occured at least 2 times in each frame.
Wordclouds created on the sentiment frames of the right-aimed tweet list using a random set of a maximum 100 words that occured at least 2 times in each frame.
Combined tweet sentiment increased from the original list to the list of tweets sent to a popular recipient, but political tweets were lower in sentiment than the first two lists (Fig. 1). The combined set of tweets also increased in tweet length from the original list to the top replies list and continued to increase in the political list (Fig. 2). After subsetting political tweets by either having a left- or a right-focus, there was an observed higher sentiment and shorter length in right-aimed tweets than left-aimed ones (Figs. 3 & 4). However, these differences were not statistically significant.
Fig. 1. Boxplots of combined tweet sentiment using the df frame in three lists of tweets that decrease
in size from left to right. The mean sentiment score of the first list (original) was 7.33, the second (top
replies) was 10.25, and the third (political) was 2.75.
Fig. 2. Boxplots of combined tweet length (in words) using the df frame in three lists of tweets that
decrease in size from left to right. The mean number of words of the first list (original) was 8.50, the
second (top replies) was 9.59, and the third (political) was 11.64.
Fig. 3. Boxplots of combined tweet sentiment using the df frame in two political lists of tweets that are
aimed at @hillaryclinton and @cnn on the left and @donaldtrump and @foxnews on the right. The mean
sentiment score of the first list of tweets (left) was 1.02, and the second (right) was 3.51. These observed
differences were not significant.
Fig. 4. Boxplots of combined tweet length (in words) using the df frame in two political lists of tweets that
are aimed at @hillaryclinton and @cnn on the left and @donaldtrump and @foxnews on the right. The mean
number of words of the first list of tweets (left) was 12.24, and the second (right) was 11.37. These observed
differences were not significant.
Then, I analyzed sentiment and length differences within each list and across subsets. Recall that there were many more positive than negative tweets, and very few neutral tweets in the original tweet list (Table 1). In the original list, neutral and negative tweets were shorter than the null mean of combined tweets, ‘df’. There was no significant difference in positive tweet length and the mean (Fig. 5).
Table 1. The total number of tweets per sentiment frame in the original tweet list. df contains the combined tweets.
##
## df neg_df neu_df pos_df
## 27949 4195 871 22883
Fig. 5. Boxplots of tweet length (in words) for each of the sentiment frames in the original frames list. df
represents the combined tweets and the null mean. neg_df and pos_df contain tweets with a negative
or positive sentiment score, respectively, and neu_df, contains the tweets with a sentiment score equal
to 0. The null mean tweet length (in words) was 8.50. For negative tweets, the mean length was 8.30; for
neutral tweets, 7.51; and for positive tweets, 8.57 (but it was nonsignificant).
In the list of tweets only sent to a top 10 recipient of tweets for each data frame, negative tweets were longer than the null (df), and neutral tweets were shorter. There was no significant difference in the length of positive tweets from the null (Fig. 6).
Recall that there were only 3 neutral tweets in the political frames list, so we eliminated them. In the list of political tweets, there was no significant difference in the length of positive or negative tweets compared to the null, df (Fig. 7).
Then, I looked at how each of the sentiment frames changed across the subsets.
## [1] 36
## [1] "@CNN Wars kill innocent people,TheBible says,God doesn't kill the innocent,wars disobey TheBible,God's Word&sin is disobedience toGod's Word"
## [1] 136
## [1] "@FoxNews Happy Birthday too a awesome show that I have watched sense of the time it started,The thing's that Gene Roddenberry have made"
Tweet length increased over time? Not significant
Python code for filtering tweets from Twitter API. https://github.com/kairstenfay/cyberbullying/blob/master/filter_rate_tweets.py.
Supplement Python code used to collect tweets from Twitter API. https://github.com/kairstenfay/cyberbullying/blob/master/twitterstream.py
Load tweets from JSON formatted text file and save them into a new file, ‘tweet_frame.csv’. This step was done in advance of the document creation.
require(RJSONIO)
require(plyr)
tweets_json <- fromJSON("one_line_tweets.txt", nullValue=NA)
dat <- lapply(tweets_json, function(j) {
as.data.frame(replace(j, sapply(j, is.list), NA))
})
df <- rbind.fill(dat)
df$text <- as.character(df$text)
str(df)
Sys.setlocale('LC_ALL','C')
write.csv(df, "tweet_frame.csv", row.names = FALSE, na = "NA")
The raw file can be seen at https://github.com/kairstenfay/cyberbullying/blob/master/one_line_tweets.txt.
Structure and summary of df.
str(df)
## 'data.frame': 27949 obs. of 33 variables:
## $ contributors : logi NA NA NA NA NA NA ...
## $ truncated : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ text : chr "@Crimson__Hybrid B-But I want to...-" "@UltimateKing26 nice uncle" "Enter to #Win a $40 @Penningtons Gift Card from @MapleMouseMama CAN/US 9/29 #Giveaway #DisneySMMC https://t.co/r6riQKReUZ" "@eeelizzzabeth @DannyShapiro13 @Beaumont_Sports \nA) we're the best of chums IRL.\nB) all musical copyright to .@daveth89" ...
## $ is_quote_status : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ in_reply_to_status_id : num 7.74e+17 NA NA 7.74e+17 7.74e+17 ...
## $ id : num 7.74e+17 7.74e+17 7.74e+17 7.74e+17 7.74e+17 ...
## $ favorite_count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sentiment : int 1 3 32 14 20 11 -4 8 18 9 ...
## $ source : Factor w/ 398 levels "<a href=\"http://02cd446.netsolhost.com/poli/hoc.html\" rel=\"nofollow\">The Long Road</a>",..: 312 148 144 312 148 144 148 147 147 148 ...
## $ retweeted : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ coordinates : logi NA NA NA NA NA NA ...
## $ timestamp_ms : num 1.47e+12 1.47e+12 1.47e+12 1.47e+12 1.47e+12 ...
## $ entities : logi NA NA NA NA NA NA ...
## $ in_reply_to_screen_name : Factor w/ 19481 levels "_______anisa",..: 4207 18339 NA 5458 751 NA 14664 13154 12663 467 ...
## $ id_str : num 7.74e+17 7.74e+17 7.74e+17 7.74e+17 7.74e+17 ...
## $ retweet_count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ in_reply_to_user_id : num 4.85e+08 3.12e+09 NA 3.01e+08 4.70e+08 ...
## $ favorited : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ user : logi NA NA NA NA NA NA ...
## $ geo : logi NA NA NA NA NA NA ...
## $ in_reply_to_user_id_str : num 4.85e+08 3.12e+09 NA 3.01e+08 4.70e+08 ...
## $ lang : Factor w/ 1 level "en": 1 1 1 1 1 1 1 1 1 1 ...
## $ created_at : POSIXct, format: "2016-09-08 18:09:49" "2016-09-08 18:09:49" ...
## $ filter_level : Factor w/ 1 level "low": 1 1 1 1 1 1 1 1 1 1 ...
## $ in_reply_to_status_id_str: num 7.74e+17 NA NA 7.74e+17 7.74e+17 ...
## $ place : logi NA NA NA NA NA NA ...
## $ possibly_sensitive : logi NA NA FALSE NA NA FALSE ...
## $ extended_entities : logi NA NA NA NA NA NA ...
## $ quoted_status_id : num NA NA NA NA NA NA NA NA NA NA ...
## $ quoted_status : logi NA NA NA NA NA NA ...
## $ quoted_status_id_str : num NA NA NA NA NA NA NA NA NA NA ...
## $ scopes : logi NA NA NA NA NA NA ...
## $ withheld_in_countries : Factor w/ 1 level "TR": NA NA NA NA NA NA NA NA NA NA ...
summary(df)
## contributors truncated text is_quote_status
## Mode:logical Mode :logical Length:27949 Mode :logical
## NA's:27949 FALSE:27949 Class :character FALSE:27301
## NA's :0 Mode :character TRUE :648
## NA's :0
##
##
##
## in_reply_to_status_id id favorite_count
## Min. :1.279e+17 Min. :7.739e+17 Min. :0
## 1st Qu.:7.740e+17 1st Qu.:7.740e+17 1st Qu.:0
## Median :7.740e+17 Median :7.740e+17 Median :0
## Mean :7.735e+17 Mean :7.740e+17 Mean :0
## 3rd Qu.:7.740e+17 3rd Qu.:7.740e+17 3rd Qu.:0
## Max. :7.740e+17 Max. :7.740e+17 Max. :0
## NA's :9758
## sentiment
## Min. :-55.000
## 1st Qu.: 2.000
## Median : 6.000
## Mean : 7.333
## 3rd Qu.: 12.000
## Max. :111.000
##
## source
## <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> :11709
## <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a> : 5485
## <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>: 5206
## <a href="http://www.google.com/" rel="nofollow">Google</a> : 835
## <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a> : 782
## <a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a> : 697
## (Other) : 3235
## retweeted coordinates timestamp_ms entities
## Mode :logical Mode:logical Min. :1.473e+12 Mode:logical
## FALSE:27949 NA's:27949 1st Qu.:1.473e+12 NA's:27949
## NA's :0 Median :1.473e+12
## Mean :1.473e+12
## 3rd Qu.:1.473e+12
## Max. :1.473e+12
##
## in_reply_to_screen_name id_str retweet_count
## jacobwhitesides: 93 Min. :7.739e+17 Min. :0
## realdonaldtrump: 77 1st Qu.:7.740e+17 1st Qu.:0
## harry_styles : 72 Median :7.740e+17 Median :0
## camerondallas : 42 Mean :7.740e+17 Mean :0
## foxnews : 38 3rd Qu.:7.740e+17 3rd Qu.:0
## (Other) :21297 Max. :7.740e+17 Max. :0
## NA's : 6330
## in_reply_to_user_id favorited user geo
## Min. :1.200e+01 Mode :logical Mode:logical Mode:logical
## 1st Qu.:1.350e+08 FALSE:27949 NA's:27949 NA's:27949
## Median :5.750e+08 NA's :0
## Mean :6.851e+16
## 3rd Qu.:2.770e+09
## Max. :7.740e+17
## NA's :6330
## in_reply_to_user_id_str lang created_at
## Min. :1.200e+01 en:27949 Min. :2016-09-08 18:09:49
## 1st Qu.:1.350e+08 1st Qu.:2016-09-08 19:14:24
## Median :5.750e+08 Median :2016-09-08 20:16:08
## Mean :6.851e+16 Mean :2016-09-08 20:15:32
## 3rd Qu.:2.770e+09 3rd Qu.:2016-09-08 21:16:22
## Max. :7.740e+17 Max. :2016-09-08 22:20:52
## NA's :6330
## filter_level in_reply_to_status_id_str place possibly_sensitive
## low:27949 Min. :1.279e+17 Mode:logical Mode :logical
## 1st Qu.:7.740e+17 NA's:27949 FALSE:6070
## Median :7.740e+17 TRUE :91
## Mean :7.735e+17 NA's :21788
## 3rd Qu.:7.740e+17
## Max. :7.740e+17
## NA's :9758
## extended_entities quoted_status_id quoted_status quoted_status_id_str
## Mode:logical Min. :4.804e+16 Mode:logical Min. :4.804e+16
## NA's:27949 1st Qu.:7.738e+17 NA's:27949 1st Qu.:7.738e+17
## Median :7.739e+17 Median :7.739e+17
## Mean :7.690e+17 Mean :7.690e+17
## 3rd Qu.:7.740e+17 3rd Qu.:7.740e+17
## Max. :7.740e+17 Max. :7.740e+17
## NA's :27303 NA's :27303
## scopes withheld_in_countries
## Mode:logical TR : 1
## NA's:27949 NA's:27948
##
##
##
##
##
Did neutral, positive, or negative tweet sentiment differ significantly from the null df?
t.test(neu_df$sentiment, df$sentiment)
##
## Welch Two Sample t-test
##
## data: neu_df$sentiment and df$sentiment
## t = -137.07, df = 27948, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.437641 -7.227929
## sample estimates:
## mean of x mean of y
## 0.000000 7.332785
t.test(pos_df$sentiment, df$sentiment)
##
## Welch Two Sample t-test
##
## data: pos_df$sentiment and df$sentiment
## t = 34.512, df = 50787, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.380088 2.666705
## sample estimates:
## mean of x mean of y
## 9.856181 7.332785
t.test(neg_df$sentiment, df$sentiment)
##
## Welch Two Sample t-test
##
## data: neg_df$sentiment and df$sentiment
## t = -126.84, df = 8488.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12.43139 -12.05301
## sample estimates:
## mean of x mean of y
## -4.909416 7.332785
Did neutral, positive, or negative tweet length differ significantly from the null, df?
t.test(frames_list$neu_df$num_words, frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: frames_list$neu_df$num_words and frames_list$df$num_words
## t = -7.9222, df = 947.85, p-value = 6.519e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.2378944 -0.7463573
## sample estimates:
## mean of x mean of y
## 7.507463 8.499589
t.test(frames_list$pos_df$num_words, frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: frames_list$pos_df$num_words and frames_list$df$num_words
## t = 1.9483, df = 48749, p-value = 0.05138
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0004487008 0.1499846467
## sample estimates:
## mean of x mean of y
## 8.574357 8.499589
t.test(frames_list$neg_df$num_words, frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: frames_list$neg_df$num_words and frames_list$df$num_words
## t = -2.872, df = 5563.3, p-value = 0.004094
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.33963361 -0.06407267
## sample estimates:
## mean of x mean of y
## 8.297735 8.499589
After subsetting tweets by those in reply to a top 10 recipient of tweets for each data frame, did neutral, positive, or negative tweet sentiment differ significantly from the null, df?
t.test(top_rep_frames_list$neu_df$sentiment, top_rep_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$neu_df$sentiment and top_rep_frames_list$df$sentiment
## t = -17.83, df = 453, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.376107 -9.117285
## sample estimates:
## mean of x mean of y
## 0.0000 10.2467
t.test(top_rep_frames_list$pos_df$sentiment, top_rep_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$pos_df$sentiment and top_rep_frames_list$df$sentiment
## t = 3.8012, df = 835, p-value = 0.0001545
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.433857 4.495727
## sample estimates:
## mean of x mean of y
## 13.21149 10.24670
t.test(top_rep_frames_list$neg_df$sentiment, top_rep_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$neg_df$sentiment and top_rep_frames_list$df$sentiment
## t = -15.621, df = 165.07, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.56125 -15.17127
## sample estimates:
## mean of x mean of y
## -7.119565 10.246696
After subsetting tweets by those in reply to a top 10 recipient of tweets for each data frame, did any data frame see a significant change in sentiment from its parallel data frame in frames_list?
t.test(top_rep_frames_list$df$sentiment, frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$df$sentiment and frames_list$df$sentiment
## t = 5.0485, df = 460.88, p-value = 6.424e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.779669 4.048153
## sample estimates:
## mean of x mean of y
## 10.246696 7.332785
t.test(top_rep_frames_list$pos_df$sentiment, frames_list$pos_df$sentiment)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$pos_df$sentiment and frames_list$pos_df$sentiment
## t = 6.3346, df = 388.85, p-value = 6.583e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.313917 4.396696
## sample estimates:
## mean of x mean of y
## 13.211488 9.856181
t.test(top_rep_frames_list$neg_df$sentiment, frames_list$neg_df$sentiment)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$neg_df$sentiment and frames_list$neg_df$sentiment
## t = -2.3143, df = 92.301, p-value = 0.02287
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.1068086 -0.3134899
## sample estimates:
## mean of x mean of y
## -7.119565 -4.909416
After subsetting tweets by those in reply to a top 10 recipient of tweets for each data frame, did neutral, positive, or negative tweet length differ significantly from the null, df?
t.test(top_rep_frames_list$neu_df$num_words, top_rep_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$neu_df$num_words and top_rep_frames_list$df$num_words
## t = -2.6145, df = 17.878, p-value = 0.01762
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.6541077 -0.3971038
## sample estimates:
## mean of x mean of y
## 7.562500 9.588106
t.test(top_rep_frames_list$pos_df$num_words, top_rep_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$pos_df$num_words and top_rep_frames_list$df$num_words
## t = -1.9305, df = 803.86, p-value = 0.0539
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.31248329 0.01094546
## sample estimates:
## mean of x mean of y
## 8.937337 9.588106
t.test(top_rep_frames_list$neg_df$num_words, top_rep_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$neg_df$num_words and top_rep_frames_list$df$num_words
## t = 4.0196, df = 159.81, p-value = 8.961e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.911719 2.672939
## sample estimates:
## mean of x mean of y
## 11.380435 9.588106
After subsetting tweets by those in reply to a top 10 recipient of tweets for each data frame, did any data frame see a significant change in tweet length from its parallel data frame in frames_list?
t.test(top_rep_frames_list$df$num_words, frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$df$num_words and frames_list$df$num_words
## t = 4.8121, df = 464.88, p-value = 2.022e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.644008 1.533026
## sample estimates:
## mean of x mean of y
## 9.588106 8.499589
t.test(top_rep_frames_list$neu_df$num_words, frames_list$neu_df$num_words)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$neu_df$num_words and frames_list$neu_df$num_words
## t = 0.073236, df = 15.831, p-value = 0.9425
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.539473 1.649548
## sample estimates:
## mean of x mean of y
## 7.562500 7.507463
t.test(top_rep_frames_list$pos_df$num_words, frames_list$pos_df$num_words)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$pos_df$num_words and frames_list$pos_df$num_words
## t = 1.4354, df = 391.92, p-value = 0.152
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1341806 0.8601412
## sample estimates:
## mean of x mean of y
## 8.937337 8.574357
t.test(top_rep_frames_list$neg_df$num_words, frames_list$neg_df$num_words)
##
## Welch Two Sample t-test
##
## data: top_rep_frames_list$neg_df$num_words and frames_list$neg_df$num_words
## t = 7.8915, df = 96.328, p-value = 4.73e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.307330 3.858069
## sample estimates:
## mean of x mean of y
## 11.380435 8.297735
After subsetting by tweets with a political focus, did positive and negative tweets differ signficiantly in sentiment from the null?
t.test(political_frames_list$pos_df$sentiment, political_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: political_frames_list$pos_df$sentiment and political_frames_list$df$sentiment
## t = 5.6035, df = 265.98, p-value = 5.244e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.680678 7.668484
## sample estimates:
## mean of x mean of y
## 8.421569 2.746988
t.test(political_frames_list$neg_df$sentiment, political_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: political_frames_list$neg_df$sentiment and political_frames_list$df$sentiment
## t = -6.805, df = 124.04, p-value = 3.832e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12.07407 -6.63302
## sample estimates:
## mean of x mean of y
## -6.606557 2.746988
How did tweets with a political focus compare in sentiment to their parallel data frames in frames_list?
t.test(political_frames_list$df$sentiment, frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: political_frames_list$df$sentiment and frames_list$df$sentiment
## t = -5.7566, df = 166.5, p-value = 4.036e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.158574 -3.013020
## sample estimates:
## mean of x mean of y
## 2.746988 7.332785
t.test(political_frames_list$pos_df$sentiment, frames_list$pos_df$sentiment)
##
## Welch Two Sample t-test
##
## data: political_frames_list$pos_df$sentiment and frames_list$pos_df$sentiment
## t = -2.2789, df = 102.28, p-value = 0.02475
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.683196 -0.186030
## sample estimates:
## mean of x mean of y
## 8.421569 9.856181
t.test(political_frames_list$neg_df$sentiment, frames_list$neg_df$sentiment)
##
## Welch Two Sample t-test
##
## data: political_frames_list$neg_df$sentiment and frames_list$neg_df$sentiment
## t = -1.5095, df = 60.617, p-value = 0.1364
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.945545 0.551262
## sample estimates:
## mean of x mean of y
## -6.606557 -4.909416
How did political tweets compare in sentiment to the tweets sent to the people in top_rep_frames_list?
t.test(political_frames_list$df$sentiment, top_rep_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: political_frames_list$df$sentiment and top_rep_frames_list$df$sentiment
## t = -7.6463, df = 347.98, p-value = 2.039e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.428804 -5.570613
## sample estimates:
## mean of x mean of y
## 2.746988 10.246696
t.test(political_frames_list$pos_df$sentiment, top_rep_frames_list$pos_df$sentiment)
##
## Welch Two Sample t-test
##
## data: political_frames_list$pos_df$sentiment and top_rep_frames_list$pos_df$sentiment
## t = -5.8437, df = 259.76, p-value = 1.524e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.403975 -3.175864
## sample estimates:
## mean of x mean of y
## 8.421569 13.211488
t.test(political_frames_list$neg_df$sentiment, top_rep_frames_list$neg_df$sentiment)
##
## Welch Two Sample t-test
##
## data: political_frames_list$neg_df$sentiment and top_rep_frames_list$neg_df$sentiment
## t = 0.34881, df = 132.3, p-value = 0.7278
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.396237 3.422253
## sample estimates:
## mean of x mean of y
## -6.606557 -7.119565
After subsetting by politically-aimed tweets only, did positive or negative tweet length differ from the null, df?
t.test(political_frames_list$pos_df$num_words, political_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: political_frames_list$pos_df$num_words and political_frames_list$df$num_words
## t = 0.80039, df = 214.76, p-value = 0.4244
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5430005 1.2854999
## sample estimates:
## mean of x mean of y
## 12.00980 11.63855
t.test(political_frames_list$neg_df$num_words, political_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: political_frames_list$neg_df$num_words and political_frames_list$df$num_words
## t = -0.96321, df = 105.37, p-value = 0.3376
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.6521690 0.5717819
## sample estimates:
## mean of x mean of y
## 11.09836 11.63855
How did the tweets in political_frames_list differ in length from their parallel data frames in frames_list?
t.test(political_frames_list$df$num_words, frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: political_frames_list$df$num_words and frames_list$df$num_words
## t = 10.888, df = 167.65, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.569828 3.708104
## sample estimates:
## mean of x mean of y
## 11.638554 8.499589
t.test(political_frames_list$pos_df$num_words, frames_list$pos_df$num_words)
##
## Welch Two Sample t-test
##
## data: political_frames_list$pos_df$num_words and frames_list$pos_df$num_words
## t = 9.4023, df = 102.24, p-value = 1.679e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.710729 4.160166
## sample estimates:
## mean of x mean of y
## 12.009804 8.574357
t.test(political_frames_list$neg_df$num_words, frames_list$neg_df$num_words)
##
## Welch Two Sample t-test
##
## data: political_frames_list$neg_df$num_words and frames_list$neg_df$num_words
## t = 5.7606, df = 62.234, p-value = 2.787e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.828860 3.772391
## sample estimates:
## mean of x mean of y
## 11.098361 8.297735
How did the tweets in political_frames_list differ in length from their parallel data frames in top_rep_frames_list?
t.test(political_frames_list$df$num_words, top_rep_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: political_frames_list$df$num_words and top_rep_frames_list$df$num_words
## t = 5.6233, df = 377.48, p-value = 3.651e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.333474 2.767423
## sample estimates:
## mean of x mean of y
## 11.638554 9.588106
t.test(political_frames_list$pos_df$num_words, top_rep_frames_list$pos_df$num_words)
##
## Welch Two Sample t-test
##
## data: political_frames_list$pos_df$num_words and top_rep_frames_list$pos_df$num_words
## t = 6.9431, df = 207.55, p-value = 4.829e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.200059 3.944876
## sample estimates:
## mean of x mean of y
## 12.009804 8.937337
t.test(political_frames_list$neg_df$num_words, top_rep_frames_list$neg_df$num_words)
##
## Welch Two Sample t-test
##
## data: political_frames_list$neg_df$num_words and top_rep_frames_list$neg_df$num_words
## t = -0.45735, df = 127, p-value = 0.6482
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.5025318 0.9383835
## sample estimates:
## mean of x mean of y
## 11.09836 11.38043
Did the sentiment score in positive and negative tweets sent to the left differ significantly from the null?
t.test(left_frames_list$pos_df$sentiment, left_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$pos_df$sentiment and left_frames_list$df$sentiment
## t = 3.558, df = 76.77, p-value = 0.0006446
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.183648 11.277137
## sample estimates:
## mean of x mean of y
## 8.250000 1.019608
t.test(left_frames_list$neg_df$sentiment, left_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$neg_df$sentiment and left_frames_list$df$sentiment
## t = -3.1347, df = 38.103, p-value = 0.003306
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -15.705993 -3.380841
## sample estimates:
## mean of x mean of y
## -8.523810 1.019608
How did the sentiment score in tweets aimed at the left compare to the parallel data frames in the original frames_list?
t.test(left_frames_list$df$sentiment, frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$df$sentiment and frames_list$df$sentiment
## t = -3.7781, df = 50.103, p-value = 0.0004208
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.669324 -2.957030
## sample estimates:
## mean of x mean of y
## 1.019608 7.332785
t.test(left_frames_list$pos_df$sentiment, frames_list$pos_df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$pos_df$sentiment and frames_list$pos_df$sentiment
## t = -1.3861, df = 27.1, p-value = 0.177
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.9833823 0.7710194
## sample estimates:
## mean of x mean of y
## 8.250000 9.856181
t.test(left_frames_list$neg_df$sentiment, frames_list$neg_df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$neg_df$sentiment and frames_list$neg_df$sentiment
## t = -1.4193, df = 20.04, p-value = 0.1712
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.926003 1.697216
## sample estimates:
## mean of x mean of y
## -8.523810 -4.909416
How did the sentiment in tweets aimed at the left compare to the parallel data frames in top_rep_frames_list?
t.test(left_frames_list$df$sentiment, top_rep_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$df$sentiment and top_rep_frames_list$df$sentiment
## t = -5.2241, df = 62.445, p-value = 2.138e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12.757298 -5.696879
## sample estimates:
## mean of x mean of y
## 1.019608 10.246696
t.test(left_frames_list$pos_df$sentiment, top_rep_frames_list$pos_df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$pos_df$sentiment and top_rep_frames_list$pos_df$sentiment
## t = -3.9001, df = 39.246, p-value = 0.0003662
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.534127 -2.388850
## sample estimates:
## mean of x mean of y
## 8.25000 13.21149
t.test(left_frames_list$neg_df$sentiment, top_rep_frames_list$neg_df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$neg_df$sentiment and top_rep_frames_list$neg_df$sentiment
## t = -0.51674, df = 25.871, p-value = 0.6097
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.991481 4.182992
## sample estimates:
## mean of x mean of y
## -8.523810 -7.119565
How did sentiment in left-aimed tweets compare to sentiment in the bipartisan political tweets?
t.test(left_frames_list$df$sentiment, political_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$df$sentiment and political_frames_list$df$sentiment
## t = -0.9339, df = 74.061, p-value = 0.3534
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.412805 1.958045
## sample estimates:
## mean of x mean of y
## 1.019608 2.746988
t.test(left_frames_list$pos_df$sentiment, political_frames_list$pos_df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$pos_df$sentiment and political_frames_list$pos_df$sentiment
## t = -0.13029, df = 44.177, p-value = 0.8969
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.825188 2.482051
## sample estimates:
## mean of x mean of y
## 8.250000 8.421569
t.test(left_frames_list$neg_df$sentiment, political_frames_list$neg_df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$neg_df$sentiment and political_frames_list$neg_df$sentiment
## t = -0.68929, df = 28.163, p-value = 0.4963
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.613381 3.778877
## sample estimates:
## mean of x mean of y
## -8.523810 -6.606557
After subsetting by political tweets aimed at the left, how did tweet length compare in positive and negative tweets to the null, df?
t.test(left_frames_list$pos_df$num_words, left_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$pos_df$num_words and left_frames_list$df$num_words
## t = 0.42685, df = 53.85, p-value = 0.6712
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.374818 2.118515
## sample estimates:
## mean of x mean of y
## 12.60714 12.23529
t.test(left_frames_list$neg_df$num_words, left_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$neg_df$num_words and left_frames_list$df$num_words
## t = -0.35453, df = 37.506, p-value = 0.7249
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.218706 1.557641
## sample estimates:
## mean of x mean of y
## 11.90476 12.23529
How did left-aimed tweet length compare to tweets in frames_list?
t.test(left_frames_list$df$num_words, frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$df$num_words and frames_list$df$num_words
## t = 7.383, df = 50.258, p-value = 1.469e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.719532 4.751879
## sample estimates:
## mean of x mean of y
## 12.235294 8.499589
t.test(left_frames_list$pos_df$num_words, frames_list$pos_df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$pos_df$num_words and frames_list$pos_df$num_words
## t = 5.6785, df = 27.087, p-value = 4.899e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.575835 5.489738
## sample estimates:
## mean of x mean of y
## 12.607143 8.574357
t.test(left_frames_list$neg_df$num_words, frames_list$neg_df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$neg_df$num_words and frames_list$neg_df$num_words
## t = 4.5879, df = 20.28, p-value = 0.0001727
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.968488 5.245565
## sample estimates:
## mean of x mean of y
## 11.904762 8.297735
How did left-aimed tweet length compare to tweets in top_rep_frames_list?
t.test(left_frames_list$df$num_words, top_rep_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$df$num_words and top_rep_frames_list$df$num_words
## t = 4.7865, df = 71.427, p-value = 8.893e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.544536 3.749841
## sample estimates:
## mean of x mean of y
## 12.235294 9.588106
t.test(left_frames_list$pos_df$num_words, top_rep_frames_list$pos_df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$pos_df$num_words and top_rep_frames_list$pos_df$num_words
## t = 4.875, df = 34.157, p-value = 2.468e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.140236 5.199376
## sample estimates:
## mean of x mean of y
## 12.607143 8.937337
t.test(left_frames_list$neg_df$num_words, top_rep_frames_list$neg_df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$neg_df$num_words and top_rep_frames_list$neg_df$num_words
## t = 0.6006, df = 30.442, p-value = 0.5526
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.257512 2.306167
## sample estimates:
## mean of x mean of y
## 11.90476 11.38043
How did left-aimed tweet length compare to tweets in the broader political set?
t.test(left_frames_list$df$num_words, political_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$df$num_words and political_frames_list$df$num_words
## t = 1.0267, df = 84.82, p-value = 0.3075
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5589112 1.7523910
## sample estimates:
## mean of x mean of y
## 12.23529 11.63855
t.test(left_frames_list$pos_df$num_words, political_frames_list$pos_df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$pos_df$num_words and political_frames_list$pos_df$num_words
## t = 0.74888, df = 42.319, p-value = 0.4581
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.012011 2.206689
## sample estimates:
## mean of x mean of y
## 12.60714 12.00980
t.test(left_frames_list$neg_df$num_words, political_frames_list$neg_df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$neg_df$num_words and political_frames_list$neg_df$num_words
## t = 0.87678, df = 36.255, p-value = 0.3864
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.058452 2.671254
## sample estimates:
## mean of x mean of y
## 11.90476 11.09836
Did the sentiment score in positive and negative tweets sent to the right differ significantly from the null?
t.test(right_frames_list$pos_df$sentiment,
right_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: right_frames_list$pos_df$sentiment and right_frames_list$df$sentiment
## t = 4.3227, df = 186.04, p-value = 2.508e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.703646 7.243240
## sample estimates:
## mean of x mean of y
## 8.486486 3.513043
t.test(right_frames_list$neg_df$sentiment,
right_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: right_frames_list$neg_df$sentiment and right_frames_list$df$sentiment
## t = -6.6269, df = 94.494, p-value = 2.096e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.843264 -6.382823
## sample estimates:
## mean of x mean of y
## -5.600000 3.513043
Did the sentiment score across all tweets, positive, and negative change significantly after subsetting by tweets to the right?
t.test(right_frames_list$df$sentiment, frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: right_frames_list$df$sentiment and frames_list$df$sentiment
## t = -4.3718, df = 114.86, p-value = 2.721e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.550457 -2.089026
## sample estimates:
## mean of x mean of y
## 3.513043 7.332785
t.test(right_frames_list$pos_df$sentiment, frames_list$pos_df$sentiment)
##
## Welch Two Sample t-test
##
## data: right_frames_list$pos_df$sentiment and frames_list$pos_df$sentiment
## t = -1.8211, df = 73.645, p-value = 0.07265
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.8684658 0.1290759
## sample estimates:
## mean of x mean of y
## 8.486486 9.856181
t.test(right_frames_list$neg_df$sentiment, frames_list$neg_df$sentiment)
##
## Welch Two Sample t-test
##
## data: right_frames_list$neg_df$sentiment and frames_list$neg_df$sentiment
## t = -0.64765, df = 39.446, p-value = 0.521
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.846571 1.465403
## sample estimates:
## mean of x mean of y
## -5.600000 -4.909416
Did the sentiment score across all, positive, and negative tweets sent to the top tweet recipients change significantly after subsetting by tweets to the right?
t.test(right_frames_list$df$sentiment, top_rep_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: right_frames_list$df$sentiment and top_rep_frames_list$df$sentiment
## t = -6.4472, df = 223.89, p-value = 6.917e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.791815 -4.675490
## sample estimates:
## mean of x mean of y
## 3.513043 10.246696
t.test(right_frames_list$pos_df$sentiment,
top_rep_frames_list$pos_df$sentiment)
##
## Welch Two Sample t-test
##
## data: right_frames_list$pos_df$sentiment and top_rep_frames_list$pos_df$sentiment
## t = -5.1514, df = 155.63, p-value = 7.724e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.536805 -2.913199
## sample estimates:
## mean of x mean of y
## 8.486486 13.211488
t.test(right_frames_list$neg_df$sentiment,
top_rep_frames_list$neg_df$sentiment)
##
## Welch Two Sample t-test
##
## data: right_frames_list$neg_df$sentiment and top_rep_frames_list$neg_df$sentiment
## t = 1.0649, df = 99.221, p-value = 0.2895
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.311684 4.350815
## sample estimates:
## mean of x mean of y
## -5.600000 -7.119565
Did positive or negative right-aimed tweets differ significantly in length from the null, df?
t.test(right_frames_list$pos_df$num_words, right_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: right_frames_list$pos_df$num_words and right_frames_list$df$num_words
## t = 0.74754, df = 158.12, p-value = 0.4558
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6730471 1.4927886
## sample estimates:
## mean of x mean of y
## 11.78378 11.37391
t.test(right_frames_list$neg_df$num_words, right_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: right_frames_list$neg_df$num_words and right_frames_list$df$num_words
## t = -1.0021, df = 66.474, p-value = 0.3199
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.0912802 0.6934541
## sample estimates:
## mean of x mean of y
## 10.67500 11.37391
How did right-aimed tweets’ lengths compare to those of parallel tweet frames in frames_list?
t.test(right_frames_list$df$num_words, frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: right_frames_list$df$num_words and frames_list$df$num_words
## t = 8.2547, df = 115.25, p-value = 2.846e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.184611 3.564038
## sample estimates:
## mean of x mean of y
## 11.373913 8.499589
t.test(right_frames_list$pos_df$num_words, frames_list$pos_df$num_words)
##
## Welch Two Sample t-test
##
## data: right_frames_list$pos_df$num_words and frames_list$pos_df$num_words
## t = 7.5469, df = 73.662, p-value = 9.503e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.362010 4.056845
## sample estimates:
## mean of x mean of y
## 11.783784 8.574357
t.test(right_frames_list$neg_df$num_words, frames_list$neg_df$num_words)
##
## Welch Two Sample t-test
##
## data: right_frames_list$neg_df$num_words and frames_list$neg_df$num_words
## t = 3.9073, df = 39.918, p-value = 0.0003519
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.147544 3.606986
## sample estimates:
## mean of x mean of y
## 10.675000 8.297735
How did right-aimed tweets’ lengths compare to those of parallel tweet frames in top_rep_frames_list?
t.test(right_frames_list$df$num_words, top_rep_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: right_frames_list$df$num_words and top_rep_frames_list$df$num_words
## t = 4.3173, df = 219.8, p-value = 2.39e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.970597 2.601018
## sample estimates:
## mean of x mean of y
## 11.373913 9.588106
t.test(right_frames_list$pos_df$num_words, top_rep_frames_list$pos_df$num_words)
##
## Welch Two Sample t-test
##
## data: right_frames_list$pos_df$num_words and top_rep_frames_list$pos_df$num_words
## t = 5.7724, df = 130.12, p-value = 5.437e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.870881 3.822013
## sample estimates:
## mean of x mean of y
## 11.783784 8.937337
t.test(right_frames_list$neg_df$num_words, top_rep_frames_list$neg_df$num_words)
##
## Welch Two Sample t-test
##
## data: right_frames_list$neg_df$num_words and top_rep_frames_list$neg_df$num_words
## t = -0.98377, df = 71.959, p-value = 0.3285
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.1349100 0.7240404
## sample estimates:
## mean of x mean of y
## 10.67500 11.38043
How did right-aimed tweets’ lengths compare to those in the broader political tweet set?
t.test(right_frames_list$df$num_words, political_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: right_frames_list$df$num_words and political_frames_list$df$num_words
## t = -0.58731, df = 244.28, p-value = 0.5575
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.1521958 0.6229135
## sample estimates:
## mean of x mean of y
## 11.37391 11.63855
t.test(right_frames_list$pos_df$num_words, political_frames_list$pos_df$num_words)
##
## Welch Two Sample t-test
##
## data: right_frames_list$pos_df$num_words and political_frames_list$pos_df$num_words
## t = -0.40417, df = 158.16, p-value = 0.6866
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3305190 0.8784788
## sample estimates:
## mean of x mean of y
## 11.78378 12.00980
t.test(right_frames_list$neg_df$num_words, political_frames_list$neg_df$num_words)
##
## Welch Two Sample t-test
##
## data: right_frames_list$neg_df$num_words and political_frames_list$neg_df$num_words
## t = -0.54749, df = 82.572, p-value = 0.5855
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.961500 1.114778
## sample estimates:
## mean of x mean of y
## 10.67500 11.09836
How did sentiment scores compare between the two political tweet lists?
t.test(left_frames_list$df$sentiment, right_frames_list$df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$df$sentiment and right_frames_list$df$sentiment
## t = -1.3234, df = 78.425, p-value = 0.1896
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.244131 1.257260
## sample estimates:
## mean of x mean of y
## 1.019608 3.513043
t.test(left_frames_list$pos_df$sentiment, right_frames_list$pos_df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$pos_df$sentiment and right_frames_list$pos_df$sentiment
## t = -0.17141, df = 51.121, p-value = 0.8646
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.006133 2.533160
## sample estimates:
## mean of x mean of y
## 8.250000 8.486486
t.test(left_frames_list$neg_df$sentiment, right_frames_list$neg_df$sentiment)
##
## Welch Two Sample t-test
##
## data: left_frames_list$neg_df$sentiment and right_frames_list$neg_df$sentiment
## t = -1.0599, df = 27.164, p-value = 0.2985
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.582306 2.734687
## sample estimates:
## mean of x mean of y
## -8.52381 -5.60000
How did tweet lenth compare between the two political tweet lists?
t.test(left_frames_list$df$num_words, right_frames_list$df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$df$num_words and right_frames_list$df$num_words
## t = 1.4048, df = 98.717, p-value = 0.1632
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3552837 2.0780458
## sample estimates:
## mean of x mean of y
## 12.23529 11.37391
t.test(left_frames_list$pos_df$num_words, right_frames_list$pos_df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$pos_df$num_words and right_frames_list$pos_df$num_words
## t = 0.99585, df = 47.512, p-value = 0.3244
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8394487 2.4861668
## sample estimates:
## mean of x mean of y
## 12.60714 11.78378
t.test(left_frames_list$neg_df$num_words, right_frames_list$neg_df$num_words)
##
## Welch Two Sample t-test
##
## data: left_frames_list$neg_df$num_words and right_frames_list$neg_df$num_words
## t = 1.2424, df = 43.096, p-value = 0.2208
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.766241 3.225765
## sample estimates:
## mean of x mean of y
## 11.90476 10.67500
agstudy. 2014. In r use gsub to remove all punctuation except period. Stack Overflow. Retrieved from http://stackoverflow.com/questions/21533899/in-r-use-gsub-to-remove-all-punctuation-except-period on 8 Dec 2016.
Andrie. 2013. R: add title to wordcloud graphics / png. Stack Overflow. Retrieved from http://stackoverflow.com/questions/15224913/r-add-title-to-wordcloud-graphics-png on 6 Dec 2016.
chappers. 2015. Counting the total number of words in of [sic] rows of a data frame. Stack Overflow. Retrieved from http://stackoverflow.com/questions/31398077/counting-the-total-number-of-words-in-of-rows-of-a-dataframe on 8 Dec 2016.
Huffington Post. 2012. Bullying on Twitter: Researchers Find 15,000 Bully-Related Tweets Sent Daily (STUDY). Retrieved from http://www.huffingtonpost.com/2012/08/02/bullying-on-twitter_n_1732952.html on 6 December 2016.
MHN. 2014. Big text corpus breaks tm_map. Stack Overflow. Retrieved from http://stackoverflow.com/questions/26834576/big-text-corpus-breaks-tm-map on 9 Dec 2016.
MrFlick. 2015. tm custom removePunctuation except hashtag. Stack Overflow. Retrieved from http://stackoverflow.com/questions/27951377/tm-custom-removepunctuation-except-hashtag on 26 November 2016.
Nielsen, FA. 2011. A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. arXiv:1103.29903v1 (cs.IR).
Soergel, A. 2016. Divided We Stand. US News. Retrieved from http://www.usnews.com/news/articles/2016-07-19/political-polarization-drives-presidential-race-to-the-bottom on 6 Dec 2016.
Sonego P. 2011. Word Cloud in R. R-bloggers. Retrieved from https://www.r-bloggers.com/word-cloud-in-r/ on 20 Nov 2016.
Stanek B. 2016. Donald Trump really really hates CNN. The Week. Retrieved from http://theweek.com/speedreads/640333/donald-trump-really-really-hates-cnn on 14 Dec 2016.
Phys Org. 2012. Learning machines scour Twitter in service of bullying research. Retrieved from http://phys.org/news/2012-08-machines-scour-twitter-bullying.html on 6 Dec 2016.
West W. 2014. Using Twitter’s Timestamp in R. WillWest.Nyc. Retrieved from http://willwest.nyc/using-twitters-timestamp-in-r.html on 19 Dec 2016.
Xu et al. 2012. Learning from bullying traces in social media. Assoc. for Comp. Linguistics: Human Lang. Tech: 656-666. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.374.1862&rep=rep1&type=pdf.
Yadav S. 2015. R tm package invalid input in ‘utf8towcs’. Stack Overflow. Retrieved from http://stackoverflow.com/questions/9637278/r-tm-package-invalid-input-in-utf8towcs on 8 Dec 2016.