Twitter is a great tool to analyze the public interactions of political actors. For this assignment, I want you to use the information about who follows whom on Twitter as well as past tweets of the current U.S. Senate members to analyze how they interact and what they tweet about.
Twitter does not allow us to search for past tweets based on keywords, location, or topics (hashtags). However, we are able to obtain the past tweets of users if we specify their Twitter handle. The file senators_twitter.csv contains the Twitter handles of the current U.S. Senate members (obtained from SocialSeer). We will focus on the Senators’ official Twitter accounts (as opposed to campaign or staff members). I have also added information on the party affiliation of the Senators from here.
The file senators_follow.csv contains an edge list of connections between each pair of senators who are connected through a follower relationship (this information was obtained using the function rtweet::lookup_friendships). The file is encoded such that the source is a follower of the target. You will need to use the subset of following = TRUE to identify the connections for which the source follows the target.
To make your life a bit easier, I have also already downloaded all available tweets for these Twitter accounts using the following code. You do not need to repeat this step. Simply rely on the file senator_tweets.RDS in the exercise folder.
packages <- c("devtools","knitr","widgetframe","readr",
"ggnetwork","GGally","network","sna","ggplot2",
"svglite","rsvg","tidyverse",
"ggraph","igraph","tidygraph",
"gganimate","randomNames","threejs","visNetwork",
"ergm","tweenr","rtweet","twitteR", "kableExtra",
"ggthemes","DT")
packages <- lapply(packages, FUN = function(x) {
if(!require(x, character.only = TRUE)) {
install.packages(x)
library(x, character.only = TRUE)
}
}
)
(senators_follow <- list.files(pattern="senators_follow.csv", recursive=T, full.names=T) %>% .[[1]] %>% read.csv()) %>%
summarytools::dfSummary(., plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.75,
valid.col = FALSE,
tmp.img.dir = "/tmp")
Dimensions: 10000 x 4
Duplicates: 0
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
|---|---|---|---|---|---|
1 |
source |
1. amyklobuchar |
100 ( 1.0%) |
0 |
|
2 |
target |
1. amyklobuchar |
100 ( 1.0%) |
0 |
|
3 |
following |
1. FALSE |
4303 (43.0%) |
0 |
|
4 |
followed_by |
1. FALSE |
4303 (43.0%) |
0 |
# summary(senators_follow <- read.csv("https://raw.githubusercontent.com/QMSS-GR5063-2018/DV_CU_course_material/master/Exercises/10_twitter_senate/senators_follow.csv?token=AZbwLk-mJMvg4I1vSugJqEVH0D2ESBvCks5a1V8vwA%3D%3D"))
(senators_twitter <- senate <- list.files(pattern="senators_twitter.csv", recursive=T, full.names=T) %>% .[[1]] %>% read_csv()) %>%
summarytools::dfSummary(., plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.75,
valid.col = FALSE,
tmp.img.dir = "/tmp")
Dimensions: 100 x 7
Duplicates: 0
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
|---|---|---|---|---|---|
1 |
State |
1. Alabama |
2 ( 2.0%) |
0 |
|
2 |
Senator |
1. Alexander, Lamar |
1 ( 1.0%) |
0 |
|
3 |
Official Twitter |
1. amyklobuchar |
1 ( 1.0%) |
0 |
|
4 |
Staff Twitter |
1. GrassleyPress |
1 ( 5.3%) |
81 |
|
5 |
Campaign Twitter |
1. AngusKing2018 |
1 ( 1.2%) |
17 |
|
6 |
Party affiliation |
1. Democratic Party |
47 (47.0%) |
0 |
|
7 |
label |
1. Alexander(R) |
1 ( 1.0%) |
0 |
# summary(senators_twitter <- senate <- read_csv("https://raw.githubusercontent.com/QMSS-GR5063-2018/DV_CU_course_material/master/Exercises/10_twitter_senate/senators_twitter.csv?token=AZbwLhUXFxAPgSBFORs5ioJPw5jMM5bMks5a1V93wA%3D%3D"))
library(tidyverse)
library(lubridate)
library(rtweet)
library(plyr)
library(reshape2)
# Read in the Senator Data
# senate <- read_csv("senators_twitter.csv")
# Get Tweets
# senator_tweets <- get_timeline(user = senate$`Official Twitter`,
# n = 3200, ## number of tweets to download (max is 3,200)
# )
urls <- url("https://github.com/QMSS-GR5063-2018/DV_CU_course_material/blob/master/Exercises/10_twitter_senate/senator_tweets.RDS?raw=true")
# readRDS(urls)
urls <- url("https://github.com/QMSS-GR5063-2018/DV_CU_course_material/raw/master/Exercises/10_twitter_senate/senator_tweets.RDS")
# saveRDS(senator_tweets, "senator_tweets.RDS")
# Read in the Tweets
# senator_tweets <- readRDS("senator_tweets.RDS")
# senator_tweets <- readRDS("~/downloads/senator_tweets.RDS")
urls <- list.files(pattern="senator_tweets.RDS", recursive=T, full.names=T) %>% .[[1]]
senator_tweets <- readRDS(urls)
# save(senator_tweets, file = "senator_tweets.rda")
# How limiting is the API limit?
#senator_tweets %>%
# group_by(screen_name) %>%
#summarize(n_tweet = n(),
# oldest_tweet = min(created_at)) %>%
# arrange(desc(oldest_tweet))
The data contains about 170k tweets and about 40 variables. Please note, that the API limit of 3,200 tweets per twitter handle actually cuts down the time period we can observe the most prolific Twitter users in the Senate down to only about one year into the past.
senators_follow.csv.igraph to calculate the number of in and out connections: indegree = igraph::degree(g, mode = "in").]Below I’m organizing the edgelist so it encompasses all outgoing ties. It looks like this was unnecessary, but I did it anyway just in case. I’m making this a “pseudo weighted” network just because it’s easier for now.
senate$screen_name <- senate$`Official Twitter`
set.seed(10)
dat <- senators_follow
# head(out.edges <- dat %>% filter(following == "TRUE"))
dat$outs <- ifelse(dat$following == "TRUE", 1, 0)
dat$ins <- ifelse(dat$followed_by == "TRUE", 1, 0)
edges.1 <- dat %>% select(from = source, to = target, weight = outs)
edges.2 <- dat %>% select(from = target, to = source, weight = ins)
edges.all <- bind_rows(edges.1, edges.2)
edges.unique <- unique(edges.all) # oops looks like a duplication from before
Below is a tie-reciprocation function my lab and I wrote.
tieReciprocated <- function(edgelist) {
tierecip <- rep(NA,nrow(edgelist))
for (i in 1:nrow(edgelist)){
if(is.na(edgelist[i, ]$to) | is.na(tierecip[i])==FALSE) next
match <- subset(edgelist, to==edgelist[i, ]$from & from==edgelist[i, ]$to)
tierecip[i] <- ifelse(nrow(match)==1, 1, 0)
}
tierecip
}
tieReciprocatedComplete <- function(edgelist) { # function that only looks at those who actually filled out SNA questionnaire
completed <- unique(edgelist$from)
tierecip <- rep(NA,nrow(edgelist))
for (i in 1:nrow(edgelist)){
if(is.na(edgelist[i, ]$to) | is.na(tierecip[i])==FALSE) next
if(!(edgelist[i, ]$to %in% completed)) next
match <- subset(edgelist, to==edgelist[i, ]$from & from==edgelist[i, ]$to)
tierecip[i] <- ifelse(nrow(match)==1, 1, 0)
}
tierecip
}
Finding tie reciprocity
# edges.unique$tierecip <- tieReciprocated(edges.unique)
# edges.unique$tierecipcomplete <- tieReciprocatedComplete(edges.unique)
# head(edges.unique)
sen.att1 <- senate %>% mutate(from = `Official Twitter`)
sen.att.edges <- merge(sen.att1, edges.unique, all = T)
Making adjacency graph
mat1 <- reshape2::acast(edges.unique, from~to, value.var="weight", fill=0, drop=F)
g <- igraph::graph.adjacency(mat1, mode=c("directed"), weighted=T, diag=F)
igraph::graph.density(g)
[1] 0.5754545
V(g)$degree <- degree(g)
V(g)$indeg <- degree(g, mode="in")
V(g)$outdeg <- degree(g, mode="out")
hi <- (vgin <- V(g)$indeg)
hi <- (vgout <- V(g)$outdeg)
hi <- (vgdeg <- V(g)$degree)
hi <- (vgname <- V(g)$name)
vgdf <- data.frame(cbind(screen_name = vgname, indegree = vgin, outdegree = vgout, degree = vgdeg))
sen.att2 <- merge(sen.att1, vgdf, all = T)
# sen.att2$`Party affiliation`
sen.att <- sen.att2 %>% filter(`Party affiliation` == "Democratic Party" | `Party affiliation` == "Republican Party" | `Party affiliation` == "Independent")
V(g)$from = as.character(sen.att$from[match(V(g)$name,sen.att$from)])
V(g)$party = as.character(sen.att$`Party affiliation`[match(V(g)$name,sen.att$from)])
V(g)$indegree = as.character(sen.att$indegree[match(V(g)$name,sen.att$indegree)])
V(g)$outdegree = as.character(sen.att$outdegree[match(V(g)$name,sen.att$outdegree)])
V(g)$degree1 = as.character(sen.att$degree[match(V(g)$name,sen.att$degree)])
In my network visualization below, senators are arranged by node color as each party. Dems = blue, GOP = red, and independents = green. The node size varies by network degree centrality (indegree + degree). Indegree is visualized as the alpha of the nodes as well. If a node is more transparent, then that means it has a lower indegree.
igraph_layouts <- c('star', 'circle', 'gem', 'dh', 'graphopt', 'grid', 'mds',
'randomly', 'fr', 'kk', 'drl', 'lgl')
library(igraph)
library(ggraph)
library(tweenr)
igraph_layouts <- sample(igraph_layouts)
# layouts <- lapply(igraph_layouts, create_layout, graph = g)
# layouts_tween <- tween_states(c(layouts, layouts[1]), tweenlength = 1,
# statelength = 1, ease = 'cubic-in-out',
# nframes = length(igraph_layouts) * 16 + 8)
# title_transp <- tween_t(c(0, 1, 0, 0, 0), 16, 'cubic-in-out')[[1]]
# for (i in seq_len(length(igraph_layouts) * 16)) {
# tmp_layout <- layouts_tween[layouts_tween$.frame == i, ]
# layout <- igraph_layouts[ceiling(i / 16)]
# title_alpha <- title_transp[i %% 16]
# p <- ggraph(g, 'manual', node.position = tmp_layout) +
# geom_edge_fan(aes(alpha = ..index.., colour = factor("party")), n = 15) +
# geom_node_point(aes(size = degree, color = party)) + scale_color_manual(values=c("blue", "green", "red")) + scale_fill_manual(values=c("blue", "green", "red")) +
# scale_edge_color_brewer(palette = 'Dark2') +
# ggtitle(paste0('Layout: ', layout)) +
# theme_void() +
# theme(legend.position = 'none',
# plot.title = element_text(colour = alpha('black', title_alpha)))
# plot(p)
# }
p <- ggraph(g) +
geom_edge_fan(alpha = .5) +
geom_node_point(aes(size = degree, color = party, alpha = indeg)) +
scale_color_manual(values=c("blue", "green", "red")) +
scale_fill_manual(values=c("blue", "green", "red")) +
scale_edge_color_brewer(palette = 'Dark2', aes(alpha = outdeg)) +
ggtitle("Senators and their Parties and Their Twitter Ties") +
theme_void() + labs(color = "Party Affiliation", alpha = "Out-Degree", size = "In-Degree") +
theme(legend.position = 'right',
plot.title = element_text(colour = alpha('black')))
plot(p)
Calculating in-degree
in.degree <- as.data.frame(igraph::degree(g, mode="in"))
in.degree$PPID <- row.names(in.degree)
names(in.degree) <- c("indeg","PPID")
top3.in.degree <- head(in.degree.rank <- in.degree[order(- in.degree$indeg), 1:2], n = 5)
names(top3.in.degree) <- c("In-Degree (Total Twitter Followers)", "Senator Surname")
top3.in.degree$`Senator Surname` <- top3.in.degree$`Senator Surname` %>% gsub("Sen|Jeff|John|Senator", "", .) %>%
gsub("MarkW", "W", .)
rownames(top3.in.degree) <- NULL
# frameWidget(datatable(top3.in.degree))
(datatable(top3.in.degree))
Calculating out-degree
out.degree <- as.data.frame(igraph::degree(g, mode="out"))
out.degree$PPID <- row.names(out.degree)
names(out.degree) <- c("outdeg","PPID")
top3.out.degree <- head(out.degree.rank <- out.degree[order(- out.degree$outdeg), 1:2], n = 3)
names(top3.out.degree) <- c("Out-Degree (Total Following on Twitter)", "Senator Surname")
top3.out.degree$`Senator Surname` <- top3.out.degree$`Senator Surname` %>% gsub("Sen|lisa|John|Senator", "", .) %>%
gsub("murk", "Murk", .)
rownames(top3.out.degree) <- NULL
# frameWidget(datatable(top3.out.degree))
(datatable(top3.out.degree))
Now let’s see whether party identification is also recovered by an automated mechanism of cluster identification. Use the cluster_walktrap command in the igraph package to find densely connected subgraphs.
# Sample Code for a graph object "g"
wc <- cluster_walktrap(g) # find "communities"
members <- membership(wc)
members.df <- cbind(names(members), data.frame(as.matrix(members)))
names(members.df) <- c("Handle", "Group")
rownames(members.df) <- NULL
hu <- head(walkt.cluster <- members.df %>% mutate(Method = "Cluster Walktrap"))
# Here, I'm running RANDOM WALK partitioning as if I'm making only 2 clusters (potentially for republican versus democrat)
g.walktrap <- walktrap.community(g,
steps = 2,modularity=TRUE) # Here, I'm running the random walk partitioning algorithm
# g.walktrap #tells you how many groups there are in the network
#g.walktrap$membership #this is the column in front of each node. it tells us the group each node is in
#g.walktrap$names #names refers to each node. when read with membership, it tells us each node next to its group. Gotta read it and mmbrship as a data frame though!
rwalkdf <-as.data.frame(cbind(g.walktrap$membership, g.walktrap$names)) #making a dataframe out of the R-W membership stuff!
colnames(rwalkdf) <- c("Group", "Handle") #telling it to turn the second column into Handles and first into group number!!
rownames(rwalkdf) <- NULL
# frameWidget(datatable(head(walkt.comm <- rwalkdf %>% select(Handle, Group) %>% mutate(Method = "Walktrap Community Algorithm"))))
(datatable(head(walkt.comm <- rwalkdf %>% select(Handle, Group) %>% mutate(Method = "Walktrap Community Algorithm"))))
In the next couple lines, I’m seeing if the different ways I found walktrap clusters ended up with the same 2 groups! It looks like they’re identical!
walkt.community <- walkt.comm %>% select(everything()) %>% mutate(Group = ifelse(Group == 2, 1, 2))
# frameWidget(datatable(summary(walkt.cluster[ , 1:2] == walkt.community[ , 1:2])))
(datatable(summary(walkt.cluster[ , 1:2] == walkt.community[ , 1:2])))
After joining the senate file with all of the attribute/party details with the walktrap cluster detail, I made a column evaluating whether or not the party mapped onto the community. If it did, the new column indicates “match.” If the mapping failed, then the column indicates “no match.” Everything was a match except for 2 people!! I filtered the data to see who didn’t correctly map onto a community and found that Senator Sanders and Senator Angus King were mis-classified. This is because they’re the only 2 independents in the network! I didn’t have a 3rd group option; silly me. So there was no way I could correctly classify them with my walktrap clustering unless I allowed for more clusters.
senate$Handle <- senate$`Official Twitter`
walkt.cluster$Handle <- as.character(walkt.cluster$Handle)
walkt.cluster$Group <- as.factor(walkt.cluster$Group)
senate.party.clust <- left_join(walkt.cluster, senate)
# frameWidget(datatable(head(senate.party.clust %>% select(`Party affiliation`, Group))))
(datatable(head(senate.party.clust %>% select(`Party affiliation`, Group))))
nooo <- head(summary(senate.party.cluster.match <- senate.party.clust %>% select(everything()) %>%
mutate(`Cluster Party x Group Match` = ifelse(`Party affiliation` == "Democratic Party" & Group == 1, "match",
ifelse(`Party affiliation` == "Republican Party" & Group == 2, "match", "no match")))))
# frameWidget(datatable(data.frame(table(senate.party.cluster.match$`Cluster Party x Group Match`))))
(datatable(data.frame(table(senate.party.cluster.match$`Cluster Party x Group Match`))))
# frameWidget(datatable(data.frame(senate.party.cluster.match %>% select(everything()) %>%
(datatable(data.frame(senate.party.cluster.match %>% select(everything()) %>%
filter(senate.party.cluster.match$`Cluster Party x Group Match` == "no match") %>% select(1, 2, 3, 4, 5, 9))))
Based on the results, visualize how well this automated community detection mechanism recovers the party affiliation of senators. This visualization need not be a network graph. Comment briefly.
library(ggplot2)
match.vis <- senate.party.cluster.match %>% filter(!is.na(`Party affiliation`))
ggplot(match.vis, aes(`Cluster Party x Group Match`)) + geom_bar(aes(color = `Party affiliation`, fill = `Party affiliation`)) + ylab("Number of Senators") + xlab("Classification Result") + labs(title = "Walktrap Clustering: Party Affiliation x Group Matching") + theme_minimal() + scale_color_manual(values=c("blue", "green", "red")) + scale_fill_manual(values=c("blue", "green", "red"))
From now on, rely on the information from the tweets stored in senator_tweets.RDS.
Remove all tweets that are re-tweets (is_retweet) and identify which topics the senators tweet about. Rather than a full text analysis, just use the variable hashtags and identify the most common hashtags over time. Provide a visual summary.
In this visual representation (comparison cloud) we can see that the most common hashtag is Obamacare, both in the years 2013 and 2017. I used these 2 years to represent similarities over time because they’re both the years right after a general presidential election in the US, so a ripe time for politics. Some other common words are mepolitics, benghol, benghazi, etc. You can see below!
library(tm)
original.tweets <- senator_tweets %>% filter(is_retweet == FALSE)
original.tweets$date <- lubridate::date(original.tweets$created_at)
original.tweets$year <- lubridate::year(original.tweets$date)
# summary(original.tweets$year)
original.tweets.2013 <- original.tweets %>% filter(year == "2013")
hash.2013 <- na.omit(unlist(original.tweets.2013$hashtags))
original.tweets.2017 <- original.tweets %>% filter(year == "2017")
hash.2017 <- na.omit(unlist(original.tweets.2017$hashtags))
hash.yrs <- data.frame(cbind(hash.2013[1:6543], hash.2017[1:6543]))
# length(hash.2017)
# length(hash.2013)
hashsh <- head(hash.yrs$X1 <- as.character(hash.yrs$X1))
hashah <- head(hash.yrs$X2 <- as.character(hash.yrs$X2))
corpus = Corpus(VectorSource(hash.yrs))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = c("Tweets of '08", "Tweets of '17")
library(wordcloud)
comparison.cloud(tdm,
title.size=1.5, max.words = 200, color = c("Blue", "Red", "Grey"))
# original.tweets %>% select(everything()) %>%
# group_by(as.vector(hashtags)) %>%
# summarize(count = n(), status_id, user_id, source, favorite_count, retweet_count)
# original.tweets[order(-original.tweets$hashtags)]
# names(original.tweets)
# head(original.tweets.rank <- original.tweets[order(-original.tweets$hashtags)], n = 5)
# hi <- data.frame(original.tweets)
# hash <- unlist(original.tweets)
# hash
# data.frame(original.tweets) %>% select(everything()) %>%
# group_by(hashtags) %>%
# summarize(total = n(), status_id, user_id, source, favorite_count, retweet_count)
# original.tweets$hashtags
Some tweets are as old as 10 years but for some prolific users we observe a much shorter time span of Twitter activity. Feel free to subset the data to only include more recent tweets. Using the party ID variable (Party affiliation), identify how the choice of topics tweeted about (again using using hashtags) differs by party and visualize that information.
It seems as though the republicans are speaking more about the southern states and about conservative policy than are the democrats– makes sense. Also, each party insults the other by saying things like “GOPtaxscam” as a hashtag.
recent.twts <- original.tweets %>% filter(year > 2014)
# head(recent.twts)
parties <- senate %>% mutate(screen_name = `Official Twitter`)
parties.twts <- left_join(recent.twts, parties)
# head(parties.twts)
gop.twts <- parties.twts %>% filter(`Party affiliation` == "Republican Party")
dem.twts <- parties.twts %>% filter(`Party affiliation` == "Democratic Party")
vey <- length(gop.hash <- na.omit(as.character(gop.twts$hashtags)))
oy <- length(dem.hash <- na.omit(as.character(dem.twts$hashtags)))
party.hash <- data.frame(cbind(dem.hash[1:37853], gop.hash[1:37853]))
party.hash$X1 <- as.character(party.hash$X1)
party.hash$X2 <- as.character(party.hash$X2)
corpus = Corpus(VectorSource(party.hash))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = c("Democrats", "Republicans")
comparison.cloud(tdm,
title.size=1.5, max.words = 200, color = c("Blue", "Red", "Grey"))
The democratic party seems broadly more supportive of gun control legislation. Try to identify a set of 5-10 hashtags that signal support for gun control legislation (e.g. “NeverAgain”, #guncontrol, #guncontrolnow, #Enough) and others that are expressing support for the right to own guns (e.g. #2ndamendment, #NRA, #liberals). The site ritetag.com can help with that task. Using the subset of senator tweets that included these hashtags, show whether and how senators from different parties talk differently about the issue of gun legislation.
In terms of guns, it looks like the democrats and republicans both talk about them quite a bit. In the comparison cloud below, I display any time one of those 2 parties mentions “gun” or any of the hashtags mentioned in the question above (except for liberals because that’s too broad a category for my interests.) I Used the grep call to find the relevant rows. The democrats reference tags like “endgunviolence” while the republicans reference “2ndamendment” and such about gun rights.
mew <- length(gop.guns <- na.omit(gop.hash[(grep("gun|NeverAgain|Enough|NRA|2ndamendment", gop.hash, ignore.case = T))]))
wem <- length(dem.guns <- na.omit(dem.hash[(grep("gun|NeverAgain|Enough|NRA|2ndamendment", dem.hash, ignore.case = T))]))
party.guns <- data.frame(cbind(dem.guns[1:89], gop.guns[1:89]))
party.guns$X1 <- as.character(party.guns$X1)
party.guns$X2 <- as.character(party.guns$X2)
corpus.guns = Corpus(VectorSource(party.guns))
tdm.guns = TermDocumentMatrix(corpus.guns)
tdm.guns = as.matrix(tdm.guns)
colnames(tdm.guns) = c("Democrats", "Republicans")
comparison.cloud(tdm.guns,
title.size=1.5, max.words = 200, color = c("Blue", "Red", "Grey"))
On February 14, 2018, a mass shooting occurred at Marjory Stoneman Douglas High School in Parkland, Florida. Provide some visualization of how senators responded to the event in their Twitter communication.
Based on my comparison cloud, democrats seem much more social-justice oriented and concerned about the Parkland shooting than do Republicans. Democrats are speaking out against a terrible thing, but republicans are largely ignoring the actual implications.
fie <- length(gop.parkland <- na.omit(gop.hash[(grep("Marjory|Stoneman|Douglas High|Parkland|February18|Feb18", gop.hash, ignore.case = T))]))
vye <- length(dem.parkland <- na.omit(dem.hash[(grep("Marjory|Stoneman|Douglas High|Parkland|February18|Feb18", dem.hash, ignore.case = T))]))
party.parkland <- data.frame(cbind(dem.parkland[1:31], gop.parkland[1:31]))
party.parkland$X1 <- as.character(party.parkland$X1)
party.parkland$X2 <- as.character(party.parkland$X2)
corpus.parkland = Corpus(VectorSource(party.parkland))
tdm.parkland = TermDocumentMatrix(corpus.parkland)
tdm.parkland = as.matrix(tdm.parkland)
colnames(tdm.parkland) = c("Democrats", "Republicans")
comparison.cloud(tdm.parkland,
title.size=1.5, max.words = 200, color = c("Blue", "Red", "Grey"))
Often tweets are simply public statements without addressing a specific audience. However, it is possible to interact with a specific person by adding them as a friend, becoming their follower, re-tweeting their messages, and/or mentioning them in a tweet using the @ symbol.
Select the set of re-tweeted messages from other senators and identify the source of the originating message. Calculate by senator the amount of re-tweets they received and from which party these re-tweets came. Essentially, I would like to visualize whether senators largely re-tweet their own party colleagues’ messages or whether there are some senators that get re-tweeted on both sides of the aisle. Visualize the result.
According to my table and my ggpot visualization, the people/parties who are most retweeted are primarily the democrats, except for Republican John McCain, who’s retweeted more than any of them. But out of the top 5 people with number of times they were retweeted, McCain is the only Republican. The rest are democrats.
retweets <- senator_tweets %>% filter(is_retweet == TRUE)
parties.retwts <- left_join(retweets, parties)
retweets.directed <- parties.retwts %>% separate(text, c("source.of.original.twt", "text"), ": ", extra = "drop")
hihd <- head(retweets.directed$source.of.original.twt <- gsub("RT @", "", retweets.directed$source.of.original.twt))
# retweets.directed$retweeted.from
# retweets.directed$source.of.original.twt <- gsub("_", "", retweets.directed$source.of.original.twt)
parties$source.of.original.twt <- parties$screen_name
rtwts.net <- left_join(retweets.directed, parties, by = "source.of.original.twt")
rtwts.net$rtwt.party.match <- ifelse(rtwts.net$`Party affiliation.x` == "Democratic Party" & rtwts.net$`Party affiliation.y` == "Democratic Party", "Democrat Retweeted Democrat",
ifelse(rtwts.net$`Party affiliation.x` == "Democratic Party" & rtwts.net$`Party affiliation.y` == "Republican Party", "Republican Retweeted Republican",
ifelse(rtwts.net$`Party affiliation.x` == "Democratic Party" & rtwts.net$`Party affiliation.y` == "Republican Party", "Democrat Retweeted Republican",
ifelse(rtwts.net$`Party affiliation.x` == "Republican Party" & rtwts.net$`Party affiliation.y` == "Democratic Party", "Republican Retweeted Democrat", "Other"))))
# summary(rtwts.net$rtwt.party.match)
rt.net <- rtwts.net
rt.net$from <- rt.net$screen_name.x
rt.net$to <- rt.net$source.of.original.twt
rt.net$weight <- as.numeric(rt.net$retweet_count)
# rt.net$from.to <- paste(rt.net$`Party affiliation.x`, "Retweeted", rt.net$`Party affiliation.y`)
who.retweers <- rt.net %>% dplyr::group_by(to, `Party affiliation.y`) %>%
dplyr::summarize(total = sum(retweet_count))
who.rtw <- who.retweers %>% filter(!is.na(`Party affiliation.y`))
top.retweeted <- who.rtw %>% dplyr::rename(`Original Tweeters' Party` = `Party affiliation.y`,
`Original Tweeters' Handle` = to,
`# of Times their Tweet was Retweeted` = total)
top5.retweeted <- head(top.retweeted[order( - top.retweeted$`# of Times their Tweet was Retweeted`), 1:3], n = 5)
# frameWidget(datatable(top5.retweeted))
(datatable(top5.retweeted))
Democrats are retweeted so much! Whoa! They also do a ton of the retweeting. The other parties just don’t tweet or retweet so much. Also interestingly, The democrats seem to pretty often retweet republicans, but not vice versa. Indepenents don’t tweet much, but when they retweet, they retweet among themselves most, but also retweet republicans and democrats to some degree.
pgrp <- rt.net %>% filter(`Party affiliation.y` == "Democratic Party" | `Party affiliation.y` == "Republican Party" | `Party affiliation.y` == "Independent" | `Party affiliation.y` == "Other") %>% filter(!is.na(`Party affiliation.x`))
pgrp$from.to <- paste(pgrp$`Party affiliation.x`, "Retweeted", pgrp$`Party affiliation.y`)
# summary(pgrp$from.to)
#ugh <- rt.net %>% filter(`Party affiliation.y` == "Democratic Party" | `Party affiliation.y` == "Republican Party" | `Party affiliation.y` == "Independent" | `Party affiliation.y` == #"Other" | !is.na(`Party affiliation.x`) | !is.na(`Party affiliation.x`))
#ugh <- pgrp %>% filter(!is.na(`Party affiliation.x`))
ggplot(pgrp, aes(`Party affiliation.y`, retweet_count)) + geom_boxplot(aes(fill = `Party affiliation.y`)) + geom_jitter(aes(color = `Party affiliation.x`)) + coord_flip() + xlab("Original Tweeters") + labs(color='Retweeters', fill = "Original Tweets") + ylab("Number of Retweets") + labs(title = "Who Retweets from Who? And How Much?") + theme_minimal()
rt1 <- rt.net %>% select(from, to, weight)
# length(rt1$from)
dfxy <- data.frame(froms = c(1:(length(rt1$from)*2)), tos = c(1:(length(rt1$to)*2)), weights = 0)
dfxy$weights[1:(length(rt1$from))] <- rt1$weight
dfxy$froms[1:(length(rt1$from))] <- rt1$from
dfxy$froms[(length(rt1$from) + 1): (length(rt1$from) * 2)] <- rt1$to
dfxy$tos[1:(length(rt1$to))] <- rt1$to
dfxy$tos[(length(rt1$to) + 1): (length(rt1$to) * 2)] <- rt1$from
mews <- na.omit(dfxy)
# length(tos <- rbind(rt1$to, rt1$from))
# rt1$weight0 <- 0
# length(weights <- rbind(rt1$weight, rt1$weight0))
# meow <- data.frame(from = c(froms), to = c(tos), weight = c(weights))
# summary(meow)
# dim(mew <- na.omit(meow))
# mews$to <- gsub("_", "", mews$tos)
mewe <- mews %>% dplyr::group_by(froms) %>%
dplyr::summarise(weight = n())
mews$from <- mews$froms
mews$weight <- mews$weights
mews$to <- mews$tos
mewe.retweeters <- unique(na.omit(mews %>% select(from, to, weight)))
# summary(mewe$from <- as.character(mewe$from))
# summary(mewe$to <- as.character(mewe$tos))
# meow <- unique(mewe %>% select(from, to, weight))
# rt.mat <- reshape2::acast(mewe.retweeters, from ~ to, value.var="weight", fill=0, drop=F)
# g <- igraph::graph.adjacency(mews, mode=c("directed"), weighted=T, diag=F)
# igraph::graph.density(g)
retwtrs <- rt.net %>% select(retweet_count, to) %>%
group_by(to) #%>%
# summarize(retweet_count)
#length(retwtrs$`Party affiliation`) %>%
# summarize(nooo = (retweet_count))
# top5.retwtrs <- head(toprtw.rank <- in.degree[order(- in.degree$indeg), 1:2], n = 5)
For my last little fun with retweeters and retweetees, I made a table spelling out the frequency at which each party has retweeted each party. The results are in and democrats just love to retweet everyone, more than anyone else even likes to retweet themselves!
who.retweers <- rt.net %>% dplyr::group_by(to, `Party affiliation.y`, `Party affiliation.y`) %>%
dplyr::summarize(total = sum(retweet_count))
who.retweers <- pgrp %>% dplyr::group_by(from.to) %>%
dplyr::summarize(total = sum(retweet_count))
top5.retwtrs.rtwtees <- head(rtw.rank <- who.retweers[order(- who.retweers$total), 1:2], n = 5)
names(top5.retwtrs.rtwtees) <- c("Party Tweets Party", "Times this Happened!")
# frameWidget(datatable(top5.retwtrs.rtwtees))
(datatable(top5.retwtrs.rtwtees))
Identify the tweets in which one senator mentions another senator directly (the variable is mentions_screen_name). For this example, please remove simple re-tweets (is_retweet == FALSE). Calculate who re-tweets whom among the senate members. Convert the information to an undirected graph object in which the number of mentions is the strength of the relationship between senators. Visualize the network graph using the party identification of the senators as a group variable (use blue for Democrats and red for Republicans) and some graph centrality measure to size the nodes. Comment on what you can see from the visualization.
First, I made a wordcloud to visualize who people from the democratic and republican parties are mentioning. It looks like those of the independent party don’t exactly mention anyone.
# is_retweet is false here
join.originals <- left_join(original.tweets, senate)
# we need a source and a target
# join.originals$mentions_screen_name tis a df with lists fml
# retweets.directed$is_retweet
gops <- join.originals %>% filter(`Party affiliation` == "Republican Party")
dems <- join.originals %>% filter(`Party affiliation` == "Democratic Party")
indeps <- join.originals %>% filter(`Party affiliation` == "Independent Party")
oh <- length(gop.mention <- na.omit(as.character(gops$mentions_screen_name)))
em <- length(dem.mention <- na.omit(as.character(dems$mentions_screen_name)))
gee <- length(indep.mention <- na.omit(as.character(indeps$mentions_screen_name)))# no mentions lol
party.mention <- data.frame(cbind(dem.mention[1:41128], gop.mention[1:41128]))
party.mention$X1 <- as.character(party.mention$X1)
party.mention$X2 <- as.character(party.mention$X2)
corpus.mention = Corpus(VectorSource(party.mention))
tdm.mention = TermDocumentMatrix(corpus.mention)
tdm.mention = as.matrix(tdm.mention)
colnames(tdm.mention) = c("Democrats", "Republicans")
comparison.cloud(tdm.mention,
title.size=1.5, max.words = 200, color = c("Blue", "Red", "Grey"))
dems <- join.originals %>% filter(`Party affiliation` == "Democratic Party")
gops <- join.originals %>% filter(`Party affiliation` == "Republican Party")
oh <- length(dem.mention <- na.omit(as.character(dems$mentions_screen_name)))
emge <- length(gop.mention <- na.omit(as.character(gops$mentions_screen_name)))
list1 <- join.originals$mentions_screen_name
MAX = max(sapply(list1, length))
func1 <- function(x, MAX) {
vec <- c(x, rep(NA, MAX-length(x)))
return(vec)
}
list2 <- lapply(list1, func1, MAX = MAX)
df3.1 <- plyr::ldply(list2)
# dim(df3.1)
orig.spread <- cbind(join.originals, df3.1)
# names(orig.spread)
orig.gath <- orig.spread %>% gather("cases","mentioned_name", V1:V19)
og <- unique(orig.gath %>% filter(year > 2017)) %>%
select(-(15:36))
#dim(og)
#names(og)
og2 <- og
parties$mentioned_name <- parties$screen_name
ment.part <- left_join(og2, parties, by = "mentioned_name")
mentions.small <- ment.part %>% dplyr::group_by(mentioned_name) %>%
dplyr::summarise(Mentions = n())
mentions <- na.omit(ment.part %>% dplyr::group_by(mentioned_name, screen_name = screen_name.x, `Party affiliation.x`, `Party affiliation.y`) %>%
dplyr::summarise(Mentions = n()))
mention.edgelist <- mentions %>%
dplyr::rename(target = screen_name, source = mentioned_name, weight = Mentions)
mg <- head(medge <- mention.edgelist[order(- mention.edgelist$weight), ], n = 200)
medge100 <- head(mention.edgelist, n = 200)
medge$weight0 <- 0
medge$party2 <- medge$`Party affiliation.y`
medge <- data.frame(medge)
ohhh <- head(edges.1 <- data.frame(medge %>% dplyr::select(from = source, to = target, weight, party = Party.affiliation.x)))
noooo <- head(edges.2 <- data.frame(medge %>% dplyr::select(from = target, to = source, weight = weight0, party = Party.affiliation.y)))
edges.all <- bind_rows(edges.1, edges.2)
mst <- as.data.frame(medge100 %>% select(source, target))
network1 = graph_from_data_frame(d = mg, directed = F)
plot(network1)
edges <- unique(mg %>% dplyr::rename(from = source, to = target))
routes_network <- network(edges, matrix.type = "edgelist", ignore.eval = FALSE)
plot(routes_network)
# edges.all
mat1 <- reshape2::acast(edges.all, from~to, value.var="weight", fill=0, drop=F)
g2 <- igraph::graph.adjacency(mat1, mode=c("undirected"), weighted=T, diag=)
Below is the density of the network (out of a maximum 1)!
print(igraph::graph.density(g2))
[1] 0.1882864
V(g2)$degree <- degree(g2)
V(g2)$indeg <- degree(g2, mode="in")
V(g2)$outdeg <- degree(g2, mode="out")
V(g2)$from = as.character(sen.att$from[match(V(g2)$name,sen.att$from)])
V(g2)$party = as.character(edges.all$party[match(V(g2)$name, edges.all$from)])
# V(g)$indegree = as.character(sen.att$indegree[match(V(g)$name,sen.att$indegree)])
# V(g)$outdegree = as.character(sen.att$outdegree[match(V(g)$name,sen.att$outdegree)])
# V(g)$degree1 = as.character(sen.att$degree[match(V(g)$name,sen.att$degree)])
To some degree, it looks like everyone mentions everyone. I don’t see much evidence that only one party is mentioning another party. In contrast to my network visualization at the beginning of this journey, where we looked at parties following parties, the mentions are very diverse with mentions. I suppose that could be because mentioning someone doesn’t have to be a nice thing. You can mention others to disparage them. Senators from different parties could be doing that.
plot(g2)
p2 <- ggraph(g2) +
geom_edge_fan(aes(alpha = weight)) +
geom_node_point(aes(size = degree, color = party)) +
scale_color_manual(values=c("blue", "green", "red")) +
scale_fill_manual(values=c("blue", "green", "red")) +
scale_edge_color_brewer(palette = 'Dark2', aes(alpha = outdeg)) +
ggtitle("Who Mentions Who? 2018 Twitter Mentions Network") +
theme_void() + labs(color = "Party Affiliation", alpha = "Out-Degree", size = "Degree") +
theme(legend.position = 'right',
plot.title = element_text(colour = alpha('black')))
plot(p2)
Using the twitter handles, access the user information of the senators to identify the number of followers they have (obviously, this will require to actually connect to the Twitter server). Re-do the previous graph object but now use the number of followers (or some transformation of that info) to size the nodes. Comment how graph degree centrality (via mentions) and the number of followers are related.
I might be crazy, but it doesn’t look like there’s a super clear relationship between degree centrality (by number of mentions) and follower count. The bigger nodes have more followers, but some of the biggest nodes are pretty transparent, meaning they aren’t mentioned a ton. The is some correlation though. Lots of big nodes are also very solidly colored.
names <- senator_tweets$screen_name
# unique(names)
users <- lookupUsers(unique(names))
fc <- sapply(users, function(x) x$followersCount)
fc <- data.frame(fc)
# fc
ufufuf <- (fc$from <- rownames(fc))
V(g2)$fc = as.character(fc$fc[match(V(g2)$name, fc$from)])
p2 <- ggraph(g2) +
geom_edge_fan(aes(alpha = weight)) +
geom_node_point(aes(alpha = as.numeric(degree), color = party, size = fc)) + guides(size = FALSE) +
scale_color_manual(values=c("blue", "green", "red")) +
scale_fill_manual(values=c("blue", "green", "red")) +
scale_edge_color_brewer(palette = 'Dark2', aes(alpha = outdeg)) + labs(title = "Who Mentions Who? 2018 Twitter Mentions Network and Follower Count", subtitle = "Node Size is Network Centrality via # of Mentions") +
theme_void() + labs(color = "Party Affiliation", alpha = "Degree Centrality (# of Mentions)", size = "Degree") +
theme(legend.position = 'right',
plot.title = element_text(colour = alpha('black')))
plot(p2)
# Setting your Access Token
## Please follow the guide in the previous lecture
## For twitteR
# options(httr_oauth_cache=T)
#twitteR::setup_twitter_oauth(Sys.getenv("4ihvg6jBxFT9RnXN8OT7SScGO"),
# Sys.getenv("ht1qDUMC59Jtj5drr9OUFEUJxzwBSYzrOuYf3KdWy9HzsOEC6W"),
# Sys.getenv("912015600339255298-FBK0AWZD2FQymGJgT6aUYWs8W8SXFoU"),
# Sys.getenv("pAUDrRtrJ6KbSIvOmllpkxQrgLjJs9iuORKV6bNROP04z"))
## For rtweet
#twitter_token <- rtweet::create_token(
# app = "qmss",
#consumer_key = Sys.getenv("4ihvg6jBxFT9RnXN8OT7SScGO"),
#consumer_secret = Sys.getenv("ht1qDUMC59Jtj5drr9OUFEUJxzwBSYzrOuYf3KdWy9HzsOEC6W"))
# library(rtweet)
# Let's get the current trending topcis in the U.S.
# us <- get_trends("united states")
# frameWidget(datatable(us[,c("trend","tweet_volume")]))
We want to capture a stream of public tweets in real-time, optionally filtering by select screen names or keywords in the text of the tweet.
# Capture 60 seconds of tweets in the U.S.
#stream_tweets(
# lookup_coords("usa"), # handy helper function in rtweet
#verbose = FALSE,
#timeout = (60 * 1),
#) -> usa
Please follow the instructions to submit your homework. The homework is due on Thursday, April 12.
If you do come across something online that provides part of the analysis / code etc., please no wholesale copying of other ideas. We are trying to evaluate your abilities to visualized data not the ability to do internet searches. Also, this is an individually assigned exercise – please keep your solution to yourself.