Network Analysis of U.S. Senate Tweets

Overview

Twitter is a great tool to analyze the public interactions of political actors. For this assignment, I want you to use the information about who follows whom on Twitter as well as past tweets of the current U.S. Senate members to analyze how they interact and what they tweet about.

Data

Twitter Handles of Senators

Twitter does not allow us to search for past tweets based on keywords, location, or topics (hashtags). However, we are able to obtain the past tweets of users if we specify their Twitter handle. The file senators_twitter.csv contains the Twitter handles of the current U.S. Senate members (obtained from SocialSeer). We will focus on the Senators’ official Twitter accounts (as opposed to campaign or staff members). I have also added information on the party affiliation of the Senators from here.

Followers

The file senators_follow.csv contains an edge list of connections between each pair of senators who are connected through a follower relationship (this information was obtained using the function rtweet::lookup_friendships). The file is encoded such that the source is a follower of the target. You will need to use the subset of following = TRUE to identify the connections for which the source follows the target.

Tweets by Senators

To make your life a bit easier, I have also already downloaded all available tweets for these Twitter accounts using the following code. You do not need to repeat this step. Simply rely on the file senator_tweets.RDS in the exercise folder.

packages <- c("devtools","knitr","widgetframe","readr",
              "ggnetwork","GGally","network","sna","ggplot2",
              "svglite","rsvg","tidyverse",
              "ggraph","igraph","tidygraph",
              "gganimate","randomNames","threejs","visNetwork",
              "ergm","tweenr","rtweet","twitteR", "kableExtra", 
              "ggthemes","DT")

packages <- lapply(packages, FUN = function(x) {
  if(!require(x, character.only = TRUE)) {
    install.packages(x)
    library(x, character.only = TRUE)
  }
}
)

(senators_follow <- list.files(pattern="senators_follow.csv", recursive=T, full.names=T) %>% .[[1]] %>% read.csv()) %>%
  summarytools::dfSummary(., plain.ascii = FALSE, 
                          style = "grid", 
                          graph.magnif = 0.75, 
                          valid.col = FALSE, 
                          tmp.img.dir = "/tmp")

Data Frame Summary

Dimensions: 10000 x 4
Duplicates: 0

No	Variable	Stats / Values	Freqs (% of Valid)
1	source [factor]	1. amyklobuchar 2. BenSasse 3. BillCassidy 4. BobCorker 5. brianschatz 6. ChrisCoons 7. ChrisMurphyCT 8. ChrisVanHollen 9. ChuckGrassley 10. clairecmc [ 90 others ]	100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 9000 (90.0%)
2	target [factor]	1. amyklobuchar 2. BenSasse 3. BillCassidy 4. BobCorker 5. brianschatz 6. ChrisCoons 7. ChrisMurphyCT 8. ChrisVanHollen 9. ChuckGrassley 10. clairecmc [ 90 others ]	100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 100 ( 1.0%) 9000 (90.0%)
3	following [logical]	1. FALSE 2. TRUE	4303 (43.0%) 5697 (57.0%)
4	followed_by [logical]	1. FALSE 2. TRUE	4303 (43.0%) 5697 (57.0%)

# summary(senators_follow <- read.csv("https://raw.githubusercontent.com/QMSS-GR5063-2018/DV_CU_course_material/master/Exercises/10_twitter_senate/senators_follow.csv?token=AZbwLk-mJMvg4I1vSugJqEVH0D2ESBvCks5a1V8vwA%3D%3D"))

(senators_twitter <- senate <- list.files(pattern="senators_twitter.csv", recursive=T, full.names=T) %>% .[[1]] %>% read_csv()) %>%
  summarytools::dfSummary(., plain.ascii = FALSE, 
                          style = "grid", 
                          graph.magnif = 0.75, 
                          valid.col = FALSE, 
                          tmp.img.dir = "/tmp")

Data Frame Summary

Dimensions: 100 x 7
Duplicates: 0

No	Variable	Stats / Values	Freqs (% of Valid)	Missing
1	State [character]	1. Alabama 2. Alaska 3. Arizona 4. Arkansas 5. California 6. Colorado 7. Connecticut 8. Delaware 9. Florida 10. Georgia [ 40 others ]	2 ( 2.0%) 2 ( 2.0%) 2 ( 2.0%) 2 ( 2.0%) 2 ( 2.0%) 2 ( 2.0%) 2 ( 2.0%) 2 ( 2.0%) 2 ( 2.0%) 2 ( 2.0%) 80 (80.0%)	0 (0%)
2	Senator [character]	1. Alexander, Lamar 2. Baldwin, Tammy 3. Barrasso, John 4. Bennet, Michael 5. Blumenthal, Richard 6. Blunt, Roy 7. Booker, Cory 8. Boozman, John 9. Brown, Sherrod 10. Burr, Richard [ 90 others ]	1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 90 (90.0%)	0 (0%)
3	Official Twitter [character]	1. amyklobuchar 2. BenSasse 3. BillCassidy 4. BobCorker 5. brianschatz 6. ChrisCoons 7. ChrisMurphyCT 8. ChrisVanHollen 9. ChuckGrassley 10. clairecmc [ 90 others ]	1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 90 (90.0%)	0 (0%)
4	Staff Twitter [character]	1. GrassleyPress 2. InhofePress 3. KlobucharOffice 4. LindseyGrahamSC 5. McCaskillOffice 6. McConnellPress 7. PatRoberts 8. PortmanPress 9. RoyBluntPress 10. SenBobCorker [ 9 others ]	1 ( 5.3%) 1 ( 5.3%) 1 ( 5.3%) 1 ( 5.3%) 1 ( 5.3%) 1 ( 5.3%) 1 ( 5.3%) 1 ( 5.3%) 1 ( 5.3%) 1 ( 5.3%) 9 (47.4%)	81 (81%)
5	Campaign Twitter [character]	1. AngusKing2018 2. barrassoforwyo 3. BenCardinforMD 4. BennetForCO 5. Bob_Casey 6. Booker4Senate 7. Boozman4AR 8. Burrforsenate 9. CapitoforWV 10. CatherineForNV [ 73 others ]	1 ( 1.2%) 1 ( 1.2%) 1 ( 1.2%) 1 ( 1.2%) 1 ( 1.2%) 1 ( 1.2%) 1 ( 1.2%) 1 ( 1.2%) 1 ( 1.2%) 1 ( 1.2%) 73 (87.9%)	17 (17%)
6	Party affiliation [character]	1. Democratic Party 2. Independent 3. Republican Party	47 (47.0%) 2 ( 2.0%) 51 (51.0%)	0 (0%)
7	label [character]	1. Alexander(R) 2. Baldwin(D) 3. Barrasso(R) 4. Bennet(D) 5. Blumenthal(D) 6. Blunt(R) 7. Booker(D) 8. Boozman(R) 9. Brown(D) 10. Burr(R) [ 90 others ]	1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 1 ( 1.0%) 90 (90.0%)	0 (0%)

# summary(senators_twitter <- senate <- read_csv("https://raw.githubusercontent.com/QMSS-GR5063-2018/DV_CU_course_material/master/Exercises/10_twitter_senate/senators_twitter.csv?token=AZbwLhUXFxAPgSBFORs5ioJPw5jMM5bMks5a1V93wA%3D%3D"))

library(tidyverse)
library(lubridate)
library(rtweet)
library(plyr)
library(reshape2)

# Read in the Senator Data
# senate <- read_csv("senators_twitter.csv")

# Get Tweets
# senator_tweets <- get_timeline(user = senate$`Official Twitter`,
#  n = 3200, ## number of tweets to download (max is 3,200)
#   )
urls <- url("https://github.com/QMSS-GR5063-2018/DV_CU_course_material/blob/master/Exercises/10_twitter_senate/senator_tweets.RDS?raw=true")
# readRDS(urls)
urls <- url("https://github.com/QMSS-GR5063-2018/DV_CU_course_material/raw/master/Exercises/10_twitter_senate/senator_tweets.RDS")

# saveRDS(senator_tweets, "senator_tweets.RDS")

# Read in the Tweets
# senator_tweets <- readRDS("senator_tweets.RDS")
# senator_tweets <- readRDS("~/downloads/senator_tweets.RDS")
urls <- list.files(pattern="senator_tweets.RDS", recursive=T, full.names=T) %>% .[[1]] 
senator_tweets <- readRDS(urls)
# save(senator_tweets, file = "senator_tweets.rda")
# How limiting is the API limit?
#senator_tweets %>% 
# group_by(screen_name) %>% 
#summarize(n_tweet = n(),
#         oldest_tweet = min(created_at)) %>%
# arrange(desc(oldest_tweet))

The data contains about 170k tweets and about 40 variables. Please note, that the API limit of 3,200 tweets per twitter handle actually cuts down the time period we can observe the most prolific Twitter users in the Senate down to only about one year into the past.

Tasks for the Assignment

1. Who follows whom?

a) Network of Followers

Read in the edgelist of follower relationships from the file senators_follow.csv.
Create a directed network graph.
Identify the three senators who are followed by the most of their colleagues (i.e. the highest “in-degree”) and the
three senators who follow the most of their colleagues (i.e. the highest “out-degree”). [Hint: You can get this information simply from the data frame or use igraph to calculate the number of in and out connections: indegree = igraph::degree(g, mode = "in").]
Visualize the network of senators. In the visualization,
highlight the party ID of the senator nodes with an appropriate color (blue = Democrat, red = Republican) and
size the nodes by the centrality of the nodes to the network.
Briefly comment.

Below I’m organizing the edgelist so it encompasses all outgoing ties. It looks like this was unnecessary, but I did it anyway just in case. I’m making this a “pseudo weighted” network just because it’s easier for now.

senate$screen_name <- senate$`Official Twitter`

set.seed(10)
dat <- senators_follow
# head(out.edges <- dat %>% filter(following == "TRUE"))

dat$outs <- ifelse(dat$following == "TRUE", 1, 0)
dat$ins <- ifelse(dat$followed_by == "TRUE", 1, 0)

edges.1 <- dat %>% select(from = source, to = target, weight = outs)

edges.2 <- dat %>% select(from = target, to = source, weight = ins)

edges.all <- bind_rows(edges.1, edges.2)

edges.unique <- unique(edges.all) # oops looks like a duplication from before

Below is a tie-reciprocation function my lab and I wrote.

tieReciprocated <- function(edgelist) {
  tierecip <- rep(NA,nrow(edgelist))
  for (i in 1:nrow(edgelist)){
    
    if(is.na(edgelist[i, ]$to) | is.na(tierecip[i])==FALSE) next
    
    match <- subset(edgelist, to==edgelist[i, ]$from & from==edgelist[i, ]$to)
    
    tierecip[i] <- ifelse(nrow(match)==1, 1, 0)
    
  }
  tierecip
}

tieReciprocatedComplete <- function(edgelist) {  # function that only looks at those who actually filled out SNA questionnaire
  completed <- unique(edgelist$from)
  tierecip <- rep(NA,nrow(edgelist))
  for (i in 1:nrow(edgelist)){
    
    if(is.na(edgelist[i, ]$to) | is.na(tierecip[i])==FALSE) next
    if(!(edgelist[i, ]$to %in% completed)) next
    
    match <- subset(edgelist, to==edgelist[i, ]$from & from==edgelist[i, ]$to)
    
    tierecip[i] <- ifelse(nrow(match)==1, 1, 0)
    
  }
  tierecip
}

Finding tie reciprocity

# edges.unique$tierecip <- tieReciprocated(edges.unique)
# edges.unique$tierecipcomplete <- tieReciprocatedComplete(edges.unique)
# head(edges.unique)
sen.att1 <- senate %>% mutate(from = `Official Twitter`)
sen.att.edges <- merge(sen.att1, edges.unique, all = T)

Making adjacency graph

mat1 <- reshape2::acast(edges.unique, from~to, value.var="weight", fill=0, drop=F)
g <- igraph::graph.adjacency(mat1, mode=c("directed"), weighted=T, diag=F)
igraph::graph.density(g)

[1] 0.5754545

V(g)$degree <- degree(g)
V(g)$indeg <- degree(g, mode="in")
V(g)$outdeg <- degree(g, mode="out")
hi <- (vgin <- V(g)$indeg)
hi <- (vgout <- V(g)$outdeg)
hi <- (vgdeg <- V(g)$degree)
hi <- (vgname <- V(g)$name)
vgdf <- data.frame(cbind(screen_name = vgname, indegree = vgin, outdegree = vgout, degree = vgdeg))
sen.att2 <- merge(sen.att1, vgdf, all = T)
# sen.att2$`Party affiliation`
sen.att <- sen.att2 %>% filter(`Party affiliation` == "Democratic Party" | `Party affiliation` == "Republican Party" | `Party affiliation` == "Independent")

V(g)$from = as.character(sen.att$from[match(V(g)$name,sen.att$from)])
V(g)$party = as.character(sen.att$`Party affiliation`[match(V(g)$name,sen.att$from)])
V(g)$indegree = as.character(sen.att$indegree[match(V(g)$name,sen.att$indegree)])
V(g)$outdegree = as.character(sen.att$outdegree[match(V(g)$name,sen.att$outdegree)])
V(g)$degree1 = as.character(sen.att$degree[match(V(g)$name,sen.att$degree)])

In my network visualization below, senators are arranged by node color as each party. Dems = blue, GOP = red, and independents = green. The node size varies by network degree centrality (indegree + degree). Indegree is visualized as the alpha of the nodes as well. If a node is more transparent, then that means it has a lower indegree.

igraph_layouts <- c('star', 'circle', 'gem', 'dh', 'graphopt', 'grid', 'mds', 
                    'randomly', 'fr', 'kk', 'drl', 'lgl')
library(igraph)
library(ggraph)
library(tweenr)
igraph_layouts <- sample(igraph_layouts)
# layouts <- lapply(igraph_layouts, create_layout, graph = g)

# layouts_tween <- tween_states(c(layouts, layouts[1]), tweenlength = 1, 
#                            statelength = 1, ease = 'cubic-in-out', 
#                           nframes = length(igraph_layouts) * 16 + 8)

# title_transp <- tween_t(c(0, 1, 0, 0, 0), 16, 'cubic-in-out')[[1]]
# for (i in seq_len(length(igraph_layouts) * 16)) {
#  tmp_layout <- layouts_tween[layouts_tween$.frame == i, ]
# layout <- igraph_layouts[ceiling(i / 16)]
# title_alpha <- title_transp[i %% 16]

#      p <- ggraph(g, 'manual', node.position = tmp_layout) + 
#       geom_edge_fan(aes(alpha = ..index.., colour = factor("party")), n = 15) +
#      geom_node_point(aes(size = degree, color = party))  + scale_color_manual(values=c("blue", "green", "red")) + scale_fill_manual(values=c("blue", "green", "red")) + 
#        scale_edge_color_brewer(palette = 'Dark2') + 
#       ggtitle(paste0('Layout: ', layout)) + 
#      theme_void() + 
#     theme(legend.position = 'none', 
#          plot.title = element_text(colour = alpha('black', title_alpha)))
#    plot(p)
# }  
p <- ggraph(g) + 
  geom_edge_fan(alpha = .5) +
  geom_node_point(aes(size = degree, color = party, alpha = indeg))  + 
  scale_color_manual(values=c("blue", "green", "red")) + 
  scale_fill_manual(values=c("blue", "green", "red")) + 
  scale_edge_color_brewer(palette = 'Dark2', aes(alpha = outdeg)) + 
  ggtitle("Senators and their Parties and Their Twitter Ties") + 
  theme_void() + labs(color = "Party Affiliation", alpha = "Out-Degree", size = "In-Degree") + 
  theme(legend.position = 'right', 
        plot.title = element_text(colour = alpha('black')))
plot(p)

Calculating in-degree

in.degree <- as.data.frame(igraph::degree(g, mode="in"))
in.degree$PPID <- row.names(in.degree)
names(in.degree) <- c("indeg","PPID")
top3.in.degree <- head(in.degree.rank <- in.degree[order(- in.degree$indeg), 1:2], n = 5)
names(top3.in.degree) <- c("In-Degree (Total Twitter Followers)", "Senator Surname")
top3.in.degree$`Senator Surname` <- top3.in.degree$`Senator Surname` %>% gsub("Sen|Jeff|John|Senator", "", .) %>%
  gsub("MarkW", "W", .)
rownames(top3.in.degree) <- NULL

# frameWidget(datatable(top3.in.degree))
(datatable(top3.in.degree))

The top three senators with the highest in-degree (most people following them) are SenJeffMerkley with 97 followers, and a tie between MarkWarner with 96 followers, SenJohnKennedy with 96 followers, and SenMarkey with 96 followers. If they all tie for 2nd place in number of followers, then SenatorEnzi takes 3rd place with 95 followers. The table above shows first through 3rd place ranked!

Calculating out-degree

out.degree <- as.data.frame(igraph::degree(g, mode="out"))
out.degree$PPID <- row.names(out.degree)
names(out.degree) <- c("outdeg","PPID")
top3.out.degree <- head(out.degree.rank <- out.degree[order(- out.degree$outdeg), 1:2], n = 3)
names(top3.out.degree) <- c("Out-Degree (Total Following on Twitter)", "Senator Surname")
top3.out.degree$`Senator Surname` <- top3.out.degree$`Senator Surname` %>% gsub("Sen|lisa|John|Senator", "", .) %>%
  gsub("murk", "Murk", .)
rownames(top3.out.degree) <- NULL


# frameWidget(datatable(top3.out.degree))
(datatable(top3.out.degree))

The top three senators with the highest out-degree (following the most people) are SenatorCollins with 82 senators that he follows, SenJohnMcCain with 79 senators he follows followers, and then lisamurkowski with 76 senators that she follows. The table above shows the top 3 ranked!

b) Communities

Now let’s see whether party identification is also recovered by an automated mechanism of cluster identification. Use the cluster_walktrap command in the igraph package to find densely connected subgraphs.

# Sample Code for a graph object "g"
wc <- cluster_walktrap(g)  # find "communities"
members <- membership(wc)
members.df <- cbind(names(members), data.frame(as.matrix(members)))
names(members.df) <- c("Handle", "Group")
rownames(members.df) <- NULL
hu <- head(walkt.cluster <- members.df %>% mutate(Method = "Cluster Walktrap"))

# Here, I'm running RANDOM WALK partitioning as if I'm making only 2 clusters (potentially for republican versus democrat)
g.walktrap <- walktrap.community(g, 
                                 steps = 2,modularity=TRUE) # Here, I'm running the random walk partitioning algorithm
# g.walktrap #tells you how many groups there are in the network
#g.walktrap$membership #this is the column in front of each node. it tells us the group each node is in
#g.walktrap$names #names refers to each node. when read with membership, it tells us each node next to its group. Gotta read it and mmbrship as a data frame though!

rwalkdf <-as.data.frame(cbind(g.walktrap$membership, g.walktrap$names)) #making a dataframe out of the R-W membership stuff!
colnames(rwalkdf) <- c("Group", "Handle") #telling it to turn the second column into Handles and first into group number!!
rownames(rwalkdf) <- NULL
# frameWidget(datatable(head(walkt.comm <- rwalkdf %>% select(Handle, Group) %>% mutate(Method = "Walktrap Community Algorithm"))))
(datatable(head(walkt.comm <- rwalkdf %>% select(Handle, Group) %>% mutate(Method = "Walktrap Community Algorithm"))))

In the next couple lines, I’m seeing if the different ways I found walktrap clusters ended up with the same 2 groups! It looks like they’re identical!

walkt.community <- walkt.comm %>% select(everything()) %>% mutate(Group = ifelse(Group == 2, 1, 2))
# frameWidget(datatable(summary(walkt.cluster[ , 1:2] == walkt.community[ , 1:2])))
(datatable(summary(walkt.cluster[ , 1:2] == walkt.community[ , 1:2])))

After joining the senate file with all of the attribute/party details with the walktrap cluster detail, I made a column evaluating whether or not the party mapped onto the community. If it did, the new column indicates “match.” If the mapping failed, then the column indicates “no match.” Everything was a match except for 2 people!! I filtered the data to see who didn’t correctly map onto a community and found that Senator Sanders and Senator Angus King were mis-classified. This is because they’re the only 2 independents in the network! I didn’t have a 3rd group option; silly me. So there was no way I could correctly classify them with my walktrap clustering unless I allowed for more clusters.

senate$Handle <- senate$`Official Twitter`
walkt.cluster$Handle <- as.character(walkt.cluster$Handle)
walkt.cluster$Group <- as.factor(walkt.cluster$Group)
senate.party.clust <- left_join(walkt.cluster, senate)
# frameWidget(datatable(head(senate.party.clust %>% select(`Party affiliation`, Group))))
(datatable(head(senate.party.clust %>% select(`Party affiliation`, Group))))

nooo <- head(summary(senate.party.cluster.match <- senate.party.clust %>% select(everything()) %>%
                       mutate(`Cluster Party x Group Match` = ifelse(`Party affiliation` == "Democratic Party" & Group == 1, "match", 
                                                                     ifelse(`Party affiliation` == "Republican Party" & Group == 2, "match", "no match")))))

# frameWidget(datatable(data.frame(table(senate.party.cluster.match$`Cluster Party x Group Match`))))
(datatable(data.frame(table(senate.party.cluster.match$`Cluster Party x Group Match`))))

# frameWidget(datatable(data.frame(senate.party.cluster.match %>% select(everything()) %>%
(datatable(data.frame(senate.party.cluster.match %>% select(everything()) %>%
                        filter(senate.party.cluster.match$`Cluster Party x Group Match` == "no match") %>% select(1, 2, 3, 4, 5, 9))))

Based on the results, visualize how well this automated community detection mechanism recovers the party affiliation of senators. This visualization need not be a network graph. Comment briefly.

library(ggplot2)

match.vis <- senate.party.cluster.match %>% filter(!is.na(`Party affiliation`))
ggplot(match.vis, aes(`Cluster Party x Group Match`)) + geom_bar(aes(color = `Party affiliation`, fill = `Party affiliation`)) + ylab("Number of Senators") + xlab("Classification Result") + labs(title = "Walktrap Clustering: Party Affiliation x Group Matching") + theme_minimal() + scale_color_manual(values=c("blue", "green", "red")) + scale_fill_manual(values=c("blue", "green", "red"))

2. What are they tweeting about?

From now on, rely on the information from the tweets stored in senator_tweets.RDS.

a) Most Common Topics over Time

Remove all tweets that are re-tweets (is_retweet) and identify which topics the senators tweet about. Rather than a full text analysis, just use the variable hashtags and identify the most common hashtags over time. Provide a visual summary.

In this visual representation (comparison cloud) we can see that the most common hashtag is Obamacare, both in the years 2013 and 2017. I used these 2 years to represent similarities over time because they’re both the years right after a general presidential election in the US, so a ripe time for politics. Some other common words are mepolitics, benghol, benghazi, etc. You can see below!

library(tm)
original.tweets <- senator_tweets %>% filter(is_retweet == FALSE)
original.tweets$date <- lubridate::date(original.tweets$created_at)
original.tweets$year <- lubridate::year(original.tweets$date) 
# summary(original.tweets$year)
original.tweets.2013 <- original.tweets %>% filter(year == "2013")
hash.2013 <- na.omit(unlist(original.tweets.2013$hashtags))
original.tweets.2017 <- original.tweets %>% filter(year == "2017")
hash.2017 <- na.omit(unlist(original.tweets.2017$hashtags))
hash.yrs <- data.frame(cbind(hash.2013[1:6543], hash.2017[1:6543]))
# length(hash.2017)
# length(hash.2013)
hashsh <- head(hash.yrs$X1 <- as.character(hash.yrs$X1))
hashah <- head(hash.yrs$X2 <- as.character(hash.yrs$X2))
corpus = Corpus(VectorSource(hash.yrs))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = c("Tweets of '08", "Tweets of '17")
library(wordcloud)
comparison.cloud(tdm,  
                 title.size=1.5, max.words = 200, color = c("Blue", "Red", "Grey"))

# original.tweets %>% select(everything()) %>%
#   group_by(as.vector(hashtags)) %>%
#   summarize(count = n(), status_id, user_id, source, favorite_count, retweet_count)
# original.tweets[order(-original.tweets$hashtags)]
# names(original.tweets)
# head(original.tweets.rank <- original.tweets[order(-original.tweets$hashtags)], n = 5)
# hi <- data.frame(original.tweets) 
# hash <- unlist(original.tweets)
# hash
# data.frame(original.tweets) %>% select(everything()) %>%
#   group_by(hashtags) %>%
#   summarize(total = n(), status_id, user_id, source, favorite_count, retweet_count)
# original.tweets$hashtags

b) Democrats vs. Republicans

Some tweets are as old as 10 years but for some prolific users we observe a much shorter time span of Twitter activity. Feel free to subset the data to only include more recent tweets. Using the party ID variable (Party affiliation), identify how the choice of topics tweeted about (again using using hashtags) differs by party and visualize that information.

It seems as though the republicans are speaking more about the southern states and about conservative policy than are the democrats– makes sense. Also, each party insults the other by saying things like “GOPtaxscam” as a hashtag.

recent.twts <- original.tweets %>% filter(year > 2014)
# head(recent.twts)
parties <- senate %>% mutate(screen_name = `Official Twitter`)

parties.twts <- left_join(recent.twts, parties)
# head(parties.twts)
gop.twts <- parties.twts %>% filter(`Party affiliation` == "Republican Party")
dem.twts <- parties.twts %>% filter(`Party affiliation` == "Democratic Party")
vey <- length(gop.hash <- na.omit(as.character(gop.twts$hashtags)))
oy <- length(dem.hash <- na.omit(as.character(dem.twts$hashtags)))
party.hash <- data.frame(cbind(dem.hash[1:37853], gop.hash[1:37853]))
party.hash$X1 <- as.character(party.hash$X1)
party.hash$X2 <- as.character(party.hash$X2)
corpus = Corpus(VectorSource(party.hash))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = c("Democrats", "Republicans")
comparison.cloud(tdm,  
                 title.size=1.5, max.words = 200, color = c("Blue", "Red", "Grey"))

c) Gun Control I - Dems vs. Reps

The democratic party seems broadly more supportive of gun control legislation. Try to identify a set of 5-10 hashtags that signal support for gun control legislation (e.g. “NeverAgain”, #guncontrol, #guncontrolnow, #Enough) and others that are expressing support for the right to own guns (e.g. #2ndamendment, #NRA, #liberals). The site ritetag.com can help with that task. Using the subset of senator tweets that included these hashtags, show whether and how senators from different parties talk differently about the issue of gun legislation.

In terms of guns, it looks like the democrats and republicans both talk about them quite a bit. In the comparison cloud below, I display any time one of those 2 parties mentions “gun” or any of the hashtags mentioned in the question above (except for liberals because that’s too broad a category for my interests.) I Used the grep call to find the relevant rows. The democrats reference tags like “endgunviolence” while the republicans reference “2ndamendment” and such about gun rights.

mew <- length(gop.guns <- na.omit(gop.hash[(grep("gun|NeverAgain|Enough|NRA|2ndamendment", gop.hash, ignore.case = T))]))
wem <- length(dem.guns <- na.omit(dem.hash[(grep("gun|NeverAgain|Enough|NRA|2ndamendment", dem.hash, ignore.case = T))]))

party.guns <- data.frame(cbind(dem.guns[1:89], gop.guns[1:89]))
party.guns$X1 <- as.character(party.guns$X1)
party.guns$X2 <- as.character(party.guns$X2)
corpus.guns = Corpus(VectorSource(party.guns))
tdm.guns = TermDocumentMatrix(corpus.guns)
tdm.guns = as.matrix(tdm.guns)
colnames(tdm.guns) = c("Democrats", "Republicans")
comparison.cloud(tdm.guns,  
                 title.size=1.5, max.words = 200, color = c("Blue", "Red", "Grey"))

d) Gun Control II - Parkland Shooting

On February 14, 2018, a mass shooting occurred at Marjory Stoneman Douglas High School in Parkland, Florida. Provide some visualization of how senators responded to the event in their Twitter communication.

Based on my comparison cloud, democrats seem much more social-justice oriented and concerned about the Parkland shooting than do Republicans. Democrats are speaking out against a terrible thing, but republicans are largely ignoring the actual implications.

fie <- length(gop.parkland <- na.omit(gop.hash[(grep("Marjory|Stoneman|Douglas High|Parkland|February18|Feb18", gop.hash, ignore.case = T))]))
vye <- length(dem.parkland <- na.omit(dem.hash[(grep("Marjory|Stoneman|Douglas High|Parkland|February18|Feb18", dem.hash, ignore.case = T))]))

party.parkland <- data.frame(cbind(dem.parkland[1:31], gop.parkland[1:31]))
party.parkland$X1 <- as.character(party.parkland$X1)
party.parkland$X2 <- as.character(party.parkland$X2)
corpus.parkland = Corpus(VectorSource(party.parkland))
tdm.parkland = TermDocumentMatrix(corpus.parkland)
tdm.parkland = as.matrix(tdm.parkland)
colnames(tdm.parkland) = c("Democrats", "Republicans")
comparison.cloud(tdm.parkland,  
                 title.size=1.5, max.words = 200, color = c("Blue", "Red", "Grey"))

3. Are you talking to me?

Often tweets are simply public statements without addressing a specific audience. However, it is possible to interact with a specific person by adding them as a friend, becoming their follower, re-tweeting their messages, and/or mentioning them in a tweet using the @ symbol.

a) Identifying Re-Tweets

Select the set of re-tweeted messages from other senators and identify the source of the originating message. Calculate by senator the amount of re-tweets they received and from which party these re-tweets came. Essentially, I would like to visualize whether senators largely re-tweet their own party colleagues’ messages or whether there are some senators that get re-tweeted on both sides of the aisle. Visualize the result.

According to my table and my ggpot visualization, the people/parties who are most retweeted are primarily the democrats, except for Republican John McCain, who’s retweeted more than any of them. But out of the top 5 people with number of times they were retweeted, McCain is the only Republican. The rest are democrats.

retweets <- senator_tweets %>% filter(is_retweet == TRUE)
parties.retwts <- left_join(retweets, parties)
retweets.directed <- parties.retwts %>% separate(text, c("source.of.original.twt", "text"), ": ", extra = "drop")
hihd <- head(retweets.directed$source.of.original.twt <- gsub("RT @", "", retweets.directed$source.of.original.twt))
# retweets.directed$retweeted.from

# retweets.directed$source.of.original.twt <- gsub("_", "", retweets.directed$source.of.original.twt)

parties$source.of.original.twt <- parties$screen_name

rtwts.net <- left_join(retweets.directed, parties, by = "source.of.original.twt")

rtwts.net$rtwt.party.match <- ifelse(rtwts.net$`Party affiliation.x` == "Democratic Party" & rtwts.net$`Party affiliation.y` == "Democratic Party", "Democrat Retweeted Democrat", 
                                     ifelse(rtwts.net$`Party affiliation.x` == "Democratic Party" & rtwts.net$`Party affiliation.y` == "Republican Party", "Republican Retweeted Republican", 
                                            ifelse(rtwts.net$`Party affiliation.x` == "Democratic Party" & rtwts.net$`Party affiliation.y` == "Republican Party", "Democrat Retweeted Republican", 
                                                   ifelse(rtwts.net$`Party affiliation.x` == "Republican Party" & rtwts.net$`Party affiliation.y` == "Democratic Party", "Republican Retweeted Democrat", "Other"))))

# summary(rtwts.net$rtwt.party.match)

rt.net <- rtwts.net
rt.net$from <- rt.net$screen_name.x
rt.net$to <- rt.net$source.of.original.twt
rt.net$weight <- as.numeric(rt.net$retweet_count)

# rt.net$from.to <- paste(rt.net$`Party affiliation.x`, "Retweeted", rt.net$`Party affiliation.y`)

who.retweers <- rt.net %>% dplyr::group_by(to, `Party affiliation.y`) %>%
  dplyr::summarize(total = sum(retweet_count))

who.rtw <- who.retweers %>% filter(!is.na(`Party affiliation.y`))

top.retweeted <- who.rtw %>% dplyr::rename(`Original Tweeters' Party` = `Party affiliation.y`, 
                                           `Original Tweeters' Handle` = to,
                                           `# of Times their Tweet was Retweeted` = total)

top5.retweeted <- head(top.retweeted[order( - top.retweeted$`# of Times their Tweet was Retweeted`), 1:3], n = 5)

# frameWidget(datatable(top5.retweeted))
(datatable(top5.retweeted))

Democrats are retweeted so much! Whoa! They also do a ton of the retweeting. The other parties just don’t tweet or retweet so much. Also interestingly, The democrats seem to pretty often retweet republicans, but not vice versa. Indepenents don’t tweet much, but when they retweet, they retweet among themselves most, but also retweet republicans and democrats to some degree.

pgrp <- rt.net %>% filter(`Party affiliation.y` == "Democratic Party" | `Party affiliation.y` == "Republican Party" | `Party affiliation.y` == "Independent" | `Party affiliation.y` == "Other") %>% filter(!is.na(`Party affiliation.x`))

pgrp$from.to <- paste(pgrp$`Party affiliation.x`, "Retweeted", pgrp$`Party affiliation.y`)

# summary(pgrp$from.to)
#ugh <- rt.net %>% filter(`Party affiliation.y` == "Democratic Party" | `Party affiliation.y` == "Republican Party" | `Party affiliation.y` == "Independent" | `Party affiliation.y` == #"Other" | !is.na(`Party affiliation.x`) | !is.na(`Party affiliation.x`))
#ugh <- pgrp %>% filter(!is.na(`Party affiliation.x`))
ggplot(pgrp, aes(`Party affiliation.y`, retweet_count)) + geom_boxplot(aes(fill = `Party affiliation.y`)) + geom_jitter(aes(color = `Party affiliation.x`)) + coord_flip() + xlab("Original Tweeters") +  labs(color='Retweeters', fill = "Original Tweets") + ylab("Number of Retweets") + labs(title = "Who Retweets from Who? And How Much?") + theme_minimal()

rt1 <- rt.net %>% select(from, to, weight)
# length(rt1$from)
dfxy <- data.frame(froms = c(1:(length(rt1$from)*2)), tos = c(1:(length(rt1$to)*2)), weights = 0)


dfxy$weights[1:(length(rt1$from))] <- rt1$weight

dfxy$froms[1:(length(rt1$from))] <- rt1$from
dfxy$froms[(length(rt1$from) + 1): (length(rt1$from) * 2)] <- rt1$to

dfxy$tos[1:(length(rt1$to))] <- rt1$to
dfxy$tos[(length(rt1$to) + 1): (length(rt1$to) * 2)] <- rt1$from

mews <- na.omit(dfxy)

# length(tos <- rbind(rt1$to, rt1$from))
# rt1$weight0 <- 0
# length(weights <- rbind(rt1$weight, rt1$weight0))
# meow <- data.frame(from = c(froms), to = c(tos), weight = c(weights))
# summary(meow)
# dim(mew <- na.omit(meow))
# mews$to <- gsub("_", "", mews$tos)

mewe <- mews %>% dplyr::group_by(froms) %>%
  dplyr::summarise(weight = n())
mews$from <- mews$froms
mews$weight <- mews$weights
mews$to <- mews$tos
mewe.retweeters <- unique(na.omit(mews %>% select(from, to, weight)))
# summary(mewe$from <- as.character(mewe$from))
# summary(mewe$to <- as.character(mewe$tos))
# meow <- unique(mewe %>% select(from, to, weight))
# rt.mat <- reshape2::acast(mewe.retweeters, from ~ to, value.var="weight", fill=0, drop=F)
# g <- igraph::graph.adjacency(mews, mode=c("directed"), weighted=T, diag=F)
# igraph::graph.density(g)


retwtrs <- rt.net %>% select(retweet_count, to) %>% 
  group_by(to) #%>%
#  summarize(retweet_count)
#length(retwtrs$`Party affiliation`) %>%
# summarize(nooo = (retweet_count))

# top5.retwtrs <- head(toprtw.rank <- in.degree[order(- in.degree$indeg), 1:2], n = 5)

For my last little fun with retweeters and retweetees, I made a table spelling out the frequency at which each party has retweeted each party. The results are in and democrats just love to retweet everyone, more than anyone else even likes to retweet themselves!

who.retweers <- rt.net %>% dplyr::group_by(to, `Party affiliation.y`, `Party affiliation.y`) %>%
  dplyr::summarize(total = sum(retweet_count))

who.retweers <- pgrp %>% dplyr::group_by(from.to) %>%
  dplyr::summarize(total = sum(retweet_count))

top5.retwtrs.rtwtees <- head(rtw.rank <- who.retweers[order(- who.retweers$total), 1:2], n = 5)
names(top5.retwtrs.rtwtees) <- c("Party Tweets Party", "Times this Happened!")


# frameWidget(datatable(top5.retwtrs.rtwtees))
(datatable(top5.retwtrs.rtwtees))

b) Identifying Mentions

Identify the tweets in which one senator mentions another senator directly (the variable is mentions_screen_name). For this example, please remove simple re-tweets (is_retweet == FALSE). Calculate who re-tweets whom among the senate members. Convert the information to an undirected graph object in which the number of mentions is the strength of the relationship between senators. Visualize the network graph using the party identification of the senators as a group variable (use blue for Democrats and red for Republicans) and some graph centrality measure to size the nodes. Comment on what you can see from the visualization.

First, I made a wordcloud to visualize who people from the democratic and republican parties are mentioning. It looks like those of the independent party don’t exactly mention anyone.

# is_retweet is false here
join.originals <- left_join(original.tweets, senate)
# we need a source and a target
# join.originals$mentions_screen_name tis a df with lists fml
# retweets.directed$is_retweet

gops <- join.originals %>% filter(`Party affiliation` == "Republican Party")
dems <- join.originals %>% filter(`Party affiliation` == "Democratic Party")
indeps <- join.originals %>% filter(`Party affiliation` == "Independent Party")

oh <- length(gop.mention <- na.omit(as.character(gops$mentions_screen_name)))
em <- length(dem.mention <- na.omit(as.character(dems$mentions_screen_name)))
gee <- length(indep.mention <- na.omit(as.character(indeps$mentions_screen_name)))# no mentions lol

party.mention <- data.frame(cbind(dem.mention[1:41128], gop.mention[1:41128]))
party.mention$X1 <- as.character(party.mention$X1)
party.mention$X2 <- as.character(party.mention$X2)
corpus.mention = Corpus(VectorSource(party.mention))
tdm.mention = TermDocumentMatrix(corpus.mention)
tdm.mention = as.matrix(tdm.mention)
colnames(tdm.mention) = c("Democrats", "Republicans")

comparison.cloud(tdm.mention,  
                 title.size=1.5, max.words = 200, color = c("Blue", "Red", "Grey"))

dems <- join.originals %>% filter(`Party affiliation` == "Democratic Party")
gops <- join.originals %>% filter(`Party affiliation` == "Republican Party")

oh <- length(dem.mention <- na.omit(as.character(dems$mentions_screen_name)))
emge <- length(gop.mention <- na.omit(as.character(gops$mentions_screen_name)))

list1 <- join.originals$mentions_screen_name
MAX = max(sapply(list1, length))

func1 <- function(x, MAX) {
  vec <- c(x, rep(NA, MAX-length(x)))
  return(vec)    
}

list2 <- lapply(list1, func1, MAX = MAX)

df3.1 <- plyr::ldply(list2)
# dim(df3.1)

orig.spread <- cbind(join.originals, df3.1)
# names(orig.spread)
orig.gath <- orig.spread %>% gather("cases","mentioned_name", V1:V19)

og <- unique(orig.gath %>% filter(year > 2017)) %>%
  select(-(15:36))
#dim(og)
#names(og)
og2 <- og
parties$mentioned_name <- parties$screen_name
ment.part <- left_join(og2, parties, by = "mentioned_name")

mentions.small <- ment.part %>% dplyr::group_by(mentioned_name) %>% 
  dplyr::summarise(Mentions = n())

mentions <- na.omit(ment.part %>% dplyr::group_by(mentioned_name, screen_name = screen_name.x, `Party affiliation.x`, `Party affiliation.y`) %>% 
                      dplyr::summarise(Mentions = n()))
mention.edgelist <- mentions %>% 
  dplyr::rename(target = screen_name, source = mentioned_name, weight = Mentions)

mg <- head(medge <- mention.edgelist[order(- mention.edgelist$weight), ], n = 200)

medge100 <- head(mention.edgelist, n = 200)

medge$weight0 <- 0

medge$party2 <- medge$`Party affiliation.y`
medge <- data.frame(medge)
ohhh <- head(edges.1 <- data.frame(medge %>% dplyr::select(from = source, to = target, weight, party = Party.affiliation.x)))
noooo <- head(edges.2 <- data.frame(medge %>% dplyr::select(from = target, to = source, weight = weight0, party = Party.affiliation.y)))

edges.all <- bind_rows(edges.1, edges.2)

mst <- as.data.frame(medge100 %>% select(source, target))

network1 = graph_from_data_frame(d = mg, directed = F) 

plot(network1)

edges <- unique(mg %>% dplyr::rename(from = source, to = target))
routes_network <- network(edges, matrix.type = "edgelist", ignore.eval = FALSE)
plot(routes_network)

# edges.all

mat1 <- reshape2::acast(edges.all, from~to, value.var="weight", fill=0, drop=F)
g2 <- igraph::graph.adjacency(mat1, mode=c("undirected"), weighted=T, diag=)

Below is the density of the network (out of a maximum 1)!

print(igraph::graph.density(g2))

[1] 0.1882864

V(g2)$degree <- degree(g2)

V(g2)$indeg <- degree(g2, mode="in")

V(g2)$outdeg <- degree(g2, mode="out")

V(g2)$from = as.character(sen.att$from[match(V(g2)$name,sen.att$from)])
V(g2)$party = as.character(edges.all$party[match(V(g2)$name, edges.all$from)])

# V(g)$indegree = as.character(sen.att$indegree[match(V(g)$name,sen.att$indegree)])
# V(g)$outdegree = as.character(sen.att$outdegree[match(V(g)$name,sen.att$outdegree)])
# V(g)$degree1 = as.character(sen.att$degree[match(V(g)$name,sen.att$degree)])

To some degree, it looks like everyone mentions everyone. I don’t see much evidence that only one party is mentioning another party. In contrast to my network visualization at the beginning of this journey, where we looked at parties following parties, the mentions are very diverse with mentions. I suppose that could be because mentioning someone doesn’t have to be a nice thing. You can mention others to disparage them. Senators from different parties could be doing that.

plot(g2)

p2 <- ggraph(g2) + 
  geom_edge_fan(aes(alpha = weight)) +
  geom_node_point(aes(size = degree, color = party))  + 
  scale_color_manual(values=c("blue", "green", "red")) + 
  scale_fill_manual(values=c("blue", "green", "red")) + 
  scale_edge_color_brewer(palette = 'Dark2', aes(alpha = outdeg)) + 
  ggtitle("Who Mentions Who? 2018 Twitter Mentions Network") + 
  theme_void() + labs(color = "Party Affiliation", alpha = "Out-Degree", size = "Degree") + 
  theme(legend.position = 'right', 
        plot.title = element_text(colour = alpha('black')))
plot(p2)

c) BONUS ONLY: Who is popular on Twitter?

Using the twitter handles, access the user information of the senators to identify the number of followers they have (obviously, this will require to actually connect to the Twitter server). Re-do the previous graph object but now use the number of followers (or some transformation of that info) to size the nodes. Comment how graph degree centrality (via mentions) and the number of followers are related.

I might be crazy, but it doesn’t look like there’s a super clear relationship between degree centrality (by number of mentions) and follower count. The bigger nodes have more followers, but some of the biggest nodes are pretty transparent, meaning they aren’t mentioned a ton. The is some correlation though. Lots of big nodes are also very solidly colored.

names <- senator_tweets$screen_name
# unique(names)
users <- lookupUsers(unique(names))
fc <- sapply(users, function(x) x$followersCount)
fc <- data.frame(fc)
# fc
ufufuf <- (fc$from <- rownames(fc))

V(g2)$fc = as.character(fc$fc[match(V(g2)$name, fc$from)])


p2 <- ggraph(g2) + 
  geom_edge_fan(aes(alpha = weight)) +
  geom_node_point(aes(alpha = as.numeric(degree), color = party, size = fc))  + guides(size = FALSE) +
  scale_color_manual(values=c("blue", "green", "red")) + 
  scale_fill_manual(values=c("blue", "green", "red")) + 
  scale_edge_color_brewer(palette = 'Dark2', aes(alpha = outdeg)) + labs(title = "Who Mentions Who? 2018 Twitter Mentions Network and Follower Count", subtitle = "Node Size is Network Centrality via # of Mentions") + 
  theme_void() + labs(color = "Party Affiliation", alpha = "Degree Centrality (# of Mentions)", size = "Degree") + 
  theme(legend.position = 'right', 
        plot.title = element_text(colour = alpha('black')))
plot(p2)

# Setting your Access Token
## Please follow the guide in the previous lecture
## For twitteR
# options(httr_oauth_cache=T)  

#twitteR::setup_twitter_oauth(Sys.getenv("4ihvg6jBxFT9RnXN8OT7SScGO"), 
#                   Sys.getenv("ht1qDUMC59Jtj5drr9OUFEUJxzwBSYzrOuYf3KdWy9HzsOEC6W"),
#                  Sys.getenv("912015600339255298-FBK0AWZD2FQymGJgT6aUYWs8W8SXFoU"),
#                 Sys.getenv("pAUDrRtrJ6KbSIvOmllpkxQrgLjJs9iuORKV6bNROP04z"))
## For rtweet
#twitter_token <- rtweet::create_token(
# app = "qmss",
#consumer_key = Sys.getenv("4ihvg6jBxFT9RnXN8OT7SScGO"),
#consumer_secret = Sys.getenv("ht1qDUMC59Jtj5drr9OUFEUJxzwBSYzrOuYf3KdWy9HzsOEC6W"))

# library(rtweet)
# Let's get the current trending topcis in the U.S.
# us <- get_trends("united states")

# frameWidget(datatable(us[,c("trend","tweet_volume")]))

Capturing Tweets in Real-time with Streaming API

We want to capture a stream of public tweets in real-time, optionally filtering by select screen names or keywords in the text of the tweet.

# Capture 60 seconds of tweets in the U.S.
#stream_tweets(
# lookup_coords("usa"), # handy helper function in rtweet
#verbose = FALSE,
#timeout = (60 * 1),
#) -> usa

Submission

Please follow the instructions to submit your homework. The homework is due on Thursday, April 12.

Please stay honest!

If you do come across something online that provides part of the analysis / code etc., please no wholesale copying of other ideas. We are trying to evaluate your abilities to visualized data not the ability to do internet searches. Also, this is an individually assigned exercise – please keep your solution to yourself.

Data Visualization: U.S. Senators on Twitter

Spring 2018 | Assignment 4 | Social Networks

Samantha Rhoads

2018-04-06

Network Analysis of U.S. Senate Tweets

Overview

Data

Twitter Handles of Senators

Followers

Tweets by Senators

Data Frame Summary

Data Frame Summary

Tasks for the Assignment

1. Who follows whom?

a) Network of Followers

The top three senators with the highest out-degree (following the most people) are SenatorCollins with 82 senators that he follows, SenJohnMcCain with 79 senators he follows followers, and then lisamurkowski with 76 senators that she follows. The table above shows the top 3 ranked!

b) Communities

2. What are they tweeting about?

a) Most Common Topics over Time

b) Democrats vs. Republicans

c) Gun Control I - Dems vs. Reps

d) Gun Control II - Parkland Shooting

3. Are you talking to me?

a) Identifying Re-Tweets

b) Identifying Mentions

c) BONUS ONLY: Who is popular on Twitter?

Capturing Tweets in Real-time with Streaming API

Submission

Please stay honest!