I’ve been inspired by how HuwFulcher analyzed some social network data in Python to replicate it in R. The fun thing about this is that it’s much, much easier than doing real analysis :) Since it’s late in the evening and I’m going to bed soon…
I encourage everyone to read the full analsis before proceeding; the author has a lot of excellent exposition about the dataset. I won’t rehash a lot of that here, except for some initial background.
Fifth Tribe, a digital agency, scraped over 17,000 tweets from pro-ISIS supporters from the November 2015 Paris attacks to some recent time. They produced a nice infographic and article; it’s rather nice of them to publish the full dataset. Good move, and good publicity.
The analysis that HuwFulcher performed was to create a graph of how ISIS members interact.
They have provided one file (tweets.csv) for us to analyze. First, let’s load some libraries.
library(needs)##
## Load `package:needs` in an interactive session to set auto-load flag
needs(readr, dplyr, tidyr, ggplot2, stringr,
igraph, intergraph, ggrepel,
ggnetwork,
pander, formattable)## installing packages:
## intergraph
## ggrepel
## ggnetwork
## pander
## also installing the dependencies 'network', 'sna'
## package 'network' successfully unpacked and MD5 sums checked
## package 'sna' successfully unpacked and MD5 sums checked
## package 'intergraph' successfully unpacked and MD5 sums checked
## package 'ggrepel' successfully unpacked and MD5 sums checked
## package 'ggnetwork' successfully unpacked and MD5 sums checked
## package 'pander' successfully unpacked and MD5 sums checked
Now let’s read the data in and look at the columns.
data <- read_csv("../input/tweets.csv")
glimpse(data) ## Observations: 17,410
## Variables: 8
## $ name (chr) "GunsandCoffee", "GunsandCoffee", "GunsandCoffe...
## $ username (chr) "GunsandCoffee70", "GunsandCoffee70", "GunsandC...
## $ description (chr) "ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews",...
## $ location (chr) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ followers (int) 640, 640, 640, 640, 640, 640, 640, 640, 640, 64...
## $ numberstatuses (int) 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49,...
## $ time (chr) "1/6/2015 21:07", "1/6/2015 21:27", "1/6/2015 2...
## $ tweets (chr) "ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFU...
data %>%
head %>%
formattable| name | username | description | location | followers | numberstatuses | time | tweets |
|---|---|---|---|---|---|---|---|
| GunsandCoffee | GunsandCoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NA | 640 | 49 | 1/6/2015 21:07 | ENGLISH TRANSLATION: ‘A MESSAGE TO THE TRUTHFUL IN SYRIA - SHEIKH ABU MUHAMMED AL MAQDISI: http://t.co/73xFszsjvr http://t.co/x8BZcscXzq| | GunsandCoffee| GunsandCoffee70| ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews| NA| 640| 49| 1/6/2015 21:27| ENGLISH TRANSLATION: SHEIKH FATIH AL JAWLANI ’FOR THE PEOPLE OF INTEGRITY, SACRIFICE IS EASY’ http://t.co/uqqzXGgVTz http://t.co/A7nbjwyHBr |
| GunsandCoffee | GunsandCoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NA | 640 | 49 | 1/6/2015 21:29 | ENGLISH TRANSLATION: FIRST AUDIO MEETING WITH SHEIKH FATIH AL JAWLANI (HA): http://t.co/TgXT1GdGw7 http://t.co/ZuE8eisze6 |
| GunsandCoffee | GunsandCoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NA | 640 | 49 | 1/6/2015 21:37 | ENGLISH TRANSLATION: SHEIKH NASIR AL WUHAYSHI (HA), LEADER OF AQAP: ‘THE PROMISE OF VICTORY’: http://t.co/3qg5dKlIwr http://t.co/7bqk1wJAzC |
| GunsandCoffee | GunsandCoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NA | 640 | 49 | 1/6/2015 21:45 | ENGLISH TRANSLATION: AQAP: ‘RESPONSE TO SHEIKH BAGHDADIS STATEMENT ’ALTHOUGH THE DISBELIEVERS DISLIKE IT.’ http://t.co/2EYm9EymTe |
| GunsandCoffee | GunsandCoffee70 | ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews | NA | 640 | 49 | 1/6/2015 21:51 | THE SECOND CLIP IN A DA’WAH SERIES BY A SOLDIER OF JN: Video Link :http://t.co/EPaPRlph5W http://t.co/4VUYszairt |
Data Scientist Vincent Su (Violinbeats) has a nice explanation of what each of the columns are; I won’t pursue that more.
First we’ll replicate the counts - there are a variety of ways, but let’s use dplyr because it prints more nicely with formattable.
data %>%
summarise(`Unique Tweets` = n_distinct(tweets),
`Total Tweets` = n()) %>%
mutate_each(funs(comma)) %>%
formattable(align = 'l')| Unique Tweets | Total Tweets |
|---|---|
| 17,410.00 | 17,410.00 |
So, we have an equal number of tweets with both.
Let’s examine retweets.
data %>%
mutate(is_retweet = grepl("^\\bRT\\b", tweets),
is_retweet = ifelse(is_retweet, "Retweets", "Actual Tweets")) -> data
data %>%
ggplot(aes(is_retweet)) +
geom_bar()What’s interesting is the high percentage of retweets in the dataset - in the general Twitter population, retweets are far lower. So the filtering process to collect these tweets probably relied upon identifying, at least in part, these retweets.
So, now let’s look more are the social network. For that, we’ll:
@ed)I’ll also add an indicator variable for retweets, since we may want to exclude those to avoid double counting.
authors = unique(data$username)
data %>%
mutate(mentioned_users = str_extract_all(tweets, '(?<=@)\\w+'),
mentioned_users = ifelse(lengths(mentioned_users) == 0, c(""), mentioned_users)) %>%
unnest(mentioned_users) %>%
mutate(is_author = mentioned_users %in% authors) ->
exploded_data
exploded_data %>%
filter(is_retweet == "Actual Tweets") %>%
mutate(is_author = ifelse(is_author, "In", "Not in")) %>%
group_by(is_author) %>%
summarise(users = n_distinct(mentioned_users)) %>%
ggplot(aes(is_author, users)) +
geom_bar(stat = "identity") +
ggtitle("Users in vs. not in tweets.csv")exploded_data %>%
mutate(is_author = ifelse(is_author, "Mentioned", "Total")) %>%
group_by(is_author) %>%
summarise(users = n_distinct(username)) %>%
ggplot(aes(is_author, users)) +
geom_bar(stat = "identity") +
ggtitle("Mentioned vs. Total in tweets")OK, so now we want to find the users who are mentioned the most.
exploded_data %>%
filter(is_retweet == "Actual Tweets",
mentioned_users != "",
is_author) %>%
group_by(mentioned_users) %>%
summarise(volume = n()) ->
user_counts
user_counts %>%
top_n(5, volume) ->
top_5_mentioned
top_5_mentioned %>%
arrange(desc(volume)) %>%
formattable(list(volume = color_bar("orange")), align = 'l')| mentioned_users | volume |
|---|---|
| WarReporter1 | 131 |
| RamiAlLolah | 53 |
| Uncle_SamCoco | 48 |
| Nidalgazaui | 34 |
| ismailmahsud | 30 |
OK! Let’s take a peek at how these people describe themselves.
exploded_data %>%
inner_join(top_5_mentioned, by = c('username' = 'mentioned_users')) %>%
select(username, description) %>%
distinct %>%
filter(!is.na(description)) %>%
formattable(aling = 'l')| username| description| |————-:|——————————————————————————————————————————————————————————-:| | ismailmahsud| Listen! No affiliations, Final year research on conflict studies.Don’t suspend Free Thinker| Time Travelling a dream| Against Injustice, Corporations| Fight CO2| | RamiAlLolah| Real-Time News, Exclusives, Intelligence & Classified Information/Reports from the ME. Forecasted many Israeli strikes in Syria/Lebanon. Graphic content.| | Uncle_SamCoco| Here to defend the American freedom and also the freedom of coconut . Cat Lover or Hater. Kebab Fan . We’re all living in America, America ist wunderbar #USA| | WarReporter1| Reporting on conflicts in the MENA and Asia regions.| | WarReporter1| Reporting on conflicts in the MENA and Asia regions. Not affiliated to any group or movement.| | Nidalgazaui| 17yr. old Freedom Activist /Correspondence of NGNA /Terror Expert/Middle East Expert. Daily News about Syria/Iraq/Yemen/Russia/Middle East|
As HuwFulcher noted: all very unbiased, hey?
Now let’s make a network.
What we want is:
To start off with, we’ll restrict it to just the authors. It’s a much smaller number.
Or, to put it in code form -
exploded_data %>%
filter(is_retweet == "Actual Tweets",
mentioned_users != "") %>%
select(username, mentioned_users) %>%
xtabs(~ username + mentioned_users, data = .) ->
network_data
authors_mentioned <- intersect(authors, colnames(network_data))
unmentioned_authors <- setdiff(rownames(network_data), authors_mentioned)
missing_authors <- matrix(0, ncol = length(unmentioned_authors),
nrow = nrow(network_data), dimnames = list(rownames(network_data), unmentioned_authors))
graph_data <- network_data[, authors_mentioned]
graph_data <- cbind(graph_data, missing_authors)
remove = setdiff(colnames(graph_data), rownames(graph_data))
graph_data <- graph_data[, !(colnames(graph_data) %in% remove)]
dim(graph_data)## [1] 100 100
graph <- graph_from_adjacency_matrix(graph_data,
mode = "directed",
weighted = T,
add.colnames = T,
add.rownames = T)## Warning in graph_from_adjacency_matrix(graph_data, mode = "directed",
## weighted = T, : Same attribute for columns and rows, row names are ignored
graph %>%
ggplot(aes(x = x, y = y, xend = xend, yend = yend, label = `TRUE.`)) +
geom_nodes(aes(size = weight)) +
geom_edges() +
theme_blank()## Loading required package: sna
## Warning: package 'sna' was built under R version 3.3.1
## sna: Tools for Social Network Analysis
## Version 2.3-2 created on 2014-01-13.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
## For citation information, type citation("sna").
## Type help(package="sna") to get started.
##
## Attaching package: 'sna'
## The following objects are masked from 'package:igraph':
##
## %c%, betweenness, bonpow, closeness, components, degree,
## dyad.census, evcent, hierarchy, is.connected, neighborhood,
## triad.census
## Loading required package: network
## Warning: package 'network' was built under R version 3.3.1
## network: Classes for Relational Data
## Version 1.13.0 created on 2015-08-31.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
## Mark S. Handcock, University of California -- Los Angeles
## David R. Hunter, Penn State University
## Martina Morris, University of Washington
## Skye Bender-deMoll, University of Washington
## For citation information, type citation("network").
## Type help("network-package") to get started.
##
## Attaching package: 'network'
## The following object is masked from 'package:sna':
##
## %c%
## The following objects are masked from 'package:igraph':
##
## %c%, %s%, add.edges, add.vertices, delete.edges,
## delete.vertices, get.edge.attribute, get.edges,
## get.vertex.attribute, is.bipartite, is.directed,
## list.edge.attributes, list.vertex.attributes,
## set.edge.attribute, set.vertex.attribute
## Warning: Removed 100 rows containing missing values (geom_point).