I’ve been inspired by how HuwFulcher analyzed some social network data in Python to replicate it in R. The fun thing about this is that it’s much, much easier than doing real analysis :) Since it’s late in the evening and I’m going to bed soon…

I encourage everyone to read the full analsis before proceeding; the author has a lot of excellent exposition about the dataset. I won’t rehash a lot of that here, except for some initial background.

Background

Fifth Tribe, a digital agency, scraped over 17,000 tweets from pro-ISIS supporters from the November 2015 Paris attacks to some recent time. They produced a nice infographic and article; it’s rather nice of them to publish the full dataset. Good move, and good publicity.

The analysis that HuwFulcher performed was to create a graph of how ISIS members interact.

Data

They have provided one file (tweets.csv) for us to analyze. First, let’s load some libraries.

library(needs)

## 
## Load `package:needs` in an interactive session to set auto-load flag

needs(readr, dplyr, tidyr, ggplot2, stringr,
      igraph, intergraph, ggrepel,
      ggnetwork,
      pander, formattable)

## installing packages:
## intergraph
## ggrepel
## ggnetwork
## pander

## also installing the dependencies 'network', 'sna'

## package 'network' successfully unpacked and MD5 sums checked
## package 'sna' successfully unpacked and MD5 sums checked
## package 'intergraph' successfully unpacked and MD5 sums checked
## package 'ggrepel' successfully unpacked and MD5 sums checked
## package 'ggnetwork' successfully unpacked and MD5 sums checked
## package 'pander' successfully unpacked and MD5 sums checked

Now let’s read the data in and look at the columns.

data <- read_csv("../input/tweets.csv")
glimpse(data)

## Observations: 17,410
## Variables: 8
## $ name           (chr) "GunsandCoffee", "GunsandCoffee", "GunsandCoffe...
## $ username       (chr) "GunsandCoffee70", "GunsandCoffee70", "GunsandC...
## $ description    (chr) "ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews",...
## $ location       (chr) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ followers      (int) 640, 640, 640, 640, 640, 640, 640, 640, 640, 64...
## $ numberstatuses (int) 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49, 49,...
## $ time           (chr) "1/6/2015 21:07", "1/6/2015 21:27", "1/6/2015 2...
## $ tweets         (chr) "ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFU...

data %>%
    head %>%
    formattable

name	username	description	location	followers	numberstatuses	time	tweets
GunsandCoffee	GunsandCoffee70	ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews	NA	640	49	1/6/2015 21:07	ENGLISH TRANSLATION: ‘A MESSAGE TO THE TRUTHFUL IN SYRIA - SHEIKH ABU MUHAMMED AL MAQDISI: http://t.co/73xFszsjvr http://t.co/x8BZcscXzq\| \| GunsandCoffee\| GunsandCoffee70\| ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews\| NA\| 640\| 49\| 1/6/2015 21:27\| ENGLISH TRANSLATION: SHEIKH FATIH AL JAWLANI ’FOR THE PEOPLE OF INTEGRITY, SACRIFICE IS EASY’ http://t.co/uqqzXGgVTz http://t.co/A7nbjwyHBr
GunsandCoffee	GunsandCoffee70	ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews	NA	640	49	1/6/2015 21:29	ENGLISH TRANSLATION: FIRST AUDIO MEETING WITH SHEIKH FATIH AL JAWLANI (HA): http://t.co/TgXT1GdGw7 http://t.co/ZuE8eisze6
GunsandCoffee	GunsandCoffee70	ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews	NA	640	49	1/6/2015 21:37	ENGLISH TRANSLATION: SHEIKH NASIR AL WUHAYSHI (HA), LEADER OF AQAP: ‘THE PROMISE OF VICTORY’: http://t.co/3qg5dKlIwr http://t.co/7bqk1wJAzC
GunsandCoffee	GunsandCoffee70	ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews	NA	640	49	1/6/2015 21:45	ENGLISH TRANSLATION: AQAP: ‘RESPONSE TO SHEIKH BAGHDADIS STATEMENT ’ALTHOUGH THE DISBELIEVERS DISLIKE IT.’ http://t.co/2EYm9EymTe
GunsandCoffee	GunsandCoffee70	ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews	NA	640	49	1/6/2015 21:51	THE SECOND CLIP IN A DA’WAH SERIES BY A SOLDIER OF JN: Video Link :http://t.co/EPaPRlph5W http://t.co/4VUYszairt

Data Scientist Vincent Su (Violinbeats) has a nice explanation of what each of the columns are; I won’t pursue that more.

Overall Counts

First we’ll replicate the counts - there are a variety of ways, but let’s use dplyr because it prints more nicely with formattable.

data %>%
    summarise(`Unique Tweets` = n_distinct(tweets),
              `Total Tweets` = n()) %>%
    mutate_each(funs(comma)) %>%
    formattable(align = 'l')

Unique Tweets	Total Tweets
17,410.00	17,410.00

So, we have an equal number of tweets with both.

Retweets

Let’s examine retweets.

data %>%
    mutate(is_retweet = grepl("^\\bRT\\b", tweets),
           is_retweet = ifelse(is_retweet, "Retweets", "Actual Tweets")) -> data

data %>%
    ggplot(aes(is_retweet)) +
    geom_bar()

What’s interesting is the high percentage of retweets in the dataset - in the general Twitter population, retweets are far lower. So the filtering process to collect these tweets probably relied upon identifying, at least in part, these retweets.

Who talks about whom?

So, now let’s look more are the social network. For that, we’ll:

Take each tweet
Extract the people involved (who are @ed)
Count them up
Add an indicator to see if the mentioned user is an “author” of one of these 17,000 tweet authors.

I’ll also add an indicator variable for retweets, since we may want to exclude those to avoid double counting.

authors = unique(data$username)

data %>%
    mutate(mentioned_users = str_extract_all(tweets, '(?<=@)\\w+'),
           mentioned_users = ifelse(lengths(mentioned_users) == 0, c(""), mentioned_users)) %>%
    unnest(mentioned_users) %>%
    mutate(is_author = mentioned_users %in% authors) ->
    exploded_data

exploded_data %>% 
    filter(is_retweet == "Actual Tweets") %>% 
    mutate(is_author = ifelse(is_author, "In", "Not in")) %>%
    group_by(is_author) %>%
    summarise(users = n_distinct(mentioned_users)) %>%
    ggplot(aes(is_author, users)) +
    geom_bar(stat = "identity") +
    ggtitle("Users in vs. not in tweets.csv")

exploded_data %>% 
    mutate(is_author = ifelse(is_author, "Mentioned", "Total")) %>%
    group_by(is_author) %>%
    summarise(users = n_distinct(username)) %>%
    ggplot(aes(is_author, users)) +
    geom_bar(stat = "identity") +
    ggtitle("Mentioned vs. Total in tweets")

OK, so now we want to find the users who are mentioned the most.

Top 5

exploded_data %>%
    filter(is_retweet == "Actual Tweets", 
           mentioned_users != "",
           is_author) %>% 
    group_by(mentioned_users) %>%
    summarise(volume = n()) ->
    user_counts

user_counts %>%
    top_n(5, volume) ->
    top_5_mentioned

top_5_mentioned %>%
    arrange(desc(volume)) %>%
    formattable(list(volume = color_bar("orange")), align = 'l')

mentioned_users	volume
WarReporter1	131
RamiAlLolah	53
Uncle_SamCoco	48
Nidalgazaui	34
ismailmahsud	30

OK! Let’s take a peek at how these people describe themselves.

exploded_data %>%
    inner_join(top_5_mentioned, by = c('username' = 'mentioned_users')) %>%
    select(username, description) %>%
    distinct %>%
    filter(!is.na(description)) %>% 
    formattable(aling = 'l')

| username| description| |————-:|——————————————————————————————————————————————————————————-:| | ismailmahsud| Listen! No affiliations, Final year research on conflict studies.Don’t suspend Free Thinker| Time Travelling a dream| Against Injustice, Corporations| Fight CO2| | RamiAlLolah| Real-Time News, Exclusives, Intelligence & Classified Information/Reports from the ME. Forecasted many Israeli strikes in Syria/Lebanon. Graphic content.| | Uncle_SamCoco| Here to defend the American freedom and also the freedom of coconut . Cat Lover or Hater. Kebab Fan . We’re all living in America, America ist wunderbar #USA| | WarReporter1| Reporting on conflicts in the MENA and Asia regions.| | WarReporter1| Reporting on conflicts in the MENA and Asia regions. Not affiliated to any group or movement.| | Nidalgazaui| 17yr. old Freedom Activist /Correspondence of NGNA /Terror Expert/Middle East Expert. Daily News about Syria/Iraq/Yemen/Russia/Middle East|

As HuwFulcher noted: all very unbiased, hey?

Network Graph

Now let’s make a network.

What we want is:

For each real tweet (not retweets)
For each author
The number of times an author mentions another author

To start off with, we’ll restrict it to just the authors. It’s a much smaller number.

Or, to put it in code form -

exploded_data %>%
    filter(is_retweet == "Actual Tweets",
           mentioned_users != "") %>%
    select(username, mentioned_users) %>%
    xtabs(~ username + mentioned_users, data = .) ->
    network_data

authors_mentioned <- intersect(authors, colnames(network_data))
unmentioned_authors <- setdiff(rownames(network_data), authors_mentioned)
missing_authors <- matrix(0, ncol = length(unmentioned_authors),
                          nrow = nrow(network_data), dimnames = list(rownames(network_data), unmentioned_authors))

graph_data <- network_data[, authors_mentioned]
graph_data <- cbind(graph_data, missing_authors)

remove = setdiff(colnames(graph_data), rownames(graph_data))

graph_data <- graph_data[, !(colnames(graph_data) %in% remove)]

dim(graph_data)

## [1] 100 100

graph <- graph_from_adjacency_matrix(graph_data, 
                                     mode = "directed", 
                                     weighted = T, 
                                     add.colnames = T, 
                                     add.rownames = T)

## Warning in graph_from_adjacency_matrix(graph_data, mode = "directed",
## weighted = T, : Same attribute for columns and rows, row names are ignored

graph %>%
    ggplot(aes(x = x, y = y, xend = xend, yend = yend, label = `TRUE.`)) +
    geom_nodes(aes(size = weight)) +
    geom_edges() +
    theme_blank()

## Loading required package: sna

## Warning: package 'sna' was built under R version 3.3.1

## sna: Tools for Social Network Analysis
## Version 2.3-2 created on 2014-01-13.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##  For citation information, type citation("sna").
##  Type help(package="sna") to get started.

## 
## Attaching package: 'sna'

## The following objects are masked from 'package:igraph':
## 
##     %c%, betweenness, bonpow, closeness, components, degree,
##     dyad.census, evcent, hierarchy, is.connected, neighborhood,
##     triad.census

## Loading required package: network

## Warning: package 'network' was built under R version 3.3.1

## network: Classes for Relational Data
## Version 1.13.0 created on 2015-08-31.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##                     Mark S. Handcock, University of California -- Los Angeles
##                     David R. Hunter, Penn State University
##                     Martina Morris, University of Washington
##                     Skye Bender-deMoll, University of Washington
##  For citation information, type citation("network").
##  Type help("network-package") to get started.

## 
## Attaching package: 'network'

## The following object is masked from 'package:sna':
## 
##     %c%

## The following objects are masked from 'package:igraph':
## 
##     %c%, %s%, add.edges, add.vertices, delete.edges,
##     delete.vertices, get.edge.attribute, get.edges,
##     get.vertex.attribute, is.bipartite, is.directed,
##     list.edge.attributes, list.vertex.attributes,
##     set.edge.attribute, set.vertex.attribute

## Warning: Removed 100 rows containing missing values (geom_point).

Social Cluster Analysis in R

Michael Griffiths

May 19, 2016