1. Introduction

This document sets out the process to identify the patients for the post-Brexit cyberhate experiment.

First step is to scrape users from the archive account. The achive account was created in order to make it easier to come up with an initial list of manually identified hateful users from Britain (patients) on Twitter. While scrolling Twitter, it is relatively easy to detect hateful tweets and their authors. However, storing usernames or user IDs of the potential patients is not very practical because of the laborous (and boring) copy-pasting process. It is also hard to keep the patient list syncronised across multiple researchers and devices. Therefore, we came up with an idea of setting up an archive account make it easier to come up with an initial patient list. The process was as follows:

Login twitter with the archive account,
Search for potentially sensitive keywords (Muslims, Islam, etc.),
Identify hateful tweet,
Check hateful authors profile information, –if there are any clues suggesting user is from the UK (such as location field, description, picture of the British flag etc), continue beloe – if there are no clues suggesting user is from the UK, go to 3rd step.
For future reference like the hateful tweet (for bookmarking purposes as to why we included the user) tweet and follow the user,
Repeat as many times as necessary

2. Identifying the larger patient network

Using the initial patient list as a starting point, we have identified the larger network of patient network.

First step was to import the initial list to R. It was possible by scraping the users followed by the archive account.

#loading packages
library(rtweet)
library(tidyverse)

archive_user <- 'Archive_Data_E' %>% 
  as_screenname()

initial_patients_list <- rtweet::get_friends(users = archive_user[1], retryonratelimit = TRUE) %>% 
  select(user_id) %>% 
  lookup_users()

Here is a quick peek at the potential patient list without exposing their identities.

initial_patients_list %>% 
  select(4,8,9,11,12,13,14,15) %>% 
  print(n=nrow(.))

## # A tibble: 36 x 8
##    location  followers_count friends_count statuses_count favourites_count
##    <chr>               <int>         <int>          <int>            <int>
##  1 United K…            6631          6429          95129            40434
##  2 Yorkshir…           26916           133           2739             2287
##  3 Manchest…             748           595           5241             8512
##  4 "Stockpo…             102           384           2124             2414
##  5 ""                   1369          2283          14103             7218
##  6 North Ea…            9730          2067          10734            18492
##  7 London, …             282           318           1559             3358
##  8 wales                 186           330           8642               52
##  9 Burnie -…              96           323           2769             1649
## 10 Portsmou…             206           762            432             1157
## 11 Edinburgh            1147           805          27492            21016
## 12 "Luton E…          393844         12520          99615            58974
## 13 "Scotlan…            4027          4255          51880            77249
## 14 United K…            2498          2385           7912             5019
## 15 ""                  17986           976           5031             5104
## 16 England,…           37109         37884          27798            17071
## 17 Corrupt …            6465          2041          20771            53786
## 18 ⬇️  My Y…           14412           703           3517            53351
## 19 New York…            2832          2673          23439             5294
## 20 Scotland             3308           477          22293            28359
## 21 Greenwic…            1209          1150           2685             3654
## 22 ""                    617           127           7536            72070
## 23 UK                   4939          4629           2061             2335
## 24 uk                    527           414          13135             7278
## 25 UK                    711           685           4648              728
## 26 Norway              88791           353          10096              924
## 27 France .…            7672          7288         166627            47494
## 28 In a Gul…              81           394           1578              127
## 29 Yorkshire             862           304            962            41109
## 30 Yorkshir…            2210          2728          45629            17187
## 31 Anywhere…            1935          1656          33507            54947
## 32 ""                   2626          1915          41142            16281
## 33 Milton K…              17            29            327               10
## 34 ""                   1294          1017          35074            19203
## 35 Milton K…              56            29           3075               42
## 36 Newport,…             140           251           2624             1889
## # ... with 3 more variables: account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>

We have 36 distinct patients with varying follower and following counts. Total number of Twitter users following the initial patitent list is 643581 and total number of Twitter users being followed by the initial patient list is 101312. Mind that these numbers do not account for duplicates so they are bound to decrease.

Lets take a look at histograms.

options(scipen=10000)

# Histogram of follower counts
p1 <- initial_patients_list %>% 
  ggplot( aes(followers_count))+
  geom_histogram(bins = 100)+
  theme_bw()+
  labs(title="Histogram of Follower Counts")

# Histogram of friend counts
p2 <- initial_patients_list %>% 
  ggplot( aes(friends_count))+
  geom_histogram(bins = 100)+
  theme_bw()+
  labs(title="Histogram of Friend Counts")

gridExtra::grid.arrange(p1, p2, nrow=2)

Histograms of followers and friends counts indicate there are some outliers in the initial list, such as the patient with nearly 400000 followers (you know who) and the patient following more than 35000 users.

Before moving on to scraping followers and friends of the initial list of patients, lets take a look at the follower/friend counts scatterplot.

p3 <- initial_patients_list %>% 
  ggplot( aes(x=followers_count, y=friends_count))+
  geom_point()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  # xlim(0,30000)+
  # ylim(0,10000)+
  labs(title="Outliers shown")

p4 <- initial_patients_list %>% 
  ggplot( aes(x=followers_count, y=friends_count))+
  geom_point()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  xlim(0,30000)+
  ylim(0,7500)+
  labs(title="Outliers hidden")

gridExtra::grid.arrange(p3, p4, nrow=1, 
                        top= grid::textGrob("Scatterplots of Follower vs Friend counts", gp=grid::gpar(fontsize=18)),
                        widths= c(0.55,0.45))

## Warning: Removed 3 rows containing missing values (geom_point).

Now, let’s scrape followers and friends of our initial patient list.

initial_patients_list <- rtweet::get_friends(users = archive_user[1], retryonratelimit = TRUE) %>% 
  select(user_id) %>% 
  lookup_users()
?get_friends

Identifying Patients

Sefa Ozalp

23/01/2018

1. Introduction

2. Identifying the larger patient network