Scraping data from offensive accounts on Twitter

1. Introduction

In this document, I’ll walk you through scraping data from users who have published threatening and/or abusive content while mentioning MP Anna Soubry on Twitter.

The threatening messages were shared with us as screenshots. Using these screenshots, I have manually created a abusive users list. There are both active and suspended accounts amongst abusive users (see below for more info).

For the accounts that are still active, I will scrape their timelines (i.e. up to 3200 recent tweets published by users) by submitting a query using Twitter’s API. I will also scrape metadata of each users and join with tweets.

It is not possible to do the same for the suspended accounts as their tweets are automatically removed by Twitter on suspension. Instead of scraping tweets published by suspended accounts, I will search their user handles on Twitter Advanced Search and scrape mentions at abusive users. For each account, I will scrape tweets two months previous to the offending tweets shared with us. This is a rather indirect method but at this point, it’s the only possible data collection method regarding suspended users.

2. Loading Abusive Users

Loading necessary libraries.

library(tidyverse)
library(rtweet)

Loading offender list created manually from screenshots of abusive and threatening messages directed at MP Anna Soubry on Twitter.

offender_list <- read_csv("/Users/sefaozalp/Documents/Work/Anna_Soubry/offensive_accounts_list.csv")
## Parsed with column specification:
## cols(
##   user_handle = col_character(),
##   Twitter_link = col_character(),
##   account_is_active = col_integer(),
##   offensive_tweet_date = col_character()
## )
offender_list %>% print(n = Inf)
## # A tibble: 38 x 4
##    user_handle      Twitter_link        account_is_active offensive_tweet…
##    <chr>            <chr>                           <int> <chr>           
##  1 @Mos__Maiorum    suspended                           0 16-Sep-17       
##  2 @IrateBrit       suspended                           0 06-Oct-17       
##  3 @TudorRashoff    suspended                           0 04-Nov-17       
##  4 @frottroilism    suspended                           0 10-Nov-17       
##  5 @edge1959        suspended                           0 14-Nov-17       
##  6 @simonfield68    suspended                           0 14-Dec-17       
##  7 @TerryNOTA60     suspended                           0 17-Dec-17       
##  8 @Km21M           Suspended                           0 18-Dec-17       
##  9 @caraamora       https://twitter.co…                 1 05-Oct-17       
## 10 @ismisnt         https://twitter.co…                 1 08-Nov-17       
## 11 @nickyscourgeon  https://twitter.co…                 1 14-Nov-17       
## 12 @AnkersDave      https://twitter.co…                 1 14-Nov-17       
## 13 @sandy_gujral    https://twitter.co…                 1 15-Nov-17       
## 14 @CugerBrant      https://twitter.co…                 1 15-Nov-17       
## 15 @sailerboy77     https://twitter.co…                 1 15-Nov-17       
## 16 @Brown97M        https://twitter.co…                 1 16-Nov-17       
## 17 @ReederSimon     https://twitter.co…                 1 16-Nov-17       
## 18 @NO10000CONFUSED https://twitter.co…                 1 14-Dec-17       
## 19 @F1CWT           https://twitter.co…                 1 14-Dec-17       
## 20 @sweettouth75    https://twitter.co…                 1 14-Dec-17       
## 21 @PetersladWY     https://twitter.co…                 1 14-Dec-17       
## 22 @OstendGudgeon   https://twitter.co…                 1 15-Dec-17       
## 23 @Yorky37852200   https://twitter.co…                 1 15-Dec-17       
## 24 @AmpersUK        https://twitter.co…                 1 16-Dec-17       
## 25 @brinkley_roy    https://twitter.co…                 1 16-Dec-17       
## 26 @plott22         https://twitter.co…                 1 16-Dec-17       
## 27 @IAMPaulHamilton https://twitter.co…                 1 16-Dec-17       
## 28 @07539_677127    https://twitter.co…                 1 17-Dec-17       
## 29 @martinpaulam    https://twitter.co…                 1 18-Dec-17       
## 30 @McDeLLaware     https://twitter.co…                 1 18-Dec-17       
## 31 @Rosslyncakessnk https://twitter.co…                 1 30-Dec-17       
## 32 @edwardburgess1  https://twitter.co…                 1 15-Jan-18       
## 33 @trevlac1980     https://twitter.co…                 1 15-Jan-18       
## 34 @Macaw121        https://twitter.co…                 1 16-Jan-18       
## 35 @EU_Be_Gone      https://twitter.co…                 1 16-Jan-18       
## 36 @Journo_list     https://twitter.co…                 1 17-Jan-18       
## 37 @euroedm         https://twitter.co…                 1 21-Jan-18       
## 38 @will_uncensored https://twitter.co…                 1 06-Feb-18

We have identified 38 abusers of Anna Soubry on Twitter. Of these, 8 have been suspended since they posted abusive messages; whereas 30 have not been suspended yet and are still active.

3. Active Abusive Accounts

First, lets scrape last 3200 tweets from 30 accounts that are still active as of today.

active_offenders <- offender_list %>% 
  filter(account_is_active==1) %>% 
  select(user_handle) 

active_offenders_timelines <- rtweet::get_timelines(user = active_offenders$user_handle, n=3200)

active_offenders_timelines
## # A tibble: 76,543 x 42
##    status_id  created_at          user_id screen_name text          source
##  * <chr>      <dttm>              <chr>   <chr>       <chr>         <chr> 
##  1 968799895… 2018-02-28 10:47:49 218727… caraamora   @laurenamber… Twitt…
##  2 968579997… 2018-02-27 20:14:01 218727… caraamora   @Bartolo7230… Twitt…
##  3 968579855… 2018-02-27 20:13:27 218727… caraamora   @TRobinsonNe… Twitt…
##  4 968542980… 2018-02-27 17:46:55 218727… caraamora   @OliverMcGee… Twitt…
##  5 968040129… 2018-02-26 08:28:46 218727… caraamora   @BABYJ78 @gu… Twitt…
##  6 968039688… 2018-02-26 08:27:01 218727… caraamora   @BABYJ78 @gu… Twitt…
##  7 968039126… 2018-02-26 08:24:47 218727… caraamora   @guskenworth… Twitt…
##  8 967805247… 2018-02-25 16:55:26 218727… caraamora   @michaelgove… Twitt…
##  9 967747683… 2018-02-25 13:06:42 218727… caraamora   @DavidTurner… Twitt…
## 10 967456714… 2018-02-24 17:50:29 218727… caraamora   @VauxhallLab… Twitt…
## # ... with 76,533 more rows, and 36 more variables:
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, hashtags <list>,
## #   symbols <list>, urls_url <list>, urls_t.co <list>,
## #   urls_expanded_url <list>, media_url <list>, media_t.co <list>,
## #   media_expanded_url <list>, media_type <list>, ext_media_url <list>,
## #   ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <lgl>, mentions_user_id <list>,
## #   mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
## #   quoted_text <chr>, retweet_status_id <chr>, retweet_text <chr>,
## #   place_url <chr>, place_name <chr>, place_full_name <chr>,
## #   place_type <chr>, country <chr>, country_code <chr>,
## #   geo_coords <list>, coords_coords <list>, bbox_coords <list>

We have grabbed 76543 tweets from 30 abusers who are still active. Apparently, some of these users have posted less than 3200 tweets to date. This is normal, given that some users tweet less frequently than others and some accounts might have been created rather recently. Lets check the tweet counts of each account.

active_offenders_timelines %>% 
  group_by(screen_name) %>% 
  summarise(n()) %>% 
  arrange(`n()`) %>% 
  print(n = Inf)
## # A tibble: 30 x 2
##    screen_name     `n()`
##    <chr>           <int>
##  1 sailerboy77       421
##  2 sweettouth75      436
##  3 caraamora         657
##  4 Macaw121          788
##  5 PetersladWY       869
##  6 trevlac1980      1469
##  7 brinkley_roy     1655
##  8 Rosslyncakessnk  1881
##  9 Yorky37852200    2438
## 10 EU_Be_Gone       2553
## 11 euroedm          2932
## 12 will_uncensored  2946
## 13 Brown97M         3131
## 14 07539_677127     3150
## 15 ismisnt          3155
## 16 plott22          3164
## 17 AmpersUK         3169
## 18 Journo_list      3189
## 19 AnkersDave       3190
## 20 sandy_gujral     3198
## 21 CugerBrant       3200
## 22 edwardburgess1   3205
## 23 OstendGudgeon    3209
## 24 F1CWT            3212
## 25 IAMPaulHamilton  3214
## 26 nickyscourgeon   3214
## 27 NO10000CONFUSED  3215
## 28 McDeLLaware      3222
## 29 martinpaulam     3223
## 30 ReederSimon      3238

Now, lets grab metadata of abusive users from Twitter API and add them to tweets.

user_data <- users_data(active_offenders_timelines) %>% 
  distinct(., user_id, .keep_all = T)


joined <- left_join(active_offenders_timelines, user_data, by="user_id") %>% 
  mutate(screen_name=screen_name.x) %>% 
  select(screen_name, everything(), -c(screen_name.x, screen_name.y)) %>% 
  select(screen_name, text, everything())

Now we have tweets published by abusive accounts joined with user and tweet metadata joined together in a tidy data frame. Let’s take a quick peak.

sample_n(joined,10) %>% print(n = 5, width = Inf)
## # A tibble: 10 x 60
##   screen_name   
##   <chr>         
## 1 caraamora     
## 2 PetersladWY   
## 3 edwardburgess1
## 4 martinpaulam  
## 5 plott22       
##   text                                                                    
##   <chr>                                                                   
## 1 @realDonaldTrump How many people will you kill this time? Sad little man
## 2 @theotheradel @Ianhwatkins @LSLofficial @llatchfordevans @_ClaireRichar…
## 3 Woman conned out of £5k by 'sugar daddy' conman who claimed to earn £50…
## 4 Make Europe Great Again! Polish MEP warns Polexit could END the ‘sick’ …
## 5 @RupaHuq @OwenJones84 @AngelaRayner Vox populi                          
##   status_id          created_at          user_id           
##   <chr>              <dttm>              <chr>             
## 1 919267584020709376 2017-10-14 18:24:05 2187274937        
## 2 941038879829786625 2017-12-13 20:15:27 783075877722681344
## 3 958804943233847297 2018-01-31 20:51:26 302622742         
## 4 956058809184768000 2018-01-24 06:59:17 830033420327723008
## 5 876095215282663424 2017-06-17 15:12:31 131565923         
##   source              reply_to_status_id reply_to_user_id
##   <chr>               <chr>              <chr>           
## 1 Twitter for iPhone  919162619889704961 25073877        
## 2 Twitter for Android 941026194472034304 22546741        
## 3 Twitter Web Client  <NA>               <NA>            
## 4 Twitter Web Client  <NA>               <NA>            
## 5 Twitter Lite        876092375319359489 706747004       
##   reply_to_screen_name is_quote is_retweet favorite_count retweet_count
##   <chr>                <lgl>    <lgl>               <int>         <int>
## 1 realDonaldTrump      F        F                       0             0
## 2 theotheradel         F        F                       1             0
## 3 <NA>                 F        F                       0             0
## 4 <NA>                 F        F                       0             1
## 5 RupaHuq              F        F                       0             0
##   hashtags  symbols   urls_url  urls_t.co urls_expanded_url media_url
##   <list>    <list>    <list>    <list>    <list>            <list>   
## 1 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>         <chr [1]>
## 2 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>         <chr [1]>
## 3 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>         <chr [1]>
## 4 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>         <chr [1]>
## 5 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>         <chr [1]>
##   media_t.co media_expanded_url media_type ext_media_url ext_media_t.co
##   <list>     <list>             <list>     <list>        <list>        
## 1 <chr [1]>  <chr [1]>          <chr [1]>  <chr [1]>     <chr [1]>     
## 2 <chr [1]>  <chr [1]>          <chr [1]>  <chr [1]>     <chr [1]>     
## 3 <chr [1]>  <chr [1]>          <chr [1]>  <chr [1]>     <chr [1]>     
## 4 <chr [1]>  <chr [1]>          <chr [1]>  <chr [1]>     <chr [1]>     
## 5 <chr [1]>  <chr [1]>          <chr [1]>  <chr [1]>     <chr [1]>     
##   ext_media_expanded_url ext_media_type mentions_user_id
##   <list>                 <lgl>          <list>          
## 1 <chr [1]>              NA             <chr [1]>       
## 2 <chr [1]>              NA             <chr [6]>       
## 3 <chr [1]>              NA             <chr [1]>       
## 4 <chr [1]>              NA             <chr [1]>       
## 5 <chr [1]>              NA             <chr [3]>       
##   mentions_screen_name lang  quoted_status_id quoted_text
##   <list>               <chr> <chr>            <chr>      
## 1 <chr [1]>            en    <NA>             <NA>       
## 2 <chr [6]>            en    <NA>             <NA>       
## 3 <chr [1]>            en    <NA>             <NA>       
## 4 <chr [1]>            en    <NA>             <NA>       
## 5 <chr [3]>            pl    <NA>             <NA>       
##   retweet_status_id retweet_text place_url place_name place_full_name
##   <chr>             <chr>        <chr>     <chr>      <chr>          
## 1 <NA>              <NA>         <NA>      <NA>       <NA>           
## 2 <NA>              <NA>         <NA>      <NA>       <NA>           
## 3 <NA>              <NA>         <NA>      <NA>       <NA>           
## 4 <NA>              <NA>         <NA>      <NA>       <NA>           
## 5 <NA>              <NA>         <NA>      <NA>       <NA>           
##   place_type country country_code geo_coords coords_coords bbox_coords
##   <chr>      <chr>   <chr>        <list>     <list>        <list>     
## 1 <NA>       <NA>    <NA>         <dbl [2]>  <dbl [2]>     <dbl [8]>  
## 2 <NA>       <NA>    <NA>         <dbl [2]>  <dbl [2]>     <dbl [8]>  
## 3 <NA>       <NA>    <NA>         <dbl [2]>  <dbl [2]>     <dbl [8]>  
## 4 <NA>       <NA>    <NA>         <dbl [2]>  <dbl [2]>     <dbl [8]>  
## 5 <NA>       <NA>    <NA>         <dbl [2]>  <dbl [2]>     <dbl [8]>  
##   name            location  
##   <chr>           <chr>     
## 1 ste evans       ""        
## 2 Life's Dancer   Vegetarian
## 3 EJB - aka - EJ  wigan     
## 4 paul martin     ""        
## 5 Monkey face liz " Nirvana"
##   description                                                             
##   <chr>                                                                   
## 1 ""                                                                      
## 2 Here's the thing about life, You only get 1, so get out there and be AM…
## 3 supporter of 2013 fa cup winners                                        
## 4 hi im paul and i live in england                                        
## 5 "Tory hater,capitalism has failed.\nRepublican"                         
##   url   protected followers_count friends_count listed_count
##   <chr> <lgl>               <int>         <int>        <int>
## 1 <NA>  F                       6            43            0
## 2 <NA>  F                      34           200            0
## 3 <NA>  F                     132           148           14
## 4 <NA>  F                       4             0            0
## 5 <NA>  F                     105           123            4
##   statuses_count favourites_count account_created_at  verified profile_url
##            <int>            <int> <dttm>              <lgl>    <chr>      
## 1            657              562 2013-11-19 11:49:19 F        <NA>       
## 2            872              711 2016-10-03 22:46:53 F        <NA>       
## 3          15283             2598 2011-05-21 13:55:20 F        <NA>       
## 4           7117               86 2017-02-10 12:39:23 F        <NA>       
## 5           3754             4464 2010-04-10 17:14:53 F        <NA>       
##   profile_expanded_url account_lang
##   <chr>                <chr>       
## 1 <NA>                 en          
## 2 <NA>                 en          
## 3 <NA>                 en          
## 4 <NA>                 en-gb       
## 5 <NA>                 en          
##   profile_banner_url                                                 
##   <chr>                                                              
## 1 <NA>                                                               
## 2 https://pbs.twimg.com/profile_banners/783075877722681344/1512331941
## 3 https://pbs.twimg.com/profile_banners/302622742/1519392472         
## 4 <NA>                                                               
## 5 https://pbs.twimg.com/profile_banners/131565923/1517939910         
##   profile_background_url                          
##   <chr>                                           
## 1 http://abs.twimg.com/images/themes/theme1/bg.png
## 2 <NA>                                            
## 3 http://abs.twimg.com/images/themes/theme1/bg.png
## 4 http://abs.twimg.com/images/themes/theme1/bg.png
## 5 http://abs.twimg.com/images/themes/theme1/bg.png
##   profile_image_url                                                       
##   <chr>                                                                   
## 1 http://pbs.twimg.com/profile_images/764164769444163584/4OlkhKDf_normal.…
## 2 http://pbs.twimg.com/profile_images/934939396624797697/WhnkfXTo_normal.…
## 3 http://pbs.twimg.com/profile_images/959870146184704000/URo4ixYA_normal.…
## 4 http://abs.twimg.com/sticky/default_profile_images/default_profile_norm…
## 5 http://pbs.twimg.com/profile_images/929642940099715072/-x9O_ksc_normal.…
## # ... with 5 more rows

As a last step, I’ll export this data frame in both csv and json formats.

rtweet::save_as_csv(joined,
                    file_name = "/Users/sefaozalp/Documents/Work/Anna_Soubry/active-accounts/active_accounts_timelines.csv" ,
                    prepend_ids = F)

joined %>% jsonlite::toJSON( digits = NA) %>% 
  write(file = "/Users/sefaozalp/Documents/Work/Anna_Soubry/active-accounts/active_accounts_timelines.json")

 4.Suspended Abusive Accounts

Now the next step is to scrape mentions of the suspended accounts two months previous to the offending tweet date.

To do that, we will need to use Twitter advanced search and scrape tweets manually for each suspended account

# Loading necessary packages
library(stringr)
library(reshape2)

I will not go in detail here but the code below is pretty self explanatory.

#List html body files in the directory in a list (necessary for map function)
suspended_accounts_html_body_files <- list.files("/Users/sefaozalp/Documents/Work/Anna_Soubry/suspended_accounts",full.names = T) %>% as.list()

#define a function to extract tweet ids html body files, query them from Twitter API to scrape tweet and user metadata adn save the results as both .json and .csv files. 

get_tweets <- function(x){
txt_source <- readLines(x) %>% as.character() 
str_detect(txt_source, "data-tweet-id") %>% table() 

tweet_ids <- str_match_all(string=txt_source, pattern = "data-tweet-id=\"(.*?)\"") %>% 
  melt() %>% 
  dcast( formula = value~Var2, drop = T) %>% 
  select("2") %>% 
  drop_na()

timelines <- rtweet::lookup_statuses(tweet_ids)

users <- rtweet::users_data(timelines) %>% #1516 rows
  distinct(., user_id, .keep_all = T) # drops to 250 users, many duplicates. expected as ozalp is a surname!

joined_get_tweets <- left_join(timelines, users, by="user_id") %>% 
  mutate(screen_name=screen_name.x) %>% 
  select(screen_name, everything(), -c(screen_name.x, screen_name.y))

joined_get_tweets

rtweet::save_as_csv(joined_get_tweets,
                    file_name = str_replace(string = x, pattern = ".txt", replacement = ".csv") ,
                    prepend_ids = F)

joined_get_tweets %>% jsonlite::toJSON( digits = NA) %>% 
  write(file = str_replace(string = x, pattern = ".txt", replacement = ".json"))
}

# map the function defined above to all files in the directory (aka repeat the above defined function for each suspended account. )

map( suspended_accounts_html_body_files, get_tweets)

As a result, we have 7 csv files with different row numbers (8th account did not return any replies within the selected time frame). We can use them separately or join in a single data frame.

I will bind them in a single csv file just in case.

suspended_metions_list <- list.files("/Users/sefaozalp/Documents/Work/Anna_Soubry/suspended_accounts/csv_files", full.names = T) %>% as.list()

map(suspended_metions_list, read_csv, col_types = cols( `quoted_status_id` = "c")) %>% # careful here, before setting the col_types argument, map threw an error saying `quoted_status_id` cannot be converted from chracter to numeric
  bind_rows() %>% 
  write_csv("/Users/sefaozalp/Documents/Work/Anna_Soubry/suspended_accounts/suspended_accounts_tweets_binded.csv")

Worked like a charm!

5. Conclusion

In this document, I have collected data regarding both active and suspended abusive accounts on Twitter. The end product is two csv files. For active accounts, I have managed to scrape 76543 tweets from 30 accounts. For suspended accounts I have managed to scrape 2300 mentions for 7 accounts. I could not gather data from only one suspended account.

I will upload both csv files to Google drive and share the links via email.

The end!