In this document, I’ll walk you through scraping data from users who have published threatening and/or abusive content while mentioning MP Anna Soubry on Twitter.
The threatening messages were shared with us as screenshots. Using these screenshots, I have manually created a abusive users list. There are both active and suspended accounts amongst abusive users (see below for more info).
For the accounts that are still active, I will scrape their timelines (i.e. up to 3200 recent tweets published by users) by submitting a query using Twitter’s API. I will also scrape metadata of each users and join with tweets.
It is not possible to do the same for the suspended accounts as their tweets are automatically removed by Twitter on suspension. Instead of scraping tweets published by suspended accounts, I will search their user handles on Twitter Advanced Search and scrape mentions at abusive users. For each account, I will scrape tweets two months previous to the offending tweets shared with us. This is a rather indirect method but at this point, it’s the only possible data collection method regarding suspended users.
Loading necessary libraries.
library(tidyverse)
library(rtweet)
Loading offender list created manually from screenshots of abusive and threatening messages directed at MP Anna Soubry on Twitter.
offender_list <- read_csv("/Users/sefaozalp/Documents/Work/Anna_Soubry/offensive_accounts_list.csv")
## Parsed with column specification:
## cols(
## user_handle = col_character(),
## Twitter_link = col_character(),
## account_is_active = col_integer(),
## offensive_tweet_date = col_character()
## )
offender_list %>% print(n = Inf)
## # A tibble: 38 x 4
## user_handle Twitter_link account_is_active offensive_tweet…
## <chr> <chr> <int> <chr>
## 1 @Mos__Maiorum suspended 0 16-Sep-17
## 2 @IrateBrit suspended 0 06-Oct-17
## 3 @TudorRashoff suspended 0 04-Nov-17
## 4 @frottroilism suspended 0 10-Nov-17
## 5 @edge1959 suspended 0 14-Nov-17
## 6 @simonfield68 suspended 0 14-Dec-17
## 7 @TerryNOTA60 suspended 0 17-Dec-17
## 8 @Km21M Suspended 0 18-Dec-17
## 9 @caraamora https://twitter.co… 1 05-Oct-17
## 10 @ismisnt https://twitter.co… 1 08-Nov-17
## 11 @nickyscourgeon https://twitter.co… 1 14-Nov-17
## 12 @AnkersDave https://twitter.co… 1 14-Nov-17
## 13 @sandy_gujral https://twitter.co… 1 15-Nov-17
## 14 @CugerBrant https://twitter.co… 1 15-Nov-17
## 15 @sailerboy77 https://twitter.co… 1 15-Nov-17
## 16 @Brown97M https://twitter.co… 1 16-Nov-17
## 17 @ReederSimon https://twitter.co… 1 16-Nov-17
## 18 @NO10000CONFUSED https://twitter.co… 1 14-Dec-17
## 19 @F1CWT https://twitter.co… 1 14-Dec-17
## 20 @sweettouth75 https://twitter.co… 1 14-Dec-17
## 21 @PetersladWY https://twitter.co… 1 14-Dec-17
## 22 @OstendGudgeon https://twitter.co… 1 15-Dec-17
## 23 @Yorky37852200 https://twitter.co… 1 15-Dec-17
## 24 @AmpersUK https://twitter.co… 1 16-Dec-17
## 25 @brinkley_roy https://twitter.co… 1 16-Dec-17
## 26 @plott22 https://twitter.co… 1 16-Dec-17
## 27 @IAMPaulHamilton https://twitter.co… 1 16-Dec-17
## 28 @07539_677127 https://twitter.co… 1 17-Dec-17
## 29 @martinpaulam https://twitter.co… 1 18-Dec-17
## 30 @McDeLLaware https://twitter.co… 1 18-Dec-17
## 31 @Rosslyncakessnk https://twitter.co… 1 30-Dec-17
## 32 @edwardburgess1 https://twitter.co… 1 15-Jan-18
## 33 @trevlac1980 https://twitter.co… 1 15-Jan-18
## 34 @Macaw121 https://twitter.co… 1 16-Jan-18
## 35 @EU_Be_Gone https://twitter.co… 1 16-Jan-18
## 36 @Journo_list https://twitter.co… 1 17-Jan-18
## 37 @euroedm https://twitter.co… 1 21-Jan-18
## 38 @will_uncensored https://twitter.co… 1 06-Feb-18
We have identified 38 abusers of Anna Soubry on Twitter. Of these, 8 have been suspended since they posted abusive messages; whereas 30 have not been suspended yet and are still active.
First, lets scrape last 3200 tweets from 30 accounts that are still active as of today.
active_offenders <- offender_list %>%
filter(account_is_active==1) %>%
select(user_handle)
active_offenders_timelines <- rtweet::get_timelines(user = active_offenders$user_handle, n=3200)
active_offenders_timelines
## # A tibble: 76,543 x 42
## status_id created_at user_id screen_name text source
## * <chr> <dttm> <chr> <chr> <chr> <chr>
## 1 968799895… 2018-02-28 10:47:49 218727… caraamora @laurenamber… Twitt…
## 2 968579997… 2018-02-27 20:14:01 218727… caraamora @Bartolo7230… Twitt…
## 3 968579855… 2018-02-27 20:13:27 218727… caraamora @TRobinsonNe… Twitt…
## 4 968542980… 2018-02-27 17:46:55 218727… caraamora @OliverMcGee… Twitt…
## 5 968040129… 2018-02-26 08:28:46 218727… caraamora @BABYJ78 @gu… Twitt…
## 6 968039688… 2018-02-26 08:27:01 218727… caraamora @BABYJ78 @gu… Twitt…
## 7 968039126… 2018-02-26 08:24:47 218727… caraamora @guskenworth… Twitt…
## 8 967805247… 2018-02-25 16:55:26 218727… caraamora @michaelgove… Twitt…
## 9 967747683… 2018-02-25 13:06:42 218727… caraamora @DavidTurner… Twitt…
## 10 967456714… 2018-02-24 17:50:29 218727… caraamora @VauxhallLab… Twitt…
## # ... with 76,533 more rows, and 36 more variables:
## # reply_to_status_id <chr>, reply_to_user_id <chr>,
## # reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, hashtags <list>,
## # symbols <list>, urls_url <list>, urls_t.co <list>,
## # urls_expanded_url <list>, media_url <list>, media_t.co <list>,
## # media_expanded_url <list>, media_type <list>, ext_media_url <list>,
## # ext_media_t.co <list>, ext_media_expanded_url <list>,
## # ext_media_type <lgl>, mentions_user_id <list>,
## # mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
## # quoted_text <chr>, retweet_status_id <chr>, retweet_text <chr>,
## # place_url <chr>, place_name <chr>, place_full_name <chr>,
## # place_type <chr>, country <chr>, country_code <chr>,
## # geo_coords <list>, coords_coords <list>, bbox_coords <list>
We have grabbed 76543 tweets from 30 abusers who are still active. Apparently, some of these users have posted less than 3200 tweets to date. This is normal, given that some users tweet less frequently than others and some accounts might have been created rather recently. Lets check the tweet counts of each account.
active_offenders_timelines %>%
group_by(screen_name) %>%
summarise(n()) %>%
arrange(`n()`) %>%
print(n = Inf)
## # A tibble: 30 x 2
## screen_name `n()`
## <chr> <int>
## 1 sailerboy77 421
## 2 sweettouth75 436
## 3 caraamora 657
## 4 Macaw121 788
## 5 PetersladWY 869
## 6 trevlac1980 1469
## 7 brinkley_roy 1655
## 8 Rosslyncakessnk 1881
## 9 Yorky37852200 2438
## 10 EU_Be_Gone 2553
## 11 euroedm 2932
## 12 will_uncensored 2946
## 13 Brown97M 3131
## 14 07539_677127 3150
## 15 ismisnt 3155
## 16 plott22 3164
## 17 AmpersUK 3169
## 18 Journo_list 3189
## 19 AnkersDave 3190
## 20 sandy_gujral 3198
## 21 CugerBrant 3200
## 22 edwardburgess1 3205
## 23 OstendGudgeon 3209
## 24 F1CWT 3212
## 25 IAMPaulHamilton 3214
## 26 nickyscourgeon 3214
## 27 NO10000CONFUSED 3215
## 28 McDeLLaware 3222
## 29 martinpaulam 3223
## 30 ReederSimon 3238
Now, lets grab metadata of abusive users from Twitter API and add them to tweets.
user_data <- users_data(active_offenders_timelines) %>%
distinct(., user_id, .keep_all = T)
joined <- left_join(active_offenders_timelines, user_data, by="user_id") %>%
mutate(screen_name=screen_name.x) %>%
select(screen_name, everything(), -c(screen_name.x, screen_name.y)) %>%
select(screen_name, text, everything())
Now we have tweets published by abusive accounts joined with user and tweet metadata joined together in a tidy data frame. Let’s take a quick peak.
sample_n(joined,10) %>% print(n = 5, width = Inf)
## # A tibble: 10 x 60
## screen_name
## <chr>
## 1 caraamora
## 2 PetersladWY
## 3 edwardburgess1
## 4 martinpaulam
## 5 plott22
## text
## <chr>
## 1 @realDonaldTrump How many people will you kill this time? Sad little man
## 2 @theotheradel @Ianhwatkins @LSLofficial @llatchfordevans @_ClaireRichar…
## 3 Woman conned out of £5k by 'sugar daddy' conman who claimed to earn £50…
## 4 Make Europe Great Again! Polish MEP warns Polexit could END the ‘sick’ …
## 5 @RupaHuq @OwenJones84 @AngelaRayner Vox populi
## status_id created_at user_id
## <chr> <dttm> <chr>
## 1 919267584020709376 2017-10-14 18:24:05 2187274937
## 2 941038879829786625 2017-12-13 20:15:27 783075877722681344
## 3 958804943233847297 2018-01-31 20:51:26 302622742
## 4 956058809184768000 2018-01-24 06:59:17 830033420327723008
## 5 876095215282663424 2017-06-17 15:12:31 131565923
## source reply_to_status_id reply_to_user_id
## <chr> <chr> <chr>
## 1 Twitter for iPhone 919162619889704961 25073877
## 2 Twitter for Android 941026194472034304 22546741
## 3 Twitter Web Client <NA> <NA>
## 4 Twitter Web Client <NA> <NA>
## 5 Twitter Lite 876092375319359489 706747004
## reply_to_screen_name is_quote is_retweet favorite_count retweet_count
## <chr> <lgl> <lgl> <int> <int>
## 1 realDonaldTrump F F 0 0
## 2 theotheradel F F 1 0
## 3 <NA> F F 0 0
## 4 <NA> F F 0 1
## 5 RupaHuq F F 0 0
## hashtags symbols urls_url urls_t.co urls_expanded_url media_url
## <list> <list> <list> <list> <list> <list>
## 1 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
## 2 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
## 3 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
## 4 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
## 5 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
## media_t.co media_expanded_url media_type ext_media_url ext_media_t.co
## <list> <list> <list> <list> <list>
## 1 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
## 2 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
## 3 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
## 4 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
## 5 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
## ext_media_expanded_url ext_media_type mentions_user_id
## <list> <lgl> <list>
## 1 <chr [1]> NA <chr [1]>
## 2 <chr [1]> NA <chr [6]>
## 3 <chr [1]> NA <chr [1]>
## 4 <chr [1]> NA <chr [1]>
## 5 <chr [1]> NA <chr [3]>
## mentions_screen_name lang quoted_status_id quoted_text
## <list> <chr> <chr> <chr>
## 1 <chr [1]> en <NA> <NA>
## 2 <chr [6]> en <NA> <NA>
## 3 <chr [1]> en <NA> <NA>
## 4 <chr [1]> en <NA> <NA>
## 5 <chr [3]> pl <NA> <NA>
## retweet_status_id retweet_text place_url place_name place_full_name
## <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA>
## place_type country country_code geo_coords coords_coords bbox_coords
## <chr> <chr> <chr> <list> <list> <list>
## 1 <NA> <NA> <NA> <dbl [2]> <dbl [2]> <dbl [8]>
## 2 <NA> <NA> <NA> <dbl [2]> <dbl [2]> <dbl [8]>
## 3 <NA> <NA> <NA> <dbl [2]> <dbl [2]> <dbl [8]>
## 4 <NA> <NA> <NA> <dbl [2]> <dbl [2]> <dbl [8]>
## 5 <NA> <NA> <NA> <dbl [2]> <dbl [2]> <dbl [8]>
## name location
## <chr> <chr>
## 1 ste evans ""
## 2 Life's Dancer Vegetarian
## 3 EJB - aka - EJ wigan
## 4 paul martin ""
## 5 Monkey face liz " Nirvana"
## description
## <chr>
## 1 ""
## 2 Here's the thing about life, You only get 1, so get out there and be AM…
## 3 supporter of 2013 fa cup winners
## 4 hi im paul and i live in england
## 5 "Tory hater,capitalism has failed.\nRepublican"
## url protected followers_count friends_count listed_count
## <chr> <lgl> <int> <int> <int>
## 1 <NA> F 6 43 0
## 2 <NA> F 34 200 0
## 3 <NA> F 132 148 14
## 4 <NA> F 4 0 0
## 5 <NA> F 105 123 4
## statuses_count favourites_count account_created_at verified profile_url
## <int> <int> <dttm> <lgl> <chr>
## 1 657 562 2013-11-19 11:49:19 F <NA>
## 2 872 711 2016-10-03 22:46:53 F <NA>
## 3 15283 2598 2011-05-21 13:55:20 F <NA>
## 4 7117 86 2017-02-10 12:39:23 F <NA>
## 5 3754 4464 2010-04-10 17:14:53 F <NA>
## profile_expanded_url account_lang
## <chr> <chr>
## 1 <NA> en
## 2 <NA> en
## 3 <NA> en
## 4 <NA> en-gb
## 5 <NA> en
## profile_banner_url
## <chr>
## 1 <NA>
## 2 https://pbs.twimg.com/profile_banners/783075877722681344/1512331941
## 3 https://pbs.twimg.com/profile_banners/302622742/1519392472
## 4 <NA>
## 5 https://pbs.twimg.com/profile_banners/131565923/1517939910
## profile_background_url
## <chr>
## 1 http://abs.twimg.com/images/themes/theme1/bg.png
## 2 <NA>
## 3 http://abs.twimg.com/images/themes/theme1/bg.png
## 4 http://abs.twimg.com/images/themes/theme1/bg.png
## 5 http://abs.twimg.com/images/themes/theme1/bg.png
## profile_image_url
## <chr>
## 1 http://pbs.twimg.com/profile_images/764164769444163584/4OlkhKDf_normal.…
## 2 http://pbs.twimg.com/profile_images/934939396624797697/WhnkfXTo_normal.…
## 3 http://pbs.twimg.com/profile_images/959870146184704000/URo4ixYA_normal.…
## 4 http://abs.twimg.com/sticky/default_profile_images/default_profile_norm…
## 5 http://pbs.twimg.com/profile_images/929642940099715072/-x9O_ksc_normal.…
## # ... with 5 more rows
As a last step, I’ll export this data frame in both csv and json formats.
rtweet::save_as_csv(joined,
file_name = "/Users/sefaozalp/Documents/Work/Anna_Soubry/active-accounts/active_accounts_timelines.csv" ,
prepend_ids = F)
joined %>% jsonlite::toJSON( digits = NA) %>%
write(file = "/Users/sefaozalp/Documents/Work/Anna_Soubry/active-accounts/active_accounts_timelines.json")
Now the next step is to scrape mentions of the suspended accounts two months previous to the offending tweet date.
To do that, we will need to use Twitter advanced search and scrape tweets manually for each suspended account
# Loading necessary packages
library(stringr)
library(reshape2)
I will not go in detail here but the code below is pretty self explanatory.
#List html body files in the directory in a list (necessary for map function)
suspended_accounts_html_body_files <- list.files("/Users/sefaozalp/Documents/Work/Anna_Soubry/suspended_accounts",full.names = T) %>% as.list()
#define a function to extract tweet ids html body files, query them from Twitter API to scrape tweet and user metadata adn save the results as both .json and .csv files.
get_tweets <- function(x){
txt_source <- readLines(x) %>% as.character()
str_detect(txt_source, "data-tweet-id") %>% table()
tweet_ids <- str_match_all(string=txt_source, pattern = "data-tweet-id=\"(.*?)\"") %>%
melt() %>%
dcast( formula = value~Var2, drop = T) %>%
select("2") %>%
drop_na()
timelines <- rtweet::lookup_statuses(tweet_ids)
users <- rtweet::users_data(timelines) %>% #1516 rows
distinct(., user_id, .keep_all = T) # drops to 250 users, many duplicates. expected as ozalp is a surname!
joined_get_tweets <- left_join(timelines, users, by="user_id") %>%
mutate(screen_name=screen_name.x) %>%
select(screen_name, everything(), -c(screen_name.x, screen_name.y))
joined_get_tweets
rtweet::save_as_csv(joined_get_tweets,
file_name = str_replace(string = x, pattern = ".txt", replacement = ".csv") ,
prepend_ids = F)
joined_get_tweets %>% jsonlite::toJSON( digits = NA) %>%
write(file = str_replace(string = x, pattern = ".txt", replacement = ".json"))
}
# map the function defined above to all files in the directory (aka repeat the above defined function for each suspended account. )
map( suspended_accounts_html_body_files, get_tweets)
As a result, we have 7 csv files with different row numbers (8th account did not return any replies within the selected time frame). We can use them separately or join in a single data frame.
I will bind them in a single csv file just in case.
suspended_metions_list <- list.files("/Users/sefaozalp/Documents/Work/Anna_Soubry/suspended_accounts/csv_files", full.names = T) %>% as.list()
map(suspended_metions_list, read_csv, col_types = cols( `quoted_status_id` = "c")) %>% # careful here, before setting the col_types argument, map threw an error saying `quoted_status_id` cannot be converted from chracter to numeric
bind_rows() %>%
write_csv("/Users/sefaozalp/Documents/Work/Anna_Soubry/suspended_accounts/suspended_accounts_tweets_binded.csv")
Worked like a charm!
In this document, I have collected data regarding both active and suspended abusive accounts on Twitter. The end product is two csv files. For active accounts, I have managed to scrape 76543 tweets from 30 accounts. For suspended accounts I have managed to scrape 2300 mentions for 7 accounts. I could not gather data from only one suspended account.
I will upload both csv files to Google drive and share the links via email.
The end!