Background

Often useful to distinguish between personal and organisational tweets in document/ discourse analysis of twitter.

This note sets out some ideas for doing this.

We use

Extract tweets

q <- "@PublicHealthBot"

search <- "social media"

results <- get_timeline(q, n = 18000, retry_on_ratelimit = TRUE)

We are primarily interested in the user descriptions - this case retweet_description and usernames.

Idea 1 - disclaimers

views <- results %>%
  filter(str_detect(retweet_description, "[Vv]iew"))

This filters tweets where the retweet description contains the word view, View, views, Views. Of 3172, 252 contain one or more of these terms.

We can see how successful this strategy is by reviewing a sample of the descriptions.

disclaimer_sample <- views %>%
  sample_n(10) %>%
  select(retweet_name, name, retweet_description)

disclaimer_sample %>%
  gt::gt()
retweet_name name retweet_description
Vivian OTA WANG Public Health Data Science Deputy Director, Office of Data Sharing, Center for Biomedical Informatics & IT, National Cancer Institute- Empowerment Through FAIR Data Sharing- Views be mine
Brianna Lindsay Public Health Data Science Infectious Disease Epidemiologist, Runner, Data Data Data. Views expressed here are my own. I generally retweet things I intend to read later but never get to.
Michelle Fenner Public Health Data Science Michelle Fenner - Marketing and Project Manager. Perpetual Mediator. Views are 100% my own!
Rob McCargow Public Health Data Science Director of #AI @PwC_UK • #ResponsibleAI @PwC • Advisor @APPG_AI @IEEESA @techUK • @TEDx Speaker • Fellow @theRSAorg • #HeForShe • #Vegan🌱🌍 • all views my own
Melisa Valverde Public Health Data Science Mom. Wife. Communicator in health care. Social scientist. Public health student. Quasi-digital native. Sci-fi fan. *Views are my own. RTs not an endorsement.*
Muin J. Khoury Public Health Data Science Physician, Epidemiologist, Geneticist, Public Health Professional. Passionate About Using Science to Improve Health for All. Views= My own. Links≠ Endorsements
IEEE Public Health Data Science Advancing technological innovation and excellence for the benefit of humanity. View the IEEE social media terms and conditions: https://t.co/xzhVUOgVE8
Brianna Lindsay Public Health Data Science Infectious Disease Epidemiologist, Runner, Data Data Data. Views expressed here are my own. I generally retweet things I intend to read later but never get to.
Biotech Mergers & Acquisitions M&A Public Health Data Science BioPortfolio brings you the latest news, reviews, bogs and reports on #Mergers & #Acquisitions #MandA in the #biotech-pharma industry.
The Parliament Magazine Public Health Data Science By MEPs, for MEPs - in-depth news, views and analysis on the latest EU issues from The Parliament Magazine.

In the sample we can see that often retweeters are recognisable individual names though not always.

We will add a new column to the tweet table (add a feature).

results <- results %>%
  mutate(views = ifelse(str_detect(retweet_description, "[Vv]iew"), 1, 0))

This approach is unlikely to filter all personal users - we will need additional strategies.

Idea 2 - detect use of personal pronouns

The idea here is that individuals are more likely to use singular personal pronouns like I and me than organisations.

Detecting this can be done with powerful NLP tools, but a much quicker method is to use the textfeature package which is very quick and can detect personal pronouns.

features <- textfeatures(results$retweet_description, normalize = FALSE)
## ↪ Counting features in text...
## ↪ Sentiment analysis...
## ↪ Parts of speech...
## ↪ Word dimensions started
## ✔ Job's done!
features
## # A tibble: 3,172 x 134
##    n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars
##     <int>     <int>      <int>         <int>      <int>         <int>   <int>
##  1      0         0          5             5          0             0     127
##  2      0         0          1             1          0             0     104
##  3      0         0         11            11          0             0     134
##  4      0         0          0             0          0             0       0
##  5      0         0          0             0          1             1      80
##  6      0         0          0             0          0             0     109
##  7      0         0          4             4          0             0     129
##  8      0         0          4             4          0             0     129
##  9      1         1          8             8          0             0     115
## 10      0         0          0             0          0             0      58
## # … with 3,162 more rows, and 127 more variables: n_uq_chars <int>,
## #   n_commas <int>, n_digits <int>, n_exclaims <int>, n_extraspaces <int>,
## #   n_lowers <int>, n_lowersp <dbl>, n_periods <int>, n_words <int>,
## #   n_uq_words <int>, n_caps <int>, n_nonasciis <int>, n_puncts <int>,
## #   n_capsp <dbl>, n_charsperword <dbl>, sent_afinn <dbl>, sent_bing <dbl>,
## #   sent_syuzhet <dbl>, sent_vader <dbl>, n_polite <dbl>, n_first_person <int>,
## #   n_first_personp <int>, n_second_person <int>, n_second_personp <int>,
## #   n_third_person <int>, n_tobe <int>, n_prepositions <int>, w1 <dbl>,
## #   w2 <dbl>, w3 <dbl>, w4 <dbl>, w5 <dbl>, w6 <dbl>, w7 <dbl>, w8 <dbl>,
## #   w9 <dbl>, w10 <dbl>, w11 <dbl>, w12 <dbl>, w13 <dbl>, w14 <dbl>, w15 <dbl>,
## #   w16 <dbl>, w17 <dbl>, w18 <dbl>, w19 <dbl>, w20 <dbl>, w21 <dbl>,
## #   w22 <dbl>, w23 <dbl>, w24 <dbl>, w25 <dbl>, w26 <dbl>, w27 <dbl>,
## #   w28 <dbl>, w29 <dbl>, w30 <dbl>, w31 <dbl>, w32 <dbl>, w33 <dbl>,
## #   w34 <dbl>, w35 <dbl>, w36 <dbl>, w37 <dbl>, w38 <dbl>, w39 <dbl>,
## #   w40 <dbl>, w41 <dbl>, w42 <dbl>, w43 <dbl>, w44 <dbl>, w45 <dbl>,
## #   w46 <dbl>, w47 <dbl>, w48 <dbl>, w49 <dbl>, w50 <dbl>, w51 <dbl>,
## #   w52 <dbl>, w53 <dbl>, w54 <dbl>, w55 <dbl>, w56 <dbl>, w57 <dbl>,
## #   w58 <dbl>, w59 <dbl>, w60 <dbl>, w61 <dbl>, w62 <dbl>, w63 <dbl>,
## #   w64 <dbl>, w65 <dbl>, w66 <dbl>, w67 <dbl>, w68 <dbl>, w69 <dbl>,
## #   w70 <dbl>, w71 <dbl>, w72 <dbl>, w73 <dbl>, …

textfeature creates a range of other text features for each tweet which might be useful - it also attaches sentiment scores using a variety of sentimeng algorithms.

We’ll combine the features generated this way with the original tweet data and a new column for tweets which have at least one singular pronoun.

results1 <- data.frame(cbind(results, features))

results1 <- results1 %>%
  mutate(sing_pron = ifelse(n_first_person > 0, 1, 0))

We can summarise at the overlap between tweets with disclaimers and those which have first person singular references.

res1_count <- results1 %>%
  count(views, sing_pron) 

res1_count %>%
  gt::gt()
views sing_pron n
0 0 2597
0 1 323
1 0 73
1 1 179

Idea 3 - named entity recognition

Using Natural Language Processing techniques…

Makes use of the spacyr package. spacy is trained on very large datasets like Wiki to learn categories including names and organisations. We can then annotate the tweet text with labels to add whether it is from a person, organisation (or both) and combine this annotations from the first two ideas. This gives a number of tweet categories summarised in the table.

library(spacyr)

results1 <- results1 %>%
  mutate(desc1 = paste(retweet_name, retweet_description), 
         row = row_number(), doc_id = paste0("text", row)
  )

parsed <- spacy_parse(results1$desc1, tag = TRUE, entity = TRUE, nounphrase = TRUE)
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.0.16, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
head(parsed)
##   doc_id sentence_id token_id   token   lemma   pos tag   entity nounphrase
## 1  text1           1        1  Darryl  darryl PROPN NNP PERSON_B        beg
## 2  text1           1        2 Pieroni pieroni PROPN NNP PERSON_I        mid
## 3  text1           1        3 Husband husband PROPN NNP PERSON_I   end_root
## 4  text1           1        4       ,       , PUNCT   ,                    
## 5  text1           1        5  Father  father PROPN NNP            beg_root
## 6  text1           1        6       ,       , PUNCT   ,                    
##   whitespace
## 1       TRUE
## 2       TRUE
## 3      FALSE
## 4       TRUE
## 5      FALSE
## 6       TRUE
person <- parsed %>%
  #filter(str_detect(entity, "^P")) %>%
  select(doc_id, entity) %>%
  distinct() %>%
  full_join(results1)
## Joining, by = "doc_id"
person_extract <- person %>%
  select(doc_id, desc1, entity, views, sing_pron) %>%
  filter(entity %in% c("ORG_I", "ORG_B", "PERSON_I", "PERSON_B", "      ")) %>%
  mutate(ent = str_extract(entity, c("ORG|PERSON"))) %>%
  select(-entity) %>%
  distinct() %>%
  group_by(doc_id) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  spread(ent, n, fill = 0)

counts <- person_extract %>%
  count(ORG, PERSON, sing_pron, views) %>%
  mutate(rowid = row_number()) 

counts %>%
  gt::gt()
ORG PERSON sing_pron views n rowid
0 1 0 0 379 1
0 1 0 1 6 2
0 1 1 0 95 3
0 1 1 1 40 4
1 0 0 0 1065 5
1 0 0 1 40 6
1 0 1 0 68 7
1 0 1 1 40 8
2 2 0 0 776 9
2 2 0 1 18 10
2 2 1 0 136 11
2 2 1 1 96 12
prob <- person_extract %>%
  filter(views == 0, sing_pron == 0, PERSON != 0)
  distinct()

Finalising



person_tweet <- person_extract %>%
  mutate(person = ifelse(PERSON >0, 1,
                          ifelse(views ==1, 1,
                                 ifelse(sing_pron == 1, 1, 0))))

person_tweet %>%
  filter(person == 1) %>%
  select(desc1) %>%
  head(10) %>%
  gt::gt()
desc1
Darryl Pieroni Husband, Father, Entrepreneur, Internet Scholar, Amateur Chef and avid Tennis player. #olemiss #digitaltransformation #CX #DX #digitalstrategy
AI•Enhanced! Get more followers (automatically) with AI-Enhanced (for Twitter)
Martin Anderson Thanos🌈 Champion #Alchemist #Industry40 Clinical #Partitioner #edu #tech #startup ER Medic, @PulseMedic @symbol7th , #Developer #digitalhealth #Evangelist, #bi
Pulse Medic Our goal is to increase awareness #AED #first #aid and #CPR in our #community. #digitalhealth #training #course in #SocBiz - #sustainability Mod by 5 admins
UHCW Innovation Team Making innovation happen at UHCW 💡💭🏥 Our own views and interests.
Martin Anderson Thanos🌈 Champion #Alchemist #Industry40 Clinical #Partitioner #edu #tech #startup ER Medic, @PulseMedic @symbol7th , #Developer #digitalhealth #Evangelist, #bi
Pulse Medic Our goal is to increase awareness #AED #first #aid and #CPR in our #community. #digitalhealth #training #course in #SocBiz - #sustainability Mod by 5 admins
Martin Anderson Thanos🌈 Champion #Alchemist #Industry40 Clinical #Partitioner #edu #tech #startup ER Medic, @PulseMedic @symbol7th , #Developer #digitalhealth #Evangelist, #bi
Pulse Medic Our goal is to increase awareness #AED #first #aid and #CPR in our #community. #digitalhealth #training #course in #SocBiz - #sustainability Mod by 5 admins
Martin Anderson Thanos🌈 Champion #Alchemist #Industry40 Clinical #Partitioner #edu #tech #startup ER Medic, @PulseMedic @symbol7th , #Developer #digitalhealth #Evangelist, #bi