Distinguishing personal from organisational tweets: proposed algorithm

This note sets out an algorithm for distinguishing tweets from individuals from organisational or corporate tweets.

It uses natural language processing to parse and annotate the tweets and named entity recognition (NER) to label tweets with person or organisation tags.

The algorithm is as follow:

  1. Download relevant tweets
  2. Join user name and user description fields
  3. Parse the description field using spacy_parse with the lemma, tag, entity and nounphrase arguments set to TRUE.
  4. Create a new field pasting the entity to the token
  5. Group by doc_id and collapse back into sentences
  6. Add a flag where lemma = -PRON- and token in I, me, mine, am (first person singular)
  7. Add a classification field = person if sentence contains a [PERSON] entity tag, or if flag = 1, else = “org”

Code as follows - using descriptions of people who retweet @PublicHealthBot.

Get tweets

q <- "@PublicHealthBot"

results <- get_timeline(q, n = 18000, retry_on_ratelimit = TRUE)

Join user name and retweet description text and add doc id field

results1 <- results %>%
  mutate(desc1 = paste(retweet_name, retweet_description), 
         row = row_number(), doc_id = paste0("text", row)
  )

Reduce variables; parse text

## slim down variables 

results2 <- results1 %>%
  select(created_at, text, row, doc_id, doc_id, desc1) 

results2 %>%
  head() %>%
  knitr::kable()
created_at text row doc_id desc1
2020-01-03 12:02:55 CLOSING SOON: PhD position (Oxford, United Kingdom)
Bridge epidemiology, m achine learning, and fieldwork to control schistosomiasis in Uganda with Goylette Chami and @aiden1doherty at @bdi_oxford @oxford_ndph. #wearables #ml #modelling
More details: https:// t.co/meJ89xrHu6 1 text1 IDDjobs: infectious disease dynamics jobs Find & advertise jobs in infectious disease dynamics. PhDs, postdocs, research, industry. Add your jobs at https://t.co/Xvf58SH16b. Made by @rozeggo.
2020-01-02 23:42:41 Manager, Epidemiology Analytics
Buckinghamshire, Unite d Kingdom #Epijobs
https://t.co/X4nIl0Lap T … … … 2 text2 Epi Job Openings Your “ONE STOP” for the latest Job Opportunities in Epidemiology and Publi c Health.
#AcademicTwitter #Epi Twitter #EpiJobs #PublicHealthTwitter
2020-01-02 23:23:01 @VPrasadMDMPH Unless you integrate the public health issue of 2 and 4 overdiagnosis in the underlying model. I mean this paper should be taken for what it is: comparative performance for a precise task between AI vs human, neither more, nor less. 3 text3 Barrit Sami 🕊 neurosurgical trainee @ULBruxelles
2020-01-02 21:03:10 With the growing fears of antimicrobial resistance, #AI could be used to identify & predict which genes cause infectious bacteria to become resistant to antibiotics:

#PublicHealth #ArtificialIntelligence #ML #HealthTech https://t.co/xX5lIVUVUj 4 text4 Steph S. ♻️ #Innovation #Disruption | Marketing Manager | @Accenture Health & Public Service | Photography, Emergency Preparedness/Response Volunteer | Opinions my own.
2020-01-02 20:03:24 Public health should strive to attract professionals with broadly applicable data-science skills." 5 text5 ClinEpiDB A global Clinical Epidemiology data resource from collaborators at University of Pennsylvania, University of Georgia & University of Liverpool
2020-01-02 16:22:32 A computer algorithm has been shown to be as effective as human radiologists in spotting breast cancer from x-ray images. https://t.co/PNAlhVNsNc via @imperialcollege #GlobalHealth 6 text6 USAID STAR Project @USAID-supported #globalhealth project, implemented by @phidotorg. We engage talented professionals in capacity-building projects at GH orgs. https://t.co/cZvg3KKB8N

## parse 

parsed <- spacy_parse(results2$desc1, tag = TRUE, entity = TRUE, nounphrase = TRUE)
## Found 'spacy_condaenv'. spacyr will use this environment

## successfully initialized (spaCy Version: 2.0.16, language model: en)

## (python options: type = "condaenv", value = "spacy_condaenv")
head(parsed)
##   doc_id sentence_id token_id      token      lemma   pos tag entity nounphrase
## 1  text1           1        1    IDDjobs     iddjob  NOUN NNS  ORG_B   beg_root
## 2  text1           1        2          :          : PUNCT   :                  
## 3  text1           1        3 infectious infectious   ADJ  JJ               beg
## 4  text1           1        4    disease    disease  NOUN  NN               mid
## 5  text1           1        5   dynamics    dynamic  NOUN NNS               mid
## 6  text1           1        6       jobs        job  NOUN NNS          end_root
##   whitespace
## 1      FALSE
## 2       TRUE
## 3       TRUE
## 4       TRUE
## 5       TRUE
## 6       TRUE

Paste entity to token

parsed1 <- parsed %>% 
  mutate( pspn = ifelse(lemma == "-PRON-" & token %in% c("I", "my", "am", "mine"),1, 
                        ifelse(str_detect(lemma, "[Vv]iew|[Pp]ersonal"), 1, 0)) , 
          annotated = ifelse(nchar(entity) > 0 , paste(token, "[", entity, "]"), token))

head(parsed1)
##   doc_id sentence_id token_id      token      lemma   pos tag entity nounphrase
## 1  text1           1        1    IDDjobs     iddjob  NOUN NNS  ORG_B   beg_root
## 2  text1           1        2          :          : PUNCT   :                  
## 3  text1           1        3 infectious infectious   ADJ  JJ               beg
## 4  text1           1        4    disease    disease  NOUN  NN               mid
## 5  text1           1        5   dynamics    dynamic  NOUN NNS               mid
## 6  text1           1        6       jobs        job  NOUN NNS          end_root
##   whitespace pspn         annotated
## 1      FALSE    0 IDDjobs [ ORG_B ]
## 2       TRUE    0                 :
## 3       TRUE    0        infectious
## 4       TRUE    0           disease
## 5       TRUE    0          dynamics
## 6       TRUE    0              jobs

Collapse into sentences

sentences <- parsed1 %>% 
  group_by(doc_id, sentence_id) %>%
  mutate(annotated_sentence = paste(annotated, collapse = " ")) %>%
  ungroup() %>%
  select(doc_id, annotated_sentence, pspn) %>%
  distinct() %>%
  group_by(doc_id) %>%
  mutate(sumpsp = sum(pspn), 
         annotated_description = paste(annotated_sentence, collapse = ". ")) %>%
  ungroup() %>%
  select(doc_id, annotated_description, sumpsp) %>%
  distinct()


head(sentences)
## # A tibble: 6 x 3
##   doc_id   annotated_description                                          sumpsp
##   <chr>    <chr>                                                           <dbl>
## 1 text1    "IDDjobs [ ORG_B ] : infectious disease dynamics jobs Find [ …      0
## 2 text10   "rikkert Eenhoorn [ ORG_B ] Corporate [ ORG_I ] Accountmanage…      0
## 3 text100  "Joel [ PERSON_B ] Lindsey [ PERSON_I ] Sr [ PERSON_I ] . Exe…      1
## 4 text1000 "Sven Awege Sven AWEGE. \" Dot Connector \" # healthcare # in…      0
## 5 text1001 "jwindz. Sow the wind and reap the whirlwind"                       0
## 6 text1002 "English [ LANGUAGE_B ] @. ECU. At the [ ORG_B ] ECU [ ORG_I …      0

Categorize

sentences <- sentences %>%
  mutate(person = ifelse(str_detect(annotated_description, "PERSON")|sumpsp > 0, "person", "org"))


sample <- sample_n(sentences, 20)

sample %>%
  knitr::kable()
doc_id annotated_description sumpsp person
text2078 CLEPH The [ ORG_B ] Centre [ ORG_I ] for [ ORG_I ] Law [ ORG_I ] Enforcement [ ORG_I ] and [ ORG_I ] Public [ ORG_I ] Health [ ORG_I ] is committed to advancing knowledge , building partnerships and undertaking projects in the joint various fields 0 org
text678 Martin [ PERSON_B ] Anderson [ PERSON_I ] Thanos [ PERSON_I ] 🌈 [ PERSON_I ] Champion [ PERSON_I ] # Alchemist # Industry40 [ ORG_B ] Clinical [ ORG_I ] # Partitioner [ GPE_B ] # [ MONEY_B ] edu [ MONEY_I ] # [ MONEY_B ] tech [ MONEY_I ] # [ MONEY_I ] startup ER [ ORG_B ] Medic [ ORG_I ] , @PulseMedic @symbol7th [ PRODUCT_B ] , [ PRODUCT_I ] # Developer # digitalhealth # Evangelist [ MONEY_B ] , # bi 0 person
text536 Artificial Intelligence 0 org
text1347 Public Health Wales. We are the national public health agency in Wales [ GPE_B ] .. Cymraeg [ PERSON_B ] : @IechydCyhoeddus [ CARDINAL_B ] 0 person
text423 Global [ ORG_B ] Capital [ ORG_I ] Guide [ ORG_I ]. The Global Capital Guide 0 org
text1435 F. [ PERSON_B ] Perry [ PERSON_I ] Wilson [ PERSON_I ] , MD [ PERSON_B ] MSCE [ PERSON_I ] # Physician [ NORP_B ] / # researcher @Yale .. Columnist [ ORG_B ] @medscape [ ORG_I ] , @huffpost , @journalsentinel [ NORP_B ] .. Evidence - based .. Science [ ORG_B ] - based . 0 person
text321 Hennepin [ GPE_B ] County [ GPE_I ] Jobs [ GPE_I ]. The official Twitter [ GPE_B ] account for Hennepin [ GPE_B ] County [ GPE_I ] careers & internships .
[ GPE_B ]. Tweets by HC [ ORG_B ] Recruiters [ ORG_I ] .
. Follow us to learn more about careers at Hennepin [ GPE_B ] County [ GPE_I ] . 0 org
text1630 Rob [ PERSON_B ] Butler [ PERSON_I ] @TBS_Canada Bridgebuilder [ GPE_B ] inside @DigiEnablement [ ORG_B ] • @LeadersGC • # GCDigital [ MONEY_B ] • # GC [ ORG_B ] 🇨 🇦 Event Manager & Media Specialist • @Disney [ ORG_B ] Rep [ ORG_I ]. •. @GOFOBO [ ORG_B ]. Host •. View [ ORG_B ] 💯 mine. View [ ORG_B ] 💯 mine 1 person
text487 Mycroft [ ORG_B ] AI [ ORG_I ] Open [ ORG_I ] Source [ ORG_I ] Voice [ ORG_I ] Assistant An [ ORG_B ] AI [ ORG_I ]. For Everyone # artificialintelligence # [ CARDINAL_B ] ai # [ MONEY_B ] opensource [ MONEY_I ] # [ CARDINAL_B ] ownyourdata # useragency 0
text2397 Wellco [ ORG_B ] Wellco [ ORG_I ] ™ [ ORG_I ] is an award - winning health & wellness firm that helps employers measurably improve health care engagement , analytics and high - value care . 0 org
text2425 FMI [ PERSON_B ] Representing [ PERSON_I ] the voice of food retail across government relations ; food protection ; industry relations ; health and wellness ; sustainability ; and communications . 0 person
text3136 MedeAnalytics [ ORG_B ] We help healthcare make even smarter decisions by rapidly orchestrating client data sources into our intelligent , healthcare - specific analytics platform . 0 org
text2648 SAIL LABS Technology SAIL LABS Technology. GmbH is a leading provider of Open [ ORG_B ] Source [ ORG_I ] Intelligence [ ORG_I ] ( # OSINT [ MONEY_B ] ) systems , Automatic [ ORG_B ] Speech [ ORG_I ] Recognition [ ORG_I ] ( ASR [ ORG_B ] ) , media monitoring and analysis . 0 org
text61 Go Blue [ ORG_B ] Enjoy [ ORG_I ] factual debates .. If you hate data , do n’t tweet at me [ PERSON_B ]. Warning :. bad MSU [ ORG_B ] accounts love to block me 0 person
text3175 Susan [ PERSON_B ] Shriner [ PERSON_I ] Wildlife [ PERSON_I ] Epidemiologist [ PERSON_I ] / [ PERSON_I ] Disease [ PERSON_I ] Ecologist [ PERSON_I ] .. Avian [ NORP_B ] # influenza , avian blood parasites , AMR [ ORG_B ] , and spatial ecology .. Birds rule .. # ornithology she / hers 0 person
text2631 HealthViewAI 1 person
text1504 IAPHS Improving the Health of Populations Through Science and Innovation .. # pophealth # populationhealth # [ MONEY_B ] science [ MONEY_I ] # [ MONEY_I ] research. # [ ORG_B ] sdoh # pophealth2019 # sociology 0 org
text2213 Saša Mitrović Principal Sales Engineer # MarkLogic # NoSQL [ GPE_B ] # DACH [ GPE_B ] , plays # basketball and goes # jogging 0 org
text3008 Raj [ PERSON_B ] Bhandari [ PERSON_I ] # DigitalTransformation [ ORG_B ] Services [ ORG_I ] Sales [ ORG_I ] , # Industrial [ ORG_B ] and [ ORG_I ] # [ ORG_I ] Manufacturing [ ORG_I ] .. Always # learning , Always in # Vismaad [ MONEY_B ] . 0 person
text2165 Freedman [ PERSON_B ] HealthCare [ PERSON_I ] Established in 2005 [ DATE_B ] , FHC [ ORG_B ] is a national consulting firm of # [ MONEY_B ] APCD [ MONEY_I ] experts who work with clients to put # [ MONEY_B ] health [ MONEY_I ] # data to work to solve complex # healthcare problems 0 person