This note sets out an algorithm for distinguishing tweets from individuals from organisational or corporate tweets.
It uses natural language processing to parse and annotate the tweets and named entity recognition (NER) to label tweets with person or organisation tags.
The algorithm is as follow:
spacy_parse
with the lemma, tag, entity and nounphrase arguments set to TRUE.Code as follows - using descriptions of people who retweet @PublicHealthBot.
results1 <- results %>%
mutate(desc1 = paste(retweet_name, retweet_description),
row = row_number(), doc_id = paste0("text", row)
)
## slim down variables
results2 <- results1 %>%
select(created_at, text, row, doc_id, doc_id, desc1)
results2 %>%
head() %>%
knitr::kable()
created_at | text | row | doc_id | desc1 |
---|---|---|---|---|
2020-01-03 12:02:55 | CLOSING SOON: PhD position (Oxford, United Kingdom) | |||
Bridge epidemiology, m | achine learning, and fieldwork to control schistosomiasis in Uganda with Goylette Chami and @aiden1doherty at @bdi_oxford @oxford_ndph. #wearables #ml #modelling | |||
More details: https:// | t.co/meJ89xrHu6 1 text1 IDDjobs: infectious disease dynamics jobs Find & advertise jobs in infectious disease dynamics. PhDs, postdocs, research, industry. Add your jobs at https://t.co/Xvf58SH16b. Made by @rozeggo. | |||
2020-01-02 23:42:41 | Manager, Epidemiology Analytics | |||
Buckinghamshire, Unite | d Kingdom #Epijobs | |||
https://t.co/X4nIl0Lap | T … … … 2 text2 Epi Job Openings Your “ONE STOP” for the latest Job Opportunities in Epidemiology and | Publi | c Health. | |
#AcademicTwitter | #Epi | #EpiJobs | #PublicHealthTwitter | |
2020-01-02 23:23:01 | @VPrasadMDMPH Unless you integrate the public health issue of 2 and 4 overdiagnosis in the underlying model. I mean this paper should be taken for what it is: comparative performance for a precise task between AI vs human, neither more, nor less. | 3 | text3 | Barrit Sami 🕊 neurosurgical trainee @ULBruxelles |
2020-01-02 21:03:10 | With the growing fears of antimicrobial resistance, #AI could be used to identify & predict which genes cause infectious bacteria to become resistant to antibiotics: |
#PublicHealth #ArtificialIntelligence #ML #HealthTech https://t.co/xX5lIVUVUj 4 text4 Steph S. ♻️ #Innovation #Disruption | Marketing Manager | @Accenture Health & Public Service | Photography, Emergency Preparedness/Response Volunteer | Opinions my own.
2020-01-02 20:03:24 Public health should strive to attract professionals with broadly applicable data-science skills." 5 text5 ClinEpiDB A global Clinical Epidemiology data resource from collaborators at University of Pennsylvania, University of Georgia & University of Liverpool
2020-01-02 16:22:32 A computer algorithm has been shown to be as effective as human radiologists in spotting breast cancer from x-ray images. https://t.co/PNAlhVNsNc via @imperialcollege #GlobalHealth 6 text6 USAID STAR Project @USAID-supported #globalhealth project, implemented by @phidotorg. We engage talented professionals in capacity-building projects at GH orgs. https://t.co/cZvg3KKB8N
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.0.16, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
## doc_id sentence_id token_id token lemma pos tag entity nounphrase
## 1 text1 1 1 IDDjobs iddjob NOUN NNS ORG_B beg_root
## 2 text1 1 2 : : PUNCT :
## 3 text1 1 3 infectious infectious ADJ JJ beg
## 4 text1 1 4 disease disease NOUN NN mid
## 5 text1 1 5 dynamics dynamic NOUN NNS mid
## 6 text1 1 6 jobs job NOUN NNS end_root
## whitespace
## 1 FALSE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
parsed1 <- parsed %>%
mutate( pspn = ifelse(lemma == "-PRON-" & token %in% c("I", "my", "am", "mine"),1,
ifelse(str_detect(lemma, "[Vv]iew|[Pp]ersonal"), 1, 0)) ,
annotated = ifelse(nchar(entity) > 0 , paste(token, "[", entity, "]"), token))
head(parsed1)
## doc_id sentence_id token_id token lemma pos tag entity nounphrase
## 1 text1 1 1 IDDjobs iddjob NOUN NNS ORG_B beg_root
## 2 text1 1 2 : : PUNCT :
## 3 text1 1 3 infectious infectious ADJ JJ beg
## 4 text1 1 4 disease disease NOUN NN mid
## 5 text1 1 5 dynamics dynamic NOUN NNS mid
## 6 text1 1 6 jobs job NOUN NNS end_root
## whitespace pspn annotated
## 1 FALSE 0 IDDjobs [ ORG_B ]
## 2 TRUE 0 :
## 3 TRUE 0 infectious
## 4 TRUE 0 disease
## 5 TRUE 0 dynamics
## 6 TRUE 0 jobs
sentences <- parsed1 %>%
group_by(doc_id, sentence_id) %>%
mutate(annotated_sentence = paste(annotated, collapse = " ")) %>%
ungroup() %>%
select(doc_id, annotated_sentence, pspn) %>%
distinct() %>%
group_by(doc_id) %>%
mutate(sumpsp = sum(pspn),
annotated_description = paste(annotated_sentence, collapse = ". ")) %>%
ungroup() %>%
select(doc_id, annotated_description, sumpsp) %>%
distinct()
head(sentences)
## # A tibble: 6 x 3
## doc_id annotated_description sumpsp
## <chr> <chr> <dbl>
## 1 text1 "IDDjobs [ ORG_B ] : infectious disease dynamics jobs Find [ … 0
## 2 text10 "rikkert Eenhoorn [ ORG_B ] Corporate [ ORG_I ] Accountmanage… 0
## 3 text100 "Joel [ PERSON_B ] Lindsey [ PERSON_I ] Sr [ PERSON_I ] . Exe… 1
## 4 text1000 "Sven Awege Sven AWEGE. \" Dot Connector \" # healthcare # in… 0
## 5 text1001 "jwindz. Sow the wind and reap the whirlwind" 0
## 6 text1002 "English [ LANGUAGE_B ] @. ECU. At the [ ORG_B ] ECU [ ORG_I … 0
sentences <- sentences %>%
mutate(person = ifelse(str_detect(annotated_description, "PERSON")|sumpsp > 0, "person", "org"))
sample <- sample_n(sentences, 20)
sample %>%
knitr::kable()
doc_id | annotated_description | sumpsp | person |
---|---|---|---|
text2078 | CLEPH The [ ORG_B ] Centre [ ORG_I ] for [ ORG_I ] Law [ ORG_I ] Enforcement [ ORG_I ] and [ ORG_I ] Public [ ORG_I ] Health [ ORG_I ] is committed to advancing knowledge , building partnerships and undertaking projects in the joint various fields | 0 | org |
text678 | Martin [ PERSON_B ] Anderson [ PERSON_I ] Thanos [ PERSON_I ] 🌈 [ PERSON_I ] Champion [ PERSON_I ] # Alchemist # Industry40 [ ORG_B ] Clinical [ ORG_I ] # Partitioner [ GPE_B ] # [ MONEY_B ] edu [ MONEY_I ] # [ MONEY_B ] tech [ MONEY_I ] # [ MONEY_I ] startup ER [ ORG_B ] Medic [ ORG_I ] , @PulseMedic @symbol7th [ PRODUCT_B ] , [ PRODUCT_I ] # Developer # digitalhealth # Evangelist [ MONEY_B ] , # bi | 0 | person |
text536 | Artificial Intelligence | 0 | org |
text1347 | Public Health Wales. We are the national public health agency in Wales [ GPE_B ] .. Cymraeg [ PERSON_B ] : @IechydCyhoeddus [ CARDINAL_B ] | 0 | person |
text423 | Global [ ORG_B ] Capital [ ORG_I ] Guide [ ORG_I ]. The Global Capital Guide | 0 | org |
text1435 | F. [ PERSON_B ] Perry [ PERSON_I ] Wilson [ PERSON_I ] , MD [ PERSON_B ] MSCE [ PERSON_I ] # Physician [ NORP_B ] / # researcher @Yale .. Columnist [ ORG_B ] @medscape [ ORG_I ] , @huffpost , @journalsentinel [ NORP_B ] .. Evidence - based .. Science [ ORG_B ] - based . | 0 | person |
text321 | Hennepin [ GPE_B ] County [ GPE_I ] Jobs [ GPE_I ]. The official Twitter [ GPE_B ] account for Hennepin [ GPE_B ] County [ GPE_I ] careers & internships . | ||
[ GPE_B ]. | Tweets by HC [ ORG_B ] Recruiters [ ORG_I ] . | ||
. Follow us | to learn more about careers at Hennepin [ GPE_B ] County [ GPE_I ] . 0 org | ||
text1630 | Rob [ PERSON_B ] Butler [ PERSON_I ] @TBS_Canada Bridgebuilder [ GPE_B ] inside @DigiEnablement [ ORG_B ] • @LeadersGC • # GCDigital [ MONEY_B ] • # GC [ ORG_B ] 🇨 🇦 Event Manager & Media Specialist • @Disney [ ORG_B ] Rep [ ORG_I ]. •. @GOFOBO [ ORG_B ]. Host •. View [ ORG_B ] 💯 mine. View [ ORG_B ] 💯 mine | 1 | person |
text487 | Mycroft [ ORG_B ] AI [ ORG_I ] Open [ ORG_I ] Source [ ORG_I ] Voice [ ORG_I ] Assistant | An [ ORG_B ] AI [ ORG_I ]. For Everyone # artificialintelligence # [ CARDINAL_B ] ai # [ MONEY_B ] opensource [ MONEY_I ] # [ CARDINAL_B ] ownyourdata # useragency | 0 |
text2397 | Wellco [ ORG_B ] Wellco [ ORG_I ] ™ [ ORG_I ] is an award - winning health & wellness firm that helps employers measurably improve health care engagement , analytics and high - value care . | 0 | org |
text2425 | FMI [ PERSON_B ] Representing [ PERSON_I ] the voice of food retail across government relations ; food protection ; industry relations ; health and wellness ; sustainability ; and communications . | 0 | person |
text3136 | MedeAnalytics [ ORG_B ] We help healthcare make even smarter decisions by rapidly orchestrating client data sources into our intelligent , healthcare - specific analytics platform . | 0 | org |
text2648 | SAIL LABS Technology SAIL LABS Technology. GmbH is a leading provider of Open [ ORG_B ] Source [ ORG_I ] Intelligence [ ORG_I ] ( # OSINT [ MONEY_B ] ) systems , Automatic [ ORG_B ] Speech [ ORG_I ] Recognition [ ORG_I ] ( ASR [ ORG_B ] ) , media monitoring and analysis . | 0 | org |
text61 | Go Blue [ ORG_B ] Enjoy [ ORG_I ] factual debates .. If you hate data , do n’t tweet at me [ PERSON_B ]. Warning :. bad MSU [ ORG_B ] accounts love to block me | 0 | person |
text3175 | Susan [ PERSON_B ] Shriner [ PERSON_I ] Wildlife [ PERSON_I ] Epidemiologist [ PERSON_I ] / [ PERSON_I ] Disease [ PERSON_I ] Ecologist [ PERSON_I ] .. Avian [ NORP_B ] # influenza , avian blood parasites , AMR [ ORG_B ] , and spatial ecology .. Birds rule .. # ornithology she / hers | 0 | person |
text2631 | HealthViewAI | 1 | person |
text1504 | IAPHS Improving the Health of Populations Through Science and Innovation .. # pophealth # populationhealth # [ MONEY_B ] science [ MONEY_I ] # [ MONEY_I ] research. # [ ORG_B ] sdoh # pophealth2019 # sociology | 0 | org |
text2213 | Saša Mitrović Principal Sales Engineer # MarkLogic # NoSQL [ GPE_B ] # DACH [ GPE_B ] , plays # basketball and goes # jogging | 0 | org |
text3008 | Raj [ PERSON_B ] Bhandari [ PERSON_I ] # DigitalTransformation [ ORG_B ] Services [ ORG_I ] Sales [ ORG_I ] , # Industrial [ ORG_B ] and [ ORG_I ] # [ ORG_I ] Manufacturing [ ORG_I ] .. Always # learning , Always in # Vismaad [ MONEY_B ] . | 0 | person |
text2165 | Freedman [ PERSON_B ] HealthCare [ PERSON_I ] Established in 2005 [ DATE_B ] , FHC [ ORG_B ] is a national consulting firm of # [ MONEY_B ] APCD [ MONEY_I ] experts who work with clients to put # [ MONEY_B ] health [ MONEY_I ] # data to work to solve complex # healthcare problems | 0 | person |