Can we distinguish personal from organisational tweets?

Background

Often useful to distinguish between personal and organisational tweets in document/ discourse analysis of twitter.

This note sets out some ideas for doing this.

We use

3000 + tweets from @PublicHealthBot - this is a retweet engine so we are interested in the characteristics of retweeters rather than primary tweeters
Some R packages - notably textfeatures, spacyr.

Extract tweets

q <- "@PublicHealthBot"

search <- "social media"

results <- get_timeline(q, n = 18000, retry_on_ratelimit = TRUE)

We are primarily interested in the user descriptions - this case retweet_description and usernames.

Idea 1 - disclaimers

Personal users who belong to organisations will often add disclaimers to their description - these often take the form “views my own…”

views <- results %>%
  filter(str_detect(retweet_description, "[Vv]iew"))

This filters tweets where the retweet description contains the word view, View, views, Views. Of 3172, 252 contain one or more of these terms.

We can see how successful this strategy is by reviewing a sample of the descriptions.

disclaimer_sample <- views %>%
  sample_n(10) %>%
  select(retweet_name, name, retweet_description)

disclaimer_sample %>%
  gt::gt()

retweet_name	name	retweet_description
Vivian OTA WANG	Public Health Data Science	Deputy Director, Office of Data Sharing, Center for Biomedical Informatics & IT, National Cancer Institute- Empowerment Through FAIR Data Sharing- Views be mine
Brianna Lindsay	Public Health Data Science	Infectious Disease Epidemiologist, Runner, Data Data Data. Views expressed here are my own. I generally retweet things I intend to read later but never get to.
Michelle Fenner	Public Health Data Science	Michelle Fenner - Marketing and Project Manager. Perpetual Mediator. Views are 100% my own!
Rob McCargow	Public Health Data Science	Director of #AI @PwC_UK • #ResponsibleAI @PwC • Advisor @APPG_AI @IEEESA @techUK • @TEDx Speaker • Fellow @theRSAorg • #HeForShe • #Vegan🌱🌍 • all views my own
Melisa Valverde	Public Health Data Science	Mom. Wife. Communicator in health care. Social scientist. Public health student. Quasi-digital native. Sci-fi fan. Views are my own. RTs not an endorsement.
Muin J. Khoury	Public Health Data Science	Physician, Epidemiologist, Geneticist, Public Health Professional. Passionate About Using Science to Improve Health for All. Views= My own. Links≠ Endorsements
IEEE	Public Health Data Science	Advancing technological innovation and excellence for the benefit of humanity. View the IEEE social media terms and conditions: https://t.co/xzhVUOgVE8
Brianna Lindsay	Public Health Data Science	Infectious Disease Epidemiologist, Runner, Data Data Data. Views expressed here are my own. I generally retweet things I intend to read later but never get to.
Biotech Mergers & Acquisitions M&A	Public Health Data Science	BioPortfolio brings you the latest news, reviews, bogs and reports on #Mergers & #Acquisitions #MandA in the #biotech-pharma industry.
The Parliament Magazine	Public Health Data Science	By MEPs, for MEPs - in-depth news, views and analysis on the latest EU issues from The Parliament Magazine.

In the sample we can see that often retweeters are recognisable individual names though not always.

We will add a new column to the tweet table (add a feature).

results <- results %>%
  mutate(views = ifelse(str_detect(retweet_description, "[Vv]iew"), 1, 0))

This approach is unlikely to filter all personal users - we will need additional strategies.

Idea 2 - detect use of personal pronouns

The idea here is that individuals are more likely to use singular personal pronouns like I and me than organisations.

Detecting this can be done with powerful NLP tools, but a much quicker method is to use the textfeature package which is very quick and can detect personal pronouns.

features <- textfeatures(results$retweet_description, normalize = FALSE)

## [32m↪[39m [38;5;244mCounting features in text...[39m
## [32m↪[39m [38;5;244mSentiment analysis...[39m
## [32m↪[39m [38;5;244mParts of speech...[39m
## [32m↪[39m [38;5;244mWord dimensions started[39m
## [32m✔[39m Job's done!

features

## # A tibble: 3,172 x 134
##    n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars
##     <int>     <int>      <int>         <int>      <int>         <int>   <int>
##  1      0         0          5             5          0             0     127
##  2      0         0          1             1          0             0     104
##  3      0         0         11            11          0             0     134
##  4      0         0          0             0          0             0       0
##  5      0         0          0             0          1             1      80
##  6      0         0          0             0          0             0     109
##  7      0         0          4             4          0             0     129
##  8      0         0          4             4          0             0     129
##  9      1         1          8             8          0             0     115
## 10      0         0          0             0          0             0      58
## # … with 3,162 more rows, and 127 more variables: n_uq_chars <int>,
## #   n_commas <int>, n_digits <int>, n_exclaims <int>, n_extraspaces <int>,
## #   n_lowers <int>, n_lowersp <dbl>, n_periods <int>, n_words <int>,
## #   n_uq_words <int>, n_caps <int>, n_nonasciis <int>, n_puncts <int>,
## #   n_capsp <dbl>, n_charsperword <dbl>, sent_afinn <dbl>, sent_bing <dbl>,
## #   sent_syuzhet <dbl>, sent_vader <dbl>, n_polite <dbl>, n_first_person <int>,
## #   n_first_personp <int>, n_second_person <int>, n_second_personp <int>,
## #   n_third_person <int>, n_tobe <int>, n_prepositions <int>, w1 <dbl>,
## #   w2 <dbl>, w3 <dbl>, w4 <dbl>, w5 <dbl>, w6 <dbl>, w7 <dbl>, w8 <dbl>,
## #   w9 <dbl>, w10 <dbl>, w11 <dbl>, w12 <dbl>, w13 <dbl>, w14 <dbl>, w15 <dbl>,
## #   w16 <dbl>, w17 <dbl>, w18 <dbl>, w19 <dbl>, w20 <dbl>, w21 <dbl>,
## #   w22 <dbl>, w23 <dbl>, w24 <dbl>, w25 <dbl>, w26 <dbl>, w27 <dbl>,
## #   w28 <dbl>, w29 <dbl>, w30 <dbl>, w31 <dbl>, w32 <dbl>, w33 <dbl>,
## #   w34 <dbl>, w35 <dbl>, w36 <dbl>, w37 <dbl>, w38 <dbl>, w39 <dbl>,
## #   w40 <dbl>, w41 <dbl>, w42 <dbl>, w43 <dbl>, w44 <dbl>, w45 <dbl>,
## #   w46 <dbl>, w47 <dbl>, w48 <dbl>, w49 <dbl>, w50 <dbl>, w51 <dbl>,
## #   w52 <dbl>, w53 <dbl>, w54 <dbl>, w55 <dbl>, w56 <dbl>, w57 <dbl>,
## #   w58 <dbl>, w59 <dbl>, w60 <dbl>, w61 <dbl>, w62 <dbl>, w63 <dbl>,
## #   w64 <dbl>, w65 <dbl>, w66 <dbl>, w67 <dbl>, w68 <dbl>, w69 <dbl>,
## #   w70 <dbl>, w71 <dbl>, w72 <dbl>, w73 <dbl>, …

textfeature creates a range of other text features for each tweet which might be useful - it also attaches sentiment scores using a variety of sentimeng algorithms.

We’ll combine the features generated this way with the original tweet data and a new column for tweets which have at least one singular pronoun.

results1 <- data.frame(cbind(results, features))

results1 <- results1 %>%
  mutate(sing_pron = ifelse(n_first_person > 0, 1, 0))

We can summarise at the overlap between tweets with disclaimers and those which have first person singular references.

res1_count <- results1 %>%
  count(views, sing_pron) 

res1_count %>%
  gt::gt()

views	sing_pron	n
0	0	2597
0	1	323
1	0	73
1	1	179

Idea 3 - named entity recognition

Using Natural Language Processing techniques…

Makes use of the spacyr package. spacy is trained on very large datasets like Wiki to learn categories including names and organisations. We can then annotate the tweet text with labels to add whether it is from a person, organisation (or both) and combine this annotations from the first two ideas. This gives a number of tweet categories summarised in the table.

library(spacyr)

results1 <- results1 %>%
  mutate(desc1 = paste(retweet_name, retweet_description), 
         row = row_number(), doc_id = paste0("text", row)
  )

parsed <- spacy_parse(results1$desc1, tag = TRUE, entity = TRUE, nounphrase = TRUE)

## Found 'spacy_condaenv'. spacyr will use this environment

## successfully initialized (spaCy Version: 2.0.16, language model: en)

## (python options: type = "condaenv", value = "spacy_condaenv")

head(parsed)

##   doc_id sentence_id token_id   token   lemma   pos tag   entity nounphrase
## 1  text1           1        1  Darryl  darryl PROPN NNP PERSON_B        beg
## 2  text1           1        2 Pieroni pieroni PROPN NNP PERSON_I        mid
## 3  text1           1        3 Husband husband PROPN NNP PERSON_I   end_root
## 4  text1           1        4       ,       , PUNCT   ,                    
## 5  text1           1        5  Father  father PROPN NNP            beg_root
## 6  text1           1        6       ,       , PUNCT   ,                    
##   whitespace
## 1       TRUE
## 2       TRUE
## 3      FALSE
## 4       TRUE
## 5      FALSE
## 6       TRUE

person <- parsed %>%
  #filter(str_detect(entity, "^P")) %>%
  select(doc_id, entity) %>%
  distinct() %>%
  full_join(results1)

## Joining, by = "doc_id"

person_extract <- person %>%
  select(doc_id, desc1, entity, views, sing_pron) %>%
  filter(entity %in% c("ORG_I", "ORG_B", "PERSON_I", "PERSON_B", "      ")) %>%
  mutate(ent = str_extract(entity, c("ORG|PERSON"))) %>%
  select(-entity) %>%
  distinct() %>%
  group_by(doc_id) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  spread(ent, n, fill = 0)

counts <- person_extract %>%
  count(ORG, PERSON, sing_pron, views) %>%
  mutate(rowid = row_number()) 

counts %>%
  gt::gt()

ORG	PERSON	sing_pron	views	n	rowid
0	1	0	0	379	1
0	1	0	1	6	2
0	1	1	0	95	3
0	1	1	1	40	4
1	0	0	0	1065	5
1	0	0	1	40	6
1	0	1	0	68	7
1	0	1	1	40	8
2	2	0	0	776	9
2	2	0	1	18	10
2	2	1	0	136	11
2	2	1	1	96	12

prob <- person_extract %>%
  filter(views == 0, sing_pron == 0, PERSON != 0)
  distinct()

Finalising

NER seems to be a highly successful approach to identifying individuals in tweets. Of 2597 tweets without mention of the word view(s) or containing first person pronoun, NER identifies 1155 as pertaining to a person.
By contrast, where NER does not identify a tweet as originating from a person, ideas 1 and 2 identify 148 as originating from a person
If we classify a tweeter as someone who is identified by NER as a person, or who has a disclaimer, or has a description containing a first person singular pronoun, we conclude that 1200 of the 3159 tweets are from an individual

person_tweet <- person_extract %>%
  mutate(person = ifelse(PERSON >0, 1,
                          ifelse(views ==1, 1,
                                 ifelse(sing_pron == 1, 1, 0))))

person_tweet %>%
  filter(person == 1) %>%
  select(desc1) %>%
  head(10) %>%
  gt::gt()

desc1
Darryl Pieroni Husband, Father, Entrepreneur, Internet Scholar, Amateur Chef and avid Tennis player. #olemiss #digitaltransformation #CX #DX #digitalstrategy
AI•Enhanced! Get more followers (automatically) with AI-Enhanced (for Twitter)
Martin Anderson Thanos🌈 Champion #Alchemist #Industry40 Clinical #Partitioner #edu #tech #startup ER Medic, @PulseMedic @symbol7th , #Developer #digitalhealth #Evangelist, #bi
Pulse Medic Our goal is to increase awareness #AED #first #aid and #CPR in our #community. #digitalhealth #training #course in #SocBiz - #sustainability Mod by 5 admins
UHCW Innovation Team Making innovation happen at UHCW 💡💭🏥 Our own views and interests.
Martin Anderson Thanos🌈 Champion #Alchemist #Industry40 Clinical #Partitioner #edu #tech #startup ER Medic, @PulseMedic @symbol7th , #Developer #digitalhealth #Evangelist, #bi
Pulse Medic Our goal is to increase awareness #AED #first #aid and #CPR in our #community. #digitalhealth #training #course in #SocBiz - #sustainability Mod by 5 admins
Martin Anderson Thanos🌈 Champion #Alchemist #Industry40 Clinical #Partitioner #edu #tech #startup ER Medic, @PulseMedic @symbol7th , #Developer #digitalhealth #Evangelist, #bi
Pulse Medic Our goal is to increase awareness #AED #first #aid and #CPR in our #community. #digitalhealth #training #course in #SocBiz - #sustainability Mod by 5 admins
Martin Anderson Thanos🌈 Champion #Alchemist #Industry40 Clinical #Partitioner #edu #tech #startup ER Medic, @PulseMedic @symbol7th , #Developer #digitalhealth #Evangelist, #bi