Often useful to distinguish between personal and organisational tweets in document/ discourse analysis of twitter.
This note sets out some ideas for doing this.
We use
textfeatures
, spacyr
.q <- "@PublicHealthBot"
search <- "social media"
results <- get_timeline(q, n = 18000, retry_on_ratelimit = TRUE)
We are primarily interested in the user descriptions - this case retweet_description
and usernames.
views <- results %>%
filter(str_detect(retweet_description, "[Vv]iew"))
This filters tweets where the retweet description contains the word view, View, views, Views. Of 3172, 252 contain one or more of these terms.
We can see how successful this strategy is by reviewing a sample of the descriptions.
disclaimer_sample <- views %>%
sample_n(10) %>%
select(retweet_name, name, retweet_description)
disclaimer_sample %>%
gt::gt()
retweet_name | name | retweet_description |
---|---|---|
Vivian OTA WANG | Public Health Data Science | Deputy Director, Office of Data Sharing, Center for Biomedical Informatics & IT, National Cancer Institute- Empowerment Through FAIR Data Sharing- Views be mine |
Brianna Lindsay | Public Health Data Science | Infectious Disease Epidemiologist, Runner, Data Data Data. Views expressed here are my own. I generally retweet things I intend to read later but never get to. |
Michelle Fenner | Public Health Data Science | Michelle Fenner - Marketing and Project Manager. Perpetual Mediator. Views are 100% my own! |
Rob McCargow | Public Health Data Science | Director of #AI @PwC_UK • #ResponsibleAI @PwC • Advisor @APPG_AI @IEEESA @techUK • @TEDx Speaker • Fellow @theRSAorg • #HeForShe • #Vegan🌱🌍 • all views my own |
Melisa Valverde | Public Health Data Science | Mom. Wife. Communicator in health care. Social scientist. Public health student. Quasi-digital native. Sci-fi fan. *Views are my own. RTs not an endorsement.* |
Muin J. Khoury | Public Health Data Science | Physician, Epidemiologist, Geneticist, Public Health Professional. Passionate About Using Science to Improve Health for All. Views= My own. Links≠ Endorsements |
IEEE | Public Health Data Science | Advancing technological innovation and excellence for the benefit of humanity. View the IEEE social media terms and conditions: https://t.co/xzhVUOgVE8 |
Brianna Lindsay | Public Health Data Science | Infectious Disease Epidemiologist, Runner, Data Data Data. Views expressed here are my own. I generally retweet things I intend to read later but never get to. |
Biotech Mergers & Acquisitions M&A | Public Health Data Science | BioPortfolio brings you the latest news, reviews, bogs and reports on #Mergers & #Acquisitions #MandA in the #biotech-pharma industry. |
The Parliament Magazine | Public Health Data Science | By MEPs, for MEPs - in-depth news, views and analysis on the latest EU issues from The Parliament Magazine. |
In the sample we can see that often retweeters are recognisable individual names though not always.
We will add a new column to the tweet table (add a feature).
results <- results %>%
mutate(views = ifelse(str_detect(retweet_description, "[Vv]iew"), 1, 0))
This approach is unlikely to filter all personal users - we will need additional strategies.
The idea here is that individuals are more likely to use singular personal pronouns like I and me than organisations.
Detecting this can be done with powerful NLP tools, but a much quicker method is to use the textfeature
package which is very quick and can detect personal pronouns.
features <- textfeatures(results$retweet_description, normalize = FALSE)
## [32m↪[39m [38;5;244mCounting features in text...[39m
## [32m↪[39m [38;5;244mSentiment analysis...[39m
## [32m↪[39m [38;5;244mParts of speech...[39m
## [32m↪[39m [38;5;244mWord dimensions started[39m
## [32m✔[39m Job's done!
features
## # A tibble: 3,172 x 134
## n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 5 5 0 0 127
## 2 0 0 1 1 0 0 104
## 3 0 0 11 11 0 0 134
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 1 1 80
## 6 0 0 0 0 0 0 109
## 7 0 0 4 4 0 0 129
## 8 0 0 4 4 0 0 129
## 9 1 1 8 8 0 0 115
## 10 0 0 0 0 0 0 58
## # … with 3,162 more rows, and 127 more variables: n_uq_chars <int>,
## # n_commas <int>, n_digits <int>, n_exclaims <int>, n_extraspaces <int>,
## # n_lowers <int>, n_lowersp <dbl>, n_periods <int>, n_words <int>,
## # n_uq_words <int>, n_caps <int>, n_nonasciis <int>, n_puncts <int>,
## # n_capsp <dbl>, n_charsperword <dbl>, sent_afinn <dbl>, sent_bing <dbl>,
## # sent_syuzhet <dbl>, sent_vader <dbl>, n_polite <dbl>, n_first_person <int>,
## # n_first_personp <int>, n_second_person <int>, n_second_personp <int>,
## # n_third_person <int>, n_tobe <int>, n_prepositions <int>, w1 <dbl>,
## # w2 <dbl>, w3 <dbl>, w4 <dbl>, w5 <dbl>, w6 <dbl>, w7 <dbl>, w8 <dbl>,
## # w9 <dbl>, w10 <dbl>, w11 <dbl>, w12 <dbl>, w13 <dbl>, w14 <dbl>, w15 <dbl>,
## # w16 <dbl>, w17 <dbl>, w18 <dbl>, w19 <dbl>, w20 <dbl>, w21 <dbl>,
## # w22 <dbl>, w23 <dbl>, w24 <dbl>, w25 <dbl>, w26 <dbl>, w27 <dbl>,
## # w28 <dbl>, w29 <dbl>, w30 <dbl>, w31 <dbl>, w32 <dbl>, w33 <dbl>,
## # w34 <dbl>, w35 <dbl>, w36 <dbl>, w37 <dbl>, w38 <dbl>, w39 <dbl>,
## # w40 <dbl>, w41 <dbl>, w42 <dbl>, w43 <dbl>, w44 <dbl>, w45 <dbl>,
## # w46 <dbl>, w47 <dbl>, w48 <dbl>, w49 <dbl>, w50 <dbl>, w51 <dbl>,
## # w52 <dbl>, w53 <dbl>, w54 <dbl>, w55 <dbl>, w56 <dbl>, w57 <dbl>,
## # w58 <dbl>, w59 <dbl>, w60 <dbl>, w61 <dbl>, w62 <dbl>, w63 <dbl>,
## # w64 <dbl>, w65 <dbl>, w66 <dbl>, w67 <dbl>, w68 <dbl>, w69 <dbl>,
## # w70 <dbl>, w71 <dbl>, w72 <dbl>, w73 <dbl>, …
textfeature
creates a range of other text features for each tweet which might be useful - it also attaches sentiment scores using a variety of sentimeng algorithms.
We’ll combine the features generated this way with the original tweet data and a new column for tweets which have at least one singular pronoun.
results1 <- data.frame(cbind(results, features))
results1 <- results1 %>%
mutate(sing_pron = ifelse(n_first_person > 0, 1, 0))
We can summarise at the overlap between tweets with disclaimers and those which have first person singular references.
res1_count <- results1 %>%
count(views, sing_pron)
res1_count %>%
gt::gt()
views | sing_pron | n |
---|---|---|
0 | 0 | 2597 |
0 | 1 | 323 |
1 | 0 | 73 |
1 | 1 | 179 |
Using Natural Language Processing techniques…
Makes use of the spacyr
package. spacy
is trained on very large datasets like Wiki to learn categories including names and organisations. We can then annotate the tweet text with labels to add whether it is from a person, organisation (or both) and combine this annotations from the first two ideas. This gives a number of tweet categories summarised in the table.
library(spacyr)
results1 <- results1 %>%
mutate(desc1 = paste(retweet_name, retweet_description),
row = row_number(), doc_id = paste0("text", row)
)
parsed <- spacy_parse(results1$desc1, tag = TRUE, entity = TRUE, nounphrase = TRUE)
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.0.16, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
head(parsed)
## doc_id sentence_id token_id token lemma pos tag entity nounphrase
## 1 text1 1 1 Darryl darryl PROPN NNP PERSON_B beg
## 2 text1 1 2 Pieroni pieroni PROPN NNP PERSON_I mid
## 3 text1 1 3 Husband husband PROPN NNP PERSON_I end_root
## 4 text1 1 4 , , PUNCT ,
## 5 text1 1 5 Father father PROPN NNP beg_root
## 6 text1 1 6 , , PUNCT ,
## whitespace
## 1 TRUE
## 2 TRUE
## 3 FALSE
## 4 TRUE
## 5 FALSE
## 6 TRUE
person <- parsed %>%
#filter(str_detect(entity, "^P")) %>%
select(doc_id, entity) %>%
distinct() %>%
full_join(results1)
## Joining, by = "doc_id"
person_extract <- person %>%
select(doc_id, desc1, entity, views, sing_pron) %>%
filter(entity %in% c("ORG_I", "ORG_B", "PERSON_I", "PERSON_B", " ")) %>%
mutate(ent = str_extract(entity, c("ORG|PERSON"))) %>%
select(-entity) %>%
distinct() %>%
group_by(doc_id) %>%
mutate(n = n()) %>%
ungroup() %>%
spread(ent, n, fill = 0)
counts <- person_extract %>%
count(ORG, PERSON, sing_pron, views) %>%
mutate(rowid = row_number())
counts %>%
gt::gt()
ORG | PERSON | sing_pron | views | n | rowid |
---|---|---|---|---|---|
0 | 1 | 0 | 0 | 379 | 1 |
0 | 1 | 0 | 1 | 6 | 2 |
0 | 1 | 1 | 0 | 95 | 3 |
0 | 1 | 1 | 1 | 40 | 4 |
1 | 0 | 0 | 0 | 1065 | 5 |
1 | 0 | 0 | 1 | 40 | 6 |
1 | 0 | 1 | 0 | 68 | 7 |
1 | 0 | 1 | 1 | 40 | 8 |
2 | 2 | 0 | 0 | 776 | 9 |
2 | 2 | 0 | 1 | 18 | 10 |
2 | 2 | 1 | 0 | 136 | 11 |
2 | 2 | 1 | 1 | 96 | 12 |
prob <- person_extract %>%
filter(views == 0, sing_pron == 0, PERSON != 0)
distinct()
person_tweet <- person_extract %>%
mutate(person = ifelse(PERSON >0, 1,
ifelse(views ==1, 1,
ifelse(sing_pron == 1, 1, 0))))
person_tweet %>%
filter(person == 1) %>%
select(desc1) %>%
head(10) %>%
gt::gt()
desc1 |
---|
Darryl Pieroni Husband, Father, Entrepreneur, Internet Scholar, Amateur Chef and avid Tennis player. #olemiss #digitaltransformation #CX #DX #digitalstrategy |
AI•Enhanced! Get more followers (automatically) with AI-Enhanced (for Twitter) |
Martin Anderson Thanos🌈 Champion #Alchemist #Industry40 Clinical #Partitioner #edu #tech #startup ER Medic, @PulseMedic @symbol7th , #Developer #digitalhealth #Evangelist, #bi |
Pulse Medic Our goal is to increase awareness #AED #first #aid and #CPR in our #community. #digitalhealth #training #course in #SocBiz - #sustainability Mod by 5 admins |
UHCW Innovation Team Making innovation happen at UHCW 💡💭🏥 Our own views and interests. |
Martin Anderson Thanos🌈 Champion #Alchemist #Industry40 Clinical #Partitioner #edu #tech #startup ER Medic, @PulseMedic @symbol7th , #Developer #digitalhealth #Evangelist, #bi |
Pulse Medic Our goal is to increase awareness #AED #first #aid and #CPR in our #community. #digitalhealth #training #course in #SocBiz - #sustainability Mod by 5 admins |
Martin Anderson Thanos🌈 Champion #Alchemist #Industry40 Clinical #Partitioner #edu #tech #startup ER Medic, @PulseMedic @symbol7th , #Developer #digitalhealth #Evangelist, #bi |
Pulse Medic Our goal is to increase awareness #AED #first #aid and #CPR in our #community. #digitalhealth #training #course in #SocBiz - #sustainability Mod by 5 admins |
Martin Anderson Thanos🌈 Champion #Alchemist #Industry40 Clinical #Partitioner #edu #tech #startup ER Medic, @PulseMedic @symbol7th , #Developer #digitalhealth #Evangelist, #bi |