
Topic Modelling
Bài toán đặt ra ở trên được gọi là Topic Modelling.
Chúng ta có thể sử dụng Latent
Dirichlet Allocation - LDA để xử lí vấn đề này. Dưới đây là R codes
cho hai tình huống: (1) căn cứ vào title của các papers, và (2) căn cứ
vào abstract của các papers.
Để mô phỏng dữ liệu của hội thảo trên trước hết chúng ta lấy tối đa
500 papers về “Digital Transformation for Social Impacts and ESG
Performance” từ 2024 đến 2025:
#===================================
# Stage 1: Collect paper info
#===================================
# Lấy thông tin về 500 papers:
library(rcrossref)
library(dplyr)
library(stringr)
library(tidytext)
library(topicmodels)
library(SnowballC)
res <- cr_works(query = "Digital Transformation for Social Impacts and ESG Performance",
filter = c(from_pub_date = "2024-01-01", until_pub_date = "2025-12-31"),
limit = 500)
# Select journal-article + abstract:
res$data %>%
filter(type == "journal-article") %>%
filter(!is.na(abstract)) %>%
mutate(id = 1:nrow(.)) -> papersAbstracts
Căn cứ vào Title
Giả sử chúng ta chia 199 papers này thành 6 sub-topics dựa trên LDA.
Dưới đây là R codes:
#---------------------------------
# Topic Modelling based on title
#---------------------------------
papersAbstracts %>%
select(title, doi, id) -> dfTitles
tidy_titles <- dfTitles %>%
unnest_tokens(word, title) %>%
anti_join(get_stopwords())
# Create a document-term matrix:
dtmTitles <- tidy_titles %>%
count(id, word) %>%
cast_dtm(document = id, term = word, value = n)
# Fit LDA with 6 topics:
n_topics <- 6
lda6 <- LDA(dtmTitles,
k = n_topics,
method = "Gibbs",
control = list(burnin = 1000, iter = 2000, seed = 29))
# Extract the per-document topic probabilities:
doc_topics <- tidy(lda6, matrix = "gamma")
# For each document, choose the topic with highest probability:
assigned_topics <- doc_topics %>%
group_by(document) %>%
slice(which.max(gamma)) %>%
ungroup() %>%
mutate(subTopic = paste0("Topic ", topic)) %>%
select(document, subTopic)
# Inspect top terms per topic (top 5):
terms(lda6, 5) %>%
t() %>%
as.data.frame() %>%
mutate(top5Words = paste(V1, V2, V3, V4, V5, sep = ", ")) %>%
mutate(subTopic = row.names(.)) -> dfKeyWords
assigned_topics %>%
full_join(dfKeyWords %>% select(subTopic, top5Words)) %>%
left_join(papersAbstracts %>% mutate(id = as.character(id)), by = c("document" = "id")) -> paperByTitle
Số lượng các bài thuộc về các sub-topic và các từ khóa nổi bật tương
ứng (Table 1):
library(kableExtra) # For presenting table.
paperByTitle %>%
group_by(subTopic, top5Words) %>%
count() %>%
kbl(caption = "Table 1: Sub-topic by LDA", escape = TRUE) %>%
kable_classic(full_width = FALSE, html_font = "Cambria")
Table 1: Sub-topic by LDA
|
subTopic
|
top5Words
|
n
|
|
Topic 1
|
esg, performance, transformation, china, analysis
|
46
|
|
Topic 2
|
performance, impact, enterprise, evidence, role
|
40
|
|
Topic 3
|
esg, digital, performance, financial, governance
|
35
|
|
Topic 4
|
transformation, performance, impacts, mediating, technology
|
26
|
|
Topic 5
|
digital, evidence, transformation, impact, can
|
19
|
|
Topic 6
|
corporate, innovation, green, social, based
|
33
|
Kết Luận
Thay vì sử dụng title chúng ta có thê sử dụng abstract để phân loại
199 papers này vào 6 sub-topic. Thông thường thì abstract sẽ giàu thông
tin hơn và do vậy việc phân loại có thể chính xác hơn, thể hiện đúng hơn
nội dung của từng nhóm. R codes là hoàn toàn tương tự như trên.
Tài Liệu Tham Khảo
Blei D.M., Ng A.Y., Jordan M.I. (2003). Latent Dirichlet Allocation.
Journal of Machine Learning Research, 3, 993–1022.
Phan X.H., Nguyen L.M., Horguchi S. (2008). Learning to Classify
Short and Sparse Text & Web with Hidden Topics from Large-scale Data
Collections. In Proceedings of the 17th International World Wide Web
Conference (WWW 2008), pages 91–100, Beijing, China.
Lu, B., Ott, M., Cardie, C., Tsou, B.K. (2011). Multi-aspect
Sentiment Analysis with Topic Models. In Proceedings of the 2011 IEEE
11th International Conference on Data Mining Workshops, pages 81–88.
LS0tDQp0aXRsZTogIlRvcGljIE1vZGVsbGluZzogQSBSZWFsLVdvcmxkIEFwcGxpY2F0aW9uIiAgDQphdXRob3I6ICJOZ3V5ZW4gQ2hpIER1bmciDQpzdWJ0aXRsZTogIlIgRGF0YSBTY2llbmNlIFNlcmllcyINCm91dHB1dDoNCiAgaHRtbF9kb2N1bWVudDogDQogICAgY29kZV9kb3dubG9hZDogdHJ1ZQ0KICAgICMgY29kZV9mb2xkaW5nOiBoaWRlDQogICAgaGlnaGxpZ2h0OiB6ZW5idXJuDQogICAgIyBudW1iZXJfc2VjdGlvbnM6IHllcw0KICAgIHRoZW1lOiAiZmxhdGx5Ig0KICAgIHRvYzogVFJVRQ0KICAgIHRvY19mbG9hdDogVFJVRQ0KLS0tDQoNCmBgYHtyIHNldHVwLGluY2x1ZGU9RkFMU0V9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUsIHdhcm5pbmcgPSBGQUxTRSwgbWVzc2FnZSA9IEZBTFNFKQ0KYGBgDQoNCiFbXShDOlxcVXNlcnNcXEFkbWluXFxEb3dubG9hZHNcXHBpYzIuanBnKQ0KDQojIEdp4bubaSBUaGnhu4d1IFbhuqVuIMSQ4buBDQoNCltDb25mZXJlbmNlIG9uIERpZ2l0YWwgVHJhbnNmb3JtYXRpb24gZm9yIFNvY2lhbCBJbXBhY3RzIGFuZCBFU0cgUGVyZm9ybWFuY2VdKGh0dHBzOi8vd3d3LmVtZXJhbGRncm91cHB1Ymxpc2hpbmcuY29tL2pvdXJuYWwvamllbS9jb25mZXJlbmNlLWRpZ2l0YWwtdHJhbnNmb3JtYXRpb24tc29jaWFsLWltcGFjdHMtYW5kLWVzZy1wZXJmb3JtYW5jZSkgY8OzIE4gYsOgaSBiw6FvIMSRxrDhu6NjIGNo4buNbiDEkeG7gyB0csOsbmggYsOgeSB04bqhaSBo4buZaSB0aOG6o28gbsOgeS4gSMOjeSBwaMOibiBsb+G6oWkgTiBiw6BpIGLDoW8gbsOgeSB2w6BvIEsgbmjDs20gY2jhu6cgxJHhu4EgKFN1Yi10b3BpYykgc2FvIGNobyBu4buZaSBkdW5nIGPhu6dhIGPDoWMgcGFwZXJzIHRyb25nIG3hu5dpIG5ow7NtIGNo4bunIMSR4buBIGNvbiBsw6AgZ2nhu5FuZyBuaGF1IG5o4bqldC4gDQoNCiMgVG9waWMgTW9kZWxsaW5nDQoNCkLDoGkgdG/DoW4gxJHhurd0IHJhIOG7nyB0csOqbiDEkcaw4bujYyBn4buNaSBsw6AgW1RvcGljIE1vZGVsbGluZ10oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvVG9waWNfbW9kZWwpLiBDaMO6bmcgdGEgY8OzIHRo4buDIHPhu60gZOG7pW5nIFtMYXRlbnQgRGlyaWNobGV0IEFsbG9jYXRpb24gLSBMREFdKGh0dHBzOi8vZW4ud2lraXBlZGlhLm9yZy93aWtpL0xhdGVudF9EaXJpY2hsZXRfYWxsb2NhdGlvbikgxJHhu4MgeOG7rSBsw60gduG6pW4gxJHhu4EgbsOgeS4gRMaw4bubaSDEkcOieSBsw6AgUiBjb2RlcyBjaG8gaGFpIHTDrG5oIGh14buRbmc6ICgxKSBjxINuIGPhu6kgdsOgbyB0aXRsZSBj4bunYSBjw6FjIHBhcGVycywgdsOgICgyKSBjxINuIGPhu6kgdsOgbyBhYnN0cmFjdCBj4bunYSBjw6FjIHBhcGVycy4gDQoNCsSQ4buDIG3DtCBwaOG7j25nIGThu68gbGnhu4d1IGPhu6dhIGjhu5lpIHRo4bqjbyB0csOqbiB0csaw4bubYyBo4bq/dCBjaMO6bmcgdGEgbOG6pXkgdOG7kWkgxJFhIDUwMCBwYXBlcnMgduG7gSAiRGlnaXRhbCBUcmFuc2Zvcm1hdGlvbiBmb3IgU29jaWFsIEltcGFjdHMgYW5kIEVTRyBQZXJmb3JtYW5jZSIgdOG7qyAyMDI0IMSR4bq/biAyMDI1OiANCg0KDQpgYGB7cn0NCiM9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQ0KIyAgIFN0YWdlIDE6IENvbGxlY3QgcGFwZXIgaW5mbyANCiM9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQ0KDQoNCiMgTOG6pXkgdGjDtG5nIHRpbiB24buBIDUwMCBwYXBlcnM6IA0KDQpsaWJyYXJ5KHJjcm9zc3JlZikNCmxpYnJhcnkoZHBseXIpDQpsaWJyYXJ5KHN0cmluZ3IpDQpsaWJyYXJ5KHRpZHl0ZXh0KQ0KbGlicmFyeSh0b3BpY21vZGVscykNCmxpYnJhcnkoU25vd2JhbGxDKQ0KDQoNCnJlcyA8LSBjcl93b3JrcyhxdWVyeSA9ICJEaWdpdGFsIFRyYW5zZm9ybWF0aW9uIGZvciBTb2NpYWwgSW1wYWN0cyBhbmQgRVNHIFBlcmZvcm1hbmNlIiwgDQogICAgICAgICAgICAgICAgZmlsdGVyID0gYyhmcm9tX3B1Yl9kYXRlID0gIjIwMjQtMDEtMDEiLCB1bnRpbF9wdWJfZGF0ZSA9ICIyMDI1LTEyLTMxIiksDQogICAgICAgICAgICAgICAgbGltaXQgPSA1MDApDQoNCiMgU2VsZWN0IGpvdXJuYWwtYXJ0aWNsZSArIGFic3RyYWN0OiANCg0KcmVzJGRhdGEgJT4lIA0KICBmaWx0ZXIodHlwZSA9PSAiam91cm5hbC1hcnRpY2xlIikgJT4lIA0KICBmaWx0ZXIoIWlzLm5hKGFic3RyYWN0KSkgJT4lIA0KICBtdXRhdGUoaWQgPSAxOm5yb3coLikpIC0+IHBhcGVyc0Fic3RyYWN0cw0KYGBgDQoNCg0KIyBDxINuIGPhu6kgdsOgbyBUaXRsZQ0KDQpHaeG6oyBz4butIGNow7puZyB0YSBjaGlhIDE5OSBwYXBlcnMgbsOgeSB0aMOgbmggNiBzdWItdG9waWNzIGThu7FhIHRyw6puIExEQS4gRMaw4bubaSDEkcOieSBsw6AgUiBjb2RlczogDQoNCmBgYHtyfQ0KIy0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KIyBUb3BpYyBNb2RlbGxpbmcgYmFzZWQgb24gdGl0bGUNCiMtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0NCg0KcGFwZXJzQWJzdHJhY3RzICU+JSANCiAgc2VsZWN0KHRpdGxlLCBkb2ksIGlkKSAtPiBkZlRpdGxlcw0KDQp0aWR5X3RpdGxlcyA8LSBkZlRpdGxlcyAlPiUNCiAgdW5uZXN0X3Rva2Vucyh3b3JkLCB0aXRsZSkgJT4lDQogIGFudGlfam9pbihnZXRfc3RvcHdvcmRzKCkpICAgDQogDQojIENyZWF0ZSBhIGRvY3VtZW50LXRlcm0gbWF0cml4OiANCg0KZHRtVGl0bGVzIDwtIHRpZHlfdGl0bGVzICU+JQ0KICBjb3VudChpZCwgd29yZCkgJT4lDQogIGNhc3RfZHRtKGRvY3VtZW50ID0gaWQsIHRlcm0gPSB3b3JkLCB2YWx1ZSA9IG4pDQoNCiMgRml0IExEQSB3aXRoIDYgdG9waWNzOiANCg0Kbl90b3BpY3MgPC0gNg0KDQpsZGE2IDwtIExEQShkdG1UaXRsZXMsIA0KICAgICAgICAgICAgayA9IG5fdG9waWNzLCANCiAgICAgICAgICAgIG1ldGhvZCA9ICJHaWJicyIsDQogICAgICAgICAgICBjb250cm9sID0gbGlzdChidXJuaW4gPSAxMDAwLCBpdGVyID0gMjAwMCwgc2VlZCA9IDI5KSkNCg0KIyBFeHRyYWN0IHRoZSBwZXItZG9jdW1lbnQgdG9waWMgcHJvYmFiaWxpdGllczogDQoNCmRvY190b3BpY3MgPC0gdGlkeShsZGE2LCBtYXRyaXggPSAiZ2FtbWEiKSAgDQoNCg0KIyBGb3IgZWFjaCBkb2N1bWVudCwgY2hvb3NlIHRoZSB0b3BpYyB3aXRoIGhpZ2hlc3QgcHJvYmFiaWxpdHk6IA0KDQphc3NpZ25lZF90b3BpY3MgPC0gZG9jX3RvcGljcyAlPiUNCiAgZ3JvdXBfYnkoZG9jdW1lbnQpICU+JQ0KICBzbGljZSh3aGljaC5tYXgoZ2FtbWEpKSAlPiUgDQogIHVuZ3JvdXAoKSAlPiUNCiAgbXV0YXRlKHN1YlRvcGljID0gcGFzdGUwKCJUb3BpYyAiLCB0b3BpYykpICU+JQ0KICBzZWxlY3QoZG9jdW1lbnQsIHN1YlRvcGljKQ0KDQoNCiMgSW5zcGVjdCB0b3AgdGVybXMgcGVyIHRvcGljICh0b3AgNSk6IA0KDQp0ZXJtcyhsZGE2LCA1KSAlPiUNCiAgdCgpICU+JSANCiAgYXMuZGF0YS5mcmFtZSgpICU+JSANCiAgbXV0YXRlKHRvcDVXb3JkcyA9IHBhc3RlKFYxLCBWMiwgVjMsIFY0LCBWNSwgc2VwID0gIiwgIikpICU+JSANCiAgbXV0YXRlKHN1YlRvcGljID0gcm93Lm5hbWVzKC4pKSAtPiBkZktleVdvcmRzDQoNCg0KYXNzaWduZWRfdG9waWNzICU+JSANCiAgZnVsbF9qb2luKGRmS2V5V29yZHMgJT4lIHNlbGVjdChzdWJUb3BpYywgdG9wNVdvcmRzKSkgJT4lIA0KICBsZWZ0X2pvaW4ocGFwZXJzQWJzdHJhY3RzICU+JSBtdXRhdGUoaWQgPSBhcy5jaGFyYWN0ZXIoaWQpKSwgYnkgPSBjKCJkb2N1bWVudCIgPSAiaWQiKSkgLT4gcGFwZXJCeVRpdGxlDQpgYGANCg0KDQpT4buRIGzGsOG7o25nIGPDoWMgYsOgaSB0aHXhu5ljIHbhu4EgY8OhYyBzdWItdG9waWMgdsOgIGPDoWMgdOG7qyBraMOzYSBu4buVaSBi4bqtdCB0xrDGoW5nIOG7qW5nIChUYWJsZSAxKTogDQoNCmBgYHtyfQ0KDQpsaWJyYXJ5KGthYmxlRXh0cmEpICMgRm9yIHByZXNlbnRpbmcgdGFibGUuIA0KDQpwYXBlckJ5VGl0bGUgJT4lIA0KICBncm91cF9ieShzdWJUb3BpYywgdG9wNVdvcmRzKSAlPiUgDQogIGNvdW50KCkgJT4lIA0KICBrYmwoY2FwdGlvbiA9ICJUYWJsZSAxOiBTdWItdG9waWMgYnkgTERBIiwgZXNjYXBlID0gVFJVRSkgJT4lDQogIGthYmxlX2NsYXNzaWMoZnVsbF93aWR0aCA9IEZBTFNFLCBodG1sX2ZvbnQgPSAiQ2FtYnJpYSIpDQoNCmBgYA0KDQoNCiMgS+G6v3QgTHXhuq1uDQoNClRoYXkgdsOsIHPhu60gZOG7pW5nIHRpdGxlIGNow7puZyB0YSBjw7MgdGjDqiBz4butIGThu6VuZyBhYnN0cmFjdCDEkeG7gyBwaMOibiBsb+G6oWkgMTk5IHBhcGVycyBuw6B5IHbDoG8gNiBzdWItdG9waWMuIFRow7RuZyB0aMaw4budbmcgdGjDrCBhYnN0cmFjdCBz4bq9IGdpw6B1IHRow7RuZyB0aW4gaMahbiB2w6AgZG8gduG6rXkgdmnhu4djIHBow6JuIGxv4bqhaSBjw7MgdGjhu4MgY2jDrW5oIHjDoWMgaMahbiwgdGjhu4MgaGnhu4duIMSRw7puZyBoxqFuIG7hu5lpIGR1bmcgY+G7p2EgdOG7q25nIG5ow7NtLiBSIGNvZGVzIGzDoCBob8OgbiB0b8OgbiB0xrDGoW5nIHThu7EgbmjGsCB0csOqbi4gDQoNCiMgVMOgaSBMaeG7h3UgVGhhbSBLaOG6o28NCg0KQmxlaSBELk0uLCBOZyBBLlkuLCBKb3JkYW4gTS5JLiAoMjAwMykuIExhdGVudCBEaXJpY2hsZXQgQWxsb2NhdGlvbi4gKkpvdXJuYWwgb2YgTWFjaGluZSBMZWFybmluZyBSZXNlYXJjaCosIDMsIDk5M+KAkzEwMjIuDQoNClBoYW4gWC5ILiwgTmd1eWVuIEwuTS4sIEhvcmd1Y2hpIFMuICgyMDA4KS4gTGVhcm5pbmcgdG8gQ2xhc3NpZnkgU2hvcnQgYW5kIFNwYXJzZSBUZXh0ICYgV2ViIHdpdGggSGlkZGVuIFRvcGljcyBmcm9tIExhcmdlLXNjYWxlIERhdGEgQ29sbGVjdGlvbnMuIEluIFByb2NlZWRpbmdzIG9mIHRoZSAxN3RoIEludGVybmF0aW9uYWwgV29ybGQgV2lkZSBXZWIgQ29uZmVyZW5jZSAoV1dXIDIwMDgpLCBwYWdlcyA5MeKAkzEwMCwgQmVpamluZywgQ2hpbmEuDQoNCkx1LCBCLiwgT3R0LCBNLiwgQ2FyZGllLCBDLiwgVHNvdSwgQi5LLiAoMjAxMSkuIE11bHRpLWFzcGVjdCBTZW50aW1lbnQgQW5hbHlzaXMgd2l0aCBUb3BpYyBNb2RlbHMuIEluIFByb2NlZWRpbmdzIG9mIHRoZSAyMDExIElFRUUgMTF0aCBJbnRlcm5hdGlvbmFsIENvbmZlcmVuY2Ugb24gRGF0YSBNaW5pbmcgV29ya3Nob3BzLCBwYWdlcyA4MeKAkzg4Lg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg==