Giới Thiệu Vấn Đề

Conference on Digital Transformation for Social Impacts and ESG Performance có N bài báo được chọn để trình bày tại hội thảo này. Hãy phân loại N bài báo này vào K nhóm chủ đề (Sub-topic) sao cho nội dung của các papers trong mỗi nhóm chủ đề con là giống nhau nhất.

Topic Modelling

Bài toán đặt ra ở trên được gọi là Topic Modelling. Chúng ta có thể sử dụng Latent Dirichlet Allocation - LDA để xử lí vấn đề này. Dưới đây là R codes cho hai tình huống: (1) căn cứ vào title của các papers, và (2) căn cứ vào abstract của các papers.

Để mô phỏng dữ liệu của hội thảo trên trước hết chúng ta lấy tối đa 500 papers về “Digital Transformation for Social Impacts and ESG Performance” từ 2024 đến 2025:

#===================================
#   Stage 1: Collect paper info 
#===================================


# Lấy thông tin về 500 papers: 

library(rcrossref)
library(dplyr)
library(stringr)
library(tidytext)
library(topicmodels)
library(SnowballC)


res <- cr_works(query = "Digital Transformation for Social Impacts and ESG Performance", 
                filter = c(from_pub_date = "2024-01-01", until_pub_date = "2025-12-31"),
                limit = 500)

# Select journal-article + abstract: 

res$data %>% 
  filter(type == "journal-article") %>% 
  filter(!is.na(abstract)) %>% 
  mutate(id = 1:nrow(.)) -> papersAbstracts

Căn cứ vào Title

Giả sử chúng ta chia 199 papers này thành 6 sub-topics dựa trên LDA. Dưới đây là R codes:

#---------------------------------
# Topic Modelling based on title
#---------------------------------

papersAbstracts %>% 
  select(title, doi, id) -> dfTitles

tidy_titles <- dfTitles %>%
  unnest_tokens(word, title) %>%
  anti_join(get_stopwords())   
 
# Create a document-term matrix: 

dtmTitles <- tidy_titles %>%
  count(id, word) %>%
  cast_dtm(document = id, term = word, value = n)

# Fit LDA with 6 topics: 

n_topics <- 6

lda6 <- LDA(dtmTitles, 
            k = n_topics, 
            method = "Gibbs",
            control = list(burnin = 1000, iter = 2000, seed = 29))

# Extract the per-document topic probabilities: 

doc_topics <- tidy(lda6, matrix = "gamma")  


# For each document, choose the topic with highest probability: 

assigned_topics <- doc_topics %>%
  group_by(document) %>%
  slice(which.max(gamma)) %>% 
  ungroup() %>%
  mutate(subTopic = paste0("Topic ", topic)) %>%
  select(document, subTopic)


# Inspect top terms per topic (top 5): 

terms(lda6, 5) %>%
  t() %>% 
  as.data.frame() %>% 
  mutate(top5Words = paste(V1, V2, V3, V4, V5, sep = ", ")) %>% 
  mutate(subTopic = row.names(.)) -> dfKeyWords


assigned_topics %>% 
  full_join(dfKeyWords %>% select(subTopic, top5Words)) %>% 
  left_join(papersAbstracts %>% mutate(id = as.character(id)), by = c("document" = "id")) -> paperByTitle

Số lượng các bài thuộc về các sub-topic và các từ khóa nổi bật tương ứng (Table 1):

library(kableExtra) # For presenting table. 

paperByTitle %>% 
  group_by(subTopic, top5Words) %>% 
  count() %>% 
  kbl(caption = "Table 1: Sub-topic by LDA", escape = TRUE) %>%
  kable_classic(full_width = FALSE, html_font = "Cambria")
Table 1: Sub-topic by LDA
subTopic top5Words n
Topic 1 esg, performance, transformation, china, analysis 46
Topic 2 performance, impact, enterprise, evidence, role 40
Topic 3 esg, digital, performance, financial, governance 35
Topic 4 transformation, performance, impacts, mediating, technology 26
Topic 5 digital, evidence, transformation, impact, can 19
Topic 6 corporate, innovation, green, social, based 33

Kết Luận

Thay vì sử dụng title chúng ta có thê sử dụng abstract để phân loại 199 papers này vào 6 sub-topic. Thông thường thì abstract sẽ giàu thông tin hơn và do vậy việc phân loại có thể chính xác hơn, thể hiện đúng hơn nội dung của từng nhóm. R codes là hoàn toàn tương tự như trên.

Tài Liệu Tham Khảo

Blei D.M., Ng A.Y., Jordan M.I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.

Phan X.H., Nguyen L.M., Horguchi S. (2008). Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. In Proceedings of the 17th International World Wide Web Conference (WWW 2008), pages 91–100, Beijing, China.

Lu, B., Ott, M., Cardie, C., Tsou, B.K. (2011). Multi-aspect Sentiment Analysis with Topic Models. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, pages 81–88.

LS0tDQp0aXRsZTogIlRvcGljIE1vZGVsbGluZzogQSBSZWFsLVdvcmxkIEFwcGxpY2F0aW9uIiAgDQphdXRob3I6ICJOZ3V5ZW4gQ2hpIER1bmciDQpzdWJ0aXRsZTogIlIgRGF0YSBTY2llbmNlIFNlcmllcyINCm91dHB1dDoNCiAgaHRtbF9kb2N1bWVudDogDQogICAgY29kZV9kb3dubG9hZDogdHJ1ZQ0KICAgICMgY29kZV9mb2xkaW5nOiBoaWRlDQogICAgaGlnaGxpZ2h0OiB6ZW5idXJuDQogICAgIyBudW1iZXJfc2VjdGlvbnM6IHllcw0KICAgIHRoZW1lOiAiZmxhdGx5Ig0KICAgIHRvYzogVFJVRQ0KICAgIHRvY19mbG9hdDogVFJVRQ0KLS0tDQoNCmBgYHtyIHNldHVwLGluY2x1ZGU9RkFMU0V9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUsIHdhcm5pbmcgPSBGQUxTRSwgbWVzc2FnZSA9IEZBTFNFKQ0KYGBgDQoNCiFbXShDOlxcVXNlcnNcXEFkbWluXFxEb3dubG9hZHNcXHBpYzIuanBnKQ0KDQojIEdp4bubaSBUaGnhu4d1IFbhuqVuIMSQ4buBDQoNCltDb25mZXJlbmNlIG9uIERpZ2l0YWwgVHJhbnNmb3JtYXRpb24gZm9yIFNvY2lhbCBJbXBhY3RzIGFuZCBFU0cgUGVyZm9ybWFuY2VdKGh0dHBzOi8vd3d3LmVtZXJhbGRncm91cHB1Ymxpc2hpbmcuY29tL2pvdXJuYWwvamllbS9jb25mZXJlbmNlLWRpZ2l0YWwtdHJhbnNmb3JtYXRpb24tc29jaWFsLWltcGFjdHMtYW5kLWVzZy1wZXJmb3JtYW5jZSkgY8OzIE4gYsOgaSBiw6FvIMSRxrDhu6NjIGNo4buNbiDEkeG7gyB0csOsbmggYsOgeSB04bqhaSBo4buZaSB0aOG6o28gbsOgeS4gSMOjeSBwaMOibiBsb+G6oWkgTiBiw6BpIGLDoW8gbsOgeSB2w6BvIEsgbmjDs20gY2jhu6cgxJHhu4EgKFN1Yi10b3BpYykgc2FvIGNobyBu4buZaSBkdW5nIGPhu6dhIGPDoWMgcGFwZXJzIHRyb25nIG3hu5dpIG5ow7NtIGNo4bunIMSR4buBIGNvbiBsw6AgZ2nhu5FuZyBuaGF1IG5o4bqldC4gDQoNCiMgVG9waWMgTW9kZWxsaW5nDQoNCkLDoGkgdG/DoW4gxJHhurd0IHJhIOG7nyB0csOqbiDEkcaw4bujYyBn4buNaSBsw6AgW1RvcGljIE1vZGVsbGluZ10oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvVG9waWNfbW9kZWwpLiBDaMO6bmcgdGEgY8OzIHRo4buDIHPhu60gZOG7pW5nIFtMYXRlbnQgRGlyaWNobGV0IEFsbG9jYXRpb24gLSBMREFdKGh0dHBzOi8vZW4ud2lraXBlZGlhLm9yZy93aWtpL0xhdGVudF9EaXJpY2hsZXRfYWxsb2NhdGlvbikgxJHhu4MgeOG7rSBsw60gduG6pW4gxJHhu4EgbsOgeS4gRMaw4bubaSDEkcOieSBsw6AgUiBjb2RlcyBjaG8gaGFpIHTDrG5oIGh14buRbmc6ICgxKSBjxINuIGPhu6kgdsOgbyB0aXRsZSBj4bunYSBjw6FjIHBhcGVycywgdsOgICgyKSBjxINuIGPhu6kgdsOgbyBhYnN0cmFjdCBj4bunYSBjw6FjIHBhcGVycy4gDQoNCsSQ4buDIG3DtCBwaOG7j25nIGThu68gbGnhu4d1IGPhu6dhIGjhu5lpIHRo4bqjbyB0csOqbiB0csaw4bubYyBo4bq/dCBjaMO6bmcgdGEgbOG6pXkgdOG7kWkgxJFhIDUwMCBwYXBlcnMgduG7gSAiRGlnaXRhbCBUcmFuc2Zvcm1hdGlvbiBmb3IgU29jaWFsIEltcGFjdHMgYW5kIEVTRyBQZXJmb3JtYW5jZSIgdOG7qyAyMDI0IMSR4bq/biAyMDI1OiANCg0KDQpgYGB7cn0NCiM9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQ0KIyAgIFN0YWdlIDE6IENvbGxlY3QgcGFwZXIgaW5mbyANCiM9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQ0KDQoNCiMgTOG6pXkgdGjDtG5nIHRpbiB24buBIDUwMCBwYXBlcnM6IA0KDQpsaWJyYXJ5KHJjcm9zc3JlZikNCmxpYnJhcnkoZHBseXIpDQpsaWJyYXJ5KHN0cmluZ3IpDQpsaWJyYXJ5KHRpZHl0ZXh0KQ0KbGlicmFyeSh0b3BpY21vZGVscykNCmxpYnJhcnkoU25vd2JhbGxDKQ0KDQoNCnJlcyA8LSBjcl93b3JrcyhxdWVyeSA9ICJEaWdpdGFsIFRyYW5zZm9ybWF0aW9uIGZvciBTb2NpYWwgSW1wYWN0cyBhbmQgRVNHIFBlcmZvcm1hbmNlIiwgDQogICAgICAgICAgICAgICAgZmlsdGVyID0gYyhmcm9tX3B1Yl9kYXRlID0gIjIwMjQtMDEtMDEiLCB1bnRpbF9wdWJfZGF0ZSA9ICIyMDI1LTEyLTMxIiksDQogICAgICAgICAgICAgICAgbGltaXQgPSA1MDApDQoNCiMgU2VsZWN0IGpvdXJuYWwtYXJ0aWNsZSArIGFic3RyYWN0OiANCg0KcmVzJGRhdGEgJT4lIA0KICBmaWx0ZXIodHlwZSA9PSAiam91cm5hbC1hcnRpY2xlIikgJT4lIA0KICBmaWx0ZXIoIWlzLm5hKGFic3RyYWN0KSkgJT4lIA0KICBtdXRhdGUoaWQgPSAxOm5yb3coLikpIC0+IHBhcGVyc0Fic3RyYWN0cw0KYGBgDQoNCg0KIyBDxINuIGPhu6kgdsOgbyBUaXRsZQ0KDQpHaeG6oyBz4butIGNow7puZyB0YSBjaGlhIDE5OSBwYXBlcnMgbsOgeSB0aMOgbmggNiBzdWItdG9waWNzIGThu7FhIHRyw6puIExEQS4gRMaw4bubaSDEkcOieSBsw6AgUiBjb2RlczogDQoNCmBgYHtyfQ0KIy0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KIyBUb3BpYyBNb2RlbGxpbmcgYmFzZWQgb24gdGl0bGUNCiMtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0NCg0KcGFwZXJzQWJzdHJhY3RzICU+JSANCiAgc2VsZWN0KHRpdGxlLCBkb2ksIGlkKSAtPiBkZlRpdGxlcw0KDQp0aWR5X3RpdGxlcyA8LSBkZlRpdGxlcyAlPiUNCiAgdW5uZXN0X3Rva2Vucyh3b3JkLCB0aXRsZSkgJT4lDQogIGFudGlfam9pbihnZXRfc3RvcHdvcmRzKCkpICAgDQogDQojIENyZWF0ZSBhIGRvY3VtZW50LXRlcm0gbWF0cml4OiANCg0KZHRtVGl0bGVzIDwtIHRpZHlfdGl0bGVzICU+JQ0KICBjb3VudChpZCwgd29yZCkgJT4lDQogIGNhc3RfZHRtKGRvY3VtZW50ID0gaWQsIHRlcm0gPSB3b3JkLCB2YWx1ZSA9IG4pDQoNCiMgRml0IExEQSB3aXRoIDYgdG9waWNzOiANCg0Kbl90b3BpY3MgPC0gNg0KDQpsZGE2IDwtIExEQShkdG1UaXRsZXMsIA0KICAgICAgICAgICAgayA9IG5fdG9waWNzLCANCiAgICAgICAgICAgIG1ldGhvZCA9ICJHaWJicyIsDQogICAgICAgICAgICBjb250cm9sID0gbGlzdChidXJuaW4gPSAxMDAwLCBpdGVyID0gMjAwMCwgc2VlZCA9IDI5KSkNCg0KIyBFeHRyYWN0IHRoZSBwZXItZG9jdW1lbnQgdG9waWMgcHJvYmFiaWxpdGllczogDQoNCmRvY190b3BpY3MgPC0gdGlkeShsZGE2LCBtYXRyaXggPSAiZ2FtbWEiKSAgDQoNCg0KIyBGb3IgZWFjaCBkb2N1bWVudCwgY2hvb3NlIHRoZSB0b3BpYyB3aXRoIGhpZ2hlc3QgcHJvYmFiaWxpdHk6IA0KDQphc3NpZ25lZF90b3BpY3MgPC0gZG9jX3RvcGljcyAlPiUNCiAgZ3JvdXBfYnkoZG9jdW1lbnQpICU+JQ0KICBzbGljZSh3aGljaC5tYXgoZ2FtbWEpKSAlPiUgDQogIHVuZ3JvdXAoKSAlPiUNCiAgbXV0YXRlKHN1YlRvcGljID0gcGFzdGUwKCJUb3BpYyAiLCB0b3BpYykpICU+JQ0KICBzZWxlY3QoZG9jdW1lbnQsIHN1YlRvcGljKQ0KDQoNCiMgSW5zcGVjdCB0b3AgdGVybXMgcGVyIHRvcGljICh0b3AgNSk6IA0KDQp0ZXJtcyhsZGE2LCA1KSAlPiUNCiAgdCgpICU+JSANCiAgYXMuZGF0YS5mcmFtZSgpICU+JSANCiAgbXV0YXRlKHRvcDVXb3JkcyA9IHBhc3RlKFYxLCBWMiwgVjMsIFY0LCBWNSwgc2VwID0gIiwgIikpICU+JSANCiAgbXV0YXRlKHN1YlRvcGljID0gcm93Lm5hbWVzKC4pKSAtPiBkZktleVdvcmRzDQoNCg0KYXNzaWduZWRfdG9waWNzICU+JSANCiAgZnVsbF9qb2luKGRmS2V5V29yZHMgJT4lIHNlbGVjdChzdWJUb3BpYywgdG9wNVdvcmRzKSkgJT4lIA0KICBsZWZ0X2pvaW4ocGFwZXJzQWJzdHJhY3RzICU+JSBtdXRhdGUoaWQgPSBhcy5jaGFyYWN0ZXIoaWQpKSwgYnkgPSBjKCJkb2N1bWVudCIgPSAiaWQiKSkgLT4gcGFwZXJCeVRpdGxlDQpgYGANCg0KDQpT4buRIGzGsOG7o25nIGPDoWMgYsOgaSB0aHXhu5ljIHbhu4EgY8OhYyBzdWItdG9waWMgdsOgIGPDoWMgdOG7qyBraMOzYSBu4buVaSBi4bqtdCB0xrDGoW5nIOG7qW5nIChUYWJsZSAxKTogDQoNCmBgYHtyfQ0KDQpsaWJyYXJ5KGthYmxlRXh0cmEpICMgRm9yIHByZXNlbnRpbmcgdGFibGUuIA0KDQpwYXBlckJ5VGl0bGUgJT4lIA0KICBncm91cF9ieShzdWJUb3BpYywgdG9wNVdvcmRzKSAlPiUgDQogIGNvdW50KCkgJT4lIA0KICBrYmwoY2FwdGlvbiA9ICJUYWJsZSAxOiBTdWItdG9waWMgYnkgTERBIiwgZXNjYXBlID0gVFJVRSkgJT4lDQogIGthYmxlX2NsYXNzaWMoZnVsbF93aWR0aCA9IEZBTFNFLCBodG1sX2ZvbnQgPSAiQ2FtYnJpYSIpDQoNCmBgYA0KDQoNCiMgS+G6v3QgTHXhuq1uDQoNClRoYXkgdsOsIHPhu60gZOG7pW5nIHRpdGxlIGNow7puZyB0YSBjw7MgdGjDqiBz4butIGThu6VuZyBhYnN0cmFjdCDEkeG7gyBwaMOibiBsb+G6oWkgMTk5IHBhcGVycyBuw6B5IHbDoG8gNiBzdWItdG9waWMuIFRow7RuZyB0aMaw4budbmcgdGjDrCBhYnN0cmFjdCBz4bq9IGdpw6B1IHRow7RuZyB0aW4gaMahbiB2w6AgZG8gduG6rXkgdmnhu4djIHBow6JuIGxv4bqhaSBjw7MgdGjhu4MgY2jDrW5oIHjDoWMgaMahbiwgdGjhu4MgaGnhu4duIMSRw7puZyBoxqFuIG7hu5lpIGR1bmcgY+G7p2EgdOG7q25nIG5ow7NtLiBSIGNvZGVzIGzDoCBob8OgbiB0b8OgbiB0xrDGoW5nIHThu7EgbmjGsCB0csOqbi4gDQoNCiMgVMOgaSBMaeG7h3UgVGhhbSBLaOG6o28NCg0KQmxlaSBELk0uLCBOZyBBLlkuLCBKb3JkYW4gTS5JLiAoMjAwMykuIExhdGVudCBEaXJpY2hsZXQgQWxsb2NhdGlvbi4gKkpvdXJuYWwgb2YgTWFjaGluZSBMZWFybmluZyBSZXNlYXJjaCosIDMsIDk5M+KAkzEwMjIuDQoNClBoYW4gWC5ILiwgTmd1eWVuIEwuTS4sIEhvcmd1Y2hpIFMuICgyMDA4KS4gTGVhcm5pbmcgdG8gQ2xhc3NpZnkgU2hvcnQgYW5kIFNwYXJzZSBUZXh0ICYgV2ViIHdpdGggSGlkZGVuIFRvcGljcyBmcm9tIExhcmdlLXNjYWxlIERhdGEgQ29sbGVjdGlvbnMuIEluIFByb2NlZWRpbmdzIG9mIHRoZSAxN3RoIEludGVybmF0aW9uYWwgV29ybGQgV2lkZSBXZWIgQ29uZmVyZW5jZSAoV1dXIDIwMDgpLCBwYWdlcyA5MeKAkzEwMCwgQmVpamluZywgQ2hpbmEuDQoNCkx1LCBCLiwgT3R0LCBNLiwgQ2FyZGllLCBDLiwgVHNvdSwgQi5LLiAoMjAxMSkuIE11bHRpLWFzcGVjdCBTZW50aW1lbnQgQW5hbHlzaXMgd2l0aCBUb3BpYyBNb2RlbHMuIEluIFByb2NlZWRpbmdzIG9mIHRoZSAyMDExIElFRUUgMTF0aCBJbnRlcm5hdGlvbmFsIENvbmZlcmVuY2Ugb24gRGF0YSBNaW5pbmcgV29ya3Nob3BzLCBwYWdlcyA4MeKAkzg4Lg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg==