텍스트 빅데이터 기반의 언어분석 방법과 이론

토픽 모델을 중심으로

이신행

11/29/2019

What is Topic Modeling?

Automated procedures for coding collections of texts
By meaningful categories that represent the main topics discussed in the texts
Assuming 1) that meanings are relational (Saussure, 1959)
And 2) that the meanings associated with a topic is understood as a set of word clusters
So, topic modeling aims at capturing word co-occurrences within a text
A topic as the cluster of words that tend to co-occur more frequently than otherwise

Topic Modeling Workflow

Collecting a pile of relevant documents to research interest
Transforming them into a co-occurrence matrix (e.g., Document-Term Matrix)
Specifying the parameters, e.g., \(k\), the number of topics as core themes of the corpus
Using unsupervised learning algorithms (e.g., LDA) to discover latent themes on text

Fig. 1 Topic Modeling Workflow

Logic behind Latent Dirichlet Allocation (LDA)

A corpus consists of latent topics
Each document consists of multiple topics
Each topic consists of multiple tokens (e.g., words or phrases)

Fig. 2 LDA assumptions

More specifically, in LDA…

Topics: Represented as probability that each of a given set of terms occurs: \(\beta\)
Documents: Probability mixture of these topics: \(\theta\)

Fig. 3 Blei (2012) ‘Probabilistic topic models’

Fig. 4 Docs clustering with words

Fig. 5 Docs clustering with topics

Fig. 6 LDA assumes a generative process

As a result of LDA,

Each topic \(t\) has a topic-word vector \(\beta_t\), which is multinominal distribution over words.
- We can decide if a specific word is part of a particular topic among other topics.
Each document \(d\) has a topic vector \(\theta_d\), which is multinominal distribution over topics.
- We can determine how common a particular topic within a document.

To reiterate, LDA algorithm observes 1) word distributions that define \(k\) latent topics and 2) \(k\) topic distributions across documents.

Still confusing? Let’s assume…

서울시의 병원에 입원 중인 환자들의 질병 종류에 따라 10대 주요 질환을 조사

서울 = 코퍼스
병원 = 문서
환자 = 단어
토픽 = 질환 (여기선, 질병의 상위개념)
1. 전수조사 대신, 확률 분포를 이용한 통계 추론
2. 서울 시내 각 병원마다 임의 방식으로 환자의 질병 조사
3. 기존 질병 분류 정보 이용 X
4. 실제 자료에 기반해 질병의 유사도(병원 특성에 따라 비슷한 질병 환자 입원 가능성) 및 종류에 따라 질환 분류
  - 각 환자가 각 주요 질환에 포함될 확률
  - 각 병원의 입원 환자들이 각 주요 질환에 해당하는 확률

결국, LDA는 10대(\(k\)) 질환 선정 및 분류를 위한 다음의 두가지 정보를 제공:

서울시 병원에 입원 중인 환자들의 10대 주요 질환을 구성하는 질병들의 분포: \(\beta\)
- 다시 말해, 10대 주요 질환을 구성하는 질병들을 파악
- 예: 1 topic: 위암 (30%), 간암 (15%), 폐암 (5%) 등의 질병 (token; word).. -> 암 (topic)
  2 topic: 뇌경색 (45%), 뇌출혈 (10%), 뇌종양 (3%) -> 뇌질환
각 주요 질환 환자들이 입원 중인 서울시 병원들의 분포: \(\theta\)
- 입원 환자들의 주요 질환 분포에 따라 병원들의 특성 분류
- 예: 1. A 병원 (document) - 암 (60%), 뇌질환 (30%), 심장질환 (10%) 등..
  2. B 병원 - 심장질환 (70%), 뇌질환 (25%), 암 (5%)

For LDA, \(k\) (# of topics) should be pre-defined

The word frequency distribution among all documents to define “k” topics
Documents are given a probability for each topic
LDA algorithm assigns a probabilitty from all observed topics to each document
With \(k\)=10, 각 병원은 10대 질환(topic) 환자 입원 확률을 가짐
\(k\) 값 설정에 따라 LDA 결과 좌우

How to define \(k\)

코퍼스의 의미와 특징을 가장 잘(?) represent할 수 있는 \(k\) 값 선정

어떻게? 두가지 기준:

Semantic coherence: 각 토픽을 구성하는 어휘들의 의미론적 유기성
- \(k\) 값이 작을수록, 즉 토픽의 수가 적을수록 coherence는 증가 (interpretability)
- 하지만, 다른 주제를 의미하는 어휘들로 같은 토픽이 구성될 가능성 증가
Exclusivity: 각 토픽을 구성하는 어휘들간의 구별성
- \(k\) 값이 커질수록, 즉 토픽의 수가 많을수록 exclusivity는 증가 (differences)
- 하지만, 이 경우 비슷한 의미의 어휘들로 구성된 토픽들이 많아질 가능성 증가

최근 많은 토픽 모델링 연구에서 사용하는 방법은 perplexity 지수 이용
- perplexity: A measure of the statistical goodness of fit of a topic model (Blei et al., 2003)
- This measure esimtates how well a model produced for the major part of the corpus predicts a held-out smaller portion of the documents.

Case Study: LDA using R

Topic modeling workflow
1. Problem definition & specific goals 2. Identify the text that needs to be collected 3. Text organization 4. Feature extraction 5. Analysis 6. Interpretation & discussion

Fig. 7 Flowchart of topic modeling by Julia Silge

Loading the organized text of news articles posted on Twitter

-조선일보, 중앙일보, 한겨레 -2019.02 ~ 2019.06

# Loading pre-processed text (headline + lead paragraph) including only nouns
knews_df = knews_td %>% 
  group_by(document) %>% 
  summarise(text = paste(word, collapse= " ")) %>% 
  ungroup()
knews_df %>% arrange(document) %>% slice(1:5)

## # A tibble: 5 x 2
##   document    text                                                         
##   <chr>       <chr>                                                        
## 1 Chosun_6266 현지 시각 도널드 트럼프 미국 대통령 새해 의회 국정 연설 제목 선택 연방 정부 셧다운 일시 업무 정지 정치…
## 2 Chosun_6267 연휴 첫날 전국 고속 도로 곳곳 정체 시작 한국 도로공사 이날 서울 요금소 출발 기준 소요 시간 부산 시간 …
## 3 Chosun_6268 일요일 주일 중국 연휴 중국인 해외여행 예상 중국 관영 글로벌 타임스 여행사 예약 자료 인용 보도 올해 태국…
## 4 Chosun_6269 정부 발표 예비 조사 면제 사업 투입 새만금 국제공항 사업 빈약 여객 수요 인근 공항 중복 새만금 개발 진척…
## 5 Chosun_6270 국무부 올해 적용 주한 미군 방위비 분담금 협정 양국 원칙 합의 현지 시각 도널드 트럼프 대통령 합의 서명 …

dim(knews_df)

## [1] 7837    2

Extracting features from text

Genearing Document-Feature Matrix

knews_dfm_full = dfm(knews_df$text)
knews_dfm_full

## Document-feature matrix of: 7,837 documents, 26,001 features (99.9% sparse).

# Too many features, so we may want to trim down the size of vocabulary
korea_dfm = dfm_trim(knews_dfm_full, min_docfreq = 0.001, max_docfreq = 0.99, docfreq_type = "prop")
korea_dfm[1:5,1:10]

## Document-feature matrix of: 5 documents, 10 features (70.0% sparse).
## 5 x 10 sparse Matrix of class "dfm"
##        features
## docs    현지 시각 도널드 트럼프 미국 대통령 새해 의회 국정 연설
##   text1    1    1      1      4    1      3    1    2    3    4
##   text2    0    0      0      0    0      0    0    0    0    0
##   text3    0    0      0      0    0      0    0    0    0    0
##   text4    0    0      0      0    0      0    0    0    0    0
##   text5    1    1      1      2    0      2    0    0    0    0

Selecting the optimal \(k\) value

perplexity can be used by 1) splitting corpus into training and testing data, 2) fitting the former, and 3) finding the optimal number of \(k\) to predict the latter
But today, we determine \(k\) by semantic coherence and exclusivity
And using STM algorithm as a successor of LDA

Training models with different \(k\) and evaluating their goodness-of-fit

load("korea_many_models_1106.RData")
korea_heldout <- make.heldout(korea_dfm)

korea_k_result <- korea_many_models %>% 
  mutate(semantic_coherence = future_map(topic_model, semanticCoherence, korea_dfm),
         exclusivity = future_map(topic_model, exclusivity),
         eval_heldout = future_map(topic_model, eval.heldout, korea_heldout$missing),
         residual = future_map(topic_model, checkResiduals, korea_dfm))

Diagnostics by # of topics

korea_k_result %>%
  transmute(K,
            `Semantic coherence` = map_dbl(semantic_coherence, mean),
            Exclusivity = map_dbl(exclusivity, mean),
            `Held-out likelihood` = map_dbl(eval_heldout, "expected.heldout"),
            Residuals = map_dbl(residual, "dispersion")) %>%
  gather(Metric, Value, -K) %>%
  ggplot(aes(K, Value, color = Metric)) +
  geom_line(size = 1.5, alpha = 0.7, show.legend = FALSE) +
  facet_wrap(~Metric, scales = "free_y") +
  labs(x = "K (number of topics)",
       y = NULL,
       title = "Model diagnostics by number of topics",
       subtitle = "These diagnostics indicate that a good number of topics would be around 60")

Comparing exclusivity and semantic coherence

korea_k_result %>%
  dplyr::select(K, exclusivity, semantic_coherence) %>%
  unnest() %>%
  mutate(K = as.factor(K)) %>%
  ggplot(aes(semantic_coherence, exclusivity, color = K)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(x = "Semantic coherence",
       y = "Exclusivity",
       title = "Comparing exclusivity and semantic coherence",
       subtitle = "Models with fewer topics have higher semantic coherence\nfor more topics, but lower exclusivity")

## Warning: `cols` is now required.
## Please use `cols = c(exclusivity, semantic_coherence)`

Fitting topic model with \(k\)=60

korea_topic_model <- korea_k_result %>% 
  filter(K==60) %>% 
  pull(topic_model) %>% 
  .[[1]]
korea_topic_model

## A topic model with 60 topics, 7837 documents and a 5130 word dictionary.

names(korea_topic_model)

##  [1] "mu"          "sigma"       "beta"        "settings"    "vocab"      
##  [6] "convergence" "theta"       "eta"         "invsigma"    "time"       
## [11] "version"

alpha: parameter that controls the prior distribution over topic weights in each document
eta: parameter for the prior distribution over word weights in each topic
beta: posterior probability of a topic \(k_1\) contraining the word \(w_1\)
theta (gamma): posterior probability of a document \(M_1\) containing the topic \(k_1\)

Exploring the topic model

library(tidytext)
kt_beta <- tidy(korea_topic_model)
kt_beta

## # A tibble: 307,800 x 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 가게  7.12e-104
##  2     2 가게  7.35e-100
##  3     3 가게  1.37e-171
##  4     4 가게  1.91e-135
##  5     5 가게  2.20e-117
##  6     6 가게  1.74e- 70
##  7     7 가게  1.99e-138
##  8     8 가게  9.26e-143
##  9     9 가게  1.01e- 62
## 10    10 가게  4.38e-153
## # … with 307,790 more rows

kt_beta %>%
  filter(topic < 6) %>% 
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  mutate(topic = paste0("Topic ", topic),
         term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  theme_minimal(base_family = "Nanum Pen Script") +
  labs(x = NULL, y = expression(beta),
       title = "Highest word probabilities for topics 1 ~ 5",
       subtitle = "Different words are associated with different topics")

kt_theta <- tidy(korea_topic_model, matrix="gamma",
                 document_names = rownames(korea_dfm)) # In LDA algorithm, the matrix 'gamma' includes $\theta$ information
kt_theta

## # A tibble: 470,220 x 3
##    document topic    gamma
##    <chr>    <int>    <dbl>
##  1 text1        1 0.00163 
##  2 text2        1 0.00206 
##  3 text3        1 0.00899 
##  4 text4        1 0.000348
##  5 text5        1 0.00192 
##  6 text6        1 0.00446 
##  7 text7        1 0.00177 
##  8 text8        1 0.00240 
##  9 text9        1 0.00300 
## 10 text10       1 0.00114 
## # … with 470,210 more rows

kt_theta %>% 
  filter(topic < 6 & gamma > 0.5) %>% 
  ggplot(aes(gamma, fill = as.factor(topic))) +
  geom_histogram(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, ncol = 3) +
  labs(title = "Distribution of document probabilities for each topic",
       subtitle = "Topics 1 and 2 do not have strongly associated story",
       y = "Number of articles", x = expression(gamma))

Visualization of the result

library(ggthemes)
library(extrafont)

top_terms <- kt_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

theta_terms <- kt_theta %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

library(scales)
theta_terms %>%
  top_n(20, gamma) %>%
  ggplot(aes(topic, gamma, label = terms, fill = topic)) +
  geom_col(show.legend = FALSE) +
  geom_text(hjust = 0, nudge_y = 0.0005, size = 8, family="Nanum Pen Script") +
  coord_flip() +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0, 0.06),
                     labels = percent_format()) +
  theme_tufte(ticks = FALSE) +
  theme(plot.title = element_text(size = 16),
        plot.subtitle = element_text(size = 13)) +
  theme_minimal(base_family = "Nanum Pen Script", base_size = 20) +
  labs(x = NULL, y = expression(gamma),
       title = "Top 20 topics by prevalence in the Korean News Tweet corpus",
       subtitle = "With the top words that contribute to each topic")