이신행
11/29/2019
Fig. 1 Topic Modeling Workflow
Fig. 2 LDA assumptions
Fig. 3 Blei (2012) ‘Probabilistic topic models’
Fig. 4 Docs clustering with words
Fig. 5 Docs clustering with topics
Fig. 6 LDA assumes a generative process
To reiterate, LDA algorithm observes 1) word distributions that define \(k\) latent topics and 2) \(k\) topic distributions across documents.
서울시의 병원에 입원 중인 환자들의 질병 종류에 따라 10대 주요 질환을 조사
토픽 = 질환 (여기선, 질병의 상위개념)
결국, LDA는 10대(\(k\)) 질환 선정 및 분류를 위한 다음의 두가지 정보를 제공:
서울시 병원에 입원 중인 환자들의 10대 주요 질환을 구성하는 질병들의 분포: \(\beta\)
각 주요 질환 환자들이 입원 중인 서울시 병원들의 분포: \(\theta\)
코퍼스의 의미와 특징을 가장 잘(?) represent할 수 있는 \(k\) 값 선정
어떻게? 두가지 기준:
Fig. 7 Flowchart of topic modeling by Julia Silge
-조선일보, 중앙일보, 한겨레 -2019.02 ~ 2019.06
# Loading pre-processed text (headline + lead paragraph) including only nouns
knews_df = knews_td %>%
group_by(document) %>%
summarise(text = paste(word, collapse= " ")) %>%
ungroup()
knews_df %>% arrange(document) %>% slice(1:5)
## # A tibble: 5 x 2
## document text
## <chr> <chr>
## 1 Chosun_6266 현지 시각 도널드 트럼프 미국 대통령 새해 의회 국정 연설 제목 선택 연방 정부 셧다운 일시 업무 정지 정치…
## 2 Chosun_6267 연휴 첫날 전국 고속 도로 곳곳 정체 시작 한국 도로공사 이날 서울 요금소 출발 기준 소요 시간 부산 시간 …
## 3 Chosun_6268 일요일 주일 중국 연휴 중국인 해외여행 예상 중국 관영 글로벌 타임스 여행사 예약 자료 인용 보도 올해 태국…
## 4 Chosun_6269 정부 발표 예비 조사 면제 사업 투입 새만금 국제공항 사업 빈약 여객 수요 인근 공항 중복 새만금 개발 진척…
## 5 Chosun_6270 국무부 올해 적용 주한 미군 방위비 분담금 협정 양국 원칙 합의 현지 시각 도널드 트럼프 대통령 합의 서명 …
## [1] 7837 2
## Document-feature matrix of: 7,837 documents, 26,001 features (99.9% sparse).
# Too many features, so we may want to trim down the size of vocabulary
korea_dfm = dfm_trim(knews_dfm_full, min_docfreq = 0.001, max_docfreq = 0.99, docfreq_type = "prop")
korea_dfm[1:5,1:10]
## Document-feature matrix of: 5 documents, 10 features (70.0% sparse).
## 5 x 10 sparse Matrix of class "dfm"
## features
## docs 현지 시각 도널드 트럼프 미국 대통령 새해 의회 국정 연설
## text1 1 1 1 4 1 3 1 2 3 4
## text2 0 0 0 0 0 0 0 0 0 0
## text3 0 0 0 0 0 0 0 0 0 0
## text4 0 0 0 0 0 0 0 0 0 0
## text5 1 1 1 2 0 2 0 0 0 0
perplexity can be used by 1) splitting corpus into training and testing data, 2) fitting the former, and 3) finding the optimal number of \(k\) to predict the latter
And using STM algorithm as a successor of LDA
load("korea_many_models_1106.RData")
korea_heldout <- make.heldout(korea_dfm)
korea_k_result <- korea_many_models %>%
mutate(semantic_coherence = future_map(topic_model, semanticCoherence, korea_dfm),
exclusivity = future_map(topic_model, exclusivity),
eval_heldout = future_map(topic_model, eval.heldout, korea_heldout$missing),
residual = future_map(topic_model, checkResiduals, korea_dfm))
korea_k_result %>%
transmute(K,
`Semantic coherence` = map_dbl(semantic_coherence, mean),
Exclusivity = map_dbl(exclusivity, mean),
`Held-out likelihood` = map_dbl(eval_heldout, "expected.heldout"),
Residuals = map_dbl(residual, "dispersion")) %>%
gather(Metric, Value, -K) %>%
ggplot(aes(K, Value, color = Metric)) +
geom_line(size = 1.5, alpha = 0.7, show.legend = FALSE) +
facet_wrap(~Metric, scales = "free_y") +
labs(x = "K (number of topics)",
y = NULL,
title = "Model diagnostics by number of topics",
subtitle = "These diagnostics indicate that a good number of topics would be around 60")
korea_k_result %>%
dplyr::select(K, exclusivity, semantic_coherence) %>%
unnest() %>%
mutate(K = as.factor(K)) %>%
ggplot(aes(semantic_coherence, exclusivity, color = K)) +
geom_point(size = 2, alpha = 0.7) +
labs(x = "Semantic coherence",
y = "Exclusivity",
title = "Comparing exclusivity and semantic coherence",
subtitle = "Models with fewer topics have higher semantic coherence\nfor more topics, but lower exclusivity")
## Warning: `cols` is now required.
## Please use `cols = c(exclusivity, semantic_coherence)`
korea_topic_model <- korea_k_result %>%
filter(K==60) %>%
pull(topic_model) %>%
.[[1]]
korea_topic_model
## A topic model with 60 topics, 7837 documents and a 5130 word dictionary.
## [1] "mu" "sigma" "beta" "settings" "vocab"
## [6] "convergence" "theta" "eta" "invsigma" "time"
## [11] "version"
## # A tibble: 307,800 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 가게 7.12e-104
## 2 2 가게 7.35e-100
## 3 3 가게 1.37e-171
## 4 4 가게 1.91e-135
## 5 5 가게 2.20e-117
## 6 6 가게 1.74e- 70
## 7 7 가게 1.99e-138
## 8 8 가게 9.26e-143
## 9 9 가게 1.01e- 62
## 10 10 가게 4.38e-153
## # … with 307,790 more rows
kt_beta %>%
filter(topic < 6) %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
theme_minimal(base_family = "Nanum Pen Script") +
labs(x = NULL, y = expression(beta),
title = "Highest word probabilities for topics 1 ~ 5",
subtitle = "Different words are associated with different topics")
kt_theta <- tidy(korea_topic_model, matrix="gamma",
document_names = rownames(korea_dfm)) # In LDA algorithm, the matrix 'gamma' includes $\theta$ information
kt_theta
## # A tibble: 470,220 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 text1 1 0.00163
## 2 text2 1 0.00206
## 3 text3 1 0.00899
## 4 text4 1 0.000348
## 5 text5 1 0.00192
## 6 text6 1 0.00446
## 7 text7 1 0.00177
## 8 text8 1 0.00240
## 9 text9 1 0.00300
## 10 text10 1 0.00114
## # … with 470,210 more rows
kt_theta %>%
filter(topic < 6 & gamma > 0.5) %>%
ggplot(aes(gamma, fill = as.factor(topic))) +
geom_histogram(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, ncol = 3) +
labs(title = "Distribution of document probabilities for each topic",
subtitle = "Topics 1 and 2 do not have strongly associated story",
y = "Number of articles", x = expression(gamma))
library(scales)
theta_terms %>%
top_n(20, gamma) %>%
ggplot(aes(topic, gamma, label = terms, fill = topic)) +
geom_col(show.legend = FALSE) +
geom_text(hjust = 0, nudge_y = 0.0005, size = 8, family="Nanum Pen Script") +
coord_flip() +
scale_y_continuous(expand = c(0,0),
limits = c(0, 0.06),
labels = percent_format()) +
theme_tufte(ticks = FALSE) +
theme(plot.title = element_text(size = 16),
plot.subtitle = element_text(size = 13)) +
theme_minimal(base_family = "Nanum Pen Script", base_size = 20) +
labs(x = NULL, y = expression(gamma),
title = "Top 20 topics by prevalence in the Korean News Tweet corpus",
subtitle = "With the top words that contribute to each topic")