You are tasked to use Topic Modelling techniques (e.g. Latent Dirichlet Allocation) to categorise the Budget 2019 Statement into 4-6 key topics. You will also need to provide a description of each key topic based on the key words derived.
Topic Modelling is a method used for unsupervised classification of documents, such as blog posts or news articles, that we would like to divide into natural groups so that we can understand them separately.
Latent Dirichlet Allocation (LDA) is a common method used for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to overlap one another in terms of content, in a way that mirrors the typical use of natural language.
# Loading Libraries
library(tm)
## Loading required package: NLP
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 3.6.3
library(tidytext)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(tidyr)
# Loading Data
corp <- Corpus(URISource("./fy2019_budget_statement.pdf"),
readerControl = list(reader = readPDF))
# Cleaning Data
dtm <- DocumentTermMatrix(corp, control = list(
removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE))
inspect(dtm)
## <<DocumentTermMatrix (documents: 1, terms: 1864)>>
## Non-/sparse entries: 1864/0
## Sparsity : 0%
## Maximal term length: 16
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs build continu help provid singapor singaporean
## fy2019_budget_statement.pdf 64 57 65 68 84 77
## Terms
## Docs support will worker year
## fy2019_budget_statement.pdf 110 268 65 111
# Finding Frequently Occurring Terms
ft <- findFreqTerms(dtm, lowfreq = 50, highfreq = Inf)
ft.matrix <- as.matrix(dtm[, ft])
print(ft.matrix)
## Terms
## Docs also build care continu help need provid singapor
## fy2019_budget_statement.pdf 55 64 54 57 65 50 68 84
## Terms
## Docs singaporean support technolog will worker year
## fy2019_budget_statement.pdf 77 110 50 268 65 111
From the Summary shown above, there are 14 Frequently Occurring Terms, each with a word count above 50.
# Creating a Four-Topic LDA Model
Bud_lda <- LDA(dtm, k = 4, control = list(seed = 1234))
print(Bud_lda)
## A LDA_VEM topic model with 4 topics.
# Examining Per-Topic-Per-Word Probabilities (Beta) from the Model
Bud_topics <- tidy(Bud_lda, matrix = "beta")
print(Bud_topics)
## # A tibble: 7,456 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 ‘old 0.000121
## 2 2 ‘old 0.000118
## 3 3 ‘old 0.000114
## 4 4 ‘old 0.0000829
## 5 1 “asia 0.0000818
## 6 2 “asia 0.0000441
## 7 3 “asia 0.000181
## 8 4 “asia 0.000123
## 9 1 “global 0.0000825
## 10 2 “global 0.000125
## # ... with 7,446 more rows
# Plotting Common Words in the Four Topics Extracted
Bud_top_terms <- Bud_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
Bud_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~topic, scales = "free") +
coord_flip() +
scale_x_reordered()
The visualization shown above allows us to understand the four topics that were extracted from the Budget 2019 Statement.
The most common words in Topic 1 include “support”, “worker”, and “technology”, which may represent news on how the government intends to utilise technological advances to support our workers.
The most common words in Topic 2 include “companies”, “care”, and “innovation”, which may represent news on how the government intends to promote innovation in companies that are facing local and global challenges.
The most common words in Topic 3 include “help”, “build”, and “enterprise”, which may represent news on how the government intends to help enterprises build up resilience against local and global challenges.
The most common words in Topic 4 include “worker”, “care”, and “need”, which may represent news on how the government intends to provide care and support for our workers and those in need.
# Examining Per-Document-Per-Topic Probabilities (Gamma) from the Model
Bud_document <- tidy(Bud_lda, matrix = "gamma")
print(Bud_document)
## # A tibble: 4 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 fy2019_budget_statement.pdf 1 0.254
## 2 fy2019_budget_statement.pdf 2 0.237
## 3 fy2019_budget_statement.pdf 3 0.245
## 4 fy2019_budget_statement.pdf 4 0.264
From the Summary shown above, each gamma value indicates the estimated proportion of words from the document, Budget 2019 Statement, that was used to generate that particular topic. For example, the model estimates that about 25% of the words in the document were used to generate Topic 1.