2020-04-17 Data Scientist Assignment

Overview

You are tasked to use Topic Modelling techniques (e.g. Latent Dirichlet Allocation) to categorise the Budget 2019 Statement into 4-6 key topics. You will also need to provide a description of each key topic based on the key words derived.

Topic Modelling is a method used for unsupervised classification of documents, such as blog posts or news articles, that we would like to divide into natural groups so that we can understand them separately.

Latent Dirichlet Allocation (LDA) is a common method used for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to overlap one another in terms of content, in a way that mirrors the typical use of natural language.

Loading Libraries

# Loading Libraries
library(tm)

## Loading required package: NLP

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(topicmodels)

## Warning: package 'topicmodels' was built under R version 3.6.3

library(tidytext)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(tidyr)

Loading and Cleaning Data

# Loading Data
corp <- Corpus(URISource("./fy2019_budget_statement.pdf"),
               readerControl = list(reader = readPDF))

# Cleaning Data
dtm <- DocumentTermMatrix(corp, control = list(
  removePunctuation = TRUE,
  stopwords = TRUE,
  tolower = TRUE,
  stemming = TRUE,
  removeNumbers = TRUE))
inspect(dtm)

## <<DocumentTermMatrix (documents: 1, terms: 1864)>>
## Non-/sparse entries: 1864/0
## Sparsity           : 0%
## Maximal term length: 16
## Weighting          : term frequency (tf)
## Sample             :
##                              Terms
## Docs                          build continu help provid singapor singaporean
##   fy2019_budget_statement.pdf    64      57   65     68       84          77
##                              Terms
## Docs                          support will worker year
##   fy2019_budget_statement.pdf     110  268     65  111

Finding Frequently Occurring Terms

# Finding Frequently Occurring Terms
ft <- findFreqTerms(dtm, lowfreq = 50, highfreq = Inf)
ft.matrix <- as.matrix(dtm[, ft])
print(ft.matrix)

##                              Terms
## Docs                          also build care continu help need provid singapor
##   fy2019_budget_statement.pdf   55    64   54      57   65   50     68       84
##                              Terms
## Docs                          singaporean support technolog will worker year
##   fy2019_budget_statement.pdf          77     110        50  268     65  111

From the Summary shown above, there are 14 Frequently Occurring Terms, each with a word count above 50.

Creating a Four-Topic LDA Model

# Creating a Four-Topic LDA Model
Bud_lda <- LDA(dtm, k = 4, control = list(seed = 1234))
print(Bud_lda)

## A LDA_VEM topic model with 4 topics.

Topic-Word Probabilities

# Examining Per-Topic-Per-Word Probabilities (Beta) from the Model
Bud_topics <- tidy(Bud_lda, matrix = "beta")
print(Bud_topics)

## # A tibble: 7,456 x 3
##    topic term         beta
##    <int> <chr>       <dbl>
##  1     1 ‘old    0.000121 
##  2     2 ‘old    0.000118 
##  3     3 ‘old    0.000114 
##  4     4 ‘old    0.0000829
##  5     1 “asia   0.0000818
##  6     2 “asia   0.0000441
##  7     3 “asia   0.000181 
##  8     4 “asia   0.000123 
##  9     1 “global 0.0000825
## 10     2 “global 0.000125 
## # ... with 7,446 more rows

# Plotting Common Words in the Four Topics Extracted
Bud_top_terms <- Bud_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

Bud_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()

The visualization shown above allows us to understand the four topics that were extracted from the Budget 2019 Statement.

The most common words in Topic 1 include “support”, “worker”, and “technology”, which may represent news on how the government intends to utilise technological advances to support our workers.

The most common words in Topic 2 include “companies”, “care”, and “innovation”, which may represent news on how the government intends to promote innovation in companies that are facing local and global challenges.

The most common words in Topic 3 include “help”, “build”, and “enterprise”, which may represent news on how the government intends to help enterprises build up resilience against local and global challenges.

The most common words in Topic 4 include “worker”, “care”, and “need”, which may represent news on how the government intends to provide care and support for our workers and those in need.

Document-Topic Probabilities

# Examining Per-Document-Per-Topic Probabilities (Gamma) from the Model 
Bud_document <- tidy(Bud_lda, matrix = "gamma")
print(Bud_document)

## # A tibble: 4 x 3
##   document                    topic gamma
##   <chr>                       <int> <dbl>
## 1 fy2019_budget_statement.pdf     1 0.254
## 2 fy2019_budget_statement.pdf     2 0.237
## 3 fy2019_budget_statement.pdf     3 0.245
## 4 fy2019_budget_statement.pdf     4 0.264

From the Summary shown above, each gamma value indicates the estimated proportion of words from the document, Budget 2019 Statement, that was used to generate that particular topic. For example, the model estimates that about 25% of the words in the document were used to generate Topic 1.