Introduction

Every school is a good school. Level the playing field. Data analytics. We have all heard these phrases used again and again, not just from the same people, but from leaders of different ministries. When I heard for the umpteenth time the ever-so-famous line “it’s not a question of if, but when” at a community event, I wondered just how many times keywords and keyphrases were being recycled by different speakers. Just for fun, I actually tried searching. As I looked through the large bank of speeches on ministries’ websites, it occurred to me that petabytes of textual data are created everyday - it’s impossible to keep up on our own. We would have to either specialise (and rely on one another), or exploit technology. The first option is feasible, but the second is much more intriguing. What if we had access to customised applications that could process large volumes of textual data, and do our reading for us, in real time? What if we had bots that could perform sensemaking on our behalf, that could zip through content to deliver useful, actionable insights? Yes, I’m sure these are already employed by big corporations, but what if such tools were available to the layman? How every individual would be a well-read individual; how quickly we could level the playing field; much data analytics.

In this post, I explore three techniques for automatic summarisation of text: word counting, Latent Dirichlet Allocation and a graph method inspired by Google’s PageRank. I apply these algorithms to speeches by MPs in nine ministries from January 2016 to August 2017.

Data

I love it when data comes in a neat, clean Excel spreadsheet. However, the bulk of data in our modern age is unstructured. We don’t need and cannot wait for interesting datasets to appear on sites like Data.Gov.Sg to generate insights on politics, economics, or the society. The internet is a data trove that’s just waiting to be explored. For this post, I scrape speeches from ministries’ websites and process them using the tm (text mining) package in R. The process for scraping and cleaning data is extremely technical. As such, I have omitted the code for the scraping process from the main post. Instead, I have provided a brief description of my methodology.

Scraping Links

To scrape links to dedicated speech pages, I used a combination of code, Google searches and OutWit Hub (OWH), a free tool for grabbing and extracting data from websites. I used OWH to scrape links off the ministries’ websites. OWH easily provided all the links in a neat table, ready for export in CSV format. All I had to do was read the CSV into R, clean up the links, and I could scrape speeches from the dedicated speech pages. Then, I used Google search to scrape links that OWH could not. The free version of OWH allows us to scrape only 100 entries, which was not enough, but Google search is amazingly powerful. There are many parameters that enable you to optimise your search. I used the site parameter to configure my searches. For example, to download speeches from 2016 by MCI ministers, I used the following query:

speech site:https://www.mccy.gov.sg/en/news/speeches/2016/

Scraping Speeches

Once I obtained the links to the dedicated pages for speeches, I employed the following broad approach to obtain a clean string of text:

Download URL and parse HTML
Save HTML and text as a character string
Inspect page source to identify where to extract article (based on the HTML tags like <div>)
Perform text cleaning
Assign date tag
Assign ministry tag

In step 1, I downloaded the source codes for the dedicated speech page in R and cleaned up the HTML. In step 2, I converted this into a mix of HTML code and useful text in string format, which is easy to work with. This allowed me to do the Microsoft Word equivalent of “Find and Replace” to remove unwanted HTML code. In step 3, I identified which specific chunk of HTML code and text contained the speech. Step 4 is the tough bit, where we have to write “Find and Replace” rules to clean up the text. Here are examples of rules that I wrote to clean speeches: (not in order of importance)

Convert all headings to spaces. Transcripted speeches typically have headers that are separated from the main paragraph only by heading tags. For example, if we were to simply remove HTML tags from this string, <h3>Section Header</h3>Start of Content, we would have: Section HeaderStart of Content. This would cause our text mining functions to mess up.
Remove paragraph indices. Some transcripts have numbered paragraphs, and writers may use a number followed by either a tab, or several spaces. Either way, it is important to identify which is being used, and cater for both possibilities to ensure that these indices are removed. The numbers may cause the preceding words to become non-words. For example, ...end of sentence.2 Next sentence.... If we aren’t careful, the first occurrence of “sentence” becomes “sentence2” after we remove punctuation.
Remove unnecessary spaces. This involved removing extra spaces before and after the text, and consolidating multiple spaces between text to ensure that the text mining functions do not pick up null keywords.

As you can see, text cleaning is complex. It’s also tedious, because we never know if a “Find and Replace” rule is needed until we spot problems in the text. As such, I went through multiple iterations to create rules for cleaning text. The outcome of my data scraping effort was 836 speeches from nine ministries, from Jan 2016 to Aug 2017.

On Summarisation

There are two broad approaches to summarising texts: extraction and abstraction. Extraction is to construct a summary by lifting words out of a text. Abstraction is to construct a summary by taking concepts from a text - this is much more difficult because it requires the use of natural language processing techniques (NLP). In this post, we start off with keyword counting - an extraction technique. Then, we go into two abstraction techniques: Latent Dirichlet Allocation (LDA) and a graph-based method.

Top Keywords

Let’s start off simple: individual keywords. Essentially, we want to know what are the top keywords being used in the various ministerial speeches. To do so, we tap on the tm package to help us extract the keywords. We use the following process: (Below, I provide the code for a function that performs all the following steps)

Put the speeches into a corpus, the data format that the tm package uses to store text documents.
Convert all words to lower case. This standardises the format of the words, thereby enabling us to identify “Text” and “text” as the same thing.
Remove punctuation for similar reasons that we perform step 2. We should note here that we are not entirely interested in sentence structure. As such, for simple analyses such as these, we need not worry about punctuation.
Remove stopwords. Stopwords are unimportant words that do not carry much meaning. See this page for a comprehensive list of stopwords.
Create a document-term matrix (DTM). A DTM is a table that tells us how many of each term is contained in each document. In our context, an Excel equivalent of a DTM would be a sheet with the speech numbers as the row names and words in the column names. The numbers in the table would be the number of occurrences of the word in a given column that is present in the corresponding speech (row).
Remove sparse terms. Sparse terms are words that do not occur in at least X% of speeches. For this analysis, I define this as 0.5%.
Calculate the total frequency of words using the DTM with sparse words removed. This gives us our summary of the most frequently-used keywords.

# Create function to summarise top keywords
summarise_keywords <- function(x){
    
    # Convert dashes to spaces - to ensure that words like "data-focused" are separated appropriately
    x <- gsub("-", " ", x)
    
    # Put into corpus
    corpus <- Corpus(VectorSource(x))
    
    # Convert to lower case
    corpus <- tm_map(corpus, tolower)
    
    # Remove punctuation
    corpus <- tm_map(corpus, removePunctuation)
    
    # Remove stopwords
    corpus <- tm_map(corpus, removeWords, c(stopwords("english"), "singapore", "also", "can", "will", "singaporeans", "singapores"))
    
    # Create Document Term Matrix
    frequencies <- DocumentTermMatrix(corpus)
    
    # Remove sparse terms - keep terms that appear in 0.5% or more of the speeches
    sparse <- removeSparseTerms(frequencies, 0.995)
    
    # Convert to dataframe
    speechSparse <- as.data.frame(as.matrix(sparse))
    
    # Convert variable names to make sure they are appropriate
    colnames(speechSparse) <- make.names(colnames(speechSparse))
    
    # Calculate word frequencies
    top_keywords <- colSums(speechSparse)
    top_keywords <- data.frame(
        word = names(top_keywords),
        freq = top_keywords
    )
    
    # Sort
    top_keywords <- top_keywords[order(top_keywords$freq, decreasing = TRUE), ]
    rownames(top_keywords) <- NULL
    
    # Output
    return(top_keywords)
}

Now, let’s perform a test using the MCI dataset.

# Test on MCI
mci_keywords <- summarise_keywords(mci_speeches$text)
mci_keywords_display <- mci_keywords

# Filter top keywords
mci_keywords_display <- mci_keywords_display[c(1:25), ]

mci_keywords_display$word <- factor(mci_keywords_display$word, levels = rev(as.character(mci_keywords_display$word)))

# Graph
ggplot(mci_keywords_display, aes(x = word, y = freq)) + geom_bar(stat = "identity") +
    coord_flip() + ggtitle("Top 25 Keywords - MCI") + theme(plot.title = element_text(lineheight=.8, face="bold")) +
    xlab("Terms") + ylab("Frequency")

As shown, the top keywords are typical descriptors of MCI - nothing special here. We can also express this in a more visually appealing way - a word cloud:

# Truncate dataset
mci_keywords <- mci_keywords[c(1:100), ]

# Word Cloud
perc_cloud <- round(mci_keywords$freq/sum(mci_keywords$freq)*100, 2)

# Create wordcloud palette
colors <- brewer.pal(9, "PuBu")
colors <- colors[-c(1:4)]
# 
# # Generate
set.seed(1)
wordcloud(mci_keywords$word,
          perc_cloud,
          scale = c(6, 0.15),
          vfont=c("sans serif","plain"),
          colors=colors,
          max.words = Inf,
          random.order = FALSE)

I’ve created word clouds for the remaining eight ministries that I collected data on. We see pretty much the same thing that we found for MCI: the top keywords are what would typically come to mind when we think of these ministries. This is the limitation of using the keyword count technique on a large number of speeches: generic keywords come to the surface while deeper, subtler relationships between words and concepts remain hidden. Still, the word clouds tell us roughly what each ministry is concerned with.

Topics

While keywords are interesting, they can only tell you so much about what is being said. For the lazy layman like me who always lags behind on the latest developments, perhaps the topics raised in speeches would be of interest. Fortunately, there are abstraction algorithms out there that can help.

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is an unsupervised machine learning technique for grouping textual data into topics. LDA can tell us what composition of topics each document has, and what composition of words each topic has. Unsupervised models are fed only inputs, and aim to find structure in data, whereas supervised methods aim to form the best, most accurate mapping from inputs to outputs. The math behind LDA is incredibly complex - I don’t understand it completely myself. What’s important is that we know how to use the model, tune it and generate results with value.

Let’s use MTI data as an example. I’ve chosen the speech by Mr S Iswaran, Minister for Trade & Industry (Industry), at the Committee of Supply Debate 2017. We begin with some text cleaning:

# Pull data
mti_topics <- mti_speeches$text[67]

# Convert dashes to spaces - to ensure that words like "data-focused" are separated appropriately
mti_topics <- gsub("-", " ", mti_topics)

# Put into corpus
corpus <- Corpus(VectorSource(mti_topics))

# Convert to lower case
corpus <- tm_map(corpus, tolower)

# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)

# Remove numbers
corpus <- tm_map(corpus, removeNumbers)

# Remove stopwords
corpus <- tm_map(corpus, removeWords, c(stopwords("english"), "singapore", "also", "can", "will", "singaporeans", "singapores"))

# Create Document Term Matrix
mti_dtm <- DocumentTermMatrix(corpus)

Next, we need to choose k, the number of topics. The LDA algorithm then finds the top keywords that best describe these k topics. Unfortunately, there is no hard and fast rule for choosing k. We could arbitrarily posit any number of topics in the speeches and proceed to create an LDA model. Alternatively, we could use a programmatic way to do this: by testing out different values of k, and choosing one based on its performance. For this experiment, I chose a relatively large set of k to test: 4 to 50.

# Find k
mti_find_k <- FindTopicsNumber(
    mti_dtm, topics = 4:50,
    metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010"),
    control = list(seed = 123),
    mc.cores = 2L,
    verbose = TRUE
)

# Plot the result
FindTopicsNumber_plot(mti_find_k)

The graph shows three different metrics that tell us that the optimal k is between 10 and 12 topics. Let’s go with 11.

# Run model
mti_lda <- LDA(mti_dtm, 11, control = list(seed = 123))

# Extract top 10 terms
mti_terms <- terms(mti_lda, 10)
mti_terms <- data.frame(t(mti_terms))
colnames(mti_terms) <- paste("Keyword", c(1:10))

# Print
kable(mti_terms)

	Keyword 1	Keyword 2	Keyword 3	Keyword 4	Keyword 5	Keyword 6	Keyword 7	Keyword 8	Keyword 9	Keyword 10
Topic 1	companies	government	industry	smes	food	sectors	innovation	overseas	today	needs
Topic 2	industry	innovation	smes	well	new	asked	programme	companies	sector	food
Topic 3	companies	sector	solar	industry	growth	smes	value	opportunities	build	transformation
Topic 4	industry	solar	sector	energy	projects	well	new	key	programme	want
Topic 5	companies	industry	energy	innovation	asked	smes	new	needs	growth	manufacturing
Topic 6	companies	markets	well	opportunities	smes	build	projects	research	business	needs
Topic 7	companies	innovation	capabilities	value	programme	demand	government	itms	manufacturing	help
Topic 8	government	smes	solar	business	overseas	well	companies	capabilities	projects	sector
Topic 9	companies	industry	capabilities	well	food	manufacturing	build	sector	government	solar
Topic 10	industry	projects	growth	food	innovation	demand	new	smes	needs	itms
Topic 11	solar	overseas	needs	opportunities	ensure	transformation	growth	itms	innovation	food

As you can see, the keywords given seem random, and there is substantial overlap among the topics. You might have your own interpretations, but I find these results uninformative. Overall, the LDA model is quick and easy, but not too useful in helping us to discern what topics there are.

Graphs

Another method by which we can identify key topics in a text is through graphs. Take TextRank for example, a “graph-based ranking model for text processing”.¹ Imagine that words are nodes in a network web. Nodes (words/concepts) that are highly connected are effective intermediaries for a collection of nodes. They may thus be interpreted as key concepts that link different ideas. Hence, we aim to look for these “connector” nodes in a text - these are our abstract concepts. I use a much more simplified approach than TextRank: I create a graph of word relations, and find the nodes that are most connected using a measure called betweenness. Node betweenness “measures the extent to which a vertex (a node) lies on paths between other vertices (nodes)”. This ties in with our interpretation above: nodes with high betweenness are effective intermediaries and therefore, key concepts.

To implement this graph approach, I adapted code from Ivan Berlocher’s script, with a few modifications. I used his concept of assigning Part-of-Speech (POS) tags to filter nouns and adjectives, and added other noun and adjective types to be more inclusive in keyword selection. Instead of identifying bigrams (two-word phrases) at the document level where sentences are all connected, I modified the algorithm to ensure that bigrams come from the same sentence, without pulling words from the subsequent sentence. In addition, I wrote my own (arguably more efficient) algorithm for creating a graph, and I calculate the betweenness for nodes (words) and edges (bigrams). In general, the algorithm goes like so:

Clean the text as we did earlier.
Assign POS tags to identify nouns, adjectives etc. within sentences instead of doing so within the whole document - this does affect the POS tags assigned.
Extract nouns and adjectives. Specifically, we’re looking at nouns (singular and plural), proper nouns (singular and plural), adjectives, comparative adjectives and superlative adjectives. For simplicity, let’s call them nouns and adjectives.
For each noun and adjective, we identify the adjacent word and the next adjacent word. That is, in the sentence “Singapore is great” with “Singapore” as the keyword, we would extract “is” and “great” to form the bigrams “Singapore is” and “Singapore great”.
Put all bigrams into a graph. Each word forms a node, and each two-word-relation forms an edge.
Calculate node (word) betweenness and edge (bigram) betweenness, and rank them.

To illustrate what a graph looks like, I’ve created a graph using our “Singapore is great” example with an additional phrase, “you are great”. For simplicity, let’s bend the rules of grammar for a moment and assume all the words in these two phrases are either nouns or adjectives so that we can shortlist them for inclusion in the graph:

# Create graph
example_graph <- new("graphNEL")

# Add nodes
example_graph <- addNode("singapore", example_graph)
example_graph <- addNode("is", example_graph)
example_graph <- addNode("great", example_graph)
example_graph <- addNode("you", example_graph)
example_graph <- addNode("are", example_graph)

# Add edges
example_graph <- addEdge("singapore", "is", example_graph)
example_graph <- addEdge("is", "great", example_graph)
example_graph <- addEdge("singapore", "great", example_graph)
example_graph <- addEdge("you", "are", example_graph)
example_graph <- addEdge("you", "great", example_graph)
example_graph <- addEdge("are", "great", example_graph)

# Plot
plot(example_graph)

The encircled words in the graph are nodes, and the lines are edges representing word relations. From this graph, we see that the word linking the two phrases is “great”. Therefore, we could infer that this text comprising two phrases is about “greatness” (or, this example is pure greatness). Applying this concept to the full speech by Mr S Iswaran, we are able to obtain the top keywords:

The graph above tells us which keywords are the best “connectors” for other words - equivalently, these are our key concepts. Let’s look at the top three keywords. The word “industry” has the highest score. Separately, we can verify that it appears a total of 38 times throughout the speech, which is expected, given that the speaker is after all the Minister of Industry talking about industry. This means that “industry” is able to connect many different words throughout the speech, hence its status as the top connector and a key concept. The word “solar” has the next highest score, and appears 21 times in the speech, mostly in the Clean Energy Sector Opportunities section, while the third word “food” appears 18 times, mainly in the section ITMs are Key Mechanisms to Build Capabilities and Transform Sectors. Although “solar” and “food” are not spread out throughout the speech, they are connected to sufficiently many other words (with sufficient connections to other words) in their respective sections. Hence, they too are key concepts in the text. We can calculate the top keywords’ reach to the 2nd degree (the equivalent of unique friends of friends):

Generally, the greater the reach, the better the keyword as a connector of words and ideas. Separately, notice that even words with low reach make it into the top 25. A possible reason for this is that these words eventually connect to keywords that also have sufficiently many connections. The graph method is able to identify such relationships. In contrast, the word count method cannot. The difference is obvious when we compare our results against those from the word count method, through which some keywords identified by the graph method were not shortlisted: spring, transformation, development, technology, rd (R&D), internationalisation, water, cost and key.

This is not to say that the word count method is completely unreliable. After all, if a word appears more frequently in a text, especially if it’s spread out, it is likely to have higher reach. Hence, word counting is a good start, but graph methods are still needed to account for the relationships between seemingly disconnected words. It is worth pointing out here that the effectiveness of the graph method is limited because it is biased in favour of keywords with more connections.² What is ideal is a collection of concepts that provides us with a balanced view of the key ideas in the text. A concept can be important, but also esoteric to the point where it relates only to another esoteric word which has few connections. This concept would not be reflected in our top 20 list of concepts.³

Now that we’ve seen which keywords are important, what about keyphrases? To obtain the top bigrams, we use a similar approach as we did for keywords: identify the best keyword-to-keyword relationship that connects the other nodes.

Here we have the top bigrams: “solar industry”, “key markets”, “food industry”, “business opportunities”, “needs (of) industry”, “projects (e.g.) water”. These words appear to be combinations of the top keywords identified earlier. Naturally, most traffic through the concept network would pass through these connections. Hence, there may not be much value in bigrams after all.

Conclusion

We started off simple by counting words and created word clouds. These were useful to the extent that they gave us an overview of the content in multiple speeches, and a broad idea of what the various ministries are concerned about. Next, we moved on to abstraction, where we applied LDA in an attempt to identify topics (topic modelling), and a graph method to tease out the relationships between words and thereby, the importance of words in a text network. The results from the LDA model were rather confusing, as the keyword groupings did not help much in identifying the topics. The graph method was more promising, as it enabled us to at least identify words that were more well-connected throughout the text. Yet, the bias in favour of greater connectivity limits the value of the graph method. A collection of words with high connectivity is not necessarily a good summary.

I think it’s useful to put machines and algorithms in our shoes. We typically digest and consolidate concepts by paragraphs, and then by sections, and then by chapters. Perhaps, employing a multi-layered approach to identify keywords within paragraphs, then sections, and then chapters/documents would enable us to generate more accurate summaries of texts. We also generate different interpretations of things that we read. To emulate this, we could employ several summarisation algorithms to shortlist keywords and keyphrases, and top it off with a voting algorithm to merge the resulting summaries. More fundamentally, we need to be patient. In the same way that it took us lots of reading to understand the nuances of the written word and go from children’s books to academic papers, it will take machines lots of development and training to be able to read text the way we do, or the way we want them to. However, at the rate technology is advancing, I believe machines that can perform personalised sensemaking for us are well within our reach. It’s no longer a question of if, but when. Until then, get reading!

Original paper by Mihalcea and Tarau ↩
Not that I think this way, but oddly, fortune (and perhaps life) is also biased in favour of people with more connections, yet we see that as a blessing instead of a limitation.↩
Replace “concept/word” with “people” and “it” with “he/she” in the preceding two sentences. And yes, I deliberately wrote it this way. #wordplay↩

More Than Words: Extracting Value from Text