Single Document Keyword Extraction

Introduction

Keywords form a set of expressive words in a document that gives high-level information of its contents to the document readers. Finding Keywords from a large amount of news data online is very effective in that it can generate a short summary of the articles in the online news. As documents available online rapidly increase in size with the advancement of WWW, extraction of keywords has become a ground of several text mining applications such as text summarization, categorization, topic detection and search engine optimization.

Manual Keyword extraction process is very time consuming and extremely difficult task; in fact, it is practically impossible to manually extract keywords even in case of news articles published in a single day due to its ample size. For an active use of keywords, we need to design an automated process that extracts significant words from the news articles. The term “Keyword extraction” is used in text mining context, for example. A comparable research topic is called “automatic indexing” or “automatic keyword extraction” in information retrieval research and automatic term recognition in computational linguistics context.

Many different methods have been used over the years, and new solutions are constantly being proposed to solve this complex problem. In this tutorial, some of the statistical and graph-based keyword extraction methods are explored.

Graph Based Methods

A graph-based ranking algorithm is an approach of determining the vertex importance within a graph, by taking into consideration the global information recursively enumerated from the entire text graph, instead of relying only on local vertex-specific information. Employing an identical line of thinking to semantic or lexical graphs obtained from natural language text documents, results in a graph-based ranking model which can be implemented to a variety of natural language processing applications, where knowledge drawn from the entire document is used in finding out local ranking selection decisions.

TextRank

Introduction

Some ranking algorithms that are graph based like Kleinberg’s HITS algorithm (Kleinberg, 1999) or Google’s PageRank (Brin and Page,1998) have been used successfully in social-network analysis, citation analysis and the analysis of the link-structure of the Worldwide Web. Arguably,these graph based ranking algorithms can be singled out as key aspect of the dramatic change trig- gered in the field of Web Search technology by bringing up a Web page ranking mechanism that relies on the aggregate knowledge of Web architects instead of individual content analysis of web pages.

TextRank Model

Graph based Ranking algorithms like PageRank computes prestige or centrality of nodes in the context of Web search using random surfing assumption. Now Rank is defined to be the probability of a random web surfer landing at that page. PageRank of a node \(v\) recursively depends on the PageRank of other nodes that point to it. Formally, let \(G = (V,E)\) be a directed graph with \(V\) defined as the set of vertices and \(E\) as set of edges, where \(E\) is a subset of \(V\). For a given vertex \(V\), let \(Out(V)\) be the collection of vertices that vertex \(V\) points to (successors), let \(In(V)\) be the set of vertices that point to it (predecessors). The score of a vertex \(S(V_i)\) is defined as follow:

\[ S(V_i) = (1-d) + d * \sum_{j \in In(V_i)} \frac{1}{Out(V_j)} S(V_j) \]

where damping factor d can be assigned value between 0 and 1, which has the role of integrating into the model the probability of jumping from a given vertex to any other random vertex in the graph.

AuthorTR

Mihalcea, Rada, and Paul Tarau. “Textrank: Bringing order into text.” Proceedings of the 2004 conference on empirical methods in natural language processing. 2004.

Reading Input

Using Corpus {tm}

library(tm)
doc <- c("Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered  types systems and systems of mixed types.")

#Creating corpus         
corp <- Corpus(VectorSource(doc))

Reading a Single text file

Suppose, if you want to read files from a directory. Then, this is how you can import a series of documents stored in local folder into R using DirSource

# Filepath
filepath= ("/Users/addy93/Desktop/Tutorial_BDA/DatasetSample/")
corp_sf = Corpus(DirSource(filepath,pattern="sample_file.txt"))

Reading multiple files

# When you are in current directory
filepath = getwd()
corp_mf = Corpus(DirSource(filepath))

Using list.files {base}

#Fetch set of files from current directory with .txt extension
filenames <- list.files(pattern=".txt", full.names=TRUE)
 #-----Pre-processing-----
for (counter in seq_along(filenames)) {

  #Reading file form directory
  doc<- scan(filenames[counter], character(0)) # reading file as series of tokens NOT plaintext
  
 
} # end of for loop

Utility Functions

Following are some user defined functions to be used while extracting keywords using TextRank and Kcore alogirthms.

Tokenization & Other

# Text Tokeization
SplitText <- function(Phrase) { 
  unlist(strsplit(Phrase," "))
}

# To check if word is present in word list
IsSelectedWord <- function(Word) {
  ifelse(length(which(selected_words == Word))>0, TRUE, FALSE)
}

POS Tagging

library(NLP)
library(tm)
library(openNLP)
library(graph)
library(knitr)

# Utility Functions

# POS Tagging
tagPOS <-  function(x, ...) {
  s <- as.String(x)
  word_token_annotator <- Maxent_Word_Token_Annotator() #from openNLP package
  a2 <- NLP::Annotation(1L, "sentence", 1L, nchar(s))  # annotating the sentences in a text.
  a2 <- NLP::annotate(s, word_token_annotator, a2) # annotate each word in a sentence.
  a3 <- NLP::annotate(s, Maxent_POS_Tag_Annotator(), a2)
  a3w <- a3[a3$type == "word"]   #taking annotated "word" 
  POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
  POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
  list(POStagged = POStagged, POStags = POStags)
}
# Illustrate usage of tagPOS
str <- "this is a confrence tutorial session."
tagged_str <-  tagPOS(str)
tagged_str

## $POStagged
## [1] "this/DT is/VBZ a/DT confrence/NN tutorial/JJ session/NN ./."
## 
## $POStags
## [1] "DT"  "VBZ" "DT"  "NN"  "JJ"  "NN"  "."

# Select words with given tagID
SelectTaggedWords <- function(Words,tagID) {
  Words[ grep(tagID,Words) ]
}
# Remove Pos Tags 
RemoveTags <- function(Words) {
  sub("/[A-Z]{2,3}","",Words)
}

Graph Construction

# To get all words in its window size W
GetWordLinks <- function(position,scope) {
  scope <- ifelse(position+scope>length(words),length(words),position+scope)
  links <- ""
  for (i in (position+1):scope) {
    if ( IsSelectedWord(words[i]) ) links <- c(links,words[i]) # add to word link list if word is present in wordlist-"word"
  }
  
  if (length(links)>1) {
    links[2:length(links)]
  }
  else {
    links <- ""
  }
}

#Contructing Graph-of-Word Representation from Input Text
ConstructTextGraph <- function(n) { 
  word_graph <- new("graphNEL") # graph object (text graph to be constructed)
  i <- 1
  while (i < length(words) ) {   #here "words"- list of tokens after pre-processing
    if ( IsSelectedWord(words[i]) ) {                                   
      links <- GetWordLinks(i,n) # getting words falling within window W for word[i]        
      if (links[1] != "") {                                     
        cat(i," ",words[i]," - ",paste(c(links),collapse=" "),"\n") # outputting words falling within window size of words[i]
        if ( length(which(nodes(word_graph)==words[i]))==0  ) {     
          word_graph <- addNode(words[i],word_graph)  # if word is not a node in  word_graph then add it.
        }                                               
        # connect this new node to words falling in its co-occurence window.
        for (j in 1:length(links)) {
          if ( length(which(graph::nodes(word_graph)==links[j]))==0 ) {
            word_graph <- addNode(links[j],word_graph)
            word_graph <- addEdge(words[i],links[j],word_graph,1)
          } 
          else { # if words linked to this newly added words are already connected to this word then increment the edge weight.
            if ( (length(which(graph::edges(word_graph,links[j])[[1]]==words[i]))>0 ) ) { 
              prev_edge_weight <- as.numeric(edgeData(word_graph,words[i],links[j],"weight"))
              edgeData(word_graph,words[i],links[j],"weight") <- prev_edge_weight+1
            }
            else {
              word_graph <- addEdge(words[i],links[j],word_graph,1)
            }
          } 
        }
      }
    }
    i <- i+1
  }
  word_graph
}

Pre-Processing

Author of the algorithm applied the following preprocessing steps on the input text:

Tokenization, process of splitting the stream of document text up into words, symbols, phrases or other expressive elements called tokens.
PoS Tagging & Selection that involves marking up a word in a text as resembling to a particular POS, based on both its definition as well as the context i.e. relationship with related and adjacent words in a sentence or phrase.
Stopwords Removal i.e.filtering out the most common words in a natural language text.

# Pre-Processing
corp <- tm_map(corp, stripWhitespace) # removing extra spaces (keeping only single space)
corp <- tm_map(corp, tolower) # transforming to lower case
words_with_punctuation <- SplitText(as.character(corp[[1]])) 
corp <- tm_map(corp, removePunctuation)

## [1] "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered  types systems and systems of mixed types."

Graph Construction

# GRAPH CONSTRUCTION
words <- SplitText(as.character(corp[[1]])) # tokenization
tagged_text <- tagPOS(corp[[1]])
tagged_words <- SplitText(as.character(tagged_text))

# Keep only NN (Nouns) & JJ (Adjectives) tagged words 

tagged_words <-c(SelectTaggedWords(tagged_words,"/NN"),SelectTaggedWords(tagged_words,"/JJ"))
#Remove tags

tagged_words <- RemoveTags(tagged_words)

selected_words <- unique(tagged_words)  # Keeping unique words.      

text_graph <- ConstructTextGraph(4)  # co-occurrence of window size 4 (can range from 2-10)

## 1   compatibility  -  systems linear 
## 3   systems  -  linear constraints 
## 5   linear  -  constraints set 
## 6   constraints  -  set 
## 9   set  -  natural numbers criteria 
## 11   natural  -  numbers criteria compatibility 
## 12   numbers  -  criteria compatibility 
## 13   criteria  -  compatibility 
## 15   compatibility  -  system 
## 18   system  -  linear diophantine equations 
## 20   linear  -  diophantine equations strict inequations 
## 21   diophantine  -  equations strict inequations 
## 22   equations  -  strict inequations nonstrict 
## 23   strict  -  inequations nonstrict inequations 
## 24   inequations  -  nonstrict inequations 
## 26   nonstrict  -  inequations upper 
## 27   inequations  -  upper bounds 
## 30   upper  -  bounds components 
## 31   bounds  -  components 
## 33   components  -  minimal set 
## 36   minimal  -  set solutions 
## 37   set  -  solutions algorithms 
## 39   solutions  -  algorithms construction 
## 41   algorithms  -  construction minimal 
## 43   construction  -  minimal sets 
## 45   minimal  -  sets solutions 
## 47   sets  -  solutions 
## 49   solutions  -  types 
## 52   types  -  systems 
## 54   systems  -  criteria 
## 58   criteria  -  corresponding algorithms 
## 61   corresponding  -  algorithms 
## 62   algorithms  -  minimal 
## 66   minimal  -  supporting set solutions 
## 67   supporting  -  set solutions 
## 68   set  -  solutions 
## 79   types  -  systems systems 
## 80   systems  -  systems mixed 
## 82   systems  -  mixed types 
## 84   mixed  -  types

Interactive Igraph plot {visNetwork}

#Interactive Igraph plot
library(visNetwork)
library(igraph)
textIgraph<-igraph.from.graphNEL(text_graph, name = TRUE, weight = TRUE,unlist.attrs = TRUE)
textIgraph<-simplify(textIgraph,remove.loops=TRUE)

visIgraph(textIgraph) %>%
    visNodes(size = 35, shape = "circle",font=list(size=28)) %>%
    visOptions(highlightNearest = TRUE, 
               nodesIdSelection = TRUE) %>%
    visInteraction(keyboard = TRUE)

Simple plot function {graphics}

#plot Text Graph 
library("Rgraphviz")
#source("http://bioconductor.org/biocLite.R")
#biocLite("Rgraphviz")

plot(text_graph, attrs = list(node = list(fillcolor = "lightblue", fontsize = 40),edge = list(arrowsize=0.5)))

Applying Text Rank

# TEXT RANK
# Computing vertex score asin PageRank (igraph package)

keywords_list<- page.rank(textIgraph, algo="prpack", directed=FALSE)$vector # function from igraph package

Keywords

# POST-PROCESSING
nodes_num <- length(nodes(text_graph))
keywords_num <- round(nodes_num/3)  # a third of the number of vertices in the graph.

ordered_keywords<- keywords_list[order(keywords_list,decreasing=TRUE)]
final_Keywords<- head(ordered_keywords,keywords_num)
(final_Keywords)

##           set       minimal       systems     solutions   inequations 
##    0.07362828    0.06728524    0.06295689    0.06102517    0.05870818 
##        linear      criteria    algorithms compatibility 
##    0.05675332    0.04683790    0.04540482    0.04024657

Kcore Retention

Introduction

Here the idea of Kcore on the graph-of-words representation of text for single document keyword extraction is enforced that reserve only the nodes from the final main core as keywords. It will take improved proximity between the significant terms (keywords) and variability in the number of final extracted keywords by selecting the group of more cohesive subsets of nodes than with existing graph-based ranking approaches based on centrality measures.

A graph of words is syntactic structure that conceal the co-occurrences of words as against the traditional bag of words and state of the art approaches in keyword extraction designed to apply PageRank and HITS on it to select its most prominent nodes. Rousseau & Vazirgiannis propose a approach that take better proximity and variability in the number of significant terms by selecting cohesive group of vertices.

This K-core approach presents some powerful advantage:

It is fully unsupervised.
It does not depend upon any collection-wide statistics and thus can be applied on any text out of the box .i.e. corpus-independent.
It is scalable to any document length since the algorithm is linearithmic in the number of unique words as contrary to complex community detection methods and
It is parameter-free as the number of keyword extracted conforms to the structure of the graph through the principle of K-core.

Author

Rousseau, François, and Michalis Vazirgiannis. “Main core retention on graph-of-words for single-document keyword extraction.” European Conference on Information Retrieval. Springer, Cham, 2015.

Pre-Processing

Authors of both TextRank & Kcore algorithm have separate prescription for preprocessing steps. TextRank does not performs stemming on the original text as it aims for finding the top multi-word key- words by using ranking based selection decisions, whereas k-core performs stemming on the text document.

Stemming is the process of chopping of inflected words (or some derived words) to their word base, stem or root forms generally a written word form, e.g. “solution” and “solute” are reduced to their root form “solut”. Use of stemming as one of the preprocessing step for text document not only reduced the graph size but also improves the qualitative performance of the algorithm. The utility functions used in this pre-processing and graph contruction are same as defined in Utility Function for TextRank algorithm.

# Pre-processing
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, tolower)
corp <- tm_map(corp, removePunctuation)

Graph Construction

# GRAPH CONSTRUCTION
words <- SplitText(as.character(corp[[1]])) # tokenization

# Applying Stemming
words<- stemDocument(words,language = "english")
# Applying POS Tagging
tagged_text <- tagPOS(corp[[1]])
tagged_words <- SplitText(as.character(tagged_text))

# Keep only NN (Nouns) & JJ (adjectives) tagged words
tagged_words <-c(SelectTaggedWords(tagged_words,"/NN"),SelectTaggedWords(tagged_words,"/JJ"))
# Remove un-used tag POS
tagged_words <- RemoveTags(tagged_words)

tagged_words<- stemDocument(tagged_words,language = "english")
selected_words <- unique(tagged_words)   

text_graph_kcore <- ConstructTextGraph(4)  # co-occurrence of window size 4

## 1   compat  -  system linear 
## 3   system  -  linear constraint 
## 5   linear  -  constraint set 
## 6   constraint  -  set 
## 9   set  -  natur number criteria 
## 11   natur  -  number criteria compat 
## 12   number  -  criteria compat 
## 13   criteria  -  compat 
## 15   compat  -  system 
## 18   system  -  linear diophantin equat 
## 20   linear  -  diophantin equat strict inequ 
## 21   diophantin  -  equat strict inequ 
## 22   equat  -  strict inequ nonstrict 
## 23   strict  -  inequ nonstrict inequ 
## 24   inequ  -  nonstrict inequ 
## 26   nonstrict  -  inequ upper 
## 27   inequ  -  upper bound 
## 30   upper  -  bound compon 
## 31   bound  -  compon 
## 33   compon  -  minim set 
## 36   minim  -  set solut 
## 37   set  -  solut algorithm 
## 39   solut  -  algorithm construct 
## 41   algorithm  -  construct minim 
## 43   construct  -  minim set 
## 45   minim  -  set solut 
## 47   set  -  solut 
## 49   solut  -  type 
## 52   type  -  system 
## 54   system  -  criteria 
## 58   criteria  -  correspond algorithm 
## 61   correspond  -  algorithm construct 
## 62   algorithm  -  construct minim 
## 64   construct  -  minim support set 
## 66   minim  -  support set solut 
## 67   support  -  set solut 
## 68   set  -  solut 
## 79   type  -  system system 
## 80   system  -  system mix 
## 82   system  -  mix type 
## 84   mix  -  type

# Interactive Igraph plot
library(visNetwork)
library(igraph)
textIgraph_kcore<-igraph.from.graphNEL(text_graph_kcore, name = TRUE, weight = TRUE,unlist.attrs = TRUE)
textIgraph_kcore<-simplify(textIgraph_kcore,remove.loops=TRUE)

visIgraph(textIgraph_kcore) %>%
    visNodes(size = 25, shape = "circle") %>%
    visOptions(highlightNearest = TRUE, 
               nodesIdSelection = TRUE) %>%
    visInteraction(keyboard = TRUE)

Coreness & Keword Extraction

# Coreness
coreness <- graph.coreness(textIgraph_kcore)
maxCoreness <- max(coreness)
verticesHavingMaxCoreness <- which(coreness == maxCoreness) 
  
# Finding Kcore of the text graph
kcore <- induced.subgraph(graph=textIgraph_kcore,vids=verticesHavingMaxCoreness)
  
# Plotting Kcore Graph
visIgraph(kcore) %>%
  visNodes(size = 25, shape = "circle") %>%
  visOptions(highlightNearest = TRUE, 
             nodesIdSelection = TRUE) %>%
  visInteraction(keyboard = TRUE)

Keywords extracted using Kcore Algorithm

# Keyword List
verticesHavingMaxCoreness

##     compat     system     linear        set      natur     number 
##          1          2          3          5          6          7 
##   criteria diophantin      equat     strict      inequ      minim 
##          8          9         10         11         12         17 
##      solut  algorithm  construct    support 
##         18         19         20         23