Text Mining: Introduction

Overview

According to some reports in 2018, around 2.5 quintillion bytes of data are created every day and it’s going to increase every year. The data we create includes videos, audios, images, text and many more. With increase in data every business needs proper tools and techniques to utilize their data to extract useful information about their products and customers. Data is key to businesses and proper utilization of those data adds value to the organization. Data is collected from the customer through social media, emails, text messages and many more on a day to day basis. As data increases exponentially with the time, every organization must learn how to analyse the data correctly. One of the techniques to analyse the data more efficiently is text mining. Text mining is simply the process of transforming unstructured text into a structured format so that new conclusions, patterns and insights can be drawn from it. As most of the data such as email, customer feedback, text messages consists of text data; text mining is becoming extremely important. Every big organization uses text mining to gather meaningful patterns from the data so that they get to know the needs of their customer more effectively and efficiently.

Text Mining Process

The process of text mining combines several techniques that enable us to deduce the information from the unstructured data. The general process in text mining are;

The first step involved in text mining is collection of data i.e. text. The data can be collected from different sources such as websites, emails, social media, blogs and others. All the available data are gathered together in the initial step.
After the collection of data, text pre-processing is carried out. This is the main step in text mining and requires lots of time and effort. The collected data may be structured, semi-structured and unstructured. In text pre-processing all the available data is cleansed to create structured data. These steps consist of methods such as text cleanup, tokenization, filtering, stemming, lemmatization, linguistic processing, part of speech recognition and word sense disambiguation.
After the pre-processing of data, various techniques are used to analyse the data. The analysis must be carried out on structured data as it gives efficient results. The common methods used in these steps are information extraction, information retrieval, categorization, clustering, visualization and summarization.
Finally the obtained results are evaluated and stored for future reference. In this way the data obtained from different sources are used to get the meaningful patterns using text mining.

Text Mining Techniques

Text mining techniques are used to discover the insights from structured text/data. These text mining techniques use different tools, methods and applications for their execution. The various text mining techniques are;

Information Extraction
It is one of the most famous text mining techniques. This technique focuses on identifying the extraction of entities, attributes and their relationships from the available textual data. Whatever information is extracted from the data is then stored in a database for future access and retrieval. The efficiency and relevancy of the results are evaluated using precision and recall processes.
Information Retrieval
Information Retrieval is the process of extracting relevant and associated patterns based on a specific set of words or phrases. In this text mining technique, information retrieval systems make use of different algorithms to track and monitor user behaviors and discover relevant data accordingly.
Categorization
Categorization is one of the most popular supervised learning methods. In this technique, normal language texts are assigned to a predefined set of classes or topics depending upon their content. In this technique, the text documents are gathered, processed and analysed to find the right topics or indexes for each document. Naive Bayesian classifier, Decision tree, Nearest Neighbour classifier and Support Vendor Machines are commonly used to categorize the texts.
Clustering
Clustering is one of the most popular unsupervised learning methods in which the data points that are neither classified nor labeled. In this technique the group of text documents which have similar contents are divided into a cluster. The K-means clustering algorithm is one of the most used clustering techniques in which the available data are divided into different clusters by using mean values.
Visualization
Visualization is used to simplify and enhance the discovery of useful information with visual cues. It uses visual cues such as text flags to indicate individual documents or document categories and colours to indicate the density of a category, entity, phrase, etc. It is used in placing the large sources of textual data in visual hierarchy.
Summarization
Summarization is used to reduce the length of the document and summarize the document’s details in brief. Summarization determines the most important points in a lengthy document and replaces the entire set of documents with new important points quickly and efficiently. Summarization involves three steps i.e. pre-processing, processing, and development. Pre-processing step involves building a structured representation of the text whereas different algorithms are used to get a summary of text in processing and finally the development step is where the final text summary is obtained.

Text Mining Methods

Traditionally, there are lots of methods developed to solve the text mining problem which is relevant information retrieval according to the user’s requirements. According to information retrieval there are four main methods used in text mining.

Term Based Method (TBM)
Term refers to a word with semantic meaning. In term based methods a document is analyzed on the basis of the term and has the advantage of efficient computational performance as well as mature theories from their weighting. Term based method faces enormous challenges in case of polysemy and synonymy. Polysemy means a word hasmultiple meanings and synonymy is multiple words with the same meaning. The semantic meaning of many extracted terms is uncertain and does not provide full information for answering what the user wants.
Phrase Based Method (PBM)
Phrase based method may have advantages over term based method as it carries more semantics like information and is less ambiguous. In phrase based methods the document is analyzed on a phrase basis as it is more discriminative than individual terms.
Concept Based Method (CBM)
In this method terms are analyzed on sentences and document level. Concept based methods can effectively discriminate between non important terms and meaningful terms which describe the meaning of the sentences. This model normally relies upon natural language processing techniques. Feature selection is used to optimize the representation and to remove the noise and ambiguity in the document.
Pattern Taxonomy Method (PTM)
In this method documents are analyzed in terms of pattern basis. Taxonomy refers to the process of finding the root words. Patterns can be structured into a taxonomy by using is-a relationship and can be discovered using techniques like rule mining, frequent item set mining, sequential pattern mining and closed pattern mining. This method refines discovered patterns in text documents and has efficient performance than that of concept based and term based methods.

Text Mining Applications

Text mining has improved user experiences and business decisions. Most of the companies are using text mining tools to add value to their organization and products. Some application areas of text mining includes;

Risk Management
One of the main causes of business failure is due to lack of proper or insufficient risk analysis. Integrating risk management software powered by text mining technologies canhelp businesses to stay updated with all current trends in the market and boost their abilities to cover up the potential risks.
Customer Care Service
When the business system is integrated with text analytical tools, feedback systems, chatbots, online reviews, support tickets and social media profiles, it enables us to improve the customer experience with speed. Text mining and sentiment analysis can provide mechanisms for us to prioritize key points for our customer, allowing us to respond to urgent issues in real-time and helps to increase the customer satisfaction.
Healthcare
One of the major applications of text mining is the healthcare sector as it provides valuable information to the researchers. Manual investigation of medical research can be very costly and time consuming. As text mining provides an automation method for extracting the valuable information from the medical literature, it is becoming extremely popular in the medical field as well.
Spam Filtering
Text mining is used to filter and exclude the emails from inboxes and thus improving the overall user experiences. With the help of this application it reduces the risk of cyber attacks to the end users.
Fraud Detection
By combining the outcomes of the text analysis with relevant structured data we are able to process the user profile and claims efficiently as well as to detect and prevent frauds.

Text Mining and Natural Language Processing

Natural language processing or NLP helps machines to process and read text or speech by simulating the human ability to understand natural language such as english. NLP is a part of artificial intelligence (AI) and is concerned with giving computers the ability to understand natural languages in the same way as human beings do.

NLP is used to understand the concepts within documents and helps to decipher ambiguities of language to extract key facts and relationships and provides summaries. Text mining uses the applications of NLP to derive the high quality of information from the text. Basically text mining is the overall process of deriving useful information from the unstructured data and NLP is one of the methods to do so. NLP combines computational linguistics i.e. rule based modeling of human language with statistical, machine learning and deep learning models thus enabling computers to process human language in the form of text or voice data to understand its full meaning and complete the writer’s intent and sentiment.

With the vast amount of data it is impossible for us to read all of the information ourselves and identify what’s most important and it might take us forever to analyse ambiguous data. However text mining with the help of NLP does all the process with accuracy and at high speed. NLP may be part of text mining and artificial intelligence but it is not less in case of the applications. NLP provides large use cases such as sentimental analysis, chatbot, speech recognition and machine translation and hence making it one of the most important methods of this overall information retrieval process.

Text Mining and Machine Learning

Machine learning or ML is a part of artificial intelligence (AI), that provides systems with the ability to automatically learn from the experience without the need of explicit programming. ML can help computers to solve complex non deterministic problems with speed and accuracy. ML is also a part of artificial intelligence (AI) and is one of the most important parts of NLP and text mining.

Machine learning for NLP and text mining involves using machine learning algorithms and artificial intelligence to understand the meaning of text documents. The documents may be anything that contains text, social media queries, online reviews, survey responses and even financial, medical legal documents. ML in natural language processing and text analytics helps to improve, accelerate and automate the underlying text analytics functions and NLP features that convert the unstructured text into usable data and outcomes.

Machine learning in NLP and text analytics involves a set of statistical techniques that identifies parts of speech, entities, sentiment and other parts of the text. The techniques can be expressed as a model which is then applied to other texts i.e. also known as supervised machine learning. Supervised learning algorithms such as Support vector machines, Bayesian networks are used to build and improve core text analytics functions and NLP features. Another learning method i.e. unsupervised machine learning algorithms are used in case of large sets of data. Unsupervised learning methods such as clustering are extremely useful for extracting the NLP feature from large data sets.

Text Mining and Artificial Intelligence

Artificial Intelligence or AI is the ability of a machine to mimic the problem solving and decision making capabilities of the human brain. Integrating AI in the text mining or text analytics can lead to the development of broad applications such as competitive intelligence, human resource management and market analysis. AI consists of different elements such as Machine learning and Natural language processing. Both Natural language processing and Machine learning plays an important role in enhancing AI systems. Without Natural language processing, AI can only understand the meaning of language and answer simple questions, but it won’t be able to know the actual meaning of the words. Thus Natural language processing in AI allows users to communicate with a computer in natural language. Similarly without machine learning, AI can only perform the defined tasks i.e. it can not perform non deterministic problems on it’s own.

AI with both Natural language processing (NLP) and Machine learning (ML) is used in text mining thus helping us to find the valuable insights from large amounts of data. AI provides a large set of real-world applications such as speech recognition, customer service, computer vision, recommendation engine and many more. Thus having a large set of functionality, AI in text mining is one of the most important aspects as it binds together useful methods such as Natural language processing and Machine learning.

Text Mining: Implementation

Now we will implement a simple example of text mining using tm package in R. For implementation, the above text data is divided into four part i.e. text mining, text mining and nlp, text mining and ml and text mining and ai. These four parts will be created as new text files and stored in the root directory of this project. After the data preparation, we will create a corpus containing all text documents. Then preprocessing of the data will be carried out. Thus obtained clean data will be used for analysis and finally for visualization.

Creating Corpus

# loading required library and text
library(tm)
# creating corpus
docs <- Corpus(DirSource("~/Documents/TextMining/texts", encoding = "UTF-8"))
print(docs)

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 4

# inspecting docs
inspect(docs[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1487
## 
## Text Mining and Artificial Intelligence:
## Artificial Intelligence or AI is the ability of a machine to mimic the problem solving and decision
## making capabilities of the human brain. Integrating AI in the text mining or text analytics can
## lead to the development of broad applications such as competitive intelligence, human resource
## management and market analysis. AI consists of different elements such as Machine learning
## and Natural language processing. Both Natural language processing and Machine learning
## plays an important role in enhancing AI systems. WIthout Natural language processing, AI can
## only understand the meaning of language and answer simple questions, but it won’t be able to
## know the actual meaning of the words. Thus Natural language processing in AI allows users to
## communicate with a computer in natural language. Similarly without machine learning, AI can
## only perform the defined tasks i.e. it can not perform non deterministic problems on it’s own.
## AI with both Natural language processing (NLP) and Machine learning (ML) is used in text
## mining thus helping us to find the valuable insights from large amounts of data. AI provides a
## large set of real-world applications such as speech recognition, customer service, computer
## vision, recommendation engine and many more. Thus having a large set of functionality, AI in
## text mining is one of the most important aspects as it binds together useful methods such as
## Natural language processing and Machine learning.

Data Preprocessing

# creating a function to remove different symbols
to_space <- content_transformer(function(x, pattern)
  { 
    return (gsub(pattern, " ", x))
  }
)
# removing unwanted symbols
docs <- tm_map(docs, to_space, ":")
docs <- tm_map(docs, to_space, "-")
docs <- tm_map(docs, to_space, "'")
docs <- tm_map(docs, to_space, "’")
docs <- tm_map(docs, to_space, '"')
docs <- tm_map(docs, to_space, ";")
# removing punctuation
docs <- tm_map(docs, removePunctuation)
# transforming to lower case
docs <- tm_map(docs, content_transformer(tolower))
# removing numbers
docs <- tm_map(docs, removeNumbers)
# removing stop words
docs <- tm_map(docs, removeWords, stopwords())
# removing white spaces
docs <- tm_map(docs, stripWhitespace)
# inspecting
inspect(docs[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1176
## 
## text mining artificial intelligence artificial intelligence ai ability machine mimic problem solving decision making capabilities human brain integrating ai text mining text analytics can lead development broad applications competitive intelligence human resource management market analysis ai consists different elements machine learning natural language processing natural language processing machine learning plays important role enhancing ai systems without natural language processing ai can understand meaning language answer simple questions won t able know actual meaning words thus natural language processing ai allows users communicate computer natural language similarly without machine learning ai can perform defined tasks ie can perform non deterministic problems s ai natural language processing nlp machine learning ml used text mining thus helping us find valuable insights large amounts data ai provides large set real world applications speech recognition customer service computer vision recommendation engine many thus large set functionality ai text mining one important aspects binds together useful methods natural language processing machine learning

Document Stemming

# saving stemmed docs to new_variable
library(SnowballC)
stem_docs <- tm_map(docs, stemDocument)
inspect(stem_docs[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 960
## 
## text mine artifici intellig artifici intellig ai abil machin mimic problem solv decis make capabl human brain integr ai text mine text analyt can lead develop broad applic competit intellig human resourc manag market analysi ai consist differ element machin learn natur languag process natur languag process machin learn play import role enhanc ai system without natur languag process ai can understand mean languag answer simpl question won t abl know actual mean word thus natur languag process ai allow user communic comput natur languag similar without machin learn ai can perform defin task ie can perform non determinist problem s ai natur languag process nlp machin learn ml use text mine thus help us find valuabl insight larg amount data ai provid larg set real world applic speech recognit custom servic comput vision recommend engin mani thus larg set function ai text mine one import aspect bind togeth use method natur languag process machin learn

Removing Unwanted Words

# removing unwanted words
docs <- tm_map(docs, content_transformer(gsub), pattern = " can ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " part ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " important ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " meaning ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " thus ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " understand ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " set ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " one ", replacement =" ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " provides ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " useful ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " used ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " help ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " may ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " ie ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " us ", replacement = " ")
docs <- tm_map(docs, content_transformer(gsub), pattern = " ing ", replacement = " ")
# inspecting
inspect(docs[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1059
## 
## text mining artificial intelligence artificial intelligence ai ability machine mimic problem solving decision making capabilities human brain integrating ai text mining text analytics lead development broad applications competitive intelligence human resource management market analysis ai consists different elements machine learning natural language processing natural language processing machine learning plays role enhancing ai systems without natural language processing ai language answer simple questions won t able know actual words natural language processing ai allows users communicate computer natural language similarly without machine learning ai perform defined tasks perform non deterministic problems s ai natural language processing nlp machine learning ml text mining helping find valuable insights large amounts data ai large real world applications speech recognition customer service computer vision recommendation engine many large functionality ai text mining aspects binds together methods natural language processing machine learning

Creating Term Document Matrix

dtm <- DocumentTermMatrix(docs)
inspect(dtm)

## <<DocumentTermMatrix (documents: 4, terms: 537)>>
## Non-/sparse entries: 674/1474
## Sparsity           : 69%
## Maximal term length: 15
## Weighting          : term frequency (tf)
## Sample             :
##                          Terms
## Docs                      data information language learning machine methods
##   text_mining_and_ai.txt     1           0        8        5       6       1
##   text_mining_and_ml.txt     2           0        1       10       7       1
##   text_mining_and_nlp.txt    4           4        6        2       2       2
##   text_mining.txt           32          16        2        2       0      11
##                          Terms
## Docs                      mining nlp processing text
##   text_mining_and_ai.txt       4   1          6    5
##   text_mining_and_ml.txt       3   6          1   11
##   text_mining_and_nlp.txt      5   9          2    8
##   text_mining.txt             33   0          9   47

Performing Text Analysis

freq <- colSums(as.matrix(dtm))
length(freq)

## [1] 537

# creating sorted order according to the freq
ord <- order(freq, decreasing = TRUE)
# inspecting most frequently occurring terms
freq[head(ord)]

##        text      mining        data information    learning  processing 
##          71          45          39          20          19          18

# inspecting less frequently occurring terms
freq[tail(ord)]

##     wants  websites weighting  whatever   whereas      year 
##         1         1         1         1         1         1

# removing less frequently occurring words
dtmr <- DocumentTermMatrix(docs,
                           control = list(wordLengths = c(2, 20),
                                          bounds = list(global = c (3, Inf))
                                          )
                           )
inspect(dtmr)

## <<DocumentTermMatrix (documents: 4, terms: 28)>>
## Non-/sparse entries: 94/18
## Sparsity           : 16%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## Sample             :
##                          Terms
## Docs                      ai data language learning machine methods mining nlp
##   text_mining_and_ai.txt  10    1        8        5       6       1      4   1
##   text_mining_and_ml.txt   2    2        1       10       7       1      3   6
##   text_mining_and_nlp.txt  1    4        6        2       2       2      5   9
##   text_mining.txt          0   32        2        2       0      11     33   0
##                          Terms
## Docs                      processing text
##   text_mining_and_ai.txt           6    5
##   text_mining_and_ml.txt           1   11
##   text_mining_and_nlp.txt          2    8
##   text_mining.txt                  9   47

# frequency after removal
freqr <- colSums(as.matrix(dtmr))
# length after removal
length(freqr)

## [1] 28

# creating sorted order according to the freq
ordr <- order(freqr, decreasing = TRUE)
# inspecting most frequently occurring terms
freqr[head(ordr)]

##       text     mining       data   learning processing   language 
##         71         45         39         19         18         17

# inspecting less frequently occurring terms
freqr[tail(ordr)]

##         non recognition        case    machines   sentiment       speed 
##           3           3           3           3           3           3

# list of most frequent terms
findFreqTerms(dtmr, lowfreq = 5)

##  [1] "ai"           "analysis"     "applications" "artificial"   "data"        
##  [6] "intelligence" "language"     "large"        "learning"     "machine"     
## [11] "methods"      "mining"       "natural"      "nlp"          "processing"  
## [16] "speech"       "text"         "documents"    "unstructured"

# finding correlations
findAssocs(dtmr, "text", 0.6)

## $text
##         data      methods       mining unstructured     analysis    documents 
##         0.99         0.99         0.99         0.96         0.93         0.93 
##      systems   processing applications 
##         0.81         0.74         0.69

findAssocs(dtmr, "mining", 0.6)

## $mining
##         data      methods         text     analysis unstructured    documents 
##         1.00         1.00         0.99         0.97         0.93         0.86 
##   processing applications      systems 
##         0.82         0.79         0.79

findAssocs(dtmr, "nlp", 0.6)

## $nlp
##    ability     speech      helps artificial 
##       0.87       0.79       0.77       0.62

findAssocs(dtmr, "machine", 0.6)

## $machine
## intelligence     learning   artificial        large           ai 
##         0.93         0.88         0.86         0.82         0.60

findAssocs(dtmr, "ai", 0.6)

## $ai
##        large      natural     language intelligence      machine 
##         0.93         0.86         0.72         0.62         0.60

Visualization

Histogram

# histogram
wf = data.frame(term = names(freqr), occurrences = freqr)
library(ggplot2)
histo <- ggplot(subset(wf, freqr > 5), aes(term, occurrences)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(histo)

### Word Cloud

# word cloud
library(wordcloud)
wordcloud(names(freqr), freqr, min.freq = 5, colors = brewer.pal(4, "Dark2"))

Text Mining: Conclusion

So text mining combines several steps to extract useful pattern/information from the text data. In the above example, we first created a corpus of the text documents and performed preprocessing. Thus obtained data i.e. clean data is analysed using different statistical methods such as calculating frequently occurring terms, calculating correlations between frequently occurring terms and many more. And finally we created histogram and word cloud to visualize our learning. Hope this example helps to clarify about text mining. Happy Learning!

Introduction to Text Mining with R

Pankaj Bhattarai, Master’s in Data Sciences (TU)

25 January, 2022 05:15:50 PM