Processing text with R


Introduccion

One of the fastest growing areas in the field of analytics and machine learning is text processing and analytics. Why would you think that’s the case? Well, turns out that free text is being generated at an exponential pace. New technological advances are generating humongous amount of text data. The internet today contains a number of blogs, reviews, comments, notes, and other text-based facts. Social media is generating millions of messages every day in the form of messages, tweets, hashtags, and references. Computer software generates log messages and audit trails that need to be looked at. Emails are another form of text data. In addition, other media like audio and video are being transcribed into text. The need to analyze and understand text data is growing every day. Businesses want to automatically mine in sites from text data and use them for business actions. But processing text poses us various unique challenges. Text data is many times more in volume than numeric data. Also, text does not have a fixed structure or schema. That makes understanding it difficult. In this course, we will look at tools and techniques offered in R for processing text data.

Document

When you consider text mining, you have to start somewhere. That somewhere is a document. Text processing software libraries receive and produce documents, but what exactly is a document? Essentially, a document is a collection of sentences that represent a specific fact or entity. Documents can be big or small, but every document contains text about that specific context. A product review, a log file, a blog entry, or a tweet, these are all examples of documents that can be used in text mining. Just like the English language, a document contains paragraphs, sentences, and words. For comparison’s sake, a document can be said to be equalent of a row or record in a database. Similar to how a record contains relevant information about an entity, a document contains relevant text. Maybe now you are wondering, how much information, or in this case text, can a document contain? Well, it turns out that the scope of a document can vary. For example, an individual tweet can be considered a document or a set of tweets containing a specific hashtag can be considered a document. A data architect will decide the scope based on the problem being solved.

Corpus

Now that we understand what a document is, the next concept we need to look at is a corpus. The plural of a corpus is corpora. In text mining, a corpus is a collection of documents. These separate documents must be linked by entity or time periods. For example, a corpus may contain all reviews for a given product in a month, all log files generated in a day by a software process, or all tweets by a particular Twitter user. If a document is analogous to a record in a database, a corpus is equivalent to a table. Of course, a table contains multiple records. Hence, a corpus contains multiple documents. What makes up a corpus may vary depending upon the specific use case. For example, all reviews by a user, all reviews for a product, or the global list of reviews in a system can all be examples of corpora. Text mining libraries work with the corpus. Hence, converting text data to a corpus and understanding its structure are important capabilities while analyzing text.

Text libraries in R

One of the key advantages, and sometimes challenges with R is a suite of different packages available to do any kind of work. While the options are many, it also becomes important to compare them and choose the right one for your use case. R is a free software environment that supports a number of text processing packages each with its own advantages and shortcomings. Let’s go through a few of the popular ones that are currently available.

The TM package is one of the most popular text processing packages available. It supports reading data from multiple sources and maintaining them in memory or persistent corpora. It also has support for a number of text cleansing capabilities out of the box, which we will discuss later in the course. The OpenNLP package provides an R interface to Apache OpenNLP, a machine learning toolkit. It is strong in areas like tokenization and machine learning. We will discuss these topics in detail in the later chapters. RWeka provides an R interface to Weka, a popular open source for machine learning. RWeka is strong in machine learning, tokenization, and ngrams. LanguageR provides capabilities for statistical analysis. It also provides machine learning capabilities. KoRpus is another text mining package that provides a suite of functions for text mining similar to TM package.

Which of these packages should you use? It depends upon what you’re trying to do. In general, my recommendation is to explore multiple packages to find the best fit for your use case based on simplicity, capability, and performance. In this course I will mostly use the TM package since it is the most popular and simple to use among the options available. RWeka provides excellent support for ngrams, so I will use it in the ngrams chapter.

Corpus in R

VCorpus vs PCorpus

Now that we have setup the environment in which we will be analyzing our data, we will explore reading data into a corpus and exploring that data. Recall that text data are handled as a corpus in text processing libraries. R’s tm package supports two corpus types, VCorpus, which stands for a volatile corpus, and PCorpus, which stands for permanent corpus.

VCorpus is created from data sources, stored in memory, and fully managed in memory. It does not have an external data source to persist. Since it’s always stored in memory, it provides quick access to data, and hence results in faster text processing times. However, the amount of data that can be stored in a VCorpus is limited by the amount of memory available. VCorpus structures are lost when a program terminates, as there is no persistence of data. This means that if you are working on a structure that you need to keep in the future, VCorpus is probably not the best way to read your data.

A PCorpus, unlike a VCorpus is created and managed through a persistent store, like a directory, or a database. R objects for the PCorpus are merely pointers to the data store. R operations on these R objects are performed on the underlying persistent data structures. Since data is stored externally, on a disk, rather than in memory, it results in slower access and processing, as compared to a VCorpus. However, it can support much larger corpora than a VCorpus as it is not limited by the memory available for processing. This means that when you’re working with big data, or a large corpora, it is best to use PCorpus instead of a VCorpus. Additionally, PCorpus is not lost when the processing program terminates, as it is always persistent to disk. You can come back to it and keep working from where you left off. Other than how the data is stored, there are no differences between the two types of corpuses in terms of the processing that can be done on them. In this course, we will extensively focus on using VCorpus for our examples.

Reading files with CorpusReader

Con el siguiente código creamos un corpus leyendo un fichero de datos.

#Read a directory into a Source object
source_data <- DirSource("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\data")

#Create a Volatile Corpus from the source object
course_corpus <- VCorpus(source_data) #crea un documento en el corpus para cada fichero en el directorio de origen

Exploring the corpus

Vamos ahora a explorar el corpus que hemos creado antes.

#Inspect the corpus to learn about its data
inspect(course_corpus) #devuelve datos sobtre cada uno de los documentos del corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 568
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 640
#Inspect the contents of a specific document in the corpus
inspect(course_corpus[[1]])  #debemos indicar el indice en la lista 
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 568
## 
## Real-time systems have guaranteed response times that can be sub-seconds from the trigger. Meaning that when a user clicks a button, your app better respond and fast. Architecting applications under real-time constraints is an even bigger challenge when you're dealing with big data. Luckily, big data technology and efficient architecture can provide the real-time responsiveness your business needs. In this course, you can learn about use cases and best practices for architecting real-time applications with technologies such as Kafka, Hazelcast, and Apache Spark.
inspect(course_corpus[[2]])  #muestra el contenido del documento
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 640
## 
## In order to construct data pipelines and networks that stream, process, and store data, data engineers and data-science DevOps specialists must understand how to combine multiple big data technologies. In this course, discover how to build big data pipelines around Apache Spark. Join Kumaran Ponnambalam as he takes you through how to make Apache Spark work with other big data technologies. He covers the basics of Apache Kafka Connect and how to integrate it with Spark for real-time streaming. In addition, he demonstrates how to use the various technologies to construct an end-to-end project that solves a real-world business problem.
#Inspect meta data about a document
meta(course_corpus[[1]]) #proporciona el nombre y el valor del metadato
##   author       : character(0)
##   datetimestamp: 2020-03-23 08:28:29
##   description  : character(0)
##   heading      : character(0)
##   id           : Architecture-Course-Description.txt
##   language     : en
##   origin       : character(0)
meta(course_corpus[[2]])
##   author       : character(0)
##   datetimestamp: 2020-03-23 08:28:29
##   description  : character(0)
##   heading      : character(0)
##   id           : Spark-Course-Description.txt
##   language     : en
##   origin       : character(0)
#Access a specific attribute about a document
course_corpus[[1]]$meta$id
## [1] "Architecture-Course-Description.txt"
#Set a value for an attribute
course_corpus[[1]]$meta$author <- "Andres Lopez"
#crear un nuevo metadato
course_corpus[[1]]$meta$type <- "Courses"

#Display the attributes again
meta(course_corpus[[1]])
##   author       : Andres Lopez
##   datetimestamp: 2020-03-23 08:28:29
##   description  : character(0)
##   heading      : character(0)
##   id           : Architecture-Course-Description.txt
##   language     : en
##   origin       : character(0)
##   type         : Courses

Persisting the corpus

Vamos a ver como podemos guardar un corpus para no perderlo al cerrar el programa, y poder hacer más análisis en el futuro.

#Change ID for each document - without the .txt extension.
#eliminamos la extension porque la función writecorpus añade la extensión txt a cada documento
for(i_doc in 1:length(course_corpus) ) {
    course_corpus[[i_doc]]$meta$id = 
        sub('.txt','',course_corpus[[i_doc]]$meta$id)
}

#The destination directory should pre-exist.
writeCorpus(course_corpus, "C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\saved_corpus" )

Text cleaning and extraction

Vamos a crear una función que mostrara los datos sobre el corpus, como el número de palabras y de caracteres del contenido del texto.

#cargamos el corpus
course_corpus <- VCorpus(DirSource("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\data"))

#Function that prints data about the corpus
#-----input : processing_step <- Description of the step
#-----input : corpus : The corpus to analyze.
analyze_corpus <- function(processing_step, corpus) {
  
  print("***************************************************")
  print(processing_step)
  print("---------------------------------------------------")
  
  #Count number of words and characters in the corpus and print.
  for(i_doc in 1:length(corpus) ) {
    print(paste(corpus[[i_doc]]$meta$id,
                " words =", 
                lengths(gregexpr("\\W+", corpus[[i_doc]])) + 1,
                " chars =",
                nchar(corpus[[i_doc]]$content)
    ))
  }
  
  #Print the first document in the corpus
  print("---------------------------------------------------")
  print(corpus[[1]]$content)
  
}

#Print the raw Corpus first
analyze_corpus("Raw input data", course_corpus)
## [1] "***************************************************"
## [1] "Raw input data"
## [1] "---------------------------------------------------"
## [1] "Architecture-Course-Description.txt  words = 90  chars = 568"
## [1] "Spark-Course-Description.txt  words = 106  chars = 640"
## [1] "---------------------------------------------------"
## [1] "Real-time systems have guaranteed response times that can be sub-seconds from the trigger. Meaning that when a user clicks a button, your app better respond and fast. Architecting applications under real-time constraints is an even bigger challenge when you're dealing with big data. Luckily, big data technology and efficient architecture can provide the real-time responsiveness your business needs. In this course, you can learn about use cases and best practices for architecting real-time applications with technologies such as Kafka, Hazelcast, and Apache Spark."

Text cleansing

Los pasos más comunes en la limpieza del texto y preparación para analytics y machine learning son:

  • Formateo y estándarización (por ejemplo, fechas)
  • Eliminar signos de puntuación
  • Quitar abreviaturas
  • Cambiar todas las letras a mayusculas o minusculas (Case conversion)
  • Eliminar elementos como los hashtags
  • Si el texto tiene varias lenguas, habría que introducir una traducción y dejar solo una lengua en el texto a analizar

Vamos a ver dos tareas de limpieza que se hacen habitualmente, como es convertir todas las letras a minusculas y quitar los signos de puntuación.

#Convert to lower case
#tm_map - función para hacer transformaciones a todo el corpus
course_corpus2 <- tm_map(course_corpus, content_transformer(tolower))
analyze_corpus("Converted to lower case",course_corpus2)
## [1] "***************************************************"
## [1] "Converted to lower case"
## [1] "---------------------------------------------------"
## [1] "Architecture-Course-Description.txt  words = 90  chars = 568"
## [1] "Spark-Course-Description.txt  words = 106  chars = 640"
## [1] "---------------------------------------------------"
## [1] "real-time systems have guaranteed response times that can be sub-seconds from the trigger. meaning that when a user clicks a button, your app better respond and fast. architecting applications under real-time constraints is an even bigger challenge when you're dealing with big data. luckily, big data technology and efficient architecture can provide the real-time responsiveness your business needs. in this course, you can learn about use cases and best practices for architecting real-time applications with technologies such as kafka, hazelcast, and apache spark."
#Remove punctuations
course_corpus3 <- tm_map(course_corpus2, removePunctuation)
analyze_corpus("Removed punctuations",course_corpus3)
## [1] "***************************************************"
## [1] "Removed punctuations"
## [1] "---------------------------------------------------"
## [1] "Architecture-Course-Description.txt  words = 83  chars = 552"
## [1] "Spark-Course-Description.txt  words = 100  chars = 625"
## [1] "---------------------------------------------------"
## [1] "realtime systems have guaranteed response times that can be subseconds from the trigger meaning that when a user clicks a button your app better respond and fast architecting applications under realtime constraints is an even bigger challenge when youre dealing with big data luckily big data technology and efficient architecture can provide the realtime responsiveness your business needs in this course you can learn about use cases and best practices for architecting realtime applications with technologies such as kafka hazelcast and apache spark"

Stop Word removal

Las stop words son palabras que no contienen ningún significado por ellas mismas, como por ejemplo, en, y, cual, el… Estas palabras consumen recursos y no son necesarias para realizar lo análisis y predicciones. Se debe usar un diccionario (estándar o personalizado) de stop words para la lengua del texto que queremos analizar.

#Remove stopwords
course_corpus4 <- tm_map(course_corpus3, removeWords, stopwords())
analyze_corpus("Removed Stopwords",course_corpus4)
## [1] "***************************************************"
## [1] "Removed Stopwords"
## [1] "---------------------------------------------------"
## [1] "Architecture-Course-Description.txt  words = 54  chars = 458"
## [1] "Spark-Course-Description.txt  words = 63  chars = 519"
## [1] "---------------------------------------------------"
## [1] "realtime systems  guaranteed response times  can  subseconds   trigger meaning    user clicks  button  app better respond  fast architecting applications  realtime constraints   even bigger challenge  youre dealing  big data luckily big data technology  efficient architecture can provide  realtime responsiveness  business needs   course  can learn  use cases  best practices  architecting realtime applications  technologies   kafka hazelcast  apache spark"

Stemming

El raiz (stem) de una palabra es la parte de la palabra que tiene el significado y a la cual se le anaden afijos (prefijos y sufijos) para complementarlo. El proceso de stemming mantiene solo esta raiz, reduciendo el numero de caracteres, ya que simplemente se corta la palabra dejando solo la raiz, por lo que la palabra resultante puede que no esté en el diccionario, pero tiene el significado y los programas analiticos pueden seguir comprendiendola.

course_corpus5 <- tm_map(course_corpus4, stemDocument)
analyze_corpus("Stemmed documents",course_corpus5)
## [1] "***************************************************"
## [1] "Stemmed documents"
## [1] "---------------------------------------------------"
## [1] "Architecture-Course-Description.txt  words = 54  chars = 365"
## [1] "Spark-Course-Description.txt  words = 62  chars = 426"
## [1] "---------------------------------------------------"
## [1] "realtim system guarante respons time can subsecond trigger mean user click button app better respond fast architect applic realtim constraint even bigger challeng your deal big data luckili big data technolog effici architectur can provid realtim respons busi need cours can learn use case best practic architect realtim applic technolog kafka hazelcast apach spark"

Gestion de metadatos

Hay que decidir que metadatos vamos a mantener y vamos a añadir al corpus, ya que se pueden tomar decisiones en base a esos metadatos.

#Change ID for each document - without the .txt extension.
for(i_doc in 1:length(course_corpus5) ) {
  
    #Remove the .txt extension in the ID
    course_corpus5[[i_doc]]$meta$id <-
        sub('.txt','',course_corpus5[[i_doc]]$meta$id)
    
    #Add no. of words
    course_corpus5[[i_doc]]$meta$words <-
          lengths(gregexpr("\\W+", course_corpus5[[i_doc]])) + 1
    
    #add a new attribute for status.
    course_corpus5[[i_doc]]$meta$status <-'Cleaned'
}

#Print modified meta data
course_corpus5[[1]]$meta
##   author       : character(0)
##   datetimestamp: 2020-03-23 08:28:30
##   description  : character(0)
##   heading      : character(0)
##   id           : Architecture-Course-Description
##   language     : en
##   origin       : character(0)
##   words        : 54
##   status       : Cleaned
course_corpus5[[2]]$meta
##   author       : character(0)
##   datetimestamp: 2020-03-23 08:28:30
##   description  : character(0)
##   heading      : character(0)
##   id           : Spark-Course-Description
##   language     : en
##   origin       : character(0)
##   words        : 62
##   status       : Cleaned
#Convert to dataframe
df_metadata <- 
  data.frame(status=sapply(course_corpus5, meta, "status"),
             words=sapply(course_corpus5, meta, "words"),
             stringsAsFactors=FALSE)

#Print the data frame
df_metadata
##                                  status words
## Architecture-Course-Description Cleaned    54
## Spark-Course-Description        Cleaned    62
#Save corpus for future use in next chapters. 
#Note: This wont save meta data. If required,  
#the metadata's dataframe should be persisted seperately 

writeCorpus(course_corpus5, "C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\clean_corpus" )

TF-IDF

In the last chapter, we set up and cleaned the text corpus and now we are ready to start analyzing the data. There are a number of text mining techniques that you can choose from depending on your purpose on what you want to accomplish. One of the most popular techniques is called text frequency, inverse document frequency, or TF-IDF. TFIDF is a numerical statistic that is intended to reflect how important a word is to a document. Even though we have a set of clean data values, a number of machine learning algorithms do not work on text values, only numeric features. So if we want to use these algorithms, the text needs to be converted to equivalent numeric representation to do machine learning. TF-IDF is a technique to convert text to a numerical data representation. It outputs a table of numeric values we can then use to perform other analysis. In this table, each role represents a document in the corpus. Each column represents a word in the corpus. Each cell in the table provides a value that indicates the relative strength of the word with respect to the document. A higher value indicates a stronger correlation between the word and the document. This information is used for a variety of analysis including machine learning. So that’s what TF-IDF is, but how do we do it? Let’s say we have a corpus with three documents. Each document is a simple sentence. We do text cleansing described in the previous chapter to arrive at a clean corpus, as shown here. First, we create a text frequency table. In this table, each document is a row and each word is a column. The count indicates the number of times the column appeared in the document. Next, we find relative text frequency. To find this, we divide each cell with the total number of words in the document. For example, document one has three words. Each word count in the document is divided by three, and we get 0.33 for all words in the document. Next, we find inverse document frequency. IDF is computed for each word across all documents in the corpus. For this, we use the formula log e of total documents in the corpus divided by documents with this word. The purpose of IDF is to find words that are unique and prevalent in a few documents only. The fewer the number of documents that have this word, the higher is the IDF. Finally, we find TF-IDF by multiplying the TF or the text frequency value in each cell with the IDF value for that word. Remember, this technique is useful because it converts text data into a numeric representation, which is required for machine learning, as most machine learning algorithms require numeric data only. So that is the conceptual overview of TF-IDF.

Generating a Term frequency Matrix

rm(list=ls())
#Load the cleaned corpus saved in Chapter 3
course_desc <- VCorpus(DirSource("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\clean_corpus"))
inspect(course_desc[[1]]) #comprobamos que es correcto
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 365
## 
## realtim system guarante respons time can subsecond trigger mean user click button app better respond fast architect applic realtim constraint even bigger challeng your deal big data luckili big data technolog effici architectur can provid realtim respons busi need cours can learn use case best practic architect realtim applic technolog kafka hazelcast apach spark
#Generate the Document Term matrix
course_dtm <- DocumentTermMatrix(course_desc) #en la opción por defecto se cuentan las palabras

#Inspect to TF-IDF
inspect(course_dtm)
## <<DocumentTermMatrix (documents: 2, terms: 79)>>
## Non-/sparse entries: 89/69
## Sparsity           : 44%
## Maximal term length: 11
## Weighting          : term frequency (tf)
## Sample             :
##                                      Terms
## Docs                                  apach applic architect big busi can
##   Architecture-Course-Description.txt     1      2         2   2    1   3
##   Spark-Course-Description.txt            3      0         0   3    1   0
##                                      Terms
## Docs                                  data realtim spark technolog
##   Architecture-Course-Description.txt    2       4     1         2
##   Spark-Course-Description.txt           6       1     3         3
#Estas funciones nos dan datos sobre algunos parámetros del corpus
#List of docs in the matrix
Docs(course_dtm)
## [1] "Architecture-Course-Description.txt"
## [2] "Spark-Course-Description.txt"
#No. of docs in the matrix
nDocs(course_dtm)
## [1] 2
#List of terms in the matrix
Terms(course_dtm)
##  [1] "addit"       "apach"       "app"         "applic"      "architect"  
##  [6] "architectur" "around"      "basic"       "best"        "better"     
## [11] "big"         "bigger"      "build"       "busi"        "button"     
## [16] "can"         "case"        "challeng"    "click"       "combin"     
## [21] "connect"     "constraint"  "construct"   "cours"       "cover"      
## [26] "data"        "datasci"     "deal"        "demonstr"    "devop"      
## [31] "discov"      "effici"      "endtoend"    "engin"       "even"       
## [36] "fast"        "guarante"    "hazelcast"   "integr"      "join"       
## [41] "kafka"       "kumaran"     "learn"       "luckili"     "make"       
## [46] "mean"        "multipl"     "must"        "need"        "network"    
## [51] "order"       "pipelin"     "ponnambalam" "practic"     "problem"    
## [56] "process"     "project"     "provid"      "realtim"     "realworld"  
## [61] "respond"     "respons"     "solv"        "spark"       "specialist" 
## [66] "store"       "stream"      "subsecond"   "system"      "take"       
## [71] "technolog"   "time"        "trigger"     "understand"  "use"        
## [76] "user"        "various"     "work"        "your"
#No. of terms in the matrix
nTerms(course_dtm)
## [1] 79
#Convert to a matrix
#al convertirlo en matriz podemos hacer uso de todas las herramientas de gestión y acceso a matrices
course_dtm_matrix = as.matrix(course_dtm)

#Inspect a specific term
course_dtm_matrix[, 'kafka']
## Architecture-Course-Description.txt        Spark-Course-Description.txt 
##                                   1                                   1
course_dtm_matrix[, 'apach']
## Architecture-Course-Description.txt        Spark-Course-Description.txt 
##                                   1                                   3

Improving Term Frequency Matrix

#Find terms that have occured atleast 5 times
findFreqTerms(course_dtm,5)
## [1] "big"       "data"      "realtim"   "technolog"
#Remove sparse terms - Terms not there in 50% of the documents
#Given that we have only 2 documents, this will give terms that
#are there in both the documents
dense_course_dtm <- removeSparseTerms(course_dtm, 0.5) #con esta función quitamos todos los elementos con menos frecuencia, yvemos que el parametro Sparsity se reduce al 0%, para los algoritmos de ML esto es importnate al enforcarnos en los terminos relevantes.

inspect(dense_course_dtm)
## <<DocumentTermMatrix (documents: 2, terms: 10)>>
## Non-/sparse entries: 20/0
## Sparsity           : 0%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## Sample             :
##                                      Terms
## Docs                                  apach big busi cours data kafka
##   Architecture-Course-Description.txt     1   2    1     1    2     1
##   Spark-Course-Description.txt            3   3    1     1    6     1
##                                      Terms
## Docs                                  realtim spark technolog use
##   Architecture-Course-Description.txt       4     1         2   1
##   Spark-Course-Description.txt              1     3         3   1

Plotting Frequency data

#Generate a frequency table
course_dtm_frequency <- sort(colSums(as.matrix(dense_course_dtm)), 
                          decreasing=TRUE)
#Print the table (vector)
course_dtm_frequency
##      data       big   realtim technolog     apach     spark      busi 
##         8         5         5         5         4         4         2 
##     cours     kafka       use 
##         2         2         2
#Convert frequency table to a data frame
course_dtm_df <- data.frame(word=names(course_dtm_frequency), 
                            freq=course_dtm_frequency)
#print the data frame
course_dtm_df
##                word freq
## data           data    8
## big             big    5
## realtim     realtim    5
## technolog technolog    5
## apach         apach    4
## spark         spark    4
## busi           busi    2
## cours         cours    2
## kafka         kafka    2
## use             use    2
#Create a frequency plot
frequency_plot <- ggplot(subset(course_dtm_df, freq>1), 
                    aes(x = reorder(word, -freq), y = freq)) +
                geom_bar(stat = "identity", fill = "#FF6666") + 
                theme(axis.text.x=element_text(angle=45, hjust=1))

#display the frequency plot
frequency_plot

Generating TF-IDF

#Generate the TF-IDF 
#con esta función calculamos la matriz para todos los terminos, pero se le pueden aplicar las tecnicas de reducción de los Sparse terms que hemos visto anteriormente
course_tfidf <- DocumentTermMatrix(course_desc, 
                      control= list(weighting = weightTfIdf)) #El controlo lo ponemos para indicar que calcule el TF-IDF en luegar de hacer una simple cuenta de terminos como hizo anteriormente esta función

#Inspect to TF-IDF
inspect(course_tfidf)
## <<DocumentTermMatrix (documents: 2, terms: 79)>>
## Non-/sparse entries: 69/89
## Sparsity           : 56%
## Maximal term length: 11
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##                                      Terms
## Docs                                         app     applic  architect
##   Architecture-Course-Description.txt 0.01851852 0.03703704 0.03703704
##   Spark-Course-Description.txt        0.00000000 0.00000000 0.00000000
##                                      Terms
## Docs                                  architectur       best        can
##   Architecture-Course-Description.txt  0.01851852 0.01851852 0.05555556
##   Spark-Course-Description.txt         0.00000000 0.00000000 0.00000000
##                                      Terms
## Docs                                   construct    pipelin    respons
##   Architecture-Course-Description.txt 0.00000000 0.00000000 0.03703704
##   Spark-Course-Description.txt        0.03225806 0.03225806 0.00000000
##                                      Terms
## Docs                                      stream
##   Architecture-Course-Description.txt 0.00000000
##   Spark-Course-Description.txt        0.03225806

N-grams

So, now we have data cleaned and prepped, the TF-IDF matrix made, and we are ready to start text mining. One of the most useful and common techniques in text mining is called N-grams. You might be asking yourself, what is N-grams? Well, N-grams is a sequence of N items in a sample of text. The sequence can be of any number. Depending on N, it’s called bigrams, trigrams, four grams, et cetera. For example, let’s take a sentence: Dogs are favorite pets. If we do bigram conversion of this, we will end up with the following three bigrams: Dogs and are, are and favorite, favorite and pets. If we do trigrams, we will end up with: Dogs, are, and favorite, then we have are, favorite, and pets. N-grams are used for building predictive text systems that predict the next sequence of words, like typeahead systems.

Using the RWeka NGramTokenizer

demo_string <- "This is a demo for ngrams"

#Bigrams
print("Bigrams extraction : ")
## [1] "Bigrams extraction : "
NGramTokenizer( demo_string, Weka_control(min=2,max=2))
## [1] "This is"    "is a"       "a demo"     "demo for"   "for ngrams"
#Trigrams
print("Trigrams extraction : ")
## [1] "Trigrams extraction : "
NGramTokenizer( demo_string, Weka_control(min=3,max=3))
## [1] "This is a"       "is a demo"       "a demo for"      "demo for ngrams"

Creating N-gram Text Frequency Matrix

#Load the corpus
course_desc <- VCorpus(DirSource("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\data"))
inspect(course_desc[[1]])
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 568
## 
## Real-time systems have guaranteed response times that can be sub-seconds from the trigger. Meaning that when a user clicks a button, your app better respond and fast. Architecting applications under real-time constraints is an even bigger challenge when you're dealing with big data. Luckily, big data technology and efficient architecture can provide the real-time responsiveness your business needs. In this course, you can learn about use cases and best practices for architecting real-time applications with technologies such as Kafka, Hazelcast, and Apache Spark.
#Function to generate Bigrams
BigramTokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}

#Generate Document Term matrix from Bigrams
dtm_bigrams = DocumentTermMatrix(course_desc,
                        control = list(tokenize = BigramTokenizer))
#Inspect the Bigrams DTM created
inspect(dtm_bigrams)
## <<DocumentTermMatrix (documents: 2, terms: 167)>>
## Non-/sparse entries: 171/163
## Sparsity           : 49%
## Maximal term length: 25
## Weighting          : term frequency (tf)
## Sample             :
##                                      Terms
## Docs                                  a button a real-world apache spark
##   Architecture-Course-Description.txt        1            0            1
##   Spark-Course-Description.txt               0            1            2
##                                      Terms
## Docs                                  big data data pipelines
##   Architecture-Course-Description.txt        2              0
##   Spark-Course-Description.txt               3              2
##                                      Terms
## Docs                                  data technologies how to in this
##   Architecture-Course-Description.txt                 0      0       1
##   Spark-Course-Description.txt                        2      5       1
##                                      Terms
## Docs                                  this course to construct
##   Architecture-Course-Description.txt           1            0
##   Spark-Course-Description.txt                  1            2
#Most frequent terms in the corpus that occured atleast 3 times
findFreqTerms(dtm_bigrams,3)
## [1] "apache spark" "big data"     "how to"

Extracting N-gram pairs

#Remove sparse bigrams
dense_bigrams <- removeSparseTerms(dtm_bigrams , 0.5)
inspect(dense_bigrams)
## <<DocumentTermMatrix (documents: 2, terms: 4)>>
## Non-/sparse entries: 8/0
## Sparsity           : 0%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## Sample             :
##                                      Terms
## Docs                                  apache spark big data in this
##   Architecture-Course-Description.txt            1        2       1
##   Spark-Course-Description.txt                   2        3       1
##                                      Terms
## Docs                                  this course
##   Architecture-Course-Description.txt           1
##   Spark-Course-Description.txt                  1
#Generate a frequency table
bigrams_frequency <- sort(colSums(as.matrix(dense_bigrams)),
                          decreasing=TRUE)
bigrams_frequency
##     big data apache spark      in this  this course 
##            5            3            2            2
#Convert to data frame
bigrams_df <- data.frame(first_word=character(), 
                               second_word=character(), 
                               count=integer())

#Iterate through the frequency table to extract data
for ( i in 1:length(bigrams_frequency)) {
  
    #Extract the bigram name
    bigram <- names(bigrams_frequency)[[i]]
    #Split bigram into words
    bigram_words<-unlist(strsplit(bigram," "))
    #Extract count
    count=bigrams_frequency[[i]]
    
    #Create a row for the dataframe
    bigram_row<-list(first_word = bigram_words[[1]],
                      second_word=bigram_words[[2]],
                      count=count)
    #Add the row to the dataframe
    bigrams_df<-rbind(bigrams_df, bigram_row, stringsAsFactors=FALSE)
}

print("Bigrams dataframe :")
## [1] "Bigrams dataframe :"
bigrams_df
##   first_word second_word count
## 1        big        data     5
## 2     apache       spark     3
## 3         in        this     2
## 4       this      course     2

Mejores prácticas

Almacenando texto

So, now you know how to set up R for text mining. However, once you have set it up you are probably going to want to return to it at some point or R may close. Therefore, storing the text data is important. So, what are the best practices to store text data?

First, don’t try to cram text data into an RDBMS. Rather, use a suitable big data storage, like HDFS, S3, or Google Cloud Storage to store text data. These data stores can scale better, especially with unstructured data. References to the storage can be stored inside RDBMS records.

Next, it’s important to be able to query and filter text data in these object stores. Create indexes on key data elements or words, either in a document database, like MongoDB, or a search engine, like Elasticsearch.

Finally, another option is to store processed text data, like tokens or TF-IDF arrays, for future consumption. This reduces the need to process raw text again, while also saving on storage costs. Again, with a small data set like this it’s less important, but when you begin working on large data this becomes a huge time saver.

Processing text data

So, now we know how to store the data, but what are some of the key practices to consider while processing text?

First, it’s important to filter data as yearly as possible in the process. Text data is heavy and the lighter we can make it ahead of time, the more performant analysis will be later in the pipeline.

Second, you should use an exhaustive and context specific stop word list to eliminate stop words. Stop words do not carry any insights, so eliminating most of them is important for efficiency.

Next, identify domain specific data for special use. These special words mean a specific purpose for the text and can be used to index and classify them. Examples of such strings would be product names, which occur in text data specific to an organization.

While building TF-IDF matrices, it’s important to eliminate tokens that rarely occur. They usually are not useful in classification or analysis.

Finally, you should try and build a clean and indexed corpus based on the language and business context. Persist it for future use. Persistence can be achieved either with a file, a database, or in a memory cache based on the use case.

Scalability

Throughout this course, we have been dealing with very small datasets. However, the power in text mining comes from being able to analyze big data. So the question is, how do we process large quantities of text data in a scalable manner?

First, when working with big data, use technologies that allow for parallel access and storage of data. Technologies like Kafka, HDFS, and MongoDB support a number of nodes and channels to allow for parallel access, movement, and storage of data.

Next, process each document independently with a map function. Activities like cleansing and tokenization can be done this way. This allows for multiple nodes to process documents in parallel, and hence, speed up the pipeline.

Finally, use reduce function late in the processing after all the filtering and cleansing is done. Reduce functions, like aggregations, create choke points so we want to use as small datasets as possible, which will help in processing speed.

Next steps

Now you have the information you need to get started with text mining on a basic level. Some ways to further your learning on this topic might be: learn more in-depth about analytics and machine learning techniques for text data, explore text processing at scale with big data technologies like we discussed in the previous video, build an end to end live project for text analytics in your organization. Data always intrigues me. I bet it intrigues you too. So, let’s keep exploring it and find better ways of extracting knowledge out of it to generate insights for our businesses.

Andres Lopez

22 de marzo de 2020