Text Analytics and Predictions with R Essential Training


Introduccion

Text processing and analytics is one of the fastest growing areas in the field of machine learning. Why? Well, the truth is that more and more data that’s getting generated today is text. The Internet contains a number of blogs, reviews, comments, notes, and other text-based facts. Social media generates data every day in the form of messages, tweets, hashtags, and references. Computer software generates log messages and audit trails. There is so much more, including emails and audio or video that gets transcribed into text. With so much free text data out there, businesses can capitalize through text analytics. They can then use these insights to drive strategic business actions. But, analyzing text possesses various unique challenges. Text data is several times as large as numeric data. Also, text data does not have a fixed structure or schema, and that makes understanding it difficult. In this course, I will show you some tools and techniques offered in R that can help you with these particular issues and aid in generating insights.

This course is about text analytics and predictions using R. It focuses on analytics and machine learning techniques specific to text. This course has examples in R, and we use RStudio. So, it’s good to have some familiarity with these tools. You will need to download the latest version of R and RStudio to follow along. RStudio will not run without a compatible version of R installed. You can download R from the cran.r-project.org website. You can also download RStudio from the rstudio.com website. The examples in this course also pre-process text data before using them for analytics. Techniques used include stopword removal, stemming, n-grams, and tf-idf. If you are not familiar with these techniques, I recommend taking my other course on LinkedIn Learning called Text Processing with R. While the course focuses on using machine learning techniques, like clustering and classification for text mining, it does not delve deep into these concepts. Rather, it focuses on using these techniques for text-specific data.

Word Cloud

Word Cloud concepts

Let’s start off with something you may have used in the past, a word cloud. You may remember it. Basically, it highlights what words are used with what frequency in a body of text or corpus and then forms a shape with the words. The size of the word in the word cloud is based on the number of occurrences of that word in the corpus. The more occurrences, the bigger the size.

Word cloud can also limit the number of words shown to the top and popular ones. A word cloud can be used to show the popularity of keywords visually. For instance, you can show the popularity of athletes in a sports league by showing the names of the players in a word cloud. In this specific example, we will analyze the popular words used in technology course descriptions on LinkedIn. Let’s get started with prepping the data so we can make that happen.

Preparing the data

#Load up the corpus
course_corpus <- VCorpus(DirSource("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\text analytics exercise files\\courses"))

#cleansing activities
#Convert to lower case
course_corpus2 <- tm_map(course_corpus, content_transformer(tolower))
#Remove punctuations
course_corpus3 <- tm_map(course_corpus2, removePunctuation)
#Remove stopwords
course_corpus4 <- tm_map(course_corpus3, removeWords, stopwords())
inspect(course_corpus4)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 458
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 519
#generate frequency matrix
#Generate TF-IDF matrix
course_dtm <- DocumentTermMatrix(course_corpus4)

#Inspect to TF-IDF
inspect(course_dtm)
## <<DocumentTermMatrix (documents: 2, terms: 82)>>
## Non-/sparse entries: 92/72
## Sparsity           : 44%
## Maximal term length: 14
## Weighting          : term frequency (tf)
## Sample             :
##                                      Terms
## Docs                                  apache applications architecting big
##   Architecture-Course-Description.txt      1            2            2   2
##   Spark-Course-Description.txt             3            0            0   3
##                                      Terms
## Docs                                  business can data realtime spark
##   Architecture-Course-Description.txt        1   3    2        4     1
##   Spark-Course-Description.txt               1   0    6        1     3
##                                      Terms
## Docs                                  technologies
##   Architecture-Course-Description.txt            1
##   Spark-Course-Description.txt                   3
#Generate a frequency data frame
word_frequency <- sort(colSums(as.matrix(course_dtm)),
                       decreasing=TRUE)
df_frequency<- data.frame(word = names(word_frequency),
                          freq=word_frequency)

head(df_frequency)
##                      word freq
## data                 data    8
## big                   big    5
## realtime         realtime    5
## apache             apache    4
## spark               spark    4
## technologies technologies    4

Displaying the word cloud

#Simple wordcloud
wordcloud(df_frequency$word,df_frequency$freq) #por defecto, esta funcion muestras las palabras con uan frecuencia de 3 o mas

#Top 10 words
wordcloud(df_frequency$word,
          df_frequency$freq,
          max.words=10, min.freq = 1) #controlamos el número de palabras mostradas asi como la frecuencia minima para que la palabra aparezca en el grafico.

Enhance the word cloud

#Choose a specific font and order
wordcloud(df_frequency$word,
          df_frequency$freq,
          max.words=10, min.freq = 1, #controlamos el numero de palabras y frecuencia
          random.order=FALSE,  #hacemos que las palabras de más frecuencia en el centro de la nube
          family = "Helvetica", font = 3) #fuente de las letras

#Using a color palatte

word_pal <- brewer.pal(10,"Dark2") #creamos una paleta de colores a usar, esplorar otras paletas

wordcloud(df_frequency$word,
          df_frequency$freq,
          max.words=20, min.freq = 1,
          random.order=FALSE,
          colors=word_pal, #añadimos la paleta de colores creada
          family= "Arial", font = 3)

Sentiment analysis

Sentiment analysis concepts

One of the most popular analysis done on text is to identify the sentiment expressed by its author. Sentiment, in this case, is the overall emotion expressed in the words people write. This might be simply positive or negative but can be other emotions as well.

Organizations try to understand the sentiment of their customers and users based on their communications and social media posts. This is referred to as sentiment analysis.

Sentiment analysis is text mining technique used to identify the intent or opinion in text data. Users communicate their opinions and needs through text in the form of reviews, emails, chats, et cetera. Sentiment analysis looks at a corpus of text, possibly multiple sentences to understand the overall sentiment of the author.

There are multiple techniques used for this. The most popular being the bag of words approach. In this technique, we look for specific words in the text and conclude on the overall sentiment this technique is simple and straightforward but may not handle complex sentiments effectively.

In sentiment analysis, we determine polarity. Polarity is a score from minus one to plus one that indicates whether the sentiment is positive, negative or neutral. The closer it is to minus one, it’s negative, the closer it is to plus one, it’s positive and the closer it is to zero, it’s neutral. After determining polarity, we can also try to understand the emotions in text like happy, sad and angry.

Let’s take a look at what we need to do to prepare data for sentiment analysis.

Finding sentiment

#Load the movie reviews file and convert it into sentences
movie_reviews <- readLines(file("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\text analytics exercise files\\reviews\\Movie-Reviews.txt"))
movie_reviews
##  [1] "When your main character in a superhero movie is unwatchable, you already have a problem. In addition, Captain Marvel has no weaknesses, which kills the tension immediately."                                  
##  [2] "Her performance was forced, uninspiring and flat! Not looking forward to the next movie with her in it...."                                                                                                     
##  [3] "I couldn't believe how boring this movie was. The acting is horrible, the action is terrible, and Captain Marvel herself is super cheesy. This is the worst Marvel movie for me, alongside Ant man and the Wasp"
##  [4] "Nothing beats a good marvel movie, and this is definitely a good marvel movie"                                                                                                                                  
##  [5] "This movie did for Marvel what Wonder Woman did for DC. Captain Marvel is a great role model for young children. Great to see Colson and Fury as well. Loved Goose."                                            
##  [6] "Captain Marvel just became my favorite superhero of all time. This movie was funnier than I expected and all-around great. Go see it!"                                                                          
##  [7] "This is a very controversial Marvel film. Which seems to be a running trend with Disney films recently."                                                                                                        
##  [8] "Unfortunately, despite carrying many elements of previous Marvel installments, it fails to embody their success due to its questionable ambition. "                                                             
##  [9] "It's worth watching just for the fact that this character will appear in The Avengers infinity war part 2. "                                                                                                    
## [10] "It's great for a first time viewing. Would you watch it again? Nah. A good motivational for little kids with comical characters. First time viewing is good but I won't look back at it."
review_text <- get_sentences(movie_reviews) #extraemos las frases en cada reseña
review_text
## [[1]]
## [1] "When your main character in a superhero movie is unwatchable, you already have a problem."
## [2] "In addition, Captain Marvel has no weaknesses, which kills the tension immediately."      
## 
## [[2]]
## [1] "Her performance was forced, uninspiring and flat!"       
## [2] "Not looking forward to the next movie with her in it...."
## 
## [[3]]
## [1] "I couldn't believe how boring this movie was."                                              
## [2] "The acting is horrible, the action is terrible, and Captain Marvel herself is super cheesy."
## [3] "This is the worst Marvel movie for me, alongside Ant man and the Wasp"                      
## 
## [[4]]
## [1] "Nothing beats a good marvel movie, and this is definitely a good marvel movie"
## 
## [[5]]
## [1] "This movie did for Marvel what Wonder Woman did for DC." 
## [2] "Captain Marvel is a great role model for young children."
## [3] "Great to see Colson and Fury as well."                   
## [4] "Loved Goose."                                            
## 
## [[6]]
## [1] "Captain Marvel just became my favorite superhero of all time."
## [2] "This movie was funnier than I expected and all-around great." 
## [3] "Go see it!"                                                   
## 
## [[7]]
## [1] "This is a very controversial Marvel film."                    
## [2] "Which seems to be a running trend with Disney films recently."
## 
## [[8]]
## [1] "Unfortunately, despite carrying many elements of previous Marvel installments, it fails to embody their success due to its questionable ambition."
## 
## [[9]]
## [1] "It's worth watching just for the fact that this character will appear in The Avengers infinity war part 2."
## 
## [[10]]
## [1] "It's great for a first time viewing."                        
## [2] "Would you watch it again?"                                   
## [3] "Nah."                                                        
## [4] "A good motivational for little kids with comical characters."
## [5] "First time viewing is good but I won't look back at it."     
## 
## attr(,"class")
## [1] "get_sentences"           "get_sentences_character"
## [3] "list"
#See sentiments for each line
sentiment(review_text)
##     element_id sentence_id word_count     sentiment
##  1:          1           1         15 -0.3485685012
##  2:          1           2         12 -0.2886751346
##  3:          2           1          7 -0.4157609203
##  4:          2           2         11 -0.3015113446
##  5:          3           1          8  0.3535533906
##  6:          3           2         15  0.1290994449
##  7:          3           3         14  0.0668153105
##  8:          4           1         14  1.1224972160
##  9:          5           1         11  0.5276448530
## 10:          5           2         10  0.6482669203
## 11:          5           3          8  0.2828427125
## 12:          5           4          2  0.3535533906
## 13:          6           1         10  0.4743416490
## 14:          6           2         11  0.6030226892
## 15:          6           3          3  0.0000000000
## 16:          7           1          7  0.0000000000
## 17:          7           2         11  0.1206045378
## 18:          8           1         20 -0.0335410197
## 19:          9           1         18  0.2003469213
## 20:         10           1          7  0.1889822365
## 21:         10           2          5  0.0000000000
## 22:         10           3          1  0.0000000000
## 23:         10           4          9  0.2833333333
## 24:         10           5         12  0.0002165064
##     element_id sentence_id word_count     sentiment
#Sentiment by each review
sentiments <- sentiment_by(review_text)
sentiments
##     element_id word_count         sd ave_sentiment
##  1:          1         27 0.04235101   -0.31862182
##  2:          2         18 0.08078665   -0.35863613
##  3:          3         37 0.15081866    0.18315605
##  4:          4         14         NA    1.12249722
##  5:          5         31 0.16587559    0.45307697
##  6:          6         24 0.31759386    0.38035077
##  7:          7         18 0.08528029    0.06581225
##  8:          8         20         NA   -0.03354102
##  9:          9         18         NA    0.20034692
## 10:         10         34 0.13354287    0.11672799

Summarizing sentiments

#Convert sentiment data.table to a data frame
sentiment_df <- setDF(sentiments)

#Function that generates a sentiment class based on sentiment score
get_sentiment_class <- function(sentiment_score) {
  
  sentiment_class = "Positive"
  
  if ( sentiment_score < -0.3) {
    sentiment_class = "Negative"
  } 
  
  else if (sentiment_score < 0.3) {
    sentiment_class = "Neutral"
  }
  
  sentiment_class
}

#add a sentiment_class attribute
sentiment_df$sentiment_class <- 
        sapply(sentiment_df$ave_sentiment,get_sentiment_class)

#Print resulting sentiment
sentiment_df[,4:5]
##    ave_sentiment sentiment_class
## 1    -0.31862182        Negative
## 2    -0.35863613        Negative
## 3     0.18315605         Neutral
## 4     1.12249722        Positive
## 5     0.45307697        Positive
## 6     0.38035077        Positive
## 7     0.06581225         Neutral
## 8    -0.03354102         Neutral
## 9     0.20034692         Neutral
## 10    0.11672799         Neutral
#Draw a pie chart
sentiment_summary <- count(sentiment_df, sentiment_class)

pie(sentiment_summary$n, 
    sentiment_summary$sentiment_class,
    col=c("Red","Blue","Green"))

Analyzing emotions

#Create a dataframe for emotions by review
#esta función nos da la emoción que representa una palabra y el conteo de las veces que esa palabra representando la emoción aparece en el texto analizado
emotion_df <- setDF(emotion_by(review_text)) 
head(emotion_df)
##   element_id emotion_type word_count emotion_count         sd ave_emotion
## 1          1        anger         27             1 0.05892557  0.03703704
## 2          1 anticipation         27             1 0.05892557  0.03703704
## 3          1      disgust         27             0 0.00000000  0.00000000
## 4          1         fear         27             1 0.04714045  0.03703704
## 5          1          joy         27             0 0.00000000  0.00000000
## 6          1      sadness         27             1 0.04714045  0.03703704
#aggregate by emotion types and remove 0 values
emotion_summary=subset(
                  aggregate(emotion_count  ~ emotion_type , 
                                 emotion_df, sum),
                   emotion_count > 0 )
emotion_summary
##       emotion_type emotion_count
## 1            anger             5
## 2     anticipation            14
## 3          disgust             3
## 4             fear             7
## 5              joy             9
## 6          sadness             3
## 7         surprise            14
## 8 surprise_negated             1
## 9            trust             8
#Draw a pie chart for emotion summary
pie(emotion_summary$emotion_count, emotion_summary$emotion_type,
    col= c("Red","Green","Blue","Orange","Brown","Purple") )

Clustering

Clustering concepts

There may be times when you run into a really large dataset with different attributes and you need to find similarities. For instance, you want to find similar customers based on their demographics. In this situation, you can use something called clustering which is a machine learning technique that helps group similar elements based on their attributes. C

lustering is a great candidate to use unsupervised learning. In unsupervised learning, there is no training dataset with prior classification. Instead, we group elements based on similarity of attributes. There are a number of techniques available in clustering like k-means clustering and k-nearest neighbors.

You might be asking what this has to do with text mining? Well, when working with return text, the words in a document become features. Documents with similar words get grouped together. Clustering algorithms use only numeric data. So text data needs to be converted to numeric representations. Text frequency-in words documents frequency or tf-idf is the most popular technique used for this purpose. It converts a corpus of documents into a numeric matrix with documents representing rows and words representing columns. Clustering for text can be used to group documents like reviews, news articles and tweets based on words used in these documents.

Preparing data for clustering

#Read movie hashtags into a data frame
movie_hashtags <- read.csv("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\text analytics exercise files\\hashtags\\Course-Hashtags.csv")
movie_hashtags
##                                   Course
## 1        Apache Spark Essential Training
## 2                 Java Memory Management
## 3          Python Automation and Testing
## 4                    Python for Graphics
## 5    Machine Learning and AI Foundations
## 6   Java : Database Integration and JDBC
## 7                          R Programming
## 8                 Python Design Patterns
## 9                Hadoop for Data Science
## 10                     Java IDE Overview
## 11 Data Science on Google Cloud Platform
## 12                Scala for Data Science
## 13        Kubernetes for Java Developers
## 14                      Python Scripting
##                                    HashTags
## 1       BigData,DataScience,MachineLearning
## 2                 Java,Advanced,Programming
## 3               Python,Automation,Scripting
## 4                 Python,Graphics,Scripting
## 5  DataScience,MachineLearning,Intermediate
## 6                     Java,JDBC,Programming
## 7             R,Programming,MachineLearning
## 8                    Python,Design,Patterns
## 9                Hadoop,DataScience,BigData
## 10                     Java,Programming,IDE
## 11             DataScience,GCP,Intermediate
## 12                Scala,DataScience,BigData
## 13              Java,Kubernetes,Programming
## 14               Python,Scripting,Developer
#Load hastags into a corpus
hashtags <- VCorpus(VectorSource(movie_hashtags$HashTags))

#replace comma with spaces
clean_hashtags <- tm_map(hashtags, 
                         content_transformer(
                            function(x) gsub(","," ",x)
                            )
                         )

inspect(clean_hashtags[[1]])
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 35
## 
## BigData DataScience MachineLearning
#Generate the Document Term matrix
hashtags_dtm <- DocumentTermMatrix(clean_hashtags)
hashtags_dtm
## <<DocumentTermMatrix (documents: 14, terms: 20)>>
## Non-/sparse entries: 41/239
## Sparsity           : 85%
## Maximal term length: 15
## Weighting          : term frequency (tf)
#Inspect to Document Term matrix
inspect(hashtags_dtm)
## <<DocumentTermMatrix (documents: 14, terms: 20)>>
## Non-/sparse entries: 41/239
## Sparsity           : 85%
## Maximal term length: 15
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs advanced automation bigdata datascience intermediate java
##   1         0          0       1           1            0    0
##   10        0          0       0           0            0    1
##   11        0          0       0           1            1    0
##   2         1          0       0           0            0    1
##   3         0          1       0           0            0    0
##   4         0          0       0           0            0    0
##   5         0          0       0           1            1    0
##   6         0          0       0           0            0    1
##   8         0          0       0           0            0    0
##   9         0          0       1           1            0    0
##     Terms
## Docs machinelearning programming python scripting
##   1                1           0      0         0
##   10               0           1      0         0
##   11               0           0      0         0
##   2                0           1      0         0
##   3                0           0      1         1
##   4                0           0      1         1
##   5                1           0      0         0
##   6                0           1      0         0
##   8                0           0      1         0
##   9                0           0      0         0

Clustering hashtags

#Setting the seed ensures repeatable results
set.seed(100)

#Create 3 clusters
movie_clusters <-  kmeans(hashtags_dtm, 3)

#Inspect the results
movie_clusters$cluster #en esta variable está el número del cluster asignado
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 
##  1  3  2  2  1  3  3  2  1  3  1  1  3  2
#Add cluster information to the original data frame 
for ( movie in 1:nrow(movie_hashtags)) {
  movie_hashtags$Cluster[movie] <- movie_clusters$cluster[movie]
}

#Sort by cluster and review results
print(movie_hashtags[order(movie_hashtags$Cluster),c(1,3)]  )
##                                   Course Cluster
## 1        Apache Spark Essential Training       1
## 5    Machine Learning and AI Foundations       1
## 9                Hadoop for Data Science       1
## 11 Data Science on Google Cloud Platform       1
## 12                Scala for Data Science       1
## 3          Python Automation and Testing       2
## 4                    Python for Graphics       2
## 8                 Python Design Patterns       2
## 14                      Python Scripting       2
## 2                 Java Memory Management       3
## 6   Java : Database Integration and JDBC       3
## 7                          R Programming       3
## 10                     Java IDE Overview       3
## 13        Kubernetes for Java Developers       3

Finding optimal cluster size

#Function to find the optimum no. of clusters
optimal_cluster_plot <- function(data, iterations=10, seed=1000){
  
  #Set within-sum-of-squares for a single cluster
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  
  #Iterate upto 10 clusters and measure wss.
  for (i in 2:iterations){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)
  }
  
  #Plot wss for each value of k and find the elbow
  plot(1:iterations, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares", col="red")
}

#Execute the function
optimal_cluster_plot(hashtags_dtm) #optimo donde hay un codo, en este caso k=3

Classification

Classification concepts

Not to be confused with clustering, classification is another use case for text mining. Classification is a machine learning technique for supervised learning. Recall that clustering is used for unsupervised learning only. Its goal is to create classes or entities that build a model to identify the class of a specific entity.

Classification algorithms build models based on a target variable in the data set. It uses other feature variables available in the data set to build these models. The model is used to predict the class of new data. It predicts the target variable based on other feature variables available in the new data. We split the source data into training data and test data, training data is used to build the model, and test data is used to test its accuracy.

How can we use classification for text mining? In text mining, words in a document become feature variables. For the purposes of training models, each document needs to be tagged with a specific class. This is then used as the target variable to build models. Most classification algorithms require feature and target variables to be in numeric values. In order to do classification of text documents need to be converted to tf-idf matrices, before they can be used for classification.

Prepare the data

#Load up the corpus

course_raw = scan("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\text analytics exercise files\\classification\\Course-Descriptions.txt",
                  what="", sep="\n")

course_corpus <- VCorpus(VectorSource(course_raw))
inspect(course_corpus[[1]])
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 130
## 
## In this practical, hands-on course, learn how to do data preparation, data munging, data visualization, and predictive analytics.
#Limpieza de los datos
#Convert to lower case
course_corpus2 <- tm_map(course_corpus, content_transformer(tolower))

#Remove punctuations
course_corpus3 <- tm_map(course_corpus2, removePunctuation)

#Remove stopwords
course_corpus4 <- tm_map(course_corpus3, removeWords, stopwords())

inspect(course_corpus4[[1]])
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 107
## 
##   practical handson course learn    data preparation data munging data visualization  predictive analytics
#Generate TF-IDF matrix
course_dtm <- DocumentTermMatrix(course_corpus4)
course_dtm
## <<DocumentTermMatrix (documents: 20, terms: 245)>>
## Non-/sparse entries: 328/4572
## Sparsity           : 93%
## Maximal term length: 19
## Weighting          : term frequency (tf)
findFreqTerms(course_dtm,5) #vemos solo aquellos terminos que tienen una frecuencia mayor a 5
## [1] "can"       "cloud"     "computing" "course"    "data"      "many"     
## [7] "python"    "using"
#Remove terms not in 90% of the documents. Only have those that are there
#in atleast 2 documents
dense_course_dtm <- removeSparseTerms(course_dtm, .8)

#Inspect to TF-IDF
inspect(dense_course_dtm)
## <<DocumentTermMatrix (documents: 20, terms: 9)>>
## Non-/sparse entries: 45/135
## Sparsity           : 75%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs can cloud code course data many python science want
##   1    0     0    0      1    3    0      0       0    0
##   10   0     0    0      1    1    1      0       1    0
##   11   1     0    0      0    1    0      1       1    0
##   14   1     2    0      1    0    0      0       0    1
##   16   1     0    1      0    0    1      0       0    0
##   4    1     3    0      1    0    0      0       0    0
##   5    0     0    0      0    3    1      0       1    1
##   7    0     0    0      0    5    0      0       1    0
##   8    0     0    0      0    2    0      2       0    0
##   9    0     2    0      0    0    1      0       0    0
#Convert continuous values to classes = { Yes, No } {1,0}, ya que los algoritmos de clasificación funcionan mejor con variables categóricas
conv_counts <- function(x) {
  x <- ifelse(x > 0, 1, 0)
  x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
}

class_dtm <- apply(dense_course_dtm, MARGIN = 2, conv_counts)
class_dtm
##     Terms
## Docs can   cloud code  course data  many  python science want 
##   1  "No"  "No"  "No"  "Yes"  "Yes" "No"  "No"   "No"    "No" 
##   2  "No"  "No"  "No"  "No"   "No"  "No"  "No"   "No"    "No" 
##   3  "No"  "No"  "No"  "No"   "No"  "Yes" "Yes"  "No"    "No" 
##   4  "Yes" "Yes" "No"  "Yes"  "No"  "No"  "No"   "No"    "No" 
##   5  "No"  "No"  "No"  "No"   "Yes" "Yes" "No"   "Yes"   "Yes"
##   6  "Yes" "No"  "No"  "No"   "No"  "No"  "No"   "No"    "No" 
##   7  "No"  "No"  "No"  "No"   "Yes" "No"  "No"   "Yes"   "No" 
##   8  "No"  "No"  "No"  "No"   "Yes" "No"  "Yes"  "No"    "No" 
##   9  "No"  "Yes" "No"  "No"   "No"  "Yes" "No"   "No"    "No" 
##   10 "No"  "No"  "No"  "Yes"  "Yes" "Yes" "No"   "Yes"   "No" 
##   11 "Yes" "No"  "No"  "No"   "Yes" "No"  "Yes"  "Yes"   "No" 
##   12 "No"  "No"  "No"  "No"   "No"  "No"  "No"   "No"    "No" 
##   13 "No"  "No"  "Yes" "No"   "No"  "No"  "No"   "No"    "Yes"
##   14 "Yes" "Yes" "No"  "Yes"  "No"  "No"  "No"   "No"    "Yes"
##   15 "No"  "No"  "Yes" "No"   "No"  "No"  "Yes"  "No"    "No" 
##   16 "Yes" "No"  "Yes" "No"   "No"  "Yes" "No"   "No"    "No" 
##   17 "No"  "Yes" "No"  "No"   "No"  "No"  "No"   "No"    "No" 
##   18 "Yes" "Yes" "No"  "Yes"  "No"  "No"  "No"   "No"    "No" 
##   19 "No"  "No"  "No"  "No"   "No"  "No"  "No"   "No"    "Yes"
##   20 "No"  "No"  "Yes" "Yes"  "No"  "No"  "Yes"  "No"    "No"

Building a model

#Load the classifications for the descriptions
course_classes = scan("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\text analytics exercise files\\classification\\Course-Classification.txt", what="", sep="\n")

#Random split of training and testing sets
train_set <- createDataPartition(y=course_classes, p=.7,list=FALSE)

#spliting the dtm
train_dtm <- class_dtm[train_set,]
test_dtm <-class_dtm[-train_set,]

#split the course_classes
train_classes <- course_classes[train_set]
test_classes <- course_classes[-train_set]

#train the model using naive bayes
course_model <- train( data.frame(train_dtm), train_classes, method="nb")
course_model
## Naive Bayes 
## 
## 16 samples
##  9 predictor
##  3 classes: 'Cloud-Computing', 'Data-Science', 'Programming' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 16, 16, 16, 16, 16, 16, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.6210317  0.4246917
##    TRUE      0.6210317  0.4246917
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE
##  and adjust = 1.

Running predictions

#Predict for the test data
course_predictions <- predict(course_model,test_dtm)

#Analyze prediction accuracy
confusionMatrix(table(course_predictions , test_classes))
## Confusion Matrix and Statistics
## 
##                   test_classes
## course_predictions Cloud-Computing Data-Science Programming
##    Cloud-Computing               1            0           0
##    Data-Science                  0            1           0
##    Programming                   0            0           2
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.3976, 1)
##     No Information Rate : 0.5        
##     P-Value [Acc > NIR] : 0.0625     
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Cloud-Computing Class: Data-Science
## Sensitivity                            1.00                1.00
## Specificity                            1.00                1.00
## Pos Pred Value                         1.00                1.00
## Neg Pred Value                         1.00                1.00
## Prevalence                             0.25                0.25
## Detection Rate                         0.25                0.25
## Detection Prevalence                   0.25                0.25
## Balanced Accuracy                      1.00                1.00
##                      Class: Programming
## Sensitivity                         1.0
## Specificity                         1.0
## Pos Pred Value                      1.0
## Neg Pred Value                      1.0
## Prevalence                          0.5
## Detection Rate                      0.5
## Detection Prevalence                0.5
## Balanced Accuracy                   1.0

Predictive Text

Predictive Text concepts

Predictive text is a popular application for text mining and analytics. When you compose a text on your smartphone or type a search term in Google, you see recommendations for the current or the next word. That is predictive text at work. The machine is trying to figure out what you will say next. When it works correctly, it saves you time and effort.

So how exactly does predictive text work? Through something called N-grams. N-grams are basically a set of co-occurring words within a given window. It is used to identify word sequence patterns. So we start with a corpus of sentences collected from usage specific to the context. We then use N-gram techniques to build a database of previous words and possible next words. This N-grams database is then queried to predict the next possible word. In order to build an accurate database, it is recommended to build a custom corpus based on the context which is a specific user or an application.

Preparing the data

#Load text files into the VCorpus
course_corpus <- VCorpus(DirSource("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\text analytics exercise files\\courses"))

#Data cleansing
#Convert to lower case
course_corpus2 <- tm_map(course_corpus, content_transformer(tolower))

#Remove punctuations
course_corpus3 <- tm_map(course_corpus2, removePunctuation)

#Convert to a Document Term Matrix with Bigrams
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

course_bigrams <- DocumentTermMatrix(course_corpus3, 
                                 control = list(tokenize = BigramTokenizer))
inspect(course_bigrams)
## <<DocumentTermMatrix (documents: 2, terms: 166)>>
## Non-/sparse entries: 170/162
## Sparsity           : 49%
## Maximal term length: 25
## Weighting          : term frequency (tf)
## Sample             :
##                                      Terms
## Docs                                  a button a realworld apache spark
##   Architecture-Course-Description.txt        1           0            1
##   Spark-Course-Description.txt               0           1            2
##                                      Terms
## Docs                                  big data data pipelines
##   Architecture-Course-Description.txt        2              0
##   Spark-Course-Description.txt               3              2
##                                      Terms
## Docs                                  data technologies how to in this
##   Architecture-Course-Description.txt                 0      0       1
##   Spark-Course-Description.txt                        2      5       1
##                                      Terms
## Docs                                  this course to construct
##   Architecture-Course-Description.txt           1            0
##   Spark-Course-Description.txt                  1            2
#Compute frequency of bigrams
bigram_frequency <- sort(colSums(as.matrix(course_bigrams)), 
                             decreasing=TRUE)

#Convert frequency table to a data frame
bigram_df <- data.frame(bigrams=names(bigram_frequency), 
                            freq=bigram_frequency)
#print the data frame
bigram_df[1:10,]
##                             bigrams freq
## big data                   big data    5
## how to                       how to    5
## apache spark           apache spark    3
## data pipelines       data pipelines    2
## data technologies data technologies    2
## in this                     in this    2
## this course             this course    2
## to construct           to construct    2
## a button                   a button    1
## a realworld             a realworld    1

Building the n-grams database

#Split each bigram into the first and second words and store them back
#into the same data frame

for ( irow in 1:nrow(bigram_df)) {

    grams = unlist(strsplit(as.character(bigram_df$bigrams[irow])," "))

    bigram_df$first[irow]= grams[1]
    bigram_df$second[irow]= grams[2]
}

#Review the bigrams data frame
bigram_df[1:10,]
##                             bigrams freq  first       second
## big data                   big data    5    big         data
## how to                       how to    5    how           to
## apache spark           apache spark    3 apache        spark
## data pipelines       data pipelines    2   data    pipelines
## data technologies data technologies    2   data technologies
## in this                     in this    2     in         this
## this course             this course    2   this       course
## to construct           to construct    2     to    construct
## a button                   a button    1      a       button
## a realworld             a realworld    1      a    realworld
#Query for for second words and frequency where first word = "data", we can suggest the most frequent second words is user types data
bigram_df[bigram_df$first == "data", c("second", "freq")]
##                         second freq
## data pipelines       pipelines    2
## data technologies technologies    2
## data data                 data    1
## data engineers       engineers    1
## data luckily           luckily    1
## data technology     technology    1

Predicting text

###### Auto-complete for word "ap" - nos mostrará las posibles primeras palabras que comiencen por ap y autocompletarlas.

#filter data frame for rows where column first starts with "ap"
autocomplete_filtered = bigram_df[
                            startsWith(
                              as.character(bigram_df$first), "ap"), 
                            c("first", "freq")]

#Aggregate across duplicate rows
autocomplete_summary =aggregate(freq ~ first, autocomplete_filtered, sum)

#Order in descending order of frequency
autocomplete_ordered = autocomplete_summary[
                          with(autocomplete_summary, order(-freq)), ]

#The predictive auto complete list.
autocomplete_ordered$first
## [1] "apache"       "applications" "app"
###### find next word for "apache" - las palabras más probables si alguien ha escrito apache y poder dar estas opciones o autocompletar

#Filter data frame where first word is "apache"
nextword_filtered = bigram_df[
                           bigram_df$first == "apache", 
                          c("freq", "second")]

#Order in descending order of frequency
nextword_ordered = nextword_filtered[
                          with(nextword_filtered, order(-freq)), ]

#The predicted next words
nextword_ordered$second
## [1] "spark" "kafka"

Next steps

Now that you have taken this course, you can take your learning even further.

Learn in-depth about text pre-processing techniques, like stopword removal, lemmatization, n-grams, and tf-idf. To do this, you can refer to my other courses on LinkedIn Learning.

Explore text machine learning at scale with big data technologies, build an end-to-end live project for text analytics in your organization. This will give you the hands-on experience that can help build your skillset.

Andres Lopez

22 de marzo de 2020