Text Analytics and Predictions with R Essential Training
linkedin course
Text Analytics and Predictions with R Essential Training
Introduccion
Text processing and analytics is one of the fastest growing areas in the field of machine learning. Why? Well, the truth is that more and more data that’s getting generated today is text. The Internet contains a number of blogs, reviews, comments, notes, and other text-based facts. Social media generates data every day in the form of messages, tweets, hashtags, and references. Computer software generates log messages and audit trails. There is so much more, including emails and audio or video that gets transcribed into text. With so much free text data out there, businesses can capitalize through text analytics. They can then use these insights to drive strategic business actions. But, analyzing text possesses various unique challenges. Text data is several times as large as numeric data. Also, text data does not have a fixed structure or schema, and that makes understanding it difficult. In this course, I will show you some tools and techniques offered in R that can help you with these particular issues and aid in generating insights.
This course is about text analytics and predictions using R. It focuses on analytics and machine learning techniques specific to text. This course has examples in R, and we use RStudio. So, it’s good to have some familiarity with these tools. You will need to download the latest version of R and RStudio to follow along. RStudio will not run without a compatible version of R installed. You can download R from the cran.r-project.org website. You can also download RStudio from the rstudio.com website. The examples in this course also pre-process text data before using them for analytics. Techniques used include stopword removal, stemming, n-grams, and tf-idf. If you are not familiar with these techniques, I recommend taking my other course on LinkedIn Learning called Text Processing with R. While the course focuses on using machine learning techniques, like clustering and classification for text mining, it does not delve deep into these concepts. Rather, it focuses on using these techniques for text-specific data.
Word Cloud
Word Cloud concepts
Let’s start off with something you may have used in the past, a word cloud. You may remember it. Basically, it highlights what words are used with what frequency in a body of text or corpus and then forms a shape with the words. The size of the word in the word cloud is based on the number of occurrences of that word in the corpus. The more occurrences, the bigger the size.
Word cloud can also limit the number of words shown to the top and popular ones. A word cloud can be used to show the popularity of keywords visually. For instance, you can show the popularity of athletes in a sports league by showing the names of the players in a word cloud. In this specific example, we will analyze the popular words used in technology course descriptions on LinkedIn. Let’s get started with prepping the data so we can make that happen.
Preparing the data
#Load up the corpus
course_corpus <- VCorpus(DirSource("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\text analytics exercise files\\courses"))
#cleansing activities
#Convert to lower case
course_corpus2 <- tm_map(course_corpus, content_transformer(tolower))
#Remove punctuations
course_corpus3 <- tm_map(course_corpus2, removePunctuation)
#Remove stopwords
course_corpus4 <- tm_map(course_corpus3, removeWords, stopwords())
inspect(course_corpus4)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 458
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 519
#generate frequency matrix
#Generate TF-IDF matrix
course_dtm <- DocumentTermMatrix(course_corpus4)
#Inspect to TF-IDF
inspect(course_dtm)
## <<DocumentTermMatrix (documents: 2, terms: 82)>>
## Non-/sparse entries: 92/72
## Sparsity : 44%
## Maximal term length: 14
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs apache applications architecting big
## Architecture-Course-Description.txt 1 2 2 2
## Spark-Course-Description.txt 3 0 0 3
## Terms
## Docs business can data realtime spark
## Architecture-Course-Description.txt 1 3 2 4 1
## Spark-Course-Description.txt 1 0 6 1 3
## Terms
## Docs technologies
## Architecture-Course-Description.txt 1
## Spark-Course-Description.txt 3
#Generate a frequency data frame
word_frequency <- sort(colSums(as.matrix(course_dtm)),
decreasing=TRUE)
df_frequency<- data.frame(word = names(word_frequency),
freq=word_frequency)
head(df_frequency)
## word freq
## data data 8
## big big 5
## realtime realtime 5
## apache apache 4
## spark spark 4
## technologies technologies 4
Displaying the word cloud
#Simple wordcloud
wordcloud(df_frequency$word,df_frequency$freq) #por defecto, esta funcion muestras las palabras con uan frecuencia de 3 o mas
#Top 10 words
wordcloud(df_frequency$word,
df_frequency$freq,
max.words=10, min.freq = 1) #controlamos el número de palabras mostradas asi como la frecuencia minima para que la palabra aparezca en el grafico.
Enhance the word cloud
#Choose a specific font and order
wordcloud(df_frequency$word,
df_frequency$freq,
max.words=10, min.freq = 1, #controlamos el numero de palabras y frecuencia
random.order=FALSE, #hacemos que las palabras de más frecuencia en el centro de la nube
family = "Helvetica", font = 3) #fuente de las letras
#Using a color palatte
word_pal <- brewer.pal(10,"Dark2") #creamos una paleta de colores a usar, esplorar otras paletas
wordcloud(df_frequency$word,
df_frequency$freq,
max.words=20, min.freq = 1,
random.order=FALSE,
colors=word_pal, #añadimos la paleta de colores creada
family= "Arial", font = 3)
Sentiment analysis
Sentiment analysis concepts
One of the most popular analysis done on text is to identify the sentiment expressed by its author. Sentiment, in this case, is the overall emotion expressed in the words people write. This might be simply positive or negative but can be other emotions as well.
Organizations try to understand the sentiment of their customers and users based on their communications and social media posts. This is referred to as sentiment analysis.
Sentiment analysis is text mining technique used to identify the intent or opinion in text data. Users communicate their opinions and needs through text in the form of reviews, emails, chats, et cetera. Sentiment analysis looks at a corpus of text, possibly multiple sentences to understand the overall sentiment of the author.
There are multiple techniques used for this. The most popular being the bag of words approach. In this technique, we look for specific words in the text and conclude on the overall sentiment this technique is simple and straightforward but may not handle complex sentiments effectively.
In sentiment analysis, we determine polarity. Polarity is a score from minus one to plus one that indicates whether the sentiment is positive, negative or neutral. The closer it is to minus one, it’s negative, the closer it is to plus one, it’s positive and the closer it is to zero, it’s neutral. After determining polarity, we can also try to understand the emotions in text like happy, sad and angry.
Let’s take a look at what we need to do to prepare data for sentiment analysis.
Finding sentiment
#Load the movie reviews file and convert it into sentences
movie_reviews <- readLines(file("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\text analytics exercise files\\reviews\\Movie-Reviews.txt"))
movie_reviews
## [1] "When your main character in a superhero movie is unwatchable, you already have a problem. In addition, Captain Marvel has no weaknesses, which kills the tension immediately."
## [2] "Her performance was forced, uninspiring and flat! Not looking forward to the next movie with her in it...."
## [3] "I couldn't believe how boring this movie was. The acting is horrible, the action is terrible, and Captain Marvel herself is super cheesy. This is the worst Marvel movie for me, alongside Ant man and the Wasp"
## [4] "Nothing beats a good marvel movie, and this is definitely a good marvel movie"
## [5] "This movie did for Marvel what Wonder Woman did for DC. Captain Marvel is a great role model for young children. Great to see Colson and Fury as well. Loved Goose."
## [6] "Captain Marvel just became my favorite superhero of all time. This movie was funnier than I expected and all-around great. Go see it!"
## [7] "This is a very controversial Marvel film. Which seems to be a running trend with Disney films recently."
## [8] "Unfortunately, despite carrying many elements of previous Marvel installments, it fails to embody their success due to its questionable ambition. "
## [9] "It's worth watching just for the fact that this character will appear in The Avengers infinity war part 2. "
## [10] "It's great for a first time viewing. Would you watch it again? Nah. A good motivational for little kids with comical characters. First time viewing is good but I won't look back at it."
review_text <- get_sentences(movie_reviews) #extraemos las frases en cada reseña
review_text
## [[1]]
## [1] "When your main character in a superhero movie is unwatchable, you already have a problem."
## [2] "In addition, Captain Marvel has no weaknesses, which kills the tension immediately."
##
## [[2]]
## [1] "Her performance was forced, uninspiring and flat!"
## [2] "Not looking forward to the next movie with her in it...."
##
## [[3]]
## [1] "I couldn't believe how boring this movie was."
## [2] "The acting is horrible, the action is terrible, and Captain Marvel herself is super cheesy."
## [3] "This is the worst Marvel movie for me, alongside Ant man and the Wasp"
##
## [[4]]
## [1] "Nothing beats a good marvel movie, and this is definitely a good marvel movie"
##
## [[5]]
## [1] "This movie did for Marvel what Wonder Woman did for DC."
## [2] "Captain Marvel is a great role model for young children."
## [3] "Great to see Colson and Fury as well."
## [4] "Loved Goose."
##
## [[6]]
## [1] "Captain Marvel just became my favorite superhero of all time."
## [2] "This movie was funnier than I expected and all-around great."
## [3] "Go see it!"
##
## [[7]]
## [1] "This is a very controversial Marvel film."
## [2] "Which seems to be a running trend with Disney films recently."
##
## [[8]]
## [1] "Unfortunately, despite carrying many elements of previous Marvel installments, it fails to embody their success due to its questionable ambition."
##
## [[9]]
## [1] "It's worth watching just for the fact that this character will appear in The Avengers infinity war part 2."
##
## [[10]]
## [1] "It's great for a first time viewing."
## [2] "Would you watch it again?"
## [3] "Nah."
## [4] "A good motivational for little kids with comical characters."
## [5] "First time viewing is good but I won't look back at it."
##
## attr(,"class")
## [1] "get_sentences" "get_sentences_character"
## [3] "list"
#See sentiments for each line
sentiment(review_text)
## element_id sentence_id word_count sentiment
## 1: 1 1 15 -0.3485685012
## 2: 1 2 12 -0.2886751346
## 3: 2 1 7 -0.4157609203
## 4: 2 2 11 -0.3015113446
## 5: 3 1 8 0.3535533906
## 6: 3 2 15 0.1290994449
## 7: 3 3 14 0.0668153105
## 8: 4 1 14 1.1224972160
## 9: 5 1 11 0.5276448530
## 10: 5 2 10 0.6482669203
## 11: 5 3 8 0.2828427125
## 12: 5 4 2 0.3535533906
## 13: 6 1 10 0.4743416490
## 14: 6 2 11 0.6030226892
## 15: 6 3 3 0.0000000000
## 16: 7 1 7 0.0000000000
## 17: 7 2 11 0.1206045378
## 18: 8 1 20 -0.0335410197
## 19: 9 1 18 0.2003469213
## 20: 10 1 7 0.1889822365
## 21: 10 2 5 0.0000000000
## 22: 10 3 1 0.0000000000
## 23: 10 4 9 0.2833333333
## 24: 10 5 12 0.0002165064
## element_id sentence_id word_count sentiment
#Sentiment by each review
sentiments <- sentiment_by(review_text)
sentiments
## element_id word_count sd ave_sentiment
## 1: 1 27 0.04235101 -0.31862182
## 2: 2 18 0.08078665 -0.35863613
## 3: 3 37 0.15081866 0.18315605
## 4: 4 14 NA 1.12249722
## 5: 5 31 0.16587559 0.45307697
## 6: 6 24 0.31759386 0.38035077
## 7: 7 18 0.08528029 0.06581225
## 8: 8 20 NA -0.03354102
## 9: 9 18 NA 0.20034692
## 10: 10 34 0.13354287 0.11672799
Summarizing sentiments
#Convert sentiment data.table to a data frame
sentiment_df <- setDF(sentiments)
#Function that generates a sentiment class based on sentiment score
get_sentiment_class <- function(sentiment_score) {
sentiment_class = "Positive"
if ( sentiment_score < -0.3) {
sentiment_class = "Negative"
}
else if (sentiment_score < 0.3) {
sentiment_class = "Neutral"
}
sentiment_class
}
#add a sentiment_class attribute
sentiment_df$sentiment_class <-
sapply(sentiment_df$ave_sentiment,get_sentiment_class)
#Print resulting sentiment
sentiment_df[,4:5]
## ave_sentiment sentiment_class
## 1 -0.31862182 Negative
## 2 -0.35863613 Negative
## 3 0.18315605 Neutral
## 4 1.12249722 Positive
## 5 0.45307697 Positive
## 6 0.38035077 Positive
## 7 0.06581225 Neutral
## 8 -0.03354102 Neutral
## 9 0.20034692 Neutral
## 10 0.11672799 Neutral
#Draw a pie chart
sentiment_summary <- count(sentiment_df, sentiment_class)
pie(sentiment_summary$n,
sentiment_summary$sentiment_class,
col=c("Red","Blue","Green"))
Analyzing emotions
#Create a dataframe for emotions by review
#esta función nos da la emoción que representa una palabra y el conteo de las veces que esa palabra representando la emoción aparece en el texto analizado
emotion_df <- setDF(emotion_by(review_text))
head(emotion_df)
## element_id emotion_type word_count emotion_count sd ave_emotion
## 1 1 anger 27 1 0.05892557 0.03703704
## 2 1 anticipation 27 1 0.05892557 0.03703704
## 3 1 disgust 27 0 0.00000000 0.00000000
## 4 1 fear 27 1 0.04714045 0.03703704
## 5 1 joy 27 0 0.00000000 0.00000000
## 6 1 sadness 27 1 0.04714045 0.03703704
#aggregate by emotion types and remove 0 values
emotion_summary=subset(
aggregate(emotion_count ~ emotion_type ,
emotion_df, sum),
emotion_count > 0 )
emotion_summary
## emotion_type emotion_count
## 1 anger 5
## 2 anticipation 14
## 3 disgust 3
## 4 fear 7
## 5 joy 9
## 6 sadness 3
## 7 surprise 14
## 8 surprise_negated 1
## 9 trust 8
#Draw a pie chart for emotion summary
pie(emotion_summary$emotion_count, emotion_summary$emotion_type,
col= c("Red","Green","Blue","Orange","Brown","Purple") )
Clustering
Clustering concepts
There may be times when you run into a really large dataset with different attributes and you need to find similarities. For instance, you want to find similar customers based on their demographics. In this situation, you can use something called clustering which is a machine learning technique that helps group similar elements based on their attributes. C
lustering is a great candidate to use unsupervised learning. In unsupervised learning, there is no training dataset with prior classification. Instead, we group elements based on similarity of attributes. There are a number of techniques available in clustering like k-means clustering and k-nearest neighbors.
You might be asking what this has to do with text mining? Well, when working with return text, the words in a document become features. Documents with similar words get grouped together. Clustering algorithms use only numeric data. So text data needs to be converted to numeric representations. Text frequency-in words documents frequency or tf-idf is the most popular technique used for this purpose. It converts a corpus of documents into a numeric matrix with documents representing rows and words representing columns. Clustering for text can be used to group documents like reviews, news articles and tweets based on words used in these documents.
Preparing data for clustering
#Read movie hashtags into a data frame
movie_hashtags <- read.csv("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\text analytics exercise files\\hashtags\\Course-Hashtags.csv")
movie_hashtags
## Course
## 1 Apache Spark Essential Training
## 2 Java Memory Management
## 3 Python Automation and Testing
## 4 Python for Graphics
## 5 Machine Learning and AI Foundations
## 6 Java : Database Integration and JDBC
## 7 R Programming
## 8 Python Design Patterns
## 9 Hadoop for Data Science
## 10 Java IDE Overview
## 11 Data Science on Google Cloud Platform
## 12 Scala for Data Science
## 13 Kubernetes for Java Developers
## 14 Python Scripting
## HashTags
## 1 BigData,DataScience,MachineLearning
## 2 Java,Advanced,Programming
## 3 Python,Automation,Scripting
## 4 Python,Graphics,Scripting
## 5 DataScience,MachineLearning,Intermediate
## 6 Java,JDBC,Programming
## 7 R,Programming,MachineLearning
## 8 Python,Design,Patterns
## 9 Hadoop,DataScience,BigData
## 10 Java,Programming,IDE
## 11 DataScience,GCP,Intermediate
## 12 Scala,DataScience,BigData
## 13 Java,Kubernetes,Programming
## 14 Python,Scripting,Developer
#Load hastags into a corpus
hashtags <- VCorpus(VectorSource(movie_hashtags$HashTags))
#replace comma with spaces
clean_hashtags <- tm_map(hashtags,
content_transformer(
function(x) gsub(","," ",x)
)
)
inspect(clean_hashtags[[1]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 35
##
## BigData DataScience MachineLearning
#Generate the Document Term matrix
hashtags_dtm <- DocumentTermMatrix(clean_hashtags)
hashtags_dtm
## <<DocumentTermMatrix (documents: 14, terms: 20)>>
## Non-/sparse entries: 41/239
## Sparsity : 85%
## Maximal term length: 15
## Weighting : term frequency (tf)
#Inspect to Document Term matrix
inspect(hashtags_dtm)
## <<DocumentTermMatrix (documents: 14, terms: 20)>>
## Non-/sparse entries: 41/239
## Sparsity : 85%
## Maximal term length: 15
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs advanced automation bigdata datascience intermediate java
## 1 0 0 1 1 0 0
## 10 0 0 0 0 0 1
## 11 0 0 0 1 1 0
## 2 1 0 0 0 0 1
## 3 0 1 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 1 1 0
## 6 0 0 0 0 0 1
## 8 0 0 0 0 0 0
## 9 0 0 1 1 0 0
## Terms
## Docs machinelearning programming python scripting
## 1 1 0 0 0
## 10 0 1 0 0
## 11 0 0 0 0
## 2 0 1 0 0
## 3 0 0 1 1
## 4 0 0 1 1
## 5 1 0 0 0
## 6 0 1 0 0
## 8 0 0 1 0
## 9 0 0 0 0
Finding optimal cluster size
#Function to find the optimum no. of clusters
optimal_cluster_plot <- function(data, iterations=10, seed=1000){
#Set within-sum-of-squares for a single cluster
wss <- (nrow(data)-1)*sum(apply(data,2,var))
#Iterate upto 10 clusters and measure wss.
for (i in 2:iterations){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)
}
#Plot wss for each value of k and find the elbow
plot(1:iterations, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", col="red")
}
#Execute the function
optimal_cluster_plot(hashtags_dtm) #optimo donde hay un codo, en este caso k=3
Classification
Classification concepts
Not to be confused with clustering, classification is another use case for text mining. Classification is a machine learning technique for supervised learning. Recall that clustering is used for unsupervised learning only. Its goal is to create classes or entities that build a model to identify the class of a specific entity.
Classification algorithms build models based on a target variable in the data set. It uses other feature variables available in the data set to build these models. The model is used to predict the class of new data. It predicts the target variable based on other feature variables available in the new data. We split the source data into training data and test data, training data is used to build the model, and test data is used to test its accuracy.
How can we use classification for text mining? In text mining, words in a document become feature variables. For the purposes of training models, each document needs to be tagged with a specific class. This is then used as the target variable to build models. Most classification algorithms require feature and target variables to be in numeric values. In order to do classification of text documents need to be converted to tf-idf matrices, before they can be used for classification.
Prepare the data
#Load up the corpus
course_raw = scan("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\text analytics exercise files\\classification\\Course-Descriptions.txt",
what="", sep="\n")
course_corpus <- VCorpus(VectorSource(course_raw))
inspect(course_corpus[[1]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 130
##
## In this practical, hands-on course, learn how to do data preparation, data munging, data visualization, and predictive analytics.
#Limpieza de los datos
#Convert to lower case
course_corpus2 <- tm_map(course_corpus, content_transformer(tolower))
#Remove punctuations
course_corpus3 <- tm_map(course_corpus2, removePunctuation)
#Remove stopwords
course_corpus4 <- tm_map(course_corpus3, removeWords, stopwords())
inspect(course_corpus4[[1]])
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 107
##
## practical handson course learn data preparation data munging data visualization predictive analytics
#Generate TF-IDF matrix
course_dtm <- DocumentTermMatrix(course_corpus4)
course_dtm
## <<DocumentTermMatrix (documents: 20, terms: 245)>>
## Non-/sparse entries: 328/4572
## Sparsity : 93%
## Maximal term length: 19
## Weighting : term frequency (tf)
findFreqTerms(course_dtm,5) #vemos solo aquellos terminos que tienen una frecuencia mayor a 5
## [1] "can" "cloud" "computing" "course" "data" "many"
## [7] "python" "using"
#Remove terms not in 90% of the documents. Only have those that are there
#in atleast 2 documents
dense_course_dtm <- removeSparseTerms(course_dtm, .8)
#Inspect to TF-IDF
inspect(dense_course_dtm)
## <<DocumentTermMatrix (documents: 20, terms: 9)>>
## Non-/sparse entries: 45/135
## Sparsity : 75%
## Maximal term length: 7
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs can cloud code course data many python science want
## 1 0 0 0 1 3 0 0 0 0
## 10 0 0 0 1 1 1 0 1 0
## 11 1 0 0 0 1 0 1 1 0
## 14 1 2 0 1 0 0 0 0 1
## 16 1 0 1 0 0 1 0 0 0
## 4 1 3 0 1 0 0 0 0 0
## 5 0 0 0 0 3 1 0 1 1
## 7 0 0 0 0 5 0 0 1 0
## 8 0 0 0 0 2 0 2 0 0
## 9 0 2 0 0 0 1 0 0 0
#Convert continuous values to classes = { Yes, No } {1,0}, ya que los algoritmos de clasificación funcionan mejor con variables categóricas
conv_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
}
class_dtm <- apply(dense_course_dtm, MARGIN = 2, conv_counts)
class_dtm
## Terms
## Docs can cloud code course data many python science want
## 1 "No" "No" "No" "Yes" "Yes" "No" "No" "No" "No"
## 2 "No" "No" "No" "No" "No" "No" "No" "No" "No"
## 3 "No" "No" "No" "No" "No" "Yes" "Yes" "No" "No"
## 4 "Yes" "Yes" "No" "Yes" "No" "No" "No" "No" "No"
## 5 "No" "No" "No" "No" "Yes" "Yes" "No" "Yes" "Yes"
## 6 "Yes" "No" "No" "No" "No" "No" "No" "No" "No"
## 7 "No" "No" "No" "No" "Yes" "No" "No" "Yes" "No"
## 8 "No" "No" "No" "No" "Yes" "No" "Yes" "No" "No"
## 9 "No" "Yes" "No" "No" "No" "Yes" "No" "No" "No"
## 10 "No" "No" "No" "Yes" "Yes" "Yes" "No" "Yes" "No"
## 11 "Yes" "No" "No" "No" "Yes" "No" "Yes" "Yes" "No"
## 12 "No" "No" "No" "No" "No" "No" "No" "No" "No"
## 13 "No" "No" "Yes" "No" "No" "No" "No" "No" "Yes"
## 14 "Yes" "Yes" "No" "Yes" "No" "No" "No" "No" "Yes"
## 15 "No" "No" "Yes" "No" "No" "No" "Yes" "No" "No"
## 16 "Yes" "No" "Yes" "No" "No" "Yes" "No" "No" "No"
## 17 "No" "Yes" "No" "No" "No" "No" "No" "No" "No"
## 18 "Yes" "Yes" "No" "Yes" "No" "No" "No" "No" "No"
## 19 "No" "No" "No" "No" "No" "No" "No" "No" "Yes"
## 20 "No" "No" "Yes" "Yes" "No" "No" "Yes" "No" "No"
Building a model
#Load the classifications for the descriptions
course_classes = scan("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\text analytics exercise files\\classification\\Course-Classification.txt", what="", sep="\n")
#Random split of training and testing sets
train_set <- createDataPartition(y=course_classes, p=.7,list=FALSE)
#spliting the dtm
train_dtm <- class_dtm[train_set,]
test_dtm <-class_dtm[-train_set,]
#split the course_classes
train_classes <- course_classes[train_set]
test_classes <- course_classes[-train_set]
#train the model using naive bayes
course_model <- train( data.frame(train_dtm), train_classes, method="nb")
course_model
## Naive Bayes
##
## 16 samples
## 9 predictor
## 3 classes: 'Cloud-Computing', 'Data-Science', 'Programming'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 16, 16, 16, 16, 16, 16, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.6210317 0.4246917
## TRUE 0.6210317 0.4246917
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE
## and adjust = 1.
Running predictions
#Predict for the test data
course_predictions <- predict(course_model,test_dtm)
#Analyze prediction accuracy
confusionMatrix(table(course_predictions , test_classes))
## Confusion Matrix and Statistics
##
## test_classes
## course_predictions Cloud-Computing Data-Science Programming
## Cloud-Computing 1 0 0
## Data-Science 0 1 0
## Programming 0 0 2
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.3976, 1)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.0625
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Cloud-Computing Class: Data-Science
## Sensitivity 1.00 1.00
## Specificity 1.00 1.00
## Pos Pred Value 1.00 1.00
## Neg Pred Value 1.00 1.00
## Prevalence 0.25 0.25
## Detection Rate 0.25 0.25
## Detection Prevalence 0.25 0.25
## Balanced Accuracy 1.00 1.00
## Class: Programming
## Sensitivity 1.0
## Specificity 1.0
## Pos Pred Value 1.0
## Neg Pred Value 1.0
## Prevalence 0.5
## Detection Rate 0.5
## Detection Prevalence 0.5
## Balanced Accuracy 1.0
Predictive Text
Predictive Text concepts
Predictive text is a popular application for text mining and analytics. When you compose a text on your smartphone or type a search term in Google, you see recommendations for the current or the next word. That is predictive text at work. The machine is trying to figure out what you will say next. When it works correctly, it saves you time and effort.
So how exactly does predictive text work? Through something called N-grams. N-grams are basically a set of co-occurring words within a given window. It is used to identify word sequence patterns. So we start with a corpus of sentences collected from usage specific to the context. We then use N-gram techniques to build a database of previous words and possible next words. This N-grams database is then queried to predict the next possible word. In order to build an accurate database, it is recommended to build a custom corpus based on the context which is a specific user or an application.
Preparing the data
#Load text files into the VCorpus
course_corpus <- VCorpus(DirSource("C:\\Users\\LOPEZANW\\OneDrive - Novartis Pharma AG\\01 daily work\\Rprojects\\text_mining\\text analytics exercise files\\courses"))
#Data cleansing
#Convert to lower case
course_corpus2 <- tm_map(course_corpus, content_transformer(tolower))
#Remove punctuations
course_corpus3 <- tm_map(course_corpus2, removePunctuation)
#Convert to a Document Term Matrix with Bigrams
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
course_bigrams <- DocumentTermMatrix(course_corpus3,
control = list(tokenize = BigramTokenizer))
inspect(course_bigrams)
## <<DocumentTermMatrix (documents: 2, terms: 166)>>
## Non-/sparse entries: 170/162
## Sparsity : 49%
## Maximal term length: 25
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs a button a realworld apache spark
## Architecture-Course-Description.txt 1 0 1
## Spark-Course-Description.txt 0 1 2
## Terms
## Docs big data data pipelines
## Architecture-Course-Description.txt 2 0
## Spark-Course-Description.txt 3 2
## Terms
## Docs data technologies how to in this
## Architecture-Course-Description.txt 0 0 1
## Spark-Course-Description.txt 2 5 1
## Terms
## Docs this course to construct
## Architecture-Course-Description.txt 1 0
## Spark-Course-Description.txt 1 2
#Compute frequency of bigrams
bigram_frequency <- sort(colSums(as.matrix(course_bigrams)),
decreasing=TRUE)
#Convert frequency table to a data frame
bigram_df <- data.frame(bigrams=names(bigram_frequency),
freq=bigram_frequency)
#print the data frame
bigram_df[1:10,]
## bigrams freq
## big data big data 5
## how to how to 5
## apache spark apache spark 3
## data pipelines data pipelines 2
## data technologies data technologies 2
## in this in this 2
## this course this course 2
## to construct to construct 2
## a button a button 1
## a realworld a realworld 1
Building the n-grams database
#Split each bigram into the first and second words and store them back
#into the same data frame
for ( irow in 1:nrow(bigram_df)) {
grams = unlist(strsplit(as.character(bigram_df$bigrams[irow])," "))
bigram_df$first[irow]= grams[1]
bigram_df$second[irow]= grams[2]
}
#Review the bigrams data frame
bigram_df[1:10,]
## bigrams freq first second
## big data big data 5 big data
## how to how to 5 how to
## apache spark apache spark 3 apache spark
## data pipelines data pipelines 2 data pipelines
## data technologies data technologies 2 data technologies
## in this in this 2 in this
## this course this course 2 this course
## to construct to construct 2 to construct
## a button a button 1 a button
## a realworld a realworld 1 a realworld
#Query for for second words and frequency where first word = "data", we can suggest the most frequent second words is user types data
bigram_df[bigram_df$first == "data", c("second", "freq")]
## second freq
## data pipelines pipelines 2
## data technologies technologies 2
## data data data 1
## data engineers engineers 1
## data luckily luckily 1
## data technology technology 1
Predicting text
###### Auto-complete for word "ap" - nos mostrará las posibles primeras palabras que comiencen por ap y autocompletarlas.
#filter data frame for rows where column first starts with "ap"
autocomplete_filtered = bigram_df[
startsWith(
as.character(bigram_df$first), "ap"),
c("first", "freq")]
#Aggregate across duplicate rows
autocomplete_summary =aggregate(freq ~ first, autocomplete_filtered, sum)
#Order in descending order of frequency
autocomplete_ordered = autocomplete_summary[
with(autocomplete_summary, order(-freq)), ]
#The predictive auto complete list.
autocomplete_ordered$first
## [1] "apache" "applications" "app"
###### find next word for "apache" - las palabras más probables si alguien ha escrito apache y poder dar estas opciones o autocompletar
#Filter data frame where first word is "apache"
nextword_filtered = bigram_df[
bigram_df$first == "apache",
c("freq", "second")]
#Order in descending order of frequency
nextword_ordered = nextword_filtered[
with(nextword_filtered, order(-freq)), ]
#The predicted next words
nextword_ordered$second
## [1] "spark" "kafka"
Next steps
Now that you have taken this course, you can take your learning even further.
Learn in-depth about text pre-processing techniques, like stopword removal, lemmatization, n-grams, and tf-idf. To do this, you can refer to my other courses on LinkedIn Learning.
Explore text machine learning at scale with big data technologies, build an end-to-end live project for text analytics in your organization. This will give you the hands-on experience that can help build your skillset.