Introduction
This is the milestone report assignment for week 2 of the Coursera Data Science Capstone Project.
The purpose of this report is to demonstrate how to do a text mining using bag-of-words method, loading and cleaning the dataset, making some exploratory analysis and describing a strategy for building a predictive model in a Shiny application based in Natural Language API.
The data set contains 4 different languages corpora: German, English, Finnish and Russian, with 3 data sources in each set: Blogs, News and Twitter.
We will use only the English corpora, in a compiled training corpus with the three sources of text.
Purpose of Text Mining
Here we find the first challenge: choosing which is the best package to perform exploratory analysis, filtering of texts and conversion of data sample into a matrix.
This programming aspect is at the heart of the project, as there are differences in the criteria used in each package to perform clustering and tokenization.
A clear example of this problem is the history of search engines since 1995. We had search engines such as Excite, SAPO, AltaVista, Yahoo, among others, which competed in the market for being the preferred search engine by users. Having the best predictive algorithm of what you are thinking or desiring from two or three words you type is the work of artificial intelligence from natural language. Tech giants such as Google and Netflix have the best predictive algorithms that show what people are thinking or wanting and can offer targeted products and services.
There is no perfect predictive algorithm based on natural language, but there is a relevant aspect in our research brought by Franz Brentano applied to Artificial Intelligence: Psychic phenomena of representation, judgment, approval (love) and disapproval (hatred) provoke engagement in social media. His posthumous work published in 1928 on sentient and noetic consciousness, which deals with the spiritual dimension of consciousness is considered the main dilemma of human intelligence and artificial intelligence: what is the difference between a virtual mind and a real mind? What is the difference between thinking and imitating thought? AI predictive algorithms are based on Brentano’s philosophical work.
Thus, a predictive algorithm can learn from human behavior and imitate it, generating a cycle between being influenced by people’s behavior and, at the same time, influencing how people will act.
Choice of R package and strategies.
We tested the model in 3 different text mining packages: qdap, RWeka and quanteda. Each of these packages has positive and negative aspects that can influence the final choice. Anyway, due the obvious limitations of a home computer, the results would be very similar in any of them.
A major issue is having enough RAM space to generate a matrix from the TDM (TermDocumentMatrix) or DTM (DocumentTermMatrix) files. There are some strategies: 1) Reduce the sample size without compromising the reliability of the training set, so that it can reflect the corpus; 2) Removal in the construction of the matrix the sparse data.
Data Summary
Below we have a data summary of the three text corpora to evaluate the size, number of lines and number of words for each source text.
| Lines | Words | Size in Mb | |
|---|---|---|---|
| Blogs | 899288 | 37570839 | 200.4242 |
| News | 1010242 | 34494539 | 196.2775 |
| 2360148 | 30451170 | 159.3641 | |
| Total | 4269678 | 102516548 | 556.0658 |
The dataset contains 4,269,678 lines and 102,516,548 words. The code for the data summary is shown in the Appendix 1.
Tokenization and stemming the training data
We tried initially to work with a 5%, then 3% samples, but with these sample sizes we couldn’t generate a matrix from DTM due limitations of my PC.
There is a couple of tricks in configuration that improves memory, like removing sparse terms and expanding the memory.limit() function, but it costs a lot of time in processing and increases the risk of crash, which is frustrating.
After a few days and a lot of testing, the final choice was to use a 1% sample and some combined packages functions.
We will convert the files in just one corpus, removing punctuation, numbers, symbols, separators, and stopwords. In the sequence, we will create DTM and TDM files in order to get a frequency matrix.
The code for building the matrix and tokenization is shown in the Appendix 2.
Frequency Plots
Now we can built the frequency plots with a list of most used words in the corpus, for unigrams, bigrams and trigrams.
We will plot the top 10 and top 20 words unigram in the sample data set by the two different methods to compare the results. In the sequence, we will plot the top 20 bigram and top 20 trigram.
The codes for the frequency barplots are shown in the Appendix 3.
Wordclouds
Wordclouds can help decision makers visualize the most relevant data. We will show wordcloud for unigrams, bigrams and trigrams.
The code for the wordclouds is shown in the Appendix 4.
Next Steps and Conclusions
To build the predictive algorithm we will need to consider the processing and memory limitations of a home computer, but it is possible to demonstrate through the training dataset using an n-gram model with a word frequency search similar to that performed in the exploratory data how we can predict the next word will be typed. On the other hand, using the strategy of eliminating stopwords can decrease the accuracy of the model, since the predictability of next word is based on natural language. A good predictive model accuracy will depend on the ability of AI learning from people’s behavior and thus increase effectiveness.
For the purpose of this project, we will be using a demo in shiny application for R.
Appendixes
Appendix 1
# set working directory
setwd("C:/Users/Marcos/Desktop/Projetos_R/Coursera-SwiftKey/final/en_US")
# setting files
blogstxt <- "en_US.blogs.txt"
newstxt <- "en_US.news.txt"
twittertxt <- "en_US.twitter.txt"
# reading files
columns <- file("en_US.blogs.txt", "r")
blogs <- readLines(columns, encoding = "UTF-8", skipNul = TRUE)
close(columns)
# used "rb" parameter here due there is an incomplete final line
columns <- file("en_US.news.txt", "rb")
news <- readLines(columns, encoding = "UTF-8", skipNul = TRUE)
close(columns)
columns <- file("en_US.twitter.txt", "r")
twitter <- readLines(columns, encoding = "UTF-8", skipNul = TRUE)
close(columns)
# making a table
library(stringi)
library(kableExtra)
Blogs <- c(stri_stats_general(blogs)[1], stri_stats_latex(blogs)[4], file.info(blogstxt)$size/(2^20))
News <- c(stri_stats_general(news)[1], stri_stats_latex(news)[4], file.info(newstxt)$size/(2^20))
Twitter<- c(stri_stats_general(twitter)[1], stri_stats_latex(twitter)[4], file.info(twittertxt)$size/(2^20))
Total <- Blogs+News+Twitter
table <- as.data.frame(rbind(Blogs, News, Twitter, Total))
# cleaning RAM
rm(Blogs, News, Twitter, Total)
colnames(table)[3] <- "Size in Mb"
table %>%
kbl(caption = "Basic Corpora Analysis") %>%
kable_classic(full_width = FALSE, html_font = "Cambria")Appendix 2
# sampling and tokenizing
set.seed(4518)
size <- 0.01
bsample <- sample(length(blogs),length(blogs)*size)
blsample <- blogs[bsample]
nsample <- sample(length(news),length(news)*size)
nwsample <- news[nsample]
tsample <- sample(length(twitter),length(twitter)*size)
twsample <- twitter[tsample]
# saving samples
writeLines(blsample,"sample/blogs.txt")
writeLines(nwsample,"sample/news.txt")
writeLines(twsample,"sample/twitter.txt")
# cleaning some RAM data
rm(blogs, news, twitter)
# transforming the data and making a corpus
library(qdap)
library(readtext)
library(tm)
library(filehash)
# making a corpus
samples05 <- c(blsample, nwsample, twsample)
samp05 <- VectorSource(samples05)
sampcorpus <- VCorpus(samp05) ## this is a VirtualCorpus, We could use a Permanent Corpus with
## sampcorpus <- PCorpus(samp05, dbControl = list(dbName = "pcorpus.db", dbType = "DB1"))
## however, the VCorpus size is not a big deal for RAM, the problem is ahead
# cleaning the corpus
sampcorpus <- tm_map(sampcorpus, content_transformer(tolower))
sampcorpus <- tm_map(sampcorpus, content_transformer(removePunctuation))
sampcorpus <- tm_map(sampcorpus, stripWhitespace)
sampcorpus <- tm_map(sampcorpus, removeWords, stopwords("english"))
sampcorpus <- tm_map(sampcorpus, removeNumbers)
removeURL <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x)
remove.users <-function(x) gsub("@[[:alnum:][:punct:]]*","",x)
sampcorpus <- tm_map(sampcorpus, content_transformer(removeURL))
sampcorpus <- tm_map(sampcorpus, content_transformer(remove.users))
sampcorpus <- tm_map(sampcorpus, PlainTextDocument)
sampcorpus <- tm_map(sampcorpus, stemDocument)
# creating a DTM
sampdtm <- DocumentTermMatrix(sampcorpus) ## It could be a TDM instead, the difference would be just the position between rows and columns.
# creating a matrix from DTM - method 1
samp_unigram <- as.matrix(removeSparseTerms(sampdtm, 0.99)) # Maybe you prefer method 2 instead.
# tokenization with RWeka package
library(RWeka)
library(rJava)
# functions for tokenization
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
# method 2
# creating TDMs
unigramtdm <- TermDocumentMatrix(sampcorpus, control = list(tokenize = UnigramTokenizer))
bigramtdm <- TermDocumentMatrix(sampcorpus, control = list(tokenize = BigramTokenizer))
trigramtdm <- TermDocumentMatrix(sampcorpus, control = list(tokenize = TrigramTokenizer))
# creating DTMs
## It's not necessary both TDM and DTM, you can choose one option and adjust your plot for row or column frequency.
unigramdtm <- DocumentTermMatrix(sampcorpus, control = list(tokenize = UnigramTokenizer))
bigramdtm <- DocumentTermMatrix(sampcorpus, control = list(tokenize = BigramTokenizer))
trigramdtm <- DocumentTermMatrix(sampcorpus, control = list(tokenize = TrigramTokenizer))
# creating matrixes from DTM removing sparse terms
unigram <- as.matrix(removeSparseTerms(unigramdtm, 0.99))
bigram <- as.matrix(removeSparseTerms(bigramdtm, 0.999))
trigram <- as.matrix(removeSparseTerms(trigramdtm, 0.9999))
# sorting frequency terms
unigramfreq <- sort(colSums(as.matrix(unigram)), decreasing = TRUE)
bigramfreq <- sort(colSums(as.matrix(bigram)), decreasing = TRUE)
trigramfreq <- sort(colSums(as.matrix(trigram)), decreasing = TRUE)
# making data frames
unigramfreq <- data.frame(word = names(unigramfreq), freq = unigramfreq)
bigramfreq <- data.frame(word = names(bigramfreq), freq = bigramfreq)
trigramfreq <- data.frame(word = names(trigramfreq), freq = trigramfreq)Appendix 3
# Sum columns and sorting by frequency - unigram
term_frequency <- colSums(samp_unigram)
term_frequency <- sort(term_frequency,
decreasing = TRUE)
# creating a barplot - method 1
barplot(term_frequency[1:10],
col = "blue",
las = 2, xlab = "words", ylab ="frequency")
# creating a barplots - method 2
library(ggplot2)
## unigram
gguni <- ggplot(unigramfreq[1:20,], aes(x = reorder(word, -freq), y = freq)) + theme_bw() +
geom_bar(stat = "identity", fill = 'darkblue', alpha=0.3) +
geom_text(aes(label = freq ), vjust = -0.20, size = 3) +
xlab("") + ylab("Frequency") +
theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5)) +
ggtitle("Top 20 Unigrams")
print(gguni)
## bigram
ggbi <- ggplot(bigramfreq[1:20,], aes(x = reorder(word, -freq), y = freq)) + theme_bw() +
geom_bar(stat = "identity", fill = 'darkblue', alpha=0.3) +
geom_text(aes(label = freq ), vjust = -0.20, size = 3) +
xlab("") + ylab("Frequency") +
theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5)) +
ggtitle("Top 20 Bigrams")
print(ggbi)
## trigram
ggtri <- ggplot(trigramfreq[1:20,], aes(x = reorder(word, -freq), y = freq)) + theme_bw() +
geom_bar(stat = "identity", fill = 'darkblue', alpha=0.3) +
geom_text(aes(label = freq ), vjust = -0.20, size = 3) +
xlab("") + ylab("Frequency") +
theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5)) +
ggtitle("Top 20 Trigrams")
print(ggtri)Appendix 4
library(wordcloud)
set.seed(4518)
wordcloud(unigramfreq$word, unigramfreq$freq, max.words = 100, random.order = FALSE, scale=c(3.5,0.01), rot.per=0.5,
use.r.layout=FALSE,colors=brewer.pal(8, "Blues"),
main="Top Unigrams Wordcloud")
set.seed(4518)
wordcloud(bigramfreq$word, bigramfreq$freq, max.words = 100, random.order = FALSE, scale=c(3.5,0.01), rot.per=0.5,
use.r.layout=FALSE,colors=brewer.pal(8, "Blues"),
main="Top Bigrams Wordcloud")
set.seed(4518)
wordcloud(trigramfreq$word, trigramfreq$freq, max.words = 100, random.order = FALSE, scale=c(3.5,0.01), rot.per=0.5,
use.r.layout=FALSE,colors=brewer.pal(8, "Blues"),
main="Top Trigrams Wordcloud")