This milestone report is an assignment of the capstone project of the Coursera Data Science specialization of Johns Hopkins University that it be colaborative supported by SwiftKey. * The project challenge is using some tools and knowledge of data science specialization in order to build an algorithm for a predictive text model that suggest three optional words than could logically fit with previous words of a phrase written, for instance, by one user when uses a mobile keyboard. To accomplish with this, the project instructions suggest to use three large corpus of text, that came from Blogs, News, and messages Twitter.
library(tm);library(ggplot2);library(dplyr);library(kableExtra);library(knitr)
library(lda);library(RWeka);library("SnowballC");library("wordcloud")
library("RColorBrewer");library(stringi)
set.seed(8888)
Due to the big size of three sources, the files must be previously downloaded and saved on local storage. For eliminating profanity it was downloaded an updated dataset from Google.
#Setting local directory
localdir <-getwd()
setwd(localdir)
usDir <- "/final/en_US/"
fileBlogs <- paste(localdir, usDir, "en_US.blogs.txt", sep = "")
fileNews <- paste(localdir, usDir, "en_US.news.txt", sep = "")
fileTwitter <- paste(localdir, usDir, "en_US.twitter.txt", sep = "")
FileBadWords <- paste(localdir, usDir,"full-list-of-bad-words_2018_03_26.txt", sep = "")
#Reading informations from text sources
options(mc.cores=4)
blogs <- readLines(fileBlogs, encoding="UTF-8", skipNul = TRUE, warn = TRUE)
#blogs_1 <- tolower(readLines(fileBlogs_1, encoding="UTF-8", skipNul = TRUE, warn = FALSE)
news <- readLines(fileNews, encoding="UTF-8", skipNul = TRUE, warn = TRUE)
twitter <- readLines(fileTwitter, encoding="UTF-8", skipNul = TRUE, warn = TRUE)
dfBadWords <-tolower(readLines(FileBadWords))
#For eliminating profanity it was downloaded an updated dataset from Google
#of this kind of words
Next is a summarize of some of the indicators of each designated text corpus.
#Calculating some parameters of text sources
#Measuring size of source files
size1MB <-1048576
blogsSize <- round(file.size(fileBlogs)/size1MB, 2)
newsSize <- round(file.size(fileNews)/size1MB, 2)
twitterSize <- round(file.size(fileTwitter)/size1MB, 2)
#Measuring number of lines per each source files
blogsLines <- format(length(blogs), big.mark = ",")
newsLines <- format(length(news), big.mark = ",")
twitterLines <- format(length(twitter), big.mark = ",")
#Calculating the number of characters of longest line
MaxLineBlogs <- format(max(nchar(blogs)), big.mark = ",")
MaxLineNews <- format(max(nchar(news)), big.mark = ",")
MaxLineTwitter <- format(max(nchar(twitter)), big.mark = ",")
#Counting the number of words
numWordsBlogs <- format(sum(stri_count_words(blogs)), big.mark = ",")
numWordsNews <- format(sum(stri_count_words(news)), big.mark = ",")
numWordsTwitter <- format(sum(stri_count_words(twitter)), big.mark = ",")
dfIndicators <- data.frame( "Size(Mb) "=c(blogsSize, newsSize, twitterSize),
"Lines "=c(blogsLines,newsLines,twitterLines),
"Total Words "=c(numWordsBlogs,numWordsNews,numWordsTwitter),
"Longest line "=c(MaxLineBlogs,MaxLineNews,MaxLineTwitter),
row.names=c("Blogs","News","Twitter"))
kable(dfIndicators) # %>% kable_styling("striped")
| Size.Mb…. | Lines… | Total.Words… | Longest.line… | |
|---|---|---|---|---|
| Blogs | 200.42 | 899,288 | 37,546,246 | 40,833 |
| News | 196.28 | 77,259 | 2,674,536 | 5,760 |
| 159.33 | 2,360,148 | 30,093,327 | 140 |
Due to the big number of lines from each text sources that will slow its processing, it is recommendable to define a subset of a small number of lines (for instance 5000) of each source that be combinated in one training set. However, Twitter has the smallest long lines so it should be recommendable to double the size of its sample.
set.seed(8888)
rowsTraining <-5000
trainDataBlog <- sample(blogs, rowsTraining)
trainDataNews <- sample(news, rowsTraining)
trainDataTwitter <- sample(twitter, 2*rowsTraining)
Over training dataset is executed several functions of Text Mining (TM) Package for cleanning data in order to prepare for tokenization operation.This functions are:
ConvtoSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x, perl = TRUE))
CleanCorpus <- function(pcorpus) {
pcorpus <- Corpus(VectorSource(pcorpus))# Converting to a corpus object
#That should make sure all data is in PlainTextDocument
pcorpus <- tm_map(pcorpus, PlainTextDocument)
pcorpus <- tm_map(pcorpus, removeNumbers) # Removing numbers
pcorpus <- tm_map(pcorpus, ConvtoSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+") #Removing URLs
pcorpus <- tm_map(pcorpus, ConvtoSpace,"<.+?>")
#pcorpus <- tm_map(pcorpus, ConvtoSpace, "[^(a-zA-Z | \\['-*’])]+") # converting
pcorpus <- tm_map(pcorpus, ConvtoSpace, "[^(a-zA-Z | \\'| \\’)]+") # converting
pcorpus <- tm_map(pcorpus, stripWhitespace) # get rid of extra spaces
pcorpus <- tm_map(pcorpus, content_transformer(tolower))
#Removing common words of english language
pcorpus <- tm_map(pcorpus, removeWords,stopwords("english"))
pcorpus <- tm_map(pcorpus, removeWords, dfBadWords) #Removing profanity
return(pcorpus)
}
trainDataBlog <- CleanCorpus(trainDataBlog)
trainDataNews <- CleanCorpus(trainDataNews)
trainDataTwitter <- CleanCorpus(trainDataTwitter)
For the prediction algorithm of Shiny App an effective strategy is using n-grams of Natural Language Processing (NLP). An n-gram is a contiguous sequence of n words that fit a common sentence extracted from text source. According to the value of n then will be 1-gram or unigram, 2-gram or bigram, 3-gram or trigram and so on.
A simple explanation that could be understandable for a non-data scientist manager, about how to use n-grams could be for example, when the user writes one word on the touch keyboard, the shiny app search into bigrams dataset the records that start with the writted word, showing some predictions of probably the next word set that could follow, which are extracted of top most frequent bigrams that were found.
So before running the whole Tokenization process (that is supported by Weka package) that split sentences of cleaning corpus into sequences of unigrams, bigrams, trigrams and quadgrams. It is convenient to find out the unigrams of three text sources, in order to do it comparative analysis of its results.
Due to tokenization is a heavy process that demands processor and RAM, it recommendable setting up a parallel process that is split into available cores of the processor.
options(mc.cores=4)
#Counting the number of words
numWordsTrainBlogs <- sum(stri_count_words(trainDataBlog))
numWordsTrainNews <- sum(stri_count_words(trainDataNews))
numWordsTrainTwitter <- sum(stri_count_words(trainDataTwitter))
#Identifying tokens and calculating frequency
uniGramTokBlogs <- data.frame(table(NGramTokenizer(trainDataBlog, Weka_control(min = 1, max = 1))))
uniGramTokNews <- data.frame(table(NGramTokenizer(trainDataNews, Weka_control(min = 1, max = 1))))
uniGramTokTwitter <- data.frame(table(NGramTokenizer(trainDataTwitter, Weka_control(min = 1, max = 1))))
names(uniGramTokBlogs) <- c("Token", "Freq_Blogs")
names(uniGramTokNews) <- c("Token", "Freq_News")
names(uniGramTokTwitter) <- c("Token", "Freq_Twitter")
#Calculating number of total records
nrowsuniGramTokBlogs <- as.numeric(nrow(uniGramTokBlogs))
nrowsuniGramTokNews <- as.numeric(nrow(uniGramTokNews))
nrowsuniGramTokTwitter <- as.numeric(nrow(uniGramTokTwitter))
#Calculating single frequencies
singleuniGramTokBlogs <-as.numeric(count(uniGramTokBlogs[uniGramTokBlogs$Freq_Blogs==1,]))
singleuniGramTokNews<- as.numeric(count(uniGramTokNews[uniGramTokNews$Freq_News==1,]))
singleuniGramTokTwitter<- as.numeric(count(uniGramTokTwitter[uniGramTokTwitter$Freq_Twitter==1,]))
#Calculating the percentage of single frequencies
persingleuniGramTokBlogs <- round((singleuniGramTokBlogs/nrowsuniGramTokBlogs)*100, digits = 1)
persingleuniGramTokNews<- round((singleuniGramTokNews/nrowsuniGramTokNews)*100, digits = 1)
persingleuniGramTokTwitter<- round((singleuniGramTokTwitter/nrowsuniGramTokTwitter)*100, digits = 1)
#Calculating the number of effective tokens for prediction
effectiveuniGramTokBlogs <-as.numeric(nrowsuniGramTokBlogs - singleuniGramTokBlogs)
effectiveniGramTokNews <- as.numeric(nrowsuniGramTokNews - singleuniGramTokNews)
effectiveuniGramTokTwitter <- as.numeric(nrowsuniGramTokTwitter - singleuniGramTokTwitter)
#Calculating ratio of effective tokens versus original words of sample
ratioEffectivenessBlogs <- round((effectiveuniGramTokBlogs/as.numeric(numWordsTrainBlogs))*100, digits = 1)
ratioEffectivenessNews <-round((effectiveniGramTokNews/as.numeric(numWordsTrainNews))*100, digits = 1)
ratioEffectivenessTwitter <-round((effectiveuniGramTokTwitter/as.numeric(numWordsTrainTwitter))*100, digits = 1)
dfResults <- data.frame("Words"=c(numWordsTrainBlogs,numWordsTrainNews,numWordsTrainTwitter),
"Tokens"=c(nrowsuniGramTokBlogs, nrowsuniGramTokNews,
nrowsuniGramTokTwitter),
"Single Freq \n Tokens"=c(singleuniGramTokBlogs,singleuniGramTokNews,
singleuniGramTokTwitter),
"% Singles Freq \n Tokens"=c(persingleuniGramTokBlogs,persingleuniGramTokNews,
persingleuniGramTokTwitter),
"Effec\nTokens"=c(effectiveuniGramTokBlogs,effectiveniGramTokNews,
effectiveuniGramTokTwitter),
"Ratio Effec Tokens \n vs Words"=c(ratioEffectivenessBlogs,
ratioEffectivenessNews,
ratioEffectivenessTwitter),
row.names=c("Blogs","News","Twitter"))
kable(dfResults) %>% kable_styling("striped")
| Words | Tokens | Single.Freq…Tokens | X..Singles.Freq…Tokens | Effec.Tokens | Ratio.Effec.Tokens…vs.Words | |
|---|---|---|---|---|---|---|
| Blogs | 111825 | 19860 | 10066 | 50.7 | 9794 | 8.8 |
| News | 98245 | 19222 | 9867 | 51.3 | 9355 | 9.5 |
| 71428 | 14406 | 8421 | 58.5 | 5985 | 8.4 |
The dataset with more single unigrams is Twitter with 58.5%. Due this, since a total of 14,406 Tokens, only 5,985 could be useful in the prediction algorithm, because they have a frequency more than 1.
The dataset with the greater number of effective unigrams is Blogs with 9,794 tokens from 111,825 of original words, obtaining 8.8% of effectiveness.
The dataset with the greater effectiveness is News with 9.5%, because since 98,245 original words provide 9,355 tokens that have a frequency more than 1.
These conclusions might suggest that Twitter dataset is not an appropriate text source to collect sentences for prediction algorithm. So it is convenient to more profound review of Twitter’s unigrams.
Checking unigrams difference between Twitter and News realize that:
#Selecting the unigrams with a frequency greather than 1
effuniGramTokBlogs <-uniGramTokBlogs[uniGramTokBlogs$Freq_Blogs>1,]
effuniGramTokNews <-uniGramTokNews[uniGramTokNews$Freq_News>1,]
effuniGramTokTwitter <-uniGramTokTwitter[uniGramTokTwitter$Freq_Twitter>1,]
#Identifying unigrams that are in Twitter training dataset and are not in Blog training dataset.
DifTokTwitter_Blogs <- as.data.frame(setdiff(effuniGramTokTwitter$Token,effuniGramTokNews$Token))
names(DifTokTwitter_Blogs)<-c("Token")
numDifTokTwitter_Blogs<-nrow(DifTokTwitter_Blogs)
#Extracting frequency of unigrams that are in Twitter training dataset and are not in Blog training dataset
corpDifTokTwitter_Blogs <- effuniGramTokTwitter[(effuniGramTokTwitter$Token %in% DifTokTwitter_Blogs$Token),]
#print(head(corpDifTokTwitter_Blogs,n=25))
corpDifTokTwitter_Blogs <- corpDifTokTwitter_Blogs[order(corpDifTokTwitter_Blogs$Freq_Twitter,decreasing = TRUE),]
barplot(corpDifTokTwitter_Blogs[1:25,]$Freq_Twitter, las = 2, names.arg = corpDifTokTwitter_Blogs[1:25,]$Token,
col ="lightblue", main ="Top 25 Most Frequent Twitter Words than are not in News",
ylab = "Word frequencies")
2,012 Twitter’s unigrams are not in News dataset, but reviewing the top 25 most frequent of this group, there are abbreviations and slang that belong to Twitter dictionary, which was built in order to facilitate message writing of 140 characters.
As a conclusion, in order to accomplish the challenge is to build a prediction algorithm for assistance a user during the writing process of an English message is not convenient to use the Twitter source. In Contrast, News and Blogs are written using a more formal English, which are appropriate sources for collecting sentences to support a prediction algorithm.
set.seed(8888)
perTraining <-as.numeric(0.01)
trainingData <- c(sample(blogs, length(blogs)*perTraining),sample(news,length(news)*perTraining))
corpustrainingData <- CleanCorpus(trainingData)
In order to facilitate the analysis of most frequent words, a reasonable strategy should be to reduce the different variations of a word to its root, this process is called stemming. Next is a graphic called ‘wordcloud’ that depict the top 100 most frequent root (stem) words, begining since center (in which are the most frequent words and its relavance is according to it size), to the exterior of the cloud (in which are less frequent words).
pcorpusStem <-tm_map(corpustrainingData, stemDocument)
wordcloud(pcorpusStem, max.words = 100, random.order = FALSE,
colors=brewer.pal(8, "Dark2"))
Then it is necessary to complete the tokenization process
options(mc.cores=4)
uniGramTok <- data.frame(table(NGramTokenizer(corpustrainingData, Weka_control(min = 1, max = 1))))
biGramTok <- data.frame(table(NGramTokenizer(corpustrainingData, Weka_control(min = 2, max = 2))))
triGramTok <- data.frame(table(NGramTokenizer(corpustrainingData, Weka_control(min = 3, max = 3))))
quadGramTok <- data.frame(table(NGramTokenizer(corpustrainingData, Weka_control(min = 4, max = 4))))
names(uniGramTok) <- c("Token", "Frequency")
names(biGramTok) <- c("Token", "Frequency")
names(triGramTok) <- c("Token", "Frequency")
names(quadGramTok) <- c("Token", "Frequency")
uniGramTok <- uniGramTok[order(uniGramTok$Frequency,decreasing = TRUE),]
biGramTok <- biGramTok[order(biGramTok$Frequency,decreasing = TRUE),]
triGramTok <- triGramTok[order(triGramTok$Frequency,decreasing = TRUE),]
quadGramTok <- quadGramTok[order(quadGramTok$Frequency,decreasing = TRUE),]
#Counting the number of corpus words
numWordsCorpus <- sum(stri_count_words(corpustrainingData))
#Calculating number of total token
nrowsuniGramTokCorpus <- as.numeric(nrow(uniGramTok))
#Calculating single frequencies
singleuniGramTok <-as.numeric(count(uniGramTok[uniGramTok$Frequency==1,]))
#Calculating the percentage of single frequencies
persingleuniGramCorpus <- round((singleuniGramTok/nrowsuniGramTokCorpus)*100, digits = 1)
#Calculating the number of effective tokens for prediction
effectiveuniGramTokCorpus <-as.numeric(nrowsuniGramTokCorpus - singleuniGramTok)
#Calculating ratio of effective tokens versus original words of sample
ratioEffectivenessCorpus <- round((effectiveuniGramTokCorpus/as.numeric(numWordsCorpus))*100, digits = 1)
dfResults <- data.frame("Words"=c(numWordsCorpus),
"Tokens"=c(nrowsuniGramTokCorpus),
"Single Freq \n Tokens"=c(singleuniGramTok),
"Effec\nTokens"=c(effectiveuniGramTokCorpus),
"Ratio Effec Tok vs Words"=c(ratioEffectivenessCorpus),
row.names=c("Corpus"))
kable(dfResults) #%>% kable_styling("striped")
| Words | Tokens | Single.Freq…Tokens | Effec.Tokens | Ratio.Effec.Tok.vs.Words | |
|---|---|---|---|---|---|
| Corpus | 213256 | 28697 | 13995 | 14702 | 6.9 |
After rearranging the sample from Blogs and News text source and combining in only one Corpus, the obtained results are:
Once completed the tokenization process, it is recommendable to depicted the histograms of unigrams and bigrams in order to check the results and able to plan next steps of development of prediction algorithm..
plotuniGramTok <- ggplot(uniGramTok[1:15,], aes(x=reorder(Token, Frequency), y=Frequency)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Unigram") + ylab("Frequency") +
labs(title = "Top Unigrams by frequency")
plotbiGramTok <- ggplot(biGramTok[1:15,], aes(x=reorder(Token,Frequency), y=Frequency)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Bigram") + ylab("Frequency") +
labs(title = "Top Bigrams by frequency")
par(mfrow=c(1,2))
print(plotuniGramTok)
print(plotbiGramTok)