Milestone report of Coursera Data Science Capstone Project

Introduction

This milestone report is an assignment of the capstone project of the Coursera Data Science specialization of Johns Hopkins University that it be colaborative supported by SwiftKey. * The project challenge is using some tools and knowledge of data science specialization in order to build an algorithm for a predictive text model that suggest three optional words than could logically fit with previous words of a phrase written, for instance, by one user when uses a mobile keyboard. To accomplish with this, the project instructions suggest to use three large corpus of text, that came from Blogs, News, and messages Twitter.

Environment and variables for reproducibility

library(tm);library(ggplot2);library(dplyr);library(kableExtra);library(knitr)
library(lda);library(RWeka);library("SnowballC");library("wordcloud")
library("RColorBrewer");library(stringi)
set.seed(8888)

Loading Data from Designated Sources

Due to the big size of three sources, the files must be previously downloaded and saved on local storage. For eliminating profanity it was downloaded an updated dataset from Google.

#Setting local directory
localdir <-getwd()
setwd(localdir)
usDir <- "/final/en_US/"
fileBlogs <- paste(localdir, usDir, "en_US.blogs.txt", sep = "")
fileNews <- paste(localdir, usDir, "en_US.news.txt", sep = "")
fileTwitter <- paste(localdir, usDir, "en_US.twitter.txt", sep = "")
FileBadWords <- paste(localdir, usDir,"full-list-of-bad-words_2018_03_26.txt", sep = "")

#Reading informations from text sources
options(mc.cores=4)
blogs <- readLines(fileBlogs, encoding="UTF-8", skipNul = TRUE, warn = TRUE)
#blogs_1 <- tolower(readLines(fileBlogs_1, encoding="UTF-8", skipNul = TRUE, warn = FALSE)
news <- readLines(fileNews, encoding="UTF-8", skipNul = TRUE, warn = TRUE)
twitter <- readLines(fileTwitter, encoding="UTF-8", skipNul = TRUE, warn = TRUE)
dfBadWords <-tolower(readLines(FileBadWords))
#For eliminating profanity it was downloaded an updated dataset from Google 
#of this kind of words

Next is a summarize of some of the indicators of each designated text corpus.

#Calculating some parameters of text sources
#Measuring size of source files
size1MB <-1048576

blogsSize <- round(file.size(fileBlogs)/size1MB, 2)
newsSize <- round(file.size(fileNews)/size1MB, 2)
twitterSize <- round(file.size(fileTwitter)/size1MB, 2)

#Measuring  number of lines per each source files
blogsLines <- format(length(blogs), big.mark   = ",")
newsLines <- format(length(news), big.mark   = ",")
twitterLines <- format(length(twitter), big.mark   = ",")

#Calculating the number of characters of longest line
MaxLineBlogs <- format(max(nchar(blogs)), big.mark   = ",")
MaxLineNews <- format(max(nchar(news)), big.mark   = ",")
MaxLineTwitter <- format(max(nchar(twitter)), big.mark   = ",")

#Counting the number of words
numWordsBlogs <- format(sum(stri_count_words(blogs)), big.mark   = ",")
numWordsNews <- format(sum(stri_count_words(news)), big.mark   = ",")
numWordsTwitter <- format(sum(stri_count_words(twitter)), big.mark   = ",")

dfIndicators <- data.frame( "Size(Mb)   "=c(blogsSize, newsSize, twitterSize),
                            "Lines   "=c(blogsLines,newsLines,twitterLines),
                            "Total Words   "=c(numWordsBlogs,numWordsNews,numWordsTwitter),
                            "Longest line   "=c(MaxLineBlogs,MaxLineNews,MaxLineTwitter),
                             row.names=c("Blogs","News","Twitter"))
kable(dfIndicators) # %>%  kable_styling("striped") 
Size.Mb…. Lines… Total.Words… Longest.line…
Blogs 200.42 899,288 37,546,246 40,833
News 196.28 77,259 2,674,536 5,760
Twitter 159.33 2,360,148 30,093,327 140
As shown on table, on rows are the information of each designated text sources. The description of each column is:

  • File size in megabytes: Blogs has more than 200 Mg.
  • Number of lines or paragraphs: Twitter has more than 2 millons of lines.
  • Total words:Blogs has more than 37 millions of words.
  • Longest line: Blog has the longest paragraph witn more than 40 thousand words.

Sampling Text Corpus

Due to the big number of lines from each text sources that will slow its processing, it is recommendable to define a subset of a small number of lines (for instance 5000) of each source that be combinated in one training set. However, Twitter has the smallest long lines so it should be recommendable to double the size of its sample.

set.seed(8888)
rowsTraining <-5000
trainDataBlog <- sample(blogs, rowsTraining)
trainDataNews <- sample(news, rowsTraining)
trainDataTwitter <- sample(twitter, 2*rowsTraining)

Cleaning Text Corpus

Over training dataset is executed several functions of Text Mining (TM) Package for cleanning data in order to prepare for tokenization operation.This functions are:

  • Converting text data to a corpus object.
  • Transforming to plain text corpus content
  • Removing numbers
  • Removing URL
  • Converting any character tha is not a letter or apostrophe to a blank space.
  • Removing extra blank spaces.
  • Converting corpus to lowercase in order to be able to remove english stop words
  • Removing english stop words that don’t aggregate value
  • Removing profanity.
ConvtoSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x, perl = TRUE))

CleanCorpus <- function(pcorpus) {
                pcorpus <- Corpus(VectorSource(pcorpus))# Converting to a corpus object
                #That should make sure all data is in PlainTextDocument
                pcorpus <- tm_map(pcorpus, PlainTextDocument) 
                pcorpus <- tm_map(pcorpus, removeNumbers) # Removing numbers
                pcorpus <- tm_map(pcorpus, ConvtoSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+") #Removing URLs
                pcorpus <- tm_map(pcorpus, ConvtoSpace,"<.+?>")
                #pcorpus <- tm_map(pcorpus, ConvtoSpace, "[^(a-zA-Z | \\['-*’])]+") # converting
                pcorpus <- tm_map(pcorpus, ConvtoSpace, "[^(a-zA-Z |  \\'| \\’)]+") # converting
                pcorpus <- tm_map(pcorpus, stripWhitespace) # get rid of extra spaces
                pcorpus <- tm_map(pcorpus, content_transformer(tolower))
                #Removing common words of english language
                pcorpus <- tm_map(pcorpus, removeWords,stopwords("english")) 
                pcorpus <- tm_map(pcorpus, removeWords, dfBadWords) #Removing profanity
                return(pcorpus)
}
trainDataBlog <- CleanCorpus(trainDataBlog)
trainDataNews <- CleanCorpus(trainDataNews)
trainDataTwitter <- CleanCorpus(trainDataTwitter)

Tokenization of Unigrams and analysis of results

For the prediction algorithm of Shiny App an effective strategy is using n-grams of Natural Language Processing (NLP). An n-gram is a contiguous sequence of n words that fit a common sentence extracted from text source. According to the value of n then will be 1-gram or unigram, 2-gram or bigram, 3-gram or trigram and so on.

A simple explanation that could be understandable for a non-data scientist manager, about how to use n-grams could be for example, when the user writes one word on the touch keyboard, the shiny app search into bigrams dataset the records that start with the writted word, showing some predictions of probably the next word set that could follow, which are extracted of top most frequent bigrams that were found.

So before running the whole Tokenization process (that is supported by Weka package) that split sentences of cleaning corpus into sequences of unigrams, bigrams, trigrams and quadgrams. It is convenient to find out the unigrams of three text sources, in order to do it comparative analysis of its results.

Due to tokenization is a heavy process that demands processor and RAM, it recommendable setting up a parallel process that is split into available cores of the processor.

options(mc.cores=4)
#Counting the number of words
numWordsTrainBlogs <- sum(stri_count_words(trainDataBlog))
numWordsTrainNews <- sum(stri_count_words(trainDataNews))
numWordsTrainTwitter <- sum(stri_count_words(trainDataTwitter))

#Identifying tokens and calculating frequency
uniGramTokBlogs <- data.frame(table(NGramTokenizer(trainDataBlog, Weka_control(min = 1, max = 1))))
uniGramTokNews <- data.frame(table(NGramTokenizer(trainDataNews, Weka_control(min = 1, max = 1))))
uniGramTokTwitter <- data.frame(table(NGramTokenizer(trainDataTwitter, Weka_control(min = 1, max = 1))))

names(uniGramTokBlogs) <- c("Token", "Freq_Blogs")
names(uniGramTokNews) <- c("Token", "Freq_News")
names(uniGramTokTwitter) <- c("Token", "Freq_Twitter")

#Calculating  number of total records
nrowsuniGramTokBlogs <- as.numeric(nrow(uniGramTokBlogs))
nrowsuniGramTokNews <- as.numeric(nrow(uniGramTokNews))
nrowsuniGramTokTwitter <- as.numeric(nrow(uniGramTokTwitter))

#Calculating single frequencies
singleuniGramTokBlogs <-as.numeric(count(uniGramTokBlogs[uniGramTokBlogs$Freq_Blogs==1,]))
singleuniGramTokNews<- as.numeric(count(uniGramTokNews[uniGramTokNews$Freq_News==1,]))
singleuniGramTokTwitter<- as.numeric(count(uniGramTokTwitter[uniGramTokTwitter$Freq_Twitter==1,]))

#Calculating the percentage of single frequencies
persingleuniGramTokBlogs <- round((singleuniGramTokBlogs/nrowsuniGramTokBlogs)*100, digits = 1)
persingleuniGramTokNews<- round((singleuniGramTokNews/nrowsuniGramTokNews)*100, digits = 1)
persingleuniGramTokTwitter<- round((singleuniGramTokTwitter/nrowsuniGramTokTwitter)*100, digits = 1)

#Calculating the number of effective tokens for prediction
effectiveuniGramTokBlogs <-as.numeric(nrowsuniGramTokBlogs - singleuniGramTokBlogs)
effectiveniGramTokNews <-  as.numeric(nrowsuniGramTokNews - singleuniGramTokNews)
effectiveuniGramTokTwitter <- as.numeric(nrowsuniGramTokTwitter - singleuniGramTokTwitter)

#Calculating ratio of effective tokens versus original words of sample
ratioEffectivenessBlogs <- round((effectiveuniGramTokBlogs/as.numeric(numWordsTrainBlogs))*100, digits = 1)
ratioEffectivenessNews  <-round((effectiveniGramTokNews/as.numeric(numWordsTrainNews))*100, digits = 1)
ratioEffectivenessTwitter <-round((effectiveuniGramTokTwitter/as.numeric(numWordsTrainTwitter))*100, digits = 1)

dfResults <- data.frame("Words"=c(numWordsTrainBlogs,numWordsTrainNews,numWordsTrainTwitter),
                        "Tokens"=c(nrowsuniGramTokBlogs, nrowsuniGramTokNews,
                                         nrowsuniGramTokTwitter),
                        "Single Freq \n Tokens"=c(singleuniGramTokBlogs,singleuniGramTokNews,
                                        singleuniGramTokTwitter),
                        "% Singles Freq \n Tokens"=c(persingleuniGramTokBlogs,persingleuniGramTokNews,
                                           persingleuniGramTokTwitter),
                        "Effec\nTokens"=c(effectiveuniGramTokBlogs,effectiveniGramTokNews,
                                           effectiveuniGramTokTwitter),
                        "Ratio Effec Tokens \n vs Words"=c(ratioEffectivenessBlogs,
                                                                    ratioEffectivenessNews,
                                                                   ratioEffectivenessTwitter),
                        row.names=c("Blogs","News","Twitter"))
kable(dfResults) %>%  kable_styling("striped") 
Words Tokens Single.Freq…Tokens X..Singles.Freq…Tokens Effec.Tokens Ratio.Effec.Tokens…vs.Words
Blogs 111825 19860 10066 50.7 9794 8.8
News 98245 19222 9867 51.3 9355 9.5
Twitter 71428 14406 8421 58.5 5985 8.4
The results could suggest next conclusions about sample dataset of each text source:

  • The dataset with more single unigrams is Twitter with 58.5%. Due this, since a total of 14,406 Tokens, only 5,985 could be useful in the prediction algorithm, because they have a frequency more than 1.

  • The dataset with the greater number of effective unigrams is Blogs with 9,794 tokens from 111,825 of original words, obtaining 8.8% of effectiveness.

  • The dataset with the greater effectiveness is News with 9.5%, because since 98,245 original words provide 9,355 tokens that have a frequency more than 1.

These conclusions might suggest that Twitter dataset is not an appropriate text source to collect sentences for prediction algorithm. So it is convenient to more profound review of Twitter’s unigrams.

Checking unigrams difference between Twitter and News realize that:

#Selecting the unigrams with a frequency greather than 1
effuniGramTokBlogs <-uniGramTokBlogs[uniGramTokBlogs$Freq_Blogs>1,]
effuniGramTokNews <-uniGramTokNews[uniGramTokNews$Freq_News>1,]
effuniGramTokTwitter <-uniGramTokTwitter[uniGramTokTwitter$Freq_Twitter>1,]

#Identifying unigrams that are in Twitter training dataset and are not in Blog training dataset.
DifTokTwitter_Blogs <- as.data.frame(setdiff(effuniGramTokTwitter$Token,effuniGramTokNews$Token))
names(DifTokTwitter_Blogs)<-c("Token")
numDifTokTwitter_Blogs<-nrow(DifTokTwitter_Blogs)
#Extracting frequency of unigrams that are in Twitter training dataset and are not in Blog training dataset
corpDifTokTwitter_Blogs <-  effuniGramTokTwitter[(effuniGramTokTwitter$Token %in% DifTokTwitter_Blogs$Token),]
#print(head(corpDifTokTwitter_Blogs,n=25))
corpDifTokTwitter_Blogs <- corpDifTokTwitter_Blogs[order(corpDifTokTwitter_Blogs$Freq_Twitter,decreasing = TRUE),]
barplot(corpDifTokTwitter_Blogs[1:25,]$Freq_Twitter, las = 2, names.arg = corpDifTokTwitter_Blogs[1:25,]$Token,
        col ="lightblue", main ="Top 25 Most Frequent Twitter Words than are not in News",
        ylab = "Word frequencies")

2,012 Twitter’s unigrams are not in News dataset, but reviewing the top 25 most frequent of this group, there are abbreviations and slang that belong to Twitter dictionary, which was built in order to facilitate message writing of 140 characters.

As a conclusion, in order to accomplish the challenge is to build a prediction algorithm for assistance a user during the writing process of an English message is not convenient to use the Twitter source. In Contrast, News and Blogs are written using a more formal English, which are appropriate sources for collecting sentences to support a prediction algorithm.

Rearrange sampling with Blogs And News Sources.

set.seed(8888)
perTraining <-as.numeric(0.01)
trainingData <- c(sample(blogs, length(blogs)*perTraining),sample(news,length(news)*perTraining))
corpustrainingData <- CleanCorpus(trainingData)

Plotting Some Information of Rearrange Corpus

In order to facilitate the analysis of most frequent words, a reasonable strategy should be to reduce the different variations of a word to its root, this process is called stemming. Next is a graphic called ‘wordcloud’ that depict the top 100 most frequent root (stem) words, begining since center (in which are the most frequent words and its relavance is according to it size), to the exterior of the cloud (in which are less frequent words).

pcorpusStem <-tm_map(corpustrainingData, stemDocument)
wordcloud(pcorpusStem, max.words = 100, random.order = FALSE,
          colors=brewer.pal(8, "Dark2"))

completing_tokenization

Then it is necessary to complete the tokenization process

options(mc.cores=4)
uniGramTok <- data.frame(table(NGramTokenizer(corpustrainingData, Weka_control(min = 1, max = 1))))
biGramTok <- data.frame(table(NGramTokenizer(corpustrainingData, Weka_control(min = 2, max = 2))))
triGramTok <- data.frame(table(NGramTokenizer(corpustrainingData, Weka_control(min = 3, max = 3))))
quadGramTok <- data.frame(table(NGramTokenizer(corpustrainingData, Weka_control(min = 4, max = 4))))

names(uniGramTok) <- c("Token", "Frequency")
names(biGramTok) <- c("Token", "Frequency")
names(triGramTok) <- c("Token", "Frequency")
names(quadGramTok) <- c("Token", "Frequency")

uniGramTok <- uniGramTok[order(uniGramTok$Frequency,decreasing = TRUE),]
biGramTok <- biGramTok[order(biGramTok$Frequency,decreasing = TRUE),]
triGramTok <- triGramTok[order(triGramTok$Frequency,decreasing = TRUE),]
quadGramTok <- quadGramTok[order(quadGramTok$Frequency,decreasing = TRUE),]


#Counting the number of corpus words
numWordsCorpus <- sum(stri_count_words(corpustrainingData))

#Calculating  number of total token
nrowsuniGramTokCorpus <- as.numeric(nrow(uniGramTok))

#Calculating single frequencies
singleuniGramTok <-as.numeric(count(uniGramTok[uniGramTok$Frequency==1,]))

#Calculating the percentage of single frequencies
persingleuniGramCorpus <- round((singleuniGramTok/nrowsuniGramTokCorpus)*100, digits = 1)

#Calculating the number of effective tokens for prediction
effectiveuniGramTokCorpus <-as.numeric(nrowsuniGramTokCorpus - singleuniGramTok)

#Calculating ratio of effective tokens versus original words of sample
ratioEffectivenessCorpus <- round((effectiveuniGramTokCorpus/as.numeric(numWordsCorpus))*100, digits = 1)

dfResults <- data.frame("Words"=c(numWordsCorpus),
                        "Tokens"=c(nrowsuniGramTokCorpus),
                        "Single Freq \n Tokens"=c(singleuniGramTok),
                        "Effec\nTokens"=c(effectiveuniGramTokCorpus),
                        "Ratio Effec Tok vs Words"=c(ratioEffectivenessCorpus),
                        row.names=c("Corpus"))
kable(dfResults) #%>%  kable_styling("striped") 
Words Tokens Single.Freq…Tokens Effec.Tokens Ratio.Effec.Tok.vs.Words
Corpus 213256 28697 13995 14702 6.9

After rearranging the sample from Blogs and News text source and combining in only one Corpus, the obtained results are:

  • With 213,256 of Corpus words produce 28,697 tokens. 13,995 are single (frequency of 1) tokens, wich represents 48.8%. 14,702 are tokens with frequency greater than 1 so they are useful to prediction algorithm.

Depicting plots

Once completed the tokenization process, it is recommendable to depicted the histograms of unigrams and bigrams in order to check the results and able to plan next steps of development of prediction algorithm..

plotuniGramTok <- ggplot(uniGramTok[1:15,], aes(x=reorder(Token, Frequency), y=Frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Unigram") + ylab("Frequency") +
    labs(title = "Top Unigrams by frequency")
plotbiGramTok <- ggplot(biGramTok[1:15,], aes(x=reorder(Token,Frequency), y=Frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Bigram") + ylab("Frequency") +
    labs(title = "Top Bigrams by frequency")
par(mfrow=c(1,2))
print(plotuniGramTok)

print(plotbiGramTok)

NEXT STEPS