Milestone Report of the Capstone project

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Review criteria:

-Does the link lead to an HTML page describing the exploratory analysis of the training data set?

-Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?

-Has the data scientist made basic plots, such as histograms to illustrate features of the data?

-Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Getting and loading the data

We download the zip file from the given Url and unzip it

setwd("D:/Data Science/Coursera/Final")

if(!file.exists("./data")){
                dir.create("./data")
  fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  
  download.file(fileUrl, destfile = "./data/data.zip")

  unzip(zipfile = "./data/data.zip", exdir = "./data")

        }

Then we read the text files for the English language from blogs, news and twitter

##blogs

fileName <- "./data/final/en_US/en_US.blogs.txt"  
con<-file(fileName,open="r" )
lineBlogs<- readLines(con)
close(con)

##news

fileName <- "./data/final/en_US/en_US.news.txt"    
con<-file(fileName,open="r" )
lineNews<- readLines(con) 
print(length(lineNews))
## [1] 77259
close(con)

##twitter

fileName <- "./data/final/en_US/en_US.twitter.txt"
con<-file(fileName,open="r" )
lineTwit<- readLines(con) 
print(length(lineTwit))
## [1] 2360148
close(con)

Now we can sample from these three files and create a new text file where we blend the blogs the news and the twitter.

RandomBlogs <- sample(lineBlogs,2000)
RandomNews <- sample(lineNews,2000)
RandomTwit <- sample(lineTwit,2000)

##blending into a single text file

random <- c(RandomBlogs,RandomNews,RandomTwit)

Basic summary

To get the number of words in the three given files we write the following code

news.words <- stri_count_words(lineNews)
twit.words <- stri_count_words(lineTwit)
blogs.words <- stri_count_words(lineBlogs)

We can also use the stri_stats_general function to get more stats for the files, like number of characters, empty lines, number of lines, and white characters.

news.stats <- stri_stats_general(lineNews)
twit.stat <- stri_stats_general(lineTwit)
blogs.stat <- stri_stats_general(lineBlogs)

Given the previous statistics we can create a data.frame to inspect the basic summary of the given files

data.frame(Source = c("News", "Twitter", "Blogs"),
                         Num_of_Words = c(sum(news.words), sum(twit.words), sum(blogs.words)),
                         Num_of_Lines = c(news.stats[1], twit.stat[1], blogs.stat[1]),
                         Lines_Empty = c(news.stats[2], twit.stat[2], blogs.stat[2]),
                         Num_of_Chars = c(news.stats[3], twit.stat[3], blogs.stat[3]),
                         Chars_White = c(news.stats[4], twit.stat[4], blogs.stat[4]))
##    Source Num_of_Words Num_of_Lines Lines_Empty Num_of_Chars Chars_White
## 1    News      2693898        77259       77259     15683765    13117038
## 2 Twitter     30218125      2360148     2360148    162384825   134370864
## 3   Blogs     38154238       899288      899288    208361438   171926076

Preprocessing

We brake the blended text to sentences using the sent_detect function

sent_doc <- sent_detect(random)

and now we can remove unnecessary information from the text by removing numbers, punctuation, whitespaces and turning all letters to lower cases.

sent_doc <- tolower(sent_doc)
sent_doc <- removeNumbers(sent_doc)
sent_doc <- removePunctuation(sent_doc)
sent_doc <- stripWhitespace(sent_doc)
sent_doc <- data.frame(sent_doc,stringsAsFactors = FALSE)

Tokenization

The NgramTokenizer function splits a string into an n-gram where we have control over the min and max n-gram that will be produced. The default minimum value is 1 and the max is 3.

grams<-NGramTokenizer(sent_doc) 

We will split to words, bigrams and trigrams

##counting for each element
wc <- word_count(grams)
##subsetting for words
the_Words <- grams[ which(wc==1)] 
the_Words <- as.vector(the_Words)
my_Words <- table(the_Words)
my_Words <- my_Words[order(my_Words, decreasing = T)]
##store in data frame
my_Words<- as.data.frame(my_Words)


##subsetting for bigrams
my_Bigrams <- grams[ which(wc==2)]
my_Bigrams <- table(my_Bigrams)
my_Bigrams <- my_Bigrams[order(my_Bigrams, decreasing = T)]
##store in data frame
my_Bigrams <- as.data.frame(my_Bigrams)

##subsetting for trigrams
my_Trigrams <- grams[ which(wc==3)]
my_Trigrams <- table(my_Trigrams)
my_Trigrams <- my_Trigrams[order(my_Trigrams, decreasing = T)]
##store in data.frame
my_Trigrams <- as.data.frame(my_Trigrams)

Exploring Features of the Data

Now we can inspect the most used n-grams in the blended text. To retrieve the wordcloud plot from the corpus we do the following

wordcloud(the_Words,  scale=c(5,0.1), max.words=100, random.order=FALSE, 
          rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(5, "Dark2"))

For the wordcloud of the Bigram we get

wordcloud(words = my_Bigrams$my_Bigrams, freq = my_Bigrams$Freq, max.words=90, random.order=FALSE, 
          rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(5, "Dark2"))

and for the trigram

wordcloud(words = my_Trigrams$my_Trigrams, freq = my_Trigrams$Freq, max.words=60, random.order=FALSE, 
          rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(5, "Dark2"))

Histograms

The histograms can give also useful insight. We give the relevant histograms for words, bigrams and trigrams.

ggplot(data= my_Words[1:20,], 
       aes( x = the_Words, y = Freq, fill = Freq)) + 
       geom_histogram(stat = "identity", col = "red") +
       ggtitle("Histogram for Words") + 
       labs( x = "Word", y = "Frequency") +
       theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1), legend.position="none")

ggplot(data= my_Bigrams[1:20,], 
       aes( x = my_Bigrams, y = Freq, fill = Freq)) + 
       geom_histogram(stat = "identity", col = "red") +
       ggtitle("Histogram for Bigrams") + 
       labs( x = "Word", y = "Frequency") +
       theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1), legend.position="none")

ggplot(data= my_Trigrams[1:20,], 
       aes( x = my_Trigrams, y = Freq, fill = Freq)) + 
       geom_histogram(stat = "identity", col = "red") +
       ggtitle("Histogram for Trigrams") + 
       labs( x = "Word", y = "Frequency") +
       theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1), legend.position="none")

Findings

From the above histograms it is evident that we get an approximately exponential decline of the frequency that a word appears as we ‘descent’ in the frequency table. This is a direct consequence of the Zipf’s law that states that the frequency of any word is inversely proportional to its rank in the frequency table.

This has as a consequence that we can cover most of the written language with a considerable smaller subset of words. For example in our case we can cover 50% of the written corpus with the following number of words

sumCover <- 0
sum_word_fr <- sum(my_Words$Freq)
for(i in 1:length(my_Words$Freq)) {
  sumCover <- sumCover + my_Words$Freq[i]
  if(sumCover >= 0.5*sum_word_fr){break}
}
##Number of words to cover 50% of the written language
print(i)
## [1] 139

while the 90% of the words with

sumCover <- 0
sum_word_fr <- sum(my_Words$Freq)
for(i in 1:length(my_Words$Freq)) {
  sumCover <- sumCover + my_Words$Freq[i]
  if(sumCover >= 0.9*sum_word_fr){break}
}
##Number of words to cover 90% of the written language
print(i)
## [1] 6551

While the total number of unique words in the corpus are

length(my_Words[,1])
## [1] 20001

which is a considerable larger number from the number of words needed to cover 90% of the corpus. This is very important observation since it will allow as to reduce the size of the ngram frequency tables in order to speed up the app without sacrificing too much prediction accuracy.

Future plans

The computational efficiency of the various functions given in the literature that calculates ngrams must be further evaluated. Ideally, we want to produce the longest possible Ngram for the biggest possible corpus. A few packages that claim efficient computation of Ngrams are tau, ngram and quanteda. There is also the possibility of building an efficient algorithm from scratch using Markov Chains but this will be time consuming.

Ideally, we will produce frequency tables for each type of N-grams (word, bigram, trigram and potentially quadgram). Then, we will use these frequency tables to predict the next word, using the following algorithm.

-Take the last three words of the input text and compare with the quadgram table to find a match with the three first words of a quadgram. If you find a match give the last word of the quadgram as the predicted word.

-Else, repeat the same process with the last two words and compare with the trigram table.

-Else, repeat the same process with the last word and compare with the bigram table.

-Else, give the most frequently used word.

Finally, the look-up tables that will be produced must be also evaluated in terms of efficiency. Since, very rare ngrams are not useful for prediction we must reduce the length of the frequency tables to a point that it balances computational complexity and accuracy of prediction.