The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:
Review criteria:
-Does the link lead to an HTML page describing the exploratory analysis of the training data set?
-Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
-Has the data scientist made basic plots, such as histograms to illustrate features of the data?
-Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
We download the zip file from the given Url and unzip it
setwd("D:/Data Science/Coursera/Final")
if(!file.exists("./data")){
dir.create("./data")
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "./data/data.zip")
unzip(zipfile = "./data/data.zip", exdir = "./data")
}
Then we read the text files for the English language from blogs, news and twitter
##blogs
fileName <- "./data/final/en_US/en_US.blogs.txt"
con<-file(fileName,open="r" )
lineBlogs<- readLines(con)
close(con)
##news
fileName <- "./data/final/en_US/en_US.news.txt"
con<-file(fileName,open="r" )
lineNews<- readLines(con)
print(length(lineNews))
## [1] 77259
close(con)
##twitter
fileName <- "./data/final/en_US/en_US.twitter.txt"
con<-file(fileName,open="r" )
lineTwit<- readLines(con)
print(length(lineTwit))
## [1] 2360148
close(con)
Now we can sample from these three files and create a new text file where we blend the blogs the news and the twitter.
RandomBlogs <- sample(lineBlogs,2000)
RandomNews <- sample(lineNews,2000)
RandomTwit <- sample(lineTwit,2000)
##blending into a single text file
random <- c(RandomBlogs,RandomNews,RandomTwit)
To get the number of words in the three given files we write the following code
news.words <- stri_count_words(lineNews)
twit.words <- stri_count_words(lineTwit)
blogs.words <- stri_count_words(lineBlogs)
We can also use the stri_stats_general function to get more stats for the files, like number of characters, empty lines, number of lines, and white characters.
news.stats <- stri_stats_general(lineNews)
twit.stat <- stri_stats_general(lineTwit)
blogs.stat <- stri_stats_general(lineBlogs)
Given the previous statistics we can create a data.frame to inspect the basic summary of the given files
data.frame(Source = c("News", "Twitter", "Blogs"),
Num_of_Words = c(sum(news.words), sum(twit.words), sum(blogs.words)),
Num_of_Lines = c(news.stats[1], twit.stat[1], blogs.stat[1]),
Lines_Empty = c(news.stats[2], twit.stat[2], blogs.stat[2]),
Num_of_Chars = c(news.stats[3], twit.stat[3], blogs.stat[3]),
Chars_White = c(news.stats[4], twit.stat[4], blogs.stat[4]))
## Source Num_of_Words Num_of_Lines Lines_Empty Num_of_Chars Chars_White
## 1 News 2693898 77259 77259 15683765 13117038
## 2 Twitter 30218125 2360148 2360148 162384825 134370864
## 3 Blogs 38154238 899288 899288 208361438 171926076
We brake the blended text to sentences using the sent_detect function
sent_doc <- sent_detect(random)
and now we can remove unnecessary information from the text by removing numbers, punctuation, whitespaces and turning all letters to lower cases.
sent_doc <- tolower(sent_doc)
sent_doc <- removeNumbers(sent_doc)
sent_doc <- removePunctuation(sent_doc)
sent_doc <- stripWhitespace(sent_doc)
sent_doc <- data.frame(sent_doc,stringsAsFactors = FALSE)
The NgramTokenizer function splits a string into an n-gram where we have control over the min and max n-gram that will be produced. The default minimum value is 1 and the max is 3.
grams<-NGramTokenizer(sent_doc)
We will split to words, bigrams and trigrams
##counting for each element
wc <- word_count(grams)
##subsetting for words
the_Words <- grams[ which(wc==1)]
the_Words <- as.vector(the_Words)
my_Words <- table(the_Words)
my_Words <- my_Words[order(my_Words, decreasing = T)]
##store in data frame
my_Words<- as.data.frame(my_Words)
##subsetting for bigrams
my_Bigrams <- grams[ which(wc==2)]
my_Bigrams <- table(my_Bigrams)
my_Bigrams <- my_Bigrams[order(my_Bigrams, decreasing = T)]
##store in data frame
my_Bigrams <- as.data.frame(my_Bigrams)
##subsetting for trigrams
my_Trigrams <- grams[ which(wc==3)]
my_Trigrams <- table(my_Trigrams)
my_Trigrams <- my_Trigrams[order(my_Trigrams, decreasing = T)]
##store in data.frame
my_Trigrams <- as.data.frame(my_Trigrams)
Now we can inspect the most used n-grams in the blended text. To retrieve the wordcloud plot from the corpus we do the following
wordcloud(the_Words, scale=c(5,0.1), max.words=100, random.order=FALSE,
rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(5, "Dark2"))
For the wordcloud of the Bigram we get
wordcloud(words = my_Bigrams$my_Bigrams, freq = my_Bigrams$Freq, max.words=90, random.order=FALSE,
rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(5, "Dark2"))
and for the trigram
wordcloud(words = my_Trigrams$my_Trigrams, freq = my_Trigrams$Freq, max.words=60, random.order=FALSE,
rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(5, "Dark2"))
The histograms can give also useful insight. We give the relevant histograms for words, bigrams and trigrams.
ggplot(data= my_Words[1:20,],
aes( x = the_Words, y = Freq, fill = Freq)) +
geom_histogram(stat = "identity", col = "red") +
ggtitle("Histogram for Words") +
labs( x = "Word", y = "Frequency") +
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1), legend.position="none")
ggplot(data= my_Bigrams[1:20,],
aes( x = my_Bigrams, y = Freq, fill = Freq)) +
geom_histogram(stat = "identity", col = "red") +
ggtitle("Histogram for Bigrams") +
labs( x = "Word", y = "Frequency") +
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1), legend.position="none")
ggplot(data= my_Trigrams[1:20,],
aes( x = my_Trigrams, y = Freq, fill = Freq)) +
geom_histogram(stat = "identity", col = "red") +
ggtitle("Histogram for Trigrams") +
labs( x = "Word", y = "Frequency") +
theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1), legend.position="none")
From the above histograms it is evident that we get an approximately exponential decline of the frequency that a word appears as we ‘descent’ in the frequency table. This is a direct consequence of the Zipf’s law that states that the frequency of any word is inversely proportional to its rank in the frequency table.
This has as a consequence that we can cover most of the written language with a considerable smaller subset of words. For example in our case we can cover 50% of the written corpus with the following number of words
sumCover <- 0
sum_word_fr <- sum(my_Words$Freq)
for(i in 1:length(my_Words$Freq)) {
sumCover <- sumCover + my_Words$Freq[i]
if(sumCover >= 0.5*sum_word_fr){break}
}
##Number of words to cover 50% of the written language
print(i)
## [1] 139
while the 90% of the words with
sumCover <- 0
sum_word_fr <- sum(my_Words$Freq)
for(i in 1:length(my_Words$Freq)) {
sumCover <- sumCover + my_Words$Freq[i]
if(sumCover >= 0.9*sum_word_fr){break}
}
##Number of words to cover 90% of the written language
print(i)
## [1] 6551
While the total number of unique words in the corpus are
length(my_Words[,1])
## [1] 20001
which is a considerable larger number from the number of words needed to cover 90% of the corpus. This is very important observation since it will allow as to reduce the size of the ngram frequency tables in order to speed up the app without sacrificing too much prediction accuracy.
The computational efficiency of the various functions given in the literature that calculates ngrams must be further evaluated. Ideally, we want to produce the longest possible Ngram for the biggest possible corpus. A few packages that claim efficient computation of Ngrams are tau, ngram and quanteda. There is also the possibility of building an efficient algorithm from scratch using Markov Chains but this will be time consuming.
Ideally, we will produce frequency tables for each type of N-grams (word, bigram, trigram and potentially quadgram). Then, we will use these frequency tables to predict the next word, using the following algorithm.
-Take the last three words of the input text and compare with the quadgram table to find a match with the three first words of a quadgram. If you find a match give the last word of the quadgram as the predicted word.
-Else, repeat the same process with the last two words and compare with the trigram table.
-Else, repeat the same process with the last word and compare with the bigram table.
-Else, give the most frequently used word.
Finally, the look-up tables that will be produced must be also evaluated in terms of efficiency. Since, very rare ngrams are not useful for prediction we must reduce the length of the frequency tables to a point that it balances computational complexity and accuracy of prediction.