Milestone Report

Instructions for this report

The goal of this project is to:

Demonstrate that you've downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Loading the data

SwiftKey provided a set of files resulting from data mining on Twitter, Blogs and News websites. The data is available in four languages, but I'm working with the English version only. And the respective three files are loaded below.

# loading twitter data
twitter <- readLines(con <- file("final/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

# loading blogs data
blogs <- readLines(con <- file("final/en_US/en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

# loading news data
news <- readLines(con <- file("final/en_US/en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

Summary statistics

The three datasets can be summarized in terms of the file sizes, number of lines in each dataset, number of words in each dataset, number of characters in each dataset and, the maximum line lenght, as follows:

##         FileSize LineCount WordCount CharCount MaxLineLength
## twitter 159.3641   2360148  30373605 162096241           140
## blogs   200.4242    899288  37334149 206824382         40833
## news    196.2775   1010242  34372814 203223154         11384

Cleaning the data

Because the datasets are too big, I'm doing the exploratory basic statistics on small samples of each dataset. For the modeling step this issue will, likely, be solved by breaking the dataset into reading blocks.

### sampling data
### detect and split sentences on endmark boundaries
set.seed(20200601)
sampleTwitter = sample(twitter, 1000)
sentTwitter = sent_detect(sampleTwitter)
set.seed(20200601)
sampleBlogs = sample(blogs, 1000)
sentBlogs = sent_detect(sampleBlogs)
set.seed(20200601)
sampleNews = sample(news, 1000)
sentNews = sent_detect(sampleNews)

## putting all data together
df_twitter = data.frame(source = rep("twitter", length(sentTwitter)), 
                        text = sentTwitter)

df_blogs = data.frame(source = rep("blog", length(sentBlogs)), 
                        text = sentBlogs)

df_news = data.frame(source = rep("news", length(sentNews)), 
                        text = sentNews)

df_all = rbind(df_twitter, df_blogs, df_news)
df_all$source = factor(df_all$source)
df_all$textLength = nchar(df_all$text)

# remove what we do not need
df_all$text = removeNumbers(df_all$text)
df_all$text = removePunctuation(df_all$text)
df_all$text = stripWhitespace(df_all$text)
df_all$text = tolower(df_all$text)
df_all = df_all[which(df_all$text!=""), ]

Interesting findings

The sample dataset has the following distributions of sentence lenght:

Which shows us some interesting patterns that are useful for the model building step. The frequency of sentence length do not follow a normal distribuiton in any of the datasets. Blogs tend to have the longest sentences, followed by news and lastly by twitter. The peak of sentence length is lower for twitter compared to blogs and news.

Additionally, for buildging the predictive model, it's necessary to obtain the n-gram frequency distribution:

# Making ordered data frames of 1-grams, 2-grams, 3-grams
allData = c(sampleTwitter, sampleBlogs, sampleNews)
txt = sent_detect(allData)

txt <- removeNumbers(txt)
txt <- removePunctuation(txt)
txt <- stripWhitespace(txt)
txt <- tolower(txt)
txt <- txt[which(txt!="")]
txt <- data.frame(txt,stringsAsFactors = FALSE)

words = WordTokenizer(txt)
grams = NGramTokenizer(txt)

for(i in 1:length(grams))
    {if(length(WordTokenizer(grams[i]))==2) break}
for(j in 1:length(grams))
    {if(length(WordTokenizer(grams[j]))==1) break}

# how freqeuntly certain words or pairs of words appear in the dataset?
onegrams <- data.frame(table(words))
onegrams <- onegrams[order(onegrams$Freq, decreasing = TRUE),]
bigrams <- data.frame(table(grams[i:(j-1)]))
bigrams <- bigrams[order(bigrams$Freq, decreasing = TRUE),]
trigrams <- data.frame(table(grams[1:(i-1)]))
trigrams <- trigrams[order(trigrams$Freq, decreasing = TRUE),]
remove(i,j,grams)

The ngrams will be used to inform the model on the probability that a given word shows up after one or a pair of words.

quantile(onegrams$Freq)

##   0%  25%  50%  75% 100% 
##    1    1    1    3 4066

quantile(bigrams$Freq)

##   0%  25%  50%  75% 100% 
##    1    1    1    1  382

quantile(trigrams$Freq)

##   0%  25%  50%  75% 100% 
##    1    1    1    1   36

The top 1-gram, 2-gram, and 3-gram words are highlighted in the following word clouds.

Plans for the prediction algorithm

Given that the main goal of this projecto is to build a predictive model to predict the words that follow an input in a shiny application, the challenge is to build an accurate model that is fast and yet with low demand of memory to run. Therefore, the training and test steps should be focused on the parameters optimization of the model given these constraints, even if it requires lower predictive power.

I will explore other sampling sizes and tools for reading data as blocks. Also, I need to better explore other tools for tokenization and dictionary building for multiple ngrams.