The goal of this project is to:
Demonstrate that you've downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.
SwiftKey provided a set of files resulting from data mining on Twitter, Blogs and News websites. The data is available in four languages, but I'm working with the English version only. And the respective three files are loaded below.
# loading twitter data
twitter <- readLines(con <- file("final/en_US/en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
# loading blogs data
blogs <- readLines(con <- file("final/en_US/en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
# loading news data
news <- readLines(con <- file("final/en_US/en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
The three datasets can be summarized in terms of the file sizes, number of lines in each dataset, number of words in each dataset, number of characters in each dataset and, the maximum line lenght, as follows:
## FileSize LineCount WordCount CharCount MaxLineLength
## twitter 159.3641 2360148 30373605 162096241 140
## blogs 200.4242 899288 37334149 206824382 40833
## news 196.2775 1010242 34372814 203223154 11384
Because the datasets are too big, I'm doing the exploratory basic statistics on small samples of each dataset. For the modeling step this issue will, likely, be solved by breaking the dataset into reading blocks.
### sampling data
### detect and split sentences on endmark boundaries
set.seed(20200601)
sampleTwitter = sample(twitter, 1000)
sentTwitter = sent_detect(sampleTwitter)
set.seed(20200601)
sampleBlogs = sample(blogs, 1000)
sentBlogs = sent_detect(sampleBlogs)
set.seed(20200601)
sampleNews = sample(news, 1000)
sentNews = sent_detect(sampleNews)
## putting all data together
df_twitter = data.frame(source = rep("twitter", length(sentTwitter)),
text = sentTwitter)
df_blogs = data.frame(source = rep("blog", length(sentBlogs)),
text = sentBlogs)
df_news = data.frame(source = rep("news", length(sentNews)),
text = sentNews)
df_all = rbind(df_twitter, df_blogs, df_news)
df_all$source = factor(df_all$source)
df_all$textLength = nchar(df_all$text)
# remove what we do not need
df_all$text = removeNumbers(df_all$text)
df_all$text = removePunctuation(df_all$text)
df_all$text = stripWhitespace(df_all$text)
df_all$text = tolower(df_all$text)
df_all = df_all[which(df_all$text!=""), ]
The sample dataset has the following distributions of sentence lenght:
Which shows us some interesting patterns that are useful for the model building step. The frequency of sentence length do not follow a normal distribuiton in any of the datasets. Blogs tend to have the longest sentences, followed by news and lastly by twitter. The peak of sentence length is lower for twitter compared to blogs and news.
Additionally, for buildging the predictive model, it's necessary to obtain the n-gram frequency distribution:
# Making ordered data frames of 1-grams, 2-grams, 3-grams
allData = c(sampleTwitter, sampleBlogs, sampleNews)
txt = sent_detect(allData)
txt <- removeNumbers(txt)
txt <- removePunctuation(txt)
txt <- stripWhitespace(txt)
txt <- tolower(txt)
txt <- txt[which(txt!="")]
txt <- data.frame(txt,stringsAsFactors = FALSE)
words = WordTokenizer(txt)
grams = NGramTokenizer(txt)
for(i in 1:length(grams))
{if(length(WordTokenizer(grams[i]))==2) break}
for(j in 1:length(grams))
{if(length(WordTokenizer(grams[j]))==1) break}
# how freqeuntly certain words or pairs of words appear in the dataset?
onegrams <- data.frame(table(words))
onegrams <- onegrams[order(onegrams$Freq, decreasing = TRUE),]
bigrams <- data.frame(table(grams[i:(j-1)]))
bigrams <- bigrams[order(bigrams$Freq, decreasing = TRUE),]
trigrams <- data.frame(table(grams[1:(i-1)]))
trigrams <- trigrams[order(trigrams$Freq, decreasing = TRUE),]
remove(i,j,grams)
The ngrams will be used to inform the model on the probability that a given word shows up after one or a pair of words.
quantile(onegrams$Freq)
## 0% 25% 50% 75% 100%
## 1 1 1 3 4066
quantile(bigrams$Freq)
## 0% 25% 50% 75% 100%
## 1 1 1 1 382
quantile(trigrams$Freq)
## 0% 25% 50% 75% 100%
## 1 1 1 1 36
The top 1-gram, 2-gram, and 3-gram words are highlighted in the following word clouds.
Given that the main goal of this projecto is to build a predictive model to predict the words that follow an input in a shiny application, the challenge is to build an accurate model that is fast and yet with low demand of memory to run. Therefore, the training and test steps should be focused on the parameters optimization of the model given these constraints, even if it requires lower predictive power.
I will explore other sampling sizes and tools for reading data as blocks. Also, I need to better explore other tools for tokenization and dictionary building for multiple ngrams.