The goal of this project is to do text mining and exploratory analysis to prepare for the final data science capstone project, where the aim is to create a next word prediction algorithm. This document is used to overview the major features of the data and briefly summarize the plans for creating the prediction algorithm and Shiny app. There are three data sources in English language: blogs, news and twitter.
There are three data sources in English language: blogs, news and twitter. Three files are read to one list and an overview of all files is produced. Twitter has the highest number of lines - over 2 million, while blogs have the highest number of words - over 37 million.
| file_name | file_size | number_of_lines | number_of_characters | number_of_words | longest_entry | |
|---|---|---|---|---|---|---|
| blogs | blogs | 255.4 Mb | 899288 | 206824505 | 37546239 | 40833 |
| news | news | 257.3 Mb | 1010242 | 203223159 | 34762395 | 11384 |
| 319 Mb | 2360148 | 162096031 | 30093372 | 140 |
As the size of each dataset is very big, we will use sampled data for the analysis and building the prediction algorithm. The 1% sample size is selected, corpus is created and then cleaned by removing non-ASCII characters, urls, numbers, punctuation, extra white spaces, converting all letters to lower case and creating plain text format.
In Natural Language Processing, n-gram is a continuous sequence of n items from a given sequence of text. We are going to create one (unigrams), two (bigrams), three (trigrams) and four (quadgrams) words combinations using RWeka package. We are going to create n-grams for one combined dictionary as well as three separate ones (for blogs, news and twitter) as words and phrases might differ depending on the text source.
CreateNgram <- function(corp, n){
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
nGram <- TermDocumentMatrix(corp, control = list(tokenize = Tokenizer))
}
nGram1_comb <- CreateNgram(corp_unlist,1)
nGram2_comb <- CreateNgram(corp_unlist,2)
nGram3_comb <- CreateNgram(corp_unlist,3)
nGram4_comb <- CreateNgram(corp_unlist,4)
nGram1<-list()
nGram2<-list()
nGram3<-list()
nGram4<-list()
for (i in 1:length(corp)){
nGram1[[i]] <- CreateNgram(corp[i],1)
nGram2[[i]] <- CreateNgram(corp[i],2)
nGram3[[i]] <- CreateNgram(corp[i],3)
nGram4[[i]] <- CreateNgram(corp[i],4)
}
We then find the frequency of terms in each of these n-grams and construct dataframes of these frequencies.
TopFreq <- function(x, lowlimit=10){
topFreq<- x[findFreqTerms(x,lowlimit),]%>%
as.matrix() %>%
rowSums() %>% sort(decreasing=TRUE)
}
freq1<-TopFreq(nGram1_comb,100)
freq2<-TopFreq(nGram2_comb,10)
freq3<-TopFreq(nGram3_comb,3)
freq4<-TopFreq(nGram4_comb,2)
freq1_split<-sapply(nGram1, function(x){TopFreq(x,100)})
freq2_split<-sapply(nGram2, function(x){TopFreq(x,10)})
freq3_split<-sapply(nGram3, function(x){TopFreq(x,3)})
freq4_split<-sapply(nGram4, function(x){TopFreq(x,2)})
names(freq1_split)<-names(corp)
names(freq2_split)<-names(corp)
names(freq3_split)<-names(corp)
names(freq4_split)<-names(corp)
Wordclouds and histograms are plotted for most common words and phrases in unigrams, bigrams, trigrams and quadgrams.
Histograms are also plotted for most common words and phrases in unigrams, bigrams, trigrams and quadgrams by text source - blogs, news and twitter. We can see that most common words differe significantly depending on text source.
The histograms show that the distribution of words is very skewed. We are going to look into how many most frequent words we need to cover the 50% and 90% of all word instances.
FreqMatrix <- function(x,ng,cov_ratio=0.5){
myTdm <- as.matrix(x)
FreqMat <- data.frame(word = rownames(myTdm),
Freq = rowSums(myTdm),
row.names = NULL)
FreqMat <- FreqMat[order(FreqMat[,2], decreasing = TRUE),]
cover <- FreqMat %>% mutate(proportion = Freq / sum(Freq)) %>%
arrange(desc(proportion)) %>%
mutate(coverage = cumsum(proportion)) %>%
filter(coverage <= cov_ratio)
cat(sprintf("We will need %d unique words and %d phrases to cover %1.0f%% of phrases in a %d-gram\n",length(unique(unlist(str_split(cover$word, ' ')))),nrow(cover),cov_ratio*100,ng))
cover
}
## We will need 170 unique words and 170 phrases to cover 50% of phrases in a 1-gram
## We will need 682 unique words and 682 phrases to cover 90% of phrases in a 1-gram
## We will need 264 unique words and 339 phrases to cover 50% of phrases in a 2-gram
## We will need 616 unique words and 1107 phrases to cover 90% of phrases in a 2-gram
## We will need 334 unique words and 261 phrases to cover 50% of phrases in a 3-gram
## We will need 685 unique words and 668 phrases to cover 90% of phrases in a 3-gram
## We will need 509 unique words and 246 phrases to cover 50% of phrases in a 4-gram
## We will need 892 unique words and 527 phrases to cover 90% of phrases in a 4-gram
In order to save memory, we are going to use only words that are enough to cover 90% of all word instances and n-grams created using those words.
We are going to use the n-grams for prediction. The model will find the n-gram with the greatest frequency given the word or a phrase provided for prediction. An example if two words are given for prediction is provided below if we use trigram. If there are no matches, we will use n-1-gram for prediction or choose a random word. We will also consider having an input for text source or type to predict the next word as it might vary depending on the text type.
## word1 word2 word3 Freq proportion coverage
## 1 cant wait see 45 0.013927577 0.01392758
## 2 new york city 23 0.007118539 0.02104612
## 3 happy new year 21 0.006499536 0.02754565
## 4 happy mothers day 20 0.006190034 0.03373569
## 5 two years ago 20 0.006190034 0.03992572
## 6 im pretty sure 19 0.005880532 0.04580625
Next Steps are going to build a predictive algorithm and a Shiny app, that suggests the most likely next word after a phrase is typed. We are going to use n-grams we have created for prediction as discussed in the prediction section. In order to optimize the model and save memory, we will use only words that would cover 90% of all word instances and will consider having an input for data source as well.