Introduction

The goal of this project is to do text mining and exploratory analysis to prepare for the final data science capstone project, where the aim is to create a next word prediction algorithm. This document is used to overview the major features of the data and briefly summarize the plans for creating the prediction algorithm and Shiny app. There are three data sources in English language: blogs, news and twitter.

Reading and sampling data

There are three data sources in English language: blogs, news and twitter. Three files are read to one list and an overview of all files is produced. Twitter has the highest number of lines - over 2 million, while blogs have the highest number of words - over 37 million.

the overview of US datasets
file_name file_size number_of_lines number_of_characters number_of_words longest_entry
blogs blogs 255.4 Mb 899288 206824505 37546239 40833
news news 257.3 Mb 1010242 203223159 34762395 11384
twitter twitter 319 Mb 2360148 162096031 30093372 140

As the size of each dataset is very big, we will use sampled data for the analysis and building the prediction algorithm. The 1% sample size is selected, corpus is created and then cleaned by removing non-ASCII characters, urls, numbers, punctuation, extra white spaces, converting all letters to lower case and creating plain text format.

Building N-grams

In Natural Language Processing, n-gram is a continuous sequence of n items from a given sequence of text. We are going to create one (unigrams), two (bigrams), three (trigrams) and four (quadgrams) words combinations using RWeka package. We are going to create n-grams for one combined dictionary as well as three separate ones (for blogs, news and twitter) as words and phrases might differ depending on the text source.

CreateNgram <- function(corp, n){
    Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
    nGram <-  TermDocumentMatrix(corp, control = list(tokenize = Tokenizer))
}

nGram1_comb <- CreateNgram(corp_unlist,1)
nGram2_comb <- CreateNgram(corp_unlist,2)
nGram3_comb <- CreateNgram(corp_unlist,3)
nGram4_comb <- CreateNgram(corp_unlist,4)

nGram1<-list()
nGram2<-list()
nGram3<-list()
nGram4<-list()
for (i in 1:length(corp)){
   nGram1[[i]] <- CreateNgram(corp[i],1)
   nGram2[[i]] <- CreateNgram(corp[i],2)
   nGram3[[i]] <- CreateNgram(corp[i],3)
   nGram4[[i]] <- CreateNgram(corp[i],4)
}

We then find the frequency of terms in each of these n-grams and construct dataframes of these frequencies.

TopFreq <- function(x, lowlimit=10){
    topFreq<- x[findFreqTerms(x,lowlimit),]%>%
    as.matrix() %>%
    rowSums()  %>% sort(decreasing=TRUE)        
}

freq1<-TopFreq(nGram1_comb,100)
freq2<-TopFreq(nGram2_comb,10)
freq3<-TopFreq(nGram3_comb,3)
freq4<-TopFreq(nGram4_comb,2)

freq1_split<-sapply(nGram1, function(x){TopFreq(x,100)})
freq2_split<-sapply(nGram2, function(x){TopFreq(x,10)})
freq3_split<-sapply(nGram3, function(x){TopFreq(x,3)})
freq4_split<-sapply(nGram4, function(x){TopFreq(x,2)})
names(freq1_split)<-names(corp)
names(freq2_split)<-names(corp)
names(freq3_split)<-names(corp)
names(freq4_split)<-names(corp)

Exploratory Analysis & Visualizations

Wordclouds and histograms are plotted for most common words and phrases in unigrams, bigrams, trigrams and quadgrams.

Word Clouds

Histograms

Histograms are also plotted for most common words and phrases in unigrams, bigrams, trigrams and quadgrams by text source - blogs, news and twitter. We can see that most common words differe significantly depending on text source.

Frequencies

The histograms show that the distribution of words is very skewed. We are going to look into how many most frequent words we need to cover the 50% and 90% of all word instances.

FreqMatrix <- function(x,ng,cov_ratio=0.5){
        myTdm <- as.matrix(x)
        FreqMat <- data.frame(word = rownames(myTdm), 
                      Freq = rowSums(myTdm), 
                      row.names = NULL)
        FreqMat <- FreqMat[order(FreqMat[,2], decreasing = TRUE),]
        cover <-  FreqMat %>% mutate(proportion = Freq / sum(Freq)) %>%
                        arrange(desc(proportion)) %>%  
                        mutate(coverage = cumsum(proportion)) %>%
                        filter(coverage <= cov_ratio)
        cat(sprintf("We will need %d unique words and %d phrases to cover %1.0f%% of phrases in a %d-gram\n",length(unique(unlist(str_split(cover$word, ' ')))),nrow(cover),cov_ratio*100,ng))
cover

}
## We will need 170 unique words and 170 phrases to cover 50% of phrases in a 1-gram
## We will need 682 unique words and 682 phrases to cover 90% of phrases in a 1-gram
## We will need 264 unique words and 339 phrases to cover 50% of phrases in a 2-gram
## We will need 616 unique words and 1107 phrases to cover 90% of phrases in a 2-gram
## We will need 334 unique words and 261 phrases to cover 50% of phrases in a 3-gram
## We will need 685 unique words and 668 phrases to cover 90% of phrases in a 3-gram
## We will need 509 unique words and 246 phrases to cover 50% of phrases in a 4-gram
## We will need 892 unique words and 527 phrases to cover 90% of phrases in a 4-gram

In order to save memory, we are going to use only words that are enough to cover 90% of all word instances and n-grams created using those words.

Prediction Model

We are going to use the n-grams for prediction. The model will find the n-gram with the greatest frequency given the word or a phrase provided for prediction. An example if two words are given for prediction is provided below if we use trigram. If there are no matches, we will use n-1-gram for prediction or choose a random word. We will also consider having an input for text source or type to predict the next word as it might vary depending on the text type.

##   word1   word2 word3 Freq  proportion   coverage
## 1  cant    wait   see   45 0.013927577 0.01392758
## 2   new    york  city   23 0.007118539 0.02104612
## 3 happy     new  year   21 0.006499536 0.02754565
## 4 happy mothers   day   20 0.006190034 0.03373569
## 5   two   years   ago   20 0.006190034 0.03992572
## 6    im  pretty  sure   19 0.005880532 0.04580625

Conclusion & Next Steps

Next Steps are going to build a predictive algorithm and a Shiny app, that suggests the most likely next word after a phrase is typed. We are going to use n-grams we have created for prediction as discussed in the prediction section. In order to optimize the model and save memory, we will use only words that would cover 90% of all word instances and will consider having an input for data source as well.