Executive Summary

The Data Science Capstone concerns the area of Natural Language Processing and the project objective is to support users in writing text predicting the next word that they will type. The predictive text models are typically based on a large text corpus of documents as training data and this project use data from HC Corpora. The training dataset includes three files containing texts extracted from blogs, news and twitter. This Exploratory Analysis summarize training data and the plans for creating the predictive model.

Data Summary

Dataset contains samples from multiple languages but only the English files (over 550 MB of data) were loaded and processed.
The first step is to understand how many words and lines there are in the raw dataset.

The following tables show the raw data summary information.

As expected 140 chars is the maximum twitter message lenght and obviously the average length of twitter messages is much short than blogs and news sentences. It might be interesting to investigate language differences between short twitter messages and blog/news sentences but the idea is to proceed considering the 3 files as a single raw dataset because the goal is to have a single general purpose prediction system.

Row Data Files Summary

Table 1. Number of chars, words and lines

file lines_number chars_number_of_the_longest_line words_number
en_US.blogs.txt 899288 40833 37334131
en_US.twitter.txt 2360148 140 30373543
en_US.news.txt 1010242 11384 34372530

Table 2. summary of chars number per line

file Min first_Qu Median Mean Third_Qu Max
en_US.blogs.txt 1 47 156 230.00 329 40830
en_US.twitter.txt 2 37 64 68.68 100 140
en_US.news.txt 1 110 185 201.20 268 11380

Creating Samples

In order to build the prediction model and to avoid bias on data the raw data are randomly sampled using a bionomial function. Due to hardware resources limitation the sample percentage is the 1% of all raw data. After the sampling the results are joined in a new one small training data.

Sampling function

percentage <- 0.01 # Sampling percentage
sampleFunct <- function(data, percent)
{return(data[as.logical(rbinom(length(data),1,percent))])}

Cleaning Data and creating a Corpus

In order to proceed with further detailed statistics it’s necessary to create a text Corpus and clean the training data. The main classical text mining steps to clean corpus are: turn to lower-case words, remove numbers. remove symbols and punctuation, strip extra white spaces.

To create the Corpus and execute main clean operations some functions of the text mining package tm are used: Corpus, VectorSource, removeNumbers, removePunctuation, stripWhitespace, content_transformer(tolower)

Some custom filters are developed to remove web elements (eg: http, ftp, www), to fix chars repetitions (eg: goooooood, ahahahahah,….) and to remove profanity. The profanity filter is based on the full list of bad words banned by google.

It is important to remind that the execution order of the cleaning functions matter. For example, in this cleaning phase it is necessary to use removePunctuation function after the use of the custom webFilter function that recognize http address using dots and colons of the web address.

In this project the approach to manage the twitter messages is to remove Hashtags because is not easy to understand if the tag is a single word or is a different words concatenation, moreover sometimes hastags are used only as “categorization” and not as word in the sentence.

Another important decision of this project is to not remove stopwords from the corpus because the idea is try to predict next word, stopwords included. As a consequence of this decision the expectation is that stopwords will be the most frequents words in the corpus.

webFilter function

# webfilter must be before remove punctuation ("."" "/", "@")
webFilter <- function(x){
  x <- gsub("(http|HTTP)[^[:space:]]*", " ", x)  
  x <- gsub("(ftp|FTP)[^[:space:]]*", " ", x) 
  x <- gsub("(www|WWW)[^[:space:]]*", " ", x)
  x <- gsub("[^[:space:]]*(.com|.COM|.org|.ORG)", " ", x) # commmon domains without http 
  x <- gsub("\\S+@\\S+", " ", x) # email
  x <- gsub("#[^[:space:]]*", " ", x) # hashtag
}
corpusData <- tm_map(corpusData, content_transformer(webFilter))

Tokenization and NGram creation

After the corpus creation the next step of the Exploratory Analysis is to tokenize the corpus. Tokenization is the process of demarcating and possibly classifying sections of a string of input characters (wikipedia). For tokenization the Ngrams_Tokenizer function is used; this function is kindly made public by Maciej Szymkiewicz. At this initial phase only oneGrams, twoGrams and threeGrams are created.

NGram Summary

Following tables show summary after the tokenization process. Tables 3 and 4 point out that there are a lot of Ngram with low frequency.

Table 3. NGram rows Number

OneGram_num TwoGram_num ThreeGram_num
46926 457152 820749

Table 4. NGram frequency summary

NGrams Min first_Qu Median Mean Third_Qu Max
OneGram 1 1 2 20.250 5 46670
TwoGram 1 1 1 2.079 1 4182
ThreeGram 1 1 1 1.158 1 341

NGram plots

OneGram table is a sort of project dictionary and contains 46670 words. As a comparison it is interesting to remark that Oxford English Dictionary contains full entries for 171476 words in current use. Obviously inside the project dictionary there are many others words like names, cities, companies….

As the Plot 1. confirm there are a lot of words with low frequency and in particular there are 22859 words with frequency equal to one; more or less half project dictionary. As expected before the tokenization, the following plots confirm also that the top OneGrams are only stopwords and threeGram/twoGram are often a combination or contain stopwords.

Plot 1. OneGram frequency distribution (top 500 words)

Plot 2. OneGram frequency (top 15)

Plot 3. TwoGram frequency (top 15)

Plot 4. ThreeGram frequency (top 15)

Memory Allocation

Important issues to develop a predict Next Word application are memory allocation and performance. In this first phase Ngram are stored into 3 different Data Frames and the memory allocation is quite high considering only 1% of raw data processed. Further steps must consider strategy to decrease the memory allocation.

Table 5. Objects Size

Data_Frame Object_size
OneGram 3.2 Mb
TwoGram 33.6 Mb
ThreeGram 65.5 Mb

Plans and main decisions

Sparse items approach

The number of low frequency terms is very high and ignoring terms that have a document frequency lower than a given threshold can help generalization and prevent overfitting. So the first next step is to looking for a minimum threshold that improves generalization of the model. Consequently also the sample percentage will increase.

Memory Allocation and files dimensions

The information stored in the three Data Frame need to be be condensed because other information (discounts, probabilities….) will be added. A possible strategy to save space on disk and memory allocation is to use only one table considering the words positions in the threeGram. For example: “can i help” is a threegram but contains two twoGram “can i” and “i help” and obviously three oneGram “can”, “i” and “help”.

Build basic model

In literature there are a lot of interesting models to be used (eg: backoff and smoothing techniques) so the idea is to start with the simplest n-gram model model, evaluate results and alternatives to improve the accuracy and find a strategy to handle unseen n-grams.

Benchmark and validation

To evaluate different alternatives will be important to create a benchmarking/validation set to compare results of the project choises.

Shiny app

The App maybe is the last point but it is important because is the project part in which the performance and the memory/file system allocation will have a fundamental role. So build a first simple App mock-up is important to test the user experience.