The Data Science Capstone concerns the area of Natural Language Processing and the project objective is to support users in writing text predicting the next word that they will type. The predictive text models are typically based on a large text corpus of documents as training data and this project use data from HC Corpora. The training dataset includes three files containing texts extracted from blogs, news and twitter. This Exploratory Analysis summarize training data and the plans for creating the predictive model.
Dataset contains samples from multiple languages but only the English files (over 550 MB of data) were loaded and processed.
The first step is to understand how many words and lines there are in the raw dataset.
The following tables show the raw data summary information.
As expected 140 chars is the maximum twitter message lenght and obviously the average length of twitter messages is much short than blogs and news sentences. It might be interesting to investigate language differences between short twitter messages and blog/news sentences but the idea is to proceed considering the 3 files as a single raw dataset because the goal is to have a single general purpose prediction system.
| file | lines_number | chars_number_of_the_longest_line | words_number |
|---|---|---|---|
| en_US.blogs.txt | 899288 | 40833 | 37334131 |
| en_US.twitter.txt | 2360148 | 140 | 30373543 |
| en_US.news.txt | 1010242 | 11384 | 34372530 |
| file | Min | first_Qu | Median | Mean | Third_Qu | Max |
|---|---|---|---|---|---|---|
| en_US.blogs.txt | 1 | 47 | 156 | 230.00 | 329 | 40830 |
| en_US.twitter.txt | 2 | 37 | 64 | 68.68 | 100 | 140 |
| en_US.news.txt | 1 | 110 | 185 | 201.20 | 268 | 11380 |
In order to build the prediction model and to avoid bias on data the raw data are randomly sampled using a bionomial function. Due to hardware resources limitation the sample percentage is the 1% of all raw data. After the sampling the results are joined in a new one small training data.
percentage <- 0.01 # Sampling percentage
sampleFunct <- function(data, percent)
{return(data[as.logical(rbinom(length(data),1,percent))])}
In order to proceed with further detailed statistics it’s necessary to create a text Corpus and clean the training data. The main classical text mining steps to clean corpus are: turn to lower-case words, remove numbers. remove symbols and punctuation, strip extra white spaces.
To create the Corpus and execute main clean operations some functions of the text mining package tm are used: Corpus, VectorSource, removeNumbers, removePunctuation, stripWhitespace, content_transformer(tolower)
Some custom filters are developed to remove web elements (eg: http, ftp, www), to fix chars repetitions (eg: goooooood, ahahahahah,….) and to remove profanity. The profanity filter is based on the full list of bad words banned by google.
It is important to remind that the execution order of the cleaning functions matter. For example, in this cleaning phase it is necessary to use removePunctuation function after the use of the custom webFilter function that recognize http address using dots and colons of the web address.
In this project the approach to manage the twitter messages is to remove Hashtags because is not easy to understand if the tag is a single word or is a different words concatenation, moreover sometimes hastags are used only as “categorization” and not as word in the sentence.
Another important decision of this project is to not remove stopwords from the corpus because the idea is try to predict next word, stopwords included. As a consequence of this decision the expectation is that stopwords will be the most frequents words in the corpus.
# webfilter must be before remove punctuation ("."" "/", "@")
webFilter <- function(x){
x <- gsub("(http|HTTP)[^[:space:]]*", " ", x)
x <- gsub("(ftp|FTP)[^[:space:]]*", " ", x)
x <- gsub("(www|WWW)[^[:space:]]*", " ", x)
x <- gsub("[^[:space:]]*(.com|.COM|.org|.ORG)", " ", x) # commmon domains without http
x <- gsub("\\S+@\\S+", " ", x) # email
x <- gsub("#[^[:space:]]*", " ", x) # hashtag
}
corpusData <- tm_map(corpusData, content_transformer(webFilter))
After the corpus creation the next step of the Exploratory Analysis is to tokenize the corpus. Tokenization is the process of demarcating and possibly classifying sections of a string of input characters (wikipedia). For tokenization the Ngrams_Tokenizer function is used; this function is kindly made public by Maciej Szymkiewicz. At this initial phase only oneGrams, twoGrams and threeGrams are created.
Following tables show summary after the tokenization process. Tables 3 and 4 point out that there are a lot of Ngram with low frequency.
| OneGram_num | TwoGram_num | ThreeGram_num |
|---|---|---|
| 46926 | 457152 | 820749 |
| NGrams | Min | first_Qu | Median | Mean | Third_Qu | Max |
|---|---|---|---|---|---|---|
| OneGram | 1 | 1 | 2 | 20.250 | 5 | 46670 |
| TwoGram | 1 | 1 | 1 | 2.079 | 1 | 4182 |
| ThreeGram | 1 | 1 | 1 | 1.158 | 1 | 341 |
OneGram table is a sort of project dictionary and contains 46670 words. As a comparison it is interesting to remark that Oxford English Dictionary contains full entries for 171476 words in current use. Obviously inside the project dictionary there are many others words like names, cities, companies….
As the Plot 1. confirm there are a lot of words with low frequency and in particular there are 22859 words with frequency equal to one; more or less half project dictionary. As expected before the tokenization, the following plots confirm also that the top OneGrams are only stopwords and threeGram/twoGram are often a combination or contain stopwords.
Important issues to develop a predict Next Word application are memory allocation and performance. In this first phase Ngram are stored into 3 different Data Frames and the memory allocation is quite high considering only 1% of raw data processed. Further steps must consider strategy to decrease the memory allocation.
| Data_Frame | Object_size |
|---|---|
| OneGram | 3.2 Mb |
| TwoGram | 33.6 Mb |
| ThreeGram | 65.5 Mb |
The number of low frequency terms is very high and ignoring terms that have a document frequency lower than a given threshold can help generalization and prevent overfitting. So the first next step is to looking for a minimum threshold that improves generalization of the model. Consequently also the sample percentage will increase.
The information stored in the three Data Frame need to be be condensed because other information (discounts, probabilities….) will be added. A possible strategy to save space on disk and memory allocation is to use only one table considering the words positions in the threeGram. For example: “can i help” is a threegram but contains two twoGram “can i” and “i help” and obviously three oneGram “can”, “i” and “help”.
In literature there are a lot of interesting models to be used (eg: backoff and smoothing techniques) so the idea is to start with the simplest n-gram model model, evaluate results and alternatives to improve the accuracy and find a strategy to handle unseen n-grams.
To evaluate different alternatives will be important to create a benchmarking/validation set to compare results of the project choises.
The App maybe is the last point but it is important because is the project part in which the performance and the memory/file system allocation will have a fundamental role. So build a first simple App mock-up is important to test the user experience.