This is the start of the Data Science Specialisation Capstone Project to build a predictive text model. We are given three Englist text files (a blog text, a twitter text and a news text) from a corpus called HC Corpora, which will form the training dataset for our predictive text model. At this stage, we hope to have a better understanding of the text data to build a n-gram dictionary for the prediction model.
First, we seek to understand the distribution and relationship between the words, tokens, and phrases in the three text files, so as to prepare to build our first linguistic models. This includes the frequencies and variation in the frequencies of words and word pairs/phrases in the text files.
| TextSource | Object_Size_in_Bytes | Line_Count | Word_Count |
|---|---|---|---|
| Filetwit | 316037344 | 2360148 | 30373792 |
| Fileblog | 260564320 | 899288 | 37334441 |
| Filenews | 20111392 | 77259 | 2643972 |
Next, given the large text size, we create random samples from the three text files with about 20,000 words for each sample for exploratory analysis. Altogether, these account for 0.9% of the original text files in terms of word count. This will allow sufficient text to build a n-gram dictionary. The random samples are put together and loaded as a Corpus for subsequent text preprocessing/cleaning.
| TextSource | Object_Size_in_Bytes | Line_Count | Word_Count |
|---|---|---|---|
| twitsample | 2170432 | 16000 | 206072 |
| blogsample | 1448800 | 5000 | 208047 |
| newsample | 1570176 | 6000 | 206915 |
Upon loading the sample data as a corpus, we start “cleaning” the text. Text transformation is performed using tm_map() function for the following:
Common English stopwords are however not removed as these stopwords are possible text and useful in our predictive text modelling. Text stemming is also not performed, as we want to capture all forms of words and not just reduce words to their root form.
Subsequently, the text is converted into a term-document matrix for further computation. This approach results in a matrix with document IDs as rows and terms as colums. The matrix elements are term frequencies. The frequencies of unigram (n-gram of size 1), bigram (n-gram of size 2), trigram (n-gram of size 3) and four-gram are displayed in the barplots and word clouds below.
In doing this stage of the project, some interesting findings gathered are as follows:
| Ngram_Dictionary | Unique_Words_Phrases | Freq_in_Sample | Percentage_of_Sample |
|---|---|---|---|
| Unigram | 43588 | 469414 | 75.58588 |
| Bigram | 289955 | 584274 | 94.08084 |
| Trigram | 477208 | 557455 | 89.76240 |
| Fourgram | 516606 | 531413 | 85.56907 |
With the Unigram, Bigram, Trigram and Four-Gram dictionaries created from the sample, we are ready to build the prediction algorithm. Based on Markov’s Assumption which states that “The future is independent of the past given the present”, we rely on the last few words of the input, especially the last word.
In other words, for a bigram model, \(P\)(the | its water is so transparent that) is approximately the same as \(P\)(the | that).
A bigram prediction model is possible, but it may not capture word phrases effectively as language has long-distrance dependencies.We would therefore rely on higher order n-gram dictionary as well for the model building. To increase the effecitveness of word check, we would subsequently keep words in the dictionaries with frequencies of at least four. This will help expedite checks and save memory space since we are looking at words/phrases with high frequencies.
Outline of the alogrithm is as follows:
input<-"of"
## NextWord FrequencyNextWord
## [1,] "the" "2427"
## [2,] "a" "444"
## [3,] "my" "277"
## [4,] "his" "187"
## [5,] "our" "131"