The present report aims to give a brief introduction to the Capstone Project for Coursera Data Science specialization. The project main goal is to develop a Shiny app capable of predicting the next word, given some introduced text.
To accomplish this task are available 3 text files from different backgrounds (blogs, news and Twitter). This report addresses the initial project’s tasks of loading, cleaning and exploring the datasets with the final objective of understanding the best way to create the N-gram that will constitute the model.
The first hurdle that was necessary to surpass was the dataset size which totals almost 600Mb. In spite of being possible to read all the data into R, some of the subsequent steps required the data to be partitioned and worked in chunks.
There are a considerable number of R packages to assist with the natural language processing projects, some of them have a similar function and many of them in spite of presenting very powerful features, reveal lack of efficiency when applied to large amounts of data. In this stage of loading and cleaning the data, the following packages were used:
library(tm)
library(RWeka)
library(ggplot2)
library(wordcloud)
library(NLP)
library(SnowballC)
Before to start working with the dataset, some pre-cleaning was performed, eliminating non ASCII characters, foreign language lines, and some special “words” like web sites and email addresses.
Several types of abbreviations were found, which were manually converted to the expected full word.
data <- iconv(data, "UTF8", "ASCII", sub="")
data <- gsub("\1", "",data)
data <- blogs[textcat(data, p = textcat::ECIMCI_profiles) == "en"]
data <- paste(data, collapse = " ")
data <- gsub("[a-z]+@[a-z]+\\.", "",data)
data <- gsub("www\\.[a-z]+\\.[a-z]+", "",data)
data <- gsub("[^[:print:]]", "###",data)
data <- gsub("[Ii]t's ", "it is ",data)
data <- gsub("'s ", " ",data)
data <- gsub("###s ", " ",data)
data <- gsub("'ve ", " have ",data)
data <- gsub("'u ", " you ",data)
data <- gsub("'r ", " are ",data)
data <- gsub("###ve ", " have ",data)
data <- gsub("n't", " not",data)
data <- gsub("n###t", " not",data)
data <- gsub("[Nn]'t", " not",data)
data <- gsub("[Nn]###t", " not",data)
data <- gsub("'m ", " am ",data)
data <- gsub("###m ", " am ",data)
data <- gsub("###", "",data)
data <- gsub(" 'n ", " ",data)
data <- gsub("^i | i ", " I ",data)
The last task of this loading process was to separate the text in sentences in order to form coherent N-Grams. A start mark was introduced in all the sentences.
sent_vec <- unlist(strsplit(data, "[?|\\.|,|!]+"))
sent_vec <- gsub("^*", paste0(start," "),sent_vec)
The table bellow shows some basic statistics from the datasets. It is visible that despite of having the double of the number of lines than the others, the Twitter dataset have an identical number of sentences. This can be related with some lack of writing structure.
| file | lines | sentences | words |
|---|---|---|---|
| Blogs | 899288 | 4039408 | 37465856 |
| News | 1010242 | 4146647 | 34852897 |
| 2360148 | 3913034 | 30823628 | |
| All | 4269678 | 12099089 | 103142381 |
In this example were sampled 5000 sentences from each dataset and created Uni-grams, bi-grams and tri-grams from them. The table bellow presents the number of created uni-grams, bi-grams and tri-grams. Although we can find some differences among the datasets, there is a similarity in the number of n-grams created.
| UniGram_Blogs | UniGram_News | UniGram_Twitter | BiGram_Blogs | BiGram_News | BiGram_Twitter | TriGram_Blogs | TriGram_News | TriGram_Twitter |
|---|---|---|---|---|---|---|---|---|
| 6543 | 6835 | 6597 | 28281 | 27450 | 29062 | 37034 | 33529 | 38259 |
The following table presents the quantile distribution of the word’s frequency. In all three datasets 50% of the words just appear one time in the dataset and 90% of them have a frequency of 9 or less. The Bi-gram and Tri-gram analysis show an escalation of this trend. In this case the one time N-Grams represent 80% and 90% of the dictionary.
| UniGram_Blogs | UniGram_News | UniGram_Twitter | BiGram_Blogs | BiGram_News | BiGram_Twitter | TriGram_Blogs | TriGram_News | TriGram_Twitter | |
|---|---|---|---|---|---|---|---|---|---|
| 0% | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 10% | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 20% | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 30% | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 40% | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 50% | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 60% | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 |
| 70% | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 |
| 80% | 4 | 4 | 4 | 1 | 1 | 1 | 1 | 1 | 1 |
| 90% | 9 | 9 | 9 | 2 | 2 | 2 | 1 | 1 | 1 |
| 100% | 2311 | 2360 | 2375 | 439 | 383 | 499 | 49 | 57 | 52 |
The uni-gram’s distribution is similar in all datasets, with most of the words being coincident among each other. In all cases “the” is the most frequent word.
Bi-gram’s distribution presents the same pattern already seen in uni-grams. The <s> symbol represents the beginning of the sentence.
Tri-gram plots keep the trend of small size words dominating the higher frequencies, however these plots show already some words that doesn’t appear in the above charts.
The analysis revealed a dataset with many sparse terms. A sample of 5000 lines created more than 30000 tri-grams in each dataset, however 90% of them just appear one time. The data analyzed here have already been stemmed (eliminated inflected forms of the same root word (example: run, runs, running)) which means that in the original dataset this issue should have been much more evident. To a further treatment, Some terms which refer to people’s names, brands or places, can be identified and eliminated, reducing the number of uncommon words/ngrams.
In the following steps of these project the three datasets will be joined and than created a train and test set.
Train set will be used to build the models:
These models performance will be evaluated using the test set applied to a perplexity function.