1. Summary

This Natural Language Processing project is the final Capstone of the Coursera Data science Specialization by Johns Hopkins University. The project is to build a text predition app with input words from the users. In this report, I summarize intial statistical findings from the data and a plan to building the prediction model.

The second report presents my prediction model as the next step after what I have done in this report. You can find it here: http://rpubs.com/nhohung/NLP_prediction. The final app is published online at: https://nhohung.shinyapps.io/TextPrediction/. A short presentation of this project is posted on: http://rpubs.com/nhohung/NLP_summary.

library(quanteda)
library(ggplot2)

2. Data loading and summary

The data contains 3 English text files from blogs, news and twitter. First of all, file size can be found, for example, by:

file.size("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt") / (1024^2)
## [1] 200.4242

where the result returns file size in MB.

I load 3 files separately, the file with text from news needs the flag rb (read binary) because it has a problem with end of file character. An example of loading these files is as followed:

data_blog <- readLines(file("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", open = 'r'))

We can then check the total number of line and words in each file:

# number of line:
length(data_blog)
## [1] 899288
# number of word:
sum(sapply(strsplit(data_blog, "\\s"), length))
## [1] 37334641

Applying similarly to other 2 files, we get the following data summary:

File source Size (in MB) Line count Word count
blog 200.42 899,288 37,334,641
news 196.28 1,010,242 34,372,792
twitter 159.36 2,360,148 30,373,906

3. Data exploratory analysis

In order to investigate deeper into the data generally, I will merge them into a single big file, remove foreign words, and calculate the frequencies of unigrams (word), bigram (pair of 2 adjacent words) and trigrams (phrase of 3 consecutive words).

Processing a huge file is not ideal for practical app, but it is crucial for this report as we will know the general aspect about our full data. In the real prediction model creation, I will use a sample of it.

3.1. Merging file

Merging can be done by:

data <- c(data_blog, data_news, data_twitter)

This big file has 4,269,678 lines and 102,081,339 words.

3.2. Handling foreign words

Removing the foreign words can be easily done by deleting the characters that are not in English encoding (latin characters):

data <- iconv(data, from = "latin1", to = "ascii", sub="")

The number of word in this new file is 102,058,174, which means that there are 102081339 - 102058174 = 23165 foreign words.

3.3. n-grams frequencies calculation with Data cleaning embedded

3.3.1. Strategy

To do this job, I use the package quanteda. The rountine can be viewed as 3 steps:

  • Step 1: Convert our merged regular text data to text corpus. Basically this process (1) breaks all lines into words, (2) summarizes word types, number of word and sentence in each line, and (3) stores all these information in a data frame as the preparation for the next step.

  • Step 2: Tokenize the corpus and perform data cleaning. Compared to corpusing, tokenization is a high-level process that can generate n-grams output (whereas corpus only produces ‘separate words’ or ‘unigrams’). The quanteda package is very handy as it supports useful data cleaning functions during tokenization. In this project, I will remove the following components from the data: number, punctuation, hyphens, symbols, url and twitter tags.

  • Step 3: Create the Document-feature matrix (dfm) of the n-grams. This matrix contains all statistics of the n-grams (with just a function call in quanteda) that we need to obtain the n-grams frequencies.

3.3.2. Technical implementation

3.3.2.1. Step 1: Building corpus

qcorpus <- corpus(data)

3.3.2.2. Step 2: Tokenization

toks <- tokens(qcorpus, remove_punct = TRUE, remove_numbers = TRUE, remove_hyphens = TRUE, remove_symbols = TRUE, remove_url = TRUE, remove_twitter = TRUE, ngrams = 1)

3.3.2.3. Step 3: Creating DFM

data_dfm <- dfm(toks)

The code above is just an example with respect to unigram. Tokenization and dfm creation for other n-grams can be done by changing ngrams parameter.

3.3.3. Results and findings

From the dfm, I can extract the highest 10 popular n-grams. An example for unigram (frequency of single word) is demonstrated below:

data_dfm_freq <- textstat_frequency(data_dfm, n = 10)

And of course, the plot:

ggplot(data_dfm_freq, aes(x = reorder(feature, frequency), y = frequency/(10^6)))+
    geom_bar(stat = "identity", width=.5, fill="tomato3") + 
    coord_flip() +
    labs(title="Unigram frequency",
         subtitle="Merged data",
         y="Appearance count (millions)",
         x="Unigrams (single word)",
         caption = "Most sorted frequency") + 
    theme_minimal()

A similar process can be applied with 2-grams (pair of adjacent words) and 3-grams (phrase of 3 consecutive words). Below are the their frequency plots:

At a quick glance, the results make sense to me. For example, the most daily popular words are “the”, “to”, “and”, whereas the most 3-word-phrases are “one of the”, “a lot of”, “thank for the”. It seems that the initial data processing is done properly.

It is also illustrated that the frequency of higher n-grams is less than smaller n-grams because the several popular high n-grams may share the same lower n-grams. For example, “the” appears in most of the 2-grams, therefore its 1-grams frequency is much higher in each of the observed 2-grams.

I have found that on a Windows computer with 16 GB or RAM, I still have to set a 32 GB page file on the SSD and constantly removing objects from the workspace. So, large data is very expensive in processing cost. A subset of this data is therefore recommended when implementing the real model.

According to my experience, the most time consuming part of this processing routine is the tokenization (with cleaning integrated). However, the part that eats up memory is the dfm calculation (lots of insufficient memory errors).

3.4. Plan for model building