This Natural Language Processing project is the final Capstone of the Coursera Data science Specialization by Johns Hopkins University. The project is to build a text predition app with input words from the users. In this report, I summarize intial statistical findings from the data and a plan to building the prediction model.
The second report presents my prediction model as the next step after what I have done in this report. You can find it here: http://rpubs.com/nhohung/NLP_prediction. The final app is published online at: https://nhohung.shinyapps.io/TextPrediction/. A short presentation of this project is posted on: http://rpubs.com/nhohung/NLP_summary.
library(quanteda)
library(ggplot2)
The data contains 3 English text files from blogs, news and twitter. First of all, file size can be found, for example, by:
file.size("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt") / (1024^2)
## [1] 200.4242
where the result returns file size in MB.
I load 3 files separately, the file with text from news needs the flag rb
(read binary) because it has a problem with end of file character. An example of loading these files is as followed:
data_blog <- readLines(file("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", open = 'r'))
We can then check the total number of line and words in each file:
# number of line:
length(data_blog)
## [1] 899288
# number of word:
sum(sapply(strsplit(data_blog, "\\s"), length))
## [1] 37334641
Applying similarly to other 2 files, we get the following data summary:
File source | Size (in MB) | Line count | Word count |
---|---|---|---|
blog | 200.42 | 899,288 | 37,334,641 |
news | 196.28 | 1,010,242 | 34,372,792 |
159.36 | 2,360,148 | 30,373,906 |
In order to investigate deeper into the data generally, I will merge them into a single big file, remove foreign words, and calculate the frequencies of unigrams (word), bigram (pair of 2 adjacent words) and trigrams (phrase of 3 consecutive words).
Processing a huge file is not ideal for practical app, but it is crucial for this report as we will know the general aspect about our full data. In the real prediction model creation, I will use a sample of it.
Merging can be done by:
data <- c(data_blog, data_news, data_twitter)
This big file has 4,269,678 lines and 102,081,339 words.
Removing the foreign words can be easily done by deleting the characters that are not in English encoding (latin characters):
data <- iconv(data, from = "latin1", to = "ascii", sub="")
The number of word in this new file is 102,058,174, which means that there are 102081339 - 102058174 = 23165
foreign words.
To do this job, I use the package quanteda
. The rountine can be viewed as 3 steps:
Step 1: Convert our merged regular text data to text corpus. Basically this process (1) breaks all lines into words, (2) summarizes word types, number of word and sentence in each line, and (3) stores all these information in a data frame as the preparation for the next step.
Step 2: Tokenize the corpus and perform data cleaning. Compared to corpusing, tokenization is a high-level process that can generate n-grams output (whereas corpus only produces ‘separate words’ or ‘unigrams’). The quanteda
package is very handy as it supports useful data cleaning functions during tokenization. In this project, I will remove the following components from the data: number, punctuation, hyphens, symbols, url and twitter tags.
Step 3: Create the Document-feature matrix (dfm) of the n-grams. This matrix contains all statistics of the n-grams (with just a function call in quanteda
) that we need to obtain the n-grams frequencies.
qcorpus <- corpus(data)
toks <- tokens(qcorpus, remove_punct = TRUE, remove_numbers = TRUE, remove_hyphens = TRUE, remove_symbols = TRUE, remove_url = TRUE, remove_twitter = TRUE, ngrams = 1)
data_dfm <- dfm(toks)
The code above is just an example with respect to unigram. Tokenization and dfm creation for other n-grams can be done by changing ngrams
parameter.
From the dfm, I can extract the highest 10 popular n-grams. An example for unigram (frequency of single word) is demonstrated below:
data_dfm_freq <- textstat_frequency(data_dfm, n = 10)
And of course, the plot:
ggplot(data_dfm_freq, aes(x = reorder(feature, frequency), y = frequency/(10^6)))+
geom_bar(stat = "identity", width=.5, fill="tomato3") +
coord_flip() +
labs(title="Unigram frequency",
subtitle="Merged data",
y="Appearance count (millions)",
x="Unigrams (single word)",
caption = "Most sorted frequency") +
theme_minimal()
A similar process can be applied with 2-grams (pair of adjacent words) and 3-grams (phrase of 3 consecutive words). Below are the their frequency plots:
At a quick glance, the results make sense to me. For example, the most daily popular words are “the”, “to”, “and”, whereas the most 3-word-phrases are “one of the”, “a lot of”, “thank for the”. It seems that the initial data processing is done properly.
It is also illustrated that the frequency of higher n-grams is less than smaller n-grams because the several popular high n-grams may share the same lower n-grams. For example, “the” appears in most of the 2-grams, therefore its 1-grams frequency is much higher in each of the observed 2-grams.
I have found that on a Windows computer with 16 GB or RAM, I still have to set a 32 GB page file on the SSD and constantly removing objects from the workspace. So, large data is very expensive in processing cost. A subset of this data is therefore recommended when implementing the real model.
According to my experience, the most time consuming part of this processing routine is the tokenization (with cleaning integrated). However, the part that eats up memory is the dfm calculation (lots of insufficient memory errors).
As the object size (even with sampling) is huge, a data table should be used instead of data frame because data tables have very fast reference indexing and require much smaller memory. Plus, one single run on creating n-grams can be considered as all of them will be stored in a data table anyway.
To make the model smaller and more efficient, I won’t use the full-size tokenization. Instead I keep the n-grams that mostly cover the data, maybe 95%. The small amount of 5% in fact can be a lot of tokens with small appearance count and removing them will highly reduce the model size. The model in that case just need 2 parameters: the n-grams and their occurences.
I think a good approach for this project is to train the model with more data, to get broad coverage of language use. Then take the 90% or 95% of the trained model (depends on the actual resuls that I don’t know at this moment) to put into prediction model. With more input data, a way to process it that helps avoid memory overflow is to divide the data into smaller chunks and process them consecutively.
The model can be trained such that as the user inputs a word, suggest the most popular 2-grams that start with the input, when the user input two words, suggest the most popular 3-grams that start with the input, and so on. If the user put n words and there is no n-grams matched with the input, reduce the n in n-grams suggestion until it matches with the input (this is demostrated in later report in the BackOff model).
To evaluate the model, I can randomly take samples of n words from each line of the test set as the input and the next word as the true value. Then see if the true value falls in the highest 10 or 15 prediction n-grams.
This last bullet is only for illustration purpose. I know that on RPubs, sometimes the bottom status bar may hide parts of the content. Therefore I create this point so that you can read all contents I presented above, if RPubs does covers something, it’s covering this part. Thanks for reading.