library(stringi)
library (quanteda)
library (readtext)
library(readr)
library(ggplot2)
library(kableExtra)
library(knitr)
The aim of the project is to write a word prediction algorithm based on the three US data files. This milestone report presents the first steps taken in order to develop the prediction model
In this capstone we are applying data science in the area of natural language processing. The idea is to develop an algorithm for predicting text. For this purpose, we have a dataset from a corpus called HC Corpora. The data is from various sources: news, blogs and Twitter.
The first step before to develop the model is getting and cleaning the data. The corpus includes texts in four languages but for this project we are only using the data in the “English” folder.
As first step, we need to download the data and load it into R. Blog ant Twitter data can be imported without problem (for example, with readtext() package or readLine() in base R). News data has EOF problems, so under these commands the loaded data only contains around 20 Mb while the txt file size is around 200 Mb. So we decide to use binary read to avoid the problem.
| Source | File size | Nº lines | Nº char | Max.char/line |
|---|---|---|---|---|
| 159 | 2360148 | 162096031 | 140 | |
| blogs | 200 | 899288 | 206824505 | 40833 |
| news | 196 | 1010242 | 203223158 | 11384 |
For managing and analyzing text we are going to use the quanteda() package.
Due to the huge amount of data (and after experiencing some “out of memory errors”), we decide to set a seed and randomly select the 50% of the data for the exploratory analysis.
First of all, we create a Corpus including the three sources available. Once the Corpus is created and before to tokenize the texts, we are going to clean some “bad characters”. Taking into account that texts are in English, if we convert the texts from UTF-8 to ASCII and remove the non convertible characters we will be able to get rid of large part of these bad characters. Moreover, considering the accent marks, forcing to ASCII might standardize the names instead of having two versions, one with an accent mark and another without.
Once this is done we tokenize the text, removing numbers, punctuation, symbols, hypens, twitter characters and url’s. We also convert the tokens to lower case, preserving upper-case acronyms if detected.
We also use a “bad word” list to filter the profanity words, if any. This list have been downloaded from a Github repo, but any other list could be used.
In statistical Natural Language Processing (NLP), an N-gram is a contiguous sequence of n items from a given sequence of text or speech. A Dfm is a matrix with as many rows as the number of lines and as many columns as there are unique words in the corpus.
if(!file.exists("./trigram.rds")) {
my_dfm3 <- dfm(tk, tolower = FALSE, ngrams=3, concatenator = " ")
saveRDS(my_dfm3, file = "trigram.rds")
} else {
if(!exists("my_dfm3"))
my_dfm3 <- readRDS("./trigram.rds")
}
if(!file.exists("./bigram.rds")) {
my_dfm2 <- dfm(tk, tolower = FALSE, ngrams=2, concatenator = " ")
saveRDS(my_dfm2, file = "bigram.rds")
} else {
if(!exists("my_dfm2"))
my_dfm2 <- readRDS("./bigram.rds")
}
if(!file.exists("./unigram.rds")) {
my_dfm <- dfm(tk, tolower = FALSE, ngrams=1)
saveRDS(my_dfm, file = "unigram.rds")
} else {
if(!exists("my_dfm1"))
my_dfm <- readRDS("./unigram.rds")
}
The top 20 features in Unigram, Bigram and Trigram are the following
The following plots show the wordclouds for the three type of n-gram we have created
Text prediction algorithms generally work by looking at the context in which words appear, quantifying the word tendencies with N-grams. Taking this into account and based on the exploratory analysis, the plans for next are:
Divide tha data into training and testing sets.
Build a model to predict the next word based on N-grams. I plan to use trigrams (according to the bibliography there is a big improvement when you move from 2-grams to 3-grams, but not so big from trigrams to 4-grams or higher).
The predictive model will be based on the Katz’s back-off model. This model is an N-gram language model to predict the next word, based on the conditional probability of the previous words in the N-gram. In order to deal to the words that doesn’t appear in the training set, Kneser-Ney smoothing can be used.
Apply the model to the testing set.
Create a Shiny App.
Possible “out of memory” errors. Due to the large number of elements, it may be possible to suffer this type of error (the large the number of “N” in N-gram, the more likely it is).
If this happens, we should remove the low-frecuency N-grams (frequencies under 3 or 4 times), which, in fact, should not contribute to much to the prediction algorith… The first trials show that we could keep around 75% data with this option
We have not stemmed or lemmatized the tokens. Taking into acount that we are trying to predict the next word, I am not sure we should do it. If the tokens are stemmed, the predicted word is not going to be accurated… Please correct me if I’m wrong.