library(stringi)
library (quanteda)
library (readtext)
library(readr)
library(ggplot2)
library(kableExtra)
library(knitr)

Summary

The aim of the project is to write a word prediction algorithm based on the three US data files. This milestone report presents the first steps taken in order to develop the prediction model

Introduction

In this capstone we are applying data science in the area of natural language processing. The idea is to develop an algorithm for predicting text. For this purpose, we have a dataset from a corpus called HC Corpora. The data is from various sources: news, blogs and Twitter.

The first step before to develop the model is getting and cleaning the data. The corpus includes texts in four languages but for this project we are only using the data in the “English” folder.

Exploratory analysis

Getting and loading the data

As first step, we need to download the data and load it into R. Blog ant Twitter data can be imported without problem (for example, with readtext() package or readLine() in base R). News data has EOF problems, so under these commands the loaded data only contains around 20 Mb while the txt file size is around 200 Mb. So we decide to use binary read to avoid the problem.

Summary Statistics

The following table shows the basic statistics of the three data sets. Note that, for the twitter data set, the maximum characters per line is 140, as expected. The total number of words has not been included because it is preferably to count them once the data has been cleaned and tokenized
Source File size Nº lines Nº char Max.char/line
twitter 159 2360148 162096031 140
blogs 200 899288 206824505 40833
news 196 1010242 203223158 11384

Constructing the Corpus, Clenaing, Tokenizing and Profanity filtering

For managing and analyzing text we are going to use the quanteda() package.

Due to the huge amount of data (and after experiencing some “out of memory errors”), we decide to set a seed and randomly select the 50% of the data for the exploratory analysis.

First of all, we create a Corpus including the three sources available. Once the Corpus is created and before to tokenize the texts, we are going to clean some “bad characters”. Taking into account that texts are in English, if we convert the texts from UTF-8 to ASCII and remove the non convertible characters we will be able to get rid of large part of these bad characters. Moreover, considering the accent marks, forcing to ASCII might standardize the names instead of having two versions, one with an accent mark and another without.

Once this is done we tokenize the text, removing numbers, punctuation, symbols, hypens, twitter characters and url’s. We also convert the tokens to lower case, preserving upper-case acronyms if detected.

We also use a “bad word” list to filter the profanity words, if any. This list have been downloaded from a Github repo, but any other list could be used.

Creating N-grams and Document-Feature Matrix (dfm)

In statistical Natural Language Processing (NLP), an N-gram is a contiguous sequence of n items from a given sequence of text or speech. A Dfm is a matrix with as many rows as the number of lines and as many columns as there are unique words in the corpus.

if(!file.exists("./trigram.rds")) {
      my_dfm3 <- dfm(tk, tolower = FALSE, ngrams=3, concatenator = " ")
      saveRDS(my_dfm3, file = "trigram.rds")
} else {
      if(!exists("my_dfm3"))
      my_dfm3 <- readRDS("./trigram.rds")
}
if(!file.exists("./bigram.rds")) {
      my_dfm2 <- dfm(tk, tolower = FALSE, ngrams=2, concatenator = " ")
      saveRDS(my_dfm2, file = "bigram.rds")
} else {
      if(!exists("my_dfm2"))
      my_dfm2 <- readRDS("./bigram.rds")
}
if(!file.exists("./unigram.rds")) {
      my_dfm <- dfm(tk, tolower = FALSE, ngrams=1)
      saveRDS(my_dfm, file = "unigram.rds")
} else {
      if(!exists("my_dfm1"))
      my_dfm <- readRDS("./unigram.rds")
}

Visualizing data

Top features

The top 20 features in Unigram, Bigram and Trigram are the following

Wordclouds

The following plots show the wordclouds for the three type of n-gram we have created

Unigrams

2-grams

3-grams

Plans for prediction algorithm

Text prediction algorithms generally work by looking at the context in which words appear, quantifying the word tendencies with N-grams. Taking this into account and based on the exploratory analysis, the plans for next are:

  1. Divide tha data into training and testing sets.

  2. Build a model to predict the next word based on N-grams. I plan to use trigrams (according to the bibliography there is a big improvement when you move from 2-grams to 3-grams, but not so big from trigrams to 4-grams or higher).

  3. The predictive model will be based on the Katz’s back-off model. This model is an N-gram language model to predict the next word, based on the conditional probability of the previous words in the N-gram. In order to deal to the words that doesn’t appear in the training set, Kneser-Ney smoothing can be used.

  4. Apply the model to the testing set.

  5. Create a Shiny App.

Pontential problems and limitations

  • Possible “out of memory” errors. Due to the large number of elements, it may be possible to suffer this type of error (the large the number of “N” in N-gram, the more likely it is).

  • If this happens, we should remove the low-frecuency N-grams (frequencies under 3 or 4 times), which, in fact, should not contribute to much to the prediction algorith… The first trials show that we could keep around 75% data with this option

  • We have not stemmed or lemmatized the tokens. Taking into acount that we are trying to predict the next word, I am not sure we should do it. If the tokens are stemmed, the predicted word is not going to be accurated… Please correct me if I’m wrong.