Overview

The initial step in developing any prediction model is to understand what kind of data we are dealing with. In this particular case we are going to perfrom an exploratiory analisis on 3 different Natural Language data sets (Blogs, News and Twits) coming from the English HC Corpora. At the end, we should have a good understanding, amongst other things, of the most frequent used words or what the different coporas are composed of.

Loading and cleaning the data

The data can be downloaded from here as a compressd .zip file with files in Dutch, English, Finish and Russian. We’ll only use the files in English.

We’ll load 3 different files: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt

Each one of these files will be loaded and then sampled in a random 10% ratio, so we can work with them more easily.

# Creates an array with the names of the files to process
filesToProcess <- c("en_US.blogs.txt","en_US.news.txt", "en_US.twitter.txt")
noOfLines <- c()
sampleLines <- c()

for (fileIndex in 1:3) {
  # Creates 2 connectors: 1 for the input file and the other for the output sampled file
  inputPath <- paste(getwd(),"/Data/final/en_US/",filesToProcess[fileIndex], sep = "")
  outputPath <- paste(getwd(),"/Data/final/en_US/sample_",filesToProcess[fileIndex], sep = "")
  inputFile <- file(inputPath, "r")
  outputFile <- file(outputPath, "w")
  
  # Reads the input file
  inputData <- readLines(inputFile)
  # Gets the number of lines in the file
  noOfLines[fileIndex] <- length(inputData)
  # Closes the connector
  close(inputFile)
  
  # Gets an array of Random Binomial elements, with a total length equal to the
  # number of lines previously read, and a 10% of true values
  linesToUse <- rbinom(noOfLines[fileIndex], 1, 0.1)
  # Writes into the output file, only those lines that match the true values
  # from the random array
  for (i in 1:noOfLines[fileIndex]) {
    if (linesToUse[i]==1) { cat(inputData[i], file=outputFile, sep="\n") }
  }
  # Keeps track of the number of lines to be used as sample
  sampleLines[fileIndex] <- sum(linesToUse)
  # Closes the connection
  close(outputFile)
}

# It will also load a list of profanity words in English
inputPath <- paste(getwd(),"/Data/final/en_US/bad-words.txt", sep = "")
inputFile <- file(inputPath, "r")

# Reads the input file
profaneWords <- readLines(inputFile)
close(inputFile)


File name Number of Original Entries Number of Sample Entries
en_US.blogs.txt 899,288 89,398
en_US.news.txt 77,259 7,769
en_US.twitter.txt 2,360,148 236,184

Now, data for Natural Language Processing (or NLP) normally has to go through several steps of cleansing because of its own nature. The source of the data matters, since, for example, data coming from News sites will be cleaner (because it is written by profesionals and normally goes through an editorial filtering) than data coming from Blogs (where generally the author tries to do its best job at writing without mistakes but since it’s not syndicated there could be many errors) or finally from Twits (where twitters can write in whatever way the fell like it adding in the process acronyms or word contractions that are probably not 100% accepted by the english speaking community).

With this in mind, our next process is to clean the data by trying to remove elements that don’t add value to our analysis, and in doing so, we could make our model thiner and faster.

This cleaning process will encompass:

Let’s start with the file that has the text coming from the Blog pages

# Loads the sample text file for BLOGS
inputPath <- paste(getwd(),"/Data/final/en_US/sample_en_US.blogs.txt", sep = "")
inputFile <- file(inputPath, "r")

# Reads the input file
blogsUS <- readLines(inputFile)
# Closes the connector
close(inputFile)

# Creates a Corpus with the text. This allows us to better work with it and mine it
blogsUSCorpus <- Corpus(VectorSource(blogsUS))

# Treats the Blogs file
# First, it converts the text to lower case
blogsUSCorpus <- tm_map(blogsUSCorpus, content_transformer(tolower))
# It removes numbers
blogsUSCorpus <- tm_map(blogsUSCorpus, removeNumbers)
# It removes punctuation
blogsUSCorpus <- tm_map(blogsUSCorpus, removePunctuation)
# It removes stop words (e.g. I, me, my, him)
blogsUSCorpus <- tm_map(blogsUSCorpus, removeWords, stopwords("english"))
# It removes profane words
blogsUSCorpus <- tm_map(blogsUSCorpus, removeWords, profaneWords)
# It removes all non english characters
blogsUSCorpus <- tm_map(blogsUSCorpus, replaceBy, "[^a-z]", " ")
# Eliminates the white spaces
blogsUSCorpus <- tm_map(blogsUSCorpus, stripWhitespace)
# Finally, stems the words in the document. 
blogsUSCorpusStem <- tm_map(blogsUSCorpus, stemDocument)

We perform the same preparation for the other 2 text files but in order to keep the document shorter we won’t be showing this.

Exploratory Analysis

Word Frequency

Once the data is prepared we can have a first look at what are the most frequent words for each corpora. One of the multiple ways of looking at this is through the use of a WordCloud, which is a visual representation of text data where the importance of each word is shown with font size or color. In our case, the “importance”" of the word will be measured by the frequency.


As we can easily see, the word that’s used most frequently in the Blogs sample data is “one”. We can also see that there seems to be different words with different levels of frequency

In the case of the news sample, it’s very clear that there are some words used very frequently and then a lot of other words that are used in a much lower frequency (here seen in green)

Not surprisingly, the Twitter sample data shows a much bigger diversity of words and frequencies.

So, let’s try to quantify what we started to see in the word clouds. First, let’s generate a chart that shows these frequencies

Language coverage

Another useful information to know is how disperse or concentrate is the number of words that make up the dictionary for each one of our corporas. In other words, what’s the coverage of the language done by the most frequent words.

Corpora No. of words that make 50% % of Total No. of words that make 90% % of Total
Blogs 538 0.7 % 7,118 8.9 %
News 643 3.4 % 6,405 34.1 %
Twitter 339 0.4 % 6,494 7.6 %

We can plot the cumulative percentage of coverage vs the number of words needed to get to that coverage to be able to identify which of the 3 corporas that we have, has a more diversified language (i.e. uses more different words)

By looking at these plots we can easily see that the number of words needed to cover 100% of the news in this sample is much lower (less than 20,000 words) than that in the other corporas (which is closer to 80,000 words). This could mean that since news are written by profesionals in a controled environment and reviewed by an editorial team, the number of “valid” words are more controled.

In the other hand, blogs and twits are more freely written (especially twits) which allows the authors to use as many different words as they feel the need. It’s a more informal an unstructured environment than news.

N-Gram Analysis

Another useful analysis to be done in our corpora is an n-gram analysis. Basically, we can do exactly the same analysis that we already did for words, but on combinations of 2 or more words (n-grams). This is very important for our final goal, as it sets the bases for the model that will be able to predict the next word that a person would probably write after writing one or more words.

Let’s start by “tokenizing” our corporas in 2, 3 and 4 gram and looking at the most frequent ones, again, making use of wordclouds.

Blogs n-grams

News n-grams

Twitter n-grams

Prediction Model

So, now we have a better idea of what kind of data we are dealing with. In doing so, we also created a base data set to use for predicting the Next Word

The model that we are going to use will select the word that has the biggest probability to be written after the word that has currently been written. The probability will be based on frequency, and it will take into account, within realistic boundaries, the context of the phrase, as a calculation of the n words written before (where n takes values from 5 to 1).

So, basically, we’ll create 5 models (bigrams, trigrams, fourgrams, fivegrams and sixgrams) for predicting the next word, depending on the number of words that we want to use as context. We’ll count the number of words in the phrase that’s being written, and match it to the corresponding model. then we search the chain of words into that particular model and return the first k (this will be a parameter passed to the predicting function) words with the highest probability. If the chain of words is not found in the model, we’ll search for the same chain minus the first word, in the next model with less words… and we keep doing this until we find a next word to suggest.

As an example, let’s suppose that the user starts to write the phrase “I want to go to the movies”:

We continue the same process again and again up until the user stops writing.

We have to consider the reponse time and the amount of memory that we can use. We’ll have to test various sample sizes, and various data structures until we achieve a good balance between these 2 variables.

What do you think?