Course Project

This course starts with the basics of NLP (Natural Language Processing), analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, you will use the knowledge you gained in data products to build a predictive text product.

Milestone Report

The goal of this report is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.

The Data

The data is from a corpus called HC Corpora. This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI.

I will be using only the english language files.

Exploratory Analysis

The data are large text files. Over 4 million lines combined. Unix wordcount gives 102,081,616 individual words.

Exploring the the data visually, it is clear that the data are in random order, eg. the lines in “Blogs” - file are not complete posts and lines are not in sequeantial order. For text prediction I decided to split sentences to lines for further transformations and modelling.

Summary of Raw Files

lineCount medianCharacterCount
Blogs 899288 156
News 1010242 185
Twitter 2360148 64
Combined 4269678 88

Text Transformations

My first approach was to use the tm-package to trasform and analyse the corpora to useful units (tokenizing). It turned out to be quite slow with this magnitude of data, even when using samples of n/1000. Also its data structures were complicating the task. Right now I am working with quanteda package. It seems to be faster and more user friendly for basic text trasnformation needed in this assignment. It also comes recommended in the course forums.

Transformations are currently the most memory intensive tasks. When the shiny app is running, I will have to see how long the code takes to run with different parameters and how well it performs. Maybe it is not feasible to read raw data into memory in Shiny, but to do the transformations on my computer and save the intermediate data to disk or database and and use that with the prediction algorithm. I Have to check the course pages to see if that is allowed.

Current Transformations

  • All text to lower case
    • it could be best to ignore capital letters in the beginning of sentence, but keep them elsewhere
  • Remove numbers
    • remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day)
  • Remove punctuation
    • coud be useful with advanced algorhitms, but with simple ngram-model causes too many sequences.
  • Remove separators
    • spaces and variations of spaces, plus tab, newlines, and anything else in the Unicode “separator” category
  • Remove Twitter characters
    • ie.(@ and #)
  • Profanity filtering
  • Filtering foreign language

Ngrams

Ngrams are easy to build with the quanteda package. I have been using unigrams (which are basically useless for prediction), bigrams and trigrams. I might add a 4-gram. One decision is what sparsity level to use to get the best coverage of normal language with the least amount of data.

Using a 1/100 random sample from combined raw data, below are the top 20 sequences in each ngram group. Comparing those to the total number in each group with the given parameters (sequence has to occur at least five times), we get the frquencies. I ran the same analysis, with different sample sizes and the resulting sequencies and their frequencies didn’t change too much. The challenge is to get the frequencies bigger, meaning more coverage with less data.

Prediction Model

My first approach is to use tables with tokenized words in their own columns and a probability associated with that word sequence. Search for the input phrae is performed using “Stupid Backoff” - method. The logic is to first see if there is a match in the highest ngram, then checking the next highest, ending in a unigram, which always predicts the most common words (“the” being the most common). There are some better algorithms that I might try to implement if time permits.

Shiny app

I would like the shiny app to show five best next word predictions for a given phrase. Nice-to-have would be a chart comparing their probability. Selectors could include how many previous words to use in the prediction and maybe some charting options.