Introduction

In this Capstone project the aim is simple: we want to, given one, two or more words, predict the next word based on the frequencies of words found in large datasets of text.

This report aims to sumamrise the exploratory data analysis performed on the three datasets supplied to build an N-Gram Natural Language Processing prediction model. The three datasets are large samples of text from Twitter, blog posts and news articles. These samples are combined, tokenized, and then analysed for sequences of words of length \(N\) known as N-grams.

The raw data can be found here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Loading the data

The blog, news and twitter datasets have around 900,000, 1,000,000 and 2,400,000 sentences of text respectively. Our prediction model will run extremely slowly if we use all this data so I have used random sampling to constrain each dataset to have around 10,000 sentences. We load these samples of the datasets in below.

library(readtext)

news <- readtext('./sample/news_sample.txt')
twitter <- readtext('./sample/twitter_sample.txt')
blog <- readtext('./sample/blog_sample.txt')

Tokenizing

We aim to create a corpus of text with text from each of the three sources; once we have done this we may remove the individual corpora and datasets from memory. Following this, we will tokenize (i.e. split into individual features like words) using the quanteda package which is used due to its exceptional speed and efficiency in comparison to other equivalent packages. Here, we remove punctuation, symbols, numbers, URLs and any non-Latin characters as we do not want to predict these. Further, we remove any profanity using a list of profane words found here: https://github.com/RobertJGabriel/Google-profanity-words.

# Creating corpus
library(quanteda)
library(dplyr)

news_corpus <- corpus(news)
twitter_corpus <- corpus(twitter)
blog_corpus <- corpus(blog)

# Create big corpus

big_corpus <- news_corpus + twitter_corpus + blog_corpus

# Remove large corpora and datasets from memory

rm(news_corpus, twitter_corpus, blog_corpus)
rm(news, twitter, blog)

# Reading in list of profanity words

profanity <- read.table('./profanity_words/list.txt') 
profanity <- as.character(profanity[,1])

# Tokenizing

tokens <- big_corpus %>%
        tokens(remove_punct = TRUE,
               remove_symbols = TRUE,
               remove_numbers = TRUE,
               remove_url = TRUE,
               split_hyphens = TRUE) %>%
        tokens_remove(profanity) %>% # Remove profanity
        tokens_keep(pattern = "^[a-zA-Z]+$", valuetype = "regex") # Keep latin alphabet only
rm(profanity, big_corpus)

N-gram Generation

Now we turn our attention to generating N-grams for \(N=1,2,3\), named unigrams, bigrams and trigrams respectively. We will make use of these to predict words in our model. For each, we show a wordcloud of the most common N-grams, a bar chart of the most frequent N-grams and a graph showing the amount of total instances of features covered with increasing number of features used.

Libraries

library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
library(dplyr)
library(ggplot2)
library(gridExtra)

Unigrams

# Unigrams
unigram_toks <- tokens_ngrams(tokens,n=1)
unigram_dfm  <- dfm(unigram_toks)
unigram_freq <- textstat_frequency(unigram_dfm) %>%
        mutate(proportion = frequency / sum(frequency)) %>%
        mutate(cumulative_proportion = cumsum(proportion))

Bigrams

# Bigrams
bigram_toks <- tokens_ngrams(tokens,n=2)
bigram_dfm  <- dfm(bigram_toks)
bigram_freq <- textstat_frequency(bigram_dfm) %>%
        mutate(proportion = frequency / sum(frequency)) %>%
        mutate(cumulative_proportion = cumsum(proportion))

Trigrams

# Trigrams
trigram_toks <- tokens_ngrams(tokens,n=3)
trigram_dfm  <- dfm(trigram_toks)
trigram_freq <- textstat_frequency(trigram_dfm) %>%
        mutate(proportion = frequency / sum(frequency)) %>%
        mutate(cumulative_proportion = cumsum(proportion))

Object sizes

One thing to note is the increasing size of N-gram data as N increases. We can see that the trigram_freq object size is in the hundreds of Megabytes. This will impact the performance of our model and as such we should take actions to reduce the size of these objects. One idea is to discard any trigrams with a frequency of one since the probability of them occurring will be extremely low and thus should not have a significant impact on the quality of our predictions.

ngram_megabytes<-c(object.size(unigram_freq),object.size(bigram_freq),object.size(trigram_freq))/1E6
names(ngram_megabytes)<-c("Unigram","Bigram","Trigram")
ngram_features<-c(length(unigram_freq$feature),length(bigram_freq$feature),length(trigram_freq$feature))
ngram_size<-rbind(ngram_megabytes,ngram_features)
ngram_size
##                      Unigram     Bigram     Trigram
## ngram_megabytes     7.870584     74.241    138.8346
## ngram_features  43843.000000 400602.000 723027.0000

Prediction Algorithm

Following this analysis I plan to implement the Katz Back-Off model for predicting the next word given a couple. This model makes use of N-gram frequencies to predict the N+1 word as well as incorporating logic to assign probabilities to unseen N-grams. This is why it will be sensible to remove trigrams with a frequency of 1, as the KBO model will assign these trigrams sensible probabilities since they will now be considered “unseen”.