In this Capstone project the aim is simple: we want to, given one, two or more words, predict the next word based on the frequencies of words found in large datasets of text.
This report aims to sumamrise the exploratory data analysis performed on the three datasets supplied to build an N-Gram Natural Language Processing prediction model. The three datasets are large samples of text from Twitter, blog posts and news articles. These samples are combined, tokenized, and then analysed for sequences of words of length \(N\) known as N-grams.
The raw data can be found here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The blog, news and twitter datasets have around 900,000, 1,000,000 and 2,400,000 sentences of text respectively. Our prediction model will run extremely slowly if we use all this data so I have used random sampling to constrain each dataset to have around 10,000 sentences. We load these samples of the datasets in below.
library(readtext)
news <- readtext('./sample/news_sample.txt')
twitter <- readtext('./sample/twitter_sample.txt')
blog <- readtext('./sample/blog_sample.txt')
We aim to create a corpus of text with text from each of the three sources; once we have done this we may remove the individual corpora and datasets from memory. Following this, we will tokenize (i.e. split into individual features like words) using the quanteda package which is used due to its exceptional speed and efficiency in comparison to other equivalent packages. Here, we remove punctuation, symbols, numbers, URLs and any non-Latin characters as we do not want to predict these. Further, we remove any profanity using a list of profane words found here: https://github.com/RobertJGabriel/Google-profanity-words.
# Creating corpus
library(quanteda)
library(dplyr)
news_corpus <- corpus(news)
twitter_corpus <- corpus(twitter)
blog_corpus <- corpus(blog)
# Create big corpus
big_corpus <- news_corpus + twitter_corpus + blog_corpus
# Remove large corpora and datasets from memory
rm(news_corpus, twitter_corpus, blog_corpus)
rm(news, twitter, blog)
# Reading in list of profanity words
profanity <- read.table('./profanity_words/list.txt')
profanity <- as.character(profanity[,1])
# Tokenizing
tokens <- big_corpus %>%
tokens(remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
split_hyphens = TRUE) %>%
tokens_remove(profanity) %>% # Remove profanity
tokens_keep(pattern = "^[a-zA-Z]+$", valuetype = "regex") # Keep latin alphabet only
rm(profanity, big_corpus)
Now we turn our attention to generating N-grams for \(N=1,2,3\), named unigrams, bigrams and trigrams respectively. We will make use of these to predict words in our model. For each, we show a wordcloud of the most common N-grams, a bar chart of the most frequent N-grams and a graph showing the amount of total instances of features covered with increasing number of features used.
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
library(dplyr)
library(ggplot2)
library(gridExtra)
# Unigrams
unigram_toks <- tokens_ngrams(tokens,n=1)
unigram_dfm <- dfm(unigram_toks)
unigram_freq <- textstat_frequency(unigram_dfm) %>%
mutate(proportion = frequency / sum(frequency)) %>%
mutate(cumulative_proportion = cumsum(proportion))
# Bigrams
bigram_toks <- tokens_ngrams(tokens,n=2)
bigram_dfm <- dfm(bigram_toks)
bigram_freq <- textstat_frequency(bigram_dfm) %>%
mutate(proportion = frequency / sum(frequency)) %>%
mutate(cumulative_proportion = cumsum(proportion))
# Trigrams
trigram_toks <- tokens_ngrams(tokens,n=3)
trigram_dfm <- dfm(trigram_toks)
trigram_freq <- textstat_frequency(trigram_dfm) %>%
mutate(proportion = frequency / sum(frequency)) %>%
mutate(cumulative_proportion = cumsum(proportion))
One thing to note is the increasing size of N-gram data as N increases. We can see that the trigram_freq object size is in the hundreds of Megabytes. This will impact the performance of our model and as such we should take actions to reduce the size of these objects. One idea is to discard any trigrams with a frequency of one since the probability of them occurring will be extremely low and thus should not have a significant impact on the quality of our predictions.
ngram_megabytes<-c(object.size(unigram_freq),object.size(bigram_freq),object.size(trigram_freq))/1E6
names(ngram_megabytes)<-c("Unigram","Bigram","Trigram")
ngram_features<-c(length(unigram_freq$feature),length(bigram_freq$feature),length(trigram_freq$feature))
ngram_size<-rbind(ngram_megabytes,ngram_features)
ngram_size
## Unigram Bigram Trigram
## ngram_megabytes 7.870584 74.241 138.8346
## ngram_features 43843.000000 400602.000 723027.0000
Following this analysis I plan to implement the Katz Back-Off model for predicting the next word given a couple. This model makes use of N-gram frequencies to predict the N+1 word as well as incorporating logic to assign probabilities to unseen N-grams. This is why it will be sensible to remove trigrams with a frequency of 1, as the KBO model will assign these trigrams sensible probabilities since they will now be considered “unseen”.