The task we have been given is to build a text prediction app, such that the user inputs a string of words and the app predicts the next word in the sentence. We were provided with a corpus comprising of three text files from blogs, news and twitter feeds. These documents amounted to approximately 100 million words. I initially cleaned the entire corpus using bash functions to remove unwanted symbols, numbers and punctuation. After trial and error I concluded that a subset of 5% of the entire corpus, approximately 5 million words, was as much as my computer could process. I split this into training, cross validation and text sets of 60%, 20% and 20% proportions respectively. I then explored the 3 million word training corpus by calculating ngrams up to 4-grams. From the training corpus I found that 142 words accounted for >50% of the corpus, and about 8000 words accounted for >90% of the corpus. I then moved onto building a language model using the Hidden Markov Model approach by using the n-gram phrase tables to predict the next word based on the probability of the final word in the n-gram given the preceding word(s). I intend to use smoothing to deal with unseen words, but haven’t explored the best method yet. At present I have continued using the whole training corpus, but intend to compare models built with a reduced vocabulary to save memory for the Shiny app. I haven’t investigated hashing yet, transforming the n-grams into vector locations and working with the indices instead of the n-grams directly, but this may be a strategy worth exploring to further save memory.
As shown in the Appendix, I downloaded the data, cleaned and sampled it using a combination of bash line commands and R packages. I chose to remove only the seven common swearwords, and left the apostrophes in. Numbers, non-UTF-8 characters and all other punctuation was removed. Everything was converted to lower-case. I chose not to worry about spelling, acronyms and foreign words for the purposes of this project due to time limitations, but I’m aware that would improve the model.
I found the 5% of the total 100 million word corpus was as much as my computer would tokenise, so I’ve proceeded with that size corpus. I split this further into 60,20 and 20% training, validation and test sets respectively using the sample command.
To tokenise the training corpus into n-grams I used the ngram package. I chose this as it seems reasonable fast, and the get.phrasetable output calculates the n-gram frequency and proportion of the corpus each n-gram accounts for, which is useful for the language model.
I calculated n-grams from 1 to 4. This showed that there is a unique vocabulary of 109676 words in the corpus, of which 142 words account for 50% of the corpus and >90% of the corpus can be described by just 8008 words. So a reduced vocabulary may be one way to proceed.
In calculating 2-grams, 3-grams and 4-grams the number of unique combinations increases massively, but their frequency drops.
Figure 1 shows the top 25 words and the frequencies for the n-grams. As we’d expect they comprise of combinations of high frequency words such as “and”,“the”,“of”,“i”.
Figure 2 shows how many n-grams are needed to account for >90% of the small training corpus for n-grams of different sizes. I found this instructive for considering the computational power needed for dealing with a large corpus and tokenising large n-grams, something I’ll need to address for the Shiny app.
Figure 1
Figure 2
The plan now is to build a language model following the Hidden Markov Model approach using the n-grams I’ve processed and their proportions to predict the next word. I’ve not explored the smoothing algorithms yet to deal with the issue of unseen words, but by having n-grams up to 4, I expect to use some form of back-off model to do this. As well as reducing the vocabulary, I may look at hashing to save memory when it comes to creating the Shiny app.
Downloading the dataset:
# Download URL for data ----
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# Download data if not already downloaded
if(!file.exists("Coursera-SwiftKey.zip")){
download.file(fileUrl,destfile="Coursera-SwiftKey.zip")
dateDownloaded <- date()
# Unzip data
unzip(zipfile = "Coursera-SwiftKey.zip", files = NULL,
unzip = "internal")
# Create logfile
log_con <- file("Coursera-SwiftKey_data_download.log")
cat (fileUrl,"\n","destfile= Coursera-SwiftKey.zip",
"\n","destdir =", getwd(),"\n",dateDownloaded,
file = log_con)
close(log_con)
}
Bash commands for cleaning and sampling the corpus:
#! /bin/bash
# Concatenate files into one file
cat en_US.blogs.txt en_US.news.txt en_US.twitter.txt > tmp2
# Remove swearing
#sed "s/'[A-Za-z].//g" tmp > tmp2
# 'cocksucker', 'cunt', 'fuck', 'motherfucker', 'piss', 'shit', 'tits'
sed 's/piss//g;s/shit//g;s/cocksucker//g;s/cunt//g;s/fuck//g;s/motherfucker//g;s/tits//g' < tmp2 > tmp3
# Remove numbers
tr -d '[:digit:]' < tmp3 > tmp4
# Remove non-UTF-8 symbols
tr -cd '\11\12\15\40-\176' < tmp4 > clean.txt
# Create random 10,5 and 2% subset
perl -ne 'print if (rand() < .1)' clean.txt > clean_10.txt
perl -ne 'print if (rand() < .05)' clean.txt > clean_05.txt
perl -ne 'print if (rand() < .02)' clean.txt > clean_02.txt
rm tmp*