It is common to find predictive keyboards, such as SwiftKey, on most modern mobile devices nowadays. The goal of the project is to develop an algorithm for the purpose of next-word prediction. The algorithm should find a good balance between three aspects: memory efficiency, speed and accuracy.
In order to train the algorithm, a corpora consisting of text taken from blogs, news feeds and Twitter was used. While the corpora came in 4 different languages, this report will only examine the features of the English (US) data.
Attempting to read the news text into R is hampered by the presence of unrecognized symbols such as the left arrow (which you may type using Alt + 26). Instances of the symbol were manually removed outside of R in order to allow the full dataset to be utilized.
There were warning messages when reading in the twitter file. However, these are not major errors which prevent subsequent lines from being read into R and are ignored. The warning messages are not displayed for tidiness.
blogs <- readLines(paste0(directory, "en_US.blogs.txt"))
news <- readLines(paste0(directory, "en_US.news.txt"))
twitter <- readLines(paste0(directory, "en_US.twitter.txt"))
Each dataset’s size on memory and the number of lines it contains is displayed below.
require(data.table)
datasets <- c("blogs", "news", "twitter")
objSize <- sapply(datasets, function(x) {format(object.size(get(x)), units = "Mb")})
lines <- sapply(datasets, function(x) {length(get(x))})
overview <- data.table("Dataset" = datasets, "Object Size" = objSize, "Lines" = lines)
overview
## Dataset Object Size Lines
## 1: blogs 248.5 Mb 899288
## 2: news 249.6 Mb 1010242
## 3: twitter 301.4 Mb 2360148
The following steps were taken to clean each dataset:
1. Removal characters that are not white spaces, alphabets, digits and punctuations
2. Removal of a list of profanities found here.
3. Removal numbers, punctuations and websites
4. Words converted to lowercase
5. Words are separated by spaces and their frequencies are counted
The data cleaning codes are not displayed here for brevity. Instead, the cleaned datasets are loaded and combined together for exploratory data analysis.
blogs <- readRDS(paste0(directory, "Blogs Unigrams.RData"))
news <- readRDS(paste0(directory, "News Unigrams.RData"))
twitter <- readRDS(paste0(directory, "Twitter Unigrams.RData"))
data <- rbindlist(list(blogs, news, twitter))
data <- data[, list("Count" = sum(Count)), by = Word]
A brief summary of the cleaned dataset is shown below.
summary(data)
## Word Count
## Length:577831 Min. : 1
## Class :character 1st Qu.: 1
## Mode :character Median : 1
## Mean : 174
## 3rd Qu.: 4
## Max. :4759890
Having cleaned the data, word clouds are used to visualize the most frequent terms in the entire dataset. Histograms are impractical to use here as there is a total of 577831 unique words in the cleaned dataset.
require(wordcloud)
palette <- brewer.pal(9, "Set1")
wordcloud(data[["Word"]], data[["Count"]], scale=c(10, .5), min.freq = 2, max.words = 1000,
random.order = F, random.color = F, colors = palette)
The word cloud is dominated by stop words. A list of stop words are removed and the word cloud is generated again.
stopWords <- readLines(paste0(directory, "Stop Word List.txt"))
require(data.table)
setkey(data, "Word")
dataWithoutStopWords <- data[!.(stopWords)]
setkey(data, "Count")
wordcloud(dataWithoutStopWords[["Word"]],
dataWithoutStopWords[["Count"]], scale=c(4, .5), min.freq = 2, max.words = 500,
random.order = F, random.color = F, colors = palette)
The word clouds shown are really just the word clouds of uni-grams, with the word frequencies corresponding to their probabilities. However, using uni-grams alone will generate a next-word prediction model that is highly ineffective: it will always predict the most frequent word.
The intention is to incorporate high order n-grams (bi-grams, tri-grams) in future development to achieve better next-word predictions. However, doing so will result in many n-grams being unobserved in the training data. This will require more advanced statistical methods, such as modified Kneser-Ney smoothing, to deal with this issue.
The corpora was cleaned and read into R to generate a word cloud for uni-grams. However, relying on uni-grams alone for next-word prediction is highly ineffective. For future development, high order n-grams and more advanced statistical methods will be incorporated to improve the prediction algorithm. It remains to be seen what the optimal order of n-gram should be and whether the training data should be further augmented with additional corpora.