knitr::opts_chunk$set(echo = TRUE)
library(knitr)
library(pander)
library(tm)
library(RWeka)
library(ggplot2)
library(SnowballC)
library(wordcloud)
This capstone project class is structured to allow students to create a usable and public data product by consolidating skills learnt during the Data Science Specialisation. This capstone course is created by John Hopkins University in collaboration with SwiftKey and Coursera. The course page can be accessed at https://www.coursera.org/learn/data-science-project.
John Hopkins University is one of the top universities in the USA and consistently ranked among the top 20 universities globally by various university ranking systems. It is most famous for its faculty of medicine and associated hospital system. Switfkey is a Microsoft-owned company with offices in London, San Francisco and Seoul. It is best known for its predictive keyboard app, SwiftKey Keyboard for Android and iOS which provides a swipe-based predictive text keyboard personalised to each individual user for more accurate typing.
The final aim of this course project is to create a Shiny app able to provide accurate prediction of the next word to be used when the user types in a phrase or word. The Shiny app should run efficiently on low memory requirements as well so a relatively optimised algorithm is required. This is to mimic current apps which run on relatively low memory in smartphones and other devices.
The data source for text mining to build this Shiny app is provided by John Hopkins in the following url https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The instructions for downloading the dataset state the following:
This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora
For the purposes of building the final Shiny app, only the English text files available under the en_US folder will be used. The dataset consists three plain text files which are drawn from blogs, news articles and twitter messages.
Preliminary statistics such as word counts and filesizes for each of the three text files are calculated below.
# Basic summary tables for each of the files
setwd("C:/Users/tanj/Desktop/Coursera Data Science/Capstone/final/en_US")
cap_files <- list(blog = readLines("en_US.blogs.txt"), twit = readLines("en_US.twitter.txt"), news = readLines("en_US.news.txt"))
cap_names <- c("en_US.blogs.txt", "en_US.twitter.txt", "en_US.news.txt")
word_count_line <- sapply(cap_files, function(x) sapply(gregexpr("\\W+", x), length) + 1)
word_count_file <- sapply(word_count_line, sum)
word_summary_line <- t(sapply(word_count_line, summary))
colnames(word_summary_line) <- c("min_word_line",
"first_quart_word_line",
"median_word_line",
"mean_word_line",
"third_quart_word_line",
"max_words_line")
word_count <- sapply(cap_files, function(x) length(strsplit(x," ")))
line_count <- sapply(cap_files, length)
size_count_MB <- sapply(cap_names, function(x) file.info(x)$size/(10^6))
sum_stats <- data.frame(line_count,
word_count_file,
word_summary_line,
size_count_MB,
row.names = cap_names)
pander(sum_stats)
 | line_count | word_count_file | min_word_line |
---|---|---|---|
en_US.blogs.txt | 899288 | 39386844 | 2 |
en_US.twitter.txt | 2360148 | 32874008 | 2 |
en_US.news.txt | 77259 | 2837489 | 2 |
 | first_quart_word_line | median_word_line |
---|---|---|
en_US.blogs.txt | 10 | 30 |
en_US.twitter.txt | 8 | 13 |
en_US.news.txt | 21 | 34 |
 | mean_word_line | third_quart_word_line |
---|---|---|
en_US.blogs.txt | 43.8 | 62 |
en_US.twitter.txt | 13.93 | 20 |
en_US.news.txt | 36.73 | 48 |
 | max_words_line | size_count_MB |
---|---|---|
en_US.blogs.txt | 6852 | 210.2 |
en_US.twitter.txt | 63 | 167.1 |
en_US.news.txt | 1522 | 205.8 |
# Summary statistics output of all three files include the following:
# 1. Number of lines
# 2. Number of words
# 3. Minimum number of words per line
# 4. 1st quartile of number of words per line
# 5. Median number of words per line
# 6. Mean number of words per line
# 7. 3rd quartile of number of words per line
# 8. Maximum number of words per line
# 9. Size in MB
We can see that the blogs file has the longest lines (>6000 words) but in general, most lines in the dataset are relatively short (<100 words) which can be observed from the numbers in the third quartile of number of words per line column. The news file has a significantly lower number of lines compared to the the other two files but is the second largest file. This indicates that the news file would have a relatively large number of long lines.
Prelimiinary file analysis was realtively time-consuming because the large file sizes and so to increase processing speed and perform easier and faster exploratory data analysis, a sample of 20,000 lines from each of the files was obtained to create a corpus (text database) using the tm package. Basic preprocessing was then done using functions from the tm package in order to remove punctuation, numbers and stopwords (which are words that occur at high frequency such as “at” or “this” as it would otherwise distort and hide unique words). The corpus is also converted to lowercase and whitespace stripped out in order to bypass case-sensitivity and remove unneccessary whitespace.
# Split initial 20,000 samples for each file to perform exploratory data analysis
set.seed(123)
cap_sample <- lapply(cap_files, function(x) sample(x, 20000))
cap_1 <- as.VCorpus(cap_sample)
# Preprocessing sequence #1 (punctuation, numbers, stop words, lowercase and whitespace)
cap_1 <- tm_map(cap_1, removePunctuation)
cap_1 <- tm_map(cap_1, removeNumbers)
cap_1 <- tm_map(cap_1, tolower)
cap_1 <- tm_map(cap_1, removeWords, stopwords("english"))
cap_1 <- tm_map(cap_1, stripWhitespace)
cap_1 <- tm_map(cap_1, PlainTextDocument)
The RWeka package is used to tokenise the corpus to create unigrams, bigrams and trigrams for visualisation of the dataset. The top 30 n-grams for each category is plotted in order to better understand the dataset.The definition of n-grams can be found at https://en.wikipedia.org/wiki/N-gram.
# Creating n-grams for exploration
uni_toke <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bi_toke <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tri_toke <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
unigram <- DocumentTermMatrix(cap_1, control = list(tokenize = uni_toke))
bigram <- DocumentTermMatrix(cap_1, control = list(tokenize = bi_toke))
trigram <- DocumentTermMatrix(cap_1, control = list(tokenize = tri_toke))
# Plotting the top 30 unigrams
unifreq <- sort(colSums(as.matrix(unigram)), decreasing=TRUE)
uni_obj <- data.frame(term = names(unifreq[1:30]), freq = unifreq[1:30])
ggplot(uni_obj, aes(reorder(term, -freq),freq)) + geom_bar(stat="identity", fill = "lightblue") + ggtitle("Top 30 Unigrams") + xlab("Unigram words") + ylab("Frequency") + theme(axis.text.x=element_text(angle=45, hjust=1))
# Plotting the top 30 bigrams
bifreq <- sort(colSums(as.matrix(bigram)), decreasing=TRUE)
bi_obj <- data.frame(term = names(bifreq[1:30]), freq = bifreq[1:30])
ggplot(bi_obj, aes(reorder(term, -freq),freq)) + geom_bar(stat="identity", fill = "lightgreen") + ggtitle("Top 30 Bigrams") + xlab("Bigram words") + ylab("Frequency") + theme(axis.text.x=element_text(angle=45, hjust=1))
# Plotting the top 30 trigrams
trifreq <- sort(colSums(as.matrix(trigram)), decreasing=TRUE)
tri_obj <- data.frame(term = names(trifreq[1:30]), freq = trifreq[1:30])
ggplot(tri_obj, aes(reorder(term, -freq),freq)) + geom_bar(stat="identity", fill = "orange") + ggtitle("Top 30 Trigrams") + xlab("Trigram words") + ylab("Frequency") + theme(axis.text.x=element_text(angle=45, hjust=1))
We can see that there are some non-English symbols in some of the words in the n-grams and this should be addressed in the preprocessing of the actual training dataset. This could be addressed by fine-tuning the reader so that foreign symbols such as accents are removed or using grep to identify foreign symbols to remove all non-English words from the corpus. The spread between the most frequent n-gram and the next increases as N increases (N being the n in n-grams) indicating that higher weightage should be given to trigrams in the final n-gram model as it is more likely for them to be accurate compared to the uni-grams or bi-grams. This can be seen in the tr-gram chart with the maximum frequency of 50 while the 10th is 20, giving a ratio of 2/5 while for bi-grams, the top frequency is about 370 while the 10th is about 200, giving a ratio of 2/3.7.
In this section, the number of unique words required to cover different percentages of the whole sample corpus is calculated in order to check the size of the dictionary required for our n-gram model. A sparse corpus indicates there is much room for optimisation.
# Simple function to calculate number of words in this sample dataset
# required to build a dictionary to cover a certain percentage of all words
nterms_perc <- function(x, perc){
total = sum(x)
n = 1
y = x[n]
while(perc>(y/total)){
n = n + 1
y = y + x[n]
}
return(n)
}
total_word_number <- sum(unifreq)
perc50 <- nterms_perc(unifreq, 0.5)
perc95 <- nterms_perc(unifreq, 0.95)
The total number of words in the sample corpus is 9.5665810^{5} while to cover 50% and 95% of the corpus, only 1089 and 3.487510^{4} is required. This quite possibly indicates that a relatively small dictionary is sufficient to provide sufficient accuracy for prediction of standard terms.
The current plan consists of the following steps:
At the current stage, I will be likely to continue building on the data exploration conducted here and build a simple and slow n-gram model with a back-off model for unknown text to test out the methods first. From there, possible avenues of exploration include utilisation of language models such as sentence structure to provide additional accuracy checks.