Overview

The goal of this Data Science Capstone is to develop an algorithm that predicts the next word in a sequence, using a training data provided. This Milestone Report documents the initial findings from the data and my plans ahead for the project.

1. Data Acquisition

The data set may be downloaded from here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

For this project, we will be utilizing three documents found in the en_US folder.

2. Load Libraries

After downloading our file and unzipping it, we load the necessary libraries to conduct our exploratory data analysis (EDA).

library(quanteda) 
library(readtext)
library (tidyverse) 
library(ggpubr) 
library(knitr)
library(kableExtra)

3. Basic Document Info

Let’s first read the files and have a look at some basic document information.

twt <- readLines("~/final/en_US/en_US.twitter.txt")
blg <- readLines("~/final/en_US/en_US.blogs.txt")
nws <- readLines("~/final/en_US/en_US.news.txt")
##           File Name File Size (MB) Number of Lines Longest Line Length
## 1 en_US.twitter.txt       167.1053         2360148                 213
## 2   en_US.blogs.txt       210.1600          899288               40835
## 3    en_US.news.txt       205.8119           77259                5760
File Name File Size (MB) Number of Lines Longest Line Length
en_US.twitter.txt 167.1053 2360148 213
en_US.blogs.txt 210.1600 899288 40835
en_US.news.txt 205.8119 77259 5760

We see that en_US.twitter.txt has the most number of lines, but en_US.blogs.txt has the longest line; this is not surprising as we know tweets have a 280-word limit, while blog posts & news do not necessarily have this restriction.

4. Subsetting samples

For the purpose of this EDA, I am sampling 10% of lines from each text to be incorporated into a corpus. My initial (naive) intention was to sample 100% of each text; unfortunately, I ended up with memory problems during the generation of n-grams. The sampling of 10% was arbitrarily chosen; it will be the baseline from which I decide how much to sample later on.

set.seed(9875)
twt <- sample(twt, length(twt)*0.1 , replace = FALSE)
blg <- sample(blg, length(blg)*0.1, replace = FALSE)
nws <- sample(nws, length(nws)*0.1, replace = FALSE)
en_us.sample <- c(twt, blg, nws)
writeLines(en_us.sample, "~/final/en_US/en_us.ten.txt")

We have sampled 10% lines from each text, and combined them into a single text named: en_us.ten.txt. This will be the document from which we will create a corpus in the following step.

5. Creating a Corpus

Let’s now load en_us.ten.txt and create our corpus.

en_us <- readtext("~/final/en_US/en_us.ten.txt")
en_us.corpus <- corpus(en_us)
summary(en_us.corpus)
## Corpus consisting of 1 document:
## 
##           Text  Types  Tokens Sentences
##  en_us.ten.txt 235589 8432284    471675
## 
## Source: C:/Users/PY Heng/Documents/* on x86-64 by PY Heng
## Created: Sat Dec 21 22:32:27 2019
## Notes:

Our corpus (en_us.corpus) is made up of 471,675 sentences, with a total of 8,432,284 tokens; of these, 235,589 are unique tokens.

6. Generating & Cleaning Tokens

en_us.tokens <- tokens(en_us.corpus) #Generate tokens

en_us.tokens <- tokens(en_us.tokens, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) #Remove punctuations, numbers, and symbols 

en_us.tokens <- tokens_tolower(en_us.tokens) #turn remaining tokens into lowercase

7. Profanity Filter

Part of the requirement for this project is to remove swear words from the texts. I obtained a list of swear words from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en, and saved it as a .txt file named swearWords.txt.

profanity <- readLines("~/final/swearWords.txt")
en_us.tokens <- tokens_remove(en_us.tokens, profanity)

8. Creating N-grams & document feature matrices

unigram <- tokens_ngrams(en_us.tokens, n=1)
dfmuni <- dfm(unigram) 

bigram <- tokens_ngrams(en_us.tokens, n=2)
dfmbi <- dfm(bigram)

trigram <- tokens_ngrams(en_us.tokens, n=3)
dfmtri <- dfm(trigram)

quadgram <- tokens_ngrams(en_us.tokens, n=4)
dfmqua <- dfm(quadgram)

9. Visualizing Top N-grams

To visualize the most common uni-, bi-, tri-, and quadgrams, I will use the topfeatures() function.

topwordsuni <- topfeatures(dfmuni, 20)
topwordsuni <- data.frame(word=names(topwordsuni), count=topwordsuni)
unip <- ggplot(topwordsuni, aes(x=reorder(topwordsuni$word, topwordsuni$count), y=topwordsuni$count) ) +
  geom_segment( aes(x=reorder(topwordsuni$word, topwordsuni$count) ,xend=reorder(topwordsuni$word, topwordsuni$count), y=0, yend=topwordsuni$count), color="grey") +
  geom_point(size=3, color="#8BCBC8") +
  coord_flip() +
  xlab('Count') + ylab('Words (Unigram)')


topwordsbi <- topfeatures(dfmbi, 20)
topwordsbi <- data.frame(word=names(topwordsbi), count=topwordsbi)
bip <- ggplot(topwordsbi, aes(x=reorder(topwordsbi$word, topwordsbi$count), y=topwordsbi$count) ) +
  geom_segment( aes(x=reorder(topwordsbi$word, topwordsbi$count) ,xend=reorder(topwordsbi$word, topwordsbi$count), y=0, yend=topwordsbi$count), color="grey") +
  geom_point(size=3, color="#FDAE84") +
  coord_flip() +
  xlab('Count') + ylab('Words (Bigram)')


topwordstri <- topfeatures(dfmtri, 20)
topwordstri <- data.frame(word=names(topwordstri), count=topwordstri)
trip <- ggplot(topwordstri, aes(x=reorder(topwordstri$word, topwordstri$count), y=topwordstri$count) ) +
  geom_segment( aes(x=reorder(topwordstri$word, topwordstri$count) ,xend=reorder(topwordstri$word, topwordstri$count), y=0, yend=topwordstri$count), color="grey") +
  geom_point(size=3, color="#EF798E") +
  coord_flip() +
  xlab('Count') + ylab('Words (Trigram)')


topwordsqua <- topfeatures(dfmqua, 20)
topwordsqua <- data.frame(word=names(topwordsqua), count=topwordsqua)
quap <- ggplot(topwordsqua, aes(x=reorder(topwordsqua$word, topwordsqua$count), y=topwordsqua$count) ) +
  geom_segment( aes(x=reorder(topwordsqua$word, topwordsqua$count) ,xend=reorder(topwordsqua$word, topwordsqua$count), y=0, yend=topwordsqua$count), color="grey") +
  geom_point(size=3, color="grey") +
  coord_flip() +
  xlab('Count') + ylab('Words (Quadgram)')

As I did not remove stopwords during the cleaning process, we can see that most of the frequently-occuring n-grams have stopwords in them. I’d like to have a look at the pattern with stopwords removed.

10. Top N-grams (stopwords removed)

en_us.tokens2 <- tokens_select(en_us.tokens, stopwords('english'),selection='remove')
en_us.tokens2 <- tokens_select(en_us.tokens2, min_nchar=2L) #to remove any 1 letter tokens

I’ll repeat the process above to generate the top N-grams, this time excluding stopwords.

11. Interesting Findings

  • There is a marked difference in the top N-grams when stopwords are removed. Nevertheless, I decided that I will retain the stopwords as removing them will affect the prediction algorithm.
  • I also see some foreign words among the top N-grams; I will remove non-ascii characters before I proceed to prevent inaccuracies during prediction.

12. Concerns & Plans Ahead

  • I did not encounter any memory problems sampling 10% lines from each text; I believe I may be able to up the percentage sampled from here on, but I’ll still need to determine how much word coverage is required.
  • For the prediction algorithm, I will probably use the N-gram model.
  • I anticipate memory & runtime problems as this project moves along; will need to find ways to solve these issues later.
  • Another concern is whether a particular word input is a term that does not exist in our 3 texts and thus will not have been encountered by our algorithm. I also hope to solve this as the project moves along.