The goal of this Data Science Capstone is to develop an algorithm that predicts the next word in a sequence, using a training data provided. This Milestone Report documents the initial findings from the data and my plans ahead for the project.
The data set may be downloaded from here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
For this project, we will be utilizing three documents found in the en_US folder.
After downloading our file and unzipping it, we load the necessary libraries to conduct our exploratory data analysis (EDA).
library(quanteda)
library(readtext)
library (tidyverse)
library(ggpubr)
library(knitr)
library(kableExtra)
Let’s first read the files and have a look at some basic document information.
twt <- readLines("~/final/en_US/en_US.twitter.txt")
blg <- readLines("~/final/en_US/en_US.blogs.txt")
nws <- readLines("~/final/en_US/en_US.news.txt")
## File Name File Size (MB) Number of Lines Longest Line Length
## 1 en_US.twitter.txt 167.1053 2360148 213
## 2 en_US.blogs.txt 210.1600 899288 40835
## 3 en_US.news.txt 205.8119 77259 5760
| File Name | File Size (MB) | Number of Lines | Longest Line Length |
|---|---|---|---|
| en_US.twitter.txt | 167.1053 | 2360148 | 213 |
| en_US.blogs.txt | 210.1600 | 899288 | 40835 |
| en_US.news.txt | 205.8119 | 77259 | 5760 |
We see that en_US.twitter.txt has the most number of lines, but en_US.blogs.txt has the longest line; this is not surprising as we know tweets have a 280-word limit, while blog posts & news do not necessarily have this restriction.
For the purpose of this EDA, I am sampling 10% of lines from each text to be incorporated into a corpus. My initial (naive) intention was to sample 100% of each text; unfortunately, I ended up with memory problems during the generation of n-grams. The sampling of 10% was arbitrarily chosen; it will be the baseline from which I decide how much to sample later on.
set.seed(9875)
twt <- sample(twt, length(twt)*0.1 , replace = FALSE)
blg <- sample(blg, length(blg)*0.1, replace = FALSE)
nws <- sample(nws, length(nws)*0.1, replace = FALSE)
en_us.sample <- c(twt, blg, nws)
writeLines(en_us.sample, "~/final/en_US/en_us.ten.txt")
We have sampled 10% lines from each text, and combined them into a single text named: en_us.ten.txt. This will be the document from which we will create a corpus in the following step.
Let’s now load en_us.ten.txt and create our corpus.
en_us <- readtext("~/final/en_US/en_us.ten.txt")
en_us.corpus <- corpus(en_us)
summary(en_us.corpus)
## Corpus consisting of 1 document:
##
## Text Types Tokens Sentences
## en_us.ten.txt 235589 8432284 471675
##
## Source: C:/Users/PY Heng/Documents/* on x86-64 by PY Heng
## Created: Sat Dec 21 22:32:27 2019
## Notes:
Our corpus (en_us.corpus) is made up of 471,675 sentences, with a total of 8,432,284 tokens; of these, 235,589 are unique tokens.
en_us.tokens <- tokens(en_us.corpus) #Generate tokens
en_us.tokens <- tokens(en_us.tokens, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) #Remove punctuations, numbers, and symbols
en_us.tokens <- tokens_tolower(en_us.tokens) #turn remaining tokens into lowercase
Part of the requirement for this project is to remove swear words from the texts. I obtained a list of swear words from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en, and saved it as a .txt file named swearWords.txt.
profanity <- readLines("~/final/swearWords.txt")
en_us.tokens <- tokens_remove(en_us.tokens, profanity)
unigram <- tokens_ngrams(en_us.tokens, n=1)
dfmuni <- dfm(unigram)
bigram <- tokens_ngrams(en_us.tokens, n=2)
dfmbi <- dfm(bigram)
trigram <- tokens_ngrams(en_us.tokens, n=3)
dfmtri <- dfm(trigram)
quadgram <- tokens_ngrams(en_us.tokens, n=4)
dfmqua <- dfm(quadgram)
To visualize the most common uni-, bi-, tri-, and quadgrams, I will use the topfeatures() function.
topwordsuni <- topfeatures(dfmuni, 20)
topwordsuni <- data.frame(word=names(topwordsuni), count=topwordsuni)
unip <- ggplot(topwordsuni, aes(x=reorder(topwordsuni$word, topwordsuni$count), y=topwordsuni$count) ) +
geom_segment( aes(x=reorder(topwordsuni$word, topwordsuni$count) ,xend=reorder(topwordsuni$word, topwordsuni$count), y=0, yend=topwordsuni$count), color="grey") +
geom_point(size=3, color="#8BCBC8") +
coord_flip() +
xlab('Count') + ylab('Words (Unigram)')
topwordsbi <- topfeatures(dfmbi, 20)
topwordsbi <- data.frame(word=names(topwordsbi), count=topwordsbi)
bip <- ggplot(topwordsbi, aes(x=reorder(topwordsbi$word, topwordsbi$count), y=topwordsbi$count) ) +
geom_segment( aes(x=reorder(topwordsbi$word, topwordsbi$count) ,xend=reorder(topwordsbi$word, topwordsbi$count), y=0, yend=topwordsbi$count), color="grey") +
geom_point(size=3, color="#FDAE84") +
coord_flip() +
xlab('Count') + ylab('Words (Bigram)')
topwordstri <- topfeatures(dfmtri, 20)
topwordstri <- data.frame(word=names(topwordstri), count=topwordstri)
trip <- ggplot(topwordstri, aes(x=reorder(topwordstri$word, topwordstri$count), y=topwordstri$count) ) +
geom_segment( aes(x=reorder(topwordstri$word, topwordstri$count) ,xend=reorder(topwordstri$word, topwordstri$count), y=0, yend=topwordstri$count), color="grey") +
geom_point(size=3, color="#EF798E") +
coord_flip() +
xlab('Count') + ylab('Words (Trigram)')
topwordsqua <- topfeatures(dfmqua, 20)
topwordsqua <- data.frame(word=names(topwordsqua), count=topwordsqua)
quap <- ggplot(topwordsqua, aes(x=reorder(topwordsqua$word, topwordsqua$count), y=topwordsqua$count) ) +
geom_segment( aes(x=reorder(topwordsqua$word, topwordsqua$count) ,xend=reorder(topwordsqua$word, topwordsqua$count), y=0, yend=topwordsqua$count), color="grey") +
geom_point(size=3, color="grey") +
coord_flip() +
xlab('Count') + ylab('Words (Quadgram)')
As I did not remove stopwords during the cleaning process, we can see that most of the frequently-occuring n-grams have stopwords in them. I’d like to have a look at the pattern with stopwords removed.
en_us.tokens2 <- tokens_select(en_us.tokens, stopwords('english'),selection='remove')
en_us.tokens2 <- tokens_select(en_us.tokens2, min_nchar=2L) #to remove any 1 letter tokens
I’ll repeat the process above to generate the top N-grams, this time excluding stopwords.