This is the Week 2 milestone report for the Data Sciences Specialization Capstone project course. The overall goal of the capstone project is to develop a prediction algorithm for the most likely next word in a sequence of words in a sentence.
The purpose of this report is to demonstrate how the data was downloaded, imported into R and cleaned, and demonstrate the exploratory analyses to investigate some features of the data.
The code below demonstrates downloading and importing the data (the English language version of blogs, news, and Twitter posts).
if(!file.exists("Coursera-SwiftKey.zip")){
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
"Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
blogs <- readLines("final/en_US/en_US.blogs.txt", warn=FALSE, encoding="UTF-8")
news <- readLines("final/en_US/en_US.news.txt", warn=FALSE, encoding="UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", warn=FALSE, encoding="UTF-8")
Summary Statistics of the Data Here, we will be computing the following summary statistics for each file:
size (in Megabytes) number of entries (rows) total characters length of longest entry
data_summary <- data.frame("File" = c("Blogs","News","Twitter"),
"File Size" = sapply(list(blogs,news,twitter),function(x){format(object.size(x),"MB")}),
"NumEntries" = sapply(list(blogs,news,twitter),function(x){length(x)}),
"TotalCharacters" = sapply(list(blogs,news,twitter),function(x){sum(nchar(x))}),
"MaxCharacters" = sapply(list(blogs,news,twitter),function(x){max(unlist(lapply(x,function(y) nchar(y))))}))
data_summary
## File File.Size NumEntries TotalCharacters MaxCharacters
## 1 Blogs 255.4 Mb 899288 206824505 40833
## 2 News 19.8 Mb 77259 15639408 5760
## 3 Twitter 319 Mb 2360148 162096031 140
Data Cleaning and Selection of Corpus Given the size of the data (as shown in the above summary statistics), we will proceed with a subset of the data, consisting of 5% of the data from each file. From the subset, we will be cleaning the data and creating a corpus to be used for prediction.
set.seed(1015) # Setting the seed for reproducibility
samp_size <- 0.05 # Setting the subset to be 5% of each file
# Create indices for the sampling of the datasets.
blogs_ind <- sample(seq_len(length(blogs)),length(blogs)*samp_size)
news_ind <- sample(seq_len(length(news)),length(news)*samp_size)
twitter_ind <- sample(seq_len(length(twitter)),length(twitter)*samp_size)
# Now select the 5% of the sample for each file.
blogs_sub <- blogs[blogs_ind[]]
news_sub <- news[news_ind[]]
twitter_sub <- twitter[twitter_ind[]]
# Now load the text mining package from R
library(tm)
## Warning: package 'tm' was built under R version 3.5.2
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.5.2
# Create a corpus out of all 3 sampled datasets.
corpus <- Corpus(VectorSource(c(blogs_sub,news_sub,twitter_sub)),
readerControl=list(reader=readPlain,language="en"))
# Clean the corpus dataset by removing non-ASCII characters.
corpus <- Corpus(VectorSource(sapply(corpus, function(row) iconv(row,"latin1","ASCII",sub=""))))
# Clean the data further by removing punctuation, unnecessary white spaces, and numbers, and
# converting to lower case.
corpus <- tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation
## drops documents
corpus <- tm_map(corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation
## drops documents
corpus <- tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents
corpus <- tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
# corpus <- tm_map(corpus, PlainTextDocument)
Exploratory Data Analysis Once we have cleaned the dataset, we will need to convert the data into a format suitable for Natural Language Processing (NLP). One format which we will be employing are N-grams stored in Term Document Matrices (TDM).
The N-gram representation of text lists all N-tuples of words that appear in a given selection of text. The simplest case is the unigram which is based on individual words, followed by bigrams (which lists all pair of words), and so on. The TDMs store the frequencies of the N-grams in the respective sources.
For this example, we will be using the RWeka R package, which is a collection of machine learning algorithms for data mining.
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.5.2
# Generate tokens for the N-grams.
TokenUnigram <- function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
TokenBigram <- function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
TokenTrigram <- function(x) NGramTokenizer(x,Weka_control(min=3,max=3))
TokenQuadgram <- function(x) NGramTokenizer(x,Weka_control(min=4,max=4))
Unigram <- TermDocumentMatrix(corpus, control=list(tokenize=TokenUnigram))
Bigram <- TermDocumentMatrix(corpus, control=list(tokenize=TokenBigram))
Trigram <- TermDocumentMatrix(corpus, control=list(tokenize=TokenTrigram))
Quadgram <- TermDocumentMatrix(corpus, control=list(tokenize=TokenQuadgram))
Unigram
## <<TermDocumentMatrix (terms: 113862, documents: 166833)>>
## Non-/sparse entries: 2341777/18993597269
## Sparsity : 100%
## Maximal term length: 164
## Weighting : term frequency (tf)
Bigram
## <<TermDocumentMatrix (terms: 113862, documents: 166833)>>
## Non-/sparse entries: 2341777/18993597269
## Sparsity : 100%
## Maximal term length: 164
## Weighting : term frequency (tf)
Trigram
## <<TermDocumentMatrix (terms: 113862, documents: 166833)>>
## Non-/sparse entries: 2341777/18993597269
## Sparsity : 100%
## Maximal term length: 164
## Weighting : term frequency (tf)
Quadgram
## <<TermDocumentMatrix (terms: 113862, documents: 166833)>>
## Non-/sparse entries: 2341777/18993597269
## Sparsity : 100%
## Maximal term length: 164
## Weighting : term frequency (tf)
It is evident from the above that the matrices are very sparse, and thus we need to create denser matrices to explore the data further, remove rare N-grams, and plot out the frequencies of the N-grams.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
# We write a function that sum up rows and sort by N-gram frequency.
freqframe <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
freqframe <- data.frame(word=names(freq), freq=freq)
return(freqframe)
}
# Create matrices that are denser, and then add up and sort the matrices.
UnigramDense <- removeSparseTerms(Unigram, 0.999)
UnigramDenseSort <- freqframe(UnigramDense)
BigramDense <- removeSparseTerms(Bigram, 0.999)
BigramDenseSort <- freqframe(BigramDense)
TrigramDense <- removeSparseTerms(Trigram, 0.999)
TrigramDenseSort <- freqframe(TrigramDense)
QuadgramDense <- removeSparseTerms(Quadgram, 0.999)
QuadgramDenseSort <- freqframe(QuadgramDense)
# Plot the frequencies of the Unigrams.
GGUni <- ggplot(data=UnigramDenseSort[1:50,],aes(x=reorder(word, -freq),y=freq)) + geom_bar(stat="identity")
GGUni <- GGUni + labs(x="N-gram", y="Frequency", title="Frequencies of the 50 Most Abundant Unigrams (individual words)")
GGUni <- GGUni + theme(axis.text.x=element_text(angle=90))
GGUni
# Plot the frequencies of the Bigrams.
GGBi <- ggplot(data=BigramDenseSort[1:50,],aes(x=reorder(word, -freq),y=freq)) + geom_bar(stat="identity")
GGBi <- GGBi + labs(x="N-gram", y="Frequency", title="Frequencies of the 50 Most Abundant Bigrams (pairs of words)")
GGBi <- GGBi + theme(axis.text.x=element_text(angle=90))
GGBi
# Plot the frequencies of the Trigrams.
GGTri <- ggplot(data=TrigramDenseSort[1:50,],aes(x=reorder(word, -freq),y=freq)) + geom_bar(stat="identity")
GGTri <- GGTri + labs(x="N-gram", y="Frequency", title="Frequencies of the 50 Most Abundant Trigrams (triplets of words)")
GGTri <- GGTri + theme(axis.text.x=element_text(angle=90))
GGTri
# Plot the frequencies of the Quadgrams.
GGQuad <- ggplot(data=QuadgramDenseSort[1:50,],aes(x=reorder(word, -freq),y=freq)) + geom_bar(stat="identity")
GGQuad <- GGQuad + labs(x="N-gram", y="Frequency", title="Frequencies of the 50 Most Abundant Quadgrams (quartets of words)")
GGQuad <- GGQuad + theme(axis.text.x=element_text(angle=90))
GGQuad
We have found that the longer the N-grams, the lower their frequency.
We have converted the corpus into N-grams stored in TDMs and then converted them into data frames of frequencies, which should be useful in predicting the next word in a sequence of words.
For example, when looking at a string of 3 words the most likely next word can be guessed by investigating all quadgrams starting with these 3 words and choosing the most frequent one. We believe that the prediction should be fairly quick as the TDMs are not required for this purpose (only the overall frequency is important). The data frames of summed frequencies (e.g. TrigramDenseSort) can be used and the intermediate steps can be cached to speed up the computing time, thus ensuring the Shiny app will run in an acceptable time.