This report summarises the exploratory data analysis conducted on the dataset provided for the JHU Data Science Capstone Project. The raw data has been downloaded and zipped locally from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
The data comprises 3 texts files (news articles, tweets and blogs) in 4 languages (English, German, Finnish and Russian). The objective of the project is to produce a predictive text algorithm based on the data set provided. The analysis for this report focusses on the English text files. The resulting algorithm will be embedded in a web app using Shiny.
The bulk of the text mining operations used the quanteda library which, after much experimentation, is faster than the tm library functions. The process for transforming the text into a usable dictionary in quanteda is as follows :
text -> corpus -> tokens -> ngrams -> dfm -> dictionary
The process and analysis is described below. Full code for running this reproducible analysis can be found at the end of the document.
The following code loads the required libraries, sets up the working directories and initial arrays containing file information.
library(tm); library(readr); library(stringi); library(ggplot2)
library(SnowballC); library(quanteda); library(RColorBrewer)
setwd("~/DataScience/CapstoneJHU")
dirData <- "~/DataScience/CapstoneJHU/data/"
lstLanguage <- c("de_DE","en_US","fi_FI","ru_RU")
lstDoctype <- c(".blogs",".news",".twitter")
lstFileDE <- c("de_DE.blogs","de_DE.news","de_DE.twitter")
lstFileEN <- c("en_US.blogs","en_US.news","en_US.twitter")
lstFileFI <- c("fi_FI.blogs","fi_FI.news","fi_FI.twitter")
lstFileRU <- c("ru_RU.blogs","ru_RU.news","ru_RU.twitter")
For the English language analysis, there is over 3 million lines of text to be analysed. The size and contents of the first line of each file is shown below.
## [1] "~/DataScience/CapstoneJHU/data/en_US/en_US.blogs.txt"
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 208361438 171926076
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200godsâ\200\235."
## [1] "~/DataScience/CapstoneJHU/data/en_US/en_US.news.txt"
## Lines LinesNEmpty Chars CharsNWhite
## 77259 77259 15683765 13117038
## [1] "He wasn't home alone, apparently."
## [1] "~/DataScience/CapstoneJHU/data/en_US/en_US.twitter.txt"
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162385035 134371036
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
A corpus is designed to be a library of documents that have been converted to standardised text. It is intended to be a static container of original reference texts. In this step we also load the bad word list which has been saved locally from https://www.cs.cmu.edu/~biglou/resources/. The processing will remove the profanity from the original text.
To enable faster processing, a random sample of 20% of the text is read. The data is preprocessed slightly prior to being converted to a corpus. The initial text preprocessing speeds up the later tokenisation process. It also performs operations which are not available within the quanteda library. This includes
* removing html style tags
* removing urls
* removing twitter hashtags (not done well in quanteda)
* removing non-alphanumeric characters
preprocessText <- function(tData) {
tSample <- 0.20
tData <- sample(tData,length(tData)*tSample,replace=FALSE)
tData <- gsub("<.*?>", "", tData) ## remove html style tags
a <- iconv(tData,to="ASCII") ## from="UTF-8",
a <- gsub("(https?)?://[^[:blank:]]*", " ",a) ## remove urls
a <- gsub("[[:blank:]]#[^[:blank:]]*", " ",a) ## remove hashtags
a <- gsub("[[:digit:]]","",a)
a <- gsub("[[:punct:]]","",a)
a <- gsub("^[:alphanum:][:space:]"," ",a)
corpus(a) }
a1 <- preprocessText(a1)
a2 <- preprocessText(a2)
a3 <- preprocessText(a2)
x <- a1 + a2 + a3
saveRDS(x,getFilename(tDir,tLanguage,tDocName,".train.rds"))
The corpus is converted into a token dataset using quanteda and saved to an .rds file. This process of tokenising cleans the text and extracts the unique features (words or terms) of the documents. The second round of preprocessing
* removes punctuation
* removes numbers
* removes symbols
* removes stopwords (eg “to”, “the”, “and”, etc)
* twitter hashtags
* removes hyphens
* removes any profanity
x <-tokens(x,remove_punct=TRUE, remove_numbers=TRUE,
remove_symbols=TRUE, remove_twitter=TRUE, remove_hyphens=TRUE,
include_docvars=FALSE, what = "word")
x <-tokens_remove(x,stopwords(substr(tLanguage,1,2)))
x <-tokens_remove(x,lstProfanity)
saveRDS(x,getFilename(tDir,tLanguage,tDocName,".token.rds"))
In Natural Language Processing (NLP), an n-gram is a sequence of n items from a given text. The n-grams are constructed by first tokenising the text.
readTokenFile <- function(tDir,tLanguage) {
tDocName <- tLanguage
fileRDS <-getFilename(tDir,tLanguage,tDocName,".token.rds")
a <- readRDS(fileRDS)
a }
getNgram <- function(x,n,nskip,tDir,tLanguage,tExtension) {
xgram <- tokens_ngrams(x,n,nskip,concatenator=" ")
saveRDS(xgram,getFilename(tDir,tLanguage,"token",tExtension))
rm(xgram) }
x <- readTokenFile(tDir,tLanguage)
getNgram(x,n=1L,nskip=0L,tDir,tLanguage,".x1gram.rds")
getNgram(x,n=2L,nskip=0L,tDir,tLanguage,".x2gram.rds")
getNgram(x,n=3L,nskip=0L,tDir,tLanguage,".x3gram.rds")
getNgram(x,n=2L,nskip=1L,tDir,tLanguage,".x1skip.rds")
The n-grams are then converted to a document frequency matrix (dfm). A document frequency matrix is a table which contains the unique tokens (features) and their total counts within the corpus.
readNgramFile <- function(tDir,tLanguage,tExtension) {
tDocName <- tLanguage
fileRDS <-getFilename(tDir,tLanguage,"token",tExtension)
a <- readRDS(fileRDS)
a }
getdfm <- function(tDir,tLanguage,tExtension) {
x <- readNgramFile(tDir,tLanguage,tExtension)
xdfm <- dfm(x)
xdfm <- dfm_sort(xdfm,decreasing=TRUE,margin="features")
saveRDS(xdfm,getFilename(tDir,tLanguage,"dfm",tExtension))
xdfm }
x.1skip <- getdfm(tDir,tLanguage,".x1skip.rds")
x.1gram <- getdfm(tDir,tLanguage,".x1gram.rds")
x.2gram <- getdfm(tDir,tLanguage,".x2gram.rds")
x.3gram <- getdfm(tDir,tLanguage,".x3gram.rds")
The most common unigrams and their absolute & relative frequencies are shown below. When this analysis is conducted using stopwords, the top 15 unigrams are stopwords. The analysis below excludes stopwords.
## one just can like time get now im new also
## 15235 12251 11865 11674 11093 8915 7450 7449 7298 7233
## day know first people make said back us well little
## 6832 6809 6690 6571 6535 6508 6464 6453 6343 6231
## really much see good love
## 6117 6062 6023 5997 5892
The coverage or cumulative percentage of words which covers the text is shown in the table below. The chart and table below shows that less than 15% of the unigrams account for 90% of the words in the corpus.
## coverage cumfreq cumfreqpct
## 1 0.1 33 0.0002794905
## 2 0.2 109 0.0009231655
## 3 0.3 259 0.0021935768
## 4 0.4 515 0.0043617454
## 5 0.5 940 0.0079612440
## 6 0.6 1671 0.0141523816
## 7 0.7 3042 0.0257639406
## 8 0.8 6011 0.0509096145
## 9 0.9 15124 0.1280913341
## 10 1.0 118072 1.0000000000
The most common bigrams and their relative frequencies are shown below.
## [1] "dont know" "right now" "years ago" "last year"
## [5] "new york" "even though" "last week" "can see"
## [9] "make sure" "first time" "feel like" "im sure"
## [13] "im going" "last night" "can get" "every day"
## [17] "high school" "dont want" "one day" "little bit"
## [21] "dont think" "long time" "united states" "year old"
## [25] "let know"
There are many bigrams (about 85%) which only occur once in the text. These low frequency bigrams will be pruned out when creating the dictionary for the algorithm.
## freqcount featcount featfreq
## 1 1 1381707 0.8628814
## 2 2 124964 0.9409219
## 3 3 38558 0.9650016
## 4 4 18195 0.9763644
## 5 5 10031 0.9826288
## 6 6 6243 0.9865276
## 7 7 4269 0.9891936
## 8 8 2993 0.9910627
## 9 9 2232 0.9924566
## 10 10 1747 0.9935476
The most common trigrams are shown below.
## [1] "vested interests vested" "interests vested interests"
## [3] "new york city" "im pretty sure"
## [5] "new york ny" "couple weeks ago"
## [7] "preheat oven degrees" "cant wait see"
## [9] "new york times" "two years ago"
## [11] "lets just say" "love love love"
## [13] "world war ii" "dont get wrong"
## [15] "im looking forward" "please let know"
## [17] "please feel free" "many years ago"
## [19] "three years ago" "two weeks ago"
As with bigrams, there is a high incidence of trigrams which occur less than 5 times within the corpus analysed. Additionally, the number of features observed multiplies significantly as the number n in n-grams increases, resulting in larger data sets.
## freqcount featcount featfreq
## 1 1 2040614 0.9841906
## 2 2 24686 0.9960967
## 3 3 4280 0.9981610
## 4 4 1704 0.9989828
## 5 5 766 0.9993523
## 6 6 420 0.9995548
## 7 7 247 0.9996740
## 8 8 149 0.9997458
## 9 9 127 0.9998071
## 10 10 98 0.9998543
The most common 1-skipgrams and their relative frequencies are shown below. A skipgram is a sequence of words where x number of words in between are removed. For this analysis, the middle word in a trigram has been removed to create a 1-skipgram.
## [1] "vested interests vested" "interests vested interests"
## [3] "new york city" "im pretty sure"
## [5] "new york ny" "couple weeks ago"
## [7] "preheat oven degrees" "cant wait see"
## [9] "new york times" "two years ago"
## [11] "lets just say" "love love love"
## [13] "world war ii" "dont get wrong"
## [15] "im looking forward" "please let know"
## [17] "please feel free" "many years ago"
## [19] "three years ago" "two weeks ago"
## [21] "new years eve" "level mp cost"
## [23] "happy new year" "dont even know"
## [25] "couple years ago"
The exploratory data analysis (EDA) provides a framework for understanding the data. One of the approaches used in the EDA was to save down the results at key points of transforming the data due to the high memory usage. In all, the EDA performed above took around 10 minutes to run.
Considerations for building the dictionary for the next stage of the predictive text model include how to -
* trim the n-grams and determine the appropriate dictionary size which balances the tradeoff between prediction speed and accuracy based on the coverage and frequency of the n-grams in the text
* determine if 4+-grams add additional accuracy to the prediction
* determine the weight of contribution of news articles vs blogs vs tweets to the accuracy of the overall model, to match the informal text typing style expected for predictive text
* consider model accuracy of including stop words and using stemming in creating the n-gram dictionary
After the dictionaries have been created, the likely approach to assessing the next word will based on the following set of rules -
* if there are more than 2 words, prediction will be based on 3, 4-gram windows
* if there are 2 words, prediction will be based on 2, 3 and 4-grams windows
* if there is 1 word, prediction will be based on 2, 3-gram windows
* during the typing of any word, the prediction and text correction will use unigrams
Issues to consider when building the algorithm include how to -
* calculate the next word probabilities for each n-gram in the dictionary and investigate the Katz Backoff model, Naive Bayes and other NLP algorithms
* build training and test datasets to assess and validate the accuracy of next word prediction
* investigate likelihood of typing errors and how it affects next word prediction
* investigate using tfidf of each n-gram when the modelling probabilities
I will also need to observe existing predictive text models (eg on mobile phones and Google) in order to design the front end of the Shiny app. For example, predictive text within the iPhone message app returns the top 3 choices.
The full reproducible code for running the above processes is as follows :
library(tm); library(readr); library(stringi); library(ggplot2)
library(SnowballC); library(quanteda); library(RColorBrewer)
setwd("~/DataScience/CapstoneJHU")
dirData <- "~/DataScience/CapstoneJHU/data/"
lstLanguage <- c("de_DE","en_US","fi_FI","ru_RU")
lstDoctype <- c(".blogs",".news",".twitter")
lstFileDE <- c("de_DE.blogs","de_DE.news","de_DE.twitter")
lstFileEN <- c("en_US.blogs","en_US.news","en_US.twitter")
lstFileFI <- c("fi_FI.blogs","fi_FI.news","fi_FI.twitter")
lstFileRU <- c("ru_RU.blogs","ru_RU.news","ru_RU.twitter")
## -------------------------------------------------------------------------
## FUNCTIONS TO READ FILES
## -------------------------------------------------------------------------
getFilename <- function(tDir,tLanguage,tDocName,tExtension) {
tDir <- paste(tDir,tLanguage,"/",sep="")
tFilename <-paste(tDir,tDocName,tExtension,sep="")
tFilename }
## read text File, print no of lines, return data
readTextFile <- function(tDir,tLanguage,tDocName) {
fileData <-getFilename(tDir,tLanguage,tDocName,".txt")
tData <- scan(file = fileData,sep = '\n', what = '', skipNul = TRUE)
print(fileData)
print(stri_stats_general(tData)) ## lines of data
tData }
## read Token file
readTokenFile <- function(tDir,tLanguage) {
tDocName <- tLanguage
fileRDS <-getFilename(tDir,tLanguage,tDocName,".token.rds")
a <- readRDS(fileRDS)
a }
## read ngram file
readNgramFile <- function(tDir,tLanguage,tExtension) {
tDocName <- tLanguage
fileRDS <-getFilename(tDir,tLanguage,"token",tExtension)
a <- readRDS(fileRDS)
a }
## --------------------------------------------------------------------------
## PROCESSING FUNCTIONS FOR TEXT TO CORPUS
## --------------------------------------------------------------------------
## convert text to corpus in quanteda and sample x% of data
preprocessText <- function(tData) {
tSample <- 0.20 ## sample 20% of the data
tData <- sample(tData,length(tData)*tSample,replace=FALSE)
tData <- gsub("<.*?>", "", tData) ## remove html style tags
a <- iconv(tData,to="ASCII") ## from="UTF-8",
a <- gsub("(https?)?://[^[:blank:]]*", " ",a) ## remove urls
a <- gsub("[[:blank:]]#[^[:blank:]]*", " ",a) ## remove hashtags
a <- gsub("[[:digit:]]","",a)
a <- gsub("[[:punct:]]","",a)
a <- gsub("^[:alphanum:][:space:]"," ",a)
corpus(a) }
## --------------------------------------------------------------------------
## PROCESSING FUNCTIONS FOR CORPUS TO TOKENS
## --------------------------------------------------------------------------
## reads corpus, preprocessing and tokenising
tokenText <- function(tDir,tLanguage,lstFile) {
## read text file and preprocess
tDocName <- tLanguage
a1 <- readTextFile(tDir,tLanguage,lstFile[1])
a2 <- readTextFile(tDir,tLanguage,lstFile[2])
a3 <- readTextFile(tDir,tLanguage,lstFile[3])
a1 <- preprocessText(a1)
a2 <- preprocessText(a2)
a3 <- preprocessText(a3)
x <- a1 + a2 + a3
saveRDS(x,getFilename(tDir,tLanguage,tDocName,".corpus.rds"))
## make tokens
x <-tokens(x,remove_punct=TRUE, remove_numbers=TRUE,
remove_symbols=TRUE, remove_twitter=TRUE, remove_hyphens=TRUE,
include_docvars=FALSE, what = "word")
## x <-tokens_remove(x,stopwords(substr(tLanguage,1,2)))
x <-tokens(x,what="fasterword",remove_url=TRUE)
x <-tokens_remove(x,lstProfanity)
saveRDS(x,getFilename(tDir,tLanguage,tDocName,".token.rds"))
x }
## --------------------------------------------------------------------------
## PROCESSING FUNCTIONS FOR TOKENS TO NGRAMS
## --------------------------------------------------------------------------
## calculate n-grams from tokens
getNgram <- function(x,n,nskip,tDir,tLanguage,tExtension) {
xgram <- tokens_ngrams(x,n,nskip,concatenator=" ")
saveRDS(xgram,getFilename(tDir,tLanguage,"token",tExtension))
rm(xgram) }
## convert tokens to ngrams, creates x.ngram file
ngramToken <- function(tDir, tLanguage) {
x <- readTokenFile(tDir,tLanguage)
getNgram(x,n=1L,nskip=0L,tDir,tLanguage,".x1gram.rds")
getNgram(x,n=2L,nskip=0L,tDir,tLanguage,".x2gram.rds")
getNgram(x,n=3L,nskip=0L,tDir,tLanguage,".x3gram.rds")
getNgram(x,n=2L,nskip=1L,tDir,tLanguage,".x1skip.rds")
rm(x)
}
## --------------------------------------------------------------------------
## PROCESSING FUNCTIONS FOR NGRAMS TO DFM
## --------------------------------------------------------------------------
## coverage by number of features - only works on dfm objects
ngramCoverage <- function(xgram,n) {
word cloud and top words
textplot_wordcloud(xgram,max_words=60/n,color=brewer.pal(6,"Dark2"))
print(topfeatures(xgram,n=25))
## word frequency table
tblgram <- textstat_frequency(xgram)
tblgram <- data.frame(tblgram,
cumfreq = cumsum(tblgram$frequency)/sum(tblgram$frequency),
length=nchar(tblgram$feature))
## print(mean(tblgram$length))
## for x% coverage of text, cumfreq of features as abs and % of corpus
tblgram2 <- data.frame(coverage = 1:10)
tblgram2$coverage <- tblgram2$coverage/10
cumfreq <- sapply(tblgram2$coverage, function(x)
length(which(tblgram$cumfreq<=x)))
cumfreqpct <- cumfreq/NROW(tblgram$freq)
tblgram2 <- cbind(tblgram2,cumfreq,cumfreqpct)
print(tblgram2)
## ggplot(tblgram2,aes(cumfreqpct,coverage))+geom_line()
## count of features which appear x times in corpus
tblgram3 <- data.frame(freqcount = 1:10)
featcount <- sapply(tblgram3$freqcount, function(x)
length(which(tblgram$freq==x)))
featfreq <- cumsum(featcount)/NROW(tblgram$freq)
tblgram3 <- cbind(tblgram3,featcount,featfreq)
print(tblgram3)
rm(tblgram, tblgram2,tblgram3) }
## calculate dfm from ngrams
getdfm <- function(tDir,tLanguage,tExtension) {
x <- readNgramFile(tDir,tLanguage,tExtension)
xdfm <- dfm(x)
xdfm <- dfm_sort(xdfm,decreasing=TRUE,margin="features")
saveRDS(xdfm,getFilename(tDir,tLanguage,"dfm",tExtension))
xdfm }
## convert ngrams to dfm, trim and create dfm file
dfmNgram <- function(tDir, tLanguage) {
getdfm(tDir,tLanguage,".x1skip.rds")
getdfm(tDir,tLanguage,".x1gram.rds")
getdfm(tDir,tLanguage,".x2gram.rds")
getdfm(tDir,tLanguage,".x3gram.rds")
## --------------------------------------------------------------------------
## REAL CODE STARTS HERE
## --------------------------------------------------------------------------
## bad word list https://www.cs.cmu.edu/~biglou/resources/
fileProfanity <- getFilename("data","","bad-words",".txt")
lstProfanity <- scan(file=fileProfanity,sep ='\n',what='',skipNul=TRUE)
## en_US
## tokenText : convert text to tokens; creates .corpus and .token file
system.time(tokenEN <- tokenText(dirData,"en_US",lstFileEN))
## ngramToken : convert tokens to ngrams, creates x.ngram file
system.time(ngramEN <- ngramToken(dirData,"en_US"))
## dfmNgram : convert ngrams to dfm, create dfm file
system.time(dfmEN <- dfmNgram(dirData,"en_US"))