Milestone Report - Data Science Capstone Project

Background and Goal

Word Prediction is an important Natural Language Processing (NLP) task. The goal is to predict the next (correct) word in a given context.
Word prediction has been commonly used as a tool to enhance written productivity and is used therefore in many applications, for example, predictive text entry systems, word completion utilities, and writing aids .(Al-Mubaid 2007).

The goal of the Data Science Capstone Project is to build such a word prediction model base on a training data set which is a corpus, a collection of written texts. The data is from a corpus called HC Corpora.

Word prediction systems are based on models that have normally been trained on a large corpus of data. The larger the corpora the models have been build from the better they tend to perform, especially on words that are less frequent. It is, however, important to keep another property in mind: models can perform very poorly when they are applied to language that is substantially different from the training context. In this context is has been

shown that a large amount of dissimilar language is more useful for language model training than a much smaller amount of similar language… (Trnka and McCoy 2007)

Another impact on the model performance may stem from the modelling approach used. A kind of standard (benchmark) uses n-grams while other interesting approaches make use of techniques like deep learning. I plan to use the n-gram approach for my prediction model.

Against this background, I will do an exploratory analysis - the first step for my modeling task - in order to understand:

Basic properties of the data - like file sizes, word counts, line counts
Similarity of files provided
Features to be used for later modeling - like n-grams

A very welcome side effect (windfall profit) of the exploratory analysis will be to get a hands-on experience of R packages used for Natural Language Processing which is a new area for me.

I will finish the analysis by outlining conclusions and next steps for building the model.

Load Packages

Based on input from the Coursera course and some investigation in web forums, blogs, etc. I have decided to use quanteda as package for NLP. On top I will use dplyr for handy data manipulation and ggplot2 for visualization.

library(quanteda)
library(dplyr)
library(ggplot2)

Get Data

In a first step we will download the data set required for the project. In a second step we will unzip it. (In order not to load the data each time it is prudent to check whether they have been already loaded respectively unzipped.)

if (!file.exists("~/Dropbox/Natural Language Processing/Coursera-SwiftKey.zip")) {
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
        destfile = "~/Dropbox/Natural Language Processing/Coursera-SwiftKey.zip")
}

if (!file.exists("~/Dropbox/Natural Language Processing/final/en_US/en_US.blogs.txt")) {
    unzip("~/Dropbox/Natural Language Processing/Coursera-SwiftKey.zip", exdir = "~/Dropbox/Natural Language Processing/", list = FALSE) }

Load Data

As the project’s US data is required for exploration and model building we need to load the three files it is made of.

blogs <- readLines("~/Dropbox/Natural Language Processing/final/en_US/en_US.blogs.txt")
news <- readLines("~/Dropbox/Natural Language Processing/final/en_US/en_US.news.txt")
twitter <- readLines("~/Dropbox/Natural Language Processing/final/en_US/en_US.twitter.txt")

Build Corpora

In a next step we build three corpora

cblogs <- corpus(blogs)
cnews<- corpus(news)
ctwitter<- corpus(twitter)

For grouping etc. we will add a document variable ‘Source’ that captures the source of the document.

docvars(cblogs,"Source")<-"blogs"
docvars(cnews,"Source")<-"news"
docvars(ctwitter,"Source")<-"twitter"

We can merge with quanteda the three corpora easily into one corpus.

total <- cblogs + cnews + ctwitter

# Remove now all objects that are no longer used and free up storage
rm(list=c("blogs","news","twitter","cblogs","cnews","ctwitter"))
gc()

Basic Overview

In a first step we create a short first overview what the three files do contain. As a working hypthesis we may assume that a blog is more complex than news and news more than twitter which means that e.g. a blog has more words per document than a tweet (or a news article). We may also think that the a blog contains more different words than a tweet.

Let us begin having a look to the file sizes (in MB) of the raw data.

Size_Blogs <- round(file.info("~/Dropbox/Natural Language Processing/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2)
Size_News <- round(file.info("~/Dropbox/Natural Language Processing/final/en_US/en_US.news.txt")$size / 1024 ^ 2)
Size_Twitter <- round(file.info("~/Dropbox/Natural Language Processing/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2)

data.frame(Source = c("blogs","news","twitter"),FileSize=c(Size_Blogs,Size_News,Size_Twitter))

##    Source FileSize
## 1   blogs      200
## 2    news      196
## 3 twitter      159

We see that the file sizes are substantial which will drive the processing time and requires mechanisms to cope with the shere size like sampling and parallel processing.

We continue analyzing some other basic properties such as number of documents, words, sentences, average words per document and average words per sentence. quanteda and dplyr offer handy functions for that.

totsum<-summary(total,n=ndoc(total),verbose=F)
data.frame(totsum %>% group_by(Source) %>% summarize(Documents=n(),Words=sum(Tokens),Sentences=sum(Sentences),AverageWordsPerDocument=round(Words/Documents)))

##    Source Documents    Words Sentences AverageWordsPerDocument
## 1   blogs    899288 43336854   2371272                      48
## 2    news   1010242 40948588   2011089                      41
## 3 twitter   2360148 36985902   3766203                      16

We now will give a look to the histogram of word frequencies per document.

ggplot(data=totsum,aes(x=Tokens))+geom_histogram()+facet_wrap(~Source) + xlab("Number of Words per Document") + ylab("Frequency") + ggtitle("Words per Document - Histogram")

The x-axis shows that the spread of the number of words per document seems to be substantial - so we will zoom in.

ggplot(data=totsum,aes(x=Tokens))+geom_histogram()+facet_wrap(~Source) + xlab("Number of Words per Document") + ylab("Frequency") + ggtitle("Words per Document - Histogram")+xlim(c(0,200))

As we see our working hypotheses are by and large confirmed. Blogs contain most words and also most words per document, twitter substantially less words and also significantly less words per document. A fact driven by the nature of the medium. We also see that the distribution of words per document is very different between the three datasets: twitter has a much more uniform spread than blogs and news which have a large long tail..

Draw Sample

For the explorative analysis I choose to sample 10% and to check during modeling whether and how to implement parallel processing an how to increase the sample size.

csamp<-sample(total,ndoc(total)/10)
# Remove now all objects that are no longer used and free up storage
rm(list=c("total"))
gc()

Let us now check what we have got in the sample.

csampsum<-summary(csamp,n=ndoc(csamp),verbose=F)
data.frame(csampsum %>% group_by(Source) %>% summarize(Documents=n(),Words=sum(Tokens),Sentences=sum(Sentences),AverageWordsPerDocument=round(Words/Documents),AverageWordsPerSentence=round(Words/Sentences)))

##    Source Documents   Words Sentences AverageWordsPerDocument
## 1   blogs     89935 4298719    235169                      48
## 2    news    101059 4103294    201943                      41
## 3 twitter    235973 3692828    375801                      16
##   AverageWordsPerSentence
## 1                      18
## 2                      20
## 3                      10

We compare the words we have retained in our sample per source (blogs, news, twitter) with the initial summary we had created above and see that we have achieved a very similar balance/share between the three different sources. This is a strong indication - together with the substantial sample size - that our sample is a good representation of the total corpus.

Similarity

An interesting question is how much lexical similarity the three sources have. quanteda provides a handy function to check that. Before we can use it we need to create a document-feature(term)-matrix for our sample corpus. While creating it we remove stopwords, numbers, punctuation and twitter specific characters (#,@). Then we can measure similarity using the cosine distance.

dfmsamp <- dfm(csamp,removeNumbers=TRUE,removePunct=TRUE,removeTwitter=TRUE,
               ignoredFeatures=stopwords("english"),groups="Source")

## Creating a dfm from a corpus ...
##    ... grouping texts by variable: Source
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 3 documents
##    ... indexing features: 210,511 feature types
##    ... removed 174 features, from 174 supplied (glob) feature types
##    ... created a 3 x 210337 sparse dfm
##    ... complete. 
## Elapsed time: 14.003 seconds.

similarity(dfmsamp,method="cosine")

## similarity Matrix:
## $blogs
## twitter    news 
##  0.8351  0.7646 
## 
## $news
##   blogs twitter 
##  0.7646  0.5949 
## 
## $twitter
##  blogs   news 
## 0.8351 0.5949

We recognize an interesting fact: regarding words per document and word per sentence blogs and news have been similar (see above). When analyzing the vocubalary itself the situation looks different: blogs and twitter show the highest similarity - news and twitter the lowest. Because a good amount of dissimilar language is beneficial for training a language model (see above) this gives a strong indication that we need to keep all three sources and shouldn’t discard one. In other word it confirms our random sampling approach based on all sources.

Unigrams

Using our document-feature-matrix we can determine the frequencies of unigrams (words) in our sample.

uni_df <- data.frame(Unigram=as.character(dfmsamp@Dimnames$features),Frequency=colSums(dfmsamp))
uni_df <- data.frame(uni_df %>% arrange(desc(Frequency)) %>% mutate(Count=1,                                                   CumCount=cumsum(Count),CumFreq=cumsum(Frequency),
                         CumCountPerc=CumCount/max(CumCount)*100,
                         CumFreqPerc=CumFreq/max(CumFreq)*100,
                         Rank=cumsum(Count)))
uni_df$Unigram <- factor(as.character(uni_df$Rank),levels=uni_df$Rank,labels = as.character(uni_df$Unigram))

Looking to the cumulative percentages of words and frequencies reveals that only a small cumulative number of words accounts for the bulk of frequencies (usage).

ggplot(data=uni_df,aes(x=CumCountPerc,y=CumFreqPerc))+geom_line(color="red")+xlab("Cumulative % of Unigrams")+ylab("Cumulative Frequency - Coverage")+ggtitle("Unigram Coverage - Stopwords Excluded")

Let us have a look to the most frequent unigrams

ggplot(data=uni_df[1:20,],aes(x=Unigram,y=Frequency))+geom_bar(stat="identity",fill="red") + ggtitle("Top 20 Unigrams - Stopwords Excluded")

Another way to look to it is a word cloud.

plot(dfmsamp,comparison=TRUE)

For further analysis regrading bigrams and trigrams we need to recreate the corpus so that lines are no longer standing for documents but for sentences. This avoids that bigrams/trigrams will be made that span over the end of a sentence/combine two sentences.

csentence<-tokenize(csamp,what="sentence",simplify=TRUE)
csampsen<-corpus(csentence)

Bigrams

We will create now a document-feature matrix that contains bigrams as features.

dfmsampsen <- dfm(csampsen,removeNumbers=TRUE,removePunct=TRUE,removeTwitter=TRUE,
               ignoredFeatures=stopwords("english"),n=2)

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 812,913 documents
##    ... indexing features: 2,483,068 feature types
##    ... removed 1,041,301 features, from 174 supplied (glob) feature types
##    ... created a 812913 x 1441767 sparse dfm
##    ... complete. 
## Elapsed time: 132.965 seconds.

The top 20 bigrams are:

topfeatures(dfmsampsen,20)

##       right_now        new_york       last_year      last_night 
##            2457            2002            1877            1573 
##     high_school       years_ago       feel_like      first_time 
##            1375            1358            1258            1239 
##       last_week looking_forward       make_sure      looks_like 
##            1224            1153            1067             993 
##     even_though  happy_birthday      new_jersey    good_morning 
##             974             894             888             852 
##         can_get       next_week        just_got       every_day 
##             832             807             785             760

Like for unigrams we analyze the coverage of bigrams.

bi_df <- data.frame(Bigram=as.character(dfmsampsen@Dimnames$features),Frequency=colSums(dfmsampsen))
bi_df <- data.frame(bi_df %>% arrange(desc(Frequency)) %>% mutate(Count=1,                                                   CumCount=cumsum(Count),CumFreq=cumsum(Frequency),
                         CumCountPerc=CumCount/max(CumCount)*100,
                         CumFreqPerc=CumFreq/max(CumFreq)*100,
                         Rank=cumsum(Count)))
bi_df$Bigram <- factor(as.character(bi_df$Rank),levels=bi_df$Rank,labels = as.character(bi_df$Bigram))

No we can visualize the result.

ggplot(data=bi_df,aes(x=CumCountPerc,y=CumFreqPerc))+geom_line(color="red")+xlab("Cumulative % of Bigrams")+ylab("Cumulative Frequency - Coverage")+ggtitle("Bigram Coverage - Stopwords Excluded")

As we see the coverage is significantly different (wider spread/lower concentration) than for unigrams.

Let us check whether the inclusion of stopwords will change the picture - keeping in mind that the user of our app will most likely want to have stopwords included as he will use them frequently.

dfmsampsen <- dfm(csampsen,removeNumbers=TRUE,removePunct=TRUE,removeTwitter=TRUE,
               n=2)

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 812,913 documents
##    ... indexing features: 2,483,068 feature types
##    ... created a 812913 x 2483068 sparse dfm
##    ... complete. 
## Elapsed time: 70.876 seconds.

The top 20 bigrams look now very differently:

topfeatures(dfmsampsen,20)

##   of_the   in_the   to_the  for_the   on_the    to_be   at_the  and_the 
##    42532    40735    21219    20260    19241    16236    14362    12561 
##     in_a with_the     is_a   it_was    for_a from_the    i_was   i_have 
##    12024    10483     9901     9608     9399     8749     8550     8424 
##  will_be   with_a    and_i    it_is 
##     8204     8170     8135     8040

The coverage changes as well significantly - a lower portion of bigrams accounting for a significantly higher coverage.

bi_df <- data.frame(Bigram=as.character(dfmsampsen@Dimnames$features),Frequency=colSums(dfmsampsen))
bi_df <- data.frame(bi_df %>% arrange(desc(Frequency)) %>% mutate(Count=1,                                                   CumCount=cumsum(Count),CumFreq=cumsum(Frequency),
                         CumCountPerc=CumCount/max(CumCount)*100,
                         CumFreqPerc=CumFreq/max(CumFreq)*100,
                         Rank=cumsum(Count)))
bi_df$Bigram <- factor(as.character(bi_df$Rank),levels=bi_df$Rank,labels = as.character(bi_df$Bigram))

ggplot(data=bi_df,aes(x=CumCountPerc,y=CumFreqPerc))+geom_line(color="red")+xlab("Cumulative % of Bigrams")+ylab("Cumulative Frequency - Coverage")+ggtitle("Bigram Coverage - Stopwords Included")

Trigrams

Let us finally check how the situation looks for trigrams - we will include here stopwords right from the start.

dfmsampsen <- dfm(csampsen,removeNumbers=TRUE,removePunct=TRUE,removeTwitter=TRUE,
               n=3)

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 812,913 documents
##    ... indexing features: 5,600,981 feature types
##    ... created a 812913 x 5600981 sparse dfm
##    ... complete. 
## Elapsed time: 95.952 seconds.

The top 20 trigrams are:

topfeatures(dfmsampsen,20)

##         one_of_the           a_lot_of     thanks_for_the 
##               3351               2997               2460 
##            to_be_a        going_to_be          i_want_to 
##               1807               1738               1584 
##         the_end_of         out_of_the        some_of_the 
##               1498               1476               1406 
##           it_was_a         as_well_as         be_able_to 
##               1391               1370               1300 
##        part_of_the           i_have_a        the_rest_of 
##               1293               1176               1141 
## looking_forward_to          i_have_to       i_don't_know 
##               1124               1120               1111 
##     the_first_time        is_going_to 
##               1043               1009

The coverage decreases significantly compared to bigrams.

tri_df <- data.frame(Trigram=as.character(dfmsampsen@Dimnames$features),Frequency=colSums(dfmsampsen))
tri_df <- data.frame(tri_df %>% arrange(desc(Frequency)) %>% mutate(Count=1,                                                   CumCount=cumsum(Count),CumFreq=cumsum(Frequency),
                         CumCountPerc=CumCount/max(CumCount)*100,
                         CumFreqPerc=CumFreq/max(CumFreq)*100,
                         Rank=cumsum(Count)))
tri_df$Trigram <- factor(as.character(tri_df$Rank),levels=tri_df$Rank,labels = as.character(tri_df$Trigram))

ggplot(data=tri_df,aes(x=CumCountPerc,y=CumFreqPerc))+geom_line(color="red")+xlab("Cumulative % of Trigrams")+ylab("Cumulative Frequency - Coverage")+ggtitle("Trigram Coverage - Stopwords Included")

Next Steps

As next steps in my project I will

Create ngram tables for further use that will include the stopwords - as stopwords are frequent and the app user surely will expect them (as part of his speech) to be considered
Build a lookup model based on the probabilities for n grams from my sample data.
Define a kind of basic backup strategy (e.g. use always the same ‘filler’ word) for the initial model.
Implement the basic model as Shiny app that contains basic user interaction.

In the steps to follow I plan to:

Implement more advanced features like the Katz backoff model or better smoothing techniques.
Improve and polish my shiny app.

If I have then have got time left I would like to:

Check whether other data sources may improve the model accuracy (e.g. Google ngrams…).
Investigate whether there are better methods than ngrams (e.g. vectorize words, neural nets…) to improve the model.

References

Al-Mubaid, Hisham. 2007. “A Learning-Classification Based Approach for Word Prediction.” Journal Article. https://pdfs.semanticscholar.org/05ab/08c1f3ddfcb645a53ea0c9c9c936e194c10d.pdf.

Trnka, Keith, and Kathleen F McCoy. 2007. “Corpus Studies in Word Prediction.” In Proceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility, 195–202. ACM. https://www.eecis.udel.edu/~trnka/research/trnka07assets.pdf.