Word Prediction is an important Natural Language Processing (NLP) task. The goal is to predict the next (correct) word in a given context.
Word prediction has been commonly used as a tool to enhance written productivity and is used therefore in many applications, for example, predictive text entry systems, word completion utilities, and writing aids .(Al-Mubaid 2007).
The goal of the Data Science Capstone Project is to build such a word prediction model base on a training data set which is a corpus, a collection of written texts. The data is from a corpus called HC Corpora.
Word prediction systems are based on models that have normally been trained on a large corpus of data. The larger the corpora the models have been build from the better they tend to perform, especially on words that are less frequent. It is, however, important to keep another property in mind: models can perform very poorly when they are applied to language that is substantially different from the training context. In this context is has been
shown that a large amount of dissimilar language is more useful for language model training than a much smaller amount of similar language… (Trnka and McCoy 2007)
Another impact on the model performance may stem from the modelling approach used. A kind of standard (benchmark) uses n-grams while other interesting approaches make use of techniques like deep learning. I plan to use the n-gram approach for my prediction model.
Against this background, I will do an exploratory analysis - the first step for my modeling task - in order to understand:
A very welcome side effect (windfall profit) of the exploratory analysis will be to get a hands-on experience of R packages used for Natural Language Processing which is a new area for me.
I will finish the analysis by outlining conclusions and next steps for building the model.
Based on input from the Coursera course and some investigation in web forums, blogs, etc. I have decided to use quanteda as package for NLP. On top I will use dplyr for handy data manipulation and ggplot2 for visualization.
library(quanteda)
library(dplyr)
library(ggplot2)
In a first step we will download the data set required for the project. In a second step we will unzip it. (In order not to load the data each time it is prudent to check whether they have been already loaded respectively unzipped.)
if (!file.exists("~/Dropbox/Natural Language Processing/Coursera-SwiftKey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "~/Dropbox/Natural Language Processing/Coursera-SwiftKey.zip")
}
if (!file.exists("~/Dropbox/Natural Language Processing/final/en_US/en_US.blogs.txt")) {
unzip("~/Dropbox/Natural Language Processing/Coursera-SwiftKey.zip", exdir = "~/Dropbox/Natural Language Processing/", list = FALSE) }
As the project’s US data is required for exploration and model building we need to load the three files it is made of.
blogs <- readLines("~/Dropbox/Natural Language Processing/final/en_US/en_US.blogs.txt")
news <- readLines("~/Dropbox/Natural Language Processing/final/en_US/en_US.news.txt")
twitter <- readLines("~/Dropbox/Natural Language Processing/final/en_US/en_US.twitter.txt")
In a next step we build three corpora
cblogs <- corpus(blogs)
cnews<- corpus(news)
ctwitter<- corpus(twitter)
For grouping etc. we will add a document variable ‘Source’ that captures the source of the document.
docvars(cblogs,"Source")<-"blogs"
docvars(cnews,"Source")<-"news"
docvars(ctwitter,"Source")<-"twitter"
We can merge with quanteda the three corpora easily into one corpus.
total <- cblogs + cnews + ctwitter
# Remove now all objects that are no longer used and free up storage
rm(list=c("blogs","news","twitter","cblogs","cnews","ctwitter"))
gc()
In a first step we create a short first overview what the three files do contain. As a working hypthesis we may assume that a blog is more complex than news and news more than twitter which means that e.g. a blog has more words per document than a tweet (or a news article). We may also think that the a blog contains more different words than a tweet.
Let us begin having a look to the file sizes (in MB) of the raw data.
Size_Blogs <- round(file.info("~/Dropbox/Natural Language Processing/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2)
Size_News <- round(file.info("~/Dropbox/Natural Language Processing/final/en_US/en_US.news.txt")$size / 1024 ^ 2)
Size_Twitter <- round(file.info("~/Dropbox/Natural Language Processing/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2)
data.frame(Source = c("blogs","news","twitter"),FileSize=c(Size_Blogs,Size_News,Size_Twitter))
## Source FileSize
## 1 blogs 200
## 2 news 196
## 3 twitter 159
We see that the file sizes are substantial which will drive the processing time and requires mechanisms to cope with the shere size like sampling and parallel processing.
We continue analyzing some other basic properties such as number of documents, words, sentences, average words per document and average words per sentence. quanteda and dplyr offer handy functions for that.
totsum<-summary(total,n=ndoc(total),verbose=F)
data.frame(totsum %>% group_by(Source) %>% summarize(Documents=n(),Words=sum(Tokens),Sentences=sum(Sentences),AverageWordsPerDocument=round(Words/Documents)))
## Source Documents Words Sentences AverageWordsPerDocument
## 1 blogs 899288 43336854 2371272 48
## 2 news 1010242 40948588 2011089 41
## 3 twitter 2360148 36985902 3766203 16
We now will give a look to the histogram of word frequencies per document.
ggplot(data=totsum,aes(x=Tokens))+geom_histogram()+facet_wrap(~Source) + xlab("Number of Words per Document") + ylab("Frequency") + ggtitle("Words per Document - Histogram")
The x-axis shows that the spread of the number of words per document seems to be substantial - so we will zoom in.
ggplot(data=totsum,aes(x=Tokens))+geom_histogram()+facet_wrap(~Source) + xlab("Number of Words per Document") + ylab("Frequency") + ggtitle("Words per Document - Histogram")+xlim(c(0,200))
As we see our working hypotheses are by and large confirmed. Blogs contain most words and also most words per document, twitter substantially less words and also significantly less words per document. A fact driven by the nature of the medium. We also see that the distribution of words per document is very different between the three datasets: twitter has a much more uniform spread than blogs and news which have a large long tail..
For the explorative analysis I choose to sample 10% and to check during modeling whether and how to implement parallel processing an how to increase the sample size.
csamp<-sample(total,ndoc(total)/10)
# Remove now all objects that are no longer used and free up storage
rm(list=c("total"))
gc()
Let us now check what we have got in the sample.
csampsum<-summary(csamp,n=ndoc(csamp),verbose=F)
data.frame(csampsum %>% group_by(Source) %>% summarize(Documents=n(),Words=sum(Tokens),Sentences=sum(Sentences),AverageWordsPerDocument=round(Words/Documents),AverageWordsPerSentence=round(Words/Sentences)))
## Source Documents Words Sentences AverageWordsPerDocument
## 1 blogs 89935 4298719 235169 48
## 2 news 101059 4103294 201943 41
## 3 twitter 235973 3692828 375801 16
## AverageWordsPerSentence
## 1 18
## 2 20
## 3 10
We compare the words we have retained in our sample per source (blogs, news, twitter) with the initial summary we had created above and see that we have achieved a very similar balance/share between the three different sources. This is a strong indication - together with the substantial sample size - that our sample is a good representation of the total corpus.
An interesting question is how much lexical similarity the three sources have. quanteda provides a handy function to check that. Before we can use it we need to create a document-feature(term)-matrix for our sample corpus. While creating it we remove stopwords, numbers, punctuation and twitter specific characters (#,@). Then we can measure similarity using the cosine distance.
dfmsamp <- dfm(csamp,removeNumbers=TRUE,removePunct=TRUE,removeTwitter=TRUE,
ignoredFeatures=stopwords("english"),groups="Source")
## Creating a dfm from a corpus ...
## ... grouping texts by variable: Source
## ... lowercasing
## ... tokenizing
## ... indexing documents: 3 documents
## ... indexing features: 210,511 feature types
## ... removed 174 features, from 174 supplied (glob) feature types
## ... created a 3 x 210337 sparse dfm
## ... complete.
## Elapsed time: 14.003 seconds.
similarity(dfmsamp,method="cosine")
## similarity Matrix:
## $blogs
## twitter news
## 0.8351 0.7646
##
## $news
## blogs twitter
## 0.7646 0.5949
##
## $twitter
## blogs news
## 0.8351 0.5949
We recognize an interesting fact: regarding words per document and word per sentence blogs and news have been similar (see above). When analyzing the vocubalary itself the situation looks different: blogs and twitter show the highest similarity - news and twitter the lowest. Because a good amount of dissimilar language is beneficial for training a language model (see above) this gives a strong indication that we need to keep all three sources and shouldn’t discard one. In other word it confirms our random sampling approach based on all sources.
Using our document-feature-matrix we can determine the frequencies of unigrams (words) in our sample.
uni_df <- data.frame(Unigram=as.character(dfmsamp@Dimnames$features),Frequency=colSums(dfmsamp))
uni_df <- data.frame(uni_df %>% arrange(desc(Frequency)) %>% mutate(Count=1, CumCount=cumsum(Count),CumFreq=cumsum(Frequency),
CumCountPerc=CumCount/max(CumCount)*100,
CumFreqPerc=CumFreq/max(CumFreq)*100,
Rank=cumsum(Count)))
uni_df$Unigram <- factor(as.character(uni_df$Rank),levels=uni_df$Rank,labels = as.character(uni_df$Unigram))
Looking to the cumulative percentages of words and frequencies reveals that only a small cumulative number of words accounts for the bulk of frequencies (usage).
ggplot(data=uni_df,aes(x=CumCountPerc,y=CumFreqPerc))+geom_line(color="red")+xlab("Cumulative % of Unigrams")+ylab("Cumulative Frequency - Coverage")+ggtitle("Unigram Coverage - Stopwords Excluded")
Let us have a look to the most frequent unigrams
ggplot(data=uni_df[1:20,],aes(x=Unigram,y=Frequency))+geom_bar(stat="identity",fill="red") + ggtitle("Top 20 Unigrams - Stopwords Excluded")
Another way to look to it is a word cloud.
plot(dfmsamp,comparison=TRUE)
For further analysis regrading bigrams and trigrams we need to recreate the corpus so that lines are no longer standing for documents but for sentences. This avoids that bigrams/trigrams will be made that span over the end of a sentence/combine two sentences.
csentence<-tokenize(csamp,what="sentence",simplify=TRUE)
csampsen<-corpus(csentence)
We will create now a document-feature matrix that contains bigrams as features.
dfmsampsen <- dfm(csampsen,removeNumbers=TRUE,removePunct=TRUE,removeTwitter=TRUE,
ignoredFeatures=stopwords("english"),n=2)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 812,913 documents
## ... indexing features: 2,483,068 feature types
## ... removed 1,041,301 features, from 174 supplied (glob) feature types
## ... created a 812913 x 1441767 sparse dfm
## ... complete.
## Elapsed time: 132.965 seconds.
The top 20 bigrams are:
topfeatures(dfmsampsen,20)
## right_now new_york last_year last_night
## 2457 2002 1877 1573
## high_school years_ago feel_like first_time
## 1375 1358 1258 1239
## last_week looking_forward make_sure looks_like
## 1224 1153 1067 993
## even_though happy_birthday new_jersey good_morning
## 974 894 888 852
## can_get next_week just_got every_day
## 832 807 785 760
Like for unigrams we analyze the coverage of bigrams.
bi_df <- data.frame(Bigram=as.character(dfmsampsen@Dimnames$features),Frequency=colSums(dfmsampsen))
bi_df <- data.frame(bi_df %>% arrange(desc(Frequency)) %>% mutate(Count=1, CumCount=cumsum(Count),CumFreq=cumsum(Frequency),
CumCountPerc=CumCount/max(CumCount)*100,
CumFreqPerc=CumFreq/max(CumFreq)*100,
Rank=cumsum(Count)))
bi_df$Bigram <- factor(as.character(bi_df$Rank),levels=bi_df$Rank,labels = as.character(bi_df$Bigram))
No we can visualize the result.
ggplot(data=bi_df,aes(x=CumCountPerc,y=CumFreqPerc))+geom_line(color="red")+xlab("Cumulative % of Bigrams")+ylab("Cumulative Frequency - Coverage")+ggtitle("Bigram Coverage - Stopwords Excluded")
As we see the coverage is significantly different (wider spread/lower concentration) than for unigrams.
Let us check whether the inclusion of stopwords will change the picture - keeping in mind that the user of our app will most likely want to have stopwords included as he will use them frequently.
dfmsampsen <- dfm(csampsen,removeNumbers=TRUE,removePunct=TRUE,removeTwitter=TRUE,
n=2)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 812,913 documents
## ... indexing features: 2,483,068 feature types
## ... created a 812913 x 2483068 sparse dfm
## ... complete.
## Elapsed time: 70.876 seconds.
The top 20 bigrams look now very differently:
topfeatures(dfmsampsen,20)
## of_the in_the to_the for_the on_the to_be at_the and_the
## 42532 40735 21219 20260 19241 16236 14362 12561
## in_a with_the is_a it_was for_a from_the i_was i_have
## 12024 10483 9901 9608 9399 8749 8550 8424
## will_be with_a and_i it_is
## 8204 8170 8135 8040
The coverage changes as well significantly - a lower portion of bigrams accounting for a significantly higher coverage.
bi_df <- data.frame(Bigram=as.character(dfmsampsen@Dimnames$features),Frequency=colSums(dfmsampsen))
bi_df <- data.frame(bi_df %>% arrange(desc(Frequency)) %>% mutate(Count=1, CumCount=cumsum(Count),CumFreq=cumsum(Frequency),
CumCountPerc=CumCount/max(CumCount)*100,
CumFreqPerc=CumFreq/max(CumFreq)*100,
Rank=cumsum(Count)))
bi_df$Bigram <- factor(as.character(bi_df$Rank),levels=bi_df$Rank,labels = as.character(bi_df$Bigram))
ggplot(data=bi_df,aes(x=CumCountPerc,y=CumFreqPerc))+geom_line(color="red")+xlab("Cumulative % of Bigrams")+ylab("Cumulative Frequency - Coverage")+ggtitle("Bigram Coverage - Stopwords Included")
Let us finally check how the situation looks for trigrams - we will include here stopwords right from the start.
dfmsampsen <- dfm(csampsen,removeNumbers=TRUE,removePunct=TRUE,removeTwitter=TRUE,
n=3)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 812,913 documents
## ... indexing features: 5,600,981 feature types
## ... created a 812913 x 5600981 sparse dfm
## ... complete.
## Elapsed time: 95.952 seconds.
The top 20 trigrams are:
topfeatures(dfmsampsen,20)
## one_of_the a_lot_of thanks_for_the
## 3351 2997 2460
## to_be_a going_to_be i_want_to
## 1807 1738 1584
## the_end_of out_of_the some_of_the
## 1498 1476 1406
## it_was_a as_well_as be_able_to
## 1391 1370 1300
## part_of_the i_have_a the_rest_of
## 1293 1176 1141
## looking_forward_to i_have_to i_don't_know
## 1124 1120 1111
## the_first_time is_going_to
## 1043 1009
The coverage decreases significantly compared to bigrams.
tri_df <- data.frame(Trigram=as.character(dfmsampsen@Dimnames$features),Frequency=colSums(dfmsampsen))
tri_df <- data.frame(tri_df %>% arrange(desc(Frequency)) %>% mutate(Count=1, CumCount=cumsum(Count),CumFreq=cumsum(Frequency),
CumCountPerc=CumCount/max(CumCount)*100,
CumFreqPerc=CumFreq/max(CumFreq)*100,
Rank=cumsum(Count)))
tri_df$Trigram <- factor(as.character(tri_df$Rank),levels=tri_df$Rank,labels = as.character(tri_df$Trigram))
ggplot(data=tri_df,aes(x=CumCountPerc,y=CumFreqPerc))+geom_line(color="red")+xlab("Cumulative % of Trigrams")+ylab("Cumulative Frequency - Coverage")+ggtitle("Trigram Coverage - Stopwords Included")
As next steps in my project I will
In the steps to follow I plan to:
If I have then have got time left I would like to:
Al-Mubaid, Hisham. 2007. “A Learning-Classification Based Approach for Word Prediction.” Journal Article. https://pdfs.semanticscholar.org/05ab/08c1f3ddfcb645a53ea0c9c9c936e194c10d.pdf.
Trnka, Keith, and Kathleen F McCoy. 2007. “Corpus Studies in Word Prediction.” In Proceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility, 195–202. ACM. https://www.eecis.udel.edu/~trnka/research/trnka07assets.pdf.