Introduction

This report is a summary of the first steps of building a model of the usage of words and their relation, using the dataset swiftkey downloaded in november 2016. This report is split into several parts: - exploratory analysis - Word frequencies - next step

Most of the steps actually took several iterations to get some workable results. In some cases the used libaries would fail (or crash). This is partly related to mangled text, i.e. unusable characters.

Exploratory analysis

The downloaded set is unzipped and explored. It consists of 4 directories, each containing 3 files (only for en_US the files are shown:

  • de_DE
  • en_US
    • en_US.blogs
    • en_US.news
    • en_US.twitter
  • fi_FI
  • ru_RU

In this case we looked at the en_US part of the dataset. Since the files are rather big only a portion (1/100 lines) is used for a more in depth analysis. First a simplistic, low-level, approach to get some insight into the size of the data. Basically we just count the number of lines and words.

Putting the resulting counts in a table results in the overview:

> tdf
   source nroflines maxlinelength nrofwords
1   blogs    899288         40835  37334441
2    news     77259          5760   2643972
3 twitter   2360148           213  30373792
> 

It does not tell a lot, except that we have a lot of words, most in blogs. This will influence our predicitve capabilities of the mode.

A quick look at the word per line show that the words per line depend heavily upon the medium. Note that this infomation can be extracted simply by using Linux text handling commands, such as ‘wc’.

To get an idea about the difference between the media used we plot a the count per source type:

library(ggplot2)

library(ggplot2)
fig1 <- ggplot(tdf, aes(x = factor(source), y = nroflines/1e+06))
fig1 <- fig1 + geom_bar(stat = "identity") +
  labs(y = "Nr of lines (million)", x = "text file", title = "Number of lines per text") 

fig1

Looking at the ratio (below) its clear this differs a lot and depends on the meduim, twitter is expected to how a low ratio.

> tdf$wordsperline <- (tdf$nrofwords / tdf$nroflines)
> tdf
   source nroflines maxlinelength nrofwords wordsperline
1   blogs    899288         40835  37334441     41.51556
2    news     77259          5760   2643972     34.22219
3 twitter   2360148           213  30373792     12.86944
> 

So, in general, we see that blogs seem to contains the most words per line, indicating a more complex text. News seems to be more conscise. Tweets coming trhough twitter contain few words per sentence, whihc seems logic since the medium is designed for small, short messages. For a more in depth analysis we use the text mining library, so we dont need to write all the code at a low detail (and hopefully we have a better performance).

First we create a subset for analysis and training, the subsets are created by taking a sample of 1 in 100, for example:

# gonna try for 1:100 lines, would that be enough...
result.blogs.sample_size <- floor(0.01 * length(result.blogs))
train_ind                <- sample(seq_len(length(result.blogs)), size = result.blogs.sample_size)
result.blogs.train       <- result.blogs[train_ind]

The samples are cleaned and stored as part of the training set, only after several iterations a workable text resulted. Cleaning the text was done using global substitutions, only using the “tm”package didn’t seem sufficient:

After the substitutions the library tm was used for further processing so that the dataset becomes a more generic set. Besides the clean up the words are also simplified, stemmed after which unneccessary spaces are removed. The same process was used for the files related to ‘news’ and ‘twitter’.

Word frequencies

Using snowball we build a document term matrix to get a better idea of the documents’s contents. Note that removing ‘bad words’ is based on a list of several websites, like google. This will impact the word prediction which may be positive for some people and bad for others, it’s a bit like censorship (we leave that for another discussion).

library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(readr)
library(SnowballC)

setwd("final/en_US")
cname<-getwd()
docs <- Corpus(DirSource(cname, pattern="train", mode="text"))#, encoding="latin1"))
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english")) 
badwords <- read_lines("../bad-words.txt") 
docs <- tm_map(docs, removeWords, badwords)
docs <- tm_map(docs, removeWords, badwords)

docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument) 

dtm <- DocumentTermMatrix(docs)   

So which words occur at least 750 times in the subset of the data?

findFreqTerms(dtm, lowfreq=750)
##  [1] "also"      "back"      "best"      "can"       "come"     
##  [6] "day"       "don"       "even"      "first"     "get"      
## [11] "going"     "good"      "got"       "great"     "home"     
## [16] "just"      "know"      "last"      "life"      "like"     
## [21] "little"    "love"      "made"      "make"      "many"     
## [26] "may"       "much"      "need"      "never"     "new"      
## [31] "next"      "night"     "now"       "one"       "people"   
## [36] "really"    "right"     "said"      "say"       "see"      
## [41] "something" "still"     "take"      "thanks"    "think"    
## [46] "time"      "today"     "two"       "want"      "way"      
## [51] "week"      "well"      "will"      "work"      "year"     
## [56] "years"

Note that the level is determined iteratively, When we sort the words we see that verbs like ‘can’ or ‘will’ are often used.

termFreq <- colSums(as.matrix(dtm))
tf <- data.frame(term = names(termFreq), freq = termFreq)
stf<-subset(tf, freq > 750)

#top 25
p2<-ggplot(stf[0:40,], aes(x=reorder(term, -freq), y= freq))
p2 <- p2 + geom_bar(stat="identity") + ggtitle("Top 40 most frequent words found in texts") 
p2 <- p2 + theme(axis.text.x=element_text(angle=45, hjust=1))   
p2

Just using a single word frequency will not do a lot for word prediction, we would suggest ‘can’ since its most used; we need somekind of relation between several used words. So lets inspect combination of words the so called N-grams.

Looking for 2 and 3 grams.

First we build a -gram word list to see what word pairs are mostly used.

library(tm)
library(RWeka)
BigramTokenizer  <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

tdmn2 <- TermDocumentMatrix(docs, control = list(tokenize = BigramTokenizer))
tdmn3 <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer))

Using the matrices built we can have a peek at the most used combination.

findFreqTerms(tdmn2, lowfreq = 200)
##  [1] "can t"     "couldn t"  "didn t"    "doesn t"   "don t"    
##  [6] "haven t"   "isn t"     "last year" "let s"     "new york" 
## [11] "p m"       "right now" "t know"    "u s"       "wasn t"   
## [16] "won t"     "year old"
findFreqTerms(tdmn3, lowfreq = 25)
##  [1] "can t believe" "can t get"     "can t wait"    "didn t get"   
##  [5] "didn t know"   "didn t want"   "doesn t mean"  "don t care"   
##  [9] "don t even"    "don t feel"    "don t forget"  "don t get"    
## [13] "don t know"    "don t like"    "don t need"    "don t see"    
## [17] "don t think"   "don t want"    "just can t"    "just don t"   
## [21] "let s go"      "m p m"         "mother s day"  "new york city"
## [25] "people don t"  "t wait see"

So it seems a lot of pairs are a result of pruning the ’ from words like “don’t, resulting”don t“. The same principle seems to acting up in the 3-gram word sets. It doesn’t look particularly usefull, though for now we leave this as its is.

Next step

The next step is determine how to use the resulting ngrams properly, enhancing the model and testing the model against a testset.

Predicition

Using the ngrams words can be predicted, we need a look up table so that we can search for a word (or words) and find the 2/3/4 probability for the next word. The combination with the highest probability will be shown first.

Unseen ngrams

So far, handling unseen n-grams seem a bit difficult: one could just go for the most often used word but one can expect this prediction will fail often.