0 Introduction

Here I report on my first steps towards building a shiny application that is able to predict the next word. The full analysis is available in my github repo. Here I briefly scratch what I have done there.

In this capstone I have applied data science in the area of natural language processing. As a first step toward working on this project, I have familiarized myself with Natural Language Processing, Text Mining, and the associated tools in R. Two sources were of particular importance for me:

I have completed the following three tasks:

  1. Getting and cleaning text data
  2. Exploratory data analysis
  3. Building basic n-gram model
  4. Future plans

1 Getting and cleaning data

The goal of this task is to get familiar with the data and do the necessary cleaning. The data is from a corpus called HC Corpora (www.corpora.heliohost.org).

As a first step I created a directory to store data:

if (!dir.exists('data')) {
    dir.create('data')
}

After that I downloaded data:

download.file(url = link,destfile = 'data/capstonedataset.zip')
unzip('data/capstonedataset.zip',exdir = 'data')

As a result I got four large datasets de_DE, en_US, fi_FI, ru_RU. For this project I will use the English database en_US. It contains three sources of text, namely blogs, tweets and news. The entire dataset is relatively large ~500 Mb. We don’t need to load in and use all of the data. Often relatively few randomly selected rows need to be included to get an accurate approximation to results that would be obtained using all the data.

For the purpose of this project I used a smaller subset of the data, roughly 10% of the entire dataset. For example, the following code snippet loads blogs data:

blogs <- readLines('data/final/en_US/en_US.blogs.txt',skipNul = T)
indx  <- rbinom(length(blogs)*.1,length(blogs),0.1)
blogs <- blogs[indx]

The same procedure was repeated for news and tweets.

The sampled data was stored on the hard disk:

writeLines(blogs,'data/samples/blogs.txt')
writeLines(tweets,'data/samples/tweets.txt')
writeLines(news,'data/samples/news.txt')

The nest step is to clean the data. For this purpose I used the R text mining framework tm:

library(tm)

To facilitate analysis, I created corpus of the sampled data:

data_corpus <- Corpus(DirSource('data/samples/'), readerControl = list(language='en-US'))
summary(data_corpus)
##            Length Class             Mode
## blogs.txt  2      PlainTextDocument list
## news.txt   2      PlainTextDocument list
## tweets.txt 2      PlainTextDocument list

Now we can start cleaning the data in the corpus using tm functionality. The following steps have been performed in this respect:

pattern = '(http|ftp|https)://[[:alnum:][:punct:]]*'
removeURL <- function(x) gsub(pattern,"",x)
removeTags <- function(x) gsub("(#|@)\\S+","",x)

data_corpus <- tm_map(data_corpus,content_transformer(removeURL))
data_corpus <- tm_map(data_corpus,content_transformer(removeTags))
data_corpus <- tm_map(data_corpus,removeNumbers)
data_corpus <- tm_map(data_corpus,removePunctuation)
data_corpus <- tm_map(data_corpus,stripWhitespace)
data_corpus <- tm_map(data_corpus,content_transformer(tolower))
data_corpus0 <- data_corpus
data_corpus <- tm_map(data_corpus,removeWords,stopwords("english"))

It is also important to perform profanity filtering by removing ‘bad’ words we do not want to predict. I used Offensive/Profane word list from Luis von Ahn’s research group

link = 'https://www.cs.cmu.edu/~biglou/resources/bad-words.txt'
download.file(url = link,destfile = 'data/bad-words.txt')
bad_words <- readLines('data/bad-words.txt')
bad_words <- bad_words[2:length(bad_words)]
data_corpus  <- tm_map(data_corpus,removeWords,bad_words)
data_corpus0 <- tm_map(data_corpus0,removeWords,bad_words)

2 Exploratory data analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text.

Interesting facts about the corpus:

I created a term document matrix for each of the three sources in the dataset:

tdm.blogs  <- TermDocumentMatrix(data_corpus[1])
tdm.news   <- TermDocumentMatrix(data_corpus[2])
tdm.tweets <- TermDocumentMatrix(data_corpus[3])

Having these objects in disposal, it is now easy to perform some interesting investigations. It does not make sense to count lines and words. Instead, I focused on the content of the dataset. In particular, I was interested in the fraction of unique words and average number of words per line. For the former task, I created vocabulary of all unique words across the dataset as well as three vocabularies for each source in the dataset. The three fractions have been calculated. For the later task, I tokenized the data source and simply counted tokens in each entry (I called it ‘line’). The results are shown below:

It is interesting that tweets are much shorter and less sophisticated in terms of unique words used than two other sources. Blogs and news appeared to be similar.

Some words are more frequent than others. The distributions of word frequencies are shown below:

The frequency distribution of three-grams:

To proceed I combined all sources into one:

tdm.combined <- c(tdm.blogs,tdm.news,tdm.tweets)

We do not need to use all the words in the resulting document matrix. Only fraction of them can cover a significant fraction of the total words used. The following figure shows the coverage versus number of unique words (sorted by their frequencies):

To cover 0.99 it is enough to use 14623 words instead of the total number 20584. The difference is the number of words which are rarely used.

3 Building basic n-gram model

I have build a simple n-gram model, which does the following:

It works like this:

predict_next_word('I love')
## [1] "i love you"  "i love the"  "i love your"

4 Future plans