Aim of the report

to provide a text mining of sample texts collected from the internet by a web crawler as a first stage to developing a next-word prediction system.

Loading required packages

To carry out the tasks of the project (the data acquisiton, preprocessing and analysis of them before a predictive model preparation) a set of R packages was used:

## Loading required package: knitr
## Loading required package: tm
## Loading required package: NLP
## Loading required package: SnowballC
## Loading required package: RWeka
## Loading required package: stringi
## Loading required package: wordcloud
## Loading required package: RColorBrewer
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

stringi was used for obtaining of statistical information about the data;
NLM, tm, SnowballC, RWeka were used for the data preparation and analysis;
knitr, wordcloud, ggplot2 were used for the data representation (tables, plots).

Data acquisition and cleaning

Capstone Dataset

Raw data files were obtaining from the Coursera site as a single archive using the following link and unzipped with the unzip function.

After unzipping the Coursera-SwiftKey.zip file, a set of text files were obtained. These texts were collected from publicly available sources (twitter, newspapers or personal blogs) by a web crawler in English, German, French and Russian (see HC Corpora). For this project we explored the English documents only.

Basic summary of the data

File name	File size (in Mb)	Number of lines	Number of words
en_US.twitter.txt	159.3641	2360148	30451128
en_US.news.txt	196.2775	77259	2651432
en_US.blogs.txt	200.4242	899288	37570839

Data sampling

As we can see from the table above the size of the data is huge. That is why the data samples were generate (using the sample function), which include only 1% of lines obtained from initial text files.

Samples were stored as text files (blog.sample.txt, news.sample.txt and twitter.sample.txt).

Corpus of sample data was created with the Corpus and DirSource functions of the tm package.

Summary of the corpus

	Length	Class	Mode
blog.sample.txt	2	PlainTextDocument	list
news.sample.txt	2	PlainTextDocument	list
twitter.sample.txt	2	PlainTextDocument	list

Data cleaning

Following words were removed from the created corpus:

numbers;
“bad” wods (list of these words was obtained here);
common stopwords;
punctuation;
excess whitespaces.

A content of the corpus was trnsformed to the lowe case.

en.corpus <- tm_map(en.corpus, removeNumbers)
badwords <- VectorSource(readLines('bad_words.txt'))
en.corpus <- tm_map(en.corpus, removeWords, badwords)
en.corpus <- tm_map(en.corpus, removeWords, stopwords('english'))
en.corpus <- tm_map(en.corpus, stripWhitespace)
en.corpus <- tm_map(en.corpus, content_transformer(tolower))
en.corpus <- tm_map(en.corpus, removePunctuation)
en.corpus <- tm_map(en.corpus, content_transformer(stemDocument))

Term Document Matrix

A term-document matrix (TDM) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. For this report it was created using the TermDocumentMatrix function from the tm package.

We represented the word-cloud visualization of the TDM by using the frequency of the words as a first look at the data. For this task the wordcloud package was used.

## <<TermDocumentMatrix (terms: 2, documents: 2)>>
## Non-/sparse entries: 0/4
## Sparsity           : 100%
## Maximal term length: 17
## Weighting          : term frequency (tf)
## 
##                    Docs
## Terms               blog.sample.txt news.sample.txt
##   aaaaaaaaaaaaaaaay               0               0
##   aaaahhh                         0               0

Exploratory analysis of the data

N-Gram analysis

To predict the next likely word the sentences were broken into n-grams (uni-, bi- and trigrams) using the NGramTokenizer function from the RWeka package.

Top 10 n-grams visualization

After carrying out a unigram, bigram and trigram tokenization for the exploratory analysis of the data, n-grams’ frequencies of occurrences were plotted.

10 most common unigrams

10 most common bigrams

10 most common trigrams

Next Step

The next step is to develop and train a Markov model for predicting the next word using bi- and trigrams and thereafter to develop a Shiny application.

Data Science Milestone Report

Andrey Kuznetsov

25.03.2015