to provide a text mining of sample texts collected from the internet by a web crawler as a first stage to developing a next-word prediction system.
To carry out the tasks of the project (the data acquisiton, preprocessing and analysis of them before a predictive model preparation) a set of R packages was used:
## Loading required package: knitr
## Loading required package: tm
## Loading required package: NLP
## Loading required package: SnowballC
## Loading required package: RWeka
## Loading required package: stringi
## Loading required package: wordcloud
## Loading required package: RColorBrewer
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
Raw data files were obtaining from the Coursera site as a single archive using the following link and unzipped with the unzip function.
After unzipping the Coursera-SwiftKey.zip file, a set of text files were obtained. These texts were collected from publicly available sources (twitter, newspapers or personal blogs) by a web crawler in English, German, French and Russian (see HC Corpora). For this project we explored the English documents only.
Basic summary of the data
| File name | File size (in Mb) | Number of lines | Number of words |
|---|---|---|---|
| en_US.twitter.txt | 159.3641 | 2360148 | 30451128 |
| en_US.news.txt | 196.2775 | 77259 | 2651432 |
| en_US.blogs.txt | 200.4242 | 899288 | 37570839 |
As we can see from the table above the size of the data is huge. That is why the data samples were generate (using the sample function), which include only 1% of lines obtained from initial text files.
Samples were stored as text files (blog.sample.txt, news.sample.txt and twitter.sample.txt).
Corpus of sample data was created with the Corpus and DirSource functions of the tm package.
Summary of the corpus
| Length | Class | Mode | |
|---|---|---|---|
| blog.sample.txt | 2 | PlainTextDocument | list |
| news.sample.txt | 2 | PlainTextDocument | list |
| twitter.sample.txt | 2 | PlainTextDocument | list |
Following words were removed from the created corpus:
A content of the corpus was trnsformed to the lowe case.
en.corpus <- tm_map(en.corpus, removeNumbers)
badwords <- VectorSource(readLines('bad_words.txt'))
en.corpus <- tm_map(en.corpus, removeWords, badwords)
en.corpus <- tm_map(en.corpus, removeWords, stopwords('english'))
en.corpus <- tm_map(en.corpus, stripWhitespace)
en.corpus <- tm_map(en.corpus, content_transformer(tolower))
en.corpus <- tm_map(en.corpus, removePunctuation)
en.corpus <- tm_map(en.corpus, content_transformer(stemDocument))
A term-document matrix (TDM) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. For this report it was created using the TermDocumentMatrix function from the tm package.
We represented the word-cloud visualization of the TDM by using the frequency of the words as a first look at the data. For this task the wordcloud package was used.
## <<TermDocumentMatrix (terms: 2, documents: 2)>>
## Non-/sparse entries: 0/4
## Sparsity : 100%
## Maximal term length: 17
## Weighting : term frequency (tf)
##
## Docs
## Terms blog.sample.txt news.sample.txt
## aaaaaaaaaaaaaaaay 0 0
## aaaahhh 0 0
To predict the next likely word the sentences were broken into n-grams (uni-, bi- and trigrams) using the NGramTokenizer function from the RWeka package.
After carrying out a unigram, bigram and trigram tokenization for the exploratory analysis of the data, n-grams’ frequencies of occurrences were plotted.
10 most common unigrams
10 most common bigrams
10 most common trigrams
The next step is to develop and train a Markov model for predicting the next word using bi- and trigrams and thereafter to develop a Shiny application.