Natural language processing and text mining

Synopsis

Information technology not only drives our workday but shapes our personal life. For manay people, it is difficult to imagine functionning without e-mail or cell phones. Nonetheless typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. In this capstone we will work on understanding and building predictive text models like those used by SwiftKey.
The goal of this project is just to display that we have gotten used to working with the data and that we are on track to create your prediction algorithm.

Data acquisition

I have loaded all the three datasets using the readLines function. I skipped loading special characters by specifying the UTF-8 parameter. We can see in the table below that blogs dataset has the larger size:210160014, the less number:899288 of lines but the larger number of words:206824505.

##                        Size   Lines     Words
## en_US.blogs.txt   210160014  899288 206824505
## en_US.news.txt    205811888 1010242 203223156
## en_US.twitter.txt 167105339 2360148 162385043

I read a random subset of the original data using the samplefunction and saved the sampling as RData and texts file in the working directory.

##                Size Lines   Words
## blogs.txt   2114082  8992 2076120
## news.txt    2078384 10102 2029562
## twitter.txt 1721720 23601 1624479

Preprocessing

Punctuation look like more words to computer and R. Rather than using a standard function removePunctuation, I have created my own function because I do not want to remove apostroph. I used the following pattern [^[:alnum:][:space:]']|[^a-z]'|'$|' +?|^' to remove all except apostroph between two characters from the corpus dataset. I choose to do not remove apostroph between words because it is very important in English language structure. This operation is performed in the texts dataset

I combined the three dataset in a corpus which is a collection of the three texts dataset, from which I will perform my exploratory analysis. This entity is similar to a database holding text documents in a generic way.

##             Length Class             Mode
## blogs.txt   2      PlainTextDocument list
## news.txt    2      PlainTextDocument list
## twitter.txt 2      PlainTextDocument list

Cleaning

I perfomed basic cleaning process, by removing numbers, capitalization, profanity, remainig punctuations and sripping white space.

I removed characters within words that have no meaning like www and com using this pattern www\\.?|\\.?com. I also removed accent from characters that are useless in english language and replace repeated useless character by single character.

I want a word to appear exactly the same every time it appears. I therefore changed everything to lowercase.

A pattern expressed by this regular expression [[^a-z]'|'$|' +?|^' has been used to remove remaining irrelevant simple apostrophs found in the corpus datasets. Remaining irrelevant simple apostrops are one that begin or end a set of character.

The above preprocessing will leave the documents with a lot of “white space” which is the result of all the left over spaces that were not removed along with the words that were deleted. They are also removed using the stripwhitespace function.

Now that the text data has been pre-processed, text mining can be performed to understand word counts & line counts, term frequencies, and common n-grams in our sample data. I tokenized the dataset. For example, in a line of text, each instance of that type was segmented. I created a matrix and its inverse using the DocumentTermMatrix and TermDocumentMatrix function.

## <<TermDocumentMatrix (terms: 57652, documents: 3)>>
## Non-/sparse entries: 86389/86567
## Sparsity           : 50%
## Maximal term length: 110
## Weighting          : term frequency (tf)

Exploratory data analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships observed in the dataset and prepare to build our first linguistic models. I started by organizing term by their frequency, I selected the most 20 occuring words.

##  said about  your  just  will  they   all  from   not   are   but  have 
##  2920  2978  3027  3107  3129  3137  3325  3770  3964  4859  4880  5246 
##  this   was  with   you  that   for   and   the 
##  5592  6328  7155  9212 10261 10961 23992 47948

I created an output with two rows of numbers. The top number is the frequency with which words appear and the bottom number reflects how many words that appear frequently. Here, considering only the 20 lowest word frequencies, we can see that 31946 terms appear only once. There are also a lot of others that appear very infrequently.

## freq
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 31946  7398  3589  2286  1601  1170   955   673   555   534   427   399 
##    13    14    15    16    17    18    19    20 
##   374   311   255   220   221   207   194   164

An alternate view of term frequency is to identify all terms that appear frequently (in this case, two thousands(2000) or more times). I identified all terms that appear frequently, 2000 or more times, and print them

##  [1] "about" "all"   "and"   "are"   "but"   "can"   "for"   "from" 
##  [9] "get"   "had"   "has"   "have"  "her"   "his"   "just"  "like" 
## [17] "more"  "new"   "not"   "one"   "out"   "said"  "some"  "that" 
## [25] "the"   "their" "there" "they"  "this"  "time"  "was"   "what" 
## [33] "when"  "who"   "will"  "with"  "would" "you"   "your"

We can also reshape the matrix by specifying words as Id variable and blogs, news, twitter as measure variable. I made a panel plot to look at separate samples together. I splited out the three samples in a panel and made three separate plots. I got three different pointrange plots of the top tweenty five(20) words versus value(frequency).

We can also make a representation on the cloud of the top fourty(40) words showing their relative size splitted them out by sample source. Top fourty(40) words by frequency for the Blog, News and Twitter samples are represented in the following clouds respectively, Note: The relative size of the words indicate how often the terms occur in the document with respect to one another. We can see that the, and, you,for,that are the most frequent words.

We can cluster similar terms to understand how words are grouped together. This type of exploaration is more important than standalone words exploration because the purpose of this capstone project is to predict the next word based on previous words. We can use N-gram approach, a very powerful techique to understand clustered words. An N-Gram is simply a contiguous sequence of words and these n-grams are the foundation of building predictive algorithms. I performed this operation for the most twenty(20) frequent words among all the three samples.

Interesting findings

An interesting findings is long line words found in the dataset. English noun compounds are generally segmented. But in this dataset, amazing noun compounds are created due to typing error or laziness. For example replacebooktitleswithbacon should be replace book title with bacon. I will consider compound splitting using maximum matching algorithm in order to do more cleaning and to improve our prediction model.
Another interesting finding is mispelling words which impact the frequency of words. To deal with this issue, I will create a dictionnary of corrected words and use a pattern to replace them in the dataset.

Plans for creating a prediction algorithm and Shiny app

Divide the processed corpus dataset into Training (60%), Cross Validation (20%) and Testing (20%) sets
Create 1-gram, 2-gram, 3-gram, and 4-gram data sets
Remove any single frequency after N-gram tokenization for all but 1-gram
Compute the Maximum Likelihood expected probability value
Build a prediction algorithm which given a text string, uses the n-grams to predict the next word based on Katz smoothing of the probability
Use a back-off strategy of highest n-gram to unigram based on matching if Katz smoothing back-off not applied
Re-preprocess dataset from resulting predictions