mileStoneReport

Introduction

This project is about creating a Shiny application designed to make text predictions. Building a smart keyboard makes it easier for people to type on their devices. One cornerstone of their smart keyboard is predictive text models. In order to do so, we will use the HC Corpora dataset, achieve an exploratory analysis, and implement a predictive model so that to predict the next word.

This Milestone Report is presenting our understanding of the data, our exploratroy analysis, and our plans for implementing a predictive algorithm.

Data

HC corpora is a collection of corpora for various languages freely available to download. The corpora have been collected from numerous different webpages by a web crawler.
More details on the corpora can be found here : http://www.corpora.heliohost.org/aboutcorpus.html In this dataset, we are provided with 3 text files (.txt) available in 4 different languages : Deutch, English, Finnish, Russian.
We will focus here only on the english files. Those files are :

en_US.blogs.txt (Blogs)
en_US.news.txt (News)
en_US.twitter.txt (Twitter)

Here is the link to download the dataset : https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Data characteristics

Once we have loaded the data into R, we can observe for each those characteristics :

Data	Size (Mo)	Number of lines	Min Word in item	Max Words in item	Total Word Count
Blogs	200.42	899 288	1	40 833	206 824 505
News	196.28	1 010 242	1	11 384	203 223 159
Twitter	159.36	2 360 148	2	140	162 096 031

Twitter presents a set of short entries (140 words max) written in a not formal manner
News presents a set of entries written in a formal manner but very topics focused
Blogs is somehow between News and Twitter. There is less noise (unformal manner) in the texts than in Twitter’s entries, and there are more topics discussed than in News’ entries.

Data subsetting

Each of these 3 files discussed above presents a very large size that may represent a problem in terms of time of computation. To avoid encountering those issues, we have subsetted our 3 files so that to obtain much smaller objects but representative of the original ones. Here are the characteristics of those subsets :

Data	Size (Mo)	Number of lines	Min Word in item	Max Words in item	Total Word Count
Blogs	37.24	134 734	1	37 191	30 990 754
News	37.35	151 267	2	3 555	30 398 789
Twitter	45.58	354 075	3	140	24 321 281

Data Preprocessing

In order to be able to analyse properly these datasets, we need to perform some transformations that will “clean” the data : removing puncuation, removing numbers, converting everything to lowercase, stemming, removing whitespaces.

Let’s look at this example to see the way those transformations are performing:

Before Preprocessing

## [1] "Origin: Middle English: from Old French joie, based on Latin gaudium, from gaudere ‘rejoice’"

After Preprocessing

## [1] "origin middl english from old french joie base on latin gaudium from gauder rejoic"

Exploratory Analysis

Tokenization

Now that we have preprocessed the data, we are able to build a term-document matrix that will allow us to perform some exploratory analysis and observe the words frequencies, and the n-gram frequencies.

Words frequency

As we can see in this graph, the words that occur most frequently in the 3 texts are mainly connecting words, as we could have expected. For example the most occuring word here is “the”. This shows us that we will have to go further to perform an efficient predictive model and consider other patterns than only word frequencies. That should be n-grams frequencies.

N-Grams frequency

An n-gram is a contiguous sequence of n items from a given sequence of text.
So in this analysis, we split all the strings in n-grams (n being the number of contiguous words) and look at their frequency through the 3 text files.
For the purpose of this analysis, we look at 1-grams, 2-grams, and 3-grams.

Here are the characteristics of each corpus defined above :

1-gram : 96 257 different 1-grams ; max 1-gram frequency = 359 869
2-gram : 1 337 085 different 2-grams ; max 2-gram frequency = 36 175
3-gram : 3 503 532 different 3-grams ; max 3-gram frequency = 2 796

Watching at the graphs above, we can observe that frequencies are very skewed, as the most occuring N-grams (N = 1 or 2 or 3) have frequencies much higher than others. Furthermore, we can observe that coordination words, such as “the” or “and” have much higher frequency than others.

Other steps

Profanity filtering : there are some words that we don’t want to predict, and that we don’t want either to use in order to build our prediction model. Those words must be removed. But we can’t just remove those words, we have to remove the entire sentences in which they are displayed because those whole sentences become unuseful by the presence of those words.
To perform this task, we can find a list of profane words, for example at this link https://gist.github.com/jamiew/1112488..

Covering 90% of word instances : in order to cover 90% of word instances in the subset we have defined above, we only have to consider the 4 044 most occuring words in the subsetted corpus, the total number of words in the subsetted corpus being 96 257, so around 4% of the words in the subsetted corpus.
That consideration gives us a large margin to reduce the size of the corpus we will have to use in our predictive model in order to have the opportunity to reduce calculation time.

Plan

Here are some tracks we will dig to build our predictive model :

Markov chains : it’s the assumption saying that the probability of a word depends only on the probability of the previous word, and can be generalized by saying that it depends only on the probabilty of previous words. We could there use the n-grams defined above to estimate the probabilty of occurance of a word.
Maximum Likelihood Estimation (MLE) : it’s about getting the count of a word from a corpus and normalizing that count. For example, if you want to calculate the probability of a word Y, knowing that it’s previous word is X, you can compute the count of the 2-gram XY and divide it by the sum of all 2-gram starting witht the word X.
Logarithm : we could use log probabilities instead of raw probabilities, knowing that we will compute probabilities by multiplying them, so that to get numbers that are not so small.
Backoff models : there will be lots of unobserved N-grams we will encounter. One way to tackle this issue is to use backoff models, saying that, if the n-gram is unobserved in the corpus, let’s watch at the (n-1)-gram, and so on until finding a reference in the corpus.

To access the version of this report with code, here is the link : http://rpubs.com/rdsn/138890