This project is about creating a Shiny application designed to make text predictions. Building a smart keyboard makes it easier for people to type on their devices. One cornerstone of their smart keyboard is predictive text models. In order to do so, we will use the HC Corpora dataset, achieve an exploratory analysis, and implement a predictive model so that to predict the next word.
This Milestone Report is presenting our understanding of the data, our exploratroy analysis, and our plans for implementing a predictive algorithm.
HC corpora is a collection of corpora for various languages freely available to download. The corpora have been collected from numerous different webpages by a web crawler.
More details on the corpora can be found here : http://www.corpora.heliohost.org/aboutcorpus.html In this dataset, we are provided with 3 text files (.txt) available in 4 different languages : Deutch, English, Finnish, Russian.
We will focus here only on the english files. Those files are :
Here is the link to download the dataset : https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Once we have loaded the data into R, we can observe for each those characteristics :
| Data | Size (Mo) | Number of lines | Min Word in item | Max Words in item | Total Word Count |
|---|---|---|---|---|---|
| Blogs | 200.42 | 899 288 | 1 | 40 833 | 206 824 505 |
| News | 196.28 | 1 010 242 | 1 | 11 384 | 203 223 159 |
| 159.36 | 2 360 148 | 2 | 140 | 162 096 031 |
Each of these 3 files discussed above presents a very large size that may represent a problem in terms of time of computation. To avoid encountering those issues, we have subsetted our 3 files so that to obtain much smaller objects but representative of the original ones. Here are the characteristics of those subsets :
| Data | Size (Mo) | Number of lines | Min Word in item | Max Words in item | Total Word Count |
|---|---|---|---|---|---|
| Blogs | 37.24 | 134 734 | 1 | 37 191 | 30 990 754 |
| News | 37.35 | 151 267 | 2 | 3 555 | 30 398 789 |
| 45.58 | 354 075 | 3 | 140 | 24 321 281 |
In order to be able to analyse properly these datasets, we need to perform some transformations that will “clean” the data : removing puncuation, removing numbers, converting everything to lowercase, stemming, removing whitespaces.
Let’s look at this example to see the way those transformations are performing:
Before Preprocessing
## [1] "Origin: Middle English: from Old French joie, based on Latin gaudium, from gaudere ‘rejoice’"
After Preprocessing
## [1] "origin middl english from old french joie base on latin gaudium from gauder rejoic"
Now that we have preprocessed the data, we are able to build a term-document matrix that will allow us to perform some exploratory analysis and observe the words frequencies, and the n-gram frequencies.
As we can see in this graph, the words that occur most frequently in the 3 texts are mainly connecting words, as we could have expected. For example the most occuring word here is “the”. This shows us that we will have to go further to perform an efficient predictive model and consider other patterns than only word frequencies. That should be n-grams frequencies.
An n-gram is a contiguous sequence of n items from a given sequence of text.
So in this analysis, we split all the strings in n-grams (n being the number of contiguous words) and look at their frequency through the 3 text files.
For the purpose of this analysis, we look at 1-grams, 2-grams, and 3-grams.
Here are the characteristics of each corpus defined above :
Watching at the graphs above, we can observe that frequencies are very skewed, as the most occuring N-grams (N = 1 or 2 or 3) have frequencies much higher than others. Furthermore, we can observe that coordination words, such as “the” or “and” have much higher frequency than others.
Profanity filtering : there are some words that we don’t want to predict, and that we don’t want either to use in order to build our prediction model. Those words must be removed. But we can’t just remove those words, we have to remove the entire sentences in which they are displayed because those whole sentences become unuseful by the presence of those words.
To perform this task, we can find a list of profane words, for example at this link https://gist.github.com/jamiew/1112488..
Covering 90% of word instances : in order to cover 90% of word instances in the subset we have defined above, we only have to consider the 4 044 most occuring words in the subsetted corpus, the total number of words in the subsetted corpus being 96 257, so around 4% of the words in the subsetted corpus.
That consideration gives us a large margin to reduce the size of the corpus we will have to use in our predictive model in order to have the opportunity to reduce calculation time.
Here are some tracks we will dig to build our predictive model :
To access the version of this report with code, here is the link : http://rpubs.com/rdsn/138890