Introduction
The main goal of capstone project is creation of Shiny application that be able to predict next word during typing text. This document describes the data which will be used in the application for model traning.
Data Summary
The traning data includes packages of text files for DE, EN, FI and RU languages. According to requirements, we will use only EN. Each package consists of 3 file with tweets, blog posts and news articles. Let’s go deeper…
| Twitter |
167105338 |
2360148 |
| News |
205811889 |
1010242 |
| Blogs |
210160014 |
899288 |
Exploratory Analysis
Due to the fact that the amount of data is too large, we will take only 0.5% of each dataset using random sampling.
Each package of plots divided on 2 columns: 1. Corpus with original words 2. Corpus with removed stop-words and stemmed
News dataset
Original file summary
|
|
num.lines
|
num.words
|
line.words.min
|
line.words.max
|
line.words.mean
|
|
original
|
5051
|
1027373
|
2
|
1507
|
203.40
|
Corpus summary
|
|
unique.words.num
|
bigram.words.num
|
trigram.words.num
|
|
original data
|
20888
|
105281
|
148208
|
|
stemmed data
|
14784
|
85044
|
88627
|
We need to cover percentage of all words
|
|
50%
|
90%
|
|
Number of original words
|
11958
|
19445
|

Blogs dataset
Original file summary
|
|
num.lines
|
num.words
|
line.words.min
|
line.words.max
|
line.words.mean
|
|
original
|
4496
|
1040228
|
2
|
2690
|
231.37
|
Corpus summary
|
|
unique.words.num
|
bigram.words.num
|
trigram.words.num
|
|
original data
|
19575
|
104752
|
160458
|
|
stemmed data
|
13540
|
83417
|
88091
|
We need to cover percentage of all words
|
|
50%
|
90%
|
|
Number of original words
|
11233
|
18797
|

Breef summary
As you can see, all dataset with original words corpus have a clearly expressed long-tail. But the stemmed data is smoother and with smaller dispersion.
Shiny application implementing notes
The user interface should be as simple as possible and consists of only textbox for typing a phrase. The application will predict a next word after some delay between typing.
The most simple predict strategy would consists of 3 steps 1. Try to find a typed word in trigrams and get the most frequent variants. If we can’t find, go to the next step. 2. Try to in bigrams and get the most frequent variants. If we can’t find, go to the next step. 3. If we have found next word, just offer it to user
During implementation I’m going to make some experiments with original and stemmed data and find the most accurate model.
Appendices
According to the Coursera rules, I can’t publish source codes. Sorry :)