Introduction

The main goal of capstone project is creation of Shiny application that be able to predict next word during typing text. This document describes the data which will be used in the application for model traning.

Data Summary

The traning data includes packages of text files for DE, EN, FI and RU languages. According to requirements, we will use only EN. Each package consists of 3 file with tweets, blog posts and news articles. Let’s go deeper…

Dataset File size Lines
Twitter 167105338 2360148
News 205811889 1010242
Blogs 210160014 899288

Exploratory Analysis

Due to the fact that the amount of data is too large, we will take only 0.5% of each dataset using random sampling.

Each package of plots divided on 2 columns: 1. Corpus with original words 2. Corpus with removed stop-words and stemmed

Twitter dataset

Original file summary
num.lines num.words line.words.min line.words.max line.words.mean
original 11800 808272 4 140 68.50
Corpus summary
unique.words.num bigram.words.num trigram.words.num
original data 16818 80845 111731
stemmed data 12914 63868 62917
We need to cover percentage of all words
50% 90%
Number of original words 9232 16089

News dataset

Original file summary
num.lines num.words line.words.min line.words.max line.words.mean
original 5051 1027373 2 1507 203.40
Corpus summary
unique.words.num bigram.words.num trigram.words.num
original data 20888 105281 148208
stemmed data 14784 85044 88627
We need to cover percentage of all words
50% 90%
Number of original words 11958 19445

Blogs dataset

Original file summary
num.lines num.words line.words.min line.words.max line.words.mean
original 4496 1040228 2 2690 231.37
Corpus summary
unique.words.num bigram.words.num trigram.words.num
original data 19575 104752 160458
stemmed data 13540 83417 88091
We need to cover percentage of all words
50% 90%
Number of original words 11233 18797

Breef summary

As you can see, all dataset with original words corpus have a clearly expressed long-tail. But the stemmed data is smoother and with smaller dispersion.

Shiny application implementing notes

The user interface should be as simple as possible and consists of only textbox for typing a phrase. The application will predict a next word after some delay between typing.

The most simple predict strategy would consists of 3 steps 1. Try to find a typed word in trigrams and get the most frequent variants. If we can’t find, go to the next step. 2. Try to in bigrams and get the most frequent variants. If we can’t find, go to the next step. 3. If we have found next word, just offer it to user

During implementation I’m going to make some experiments with original and stemmed data and find the most accurate model.

Appendices

According to the Coursera rules, I can’t publish source codes. Sorry :)