Coursera Data Science Capstone: Milestone Report

Introduction

The main goal of capstone project is creation of Shiny application that be able to predict next word during typing text. This document describes the data which will be used in the application for model traning.

Data Summary

The traning data includes packages of text files for DE, EN, FI and RU languages. According to requirements, we will use only EN. Each package consists of 3 file with tweets, blog posts and news articles. Let’s go deeper…

Dataset	File size	Lines
Twitter	167105338	2360148
News	205811889	1010242
Blogs	210160014	899288

Exploratory Analysis

Due to the fact that the amount of data is too large, we will take only 0.5% of each dataset using random sampling.

Each package of plots divided on 2 columns: 1. Corpus with original words 2. Corpus with removed stop-words and stemmed

Twitter dataset

Original file summary
	num.lines	num.words	line.words.min	line.words.max	line.words.mean
original	11800	808272	4	140	68.50

Corpus summary
	unique.words.num	bigram.words.num	trigram.words.num
original data	16818	80845	111731
stemmed data	12914	63868	62917

We need to cover percentage of all words
	50%	90%
Number of original words	9232	16089

News dataset

Original file summary
	num.lines	num.words	line.words.min	line.words.max	line.words.mean
original	5051	1027373	2	1507	203.40

Corpus summary
	unique.words.num	bigram.words.num	trigram.words.num
original data	20888	105281	148208
stemmed data	14784	85044	88627

We need to cover percentage of all words
	50%	90%
Number of original words	11958	19445

Blogs dataset

Original file summary
	num.lines	num.words	line.words.min	line.words.max	line.words.mean
original	4496	1040228	2	2690	231.37

Corpus summary
	unique.words.num	bigram.words.num	trigram.words.num
original data	19575	104752	160458
stemmed data	13540	83417	88091

We need to cover percentage of all words
	50%	90%
Number of original words	11233	18797

Breef summary

As you can see, all dataset with original words corpus have a clearly expressed long-tail. But the stemmed data is smoother and with smaller dispersion.

Shiny application implementing notes

The user interface should be as simple as possible and consists of only textbox for typing a phrase. The application will predict a next word after some delay between typing.

The most simple predict strategy would consists of 3 steps 1. Try to find a typed word in trigrams and get the most frequent variants. If we can’t find, go to the next step. 2. Try to in bigrams and get the most frequent variants. If we can’t find, go to the next step. 3. If we have found next word, just offer it to user

During implementation I’m going to make some experiments with original and stemmed data and find the most accurate model.

Appendices

According to the Coursera rules, I can’t publish source codes. Sorry :)