Milestone Report-Capstone

Synopsis

This project is based on the data HC Corpora Dataset.
The goal of this project is to create a Shiny App which can recognize the next word based on the user’s input.
The data is provided by SwiftKey. It contains the data from news, twitter and blogs text. It is aavailabe in multiple languages such as English, French and Russian.
This report illustrates the ‘Exploratory Data Analysis’ of the english language data set.
Also, different n-gram models and their performance is included in this report.

Following summary illustartes the data from English News, Blogs and Twits.

f_names	f_size	f_lines	n_char	n_words	pct_n_char	pct_lines	pct_words
blogs	200.4242	899288	208361438	37334131	0.54	0.27	0.53
news	196.2775	77259	15683765	2643969	0.04	0.02	0.04
twitter	159.3641	2360148	162384825	30373543	0.42	0.71	0.43

The file sizes are pretty large and cannot be considered as it as for the analysis as the resources are limited to process those files.
To avoid that, I have sampled the data from each file and then analysis has been performed after cleaning and tidying it.

A unigram model can be treated as the combination of several one-state finite automata.
In this model, the probability of each word only depends on that word’s own probability in the document, so we only have one-state finite automata as units. The automaton itself has a probability distribution over the entire vocabulary of the model, summing to 1.

The different sources are news, blogs and twitter.

Based on relative frequency uni-gram distributions is plotted. They are plotted for each set of n-grams.

The predictions are based on the n-gram tables.
In bi-gram model the next word is predicted based on the last word with highest relative frequency.
In tri-grams the next word is predicted based on the last two words and their relative frequency.
In quad-gram model, the next word is predicted based on the last three words and their relative frequency.

word1	word2	word3	word4	n	proportion	coverage
the	end	of	the	497	8.00e-05	0.0000800
the	rest	of	the	454	7.31e-05	0.0001531
at	the	end	of	405	6.52e-05	0.0002183
for	the	first	time	397	6.39e-05	0.0002822
thank	you	for	the	359	5.78e-05	0.0003401
is	going	to	be	358	5.76e-05	0.0003977