1. Introduction

In this document we explore the data related to the Milestone Assignemnt in Data Science Capstone Project. The data can be retrieved here and consists of set of text files related to blogs, news and twitter writted in 4 different languages German, English, Finnish and Russian.

In this document we will provide basic Exploratory Data Analysis of the texts file, with special focus on English languages, and reports ideas to create a model to predict the next typed word.

2. Data Set files description

The data set to be analyzed consists of 4 folders one per language (country), German (DE), English (US), Finnish (FI) and Russian (RU). Each folder includes three text files from different types of communication, namely: blog, news and tweeter. In order to get an idea of the amount and type of “data” available, we report hereafter the number of information element per file and the distribution of number of words per information element. Note that the information element is a complete piece of news, blog or tweet that, in the analyzed file, corresponds to one complete line.

The above graph shows basically the amount of “sample” (corresponding to number of lines) for each text type for each Country. For example the number of tweet sample for US is over 2 Millions, while the number of sample news for US is below 250 Thousands.

It is interesting to visualize in the following graph the distribution of the number of word per sample (line).

Alt text

In the above graph we observe statistically the different length of tweet and news for each language: as expected tweets are shorter than news. It is interesting to notice that russian lines for all three kinds of texts are longer than other languages.

3. Findings and Ideas for predicticion

In this paragraph we focus on the English language and analyze the Data Sets with the goal to create a prediction model. As the prediction modeling task is to predict the next typed work, we start analyzing the ngrams. Ngram is a ordered sequence of “words”. More precisely according to wikipedia: n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs.

The “n” in n-gram term, identifies the length of the n-gram. Hereafter we analyze n=1,2,3.

In the following graph we show the number of n-gram for n=1 (single word), n=2 and n=3. We limit the number of n-grams to 20 per text type.

It is interesting to notice that for n=2 and n=3, there are expressions that are typically associated to a text type. For example expressions to say thanks to a person is typical colloquial tweeter style.

This analysis is the base to create a prediction algorithm. The steps we will make to create the prediction model are:

  1. Subset the English data sets in Training and Verification sets.
  2. Train three different models for text type: tweet, news and blog. Start with Markov but then try other different methods from R caret library.
  3. Select the “best model”" by the score obtained with Cross Validation.
  4. Utilize the Verification Set to run practicle example predictions with the “best model” and estimate the accuracy of it.
  5. Finally we will build a Shiny App that allows to insert a sentence and will provide the most probable next “word” together with other 5 possibilities in order of decreasing probability.