1. Introduction

In this document we explore the data related to the Milestone Assignemnt in Data Science Capstone Project. The data can be retrieved here and consists of set of text files related to blogs, news and twitter writted in 4 different languages German, English, Finnish and Russian.

In this document we will provide basic Exploratory Data Analysis of the texts file, with special focus on English languages, and reports ideas to create a model to predict the next typed word.

2. Data Set files description

The data set to be analyzed consists of 4 folders one per language (country), German (DE), English (US), Finnish (FI) and Russian (RU). Each folder includes three text files from different types of communication, namely: blog, news and tweeter. In order to get an idea of the amount and type of “data” available, we report hereafter the number of information element per file and the distribution of number of words per information element. Note that the information element is a complete piece of news, blog or tweet that, in the analyzed file, corresponds to one complete line.

The above graph shows basically the amount of “sample” (corresponding to number of lines) for each text type for each Country. For example the number of tweet sample for US is over 2 Millions, while the number of sample news for US is below 250 Thousands.

It is interesting to visualize in the following graph the distribution of the number of word per sample (line).

Alt text