Overview

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this report is to understand the basic relationships we observe in the data and prepare to build our first linguistic model.

Tasks to accomplish

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.

Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

In the interest of keeping this report simple and concise, all R code chunks have been hidden.

Summary Data

The training data is downloaded from here and is unzipped manually, to speed up the process. It has three files containing sample data from twitter, blogs and news. Here is the summary.

##   File_Name Size_in_MB Line_Count Word_Count
## 1   Twitter      159.4    2360148   30218166
## 2     Blogs      200.4     899288   38154238
## 3      News      196.3    1010242   35010782

Cleaning up the data

We load all 3 files in their entirety. After collecting summary statistics, we just get a random sample of 1% of each files (using ‘runif’ function) into one corpus due to performance issues. The resulting sample is cleaned up to get rid of hashtags and non-English words using simple R functions for further processing.

Features of the data

We do an n-gram analysis. In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

In this report, we present unigram, bigram, and trigram plots. Please see below.