The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this report is to understand the basic relationships we observe in the data and prepare to build our first linguistic model.
Tasks to accomplish
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
In the interest of keeping this report simple and concise, all R code chunks have been hidden.
The training data is downloaded from here and is unzipped manually, to speed up the process. It has three files containing sample data from twitter, blogs and news. Here is the summary.
## File_Name Size_in_MB Line_Count Word_Count
## 1 Twitter 159.4 2360148 30218166
## 2 Blogs 200.4 899288 38154238
## 3 News 196.3 1010242 35010782
We load all 3 files in their entirety. After collecting summary statistics, we just get a random sample of 1% of each files (using ‘runif’ function) into one corpus due to performance issues. The resulting sample is cleaned up to get rid of hashtags and non-English words using simple R functions for further processing.
We do an n-gram analysis. In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.
In this report, we present unigram, bigram, and trigram plots. Please see below.