Capstone - Exploratory Data Analysis

Understanding the problem

Analyze a large corpus of text documents to discover the structure in the data and how words are put together. The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, to mainly get texts consisting of the desired language. The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text.

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.

Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Data loading, cleaning & initial analysis

Getting to know the general structure of the data, such as the lines / words / chars / etc. will help with better optimizing the model & program.

## Loading required package: NLP

##   fileName fileSizeMB totLines wordsPerLine charsPerLine charsPerWord
## 1  twitter        159  2360148           13           69            5
## 2    blogs        200   899288           42          232            6
## 3     news        196    77259           35          203            6

Exploratory analysis

Some words are more frequent than others - what are the distributions of word frequencies?

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

What are the frequencies of 2-grams and 3-grams in the dataset?

While the n-grams are not exactly same, there are repeating words. Two observations from this exercise: (1) the follow-on words are different depending on type of entry - blog vs. twitter (2) most common words are the same however frequency of use differs based on the context of text. Ultimately, the context of the text (i.e. twitter vs. news vs. blogs) plays a significant role in predicting the next word in a sentence. But let’s not come to any conclusions and continue with the exploration…

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

## [1] "Number of unique words for coverage @ 50%"

## [1] 193

## [1] "Number of unique words for coverage $ 90%"

## [1] 10641

As the coverage increases, the number of unique words needed explodes exponentially - not surprising. If we had further performed the analysis including the context of the data, for example the blog data is about a sporting event, the coverage of words would be significantly different.

How do you evaluate how many of the words come from foreign languages?

A simple method would be to use a word dictionary to filter out the foreign words. Another method would be to use custom dictionary for adding words that are repeating, even if the word is not valid for each language. For example, in technical blogs word infra usually refers to infrastructure. It might be fine to use it in a blog, removing it might affect the user keyboard performance.

Can you think of a way to increase the coverage?

Understanding the context of the text is extremely useful. Personalized local dictionary with frequently used phrases will be highly effective. For example, I have used “For example” multiple times in the past three paras. Most users repeat the frequently used phrases much more often than we expect. Understanding the lexical structure of the language will help predict the next word even without typing a single letter after the current word.

Future plans

Based on this preliminary analysis, the Shiny app should focus on the following:

Focus on memory and CPU requirements for any model (extremely critical for robust response)
N-grams based model is certainly worth considering
Data pre-processing techniques should be expanded further to reduce the startup time for the app
Ability to predict next 3 words would be a nice addition