Analyze a large corpus of text documents to discover the structure in the data and how words are put together. The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, to mainly get texts consisting of the desired language. The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text.
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
Getting to know the general structure of the data, such as the lines / words / chars / etc. will help with better optimizing the model & program.
## Loading required package: NLP
## fileName fileSizeMB totLines wordsPerLine charsPerLine charsPerWord
## 1 twitter 159 2360148 13 69 5
## 2 blogs 200 899288 42 232 6
## 3 news 196 77259 35 203 6
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
While the n-grams are not exactly same, there are repeating words. Two observations from this exercise: (1) the follow-on words are different depending on type of entry - blog vs. twitter (2) most common words are the same however frequency of use differs based on the context of text. Ultimately, the context of the text (i.e. twitter vs. news vs. blogs) plays a significant role in predicting the next word in a sentence. But let’s not come to any conclusions and continue with the exploration…
## [1] "Number of unique words for coverage @ 50%"
## [1] 193
## [1] "Number of unique words for coverage $ 90%"
## [1] 10641
As the coverage increases, the number of unique words needed explodes exponentially - not surprising. If we had further performed the analysis including the context of the data, for example the blog data is about a sporting event, the coverage of words would be significantly different.
A simple method would be to use a word dictionary to filter out the foreign words. Another method would be to use custom dictionary for adding words that are repeating, even if the word is not valid for each language. For example, in technical blogs word infra usually refers to infrastructure. It might be fine to use it in a blog, removing it might affect the user keyboard performance.
Understanding the context of the text is extremely useful. Personalized local dictionary with frequently used phrases will be highly effective. For example, I have used “For example” multiple times in the past three paras. Most users repeat the frequently used phrases much more often than we expect. Understanding the lexical structure of the language will help predict the next word even without typing a single letter after the current word.
Based on this preliminary analysis, the Shiny app should focus on the following:
Focus on memory and CPU requirements for any model (extremely critical for robust response)
N-grams based model is certainly worth considering
Data pre-processing techniques should be expanded further to reduce the startup time for the app
Ability to predict next 3 words would be a nice addition