Building a Predictive Model - Exploratory Analysis

Jas Sohi
November 16, 2014

Methodology

I quickly discovered that working with the entire corpus at once in the text mining package was not the right approach as the file size was way to large.

I decided to work with a subset of the the corpora to create into a corpus package. I created a function to read each line of each corpus and randomly select 10% of the lines and output 3 new text documents.

Then I read them back into R and created a corpus object. All summaries(except for line counts) are from this smaller corpus (575 mb). This new corpus I worked with is called ovid.

Summary Statistics

Blogs, News, Twitter (respectively)

Word counts - (sample texts)

6,384,564
9,523,789
15,320,493

Summary Statistics Line counts (full texts)

899,289
1,010,243
2,360,149

Interesting Findings

Most of the words are considered sparse, that is they occur very rarely in the texts.
The corpus is very large (several 100 mbs), so I will definately need to reduce the file size so that a user from a mobile phone for example can query the answer in a reasonable amount of time. This will cause a slight reduction in accuracy, but improved performance.
The ordering of cleansing operations is very important. For example, I decided to remove profanity words before removing punctuations (since some swear words contain punctuation).

User GUI

I will use a counter to keep track of the number of spaces the user enters.
A word will be considered anything proceeding a space.
Once a user has typed in three spaces(three words). The shiny app will suggest a word with a dropdown to select other options sorted by probability.

Example

Enter Text: I jump for joy ________

*Example: 'I jump for' predicted word should be: joy

Markov-Chain Based Model

These predicted words will be based on a combined probability.
The first part will be what is the probability of different predicted words such as “joy” to appear given the phrase “jump for” in the provided corpora - what is considered a 3gram Model.
The second part will be the probability of finding the fourth word after the three words together “I”, “jump”, “for” in a 4gram model. If the training corpus has these 3 words together then this model will give higher weight to the word that appears next.
However, my intuition is there will be a lot of cases where the 3gram model will not have an existing cases and the 3gram model will be the only one that gives predictions. This is the strength of using a combination model as it is more robust.
To simplify things I will consider all lower and uppercase words the same. “We” and “we” will be considered the same word. I will convert user's text into lowercase to take of this problem on the backend only.