Jas Sohi
November 16, 2014
I quickly discovered that working with the entire corpus at once in the text mining package was not the right approach as the file size was way to large.
I decided to work with a subset of the the corpora to create into a corpus package. I created a function to read each line of each corpus and randomly select 10% of the lines and output 3 new text documents.
Then I read them back into R and created a corpus object. All summaries(except for line counts) are from this smaller corpus (575 mb). This new corpus I worked with is called ovid.
Blogs, News, Twitter (respectively)
Word counts - (sample texts)
Summary Statistics Line counts (full texts)
Enter Text: I jump for joy ________
*Example: 'I jump for' predicted word should be: joy