Milestone Report

File Summary

File Size Lines Words

en_US.blogs.txt 201MB 899288 37334131

en_US.news.txt 197MB 1010242 34372530

en_US.twitter.txt 160MB 2360148 30373583

Bigram Analysis

N-gram statistics were generated using the perl N-gram Statistics Package. [1] The NSP package tokenizes the text and counts the occurence of n-grams as well as the elements of the n-grams. Below is an example from the twitter corpus.

plot of chunk unnamed-chunk-1

Strategy for Final Analysis and App

In addition to the bigram analysis, a 3-gram frequency analysis will be used to generate frequncy tables and probabilities.

Footnotes

[1]: Satanjeev Banerjee and Ted Pedersen http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf