Milestone Report & Basic Summary

Currently there are 3 files inside en_US folder:

Word Counts

en_US.blogs.txt has a word count total of 37334131

en_US.news.txt has a word count total of 34372530

en_US.twitter.txt has a word count total of 30373583

en_US.blogs.txt has a line count total of 899288

en_US.news.txt has a line count total of 1010242

en_US.twitter.txt has a line count total of 2360148

Sample data is taken from 10000 lines of each file:

Remove:

Converting:

Ngram analysis:

ggplot(one20Blogs, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") +geom_text(aes(label=Frequency), vjust=-0.2)

ggplot(one20News, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="#ffdead") +geom_text(aes(label=Frequency), vjust=-0.2)

ggplot(one20Twitter, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="#f28500") +geom_text(aes(label=Frequency), vjust=-0.2)