Currently there are 3 files inside en_US folder:
en_US.blogs.txt has a word count total of 37334131
en_US.news.txt has a word count total of 34372530
en_US.twitter.txt has a word count total of 30373583
en_US.blogs.txt has a line count total of 899288
en_US.news.txt has a line count total of 1010242
en_US.twitter.txt has a line count total of 2360148
Sample data is taken from 10000 lines of each file:
Remove:
Converting:
Ngram analysis:
ggplot(one20Blogs, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") +geom_text(aes(label=Frequency), vjust=-0.2)
ggplot(one20News, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="#ffdead") +geom_text(aes(label=Frequency), vjust=-0.2)
ggplot(one20Twitter, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="#f28500") +geom_text(aes(label=Frequency), vjust=-0.2)