Milestone Report & Basic Summary

Currently there are 3 files inside en_US folder:

  1. en_US.blogs.txt
  2. en_US.news.txt
  3. en_US.twitter.txt

Word Counts

en_US.blogs.txt has a word count total of 37334131

en_US.news.txt has a word count total of 34372530

en_US.twitter.txt has a word count total of 30373583

Line Counts

en_US.blogs.txt has a line count total of 899288

en_US.news.txt has a line count total of 1010242

en_US.twitter.txt has a line count total of 2360148

Analysing 10000 lines of each files

Sample data is taken from 10000 lines of each file:

  • en_US.blogs.txt
  • en_US.news.txt
  • en_US.twitter.txt

Cleaning data

Remove:

  • common stopwords like “the”
  • punctuation
  • numbers
  • bad words from this list

Converting:

  • to UTF-8
  • to lowercase

Ngram analysis:

  • unigram

Frequencies of top 20 words in the sampled files

Histogram of top 20 words in Blogs

ggplot(one20Blogs, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") +geom_text(aes(label=Frequency), vjust=-0.2)

Histogram of top 20 words in News

ggplot(one20News, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="#ffdead") +geom_text(aes(label=Frequency), vjust=-0.2)

Histogram of top 20 words in Twitter

ggplot(one20Twitter, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="#f28500") +geom_text(aes(label=Frequency), vjust=-0.2)