This page shows exploratory data analysis for the data set consisting of three files for en_US: twitter, blogs and news. It shows the major features of the data and illustrate important summaries of the data set in the form of tables and plots. It also briefly summarize the plan for creating the prediction algorithm
## [1] ".\\final/de_DE" ".\\final/de_DE/de_DE.blogs.txt"
## [3] ".\\final/de_DE/de_DE.news.txt" ".\\final/de_DE/de_DE.twitter.txt"
## [5] ".\\final/en_US" ".\\final/en_US/en_US.blogs.txt"
## [7] ".\\final/en_US/en_US.news.txt" ".\\final/en_US/en_US.twitter.txt"
## [9] ".\\final/fi_FI" ".\\final/fi_FI/fi_FI.blogs.txt"
## [11] ".\\final/fi_FI/fi_FI.news.txt" ".\\final/fi_FI/fi_FI.twitter.txt"
## [13] ".\\final/ru_RU" ".\\final/ru_RU/ru_RU.blogs.txt"
## [15] ".\\final/ru_RU/ru_RU.news.txt" ".\\final/ru_RU/ru_RU.twitter.txt"
## en_US.twitter en_US.blogs en_US.news
## total_number_of_lines 2360148 899288 77259
## en_US.twitter en_US.blogs en_US.news
## total_number_of_words 30513860 38487556 2760230
## en_US.twitter en_US.blogs en_US.news
## Mean 12.928791 42.79781 35.72697
## Std Dev 7.185126 47.80498 24.06795
## Min 1.000000 1.00000 1.00000
## Q1 7.000000 9.00000 20.00000
## Median 12.000000 29.00000 33.00000
## Q3 19.000000 61.00000 47.00000
## Max 62.000000 6851.00000 1521.00000
Distribution for only those blogs having word length upto 200 words
Distribution for only those news having word length upto 150 words
—————————-END—————————–