I have already download files to Capstone Project directory.
setwd("~/RDIR/Capstone Project")
list.files("en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
blogs <- readLines("en_US/en_US.blogs.txt", encoding="UTF-8")
twitter <- readLines("en_US/en_US.twitter.txt", encoding="UTF-8", skipNul = TRUE)
news <- readLines("en_US/en_US.news.txt", encoding="UTF-8", , skipNul = TRUE)
## Warning: package 'RWeka' was built under R version 3.1.3
files_summary
## files memory lines words
## 1 blogs 260564320 899288 38308421
## 2 twitter 302322752 2360148 29354795
## 3 news 261759048 1010242 35624448
## [1] "Blogs"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40830
## [1] "Twitter"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 37.00 64.00 68.68 100.00 140.00
## [1] "News"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 110.0 185.0 201.2 268.0 11380.0
## [1] "Blogs"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 9.0 29.0 42.6 61.0 6851.0
## [1] "Twitter"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 13.14 19.00 47.00
## [1] "News"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 35.26 47.00 1928.00
blogs <- gsub(blogs, pattern = "[^A-Za-z ]", replacement = "")
blogs <- gsub(x = tolower(blogs), pattern = " {2, }", replacement = " ")
Do it with other files in silent mode.
Look at words distribution in blogs.
## 25% 50% 75% 95% 99%
## 1 2 6 108 964
Interesting fact - 99% of blogs text consists less then 1000 words.