This report describes the main features of the corpora, that will be used to build a webapp based on word frequency.
Blogs <- readLines("./en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
News <- readLines("./en_US/en_US.news.txt", encoding = "UTF-8", skipNul=TRUE)
Twitter <- readLines("./en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)
twitter <- Twitter[sample(1:length(Twitter),10000)]
news <- News[sample(1:length(News),10000)]
blogs <- Blogs[sample(1:length(Blogs),10000)]
data0 <- c(twitter,news,blogs)
# Saving Sampled data
writeLines(data0, "./data0.txt")
SampleCon <- file("./data0.txt")
Sample <- readLines(SampleCon)
close(SampleCon)
| File | File Size (MB) | Lines | Words |
|---|---|---|---|
| Blogs | 200.42 | 899288 | 37334131 |
| News | 196.28 | 77259 | 2643969 |
| 159.36 | 2360148 | 128995 | |
| Aggregated Sample | 4.78 | 30000 | 30000 |
Sampled data:
frequent_terms_sample <- freq_terms(Sample, 25)
plot(frequent_terms_sample)
knitr::kable(head(frequent_terms_sample,10))
| WORD | FREQ | |
|---|---|---|
| 49720 | the | 43187 |
| 50435 | to | 23725 |
| 3109 | and | 22587 |
| 1464 | a | 21030 |
| 34990 | of | 18410 |
| 24939 | in | 14673 |
| 24468 | i | 12944 |
| 49684 | that | 9417 |
| 19453 | for | 9051 |
| 26031 | is | 8864 |