This report describes the main features of the corpora, that will be used to build a webapp based on word frequency.

Loading And Sampling the Data

Blogs <- readLines("./en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
News <- readLines("./en_US/en_US.news.txt", encoding = "UTF-8", skipNul=TRUE)
Twitter <- readLines("./en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)

twitter <- Twitter[sample(1:length(Twitter),10000)]
news <- News[sample(1:length(News),10000)]
blogs <- Blogs[sample(1:length(Blogs),10000)]
data0 <- c(twitter,news,blogs)

# Saving Sampled data
writeLines(data0, "./data0.txt")

SampleCon <- file("./data0.txt")
Sample <- readLines(SampleCon)
close(SampleCon)

Corpus composition

File File Size (MB) Lines Words
Blogs 200.42 899288 37334131
News 196.28 77259 2643969
Twitter 159.36 2360148 128995
Aggregated Sample 4.78 30000 30000

Most Frequent Words

Sampled data:

frequent_terms_sample <- freq_terms(Sample, 25)
plot(frequent_terms_sample)

knitr::kable(head(frequent_terms_sample,10))
WORD FREQ
49720 the 43187
50435 to 23725
3109 and 22587
1464 a 21030
34990 of 18410
24939 in 14673
24468 i 12944
49684 that 9417
19453 for 9051
26031 is 8864