For this report i conducted an exploratory analysis of the provided documents. As training data i randomly selected 15% of the lines of each document.
Twitter has the lowest total word count, but has the most unique word count and the most lines. This is not surprising, because twitter has usually short texts that contain spelling mistakes or slang words. The blog data is the exact opposite, it has the most words, but the least amount of lines and unique words and the news data is right in the middle.
For data exploration i first load the data from the files and select 15% of the lines of each document as training data.
Computing line and word counts. For the word counts i count the occurence of the ” ” character and add 1 for each line.
docStats <- tibble(name = c("blogs", "news", "twitter"), lineCount = c(0L, 0L, 0L), wordCount = c(0L, 0L, 0L));
docStats[1,]$lineCount <- length(linesBlogs)
docStats[2,]$lineCount <- length(linesNews)
docStats[3,]$lineCount <- length(linesTwitter)
docStats[1,]$wordCount <- sum(str_count(linesBlogs, " ")) + length(linesBlogs);
docStats[2,]$wordCount <- sum(str_count(linesNews, " ")) + length(linesNews);
docStats[3,]$wordCount <- sum(str_count(linesTwitter, " ")) + length(linesTwitter);
totalLineCount <- sum(docStats$lineCount);
totalWordCount <- sum(docStats$wordCount);
docStats$lineCountPercent <- sapply(docStats$lineCount, function(x) x / totalLineCount);
docStats$wordCountPercent <- sapply(docStats$wordCount, function(x) x / totalWordCount);
docStats;
## # A tibble: 3 × 5
## name lineCount wordCount lineCountPercent wordCountPercent
## <chr> <int> <int> <dbl> <dbl>
## 1 blogs 134669 5579095 0.210 0.364
## 2 news 151841 5164160 0.237 0.337
## 3 twitter 354854 4566442 0.553 0.298
While Twitter has by far the most lines it has the lowest word count.
(ggplot(docStats, aes(x = name, y = lineCountPercent, fill = name))+ geom_col()) +
(ggplot(docStats, aes(x = name, y = wordCountPercent, fill = name)) + geom_col()) + plot_layout(ncol = 1)
To further analyze the texts i flatten the lines into one large string for each document, create a corpus and tokenize it.
linesBlogs <- paste(linesBlogs, collapse = '');
linesNews <- paste(linesNews, collapse = '');
linesTwitter <- paste(linesTwitter, collapse = '');
groupNames <- c("blogs", "news", "twitter");
selectedLines <- tibble(id = groupNames, data = c(linesBlogs, linesNews, linesTwitter));
swearWords <- read_lines("swearWords.txt");
corp <- corpus(selectedLines, docid_field = "id", text_field = "data");
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
remove_url = TRUE, remove_separators = TRUE, padding = FALSE);
toks <- tokens_tolower(toks);
toks <- tokens_select(toks, pattern = swearWords, selection = "remove");
Creating a document-feature matrix i compute the word frequencies for each document.
dfTokens <- dfm(toks) %>% dfm_group;
statsTokens <- textstat_frequency(dfTokens, groups = groupNames, ties_method = "first") %>% as_tibble;
Taking a look at the first and last 10 entries sorted by frequency, it seems that twitter has a lot of low frequency words.
head(statsTokens, 10);
## # A tibble: 10 × 5
## feature frequency rank docfreq group
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 the 270690 1 1 blogs
## 2 and 161358 2 1 blogs
## 3 to 158607 3 1 blogs
## 4 a 132950 4 1 blogs
## 5 of 130217 5 1 blogs
## 6 i 108169 6 1 blogs
## 7 in 86797 7 1 blogs
## 8 that 67532 8 1 blogs
## 9 is 64127 9 1 blogs
## 10 it 57828 10 1 blogs
tail(statsTokens, 10);
## # A tibble: 10 × 5
## feature frequency rank docfreq group
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 ocm 1 271424 1 twitter
## 2 bocas 1 271425 1 twitter
## 3 byemamp 1 271426 1 twitter
## 4 center.baby 1 271427 1 twitter
## 5 now.nodaysoff_atc 1 271428 1 twitter
## 6 amazing.not 1 271429 1 twitter
## 7 nandina.roll 1 271430 1 twitter
## 8 kidzui 1 271431 1 twitter
## 9 kidzui.anyone 1 271432 1 twitter
## 10 updyke-type 1 271433 1 twitter
The twitter data has the most tokens with a frequency of 1.
statsTokens %>% group_by(group) %>% filter(frequency == 1) %>% summarize(n = n())
## # A tibble: 3 × 2
## group n
## <chr> <int>
## 1 blogs 135681
## 2 news 141313
## 3 twitter 206978
Now comparing the unique word count against the total word count Twitter has the most unique words.
uniqueWordsPerGroup <- statsTokens %>% group_by(group) %>% summarize(n = n());
uniqueWordsPerGroup$wordCount <- docStats$wordCount;
tibble(name = uniqueWordsPerGroup$group, percent = uniqueWordsPerGroup$n / uniqueWordsPerGroup$wordCount * 100)
## # A tibble: 3 × 2
## name percent
## <chr> <dbl>
## 1 blogs 3.62
## 2 news 4.10
## 3 twitter 5.94
Plotting the word cloud of the most frequent 200 words for each document.
textplot_wordcloud(dfTokens, comparison = TRUE, max_words = 200);
Function loadLineFile:
loadLineFile <- function(path, srcPath, p = 0.15)
{
if(file.exists(path))
{
lines <- read_lines(path);
return(lines);
}
lines <- read_lines(srcPath, skip_empty_rows = TRUE);
selected <- rbinom(length(lines), 1, p) == 1;
lines <- lines[selected];
write_lines(lines, path);
lines;
}
Session info
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-redhat-linux-gnu
## Running under: Fedora Linux 42 (Workstation Edition)
##
## Matrix products: default
## BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Berlin
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] patchwork_1.3.1 caret_7.0-1
## [3] ggplot2_3.5.2 quanteda.textmodels_0.9.10
## [5] quanteda.textplots_0.95 quanteda.textstats_0.97.2
## [7] quanteda_4.3.1 readtext_0.91
## [9] dplyr_1.1.4 readr_2.1.5
## [11] lattice_0.22-7 stringr_1.5.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4041.110 farver_2.1.2
## [4] fastmap_1.2.0 pROC_1.18.5 digest_0.6.37
## [7] rpart_4.1.24 timechange_0.3.0 lifecycle_1.0.4
## [10] survival_3.8-3 magrittr_2.0.3 compiler_4.5.1
## [13] rlang_1.1.6 sass_0.4.10 tools_4.5.1
## [16] utf8_1.2.6 yaml_2.3.10 data.table_1.17.8
## [19] knitr_1.50 labeling_0.4.3 stopwords_2.3
## [22] bit_4.6.0 plyr_1.8.9 RColorBrewer_1.1-3
## [25] withr_3.0.2 purrr_1.1.0 nnet_7.3-20
## [28] grid_4.5.1 stats4_4.5.1 future_1.58.0
## [31] globals_0.18.0 scales_1.4.0 iterators_1.0.14
## [34] MASS_7.3-65 cli_3.6.5 crayon_1.5.3
## [37] rmarkdown_2.29 generics_0.1.4 rstudioapi_0.17.1
## [40] future.apply_1.20.0 httr_1.4.7 reshape2_1.4.4
## [43] tzdb_0.5.0 cachem_1.1.0 splines_4.5.1
## [46] parallel_4.5.1 CoprManager_0.5.7 vctrs_0.6.5
## [49] hardhat_1.4.1 glmnet_4.1-10 Matrix_1.7-3
## [52] jsonlite_2.0.0 hms_1.1.3 bit64_4.6.0-1
## [55] listenv_0.9.1 foreach_1.5.2 gower_1.0.2
## [58] jquerylib_0.1.4 recipes_1.3.1 glue_1.8.0
## [61] parallelly_1.45.0 codetools_0.2-20 lubridate_1.9.4
## [64] stringi_1.8.7 shape_1.4.6.1 gtable_0.3.6
## [67] tibble_3.3.0 pillar_1.11.0 htmltools_0.5.8.1
## [70] ipred_0.9-15 lava_1.8.1 R6_2.6.1
## [73] vroom_1.6.5 evaluate_1.0.4 bslib_0.9.0
## [76] class_7.3-23 Rcpp_1.1.0 fastmatch_1.1-6
## [79] nsyllable_1.0.1 nlme_3.1-168 prodlim_2025.04.28
## [82] xfun_0.52 pkgconfig_2.0.3 ModelMetrics_1.2.2.2