Data Science Capstone Milestone Report 1

Summary

For this report i conducted an exploratory analysis of the provided documents. As training data i randomly selected 15% of the lines of each document.

Twitter has the lowest total word count, but has the most unique word count and the most lines. This is not surprising, because twitter has usually short texts that contain spelling mistakes or slang words. The blog data is the exact opposite, it has the most words, but the least amount of lines and unique words and the news data is right in the middle.

Data Exploration

For data exploration i first load the data from the files and select 15% of the lines of each document as training data.

Computing line and word counts. For the word counts i count the occurence of the ” ” character and add 1 for each line.

docStats <- tibble(name = c("blogs", "news", "twitter"), lineCount = c(0L, 0L, 0L), wordCount = c(0L, 0L, 0L));
docStats[1,]$lineCount <- length(linesBlogs)
docStats[2,]$lineCount <- length(linesNews)
docStats[3,]$lineCount <- length(linesTwitter)

docStats[1,]$wordCount <- sum(str_count(linesBlogs, " ")) + length(linesBlogs);
docStats[2,]$wordCount <- sum(str_count(linesNews, " ")) + length(linesNews);
docStats[3,]$wordCount <- sum(str_count(linesTwitter, " ")) + length(linesTwitter);

totalLineCount <- sum(docStats$lineCount);
totalWordCount <- sum(docStats$wordCount);

docStats$lineCountPercent <- sapply(docStats$lineCount, function(x) x / totalLineCount);
docStats$wordCountPercent <- sapply(docStats$wordCount, function(x) x / totalWordCount);

docStats;

## # A tibble: 3 × 5
##   name    lineCount wordCount lineCountPercent wordCountPercent
##   <chr>       <int>     <int>            <dbl>            <dbl>
## 1 blogs      134669   5579095            0.210            0.364
## 2 news       151841   5164160            0.237            0.337
## 3 twitter    354854   4566442            0.553            0.298

While Twitter has by far the most lines it has the lowest word count.

(ggplot(docStats, aes(x = name, y = lineCountPercent, fill = name))+ geom_col()) + 
  (ggplot(docStats, aes(x = name, y = wordCountPercent, fill = name)) + geom_col()) + plot_layout(ncol = 1)

To further analyze the texts i flatten the lines into one large string for each document, create a corpus and tokenize it.

linesBlogs <- paste(linesBlogs, collapse = '');
linesNews <- paste(linesNews, collapse = '');
linesTwitter <- paste(linesTwitter, collapse = '');

groupNames <- c("blogs", "news", "twitter");
selectedLines <- tibble(id = groupNames, data = c(linesBlogs, linesNews, linesTwitter));

swearWords <- read_lines("swearWords.txt");
  
corp <- corpus(selectedLines, docid_field = "id", text_field = "data");
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE, 
                 remove_url = TRUE, remove_separators = TRUE, padding = FALSE);
toks <- tokens_tolower(toks);
toks <- tokens_select(toks, pattern = swearWords, selection = "remove");

Creating a document-feature matrix i compute the word frequencies for each document.

dfTokens <- dfm(toks) %>% dfm_group;
statsTokens <- textstat_frequency(dfTokens, groups = groupNames, ties_method = "first") %>% as_tibble;

Taking a look at the first and last 10 entries sorted by frequency, it seems that twitter has a lot of low frequency words.

head(statsTokens, 10);

## # A tibble: 10 × 5
##    feature frequency  rank docfreq group
##    <chr>       <dbl> <dbl>   <dbl> <chr>
##  1 the        270690     1       1 blogs
##  2 and        161358     2       1 blogs
##  3 to         158607     3       1 blogs
##  4 a          132950     4       1 blogs
##  5 of         130217     5       1 blogs
##  6 i          108169     6       1 blogs
##  7 in          86797     7       1 blogs
##  8 that        67532     8       1 blogs
##  9 is          64127     9       1 blogs
## 10 it          57828    10       1 blogs

tail(statsTokens, 10);

## # A tibble: 10 × 5
##    feature           frequency   rank docfreq group  
##    <chr>                 <dbl>  <dbl>   <dbl> <chr>  
##  1 ocm                       1 271424       1 twitter
##  2 bocas                     1 271425       1 twitter
##  3 byemamp                   1 271426       1 twitter
##  4 center.baby               1 271427       1 twitter
##  5 now.nodaysoff_atc         1 271428       1 twitter
##  6 amazing.not               1 271429       1 twitter
##  7 nandina.roll              1 271430       1 twitter
##  8 kidzui                    1 271431       1 twitter
##  9 kidzui.anyone             1 271432       1 twitter
## 10 updyke-type               1 271433       1 twitter

The twitter data has the most tokens with a frequency of 1.

statsTokens %>% group_by(group) %>% filter(frequency == 1) %>% summarize(n = n())

## # A tibble: 3 × 2
##   group        n
##   <chr>    <int>
## 1 blogs   135681
## 2 news    141313
## 3 twitter 206978

Now comparing the unique word count against the total word count Twitter has the most unique words.

uniqueWordsPerGroup <- statsTokens %>% group_by(group) %>% summarize(n = n());
uniqueWordsPerGroup$wordCount <- docStats$wordCount;
tibble(name = uniqueWordsPerGroup$group, percent = uniqueWordsPerGroup$n / uniqueWordsPerGroup$wordCount * 100)

## # A tibble: 3 × 2
##   name    percent
##   <chr>     <dbl>
## 1 blogs      3.62
## 2 news       4.10
## 3 twitter    5.94

Plotting the word cloud of the most frequent 200 words for each document.

textplot_wordcloud(dfTokens, comparison = TRUE, max_words = 200);

Appendix

Function loadLineFile:

loadLineFile <- function(path, srcPath, p = 0.15)
{
  if(file.exists(path))
  {
    lines <- read_lines(path);
    return(lines);
  }
  
  lines <- read_lines(srcPath, skip_empty_rows = TRUE);
  selected <- rbinom(length(lines), 1, p) == 1;
  lines <- lines[selected];
  
  write_lines(lines, path);
  
  lines;
}

Session info

sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-redhat-linux-gnu
## Running under: Fedora Linux 42 (Workstation Edition)
## 
## Matrix products: default
## BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Berlin
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] patchwork_1.3.1            caret_7.0-1               
##  [3] ggplot2_3.5.2              quanteda.textmodels_0.9.10
##  [5] quanteda.textplots_0.95    quanteda.textstats_0.97.2 
##  [7] quanteda_4.3.1             readtext_0.91             
##  [9] dplyr_1.1.4                readr_2.1.5               
## [11] lattice_0.22-7             stringr_1.5.1             
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     timeDate_4041.110    farver_2.1.2        
##  [4] fastmap_1.2.0        pROC_1.18.5          digest_0.6.37       
##  [7] rpart_4.1.24         timechange_0.3.0     lifecycle_1.0.4     
## [10] survival_3.8-3       magrittr_2.0.3       compiler_4.5.1      
## [13] rlang_1.1.6          sass_0.4.10          tools_4.5.1         
## [16] utf8_1.2.6           yaml_2.3.10          data.table_1.17.8   
## [19] knitr_1.50           labeling_0.4.3       stopwords_2.3       
## [22] bit_4.6.0            plyr_1.8.9           RColorBrewer_1.1-3  
## [25] withr_3.0.2          purrr_1.1.0          nnet_7.3-20         
## [28] grid_4.5.1           stats4_4.5.1         future_1.58.0       
## [31] globals_0.18.0       scales_1.4.0         iterators_1.0.14    
## [34] MASS_7.3-65          cli_3.6.5            crayon_1.5.3        
## [37] rmarkdown_2.29       generics_0.1.4       rstudioapi_0.17.1   
## [40] future.apply_1.20.0  httr_1.4.7           reshape2_1.4.4      
## [43] tzdb_0.5.0           cachem_1.1.0         splines_4.5.1       
## [46] parallel_4.5.1       CoprManager_0.5.7    vctrs_0.6.5         
## [49] hardhat_1.4.1        glmnet_4.1-10        Matrix_1.7-3        
## [52] jsonlite_2.0.0       hms_1.1.3            bit64_4.6.0-1       
## [55] listenv_0.9.1        foreach_1.5.2        gower_1.0.2         
## [58] jquerylib_0.1.4      recipes_1.3.1        glue_1.8.0          
## [61] parallelly_1.45.0    codetools_0.2-20     lubridate_1.9.4     
## [64] stringi_1.8.7        shape_1.4.6.1        gtable_0.3.6        
## [67] tibble_3.3.0         pillar_1.11.0        htmltools_0.5.8.1   
## [70] ipred_0.9-15         lava_1.8.1           R6_2.6.1            
## [73] vroom_1.6.5          evaluate_1.0.4       bslib_0.9.0         
## [76] class_7.3-23         Rcpp_1.1.0           fastmatch_1.1-6     
## [79] nsyllable_1.0.1      nlme_3.1-168         prodlim_2025.04.28  
## [82] xfun_0.52            pkgconfig_2.0.3      ModelMetrics_1.2.2.2

Data Science Capstone Milestone Report 1

Dennis Wandschura

2025-07-25

Summary

Data Exploration

Appendix