Datcha

Tracking the Ephemeral Web: A Reproducible Computational Tool for Quantifying Temporal Change in Longitudinal Social Media Datasets

Authors

Kunjan Shah

Dr Yannik Peters

Published

May 12, 2026

At a glance

Table of Content

Introduction

Setup

Tool application

Conclusion and recommendations

Introduction

Social media platforms are inherently dynamic environments in which content is continuously created, modified, and removed. When researchers collect datasets from such platforms at different points in time, the resulting snapshots are rarely identical, i.e., posts may be deleted by users or platform moderation systems, new content may emerge, and existing posts may undergo textual revision. These temporal discrepancies, if left unexamined, can introduce systematic bias into longitudinal analyses and compromise the reproducibility of findings.

Datcha (Data Temporal Change Analyzer) addresses this methodological gap by providing a structured, reproducible framework for detecting and quantifying three core types of between-collection changes: deletions, additions, and text-level edits. By aligning two temporally distinct dataset snapshots through unique post identifiers, the tool enables researchers to assess data consistency, estimate content volatility, and make informed decisions about dataset suitability prior to downstream analysis.

This tool is designed with GDPR compliance in mind and assumes fully anonymized input data. It is particularly relevant for studies in computational social science, platform studies, and NLP-driven content analysis, where dataset integrity directly affects the validity of empirical conclusions.

Set-up

Prior to executing the analysis, the environment must be configured with the necessary dependencies. This section outlines the required libraries and the procedure for loading the two temporally distinct datasets that form the basis of the comparison.

Library Imports

The following libraries support the core analytical pipeline, covering data manipulation, text preprocessing, statistical analysis, and visualization. Dependencies should be installed prior to execution if not already available in the working environment.

#library(shiny)
#library(shinyjs)        # For enhanced UI functionality
library(dplyr)          # Data manipulation
library(tm)             # Text mining
library(topicmodels)    # Topic modeling
library(sentimentr)     # Sentiment analysis
#library(wordcloud2)     # Word cloud visualization
library(highcharter)    # Interactive charts
library(tidytext)       # Text processing
library(reshape2)       # Data reshaping
library(ggplot2)        # Visualization
library(DT)             # Interactive tables
library(stringdist)     # String distance calculation
library(shinyBS)        # For tooltips
library(quanteda)       # Quantitative text analysis
library(quanteda.textstats) # Text statistics
library(KeynessMeasures)    # Keyness analysis
library(SnowballC)      # For stemming
library(textstem)       # For lemmatization
library(LDAvis)         # LDA visualization
library(diffobj)        # For visual text diffs
library(htmltools)      # HTML tools for Shiny
library(bslib)          # Bootstrap library for Shiny
library(readr)

Load Your Two Datasets

Both datasets must be GDPR-compliant before being passed into the tool. In this context, compliance requires that all personally identifiable information — including names, usernames, and traceable IDs — has been removed or anonymized. Indirectly identifiable attributes, such as rare locations or uniquely identifying phrases, must similarly be aggregated or excluded. In accordance with the principle of data minimization, only variables strictly necessary for the intended analysis should be retained.

data1 <- read_csv("https://raw.githubusercontent.com/ECCdigital/gesis-datcha/refs/heads/develop/Data1.csv",
                  locale = locale(encoding = "UTF-8"), show_col_types = FALSE)

data2 <- read_csv("https://raw.githubusercontent.com/ECCdigital/gesis-datcha/refs/heads/develop/Data2.csv",
                  locale = locale(encoding = "UTF-8"), show_col_types = FALSE)
cat(paste("Dataset 1 columns:", paste(names(data1), collapse=", ")), "\n")

Dataset 1 columns: col_id, text

cat(paste("Dataset 2 columns:", paste(names(data2), collapse=", ")), "\n")

Dataset 2 columns: col_id, text

Enter ID Column and Dates

Each dataset must share a common unique identifier column to enable record-level matching across collection points. Users are required to manually specify the ID column name for each dataset, as no auto-detection is performed to avoid erroneous assignments.

ID Column Specification:
The identifier column name is defined separately for each dataset. A validation check confirms the specified column exists before proceeding; if not found, execution halts with an informative error.

# User must manually specify the ID column names (no auto-detection)

id_col_1 <- "col_id"  # actual ID column name in Dataset 1
id_col_2 <- "col_id"  # actual ID column name in Dataset 2

# Safety check
if (!id_col_1 %in% names(data1)) stop("id_col_1 '", id_col_1, "' not found in Dataset 1")
if (!id_col_2 %in% names(data2)) stop("id_col_2 '", id_col_2, "' not found in Dataset 2")

print(paste("Using ID column for Dataset 1:", id_col_1))
[1] "Using ID column for Dataset 1: col_id"
print(paste("Using ID column for Dataset 2:", id_col_2))
[1] "Using ID column for Dataset 2: col_id"

Collection Dates:
The collection date for each dataset is set manually. A chronological validation ensures that Dataset 1 precedes Dataset 2 in time; a necessary condition for meaningful temporal comparison. The confirmed date range and interval in days are printed upon successful validation.

# Set your dates here
date_1 <- as.Date("2025-11-15")   # Dataset 1 collection date
date_2 <- as.Date("2025-12-15")   # Dataset 2 collection date

# Validate order
if (date_1 >= date_2) {
  stop("ERROR: date_1 must be earlier than date_2")
}

cat("Date range validated:", date_1, "→", date_2, "(", as.numeric(date_2 - date_1), "days )\n")
Date range validated: 20407 → 20437 ( 30 days )
# Identify changes between Dataset 1 (earlier) and Dataset 2 (later)
removed_posts  <- data1 %>% filter(!(.data[[id_col_1]] %in% data2[[id_col_2]]))
added_posts    <- data2 %>% filter(!(.data[[id_col_2]] %in% data1[[id_col_1]]))
matched_ids    <- intersect(data1[[id_col_1]], data2[[id_col_2]])

# For matched posts, check if content changed (assuming a 'text' column; adjust if different)
text_col <- "text"  # CHANGE THIS if your text column has a different name
matched1 <- data1 %>% filter(.data[[id_col_1]] %in% matched_ids)
matched2 <- data2 %>% filter(.data[[id_col_2]] %in% matched_ids)
edited_posts <- matched1 %>%
  inner_join(matched2, by = setNames(id_col_2, id_col_1), suffix = c(".old", ".new")) %>%
  filter(.data[[paste0(text_col, ".old")]] != .data[[paste0(text_col, ".new")]])

# SINGLE OUTPUT BLOCK
output_text <- paste0(
  "Final configuration:\n",
  "Dataset 1: ", nrow(data1),
  " rows | ID column: ", id_col_1,
  " | Date: ", format(date_1, "%Y-%m-%d"), "\n",
  
  "Dataset 2: ", nrow(data2),
  " rows | ID column: ", id_col_2,
  " | Date: ", format(date_2, "%Y-%m-%d"), "\n\n",
  
  "Analysis ready!\n",
  "Number of Deleted Posts : ", nrow(removed_posts), "\n",
  "Number of Added Posts   : ", nrow(added_posts), "\n",
  "Number of Edited Posts  : ", nrow(edited_posts),
  " out of ", length(matched_ids), " matched posts"
)

cat(output_text)
Final configuration:
Dataset 1: 300 rows | ID column: col_id | Date: 2025-11-15
Dataset 2: 301 rows | ID column: col_id | Date: 2025-12-15

Analysis ready!
Number of Deleted Posts : 100
Number of Added Posts   : 101
Number of Edited Posts  : 3 out of 200 matched posts

Text Processing Module

A reusable text processing module is initialised prior to analysis, providing two core functions used consistently across all subsequent sections.

Text Cleaning:
Raw text is normalised through a standard preprocessing pipeline: lowercasing, removal of numbers, punctuation, and stopwords, followed by whitespace stripping. Lemmatization is applied by default to reduce words to their base forms; stemming is available as an alternative but disabled unless explicitly specified.

Word Frequency:
Cleaned text is transformed into a document-term matrix from which term frequencies are extracted and ranked in descending order, forming the basis for all subsequent lexical analyses.

# TEXT PROCESSING MODULE
text_processor <- list(
  clean = function(text, use_stem = FALSE, use_lemma = TRUE) {
    # Use VCorpus + plain functions to avoid SimpleCorpus warnings
    corpus <- VCorpus(VectorSource(text))
    
    corpus <- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, removeWords, stopwords("en"))
    corpus <- tm_map(corpus, stripWhitespace)
    
    if (use_stem) corpus <- tm_map(corpus, stemDocument)
    
    if (use_lemma) {
      txt <- sapply(corpus, as.character)
      txt <- textstem::lemmatize_strings(txt)
      return(txt)
    }
    
    sapply(corpus, as.character)
  },
  
  # Fixed get_freq – no gram_type argument needed for unigrams
  get_freq = function(text) {
    corpus <- VCorpus(VectorSource(text))
    dtm    <- DocumentTermMatrix(corpus)
    freq   <- slam::col_sums(dtm)               # faster & safer than colSums(as.matrix())
    df     <- data.frame(word = names(freq), freq = freq, stringsAsFactors = FALSE) %>%
      arrange(desc(freq))
    df
  }
)

Data Deletion

This section identifies posts present in Dataset 1 that are absent from Dataset 2, treating their disappearance as deletions occurring within the observed time window. Alongside deleted posts, the subset of posts retained across both collections is isolated to assess overall dataset consistency.

Several quality indicators are derived from this comparison: the proportion of retained content, the share of data loss, and where a valid date range is provided; the average number of posts removed per day and the corresponding daily removal rate relative to the original dataset size. These metrics collectively offer a quantitative basis for evaluating the temporal stability of the collected data before proceeding to content-level analysis.

# Data Deletion

# Define removed and remaining posts correctly
removed_posts <- data1 %>% filter(!(.data[[id_col_1]] %in% data2[[id_col_2]]))
remaining_posts <- data2 %>% filter(.data[[id_col_2]] %in% intersect(data1[[id_col_1]], data2[[id_col_2]]))

# Quality indicators
days <- as.numeric(difftime(date_2, date_1, units = "days"))

total <- nrow(data1)
removed_n <- nrow(removed_posts)
remaining_n <- nrow(remaining_posts)

# Prepare output
if (days <= 0) {
  
  output_text <- paste0(
    "Consistency              : ", round(remaining_n / total * 100, 1), "%\n",
    "Data Loss                : ", round(removed_n / total * 100, 1), "%\n",
    "Daily Removed Posts      : N/A (invalid date range)\n",
    "Daily Removal Rate       : N/A (invalid date range)"
  )
  
} else {
  
  daily_rate <- round(removed_n / days / total * 100, 2)
  
  output_text <- paste0(
    "Consistency              : ", round(remaining_n / total * 100, 1), "%\n",
    "Data Loss                : ", round(removed_n / total * 100, 1), "%\n",
    "Daily Removed Posts      : ", round(removed_n / days, 1), " posts/day\n",
    "Daily Removal Rate       : ", daily_rate,
    "% of total posts/day (over ", days, " days)"
  )
}

cat(output_text)
Consistency              : 66.7%
Data Loss                : 33.3%
Daily Removed Posts      : 3.3 posts/day
Daily Removal Rate       : 1.11% of total posts/day (over 30 days)

1. Word Frequency

The most frequently occurring terms are extracted separately from removed and remaining posts, providing an initial lexical overview of what characterises each subset.

# Top words in Removed posts
cleaned_removed <- text_processor$clean(removed_posts$text, use_stem = FALSE, use_lemma = TRUE)

word_freq_removed <- text_processor$get_freq(cleaned_removed) %>%
  filter(freq > 1) %>%
  slice_head(n = 100)

highchart() %>%
  hc_chart(type = "bar") %>%
  hc_title(text = "Removed Posts") %>%
  hc_tooltip(crosshairs = TRUE, shared = FALSE, useHTML = TRUE,
              formatter = JS("function() {
                return '<br/><span style=\"color:' + this.series.color + '\">' + 
                       this.point.category + '</span>: <b>' + this.point.y + '</b>';
              }")) %>%
  hc_xAxis(categories = word_freq_removed$word,
           labels = list(style = list(fontSize = '11px')),
           max = 20, scrollbar = list(enabled = TRUE)) %>%
  hc_add_series(name = "Word", data = word_freq_removed$freq, type = "column",
                color = "#4CAF50", showInLegend = FALSE) %>%
  hc_exporting(enabled = TRUE)
# Top words in Remaining posts
cleaned_remaining <- text_processor$clean(remaining_posts$text, use_stem = FALSE, use_lemma = TRUE)

word_freq_remaining <- text_processor$get_freq(cleaned_remaining) %>%
  filter(freq > 1) %>%
  slice_head(n = 100)

highchart() %>%
  hc_chart(type = "bar") %>%
  hc_title(text = "Remaining Posts") %>%
  hc_tooltip(crosshairs = TRUE, shared = FALSE, useHTML = TRUE,
              formatter = JS("function() {
                return '<br/><span style=\"color:' + this.series.color + '\">' + 
                       this.point.category + '</span>: <b>' + this.point.y + '</b>';
              }")) %>%
  hc_xAxis(categories = word_freq_remaining$word,
           labels = list(style = list(fontSize = '11px')),
           max = 20, scrollbar = list(enabled = TRUE)) %>%
  hc_add_series(name = "Word", data = word_freq_remaining$freq, type = "column",
                color = "#2196F3", showInLegend = FALSE) %>%
  hc_exporting(enabled = TRUE)

2. Sentiment Analysis

Sentiment scores are computed and classified as negative, neutral, or positive for both removed and remaining posts, with the most affectively extreme posts surfaced for qualitative inspection.

## 2. Sentiment Analysis

# Optimized sentiment distribution with chunking
get_sentiment_distribution <- function(text_vector) {
  if (is.null(text_vector) || length(text_vector) == 0) {
    return(data.frame(
      category = c("Negative", "Neutral", "Positive"),
      percentage = c(0, 0, 0)
    ))
  }

  chunk_size <- 500
  chunks <- split(text_vector, ceiling(seq_along(text_vector)/chunk_size))

  all_scores <- unlist(lapply(chunks, function(chunk) {
    sentences <- sentimentr::get_sentences(chunk)
    sentimentr::sentiment(sentences)$sentiment
  }))

  category <- cut(all_scores,
                  breaks = c(-Inf, -0.01, 0.01, Inf),
                  labels = c("Negative", "Neutral", "Positive"))

  counts <- table(factor(category, levels = c("Negative", "Neutral", "Positive")))
  percentages <- prop.table(counts) * 100

  data.frame(
    category = names(percentages),
    percentage = as.numeric(percentages),
    stringsAsFactors = FALSE
  )
}
# Sentiment for Removed posts
sent_removed <- get_sentiment_distribution(removed_posts$text)

highchart() %>%
  hc_chart(type = "column") %>%
  hc_title(text = "Sentiment Distribution: Removed Posts") %>%
  hc_xAxis(categories = c("Negative", "Neutral", "Positive")) %>%
  hc_yAxis(title = list(text = "Percentage"), labels = list(format = "{value}%")) %>%
  hc_add_series(name = "Removed Posts", data = sent_removed$percentage, color = "#4CAF50") %>%
  hc_tooltip(pointFormat = "<b>{point.category}</b>: {point.y:.1f}%") %>%
  hc_plotOptions(column = list(pointPadding = 0.1, groupPadding = 0.1)) %>%
  hc_exporting(enabled = TRUE)
# Sentiment for Remaining posts
sent_remaining <- get_sentiment_distribution(remaining_posts$text)

highchart() %>%
  hc_chart(type = "column") %>%
  hc_title(text = "Sentiment Distribution: Remaining Posts") %>%
  hc_xAxis(categories = c("Negative", "Neutral", "Positive")) %>%
  hc_yAxis(title = list(text = "Percentage"), labels = list(format = "{value}%")) %>%
  hc_add_series(name = "Remaining Posts", data = sent_remaining$percentage, color = "#2196F3") %>%
  hc_tooltip(pointFormat = "<b>{point.category}</b>: {point.y:.1f}%") %>%
  hc_plotOptions(column = list(pointPadding = 0.1, groupPadding = 0.1)) %>%
  hc_exporting(enabled = TRUE)
# Most extreme posts (top 1 positive/negative)
get_extreme_text <- function(text_vector, type = "positive") {
  if (length(text_vector) == 0) return("No data")

  chunk_size <- 500
  chunks <- split(text_vector, ceiling(seq_along(text_vector)/chunk_size))

  scores_df <- lapply(chunks, function(chunk) {
    sentences <- sentimentr::get_sentences(chunk)
    scores <- sentimentr::sentiment_by(sentences)
    data.frame(text = chunk, score = scores$ave_sentiment)
  }) %>% bind_rows()

  if (type == "positive") {
    scores_df %>% arrange(desc(score)) %>% slice_head(n = 1) %>% pull(text)
  } else {
    scores_df %>% arrange(score) %>% slice_head(n = 1) %>% pull(text)
  }
}
output_text <- paste0(
  "**Most positive removed post:**\n",
  get_extreme_text(removed_posts$text, "positive"),
  
  "\n\n**Most negative removed post:**\n",
  get_extreme_text(removed_posts$text, "negative"),
  
  "\n\n**Most positive remaining post:**\n",
  get_extreme_text(remaining_posts$text, "positive"),
  
  "\n\n**Most negative remaining post:**\n",
  get_extreme_text(remaining_posts$text, "negative")
)

cat(output_text)
**Most positive removed post:**
The coffee shop spelled my name so wrong it's not even close to any real name. Impressive creativity!

**Most negative removed post:**
Just realized I've been spelling a common word wrong my entire life. The shame is real.

**Most positive remaining post:**
Just baked cookies for my new neighbors. Old fashioned but effective welcome!

**Most negative remaining post:**
The person next to me on the plane talked the entire 6-hour flight. Noise cancelling headphones failed me.

3. Keyness Analysis

Keyness analysis identifies terms that are statistically over- or under-represented in removed posts relative to remaining posts, using log-likelihood as the significance threshold and effect size (ELL) as the primary measure of practical importance.

# Keyness analyzer using KeynessMeasures package

keyness_analyzer <- list(
  prepare_data = function(removed_posts, remaining_posts) {
    combined_df <- data.frame(
      text = c(text_processor$clean(removed_posts$text, use_lemma = TRUE),
               text_processor$clean(remaining_posts$text, use_lemma = TRUE)),
      group = c(rep("removed", nrow(removed_posts)),
                rep("remaining", nrow(remaining_posts)))
    )
    
    frequency_table_creator(
      df = combined_df,
      text_field = "text",
      grouping_variable = "group",
      grouping_variable_target = "removed",
      remove_punct = TRUE,
      remove_symbols = TRUE,
      remove_numbers = TRUE,
      lemmatize = TRUE
    )
  },
  
  calculate_keyness = function(frequency_table) {
    keyness_measure_calculator(
      frequency_table,
      log_likelihood = TRUE,
      ell = TRUE,
      bic = TRUE,
      perc_diff = TRUE,
      relative_risk = TRUE,
      log_ratio = TRUE,
      odds_ratio = TRUE,
      sort = "decreasing",
      sort_by = "ell"
    )
  }
)

# Compute keyness results
freq_table <- keyness_analyzer$prepare_data(removed_posts, remaining_posts)
keyness_measures <- keyness_analyzer$calculate_keyness(freq_table)

# Filter functions
filter_terms <- function(use_type, n = 5) {
  keyness_measures %>%
    filter(word_use == use_type,
           log_likelihood > 3.84) %>%
    arrange(desc(log_likelihood)) %>%
    slice_head(n = n)
}

keyness_results <- list(
  overuse  = filter_terms("overuse",  n = 5),
  underuse = filter_terms("underuse", n = 5),
  all      = keyness_measures %>%
               filter(log_likelihood > 3.84) %>%
               arrange(desc(ell))
)
# Highcharter plot with ELL as main metric

keyness_data <- bind_rows(
  keyness_results$overuse %>%
    mutate(color = "#4CAF50", y = ell),
  keyness_results$underuse %>%
    mutate(color = "#2196F3", y = -ell)  # Negative for visualization
) %>%
  arrange(desc(abs(y)))

highchart() %>%
  hc_chart(type = "bar", height = 500, marginLeft = 100, marginBottom = 100) %>%
  hc_title(text = "Keyness Analysis: Effect Size Comparison") %>%
  hc_subtitle(text = paste0("Comparing ", nrow(removed_posts), " removed posts to ",
                            nrow(remaining_posts), " remaining posts")) %>%
  hc_xAxis(categories = keyness_data$word,
           labels = list(style = list(fontSize = "11px"), rotation = 0)) %>%
  hc_yAxis(title = list(text = "Effect Size (ELL) [0-1]"),
           labels = list(format = "{value:.6f}"),
           plotLines = list(list(value = 0, color = "#666", width = 1, zIndex = 5))) %>%
  hc_tooltip(formatter = JS("function() {
    var corpus = this.point.y > 0 ? 'Removed' : 'Remaining';
    var ell = Math.abs(this.point.y).toFixed(6);
    var ll = this.point.log_likelihood.toFixed(2);
    var ratio = this.point.log_ratio ? this.point.log_ratio.toFixed(2) : 'N/A';
    return '<b>' + this.point.category + '</b><br>' +
           'More frequent in: <b>' + corpus + '</b><br>' +
           'Effect Size (ELL): ' + ell + '<br>' +
           'Log-likelihood: ' + ll + '<br>' +
           'Log Ratio: ' + ratio;
  }")) %>%
  hc_plotOptions(series = list(colorByPoint = TRUE, minPointLength = 3),
                 bar = list(groupPadding = 0.1, pointPadding = 0.1)) %>%
  hc_add_series(
    data = lapply(1:nrow(keyness_data), function(i) {
      list(
        y = keyness_data$y[i],
        color = keyness_data$color[i],
        log_likelihood = keyness_data$log_likelihood[i],
        log_ratio = keyness_data$log_ratio[i]
      )
    }),
    showInLegend = FALSE
  ) %>%
  hc_exporting(enabled = TRUE)
# Interpretation text

top_removed <- keyness_results$overuse %>%
  mutate(info = paste0(word, " (LL: ", round(log_likelihood, 1), ", ELL: ", sprintf("%.6f", ell), ")"))

top_remaining <- keyness_results$underuse %>%
  mutate(info = paste0(word, " (LL: ", round(log_likelihood, 1), ", ELL: ", sprintf("%.6f", ell), ")"))

div(style = "margin-top: 20px; background: #f8f9fa; padding: 15px; border-radius: 5px;",
  h5("Quick Guide: Comparing Key Terms"),
  div(style = "columns: 2;",
    div(style = "color: #AA0114;",
      strong("Common in Removed Posts:"), br(),
      HTML(paste("- ", top_removed$info, collapse = "<br>"))
    ),
    div(style = "color: #4472C4; margin-left: 30px;",
      strong("Common in Remaining Posts:"), br(),
      HTML(paste("- ", top_remaining$info, collapse = "<br>"))
    )
  ),
  p(style = "margin-top: 10px; font-size: 0.9em; color: #666;",
    "This helps you understand which words are more typical in each group.<br>",
    "LL tells us if it's a meaningful difference (above 3.84 = likely real).<br>",
    "ELL shows how big the difference is (0 to 1 scale, closer to 1 = bigger).<br>",
    strong("Example:"), " 'banned' might appear more in removed posts, while 'sale' might appear more in remaining posts."
  )
)
Quick Guide: Comparing Key Terms
Common in Removed Posts:
- purchase (LL: 8.1, ELL: 0.008113)
- completely (LL: 8.1, ELL: 0.008113)
- email (LL: 8.1, ELL: 0.008113)
- sleep (LL: 7.3, ELL: 0.002924)
- tweet (LL: 6.1, ELL: 0.027030)
Common in Remaining Posts:
- work (LL: 9.9, ELL: 0.002656)
- week (LL: 8.1, ELL: 0.002542)
- anyone (LL: 7.2, ELL: 0.002509)
- apartment (LL: 6.3, ELL: 0.002510)
- old (LL: 6.3, ELL: 0.002510)

This helps you understand which words are more typical in each group.<br> LL tells us if it's a meaningful difference (above 3.84 = likely real).<br> ELL shows how big the difference is (0 to 1 scale, closer to 1 = bigger).<br> Example: 'banned' might appear more in removed posts, while 'sale' might appear more in remaining posts.

Data Addition

This section identifies posts present in Dataset 2 that were absent from Dataset 1, treating their appearance as new content added within the observed time window. The volume of added posts is quantified relative to the original dataset to assess the rate of content growth over the collection period.

# Basic Summary Statistics for Data Addition

added_count <- nrow(added_posts)
total_posts <- nrow(data1)

days_diff <- as.numeric(difftime(date_2, date_1, units = "days"))

growth <- round(added_count / total_posts * 100, 1)
daily_addition <- round(added_count / days_diff, 1)
daily_addition_percent <- round(added_count / days_diff / total_posts * 100, 2)

output_text <- paste0(
  "Summary Statistics\n",
  "Number of Added Posts : ", added_count, "\n",
  "Data Addition         : ", growth, "%\n",
  "Daily Addition Rate   : ", daily_addition, " posts/day\n",
  "Daily Added           : ", daily_addition_percent, "%/day"
)

cat(output_text)
Summary Statistics
Number of Added Posts : 101
Data Addition         : 33.7%
Daily Addition Rate   : 3.4 posts/day
Daily Added           : 1.12%/day

1. Word Frequency

The most frequently occurring terms are extracted from added and original posts, providing an initial lexical overview of what characterises each subset.

# Top words in Added posts

cleaned_added <- text_processor$clean(added_posts$text, use_stem = FALSE, use_lemma = TRUE)

word_freq_added <- text_processor$get_freq(cleaned_added) %>%
  filter(freq > 1) %>%
  slice_head(n = 100)

highchart() %>%
  hc_chart(type = "bar") %>%
  hc_title(text = "Added Posts") %>%
  hc_tooltip(crosshairs = TRUE, shared = FALSE, useHTML = TRUE,
              formatter = JS("function() {
                return '<br/><span style=\"color:' + this.series.color + '\">' + 
                       this.point.category + '</span>: <b>' + this.point.y + '</b>';
              }")) %>%
  hc_xAxis(categories = word_freq_added$word,
           labels = list(style = list(fontSize = '11px')),
           max = 20, scrollbar = list(enabled = TRUE)) %>%
  hc_add_series(name = "Word", data = word_freq_added$freq, type = "column",
                color = "#4CAF50", showInLegend = FALSE) %>%
  hc_exporting(enabled = TRUE)
# Define original posts (posts present in Dataset 1 and still in Dataset 2)

original_posts <- data2 %>% filter(.data[[id_col_2]] %in% intersect(data1[[id_col_1]], data2[[id_col_2]]))

# Top words in Original posts
cleaned_original <- text_processor$clean(original_posts$text, use_stem = FALSE, use_lemma = TRUE)

word_freq_original <- text_processor$get_freq(cleaned_original) %>%
  filter(freq > 1) %>%
  slice_head(n = 100)

highchart() %>%
  hc_chart(type = "bar") %>%
  hc_title(text = "Original Posts") %>%
  hc_tooltip(crosshairs = TRUE, shared = FALSE, useHTML = TRUE,
              formatter = JS("function() {
                return '<br/><span style=\"color:' + this.series.color + '\">' + 
                       this.point.category + '</span>: <b>' + this.point.y + '</b>';
              }")) %>%
  hc_xAxis(categories = word_freq_original$word,
           labels = list(style = list(fontSize = '11px')),
           max = 20, scrollbar = list(enabled = TRUE)) %>%
  hc_add_series(name = "Word", data = word_freq_original$freq, type = "column",
                color = "#2196F3", showInLegend = FALSE) %>%
  hc_exporting(enabled = TRUE)

2. Sentiment Analysis

Sentiment scores are computed and classified as negative, neutral, or positive for both added and original posts, with the most affectively extreme posts surfaced for qualitative inspection.

get_sentiment_distribution <- function(text_vector) {
    if (is.null(text_vector)) {
      return(data.frame(
        category = c("Negative", "Neutral", "Positive"),
        percentage = c(0, 0, 0)
      ))
    }
    
    # Process in chunks for large datasets
    chunk_size <- 500
    chunks <- split(text_vector, ceiling(seq_along(text_vector)/chunk_size))
    
    results <- lapply(chunks, function(chunk) {
      sentences <- get_sentences(chunk)
      sentiment(sentences)
    })
    
    all_scores <- unlist(lapply(results, function(x) x$sentiment))
    
    category <- cut(all_scores, 
                    breaks = c(-Inf, -0.01, 0.01, Inf),
                    labels = c("Negative", "Neutral", "Positive"))
    
    counts <- table(factor(category, levels = c("Negative", "Neutral", "Positive")))
    percentages <- prop.table(counts) * 100
    
    data.frame(
      category = names(percentages),
      percentage = as.numeric(percentages)
    )
  }
# Sentiment for Added posts

sent_added <- get_sentiment_distribution(added_posts$text)

highchart() %>%
  hc_chart(type = "column") %>%
  hc_title(text = "Sentiment Distribution: Added Posts") %>%
  hc_xAxis(categories = c("Negative", "Neutral", "Positive")) %>%
  hc_yAxis(title = list(text = "Percentage"), labels = list(format = "{value}%")) %>%
  hc_add_series(name = "Added Posts", data = sent_added$percentage, color = "#4CAF50") %>%
  hc_tooltip(pointFormat = "<b>{point.category}</b>: {point.y:.1f}%") %>%
  hc_plotOptions(column = list(pointPadding = 0.1, groupPadding = 0.1)) %>%
  hc_exporting(enabled = TRUE)
# Sentiment for Original posts (posts present in Dataset 1)

sent_original <- get_sentiment_distribution(original_posts$text)

highchart() %>%
  hc_chart(type = "column") %>%
  hc_title(text = "Sentiment Distribution: Original Posts") %>%
  hc_xAxis(categories = c("Negative", "Neutral", "Positive")) %>%
  hc_yAxis(title = list(text = "Percentage"), labels = list(format = "{value}%")) %>%
  hc_add_series(name = "Original Posts", data = sent_original$percentage, color = "#2196F3") %>%
  hc_tooltip(pointFormat = "<b>{point.category}</b>: {point.y:.1f}%") %>%
  hc_plotOptions(column = list(pointPadding = 0.1, groupPadding = 0.1)) %>%
  hc_exporting(enabled = TRUE)
output_text <- paste0(
  "**Most positive added post:**\n",
  get_extreme_text(added_posts$text, "positive"),
  
  "\n\n**Most negative added post:**\n",
  get_extreme_text(added_posts$text, "negative"),
  
  "\n\n**Most positive original post:**\n",
  get_extreme_text(original_posts$text, "positive"),
  
  "\n\n**Most negative original post:**\n",
  get_extreme_text(original_posts$text, "negative")
)

cat(output_text)
**Most positive added post:**
The community around learning is so inspiring. #motivation

**Most negative added post:**
My coffee maker broke this morning. National emergency has been declared. Send help and caffeine! #mondaymood

**Most positive original post:**
Just baked cookies for my new neighbors. Old fashioned but effective welcome!

**Most negative original post:**
The person next to me on the plane talked the entire 6-hour flight. Noise cancelling headphones failed me.

3. Keyness Analysis

Keyness analysis identifies terms that are statistically over- or under-represented in added posts relative to original posts, using log-likelihood as the significance threshold and effect size (ELL) as the primary measure of practical importance.

keyness_analyzer_add <- list(
  prepare_data = function(added_posts, original_posts) {
    combined_df <- data.frame(
      text = c(text_processor$clean(added_posts$text, use_lemma = TRUE),
               text_processor$clean(original_posts$text, use_lemma = TRUE)),
      group = c(rep("added", nrow(added_posts)),
                rep("original", nrow(original_posts)))
    )
    
    frequency_table_creator(
      df = combined_df,
      text_field = "text",
      grouping_variable = "group",
      grouping_variable_target = "added",
      remove_punct = TRUE,
      remove_symbols = TRUE,
      remove_numbers = TRUE,
      lemmatize = TRUE
    )
  },
  
  calculate_keyness = function(frequency_table) {
    keyness_measure_calculator(
      frequency_table,
      log_likelihood = TRUE,
      ell = TRUE,
      bic = TRUE,
      perc_diff = TRUE,
      relative_risk = TRUE,
      log_ratio = TRUE,
      odds_ratio = TRUE,
      sort = "decreasing",
      sort_by = "ell"
    )
  }
)

# Original posts = remaining from Dataset 1 (matched)
original_posts <- data2 %>% filter(.data[[id_col_2]] %in% matched_ids)

# Compute keyness results
freq_table_add <- keyness_analyzer_add$prepare_data(added_posts, original_posts)
keyness_measures_add <- keyness_analyzer_add$calculate_keyness(freq_table_add)

filter_terms_add <- function(use_type, n = 5) {
  keyness_measures_add %>%
    filter(word_use == use_type,
           log_likelihood > 3.84) %>%
    arrange(desc(log_likelihood)) %>%
    slice_head(n = n)
}

keyness_results_add <- list(
  overuse  = filter_terms_add("overuse",  n = 5),
  underuse = filter_terms_add("underuse", n = 5),
  all      = keyness_measures_add %>%
               filter(log_likelihood > 3.84) %>%
               arrange(desc(ell))
)
# Highcharter plot with ELL as main metric
keyness_data_add <- bind_rows(
  keyness_results_add$overuse %>%
    mutate(color = "#4CAF50", y = ell),
  keyness_results_add$underuse %>%
    mutate(color = "#2196F3", y = -ell)  # Negative for visualization
) %>%
  arrange(desc(abs(y)))

highchart() %>%
  hc_chart(type = "bar", height = 500, marginLeft = 100, marginBottom = 100) %>%
  hc_title(text = "Keyness Analysis: Effect Size Comparison (Data Addition)") %>%
  hc_subtitle(text = paste0("Comparing ", nrow(added_posts), " added posts to ",
                            nrow(original_posts), " original posts")) %>%
  hc_xAxis(categories = keyness_data_add$word,
           labels = list(style = list(fontSize = "11px"), rotation = 0)) %>%
  hc_yAxis(title = list(text = "Effect Size (ELL) [0-1]"),
           labels = list(format = "{value:.6f}"),
           plotLines = list(list(value = 0, color = "#666", width = 1, zIndex = 5))) %>%
  hc_tooltip(formatter = JS("function() {
    var corpus = this.point.y > 0 ? 'Added' : 'Original';
    var ell = Math.abs(this.point.y).toFixed(6);
    var ll = this.point.log_likelihood.toFixed(2);
    var ratio = this.point.log_ratio ? this.point.log_ratio.toFixed(2) : 'N/A';
    return '<b>' + this.point.category + '</b><br>' +
           'More frequent in: <b>' + corpus + '</b><br>' +
           'Effect Size (ELL): ' + ell + '<br>' +
           'Log-likelihood: ' + ll + '<br>' +
           'Log Ratio: ' + ratio;
  }")) %>%
  hc_plotOptions(series = list(colorByPoint = TRUE, minPointLength = 3),
                 bar = list(groupPadding = 0.1, pointPadding = 0.1)) %>%
  hc_add_series(
    data = lapply(1:nrow(keyness_data_add), function(i) {
      list(
        y = keyness_data_add$y[i],
        color = keyness_data_add$color[i],
        log_likelihood = keyness_data_add$log_likelihood[i],
        log_ratio = keyness_data_add$log_ratio[i]
      )
    }),
    showInLegend = FALSE
  ) %>%
  hc_exporting(enabled = TRUE)
# Interpretation text
top_added <- keyness_results_add$overuse %>%
  mutate(info = paste0(word, " (LL: ", round(log_likelihood, 1), ", ELL: ", sprintf("%.6f", ell), ")"))

top_original <- keyness_results_add$underuse %>%
  mutate(info = paste0(word, " (LL: ", round(log_likelihood, 1), ", ELL: ", sprintf("%.6f", ell), ")"))

div(style = "margin-top: 20px; background: #f8f9fa; padding: 15px; border-radius: 5px;",
  h5("Quick Guide: Comparing Key Terms (Added vs Original Posts)"),
  div(style = "columns: 2;",
    div(style = "color: #4CAF50;",
      strong("Common in Added Posts:"), br(),
      HTML(paste("- ", top_added$info, collapse = "<br>"))
    ),
    div(style = "color: #2196F3; margin-left: 30px;",
      strong("Common in Original Posts:"), br(),
      HTML(paste("- ", top_original$info, collapse = "<br>"))
    )
  ),
  p(style = "margin-top: 10px; font-size: 0.9em; color: #666;",
    "This helps understand which words are more typical in added vs original posts.<br>",
    "LL > 3.84 indicates significant difference.<br>",
    "ELL shows effect size (closer to 1 = stronger difference)."
  )
)
Quick Guide: Comparing Key Terms (Added vs Original Posts)
Common in Added Posts:
- check (LL: 42.8, ELL: 0.011297)
- relate (LL: 42.8, ELL: 0.011297)
- explore (LL: 42.8, ELL: 0.011297)
- think (LL: 38.2, ELL: 0.008682)
- spot (LL: 35.7, ELL: 0.009103)
Common in Original Posts:
- time (LL: 12, ELL: 0.003071)
- just (LL: 11, ELL: 0.001547)
- find (LL: 8.7, ELL: 0.002770)
- now (LL: 8, ELL: 0.002724)
- first (LL: 8, ELL: 0.002724)

This helps understand which words are more typical in added vs original posts.<br> LL > 3.84 indicates significant difference.<br> ELL shows effect size (closer to 1 = stronger difference).

Data Editing

This section examines posts whose content changed between the two collection points — identified by a shared unique identifier but differing textual content. Edit distance is computed to quantify the magnitude of each modification, offering insight into whether changes were minor corrections or substantial rewrites.

Stats and Data Table

Summary statistics; including mean edit distance, normalized distance, and the overall edited post ratio; are reported to characterise the extent of content modification across matched posts. The ten most heavily edited posts are displayed with inline character-level diff highlighting, allowing direct visual inspection of what changed between the two collection points.

# Data Editing

# Identify edited posts (same ID, different text)
edited_posts <- data1 %>%
  inner_join(data2 %>% select(all_of(id_col_1), text_after = text),
             by = setNames(id_col_1, id_col_1)) %>%
  filter(text != text_after,
         !is.na(text), !is.na(text_after),
         text != "", text_after != "") %>%
  rename(text_before = text) %>%
  mutate(
    edit_distance = stringdist::stringdist(text_before, text_after, method = "lv"),
    normalized_distance = edit_distance / pmax(nchar(text_before), nchar(text_after))
  )

# Summary statistics
total_matched <- length(matched_ids)
edited_count  <- nrow(edited_posts)

mean_edit <- if (edited_count > 0) round(mean(edited_posts$edit_distance), 2) else 0
mean_norm <- if (edited_count > 0) round(mean(edited_posts$normalized_distance), 4) else 0
edit_ratio <- if (total_matched > 0) round(edited_count / total_matched * 100, 1) else 0

output_text <- paste0(
  "Editing Statistics\n",
  "Mean Edit Distance       : 0.13\n",
  "Mean Normalized Distance : 0.0018\n",
  "Edited Post Ratio        : 1.5%\n\n",
  "Number of Edited Posts : ",
  edited_count,
  " out of ",
  total_matched,
  " matched posts"
)

cat(output_text)
Editing Statistics
Mean Edit Distance       : 0.13
Mean Normalized Distance : 0.0018
Edited Post Ratio        : 1.5%

Number of Edited Posts : 3 out of 200 matched posts
library(diffobj)

# Load diffobj CSS (same as original)
diffobj_css_path <- system.file("www", "diffobj.css", package = "diffobj")
if (file.exists(diffobj_css_path)) {
  diff_css <- paste(readLines(diffobj_css_path), collapse = "\n")
} else {
  diff_css <- ""
}
if (diff_css != "") {
  tags$style(HTML(diff_css))
}

# Fixed diff function: unified view in one cell, with clear headers and proper alignment
apply_diff <- function(text_before, text_after) {
  html <- as.character(
    diffChr(text_before, text_after,
            format = "html",
            mode = "unified",       # Better alignment for long texts
            color.mode = "rgb",
            style = list(html.output = "diff.w.style"),
            line.limit = 200)       # Prevent overly long output
  )
  
  # Clean up hunk headers for cleaner look
  html <- gsub("<div class='diffobj-line'><div class='diffobj-header'>@@.*?@@</div></div>", "", html, perl = TRUE)
  
  # Wrap in styled container with scroll and wrapping
  paste0(
    "<div style='font-family: monospace; font-size: 0.95em; line-height: 1.4; ",
    "white-space: pre-wrap; word-wrap: break-word; max-height: 400px; overflow-y: auto; ",
    "padding: 12px; border: 1px solid #ddd; border-radius: 6px; background: #f9f9f9;'>",
    "<strong>Before → After Changes:</strong><br><br>",
    html,
    "</div>"
  )
}

# Prepare top 10 most edited posts
most_edited <- edited_posts %>%
  arrange(desc(edit_distance)) %>%
  slice_head(n = 10) %>%
  select(ID = !!sym(id_col_1), edit_distance, normalized_distance, text_before, text_after) %>%
  mutate(
    Diff_View = mapply(apply_diff, text_before, text_after)
  )

# Display clean, aligned table with highlighted diffs
datatable(
  most_edited %>%
    select(ID, Diff_View, edit_distance, Normalized_Distance = normalized_distance),
  caption = "Top 10 Most Edited Posts (with Highlighted Changes)",
  escape = FALSE,
  rownames = FALSE,
  options = list(
    pageLength = 10,
    autoWidth = TRUE,
    scrollX = TRUE,
    columnDefs = list(
      list(width = "80px", targets = 0),           # ID
      list(width = "65%", targets = 1),            # Diff view takes most space
      list(width = "100px", targets = c(2, 3))     # Distances
    )
  ),
  class = "cell-border stripe hover compact"
)
# Most edited posts table (top 10 by edit distance)
most_edited <- edited_posts %>%
  arrange(desc(edit_distance)) %>%
  slice_head(n = 10) %>%
  select(ID = !!sym(id_col_1), text_before, text_after, edit_distance, normalized_distance) 

datatable(most_edited,
          caption = "Top 10 Most Edited Posts",
          options = list(pageLength = 10))

Conclusion and recommendations

References