Applying statistical and causal modeling to interpret thermal alteration in observational lithic data

Marcelo Cardillo and Eugenia Carranza (2025) Applying statistical and causal modeling to interpret thermal alteration in observational lithic data, Peer Community Journal, 5: e88. https://doi.org/10.24072/pcjournal.609 Dataset and R code are both available on zenodo: https://zenodo.org/records/16335811

Abstract: Wildfire events are increasingly recognized as agents of taphonomic alteration in archaeological contexts. This study applies causal and statistical modeling to assess the impact of thermal alteration on lithic assemblages in the San Matías Gulf region of Northern Patagonia, Argentina. Utilizing data derived from naturalistic sampling after recent wildfires, we developed a Directed Acyclic Graph (DAG) to identify key mediators and potential confounders affecting lithic fragmentation. A minimal adjustment set was determined to isolate the indirect effect of fire on mechanical alterations, and Random Forest classifiers were trained to evaluate the predictive power of variables such as pebble shape, fissures, and rock type. Comparison between models based on original and synthetically balanced datasets revealed significant improvements in classification accuracy, particularly for unfractured samples. The results emphasize the central role of shape in fracture likelihood, and demonstrate the utility of synthetic resampling and causal inference for interpreting complex taphonomic processes in observational archaeological data. Experimental findings corroborate the observed patterns, supporting the potential generalizability of the proposed causal model across fire-affected contexts.

#general packages
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

library(reshape2)
library(dplyr)

## 
## Adjuntando el paquete: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(vcdExtra)

## Cargando paquete requerido: vcd

## Cargando paquete requerido: grid

## Cargando paquete requerido: gnm

## 
## Adjuntando el paquete: 'vcdExtra'

## The following object is masked from 'package:dplyr':
## 
##     summarise

library(tidyr)

## 
## Adjuntando el paquete: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

library(tidyverse)

## Warning: package 'tibble' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.3     ✔ stringr   1.5.1
## ✔ purrr     1.0.2     ✔ tibble    3.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()       masks stats::filter()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ vcdExtra::summarise() masks dplyr::summarise()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(scales)

## Warning: package 'scales' was built under R version 4.4.3

## 
## Adjuntando el paquete: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

#Random over sampling
library(ROSE)

## Warning: package 'ROSE' was built under R version 4.4.2

## Loaded ROSE 0.0-4

#classification 
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.2

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Adjuntando el paquete: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret)

## Cargando paquete requerido: lattice
## 
## Adjuntando el paquete: 'lattice'
## 
## The following object is masked from 'package:gnm':
## 
##     barley
## 
## 
## Adjuntando el paquete: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

##DAG´S
library(ggdag)

## 
## Adjuntando el paquete: 'ggdag'
## 
## The following object is masked from 'package:stats':
## 
##     filter

library(dagitty)

##DAG for the naturalist model of causal effect

Nat_DAG <- dagify(
  Mech_Alt ~Fire+Rock_Type + Shape + Fissures + Trampling,
  Rock_Type ~ Fire + Knapping + Fissures,
  Shape ~ Fire + Knapping,
  Fissures ~ Fire,
  Fire ~ Biomass,
  exposure = "Fire",
  outcome = "Mech_Alt",
  latent="Trampling",
  labels = c(
    "Fissures"="Fissures",
    "Fire" = "Fire",
    "Mech_Alt" = "Mech_Alterations",
    "Knapping" = "Knapping",
    "Trampling" = "Trampling",
    "Shape" = "Shape",
    "Rock_Type" = "Rock_Type",
    "Biomass" = "Biomass"), coords = list(
    x = c(Biomass = 0.124, Fire = 0.255, Fissures= 0.306, Knapping = 0.478, Rock_Type = 0.479, Shape = 0.628, Mech_Alt = 0.737, Trampling = 0.733),
    y = c(Biomass = 0.650, Fire = 0.652, Fissures = 0.362, Knapping = 0.158, Rock_Type = 0.390, Shape = 0.346, Mech_Alt = 0.659, Trampling = 0.163)
  )
)

#Note:The causal model posits that fire exerts both a direct influence on the probability of rock fracture and an indirect influence mediated through Rock_Type, Shape, and Fissures. Furthermore, although Knapping and Trampling may contribute to mechanical damage, they are not directly observable in the available data and are therefore modeled as latent variables.

node_status(Nat_DAG, as_factor = TRUE)

## # A DAG with 8 nodes and 12 edges
## #
## # Exposure: Fire
## # Outcome: Mech_Alt
## # Latent Variable: Trampling
## #
## # A tibble: 13 × 10
##    name          x     y direction to          xend   yend circular label status
##    <chr>     <dbl> <dbl> <fct>     <chr>      <dbl>  <dbl> <lgl>    <chr> <fct> 
##  1 Biomass   0.124 0.65  ->        Fire       0.255  0.652 FALSE    Biom… <NA>  
##  2 Fire      0.255 0.652 ->        Fissures   0.306  0.362 FALSE    Fire  expos…
##  3 Fire      0.255 0.652 ->        Mech_Alt   0.737  0.659 FALSE    Fire  expos…
##  4 Fire      0.255 0.652 ->        Rock_Type  0.479  0.39  FALSE    Fire  expos…
##  5 Fire      0.255 0.652 ->        Shape      0.628  0.346 FALSE    Fire  expos…
##  6 Fissures  0.306 0.362 ->        Mech_Alt   0.737  0.659 FALSE    Fiss… <NA>  
##  7 Fissures  0.306 0.362 ->        Rock_Type  0.479  0.39  FALSE    Fiss… <NA>  
##  8 Knapping  0.478 0.158 ->        Rock_Type  0.479  0.39  FALSE    Knap… <NA>  
##  9 Knapping  0.478 0.158 ->        Shape      0.628  0.346 FALSE    Knap… <NA>  
## 10 Mech_Alt  0.737 0.659 <NA>      <NA>      NA     NA     FALSE    Mech… outco…
## 11 Rock_Type 0.479 0.39  ->        Mech_Alt   0.737  0.659 FALSE    Rock… <NA>  
## 12 Shape     0.628 0.346 ->        Mech_Alt   0.737  0.659 FALSE    Shape <NA>  
## 13 Trampling 0.733 0.163 ->        Mech_Alt   0.737  0.659 FALSE    Tram… latent

# Transform DAG into tidy format

Nat_DAG_tidy <- Nat_DAG %>% 
  tidy_dagitty() %>% 
  mutate(colour = case_when(
    name %in% c("Fire", "Rock_Type", "Fissures", "Shape") ~ "lightcoral", 
    name == "Mech_Alt" ~ "lightgreen",  
    TRUE ~ "darkgray"             
  ))

# Plot DAG with node colors

ggdag(Nat_DAG_tidy) + 
  geom_dag_point(aes(colour = colour)) +
  geom_dag_edges() +
  geom_dag_text(colour = "black") +
  scale_colour_manual(values = c("lightcoral" = "lightcoral", 
                                 "lightgreen" = "lightgreen", 
                                 "gray" = "gray")) +
  theme_dag()+
   guides(colour = "none")

#Causal paths from Fire to Mechanical alterations in the naturalistic set

theme_set(theme_dag())
ggdag_paths(Nat_DAG, from = "Fire", to = "Mech_Alt", shadow = TRUE , stylized = TRUE, node_size = 7, text_size = 2, text_col="Black")

# Identify the minimum set of variables that need to be adjusted to estimate the causal effect of Fire in the natural sample on Mechanical Alterations, ensuring that all backdoor paths (potential confounding paths) are blocked

adjustmentSets(Nat_DAG,
  exposure = "Fire", 
  outcome = "Mech_Alt",
  type = "minimal",
  effect = "direct")

## { Fissures, Rock_Type, Shape }

#Datasets
Nat_sample=read.table("Nat_sample.txt", T)

head(Nat_sample)

#Table of all categorical data
Nat_sample <- Nat_sample %>% mutate(across(where(is.character), as.factor))

summary(Nat_sample)

##  Rock_Type Fissures         Condition      Shape   
##  AV :25    N:32     Fractured    :33   Rounded:30  
##  IBV:17    Y:18     Not_Fractured:17   Tabular:20  
##  Sed: 4                                            
##  Sil: 4

#Percentages by combination

percentage_table <- Nat_sample %>%
  # Count observations for each combination
  count(Rock_Type, Shape, Condition, Fissures, name = "n") %>%
  
  # Calculate percentages within each Condition
  group_by(Condition) %>%
  mutate(
    percentage = n / sum(n) * 100) %>%
  ungroup() %>%
  
  # Sort results
  arrange(Condition, Rock_Type, Shape, Fissures)

print(percentage_table)

## # A tibble: 17 × 6
##    Rock_Type Shape   Condition     Fissures     n percentage
##    <fct>     <fct>   <fct>         <fct>    <int>      <dbl>
##  1 AV        Rounded Fractured     N            9      27.3 
##  2 AV        Tabular Fractured     N            3       9.09
##  3 AV        Tabular Fractured     Y            4      12.1 
##  4 IBV       Rounded Fractured     N           10      30.3 
##  5 IBV       Rounded Fractured     Y            3       9.09
##  6 IBV       Tabular Fractured     N            1       3.03
##  7 Sed       Rounded Fractured     N            2       6.06
##  8 Sil       Rounded Fractured     Y            1       3.03
##  9 AV        Rounded Not_Fractured Y            1       5.88
## 10 AV        Tabular Not_Fractured N            4      23.5 
## 11 AV        Tabular Not_Fractured Y            4      23.5 
## 12 IBV       Rounded Not_Fractured Y            1       5.88
## 13 IBV       Tabular Not_Fractured N            1       5.88
## 14 IBV       Tabular Not_Fractured Y            1       5.88
## 15 Sed       Rounded Not_Fractured Y            2      11.8 
## 16 Sil       Rounded Not_Fractured Y            1       5.88
## 17 Sil       Tabular Not_Fractured N            2      11.8

# Percentages by Shape 
graph_data <- percentage_table %>%
  mutate(
    Combination = interaction(Rock_Type, Fissures, sep = " - ") %>% 
      fct_reorder(percentage, .fun = max)
  )

# Stacked bar chart by Shape
ggplot(graph_data, aes(x = Combination, y = percentage, fill = Shape)) +
  geom_col(position = "stack") +
  geom_text(
    aes(label = if_else(percentage >= 5, sprintf("%.1f%%", percentage), "")),
    position = position_stack(vjust = 0.5),
    size = 3,
    color = "white"  
  ) +
  facet_wrap(~ Condition, scales = "free_x") +
  labs(
    title = "Shape Distribution by Rock Type and Fissure Status",
    x = "Rock Type - Fissure Status",
    y = "Percentage (%)",
    fill = "Shape"
  ) +
  scale_y_continuous(labels = percent_format(scale = 1)) +
  scale_fill_brewer(palette = "Set2") +  
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
    legend.position = "top",
    strip.text = element_text(face = "bold"),
    panel.spacing = unit(1, "lines")
  ) +
  coord_cartesian(ylim = c(0, 50))

## First classification model

set.seed(1973)
model1 <- randomForest(Condition ~ Shape + Rock_Type+Fissures, mtry=1, data = Nat_sample, importance=TRUE, ntree=1000)

model1

## 
## Call:
##  randomForest(formula = Condition ~ Shape + Rock_Type + Fissures,      data = Nat_sample, mtry = 1, importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 48%
## Confusion matrix:
##               Fractured Not_Fractured class.error
## Fractured            24             9   0.2727273
## Not_Fractured        15             2   0.8823529

#Note: We use mtry=1, that in this case is the default value (square root of the number of predictors)

importance_values=importance(model1)
print(importance_values)

##           Fractured Not_Fractured MeanDecreaseAccuracy MeanDecreaseGini
## Shape     19.323431     15.150422           20.6110286         3.562321
## Rock_Type  2.534765     -1.765154            0.1086124         2.419630
## Fissures  13.627506      7.280955           12.8589498         2.491965

#Plot  rf variable importance measures
varImpPlot(model1, sort=T,main="Variable Importance Plot")

#Create synthetic data 
bal_sample <- ROSE(Condition ~ Shape + Rock_Type+Fissures, data =Nat_sample, seed = 123)$data

# Check synthetic data
summary(bal_sample)

##  Rock_Type Fissures         Condition      Shape   
##  AV :25    N:28     Fractured    :25   Rounded:25  
##  IBV:17    Y:22     Not_Fractured:25   Tabular:25  
##  Sed: 2                                            
##  Sil: 6

# Create a distribution of prediction probabilities based on random forest results.
# 1. Create all possible combination of factors levels
combinations <- expand.grid(
  Rock_Type = c("AV", "Sil", "Sed", "IBV"),  # Rock_Type
  Shape = c("Rounded", "Tabular"),           # Shape
  Fissures = c("Y", "N")                     # Fissures
)

# 2: Predict probabilities based on the trained model (model1)
combinations$Prob_Fractured <- predict(model1, combinations, type = "prob")[, "Fractured"]

# Sort combinations
combinations <- combinations[order(-combinations$Prob_Fractured), ]

# Print probabilities in decreasing order
print(combinations)

##    Rock_Type   Shape Fissures Prob_Fractured
## 12       IBV Rounded        N          1.000
## 9         AV Rounded        N          0.991
## 11       Sed Rounded        N          0.862
## 4        IBV Rounded        Y          0.731
## 10       Sil Rounded        N          0.698
## 16       IBV Tabular        N          0.546
## 1         AV Rounded        Y          0.543
## 2        Sil Rounded        Y          0.491
## 13        AV Tabular        N          0.490
## 5         AV Tabular        Y          0.421
## 15       Sed Tabular        N          0.398
## 8        IBV Tabular        Y          0.394
## 3        Sed Rounded        Y          0.374
## 6        Sil Tabular        Y          0.263
## 7        Sed Tabular        Y          0.253
## 14       Sil Tabular        N          0.227

#Note:This script creates a predictive table to explore how different combinations of lithic attributes (rock type, shape, and presence of fissures) affect the predicted probability of fracture, based on a previously trained random forest model (model1). First, it generates all possible combinations of these categorical factors using expand.grid function, in order to cover the entire space of hypothetical observed trait combinations. Then, it uses the predict() function to estimate the probability of being classified as “Fractured” for each combination and storing it in a new column of the “combinations” data. Finally, it sorts the combinations in descending order of fracture probability.

# 3. Plot fracture probabilities by rock type
ggplot(combinations, aes(x = Shape, y = Prob_Fractured, fill = Rock_Type)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ Fissures) +
  labs(
    title = "Fracture probability for naturalistic sample",
    x = "Shape",
    y = "Probability of fracture",
    fill = "Rock_Type"
  ) +
  theme_minimal()

## Second classification model (balanced sample)

set.seed(1973)
model2 <- randomForest(Condition ~ Shape + Rock_Type+Fissures, mtry=1, data = bal_sample, importance=TRUE, ntree=1000)

model2

## 
## Call:
##  randomForest(formula = Condition ~ Shape + Rock_Type + Fissures,      data = bal_sample, mtry = 1, importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 16%
## Confusion matrix:
##               Fractured Not_Fractured class.error
## Fractured            19             6        0.24
## Not_Fractured         2            23        0.08

importance_values2=importance(model2)
print(importance_values2)

##           Fractured Not_Fractured MeanDecreaseAccuracy MeanDecreaseGini
## Shape      26.01564      29.19392             31.34526         4.886198
## Rock_Type  17.31656      20.82375             23.44170         4.390060
## Fissures   16.15591      12.53004             16.38496         2.188747

#Plot  rf variable importance measures
varImpPlot(model2, sort=T,main="Variable Importance for balanced model")

# Create a distribution of prediction probabilities based on random forest results for the synthetic data.
# 1. Create all possible combination of factors levels
combinations2 <- expand.grid(
  Rock_Type = c("AV", "Sil", "Sed", "IBV"),  # Rock_Type
  Shape = c("Rounded", "Tabular"),           # Shape
  Fissures = c("Y", "N")                     # Fissures
)

# 2: Predict probabilities based on the trained model (model2)
combinations2$Prob_Fractured <- predict(model2, combinations2, type = "prob")[, "Fractured"]

# Sort combinations
combinations2 <- combinations2[order(-combinations2$Prob_Fractured), ]

# Sort probabilites in decreasing order
print(combinations2)

##    Rock_Type   Shape Fissures Prob_Fractured
## 12       IBV Rounded        N          0.997
## 9         AV Rounded        N          0.917
## 4        IBV Rounded        Y          0.721
## 10       Sil Rounded        N          0.547
## 11       Sed Rounded        N          0.547
## 16       IBV Tabular        N          0.499
## 1         AV Rounded        Y          0.414
## 8        IBV Tabular        Y          0.291
## 13        AV Tabular        N          0.229
## 2        Sil Rounded        Y          0.170
## 3        Sed Rounded        Y          0.170
## 14       Sil Tabular        N          0.119
## 15       Sed Tabular        N          0.119
## 5         AV Tabular        Y          0.043
## 6        Sil Tabular        Y          0.014
## 7        Sed Tabular        Y          0.014

# 3. Plot fracture probabilities by rock type for the synthetic model
ggplot(combinations2, aes(x = Shape, y = Prob_Fractured, fill = Rock_Type)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ Fissures) +
  labs(
    title = "Fracture probability for balanced sample",
    x = "Shape",
    y = "Probability of fracture",
    fill = "Rock_Type"
  ) +
  theme_minimal()

Published by: Marcelo Cardillo and Eugenia Carranza.