Marcelo Cardillo and Eugenia Carranza (2025) Applying statistical and causal modeling to interpret thermal alteration in observational lithic data, Peer Community Journal, 5: e88. https://doi.org/10.24072/pcjournal.609 Dataset and R code are both available on zenodo: https://zenodo.org/records/16335811
Abstract: Wildfire events are increasingly recognized as agents of taphonomic alteration in archaeological contexts. This study applies causal and statistical modeling to assess the impact of thermal alteration on lithic assemblages in the San Matías Gulf region of Northern Patagonia, Argentina. Utilizing data derived from naturalistic sampling after recent wildfires, we developed a Directed Acyclic Graph (DAG) to identify key mediators and potential confounders affecting lithic fragmentation. A minimal adjustment set was determined to isolate the indirect effect of fire on mechanical alterations, and Random Forest classifiers were trained to evaluate the predictive power of variables such as pebble shape, fissures, and rock type. Comparison between models based on original and synthetically balanced datasets revealed significant improvements in classification accuracy, particularly for unfractured samples. The results emphasize the central role of shape in fracture likelihood, and demonstrate the utility of synthetic resampling and causal inference for interpreting complex taphonomic processes in observational archaeological data. Experimental findings corroborate the observed patterns, supporting the potential generalizability of the proposed causal model across fire-affected contexts.
#general packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(reshape2)
library(dplyr)
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(vcdExtra)
## Cargando paquete requerido: vcd
## Cargando paquete requerido: grid
## Cargando paquete requerido: gnm
##
## Adjuntando el paquete: 'vcdExtra'
## The following object is masked from 'package:dplyr':
##
## summarise
library(tidyr)
##
## Adjuntando el paquete: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
library(tidyverse)
## Warning: package 'tibble' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.3 ✔ stringr 1.5.1
## ✔ purrr 1.0.2 ✔ tibble 3.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ vcdExtra::summarise() masks dplyr::summarise()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
## Warning: package 'scales' was built under R version 4.4.3
##
## Adjuntando el paquete: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
#Random over sampling
library(ROSE)
## Warning: package 'ROSE' was built under R version 4.4.2
## Loaded ROSE 0.0-4
#classification
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Adjuntando el paquete: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(caret)
## Cargando paquete requerido: lattice
##
## Adjuntando el paquete: 'lattice'
##
## The following object is masked from 'package:gnm':
##
## barley
##
##
## Adjuntando el paquete: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
##DAG´S
library(ggdag)
##
## Adjuntando el paquete: 'ggdag'
##
## The following object is masked from 'package:stats':
##
## filter
library(dagitty)
##DAG for the naturalist model of causal effect
Nat_DAG <- dagify(
Mech_Alt ~Fire+Rock_Type + Shape + Fissures + Trampling,
Rock_Type ~ Fire + Knapping + Fissures,
Shape ~ Fire + Knapping,
Fissures ~ Fire,
Fire ~ Biomass,
exposure = "Fire",
outcome = "Mech_Alt",
latent="Trampling",
labels = c(
"Fissures"="Fissures",
"Fire" = "Fire",
"Mech_Alt" = "Mech_Alterations",
"Knapping" = "Knapping",
"Trampling" = "Trampling",
"Shape" = "Shape",
"Rock_Type" = "Rock_Type",
"Biomass" = "Biomass"), coords = list(
x = c(Biomass = 0.124, Fire = 0.255, Fissures= 0.306, Knapping = 0.478, Rock_Type = 0.479, Shape = 0.628, Mech_Alt = 0.737, Trampling = 0.733),
y = c(Biomass = 0.650, Fire = 0.652, Fissures = 0.362, Knapping = 0.158, Rock_Type = 0.390, Shape = 0.346, Mech_Alt = 0.659, Trampling = 0.163)
)
)
#Note:The causal model posits that fire exerts both a direct influence on the probability of rock fracture and an indirect influence mediated through Rock_Type, Shape, and Fissures. Furthermore, although Knapping and Trampling may contribute to mechanical damage, they are not directly observable in the available data and are therefore modeled as latent variables.
node_status(Nat_DAG, as_factor = TRUE)
## # A DAG with 8 nodes and 12 edges
## #
## # Exposure: Fire
## # Outcome: Mech_Alt
## # Latent Variable: Trampling
## #
## # A tibble: 13 × 10
## name x y direction to xend yend circular label status
## <chr> <dbl> <dbl> <fct> <chr> <dbl> <dbl> <lgl> <chr> <fct>
## 1 Biomass 0.124 0.65 -> Fire 0.255 0.652 FALSE Biom… <NA>
## 2 Fire 0.255 0.652 -> Fissures 0.306 0.362 FALSE Fire expos…
## 3 Fire 0.255 0.652 -> Mech_Alt 0.737 0.659 FALSE Fire expos…
## 4 Fire 0.255 0.652 -> Rock_Type 0.479 0.39 FALSE Fire expos…
## 5 Fire 0.255 0.652 -> Shape 0.628 0.346 FALSE Fire expos…
## 6 Fissures 0.306 0.362 -> Mech_Alt 0.737 0.659 FALSE Fiss… <NA>
## 7 Fissures 0.306 0.362 -> Rock_Type 0.479 0.39 FALSE Fiss… <NA>
## 8 Knapping 0.478 0.158 -> Rock_Type 0.479 0.39 FALSE Knap… <NA>
## 9 Knapping 0.478 0.158 -> Shape 0.628 0.346 FALSE Knap… <NA>
## 10 Mech_Alt 0.737 0.659 <NA> <NA> NA NA FALSE Mech… outco…
## 11 Rock_Type 0.479 0.39 -> Mech_Alt 0.737 0.659 FALSE Rock… <NA>
## 12 Shape 0.628 0.346 -> Mech_Alt 0.737 0.659 FALSE Shape <NA>
## 13 Trampling 0.733 0.163 -> Mech_Alt 0.737 0.659 FALSE Tram… latent
# Transform DAG into tidy format
Nat_DAG_tidy <- Nat_DAG %>%
tidy_dagitty() %>%
mutate(colour = case_when(
name %in% c("Fire", "Rock_Type", "Fissures", "Shape") ~ "lightcoral",
name == "Mech_Alt" ~ "lightgreen",
TRUE ~ "darkgray"
))
# Plot DAG with node colors
ggdag(Nat_DAG_tidy) +
geom_dag_point(aes(colour = colour)) +
geom_dag_edges() +
geom_dag_text(colour = "black") +
scale_colour_manual(values = c("lightcoral" = "lightcoral",
"lightgreen" = "lightgreen",
"gray" = "gray")) +
theme_dag()+
guides(colour = "none")
#Causal paths from Fire to Mechanical alterations in the naturalistic set
theme_set(theme_dag())
ggdag_paths(Nat_DAG, from = "Fire", to = "Mech_Alt", shadow = TRUE , stylized = TRUE, node_size = 7, text_size = 2, text_col="Black")
# Identify the minimum set of variables that need to be adjusted to estimate the causal effect of Fire in the natural sample on Mechanical Alterations, ensuring that all backdoor paths (potential confounding paths) are blocked
adjustmentSets(Nat_DAG,
exposure = "Fire",
outcome = "Mech_Alt",
type = "minimal",
effect = "direct")
## { Fissures, Rock_Type, Shape }
#Datasets
Nat_sample=read.table("Nat_sample.txt", T)
head(Nat_sample)
#Table of all categorical data
Nat_sample <- Nat_sample %>% mutate(across(where(is.character), as.factor))
summary(Nat_sample)
## Rock_Type Fissures Condition Shape
## AV :25 N:32 Fractured :33 Rounded:30
## IBV:17 Y:18 Not_Fractured:17 Tabular:20
## Sed: 4
## Sil: 4
#Percentages by combination
percentage_table <- Nat_sample %>%
# Count observations for each combination
count(Rock_Type, Shape, Condition, Fissures, name = "n") %>%
# Calculate percentages within each Condition
group_by(Condition) %>%
mutate(
percentage = n / sum(n) * 100) %>%
ungroup() %>%
# Sort results
arrange(Condition, Rock_Type, Shape, Fissures)
print(percentage_table)
## # A tibble: 17 × 6
## Rock_Type Shape Condition Fissures n percentage
## <fct> <fct> <fct> <fct> <int> <dbl>
## 1 AV Rounded Fractured N 9 27.3
## 2 AV Tabular Fractured N 3 9.09
## 3 AV Tabular Fractured Y 4 12.1
## 4 IBV Rounded Fractured N 10 30.3
## 5 IBV Rounded Fractured Y 3 9.09
## 6 IBV Tabular Fractured N 1 3.03
## 7 Sed Rounded Fractured N 2 6.06
## 8 Sil Rounded Fractured Y 1 3.03
## 9 AV Rounded Not_Fractured Y 1 5.88
## 10 AV Tabular Not_Fractured N 4 23.5
## 11 AV Tabular Not_Fractured Y 4 23.5
## 12 IBV Rounded Not_Fractured Y 1 5.88
## 13 IBV Tabular Not_Fractured N 1 5.88
## 14 IBV Tabular Not_Fractured Y 1 5.88
## 15 Sed Rounded Not_Fractured Y 2 11.8
## 16 Sil Rounded Not_Fractured Y 1 5.88
## 17 Sil Tabular Not_Fractured N 2 11.8
# Percentages by Shape
graph_data <- percentage_table %>%
mutate(
Combination = interaction(Rock_Type, Fissures, sep = " - ") %>%
fct_reorder(percentage, .fun = max)
)
# Stacked bar chart by Shape
ggplot(graph_data, aes(x = Combination, y = percentage, fill = Shape)) +
geom_col(position = "stack") +
geom_text(
aes(label = if_else(percentage >= 5, sprintf("%.1f%%", percentage), "")),
position = position_stack(vjust = 0.5),
size = 3,
color = "white"
) +
facet_wrap(~ Condition, scales = "free_x") +
labs(
title = "Shape Distribution by Rock Type and Fissure Status",
x = "Rock Type - Fissure Status",
y = "Percentage (%)",
fill = "Shape"
) +
scale_y_continuous(labels = percent_format(scale = 1)) +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
legend.position = "top",
strip.text = element_text(face = "bold"),
panel.spacing = unit(1, "lines")
) +
coord_cartesian(ylim = c(0, 50))
## First classification model
set.seed(1973)
model1 <- randomForest(Condition ~ Shape + Rock_Type+Fissures, mtry=1, data = Nat_sample, importance=TRUE, ntree=1000)
model1
##
## Call:
## randomForest(formula = Condition ~ Shape + Rock_Type + Fissures, data = Nat_sample, mtry = 1, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 48%
## Confusion matrix:
## Fractured Not_Fractured class.error
## Fractured 24 9 0.2727273
## Not_Fractured 15 2 0.8823529
#Note: We use mtry=1, that in this case is the default value (square root of the number of predictors)
importance_values=importance(model1)
print(importance_values)
## Fractured Not_Fractured MeanDecreaseAccuracy MeanDecreaseGini
## Shape 19.323431 15.150422 20.6110286 3.562321
## Rock_Type 2.534765 -1.765154 0.1086124 2.419630
## Fissures 13.627506 7.280955 12.8589498 2.491965
#Plot rf variable importance measures
varImpPlot(model1, sort=T,main="Variable Importance Plot")
#Create synthetic data
bal_sample <- ROSE(Condition ~ Shape + Rock_Type+Fissures, data =Nat_sample, seed = 123)$data
# Check synthetic data
summary(bal_sample)
## Rock_Type Fissures Condition Shape
## AV :25 N:28 Fractured :25 Rounded:25
## IBV:17 Y:22 Not_Fractured:25 Tabular:25
## Sed: 2
## Sil: 6
# Create a distribution of prediction probabilities based on random forest results.
# 1. Create all possible combination of factors levels
combinations <- expand.grid(
Rock_Type = c("AV", "Sil", "Sed", "IBV"), # Rock_Type
Shape = c("Rounded", "Tabular"), # Shape
Fissures = c("Y", "N") # Fissures
)
# 2: Predict probabilities based on the trained model (model1)
combinations$Prob_Fractured <- predict(model1, combinations, type = "prob")[, "Fractured"]
# Sort combinations
combinations <- combinations[order(-combinations$Prob_Fractured), ]
# Print probabilities in decreasing order
print(combinations)
## Rock_Type Shape Fissures Prob_Fractured
## 12 IBV Rounded N 1.000
## 9 AV Rounded N 0.991
## 11 Sed Rounded N 0.862
## 4 IBV Rounded Y 0.731
## 10 Sil Rounded N 0.698
## 16 IBV Tabular N 0.546
## 1 AV Rounded Y 0.543
## 2 Sil Rounded Y 0.491
## 13 AV Tabular N 0.490
## 5 AV Tabular Y 0.421
## 15 Sed Tabular N 0.398
## 8 IBV Tabular Y 0.394
## 3 Sed Rounded Y 0.374
## 6 Sil Tabular Y 0.263
## 7 Sed Tabular Y 0.253
## 14 Sil Tabular N 0.227
#Note:This script creates a predictive table to explore how different combinations of lithic attributes (rock type, shape, and presence of fissures) affect the predicted probability of fracture, based on a previously trained random forest model (model1). First, it generates all possible combinations of these categorical factors using expand.grid function, in order to cover the entire space of hypothetical observed trait combinations. Then, it uses the predict() function to estimate the probability of being classified as “Fractured” for each combination and storing it in a new column of the “combinations” data. Finally, it sorts the combinations in descending order of fracture probability.
# 3. Plot fracture probabilities by rock type
ggplot(combinations, aes(x = Shape, y = Prob_Fractured, fill = Rock_Type)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ Fissures) +
labs(
title = "Fracture probability for naturalistic sample",
x = "Shape",
y = "Probability of fracture",
fill = "Rock_Type"
) +
theme_minimal()
## Second classification model (balanced sample)
set.seed(1973)
model2 <- randomForest(Condition ~ Shape + Rock_Type+Fissures, mtry=1, data = bal_sample, importance=TRUE, ntree=1000)
model2
##
## Call:
## randomForest(formula = Condition ~ Shape + Rock_Type + Fissures, data = bal_sample, mtry = 1, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 16%
## Confusion matrix:
## Fractured Not_Fractured class.error
## Fractured 19 6 0.24
## Not_Fractured 2 23 0.08
importance_values2=importance(model2)
print(importance_values2)
## Fractured Not_Fractured MeanDecreaseAccuracy MeanDecreaseGini
## Shape 26.01564 29.19392 31.34526 4.886198
## Rock_Type 17.31656 20.82375 23.44170 4.390060
## Fissures 16.15591 12.53004 16.38496 2.188747
#Plot rf variable importance measures
varImpPlot(model2, sort=T,main="Variable Importance for balanced model")
# Create a distribution of prediction probabilities based on random forest results for the synthetic data.
# 1. Create all possible combination of factors levels
combinations2 <- expand.grid(
Rock_Type = c("AV", "Sil", "Sed", "IBV"), # Rock_Type
Shape = c("Rounded", "Tabular"), # Shape
Fissures = c("Y", "N") # Fissures
)
# 2: Predict probabilities based on the trained model (model2)
combinations2$Prob_Fractured <- predict(model2, combinations2, type = "prob")[, "Fractured"]
# Sort combinations
combinations2 <- combinations2[order(-combinations2$Prob_Fractured), ]
# Sort probabilites in decreasing order
print(combinations2)
## Rock_Type Shape Fissures Prob_Fractured
## 12 IBV Rounded N 0.997
## 9 AV Rounded N 0.917
## 4 IBV Rounded Y 0.721
## 10 Sil Rounded N 0.547
## 11 Sed Rounded N 0.547
## 16 IBV Tabular N 0.499
## 1 AV Rounded Y 0.414
## 8 IBV Tabular Y 0.291
## 13 AV Tabular N 0.229
## 2 Sil Rounded Y 0.170
## 3 Sed Rounded Y 0.170
## 14 Sil Tabular N 0.119
## 15 Sed Tabular N 0.119
## 5 AV Tabular Y 0.043
## 6 Sil Tabular Y 0.014
## 7 Sed Tabular Y 0.014
# 3. Plot fracture probabilities by rock type for the synthetic model
ggplot(combinations2, aes(x = Shape, y = Prob_Fractured, fill = Rock_Type)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ Fissures) +
labs(
title = "Fracture probability for balanced sample",
x = "Shape",
y = "Probability of fracture",
fill = "Rock_Type"
) +
theme_minimal()
Published by: Marcelo Cardillo and Eugenia Carranza.