Partitioning Metric Space Data in Science

2026-06-07

Data difficulties

Problem statement: To what extend does the reliability of metric space data affect the accuracy of scientific observations and modeling in science?

Challenges

Very limited data
Indirect measurements
High noise data
False positives

Growing technqies

New strategies on validating data are always developing such as:

String aggregation
Failure Mode & Effect Analysis (FMEA)
Linear regression
Analysis of Variance (ANOVA)

i.e. Euclidean distance being one of the most common:

d² = (x₂ - x₁)² + (y₂ - y₁)²

Developing a statistical method

SP determines whether data is significant or random noise

Separation/segregation power - SP

SP = \({MD~inter~ \over MD~intra~}\)

MD_inter: Inter mean distance

MD_intra: Intra mean distance

Complications

False negatives (Highest SP)
- Highest SP ≠ significant results
Small SP distribution data sets
- Error increases
- No comparisons
Dependence on simulations
- Null hypothesis reliable

SP - Separation Power

Methodological study

Applied statistical approach introducing separation/segregation power (SP)

Sum of Distances: Intra{(SD)_intra = d(2,3) + d(1,4,5)}

Sum of Distances: Inter{(SD)_inter = d(2,1) + d(2,5) + d(2,4) + d(3,1) + d(3,4) + d(3,5)}

Methodological approach

Data science approach

Code for plotting a plotly plot based on partitioning metric space data with randomized numbers

set.seed(123)
xDist = sample(1:10, 5)
yDist = sample(1:10, 5)
zDist = sample(1:10, 5)

clusters = c("cluster1", "cluster1", "cluster2", 
             "cluster2", "cluster2")
points = c("a", "b", "c", "d", "e")

fig1 <- plot_ly(x = xDist, y = yDist, z = zDist,
        type = "scatter3d", mode = "markers+text",
        color = clusters,
        text = points)
fig1 ## plotly not printing for ioslides

Analysing data

Normal distribution calculations

Easy way of analyzing discrepancies for a data set.

f(d) = probability density function (PDF)

\(f(d) = {1 \over {\sigma}\sqrt{\pi}} e^{-\left({d \over 2{\sigma}}\right)^{2}}\) \((d \geq 0)\)

Analysing data

Normal distribution is able to show that the majority of the distances are 8 bits apart. Plotting more points will result in a narrower distribution.
- For meaningful data, this can be used to show patterns whether the data fits the criteria of focus
- SP increases accuracy thus improves assertion and validates the data