2026-06-07

Data difficulties

Problem statement: To what extend does the reliability of metric space data affect the accuracy of scientific observations and modeling in science?

Challenges

  • Very limited data
  • Indirect measurements
  • High noise data
  • False positives

Growing technqies

New strategies on validating data are always developing such as:

  • String aggregation
  • Failure Mode & Effect Analysis (FMEA)
  • Linear regression
  • Analysis of Variance (ANOVA)
i.e. Euclidean distance being one of the most common:

d2 = (x2 - x1)2 + (y2 - y1)2

Developing a statistical method

  • SP determines whether data is significant or random noise

Separation/segregation power - SP

SP = \({MD~inter~ \over MD~intra~}\)

MDinter: Inter mean distance

MDintra: Intra mean distance

Complications

  • False negatives (Highest SP)
    • Highest SP ≠ significant results
  • Small SP distribution data sets
    • Error increases
    • No comparisons
  • Dependence on simulations
    • Null hypothesis reliable

SP - Separation Power

Methodological study

  • Applied statistical approach introducing separation/segregation power (SP)


Sum of Distances: Intra{(SD)intra = d(2,3) + d(1,4,5)}

Sum of Distances: Inter{(SD)inter = d(2,1) + d(2,5) + d(2,4) + d(3,1) + d(3,4) + d(3,5)}

Methodological approach

Data science approach

Code for plotting a plotly plot based on partitioning metric space data with randomized numbers

set.seed(123)
xDist = sample(1:10, 5)
yDist = sample(1:10, 5)
zDist = sample(1:10, 5)

clusters = c("cluster1", "cluster1", "cluster2", 
             "cluster2", "cluster2")
points = c("a", "b", "c", "d", "e")

fig1 <- plot_ly(x = xDist, y = yDist, z = zDist,
        type = "scatter3d", mode = "markers+text",
        color = clusters,
        text = points)
fig1 ## plotly not printing for ioslides

Analysing data

Normal distribution calculations

  • Easy way of analyzing discrepancies for a data set.

f(d) = probability density function (PDF)

\(f(d) = {1 \over {\sigma}\sqrt{\pi}} e^{-\left({d \over 2{\sigma}}\right)^{2}}\) \((d \geq 0)\)

Analysing data

  • Normal distribution is able to show that the majority of the distances are 8 bits apart. Plotting more points will result in a narrower distribution.
    • For meaningful data, this can be used to show patterns whether the data fits the criteria of focus
    • SP increases accuracy thus improves assertion and validates the data