Understanding Standard Deviation in Large Data Models

2025-11-09

Data Varience in Datasets

Randomness in collected data is impossible to remove, so instead we use standard deviation as a metric to track this variability

Similar metrics in Statistics include:

Mean/ Average
Range
Interquartile Range
Variance

Mathmatical Formula

Standard deviation measures the average distance of data values from the mean.

\[ \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2} \]

Where:

\(x_i\) = individual data points
\(\mu\) = population mean
\(N\) = number of observations

Uses and Limitations

Standard Deviation provides great value by:

Quantifying variability in datasets
Detecting outliers or oddly dispersed data
Serves as a clear statistic which can be used in probability estimates

However, standard deviation can struggle since it:

Assumes access to the entire population, not just a sample
Is sensitive to outliers or bias
Can underestimate variability in more limited samples

Bessel’s Correction

\[ s = \sqrt{ \frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2 } \]

Why use \(n - 1\)?

The sample mean \(\bar{x}\) is itself an estimate
Using \(n\) in the denominator systematically underestimates the true deviation in the data
Dividing by \(n - 1\) applies Bessel’s correction, producing a less biased estimator of population variance

Visualizing Variability in 3D Data

Below is a simulated 3D scatterplot illustrating spread across three features:

Deviation in a Multivariable Data Set

The graph below uses 100 variables, each observed 50 times to showcase non-uniform variability.

Comparing Deviation in Models

Different data sets also have different variations, as shown below.

Sample 3D Model Generation

set.seed(123)

## Simulate Data
n <- 1000
df <- data.frame(
  x = rnorm(n),
  y = rnorm(n),
  z = rnorm(n)
)

## Plot Data
plot_ly(df, x = ~x, y = ~y, z = ~z,
        type = "scatter3d",
        mode = "markers",
        marker = list(size = 3, color = ~z))