2023-02-12

The Kolmogorov-Smirnov (KS) Test

The test is used to determine whether observed data is close to a theoretical or expected distribution. Useful in situations where data may not be necessarily normal. This is important and powerful test because all data isn’t normal. Some of the power in the test lies in the ability to perform the test without grouping data.

The Kolmogorov-Smirnov (KS) Test Math

The KS test is based on the idea of cumulative distribution function (c.d.f):

\(F(x)=P(X\leq x) for -\infty <x< \infty\)

In this case we consider random Variable X, and x is all the observed values of X. c.d.f is defined above for every random variable X, whether the distribution is discrete, continuous, or mixed. Also, c.d.f allows us to calculate all interval probabilities even if we consider step functions. But more importantly it describes a common characteristic of distribution of a random variable X to compare expected distributions with a level confidence. KS Test both c.d.f., e.d.f. defined in later slides.

3 Main properties Cumlative Distribution Function(c.d.f)

The function F(x) is a non decreasing as x increases, that is if

  • \(x1<x2\) \(then\) \(F(x1) \leq F(x2)\)

  • Limits at \(\pm \infty\) \(\lim_{x\to-\infty} F(x)=0\) & \(\lim_{x\to+\infty}F(x)= 1\)

  • Continuity from the right. A c.d.f is always continuing from the right, that is \(F(x)=F(x^+)\) at every point x

There are many uses for the concept of a c.d.f that can be used in variety of applications from blood work, voltage, etc..

Empirical Cumulative Distribution defined as:

Similar to the general c.d.f, the sample distribution will exist for n observations that we take from the random sample X.

Those values will equate to some value \(F_n(x)\). It turns out we can take this further and consider \(F_n(X)\) as a c.d.f of discrete distribution that assigns probability 1/n for each observation that we take. The benefit is now by the law large numbers the more observations we take the closer we are to approximate the distribution from which we took our sample.

Empirical Cumulative Distribution Function

Said mathematically and simply have \(F_n(x) \Rightarrow F(x) for -\infty<x<\infty)\)

If we expand definition to the proportion aspect, we have the expression below:

\(F_n(x)=P_n(X\leq x) = \frac{1}{n} \sum_{i=1} I(X_i\leq x) \ \) for all x \((-\infty<x<\infty)\) or \(F_n(x)=P_n(X\leq x) = \frac{1}{n} \sum_{i=1}I(X_i\leq x) \Rightarrow \mathbb{E} \textit{I}= F(x)\)

\(\mathbb{E}\) is the expected value. Versus the general case we can compare to some specific value.

Two Graphs

The next 2 slides will show ECDF in 2 different representations, one base plot one GG Plot. Base plot gives you an idea of the Probability with tails. Th important part is to recognize the lengthy of tails and the distance between each point. The Second plot shows similar graph tails and notice how close the distribution approximation of the original posted normal cumulative distribution function posted in earlier slides. This data is from USjudgeRatings.

Graph 1

Graph 2

The Kolmogorov-Smirnov (KS) Test Math (Final)

Since we have jump discontinuities, we may not be able to completely estimate \(F(X)\). However, we still test based on largest distance to distribution in question. So, we can choose expected distribution, and then compare based on the largest distance from our expect distribution and our sample.
Mathematically this shown below like:

\(D_n = max_x|F_n(x)-F(X)|\) (One sample test)

\(D_{n1,n2} = max_x|F_{X,n1}(x)-F_{Y,n2}(x)|\) (2 Sample Test)

Greatest vertical distance between n1, n2

The Kolmogorov-Smirnov (KS) Test Math (Con’t)

Then we’d generate these hypothesis test for one or two samples.

\(H_0:F_X(z) \equiv F_Y(z) \forall \textit{z} \in \mathbb{Z}\)

\(H_1:F_X(z) >F_Y(z), for some \textit{z} \in \mathbb{Z}\)

If you want to know more, please look up (Glivenko-Cantelli lemma)

Example of KS Test in R

Question:Test the Hypothesis that The following Sample from a normal distribution is with mean .5, variance 1. Use Alpha .05

xax <- list(title = "Random Variable")
yax <- list(title = "ID")
x=c(.36,.92,-.56,1.86,1.74,.56,-.95,.24,-.15,-.74,.32,.82,.70,-.10,-1.06,.15,.55,-.48,-.49)
test<-rnorm(20,.5, 1)  
fig <- plot_ly(x=x,type="scatter", mode="markers", name="Given Data",width=600, height=207) %>%
  add_trace(data , x = test,type = 'scatter',  mode = 'markers', name = 'Normal Sample') %>%
  layout(xaxis = xax, yaxis = yax) %>%
  layout(margin=list(l=150,r=50,b=20,t=40))
config(fig, displaylogo=FALSE)

Then the KS Test to Check our Hypothesis

##We can make our conclusion based on the result by either the D or p statistics,
## both suggest the set isn't significantly different from normal. 
ks.test(x,"pnorm", mean=.5, sd=1)
## 
##  Exact one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.23198, p-value = 0.221
## alternative hypothesis: two-sided

References

DeGroot, M. H., & Schervish, M. J. (2012). Probability and statistics. Pearson Education.

Fogiel, M. (1986). The statistics problem solver: A complete solution guide to any textbook. Research and Education Association.