Problem 1

Prove the statement

\[ \adjustlimits\sup_{X_k \le x\le X_{k+1}} |k/N -F(x)| =\max (F(X_{k+1})-\frac{k}{N},\frac{k}{N}-F(X_k)); k=1,2,\hdots, N-1\]
Which was used in the derivation of \(D_N\)

Proof:

If we have a sorted sample \(X_1 \le X_2 \le X_3 \le \hdots \le X_N\) then the Kolmogorov-Smirnov test statistic \(D_N\) is defined as

where \[ F_N(x) = \begin{cases} 0 & \text{if $x<X_1$} \\ \frac{k}{N} & \text{if $X_k \le x\le X_{k+1}$} \\ 1 & \text{if $x>X_N$} \end{cases} \]

Now

Now applying the conditions, \(x<X_1, X_1<x<X_2, X_2<x<X_3, \hdots, x>X_N\) we get

\[ D_N=max(\adjustlimits\sup_{x<X_1} |-F(x)|, \adjustlimits\sup_{X_1 \le x < X_{2}} |1/N -F(x)|,\adjustlimits\sup_{X_2 \le x < X_{3}} |2/N -F(x)|,\hdots \adjustlimits\sup_{x \ge X_N} |1-F(x)| )\]

From the graph below we see that, \[\adjustlimits\sup_{x<X_1} |-F(x)|=F(X_1)\] and \[\adjustlimits\sup_{x>X_N} |1-F(x)|=1-F(X_N)\]

Graph

We know from the properties of absolute values \[|\frac{k}{N}-F(x)|=\begin{cases} \frac{k}{N}-F(x) & \text{if $\frac{k}{N}>F(x)$}\\ -(\frac{k}{N}-F(x))& \text{if $\frac{k}{N}<F(x)$}\\ \end{cases}\]

Since cummulative distribution function (CDF) is non-decreasing and right continuous so for any \(X_k \le X_{k+1}\) we will have \(F(X_k) \le F(X_{k+1})\). And thus

and

Which always measures the distance between the empirical and standard CDF. Therefore, any of \(F(X_{k+1})-\frac{k}{N}\) or \(\frac{k}{N}-F(X_k)\) is positive. Thus,

\[ \adjustlimits\sup_{X_k \le x\le X_{k+1}} |k/N -F(x)| =\max (F(X_{k+1})-\frac{k}{N},\frac{k}{N}-F(X_k)); k=1,2,\hdots, N-1\]

Problem 2

Compute \(D_N=\max(D^{+},D^{-})\) for the dat set \(x_1=0.2, x_2=0.6, x_3=0.7\). Take F to be the CDF of \(U(0,1)\); the uniform distribution on \((0,1)\). What do you think \(D^+, D^-, D_N\) measure, intuitively?

Solution:

We know

\[D_N = \sum_{-\infty < x < \infty}|F_N(x) - F(x)|\]

where \[ F_N(x) = \begin{cases} 0 & \text{if $x<X_1$} \\ \frac{k}{N} & \text{if $X_k \le x\le X_{k+1}$} \\ 1 & \text{if $x>X_N$} \end{cases} \]

So,
\[F_{N}(x)=\begin{cases} 0&\text{ if }x<0.2\\ \frac{1}{3}&\text{ if }x\in[0.2,0.6)\\ \frac{2}{3}&\text{ if }x\in[0.6,0.7)\\ 1&\text{ if }x\geq0.7\\ \end{cases}\]

Now \(D_N\) can be written as

\(\begin{aligned}D_{N}&=\max\left[F(X_{1}),\max_{k=1,\ldots,N-1}\left(F(X_{k+1})-\tfrac{k}{N},\tfrac{k}{N}-F(X_{k}),1-F(X_{N})\right)\right]\\ &= \max\left[\underbrace{\max_{k=1,\ldots,N}\left(\tfrac{k}{N}-F(X_{k})\right)}_{D^{+}},\underbrace{\max_{k=1,\ldots,N}\left(F(X_{k})-\tfrac{k-1}{N}\right)}_{D^{-}}\right]\\&=\max\{D^{+},D^{-}\}\end{aligned}\)

Accordingly,

\(\begin{aligned} D^{+}&=\max\{\tfrac{1}{3}-0.2,\tfrac{2}{3}-0.6,\tfrac{3}{3}-0.7\}\approx\{0.13,0.067,0.3\}\\ D^{-}&=\max\{0.2-\tfrac{0}{3},0.6-\tfrac{1}{3},0.7-\tfrac{2}{3}\}\approx\{0.2,0.26,0.03\}\\ \end{aligned}\)

And thus, \(D_N=0.3\)

Problem 3

Prove

\(\begin{aligned}\max\left[F(X_{1}),\max_{k=1,\ldots,N-1}\left(F(X_{k+1})-\tfrac{k}{N},\tfrac{k}{N}-F(X_{k}),1-F(X_{N})\right)\right] &= \max\left[\max_{k=1,\ldots,N}\left(\tfrac{k}{N}-F(X_{k})\right),\max_{k=1,\ldots,N}\left(F(X_{k})-\tfrac{k-1}{N}\right)\right]\end{aligned}\)
Which was used in the derivation of \(D_N\)

Proof:

If we have a sorted sample \(X_1 \le X_2 \le X_3 \le \hdots \le X_N\) then the Kolmogorov-Smirnov test statistic \(D_N\) is defined as

where \[ F_N(x) = \begin{cases} 0 & \text{if $x<X_1$} \\ \frac{k}{N} & \text{if $X_k \le x\le X_{k+1}$} \\ 1 & \text{if $x>X_N$} \end{cases} \]

Now applying the conditions, \(x<X_1, X_1<x<X_2, X_2<x<X_3, \hdots, x>X_N\) we get

\[ D_N=max(\adjustlimits\sup_{x<X_1} |-F(x)|, \adjustlimits\sup_{X_1 \le x < X_{2}} |1/N -F(x)|,\adjustlimits\sup_{X_2 \le x < X_{3}} |2/N -F(x)|,\hdots \adjustlimits\sup_{x \ge X_N} |1-F(x)| )\]

The first and last term can be written as
\[\adjustlimits\sup_{x<X_1} |-F(x)|=F(X_1)\] and \[\adjustlimits\sup_{x>X_N} |1-F(x)|=1-F(X_N)\]

For other terms we can write

\(\sup_{X_k\leq x < X_{k+1}}\left|\frac{k}{N} - F(x)\right| = \max\left(F(X_{k+1}) - \frac{k}{N},\frac{k}{N} - F(X_k)\right); k = 1, \ldots, N - 1\)

So,

\(\begin{aligned}D_{N}&=\max\left[F(X_{1}),\max_{k=1,\ldots,N-1}\left(F(X_{k+1})-\tfrac{k}{N},\tfrac{k}{N}-F(X_{k}),1-F(X_{N})\right)\right]\\ &= \max\left[\underbrace{\max_{k=1,\ldots,N}\left(\tfrac{k}{N}-F(X_{k})\right)}_{D^{+}},\underbrace{\max_{k=1,\ldots,N}\left(F(X_{k})-\tfrac{k-1}{N}\right)}_{D^{-}}\right]\\&=\max\{D^{+},D^{-}\}\end{aligned}\)

Problem 4

Consider the MCG with parameters \(a=23, M=10^8+1\) and let the seed be \(47594118\). Apply KS test to the first 1000 random numbers (including the seed) from this generator. Compute the KS-statistic and find its p-value. What is your conclusion for the generator?

Solution:

N=1000;                                       # Number of total observations
a=23;                                         # Multiplier
M=(10^8)+1;                                   # Modulo M
A = matrix(0, ncol=1, nrow=N)                 # Zero matrix of N by 1
seed<-as.double(47594118)                     # Initial seed
A[1,1]<-seed/M                                # Defining a function mcg to generate
mcg<-function(){                              # N random numbers between 0 to 1
  seed<<-(a*seed)%% M
  round(seed/M,6)
}

for (i in 2:N) {
  A[i,]<-c(mcg())
}

# CALCULATION OF KS-STATISTIC #######################################################
## STEP 1: Obtain the sample x and sort it ##########################################
x<-sort(A[,1])
## STEP 2: Hypothesis H_0: x follows F(x) vs H_a: x does not follow F(x) ############
## STEP 3: Compute D_N ##############################################################
k<-1:N
D_N<-max(max((k/N)-x),
         max(x-((k-1)/N)))
D_N
## [1] 0.036956
## STEP 4: Run the test #############################################################
ks.test(x,"punif")  
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.036956, p-value = 0.1302
## alternative hypothesis: two-sided

We see that the test statistic calcualted manually and generated from built in KS test are the same. So, the calculatin was correct. Now looking at p-value we see that it is not significant, so we do not have enough evidence against our null hypothesis. Hence the x values that we got from the generator follows the uniform distribution.

Problem 5

Implement an obviously “bad” random number generator of your choice and you should explain why it is bad. Then apply the \(\chi^2-\) test together with the KS-statistic as explained in Remark 1. Take \(M=10, N=1000\) and \(k=10\). What are your conclusions?

Solution:

N=1000;                                         # Total Number of observations
M=10;                                           
k=10;

# INTENTIONALLY SELECTED BAD GENERATOR 
a=1.5;                                         # Multiplier a=1.5
x_n <- as.double(5)                            # initial value/seed=5
rangen <- function() {                         # rangen is the definition of the
  x_n <<- (a* x_n+.5) %% M                     # generator function 
}                                              # x_(i+1)=1.5x_i +(1/2) Mod 10

A<-matrix(0,ncol = 1,nrow = N)                 # A is the zero matrix of dim N by 1
for(i in 1:N) {
  A[i,] <- rangen()
}
X<-A[,1]                                       # X is the N random numbers
head(sort(round(unique(X),2)))                 # Here is how X looks like (rounded)
## [1] 0.03 0.05 0.10 0.10 0.11 0.15
plot(X/M,X/M)                                  # pair-wise plot of normalized values    

# The reason we are claiming our generator is bad is the graph. From the graph we see
# it is a straight line, following a linear pattern which is not expected at all.
# All we want a random scatter plot which is not present here. This is why the chosen
# RNG is not a good generator.

# CHI-SQUARED TEST ##################################################################
H<-hist(X)                                    # Histogram 

Y<-H$counts                                   # Y is the number of each kind of numbers
Y                                             # generated by the generator. Also called
##  [1]  76 107 140 129 128 106  90  85  87  52
                                              # the observed values.
p<-1/k;                                       # Since we expect the distribution would
                                              # be uniform so probability p=1/k
Exp<-N*p;                                     # Exp is the expected number of each kind
                                              # numbers
## Chi-squared test statistic ########################################################
sm<-0;
for (i in 1:k) {
  sm<<-sm+(((Y[i]-Exp)^2)/Exp)
}
sprintf("Chi-square Statistic %f", sm)
## [1] "Chi-square Statistic 66.840000"
## Testing the hypothesis ###########################################################
critcal_value<-qchisq(.95, df=(k-1))
critcal_value
## [1] 16.91898
options(warn=-1)
if(sm>critcal_value){
  sprintf("We reject the null hypothesis")
}else{
  sprintf("We do not reject the null hypothesis")
}
## [1] "We reject the null hypothesis"
# KS test ###########################################################################
x<-sort(unique(X/M))   
k<-1:N
D_N<-max(max((k/N)-x),
         max(x-((k-1)/N)))
D_N
## [1] 0.108436
ks.test(x,"punif")
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.10844, p-value = 1.224e-10
## alternative hypothesis: two-sided
# CONCLUSION #######################################################################
# Both of the test rejects the null hypothesis. In the first chi-squared test the
# test statistic is too bigger than the critical value and in the second test (KS)
# p-value is very small.