Explore what the Givenko-Cantelli theorem has to say about the use of: resampling

Principles to investigate the causation

• Can we say with confidence that we can do better than simply resampling from the existing data?

• Can we say something useful as to how we can extend beyond max(X1,…Xn)?

• The Givenko-Cantelli theorem probably plays a role in a controlled environment such as mystery distribution fares against various possibilities: lognormal distribution, Gamma, …

• If we produce data from a lognormal, and then we either attempt to model them with a lognormal, I would expect that this is better than simply resampling, at least between 0 and max(X1,…Xn), and when the number n of data points is below a certain threshold.

  Is that so? What is this threshold?

The first step will be to examine the shape of the distribution, calculate the CI and then verify if it can be re-sampled differently than just with same data.

Here is a revisited version of the three methods for the calculation of the CI for an unknown distribution

The assumptions of:

1  the traditional Confidence Interval

2  the Nonparametric Bootstrap Confidence Interval

3  the Parametric Bootstrap Confidence Interval methods

will be explored, and a contrast will be created by demonstrating how the three methods use the same collection of random variates, and how to come to similar conclusions with varying amounts of assumptions about the underlying distribution.

Type of dataset: the differences in heights of young men, measured in the morning and in the evening

subjects: 41 students
variates: differences in heights between the morning and the evening
random variables: the unseen processes changing the height from morning to evening (iid)

The distribution of these random variables is unknown:

x is a vector with 41 elements that each contain the differences between the participants’ morning and evening heights in millimeters

x = c(8.50, 9.75, 9.75, 6.00, 4.00, 10.75, 9.25, 13.25, 10.50,
      12.00, 11.25, 14.50, 12.75, 9.25, 11.00, 11.00, 8.75, 5.75,
      9.25, 11.50, 11.75, 7.75, 7.25, 10.75, 7.00, 8.00, 13.75,
      5.50, 8.25, 8.75, 10.25, 12.50, 4.50, 10.75, 6.75, 13.25,
      14.75, 9.00, 6.25, 11.75, 6.25)

n = length(x);       #the length of the vector                              
avg_x = mean(x);     #the mean                             
sd_x = sd(x)         #and the sd deviation                              
se = sd_x/sqrt(n)    #the standard error (SE) 

round(as.table(c(Avg=avg_x, StDev=sd_x,StErr=se)),3)

##   Avg StDev StErr 
## 9.598 2.735 0.427

summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   7.750   9.750   9.598  11.500  14.750

The test of the assumption of Normality for the distribution of the differences shows clearly that is a bell shaped distribution

par(mfrow=c(1,3))
hist(x)
hist(x, breaks = 50)                        # breaks of the main data into a larger number of smaller intervals 
hist(log(x), breaks = 50)

The density function of the original F(x) is normal shaped as well as the log(F(x)) density distribution which is slightly left skewed.

par(mfrow=c(1,2))

plot(density(x))  
plot(density(log(x)))

### Kernel Density Estimate (KDE)

Kernel Density Estimate (KDE) using a “training set of real observations”, generate simulated data from the KDE, then perform a two sample permutations statistical test using a “test set(s) of real observations” since the null hypothesis for this statistical test is that the test set(s) and the simulated data are from the same distribution.

The challenge is the quality and size of the real observations and appropriately selecting the kernel function and bandwidth for the KDE.

“test set(s) data” using a QQplot against where the distribution is the KDE.

Other important step before to draw conclusions about the utility of the Normally distributed distribution of F(x) is to see the data set against different types of distributions to see the differences.

Test of the data against three distributions:

Gamma
Lognormal
Weibull

the procedure for this first part follows the information released by CAS in the Workshop-4_5 at the file attached: file:///Users/federica/Desktop/Loss%20Distributions_Workshop-4_5.pdf

library(MASS)                               

fitGamma <- fitdistr(x, "gamma")                               #the first try with the fit of the Gamma produces NaN values

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

fitLognormal <- fitdistr(x, "lognormal")                       #this doesn't happen with the others two types of distributions
fitWeibull <- fitdistr(x, "Weibull")

Further step is to consider a vector of probabilities for which the quintiles of the three distribution will be calculated directly and then plotted

probabilities = (1:(sum(n)))/(sum(n)+1)

weibullQ <- qweibull(probabilities, coef(fitWeibull)[1], coef(fitWeibull)[2])
lnQ <- qlnorm(probabilities, coef(fitLognormal)[1], coef(fitLognormal)[2])
gammaQ <- qgamma(probabilities, coef(fitGamma)[1], coef(fitGamma)[2])

sorted_x <- sort(x)

oldPar <- par(mfrow = c(1,3))

plot(sort(weibullQ), sorted_x, xlab = 'Theoretical Quantiles', ylab = 'Sample Quantiles', pch=19, main = "Weibull Fit")
abline(0,1)

plot(sort(lnQ), sorted_x, xlab = 'Theoretical Quantiles', ylab = 'Sample Quantiles', pch=19, main = "Lognormal Fit")
abline(0,1)

plot(sort(gammaQ), sorted_x, xlab = 'Theoretical Quantiles', ylab = 'Sample Quantiles', pch=19, main = "Gamma Fit")
abline(0,1)

Results about the fit of the LogNormal distribution need to be revisited and analysed separtly, as difference from the other two distributions is greater and its fit is slightly different, taking far from the main line mostly for the right side of the tail.

sampleLogMean <- fitLognormal$estimate[1]           #meanlog: first parameter of the LogNormal distribution  
sampleLogSd <- fitLognormal$estimate[2]             #sdlog:   second parameter of the LogNormal distribution

sampleShape <- fitGamma$estimate[1]                 #shape:   first parameter of the Gamma distribution
sampleRate <- fitGamma$estimate[2]                  #rate:    second parameter of the Gamma distribution

sampleShapeW <- fitWeibull$estimate[1]              #shape:   first parameter of the Weibull distribution
sampleScaleW <- fitWeibull$estimate[2]              #scale:   second parameter of the Weibull distribution

Having calculated the set of primary parameters for the three distributions, it can be obtained a plot of the three density distributions. The Gamma and LogNormal distributions are quite similar, the Weibull is the one who best fir in terms of density.

x_samp <- seq(0, max(x), length.out=500)

yLN <- dlnorm(x_samp, sampleLogMean, sampleLogSd)
yGamma <- dgamma(x_samp, sampleShape, sampleRate)
yWeibull <- dweibull(x_samp, sampleShapeW, sampleScaleW)

hist(x, freq=FALSE, ylim=range(yLN, yWeibull))

lines(x_samp, yLN, col="blue")
lines(x_samp, yGamma, col="red")
lines(x_samp, yWeibull, col="green")

Now that the shape is clearly evidenced, it can be proceeded in the calculations for the CI.The source of this part of the study is by Michael Archibeque https://rpubs.com/statscripter/322818 with all the others references at the original demonstrations for the methods summarised here.

1 The Traditional Confidence Interval

The Traditional Confidence Interval method for the calculation of the CI of an unknown distribution for which what is known is just the difference in values of the two comparing samples, starts with the normalization of the sample, even if we already know, at this point that is mainly Normally distributed.

qqnorm(x, datax=TRUE) #plot the data and add a line
qqline(x, datax=TRUE)

A further test to run to see the normality id the Shapiro-Wilk Test:

shapiro.test(x) # Shapiro-Wilk Test to be adequate for p.value < 0.1

## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.98383, p-value = 0.817

it checks if the assumption of Normality is reasonably supported.

Alternative to above estimations of the population for which “with or without” knowing the mean and/or the standard deviation of the underlying population is using the T Distribution in substitution of the Normal Distribution.

Since the population mean and standard deviation are unknown but the data behaves similarly to the Normal Distribution we calculate the Traditional Confidence Interval of the Mean Difference

1.1 Traditional Confidence Interval of the Mean Difference

Lower and Upper Limits of the Confidence Interval (95% CI)

avg_x - qt(c(.975,.025), n-1) * se

## [1]  8.734251 10.460871

Check of the results with the use of the T-Distrution with MLEs inserted:

t.test(x)  # T-Test with CI

## 
##  One Sample t-test
## 
## data:  x
## t = 22.469, df = 40, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   8.734251 10.460871
## sample estimates:
## mean of x 
##  9.597561

Results: 95% confident that the true-mean-difference in height is between: 8.73 and 10.46 millimeters.

   What if the data is not Normally Distributed or any other known distribution?

Traditionally, the T Distribution based Confidence Interval can continue to be used if the data is continuous and the population is thought to be mound-shaped.

To acctually see how to use the MLE, as the maximun value (the superior) of the distribution to match it against the real distribution after havin replicated a certain number of samples of the same kind of unknown distribution, we apply the theorem of Glivenko-Cantelli.

2 The Non Parametric Bootstrap (NPB)

If the distribution is heavy-tailed, skewed, or there is no recognizable shape, then non-parametric methods need to be employed.This is not the case but we see the procedure.

Re-Sampling from the original dataset produces a larger number of replications from which the sampling distribution can be approximated.

Data sampled at random from an unknown population have:

random variables: iid
distribution: approximate the Cumulative Distribution Function (CDF) of the population: F(x)

The Empirical Cumulative Distribution Function (ECDF) Fn(x) is the distribution of the probability of the actual data points without additional information.

2.1 The Glivenko–Cantelli Theorem:

to make an estimate of the population from the sample:

      Fn(X)

defined as:

      Fn(X) = 1/n * ∑ I(xi∈A)(Xi)
      
      I(xi∈A)     is an indicator function defined as:
      
      I(x)= {1   if  x∈A
            {0   if  x∉A
            
      ‖Fn−F‖∞ = supx∈R |Fn(x)−F(x)| → 0       almost surely

The supremum of the difference between Fn and F, almost surely will approach zero as n→∞.

There will be no difference between the ECDF and the underlying CDF as n increases towards infinity.

par(mfrow=c(1,3))
plot.ecdf(x)

curve(pnorm(x,mean(x),sd(x)))

plot.ecdf( x, main="Empirical distibution ECDF(x)")

curve( pnorm( x, mean(x), sd(x)), add = TRUE, col="red")

2.2 The Non Parametric Bootstrap Confidence Interval

to be a powerful tool in practice when working with small data and limited background information.

set.seed(42)                                         # Setting the random generator seed
n = length(x)                                        # The original sample

avg_x = mean(x)                                       # The mean of the original sample.
B = 10000                                             # A larger number of resamples

npar_x = sample(x, B*n, repl=TRUE)                    # B times re-sample of x Non Parametric bootstrap

npar_matrix_x = matrix(npar_x, nrow=B)                # B x n matrix of resamples (B.Rows.N.Cols.X.Resampled)

npar_rows_x = rowMeans(npar_matrix_x)                 # vector of B 

confidence.np_x = npar_rows_x - avg_x                 # vector (row_elements - mean) 

quantiles.np_x = quantile(confidence.np_x, c(.975,.025)) # Estimate quantile of V 

avg_x - quantiles.np_x                                # Non-parametric Bootstrap Confidence Interval

##     97.5%      2.5% 
##  8.762195 10.420732

Visualising the Non Parametric confidence interval with a histogram:

par(mfrow=c(1,3))
hist(npar_x); hist(npar_rows_x); hist(confidence.np_x)

The outcome is similar to the Traditional Confidence Interval, and it can be concluded that true mean difference is between approximately 8.77 and 10.45 millimeters.

It can also be used the Kolmogorov–Smirnov test statistic.

The Kolmogorov–Smirnov statistic

The Kolmogorov–Smirnov test (K–S test or KS test) is a nonparametric test of the equality of continuous (or discontinuous) one-dimensional probability distributions used to compare a sample with a reference probability distribution (one-sample K–S test).

The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution

### One sample Kolmogorov-Smirnov test

# test if x is stochastically larger than x2
require(graphics)


x2 <- rnorm(x)
plot(ecdf(x), xlim = range(c(x, x2)))
#curve(x2 , add = TRUE, col="blue")
plot(ecdf(x2), add = TRUE, lty = "dashed")
curve( pnorm( x, mean(x), sd(x)), add = TRUE, col="red")

t.test(x, x2, alternative = "g")

## 
##  Welch Two Sample t-test
## 
## data:  x and x2
## t = 20.585, df = 53.697, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  8.762134      Inf
## sample estimates:
##  mean of x  mean of y 
## 9.59756098 0.05994177

wilcox.test(x, x2, alternative = "g")

## Warning in wilcox.test.default(x, x2, alternative = "g"): cannot compute exact
## p-value with ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  x and x2
## W = 1681, p-value = 3.33e-15
## alternative hypothesis: true location shift is greater than 0

ks.test(x, x2, alternative = "l")

## Warning in ks.test(x, x2, alternative = "l"): cannot compute exact p-value with
## ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  x and x2
## D^- = 1, p-value < 2.2e-16
## alternative hypothesis: the CDF of x lies below that of y

3 The Parametric Bootstrap (PB)

Parametric Bootstrap Confidence Interval

set.seed(42)                                                       
n = length(x)                                                       

avg_x = mean(x)                                                     # The MLE of the mean  
sd_x = sd(x)                                                        # The MLE of the standard deviation
B = 10000                                                           # A larger sample

par_x = rnorm(B*n, avg_x, sd_x)                                     # Normalised x

par_matrix_x = matrix(par_x, nrow=B)                                # B x n matrix of samples 

par_rows_x = rowMeans(par_matrix_x)                                 # Vector of the B sample means

confidence.p_x = par_rows_x - avg_x                                    

quantiles.p_x = quantile(confidence.p_x, c(.975,.025))      # estimated quantiles 
  
avg_x - quantiles.p_x                                               # parametric bootstrap CI of the mean

##     97.5%      2.5% 
##  8.752424 10.432419

Visualising the Parametric confidence interval with a histogram:

par(mfrow=c(1,3))
hist(par_x); hist(par_rows_x); hist(confidence.p_x)

A sample of 41 original data points will be replaced with a sample space based on the estimated Normal CDF with parameters:

the maximum likelihood estimators (MLE) for the mean (the observed mean from the sample)
standard deviation (the observed standard deviation from original sample)

rather than the elements being filled with some sample with replacement of the exact values from the original data set, the elements of the vector and matrix are filled with some of all possible values “within the scope” of the estimated CDF.

Results: similar to those produced by the traditional Confidence Interval as well as NPB, with the true mean difference in height being between approximately 8.75 and 10.43 millimeters.

par(mfrow=c(1,4))

qqplot(x,npar_x)

qqplot(x,par_x)
abline(0,1, col="red")


qqplot(x,-log(x), type="l")
#qqplot(x,lognormal_x)

#qqnorm(quant_lognormal_x)
#abline(0,1)

###What happens if we use The Log Normal Distribution?

The log normal distribution has density

f(x) = 1/(√(2 π) σ x) e^-((log x - μ)^2 / (2 σ^2))

where μ and σ are the mean and standard deviation of the logarithm.

The mean is:

E(X) = exp(μ + 1/2 σ^2)

the median is:

med(X) = exp(μ)

and the variance:

Var(X) = exp(2*μ + σ^2)*(exp(σ^2) - 1)

and hence the coefficient of variation is:

sqrt(exp(σ^2) - 1)

which is approximately σ when that is small (e.g., σ < 1/2).

Density

lognormal_x = dlnorm(x, meanlog = 0, sdlog = 1, log = FALSE)
hist(lognormal_x)
lines(density(lognormal_x), col = "red")

par(mfrow=c(1,3))
qqplot(x,lognormal_x,type="o",col="red")
qqplot(npar_x,lognormal_x,type="o",col="blue")
qqplot(par_x,lognormal_x,type="o",col="green")

and the distribution function, the quantile function and the random generation for the log normal distribution:

plnorm(quantiles.p_x, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)

##     97.5%      2.5% 
## 0.4331907 0.0000000

quant_lognormal_x = qlnorm(confidence.p_x, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)

## Warning in qlnorm(confidence.p_x, meanlog = 0, sdlog = 1, lower.tail = TRUE, :
## NaNs produced

plot(quant_lognormal_x)

qqnorm(quant_lognormal_x)

rand_lognormal_x = rlnorm(B*n, meanlog = 0, sdlog = 1)
qqnorm(rand_lognormal_x)

Glivenko-Cantelli theorem

Federica Gazzelloni

8/26/2020