Principles to investigate the causation
• Can we say with confidence that we can do better than simply resampling from the existing data?
• Can we say something useful as to how we can extend beyond max(X1,…Xn)?
• The Givenko-Cantelli theorem probably plays a role in a controlled environment such as mystery distribution fares against various possibilities: lognormal distribution, Gamma, …
• If we produce data from a lognormal, and then we either attempt to model them with a lognormal, I would expect that this is better than simply resampling, at least between 0 and max(X1,…Xn), and when the number n of data points is below a certain threshold.
Is that so? What is this threshold?
The first step will be to examine the shape of the distribution, calculate the CI and then verify if it can be re-sampled differently than just with same data.
The assumptions of:
1 the traditional Confidence Interval
2 the Nonparametric Bootstrap Confidence Interval
3 the Parametric Bootstrap Confidence Interval methods
will be explored, and a contrast will be created by demonstrating how the three methods use the same collection of random variates, and how to come to similar conclusions with varying amounts of assumptions about the underlying distribution.
Type of dataset: the differences in heights of young men, measured in the morning and in the evening
The distribution of these random variables is unknown:
x is a vector with 41 elements that each contain the differences between the participants’ morning and evening heights in millimeters
x = c(8.50, 9.75, 9.75, 6.00, 4.00, 10.75, 9.25, 13.25, 10.50,
12.00, 11.25, 14.50, 12.75, 9.25, 11.00, 11.00, 8.75, 5.75,
9.25, 11.50, 11.75, 7.75, 7.25, 10.75, 7.00, 8.00, 13.75,
5.50, 8.25, 8.75, 10.25, 12.50, 4.50, 10.75, 6.75, 13.25,
14.75, 9.00, 6.25, 11.75, 6.25)
n = length(x); #the length of the vector
avg_x = mean(x); #the mean
sd_x = sd(x) #and the sd deviation
se = sd_x/sqrt(n) #the standard error (SE)
round(as.table(c(Avg=avg_x, StDev=sd_x,StErr=se)),3)
## Avg StDev StErr
## 9.598 2.735 0.427
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 7.750 9.750 9.598 11.500 14.750
The test of the assumption of Normality for the distribution of the differences shows clearly that is a bell shaped distribution
par(mfrow=c(1,3))
hist(x)
hist(x, breaks = 50) # breaks of the main data into a larger number of smaller intervals
hist(log(x), breaks = 50)
The density function of the original F(x) is normal shaped as well as the log(F(x)) density distribution which is slightly left skewed.
par(mfrow=c(1,2))
plot(density(x))
plot(density(log(x)))
### Kernel Density Estimate (KDE)
Kernel Density Estimate (KDE) using a “training set of real observations”, generate simulated data from the KDE, then perform a two sample permutations statistical test using a “test set(s) of real observations” since the null hypothesis for this statistical test is that the test set(s) and the simulated data are from the same distribution.
The challenge is the quality and size of the real observations and appropriately selecting the kernel function and bandwidth for the KDE.
“test set(s) data” using a QQplot against where the distribution is the KDE.
Other important step before to draw conclusions about the utility of the Normally distributed distribution of F(x) is to see the data set against different types of distributions to see the differences.
Test of the data against three distributions:
the procedure for this first part follows the information released by CAS in the Workshop-4_5 at the file attached: file:///Users/federica/Desktop/Loss%20Distributions_Workshop-4_5.pdf
library(MASS)
fitGamma <- fitdistr(x, "gamma") #the first try with the fit of the Gamma produces NaN values
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
fitLognormal <- fitdistr(x, "lognormal") #this doesn't happen with the others two types of distributions
fitWeibull <- fitdistr(x, "Weibull")
Further step is to consider a vector of probabilities for which the quintiles of the three distribution will be calculated directly and then plotted
probabilities = (1:(sum(n)))/(sum(n)+1)
weibullQ <- qweibull(probabilities, coef(fitWeibull)[1], coef(fitWeibull)[2])
lnQ <- qlnorm(probabilities, coef(fitLognormal)[1], coef(fitLognormal)[2])
gammaQ <- qgamma(probabilities, coef(fitGamma)[1], coef(fitGamma)[2])
sorted_x <- sort(x)
oldPar <- par(mfrow = c(1,3))
plot(sort(weibullQ), sorted_x, xlab = 'Theoretical Quantiles', ylab = 'Sample Quantiles', pch=19, main = "Weibull Fit")
abline(0,1)
plot(sort(lnQ), sorted_x, xlab = 'Theoretical Quantiles', ylab = 'Sample Quantiles', pch=19, main = "Lognormal Fit")
abline(0,1)
plot(sort(gammaQ), sorted_x, xlab = 'Theoretical Quantiles', ylab = 'Sample Quantiles', pch=19, main = "Gamma Fit")
abline(0,1)
Results about the fit of the LogNormal distribution need to be revisited and analysed separtly, as difference from the other two distributions is greater and its fit is slightly different, taking far from the main line mostly for the right side of the tail.
sampleLogMean <- fitLognormal$estimate[1] #meanlog: first parameter of the LogNormal distribution
sampleLogSd <- fitLognormal$estimate[2] #sdlog: second parameter of the LogNormal distribution
sampleShape <- fitGamma$estimate[1] #shape: first parameter of the Gamma distribution
sampleRate <- fitGamma$estimate[2] #rate: second parameter of the Gamma distribution
sampleShapeW <- fitWeibull$estimate[1] #shape: first parameter of the Weibull distribution
sampleScaleW <- fitWeibull$estimate[2] #scale: second parameter of the Weibull distribution
Having calculated the set of primary parameters for the three distributions, it can be obtained a plot of the three density distributions. The Gamma and LogNormal distributions are quite similar, the Weibull is the one who best fir in terms of density.
x_samp <- seq(0, max(x), length.out=500)
yLN <- dlnorm(x_samp, sampleLogMean, sampleLogSd)
yGamma <- dgamma(x_samp, sampleShape, sampleRate)
yWeibull <- dweibull(x_samp, sampleShapeW, sampleScaleW)
hist(x, freq=FALSE, ylim=range(yLN, yWeibull))
lines(x_samp, yLN, col="blue")
lines(x_samp, yGamma, col="red")
lines(x_samp, yWeibull, col="green")
Now that the shape is clearly evidenced, it can be proceeded in the calculations for the CI.The source of this part of the study is by Michael Archibeque https://rpubs.com/statscripter/322818 with all the others references at the original demonstrations for the methods summarised here.
The Traditional Confidence Interval method for the calculation of the CI of an unknown distribution for which what is known is just the difference in values of the two comparing samples, starts with the normalization of the sample, even if we already know, at this point that is mainly Normally distributed.
qqnorm(x, datax=TRUE) #plot the data and add a line
qqline(x, datax=TRUE)
shapiro.test(x) # Shapiro-Wilk Test to be adequate for p.value < 0.1
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.98383, p-value = 0.817
it checks if the assumption of Normality is reasonably supported.
Alternative to above estimations of the population for which “with or without” knowing the mean and/or the standard deviation of the underlying population is using the T Distribution in substitution of the Normal Distribution.
Since the population mean and standard deviation are unknown but the data behaves similarly to the Normal Distribution we calculate the Traditional Confidence Interval of the Mean Difference
Lower and Upper Limits of the Confidence Interval (95% CI)
avg_x - qt(c(.975,.025), n-1) * se
## [1] 8.734251 10.460871
Check of the results with the use of the T-Distrution with MLEs inserted:
t.test(x) # T-Test with CI
##
## One Sample t-test
##
## data: x
## t = 22.469, df = 40, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 8.734251 10.460871
## sample estimates:
## mean of x
## 9.597561
Results: 95% confident that the true-mean-difference in height is between: 8.73 and 10.46 millimeters.
What if the data is not Normally Distributed or any other known distribution?
Traditionally, the T Distribution based Confidence Interval can continue to be used if the data is continuous and the population is thought to be mound-shaped.
To acctually see how to use the MLE, as the maximun value (the superior) of the distribution to match it against the real distribution after havin replicated a certain number of samples of the same kind of unknown distribution, we apply the theorem of Glivenko-Cantelli.
If the distribution is heavy-tailed, skewed, or there is no recognizable shape, then non-parametric methods need to be employed.This is not the case but we see the procedure.
Re-Sampling from the original dataset produces a larger number of replications from which the sampling distribution can be approximated.
Data sampled at random from an unknown population have:
The Empirical Cumulative Distribution Function (ECDF) Fn(x) is the distribution of the probability of the actual data points without additional information.
to make an estimate of the population from the sample:
Fn(X)
defined as:
Fn(X) = 1/n * ∑ I(xi∈A)(Xi)
I(xi∈A) is an indicator function defined as:
I(x)= {1 if x∈A
{0 if x∉A
‖Fn−F‖∞ = supx∈R |Fn(x)−F(x)| → 0 almost surely
The supremum of the difference between Fn and F, almost surely will approach zero as n→∞.
There will be no difference between the ECDF and the underlying CDF as n increases towards infinity.
par(mfrow=c(1,3))
plot.ecdf(x)
curve(pnorm(x,mean(x),sd(x)))
plot.ecdf( x, main="Empirical distibution ECDF(x)")
curve( pnorm( x, mean(x), sd(x)), add = TRUE, col="red")
to be a powerful tool in practice when working with small data and limited background information.
set.seed(42) # Setting the random generator seed
n = length(x) # The original sample
avg_x = mean(x) # The mean of the original sample.
B = 10000 # A larger number of resamples
npar_x = sample(x, B*n, repl=TRUE) # B times re-sample of x Non Parametric bootstrap
npar_matrix_x = matrix(npar_x, nrow=B) # B x n matrix of resamples (B.Rows.N.Cols.X.Resampled)
npar_rows_x = rowMeans(npar_matrix_x) # vector of B
confidence.np_x = npar_rows_x - avg_x # vector (row_elements - mean)
quantiles.np_x = quantile(confidence.np_x, c(.975,.025)) # Estimate quantile of V
avg_x - quantiles.np_x # Non-parametric Bootstrap Confidence Interval
## 97.5% 2.5%
## 8.762195 10.420732
Visualising the Non Parametric confidence interval with a histogram:
par(mfrow=c(1,3))
hist(npar_x); hist(npar_rows_x); hist(confidence.np_x)
The outcome is similar to the Traditional Confidence Interval, and it can be concluded that true mean difference is between approximately 8.77 and 10.45 millimeters.
It can also be used the Kolmogorov–Smirnov test statistic.
The Kolmogorov–Smirnov test (K–S test or KS test) is a nonparametric test of the equality of continuous (or discontinuous) one-dimensional probability distributions used to compare a sample with a reference probability distribution (one-sample K–S test).
The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution
### One sample Kolmogorov-Smirnov test
# test if x is stochastically larger than x2
require(graphics)
x2 <- rnorm(x)
plot(ecdf(x), xlim = range(c(x, x2)))
#curve(x2 , add = TRUE, col="blue")
plot(ecdf(x2), add = TRUE, lty = "dashed")
curve( pnorm( x, mean(x), sd(x)), add = TRUE, col="red")
t.test(x, x2, alternative = "g")
##
## Welch Two Sample t-test
##
## data: x and x2
## t = 20.585, df = 53.697, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 8.762134 Inf
## sample estimates:
## mean of x mean of y
## 9.59756098 0.05994177
wilcox.test(x, x2, alternative = "g")
## Warning in wilcox.test.default(x, x2, alternative = "g"): cannot compute exact
## p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: x and x2
## W = 1681, p-value = 3.33e-15
## alternative hypothesis: true location shift is greater than 0
ks.test(x, x2, alternative = "l")
## Warning in ks.test(x, x2, alternative = "l"): cannot compute exact p-value with
## ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and x2
## D^- = 1, p-value < 2.2e-16
## alternative hypothesis: the CDF of x lies below that of y
Parametric Bootstrap Confidence Interval
set.seed(42)
n = length(x)
avg_x = mean(x) # The MLE of the mean
sd_x = sd(x) # The MLE of the standard deviation
B = 10000 # A larger sample
par_x = rnorm(B*n, avg_x, sd_x) # Normalised x
par_matrix_x = matrix(par_x, nrow=B) # B x n matrix of samples
par_rows_x = rowMeans(par_matrix_x) # Vector of the B sample means
confidence.p_x = par_rows_x - avg_x
quantiles.p_x = quantile(confidence.p_x, c(.975,.025)) # estimated quantiles
avg_x - quantiles.p_x # parametric bootstrap CI of the mean
## 97.5% 2.5%
## 8.752424 10.432419
Visualising the Parametric confidence interval with a histogram:
par(mfrow=c(1,3))
hist(par_x); hist(par_rows_x); hist(confidence.p_x)
A sample of 41 original data points will be replaced with a sample space based on the estimated Normal CDF with parameters:
rather than the elements being filled with some sample with replacement of the exact values from the original data set, the elements of the vector and matrix are filled with some of all possible values “within the scope” of the estimated CDF.
Results: similar to those produced by the traditional Confidence Interval as well as NPB, with the true mean difference in height being between approximately 8.75 and 10.43 millimeters.
par(mfrow=c(1,4))
qqplot(x,npar_x)
qqplot(x,par_x)
abline(0,1, col="red")
qqplot(x,-log(x), type="l")
#qqplot(x,lognormal_x)
#qqnorm(quant_lognormal_x)
#abline(0,1)
###What happens if we use The Log Normal Distribution?
The log normal distribution has density
f(x) = 1/(√(2 π) σ x) e^-((log x - μ)^2 / (2 σ^2))
where μ and σ are the mean and standard deviation of the logarithm.
The mean is:
E(X) = exp(μ + 1/2 σ^2)
the median is:
med(X) = exp(μ)
and the variance:
Var(X) = exp(2*μ + σ^2)*(exp(σ^2) - 1)
and hence the coefficient of variation is:
sqrt(exp(σ^2) - 1)
which is approximately σ when that is small (e.g., σ < 1/2).
Density
lognormal_x = dlnorm(x, meanlog = 0, sdlog = 1, log = FALSE)
hist(lognormal_x)
lines(density(lognormal_x), col = "red")
par(mfrow=c(1,3))
qqplot(x,lognormal_x,type="o",col="red")
qqplot(npar_x,lognormal_x,type="o",col="blue")
qqplot(par_x,lognormal_x,type="o",col="green")
and the distribution function, the quantile function and the random generation for the log normal distribution:
plnorm(quantiles.p_x, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)
## 97.5% 2.5%
## 0.4331907 0.0000000
quant_lognormal_x = qlnorm(confidence.p_x, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)
## Warning in qlnorm(confidence.p_x, meanlog = 0, sdlog = 1, lower.tail = TRUE, :
## NaNs produced
plot(quant_lognormal_x)
qqnorm(quant_lognormal_x)
rand_lognormal_x = rlnorm(B*n, meanlog = 0, sdlog = 1)
qqnorm(rand_lognormal_x)