Statistical Inference, Peer Assesment: Course Project.

Overview

The purpose of this project is to investigate the Exponential Distribution using R and compare it with the Central Limit Theorem. The Exponential Distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. For the Exponential Distribution, both the mean and the standard deviation are 1/lambda. We will set lambda = 0.2 for all of the simulations and will investigate the distribution of averages of 40 exponentials. We will need to do a thousand simulations. We will illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. We will:

  1. Show the sample mean and compare it to the theoretical mean of the distribution.
  2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
  3. Show that the distribution is approximately normal.

A Brief Summary on The Central Limit Theorem

“The normal distribution is used to help measure the accuracy of many statistics, including the sample mean, using …the Central Limit Theorem. This theorem gives you the ability to measure how much the means of various samples will vary, without having to take any other sample means to compare it with.” The Central Limit Theorem “basically says that for non-normal data, the distribution of the sample means has an approximate normal distribution, no matter what the distribution of the original data looks like, as long as the sample size is large enough (usually at least 30) and all samples have the same size. And it doesn’t just apply to the sample mean; the CLT is also true for other sample statistics, such as the sample proportion. Because statisticians know so much about the normal distribution, these analyses are much easier.” [1]
“A tricky aspect of statistics is that results like the central limit theorem come with caveats, such as”.for sufficiently large n…“.” [2] Let us remember, a distribution describes the shape of a batch of numbers which might be the result of an experiment, like a poll or a science study. The characteristics of a distribution can sometimes be defined using a small number of numeric descriptors called parameters (example: mean, variance, standard deviation, etcetera.) These parameters might be a basis for standardized comparison of empirical distributions can help us estimate confidence intervals for inferential statistics form a basis for more advanced statistical methods. The fit between observed distributions and certain theoretical distributions is an assumption of many statistical procedures. The point is, given a study, we don’t know -a priori- the shape of the distribution for that sample. Might be not a normal distribution however, we can make use of some properties of the normal distribution. That is where the Central Limit Theorem appears and play a crucial role: distribution of means derived from sets of random samples taken from any population will tend toward normality; the conformity to a normal distribution increases with the size of samples and these means will be distributed around the mean of the sample (say, the average of the individual samples average will be equal to the mean of the total sample.) We usually have one of those samples. We can’t know where it falls relative to the population mean, but we can estimate odds about how far it is likely to be. this depends on sample size and an estimate of the population variance. The smaller the sample and the more dispersed the population, the more likely that our sample is far from the population mean. Let’s try a final summary: having a normal distribution we can use the Central Limit Theorem and a sampling distribution of the mean \(\mu\) to use the standard normal curve on samples from populations that are not distributed normally. There are 3 steps in generating a theoretical sampling distribution a. every sample generating the sample statistic is drawn randomly from the same population b. the sample size, n, is the same for all the samples (example, 40.) c. the number of samples is very large (example: 1000.)

Let’s see how looks a normal distribution of 40 samples, with mean=standard_deviation = 5.

set.seed(159)
test=rnorm(40, 5, 5)
x=test
hist(test, prob=TRUE, col="gray", main=expression(paste("40 Sampled values, ", mu,"=5, ", sigma,"=5")))
curve(dnorm(x, 5,5), lty=1, add=T)  

We can compare that against the following normal distribution, this time of 1000 values:

set.seed(123)
samp=rnorm(1000, 5, 5) 
x=samp
hist(samp, prob=TRUE, col="orange", main=expression(paste("1000 Sampled values, ", mu,"=5, ", sigma,"=5")))
curve(dnorm(x, 5,5), lty=1, add=T)  

Between the many possible theoretical distibutions, the Normal or Gaussian Distribution is relevant because is a continuos distribution with tails or wings that stretch simetrical and infinitely in both directions around the mean \(\mu\), point at whcih the distribution reach its maximum and the standard deviation \(\sigma\), value located at the points of inflection of the distribution. A specific, single normal curve exists for any combination of \(\mu\) and (), these are the parameters of the distribution and define it completely. Llots of natural phenomena in the real world approximate normal distributions-near enough that we can make use of it as a model, example, the height of a population sample. Also, phenomena that emerge from a large number of uncorrelated, random events will usually approximate a normal distribution

Q1: Show the sample mean and compare it to the theoretical mean of the distribution.

For our study, first simulate the exponential distribution. ??What is the behavior of the average if we consider samples of 40 and 200, with 2 different simulations, 1000 and 5000?. Let’s plot and examine:

library(ggplot2)
lambda = 0.2
mean = std = 1 / lambda
snum1 = 1000
snum2=5000
ssize1 = 40
ssize2= 200
set.seed(278) 
avg1=avg2=avg3=avg4=NULL
par(mfrow=c(2,2))

for (i in 1 : snum1) avg1 = c(avg1, mean(rexp(ssize1, lambda)))
plot(avg1, type = "l", col = "gray", ylab = "Average 1" , xlab = "40 samples, 1000 Simulations")
lines(x = seq(1,snum1,1), y = rep(mean,snum1), type = "l", col = "red")

for (i in 1 : snum1) avg2 = c(avg2, mean(rexp(ssize2, lambda)))
plot(avg2, type = "l", col = "gray", ylab = "Average 2" , xlab = "200 Samples, 1000 Simulations")
lines(x = seq(1,snum1,1), y = rep(mean,snum1), type = "l", col = "black")

for (i in 1 : snum2) avg3 = c(avg3, mean(rexp(ssize1, lambda)))
plot(avg3, type = "l", col = "gray", ylab = "Average 3" , xlab = "40 samples, 5000 Simulations")
lines(x = seq(1,snum2,1), y = rep(mean,snum2), type = "l", col = "orange")

for (i in 1 : snum2) avg4 = c(avg4, mean(rexp(ssize2, lambda)))
plot(avg4, type = "l", col = "gray", ylab = "Average 4" , xlab = "200 Samples, 5000 Simulations")
lines(x = seq(1,snum2,1), y = rep(mean,snum2), type = "l", col = "green")   

How the histogram of those average behaves?

par(mfrow=c(2,2))

hist(avg1)
hist(avg2)
hist(avg3)
hist(avg4)

And how the averages and standard deviations behave/compare?

mean(avg1)
## [1] 5.021335
mean(avg2)
## [1] 4.99923
mean(avg3)
## [1] 5.011971
mean(avg4)
## [1] 4.99302
mean(c(avg1, sd(rexp(ssize1, 0.2))))
## [1] 5.022172
sd(c(avg1, sd(rexp(ssize1, 0.2))))
## [1] 0.7920146
mean(c(avg2, sd(rexp(ssize2, 0.2))))
## [1] 4.999147
sd(c(avg2, sd(rexp(ssize2, 0.2))))
## [1] 0.3503007
mean(c(avg3, sd(rexp(ssize1, 0.2))))
## [1] 5.011735
sd(c(avg3, sd(rexp(ssize1, 0.2))))
## [1] 0.7908204
mean(c(avg4, sd(rexp(ssize2, 0.2))))
## [1] 4.992986
sd(c(avg4, sd(rexp(ssize2, 0.2))))
## [1] 0.3493377

We have just confirmed what we explained in the introduction, having a bigger sample is better, but is not always the case. May be is expensive to have a bigger sample and the trick is, if the sample distribution can be replaced by the normal one, get the most of the sample applying to it all the normal distribution tools. The ‘sd’ of the vector of individual values, is interpreted by saying, the bigger the sample and the number of simulations, the smaller is the standard deviation of the ‘vector’ of individual values of standard deviation.

Q2: Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

For the exponential distribution the theoretical variance is given by \(\sigma^2\) = 1/\(\lambda^2\) = (in our case) = 1/\(0.2^2\) = 25. According to the Central Limit Theorem, the distribution of the mean for our sample(s), ‘could’ follow a normal distribution centered around (in our case) 5, as in fact we have seen in previous plots, vith a variance given by \(\sigma^2\)/n = 25/40 = 0.625, where n is the size of the sample, in our study n=40. We can compare this theoretical (‘expected’) result with the variance we can get for our averages ‘avg1’ and ‘avg3’ which I got running 1000 and 5000 simulations:

var(avg1)
## [1] 0.6266318
var(avg3)
## [1] 0.625478

And, ‘as expected’, the bigger the number of simulations (5000 versus 1000), the closer our variance to the theoretical (expected) value.

Q3: Show that the distribution is approximately normal.

Wikipedia “n statistics, normality tests are used to determine if a data set is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed. More precisely, the tests are a form of model selection, and can be interpreted several ways, depending on one’s interpretations of probability:
- In descriptive statistics terms, one measures a goodness of fit of a normal model to the data - if the fit is poor then the data are not well modeled in that respect by a normal distribution, without making a judgment on any underlying variable.
- In frequentist statistics statistical hypothesis testing, data are tested against the null hypothesis that it is normally distributed.
- In Bayesian statistics, one does not”test normality" per se, but rather computes the likelihood that the data come from a normal distribution with given parameters \(\mu\),\(\sigma\) (for all \(\mu\),\(\sigma\)), and compares that with the likelihood that the data come from other distributions under consideration, most simply using a Bayes factor (giving the relative likelihood of seeing the data given different models), or more finely taking a prior distribution on possible models and parameters and computing a posterior distribution given the computed likelihoods."

The simpler, faster way to answer this is using GNU R qqplot and qqline:

qqnorm {stats} R Documentation Quantile-Quantile Plots - Description qqnorm is a generic function the default method of which produces a normal QQ plot of the values in y. qqline adds a line to a “theoretical”, by default normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.

Again, from Wikipedia:
Normality tests assess the likelihood that the given data set {x1, ., xn} comes from a normal distribution. Typically the null hypothesis H0 is that the observations are distributed normally with unspecified mean ΅ and variance s2, versus the alternative Ha that the distribution is arbitrary. Many tests (over 40) have been devised for this problem, the more prominent of them are outlined below: “Visual” tests are more intuitively appealing but subjective at the same time, as they rely on informal human judgement to accept or reject the null hypothesis. Q-Q plot- is a plot of the sorted values from the data set against the expected values of the corresponding quantiles from the standard normal distribution. If the null hypothesis is true, the plotted points should approximately lie on a straight line. P-P plot- similar to the Q-Q plot, but used much less frequently. Shapiro-Wilk test employs the fact that the line in the Q-Q plot has the slope of \(\sigma\). Normal probability plot (rankit plot)

Finally, in our case, considering what we got at the beginning (avg1, avg2, avg3 and avg4) we can plot:

par(mfrow=c(2,2))
qqnorm(avg1, main = "40 Samples, 1000 Simulations"); qqline(avg1)
qqnorm(avg2, main = "200 Samples, 1000 Simulations"); qqline(avg2)
qqnorm(avg3, main = "40 Samples, 5000 Simulations"); qqline(avg3)
qqnorm(avg4, main = "200 Samples, 5000 Simulations"); qqline(avg4)  

From the previous plot and in expected agreement to our long ‘theoretical’ introduction, the sample / simulation case 4 (200 samples, 5000 simulations) is the distribution that looks more ‘normal’ as we can conclude the right bottom plot above. The 40 sample / 1000 simulations distribution, also behaves ‘normally’ but the other one is just better.
Sorry I know this looks more messy than ‘expected’, I just went short of time, a pity because this is a very interesting subject and project! Yes, I strongly based my lines/notes in what I found in Wikipedia and some other older notes I had.

Some references I used:

  1. https://en.wikipedia.org/wiki/Normal_distribution
  2. http://www.ats.ucla.edu/stat/r/pages/greek_letters.htm
  3. http://www.stats4stem.org/r-normal-distribution.html
  4. https://rstudio-pubs-static.s3.amazonaws.com/18858_0c289c260a574ea08c0f10b944abc883.html
  5. http://www.statpower.net/Content/310/R%20Stuff/SampleMarkdown.html
  6. http://www.calvin.edu/~rpruim/courses/m343/F12/RStudio/LatexExamples.html