Sampling Distributions

When we repeatedly sample and calculate the mean, a strikingly similar distribution of means appears even though we are sampling from very different populations. One that is symmetric at its center, meaning that roughly half of the values appear on the left and right of the mean - and that the mean and the median are close to each other. Coincidentally, these are all also properties of the normal distribution!

Central Limit Theorem in Action 1) In the first row of graphs, we see the distributions of three populations we worked with until now. These include the guesses from the Danish newspaper’s contest, the guesses naive players would make, and the theoretical distribution of Roulette winnings. All three are very different in shape.
2) In the second row of graphs, we see three typical samples from each population, of sample size 10. Again the samples all look very different, and are influenced by the shape of the population they are sampled from.
3) In the third and final row, we see the distribution of means that we create by the repeated sampling, and calculation of sample means, from the populations. Despite their previous difference, the distributions now look very alike in shape.

setwd("~/OneDrive - Universiteit Leiden/Quantitative Research Skills/R Tutorials")
sampDist <- read.csv("samplingDistributions.csv")

# checking the 1,2,3!
dim(sampDist)
## [1] 500000      5
summary(sampDist)
##        X                N            Danish          Naive      
##  Min.   :     1   Min.   :   5   Min.   : 2.20   Min.   :13.47  
##  1st Qu.:125001   1st Qu.:  15   1st Qu.:30.62   1st Qu.:47.90  
##  Median :250000   Median :  30   Median :32.37   Median :50.00  
##  Mean   :250000   Mean   : 670   Mean   :32.40   Mean   :50.00  
##  3rd Qu.:375000   3rd Qu.: 300   3rd Qu.:33.90   3rd Qu.:52.11  
##  Max.   :500000   Max.   :3000   Max.   :85.00   Max.   :86.01  
##     Roulette     
##  Min.   :-866.0  
##  1st Qu.:-146.0  
##  Median :-105.4  
##  Mean   :-105.3  
##  3rd Qu.: -65.6  
##  Max.   : 700.0
head(sampDist, 3)
##   X N Danish    Naive Roulette
## 1 1 5   29.4 53.17620       34
## 2 2 5   23.8 48.92205      106
## 3 3 5   41.8 50.72502     -218
# some graphic parameters to make a crisp plot
par(cex.main = 1.5, mar = c(5, 6, 4, 5) + 0.1, mgp = c(3.5, 1, 0), cex.lab = 1.5, font.lab = 2, 
    cex.axis = 1.3, bty = "n", las = 1)
#       las -> style of axis labels (0: parallel to axis; 1: always horizontal; 2: always perpendicular,
#               4: always vertical)
#       mar -> number of lines of margin on four sides of plot c(bottom, left, top, right)
#       mgp -> margin line for axis title, axis labels and axis line c(title,axis1,axis2)
#       cex.main -> magnification for main titles relative to current setting
#       cex.lab -> magnification to be used for x and y labels relative to current setting
#       font.lab -> font to be used for x and y labels

# using expression function to plot symbols
curve(dnorm(x), from=(-4), to=4, col="black", lwd=2 , xlab=expression(sigma), 
      ylab= "Probability Density")

# using segments function to create lines on the plot
#       segments(x0, y0, x1=?, x2=?, col=?, ...)
#           x0, y0 -> coordinates of points from which to draw
#           x1, y1 -> coordinates of points to which to draw (at least one must be supplied)

# first, adding the 1 and -1 standard deviations lines
segments(1, 0, 1, dnorm(1), lty=2, lwd=2)
segments(-1, 0, -1, dnorm(-1), lty=2, lwd=2)
# calculating are between ±1 standard deviations -> pnorm is the integral of dnorm!
pnorm(1) # -> probability of obtaining value at 1 standard deviation from the mean
## [1] 0.8413447
area <- pnorm(1) - pnorm(-1)
text(0, 0.2, paste0(round(area, 3)*100, "%"))

# second, adding the 1.96 and -1.96 standard deviations lines
segments(1.96, 0,1.96, dnorm(1.96), lty=2, lwd=2, col="red")
segments(-1.96, 0, -1.96, dnorm(-1.96), lty=2, lwd=2, col="red")

# calculating area between ±2 standard deviations
area <- pnorm(2)-pnorm(-2)
text(-3, 0.1, paste0(round(area, 3)*100, "%"), col="red")
text(3, 0.1, paste0(round(area, 3)*100, "%"), col="red")


Standardizing Data Values

To test for normality, sampling distributions may be compared to the simplest normal distribution possible: the standard normal model. Also known as the “z-curve”. For this, we need to subtract the mean from all values in these datasets and divide by their standard error, that is, convert them into z-scores. This is a process called standardization of data. Data is standardized specifically to compare it to the standard normal. Take care to do this for each sample size separately!

# subsetting data to the distributions that came from a sample size of n=5
sampN5 <- subset(sampDist, N==5)

# standardizing each column and creating a new column for resulting values
sampN5$stDanish <- (sampN5$Danish - mean(sampN5$Danish)) / sd(sampN5$Danish)
sampN5$stNaive <- (sampN5$Naive - mean(sampN5$Naive)) / sd(sampN5$Naive)
sampN5$stRoulette <- (sampN5$Roulette - mean(sampN5$Roulette)) / sd(sampN5$Roulette)

# summarizing each column again
summary(sampN5)
##        X                N         Danish          Naive          Roulette   
##  Min.   :     1   Min.   :5   Min.   : 2.20   Min.   :13.47   Min.   :-866  
##  1st Qu.: 25001   1st Qu.:5   1st Qu.:25.60   1st Qu.:43.79   1st Qu.:-236  
##  Median : 50000   Median :5   Median :31.80   Median :50.03   Median :-110  
##  Mean   : 50000   Mean   :5   Mean   :32.41   Mean   :50.05   Mean   :-105  
##  3rd Qu.: 75000   3rd Qu.:5   3rd Qu.:38.60   3rd Qu.:56.28   3rd Qu.:  16  
##  Max.   :100000   Max.   :5   Max.   :85.00   Max.   :86.01   Max.   : 700  
##     stDanish           stNaive            stRoulette      
##  Min.   :-3.15786   Min.   :-3.997764   Min.   :-4.17714  
##  1st Qu.:-0.71168   1st Qu.:-0.684361   1st Qu.:-0.71904  
##  Median :-0.06354   Median :-0.001705   Median :-0.02742  
##  Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000  
##  3rd Qu.: 0.64731   3rd Qu.: 0.681083   3rd Qu.: 0.66420  
##  Max.   : 5.49786   Max.   : 3.930763   Max.   : 4.41871
# plotting data and comparing graphs to the standard normal
# first, the graphical settings
par(mfrow=c(3,1),mar=c(4,4,1,1),cex.main=0.8,las=1)

# 1.sampling distributions of the Danish contest
hist(sampN5$stDanish, xlab="", main="Danish, n = 5", freq=FALSE, ylab="Probability Density", 
     ylim=c(0,0.5), xlim=c(-4,4))
# adding standard normal distribution
curve(dnorm(x), col="black", lwd=2,add=TRUE)

# 2. sampling distributions of the naive guesses
hist(sampN5$stNaive, xlab="", main="Naive, n = 5", freq=FALSE, ylab = "Probability Density", 
     ylim=c(0,0.5), xlim=c(-4,4))
# adding standard normal distribution
curve(dnorm(x), col ="black", lwd=2,add=TRUE)

# 3. sampling distributions of Roulette winnings
hist(sampN5$stRoulette, xlab="Mean", main="Roulette, n = 5", freq=FALSE, ylab = "Probability Density", ylim=c(0,0.5), xlim=c(-4,4))
# adding standard normal distribution
curve(dnorm(x), col="black", lwd=2, add=TRUE)

The sampling distributions all show a remarkable likeness to the standard normal distribution! However, how close are they really? We can find out by looking at the values that comprise the outermost 5% of the data.

# 1.sampling distributions of the Danish contest
quantile(sampN5$stDanish,prob=c(0.025,0.975))
##      2.5%     97.5% 
## -1.757054  2.131749
# 2. sampling distributions of the naive guesses
quantile(sampN5$stNaive,prob=c(0.025,0.975))
##      2.5%     97.5% 
## -1.945551  1.957428
# 3. sampling distributions of Roulette winnings
quantile(sampN5$stRoulette,prob=c(0.025,0.975))
##      2.5%     97.5% 
## -1.904671  1.948639
# we are close but not exactly there..
#       ...is this because of the low sample size?
#       ...or due to the shape of the population we sampled from?
#       -> important questions to ask, as they influence choice of appropriate statistical test

Exercise 14

Adapt the above code to find out whether you get closer to the 95% cutoffs of a standard normal (-1.96 and 1.96) with increasing sample size. Do this for each sample size in the dataset seperately. Tip: use a new R-script to do this systematically. In addition, the function tapply combined with a custom function can do this for all samples sizes in one go - can you figure out how to do it with tapply? This isn’t needed - but it is more efficient. Ask an instructor for help if you feel you are lost! Save your answer.

unique(sampDist$N) # to find what sample sizes exist in the dataset
## [1]    5   15   30  300 3000
# subsetting data to the distributions that came from a sample size of n=3,000
sampN3000 <- subset(sampDist, N==3000)

# standardizing each column and creating a new column for resulting values
sampN3000$stDanish <- (sampN3000$Danish - mean(sampN3000$Danish)) / sd(sampN3000$Danish)
sampN3000$stNaive <- (sampN3000$Naive - mean(sampN3000$Naive)) / sd(sampN3000$Naive)
sampN3000$stRoulette <- (sampN3000$Roulette - mean(sampN3000$Roulette)) / sd(sampN3000$Roulette)

# summarizing each column again
summary(sampN3000)
##        X                N            Danish          Naive      
##  Min.   :400001   Min.   :3000   Min.   :30.85   Min.   :47.70  
##  1st Qu.:425001   1st Qu.:3000   1st Qu.:32.16   1st Qu.:49.64  
##  Median :450000   Median :3000   Median :32.40   Median :50.00  
##  Mean   :450000   Mean   :3000   Mean   :32.40   Mean   :50.00  
##  3rd Qu.:475000   3rd Qu.:3000   3rd Qu.:32.65   3rd Qu.:50.35  
##  Max.   :500000   Max.   :3000   Max.   :33.99   Max.   :52.28  
##     Roulette          stDanish            stNaive           stRoulette       
##  Min.   :-155.48   Min.   :-4.342786   Min.   :-4.36614   Min.   :-4.778993  
##  1st Qu.:-112.34   1st Qu.:-0.677617   1st Qu.:-0.67677   1st Qu.:-0.675996  
##  Median :-105.26   Median : 0.001066   Median : 0.00136   Median :-0.002625  
##  Mean   :-105.23   Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.: -98.12   3rd Qu.: 0.675104   3rd Qu.: 0.67548   3rd Qu.: 0.676452  
##  Max.   : -59.90   Max.   : 4.421101   Max.   : 4.34703   Max.   : 4.311514
# plotting data and comparing graphs to the standard normal
# first, the graphical settings
par(mfrow=c(3,1),mar=c(4,4,1,1),cex.main=0.8,las=1)

# 1.sampling distributions of the Danish contest
hist(sampN3000$stDanish, xlab="", main="Danish, n = 3,000", freq=FALSE, ylab="Probability Density", 
     ylim=c(0,0.5), xlim=c(-4,4))
# adding standard normal distribution
curve(dnorm(x), col="black", lwd=2,add=TRUE)

# 2. sampling distributions of the naive guesses
hist(sampN3000$stNaive, xlab="", main="Naive, n = 3,000", freq=FALSE, ylab = "Probability Density", 
     ylim=c(0,0.5), xlim=c(-4,4))
# adding standard normal distribution
curve(dnorm(x), col ="black", lwd=2,add=TRUE)

# 3. sampling distributions of Roulette winnings
hist(sampN3000$stRoulette, xlab="Mean", main="Roulette, n = 3,000", freq=FALSE, ylab = "Probability 
     Density", ylim=c(0,0.5), xlim=c(-4,4))
# adding standard normal distribution
curve(dnorm(x), col="black", lwd=2, add=TRUE)

# how close are the three sampling distributions to the standard normal?
# 1.sampling distributions of the Danish contest
quantile(sampN3000$stDanish,prob=c(0.025,0.975))
##      2.5%     97.5% 
## -1.955084  1.963720
# 2. sampling distributions of the naive guesses
quantile(sampN3000$stNaive,prob=c(0.025,0.975))
##      2.5%     97.5% 
## -1.959750  1.961674
# 3. sampling distributions of Roulette winnings
quantile(sampN3000$stRoulette,prob=c(0.025,0.975))
##      2.5%     97.5% 
## -1.959966  1.960422

This can be done systematically using the “tapply” function, which is perfect for this because it applies a function to a ragged array, i.e. different groups of data (defined by N) that you want to apply the same operation to. The typical set-up is as follows: tapply(x, index, function).

sampDist <- read.csv("samplingDistributions.csv")
# to use tapply, first necessary to make a custom function -> function(arguments){ instructions }
QF <- function(x){
  # standardize
  zscores <- (x - mean(x))/sd(x)
  # quantiles
  quantilesST <- quantile(zscores, prob=c(0.025,0.975))
}
# now using tapply -> tapply(dataset, index/grouping, function)
danishCI <- tapply(sampDist$Danish, sampDist$N, QF)

# using lapply to use tapply function above on each dataset systematically
# for this, need to create another function tapplyF to systematically use tapply on datasets
datasets <- sampDist[c("Danish", "Naive", "Roulette")]
tapplyF <- function(datasets){
  tapply(datasets, sampDist$N, QF)
} 
# now using lapply to apply tapplyF function (which in turn applies QF function) to each dataset
allCIs <- lapply(datasets, tapplyF)
allCIs
## $Danish
## $Danish$`5`
##      2.5%     97.5% 
## -1.757054  2.131749 
## 
## $Danish$`15`
##      2.5%     97.5% 
## -1.861561  2.049589 
## 
## $Danish$`30`
##      2.5%     97.5% 
## -1.887132  2.027894 
## 
## $Danish$`300`
##      2.5%     97.5% 
## -1.939360  1.976974 
## 
## $Danish$`3000`
##      2.5%     97.5% 
## -1.955084  1.963720 
## 
## 
## $Naive
## $Naive$`5`
##      2.5%     97.5% 
## -1.945551  1.957428 
## 
## $Naive$`15`
##      2.5%     97.5% 
## -1.944215  1.963825 
## 
## $Naive$`30`
##      2.5%     97.5% 
## -1.956644  1.956305 
## 
## $Naive$`300`
##     2.5%    97.5% 
## -1.95664  1.96075 
## 
## $Naive$`3000`
##      2.5%     97.5% 
## -1.959750  1.961674 
## 
## 
## $Roulette
## $Roulette$`5`
##      2.5%     97.5% 
## -1.904671  1.948639 
## 
## $Roulette$`15`
##      2.5%     97.5% 
## -1.922995  2.018682 
## 
## $Roulette$`30`
##      2.5%     97.5% 
## -1.922802  1.957193 
## 
## $Roulette$`300`
##      2.5%     97.5% 
## -1.944370  1.975033 
## 
## $Roulette$`3000`
##      2.5%     97.5% 
## -1.959966  1.960422

AThe above shows the Central Limit Theorem in action! All distributions eventually come close to the situation where 95% of all points are contained between ±1.96. This emonstrates that larger samples make the empricial sampling distribution align more closely with the standard normal distribution, just as the CLT predicts. As the sample size increases, the estimated confidence interval thresholds approach the theoretical cutoffs of ±1.96. However, it seems that low sample size is more problematic for the Danish distribution, underlining that not all distributions converge to the standard normal st the same “speed” (with sample size).