When we repeatedly sample and calculate the mean, a strikingly similar distribution of means appears even though we are sampling from very different populations. One that is symmetric at its center, meaning that roughly half of the values appear on the left and right of the mean - and that the mean and the median are close to each other. Coincidentally, these are all also properties of the normal distribution!
1)
In the first row of graphs, we see the distributions of three
populations we worked with until now. These include the guesses
from the Danish newspaper’s contest, the
guesses naive
players would make, and the theoretical
distribution of Roulette winnings. All three
are very different in shape.
2) In the second row of graphs, we see three typical samples
from each population, of sample size 10. Again the samples all
look very different, and are influenced by the shape of the
population they are sampled from.
3) In the third and final row, we see the distribution of
means that we create by the repeated
sampling, and calculation of sample means,
from the populations. Despite their previous difference, the
distributions now look very alike in
shape.
setwd("~/OneDrive - Universiteit Leiden/Quantitative Research Skills/R Tutorials")
sampDist <- read.csv("samplingDistributions.csv")
# checking the 1,2,3!
dim(sampDist)
## [1] 500000 5
summary(sampDist)
## X N Danish Naive
## Min. : 1 Min. : 5 Min. : 2.20 Min. :13.47
## 1st Qu.:125001 1st Qu.: 15 1st Qu.:30.62 1st Qu.:47.90
## Median :250000 Median : 30 Median :32.37 Median :50.00
## Mean :250000 Mean : 670 Mean :32.40 Mean :50.00
## 3rd Qu.:375000 3rd Qu.: 300 3rd Qu.:33.90 3rd Qu.:52.11
## Max. :500000 Max. :3000 Max. :85.00 Max. :86.01
## Roulette
## Min. :-866.0
## 1st Qu.:-146.0
## Median :-105.4
## Mean :-105.3
## 3rd Qu.: -65.6
## Max. : 700.0
head(sampDist, 3)
## X N Danish Naive Roulette
## 1 1 5 29.4 53.17620 34
## 2 2 5 23.8 48.92205 106
## 3 3 5 41.8 50.72502 -218
# some graphic parameters to make a crisp plot
par(cex.main = 1.5, mar = c(5, 6, 4, 5) + 0.1, mgp = c(3.5, 1, 0), cex.lab = 1.5, font.lab = 2,
cex.axis = 1.3, bty = "n", las = 1)
# las -> style of axis labels (0: parallel to axis; 1: always horizontal; 2: always perpendicular,
# 4: always vertical)
# mar -> number of lines of margin on four sides of plot c(bottom, left, top, right)
# mgp -> margin line for axis title, axis labels and axis line c(title,axis1,axis2)
# cex.main -> magnification for main titles relative to current setting
# cex.lab -> magnification to be used for x and y labels relative to current setting
# font.lab -> font to be used for x and y labels
# using expression function to plot symbols
curve(dnorm(x), from=(-4), to=4, col="black", lwd=2 , xlab=expression(sigma),
ylab= "Probability Density")
# using segments function to create lines on the plot
# segments(x0, y0, x1=?, x2=?, col=?, ...)
# x0, y0 -> coordinates of points from which to draw
# x1, y1 -> coordinates of points to which to draw (at least one must be supplied)
# first, adding the 1 and -1 standard deviations lines
segments(1, 0, 1, dnorm(1), lty=2, lwd=2)
segments(-1, 0, -1, dnorm(-1), lty=2, lwd=2)
# calculating are between ±1 standard deviations -> pnorm is the integral of dnorm!
pnorm(1) # -> probability of obtaining value at 1 standard deviation from the mean
## [1] 0.8413447
area <- pnorm(1) - pnorm(-1)
text(0, 0.2, paste0(round(area, 3)*100, "%"))
# second, adding the 1.96 and -1.96 standard deviations lines
segments(1.96, 0,1.96, dnorm(1.96), lty=2, lwd=2, col="red")
segments(-1.96, 0, -1.96, dnorm(-1.96), lty=2, lwd=2, col="red")
# calculating area between ±2 standard deviations
area <- pnorm(2)-pnorm(-2)
text(-3, 0.1, paste0(round(area, 3)*100, "%"), col="red")
text(3, 0.1, paste0(round(area, 3)*100, "%"), col="red")
To test for normality, sampling distributions may be compared to the simplest normal distribution possible: the standard normal model. Also known as the “z-curve”. For this, we need to subtract the mean from all values in these datasets and divide by their standard error, that is, convert them into z-scores. This is a process called standardization of data. Data is standardized specifically to compare it to the standard normal. Take care to do this for each sample size separately!
# subsetting data to the distributions that came from a sample size of n=5
sampN5 <- subset(sampDist, N==5)
# standardizing each column and creating a new column for resulting values
sampN5$stDanish <- (sampN5$Danish - mean(sampN5$Danish)) / sd(sampN5$Danish)
sampN5$stNaive <- (sampN5$Naive - mean(sampN5$Naive)) / sd(sampN5$Naive)
sampN5$stRoulette <- (sampN5$Roulette - mean(sampN5$Roulette)) / sd(sampN5$Roulette)
# summarizing each column again
summary(sampN5)
## X N Danish Naive Roulette
## Min. : 1 Min. :5 Min. : 2.20 Min. :13.47 Min. :-866
## 1st Qu.: 25001 1st Qu.:5 1st Qu.:25.60 1st Qu.:43.79 1st Qu.:-236
## Median : 50000 Median :5 Median :31.80 Median :50.03 Median :-110
## Mean : 50000 Mean :5 Mean :32.41 Mean :50.05 Mean :-105
## 3rd Qu.: 75000 3rd Qu.:5 3rd Qu.:38.60 3rd Qu.:56.28 3rd Qu.: 16
## Max. :100000 Max. :5 Max. :85.00 Max. :86.01 Max. : 700
## stDanish stNaive stRoulette
## Min. :-3.15786 Min. :-3.997764 Min. :-4.17714
## 1st Qu.:-0.71168 1st Qu.:-0.684361 1st Qu.:-0.71904
## Median :-0.06354 Median :-0.001705 Median :-0.02742
## Mean : 0.00000 Mean : 0.000000 Mean : 0.00000
## 3rd Qu.: 0.64731 3rd Qu.: 0.681083 3rd Qu.: 0.66420
## Max. : 5.49786 Max. : 3.930763 Max. : 4.41871
# plotting data and comparing graphs to the standard normal
# first, the graphical settings
par(mfrow=c(3,1),mar=c(4,4,1,1),cex.main=0.8,las=1)
# 1.sampling distributions of the Danish contest
hist(sampN5$stDanish, xlab="", main="Danish, n = 5", freq=FALSE, ylab="Probability Density",
ylim=c(0,0.5), xlim=c(-4,4))
# adding standard normal distribution
curve(dnorm(x), col="black", lwd=2,add=TRUE)
# 2. sampling distributions of the naive guesses
hist(sampN5$stNaive, xlab="", main="Naive, n = 5", freq=FALSE, ylab = "Probability Density",
ylim=c(0,0.5), xlim=c(-4,4))
# adding standard normal distribution
curve(dnorm(x), col ="black", lwd=2,add=TRUE)
# 3. sampling distributions of Roulette winnings
hist(sampN5$stRoulette, xlab="Mean", main="Roulette, n = 5", freq=FALSE, ylab = "Probability Density", ylim=c(0,0.5), xlim=c(-4,4))
# adding standard normal distribution
curve(dnorm(x), col="black", lwd=2, add=TRUE)
The sampling distributions all show a
remarkable likeness to the standard
normal distribution! However,
how close are they really? We can find out by looking
at the values that comprise the outermost
5% of the data.
# 1.sampling distributions of the Danish contest
quantile(sampN5$stDanish,prob=c(0.025,0.975))
## 2.5% 97.5%
## -1.757054 2.131749
# 2. sampling distributions of the naive guesses
quantile(sampN5$stNaive,prob=c(0.025,0.975))
## 2.5% 97.5%
## -1.945551 1.957428
# 3. sampling distributions of Roulette winnings
quantile(sampN5$stRoulette,prob=c(0.025,0.975))
## 2.5% 97.5%
## -1.904671 1.948639
# we are close but not exactly there..
# ...is this because of the low sample size?
# ...or due to the shape of the population we sampled from?
# -> important questions to ask, as they influence choice of appropriate statistical test
Adapt the above code to find out whether you get closer to the 95% cutoffs of a standard normal (-1.96 and 1.96) with increasing sample size. Do this for each sample size in the dataset seperately. Tip: use a new R-script to do this systematically. In addition, the function tapply combined with a custom function can do this for all samples sizes in one go - can you figure out how to do it with tapply? This isn’t needed - but it is more efficient. Ask an instructor for help if you feel you are lost! Save your answer.
unique(sampDist$N) # to find what sample sizes exist in the dataset
## [1] 5 15 30 300 3000
# subsetting data to the distributions that came from a sample size of n=3,000
sampN3000 <- subset(sampDist, N==3000)
# standardizing each column and creating a new column for resulting values
sampN3000$stDanish <- (sampN3000$Danish - mean(sampN3000$Danish)) / sd(sampN3000$Danish)
sampN3000$stNaive <- (sampN3000$Naive - mean(sampN3000$Naive)) / sd(sampN3000$Naive)
sampN3000$stRoulette <- (sampN3000$Roulette - mean(sampN3000$Roulette)) / sd(sampN3000$Roulette)
# summarizing each column again
summary(sampN3000)
## X N Danish Naive
## Min. :400001 Min. :3000 Min. :30.85 Min. :47.70
## 1st Qu.:425001 1st Qu.:3000 1st Qu.:32.16 1st Qu.:49.64
## Median :450000 Median :3000 Median :32.40 Median :50.00
## Mean :450000 Mean :3000 Mean :32.40 Mean :50.00
## 3rd Qu.:475000 3rd Qu.:3000 3rd Qu.:32.65 3rd Qu.:50.35
## Max. :500000 Max. :3000 Max. :33.99 Max. :52.28
## Roulette stDanish stNaive stRoulette
## Min. :-155.48 Min. :-4.342786 Min. :-4.36614 Min. :-4.778993
## 1st Qu.:-112.34 1st Qu.:-0.677617 1st Qu.:-0.67677 1st Qu.:-0.675996
## Median :-105.26 Median : 0.001066 Median : 0.00136 Median :-0.002625
## Mean :-105.23 Mean : 0.000000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: -98.12 3rd Qu.: 0.675104 3rd Qu.: 0.67548 3rd Qu.: 0.676452
## Max. : -59.90 Max. : 4.421101 Max. : 4.34703 Max. : 4.311514
# plotting data and comparing graphs to the standard normal
# first, the graphical settings
par(mfrow=c(3,1),mar=c(4,4,1,1),cex.main=0.8,las=1)
# 1.sampling distributions of the Danish contest
hist(sampN3000$stDanish, xlab="", main="Danish, n = 3,000", freq=FALSE, ylab="Probability Density",
ylim=c(0,0.5), xlim=c(-4,4))
# adding standard normal distribution
curve(dnorm(x), col="black", lwd=2,add=TRUE)
# 2. sampling distributions of the naive guesses
hist(sampN3000$stNaive, xlab="", main="Naive, n = 3,000", freq=FALSE, ylab = "Probability Density",
ylim=c(0,0.5), xlim=c(-4,4))
# adding standard normal distribution
curve(dnorm(x), col ="black", lwd=2,add=TRUE)
# 3. sampling distributions of Roulette winnings
hist(sampN3000$stRoulette, xlab="Mean", main="Roulette, n = 3,000", freq=FALSE, ylab = "Probability
Density", ylim=c(0,0.5), xlim=c(-4,4))
# adding standard normal distribution
curve(dnorm(x), col="black", lwd=2, add=TRUE)
# how close are the three sampling distributions to the standard normal?
# 1.sampling distributions of the Danish contest
quantile(sampN3000$stDanish,prob=c(0.025,0.975))
## 2.5% 97.5%
## -1.955084 1.963720
# 2. sampling distributions of the naive guesses
quantile(sampN3000$stNaive,prob=c(0.025,0.975))
## 2.5% 97.5%
## -1.959750 1.961674
# 3. sampling distributions of Roulette winnings
quantile(sampN3000$stRoulette,prob=c(0.025,0.975))
## 2.5% 97.5%
## -1.959966 1.960422
This can be done systematically using the “tapply” function, which is perfect for this because it applies a function to a ragged array, i.e. different groups of data (defined by N) that you want to apply the same operation to. The typical set-up is as follows: tapply(x, index, function).
sampDist <- read.csv("samplingDistributions.csv")
# to use tapply, first necessary to make a custom function -> function(arguments){ instructions }
QF <- function(x){
# standardize
zscores <- (x - mean(x))/sd(x)
# quantiles
quantilesST <- quantile(zscores, prob=c(0.025,0.975))
}
# now using tapply -> tapply(dataset, index/grouping, function)
danishCI <- tapply(sampDist$Danish, sampDist$N, QF)
# using lapply to use tapply function above on each dataset systematically
# for this, need to create another function tapplyF to systematically use tapply on datasets
datasets <- sampDist[c("Danish", "Naive", "Roulette")]
tapplyF <- function(datasets){
tapply(datasets, sampDist$N, QF)
}
# now using lapply to apply tapplyF function (which in turn applies QF function) to each dataset
allCIs <- lapply(datasets, tapplyF)
allCIs
## $Danish
## $Danish$`5`
## 2.5% 97.5%
## -1.757054 2.131749
##
## $Danish$`15`
## 2.5% 97.5%
## -1.861561 2.049589
##
## $Danish$`30`
## 2.5% 97.5%
## -1.887132 2.027894
##
## $Danish$`300`
## 2.5% 97.5%
## -1.939360 1.976974
##
## $Danish$`3000`
## 2.5% 97.5%
## -1.955084 1.963720
##
##
## $Naive
## $Naive$`5`
## 2.5% 97.5%
## -1.945551 1.957428
##
## $Naive$`15`
## 2.5% 97.5%
## -1.944215 1.963825
##
## $Naive$`30`
## 2.5% 97.5%
## -1.956644 1.956305
##
## $Naive$`300`
## 2.5% 97.5%
## -1.95664 1.96075
##
## $Naive$`3000`
## 2.5% 97.5%
## -1.959750 1.961674
##
##
## $Roulette
## $Roulette$`5`
## 2.5% 97.5%
## -1.904671 1.948639
##
## $Roulette$`15`
## 2.5% 97.5%
## -1.922995 2.018682
##
## $Roulette$`30`
## 2.5% 97.5%
## -1.922802 1.957193
##
## $Roulette$`300`
## 2.5% 97.5%
## -1.944370 1.975033
##
## $Roulette$`3000`
## 2.5% 97.5%
## -1.959966 1.960422
AThe above shows the Central Limit Theorem in action! All distributions eventually come close to the situation where 95% of all points are contained between ±1.96. This emonstrates that larger samples make the empricial sampling distribution align more closely with the standard normal distribution, just as the CLT predicts. As the sample size increases, the estimated confidence interval thresholds approach the theoretical cutoffs of ±1.96. However, it seems that low sample size is more problematic for the Danish distribution, underlining that not all distributions converge to the standard normal st the same “speed” (with sample size).