Central Limit Theorem Example

To illustrate Central Limit Theorem, let’s use salary data for Major League Baseball players. Data is from 2011.

library(readxl)

Baseball_Salaries <- read_excel("G:/My Drive/MBA Statistics/Fall 2019/Data Files/Baseball Salaries.xlsx", sheet = "2011 Salaries") 

View(Baseball_Salaries)

First, let’s examine the salaries to see if they approximate normal distribution

library(psych)
## Warning: package 'psych' was built under R version 4.2.2
describe(Baseball_Salaries$Salary)
##    vars   n    mean      sd  median trimmed     mad    min     max    range
## X1    1 843 3305055 4534742 1175000 2250999 1126776 414000 3.2e+07 31586000
##    skew kurtosis       se
## X1 2.25     5.66 156184.8

With mean of $3,305,055 and median of $1,175,000 we suspect data has a significant positive skew (as also indicated by skew of 2.25). Let’s look at the histogram (mean shown as a vertical red line).

hist(Baseball_Salaries$Salary)
abline(v=mean(Baseball_Salaries$Salary), col="red")

OK, so data is definitely not normally distributed as it seems that most players are actaully at some sort of league minimum, and then we have an outlier at approxmiately $32M (A-Rod?).

Out of sheer curiosity, let’s look at number o players earning more than $5M, $10M and $20M. We already know from the median that close to 50% of players make less than a $1M. Also, let’s look at how many players are at the minimum, which appears to be $414,000 and also how many earn less than $500,000.

sum(Baseball_Salaries$Salary == 414000)
## [1] 57
sum(Baseball_Salaries$Salary<500000)
## [1] 305
sum(Baseball_Salaries$Salary > 5000000)
## [1] 185
sum(Baseball_Salaries$Salary > 10000000)
## [1] 77
sum(Baseball_Salaries$Salary > 20000000)
## [1] 7

OK, I guess we can confidently say that the data is not normally distributed. Let’s create some samples of means. To illustrate CLT, I will generate variables of sample means in hopes of showing both the effects of sample size and number of samples. First one is pretty straight forward, larger sample size will generally be more likely to follow the CLT and should have lower standard deviation. The repeat sampling is my attempt at demonstrating the asymptotic nature of statistics, meaning that the more often we do something, closer we will get to the underlying statistical outcome (in the spirit of “if we flip a coin twice, we can often observe p(Head)=1 vs. if we do it 1M times, we should observe P(Head) much closer to 0.5).

Structure of the command below: replicate tells us how many times to repeat the command and save output, mean tells us what we are calculating, and sample tells us what kind of sample to obtain based on data, sample size, and with or without replacement. In this case, we are looking at Salary data, sample of 10 or 100, replace=FALSE meaning that for each sample we are sampling without replacement. Structure of the variable names is SampleofN_K means sample of N taken K times.

Sampleof10_10<-replicate(10, mean(sample(Baseball_Salaries$Salary, 10, replace = FALSE)))

Sampleof10_100<-replicate(100, mean(sample(Baseball_Salaries$Salary, 10, replace = FALSE)))

Sampleof10_1000<-replicate(1000, mean(sample(Baseball_Salaries$Salary, 10, replace = FALSE)))

Sampleof100_10<-replicate(10, mean(sample(Baseball_Salaries$Salary, 100, replace = FALSE)))

Sampleof100_100<-replicate(100, mean(sample(Baseball_Salaries$Salary, 100, replace = FALSE)))

Sampleof100_1000<-replicate(1000, mean(sample(Baseball_Salaries$Salary, 100, replace = FALSE)))

Now, let’s look at our variables

AllSamples<-data.frame(Sampleof10_10, Sampleof10_100, Sampleof10_1000, Sampleof100_10, Sampleof100_100, Sampleof100_1000)

describe(AllSamples)
##                  vars    n    mean        sd  median trimmed       mad     min
## Sampleof10_10       1 1000 2819446 1337956.7 3045397 2745814 1585547.7  890724
## Sampleof10_100      2 1000 3057507 1316872.6 2897900 2964895 1102090.7  702470
## Sampleof10_1000     3 1000 3342836 1416785.8 3106445 3227978 1292767.9  700300
## Sampleof100_10      4 1000 3607126  467371.6 3672875 3626869  359959.3 2695059
## Sampleof100_100     5 1000 3253760  364661.2 3251897 3251998  416881.9 2393571
## Sampleof100_1000    6 1000 3311910  417605.0 3291990 3301734  398270.2 2253150
##                       max    range  skew kurtosis       se
## Sampleof10_10     5337232  4446508  0.21    -0.89 42309.91
## Sampleof10_100    8106971  7404501  0.82     1.21 41643.17
## Sampleof10_1000  10826787 10126487  0.97     1.60 44802.70
## Sampleof100_10    4361247  1666189 -0.26    -0.41 14779.59
## Sampleof100_100   4107633  1714062  0.07    -0.52 11531.60
## Sampleof100_1000  4741268  2488118  0.29     0.19 13205.83

Static output - necessary as output will change every time you run the script as samples are very unlikely to be the same.

                   range  skew kurtosis       se    mean
Sampleof10_10    6291045  0.74    -0.49 58862.52   2924946
Sampleof10_100   8021940  0.63     0.39 48460.86   3523447
Sampleof10_1000  8314404  0.81     0.73 46215.72   3200685
Sampleof100_10   1373287 -0.47    -0.74 13509.60   3478721
Sampleof100_100  1677038  0.10    -0.58 11923.23   3321206
Sampleof100_1000 3169670  0.17     0.18 13739.59   3300891

Let’s concentrate on the skew for a second, just notice all skews are well under 1 (as opposed to 2.25 for underlying data). As you can see, the skew for samples of 100 is always lower than the skew for samples of 10 (e.g. skew for sample of 100 done 1000 times is 0.17 vs. 0.81 for a sample of 10 done 1000 times).

Now, let’s take a quick look at two histograms to compare effect of sample size: sample of 10 taken 10 times and sample of 100 taken 10 times

hist(Sampleof10_100,)
abline(v=c((mean(Baseball_Salaries$Salary)), (mean(Sampleof10_100))), col=c("red", "blue"))

Static Output - blue line - sample mean, red line - population mean

hist(Sampleof100_100,)
abline(v=c((mean(Baseball_Salaries$Salary)), (mean(Sampleof100_100))), col=c("red", "blue"))

Let’s quickly take a look at the effect of more sampling

hist(Sampleof100_1000,)
abline(v=c((mean(Baseball_Salaries$Salary)), (mean(Sampleof100_1000))), col=c("red", "blue"))

Static output from earlier run:

Now, let’s set up a 95% confidence interval for the mean of the baseball players salaries, based on the mean of a sample of 100 players and standard deviation of sd(Salaries)/sqrt(n).

meanCL<-mean(sample(Baseball_Salaries$Salary, 100, replace=FALSE))
SDCL<-sd(Baseball_Salaries$Salary)/sqrt(100)

Zcrit<-qnorm(.975)

upperCL<-meanCL+Zcrit*SDCL
lowerCL<-meanCL-Zcrit*SDCL

upperCL
## [1] 4404826
meanCL
## [1] 3516033
lowerCL
## [1] 2627240
upperCL - lowerCL
## [1] 1777586

Let’s repeat the exercise for a sample of 10

meanCL10<-mean(sample(Baseball_Salaries$Salary, 10, replace=FALSE))
SDCL10<-sd(Baseball_Salaries$Salary)/sqrt(10)

Zcrit<-qnorm(.975)

upperCL10<-meanCL10+Zcrit*SDCL10
lowerCL10<-meanCL10-Zcrit*SDCL10

upperCL10
## [1] 6112227
meanCL10
## [1] 3301616
lowerCL10
## [1] 491005.3
upperCL10-lowerCL10
## [1] 5621221