To illustrate Central Limit Theorem, let’s use salary data for Major League Baseball players. Data is from 2011.
library(readxl)
Baseball_Salaries <- read_excel("G:/My Drive/MBA Statistics/Fall 2019/Data Files/Baseball Salaries.xlsx", sheet = "2011 Salaries")
View(Baseball_Salaries)
First, let’s examine the salaries to see if they approximate normal distribution
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
describe(Baseball_Salaries$Salary)
## vars n mean sd median trimmed mad min max range
## X1 1 843 3305055 4534742 1175000 2250999 1126776 414000 3.2e+07 31586000
## skew kurtosis se
## X1 2.25 5.66 156184.8
With mean of $3,305,055 and median of $1,175,000 we suspect data has a significant positive skew (as also indicated by skew of 2.25). Let’s look at the histogram (mean shown as a vertical red line).
hist(Baseball_Salaries$Salary)
abline(v=mean(Baseball_Salaries$Salary), col="red")
OK, so data is definitely not normally distributed as it seems that most players are actaully at some sort of league minimum, and then we have an outlier at approxmiately $32M (A-Rod?).
Out of sheer curiosity, let’s look at number o players earning more than $5M, $10M and $20M. We already know from the median that close to 50% of players make less than a $1M. Also, let’s look at how many players are at the minimum, which appears to be $414,000 and also how many earn less than $500,000.
sum(Baseball_Salaries$Salary == 414000)
## [1] 57
sum(Baseball_Salaries$Salary<500000)
## [1] 305
sum(Baseball_Salaries$Salary > 5000000)
## [1] 185
sum(Baseball_Salaries$Salary > 10000000)
## [1] 77
sum(Baseball_Salaries$Salary > 20000000)
## [1] 7
OK, I guess we can confidently say that the data is not normally distributed. Let’s create some samples of means. To illustrate CLT, I will generate variables of sample means in hopes of showing both the effects of sample size and number of samples. First one is pretty straight forward, larger sample size will generally be more likely to follow the CLT and should have lower standard deviation. The repeat sampling is my attempt at demonstrating the asymptotic nature of statistics, meaning that the more often we do something, closer we will get to the underlying statistical outcome (in the spirit of “if we flip a coin twice, we can often observe p(Head)=1 vs. if we do it 1M times, we should observe P(Head) much closer to 0.5).
Structure of the command below: replicate tells us how many times to repeat the command and save output, mean tells us what we are calculating, and sample tells us what kind of sample to obtain based on data, sample size, and with or without replacement. In this case, we are looking at Salary data, sample of 10 or 100, replace=FALSE meaning that for each sample we are sampling without replacement. Structure of the variable names is SampleofN_K means sample of N taken K times.
Sampleof10_10<-replicate(10, mean(sample(Baseball_Salaries$Salary, 10, replace = FALSE)))
Sampleof10_100<-replicate(100, mean(sample(Baseball_Salaries$Salary, 10, replace = FALSE)))
Sampleof10_1000<-replicate(1000, mean(sample(Baseball_Salaries$Salary, 10, replace = FALSE)))
Sampleof100_10<-replicate(10, mean(sample(Baseball_Salaries$Salary, 100, replace = FALSE)))
Sampleof100_100<-replicate(100, mean(sample(Baseball_Salaries$Salary, 100, replace = FALSE)))
Sampleof100_1000<-replicate(1000, mean(sample(Baseball_Salaries$Salary, 100, replace = FALSE)))
Now, let’s look at our variables
AllSamples<-data.frame(Sampleof10_10, Sampleof10_100, Sampleof10_1000, Sampleof100_10, Sampleof100_100, Sampleof100_1000)
describe(AllSamples)
## vars n mean sd median trimmed mad min
## Sampleof10_10 1 1000 2819446 1337956.7 3045397 2745814 1585547.7 890724
## Sampleof10_100 2 1000 3057507 1316872.6 2897900 2964895 1102090.7 702470
## Sampleof10_1000 3 1000 3342836 1416785.8 3106445 3227978 1292767.9 700300
## Sampleof100_10 4 1000 3607126 467371.6 3672875 3626869 359959.3 2695059
## Sampleof100_100 5 1000 3253760 364661.2 3251897 3251998 416881.9 2393571
## Sampleof100_1000 6 1000 3311910 417605.0 3291990 3301734 398270.2 2253150
## max range skew kurtosis se
## Sampleof10_10 5337232 4446508 0.21 -0.89 42309.91
## Sampleof10_100 8106971 7404501 0.82 1.21 41643.17
## Sampleof10_1000 10826787 10126487 0.97 1.60 44802.70
## Sampleof100_10 4361247 1666189 -0.26 -0.41 14779.59
## Sampleof100_100 4107633 1714062 0.07 -0.52 11531.60
## Sampleof100_1000 4741268 2488118 0.29 0.19 13205.83
Static output - necessary as output will change every time you run the script as samples are very unlikely to be the same.
range skew kurtosis se mean
Sampleof10_10 6291045 0.74 -0.49 58862.52 2924946
Sampleof10_100 8021940 0.63 0.39 48460.86 3523447
Sampleof10_1000 8314404 0.81 0.73 46215.72 3200685
Sampleof100_10 1373287 -0.47 -0.74 13509.60 3478721
Sampleof100_100 1677038 0.10 -0.58 11923.23 3321206
Sampleof100_1000 3169670 0.17 0.18 13739.59 3300891
Let’s concentrate on the skew for a second, just notice all skews are well under 1 (as opposed to 2.25 for underlying data). As you can see, the skew for samples of 100 is always lower than the skew for samples of 10 (e.g. skew for sample of 100 done 1000 times is 0.17 vs. 0.81 for a sample of 10 done 1000 times).
Now, let’s take a quick look at two histograms to compare effect of sample size: sample of 10 taken 10 times and sample of 100 taken 10 times
hist(Sampleof10_100,)
abline(v=c((mean(Baseball_Salaries$Salary)), (mean(Sampleof10_100))), col=c("red", "blue"))
Static Output - blue line - sample mean, red line - population mean
hist(Sampleof100_100,)
abline(v=c((mean(Baseball_Salaries$Salary)), (mean(Sampleof100_100))), col=c("red", "blue"))
Let’s quickly take a look at the effect of more sampling
hist(Sampleof100_1000,)
abline(v=c((mean(Baseball_Salaries$Salary)), (mean(Sampleof100_1000))), col=c("red", "blue"))
Static output from earlier run:
Now, let’s set up a 95% confidence interval for the mean of the baseball players salaries, based on the mean of a sample of 100 players and standard deviation of sd(Salaries)/sqrt(n).
meanCL<-mean(sample(Baseball_Salaries$Salary, 100, replace=FALSE))
SDCL<-sd(Baseball_Salaries$Salary)/sqrt(100)
Zcrit<-qnorm(.975)
upperCL<-meanCL+Zcrit*SDCL
lowerCL<-meanCL-Zcrit*SDCL
upperCL
## [1] 4404826
meanCL
## [1] 3516033
lowerCL
## [1] 2627240
upperCL - lowerCL
## [1] 1777586
Let’s repeat the exercise for a sample of 10
meanCL10<-mean(sample(Baseball_Salaries$Salary, 10, replace=FALSE))
SDCL10<-sd(Baseball_Salaries$Salary)/sqrt(10)
Zcrit<-qnorm(.975)
upperCL10<-meanCL10+Zcrit*SDCL10
lowerCL10<-meanCL10-Zcrit*SDCL10
upperCL10
## [1] 6112227
meanCL10
## [1] 3301616
lowerCL10
## [1] 491005.3
upperCL10-lowerCL10
## [1] 5621221