To demonstrate the Central limit theorem, I found a cool package, Lahman containing the salaries of each Major League Baseball player from 1871 to 2015. As you can expect, the salary distribution is a highly right skewed distribution. However, if we subsample sufficiently large number of players and calculate the mean. The distribution of those means will follow normal distribution! Magic~
Let’s first load the package and subset the salaries from just one year. Note that, because the salary data is inside the Lahman package, so we have to use data()
again to fetch it.
install.packages("Lahman")
library(Lahman)
data("Salaries")
Now let’s subset the salaries from one single year (say 2015). Do you remember how to do it?
head(money15)
## yearID teamID lgID playerID salary
## 24759 2015 ARI NL ahmedni01 508500
## 24760 2015 ARI NL anderch01 512500
## 24761 2015 ARI NL chafian01 507500
## 24762 2015 ARI NL collmjo01 1400000
## 24763 2015 ARI NL corbipa01 524000
## 24764 2015 ARI NL delarru01 516000
meansVector = function(times, size, dat, varb){
a<- as.numeric(times)
b <- as.numeric(size)
v <- c()
for(i in 1:a){
y<- sample(dat[,varb],b,replace=TRUE)
m <- mean(y)
v <- c(v,m)
}
v
}
Here I write a small function, which will return the means of each subsample as a vector. The input of this function is (1) how many times you what to do subsample, (2) in each subsample, how many values you want to take (how many players’ salary you want to subsample out), (3) which year you want to do the subsample (in the following case, it’s year 2015, but you can subset another year), and (4) which variable you want to do the subsample ( in the following case, it’s “salary”).
Let’s take a look of the salary distribution in year 2015.
hist(money15$salary, main = "distribution of salary")
avg=mean(money15$salary)
SD=sd(money15$salary)
abline(v=avg, col="blue")
legend("topright", legend = c(paste0("mean=", avg), paste0("SD=", SD)),text.col=c("blue", "dark green"))
Obviously, it’s highly right skewed.
Now let’s plot the histogram of 10 subsamples with 100 values in each subsample.
Let’s gradually increase the number of subsamples but fix the values in each subsample for now.
How about we fix the number of subsample but increase the value taken in each subsample?
We start from 1280 subsamples with 10 values in each subsample.
Let’s gradually increase the values taken in each subsample.
Try to create same series of histograms with players’ salaries in 2014.
Now demonstrate central limit theorem with the “mpg (miles/gallon)” variable in the mtcars (Motor Trend Car Road Tests) data set. mtcars is another built-in data set in the base package of R. You can use the meansVector function I wrote for you to generate a vector of means.
Make sure that you are able to
(1) read in (data()
) data,
(2) plotting the histogram of “mpg” variable
(3) create a series of histograms with gradually increasing numbers of subsamples but fixed values taken in each subsample
(4) create a series of histograms with fixed numbers of subsamples but gradually increasing values taken in each subsample
data("mtcars")
hist(mtcars$mpg)
hist(mtcars$mpg, main = "distribution of mpg", breaks=10)
avg=mean(mtcars$mpg)
SD=sd(mtcars$mpg)
abline(v=avg, col="blue")
legend("topright", legend = c(paste0("mean=", avg), paste0("SD=", SD)),text.col=c("blue", "dark green"))
# It's a pretty flat distribution (uniform distribution)