Central limit theorem

To demonstrate the Central limit theorem, I found a cool package, Lahman containing the salaries of each Major League Baseball player from 1871 to 2015. As you can expect, the salary distribution is a highly right skewed distribution. However, if we subsample sufficiently large number of players and calculate the mean. The distribution of those means will follow normal distribution! Magic~

Let’s first load the package and subset the salaries from just one year. Note that, because the salary data is inside the Lahman package, so we have to use data() again to fetch it.

install.packages("Lahman")
library(Lahman)
data("Salaries")

Now let’s subset the salaries from one single year (say 2015). Do you remember how to do it?

head(money15)
##       yearID teamID lgID  playerID  salary
## 24759   2015    ARI   NL ahmedni01  508500
## 24760   2015    ARI   NL anderch01  512500
## 24761   2015    ARI   NL chafian01  507500
## 24762   2015    ARI   NL collmjo01 1400000
## 24763   2015    ARI   NL corbipa01  524000
## 24764   2015    ARI   NL delarru01  516000
meansVector = function(times, size, dat, varb){
  a<- as.numeric(times)
  b <- as.numeric(size)
  v <- c()
  for(i in 1:a){
    y<- sample(dat[,varb],b,replace=TRUE)
    m <- mean(y)
    v <- c(v,m)
  }
  v
}

Here I write a small function, which will return the means of each subsample as a vector. The input of this function is (1) how many times you what to do subsample, (2) in each subsample, how many values you want to take (how many players’ salary you want to subsample out), (3) which year you want to do the subsample (in the following case, it’s year 2015, but you can subset another year), and (4) which variable you want to do the subsample ( in the following case, it’s “salary”).

Let’s take a look of the salary distribution in year 2015.

hist(money15$salary, main = "distribution of salary")
  avg=mean(money15$salary)
  SD=sd(money15$salary)
  abline(v=avg, col="blue")
  legend("topright", legend = c(paste0("mean=", avg), paste0("SD=", SD)),text.col=c("blue", "dark green"))

Obviously, it’s highly right skewed.

Now let’s plot the histogram of 10 subsamples with 100 values in each subsample.

Let’s gradually increase the number of subsamples but fix the values in each subsample for now.

How about we fix the number of subsample but increase the value taken in each subsample?
We start from 1280 subsamples with 10 values in each subsample.

Let’s gradually increase the values taken in each subsample.


Exercise 1

Try to create same series of histograms with players’ salaries in 2014.

Exercise 2

Now demonstrate central limit theorem with the “mpg (miles/gallon)” variable in the mtcars (Motor Trend Car Road Tests) data set. mtcars is another built-in data set in the base package of R. You can use the meansVector function I wrote for you to generate a vector of means.

Make sure that you are able to
(1) read in (data()) data,
(2) plotting the histogram of “mpg” variable
(3) create a series of histograms with gradually increasing numbers of subsamples but fixed values taken in each subsample
(4) create a series of histograms with fixed numbers of subsamples but gradually increasing values taken in each subsample

data("mtcars")
hist(mtcars$mpg)
hist(mtcars$mpg, main = "distribution of mpg", breaks=10)
  avg=mean(mtcars$mpg)
  SD=sd(mtcars$mpg)
  abline(v=avg, col="blue")
  legend("topright", legend = c(paste0("mean=", avg), paste0("SD=", SD)),text.col=c("blue", "dark green"))
# It's a pretty flat distribution (uniform distribution)