Central limit theorem

To demonstrate the Central limit theorem, I found a cool package, Lahman contianing the salaries of each Major League Baseball player from 1871 to 2015. As you can expect, the salary distributino is a highly right skewed distriution. However, if we subsample sufficiently large number of players and calculate the mean. The distribution of those means will follow normal distritution! Magic~

Let’s frist load the package and subset the salaries from just one year. Note that, because the salary data is inside the Lahman package, so we have to use data() again to fetch it.

install.packages("Lahman")
library(Lahman)
data("Salaries")

Now let’s subset the salaries from one single year (say 2015). Do you remember how to do it?

head(money15)
##       yearID teamID lgID  playerID  salary
## 24759   2015    ARI   NL ahmedni01  508500
## 24760   2015    ARI   NL anderch01  512500
## 24761   2015    ARI   NL chafian01  507500
## 24762   2015    ARI   NL collmjo01 1400000
## 24763   2015    ARI   NL corbipa01  524000
## 24764   2015    ARI   NL delarru01  516000
meansVector = function(times, size, dat, varb){
  a<- as.numeric(times)
  b <- as.numeric(size)
  v <- c()
  for(i in 1:a){
    y<- sample(dat[,varb],b,replace=TRUE)
    m <- mean(y)
    v <- c(v,m)
  }
  v
}

Here I write a small function, which will return the means of each subsample as a vector. The input of this functin is (1) how many times you what to do subsample, (2) in each subsample, how many values you want to take (how many players’ salary you want to subsample out), (3) which year you want to do the subsample (in the following case, it’s year 2015, but you can subset another year), and (4) which variable you want to do the subsample ( in the follwoing case, it’s ā€œsalaryā€).

Let’s take a look of the salary distribution in year 2015.

hist(money15$salary, main = "distribution of salary")
  avg=mean(money15$salary)
  SD=sd(money15$salary)
  abline(v=avg, col="blue")
  legend("topright", legend = c(paste0("mean=", avg), paste0("SD=", SD)),text.col=c("blue", "dark green"))

Obviously, it’s highly right skewed.

Now let’s plot the histogram of 10 subsamples with 100 values in each subsample.

Let’s gradually increase teh number of subsamples but fix the values in each subsample for now.

How about we fix the number of subsample but increase the value taken in each subsample?
We start from 1280 subsamples with 10 values in each subsample.

Let’s gradually increase the values taken in each subsample.


Exercise 1

Try to create same series of histograms with players’ salaries in 2014.

Exercise 2

Now demonstrate central limit theorm with the ā€œmpg (miles/gallon)ā€ variable in the mtcars (Motor Trend Car Road Tests) data set. mtcars is another built-in data set in the base package of R. You can use the meansVector functino I wrote for you to generate a vector of means.

Make sure that you are able to
(1) read in (data()) data,
(2) plotting the histogram of ā€œmpgā€ varaible
(3) create a series of histograms with gradually increasing numbers of subsamples but fixed values taken in each subsample
(4) create a series of histograms with fixed numbers of subsamples but gradually increasing values taken in each subsample

data("mtcars", eval=FALSE)
hist(mtcars$mpg)
hist(mtcars$mpg, main = "distribution of mpg", breaks=10)
  avg=mean(mtcars$mpg)
  SD=sd(mtcars$mpg)
  abline(v=avg, col="blue")
  legend("topright", legend = c(paste0("mean=", avg), paste0("SD=", SD)),text.col=c("blue", "dark green"))
# It's a pretty flat distribution (uniform distribution)

Student’s t-test

One sample t-test

One sample t test is to test the null hypothesis that the population mean is not different from a specified value. This can be done by using t.test() in R.

To learn how to perform t-test in R, we will use another data set that describe the violent crime rates and concealed-carry laws for 50 US States plus DC. This is a balanced panel of data on 50 US states plus the District of Columbia (for a total of 51 ā€˜states’) by year for 1977–1999. Each observation is a given state in a given year. There are a total of 51 states times 23 years = 1173 observations. These data were provided by Prof.Ā John Donohue of Stanford University and were used in his paper with Ian Ayres ā€œShooting Down the ’More Guns Less Crime Hypothesisā€ (Stanford Law Review 2003)

At this stage, you should be able to read-in data and subset part of the whole data now. For example, we can use one sample t-test to test if the murder rate in Michigan state significantly differed from a certain value, say 9 incidents per 10^5 people. The syntax is t.test(sample, mu="specified value").

setwd("D:/Courses/UM/2016_WN/NRE538_GSRA/Labs/NRE538_Lab2")
  # set working directory
guns = read.table("guns.csv",header=T,fill=T,sep=",")
  # read in data
guns.MI = subset(guns, state=="Michigan")
t.test(guns.MI[,"murder"], mu=9)
## 
##  One Sample t-test
## 
## data:  guns.MI[, "murder"]
## t = 2.3794, df = 22, p-value = 0.02644
## alternative hypothesis: true mean is not equal to 9
## 95 percent confidence interval:
##   9.085973 10.253157
## sample estimates:
## mean of x 
##  9.669565

What’s the statistical results tell you?
Let’s polt the histogram of the murder rate of Michigan state to visualize the results.

hist(guns.MI[,"murder"], breaks=20)
abline(v=mean(guns.MI[,"murder"]), col="blue")
abline(v=9, col="dark green")
  legend("topleft", legend = c("sample mean", "specified value=9"),text.col=c("blue", "dark green"))


Exercise

Perfome t-test on other statistics of Michigan state and see if those statistics significantly differed from a specified values that is specified by yourself.

Two sample t-test

Two sample t-test is to test the hypothesis that the two population means are not significantly differed from each other. Two sample t-test can be categorized into paired and unpaired t-test.
Remember what is the difference between the two?

Let’s compare the murder rate of Washington state and Minnesota state. The syntax is t.test(sample1, sample2, paired=TRUE or FALSE).

guns.WAS = subset(guns, state=="Washington")
guns.MIN = subset(guns, state=="Minnesota")
t.test(guns.WAS[,"murder"], guns.MIN[,"murder"], paired=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  guns.WAS[, "murder"] and guns.MIN[, "murder"]
## t = 12.104, df = 43.616, p-value = 1.546e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.753891 2.454804
## sample estimates:
## mean of x mean of y 
##  4.773913  2.669565
  • What does the results tell you?
  • why do I use paired=FALSE here?

Let’s visualize the results by plotting the two distributions.

hist(guns.WAS[,"murder"], breaks=15, col="light blue", xlim=c(0, 7), ylim=c(0,5), xlab="murder rate", main="")
abline(v=mean(guns.WAS[,"murder"]), col="blue")
par(new=TRUE)
hist(guns.MIN[,"murder"], breaks=15, col="light green", xlim=c(0, 7), ylim=c(0,5), xlab="", main="")
abline(v=mean(guns.MIN[,"murder"]), col="dark green")
  legend("topright", legend = c("Washington mean", "Minnesota mean"),text.col=c("blue", "dark green"))

Rivisit the assumptions of t-test

Three critical assumptions of t-test are:
1. samples are normally distributed
2. samples have equal variance
3. each observation is sampled independently
Among the three, the third one is unlikely to test, but the first two are testable. Let’s see if the murder rates of Washington state and Minnisota state meet the first two assumptions.

To test for normality, we can just use the shapiro.test(), and to test for equal variance, we can use var.test(), which is essentially a F-test.

shapiro.test(guns.WAS[,"murder"])
## 
##  Shapiro-Wilk normality test
## 
## data:  guns.WAS[, "murder"]
## W = 0.94148, p-value = 0.1933
shapiro.test(guns.MIN[,"murder"])
## 
##  Shapiro-Wilk normality test
## 
## data:  guns.MIN[, "murder"]
## W = 0.98013, p-value = 0.9083
var.test(guns.WAS[,"murder"], guns.MIN[,"murder"])
## 
##  F test to compare two variances
## 
## data:  guns.WAS[, "murder"] and guns.MIN[, "murder"]
## F = 1.2072, num df = 22, denom df = 22, p-value = 0.6626
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5119867 2.8464432
## sample estimates:
## ratio of variances 
##           1.207204

All assumptions are met, so the results from this unpaired t-test should be robust. But, what if the assumptions are violated?

Let’s first see an example when the assumptions are satisfied.

As you can see…..

Now we delierately violate the assumptions by making one population right skewed. By doing so, we violate the first assumption because blue population is not a normal distibution. The variance of blue populaiton (~2.71) is significantly larger than the green population (~1.03).