To demonstrate the Central limit theorem, I found a cool package, Lahman contianing the salaries of each Major League Baseball player from 1871 to 2015. As you can expect, the salary distributino is a highly right skewed distriution. However, if we subsample sufficiently large number of players and calculate the mean. The distribution of those means will follow normal distritution! Magic~
Letās frist load the package and subset the salaries from just one year. Note that, because the salary data is inside the Lahman package, so we have to use data()
again to fetch it.
install.packages("Lahman")
library(Lahman)
data("Salaries")
Now letās subset the salaries from one single year (say 2015). Do you remember how to do it?
head(money15)
## yearID teamID lgID playerID salary
## 24759 2015 ARI NL ahmedni01 508500
## 24760 2015 ARI NL anderch01 512500
## 24761 2015 ARI NL chafian01 507500
## 24762 2015 ARI NL collmjo01 1400000
## 24763 2015 ARI NL corbipa01 524000
## 24764 2015 ARI NL delarru01 516000
meansVector = function(times, size, dat, varb){
a<- as.numeric(times)
b <- as.numeric(size)
v <- c()
for(i in 1:a){
y<- sample(dat[,varb],b,replace=TRUE)
m <- mean(y)
v <- c(v,m)
}
v
}
Here I write a small function, which will return the means of each subsample as a vector. The input of this functin is (1) how many times you what to do subsample, (2) in each subsample, how many values you want to take (how many playersā salary you want to subsample out), (3) which year you want to do the subsample (in the following case, itās year 2015, but you can subset another year), and (4) which variable you want to do the subsample ( in the follwoing case, itās āsalaryā).
Letās take a look of the salary distribution in year 2015.
hist(money15$salary, main = "distribution of salary")
avg=mean(money15$salary)
SD=sd(money15$salary)
abline(v=avg, col="blue")
legend("topright", legend = c(paste0("mean=", avg), paste0("SD=", SD)),text.col=c("blue", "dark green"))
Obviously, itās highly right skewed.
Now letās plot the histogram of 10 subsamples with 100 values in each subsample.
Letās gradually increase teh number of subsamples but fix the values in each subsample for now.
How about we fix the number of subsample but increase the value taken in each subsample?
We start from 1280 subsamples with 10 values in each subsample.
Letās gradually increase the values taken in each subsample.
Try to create same series of histograms with playersā salaries in 2014.
Now demonstrate central limit theorm with the āmpg (miles/gallon)ā variable in the mtcars (Motor Trend Car Road Tests) data set. mtcars is another built-in data set in the base package of R. You can use the meansVector functino I wrote for you to generate a vector of means.
Make sure that you are able to
(1) read in (data()
) data,
(2) plotting the histogram of āmpgā varaible
(3) create a series of histograms with gradually increasing numbers of subsamples but fixed values taken in each subsample
(4) create a series of histograms with fixed numbers of subsamples but gradually increasing values taken in each subsample
data("mtcars", eval=FALSE)
hist(mtcars$mpg)
hist(mtcars$mpg, main = "distribution of mpg", breaks=10)
avg=mean(mtcars$mpg)
SD=sd(mtcars$mpg)
abline(v=avg, col="blue")
legend("topright", legend = c(paste0("mean=", avg), paste0("SD=", SD)),text.col=c("blue", "dark green"))
# It's a pretty flat distribution (uniform distribution)
One sample t test is to test the null hypothesis that the population mean is not different from a specified value. This can be done by using t.test()
in R.
To learn how to perform t-test in R, we will use another data set that describe the violent crime rates and concealed-carry laws for 50 US States plus DC. This is a balanced panel of data on 50 US states plus the District of Columbia (for a total of 51 āstatesā) by year for 1977ā1999. Each observation is a given state in a given year. There are a total of 51 states times 23 years = 1173 observations. These data were provided by Prof.Ā John Donohue of Stanford University and were used in his paper with Ian Ayres āShooting Down the āMore Guns Less Crime Hypothesisā (Stanford Law Review 2003)
At this stage, you should be able to read-in data and subset part of the whole data now. For example, we can use one sample t-test to test if the murder rate in Michigan state significantly differed from a certain value, say 9 incidents per 10^5 people. The syntax is t.test(sample, mu="specified value")
.
setwd("D:/Courses/UM/2016_WN/NRE538_GSRA/Labs/NRE538_Lab2")
# set working directory
guns = read.table("guns.csv",header=T,fill=T,sep=",")
# read in data
guns.MI = subset(guns, state=="Michigan")
t.test(guns.MI[,"murder"], mu=9)
##
## One Sample t-test
##
## data: guns.MI[, "murder"]
## t = 2.3794, df = 22, p-value = 0.02644
## alternative hypothesis: true mean is not equal to 9
## 95 percent confidence interval:
## 9.085973 10.253157
## sample estimates:
## mean of x
## 9.669565
Whatās the statistical results tell you?
Letās polt the histogram of the murder rate of Michigan state to visualize the results.
hist(guns.MI[,"murder"], breaks=20)
abline(v=mean(guns.MI[,"murder"]), col="blue")
abline(v=9, col="dark green")
legend("topleft", legend = c("sample mean", "specified value=9"),text.col=c("blue", "dark green"))
Perfome t-test on other statistics of Michigan state and see if those statistics significantly differed from a specified values that is specified by yourself.
Two sample t-test is to test the hypothesis that the two population means are not significantly differed from each other. Two sample t-test can be categorized into paired and unpaired t-test.
Remember what is the difference between the two?
Letās compare the murder rate of Washington state and Minnesota state. The syntax is t.test(sample1, sample2, paired=TRUE or FALSE)
.
guns.WAS = subset(guns, state=="Washington")
guns.MIN = subset(guns, state=="Minnesota")
t.test(guns.WAS[,"murder"], guns.MIN[,"murder"], paired=FALSE)
##
## Welch Two Sample t-test
##
## data: guns.WAS[, "murder"] and guns.MIN[, "murder"]
## t = 12.104, df = 43.616, p-value = 1.546e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.753891 2.454804
## sample estimates:
## mean of x mean of y
## 4.773913 2.669565
paired=FALSE
here?Letās visualize the results by plotting the two distributions.
hist(guns.WAS[,"murder"], breaks=15, col="light blue", xlim=c(0, 7), ylim=c(0,5), xlab="murder rate", main="")
abline(v=mean(guns.WAS[,"murder"]), col="blue")
par(new=TRUE)
hist(guns.MIN[,"murder"], breaks=15, col="light green", xlim=c(0, 7), ylim=c(0,5), xlab="", main="")
abline(v=mean(guns.MIN[,"murder"]), col="dark green")
legend("topright", legend = c("Washington mean", "Minnesota mean"),text.col=c("blue", "dark green"))
Three critical assumptions of t-test are:
1. samples are normally distributed
2. samples have equal variance
3. each observation is sampled independently
Among the three, the third one is unlikely to test, but the first two are testable. Letās see if the murder rates of Washington state and Minnisota state meet the first two assumptions.
To test for normality, we can just use the shapiro.test()
, and to test for equal variance, we can use var.test()
, which is essentially a F-test.
shapiro.test(guns.WAS[,"murder"])
##
## Shapiro-Wilk normality test
##
## data: guns.WAS[, "murder"]
## W = 0.94148, p-value = 0.1933
shapiro.test(guns.MIN[,"murder"])
##
## Shapiro-Wilk normality test
##
## data: guns.MIN[, "murder"]
## W = 0.98013, p-value = 0.9083
var.test(guns.WAS[,"murder"], guns.MIN[,"murder"])
##
## F test to compare two variances
##
## data: guns.WAS[, "murder"] and guns.MIN[, "murder"]
## F = 1.2072, num df = 22, denom df = 22, p-value = 0.6626
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5119867 2.8464432
## sample estimates:
## ratio of variances
## 1.207204
All assumptions are met, so the results from this unpaired t-test should be robust. But, what if the assumptions are violated?
Letās first see an example when the assumptions are satisfied.
As you can seeā¦..
Now we delierately violate the assumptions by making one population right skewed. By doing so, we violate the first assumption because blue population is not a normal distibution. The variance of blue populaiton (~2.71) is significantly larger than the green population (~1.03).