Name(s):Anh Ha & Anh Nguyen
#install.packages("mosaic") #take the comment symbol off and run the first time
library(mosaic) #package used to make some of the programming easier.
## Warning: package 'mosaic' was built under R version 3.2.5
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.2.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.5
## Loading required package: mosaicData
## Warning: package 'mosaicData' was built under R version 3.2.5
## Loading required package: Matrix
##
## The 'mosaic' package masks several functions from core packages in order to add additional features.
## The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cov, D, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
So the central limit theorem says that no matter what the distribution of the overall populations that means (and proportions) follow the normal distribution if the sample size is large enough.
So in this lab we will simulate sampling from the different distributions that we have seen in this course at different samples sizes and see if the central limit theorem holds at each sample size.
First an example using the exponential.
Sample from the exponential distribution for lamda=.1, which is not normal. Generally in r sampling from a distribution is done with the rdistribution command.
So here is the population
expsamplePop=rexp(1000, .1) #one set of samples, 1000 samples with n=10, p=.1
hist(expsamplePop) #not so normal population
qqnorm(expsamplePop) #normal quantile plot. The closer it is to a line the more normal the data.
Now in order to use the central limit theorem we look at the average value for samples of different sizes from this population. Samples of size 5. Taking mean of each sample
#a sample
mean(rexp(5, .1)) #note if you rerun this command it should vary
## [1] 8.601455
expsample5=do(1000)*mean(rexp(5, .1)) # now 1000 samples collecting a mean from each
hist(expsample5$mean)
qqnorm(expsample5$mean)
Not very normal, but sample size was pretty small. What about 20?
expsample20=do(1000)*mean(rexp(20, .1)) # now 1000 samples collecting a mean from each
hist(expsample20$mean)
qqnorm(expsample20$mean)
Not bad. How about 30?
expsample30=do(1000)*mean(rexp(30, .1)) # now 1000 samples collecting a mean from each
hist(expsample30$mean)
qqnorm(expsample30$mean)
Pretty good.
So the central limit theorem says that no matter what the distribution of the overall populations that means (and proportions) follow the normal distribution if the sample size is large enough.
So in this lab we will simulate sampling from the different distributions that we have seen in this course at different samples sizes and see if the central limit theorem holds at each sample size.
First an example using the normal.
Sample from the normal distribution for mean=1.2 and sd=3.4, which is not normal. Generally in r sampling from a distribution is done with the rdistribution command.
So here is the population
nsamplePop=rnorm(1000,1.2, 3.4) #one set of samples, 1000 samples with n=10, p=.1
hist(nsamplePop) #not so normal population
qqnorm(nsamplePop) #normal quantile plot. The closer it is to a line the more normal the data.
Now in order to use the central limit theorem we look at the average value for samples of different sizes from this population. Samples of size 5. Taking mean of each sample
#a sample
mean(rnorm(5,1.2,3.4)) #note if you rerun this command it should vary
## [1] 4.253085
nsample5=do(1000)*mean(rnorm(5,1.2,3.4)) # now 1000 samples collecting a mean from each
hist(nsample5$mean)
qqnorm(nsample5$mean)
Not very normal, but sample size was pretty small. What about 20?
nsample20=do(1000)*mean(rnorm(20,1.2,3.4)) # now 1000 samples collecting a mean from each
hist(nsample20$mean)
qqnorm(nsample20$mean)
Not bad. How about 30?
First an example using the poisson.
Sample from the poisson distribution for lamda=.5, which is not normal. Generally in r sampling from a distribution is done with the rdistribution command.
So here is the population
psamplePop=rpois(1000, .5) #one set of samples, 1000 samples with n=10, p=.1
hist(psamplePop) #not so normal population
qqnorm(psamplePop) #normal quantile plot. The closer it is to a line the more normal the data.
Now in order to use the central limit theorem we look at the average value for samples of different sizes from this population. Samples of size 5. Taking mean of each sample
#a sample
mean(rpois(5, .5)) #note if you rerun this command it should vary
## [1] 0.6
psample5=do(1000)*mean(rpois(5, .5)) # now 1000 samples collecting a mean from each
hist(psample5$mean)
qqnorm(psample5$mean)
Not very normal, but sample size was pretty small. What about 20?
psample20=do(1000)*mean(rpois(20, .5)) # now 1000 samples collecting a mean from each
hist(psample20$mean)
qqnorm(psample20$mean)
Not bad. How about 30?
psample30=do(1000)*mean(rpois(30, .5)) # now 1000 samples collecting a mean from each
hist(psample30$mean)
qqnorm(psample30$mean)
psample100=do(1000)*mean(rpois(100, .5)) # now 1000 samples collecting a mean from each
hist(psample100$mean)
qqnorm(psample100$mean)
First an example using the geometric.
Sample from the poisson distribution for p=.1, which is not normal. Generally in r sampling from a distribution is done with the rdistribution command.
So here is the population
gsamplePop=rgeom(1000,.1) #one set of samples, 1000 samples with n=10, p=.1
hist(gsamplePop) #not so normal population
qqnorm(gsamplePop) #normal quantile plot. The closer it is to a line the more normal the data.
Now in order to use the central limit theorem we look at the average value for samples of different sizes from this population. Samples of size 5. Taking mean of each sample
#a sample
mean(rgeom(5,.1)) #note if you rerun this command it should vary
## [1] 8.8
gsample5=do(1000)*mean(rgeom(5,.1)) # now 1000 samples collecting a mean from each
hist(gsample5$mean)
qqnorm(psample5$mean)
Not very normal, but sample size was pretty small. What about 20?
gsample20=do(1000)*mean(rgeom(20,.1)) # now 1000 samples collecting a mean from each
hist(gsample20$mean)
qqnorm(gsample20$mean)
Not bad. How about 30?
gsample30=do(1000)*mean(rgeom(30,.1)) # now 1000 samples collecting a mean from each
hist(gsample30$mean)
qqnorm(gsample30$mean)
gsample200=do(1000)*mean(rgeom(200,.1)) # now 1000 samples collecting a mean from each
hist(gsample200$mean)
qqnorm(gsample200$mean)
First an example using the hypergeometric.
Sample from the hypergeometric distribution for wball=30, bball=50, k=20 which is not normal. Generally in r sampling from a distribution is done with the rdistribution command.
So here is the population
hsamplePop=rhyper(1000,30,50,20) #one set of samples, 1000 samples with n=10, p=.1
hist(hsamplePop) #not so normal population
qqnorm(hsamplePop) #normal quantile plot. The closer it is to a line the more normal the data.
Now in order to use the central limit theorem we look at the average value for samples of different sizes from this population. Samples of size 5. Taking mean of each sample
#a sample
mean(rhyper(5, 30,50,20)) #note if you rerun this command it should vary
## [1] 7.8
hsample5=do(1000)*mean(rhyper(5, 30,50,20)) # now 1000 samples collecting a mean from each
hist(hsample5$mean)
qqnorm(hsample5$mean)
Not very normal, but sample size was pretty small. What about 20?
hsample20=do(1000)*mean(rhyper(20,30,50,20)) # now 1000 samples collecting a mean from each
hist(hsample20$mean)
qqnorm(hsample20$mean)
Not bad. How about 30?
hsample30=do(1000)*mean(rhyper(30,30,50,20)) # now 1000 samples collecting a mean from each
hist(hsample30$mean)
qqnorm(hsample30$mean)
hsample100=do(1000)*mean(rhyper(100,30,50,20)) # now 1000 samples collecting a mean from each
hist(hsample100$mean)
qqnorm(hsample100$mean)
First an example using the negative binomial.
Sample from the negative binomial distribution for size=10,p=.1 which is not normal. Generally in r sampling from a distribution is done with the rdistribution command.
So here is the population
nbsamplePop=rnbinom(1000,10,.1) #one set of samples, 1000 samples with n=10, p=.1
hist(nbsamplePop) #not so normal population
qqnorm(nbsamplePop) #normal quantile plot. The closer it is to a line the more normal the data.
Now in order to use the central limit theorem we look at the average value for samples of different sizes from this population. Samples of size 5. Taking mean of each sample
#a sample
mean(rnbinom(5, 10,.1)) #note if you rerun this command it should vary
## [1] 88.8
nbsample5=do(1000)*mean(rnbinom(5, 10,.1)) # now 1000 samples collecting a mean from each
hist(nbsample5$mean)
qqnorm(nbsample5$mean)
Not very normal, but sample size was pretty small. What about 20?
nbsample20=do(1000)*mean(rnbinom(20,10,.1)) # now 1000 samples collecting a mean from each
hist(nbsample20$mean)
qqnorm(nbsample20$mean)
Not bad. How about 30?
nbsample30=do(1000)*mean(rnbinom(30,10,.1)) # now 1000 samples collecting a mean from each
hist(nbsample30$mean)
qqnorm(nbsample30$mean)
nbsample100=do(1000)*mean(rnbinom(100,10,.1)) # now 1000 samples collecting a mean from each
hist(nbsample100$mean)
qqnorm(nbsample100$mean)
Goal for the rest of the lab. What does the popultation look like? What n is large enough for the central limit theorem to work? Use sample sizes n=5, n=20, n=30, and higher until the sampling distribution of the mean is close to normal (that is you can stop at 30 if it looks normal enough, otherwise keep going). The commands should be similar to the examples above.
Do this for the following distribtuions with the following sampling commands. You get to pick the parameters. Make sure to list them before starting.
exponential distribution, use rexp
normal distribution, use rnorm
the possion distribution, use rpois
the geometric distribution, use dgeom
the hypergeometric distribution, use rhyper
the negative binomial distributio, use rnbinom
Another distribution of your choice. Describe the distribution and your choice of parameters before you begin. Can find candidates either by typing distributions into the help viewer on the right or using the distribution viewer from class linked to on moodle.
What do you think of the n>=30 rule of thumb for a large enough sample. Based on what you did above, for what sorts of populations does it work okay and when does it not?