loading Packages
require(latticeExtra)
require(mosaic)
require(Hmisc)
Simulating Sampling Distribution of Mean for three different sample sizes, n=5, n=25 and n=100 from expoential distribution with mean 5 and visualising it using boxplots
lambda<-2
numsim<-100
mean5<-rep(0,numsim)
mean25<-mean5
mean100<-mean5
for (i in 1:numsim){
mean5[i]<-mean(rexp(5,lambda))
mean25[i]<-mean(rexp(25,lambda))
mean100[i]<-mean(rexp(100,lambda))
}
boxplot(mean5,mean25,mean100,names=c("n=5","n=25","n=100"))
title("Distribution of Mean of Exponentials")
Look how the variance of the estimator is getting smaller with increase in sample size. By CLT the mean should have normal distribution or at least approximate normal, for large sample sizes, regardless of the orignating distribution from which samples came.
Compare three estimators of the mean (\(\mu\)) of a standard normal distribution. Comparing sample mean, trimmed mean and medianas estimators of population mean.If the distirbution of data is symmetric (as Standard Normal distribution of our example) all three estimators indeed estimate population mean but if the population from which sample is coming from is skewed then they do not.
set.seed(1234)
# generate a matrix of size sXn ( s rows, n colmns)
# each row is a random normal sample of size n
s<-10# declaring # of rows
n<-3 #declaring sample size
mu<-0
x<-rnorm(s*n,mean=mu,sd=1) #generating random normal sample
#x
# filling up the matrix with random normal sample
mat<-matrix(x,nrow=s,ncol=n)
mat
## [,1] [,2] [,3]
## [1,] -1.2070657 -0.47719270 0.1340882
## [2,] 0.2774292 -0.99838644 -0.4906859
## [3,] 1.0844412 -0.77625389 -0.4405479
## [4,] -2.3456977 0.06445882 0.4595894
## [5,] 0.4291247 0.95949406 -0.6937202
## [6,] 0.5060559 -0.11028549 -1.4482049
## [7,] -0.5747400 -0.51100951 0.5747557
## [8,] -0.5466319 -0.91119542 -1.0236557
## [9,] -0.5644520 -0.83717168 -0.0151383
## [10,] -0.8900378 2.41583518 -0.9359486
#adding three columns to theright of the matrix to store sample mean, trimmed mean, median
#before doing that you need to allocate space in R's memory
theta1<-numeric(s) # a vector of size s where we will store mean of each sample
theta2<-numeric(s) # a vector of size s where we will store 20% trimmed mean of each sample
theta3<-numeric(s) # a vector of size s where we will store median of each sample
# Notice there are three estimators ( mean, trimmed mean and median) whose properties we are comparing
matnew<-cbind(mat,theta1,theta2,theta3)
#matnew # notice at this point the theta1,theta2,theta3 columns have 0s
# now we need to fill up the last three columns
# for that we will use apply function ( check out help on apply)
# Reminder ? apply will provide you details
# a generic form of doing things in r is
# goal( ), goal is the generic function name and within brackets you give arguments,
# arguments are the things R need to perform that fucntion
matnew[,n+1]<-apply(matnew[,1:n],1,mean) # finding mean for each row/sample
#writing a small function to call trim mean in apply
trimmean<-function(Y)
{
mean(Y,0.2)
}
matnew[,n+2]<-apply(matnew[,1:n],1,trimmean) # finding trim mean for each sample/row
matnew[,n+3]<-apply(matnew[,1:n],1,median) # finding median for each sample/row
matnew_round<-round(matnew,4)
#matnew_round
# Calculating bias for each estimator
#theroatically we all know sample mean is an unbiased estimator of population mean
#what about the other two estimators?
# finding expected value of each estimator ( E(thetahat) = overall mean)
expected.value.theta1<-mean(matnew[,n+1])
expected.value.theta1
## [1] -0.296425
# similarly you can do it for the other two estimators individually
# or you can write a for loop to do the same
expected.value.theta<-numeric(3) # 3 is the # of estimators we are comparing
bias<-numeric(3)
for (i in 1:3)
{
expected.value.theta[i]<-mean(matnew[,n+i])
bias[i]<- expected.value.theta[i]-mu
}
#expected.value.theta
#bias
#colnames(expected.value.theta) <- c("theta1","theta2","theta3")
m2 <- cbind(c("theta1","theta2","theta3"),round(expected.value.theta,4),round(bias,4) )
m2
## [,1] [,2] [,3]
## [1,] "theta1" "-0.2964" "-0.2964"
## [2,] "theta2" "-0.2964" "-0.2964"
## [3,] "theta3" "-0.3902" "-0.3902"
Simulation to compare sampling properties of three different estimators for the mean of a distribution based on an iid sample of size n * sample mean * 20% trimmed mean (may work better with heavy tails) * median (may not work well for asymmetric distributions) * Note you can generate samples from gammma, normal, t to see the difference in estimators. A possible “fair” comparison would be to generate data from each of these distributions with the same mean and variance and see how the three methods perform on a relative basis under each condition
The code below has some funtions, which will come in handy for later coding while you are working on your assignment and final exam and are good basci building blocks for any simulation related project that you taken on. It has
simsum: Function to calculate summary statistics across the S data sets, using the geenric function apply. Please read its documentation for details.
generate.normal: Function to generate S data sets of size n from normal distribution with mean mu and variance sigma^2, using matrix approach, instead of for loop. Highly reccomended.
generate.gamma: Function to generate S data sets of size n from gamma distribution with mean mu, variance sigma^2 mu^2
generate.t: Function to generate S data sets of size n from a t distribution with df degrees of freedom centered at the # value mu (a t distribution has mean 0 and variance df/(df-2) for df>2)
trimmean: Function to compute the 20% trimmed mean. You can always chnge the trimming percentage.
Also, it calculates coverage probability. When we calculate confidence itnerval using certain method, in the begining we as certian confidence co-efficient, (example 99% or 95%), to check in practice, does it really attain that confidence level, we can calculate attainded confidence level, also known as covergae probability. Basically, we simulate S sampels, find CI for each sample and find the proportion of CIs which covered the true value of the mean.
# function to view the first k lines of a data frame
view <- function(dat,k){
message <- paste("First",k,"rows")
krows <- dat[1:k,]
cat(message,"\n","\n")
print(krows)
}
# function to calculate summary statistics across the S data sets
simsum <- function(dat,trueval){
S <- nrow(dat)
MCmean <- apply(dat,2,mean)
MCbias <- MCmean-trueval
MCrelbias <- MCbias/trueval
MCstddev <- sqrt(apply(dat,2,var))
MCMSE <- apply((dat-trueval)^2,2,mean)
# MCMSE <- MCbias^2 + MCstddev^2 # alternative lazy calculation
###########################
#textbook approach to calculate estimate of MSE and SE of MSE
#mse.est <- mean(tmean^2)
#se.mse <- sqrt(mean((tmean-mean(tmean))^2)) / sqrt(m)
###########################
MCRE <- MCMSE[1]/MCMSE
sumdat <- rbind(rep(trueval,3),S,MCmean,MCbias,MCrelbias,MCstddev,MCMSE,
MCRE)
names <- c("true value","# sims","MC mean","MC bias","MC relative bias",
"MC standard deviation","MC MSE","MC relative efficiency")
ests <- c("Sample mean","Trimmed mean","Median")
dimnames(sumdat) <- list(names,ests)
round(sumdat,5)
}
# function to generate S data sets of size n from normal distribution with mean mu and variance sigma^2
generate.normal <- function(S,n,mu,sigma){
dat <- matrix(rnorm(n*S,mu,sigma),ncol=n,byrow=T)
# Note for Normal data generation, we can get the data in one step like this, which requires no looping.
# In more complex statistical models, looping is often required to set up each data set, because the scenario #is much more complicated. Below is a sample loop to get the same data as above; try running the program and
# see how much longer it takes!
# dat <- NULL
#
# for (i in 1:S){
#
# Y <- rnorm(n,mu,sigma)
# dat <- rbind(dat,Y)
#
# }
out <- list(dat=dat)
return(out)
}
# function to generate S data sets of size n from gamma distribution with mean mu, variance sigma^2 mu^2
generate.gamma <- function(S,n,mu,sigma){
a <- 1/(sigma^2)
s <- mu/a
dat <- matrix(rgamma(n*S,shape=a,scale=s),ncol=n,byrow=T)
# Alternative loop
# dat <- NULL
#
# for (i in 1:S){
#
# Y <- rgamma(n,shape=a,scale=s)
# dat <- rbind(dat,Y)
#
# }
out <- list(dat=dat)
return(out)
}
# function to generate S data sets of size n from a t distribution with df degrees of freedom centered at the # value mu (a t distribution has mean 0 and variance df/(df-2) for df>2)
generate.t <- function(S,n,mu,df){
dat <- matrix(mu + rt(n*S,df),ncol=n,byrow=T)
# Alternative loop
# dat <- NULL
#
# for (i in 1:S){
#
# Y <- mu + rt(n,df)
# dat <- rbind(dat,Y)
#
# }
out <- list(dat=dat)
return(out)
}
# function to compute the 20% trimmed mean
trimmean <- function(Y){mean(Y,0.2)}
# set the seed for the simulation
set.seed(3)
# set number of simulated data sets and sample size
S <- 1000
n <- 15
# generate data --Distribution choices are normal with mu,sigma
# (rnorm), gamma (rgamma) and student t (rt)
# a possible "fair" comparison would be to generate data from each of these distributions with the same mean and variance and see how the three methods perform on a relative basis under each condition
mu <- 1
sigma <- sqrt(5/3)
# out <- generate.normal(S,n,mu,sigma) # generate normal samples
# out <- generate.gamma(S,n,mu,sigma) # generate gamma samples
out <- generate.t(S,n,mu,5) # generate t_5 samples
# get the sample means from each data set
outsampmean <- apply(out$dat,1,mean)
# get the sample trimmed mean from each data set
outtrimmean <- apply(out$dat,1,trimmean)
# get the sample median from each data set
outmedian <- apply(out$dat,1,median)
# now we can summarize, remember, mu is the true value of the mean for the true distribution of the data
# get the S=1000 estimates for each method
summary.sim <- data.frame(mean=outsampmean,trim=outtrimmean,
median=outmedian)
# print out the first 5 of them (round to 4 decimal places)
#view(round(summary.sim,4),5)
# get summary statistics for each estimator
results <- simsum(summary.sim,mu)
#############################################################
# Some additional calculation for the sample mean
# average of estimated standard errors for sample mean
# usual standard error for sample mean from each data set
sampmean.ses <- sqrt(apply(out$dat,1,var)/n)
# take the average
ave.sampmeanses <- mean(sampmean.ses)
# coverage of usual confidence interval based on sample mean
t05 <- qt(0.975,n-1)
coverage <- sum((outsampmean-t05*sampmean.ses <= mu) &
(outsampmean+t05*sampmean.ses >= mu))/S
coverage.probability<-coverage
results
## Sample mean Trimmed mean Median
## true value 1.00000 1.00000 1.00000
## # sims 1000.00000 1000.00000 1000.00000
## MC mean 0.98304 0.98499 0.98474
## MC bias -0.01696 -0.01501 -0.01526
## MC relative bias -0.01696 -0.01501 -0.01526
## MC standard deviation 0.33067 0.31198 0.35016
## MC MSE 0.10952 0.09746 0.12273
## MC relative efficiency 1.00000 1.12370 0.89238
coverage.probability
## [1] 0.948
This is an illustration of basic Monte Carlo estimation. We generate a sample of size 2 from normal distribution and target parameter is the absolute value of difference in means.
m <- 1000
g <- numeric(m)
MSE.Num<-numeric(m)
for (i in 1:m) {
x <- rnorm(2)
g[i] <- abs(x[1] - x[2])
MSE.Num[i]<-(g[i]-mu)^2 # MSE calculation of the numerator
}
est <- mean(g)
est
## [1] 1.152332
bias.g<-est-0 # Monte Carlo Bias, note mu ==0
bias.g
## [1] 1.152332
MC.SE<-sd(g) # MC Standard error
var.theor<-2-(4/pi) # theroetical variance
MC.SE
## [1] 0.8613968
var.theor
## [1] 0.7267605
MC.MSE<-sum(MSE.Num)/m # Monte Carlo MSE
MC.MSE.shortcut<-(bias.g)^2+(MC.SE)^2 # Short cut way to calculate MSE
MC.MSE
## [1] 0.7644676
MC.MSE.shortcut
## [1] 2.069875
n <- 20
alpha <- .05
x <- rnorm(n, mean=0, sd=2)
UCL <- (n-1) * var(x) / qchisq(alpha, df=n-1)
Sampled from Normal. Uses techniqe from example 6.4 to build CI. Also, known as coverage probability for the interval estimate. It is one of the two main measures of goodness of an interval estimate. Other being the lenght of the interval, shorter the better.
n <- 20
alpha <- .05
UCL <- replicate(1000, expr = {
x <- rnorm(n, mean = 0, sd = 2)
(n-1) * var(x) / qchisq(alpha, df = n-1)
} )
#count the number of intervals that contain sigma^2=4
sum(UCL > 4)
## [1] 943
#or compute the mean to get the confidence level
mean(UCL > 4)
## [1] 0.943
CI for variance, when sample is from Chi-square, i.e., positively skewed population instead of symmetric mount shaped. Notice how drastically the covergae probability decreases. Its because such technique of building CI is good for samples coming from normal population. So you claimed 95% CI but it doesnt acheive what it claimed for!
n <- 20
alpha <- .05
UCL <- replicate(1000, expr = {
x <- rchisq(n, df = 2)
(n-1) * var(x) / qchisq(alpha, df = n-1)
} )
sum(UCL > 4)
## [1] 784
mean(UCL > 4)
## [1] 0.784
This example calcualtes acheived alpha. Its a way of testing goodness of a test. We want ot check if it actually acheives the alpha we claim it has! Here we draw samples from normal distribution. Sample size is small (smaller than 30) and sigma is unkown, hence we use t-test for testing for a particular value of mean. Its a right tailed hypothesis/test. It calculates empirically (via simulation) proportion of times the hypothesis was rejected, assuming H0 is true. Recall, alpha= P(type I error)== P(reject Ho|Ho is true). As can be seen from the test, it is able ot acheive claimed alpha.
n <- 20
alpha <- .05
mu0 <- 500
sigma <- 100
m <- 10000 #number of replicates
p <- numeric(m) #storage for p-values
for (j in 1:m) {
x <- rnorm(n, mu0, sigma)
ttest <- t.test(x, alternative = "greater", mu = mu0)
p[j] <- ttest$p.value
}
p.hat <- mean(p < alpha)
se.hat <- sqrt(p.hat * (1 - p.hat) / m)
print(c(p.hat, se.hat))
## [1] 0.050200000 0.002183574
This example calculates empirical power for various values of mu in alternate space. It also,plots a power curve for variaos values of mu in alternate space. Recall, power = 1-beta=1-P(type II error) = P(fail to reject Ho|Ho is false). This means we calclate Power under the assumption Ha is true, but we really dont know its value in alternate space.
PS: We are still testing for mean using t-test, as we are workign with small sample from normal distribution with unknown variance.
Note: Adding SE barrs to the curve is optional. But since you have the code now, it will make your output power curve look more illustrative, so why NOT!
n <- 20
m <- 1000
mu0 <- 500
sigma <- 100
mu <- c(seq(450, 650, 10)) #alternatives
M <- length(mu)
power <- numeric(M)
for (i in 1:M) {
mu1 <- mu[i]
pvalues <- replicate(m, expr = {
#simulate under alternative mu1
x <- rnorm(n, mean = mu1, sd = sigma)
ttest <- t.test(x,
alternative = "greater", mu = mu0)
ttest$p.value } )
power[i] <- mean(pvalues <= .05)
}
par(ask = TRUE)
library(Hmisc) #for errbar
plot(mu, power)
abline(v = mu0, lty = 1)
abline(h = .05, lty = 1)
#add standard errors
se <- sqrt(power * (1-power) / m)
errbar(mu, power, yplus = power+se, yminus = power-se,
xlab = bquote(theta))
lines(mu, power, lty=3)
detach(package:Hmisc)
par(ask = FALSE)
It claculates width of the confidence interval for proportion ( threedifferent hypothesised values, 0.1, 0.25 and 0.5) for samples ofvarious sizes ( i.e., n=100, n=300, n=500 and n=1500). As sample size increase the width decreases, which is pretty expected.
PS: It is little advanced code, it saves output in a file name myfile in your workign directory. Its a guiding code for graduate students and those intrested in improving their R skill for various reasons.