Ch6GuidedNotesCode

loading Packages

require(latticeExtra)
require(mosaic)
require(Hmisc)

Example1:

Simulating Sampling Distribution of Mean for three different sample sizes, n=5, n=25 and n=100 from expoential distribution with mean 5 and visualising it using boxplots

lambda<-2
numsim<-100
mean5<-rep(0,numsim)
mean25<-mean5
mean100<-mean5

for (i in 1:numsim){
  mean5[i]<-mean(rexp(5,lambda))
  mean25[i]<-mean(rexp(25,lambda))
  mean100[i]<-mean(rexp(100,lambda))
  
}
boxplot(mean5,mean25,mean100,names=c("n=5","n=25","n=100"))
title("Distribution of Mean of Exponentials")

Look how the variance of the estimator is getting smaller with increase in sample size. By CLT the mean should have normal distribution or at least approximate normal, for large sample sizes, regardless of the orignating distribution from which samples came.

Example 2

Compare three estimators of the mean (\(\mu\)) of a standard normal distribution. Comparing sample mean, trimmed mean and medianas estimators of population mean.If the distirbution of data is symmetric (as Standard Normal distribution of our example) all three estimators indeed estimate population mean but if the population from which sample is coming from is skewed then they do not.

Step by step coversion of algorithm from guided notes into code

set.seed(1234)
# generate a matrix of size sXn ( s rows, n colmns)
# each row is a random normal sample of size n

s<-10# declaring # of rows

n<-3 #declaring sample size
mu<-0
x<-rnorm(s*n,mean=mu,sd=1) #generating random normal sample
#x

# filling up the matrix with random normal sample
mat<-matrix(x,nrow=s,ncol=n)
mat

##             [,1]        [,2]       [,3]
##  [1,] -1.2070657 -0.47719270  0.1340882
##  [2,]  0.2774292 -0.99838644 -0.4906859
##  [3,]  1.0844412 -0.77625389 -0.4405479
##  [4,] -2.3456977  0.06445882  0.4595894
##  [5,]  0.4291247  0.95949406 -0.6937202
##  [6,]  0.5060559 -0.11028549 -1.4482049
##  [7,] -0.5747400 -0.51100951  0.5747557
##  [8,] -0.5466319 -0.91119542 -1.0236557
##  [9,] -0.5644520 -0.83717168 -0.0151383
## [10,] -0.8900378  2.41583518 -0.9359486

#adding three columns to theright of the matrix to store sample mean, trimmed mean, median

#before doing that you need to allocate space in R's memory


theta1<-numeric(s) # a vector of size s where we will store mean of each sample
theta2<-numeric(s) # a vector of size s where we will store 20% trimmed mean of each sample
theta3<-numeric(s) # a vector of size s where we will store median of each sample

# Notice there are three estimators ( mean, trimmed mean and median) whose properties we are comparing
matnew<-cbind(mat,theta1,theta2,theta3)
#matnew # notice at this point the theta1,theta2,theta3 columns have 0s

# now we need to fill up the last three columns
# for that we will use apply function ( check out help on apply)

# Reminder ? apply will provide you details
# a generic form of doing things in r is 
# goal( ), goal is the generic function name and within brackets you give arguments, 
# arguments are the things R need to   perform that fucntion

matnew[,n+1]<-apply(matnew[,1:n],1,mean) # finding mean for each row/sample

#writing a small function to call trim mean in apply

trimmean<-function(Y)
{
mean(Y,0.2)
}

matnew[,n+2]<-apply(matnew[,1:n],1,trimmean) # finding trim mean for each sample/row

matnew[,n+3]<-apply(matnew[,1:n],1,median) # finding median for each sample/row

matnew_round<-round(matnew,4)
#matnew_round

# Calculating bias for each estimator 
#theroatically we all know sample mean is an unbiased estimator of population mean
#what about the other two estimators?

# finding expected value of each estimator ( E(thetahat) = overall mean)
expected.value.theta1<-mean(matnew[,n+1])
expected.value.theta1

## [1] -0.296425

# similarly you can do it for the other two estimators individually

# or you can write a for loop to do the same

expected.value.theta<-numeric(3) # 3 is the # of estimators we are comparing
bias<-numeric(3)

for (i in 1:3)
{
  expected.value.theta[i]<-mean(matnew[,n+i])
  bias[i]<- expected.value.theta[i]-mu 

}
#expected.value.theta
#bias
#colnames(expected.value.theta) <- c("theta1","theta2","theta3")

m2 <- cbind(c("theta1","theta2","theta3"),round(expected.value.theta,4),round(bias,4) )
m2

##      [,1]     [,2]      [,3]     
## [1,] "theta1" "-0.2964" "-0.2964"
## [2,] "theta2" "-0.2964" "-0.2964"
## [3,] "theta3" "-0.3902" "-0.3902"

Example 2 Using functions

Complete Code for example2, usign functions

Simulation to compare sampling properties of three different estimators for the mean of a distribution based on an iid sample of size n * sample mean * 20% trimmed mean (may work better with heavy tails) * median (may not work well for asymmetric distributions) * Note you can generate samples from gammma, normal, t to see the difference in estimators. A possible “fair” comparison would be to generate data from each of these distributions with the same mean and variance and see how the three methods perform on a relative basis under each condition

The code below has some funtions, which will come in handy for later coding while you are working on your assignment and final exam and are good basci building blocks for any simulation related project that you taken on. It has

simsum: Function to calculate summary statistics across the S data sets, using the geenric function apply. Please read its documentation for details.
generate.normal: Function to generate S data sets of size n from normal distribution with mean mu and variance sigma^2, using matrix approach, instead of for loop. Highly reccomended.
generate.gamma: Function to generate S data sets of size n from gamma distribution with mean mu, variance sigma^2 mu^2
generate.t: Function to generate S data sets of size n from a t distribution with df degrees of freedom centered at the # value mu (a t distribution has mean 0 and variance df/(df-2) for df>2)
trimmean: Function to compute the 20% trimmed mean. You can always chnge the trimming percentage.
Also, it calculates coverage probability. When we calculate confidence itnerval using certain method, in the begining we as certian confidence co-efficient, (example 99% or 95%), to check in practice, does it really attain that confidence level, we can calculate attainded confidence level, also known as covergae probability. Basically, we simulate S sampels, find CI for each sample and find the proportion of CIs which covered the true value of the mean.

#  function to view the first k lines of a data frame

view <- function(dat,k){
  
  message <- paste("First",k,"rows")
  krows <- dat[1:k,]
  cat(message,"\n","\n")
  print(krows)
  
}

#  function to calculate summary statistics across the S data sets 

simsum <- function(dat,trueval){
  
  S <- nrow(dat)
  
  MCmean <- apply(dat,2,mean)
  MCbias <- MCmean-trueval
  MCrelbias <- MCbias/trueval
  MCstddev <- sqrt(apply(dat,2,var))
  MCMSE <- apply((dat-trueval)^2,2,mean)  
  #   MCMSE <- MCbias^2 + MCstddev^2   # alternative lazy calculation

###########################
#textbook approach to calculate estimate of MSE and SE of MSE 
#mse.est <- mean(tmean^2) 
#se.mse <- sqrt(mean((tmean-mean(tmean))^2)) / sqrt(m)
###########################  
  
  
  MCRE <- MCMSE[1]/MCMSE
  
  sumdat <- rbind(rep(trueval,3),S,MCmean,MCbias,MCrelbias,MCstddev,MCMSE,
                  MCRE)
  names <- c("true value","# sims","MC mean","MC bias","MC relative bias",
             "MC standard deviation","MC MSE","MC relative efficiency")
  ests <- c("Sample mean","Trimmed mean","Median")
  
  dimnames(sumdat) <- list(names,ests)
  round(sumdat,5)
}

#  function to generate S data sets of size n from normal distribution with mean mu and variance sigma^2

generate.normal <- function(S,n,mu,sigma){
  
dat <- matrix(rnorm(n*S,mu,sigma),ncol=n,byrow=T) 
# Note for Normal data generation, we can get the data in one step like this, which requires no looping.  
# In more complex statistical models, looping is often required to set up each data set, because the scenario #is much more complicated. Below is a sample loop to get the same data as above; try running the program and 
# see how much longer it takes!
  
  #  dat <- NULL
  #
  #   for (i in 1:S){
  #
  #      Y <- rnorm(n,mu,sigma)
  #      dat <- rbind(dat,Y)
  #
  #   }
  
  out <- list(dat=dat)
  return(out)
}

#  function to generate S data sets of size n from gamma distribution with mean mu, variance sigma^2 mu^2

generate.gamma <- function(S,n,mu,sigma){
  
  a <- 1/(sigma^2)
  s <- mu/a
  
  dat <- matrix(rgamma(n*S,shape=a,scale=s),ncol=n,byrow=T) 
  
  #  Alternative loop
  
  #   dat <- NULL
  #
  #   for (i in 1:S){
  #
  #      Y <- rgamma(n,shape=a,scale=s)
  #      dat <- rbind(dat,Y)
  #
  #   }
  
  out <- list(dat=dat)
  return(out)
}

# function to generate S data sets of size n from a t distribution with df degrees of freedom centered at the # value mu (a t distribution has mean 0 and variance df/(df-2) for df>2)

generate.t <- function(S,n,mu,df){
  
  dat <- matrix(mu + rt(n*S,df),ncol=n,byrow=T) 
  
  #   Alternative loop
  
  #   dat <- NULL
  #
  #   for (i in 1:S){
  #
  #      Y <- mu + rt(n,df)
  #      dat <- rbind(dat,Y)
  #
  #   }
  
  out <- list(dat=dat)
  return(out)
}


#  function to compute the 20% trimmed mean

trimmean <- function(Y){mean(Y,0.2)}

#  set the seed for the simulation

set.seed(3)

#  set number of simulated data sets and sample size

S <- 1000 

n <- 15

#  generate data  --Distribution choices are normal with mu,sigma
#  (rnorm), gamma (rgamma) and student t (rt)

#  a possible "fair" comparison would be to generate data from each of these distributions with the same mean and variance and see how the three methods perform on a relative basis under each condition

mu <- 1
sigma <- sqrt(5/3)

#   out <- generate.normal(S,n,mu,sigma)  # generate normal samples
#  out <- generate.gamma(S,n,mu,sigma)  # generate gamma samples 
out <- generate.t(S,n,mu,5)   # generate t_5 samples

#  get the sample means from each data set

outsampmean <- apply(out$dat,1,mean)

#  get the sample trimmed mean from each data set

outtrimmean <- apply(out$dat,1,trimmean)

#  get the sample median from each data set

outmedian <- apply(out$dat,1,median)

# now we can summarize, remember, mu is the true value of the mean for the true distribution of the data

# get the S=1000 estimates for each method

summary.sim <- data.frame(mean=outsampmean,trim=outtrimmean,
                          median=outmedian)

#  print out the first 5 of them (round to 4 decimal places)

#view(round(summary.sim,4),5)

#  get summary statistics for each estimator

results <- simsum(summary.sim,mu)

#############################################################

#  Some additional calculation for the sample mean

#  average of estimated standard errors for sample mean

#  usual standard error for sample mean from each data set

sampmean.ses <- sqrt(apply(out$dat,1,var)/n)

#  take the average

ave.sampmeanses <- mean(sampmean.ses)

#  coverage of usual confidence interval based on sample mean

t05 <- qt(0.975,n-1)

coverage <- sum((outsampmean-t05*sampmean.ses <= mu) & 
                  (outsampmean+t05*sampmean.ses >= mu))/S

coverage.probability<-coverage
results

##                        Sample mean Trimmed mean     Median
## true value                 1.00000      1.00000    1.00000
## # sims                  1000.00000   1000.00000 1000.00000
## MC mean                    0.98304      0.98499    0.98474
## MC bias                   -0.01696     -0.01501   -0.01526
## MC relative bias          -0.01696     -0.01501   -0.01526
## MC standard deviation      0.33067      0.31198    0.35016
## MC MSE                     0.10952      0.09746    0.12273
## MC relative efficiency     1.00000      1.12370    0.89238

coverage.probability

## [1] 0.948

Example 6.1 From textbook

This is an illustration of basic Monte Carlo estimation. We generate a sample of size 2 from normal distribution and target parameter is the absolute value of difference in means.

    m <- 1000
    g <- numeric(m)
    MSE.Num<-numeric(m)
    
    for (i in 1:m) {
        x <- rnorm(2)
        g[i] <- abs(x[1] - x[2])
        MSE.Num[i]<-(g[i]-mu)^2 # MSE calculation of the numerator
       
    }
    est <- mean(g)
    est

## [1] 1.152332

    bias.g<-est-0 # Monte Carlo Bias, note mu ==0

 bias.g

## [1] 1.152332

MC.SE<-sd(g) # MC Standard error
var.theor<-2-(4/pi) # theroetical variance
MC.SE

## [1] 0.8613968

var.theor

## [1] 0.7267605

MC.MSE<-sum(MSE.Num)/m # Monte Carlo MSE
MC.MSE.shortcut<-(bias.g)^2+(MC.SE)^2 # Short cut way to calculate MSE

MC.MSE

## [1] 0.7644676

MC.MSE.shortcut

## [1] 2.069875

Example 6.4 Confidence interval for variance

    n <- 20
    alpha <- .05
    x <- rnorm(n, mean=0, sd=2)
    UCL <- (n-1) * var(x) / qchisq(alpha, df=n-1)

Example 6.5 MC estimate of confidence level for variance.

Sampled from Normal. Uses techniqe from example 6.4 to build CI. Also, known as coverage probability for the interval estimate. It is one of the two main measures of goodness of an interval estimate. Other being the lenght of the interval, shorter the better.

    n <- 20
    alpha <- .05
    UCL <- replicate(1000, expr = {
        x <- rnorm(n, mean = 0, sd = 2)
        (n-1) * var(x) / qchisq(alpha, df = n-1)
        } )
    #count the number of intervals that contain sigma^2=4
    sum(UCL > 4)

## [1] 943

    #or compute the mean to get the confidence level
    mean(UCL > 4)

## [1] 0.943

Example 6.6 Empirical confidence level

CI for variance, when sample is from Chi-square, i.e., positively skewed population instead of symmetric mount shaped. Notice how drastically the covergae probability decreases. Its because such technique of building CI is good for samples coming from normal population. So you claimed 95% CI but it doesnt acheive what it claimed for!

    n <- 20
    alpha <- .05
    UCL <- replicate(1000, expr = {
        x <- rchisq(n, df = 2)
        (n-1) * var(x) / qchisq(alpha, df = n-1)
        } )
    sum(UCL > 4)

## [1] 784

    mean(UCL > 4)

## [1] 0.784

Example 6.7 (Empirical Type I error rate)

This example calcualtes acheived alpha. Its a way of testing goodness of a test. We want ot check if it actually acheives the alpha we claim it has! Here we draw samples from normal distribution. Sample size is small (smaller than 30) and sigma is unkown, hence we use t-test for testing for a particular value of mean. Its a right tailed hypothesis/test. It calculates empirically (via simulation) proportion of times the hypothesis was rejected, assuming H0 is true. Recall, alpha= P(type I error)== P(reject Ho|Ho is true). As can be seen from the test, it is able ot acheive claimed alpha.

    n <- 20
    alpha <- .05
    mu0 <- 500
    sigma <- 100

    m <- 10000          #number of replicates
    p <- numeric(m)     #storage for p-values
    for (j in 1:m) {
        x <- rnorm(n, mu0, sigma)
        ttest <- t.test(x, alternative = "greater", mu = mu0)
        p[j] <- ttest$p.value
        }

    p.hat <- mean(p < alpha)
    se.hat <- sqrt(p.hat * (1 - p.hat) / m)
    print(c(p.hat, se.hat))

## [1] 0.050200000 0.002183574

Example 6.9 (Empirical power)

This example calculates empirical power for various values of mu in alternate space. It also,plots a power curve for variaos values of mu in alternate space. Recall, power = 1-beta=1-P(type II error) = P(fail to reject Ho|Ho is false). This means we calclate Power under the assumption Ha is true, but we really dont know its value in alternate space.

PS: We are still testing for mean using t-test, as we are workign with small sample from normal distribution with unknown variance.

Note: Adding SE barrs to the curve is optional. But since you have the code now, it will make your output power curve look more illustrative, so why NOT!

    n <- 20
    m <- 1000
    mu0 <- 500
    sigma <- 100
    mu <- c(seq(450, 650, 10))  #alternatives
    M <- length(mu)
    power <- numeric(M)
    for (i in 1:M) {
        mu1 <- mu[i]
        pvalues <- replicate(m, expr = {
            #simulate under alternative mu1
            x <- rnorm(n, mean = mu1, sd = sigma)
            ttest <- t.test(x,
                     alternative = "greater", mu = mu0)
            ttest$p.value  } )
        power[i] <- mean(pvalues <= .05)
    }

    par(ask = TRUE)
    library(Hmisc)  #for errbar
    plot(mu, power)
    abline(v = mu0, lty = 1)
    abline(h = .05, lty = 1)

    #add standard errors
    se <- sqrt(power * (1-power) / m)
    errbar(mu, power, yplus = power+se, yminus = power-se,
        xlab = bquote(theta))
    lines(mu, power, lty=3)

    detach(package:Hmisc)
    par(ask = FALSE)

Example, when population proportion is the parameter of intrest!

It claculates width of the confidence interval for proportion ( threedifferent hypothesised values, 0.1, 0.25 and 0.5) for samples ofvarious sizes ( i.e., n=100, n=300, n=500 and n=1500). As sample size increase the width decreases, which is pretty expected.

PS: It is little advanced code, it saves output in a file name myfile in your workign directory. Its a guiding code for graduate students and those intrested in improving their R skill for various reasons.