STAT540 Seminar 02: Playing with probability distributions and simulations

Rebecca Johnston 15/01/2014

Generate random numbers from a normal distribution
Generate lots of data, the R way
Generate lots of data, the awkward ways
Compute sample means and explore
Compare probabilities and observed relative frequencies
Explore the distribution of sample means and the CLT

Generate random numbers from a normal distribution

Set seed to regenerate the same random numbers

set.seed(540)  # choose any positive integer
rnorm(10)  # random numbers from a normal distribution, default mean = 0, sd = 1

##  [1] -1.6145  0.7340  2.0947  0.5009  1.7355 -2.0584 -1.0339  1.4914
##  [9] -0.5068 -0.7001

Generate lots of data, the R way

Generate a matrix: specify 2 dimensions, each named under a variable. Use these 2 variables within the rnorm arg together with matrix arg

n <- 10
B <- 4
(x <- matrix(rnorm(n * B), nrow = n))

##            [,1]    [,2]    [,3]     [,4]
##  [1,] -0.008744 -0.8413 -1.2752  1.13066
##  [2,]  0.108303 -1.0581  0.3885 -1.46676
##  [3,]  0.806000 -0.2552 -0.1950  0.07621
##  [4,] -1.504498  0.3213  1.4259 -0.11602
##  [5,] -0.417628  0.7739 -0.6323 -0.73436
##  [6,]  0.480787 -0.1932  0.6684  0.55222
##  [7,] -0.168417 -1.7138  0.5225 -1.35783
##  [8,]  0.760428  1.2975 -0.2879  0.66815
##  [9,]  0.624589  1.0531 -0.3869  0.34295
## [10,]  0.580386  0.3417  0.3060  0.05395

str(x)

##  num [1:10, 1:4] -0.00874 0.1083 0.806 -1.5045 -0.41763 ...

head(x)

##           [,1]    [,2]    [,3]     [,4]
## [1,] -0.008744 -0.8413 -1.2752  1.13066
## [2,]  0.108303 -1.0581  0.3885 -1.46676
## [3,]  0.806000 -0.2552 -0.1950  0.07621
## [4,] -1.504498  0.3213  1.4259 -0.11602
## [5,] -0.417628  0.7739 -0.6323 -0.73436
## [6,]  0.480787 -0.1932  0.6684  0.55222

tail(x)

##          [,1]    [,2]    [,3]     [,4]
## [5,]  -0.4176  0.7739 -0.6323 -0.73436
## [6,]   0.4808 -0.1932  0.6684  0.55222
## [7,]  -0.1684 -1.7138  0.5225 -1.35783
## [8,]   0.7604  1.2975 -0.2879  0.66815
## [9,]   0.6246  1.0531 -0.3869  0.34295
## [10,]  0.5804  0.3417  0.3060  0.05395

summary(x)

##        V1               V2                V3                V4         
##  Min.   :-1.504   Min.   :-1.7138   Min.   :-1.2752   Min.   :-1.4668  
##  1st Qu.:-0.128   1st Qu.:-0.6948   1st Qu.:-0.3622   1st Qu.:-0.5798  
##  Median : 0.294   Median : 0.0640   Median : 0.0555   Median : 0.0651  
##  Mean   : 0.126   Mean   :-0.0274   Mean   : 0.0534   Mean   :-0.0851  
##  3rd Qu.: 0.614   3rd Qu.: 0.6659   3rd Qu.: 0.4890   3rd Qu.: 0.4999  
##  Max.   : 0.806   Max.   : 1.2975   Max.   : 1.4259   Max.   : 1.1307

Generate lots of data, the awkward ways

For loop:

x <- matrix(0, nrow = n, ncol = B)
for (j in 1:B) {
    x[, j] <- rnorm(n)
}

Gluing stuff together: Repetitive code is a sure sign you're attacking from the wrong angle. But the lack of scalability is the main problem here.

sample1 <- rnorm(n)
sample2 <- rnorm(n)
sample3 <- rnorm(n)
sample4 <- rnorm(n)
x <- cbind(sample1, sample2, sample3, sample4)

Gluing stuff together inside a for loop:

x <- rnorm(n)
for (j in 1:(B - 1)) {
    x <- cbind(x, rnorm(n))
}

Compute sample means and explore

Create a matrix:

n <- 10
B <- 4
x <- matrix(rnorm(n * B), nrow = n)

Create matrix row and column names using sprintf arg.

sprintf = C-style command, returns a character vector containing a formatted combination of text and variable values, e.g. %02d = format number with up to 2 leading zeroes

rownames(x) <- sprintf("obs%02d", 1:n)
colnames(x) <- sprintf("samp%02d", 1:B)
x

##        samp01   samp02   samp03    samp04
## obs01 -0.6625  3.42818 -1.31936 -0.100429
## obs02  0.7743 -0.52823  0.70759  0.001006
## obs03 -1.5737  0.47803 -0.42372 -0.552930
## obs04 -0.2075  1.34923 -0.75977 -0.209284
## obs05 -0.6732 -0.06336  0.12161  0.292004
## obs06 -0.4987 -0.23567 -1.33881  0.240990
## obs07  0.3075 -0.67465 -0.06423  0.952912
## obs08  0.2241  0.47812  1.54898  0.643902
## obs09 -0.5773  0.30075  0.51883 -1.265136
## obs10  1.1534  0.57059  2.36283 -0.779013

dim(x)

## [1] 10  4

Compute sample means:

mean(x[, 2])  # sample mean for 2nd sample

## [1] 0.5103

colMeans(x)

##  samp01  samp02  samp03  samp04 
## -0.1734  0.5103  0.1354 -0.0776

apply(x, 2, mean)  # same answer as colMeans, as 2 in 'apply' denotes by column

##  samp01  samp02  samp03  samp04 
## -0.1734  0.5103  0.1354 -0.0776

Exercise: Recall the claim that the expected value of the sample mean is the true mean. Compute the average of the 4 sample means we have. Is it (sort of) close to the true mean? Feel free to change n or B at any point.

mean(colMeans(x))

## [1] 0.09868

Yes, the average of the sample means is close to the true mean (since generated from rnorm with defaults, the true mean is 0).

Let's try other n and b values:

n <- 100
B <- 100
x <- matrix(rnorm(n * B), nrow = n)
mean(colMeans(x))

## [1] -0.01853


n <- 500
B <- 500
x <- matrix(rnorm(n * B), nrow = n)
mean(colMeans(x))

## [1] 0.0001131


n <- 1000
B <- 1000
x <- matrix(rnorm(n * B), nrow = n)
mean(colMeans(x))

## [1] 0.001404

Takes a while to compute, but the larger the sample size, the closer the sample mean to the true mean.

Exercise: Recall the Weak Law of Large Numbers said that, as the sample size gets bigger, the distribution of the sample means gets more concentrated around the true mean. Recall also that the variance of the sample mean is equal to the true data-generating variance divided by the sample size n. Explore these probability facts empirically. Don't go crazy – just pick a few different sample sizes, compute sample means, and explore the variability of the sample means as a function of sample size.

Sample size defined by number of rows, sample means calculated by column - keep this value constant:

B <- 1000
x10 <- matrix(rnorm(10 * B), nrow = 10)
x100 <- matrix(rnorm(100 * B), nrow = 100)
x1000 <- matrix(rnorm(1000 * B), nrow = 1000)
x10000 <- matrix(rnorm(10000 * B), nrow = 10000)
x100000 <- matrix(rnorm(1e+05 * B), nrow = 1e+05)

Calculate sample means using colMeans:

xBar10 <- colMeans(x10)
xBar100 <- colMeans(x100)
xBar1000 <- colMeans(x1000)
xBar10000 <- colMeans(x10000)
xBar100000 <- colMeans(x100000)

Calculate SD of colMeans using sd(colMeans)

xBarSd10 <- sd(colMeans(x10))
xBarSd100 <- sd(colMeans(x100))
xBarSd1000 <- sd(colMeans(x1000))
xBarSd10000 <- sd(colMeans(x10000))
xBarSd100000 <- sd(colMeans(x100000))

Generate dataframe to summarise findings:

cbind(sampSize = c(10, 100, 1000, 10000, 1e+05), trueSEM = 1/sqrt(c(10, 100, 
    1000, 10000, 1e+05)), obsSEM = c(xBarSd10, xBarSd100, xBarSd1000, xBarSd10000, 
    xBarSd100000))

##      sampSize  trueSEM   obsSEM
## [1,]    1e+01 0.316228 0.307788
## [2,]    1e+02 0.100000 0.100917
## [3,]    1e+03 0.031623 0.033060
## [4,]    1e+04 0.010000 0.010212
## [5,]    1e+05 0.003162 0.003256

Jenny's more sophisticated solution, using more advanced R!!!

B <- 1000
n <- 10^(1:4)
names(n) <- paste0("n", n)
getSampleMeans <- function(n, B) colMeans(matrix(rnorm(n * B), nrow = n))
x <- sapply(n, getSampleMeans, B, simplify = FALSE)
cbind(sampSize = n,
      trueSEM = 1 / sqrt(n),
      obsSEM = sapply(x, sd),
      sampMeanIQR = sapply(x, IQR), # IQR = inter-quartile range
      sampMeanMad = sapply(x, mad)) # MAD = median absolute deviation

##        sampSize trueSEM   obsSEM sampMeanIQR sampMeanMad
## n10          10 0.31623 0.316264     0.43921     0.32625
## n100        100 0.10000 0.101797     0.14236     0.10578
## n1000      1000 0.03162 0.031464     0.04434     0.03279
## n10000    10000 0.01000 0.009741     0.01303     0.00957

Exercise: Repeat the above for a different distribution I have chosen the exponential distribution. Jenny's code is super clean and generalisable: only need to change “rnorm” to “rexp”

B <- 1000
n <- 10^(1:4)
names(n) <- paste0("n", n)
getSampleMeans <- function(n, B) colMeans(matrix(rexp(n * B), nrow = n))
x <- sapply(n, getSampleMeans, B, simplify = FALSE)
cbind(sampSize = n,
      trueSEM = 1 / sqrt(n),
      obsSEM = sapply(x, sd),
      sampMeanIQR = sapply(x, IQR), # IQR = inter-quartile range
      sampMeanMad = sapply(x, mad)) # MAD = median absolute deviation

##        sampSize trueSEM   obsSEM sampMeanIQR sampMeanMad
## n10          10 0.31623 0.324958     0.40783     0.30025
## n100        100 0.10000 0.102322     0.12980     0.09641
## n1000      1000 0.03162 0.030318     0.04121     0.03057
## n10000    10000 0.01000 0.009602     0.01237     0.00921

Compare probabilities and observed relative frequencies

Exercise: Generate a reasonably large sample from some normal distribution (it need not be standard normal!). Pick a threshold. What is the CDF at that threshold, i.e. what's the true probability of seeing an observation less than or equal to the threshold? Use your large sample to compute the observed proportion of observations that are less than threshold. Are the two numbers sort of close? Hint: If x is a numeric vector, then mean(x <= threshold) computes the proportion of values less than or equal to threshold

# Define parameters
distSize <- 1000
distMean <- 100
distSd <- 50

# Generate large sample:
set.seed(10)
x <- rnorm(distSize, distMean, distSd)
head(x)

## [1] 100.94  90.79  31.43  70.04 114.73 119.49

# Choose a threshhold:
thresh <- 138

# Expected CDF: use pnorm
(expCdf <- pnorm(thresh, distMean, distSd))

## [1] 0.7764

# Observed CDF: use less than or equal to symbol
(obsCdf <- mean(x <= thresh))

## [1] 0.766

obsCdf - expCdf

## [1] -0.01037

Yes the observed proportion and true probability are sort of close!

Exercise: Do the same for a different distribution. I've chosen the Binomial distribution. The parameters for rbinom and pbinom were confusing to figure out :S I think I got there in the end!

nObs <- 1000
sizeTrial <- 100
probTrial <- 0.8
x <- rbinom(n = nObs, size = sizeTrial, prob = probTrial)
thresh <- 75
(expCdf <- pbinom(thresh, sizeTrial, probTrial))

## [1] 0.1314

(obsCdf <- mean(x <= thresh))

## [1] 0.163

obsCdf - expCdf

## [1] 0.03165

Exercise: Do the same for a variety of sample sizes. Do the two numbers tend to be closer for larger samples?

# n = 100
nObs <- 100
x <- rbinom(n = nObs, size = sizeTrial, prob = probTrial)
thresh <- 75
(expCdf <- pbinom(thresh, sizeTrial, probTrial))

## [1] 0.1314

(obsCdf <- mean(x <= thresh))

## [1] 0.13

obsCdf - expCdf

## [1] -0.001353


# n = 10000
nObs <- 10000
x <- rbinom(n = nObs, size = sizeTrial, prob = probTrial)
thresh <- 75
(expCdf <- pbinom(thresh, sizeTrial, probTrial))

## [1] 0.1314

(obsCdf <- mean(x <= thresh))

## [1] 0.1341

obsCdf - expCdf

## [1] 0.002747


# n = 1000000
nObs <- 1e+06
x <- rbinom(n = nObs, size = sizeTrial, prob = probTrial)
thresh <- 75
(expCdf <- pbinom(thresh, sizeTrial, probTrial))

## [1] 0.1314

(obsCdf <- mean(x <= thresh))

## [1] 0.1311

obsCdf - expCdf

## [1] -0.0002722

Yes, for larger sample sizes, the observed CDF approaches the expected CDF

Exercise: Instead of focusing on values less than the threshold, focus on values greater than the threshold.

# Within rbinom and pbinom args use lower.tail = FALSE
nObs <- 10000
sizeTrial <- 100
probTrial <- 0.8
x <- rbinom(n = nObs, size = sizeTrial, prob = probTrial)
thresh <- 75
(expCdf <- pbinom(thresh, sizeTrial, probTrial, lower.tail = FALSE))

## [1] 0.8686

(obsCdf <- mean(x >= thresh))

## [1] 0.9117

obsCdf - expCdf

## [1] 0.04305

Exercise: Instead of focusing on tail probabilities, focus on the probability of the observed values falling in an interval

nObs <- 10000
sizeTrial <- 100
probTrial <- 0.8
x <- rbinom(n = nObs, size = sizeTrial, prob = probTrial)
lowThresh <- 73
uppThresh <- 79
mean(x >= lowThresh & x <= uppThresh)

## [1] 0.4032

pbinom(uppThresh, sizeTrial, probTrial) - pbinom(lowThresh, sizeTrial, probTrial)

## [1] 0.3847

Explore the distribution of sample means and the CLT

CLT: central limit theorem. Let's use ggplot2 instead of lattice.

library("ggplot2")

Empirical distribution of sample means for various sample sizes:

B <- 1000
n <- round(10^(seq(from = 1, to = 2.5, length = 4)), 0)
names(n) <- paste0("n", n)
getSampleMeans <- function(n, B) colMeans(matrix(rnorm(n * B), nrow = n))
x <- data.frame(sapply(n, getSampleMeans, B))

Put data in tall format for ggplot2. I normally use reshape function but Jenny does it manually:

xTallSkinny <- stack(x)
names(xTallSkinny) <- c("x", "n")
xTallSkinny$n <- factor(xTallSkinny$n, levels = colnames(x))
head(xTallSkinny)

##          x   n
## 1 -0.20109 n10
## 2  0.33789 n10
## 3 -0.17150 n10
## 4  0.52432 n10
## 5 -0.28495 n10
## 6  0.06774 n10

tail(xTallSkinny)

##             x    n
## 3995  0.01229 n316
## 3996 -0.04354 n316
## 3997  0.01176 n316
## 3998 -0.02778 n316
## 3999 -0.01335 n316
## 4000 -0.03478 n316

Make ggplot:

# Add histogram layer
p <- ggplot(xTallSkinny, aes(x, group = n, colour = n)) + geom_density()

# Add strip plot layer
p <- p + geom_point(aes(x, y = -0.5, group = n, colour = n), position = position_jitter(height = 0.1), 
    alpha = 0.3)

# Add labels
p <- p + xlab("Sample means") + ylab("Density") + ggtitle("Exploring the Central Limit Theorem") + 
    theme(plot.title = element_text(face = "bold"))
p

plot of chunk unnamed-chunk-25