Rebecca Johnston 15/01/2014
Set seed to regenerate the same random numbers
set.seed(540) # choose any positive integer
rnorm(10) # random numbers from a normal distribution, default mean = 0, sd = 1
## [1] -1.6145 0.7340 2.0947 0.5009 1.7355 -2.0584 -1.0339 1.4914
## [9] -0.5068 -0.7001
Generate a matrix: specify 2 dimensions, each named under a variable. Use these 2 variables within the rnorm arg together with matrix arg
n <- 10
B <- 4
(x <- matrix(rnorm(n * B), nrow = n))
## [,1] [,2] [,3] [,4]
## [1,] -0.008744 -0.8413 -1.2752 1.13066
## [2,] 0.108303 -1.0581 0.3885 -1.46676
## [3,] 0.806000 -0.2552 -0.1950 0.07621
## [4,] -1.504498 0.3213 1.4259 -0.11602
## [5,] -0.417628 0.7739 -0.6323 -0.73436
## [6,] 0.480787 -0.1932 0.6684 0.55222
## [7,] -0.168417 -1.7138 0.5225 -1.35783
## [8,] 0.760428 1.2975 -0.2879 0.66815
## [9,] 0.624589 1.0531 -0.3869 0.34295
## [10,] 0.580386 0.3417 0.3060 0.05395
str(x)
## num [1:10, 1:4] -0.00874 0.1083 0.806 -1.5045 -0.41763 ...
head(x)
## [,1] [,2] [,3] [,4]
## [1,] -0.008744 -0.8413 -1.2752 1.13066
## [2,] 0.108303 -1.0581 0.3885 -1.46676
## [3,] 0.806000 -0.2552 -0.1950 0.07621
## [4,] -1.504498 0.3213 1.4259 -0.11602
## [5,] -0.417628 0.7739 -0.6323 -0.73436
## [6,] 0.480787 -0.1932 0.6684 0.55222
tail(x)
## [,1] [,2] [,3] [,4]
## [5,] -0.4176 0.7739 -0.6323 -0.73436
## [6,] 0.4808 -0.1932 0.6684 0.55222
## [7,] -0.1684 -1.7138 0.5225 -1.35783
## [8,] 0.7604 1.2975 -0.2879 0.66815
## [9,] 0.6246 1.0531 -0.3869 0.34295
## [10,] 0.5804 0.3417 0.3060 0.05395
summary(x)
## V1 V2 V3 V4
## Min. :-1.504 Min. :-1.7138 Min. :-1.2752 Min. :-1.4668
## 1st Qu.:-0.128 1st Qu.:-0.6948 1st Qu.:-0.3622 1st Qu.:-0.5798
## Median : 0.294 Median : 0.0640 Median : 0.0555 Median : 0.0651
## Mean : 0.126 Mean :-0.0274 Mean : 0.0534 Mean :-0.0851
## 3rd Qu.: 0.614 3rd Qu.: 0.6659 3rd Qu.: 0.4890 3rd Qu.: 0.4999
## Max. : 0.806 Max. : 1.2975 Max. : 1.4259 Max. : 1.1307
For loop:
x <- matrix(0, nrow = n, ncol = B)
for (j in 1:B) {
x[, j] <- rnorm(n)
}
Gluing stuff together: Repetitive code is a sure sign you're attacking from the wrong angle. But the lack of scalability is the main problem here.
sample1 <- rnorm(n)
sample2 <- rnorm(n)
sample3 <- rnorm(n)
sample4 <- rnorm(n)
x <- cbind(sample1, sample2, sample3, sample4)
Gluing stuff together inside a for loop:
x <- rnorm(n)
for (j in 1:(B - 1)) {
x <- cbind(x, rnorm(n))
}
Create a matrix:
n <- 10
B <- 4
x <- matrix(rnorm(n * B), nrow = n)
Create matrix row and column names using sprintf arg.
rownames(x) <- sprintf("obs%02d", 1:n)
colnames(x) <- sprintf("samp%02d", 1:B)
x
## samp01 samp02 samp03 samp04
## obs01 -0.6625 3.42818 -1.31936 -0.100429
## obs02 0.7743 -0.52823 0.70759 0.001006
## obs03 -1.5737 0.47803 -0.42372 -0.552930
## obs04 -0.2075 1.34923 -0.75977 -0.209284
## obs05 -0.6732 -0.06336 0.12161 0.292004
## obs06 -0.4987 -0.23567 -1.33881 0.240990
## obs07 0.3075 -0.67465 -0.06423 0.952912
## obs08 0.2241 0.47812 1.54898 0.643902
## obs09 -0.5773 0.30075 0.51883 -1.265136
## obs10 1.1534 0.57059 2.36283 -0.779013
dim(x)
## [1] 10 4
Compute sample means:
mean(x[, 2]) # sample mean for 2nd sample
## [1] 0.5103
colMeans(x)
## samp01 samp02 samp03 samp04
## -0.1734 0.5103 0.1354 -0.0776
apply(x, 2, mean) # same answer as colMeans, as 2 in 'apply' denotes by column
## samp01 samp02 samp03 samp04
## -0.1734 0.5103 0.1354 -0.0776
Exercise: Recall the claim that the expected value of the sample mean is the true mean. Compute the average of the 4 sample means we have. Is it (sort of) close to the true mean? Feel free to change n or B at any point.
mean(colMeans(x))
## [1] 0.09868
Yes, the average of the sample means is close to the true mean (since generated from rnorm with defaults, the true mean is 0).
Let's try other n and b values:
n <- 100
B <- 100
x <- matrix(rnorm(n * B), nrow = n)
mean(colMeans(x))
## [1] -0.01853
n <- 500
B <- 500
x <- matrix(rnorm(n * B), nrow = n)
mean(colMeans(x))
## [1] 0.0001131
n <- 1000
B <- 1000
x <- matrix(rnorm(n * B), nrow = n)
mean(colMeans(x))
## [1] 0.001404
Takes a while to compute, but the larger the sample size, the closer the sample mean to the true mean.
Exercise: Recall the Weak Law of Large Numbers said that, as the sample size gets bigger, the distribution of the sample means gets more concentrated around the true mean. Recall also that the variance of the sample mean is equal to the true data-generating variance divided by the sample size n. Explore these probability facts empirically. Don't go crazy – just pick a few different sample sizes, compute sample means, and explore the variability of the sample means as a function of sample size.
Sample size defined by number of rows, sample means calculated by column - keep this value constant:
B <- 1000
x10 <- matrix(rnorm(10 * B), nrow = 10)
x100 <- matrix(rnorm(100 * B), nrow = 100)
x1000 <- matrix(rnorm(1000 * B), nrow = 1000)
x10000 <- matrix(rnorm(10000 * B), nrow = 10000)
x100000 <- matrix(rnorm(1e+05 * B), nrow = 1e+05)
Calculate sample means using colMeans:
xBar10 <- colMeans(x10)
xBar100 <- colMeans(x100)
xBar1000 <- colMeans(x1000)
xBar10000 <- colMeans(x10000)
xBar100000 <- colMeans(x100000)
Calculate SD of colMeans using sd(colMeans)
xBarSd10 <- sd(colMeans(x10))
xBarSd100 <- sd(colMeans(x100))
xBarSd1000 <- sd(colMeans(x1000))
xBarSd10000 <- sd(colMeans(x10000))
xBarSd100000 <- sd(colMeans(x100000))
Generate dataframe to summarise findings:
cbind(sampSize = c(10, 100, 1000, 10000, 1e+05), trueSEM = 1/sqrt(c(10, 100,
1000, 10000, 1e+05)), obsSEM = c(xBarSd10, xBarSd100, xBarSd1000, xBarSd10000,
xBarSd100000))
## sampSize trueSEM obsSEM
## [1,] 1e+01 0.316228 0.307788
## [2,] 1e+02 0.100000 0.100917
## [3,] 1e+03 0.031623 0.033060
## [4,] 1e+04 0.010000 0.010212
## [5,] 1e+05 0.003162 0.003256
Jenny's more sophisticated solution, using more advanced R!!!
B <- 1000
n <- 10^(1:4)
names(n) <- paste0("n", n)
getSampleMeans <- function(n, B) colMeans(matrix(rnorm(n * B), nrow = n))
x <- sapply(n, getSampleMeans, B, simplify = FALSE)
cbind(sampSize = n,
trueSEM = 1 / sqrt(n),
obsSEM = sapply(x, sd),
sampMeanIQR = sapply(x, IQR), # IQR = inter-quartile range
sampMeanMad = sapply(x, mad)) # MAD = median absolute deviation
## sampSize trueSEM obsSEM sampMeanIQR sampMeanMad
## n10 10 0.31623 0.316264 0.43921 0.32625
## n100 100 0.10000 0.101797 0.14236 0.10578
## n1000 1000 0.03162 0.031464 0.04434 0.03279
## n10000 10000 0.01000 0.009741 0.01303 0.00957
Exercise: Repeat the above for a different distribution I have chosen the exponential distribution. Jenny's code is super clean and generalisable: only need to change “rnorm” to “rexp”
B <- 1000
n <- 10^(1:4)
names(n) <- paste0("n", n)
getSampleMeans <- function(n, B) colMeans(matrix(rexp(n * B), nrow = n))
x <- sapply(n, getSampleMeans, B, simplify = FALSE)
cbind(sampSize = n,
trueSEM = 1 / sqrt(n),
obsSEM = sapply(x, sd),
sampMeanIQR = sapply(x, IQR), # IQR = inter-quartile range
sampMeanMad = sapply(x, mad)) # MAD = median absolute deviation
## sampSize trueSEM obsSEM sampMeanIQR sampMeanMad
## n10 10 0.31623 0.324958 0.40783 0.30025
## n100 100 0.10000 0.102322 0.12980 0.09641
## n1000 1000 0.03162 0.030318 0.04121 0.03057
## n10000 10000 0.01000 0.009602 0.01237 0.00921
Exercise: Generate a reasonably large sample from some normal distribution (it need not be standard normal!). Pick a threshold. What is the CDF at that threshold, i.e. what's the true probability of seeing an observation less than or equal to the threshold? Use your large sample to compute the observed proportion of observations that are less than threshold. Are the two numbers sort of close? Hint: If x is a numeric vector, then mean(x <= threshold) computes the proportion of values less than or equal to threshold
# Define parameters
distSize <- 1000
distMean <- 100
distSd <- 50
# Generate large sample:
set.seed(10)
x <- rnorm(distSize, distMean, distSd)
head(x)
## [1] 100.94 90.79 31.43 70.04 114.73 119.49
# Choose a threshhold:
thresh <- 138
# Expected CDF: use pnorm
(expCdf <- pnorm(thresh, distMean, distSd))
## [1] 0.7764
# Observed CDF: use less than or equal to symbol
(obsCdf <- mean(x <= thresh))
## [1] 0.766
obsCdf - expCdf
## [1] -0.01037
Yes the observed proportion and true probability are sort of close!
Exercise: Do the same for a different distribution. I've chosen the Binomial distribution. The parameters for rbinom and pbinom were confusing to figure out :S I think I got there in the end!
nObs <- 1000
sizeTrial <- 100
probTrial <- 0.8
x <- rbinom(n = nObs, size = sizeTrial, prob = probTrial)
thresh <- 75
(expCdf <- pbinom(thresh, sizeTrial, probTrial))
## [1] 0.1314
(obsCdf <- mean(x <= thresh))
## [1] 0.163
obsCdf - expCdf
## [1] 0.03165
Exercise: Do the same for a variety of sample sizes. Do the two numbers tend to be closer for larger samples?
# n = 100
nObs <- 100
x <- rbinom(n = nObs, size = sizeTrial, prob = probTrial)
thresh <- 75
(expCdf <- pbinom(thresh, sizeTrial, probTrial))
## [1] 0.1314
(obsCdf <- mean(x <= thresh))
## [1] 0.13
obsCdf - expCdf
## [1] -0.001353
# n = 10000
nObs <- 10000
x <- rbinom(n = nObs, size = sizeTrial, prob = probTrial)
thresh <- 75
(expCdf <- pbinom(thresh, sizeTrial, probTrial))
## [1] 0.1314
(obsCdf <- mean(x <= thresh))
## [1] 0.1341
obsCdf - expCdf
## [1] 0.002747
# n = 1000000
nObs <- 1e+06
x <- rbinom(n = nObs, size = sizeTrial, prob = probTrial)
thresh <- 75
(expCdf <- pbinom(thresh, sizeTrial, probTrial))
## [1] 0.1314
(obsCdf <- mean(x <= thresh))
## [1] 0.1311
obsCdf - expCdf
## [1] -0.0002722
Yes, for larger sample sizes, the observed CDF approaches the expected CDF
Exercise: Instead of focusing on values less than the threshold, focus on values greater than the threshold.
# Within rbinom and pbinom args use lower.tail = FALSE
nObs <- 10000
sizeTrial <- 100
probTrial <- 0.8
x <- rbinom(n = nObs, size = sizeTrial, prob = probTrial)
thresh <- 75
(expCdf <- pbinom(thresh, sizeTrial, probTrial, lower.tail = FALSE))
## [1] 0.8686
(obsCdf <- mean(x >= thresh))
## [1] 0.9117
obsCdf - expCdf
## [1] 0.04305
Exercise: Instead of focusing on tail probabilities, focus on the probability of the observed values falling in an interval
nObs <- 10000
sizeTrial <- 100
probTrial <- 0.8
x <- rbinom(n = nObs, size = sizeTrial, prob = probTrial)
lowThresh <- 73
uppThresh <- 79
mean(x >= lowThresh & x <= uppThresh)
## [1] 0.4032
pbinom(uppThresh, sizeTrial, probTrial) - pbinom(lowThresh, sizeTrial, probTrial)
## [1] 0.3847
CLT: central limit theorem. Let's use ggplot2 instead of lattice.
library("ggplot2")
Empirical distribution of sample means for various sample sizes:
B <- 1000
n <- round(10^(seq(from = 1, to = 2.5, length = 4)), 0)
names(n) <- paste0("n", n)
getSampleMeans <- function(n, B) colMeans(matrix(rnorm(n * B), nrow = n))
x <- data.frame(sapply(n, getSampleMeans, B))
Put data in tall format for ggplot2. I normally use reshape function but Jenny does it manually:
xTallSkinny <- stack(x)
names(xTallSkinny) <- c("x", "n")
xTallSkinny$n <- factor(xTallSkinny$n, levels = colnames(x))
head(xTallSkinny)
## x n
## 1 -0.20109 n10
## 2 0.33789 n10
## 3 -0.17150 n10
## 4 0.52432 n10
## 5 -0.28495 n10
## 6 0.06774 n10
tail(xTallSkinny)
## x n
## 3995 0.01229 n316
## 3996 -0.04354 n316
## 3997 0.01176 n316
## 3998 -0.02778 n316
## 3999 -0.01335 n316
## 4000 -0.03478 n316
Make ggplot:
# Add histogram layer
p <- ggplot(xTallSkinny, aes(x, group = n, colour = n)) + geom_density()
# Add strip plot layer
p <- p + geom_point(aes(x, y = -0.5, group = n, colour = n), position = position_jitter(height = 0.1),
alpha = 0.3)
# Add labels
p <- p + xlab("Sample means") + ylab("Density") + ggtitle("Exploring the Central Limit Theorem") +
theme(plot.title = element_text(face = "bold"))
p