C6 Wk2 L02 Variance

Population variance

By definition, population variance indicates a distribution of the centeredness (the mean) of all population data.

Population variance, sigma^2, or Var(X)
= E((X - mu)^2) = E(X^2) - (E(X)^2)
= sum(X^2 * P(X)) - (sum(X * P(X))^2)
= 1 / n * sum((Xi - Xbar)^2)

where n = population size

# Population variance of dice-throw outcome.

Pdice <- 1/6
dice_outcome <- 1:6 # Xi's
# Compute population mean, mu = sum(X * P(X))
mu <- Pdice * sum(dice_outcome)
var_pop <- sum(dice_outcome^2 * Pdice) - mu^2
var_pop

## [1] 2.916667

The population variance is 2.9166667.

Note: you need to know the probability distribution of all the values the variable can take, in order to calculate the true population variance.

Sample variance

Sample variance is a measure of the distribution of the centeredness (the mean) of the collected sample data.

Sample variance, s2
= 1/(n-1) * sum((Xi - Xbar)^2)

where n = sample size

# Generate 10 random outcomes between 1 and 6 from 10 throws.
# Note: runif(10, 1, 6) generates 10 random numbers that are NOT integers.
# Note: sample(x, n, replace = F), for n greater than the number of choices possible in x, replace must = T. Otherwise you would've picked all the balls from the bag, and need to pick n-length(x) more, but can't.
samplesize <- 10
dice_outcome_sample <- sample(1:6, samplesize, replace = T)
dice_outcome_sample

##  [1] 4 2 6 4 1 4 6 5 1 2

Xbar = mean(dice_outcome_sample)
Xbar

## [1] 3.5

var_sample = 1/(samplesize-1) * sum((dice_outcome_sample - Xbar)^2)
var_sample

## [1] 3.611111

The sample variance for this 10-throw experiment is 3.6111111.

Note: sample variance is calculated from the data you collect. You don’t have to know how likely it is to get these collected values in the true population. The purpose of sampling is to enable you to estimate those values.

Sample variance: Larger sample sizes

samplesize <- 50
dice_outcome_sample <- sample(1:6, samplesize, replace = T)
dice_outcome_sample

##  [1] 5 5 2 1 6 2 3 1 3 2 5 1 2 1 3 4 4 5 6 1 2 5 5 6 5 3 3 3 4 6 1 2 1 5 1
## [36] 3 4 3 6 4 3 3 4 3 6 1 1 6 3 6

Xbar = mean(dice_outcome_sample)
Xbar

## [1] 3.4

var_sample = 1/(samplesize-1) * sum((dice_outcome_sample - Xbar)^2)
var_sample

## [1] 3.020408

The sample variance for this 50-throw experiment is 3.0204082.

A single, larger sample does not affect the variance of its data.

Relationship between multiple samples and population variance

The variance of a single 50-throw experiment is the property of the sample data. Therefore, from sampling to sampling, the variance will change (3.21, 3.28, 2.84…), and will follow its own distribution.

If you repeat this 50-throw experiment many times, and calculate the sample variance every time, you will find the average of those sample converges to 2.9166667. This is the true distribution of all data from the population.

exptrpts <- 100
samplesize <- 10
var_sample2 <- numeric(exptrpts)

for(i in 1:exptrpts) {
    dice_outcome_sample <- sample(1:6,
                                  samplesize,
                                  replace = T)
    Xbar = mean(dice_outcome_sample)
    var_sample[i] = 1/(samplesize-1) * sum((dice_outcome_sample - Xbar)^2)
}
var_sample

##   [1] 0.6777778 1.8222222 2.4888889 2.2333333 2.9000000 2.6777778 2.7666667
##   [8] 4.1777778 3.4333333 3.2888889 2.9333333 2.3222222 3.7888889 3.3777778
##  [15] 1.8222222 2.6777778 2.2333333 2.2666667 4.0444444 4.5000000 2.3222222
##  [22] 2.1777778 1.8777778 3.7777778 2.4555556 5.2111111 3.7333333 2.7111111
##  [29] 5.0666667 4.2777778 3.5666667 2.9888889 2.2333333 1.5111111 3.2111111
##  [36] 4.1777778 3.1222222 3.5666667 2.9000000 4.4555556 4.2222222 2.3222222
##  [43] 3.4333333 2.4888889 2.5444444 2.1000000 1.7333333 2.8444444 4.2666667
##  [50] 4.2666667 4.4000000 3.3777778 3.8333333 2.0111111 3.1555556 3.7777778
##  [57] 2.8444444 2.0444444 2.0111111 3.2111111 4.4000000 2.4888889 2.5444444
##  [64] 3.4333333 3.6000000 3.1555556 5.2888889 2.0444444 2.6222222 4.9000000
##  [71] 2.2333333 4.1777778 3.6000000 2.2666667 3.2888889 4.4000000 2.9888889
##  [78] 3.5111111 2.7111111 2.4000000 2.1000000 4.9000000 2.5444444 1.5555556
##  [85] 3.2888889 2.4555556 4.0444444 4.5444444 3.4333333 3.4333333 4.9333333
##  [92] 2.9333333 2.7111111 3.5111111 2.7111111 2.0000000 2.7222222 3.1222222
##  [99] 2.6666667 3.5666667

mean(var_sample)

## [1] 3.119

The mean variance for 100 such 10-throw experiments is 3.119.

This value is much closer to 2.9166667, the true population data variance.

var_var_sample <- 1/(exptrpts-1) * sum((var_sample - mean(var_sample))^2)
var_var_sample

## [1] 0.8762247

And the variance of sample variance for 100 such 10-throw experiments is 0.8762247.

Relationship between multiple samples and population variance: Larger sample sizes

exptrpts <- 100
samplesize <- 50
var_sample2 <- numeric(exptrpts)

for(i in 1:exptrpts) {
    dice_outcome_sample <- sample(1:6,
                                  samplesize,
                                  replace = T)
    Xbar = mean(dice_outcome_sample)
    var_sample[i] = 1/(samplesize-1) * sum((dice_outcome_sample - Xbar)^2)
}
mean(var_sample)

## [1] 2.874698

The mean variance for 100 such 50-throw experiments is 2.874698.

This value is also much closer to 2.9166667, the true population data variance.

var_var_sample <- 1/(exptrpts-1) * sum((var_sample - mean(var_sample))^2)
var_var_sample

## [1] 0.1494661

And the variance of sample variance for 100 such 50-throw experiments is 0.1494661. It’s smaller than a 10-throw experiment.

Note: As detailed below mathematically, the larger the sample size each time sampling is done, the narrower (more “accurate”) is your estimation for the true population data spread (i.e. population variance).

Relationship between multiple samples and population variance: Larger experimental repeats (more sampling)

exptrpts <- 1000
samplesize <- 10
var_sample2 <- numeric(exptrpts)

for(i in 1:exptrpts) {
    dice_outcome_sample <- sample(1:6,
                                  samplesize,
                                  replace = T)
    Xbar = mean(dice_outcome_sample)
    var_sample[i] = 1/(samplesize-1) * sum((dice_outcome_sample - Xbar)^2)
}
mean(var_sample)

## [1] 2.908244

The mean variance for 1000 such 10-throw experiments is 2.9082444.

var_var_sample <- 1/(exptrpts-1) * sum((var_sample - mean(var_sample))^2)
var_var_sample

## [1] 0.7872578

And the variance of sample variance for 1000 such 10-throw experiments is 0.7872578. It’s not smaller. The variance from sampling to sampling depends on sample size, not on the number of times you sample!

Variance of sample mean

We learnt how to use the variance (width) of respective samples to estimate true overall population data spread. I.e.,
- Even when each sampling gives rise to wide or narrow samples, over many samples, the average width will be close to the true overall data spread.
- Importantly, the width of each sample is measured by 1/(n-1)(distance from center)^2, not 1/n. - From sample to sample, the data spread will be more similar to each other, ie sample variances will have a small variance, if each sample has a large size.

Now we want to get a feeling for how to use the sample variance to measure a different property of the population - where its center lies, or its true mean, mu.
- Even when each sampling gives rise to different means (centers), over many samples, the average mean will be close to the true population mean: E(Xbar) = mu
- From sample to sample, the center will be more similar to each other, ie sample means will have a small variance, if each sample has a large size.
- Specifically, the means we calculate from samples will be closer to each other in value, more narrowly than the natural population spread (sigma^2) by a factor of sample size. - Therefore, in the absence of probability information for each value that the variable can take (e.g. 1/6 for each value a dice-throw can take), we can use the variance of samples’ means to estimate the true population spread: Var(Xbar) = s^2 = var_pop / n = sigma^2 / n

# Calculate a mean of variance like above.
exptrpts <- 1000
samplesize <- 10
var_sample2 <- numeric(exptrpts)
Xbar2 <- numeric(exptrpts)

for(i in 1:exptrpts) {
    dice_outcome_sample <- sample(1:6,
                                  samplesize,
                                  replace = T)
    Xbar2[i] = mean(dice_outcome_sample)
    var_sample[i] = 1/(samplesize-1) * sum((dice_outcome_sample - Xbar2[i])^2)
}
# This is a value close to population variance.
mean(var_sample)

## [1] 2.894511

# This is a value close to population mean.
mean(Xbar2)

## [1] 3.528

# Compute the variation (distribution, width) of the sample means.
var_mean_sample <- 1/(exptrpts-1) * sum((Xbar2 - mean(Xbar2))^2)
var_mean_sample

## [1] 0.3014575

# Compare with true variation of the population mean divided by sample size (10):
var_pop / samplesize

## [1] 0.2916667

From the above, you can tell, the sample mean values are narrower than the population data distribution (NOT population mean distribution, cuz population only has 1 mean by definition) by a factor of sample size. I.e.,

Var(Xbar) = var_pop / n

In this case,

0.3014575 approximately = 2.9166667 / 10

Try a different sample size:

# Calculate a mean of variance like above.
exptrpts <- 1000
samplesize <- 50
var_sample2 <- numeric(exptrpts)
Xbar2 <- numeric(exptrpts)

for(i in 1:exptrpts) {
    dice_outcome_sample <- sample(1:6,
                                  samplesize,
                                  replace = T)
    Xbar2[i] = mean(dice_outcome_sample)
    var_sample[i] = 1/(samplesize-1) * sum((dice_outcome_sample - Xbar2[i])^2)
}
# This is a value close to population variance.
mean(var_sample)

## [1] 2.91856

# This is a value close to population mean.
mean(Xbar2)

## [1] 3.5135

# Compute the variation (distribution, width) of the sample means.
var_mean_sample <- 1/(exptrpts-1) * sum((Xbar2 - mean(Xbar2))^2)
var_mean_sample

## [1] 0.05472407

# Compare with true variation of the population mean divided by sample size (50):
var_pop / samplesize

## [1] 0.05833333

Again,
Var(Xbar) = var_pop / n
0.0547241 approximately = 2.9166667 / 50

Standard errors

The standard error of mean is basically sqrt of the variance within the samples’ means:

Standard error SE
= sqrt(s2) = sqrt(var_mean_sample)
= sqrt(var_pop / n) = sigma / sqrt(n)

where n = sample size

Eg 1: Standard normal distribution

If the population is a population of standard normal numbers, then
- mu = 0 - sigma = 1 (var_pop = 1)

nosim <- 1000
n <- 10
# Make a matrix of 1000 rows, with 10000 entries (i.e. 10 cols).
# Apply the function mean() across the matrix's rows. So we get 1000 means.
# Calculates the standard deviation of the 1000 numbers. Uses (n-1) as the denominator, just like calculating sample variance.
sd(apply(matrix(rnorm(nosim * n), nosim), 1, mean))

## [1] 0.3230137

# This is the same as:
exptrpts <- nosim
Xbar3 <- apply(matrix(rnorm(nosim * n), nosim), 1, mean)
sd(Xbar3)

## [1] 0.3148778

# Which is the same as:
var_mean_sample <- 1/(exptrpts-1) * sum((Xbar3 - mean(Xbar3))^2)
sqrt(var_mean_sample)

## [1] 0.3148778

# Compared to what you'd expect for how 10-size-samples' means are like for a standard normal distribution (Var(Xbar) = sigma / n = 1 / 10)
sqrt(1/n)

## [1] 0.3162278

Eg 2: Standard uniform distribution

If the population is a population of standard uniforms,
mean = (a + b) / 2
variance = (b - a)^2 / 12 sample variance, s^2 = (b - a)^2 / 12 / n

nosim <- 1000
n <- 10
# Use runif() to generate a sample of 10 of uniform distribution data points.
# In this case, runif(n, a, b) sets a, b to 0, 1 by default.
# Therefore, mean = 1/2, var = 1/12; var_mean_sample = 1/12 / n
Xbar4 <- apply(matrix(runif(nosim * n), nosim), 1, mean)

# Use back my convention:
exptrpts <- nosim

# Method 1
sd(Xbar4)

## [1] 0.09150116

# Method 2
sqrt(1/(exptrpts-1) * sum((Xbar4 - mean(Xbar4))^2))

## [1] 0.09150116

# Method 3
sqrt(1/12 / n)

## [1] 0.09128709

Eg 3: Poisson distribution

If the population follows a Poisson distribution, mean = variance = lambda

# Simulate Poisson 4. Generate 1000 10-sized-samples, and get each of these 1000 averages.
# rpois(n, lambda) generates n numbers that follow Poisson-4 distribution
nosim <- 1000
n <- 10
lambda <- 4
Xbar5 <- apply(matrix(rpois(nosim * n, lambda), nosim), 1, mean)

# Use back my convention:
exptrpts <- nosim

# Method 1
sd(Xbar5)

## [1] 0.6058243

# Method 2
sqrt(1/(exptrpts-1) * sum((Xbar5 - mean(Xbar5))^2))

## [1] 0.6058243

# Method 3
sqrt(lambda / n)

## [1] 0.6324555

Eg 4: Binomial distribution

If the population is binomial (0 or 1 like a coin flip), with probability p for generating either 0 or 1, mu = p or 1-p variance = p(1 - p)

# Assume fair coin; p = 0.5
nosim <- 1000
n <- 10
p <- 0.5
Xbar6 <- apply(matrix(sample(0:1, nosim * n, replace = T), nosim), 1, mean)

# Use back my convention:
exptrpts <- nosim

# Method 1
sd(Xbar6)

## [1] 0.1610374

# Method 2
sqrt(1/(exptrpts-1) * sum((Xbar6 - mean(Xbar6))^2))

## [1] 0.1610374

# Method 3
sqrt(p * (1-p) / n)

## [1] 0.1581139

More eg

library(UsingR); data(father.son);

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 3.1.3

## Loading required package: HistData
## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 3.1.3

## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula

## Warning: package 'Formula' was built under R version 3.1.3

## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
## 
## 
## Attaching package: 'UsingR'
## 
## The following object is masked from 'package:ggplot2':
## 
##     movies
## 
## The following object is masked from 'package:survival':
## 
##     cancer

x <- father.son$sheight # grab son's height
n <- length(x) # n is the number of observations
hist(x) # histogram plots distribution of sons' heights

round(c(var(x), var(x) / n, sd(x), sd(x) / sqrt(n)), 2) # Calculate the pop and sample variances

## [1] 7.92 0.01 2.81 0.09

2.8147016 is the variability in this single sample of sons’ heights. 0.0857281 is the variability in n-sized-samples of sons’ heights drawn from the population. It’s the distribution of heights you would expect if you take n-sons from the population at a time, and measured the n-averages. Notice that we used the data from just ONE n-sample, not the average data from many n-samples!