Xing Su
February 19, 2015
Variance is a statistical measure of spread of a given distribution.
For a discrete variable \( X \), variance is calculated by
\[ \sigma^2 =\frac{1}{n}\sum_{i=1}^n (X_i - \bar X)^2 \]
where \( X_i \) represents the observations, \( \bar X \) represents the mean, and \( n \) represents number of observations.
Since we rarely know the population statistics and are often provided with only a sample, we can estimate it using the sample statistics.
There are two ways of estimating the population variance using a sample:
\[ S^2_{unbiased} = \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n-1} ~~~\mbox{and}~~~ S^2_{biased} = \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n} \]
The unbiased estimator is more commonly used and is a better estimate. The only difference between the two calculations is the denominator–so why does dividing by \( n-1 \) make the estimator unbiased and better?
To show this empirically, we will leverage a Shiny Application to simulate and analyze the variance estimates.
The Simulation Experiment will perform the following:
1. create a population distribution by drawing a number of observations from values 1 to 20 2. draw a number of samples of specified size from the population 3. compare the individual sample variances and the true population variance 4. show the effects of sample size vs accuracy of variance estimated
The user will be able to control the number of oberservations, number of samples, and sample size to generate relevant plots using ggplot2 and Google visualiztions.
Google Visualization Plot Example from Shiny App:
ggplot2 Plot Example from Shiny App: