Bessel’s Correction & Sampling Distributions

In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

The distribution of a variable tells what values the variable can take and how often it takes these values. In this case, the variable is price.

Type of variable chosen

A quantitative variable is a variable that is measured numerically. You would chose a quantitative variable when making a report about standard deviation because you can add and average the data. A numerical variable is simply a label whera quantitative data shows the actual values.

Histogram of diamonds price.

A histogram is a graph that shows a variable’s frequency. The graph below shows, at a certain price, the frequency of an object.

Violin plot

Violin plots describe the probability density, which is similar to a histogram.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

The min is the lowest point of data, 1st quartile is the 25 percentile of the data, the median is the number in the middle of the data set, the 3rd quartile is the 75 percentile of the data, and the max is the largest data point.

Modified Box Plots

The min and the max are the ends of the graph (the whiskers). The line in the box is the median, to the left of the median in the box is the 1st quartile and to the right in the box is the 3rd quartile. A modified box plot does not mark the outliers as dots while a regular box plot marks them as dots. Outliers are data points that are drastically different from the rest of the data.

Adding the mean to the plot

The mean is the red dot.

Standard Deviation: Formulas

The first formula is the Bessel’s correction, indicated by the division of (n-1). The second equation is simply the standard deviation.

\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]

Standard Deviation of Diamonds Price

We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)

## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)

## [1] 3989.403

These estimates are rather close, with the Bessel’s correction being larger.

So what is the big deal about Bessel’s correction? See below.

Sampling

The statement that began this document asserted that Bessel’s correction is important in the context of sampling. A population is the area in which you are collecting information, while a sample is a small section of that population that you use to represent the entire population. A parameter is used also to represent the population as a whole. A statistic is simply a piece of data but is not as accurate as a parameter. A sample mean is the mean of the sample of the population, while the population mean is the mean of the population - these two are not always accurate. The sample standard deviation is the deviation of the data set in a smample while a population standard deviartion is the population as a whole.

We can sample from the diamonds data set and display the price of the diamonds in the sample.

Sample size, \(n\).

First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. A seed of a random number generator is a value that shows the computer what sample set to use- if you use the same seed each time, you will get the same “random” number set each time. When you use different seeds, you will have different number sets.

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

This command created a random sample of the larger data set.

Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)

## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)

## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

Each time you use a different seed, you get a different set of values to work with.

Finally, what happens when we don’t set a seed, between samples.

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

Without using a seed, each time you randomly generate data you will get a different set of data which you will be unable to replicate.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))

## [1] 3908

mean(sample(diamonds$price,sample.size))

## [1] 2521

mean(sample(diamonds$price,sample.size))

## [1] 3346.75

In these data sets, the mean has been taken of the samples and of the seed. Other statistics that we could use to describe samples would be any average or percentage from the sample set.

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))

## [1] 3936.586

sd(sample(diamonds$price,sample.size))

## [1] 1683.428

sd(sample(diamonds$price,sample.size))

## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))

## [1] 3409.183

sdn(sample(diamonds$price,sample.size))

## [1] 1457.891

sdn(sample(diamonds$price,sample.size))

## [1] 1763.582

Sampling Distributions of Statistics

A sampling distribution is every result a statistic can take in every sample and how often that can happen. All the numbers above are sampling distributions from standard deviation. Tools we have to describe this include mean, median, and mode.

Sampling distribution for the mean of price of a sample of diamonds.

The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.

The sampling distribution (the mean) decreases as the sample size increases. Each time the sample size increases the number of outliers decrease.

An estimator is used when calculating the estimate of a quantity based on a data set and is the rule used when estimating. The sample mean is estimating the mean of the entire population and is more accurate as the sample size gets larger.

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. The sample standard deviation is used to estimate the population standard deviation. If there are no outliers and no large spread, it is a better estimate because the mean and standard deviation are more accurate.

Now let’s try without Bessel’s correction:

The standard deviation with Bessel’s Correction is divided by (n-1) and corrects the bias of the data set, whereas the standard deviation without Bessel’s Correction divides by (n) and does not correct for bias. The Bessel’s Correction is better simply because it accounts for bias.

Sampling error and sampling bias

Sampling error would be when you make a mistake with your data collection, while sampling bias is when the mean doesn’t equal zero. The bias of the estimator is the difference between the expected and true value. An unbiased estimator is an estimator with zero bias.