title: “Bessels Correction & Sampling Distributions”

author: “Sarah Shelson”

output: html_document

In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

The distribution of a variable tells you what values the variable takes and how often it takes those values.

Type of variable chosen

A quantitative variable is number. You can use arithmetic operations on quantitative variables. You need quantitative variables in order to have values that can plug into the standard deviation formula. A numerical variable is used to give data a number, but that number given to designate the data does not represent a true number value. It serves more as a label, whereas quantitative data shows values.

Histogram of diamonds price.

A histogram is a graph that shows the frequency of a variable. histogram below shows the frequency of an object at a given price point. This histogram is skewed to the right, and is unimodal.

Violin plot

A histogram and a violin plot both show frequencies of data.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

The min is the lowest data point, whereas the max is the largest data point. Quartile 1 is also known as the 25th percentile, meaning that at that point, 25% of the data is below it. Quartile 3 is also known as the 75th percentile, meaning that at that point, 75% of the data is below it. The median is the data point directly in the center, and the mean is the average of all the data.

Modified Box Plots

The min and max are found at the end of the “whiskers” of the box plot. Q1 and Q3 are the ends of the box. The median is found at the line in the box. A boxplot plots the outliers, but a modified box plot extends the whiskers out to include the outliers. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.

Adding the mean to the plot

The mean is the red dot on the violin plot.

Standard Deviation: Formulas

The first formula uses Bessel’s correction, because n-1 takes into account the degrees of freedom.

\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\] >picture

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\] >picture #### Standard Deviation of Diamonds Price We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)

## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)

## [1] 3989.403

The estimates are off by .037. The estimate gathered using Bessel’s correction is higher by .037.

So what is the big deal about Bessel’s correction? See below.

Sampling

The statement that began this document asserted that Bessel’s correction is important in the context of sampling. A population is the entire group you wish to learn more about. A sample is a small amount of the population used to make predictions about the population. A parameters are numbers used to show information about the total population. Statistics are numbers gathered from the sample that can be used to make predicitons about the population. An example of a statistic include sample mean. An example of a parameter is population mean. The sample mean is the mean derived from the sample, whereas the population mean is the mean gathered from the whole population. The sample standard deviation is the standard deviation of the sample group, whereas the population standard deviation is the standard deviation of the population as a whole.

We can sample from the diamonds data set and display the price of the diamonds in the sample.

Sample size, \(n\).

First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. The seed of a random number generator is the value that tells the computer what “random” number sample set to use. If you use the same seed, you will get the same random number sets. If you use different seeds, you will get different random number sets.

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

This command produced a set of randomly generated numbers.

Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)

## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)

## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

Each seed has a certain set of randomly generated numbers, which explains the change in random numbers that occurs when the seed number changes.

Finally, what happens when we don’t set a seed, between samples.

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

When you don’t set a seed, the randomly generated numbers loop, and appear again.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))

## [1] 3908

mean(sample(diamonds$price,sample.size))

## [1] 2521

mean(sample(diamonds$price,sample.size))

## [1] 3346.75

We have found a mean for each seed. Other statistics used to describe samples include sample standard deviation.

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))

## [1] 3936.586

sd(sample(diamonds$price,sample.size))

## [1] 1683.428

sd(sample(diamonds$price,sample.size))

## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))

## [1] 3409.183

sdn(sample(diamonds$price,sample.size))

## [1] 1457.891

sdn(sample(diamonds$price,sample.size))

## [1] 1763.582

Sampling Distributions of Statistics

A sampling distribution is the distribution of the statistic. We can describe these distributions using Q1, Q3, Min, Max, and IQR.

Sampling distribution for the mean of price of a sample of diamonds.

The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.

The horizontal line is the population mean of the prices of all diamonds in the data set. The dots on the whiskers represent outliers. The ends of the box represent Q1 and Q3. The line in the middle of the box is the median.

An estimator is used to calculate an estimate of a population. The sample mean is estimating the population mean, and it does a better job when the sample size is larger.

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. The sample standard deviation is measuring the standard deviation of the population - it is attempting to make a prediction or estimate of the population’s standard deviation. It is a better estimator when the sample size is larger.

Now let’s try without Bessel’s correction:

There is not much difference, except that with bessel’s correction, the samples get to the true mean quicker. Bessel’s correction is best because n-1 takes into account degrees of freedom.

Sampling error and sampling bias

A sampling error occurs when a mistake is made when sampling. Sampling bias occurs when the sample is inadequate in size.