Bessel’s Correction & Sampling Distributions

In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price?

The distribution of a variable is the distance in which that the data points in a set stray from the mean or average. This relates to price in that the distribution in this case, shows the variety of prices depending on the size and cut of a diamond.

Type of variable chosen

Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related.

Quantitative and numerical variables hold numbers. It is necessary to choose data with numbers in order to calculate the mean and standard deviation of the set. Though they both have numbers, numerical and quantitative data are different. If arithmetic operations are useful for the data set, then it is quantitative; otherwise, it is numerical. Numerical variables are more useful for calculating things such as the mean and standard deviation.

Histogram of diamonds price.

What is a histogram? Explain graph below.

A histogram is a graphical representation of the distribution based on frequencies. This graph shows that there are more diamonds under $1000 in the count than in the higher price ranges. The declining height of the bars from left to right shows this.

Violin plot

Explain the relationship between a histogram and a violin plot.

The histogram and violin plot both display the distribution of the data set. A violin plot is really just a histogram turned on its side and it shows density.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

Describe what each of these numbers means.

The minimum is the lowest data point in a set, the first quartile is the lower half of the data, the second quartile or median is the middle value, the third quartile is the upper half of the data, and the maximum is the largest data point.

Modified Box Plots

Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.

The modified box plot draws a fence on both sides of the quartile. There are no data points below the lower fence and the numbers above the modified box plot are not in the data set. The fence identifies outliers within the data set. The difference between a box plot and a modified box plot is that a box plot shows what everything is and is generally not in a violin plot. It also displays the first quartile, median and third quartile clearly while the modified box plot does not. An outlier is a number that is higher or lower than 1.5x your IQR. This number often skews the data set because it is so far away from the rest of the values. Outliers are labeled in a modified box plot farther out from the minimum or maximum.

Adding the mean to the plot

Add one sentence to indicate where the mean is on this plot.

The mean is larger than the median which means that the distribution is skewed to the right.

Standard Deviation: Formulas

Explain the formulas below, say which uses Bessel’s correction.

\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]

This one uses Bessel’s correction because it uses n-1 and is supposed to show a more accurate representation or estimate of the standard deviation.

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]

This formula is more common but less accurate because we only divide by n.

Standard Deviation of Diamonds Price

We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)

## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)

## [1] 3989.403

How close are these estimates? Which is larger?

These estimates are fairly close. However, the standard deviation with Bessel’s correction is larger. The differences will be significantly different when the data set is smaller.

So what is the big deal about Bessel’s correction? See below.

Sampling

The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation.

Sampling is the act of pulling either random or specific values from a population in order to create a data set. A population implies all of the people in a specific group. A sample is a portion of that specific group. A parameter describes the entire population, while a statistic describes a sample. The sample mean is the unbiased estimate of the population mean. The sample population is the sample of the entire population (a larger sample size). The sample standard deviation measures the spread of data around a sample mean while the population standard deviation does not.

We can sample from the diamonds data set and display the price of the diamonds in the sample.

Sample size, $n$.

First, we need to choose a sample size, $n$. We choose $n=4$ which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you.

A seed of a random number generator is the starting point of a list. If you use different seeds, you will get different numbers in your results.

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

Explain what this command did.

This command generated a sample of four diamonds that appears random but are really just lists that the program chooses using the random number generator. There are more numbers on the list than seeds but the seed tells you where to start.

Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)

## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)

## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

Explain these results.

These are the results for seed 1.

Finally, what happens when we don’t set a seed, between samples.

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

Explain these results.

These values are the means of all of the numbers from the data with the 1 seed.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))

## [1] 3908

mean(sample(diamonds$price,sample.size))

## [1] 2521

mean(sample(diamonds$price,sample.size))

## [1] 3346.75

Explain what we have done here. Answer the following question: what other statistics could we use to describe samples?

We have calculated the standard deviation with Bessel’s correction for all of this data. Some other statistical tests that we could have used include the mean, median, mode, and the significance or p-value.

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))

## [1] 3936.586

sd(sample(diamonds$price,sample.size))

## [1] 1683.428

sd(sample(diamonds$price,sample.size))

## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))

## [1] 3409.183

sdn(sample(diamonds$price,sample.size))

## [1] 1457.891

sdn(sample(diamonds$price,sample.size))

## [1] 1763.582

Sampling Distributions of Statistics

Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions?

The sampling distribution of a statistic is the standard deviation of a smaller population within the whole data set. We can describe these distributions with histograms, box plots, violin plots, scatter plots and other graphical representations.

Sampling distribution for the mean of price of a sample of diamonds.

The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.

Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set.

The vertical line is the distribution of our sample means which are on the horizontal line. As the box plots get smaller and smaller as the population size gets bigger, the distribution and spread get smaller as well.

Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job?

An estimator uses our random sample to estimate our population mean. It is a smaller case that tries to describe the larger picture. The sample mean shows the average of a small population within a larger population. As you increase your sample size, you get a better description of the data set. The distribution gets smaller as you increase the number of data points in your sample.

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate?

It doesn’t matter whether or not you drop Bessel’s correction in a large population size because the “n-1” will effect the numbers less and less as they get larger. The standard deviation is estimating the distribution of the data set and with more and more data points, changing one small element of the formula will not affect results as much. It might only change by one decimal point. Bessel’s correction gives you a better estimate in a smaller data set or population size.

Now let’s try without Bessel’s correction:

Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter?

The standard deviation with Bessel’s correction is simply more accurate than the standard deviation without Bessel’s correction. Bessel’s correction is obviously better in all cases but it matters more in smaller populations or smaller data sets.

Sampling error and sampling bias

Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators.

The sampling error is the difference between a statistic and the parameter while sampling bias is the mean or average of the sampling error.