Bessel’s Correction & Sampling Distributions

In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price? Answer:The distribution of the price of the diamonds relates to the meaning of the distribution.

Type of variable chosen

Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related. Answer: A quantitative variable is a variable that is measured on a quantitative or numeric scale. The standard deviation is measured by quantitative variables. Everything quantitative is automatically numerical but not vice versa.

Histogram of diamonds price.

What is a histogram? Explain graph below.

Answer: Histogram: a type of graph with rectangles whose area is proportional to the frequency of a variable, and whose width is equal to the class interval.

Violin plot

Explain the relationship between a histogram and a violin plot.

Answer: The histogram shows the frequency of the variables are while a violin plot shows the density.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

Describe what each of these numbers means. Answer: The minimum number of the data set is 326. 950 is the first quartile range point, or the median of the first half of the data. The median number of the entire data set is 2401. 3933 is the average number of the entire data set. 5324 is the third quartile range or the median of the second half of the data. 18823 is the maxiumum number of the whole data set.

Modified Box Plots

Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.

Answer: An outlier is a number, or multiple numbers, that falls outside of the range of data points. A boxplot shows all the datapoints except for the outliers, and the modified boxplot shows both. Suspected outliers will be shown in a modified box plot if they fall outside of the minimum or maxiumum lines of the boxplot.

Adding the mean to the plot

Add one sentence to indicate where the mean is on this plot.

Answer: The mean is where the red dot lies within the boxplot.

Standard Deviation: Formulas

Explain the formulas below, say which uses Bessel’s correction.

Answer: The two formulas find the standard deviation. However, only the first formula uses n-1, which means it is the only one that uses bessel’s correction. \[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]

Standard Deviation of Diamonds Price

We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)

## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)

## [1] 3989.403

How close are these estimates? Which is larger? Answer: All of the estimates are close, but the first is larger.

So what is the big deal about Bessel’s correction? See below.

Sampling

The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation.

Answer: The sample mean is the mean of a sample taken from a population. The population mean is the mean of the entire population in question. This applies to standard deviation. A parameter is a number that describes the entire population, while a statistic describes the sample taken from the population. An example of a parameter would be the population mean. We can sample from the diamonds data set and display the price of the diamonds in the sample.

We can sample from the diamonds data set and display the price of the diamonds in the sample.

Sample size, \(n\).

First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you.

Answer: The seed is the starting value. When the same seed is used, you see the same numbers as a result. If you use different seeds then you get different numbers as results.

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

Explain what this command did. Answer: Gave the sample.

Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)

## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)

## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

Explain these results.

Answer:The outcomes are different because different seeds were used.

Finally, what happens when we don’t set a seed, between samples.

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

Explain these results.

Answer:No seed between samples means no interruption in data.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))

## [1] 3908

mean(sample(diamonds$price,sample.size))

## [1] 2521

mean(sample(diamonds$price,sample.size))

## [1] 3346.75

Explain what we have done here. Answer the following question: what other statistics could we use to describe samples? Answer:The outcome is the sample mean, we could also use median or mode to describe samples. For example standard deviation, with Bessel’s correction.

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))

## [1] 3936.586

sd(sample(diamonds$price,sample.size))

## [1] 1683.428

sd(sample(diamonds$price,sample.size))

## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))

## [1] 3409.183

sdn(sample(diamonds$price,sample.size))

## [1] 1457.891

sdn(sample(diamonds$price,sample.size))

## [1] 1763.582

Sampling Distributions of Statistics

Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions?

Answer:The sampling distribution of a statistic is the distribution of a statistic from a random sample. This relates to the numbers above because we took the sampling distribution of the diamond sample. We could also use the median and mode as tools to describe these distributions.

Sampling distribution for the mean of price of a sample of diamonds.

The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.

Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set.

Answer: The features of the graph below represent the following: The modified box plots on the x axis and are the sampling sizes, the horizontal line shows the entire data set’s. Meanwhile, the red dots on the modified box plots are the sample means, and the outliers are represented by the block dots found outside of the box plot.

Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job?

Answer: The estimator estimates the peramitor. Both the sample mean and sample standard deviation are estimating the population mean and standard deviation. When the sample size is larger, it is more accurate of estimating the population.

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? Answer: The sample standard deviation is estimating the spread of data in the sample given, it would be better when there is a big range or spread of data. What is the sample standard deviation estimating? In what situations is it a better estimate?

Now let’s try without Bessel’s correction:

Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter?

Answer: With bessel’s correction the standard deviation is larger, it is better to use bessel’s for smaller samples.

Sampling error and sampling bias

Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators. Answer: Sampling error is the error within the sample, but a sampling bias is an error that effects the whole population, making it so that the sample can no longer accurately represent the whole population. A biased estimator is the expected value subtracted from the actual value. The unbiased estimator says there is no bias or difference in the expected versus actual value.