Bessel’s Correction & Sampling Distributions

In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price? A distribution of a variabel tells us what values a variable takes and how often it takes those values. It relates to “price” because a price sets the value for an object and the price will go up or down depending on the value of an object.

Type of variable chosen

Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related. A quantitative variabele holds numbers and you are able to either add or average the numbers (makes sense to add or average). Standard deviation is based on the mean, therefore you would need quantitative variables. Numerical means the numbers in which quantitative varibales hold. Not everything numerical is quantitative but everything quantitative is numerical.

Histogram of diamonds price.

What is a histogram? Explain graph below. A histogram is a graph using quantitative variables, similar to a stem plot and violin plot. There are “bins” illustrating values/information. The graph below illustrates the price of a diomond and how much people buy (the type of diamond)

Violin plot

Explain the relationship between a histogram and a violin plot. A violin plot is like a mirrored, smoothed out histogram.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

Describe what each of these numbers means. The min and max numbers are sensitive to outliers while the 1st Q, median, and 3rd Q are resistant to outliers. The 1st Q is the 25th percentile, median the 50th percentile, and the 3rd Q the 75th percentile.

Modified Box Plots

Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot. The min, 1st Q, median, 3rd Q, and max are represented by the box plot. A boxplot plots the outliers as part of the plot whereas a modified boxplot does not include outliers directly in the plot, but as a dot (making fences in the plot). Outliers are variables furthest away from the quartiles and median. Outliers won’t change the mode, making them resistant.

Adding the mean to the plot

Add one sentence to indicate where the mean is on this plot. The mean is the red dot.

Standard Deviation: Formulas

Explain the formulas below, say which uses Bessel’s correction. Bessel’s correction is the first one (the one with n-1). The formulas below give standard deviation.

\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]

Standard Deviation of Diamonds Price

We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)

## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)

## [1] 3989.403

How close are these estimates? Which is larger? Bessel’s correction is larger than the other estimate. These estimates are very close, but Bessel’s correction is .04 larger.

So what is the big deal about Bessel’s correction? See below.

Sampling

The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation. Population is the the larger set number in which you are taking data from; a sample is a smaller set of numbers in which you are taking data from the population (sample is a small group within the population). Parameters describes the population while statistics describes the value of some quantity of a sample. Examples od parameters would be the population of American university while the statistics would be a sample of athletes that go to American. Sample mean is an unbiased estimate of the population mean. The populatin mean is the average of a population. Sample standard deviation is the deviation (distance of data point from mean) of a given sample. Population standard deviation is the deviation of a population, or the larger group in which the sample is in.

We can sample from the diamonds data set and display the price of the diamonds in the sample.

Sample size, $n$.

First, we need to choose a sample size, $n$. We choose $n=4$ which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you. A seed selects a (random) sample of numbers. If you use the same seed then the numbers will not change. (A random generator is a sequence of random numbers –> a seed is a starting point in that sequence). If you use different seeds, then you will get a different set of numbers.

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

Explain what this command did. You get a new set of diamonds (selects a random set of diamonds). The seed selects the diamonds (will look random, but is based off the seed)

Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)

## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)

## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

Explain these results. The new seed selects a new set of diamond information. The seed randomly picks new data.

Finally, what happens when we don’t set a seed, between samples.

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

Explain these results. The new seed generated a new set of numbers; but when changed back to the original seed, the sample changed back to the original. Therefore a certain seed will generate the same numbers everytime.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))

## [1] 3908

mean(sample(diamonds$price,sample.size))

## [1] 2521

mean(sample(diamonds$price,sample.size))

## [1] 3346.75

Explain what we have done here. Answer the following question: what other statistics could we use to describe samples? Here, we have taken the mean of the seeded numbers. Each number represents a number in the sequence using the seed of 1.We could use the mean, average, mode, or median to describe samples.

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))

## [1] 3936.586

sd(sample(diamonds$price,sample.size))

## [1] 1683.428

sd(sample(diamonds$price,sample.size))

## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))

## [1] 3409.183

sdn(sample(diamonds$price,sample.size))

## [1] 1457.891

sdn(sample(diamonds$price,sample.size))

## [1] 1763.582

Sampling Distributions of Statistics

Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions? Sampling distribution tells you what values the sample takes and how often it takes those values. Tools we have to describe these distributions include mean, median, mode, and standard deviation.

Sampling distribution for the mean of price of a sample of diamonds.

The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.

Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set.

The graphs are modified box and whisker plots. The red dots represent the mean. The population line shows how most peole buy diamonds that are $4,000. By the end of the graph there are more samples to better understand the statistics.

Explain the concept of an estimator. What is the sample mean estimating, and in what situation does it do a better job? An estimator is a way of calculating an estimate of data. The sample mean is estimating the average amount of what people will pay for a diamond. The job gets better as you take multiple samples (larger sample). The skewness goes down as the sample size increases. An example of an estimator: sample mean is an estimator for the population mean.

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate? Bessel’s correction does not matter as much because the sample size is so large that subtracting 1 from n will not affect the plot as much as it will with a smaller sample size. The sample standard deviation is estimating the population standard deviation. With a smaller sample size it would create a better estimate.

Now let’s try without Bessel’s correction:

Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter? Standard deviation with Bessel’s correction will be more accurate with smaller sample sizes while the standard deviation without Bessel’s correction would be more useful with a larger sample size. I think Bessel’s correction matters because it works best with small sample sizes and when there are larger sample sizes it won’t matter as much.

Sampling error and sampling bias

Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators. Sample error is an error caused by looking at a sample rather than the population. Sample bias is the average sampling error. If you take it multiple times then it will be unbiased.