In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'
For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The distribution of a variable tells what values the variable can take and how often it takes these values. In this case, the variable is price.
A quantitative variable is a variable that is measured numerically. You would chose a quantitative variable when making a report about standard deviation because you can add and average the data. A numerical variable is simply a label whera quantitative data shows the actual values.
A histogram is a graph that shows a variable’s frequency. The graph below shows, at a certain price, the frequency of an object.
Violin plots describe the probability density, which is similar to a histogram.
R has a function that returns numerical summaries of data. For example:
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
The min is the lowest point of data, 1st quartile is the 25 percentile of the data, the median is the number in the middle of the data set, the 3rd quartile is the 75 percentile of the data, and the max is the largest data point.
The min and the max are the ends of the graph (the whiskers). The line in the box is the median, to the left of the median in the box is the 1st quartile and to the right in the box is the 3rd quartile. A modified box plot does not mark the outliers as dots while a regular box plot marks them as dots. Outliers are data points that are drastically different from the rest of the data.
The mean is the red dot.
The first formula is the Bessel’s correction, indicated by the division of (n-1). The second equation is simply the standard deviation.
\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]
\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]
We compute the standard deviation (with Bessel’s correction) of the price variable:
sd(diamonds$price)
## [1] 3989.44
How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:
sdn <- function(x) {
return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403
These estimates are rather close, with the Bessel’s correction being larger.
So what is the big deal about Bessel’s correction? See below.
The statement that began this document asserted that Bessel’s correction is important in the context of sampling. A population is the area in which you are collecting information, while a sample is a small section of that population that you use to represent the entire population. A parameter is used also to represent the population as a whole. A statistic is simply a piece of data but is not as accurate as a parameter. A sample mean is the mean of the sample of the population, while the population mean is the mean of the population - these two are not always accurate. The sample standard deviation is the deviation of the data set in a smample while a population standard deviartion is the population as a whole.
We can sample from the diamonds data set and display the price of the diamonds in the sample.
First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.
sample.size <- 4
Sampling is random, so next we set the seed. A seed of a random number generator is a value that shows the computer what sample set to use- if you use the same seed each time, you will get the same “random” number set each time. When you use different seeds, you will have different number sets.
set.seed(1)
Now let’s try sampling, once.
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
This command created a random sample of the larger data set.
Let’s try it with another seed:
set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006 745 4516
And another:
set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127
And back to the first one:
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Each time you use a different seed, you get a different set of values to work with.
Finally, what happens when we don’t set a seed, between samples.
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
Without using a seed, each time you randomly generate data you will get a different set of data which you will be unable to replicate.
set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75
In these data sets, the mean has been taken of the samples and of the seed. Other statistics that we could use to describe samples would be any average or percentage from the sample set.
For example standard deviation, with Bessel’s correction:
set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409
And standard deviation, without Bessel’s correction:
set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582
A sampling distribution is every result a statistic can take in every sample and how often that can happen. All the numbers above are sampling distributions from standard deviation. Tools we have to describe this include mean, median, and mode.
The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.
The sampling distribution (the mean) decreases as the sample size increases. Each time the sample size increases the number of outliers decrease.
An estimator is used when calculating the estimate of a quantity based on a data set and is the rule used when estimating. The sample mean is estimating the mean of the entire population and is more accurate as the sample size gets larger.
Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:
Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. The sample standard deviation is used to estimate the population standard deviation. If there are no outliers and no large spread, it is a better estimate because the mean and standard deviation are more accurate.
Now let’s try without Bessel’s correction:
The standard deviation with Bessel’s Correction is divided by (n-1) and corrects the bias of the data set, whereas the standard deviation without Bessel’s Correction divides by (n) and does not correct for bias. The Bessel’s Correction is better simply because it accounts for bias.
Sampling error would be when you make a mistake with your data collection, while sampling bias is when the mean doesn’t equal zero. The bias of the estimator is the difference between the expected and true value. An unbiased estimator is an estimator with zero bias.