| title: “Bessels Correction & Sampling Distributions” |
| author: “Sarah Shelson” |
| output: html_document |
In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'
For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The distribution of a variable tells you what values the variable takes and how often it takes those values.
A quantitative variable is number. You can use arithmetic operations on quantitative variables. You need quantitative variables in order to have values that can plug into the standard deviation formula. A numerical variable is used to give data a number, but that number given to designate the data does not represent a true number value. It serves more as a label, whereas quantitative data shows values.
A histogram is a graph that shows the frequency of a variable. histogram below shows the frequency of an object at a given price point. This histogram is skewed to the right, and is unimodal.
A histogram and a violin plot both show frequencies of data.
R has a function that returns numerical summaries of data. For example:
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
The min is the lowest data point, whereas the max is the largest data point. Quartile 1 is also known as the 25th percentile, meaning that at that point, 25% of the data is below it. Quartile 3 is also known as the 75th percentile, meaning that at that point, 75% of the data is below it. The median is the data point directly in the center, and the mean is the average of all the data.
The min and max are found at the end of the “whiskers” of the box plot. Q1 and Q3 are the ends of the box. The median is found at the line in the box. A boxplot plots the outliers, but a modified box plot extends the whiskers out to include the outliers. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.
The mean is the red dot on the violin plot.
The first formula uses Bessel’s correction, because n-1 takes into account the degrees of freedom.
\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\] >picture
\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\] >picture #### Standard Deviation of Diamonds Price We compute the standard deviation (with Bessel’s correction) of the price variable:
sd(diamonds$price)
## [1] 3989.44
How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:
sdn <- function(x) {
return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403
The estimates are off by .037. The estimate gathered using Bessel’s correction is higher by .037.
So what is the big deal about Bessel’s correction? See below.
The statement that began this document asserted that Bessel’s correction is important in the context of sampling. A population is the entire group you wish to learn more about. A sample is a small amount of the population used to make predictions about the population. A parameters are numbers used to show information about the total population. Statistics are numbers gathered from the sample that can be used to make predicitons about the population. An example of a statistic include sample mean. An example of a parameter is population mean. The sample mean is the mean derived from the sample, whereas the population mean is the mean gathered from the whole population. The sample standard deviation is the standard deviation of the sample group, whereas the population standard deviation is the standard deviation of the population as a whole.
We can sample from the diamonds data set and display the price of the diamonds in the sample.
First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.
sample.size <- 4
Sampling is random, so next we set the seed. The seed of a random number generator is the value that tells the computer what “random” number sample set to use. If you use the same seed, you will get the same random number sets. If you use different seeds, you will get different random number sets.
set.seed(1)
Now let’s try sampling, once.
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
This command produced a set of randomly generated numbers.
Let’s try it with another seed:
set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006 745 4516
And another:
set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127
And back to the first one:
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Each seed has a certain set of randomly generated numbers, which explains the change in random numbers that occurs when the seed number changes.
Finally, what happens when we don’t set a seed, between samples.
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
When you don’t set a seed, the randomly generated numbers loop, and appear again.
set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75
We have found a mean for each seed. Other statistics used to describe samples include sample standard deviation.
For example standard deviation, with Bessel’s correction:
set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409
And standard deviation, without Bessel’s correction:
set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582
A sampling distribution is the distribution of the statistic. We can describe these distributions using Q1, Q3, Min, Max, and IQR.
The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.
The horizontal line is the population mean of the prices of all diamonds in the data set. The dots on the whiskers represent outliers. The ends of the box represent Q1 and Q3. The line in the middle of the box is the median.
An estimator is used to calculate an estimate of a population. The sample mean is estimating the population mean, and it does a better job when the sample size is larger.
Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:
Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. The sample standard deviation is measuring the standard deviation of the population - it is attempting to make a prediction or estimate of the population’s standard deviation. It is a better estimator when the sample size is larger.
Now let’s try without Bessel’s correction:
There is not much difference, except that with bessel’s correction, the samples get to the true mean quicker. Bessel’s correction is best because n-1 takes into account degrees of freedom.
A sampling error occurs when a mistake is made when sampling. Sampling bias occurs when the sample is inadequate in size.