In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'
For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The distribution of a variable tells you what values the variable takes and how often it takes those values. It relates to price because price is a variable for the diamonds, and the price has different values.
Quantitative variables are numerical values that can have arithmetic operations (adding, diving) applied to them with logic. Quantitative variables are different from numerical variables because numerical variables can be applied to categorical data (ex: Student = 1, Teacher = 2), and thus do not have value/sense when arithmetic operations are applied to them. Only quantitative variables can be used in standard deviation because it only makes sense to apply arithmetic operation to quantitative variables, not categorical variables.
A histogram is a graph that displays the variables in groups and shows only the percent/number of data points that belong in each group. The graph below displays the a range of diamond prices and the corresponding numbers of how often diamonds take those prices. The histogram shows that most diamonds cost between $100 and $1000, while as price increases, fewer diamonds hold that price. The distribution is skewed right, with a major peak at $1000.
A histogram is a graph that displays the variables in groups and shows only the percent/number of data points that belong in each group. Violin plots are modified histograms that display the density of a distribution (histogram) with their width, which is mirrored for visual effect. Violin plots have smoothed edges and lack bins, while histograms have defined bins. The horizontal scale of a histogram is the vertical scale of a violin plot. Both histograms and violin plots for the same data should have the same shape, but with violin plots using the mirroring effect.
R has a function that returns numerical summaries of data. For example:
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
All together, these numbers make up a 5-number summary. The “Min.” 326 number is the lowest value in the data set of diamond prices, while the “Max.” 18820 is the greatest value in the data set of diamond prices. The 1st Quartile (950) is the median for all of the data to the left of the median (the lower half of the data, 25th percentile), while the 3rd Quartile (5324) is the median for all of the data to the right of the Median (the upper half of the data, 75% percentile). The Median (2401) is the middle value in the overall set of data. The Mean (3933) is the average value of the data.
Essentially, the 5-number summary listed above is visually represented in a box plot. The box is formed using the 1st Quartile (950) and the 3rd Quartile (5324). The horizontal line in the box plot represents the median (2401). The mean (3933) would be represented by a dot in the box. The lines that extend from the box extend to the Min (326) and the Max (18820) in the data set.
A box plot is a graph of the 5-number summary, where the box represents the data falling in the ranges from the 1st Quartile to the 3rd Quartile. The line in the box marks the Median of the data. The "whiskers"/lines extending from the box stop at the Minimum/Maximum data points. Modified box plots, on the other hand, uses the 1.5xIQR (Interquartile Range: the distance between the 1st Quartile and the 3rd Quartile) rule to draw whiskers (doesn't just use the minimum and maximum data values). Points beyond the 1.5xIQR whiskers are plotted individually and labeled as outliers. An outlier is an individual deviation that falls outside the overall pattern of the data. In a modified box plots, outliers are plotted as individual points if they fall beyond the 1.5xIQR whiskers.
The mean of the box plot below is indicated by the red dot. It is at approximately $3,900.
The first formula solves for standard deviation (s) using Bessel’s correction (n-1). It is essentially the square root of variance, or the square root of the average of all of the squares of the deviations of the data from their mean. Xi stands for data point. X bar is the mean. Sigma is the “sum of.” N stands for the number of data points, or data size. Subtracting the sample population (n) by 1 improves the sample standard deviation as a predictor of the population standard deviation by reducing bias.
The second formula also solves for standard deviation, but simply lacks Bessel’s correction (n-1).
\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]
\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]
We compute the standard deviation (with Bessel’s correction) of the price variable:
sd(diamonds$price)
## [1] 3989.44
How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:
sdn <- function(x) {
return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403
The two estimates, 3989.44 (standard deviation with Bessel’s correction) and 3989.403 (standard deviation without Bessel’s correction) have a difference of 0.037. The standard deviation with Bessel’s correction, 3989.44, is larger.
So what is the big deal about Bessel’s correction? See below.
The statement that began this document asserted that Bessel’s correction is important in the context of sampling. A population is all of the observations in a data set, while a sample is just a portion of the population (1 or more observations in a data set, but not all). A parameter is any measurement of the population (ex: mean, median, standard deviation, variance), while a statistic is any measurement of a sample (ex: mean, median, standard deviation, variance). The mean is the average value of a set of data. Thus, the population mean is the average of the population (the entirety of the data set), while the sample mean is the average value of a sample (only a portion of the population). Population standard deviation is the square root of variance, or the square root of the average of all of the squares of the deviations of the data from their mean, using all of the entities in a data set (n). The sample standard deviation is the square root of variance, or the square root of the average of all of the squares of the deviations of the data from their mean, using only some observations in a data set (only a portion of the population). Sample standard deviations are used to estimate population standard deviations if the full population/data is not available.
We can sample from the diamonds data set and display the price of the diamonds in the sample.
First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.
sample.size <- 4
Sampling is random, so next we set the seed. A seed of a random number generator is the value that tells the computer what “random” number sample set to use. Random number sets are not inherently random, since they are set in a computer’s algorithm. When you use the same seed, you will get the same “random” number sets. When you use different seeds, you will get different “random” number sets.
set.seed(1)
Now let’s try sampling, once.
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
The command asked the computer to generate diamond prices for a sample size of four from the population using a set seed (1). The computer did just that, yielding 5801, 8549, 744, 538.
Let’s try it with another seed:
set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006 745 4516
And another:
set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127
And back to the first one:
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
The previous three commands all asked the computer to generate diamond prices for a sample size of four from the population. However, the first example used a set seed of 2, the second used a set seed of 3, and the last used a set seed of 1. Thus, the three commands yielded different results. The set seed of 2 gave us 4702, 1006, 745, and 4516. The set seed of 3 gave us 4516, 1429, 9002, and 7127. The set seed of 1 gave us 5801, 8549, 744, and 538. This demonstrates the fact that random number sets are not inherently random, since they are set in a computer’s algorithm. When you use the same seed, you will get the same “random” number sets. When you use different seeds, you will get different “random” number sets.
Finally, what happens when we don’t set a seed, between samples.
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
When we don’t set a seed between samples, the computer automatically gives us sets of “random numbers” from the same seed. When we don’t set a seed, the computer puts out numbers further down the line of the same seed. That is why both examples, which begin with a set seed of 1, end up generating the same answers for 3 generations; the computer is simply giving the numbers in sequential order for the first seed (1).
set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75
The work above commanded the computer to give a mean for a sample size of diamond prices, using the set seed 1. The three means given were 3908, 2521, and 3346.75. One can also use standard deviation to describe samples.
For example standard deviation, with Bessel’s correction:
set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409
And standard deviation, without Bessel’s correction:
set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582
A sampling distribution of a statistic basically shows all of the different values a statistic could take in a population and how often it would take these values, using every single possible sample from a population. The examples above gives us six numbers of a sample distribution using measures of spread (standard deviation).
We can describe these distributions using measures of center (mean, median), measures of spread (variance, standard deviation), etc.
The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.
In the graph below, the horizontal axis represents the sample size of diamond prices and the vertical axis gives sample means. Six sample sizes are given, and all of the sample sizes have the same mean (sample mean = population mean). Each sample is represented by what appear to be modified box plots, with a red dot representing the mean. The boxes represent the spread between the 1st and 3rd Quartiles, the line in the boxes is the median, and the lines extending from the boxes are the minimum and maximum values by the 1.5xIQR rule. The black dots on the end of the whiskers are outliers.
The graph shows that as sample size increases, the statistics and observations become more accurate and succinct. Those with greater sample size are less skewed and have sample means that equal the population mean. Those sample with less size tend to be skewed, have more outliers, and are less accurate estimators of the population.
An estimator is simply a statistic that measures something in a sample and attempts to use this measure to estimate what the corresponding population statistic would be. Examples include the sample mean and the sample standard deviation. The sample mean estimates the population mean. The population mean never changes, but the sample mean does for different samples. The sample mean is a more accurate estimator when the distribution is rather symmetrical, as the mean is not a resistant measure of center. Thus, the more skewed a distribution is, or the more outliers a sample distribution has, the more thrown the sample mean will be, and the farther the sample mean will be from the population mean.
Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:
Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Bessel’s correction won’t have a great impact on very large sample sizes, and large sample sizes tend to be more accurate representations of the population anyway. The sample standard deviation estimates the population standard deviation. Sample standard deviation is a better estimator for normal distributions and when the mean is an accurate measure of center. Thus, the sample standard deviation is not great for distributions that have outliers or are skewed.
Now let’s try without Bessel’s correction:
Standard deviation with Bessel’s correction averages all of the squares of the deviations of the data from their mean by n-1, where standard deviation without Bessel’s correction only averages all of the squares of the deviations of the data from their mean by n (number of observations). The result with Bessel’s correction is that the standard deviation is larger, lending it less bias as a sample and making it a more accurate measure of the population standard deviation. I think that standard deviations with Bessel’s correction are more accurate and thus better estimators. It’s important because accuracy in statistics is important. Samples can be poor estimators or representatives of their population, and thus it’s important to reduce sample bias as much as possible in order to better visualize the population.
Sampling error is the difference in a statistic found in the sample and the true value of the statistic in a population. Samples tend to lend somewhat inaccurate statistics as they are estimators. However, the statistics of an entire population are true do not change.
Sampling bias is the average (mean) sampling error in a set of samples for a population. It is when the sample is not random, and thus the sample makes certain characteristics of the population more prevalent than others, when this is not true for the population.
An unbiased estimator is a sample where the mean of the statistic being measured is the same for the sample and the population.
A biased estimator has a difference between the estimator and the population’s true value for that statistic because the biased estimator does not accurately represent the population.