Bessel’s Correction & Sampling Distributions

In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

The distribution of a variable is a description of the relative numbers of times each possible outcome will occur in a number of trials. In this graph, the distribution of a variable tells us, at what price, the products are popular with customers.

Type of variable chosen

Quantitative variables are the ones that are measured on a numeric or quantitative scale. They are the main characters will be used to calculate the standard deviation in the data set. The quantitative variables must be numerical variables. However, some of numerical variables are not quantitative variables, like social security numbers, ID numbers.

Histogram of diamonds price.

Histogram is a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval.

When the price changes from $250 to $1,250, the products are relative popular. However, as the price increases, the number of sold products decreases.

Violin plot

Both of the histogram and the violin plot can show the density of the data sheet. However, the histogram has x-axis, which is not included in the violin plot.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18820

‘Min’ is the smallest number in the data set. ‘Q1’ is the first quartile before the median of the data set. ‘Median’ is the middle number in the data set. ‘Mean’ is the average of all the numbers in the data set. ‘Q3’ is the third quartile after the median of the data set. ‘Max’ is the biggest number in the data set.

Modified Box Plots

‘Min’ is the located at the lowest of the plot. ‘Q1’ is the lowest point of the orange rectangle. ‘Median’ is the black line inside orange rectangle. ‘Q3’ is on the top of the orange rectangle. ‘Max’ is located at the top of the data set.

‘Boxplot’ can tell us where are ‘Min’, ‘Q1’, ‘Median’, ‘Q3’, ‘Max’. Howver, the modified box plot can show the density of the data set and frequency.

An outlier is an observation point that is distant from other observations. As the modified box plot shows, the outliers are located on the black line on the upper position.

Adding the mean to the plot

The mean is the red point in the diagram.

Standard Deviation: Formulas

The first formula is to calculate the accurate number of standard deviation and uses Bessel???s correction.

The second formula is also used to calculate the standard deviation, but it doesn’t use Bessel’s correction.

\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]

Standard Deviation of Diamonds Price

We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)

## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)

## [1] 3989.403

The difference between two estimates is 0.037 and the first estimate is larger.

So what is the big deal about Bessel’s correction? See below.

Sampling

The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: A parameter is any number calculated from a population, which the total number. A statistic is any number calculated from a sample, which is a part chosen randomly from population.

Are the grades of college students inflated? Parameter: Population: 7 million students Mean=2.7 Statistics: Sample=100 students Mean=2.9 Sample mean implies the mean of the sample derived from the whole population randomly. Population mean is nothing but the average of the entire group. The standard deviation of a population gives researchers the amount of dispersion of data for an entire population of survey respondents. A standard deviation of a sample estimates teh standard deviation of a population based on a random sample.

We can sample from the diamonds data set and display the price of the diamonds in the sample.

Sample size, $n$.

First, we need to choose a sample size, $n$. We choose $n=4$ which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. srand(x) used to set the starting value(seed) for a generating a sequence of pseudo-random integer values. The srand(x) function sets the seed of the random number generator algorithm used by the function rand().

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

This command asks the program to choose four random data from the set.

Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)

## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)

## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

Once we have given the seed some number, the program will gives us some random data. In addition, if the number of the seed is different, the results will also be different. If we set the seed as the same number, the random data that the program shows will be the same.

Finally, what happens when we don’t set a seed, between samples.

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

If we set the number of seed as 1 and type sample(diamonds$price,sample.size) for some times, the results will be different.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))

## [1] 3908

mean(sample(diamonds$price,sample.size))

## [1] 2521

mean(sample(diamonds$price,sample.size))

## [1] 3346.75

We set the number of seed as 1 and get some groups of samples and then calculate the mean of these groups. Additionally, we can also use mode, the most observed value in the data set, and median, the value which divides a data set into two equal halves.

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))

## [1] 3936.586

sd(sample(diamonds$price,sample.size))

## [1] 1683.428

sd(sample(diamonds$price,sample.size))

## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))

## [1] 3409.183

sdn(sample(diamonds$price,sample.size))

## [1] 1457.891

sdn(sample(diamonds$price,sample.size))

## [1] 1763.582

Sampling Distributions of Statistics

Sample distribution of a statistic is the probability distribution of a given statistic based on a random sample. Sampling distributions are important in statistics because they provide a major simplification en route to statistical inference. To describe these distributions, we need to use statcrunch to make the graph, like diagram, violin plot, or modified box plot to help us to analyze the distribution of data.

Sampling distribution for the mean of price of a sample of diamonds.

The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.

Answer the following questions: The mean of these tested groups of sample are the same. However, the distribution of the data, sample size, and shape are all different. As the sample size increases, the distributions of data become dense and the number and range of outliers decrease.

An estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule(estimator), the quantity of interest(the estimand) and its result(the estimate) are distinguished. The sample mean form a group of observation is an estimate of the population mean. The sample mean from a group of observations is an estimate of the population mean. When sample size is larger, the sample mean estimating will do a better job

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. When the number of the sample is large enough, the Bessel’s correction seems not important. We can estimate the population standard deviation from a sample standard deviation. When the number of the sample is big enough, the sample standard deviation can get a better estimate.

Now let’s try without Bessel’s correction:

The standard deviation with Bessel’s correction can get more accurate result than the standard deviation without Bessel’s correction. Id like to say both of them are good because it always depends on the cases. In some cases, like the number of the data is large enough, Bessel’s correction is better. If the number is small, the Bessel’s correction doesn’t really matter.

Sampling error and sampling bias

Sampling error is the error caused by observing a sample instead of the whole population. The sampling error is the difference between a sample statistc used to estimate a population parameter and the actual but known value of the parameter. If an overestimate or underestimate does happen, the mean of the difference is called a bias. In more mathematical terms, an estimator is unbiased. That’s just just saying if the estimator(the sample mean) equals the parameter(the population mean), then it’s an unbiased estimator.