Bessel’s Correction & Sampling Distributions

In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price?

The distribution of a variable is a description of the relative numbers of times each possilbe outcomes will occur in a number of trials. It relates to price because as it increased the number of times it will occur will decrease.

Type of variable chosen

Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related.

A quantitative variable is a variable that is measured on a numeric scale. It is important to make such a report about standard deviation because it is important to recognize the outliers in a set of data. It is a number that describes data. Numerical variables are numbers, they can be further classified into discrete and continuous variables. While, quantitative data is expressed using a certain quantity, amount or range.These two are related due to the fact that they both use numbers and describe quantities in data.

Histogram of diamonds price.

What is a histogram? Explain graph below.

A histogram is diagram of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval. This graph is scewed to the right, as the price of diamonds increases the count of diamonds decreases.

Violin plot

Explain the relationship between a histogram and a violin plot.

The violin plot is a method of plotting numeric data, they show the probability density of the data at different values - they usually include the marker for the median of the data indicating the interquartile range. They represent a comparison of variable distribution across different categories.Shows distribution (just like histogram) buy its a histogram on its side, smoothed out, and mirrored on the other side. A histogram is a display of data that uses rectangles to show the frequency of data in successive numerical intervals of equal size.The independant varialbe is plotted along the horizontal axis, while the dependant variable is plotted along the vertical axis. Describes each of the 5 summaries.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

Describe what each of these numbers means.

These numbers are the min, 1st Qu., median, mean, 3rd Qu., and max of the price of diamonds. The min is Q0 - the lowest # on the data point. 1st Qu. - is the median of the first half of data. median is the average of all the data points. 3rd Qu. - the median of the second half of data. and the max is the last data point on the scale (the highest data point).

Modified Box Plots

Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.

The numbers above are shown as the minimum price for diamonds, the 1st Qu., the median, mean, 3rd Qu., and max of the price for diamonds. The graph is showing that diamonds at lower prices are more common then diamonds at higher prices and the box plot includes the outliers, min and max. A modified boxplot is a data display that shows the 5-number summary, its used to find suspected outliers. The whiskers stretch outward away from the 1st Qu. and 3rd Qu. A box plot inlcudes the min, 1st Qu., median, 3rd Qu., and max. An outlier is an observation point that is distant from the other observations.

Adding the mean to the plot

Add one sentence to indicate where the mean is on this plot.

the red dot indicates the mean

Standard Deviation: Formulas

Explain the formulas below, say which uses Bessel’s correction.

The formula used to find the standard deviation. The first one is Bessels correction - (n-1). This is better used, for accuracy, in smaller populations.

\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]

Standard Deviation of Diamonds Price

We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)

## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)

## [1] 3989.403

How close are these estimates? Which is larger?

these estimates are very close, 1 is slightly larger.

So what is the big deal about Bessel’s correction? See below.

Sampling

The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation.

A parameter is a numerical measure that describes a population.A statistic is a numerical value that describes a sample. A sample is collected data from a population. A population refers to the total set of observations that can be made. Example of a parameter: Only 10% of US senators voted for a certain topic. (only 100 US Senators). Example of a statistic: 60% of US citizens agree with the new health policy. (Us citizens are a large populaiton). Sample stardard deviation is a stat that measures the dispersion of data around sample mean. The population sd is the square root of the variance.

We can sample from the diamonds data set and display the price of the diamonds in the sample.

Sample size, \(n\).

First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you.

Seed is a label for the starting point on the list of sequence of numbers. If you use the same seed youll get the same mean, but if you use different seeds youll get different means.

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

Explain what this command did.

gave the sample mean.

Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)

## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)

## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

Explain these results.

Finally, what happens when we don’t set a seed, between samples.

The sample repeats.

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

set.seed(1)
sample(diamonds$price, sample.size)

## [1] 5801 8549  744  538

sample(diamonds$price, sample.size)

## [1] 4879 1976 2322  907

sample(diamonds$price, sample.size)

## [1]  463 3376 4932 4616

Explain these results.

set a seed for 1 and then take the mean for the sample and then the mean of that result.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))

## [1] 3908

mean(sample(diamonds$price,sample.size))

## [1] 2521

mean(sample(diamonds$price,sample.size))

## [1] 3346.75

Explain what we have done here. Answer the following question: what other statistics could we use to describe samples?

set the seed of 1 then took the standard deviation of that result and then took the standard deviation again from that result.

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))

## [1] 3936.586

sd(sample(diamonds$price,sample.size))

## [1] 1683.428

sd(sample(diamonds$price,sample.size))

## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))

## [1] 3409.183

sdn(sample(diamonds$price,sample.size))

## [1] 1457.891

sdn(sample(diamonds$price,sample.size))

## [1] 1763.582

Sampling Distributions of Statistics

Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions?

It is the distribution of a statistic given in a random sample. The seeds are random numbers so this corresponds to the random sample.

Sampling distribution for the mean of price of a sample of diamonds.

The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.

Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set.

The line is the population mean.

Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job?

an estimator = sampling a set of data. A sample mean is an estimate of the population mean.

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate?

it is estimating the standard deviation for each sample size. The best estimate occurs at the 1024 sample size.

Now let’s try without Bessel’s correction:

Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter?

Bessel’s is better in large populations. Only difference is that Bessel’s does (n-1) to make it more accurate in a certain population size.

Sampling error and sampling bias

Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators.

Sampling bias is the mean sampling error. The sampling error is the difference between the sample mean and the parameter population mean. Unbiased estimator has a sampling error equal to 0.The biased estimator is the difference between the estimator’s expected value and the true vaule of the parameter being estimated.