In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price? The distribution of a variable is what values the variable takes and how often it takes those values.

Type of variable chosen

Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related. A quantitative variable are numbers, makes sense to add averages, stem plots, histograms. You can use it in standard deciation. Numerical variables are numbers but it is only quantitative if you use arithmetic operations like adding and averaging.

Histogram of diamonds price.

What is a histogram? Explain graph below. A histogram is a graph that uses bars to show different variables. It is similar to a stem plot except it can show larger values. It also uses bins to count the amount of diamonds. The graph below describes how there are more diamonds that cost less than diamonds that are more expensive.

Violin plot

Explain the relationship between a histogram and a violin plot. The violin plot is smoothed out, flipped, and mirrored on both sides of what the histogram graph showed. It shows the density.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

Describe what each of these numbers means. The minimum price of a diamond is $326, the 1st quartile is the mean of the first 25% of the data which is $950. The median is the middle of the prices which is $2401 and the average or mean of the data is $3933. The third quartile of the data is the average of the 75th percentile $5324 and the maximum value of a diamond is $18823.

Modified Box Plots

Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot. The box plot shows the minimum, first and third quartiles, median, mean, and maximum within the violin plot. The box plot is a visual representation of the five number summary. A box plot ends at a minimum and a maximum. A modified box plot has outliers and fences with 1.5 below and above quartiles 1 and 3. The minimum and maximum are changed in a modified box plot.

Adding the mean to the plot

Add one sentence to indicate where the mean is on this plot. The mean is the red dot on the plot.The mean is above the median.

Standard Deviation: Formulas

Explain the formulas below, say which uses Bessel’s correction. The first formula uses Bessel’s correction. Bessel’s formula is dividing n-1 instead of just n.

\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]

Standard Deviation of Diamonds Price

We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)
## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403

How close are these estimates? Which is larger? These estimates are very close together. The standard deviation is larger with Bessel’s.

So what is the big deal about Bessel’s correction? See below.

Sampling

The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation. A population is all possible values in a data set and a sample is only a section of the selected population. A parameter is a number that summarizes the data from a population while a statistic summarizes the data from a sample. An example of a parameter is finding the average age of everyone in one class. An example of a statistic is seeing that a certain percent of people from Pennsylvania voted for the Democratic party. The population mean never changes but the sample mean changes. A sample standard deviation is finding a measure of sources from a part of a given population. Population standard deviation is calculating the measure of a wide variety of sources

We can sample from the diamonds data set and display the price of the diamonds in the sample.

Sample size, \(n\).

First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you. In a computer with a random number generator set there are a sequence of random numbers. A seed tells the generator which random number where to start. A seed has different starting points. When you use the same seed you will get the same numbers and a different seed will generate different numbers.

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538

Explain what this command did. Starting with the middle of the list and using that seed.

Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538

Explain these results. These are different diamond results that we could use in a sample.

Finally, what happens when we don’t set a seed, between samples.

set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322  907
sample(diamonds$price, sample.size)
## [1]  463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322  907
sample(diamonds$price, sample.size)
## [1]  463 3376 4932 4616

Explain these results. When we don’t use a seed the numbers are not random.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75

Explain what we have done here. Answer the following question: what other statistics could we use to describe samples? What seed you use can cause a change in sampling. You could use random sampling to estimate the statistic.

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582

Sampling Distributions of Statistics

Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions? A sampling distribution of a statistic is what value the statistic takes and how often it takes it. These numbers tell us how much the statistic takes. You have mean, median, and mode to describe these distributions.

Sampling distribution for the mean of price of a sample of diamonds.

The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.

Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set. The graph below represents the sampling distribution of diamonds and the average prices of diamonds in the data set. As the sample size gets larger in numbers, the number of outliers decreases and the mean and standard deviation get more acurrate.

Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job? An estimator is a rule for calculating an estimate of a quantity such as a sample mean. A sample mean is estimating the mean of the population. A statistic drawn from a sample to estimate population. It does better with Bessel’s correction because the bias is less.

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate? Bessel’s correction is better at taking larger sizes and giving a more precise value. The sample standard deviation can be used to estimate a population.

Now let’s try without Bessel’s correction:

Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter? Bessel’s correction uses the formula n-1 and without Bessel’s correction it just uses n. Bessel’s better because it reduces the biases.

Sampling error and sampling bias

Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators. Sampling error is an error by observing the unrepresentativeness of a sample taken. Sampling bias is when the mean does not equal zero. Biased estimators have a value that makes it inaccurate. Unbiased estimators equal zero.