In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price?

The distribution of a variable tells us two things: what the variable takes and how often the variable takes these values. It relates to price by showing the distribution of prices of the diamonds.

Type of variable chosen

Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related.

A quantitative variable holds numbers. Arithmetic operations must be meaningful to these variables for any numbers to be considered a quantitative varibale. If arithmetic operations are not meaningful to these numbers, then the variables are considered categorical or raw instead of quantitative. Similarly, numerical is not an appropriate substitute for quantitative. So not everything that is numerical is quantitative, but everything that is quantitative is numerical. It is important to use quantitative data instead of categorical, especially when considering standard deviation because the variables will be averaged (among other arithmetic operations). This is significant because categorical values cannot be used for arithmetic operations.

Histogram of diamonds price.

What is a histogram? Explain graph below.

A histogram is a diagram representing the frequencies of variables with bars equal to those frequencies. The graph below demonstrates the different prices of diamonds at certain counts.

Violin plot

Explain the relationship between a histogram and a violin plot.

A vioin plot captures how dense the variables are. While a violin plot is a method of plotting numerical data through displaying density of variables, similar to that of a box plot, a histogram is a bar graph that displays the frequencies of variables. In the simplest terms, a violin plot could be considered a type of histogram. While a histogram can display frequency, it is not as practical because you would have to change the y-axis to show the density.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

Describe what each of these numbers means.

The chart displays the 5 number summary. The minimum is the lowest variable in the data. The 1st Quartile represents the middle variable between the minimum and the median. The median represents the middle number of the data set. The mean is the sum of all the variables in the data set divided by n, the number of values in the data set. The 3rd Quartile is the number in the middle of the median and the maximum. The maximum is the highest variable in the data set.

Modified Box Plots

Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.

A regular box plot looks the same, but with a modified box plot, there is a thick line drawn to demonstrate the outliars. An outliar is a variable that lies outside the main group of variables. Anything outside is identified as a suspected outliar. Something else significant is that the IQR (interquartile range) is the distance from the first quartile and the thrid quartile. To find the outliars, you multiply the IQR by 1.5 to determine where the fence would be placed and display the suspected outliars.

Adding the mean to the plot

Add one sentence to indicate where the mean is on this plot.

Although the median is the black horizontal line, the mean is the red dot, which is higher due to the inclusion of the outliers when averaging the variables.

Standard Deviation: Formulas

Explain the formulas below, say which uses Bessel’s correction.

Both formulas are equations to determine standard deviation. Although the second one is less common, they are both correct. The first one uses Bessel’s correction because it uses n-1. This is used to find a better estimate, even though most people think the the second formula is easier to understand.

\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]

Standard Deviation of Diamonds Price

We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)
## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403

How close are these estimates? Which is larger?

The estimates are very close together. The estimate found with Bessel’s Correction is higher than the estimate found without Bessel’s Correction. Bessels correction only matters when the number of things you are averging is small.

So what is the big deal about Bessel’s correction? See below.

Sampling

The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation.

We can sample the diamonds data set and display the prices of the diamonds.

Sample size, \(n\).

First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you.

The seed demonstrates where you will start on the sequence of random numbers. If you use the the same computation everytime, then you will get the same numbers everytime, including if you us more than one computation. If you were to change the seeds, then you would get random number data that would not be the same everytime.

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538

Explain what this command did.

This command generated a random seed, which gives you random prices of diamonds.

Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538

Explain these results.

These results show us that a different seed generates random (and different) prices of diamonds within the data set.

Finally, what happens when we don’t set a seed, between samples.

set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322  907
sample(diamonds$price, sample.size)
## [1]  463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322  907
sample(diamonds$price, sample.size)
## [1]  463 3376 4932 4616

Explain these results.

These results tell us that when we do not set a seed in between samples, that the computer gives us sets of random numbers that are all from the same seed.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75

Explain what we have done here. Answer the following question: what other statistics could we use to describe samples?

This is showing us that the computer has given us the means for a sample size of diamond prices. Another statistic we could use to describe samples would be standard deviation.

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582

Sampling Distributions of Statistics

Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions?

Sampling distribution shows us the values within a population that a statistic could take and how often it would take these values. The data above shows us data using sampling distribution. Tools that we use to describe these distributions would be standard deviation, variance, mean, median, mode, among others.

Sampling distribution for the mean of price of a sample of diamonds.

The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.

Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set.

Everytime a different seed is picked, a different sample mean is given. This graph shows the distribution of the sample means. Each grouping shows the estimates. As sample size increases, so does the quality of the estimate. The more samples, the sample mean should theoretically lie closer to the population mean line.

Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job?

An estimator is a statistic that measures data in a sample and estimates what the population statistic would be. The sample mean estimates the population mean of a sample set. The sample mean is more accurate when the sample set is more symmetric, suggesting that the more the data is skewed, the farther the sample mean will be from the popualtion mean of a sample set.

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate?

Bessel’s correction does not have as large of an impact on very large populations because it is more representative of a real population. The sample standard deviation estimates the population standard deviation, and is more accurate for normal distributions where the mean is an accurate measure of the center of a sample set. So, again, the sample standard deviation is not as accurate for sample sets with many outliers or where the data is extremely skewed.

Now let’s try without Bessel’s correction:

Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter?

When using Bessel’s Correction, the mean sampling error is closer to zero, suggesting that Bessel’s Correction provides a better estimate. It is always more accurate with larger sample sizes.

Sampling error and sampling bias

Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators.

Sampling error is the difference between the statistics, or sample mean, and the parameter, or population. Bias is the mean sampling error. The sample mean is an estimate that is unbiased, meaning that the mean sampling error is equal to zero of the popultion mean.