In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price?

The distribution of a variable is the values of that variable.The price of diamonds is an example of a set of values. The price is an example of the variable.

Type of variable chosen

Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related.

A quantitative variable is a variable that includes data that is not categorical. Standard deviation is the distribution of certain values from the mean. Numerical variables include numbers, while quantitative includes data about the numbers. Not all numerical data is quantitative.

Histogram of diamonds price.

What is a histogram? Explain graph below.

A histogram is a type of plot that shows the frequency of a distribution of data.

Violin plot

Explain the relationship between a histogram and a violin plot.

The relationship between a histogram and and a violin plot is that both show the density of a distribution of values.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18820

Describe what each of these numbers means.

The minimum(236) is the smallest value in a sample. The maximum(18820) is the largets value in a sample. The median is the middle number of a sample(2401). This splits the set of numbers in half. One section is below the median and another is above the mean. The meadian of the first half is the 1st quartile(950) and the meadian of the second half is the 3rd quartile(5324).The mean is the average of all of the numbers in a sample.

Modified Box Plots

Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.

The median is the horizontal line in the center of the box. The min is 1.5 below the the 1st quartile, while the max is 1.5 above the 3rd quartile. The mean should be close by to the median. If the mean is above the median it is skewed to the right, but if the mean is below the median it is skewed to the left.

Adding the mean to the plot

Add one sentence to indicate where the mean is on this plot.

The mean is the red dot above the median.

Standard Deviation: Formulas

Explain the formulas below, say which uses Bessel’s correction.

\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\] The formula with the n-1 uses Bassel’s Correction.

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]

Standard Deviation of Diamonds Price

We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)
## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403

How close are these estimates? Which is larger? The estimates are very similar, but the one with the Bassel’s correction is larger.

So what is the big deal about Bessel’s correction? See below.

Sampling

The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation.

A population mean is the mean of a set a data, but a sample mean is the mean of a small portion of the entire population of data. A sample is usually representative of the entire population.

We can sample from the diamonds data set and display the price of the diamonds in the sample.

Sample size, \(n\).

First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you.

A seed is a sequence of random numbers. When you use the same seeds, the same sequence of random numbers appears. When different seeds are used, different random sequences of random numbers will appear.

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538

Explain what this command did.

The command took a random number of samples from the diamonds data.

Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538

Explain these results.

This is a random set of a sequence of numbers from the diamonds data.

Finally, what happens when we don’t set a seed, between samples.

set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322  907
sample(diamonds$price, sample.size)
## [1]  463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322  907
sample(diamonds$price, sample.size)
## [1]  463 3376 4932 4616

Explain these results.

This is a random sample of data taken from the diamonds data.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75

Explain what we have done here. Answer the following question: what other statistics could we use to describe samples?

Some examples of other sctatistics that can be used are measures of central tendency, variance, and standard deviation.

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582

Sampling Distributions of Statistics

Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions?

A sampling distribution is a distribution of a statistic. According to the numbers above, the sample means and the sample standard deviations are considered sampling distribution. The tools are mean, deviation, variance the 5 number summary(IQR), #### Sampling distribution for the mean of price of a sample of diamonds.

The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.

Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line that goes through all of the graphs is the population mean of the prices of all diamonds in the data set.

These individual boxplots represent different sized samples from the entire data set. The red dots on the horizontal line(mean) are the sample means. The individual horizontal lines on each of the graphs are the medians of each individual sample.

Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job?

The sample means are estimating the population mean. The bias decreases as more samples are pulled from the population. Also, a larger population has less bias.

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate?

There is less bias apparent in larger popualtions. The sample standard deviation is estiamting the popualtion standard deviation. Within a certain sample the sample standard deviation is estimating how sepearated the values are away from the sample mean.

Now let’s try without Bessel’s correction:

Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter?

The standard deviation with the Bessel’s correction has n-1 while the one without Bessel’s correction has only an n. The one with Bessel’s correction is slightly more accurate.

Sampling error and sampling bias

Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators.

A sampling error is the difference between the standardard deviation and the sample standard deviation. Biased estimators have a difference between the parameter and the sample standard deviation. Unbiased estimators have no difference or “0.”