In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'
For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price?
A distribution tells what values a variable may take and how often it takes those values. It relates to price by indicating both the possible price options given by the data and showing the showing the amount of times each specific price appears in the data set.
Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related.
A quantitative variable is one that holds numerical data can be averaged in a way that makes sense, e.g. global temperature changes or the national debt. In the case of data on diamonds, carat, depth, table, and price would be quantitative variables. It is important to consider quantitative data in a report about standard deviation because. Despite sounding similar, numerical and quantitative data are also different. While it is true that quantitative data are numerical, binary (in which the only options are 0 and 1) and ordinal categorial rating systems that use 1-5 as a means of showing variance in agreement or disagreement are also numerical but do not hold numerical information with the same value. In other words, it would not be sensible to average binary or a 1-5 rating system in the way it would be sensible to average the carat or price of diamonds.
What is a histogram? Explain graph below.
A histogram is a graph displaying the frequency of specific data. The below graph shows that cheaper diamonds are in higher supply, and this trend decreases as the price of the diamond increases.
Explain the relationship between a histogram and a violin plot.
A violin plot vertically depicts data displayed in the histogram in a symmetrical shape, and it is similar because both graphs show the distribution of a particular set of data.
R has a function that returns numerical summaries of data. For example:
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
Describe what each of these numbers means.
The minimum is the lowest datum value of the data set, the 1st quartile is the number that is below the median that represents the median of the lower half of the data, the median is the midpoint of the data set, the 3rd quartile lies above the median and represents the median of the upper half of the data, and the maximum is the highest datum value of the data set.
Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.
The minimum of the values below is the bottom of the violin plot, and the maximum is the top of the violin plot. The Q1 value is represented in the modified box plot as the bottom line of the orange box, the median is the bold line extending through it, and the Q3 value is the top line of the orange box. This graph is a modified box plot because outliers in the data set are shown as individual data points above the orange box (they look like a large bold line because they are so close together), and a standard box plot would simply include these points as a part of the overall plot. Outliers are specific datum values that are out of step with the bulk of the data, and they are represented by individual data points beyond the line of the box plot. This is what makes this boc plot modified.
Add one sentence to indicate where the mean is on this plot.
The mean is the red dot within the box in this modified box plot.
Explain the formulas below, say which uses Bessel’s correction.
\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]
\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]
We compute the standard deviation (with Bessel’s correction) of the price variable:
sd(diamonds$price)
## [1] 3989.44
How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:
sdn <- function(x) {
return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403
How close are these estimates? Which is larger?
So what is the big deal about Bessel’s correction? See below.
The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation.
We can sample from the diamonds data set and display the price of the diamonds in the sample.
First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.
sample.size <- 4
Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you.
set.seed(1)
Now let’s try sampling, once.
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Explain what this command did.
Let’s try it with another seed:
set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006 745 4516
And another:
set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127
And back to the first one:
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Explain these results.
Finally, what happens when we don’t set a seed, between samples.
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
Explain these results.
set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75
Explain what we have done here. Answer the following question: what other statistics could we use to describe samples?
For example standard deviation, with Bessel’s correction:
set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409
And standard deviation, without Bessel’s correction:
set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582
Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions?
Sampling distribution is the distribution of a statistic. In other words, it is the values the statistic takes and how often it takes those values.
The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.
Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set.
Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job?
Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:
Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate?
Now let’s try without Bessel’s correction:
Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter?
Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators.