In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'
For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price?
What values a variable takes and how often it takes those values.
Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related.
Quantitative variables hold numbers. Quantitative variables can be used with arithmetic while numerical variables cannot. They are both numbers, so it can be tricky to differenciate between the two types. Quantitative variables are important for standard deviation, because you need such variables to be able to calculate standard deviation.
What is a histogram? Explain graph below.
A histogram is a kind horizontal of bar graph which has bars that touch each other. The graph below portrays the frequecy of diamond prices. It is skewed to the right and unimodal, indicating that the majority of diamonds are relatively low in price.
Explain the relationship between a histogram and a violin plot.
They both show the distribution and how dense the data are at each point in the rang of the data. The histogram shows you in horizontal bars while the violin plot is a smooth, vertical curve.
R has a function that returns numerical summaries of data. For example:
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
Describe what each of these numbers means.
The numbers represent the IQR of the boxplot, helping to reveal whether or not any outliers are depicted in the data as well as help describe the spread of data.
Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.
Within the modified box plot, the minimum is located at the bottom of the plot, Q1 is located in the first box from the bottom of the plot; the median is the bolded line in the box, separating the quartiles; Q3 is the top part of the orange box; the maximum is shown at the very top of the violin plot. Outliers can be identified by finding the IQR and multiplying it by 1.5. Anything above or below the result and the IQR is an outlier. Outliers are datum points that fall outside of the IQR and the IQRx1.5 value, meaning that they do not subscribe to the general trend of data and can obstruct non-resistant statistical tests and. Modified box plots do not plot outliers in the box and whisker part of the graph while regular box plots do.
Add one sentence to indicate where the mean is on this plot.
The mean is represented by the red dot.
Explain the formulas below, say which uses Bessel’s correction.
The first formula uses Bessel’s correction. Also, the first formula is for standard deviation while the second is for population standard deviation.
\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]
\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]
We compute the standard deviation (with Bessel’s correction) of the price variable:
sd(diamonds$price)
## [1] 3989.44
How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:
sdn <- function(x) {
return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403
How close are these estimates? Which is larger?
There is a 0.037 difference between the estimates. The first estimate with Bessel’s correction is larger.
So what is the big deal about Bessel’s correction? See below.
The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation.
A population is exactly what it sounds like while a sample is a portion of that population. A statistc desbribes a sample while a parameter describes an entire population. For instance, 99% of Americans love dogs (not a real parameter). An example of a statistic, again not real, is that 100% of Americans under 18 love dogs.The sample mean and standard deviation describe a sample of a total population. Population mean and standard deviation describe the population as a whole.
We can sample from the diamonds data set and display the price of the diamonds in the sample.
First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.
sample.size <- 4
Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you.
The seed is a label for the starting point on the list of numbers in your data. The sequence of numbers is random but will eventually repeat. There are more numbers on the list than seeds. If you use a different seed, you get four different numbers whereas you will always get the same numbers if you use the same seed.
set.seed(1)
Now let’s try sampling, once.
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Explain what this command did.
It gave you different numbers, because you used a different seed.
Let’s try it with another seed:
set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006 745 4516
And another:
set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127
And back to the first one:
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Explain these results.
As you change the seed, a different set of four numbers appears as an output.
Finally, what happens when we don’t set a seed, between samples.
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
Explain these results.
You didn’t change the seed, so the four numbers generated randomly, not according to a specific location in the data.
set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75
Explain what we have done here. Answer the following question: what other statistics could we use to describe samples?
Sample standard deviation and sample distribution.
For example standard deviation, with Bessel’s correction:
set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409
And standard deviation, without Bessel’s correction:
set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582
Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions?
The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, coming from a random sample of size n. For the mean, the sample size was set at four and corresponds to the ‘4’ on the plot. The sample size would change in the R studio code to change the sample size on the plot. Ergo, everytime you set a new sample size, you get new numbers. We can use mean and standard deviation as well as plots to describe the distribution.
The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.
Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set.
The top represents the outliers of price, the first and third quartiles are represented by the white boxes that the population mean of the prices runs through The small black line separating the quartiles is the median.
**
Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job?
It is simply for estimating a given quantity based on collected data. In this case, the sample mean is an unbiased estimate of poulation mean. It will do a better job when there is little to no bias.
Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:
Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate?
In larger populations, the correction still works, but there isn’t much of a difference between its result and one used without Bessel’s correction. It is an unbiased estimating the standard deviation of a population. It only matters for a small sample size, so with a larger sample size it may be better not to use Bessel’s correction.
Now let’s try without Bessel’s correction:
Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter?
On average, the standard deviation isn’t as accurate without Bessel’s corecction. With it, the mean sampling error is closer to zero, making it closer to being unbiased.
Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators.
A sampling error is the difference between the statistic (sample mean) and the parameter (population mean). A sampling bias is the mean sampling error specificallhy. Mean is an unbiased estimate of the population mean, meaning the mean sampling error is zero. A biased estimator is the difference betwwen an estimator’s expected value and the actual value of the parameter.