In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'
For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price?
Distribution of value is every single outcome possible. It relates to price as the price can range from $0 to the maximum amount of money, and the variable can be every single price. For this diamonds historgram, its $0-$20,000
Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related.
The quantitative variable are variables that are measured on a numeric scale. Anything measured not on a numerical scale, are qualitative variables. The quantitative variable is important to a report on standard deviation because the quantitative variable includes the mean/average of a set of data. Since the standard deviation is all about the mean/average, the Standard deviation is about how far each data point is away from the mean.
Even though quantitative variables are measured on a numeric scales, sometimes there are differences between numerical variables and quantitative variables. For example, a social security number is a numerical value but you wouldn’t average a list of social security numbers. So not all numerical variables can be a quantitative variable, as quantitative variables mean you have the ability to add and average the numbers.
What is a histogram? Explain graph below.
A histogram is a diagram consisting of rectangles that shows the distribution and frequency of variables. The histogram below shows the frequency of diamonds and how much they cost.
Explain the relationship between a histogram and a violin plot.
The violin plot is essentially the histogram smoothed out, put on its side, and then made symmetrical.The density of points on a histogram.
R has a function that returns numerical summaries of data. For example:
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
Describe what each of these numbers means.
This is the five number summary. Min is the minimum value in the data set (326) and max is the maximum value (18823). Mean is the average value of the data set(3933), Median is the middle value in the whole data set (2401). Q1 or 1st quartile is the median of the lower half of the data set (950). Q3 or 3rd quartile is the median of the upper half of the data set (5324).
Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.
The 5 number summary above determines how the boxplot is created.. A modified boxplot does not include outliers, it has a ’fence’that indicates where the outliers are. An outlier is a data point on a graph that is much bigger or smaller than the next nearest data point. To identify an outlier, you take the interquartile range and multiply it by 1.5. Take that value and add to Q3 and subtract from Q1, anything above or lower than these two values are outliers.
Add one sentence to indicate where the mean is on this plot.
The mean is the red dot inside the box. The mean is higher than the median because the outliers skew the data that makes the mean higher than usual.
Explain the formulas below, say which uses Bessel’s correction.
The formula without N-1 is the standard deviation formula. Standard deviation figures out how far each data point is away from the mean. The formula with N-1 is Bessel’s correction. Bessel’s correction basically makes the estimator unbiased.
\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]
\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]
We compute the standard deviation (with Bessel’s correction) of the price variable:
sd(diamonds$price)
## [1] 3989.44
How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:
sdn <- function(x) {
return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403
How close are these estimates? Which is larger?
These estimates are very close, the calculation that does involve Bessel’s correction is larger.
So what is the big deal about Bessel’s correction? See below.
The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation.
A population is a broader group of people whom will be affected by the results, where as a sample is a specific group of individuals. A statistic and parameter are similar in a way that they both descriptions of groups. The only difference is that a statistic describes a sample, and a parameter describes a population. For example, if you randomly poll voters in an election and they find that 49% of the population plans to vote for candidate B, this is a statistic because you only asked a portion of the population. Whereas if you asked a whole freshman class who likes chocolate ice cream and 80% of people said they like this flavor, this is a parameter because you asked the whole freshman class.
We can sample from the diamonds data set and display the price of the diamonds in the sample.
First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.
sample.size <- 4
Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you.
The seed of a random number generator is a number or a set of numbers that the generator creates. When you you use the same seed, you will get the same numbers. However, when you use different seeds, you will get different numbers. If you change the seeds, you will get random numbers.
set.seed(1)
Now let’s try sampling, once.
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Explain what this command did.
This command set the 1st seed with a sample from the diamonds data set.
Let’s try it with another seed:
set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006 745 4516
And another:
set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127
And back to the first one:
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Explain these results.
We gave the first 3 seeds a set of numbers from the diamonds data set. Then, we we called upon the first seed, it gave us the numbers from when we set the first seed.
Finally, what happens when we don’t set a seed, between samples.
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
Explain these results.
When we don’t set the seed while taking a sample, the numbers will be random.
set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75
Explain what we have done here. Answer the following question: what other statistics could we use to describe samples?
We took the mean number from the 1st seed. Other statistics we could use to describe samples are; mean, median, and mode.
For example standard deviation, with Bessel’s correction:
set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409
And standard deviation, without Bessel’s correction:
set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582
Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions?
We have standard deviation and mean to describe data. With a high standard deviation, we can say that the numbers of the data are spread out.
The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.
Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set.
The graph represents the different means of different sample sizes of the data set. As the sample size grows, the means goes toward a certain point.
Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job?
An estimator helps predict what the numbers in a distribution will be based on previous distributions. It does a good job in a distribution with a low standard deviation because a low standard deviation means the numbers are more predictable.
Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:
Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate?
The standard deviation is essentially an estimate of how varied the numbers are. It is best used in a distributiton with no outliers.
Now let’s try without Bessel’s correction:
Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter?
Bessel’s correction essentially makes our standard deviation less biased. I think a standard deviation with Bessel’s correction is better because it makes our data more reliable. With bias, our data won’t be entirely too accurate. The conclusions we draw from the data will be affected less by the bias.
Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators.
sampling error- how far a statistic is away from the estimator and sampling bias- average of sampling errors
The unbiased estimator is when the red dot is close to the line, whereas a biased estimator is when the red dot is far away from the line.