In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'
For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price?
The distribution tells us what values the variable takes and how often it takes the values. #### Type of variable chosen
Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related.
A quantitative varible is a variable that holds numbers. It is important to use quantitativew variables when reporting standard deviation because quatitative variables are distingushed by your ability to add and average the data, which is a central part of standard deviation. Numerical data is different from quantitative because even though it is represented by numbers, it would not make sense to add or average numerical variables. For example, social security numbers are made of numbers but are considered numerical, not quantitative, because you would not ever add or average social security numbers. The equation for standard deviation is the square root of the Variance, which is the average of the squared differences from the average (mean). For the variance of each number you would subtract the mean from each number and square the result. It is important to choose quantitative variables for standard deviation because it allows you to have an average.
What is a histogram? Explain graph below.
A histogram is a graph that consists of many rectangles used to show the distribution of many variables. THe area of the rectangles proportionally show the frequency at which a variable occurs, and the width of the rectangles represent the interval of the variables. In the graph below each rectangle or “bin” is approximately $750, and shows the amount of diamonds in each price range.
Explain the relationship between a histogram and a violin plot.
Violin plots and histograms both show the distriution of variables on a graph or diagram. The violin plot is a histogram that is turned on its side and smoothed out using a computer algorithm. The imae is then mirrored to reflect the density of the variables.
The wider the violin the more diamonds are at that price. It becomes narrow at the top because there are less diamonds that are $15,000.
R has a function that returns numerical summaries of data. For example:
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
Describe what each of these numbers means.
Min. 326 is the minimum number in the set. 1st Qu. 950: is the median of the bottom half of the data Median 2401: is the median of the entire data set Mean 3933: is the mean of the entire data set 3rd Qu 5324: is the median of the upper half of the data set Max 18823: is the maximum number in the data set
Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.
The number represented by the dots above the modified box plot are the outliers. Outliers are numbers that are distinctly seperated from the rest of the data. A box plot shows the max, min, median, Q1, and Q3. In a modified box plot, there are limits placed on the max and min and any points outside those limits are dots represented as outliers. A box plot is different from a modified box plot in that it changes the min and max at the top of the whiskers and includes the outliers. A regular box plot does not show the outlier. We make fences in our plot. we draw the upper fence in the box (before the dark streak) and everything above it is an outlier.
Add one sentence to indicate where the mean is on this plot.
The red dot is the mean. It is above the median so it is skewed to the right.
Explain the formulas below, say which uses Bessel’s correction.
\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\] This uses Bessel’s correction. Bessel says that by dividing by “n-1” instead of “n” you get a more accurate answer. When you only divide by “n” you get an answer that is too small. When you divide by “n-1” you get a larger number which is therefore more accurate, but is still not correct.
\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\] Without Bessel’s correction, the standard deviation is the square root of the mean of the squares of the deviations from the mean. With this equation you get an answer that is too small.
We compute the standard deviation (with Bessel’s correction) of the price variable:
sd(diamonds$price)
## [1] 3989.44
How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:
sdn <- function(x) {
return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403
How close are these estimates? Which is larger?
Both of these estimates are very close together. They are so close that by the rules of rounding, they would be rounded to the same number. The number calculated by using Bessel’s correction will give you a larger number.
So what is the big deal about Bessel’s correction? See below.
The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation.
Population refers to the total set of observations that can be made, while a sample is a set of data collected and/or selected from a statistical population. A part of the population is called a sample. A parameter is a characteristic of a population and always have one possible value (always a number describes a population) while statistics are numbers that summarize data from a sample which is randomly drawn from a population. Parameters are numbers that summarize data for an entire population while statistics are numbers that summarize data from a sample (a subset of the population).
All of the diamond data set is the population; add up the prices of the diamonds and divde by 54000 - parameter a statistic draws a sample which is a selection of cases from a population pulled out to study them. Done in a situation when it is too expensive to study them. You would ask a few thousand voters instead of the entire population about their voting choices. sample mean is used as an estimate of the population mean. Sampling distrubuton what values does the sample take
We can sample from the diamonds data set and display the price of the diamonds in the sample.
First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.
sample.size <- 4
Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you.
A seed in a random number generator is like a code that tells the computer what set of numbers to display. The seed is a starting point in a random number generator. The seed selects which diamonds you get in this set. It is random but if you chose the same number you get the same diamonds. This is helpful if you want to replicate your data. Each specific seed gives you the same numbers in the data set.
set.seed(1)
Now let’s try sampling, once.
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Explain what this command did.
This command sampled four numbers from a data set produced by the seed [1]
Let’s try it with another seed:
set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006 745 4516
And another:
set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127
And back to the first one:
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Explain these results.
Each time a different seed is entered into the commandd, a sample of a different data is displayed. When the [1] seed was entered it produced the same four numbers as the first time it was entered.
Finally, what happens when we don’t set a seed, between samples.
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
Explain these results.
If you don’t set a seed in between samples, it will produce different numbers, but these numbers are still part of the same seeded set just from a different starting point. They are all different samples of the same population.
set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75
Explain what we have done here. Answer the following question: what other statistics could we use to describe samples?
Above we have the mean representing different samples of a specific population. We could also use standard deviation, mean, and the quartiles to describe samples.
For example standard deviation, with Bessel’s correction:
set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409
And standard deviation, without Bessel’s correction:
set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582
Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions?
A sampling distribution of a statistic is the representation of the max, min, outliers, fences, median, and mean of a data sample. The numbers we computed above can be used to describe these data points on the graph.
The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.
Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set.
The box plot shows the max, min, outliers, quartiles and medians of different sample sizes from the same population. As sample size increases, the standard deviation becomes more accurate. The bigger the sample size the more accurate. For example, the second figure shows the average price of 4 diamonds. Prices below the line are negative sampling errors.
Explain the concept of an estimator. What is the sample mean estimating, and in what situation does it do a better job?
An estimator is a function of the data that is used to infer the value of an unknown parameter (ex: population) in a statistical model. It is used for calculating an estimate of a given quantity based on observed data. The sample mean is estimating the mean of the population mean.
Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:
Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate?
The sample standard deviation is estimating the price of a sample size of diamonds. The mean of the distribution of the statistic is an unbiased estimator where the value of mean of the statistic is the same as the mean of the population
Now let’s try without Bessel’s correction:
Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter?
The stadard deviation calculated with Bessel’s correction is a larger representation of the sample standard deviation. The reason for this is that Bessel’s correction creates a larger number that is closer to the population standard deviation. Using Bessel’s correction creates a more accurate representation of the population standard deviation than not using Bessel’s correction. This is beneficial when there is a large data in a population.
Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators.
The sampling error is the difference between the statistic (sample mean) and the parameter (population mean). However, the sampling bias is the mean sampling error which should come to equal 0.