In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price?

The distribution of a variable tells you (1) what values the variable takes and (2) how often the variable takes that value. In this example, the price is the variable.

The price is the variable in which we are calculating the distribution.

Type of variable chosen

Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related.

A quantitative variable are measured on a numeric or quantitative scale. To find the standard deviation, quantitative variables are easy to plug into the equation and find the mean. Both quantitative and numerical variables use numbers to display data, however quantitive utilizes arithmatic operations.

Histogram of diamonds price.

What is a histogram? Explain graph below.

A histogram is a display of data that uses rectangles to show the frequenc of data items in successive numerical intervals of equal size. The independent variable is plotted along the horizontal axis and the dependent variable is plotted along the verticle axis. This histogram illustrates the relationship between the carrot of a diamond and its cost.

Violin plot

Explain the relationship between a histogram and a violin plot.

The relationship between a histogram and a violin plot is that they both show you what values the variables take and how often they take those values. However, a histogram shows this through bars, while a violin plot shows with a different shape. It is smoothed out and points at the top.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

Describe what each of these numbers means.

The min is the smallest data point.

The first quartile is the lower half of the data.

The median is the mid point of the data.

The third quartile is the upper half of the data.

The max is the largest data point.

Modified Box Plots

Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.

A modified boxplot displays the five number summary, which are the numbers stated above. In a modified box plot, the whiskers, stretching outward from the first quartile and third, are no longer than 1.5 times the interquartile range (IQR). Outliers beyond that are marked separately.

An outlier is a data point that is distinctly separate from the rest of the data. In a modified box plot, an data point more than 1.5 IQRs below the first quarto;e r abbove the third quartile is an outlier.

Adding the mean to the plot

Add one sentence to indicate where the mean is on this plot.

The red circle is the mean, which is higher than the medium which means the distribution is skewed to the right.

Standard Deviation: Formulas

Explain the formulas below, say which uses Bessel’s correction.

Both formulas calculat standard variation.

When we divide by (n −1) when calculating the sample variation , then it turns out that the average of the sample variances for all possible samples is equal the population variance. So the sample variance is what we call an unbiased estimate of the population variance.

If instead we were to divide by n (rather than n −1) when calculating the sample variance, then the average for all possible samples would NOT equal the population variance. Dividing by n does not give an “unbiased” estimate of the population standard deviation.

Dividing by n−1 satisfies this property of being “unbiased”, but dividing by n does not. Therefore we prefer to divide by n-1 when calculating the sample variance.

\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]

Standard Deviation of Diamonds Price

We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)
## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403

How close are these estimates? Which is larger?

They are very close because n, the number of things you are averaging, is small. The standard deviation without Bessels correction is larger. The larger the value of n, the less difference there is between the results given by the two formulas. The difference between dividing by 6 or dividing by 5 is much greater than the difference between dividing by 1000 or dividing by 999.

So what is the big deal about Bessel’s correction? See below.

Sampling

The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation.

Sampling is a process used in statistical analysis in which a predetermined number of observations are taken from a larger population.

First, your sample is the group of individuals who actually participate in your study. On the other hand, your population is the broader group of people to whom you intend to generalize the results of your study. Your sample will always be a subset of your population. Your exact population will depend on the scope of your study.

The difference between a statistic and a parameter is that statistics describe a sample. A parameter describes an entire population. Ex of Statistic: (large populations) 60% of US residents agree with the latest health care proposal. It’s not possible to actually ask hundreds of millions of people whether they agree. Researchers have to just take samples and calculate the rest, so this is a statistic. Ex of parameter: (small group) 10% of US senators voted for a particular measure. There are only 100 US Senators, you can count what every single one of them voted.

Sample Mean implies the mean of the sample derived from the whole population randomly. Population Mean is nothing but the average of the entire group.

The standard deviation is a measure of the spread of scores within a set of data. However, as we are often presented with data from a sample only, we can estimate the population standard deviation from a sample standard deviation. These two standard deviations - sample and population standard deviations - are calculated differently.

The standard deviation of a population gives researchers the amount of dispersion of data for an entire population of survey respondents. A population standard deviation represents a parameter, not a statistic.

Sample size, \(n\).

First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you.

What is normally called a random number sequence in reality is a “pseudo-random” number sequence because the values are computed using a deterministic algorithm and probability plays no real role. The “seed” is a starting point for the sequence and the guarantee is that if you start from the same seed you will get the same sequence of numbers. Seed values are integers that define the exact sequence of pseudo-random numbers, but there’s no way of knowing ahead of time what sequence it will be and there’s no way of tweaking a sequence by slightly changing the seed. Even the tiniest change in seed value will result in a radically different random sequence.

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538

Explain what this command did.

These are the results for 1 seed.

This command found Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538

Explain these results.

These are the means of all the numbers from the data with the 1 seed.

Finally, what happens when we don’t set a seed, between samples.

If you don’t set the seed, R draws from the current state of the random number generator (RNG). On startup R may set a random seed to initialize the RNG, but each time you call it, R starts from the next value in the RNG stream.

set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322  907
sample(diamonds$price, sample.size)
## [1]  463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322  907
sample(diamonds$price, sample.size)
## [1]  463 3376 4932 4616

Explain these results.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75

Explain what we have done here. Answer the following question: what other statistics could we use to describe samples?

GPAs, incomes, test scores, population and much more!

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582

Sampling Distributions of Statistics

Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions?

A sampling distribution of a statistic is a sample of a distribution from a population. The numbers above are calculating standard deviation and they are trying to figure out how those numbers are related to the population. The tools that we have to describe these distributions are the modified box plots, box plots, histograms, and violin plot.

Sampling distribution for the mean of price of a sample of diamonds.

Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set.

The plot below shows images of the sampling distribution for the sample mean for different values of sample size.

Additional note: The things on the x axis are the sample size. 4 means that you pick 4 diamonds at random and find the mean of their prices. The idea is that everytime you pick new seed, you get a different sample mean.

Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job?

The concept of an estimator is having a smaller sample size, because you can’t possibly count all of them. The sample mean is estimating what values it takes and how often it takes those values. An estimator does a better job with larger sample sizes.

As you increase sample size you get a better and better estimate of the sample means. The average mean of the sample means theoretically will lie on the line (if you took all of the possible samples.) The line is the population mean, the sum of all the populatons of the data set divided by the amount.

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate?

The standard deviation is a measure of the spread of scores within a set of data. Usually, we are interested in the standard deviation of a population. However, as we are often presented with data from a sample only, we can estimate the population standard deviation from a sample standard deviation.We are normally interested in knowing the population standard deviation because our population contains all the values we are interested in. Therefore, you would normally calculate the population standard deviation if: (1) you have the entire population or (2) you have a sample of a larger population, but you are only interested in this sample and do not wish to generalize your findings to the population. However, in statistics, we are usually presented with a sample from which we wish to estimate (generalize to) a population, and the standard deviation is no exception to this. Therefore, if all you have is a sample, but you wish to make a statement about the population standard deviation from which the sample is drawn, you need to use the sample standard deviation. Confusion can often arise as to which standard deviation to use due to the name “sample” standard deviation incorrectly being interpreted as meaning the standard deviation of the sample itself and not the estimate of the population standard deviation based on the sample.

Now let’s try without Bessel’s correction:

Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter?

Bessels correction is n-1, and without Bessels is simply n. Bessles corrects for bias. The 1/n variance formula is systematically biased; in nearly every case, it gives a lower estimate than you would make if you had the population mean available. This is bad, because it means that there’s no reliable way to increase the accuracy of your variance estimate by taking more samplesSo we use Bessel’s correction to produce an unbiased estimator. This mean of sample variances is equal to the population variance, so taking a bunch of samples and averaging the result improves your estimate.

Sampling error and sampling bias

Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators.

Sampling error is incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Sampling bias is a bias in which a sample is collected in such a way that some members of the intended population are less likely to be included than others. Sampling error is the difference between the statistic (sample mean) and the parameter (population mean). Sampling bias is the mean sampling error.

Bias is the tendency of a statistic to overestimate or underestimate a parameter. An unbiased estimator is an accurate statistic that’s used to approximate a population parameter.

** Additional note: Sampling mean is an unbiased estimate of a population mean.**