In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price?

The distribution of a variable is a description of the relative numbers of times each possible outcome will occur in a number of trials. It relates to price because the values of the diamond is distributed according to price.

Type of variable chosen

Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related.

Quantitative variables are variables that are measured on a numeric or quantitative scale. It is important to use in a report about standard deviation becuase it holds numbers and easier to measure. Numerical are values or obsevations that can be measured.Numerical data and quantitaive variables are realted in way that have numbers in commons are used for graphs and plots. It works for inputting it into the standard deviation formula.

Histogram of diamonds price.

What is a histogram? Explain graph below.

A histogram is a display of statistical information that uses rectangles to show the frequency of data items in successive numerical intervals of equal size. The graph below shows the price of diamonds (independent varibales) plotted along the horizontal axis and the number of diamonds (dependent variabels) plotted along the vertical axis. So there are almost 10,000 diamonds costing 500 - 1,500 dollars and about a little less than 8,750 diamonds costing in between 0-1000 dollars.

Violin plot

Explain the relationship between a histogram and a violin plot.

Both the violin and histogram measure the distribution of varibales.Violin plot displays more information about the data. The width of the violin is how dense the points are.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18820

Describe what each of these numbers means.

The Min. is the minimum value, or the number of the data set that is less than (or equal to) all the other values of the set. The 1st Qu. is the median of the numbers between the minimum and the median, the median is the middle of the numbers when you place each number in order of numerical value and find the middle number. The mean is all the numbers added up and aaveraged. The 3rd Qu. is the median of the numbers between the median and the maximum. The maximum is the highest value in the data set that is greater than (or equal to) all the other values in the set.

Modified Box Plots

Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.

The numbers above the box plot are numbers that are outliers. Outliers are the numbers that generaly do not fit within a normal set of data. These outliers can be found using modifiers. The beginning of the outliars (the thick black vertical bar) is the modified maximum, and the maximum is at the very top of the violin plot, it’s the highest point. A modified box plot does not plot the outliars as part of the box plot, instead they are plotted as individual points beyond the whisker in order to give a more accurate depiction of the dispersal of the data. The standard box plot includes all data points, even outliars, and does not show individual outliars. Outliars are numbers that skew a data set, they are either unusually high or unusally low in the data set and do not represent the average.

Adding the mean to the plot

Add one sentence to indicate where the mean is on this plot.

The mean is the RED DOT.

Standard Deviation: Formulas

Explain the formulas below, say which uses Bessel’s correction.

The formulas below are to calculate the standard deviation of a set of values. Bessel’s correction is that he changed n to n-1 for the sample variance. N is the number of obervations in a sample. Bessell tried to correct “the bias in the estimation of the population standard deviation” (Wikipedia).

\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]

Standard Deviation of Diamonds Price

We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)
## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403

How close are these estimates? Which is larger?

The estimates are very close, they only differ by 0.037. The one without Bessel’s is larger, I think.

So what is the big deal about Bessel’s correction? See below.

Sampling

The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation.

Sampling is taking data from a smaller sect of a population. A parameter (population) gives us the information data about the population as a whole, while the statistics gives us data about a sample. We can sample from the diamonds data set and display the price of the diamonds in the sample. The sample standard deviation is a standard deviation of the population based on a random sample of the population. The population standard deviation is for the entire population, not just a sample of the population.

Sample size, \(n\).

First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you.

A seed of a random number generator is a starting point for a sequence that produces random numbers. If you use the same seed it will give the same sequence every time. If you use different seeds it will give you a different sequence of random numbers each time.

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538

Explain what this command did.

Seed started a random sequence of numbers.

Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538

Explain these results.

When you changed the seed from 1 to 2 to 3, different sequences of random numbers came out. And when you went back to set.seed(1) it produced the same random numbers as the first seed.

Finally, what happens when we don’t set a seed, between samples.

set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322  907
sample(diamonds$price, sample.size)
## [1]  463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322  907
sample(diamonds$price, sample.size)
## [1]  463 3376 4932 4616

Explain these results.

Because we did not set a seed between samples it produced the 3 sets of random numbers each time.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75

Explain what we have done here. Answer the following question: what other statistics could we use to describe samples?

You have taken the mean of the three random sequences above. The first mean corresponds to the first random numbers set and so on.

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582

Sampling Distributions of Statistics

Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions?

A sampling distibution of a statistic is the probability distribution of a statistic obtained thorugh a large number of samples taken from a specific population. The numbers above are the computed standard deviation of the random sample. It represents a sampling distribution. We can describe these disttibutions by the results they give out compared to others.

Sampling distribution for the mean of price of a sample of diamonds.

The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.

Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set.

The horizontal line is the population mean of the prices of all diamonds in the data set, the thick black dots are the outliars, the red dot is the mean, the small short line below the population mean line is the median, the line below that is Q1 and the line above it is Q3, the line below Q1 is the minimum and the very top of all the outliars, at the highest point on the plots is the maximum.

Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job?

“An estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule, the quantity of interest and its result are distinguished.” (Google). The sample mean is an estimate of the population mean. This graph does a better job with larger sample sizes. For example a sample size of 1,024 diamonds is a lot closer to the mean, has less of the IQR not in the center and follows the population mean pretty closely. 1 diamond is all over the place, it has all of it’s ouliars, the median, Q1, Q3 all farther away from the population mean.

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate?

Going from 1,024 diamonds to 1,023 isn’t going to have a major difference. So it doesn’t matter much because in a large sample thre is litle difference. It is a better estimate in a smaller sample size.

Now let’s try without Bessel’s correction:

Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter?

The mean is a lot farther form the standard population mean in the standard deviation without Bessel’s formula. I think Bessel’s is better because there isn’t a difference in the larger population size but in the smaller population size, Bessel’s is more accurate.

Sampling error and sampling bias

Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators.

Sampling error is the error that is caused by observing a sample instead of a population. It shows how much the results of the study missed its mark. Sampling bias is refering to error that is systematic in nature. Some numbers in a statistic population are less likely to be included than others. So the sample is no longer random. A biased estimator is the difference between what the estimators expected value is and what the true value of the paramater being estimated is.