In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:

## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Describing the distribution of the “price” variable

Distribution of a cariable tells us what values it takes and how often it takes these values. This relates to price because it shows how many diamonds fall under certain prices.

Type of variable chosen

A quantitative variable takes numerical values for arithmetic operations like finding the average for example. Quantitative varibles are able to operated on and that is an important factor for finding standard deviation. A numerical variable is a number like quanitative but numerical can not be operated on, it is similar to a label. I.e. Social security.
#### Histogram of diamonds price.

Histograms breaks the range of values of a varible into “bins” and displays only the count or percentage of the observations in each bin. They show the modes clearly and can be bimodel. The histogram below shows the amount of diamonds that are sold at a certain price.

Violin plot

Violin plots are not as precise as histograms since they do not show bins and exact data point. A violin plot is smoothed out with relativly the same shape. The wider the violin plot the more dense the graph is.

Numerical Summaries

R has a function that returns numerical summaries of data. For example:

summary(diamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2401    3933    5324   18823

The numbers show the five number summary. #### Modified Box Plots Outliers are observations that lie outside the overall pattern of a distrubtion. The difference between a box plot and a modified box plot is that the box plot is a graph of the five number summary. The modified box plot shows whiskers which are outliers which uses the, IQR x 1.5 to find the length of the lines that extend outside of the box. The black line in the violin graph show outliers. Outliers can be suspected if the graph is skewed.

Adding the mean to the plot

The red dot is the mean.

Standard Deviation: Formulas

The top equation uses Bessel’s formula. These formulas are used to show standard deviation. Bessel’s correction makes the data more accurate due to n-1 and less biased, the average sampling error is not 0. \[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]

\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]

Standard Deviation of Diamonds Price

We compute the standard deviation (with Bessel’s correction) of the price variable:

sd(diamonds$price)
## [1] 3989.44

How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:

sdn <- function(x) {
  return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403

They are close but the first equation is bigger.

So what is the big deal about Bessel’s correction? See below.

Sampling

A population is an entire group or individuals a study is going to taken from. A sample is a study or gain information about the population. Parameter is a number that describes the population; a number that describes some characteristics of the population generally not known. A statistic is a number that describes some characteristics of the sample computed directly from the sample data. Both parameter and statistics are represented by numbers.

Population: All U.S. Adults Parameter: Proportion of adults that believe in ghosts Sample: 640 U.S. adults Statistic: 240/640 = .375 of U.S. adults believe in ghosts

Sample mean changes from every sample taken and is unbiased estimate of Population mean. Population mean does not. Standard deviation is not biased. Sample Standard deviation is a statistic that measures the dispersion of the data around the sample mean. Population standard deviation represents a parameter, not a statistic.

We can sample from the diamonds data set and display the price of the diamonds in the sample.

Sample size, \(n\).

First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.

sample.size <- 4

Set the seed of the pseudorandom number generator.

Sampling is random, so next we set the seed. Seeds are a starting point in a sequence of different random numbers. When you use the same seed the same numbers is data set can re-appear. When you use different seeds the numbers chosen will always be different.

set.seed(1)

Sample once and repeat.

Now let’s try sampling, once.

sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538

Random numbers were chosen from the seed inserted.

Let’s try it with another seed:

set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006  745 4516

And another:

set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127

And back to the first one:

set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538

If you use the same seed the numbers will repeat. If the seeds are always different the numbers will always be different.

Finally, what happens when we don’t set a seed, between samples.

set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322  907
sample(diamonds$price, sample.size)
## [1]  463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549  744  538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322  907
sample(diamonds$price, sample.size)
## [1]  463 3376 4932 4616

The seed is the same seed but the numbers in the data set tend to repeat with every time being generator.

Describing samples with one number: a statistic

set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75

Set the same seed to see what random numbers were generated. We can describe samples with standard deviation.

For example standard deviation, with Bessel’s correction:

set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409

And standard deviation, without Bessel’s correction:

set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582

Sampling Distributions of Statistics

The sample distribution of a statistic is the distribution of the statistic, considered as a random variable, when derived from a random sample of size n. This can be used with mean, proportion, standard deviation.

Sampling distribution for the mean of price of a sample of diamonds.

The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.

The graph represents the mean, median, quartiles and outsliers. It also shows the the different box and whisker plots in different sample sizes.

Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job?

Estimator is a rule for calculating an estimate of a given quantity based on observed data. The sample mean is estimating the average of the price of diamonds.

Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:

The data set is very large which makes it more accurate.The smaller the sample, it does know really show how close it is to the population standard deviation. The sample standard deviation is estimating how spread out the average is in the sample.

Now let’s try without Bessel’s correction:

The outliers are different is Bessel’s Correctuon. I think the first one is better, the size of the sample matters. #### Sampling error and sampling bias

Sampling bias is the average of the sampling error. Sampling error is taking only a sample of the population.