In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'
For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The distribution of a variable is a description of the relative numbers of times each possible outcome will occur in a number of trials. In this graph, the distribution of a variable tells us, at what price, the products are popular with customers.
Quantitative variables are the ones that are measured on a numeric or quantitative scale. They are the main characters will be used to calculate the standard deviation in the data set. The quantitative variables must be numerical variables. However, some of numerical variables are not quantitative variables, like social security numbers, ID numbers.
Histogram is a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval.
When the price changes from $250 to $1,250, the products are relative popular. However, as the price increases, the number of sold products decreases.
Both of the histogram and the violin plot can show the density of the data sheet. However, the histogram has x-axis, which is not included in the violin plot.
R has a function that returns numerical summaries of data. For example:
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18820
‘Min’ is the smallest number in the data set. ‘Q1’ is the first quartile before the median of the data set. ‘Median’ is the middle number in the data set. ‘Mean’ is the average of all the numbers in the data set. ‘Q3’ is the third quartile after the median of the data set. ‘Max’ is the biggest number in the data set.
‘Min’ is the located at the lowest of the plot. ‘Q1’ is the lowest point of the orange rectangle. ‘Median’ is the black line inside orange rectangle. ‘Q3’ is on the top of the orange rectangle. ‘Max’ is located at the top of the data set.
‘Boxplot’ can tell us where are ‘Min’, ‘Q1’, ‘Median’, ‘Q3’, ‘Max’. Howver, the modified box plot can show the density of the data set and frequency.
An outlier is an observation point that is distant from other observations. As the modified box plot shows, the outliers are located on the black line on the upper position.
The mean is the red point in the diagram.
The first formula is to calculate the accurate number of standard deviation and uses Bessel???s correction.
The second formula is also used to calculate the standard deviation, but it doesn’t use Bessel’s correction.
\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]
\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]
We compute the standard deviation (with Bessel’s correction) of the price variable:
sd(diamonds$price)
## [1] 3989.44
How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:
sdn <- function(x) {
return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403
The difference between two estimates is 0.037 and the first estimate is larger.
So what is the big deal about Bessel’s correction? See below.
The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: A parameter is any number calculated from a population, which the total number. A statistic is any number calculated from a sample, which is a part chosen randomly from population.
Are the grades of college students inflated? Parameter: Population: 7 million students Mean=2.7 Statistics: Sample=100 students Mean=2.9 Sample mean implies the mean of the sample derived from the whole population randomly. Population mean is nothing but the average of the entire group. The standard deviation of a population gives researchers the amount of dispersion of data for an entire population of survey respondents. A standard deviation of a sample estimates teh standard deviation of a population based on a random sample.
We can sample from the diamonds data set and display the price of the diamonds in the sample.
First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.
sample.size <- 4
Sampling is random, so next we set the seed. srand(x) used to set the starting value(seed) for a generating a sequence of pseudo-random integer values. The srand(x) function sets the seed of the random number generator algorithm used by the function rand().
set.seed(1)
Now let’s try sampling, once.
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
This command asks the program to choose four random data from the set.
Let’s try it with another seed:
set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006 745 4516
And another:
set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127
And back to the first one:
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Once we have given the seed some number, the program will gives us some random data. In addition, if the number of the seed is different, the results will also be different. If we set the seed as the same number, the random data that the program shows will be the same.
Finally, what happens when we don’t set a seed, between samples.
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
If we set the number of seed as 1 and type sample(diamonds$price,sample.size) for some times, the results will be different.
set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75
We set the number of seed as 1 and get some groups of samples and then calculate the mean of these groups. Additionally, we can also use mode, the most observed value in the data set, and median, the value which divides a data set into two equal halves.
For example standard deviation, with Bessel’s correction:
set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409
And standard deviation, without Bessel’s correction:
set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582
Sample distribution of a statistic is the probability distribution of a given statistic based on a random sample. Sampling distributions are important in statistics because they provide a major simplification en route to statistical inference. To describe these distributions, we need to use statcrunch to make the graph, like diagram, violin plot, or modified box plot to help us to analyze the distribution of data.
The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.
Answer the following questions: The mean of these tested groups of sample are the same. However, the distribution of the data, sample size, and shape are all different. As the sample size increases, the distributions of data become dense and the number and range of outliers decrease.
An estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule(estimator), the quantity of interest(the estimand) and its result(the estimate) are distinguished. The sample mean form a group of observation is an estimate of the population mean. The sample mean from a group of observations is an estimate of the population mean. When sample size is larger, the sample mean estimating will do a better job
Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:
Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. When the number of the sample is large enough, the Bessel’s correction seems not important. We can estimate the population standard deviation from a sample standard deviation. When the number of the sample is big enough, the sample standard deviation can get a better estimate.
Now let’s try without Bessel’s correction:
The standard deviation with Bessel’s correction can get more accurate result than the standard deviation without Bessel’s correction. Id like to say both of them are good because it always depends on the cases. In some cases, like the number of the data is large enough, Bessel’s correction is better. If the number is small, the Bessel’s correction doesn’t really matter.
Sampling error is the error caused by observing a sample instead of the whole population. The sampling error is the difference between a sample statistc used to estimate a population parameter and the actual but known value of the parameter. If an overestimate or underestimate does happen, the mean of the difference is called a bias. In more mathematical terms, an estimator is unbiased. That’s just just saying if the estimator(the sample mean) equals the parameter(the population mean), then it’s an unbiased estimator.