In the context of sampling, Bessel’s Correction improves the estimate of standard deviation: specifically, while the sample standard deviation is a biased estimate of the population standard deviation, the bias is smaller with Bessel’s Correction.
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'
For data, we will use the diamonds data set in the R-Package ggplot2, which contains data from 53940 round cut diamonds. Here are the first 6 rows of this data set:
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Answer this question: what is the meaning of a distribution of a variable, and how does it relate to price?
Distribution tells us what values a variable takes and how often it takes those values.
Explain what a quantitative variable is, and why it was important to make such a choice in a report about standard deviation. Explain how the concepts of numerical and quantitative variables are different, though related.
A quantitative variable is a variable that holds numbers. It is important to use quantitative variables when reporting standard deviation because, quantitative variables are distiguished by your ability to add and average the data, and this is a central part of standard deviation. Numerical data is different from quantitative because even though it is represented by numbers it would not make sense to add or averave numerical variables. For example, social security numbers are made of numbers but are considered numerical, not quantitative, because you would ever add or average social security numbers.
What is a histogram? Explain graph below.
A histogram is a graph/diagram that consists of many rectangles used to show the distribution of many variables. The area of the rectangles proportionally show the frequency at which a variable occurs, and the with of the rectangles represent the interval of the variables. In the graph below each rectange or “bin” is about $750, and shows the amount of diamonds in each price range.
Explain the relationship between a histogram and a violin plot.
Violin plots and histograms both show the distribution of variables on a graph or diagram. The violin plot is a histogram that is turned on it side and smoothed out using a computer algorithm. The image is then mirrored to reflect the density of the variables.
R has a function that returns numerical summaries of data. For example:
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
Describe what each of these numbers means.
#1: The minimum number in the set. #2: The median of the bottom half of the data #3: The median of the entire data set #4: The median of the upper half of the data #5: The maximum number in the set
Describe the relationship of the numbers above to the modified box plot, here drawn inside the violin plot. Explain the difference between a boxplot and a modified box plot. Explain what an outlier is, and how suspected outliers are identified in a modified box plot.
The numbers represented by the dots above the modified box plot are the outliers. Outliers are number that are distictly separate from the rest of the data. A box plot shows the max, min, median, Q1 and Q3. In a modified box plot there are limits placed on the max and min and any points outside those limits are dots represented as outliers.
Add one sentence to indicate where the mean is on this plot.
The mean is represented on the graph below as a red dot.
Explain the formulas below, say which uses Bessel’s correction.
The formula’s below show bessels correction. Bessle says that by dividing my “n-1” instead of “n” you get a more accurate answer. When you only divide by “n” you get an answer that is too small. By dividing by “n-1” you get a larger number which is therefore more accurate but still not correct.
\[s = \sqrt{\frac{1}{n-1}\sum\left(x_i - \bar x\right)^2}\]
\[s_n = \sqrt{\frac{1}{n}\sum\left(x_i - \bar x\right)^2}\]
We compute the standard deviation (with Bessel’s correction) of the price variable:
sd(diamonds$price)
## [1] 3989.44
How about without Bessel’s correction? Well, R doesn’t seem to have this function, but we can add it:
sdn <- function(x) {
return(sqrt(mean((x - mean(x))^2)))
}
sdn(diamonds$price)
## [1] 3989.403
How close are these estimates? Which is larger?
Both of these estimates are are extrememly close together. They are so close that by the rules of rounding they would be rounded to the same number. The number calculated by using bessel’s correction is larger.
So what is the big deal about Bessel’s correction? See below.
The statement that began this document asserted that Bessel’s correction is important in the context of sampling. Explain sampling here: explain the differences between a population, and a sample, and between a parameter and a statistic. Give examples of parameters and give examples of statistics. Explain the difference between the sample mean and the population mean. Explain the difference between the sample standard deviation and the population standard deviation.
The “population” in statistics includes all members of a defined group that we are studying or collecting information on for data driven decisions. A part of the population is called a “sample.” The difference between a statistic and a parameter is that statistics describe a sample. A parameter describes an entire population. Example of a statistic is a sample standard deviation and an example of a parameter is a maximum. A standard deviation is a measure of how a set of data are spread out for an entire population. A sample standard deviation is the spread of data within a specific sample of a population.
We can sample from the diamonds data set and display the price of the diamonds in the sample.
First, we need to choose a sample size, \(n\). We choose \(n=4\) which is very low in practice, but will serve to make a point.
sample.size <- 4
Sampling is random, so next we set the seed. Explain what a seed of a random number generator is. Explain what happens when you use the same seed and what happens when you use different seeds. The simulations below may help you.
A seed in a random number generator is like a code that tells the computer what set of numbers to display. It gives the illusion of being random but it is not. Everytime you enter the seed the same numbers will reappear.
set.seed(1)
Now let’s try sampling, once.
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Explain what this command did.
This command sampled four numbers from a data set produced by the seed (1)
Let’s try it with another seed:
set.seed(2)
sample(diamonds$price, sample.size)
## [1] 4702 1006 745 4516
And another:
set.seed(3)
sample(diamonds$price, sample.size)
## [1] 4516 1429 9002 7127
And back to the first one:
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
Explain these results.
Each time a different seed is entered into the command a sample of a different data set are displayed. When the (1) seed was entered it produced the same four numbers as the first time it was entered.
Finally, what happens when we don’t set a seed, between samples.
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
set.seed(1)
sample(diamonds$price, sample.size)
## [1] 5801 8549 744 538
sample(diamonds$price, sample.size)
## [1] 4879 1976 2322 907
sample(diamonds$price, sample.size)
## [1] 463 3376 4932 4616
Explain these results.
If you don’t set a seed in between samples it will produce different numbers, but these numbers are still part of the same seeded set just from a different starting point. They are all differnt samples of the same population.
set.seed(1)
mean(sample(diamonds$price,sample.size))
## [1] 3908
mean(sample(diamonds$price,sample.size))
## [1] 2521
mean(sample(diamonds$price,sample.size))
## [1] 3346.75
Explain what we have done here. Answer the following question: what other statistics could we use to describe samples?
Above we have the mean representing different samples of a specific population. We could also use standar deviation, median, and the quartiles to describe samples.
For example standard deviation, with Bessel’s correction:
set.seed(1)
sd(sample(diamonds$price,sample.size))
## [1] 3936.586
sd(sample(diamonds$price,sample.size))
## [1] 1683.428
sd(sample(diamonds$price,sample.size))
## [1] 2036.409
And standard deviation, without Bessel’s correction:
set.seed(1)
sdn(sample(diamonds$price,sample.size))
## [1] 3409.183
sdn(sample(diamonds$price,sample.size))
## [1] 1457.891
sdn(sample(diamonds$price,sample.size))
## [1] 1763.582
Explain what a sampling distribution of a statistic is and how it relates to the numbers computed above. Answer the following question: what tools do we have to describe these distributions?
A sampling distribution of a statistic is the representation of the max, min, outliers, fences, median and mean of a data sample. The numbers we computed above can be used to describe these data points on the graph.
The plot below shows images of the sampling distribution for the sample mean, for different values of sample size.
Answer the following questions: what do the features of the graph below represent? One hint: the horizontal line is the population mean of the prices of all diamonds in the data set.
The box plots show the max, min, outliers, quartiles and medians of different sample sizes from the same population.
Explain the concept of an estimator. What is the sample mean estimating, and it what situation does it do a better job?
An estimator is a rule for calculating an estimate of a given quantity based on observed data.The sample mean is estimating the mean of the population mean.
Let’s try describing the sampling distribution of the sample standard deviation with Bessel’s Correction. Again the samples are of diamonds, and the variable considered is the price of diamonds:
Some people argue that it is appropriate to drop Bessel’s correction for populations, but if the population size is large, as shown here it doesn’t matter much. Why? What is the sample standard deviation estimating? In what situations is it a better estimate?
Now let’s try without Bessel’s correction:
Answer the following questions: what is the difference between the standard deviation with Bessel’s correction and the standard deviation without Bessel’s correction? Which do you think is better and when does it matter?
The standard deviation calculated with Bessels’s correction is a larger representation of the sample standard deviation. The reason for this that bessels correction creates a larger number therefore, it is closer to the population standard deviation. I believe using Bessel’s correction creates a more accurate representation of the population standard deviation more often than not. For this reason I would prefer to use his equation over the other in instances with more data in a population.
Describe the difference between sampling error and sampling bias. Describe the difference between a biased estimator and unbiased estimators.
The sampling error is the difference between the statistic (sample mean) and the parameter (population mean). ON the other hand, samplin bias is the mean sampling error whcih shoiuld come to 0.