Elementary descriptive statistics of populations and samples
Any collection of numerical data on one or more variables can be described using a number of common statistical concepts. Let \(x = x_1, x_2, \dots, x_n\) be a sample of \(n\) observations of a variable drawn from a population.
Measures of | Category | Population | Sample |
---|---|---|---|
- | What is it? | Reality | A small fraction of reality (inference) |
- | Characteristics described by | Parameters | Statistics |
Central Tendency | Mean | \(\mu = E(Y)\) | \(\hat{\mu} = \overline{y}\) |
Central Tendency | Median | 50-th percentile | \(y_{(\frac{n+1}{2})}\) |
Dispersion | Variance | \(\sigma^2=var(Y)\) \(=E(Y-\mu)^2\) | \(s^2=\frac{1}{n-1} \sum_{i = 1}^{n} (y_i-\overline{y})^2\) \(=\frac{1}{n-1} \sum_{i = 1}^{n} (y_i^2-n\overline{y}^2)\) |
Dispersion | Coefficient of Variation | \(\frac{\sigma}{\mu}\) | \(\frac{s}{\overline{y}}\) |
Dispersion | Interquartile Range | difference between 25th and 75th percentiles. Robust to outliers | |
Shape | Skewness Standardized 3rd central moment (unitless) | \(g_1=\frac{\mu_3}{\mu_2^{3/2}}\) | \(\hat{g_1}=\frac{m_3}{m_2sqrt(m_2)}\) |
Shape | Central moments | \(\mu=E(Y)\) \(\mu_2 = \sigma^2=E(Y-\mu)^2\) \(\mu_3 = E(Y-\mu)^3\) \(\mu_4 = E(Y-\mu)^4\) | \(m_2=\sum_{i=1}^{n}(y_1-\overline{y})^2/n\) \(m_3=\sum_{i=1}^{n}(y_1-\overline{y})^3/n\) |
Shape | Kurtosis (peakedness and tail thickness) Standardized 4th central moment | \(g_2^*=\frac{E(Y-\mu)^4}{\sigma^4}\) | \(\hat{g_2}=\frac{m_4}{m_2^2}-3\) |
Mean, variance and standard deviation
- The mean is the average value of the observations and is defined by adding up all the values and dividing by the number of observations. The mean \(\bar{x}\) of our sample \(x\) is defined as:
\[ \bar{x} = \frac{1}{n}\sum_{i = 1}^{n}x_i \] While the mean of a sample \(x\) is denoted by \(\bar{x}\), the mean of an entire population is usually denoted by \(\mu\). The mean can have a different interpretation depending on the type of data being studied.
lets create a fake dataset here
# Defining a vector of study hours
<- c(10.0, 11.5, 9.0, 16.0, 9.25, 1.0, 11.5, 9.0,
study_hours 8.5, 14.5, 15.5, 13.75, 9.0, 8.0, 15.5, 8.0,
9.0, 6.0, 10.0, 12.0, 12.5, 12.0)
mean(study_hours, na.rm = TRUE)
## [1] 10.52273
This looks very intuitive and appears to be the average amount of hours taken by the individuals in the sample data.
Other common statistical summary measures include the median, which is the middle value when the values are ranked in order, and the mode, which is the most frequently occurring value.
R does not have an inbuilt function for the mode however so we use
statip
package
library(statip)
## median
<- median(study_hours)) (med_val
## [1] 10
## mode
<- mfv(study_hours)) (mod_val
## [1] 9
- The variance is a measure of how much the data
varies around its mean. There are two different definitions of
variance.
- The population variance assumes that that we are working with the entire population and is defined as the average squared difference from the mean:
\[ \mathrm{Var}_p(x) = \frac{1}{n}\sum_{i = 1}^{n}(x_i - \bar{x})^2 \]
- The sample variance assumes that we are working with a sample and attempts to estimate the variance of a larger population by applying Bessel’s correction to account for potential sampling error. The sample variance is:
\[ \mathrm{Var}_s(x) = \frac{1}{n-1}\sum_{i = 1}^{n}(x_i - \bar{x})^2 \]
You can see that
\[ \mathrm{Var}_p(x) = \frac{n - 1}{n}\mathrm{Var}_s(x) \] So as the data set gets larger, the sample variance and the population variance become less and less distinguishable, which intuitively makes sense.
Because we rarely work with full populations, the sample variance is calculated by default in R and in many other statistical software packages.
# sample variance
<- var(study_hours, na.rm = TRUE)) (sample_variance_study_hours
## [1] 12.16017
So where necessary, we need to apply a transformation to get the population variance.
# population variance (need length of non-NA data)
<- length(na.omit(study_hours))
n <- ((n-1)/n) * sample_variance_study_hours) (population_variance_study_hours
## [1] 11.60744
Variance does not have intuitive scale relative to the data being studied, because we have used a ‘squared distance metric’ therefore we can square-root it to get a measure of ‘deviance’ on the same scale as the data.
- We call this the standard deviation \(\sigma(x)\), where \(\mathrm{Var}(x) = \sigma(x)^2\). As with variance, standard deviation has both population and sample versions, and the sample version is calculated by default. Conversion between the two takes the form
\[ \sigma_p(x) = \sqrt{\frac{n-1}{n}}\sigma_s(x) \]
# sample standard deviation
<- sd(study_hours, na.rm = TRUE)) (sample_sd_study_hours
## [1] 3.487144
# verify that sample sd is sqrt(sample var)
== sqrt(sample_variance_study_hours) sample_sd_study_hours
## [1] TRUE
# calculate population standard deviation
<- sqrt((n-1)/n) * sample_sd_study_hours) (population_study_hours
## [1] 3.406969
Given the range of study_hours is [1, 16] and the mean is 11, we see that the standard deviation gives a more intuitive sense of the ‘spread’ of the data relative to its inherent scale.
Hypothesis testing
- Make inferences (an interpretation) about the true parameter value based on our estimator/estimate
- Test whether our underlying assumptions (about the true population parameters, random variables, or model specification) hold true.
however Testing does not
- Confirm with 100% a hypothesis is true
- Confirm with 100% a hypothesis is false
- Tell you how to interpret the estimate value (Economic vs. Practical vs. Statistical Significance)
Hypothesis: Translate an objective in better understanding the results in terms of specifying a value (or sets of values) in which our population parameters should/should not lie.
- Null hypothesis (\(H_0\)): A statement about the population
parameter that we take to be true in which we would need the data to
provide substantial evidence that against it.
- Can be either a single value (ex: \(H_0: \beta=0\)) or a set of values (ex: \(H_0: \beta_1 \ge 0\))
- Will generally be the value you would not like the population
parameter to be (subjective)
- \(H_0: \beta_1=0\) means you would like to see a non-zero coefficient
- \(H_0: \beta_1 \ge 0\) means you would like to see a negative effect
- “Test of Significance” refers to the two-sided test: \(H_0: \beta_j=0\)
- Alternative hypothesis (\(H_a\) or \(H_1\)) (Research Hypothesis): All other possible values that the population parameter may be if the null hypothesis does not hold.
Type I Error
Error made when \(H_0\) is rejected
when, in fact, \(H_0\) is true.
The probability of committing a Type I error is \(\alpha\) (known as level of
significance of the test)
Type I error (\(\alpha\)): probability of rejecting \(H_0\) when it is true.
Type II Error
Type II error level (\(\beta\)): probability that you fail to reject the null hypothesis when it is false.
Error made when \(H_0\) is not rejected when, in fact, \(H_1\) is true
The probability of committing a Type II error is \(\beta\) (known as the power of the test)
Estimators
Point Estimator
\(\hat{\theta}\) is a statistic used to approximate a population parameter \(\theta\)
Point estimate
The numerical value assumed by \(\hat{\theta}\) when evaluated for a given
sample
Unbiased estimator
If \(E(\hat{\theta}) = \theta\), then
\(\hat{\theta}\) is an unbiased
estimator for \(\theta\)
- \(\bar{X}\) is an unbiased estimator for \(\mu\)
- \(S^2\) is an unbiased estimator for \(\sigma^2\)
- \(\hat{p}\) is an unbiased estimator for p
- \(\widehat{p_1-P_2}\) is an unbiased estimator for \(p_1- p_2\)
- \(\bar{X_1} - \bar{X_2}\) is an unbiased estimator for \(\mu_1 - \mu_2\)
Note: \(S\) is a biased estimator for \(\sigma\)
confidence intervals (C.I)
When \(\sigma^2\) is estimated by \(s^2\), then
\[ \frac{\bar{y}-\mu}{s/\sqrt{n}} \sim t_{n-1} \]
Then, a \(100(1-\alpha) \%\) confidence interval for \(\mu\) is obtained from:
\[ 1 - \alpha = P(-t_{\alpha/2;n-1} \le \frac{\bar{y}-\mu}{s/\sqrt{n}} \le t_{\alpha/2;n-1}) \\ = P(\bar{y} - (t_{\alpha/2;n-1})s/\sqrt{n} \le \mu \le \bar{y} + (t_{\alpha/2;n-1})s/\sqrt{n}) \]
And the interval is
\[ \bar{y} \pm (t_{\alpha/2;n-1})s/\sqrt{n} \]
and \(s/\sqrt{n}\) is the standard error of \(\bar{y}\)
If the experiment were repeated many times, \(100(1-\alpha) \%\) of these intervals would contain \(\mu\)
Confidence Interval \(100(1-\alpha)%\) | Sample Sizes Confidence \(\alpha\), Error d |
Hypothesis Testing Test Statistic |
|
---|---|---|---|
When \(\sigma^2\) is known, X is normal (or \(n \ge 25\)) | \(\bar{X} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\) | \(n \approx \frac{z_{\alpha/2}^2 \sigma^2}{d^2}\) | \(z = \frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}}\) |
When \(\sigma^2\) is unknown, X is normal (or \(n \ge 25\)) | \(\bar{X} \pm t_{\alpha/2}\frac{s}{\sqrt{n}}\) | \(n \approx \frac{z_{\alpha/2}^2 s^2}{d^2}\) | \(t = \frac{\bar{X}-\mu_0}{s/\sqrt{n}}\) |
For Difference of Means (\(\mu_1-\mu_2\)), Independent Samples
\(100(1-\alpha)%\) Confidence Interval | Hypothesis Testing Test Statistic |
||
---|---|---|---|
When \(\sigma^2\) is known | \(\bar{X}_1 - \bar{X}_2 \pm z_{\alpha/2}\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}\) | \(z= \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}}\) | |
When \(\sigma^2\) is unknown, Variances Assumed EQUAL | \(\bar{X}_1 - \bar{X}_2 \pm t_{\alpha/2}\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}\) | \(t = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}}\) | Pooled Variance: \(s_p^2 = \frac{(n_1
-1)s^2_1 - (n_2-1)s^2_2}{n_1 + n_2 -2}\) Degrees of Freedom: \(\gamma = n_1 + n_2 -2\) |
When \(\sigma^2\) is unknown, Variances Assumed UNEQUAL | \(\bar{X}_1 - \bar{X}_2 \pm t_{\alpha/2}\sqrt{(\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2})}\) | \(t = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{(\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2})}}\) | Degrees of Freedom: \(\gamma = \frac{(\frac{s_1^2}{n_1}+\frac{s^2_2}{n_2})^2}{\frac{(\frac{s_1^2}{n_1})^2}{n_1-1}+\frac{(\frac{s_2^2}{n_2})^2}{n_2-1}}\) |