Alban Guillaumet, Troy University
“While nothing is more uncertain than a single life, nothing is more certain than the average duration of a thousand lives.”
- Elizur Wright
Measures of central tendency
Measures of variability
Definition: The
population mean \( \mu \) is the sum of all the observations in the population divided by \( N \), the number of observations in the population.
\[ \mu = \frac{1}{N}\sum_{i=1}^{N}Y_{i}\, \]
Definition: The
sample mean \( \overline{Y} \) is the sum of all the observations in the sample divided by \( n \), the number of sample observations.
\[ \overline{Y} = \frac{1}{n}\sum_{i=1}^{n}Y_{i}\, \]
Question: Is the population mean \( \mu \) a parameter or an estimate? What about the sample mean?
Note that every observation has equal weight (i.e. 1), so any outliers can strongly affect the mean. It is a very democratic statistic - equal representation!
Definition: The
population median is the middle measurement of the set of all observations in the population, the measurement that partitions the ordered measurements into two halves.
Definition: The
sample median is the middle measurement of the set of all observations in the sample.
How do you compute the median? W&S version:
Look at special cases such as \( n=4 \) and \( n=5 \)
The median is the middle measurement of the distribution (a color for each half of the distribution). The mean is the center of gravity, the point at which the frequency distribution would be balanced (if observations had weight).
Note: The mean and median have the same units as the variable.
v = c(-100,-99, 0, 2, 15)
median(v)
[1] 0
v = c(-100,-99, 0, 2, 15)
(v[1]+v[2])
[1] -199
(v[4]+v[5])
[1] 17
v = c(-100,-99, 0, 2, 15)
(m = mean(v))
[1] -36.4
(d = v-m)
[1] -63.6 -62.6 36.4 38.4 51.4
v = c(-100,-99, 0, 2, 15)
(d[1]+d[2])
[1] -126.2
(d[3]+d[4]+d[5])
[1] 126.2
Definition: The
population variance \( \sigma^{2} \) is the average of the squared deviations of all observations from the population mean, calculated as:
\[ \sigma^{2} = \frac{1}{N}\sum_{i=1}^{N}(Y_{i}-\mu)^2 \]
Definition: The
sample variance \( s^{2} \) is the average of the squared deviations from the sample mean,
\[ s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(Y_{i}-\overline{Y})^2 \]
Question: Why \( n-1 \)?
Answer: Needed to be unbiased estimate
Definition: The
population standard deviation \( \sigma \) is the square root of population variance
\[ \sigma = \sqrt{\sigma^{2}} \]
Definition: The
sample standard deviation \( s \) is the square root of the sample variance,
\[ s = \sqrt{s^{2}} \]
Note #1: \( s \) is in general a biased estimator of \( \sigma \). The bias gets smaller as the sample size gets larger.
Note #2: \( s \) and \( \sigma \) have the same units as the random variable.
Note #3: If the frequency distribution is bell shaped, then about two-thirds (68%) of the observations will lie within one standard deviation of the mean, and 95% of the observations will lie within ~ two standard deviations of the mean.
Note #3: If the frequency distribution is bell shaped, then about two-thirds (68%) of the observations will lie within one standard deviation of the mean, and 95% of the observations will lie within ~ two standard deviations of the mean.
Definition: The
interquartile range \( IQR \) is the difference between the third and first quartiles of the data. It is the span of the middle 50% of the data.
Definition: The
interquartile range \( IQR \) is the difference between the third and first quartiles of the data. It is the span of the middle 50% of the data.
Spiders with huge pedipalps, copulatory organs that make up about 10% of a male's mass.
*If whisker extends past the max or min of data, then the whisker will be the max or min of the data
Heuristic #1: The locate (mean and median) and spread (standard deviation and interquartile range) tend to give similar information
when the frequency distribution is symmetric and unimodal (i.e. bell shaped).
Heuristic #2: The information starts to differ
when the distribution is strongly skewed or there are extreme observations . The median and IQR tend to be better descriptors of the ‘typical’ observation and spread of the main part of the distribution. However, the mean and standard deviation still reflect the distribution of the data as a whole and have better mathematical properties.
Since in biology many times the standard deviation scales with the mean, it can be more informative to look at the coefficient of variation.
Definition: The
coefficient of variation (CV) calculates the standard deviation as a percentage of the mean: \[ CV = \frac{s}{\bar{Y}}\times 100 \% \]
In other words, the CV answers the question “How much variation is there relative to the mean?”
| Measures | R commands |
|---|---|
| \( \overline{Y} \) | mean |
| \( s^2 \) | var |
| \( s \) | sd |
| \( IQR \) | IQR\( ^* \) |
| Multiple | summary |
\( ^* \) Note that IQR has different algorithms. To match the algorithm in W&S, you should use IQR(___, type=2). There are different algorithms as there are different ways to calculate quantiles (for curious souls, see ?quantile). For the HW, either version is acceptable. Default type in R is type=7.
| Measures | R commands |
|---|---|
| \( \overline{Y} \) | mean |
| \( s^2 \) | var |
| \( s \) | sd |
| \( IQR \) | IQR |
| Multiple | summary |
summary(mydata)
breadth
Min. : 1.00
1st Qu.: 3.00
Median : 8.00
Mean :11.88
3rd Qu.:17.00
Max. :62.00
IQR would be \( 17-3 = 14 \).