Descriptive Statistics

Alban Guillaumet, Troy University

“While nothing is more uncertain than a single life, nothing is more certain than the average duration of a thousand lives.”

- Elizur Wright

Objectives

  • Measures of central tendency

  • Measures of variability

Measures of central tendency - Arithmetic mean

Definition: The population mean \( \mu \) is the sum of all the observations in the population divided by \( N \), the number of observations in the population.
\[ \mu = \frac{1}{N}\sum_{i=1}^{N}Y_{i}\, \]

Measures of central tendency - Arithmetic mean

Definition: The sample mean \( \overline{Y} \) is the sum of all the observations in the sample divided by \( n \), the number of sample observations.
\[ \overline{Y} = \frac{1}{n}\sum_{i=1}^{n}Y_{i}\, \]

Measures of central tendency - Arithmetic mean

Question: Is the population mean \( \mu \) a parameter or an estimate? What about the sample mean?

Note that every observation has equal weight (i.e. 1), so any outliers can strongly affect the mean. It is a very democratic statistic - equal representation!

Measures of central tendency - Arithmetic mean

alt text

Measures of central tendency - Median

Definition: The population median is the middle measurement of the set of all observations in the population, the measurement that partitions the ordered measurements into two halves.

Definition: The sample median is the middle measurement of the set of all observations in the sample.

Measures of central tendency - Median

How do you compute the median? W&S version:

  • First, sort the data from smallest to largest.
  • We then have two conditions:
    • If the number of observations is odd, then we have \[ Median = Y_{(n+1)/2} \]
    • If the number of observations is even, then we have \[ Median = \left[Y_{n/2} + Y_{(n/2)+1}\right]/2 \]

Look at special cases such as \( n=4 \) and \( n=5 \)

Measures of central tendency - Mean vs. Median

The median is the middle measurement of the distribution (a color for each half of the distribution). The mean is the center of gravity, the point at which the frequency distribution would be balanced (if observations had weight).

Note: The mean and median have the same units as the variable.

Measures of central tendency - Mean vs. Median

v = c(-100,-99, 0, 2, 15)
median(v)
[1] 0

Measures of central tendency - Mean vs. Median

v = c(-100,-99, 0, 2, 15)
(v[1]+v[2])
[1] -199
(v[4]+v[5])
[1] 17

Measures of central tendency - Mean vs. Median

v = c(-100,-99, 0, 2, 15)
(m = mean(v))
[1] -36.4
(d = v-m)
[1] -63.6 -62.6  36.4  38.4  51.4

Measures of central tendency - Mean vs. Median

v = c(-100,-99, 0, 2, 15)
(d[1]+d[2])
[1] -126.2
(d[3]+d[4]+d[5])
[1] 126.2

Measures of variability - Variance

Definition: The population variance \( \sigma^{2} \) is the average of the squared deviations of all observations from the population mean, calculated as:
\[ \sigma^{2} = \frac{1}{N}\sum_{i=1}^{N}(Y_{i}-\mu)^2 \]

Measures of variability - Variance

Definition: The sample variance \( s^{2} \) is the average of the squared deviations from the sample mean,
\[ s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(Y_{i}-\overline{Y})^2 \]

Question: Why \( n-1 \)?

Answer: Needed to be unbiased estimate

Measures of variability - Standard deviation

Definition: The population standard deviation \( \sigma \) is the square root of population variance
\[ \sigma = \sqrt{\sigma^{2}} \]

Definition: The sample standard deviation \( s \) is the square root of the sample variance,
\[ s = \sqrt{s^{2}} \]

Note #1: \( s \) is in general a biased estimator of \( \sigma \). The bias gets smaller as the sample size gets larger.

Note #2: \( s \) and \( \sigma \) have the same units as the random variable.

Measures of variability - Standard deviation

Note #3: If the frequency distribution is bell shaped, then about two-thirds (68%) of the observations will lie within one standard deviation of the mean, and 95% of the observations will lie within ~ two standard deviations of the mean.

Measures of variability - Standard deviation

Note #3: If the frequency distribution is bell shaped, then about two-thirds (68%) of the observations will lie within one standard deviation of the mean, and 95% of the observations will lie within ~ two standard deviations of the mean.

Measures of variability - Interquartile range

Definition: The interquartile range \( IQR \) is the difference between the third and first quartiles of the data. It is the span of the middle 50% of the data.

Measures of variability - Interquartile range

Definition: The interquartile range \( IQR \) is the difference between the third and first quartiles of the data. It is the span of the middle 50% of the data.

Measures of variability - Interquartile range

Spiders with huge pedipalps, copulatory organs that make up about 10% of a male's mass. alt text

alt text

Box plot

Measures of variability - Interquartile range

alt text

  • Middle bar of box is median
  • Bottom of box is first quartile
  • Top of box is third quartile
  • Whiskers extend \( 1.5\times IQR \) above and below box*
  • Data outside whiskers (extreme values) are plotted as dots

*If whisker extends past the max or min of data, then the whisker will be the max or min of the data

Which one to use?

Heuristic #1: The locate (mean and median) and spread (standard deviation and interquartile range) tend to give similar information when the frequency distribution is symmetric and unimodal (i.e. bell shaped).

Heuristic #2: The information starts to differ when the distribution is strongly skewed or there are extreme observations. The median and IQR tend to be better descriptors of the ‘typical’ observation and spread of the main part of the distribution. However, the mean and standard deviation still reflect the distribution of the data as a whole and have better mathematical properties.

Compare the distributions

plot of chunk unnamed-chunk-5

Compare the distributions

plot of chunk unnamed-chunk-6

Coefficient of variation

Since in biology many times the standard deviation scales with the mean, it can be more informative to look at the coefficient of variation.

Definition: The coefficient of variation (CV) calculates the standard deviation as a percentage of the mean: \[ CV = \frac{s}{\bar{Y}}\times 100 \% \]

In other words, the CV answers the question “How much variation is there relative to the mean?”

Coefficient of variation

plot of chunk unnamed-chunk-7

plot of chunk unnamed-chunk-8

Describing data in R

Measures R commands
\( \overline{Y} \) mean
\( s^2 \) var
\( s \) sd
\( IQR \) IQR\( ^* \)
Multiple summary

\( ^* \) Note that IQR has different algorithms. To match the algorithm in W&S, you should use IQR(___, type=2). There are different algorithms as there are different ways to calculate quantiles (for curious souls, see ?quantile). For the HW, either version is acceptable. Default type in R is type=7.

Describing data in R

Measures R commands
\( \overline{Y} \) mean
\( s^2 \) var
\( s \) sd
\( IQR \) IQR
Multiple summary
summary(mydata)
    breadth     
 Min.   : 1.00  
 1st Qu.: 3.00  
 Median : 8.00  
 Mean   :11.88  
 3rd Qu.:17.00  
 Max.   :62.00  

IQR would be \( 17-3 = 14 \).