M. Drew LaMar
September 17, 2018
“Absolute certainty is a privilege of uneducated minds and fanatics. It is, for scientific folk, an unattainable ideal.”
- Cassius J. Keyser
John Pullinger
John Pullinger
Quote: Statisticians can now amass more data more quickly than ever. This could help us to make decisions based on real numbers, not prejudice.
Quote: Of course decisions are made on the basis of emotions and beliefs as well as science. Those of us who work in the world of data need some humility in what we claim. But good evidence does matter.
Quote: The models being used today are opaque, unregulated, and uncontestable, even when they’re wrong. Most troubling, they reinforce discrimination: If a poor student can’t get a loan because a lending model deems him too risky (by virtue of his zip code), he’s then cut off from the kind of education that could pull him out of poverty, and a vicious spiral ensues.
apply
family The median is the middle measurement of the distibution (different colors represent the two halves of the distribution). The mean is the center of gravity, the point at which the frequency distribution would be balanced (if observations had weight).
Note: The mean and median have the same units as the variable!!!
Definition: The
population variance \( \sigma^{2} \) is the average of the squared deviations of all observations from the population mean, and assuming a finite population, we have
\[ \sigma^{2} = \frac{1}{N}\sum_{i=1}^{N}(Y_{i}-\mu)^2 \]
Definition: The
sample variance \( s^{2} \) is the average of the squared deviations from the sample mean,
\[ s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(Y_{i}-\overline{Y})^2 \]
Question: Why \( n-1 \)??
Answer: Needed to be unbiased estimate!!
Definition: The
population standard deviation \( \sigma \) is the square root of population variance
\[ \sigma = \sqrt{\sigma^{2}} \]
Definition: The
sample standard deviation \( s \) is the square root of the sample variance,
\[ s = \sqrt{s^{2}} \]
Note #1: \( s \) is in general a biased estimator of \( \sigma \). The bias gets smaller as the sample size gets larger.
Note #2: \( s \) and \( \sigma \) have the same units as the random variable!!!
Note #3: If the frequency distribution is bell shaped, then about two-thirds (67%) of the observations will lie within one standard deviation of the mean, and 95% of the observations will lie within two standard deviations of the mean.
Note #3: If the frequency distribution is bell shaped, then about two-thirds (67%) of the observations will lie within one standard deviation of the mean, and 95% of the observations will lie within two standard deviations of the mean.
Definition: The
interquartile range \( IQR \) is the difference between the third and first quartiles of the data. It is the span of the middle 50% of the data.
Spiders with huge pedipalps, copulatory organs that make up about 10% of a male's mass.
\( ^{*} \) If whisker extends past the max or min of data, then the whisker will be the max or min of the data
Heuristic #1: The location (mean and median) and spread (interquartile range and standard deviation) give similar information
when the frequency distribution is symmetric and unimodal (i.e. bell shaped).
Heuristic #2: The mean and standard deviation become less informative
when the distribution is strongly skewed or there there are extreme observations.
Since in biology many times the standard deviation scales with the mean, it can be more informative to look at the coefficient of variation.
Definition: The
coefficient of variation (CV) calculates the standard deviation as a percentage of the mean: \[ CV = \frac{s}{\bar{Y}}\times 100\% \]
In other words, the CV answers the question “How much variation is there relative to the mean?”
Make sure you read the book for the following discussions
Question: Why is this important to know?
My point here is that you are responsible for all book material, even if we don't cover it in lecture!
Measures | R commands |
---|---|
\( \overline{Y} \) | mean |
\( s^2 \) | var |
\( s \) | sd |
\( IQR \) | IQR \( ^* \) |
Multiple | summary |
\( ^* \) Note that IQR
has different algorithms. To match the algorithm in W&S, you should use IQR(___, type=2)
. There are different algorithms as there are different ways to calculate quantiles. (for curious souls, see ?quantiles
). For the HW, either version is acceptable. Default type in R is type=7
.
Measures | R commands |
---|---|
\( \overline{Y} \) | mean |
\( s^2 \) | var |
\( s \) | sd |
\( IQR \) | IQR |
Multiple | summary |
summary(mydata)
breadth
Min. : 1.00
1st Qu.: 3.00
Median : 8.00
Mean :11.88
3rd Qu.:17.00
Max. :62.00
IQR
would be \( 17-3 = 14 \).
Measures | R commands |
---|---|
\( \overline{Y} \) | mean |
\( s^2 \) | var |
\( s \) | sd |
\( IQR \) | IQR \( ^* \) |
Multiple | summary |
\( ^* \) Note that IQR
has different algorithms. To match the algorithm in W&S, you should use IQR(___, type=2)
. There are different algorithms as there are different ways to calculate quantiles. (for curious souls, see ?quantiles
). For the HW, either version is acceptable. Default type in R is type=7
.
Measures | R commands |
---|---|
\( \overline{Y} \) | mean |
\( s^2 \) | var |
\( s \) | sd |
\( IQR \) | IQR |
Multiple | summary |
summary(mydata)
breadth
Min. : 1.00
1st Qu.: 3.00
Median : 8.00
Mean :11.88
3rd Qu.:17.00
Max. :62.00
IQR
would be \( 17-3 = 14 \).