“While nothing is more uncertain than a single life, nothing is more certain than the average duration of a thousand lives.”
- Elizur Wright
Distributions
Definition:The frequency distribution of a variable is the number of occurrences of all values of that variable in the data.
Definition:The relative frequency distribution of a variable is the fraction of occurrences of all values of that variable in the data or population.
These definitions apply to both continuous and discrete variables.
Frequency = Number
Relative frequency = Fraction (proportion)
Distributions
Question:What type of plot represents the frequency (relative frequency) distribution for a discrete variable?
Answer:Bar plot
Definition: A bar plot uses the height of rectangular bars to display the frequency distribution (or relative frequency distribution) of a categorical variable.
i.e. height of bars = number or proportion
Distributions - Bar plot
Death by tiger
Distributions - Bar plot
Question: What type of plot represents the frequency distribution for a continuous variable?
Answer: Histogram (which is still a bar plot, actually)
Definition: A histogram for a frequency distribution uses the height of rectangular bars to display the frequency distribution of a numerical variable.
Definition: A histogram for a relative frequency distribution uses the area of rectangular bars to display the relative frequency distribution of a numerical variable.
Distributions
Three different histograms that depict the body mass of 228 female sockeye salmon
Question: What’s the explanatory and response variable?
histObj <-hist(salmonSizeData$massKg, right =FALSE, breaks =seq(1,4,by=0.5), col ="firebrick")
seq(1,4,by=0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Distributions - Histogram
Plot in a histogram:
Distributions - Histogram
Question: What would the height of the second bar from the left be for a relative frequency distribution? (note: current height is 136)
Distributions - Histogram
Question: What would the height of the second bar from the left be for a relative frequency distribution, given that we have 228 fish?
Distributions - Histogram
\[ Area = Proportion \]
\[ Area = Height \times width \]
\[ Proportion = Height \times 0.5 \]
\[ 136/228 = Height \times 0.5 \]
\[ Height = 2\times 136/228 \]
\[ Height = 1.1929825 \]
Distributions - Histogram
Question: What happens with smaller bin width (say width of 0.1)?
hist(salmonSizeData$massKg, right =FALSE, breaks =seq(1,4,by=0.1), col ="firebrick", freq=FALSE)
Distributions - Histogram
Question: What happens with smaller bin width (say width of 0.1)?
Measures of central tendency - Arithmetic mean
Definition: The population mean\(\mu\) is the sum of all the observations in the population divided by \(N\), the number of observations in the population (assuming it is finite - for now). \[\mu = \frac{1}{N}\sum_{i=1}^{N}Y_{i}\,\]
Measures of central tendency - Arithmetic mean
Definition: The sample mean\(\overline{Y}\) is the sum of all the observations in the sample divided by \(n\), the number of sample observations. \[\overline{Y} = \frac{1}{n}\sum_{i=1}^{n}Y_{i}\,\]
Measures of central tendency - Arithmetic mean
Question: Is the population mean \(\mu\) a parameter or an estimate? What about the sample mean?
Note that every observation has equal weight (i.e. \(\frac{1}{n}\)), so any outliers can strongly affect the mean. It is a very democratic statistic - equal representation!
Measures of central tendency - Arithmetic mean
Measures of central tendency - Median
Definition: The population median is the middle measurement of the set of all observations in the population (again, assume population finite for now).
Definition: The sample median is the middle measurement of the set of all observations in the sample.
Measures of central tendency - Median
How do you compute the median? W&S version:
First, sort the data from smallest to largest.
We then have two conditions:
If the number of observations is odd, then we have \[ Median = Y_{(n+1)/2} \]
If the number of observations is even, then we have \[ Median = \left[Y_{n/2} + Y_{(n/2)+1}\right]/2 \]
Look at special cases of \(n=3\) and \(n=4\)!!!
Mean vs. Median
The median is the middle measurement of the distibution (different colors represent the two halves of the distribution). The mean is the center of gravity, the point at which the frequency distribution would be balanced (if observations had weight).
Note: The mean and median have the same units as the variable!!!
Measures of variability - Variance
Definition: The population variance\(\sigma^{2}\) is the average of the squared deviations of all observations from the population mean, and assuming a finite population, we have \[\sigma^{2} = \frac{1}{N}\sum_{i=1}^{N}(Y_{i}-\mu)^2\]
Measures of variability - Variance
Definition: The sample variance\(s^{2}\) is the average of the squared deviations from the sample mean, \[s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(Y_{i}-\overline{Y})^2\]
Question: Why \(n-1\)??
Answer: Needed to be unbiased estimate!!
Standard deviation
Definition: The population standard deviation\(\sigma\) is the square root of population variance \[\sigma = \sqrt{\sigma^{2}}\]
Definition: The sample standard deviation\(s\) is the square root of the sample variance, \[s = \sqrt{s^{2}}\]
Note #1:\(s\) is in general a biased estimator of \(\sigma\). The bias gets smaller as the sample size gets larger.
Note #2:\(s\) and \(\sigma\) have the same units as the random variable!!!
Standard deviation
Note #3: If the frequency distribution is bell shaped, then about two-thirds (67%) of the observations will lie within one standard deviation of the mean, and 95% of the observations will lie within two standard deviations of the mean.
Standard deviation
Note #3: If the frequency distribution is bell shaped, then about two-thirds (67%) of the observations will lie within one standard deviation of the mean, and 95% of the observations will lie within two standard deviations of the mean.
Measures of variability - Interquartile range
Definition: The interquartile range\(IQR\) is the difference between the third and first quartiles of the data. It is the span of the middle 50% of the data.
Measures of variability - Interquartile range
Spiders with huge pedipalps, copulatory organs that make up about 10% of a male’s mass.
Interquartile range
Middle bar of box is median
Bottom of box is first quartile
Top of box is third quartile
Whiskers extend \(1.5\times IQR\) above and below box\(^{*}\)
Data outside whiskers (extreme values) are plotted as dots
\(^{*}\) If whisker extends past the max or min of data, then the whisker will be the max or min of the data
Standard deviation or interquartile range?
Heuristic #1: The location (mean and median) and spread (interquartile range and standard deviation) give similar information when the frequency distribution is symmetric and unimodal (i.e. bell shaped).
Heuristic #2: The mean and standard deviation become less informative when the distribution is strongly skewed or there there are extreme observations.
Coefficient of variation
Since in biology many times the standard deviation scales with the mean, it can be more informative to look at the coefficient of variation.
Definition: The coefficient of variation (CV) calculates the standard deviation as a percentage of the mean: \[CV = \frac{s}{\bar{Y}}\times 100\%\]
In other words, the CV answers the question “How much variation is there relative to the mean?”
Moving on…
Make sure you read the book for the following discussions
How to compute a mean and standard deviation from a frequency table (Why is this important to know?)
Rounding rules for displaying tables and statistics
Effect of changing measurement scale
Cumulative frequency distributions (we will cover this later as well)
My point here is that you are responsible for all book material, even if we don’t cover it in lecture!
Describing data in R
Measures
R commands
\(\overline{Y}\)
mean
\(s^2\)
var
\(s\)
sd
\(IQR\)
IQR\(^*\)
Multiple
summary
\(^*\) Note that IQR has different algorithms. To match the algorithm in W&S, you should use IQR(___, type=5). There are different algorithms as there are different ways to calculate quantiles. (for curious souls, see ?quantiles). For the HW, either version is acceptable. Default type in R is type=7.
Describing data in R
Measures
R commands
\(\overline{Y}\)
mean
\(s^2\)
var
\(s\)
sd
\(IQR\)
IQR
Multiple
summary
summary(mydata)
breadth
Min. : 1.00
1st Qu.: 3.00
Median : 8.00
Mean :11.88
3rd Qu.:17.00
Max. :62.00