Descriptive Statistics

M. Drew LaMar
September 10, 2018

“While nothing is more uncertain than a single life, nothing is more certain than the average duration of a thousand lives.”

- Elizur Wright

Course Announcements

  • Reading Assignment for Wednesday - W&S, Chapter 4 (QUIZ)
  • Grab links from Blackboard for datasets for Ch. 2, #33 and #35
    • Chapter 2, #33: Bad csv file
    • Chapter 2, #35: Easier format for plotting (combined two categoricals into one)
  • Quick RStudio Tips & Tricks

Project Teaser (10% of grade)

Distributions

Definition:The frequency distribution of a variable is the number of occurrences of all values of that variable in the data.

Definition:The relative frequency distribution of a variable is the fraction of occurrences of all values of that variable in the data or population.

  • These definitions apply to both continuous and discrete variables.
  • Frequency = Number
  • Relative frequency = Fraction (proportion)

Distributions

Question:What type of plot represents the frequency (relative frequency) distribution for a discrete variable?

Answer:Bar plot

Definition: A bar plot uses the height of rectangular bars to display the frequency distribution (or relative frequency distribution) of a categorical variable.

  • i.e. height of bars = number or proportion

Distributions - Bar plot

alt text

alt text

Death by tiger

Distributions - Bar plot

Question: What type of plot represents the frequency distribution for a continuous variable?

Answer: Histogram (which is still a bar plot, actually)

Definition: A histogram for a frequency distribution uses the height of rectangular bars to display the frequency distribution of a numerical variable.

Definition: A histogram for a relative frequency distribution uses the area of rectangular bars to display the relative frequency distribution of a numerical variable.

Distributions

Three different histograms that depict the body mass of 228 female sockeye salmon

alt text

Question: What’s the explanatory and response variable?

Answer: Neither

Distributions

Load and show the data:

salmonSizeData <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02f2_5SalmonBodySize.csv"))
head(salmonSizeData)
  year   sex oceanAgeYears lengthMm massKg
1 1996 FALSE             3      513  3.090
2 1996 FALSE             3      513  2.909
3 1996 FALSE             3      525  3.056
4 1996 FALSE             3      501  2.690
5 1996 FALSE             3      513  2.876
6 1996 FALSE             3      501  2.978

Distributions - Histogram

Plot in a histogram:

histObj <- hist(salmonSizeData$massKg, 
                right = FALSE, 
                breaks = seq(1,4,by=0.5), 
                col = "firebrick")
seq(1,4,by=0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Distributions - Histogram

Plot in a histogram:

histObj <- hist(salmonSizeData$massKg, 
                right = FALSE, 
                breaks = seq(1,4,by=0.5), 
                col = "firebrick")

plot of chunk unnamed-chunk-3

Distributions - Histogram

plot of chunk unnamed-chunk-4

Question: What would the height of the second bar from the left be for a relative frequency distribution? (note: current height is 136)

Question: What would the height of the second bar from the left be for a relative frequency distribution, given that we have 228 fish?

Distributions - Histogram

plot of chunk unnamed-chunk-5

\[ Area = Proportion \]

\[ Area = Height \times width \]

\[ Proportion = Height \times 0.5 \]

\[ 136/228 = Height \times 0.5 \]

\[ Height = 2\times 136/228 \]

\[ Height = 1.1929825 \]

Distributions - Histogram

Question: What happens with smaller bin width (say width of 0.1)?

hist(salmonSizeData$massKg, 
     right = FALSE, 
     breaks = seq(1,4,by=0.1), 
     col = "firebrick", 
     freq=FALSE)

Distributions - Histogram

Question: What happens with smaller bin width (say width of 0.1)?

plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-7

Measures of central tendency - Arithmetic mean

Definition: The population mean \( \mu \) is the sum of all the observations in the population divided by \( N \), the number of observations in the population (assuming it is finite - for now).
\[ \mu = \frac{1}{N}\sum_{i=1}^{N}Y_{i}\, \]

Measures of central tendency - Arithmetic mean

Definition: The sample mean \( \overline{Y} \) is the sum of all the observations in the sample divided by \( n \), the number of sample observations.
\[ \overline{Y} = \frac{1}{n}\sum_{i=1}^{n}Y_{i}\, \]

Measures of central tendency - Arithmetic mean

Question: Is the population mean \( \mu \) a parameter or an estimate? What about the sample mean?

Note that every observation has equal weight (i.e. \( \frac{1}{n} \)), so any outliers can strongly affect the mean. It is a very democratic statistic - equal representation!

Measures of central tendency - Arithmetic mean

alt text

Measures of central tendency - Median

Definition: The population median is the middle measurement of the set of all observations in the population (again, assume population finite for now).

Definition: The sample median is the middle measurement of the set of all observations in the sample.

Measures of central tendency - Median

How do you compute the median? W&S version:

  • First, sort the data from smallest to largest.
  • We then have two conditions:
    • If the number of observations is odd, then we have \[ Median = Y_{(n+1)/2} \]
    • If the number of observations is even, then we have \[ Median = \left[Y_{n/2} + Y_{(n/2)+1}\right]/2 \]

Look at special cases of \( n=3 \) and \( n=4 \)!!!