Descriptive Statistics

M. Drew LaMar
September 10, 2021

“While nothing is more uncertain than a single life, nothing is more certain than the average duration of a thousand lives.”

- Elizur Wright

Practice Problem #5

A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.

Practice Problem #5 - Main questions

A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.

Question: Experimental or observational?

Answer: Observational

Practice Problem #5

A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.

Question: Explanatory variables with type?

Answer: Fry source, which is categorical variable with two levels, “hatchery” and “wild”.

Practice Problem #5

A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.

Question: Response variables with type?

Answer: “Survival”, which is a categorical variable with two levels, “caught” and “not caught”.

Practice Problem #5

In R: First load the data and get some quick info on the data using the head and str commands.

Load the data:

troutfry <- read.csv(paste0(here::here(), "/Datasets/chapter02/chap02q05FrySurvival.csv"))

Practice Problem #5

In R: First load the data and get some quick info on the data using the head and str commands.


head(troutfry)
  frySource survival
1      wild survived
2      wild survived
3      wild survived
4      wild survived
5      wild survived
6      wild survived

Practice Problem #5

In R: First load the data and get some quick info on the data using the head and str commands.


str(troutfry)
'data.frame':   8000 obs. of  2 variables:
 $ frySource: chr  "wild" "wild" "wild" "wild" ...
 $ survival : chr  "survived" "survived" "survived" "survived" ...

Looks like our data is in raw form, i.e. each row is an observation and each column is a measurement/variable.

Practice Problem #5

In R: Now make a table using the table command.


(troutfryTable <- table(troutfry$frySource, troutfry$survival))

           not caught survived
  hatchery       3973       27
  wild           3949       51

Notice the parenthesis around the assignment, which says to output the result.

Question: Anything off with this format?

Practice Problem #5

In R: Now make a table using the table command.


(troutfryTable <- table(troutfry$frySource, troutfry$survival))

           not caught survived
  hatchery       3973       27
  wild           3949       51

Notice the parenthesis around the assignment, which says to output the result.

Answer: Explanatory variable should be in the horizontal dimension!

Practice Problem #5

In R: Now make a table using the table command.


(troutfryTable <- table(troutfry$frySource, troutfry$survival))

           not caught survived
  hatchery       3973       27
  wild           3949       51

Notice the parenthesis around the assignment, which says to output the result.

Question: So how do we fix it?

Practice Problem #5

In R: Now make a table using the table command, putting the explanatory variable in the correct dimension.


(troutfryTable <- table(troutfry$survival, troutfry$frySource))

             hatchery wild
  not caught     3973 3949
  survived         27   51

Question: Any other changes?

Practice Problem #5

In R: Now make a table using the table command, putting the explanatory variable in the correct dimension.


(troutfryTable <- table(troutfry$survival, troutfry$frySource))

             hatchery wild
  not caught     3973 3949
  survived         27   51

Answer: Maybe “survived” should come before “not caught”, as that might be the most interesting.

Practice Problem #5

In R: Now make a table using the table command, putting the explanatory variable in the correct dimension.


(troutfryTable <- table(troutfry$survival, troutfry$frySource))

             hatchery wild
  not caught     3973 3949
  survived         27   51

Question: How do we change the order of the levels for “survival”?

Practice Problem #5

In R: Reorder the levels of the “survival” factor…


str(troutfry$survival)
 chr [1:8000] "survived" "survived" "survived" "survived" "survived" ...
troutfry$survival <- factor(troutfry$survival, levels = c("survived", "not caught"))
str(troutfry$survival)
 Factor w/ 2 levels "survived","not caught": 1 1 1 1 1 1 1 1 1 1 ...

Practice Problem #5

In R: …and remake the table.


(troutfryTable <- table(troutfry$survival, troutfry$frySource))

             hatchery wild
  survived         27   51
  not caught     3973 3949

Practice Problem #5

In R: Finally, lets add some margins using the addmargins command.


addmargins(troutfryTable)

             hatchery wild  Sum
  survived         27   51   78
  not caught     3973 3949 7922
  Sum            4000 4000 8000

Practice Problem #5: Shearing sheep

There's more than one way to shear a sheep…

t(addmargins(table(troutfry)))
            frySource
survival     hatchery wild  Sum
  survived         27   51   78
  not caught     3973 3949 7922
  Sum            4000 4000 8000

The t command transposes the table (i.e. switch horizontal and vertical variable placement).

Practice Problem #5: Mosaic plot

In R: Alright, let’s do some data viz. Draw a mosaic plot.


mosaicplot(troutfryTable)

plot of chunk unnamed-chunk-15

Practice Problem #5: Mosaic plot

In R: Alright, let’s do some data viz. Draw a mosaic plot.


plot of chunk unnamed-chunk-16

Question: Explanatory variable along vertical axis. How to fix?

Practice Problem #5: Mosaic plot

In R: Alright, let’s do some data viz. Draw a mosaic plot.


plot of chunk unnamed-chunk-17

Answer: Transpose the table!

Practice Problem #5: Transpose table

In R: Same command as before, but plot transposed table with the t command.


mosaicplot(t(troutfryTable))

plot of chunk unnamed-chunk-18

Data Visualization Tidbit

Practice Problem #5: Add axes

In R: Final to-do: label your axes! Bonus: Remove the title.


mosaicplot(t(troutfryTable), 
           xlab="Fry source", 
           ylab="Relative frequency", 
           main="", 
           cex = 1.5, 
           cex.sub = 1.5, 
           col = c("forestgreen", "goldenrod1"))

Practice Problem #5: Add axes

In R: Final to-do: label your axes! Bonus: Remove the title.


plot of chunk unnamed-chunk-19

Assignment Problem #33

The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997 - 1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association?

Assignment Problem #33 - Questions

The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997 - 1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association?

Question: Experimental or observational?

Answer: Observational

Assignment Problem #33 - Questions

The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997 - 1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association?

Question: Explanatory variables with type?

Answer: Birth month - categorical with levels “Jan” and “Dec”.

Assignment Problem #33 - Questions

The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997 - 1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association?

Question: Response variables with type?

Answer: ADHD diagnosis - categorical with levels “ADHD” and “no ADHD”.

Assignment Problem #33: ADHD - Load data

In R: Load the dataset and look at the structure of the data.


adhd <- read.csv(paste0(here::here(), "/Datasets/chapter02/chap02q33BirthMonthADHD.csv"))
str(adhd)
'data.frame':   4 obs. of  3 variables:
 $ birthMonth : chr  "January" "January" "December" "December"
 $ diagnosis  : chr  "ADHD" "no ADHD" "ADHD" "no ADHD"
 $ frequencies: int  2219 36917 2870 36107

Assignment Problem #33: ADHD - Look at data

Question: Is this processed or raw data?


  birthMonth diagnosis frequencies
1    January      ADHD        2219
2    January   no ADHD       36917
3   December      ADHD        2870
4   December   no ADHD       36107

Answer: Processed! They have already counted the data!

In R: Transform this into a more appropriate form.

Assignment Problem #33: ADHD - Transform data

In R: Transform this into a more appropriate form.


(adhdMatrix <- matrix(adhd$frequencies, 
                      nrow = 2, 
                      ncol = 2, 
                      dimnames = list(c("Diagnosed ADHD", "Not diagnosed"), 
                                      c("Jan","Dec"))))
                 Jan   Dec
Diagnosed ADHD  2219  2870
Not diagnosed  36917 36107
  birthMonth diagnosis frequencies
1    January      ADHD        2219
2    January   no ADHD       36917
3   December      ADHD        2870
4   December   no ADHD       36107

Assignment Problem #33: ADHD - Hint for HW

Look at Chapter 2 R-Markdown, Example 2.3A. Bird malaria to see how to plot with data of this type. In particular, note that grouped barplot (using the barplot command) and mosaic plots (using the mosaicplot command) can work on matrices or tables.

Assignment Problem #35

The following data are from Mattison et al. (2012), who carried out an experiment with rhesus monkeys to test whether a reduction in food intake extends life span (as measured in years). The data are the life spans of 19 male and 15 female monkeys who were randomly assigned a normal nutritious diet or a similar diet reduced in amount by 30%. All monkeys were adults at the start of the study.

  1. Graph the results, using the most appropriate method and following the four principles of good graph design.
  2. According to your graph, which difference in life span is greater: that between the sexes, or that between diet groups?

Other questions: Experimental or observational? Explanatory and response variables?

Assignment Problem #35

The following data are from Mattison et al. (2012), who carried out an experiment with rhesus monkeys to test whether a reduction in food intake extends life span (as measured in years). The data are the life spans of 19 male and 15 female monkeys who were randomly assigned a normal nutritious diet or a similar diet reduced in amount by 30%. All monkeys were adults at the start of the study.

Answer: This is an experimental procedure.

Answer: Explanatory variables are sex and food intake, both of which are categorical and have 2 levels.

Answer: Response variable is life span, which is numerical and continuous.

Assignment Problem #35 - Look at data

'data.frame':   34 obs. of  3 variables:
 $ sex          : chr  "female" "female" "female" "female" ...
 $ foodTreatment: chr  "reduced" "reduced" "reduced" "reduced" ...
 $ lifespan     : num  16.5 18.9 22.6 27.8 30.2 30.7 35.9 23.7 24.5 24.7 ...

Categories of the sex variable:

[1] "female" "male"  

Categories of the foodTreatment variable:

[1] "control" "reduced"

Assignment Problem #35 - Look at data

head(foodforlife)
     sex foodTreatment lifespan
1 female       reduced     16.5
2 female       reduced     18.9
3 female       reduced     22.6
4 female       reduced     27.8
5 female       reduced     30.2
6 female       reduced     30.7

Question: Is this data in the right form for graphing?

Assignment Problem #35 - Trick

Depending on the graph that you choose, the first two arguments to the function should be

functionname(lifespan ~ foodTreatment * sex, data=mydata, …)

where functionname is the name of the plotting function in R that you chose, and mydata is the name of your data frame.

Assignment Problem #35 - Another way

You can combine two columns into one column using the unite function in the tidyr package (we'll learn more later about this package).

foodforlife <- tidyr::unite(foodforlife, category, c("sex", "foodTreatment"), sep="-")
str(foodforlife)
'data.frame':   34 obs. of  2 variables:
 $ category: chr  "female-reduced" "female-reduced" "female-reduced" "female-reduced" ...
 $ lifespan: num  16.5 18.9 22.6 27.8 30.2 30.7 35.9 23.7 24.5 24.7 ...

The values of the new category variable are:

[1] "female-control" "female-reduced" "male-control"   "male-reduced"  

Distributions

Definition:The frequency distribution of a variable is the number of occurrences of all values of that variable in the data.

Definition:The relative frequency distribution of a variable is the fraction of occurrences of all values of that variable in the data or population.

  • These definitions apply to both continuous and discrete variables.
  • Frequency = Number
  • Relative frequency = Fraction (proportion)

Distributions

Question:What type of plot represents the frequency (relative frequency) distribution for a discrete variable?

Answer:Bar plot

Definition: A bar plot uses the height of rectangular bars to display the frequency distribution (or relative frequency distribution) of a categorical variable.

  • i.e. height of bars = number or proportion

Distributions - Bar plot

alt text

alt text

Death by tiger

Distributions - Bar plot

Question: What type of plot represents the frequency distribution for a continuous variable?

Answer: Histogram (which is still a bar plot, actually)

Definition: A histogram for a frequency distribution uses the height of rectangular bars to display the frequency distribution of a numerical variable.

Definition: A histogram for a relative frequency distribution uses the area of rectangular bars to display the relative frequency distribution of a numerical variable.

Distributions

Three different histograms that depict the body mass of 228 female sockeye salmon

alt text

Question: What’s the explanatory and response variable?

Answer: Neither

Distributions

Load and show the data:

salmonSizeData <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter02/chap02f2_5SalmonBodySize.csv")
head(salmonSizeData)
  year   sex oceanAgeYears lengthMm massKg
1 1996 FALSE             3      513  3.090
2 1996 FALSE             3      513  2.909
3 1996 FALSE             3      525  3.056
4 1996 FALSE             3      501  2.690
5 1996 FALSE             3      513  2.876
6 1996 FALSE             3      501  2.978

Distributions - Histogram

Plot in a histogram:

histObj <- hist(salmonSizeData$massKg, 
                right = FALSE, 
                breaks = seq(1,4,by=0.5), 
                col = "firebrick")
seq(1,4,by=0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Distributions - Histogram

Plot in a histogram:

histObj <- hist(salmonSizeData$massKg, 
                right = FALSE, 
                breaks = seq(1,4,by=0.5), 
                col = "firebrick")

plot of chunk unnamed-chunk-32

Distributions - Histogram

plot of chunk unnamed-chunk-33

Question: What would the height of the second bar from the left be for a relative frequency distribution? (note: current height is 136)

Question: What would the height of the second bar from the left be for a relative frequency distribution, given that we have 228 fish?

Distributions - Histogram

plot of chunk unnamed-chunk-34

\[ Area = Proportion \]

\[ Area = Height \times width \]

\[ Proportion = Height \times 0.5 \]

\[ 136/228 = Height \times 0.5 \]

\[ Height = 2\times 136/228 \]

\[ Height = 1.1929825 \]

Distributions - Histogram

Question: What happens with smaller bin width (say width of 0.1)?

hist(salmonSizeData$massKg, 
     right = FALSE, 
     breaks = seq(1,4,by=0.1), 
     col = "firebrick", 
     freq=FALSE)

Distributions - Histogram

Question: What happens with smaller bin width (say width of 0.1)?

plot of chunk unnamed-chunk-35

plot of chunk unnamed-chunk-36

Measures of central tendency - Arithmetic mean

Definition: The population mean \( \mu \) is the sum of all the observations in the population divided by \( N \), the number of observations in the population (assuming it is finite - for now).
\[ \mu = \frac{1}{N}\sum_{i=1}^{N}Y_{i}\, \]

Measures of central tendency - Arithmetic mean

Definition: The sample mean \( \overline{Y} \) is the sum of all the observations in the sample divided by \( n \), the number of sample observations.
\[ \overline{Y} = \frac{1}{n}\sum_{i=1}^{n}Y_{i}\, \]

Measures of central tendency - Arithmetic mean

Question: Is the population mean \( \mu \) a parameter or an estimate? What about the sample mean?

Note that every observation has equal weight (i.e. \( \frac{1}{n} \)), so any outliers can strongly affect the mean. It is a very democratic statistic - equal representation!

Measures of central tendency - Arithmetic mean

alt text

Measures of central tendency - Median

Definition: The population median is the middle measurement of the set of all observations in the population (again, assume population finite for now).

Definition: The sample median is the middle measurement of the set of all observations in the sample.

Measures of central tendency - Median

How do you compute the median? W&S version:

  • First, sort the data from smallest to largest.
  • We then have two conditions:
    • If the number of observations is odd, then we have \[ Median = Y_{(n+1)/2} \]
    • If the number of observations is even, then we have \[ Median = \left[Y_{n/2} + Y_{(n/2)+1}\right]/2 \]

Look at special cases of \( n=3 \) and \( n=4 \)!!!

Measures of central tendency - Mean vs. Median

The median is the middle measurement of the distibution (different colors represent the two halves of the distribution). The mean is the center of gravity, the point at which the frequency distribution would be balanced (if observations had weight).

Note: The mean and median have the same units as the variable!!!

Measures of variability - Variance

Definition: The population variance \( \sigma^{2} \) is the average of the squared deviations of all observations from the population mean, and assuming a finite population, we have
\[ \sigma^{2} = \frac{1}{N}\sum_{i=1}^{N}(Y_{i}-\mu)^2 \]

Measures of variability - Variance

Definition: The sample variance \( s^{2} \) is the average of the squared deviations from the sample mean,
\[ s^{2} = \frac{1}{n-1}\sum_{i=1}^{n}(Y_{i}-\overline{Y})^2 \]

Question: Why \( n-1 \)??

Answer: Needed to be unbiased estimate!!

Measures of variability - Standard deviation

Definition: The population standard deviation \( \sigma \) is the square root of population variance
\[ \sigma = \sqrt{\sigma^{2}} \]

Definition: The sample standard deviation \( s \) is the square root of the sample variance,
\[ s = \sqrt{s^{2}} \]

Note #1: \( s \) is in general a biased estimator of \( \sigma \). The bias gets smaller as the sample size gets larger.

Note #2: \( s \) and \( \sigma \) have the same units as the random variable!!!

Measures of variability - Standard deviation

Note #3: If the frequency distribution is bell shaped, then about two-thirds (67%) of the observations will lie within one standard deviation of the mean, and 95% of the observations will lie within two standard deviations of the mean.

Measures of variability - Standard deviation

Note #3: If the frequency distribution is bell shaped, then about two-thirds (67%) of the observations will lie within one standard deviation of the mean, and 95% of the observations will lie within two standard deviations of the mean.

Measures of variability - Interquartile range

Definition: The interquartile range \( IQR \) is the difference between the third and first quartiles of the data. It is the span of the middle 50% of the data.

Measures of variability - Interquartile range

Spiders with huge pedipalps, copulatory organs that make up about 10% of a male's mass. alt text

alt text

Measures of variability - Interquartile range

alt text

  • Middle bar of box is median
  • Bottom of box is first quartile
  • Top of box is third quartile
  • Whiskers extend \( 1.5\times IQR \) above and below box\( ^{*} \)
  • Data outside whiskers (extreme values) are plotted as dots

\( ^{*} \) If whisker extends past the max or min of data, then the whisker will be the max or min of the data

Standard deviation or interquartile range?

Heuristic #1: The location (mean and median) and spread (interquartile range and standard deviation) give similar information when the frequency distribution is symmetric and unimodal (i.e. bell shaped).

Heuristic #2: The mean and standard deviation become less informative when the distribution is strongly skewed or there there are extreme observations.

Coefficient of variation

Since in biology many times the standard deviation scales with the mean, it can be more informative to look at the coefficient of variation.

Definition: The coefficient of variation (CV) calculates the standard deviation as a percentage of the mean: \[ CV = \frac{s}{\bar{Y}}\times 100\% \]

In other words, the CV answers the question “How much variation is there relative to the mean?”

Moving on...

Make sure you read the book for the following discussions

  • How to compute a mean and standard deviation from a frequency table

Question: Why is this important to know?

  • Rounding rules for displaying tables and statistics
  • Effect of changing measurement scale
  • Cumulative frequency distributions (we will cover this later as well)

My point here is that you are responsible for all book material, even if we don't cover it in lecture!

Describing data in R

Measures R commands
\( \overline{Y} \) mean
\( s^2 \) var
\( s \) sd
\( IQR \) IQR\( ^* \)
Multiple summary

\( ^* \) Note that IQR has different algorithms. To match the algorithm in W&S, you should use IQR(___, type=5). There are different algorithms as there are different ways to calculate quantiles. (for curious souls, see ?quantiles). For the HW, either version is acceptable. Default type in R is type=7.

Describing data in R

Measures R commands
\( \overline{Y} \) mean
\( s^2 \) var
\( s \) sd
\( IQR \) IQR
Multiple summary
summary(mydata)
    breadth     
 Min.   : 1.00  
 1st Qu.: 3.00  
 Median : 8.00  
 Mean   :11.88  
 3rd Qu.:17.00  
 Max.   :62.00  

IQR would be \( 17-3 = 14 \).