Data Visualization

M. Drew LaMar
August 24, 2020

“Maturity of mind is the capacity to endure uncertainty.”

- John Finley

Course Announcements

  • Homework #1 (due Monday, August 31, 11:59 pm):
    • Whitlock & Schluter: Chapter 1
      • Practice Problems (do NOT turn these in): #1, 3, 4, 9, 12
      • Assignment Problems (do turn these in): #14, 16-20 (all), 24
    • Whitlock & Schluter: Chapter 2
      • Practice Problems (do NOT turn these in): #3, 8, 11-16 (all)
      • Assignment Problems (do turn these in): #20, 23, 29, 32-36

Course Announcements

  • Reading Quiz: Whitlock & Schluter, Chapter 3 Due Wednesday, August 26, 2 pm
  • Lab #2 Due before lab during week of August 31-September 4

What is statistics?

Statistics is a technology that describes and measures aspects of nature from samples.

Statistics lets us quantify the uncertainty of these measures.

Statistics makes it possible to determine the likely magnitude of measurements departure from the “truth”.

Statistics is about estimation, the process of inferring an unknown quantity of a target population using sample data.

What is statistics?

The two sides of the statistical coin:

  • Parameter estimation
  • Hypothesis testing
Definition: A statistical hypothesis is a specific claim regarding a population parameter.
Definition: Hypothesis testing uses data to evaluate evidence for or against statistical hypotheses.

What is statistics? Parameter estimation

The two sides of the statistical coin:

  • Parameter estimation
  • Hypothesis testing

Example: A trapping study measures the rate of fruit fall in forest clear-cuts.

What is statistics? Hypothesis testing

The two sides of the statistical coin:

  • Parameter estimation
  • Hypothesis testing

Example: A clinical trial is carried out to determine whether taking large doses of vitamin C benefits health of advanced cancer patients.

What is probability?

alt text alt text

Probability comes first!

…well, most of the time.

  • Many statistical techniques require assumptions about where your data is coming from (i.e. properties of the population)
  • In other words, an assumed probability model describes the population
  • Statistical techniques that are based on probability models are called parametric techniques, while those that are not are called non-parametric techniques.
Quote: “Huh?”
- Student

Data as Information

“Modern statisticians are familiar with the notion that any finite body of data contains only a limited amount of information on any point under examination; that this limit is set by the nature of the data themselves, and cannot be increased by any amount of ingenuity expended in their statistical examination: that the statistician's task, in fact, is limited to the extraction of the whole of the available information on any particular issue.”

- R. A. Fisher (biologist!)

Data as Information

There is desired and undesired information in data.

Goals:

  • Get accurate information by reducing bias (do we have the right signal?)

  • Get precise information by reducing sampling error due to random variation (increase signal-to-noise ratio)

    Definition: Bias is a systematic discrepancy between the estimates we would obtain, if we could sample a population again and again, and the true population characteristic.

Data as Information

There is desired and undesired information in data.

Goals:

  • Get accurate information by reducing bias (do we have the right signal?)

  • Get precise information by reducing sampling error due to random variation (increase signal-to-noise ratio)

    Definition: Sampling error is the difference between an estimate and the population parameter being estimated caused by chance.

Precision vs Accuracy

Data as Information

There is desired and undesired information in data.

Goals:

  • Get accurate information by reducing bias (do we have the right signal?)

  • Get precise information by reducing sampling error due to random variation (increase signal-to-noise ratio)

    “An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.”

    - John Tukey

Data as Information

For your question, there is desired (signal) and undesired (noise) information in your data.

Goals:

  • Isolate desired information by reducing or controlling for confounding factors (i.e. undesired information)

“The aim … is to provide a clear and rigorous basis for determining when a causal ordering can be said to hold between two variables or groups of variables in a model…”

- H. Simon

Random sampling

The main assumptions of all statistical techniques is that your data come from a random sample.

Definition: In a random sample, each member of a population has an equal and independent chance of being selected.


Random sampling

  1. minimizes bias (equal) and
  2. makes it possible to measure the amount of (quantify precision) sampling error (independent)

Random sampling (Class discussion)

In a recent study, researchers took electrophysiological measurements from the brains of two rhesus macaques (monkeys). Forty neurons were tested in each monkey, yielding a total of 80 measurements.

  1. Do the 80 neurons constitute a random sample? Why or why not?

    Lack of independence

  2. If the 80 measurements were analyzed as though they constituted a random sample, what consequences would this have for the estimate of the measurement in the monkey population?

    Incorrect precision of estimate (most likely underestimated)

The Degradation of Information

alt text

Experimental Design, Data & Statistics

“Designing experiments is as much about learning to think scientifically as it is about the mechanics of the statistics that we use to analyse the data once we have it. It is about having confidence in your data, and knowing that you are measuring what you think you are measuring. It is about knowing what can be concluded from a particular type of experiment and what cannot.

- Ruxton & Colegrave

Experimental Design, Data & Statistics

Design your experiment so that:

  • Measurements lead to useful data.
  • Useful data has information addressing your hypothesis.
  • Statistics are tailored to your data and powerful enough to separate out signal from noise.
  • Results of statistics can be properly interpreted as evidence for or against your original hypothesis.

Two key concepts of experimental design

“It might be said that the two major goals of designing experiments are to minimize random variation and account for confounding factors.

- Ruxton & Colegrave

Definition: Random variation is the differences between measured values of the same variable taken from different experimental subjects.

Good experiments minimize or control for "unwanted” random variation, so that any variation due to the factors of interest can be detected more easily.

Two key concepts of experimental design

“It might be said that the two major goals of designing experiments are to minimize random variation and account for confounding factors.

- Ruxton & Colegrave

Definition: If we want to study the effect of variable A on variable B, but variable C also affects B, then C is a confounding factor.

The importance of experimental design

“Designing effective experiments needs thinking about biology more than it does mathematical calculations.”

“Experimental design is about the biology of the system, and that is why the best people to devise biological experiments are biologists themselves.”

- Ruxton & Colegrave

Data Visualization


Allison Horst

“Numerical quantities focus on expected values, graphical summaries on unexpected values.”

- John Tukey

Communicating with data visualization

Data is beautiful!

Data is ugly!

What is data?

Definition: Variables are characteristics that differ among objects of interest.

Definition: Data are the measurements of one or more variables made on a sample of objects of interest.

Data, essentially, is any measurement of the real world since

  • \( n=1 \) counts as a sample,
  • variables can technically have only one possible value (i.e. no variation)

Types of data

  • Categorical variable (qualitative)

    • Nominal (levels have no inherent ordering)
    • Ordinal (levels have an inherent ordering)

    Remember the factor data type in R?

  • Numerical variable (quantitative)

    • Continuous
    • Discrete

    Remember the numeric data type in R?

Types of data (Class discussion)

Discuss: Would the fraction of birds in a large sample infected with avian flu virus be a discrete or continuous numerical variable?

Answer: Neither! The variable of interest here is actually categorical (nominal). Why?

Ask yourself the following questions:

  • What is the population of interest?
  • What measurement is being taken on objects in population?
  • What are the characteristics of this measurement (i.e. data type)?

Plots and data types

Frequency distributions of univariate data

Type of data Graphical method
Categorical Bar graph
Numerical Histogram

Plots and data types

Showing association of bivariate data

Type of data Graphical method
Two numerical Scatter plot
Line plot
Map
Two categorical Grouped bar graph
Mosaic plot
Mixed Strip chart
Box plot
Multiple histograms
Cumulative frequency distributions

Plots and data types

Visualize before you analyze!!!

Data visualization is one step in exploratory data analysis.

Quote: …the first step in any data analysis or statistical procedure is to graph the data and look at it. Humans are a visual species, with brains evolved to process visual information. Take advantage of millions of years of evolution, and look at visual representations of your data before doing anything else.
- Whitlock & Schluter

Visualize before you analyze!!!

Data visualization is one step in exploratory data analysis.

Transform before visualize!

If you want to graph some data, you most likely will need to manipulate the data first to put it in the right form.

Figure 2.1-2. Locust serotonin

Strip chart of serotonin levels in the central nervous system of desert locusts that were experimentally crowded for 0 (the control group), 1, and 2 hours. (left panel)

alt text

Figure 2.1-2. Locust serotonin - Load data

Read the data and store in data frame (here named locustData). The following command uses read.csv to grab the data from a file on the internet (on the current web site).

locustData <- read.csv(paste0(here::here(), "/Datasets/chapter02/chap02f1_2locustSerotonin.csv"))

The read.csv command reads a CSV (comma-separated value) file. It's argument can be a file on your computer, or in this case, a location to the file on the web via a URL.

Question: So where is the data for the book?

Figure 2.1-2. Locust serotonin - Look at data

Show the first few lines of the data, to ensure it read correctly. Determine the number of cases in the data.

head(locustData)
  serotoninLevel treatmentTime
1            5.3             0
2            4.6             0
3            4.5             0
4            4.3             0
5            4.2             0
6            3.6             0

Figure 2.1-2. Locust serotonin - Look at data

Show the first few lines of the data, to ensure it read correctly. Determine the number of cases in the data.

nrow(locustData)
[1] 30

Figure 2.1-2. Locust serotonin - Look at data

Check the object type of the variables using str.

str(locustData)
'data.frame':   30 obs. of  2 variables:
 $ serotoninLevel: num  5.3 4.6 4.5 4.3 4.2 3.6 3.7 3.3 12.1 18 ...
 $ treatmentTime : int  0 0 0 0 0 0 0 0 0 0 ...

Figure 2.1-2. Locust serotonin - Graph data

Draw a stripchart (the tilde “~” means that the first argument below is a formula, relating one variable to the other).

stripchart(serotoninLevel ~ treatmentTime, 
           data = locustData, 
           method = "jitter", 
           vertical = TRUE, 
           xlab="Treatment time (hours)", 
           ylab="Serotonin (pmoles)", 
           cex.lab = 1.5)

Figure 2.1-2. Locust serotonin - Graph data

Draw a stripchart (the tilde “~” means that the first argument below is a formula, relating one variable to the other).

plot of chunk unnamed-chunk-5

Figure 2.1-2. Locust serotonin - Fancier graph

A fancier strip chart, closer to that shown in Figure 2.1-2, by including more options.

# Stripchart with options
par(bty = "l") # plot x and y axes only, not a complete box
stripchart(serotoninLevel ~ treatmentTime, 
           data = locustData, 
           vertical = TRUE, 
           method = "jitter", 
           pch = 16, 
           col = "firebrick", 
           cex = 1.5, 
           cex.lab = 1.5, 
           las = 1,
           ylab = "Serotonin (pmoles)", 
           xlab = "Treatment time (hours)",
           ylim = c(0, max(locustData$serotoninLevel)))

Figure 2.1-2. Locust serotonin - Fancier graph

A fancier strip chart, closer to that shown in Figure 2.1-2, by including more options.

plot of chunk unnamed-chunk-6

Figure 2.1-2. Locust serotonin - Graphing parameters

In R: Search arguments you don’t understand in the help for the command par.


?par
# Stripchart with options
par(bty = "l") # plot x and y axes only, not a complete box
stripchart(serotoninLevel ~ treatmentTime, 
           data = locustData, 
           vertical = TRUE, 
           method = "jitter", 
           pch = 16, 
           col = "firebrick", 
           cex.lab = 1.5, 
           las = 1,
           ylab = "Serotonin (pmoles)", 
           xlab = "Treatment time (hours)",
           ylim = c(0, max(locustData$serotoninLevel)))

Practice Problem #5

A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.

Practice Problem #5 - Main questions

A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.

Question: Experimental or observational?

Answer: Observational

Practice Problem #5

A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.

Question: Explanatory variables with type?

Answer: Fry source, which is categorical variable with two levels, “hatchery” and “wild”.

Practice Problem #5

A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.

Question: Response variables with type?

Answer: “Survival”, which is a categorical variable with two levels, “caught” and “not caught”.

Practice Problem #5

In R: First load the data and get some quick info on the data using the head and str commands.

Load the data:

troutfry <- read.csv(paste0(here::here(), "/Datasets/chapter02/chap02q05FrySurvival.csv"))

Practice Problem #5

In R: First load the data and get some quick info on the data using the head and str commands.


head(troutfry)
  frySource survival
1      wild survived
2      wild survived
3      wild survived
4      wild survived
5      wild survived
6      wild survived

Practice Problem #5

In R: First load the data and get some quick info on the data using the head and str commands.


str(troutfry)
'data.frame':   8000 obs. of  2 variables:
 $ frySource: chr  "wild" "wild" "wild" "wild" ...
 $ survival : chr  "survived" "survived" "survived" "survived" ...

Looks like our data is in raw form, i.e. each row is an observation and each column is a measurement/variable.

Practice Problem #5

In R: Now make a table using the table command.


(troutfryTable <- table(troutfry$frySource, troutfry$survival))

           not caught survived
  hatchery       3973       27
  wild           3949       51

Notice the parenthesis around the assignment, which says to output the result.

Question: Anything off with this format?

Practice Problem #5

In R: Now make a table using the table command.


(troutfryTable <- table(troutfry$frySource, troutfry$survival))

           not caught survived
  hatchery       3973       27
  wild           3949       51

Notice the parenthesis around the assignment, which says to output the result.

Answer: Explanatory variable should be in the horizontal dimension!

Practice Problem #5

In R: Now make a table using the table command.


(troutfryTable <- table(troutfry$frySource, troutfry$survival))

           not caught survived
  hatchery       3973       27
  wild           3949       51

Notice the parenthesis around the assignment, which says to output the result.

Question: So how do we fix it?

Practice Problem #5

In R: Now make a table using the table command, putting the explanatory variable in the correct dimension.


(troutfryTable <- table(troutfry$survival, troutfry$frySource))

             hatchery wild
  not caught     3973 3949
  survived         27   51

Question: Any other changes?

Practice Problem #5

In R: Now make a table using the table command, putting the explanatory variable in the correct dimension.


(troutfryTable <- table(troutfry$survival, troutfry$frySource))

             hatchery wild
  not caught     3973 3949
  survived         27   51

Answer: Maybe “survived” should come before “not caught”, as that might be the most interesting.

Practice Problem #5

In R: Now make a table using the table command, putting the explanatory variable in the correct dimension.


(troutfryTable <- table(troutfry$survival, troutfry$frySource))

             hatchery wild
  not caught     3973 3949
  survived         27   51

Question: How do we change the order of the levels for “survival”?

Practice Problem #5

In R: Reorder the levels of the “survival” factor…


str(troutfry$survival)
 chr [1:8000] "survived" "survived" "survived" "survived" "survived" ...
troutfry$survival <- factor(troutfry$survival, levels = c("survived", "not caught"))
str(troutfry$survival)
 Factor w/ 2 levels "survived","not caught": 1 1 1 1 1 1 1 1 1 1 ...

Practice Problem #5

In R: …and remake the table.


(troutfryTable <- table(troutfry$survival, troutfry$frySource))

             hatchery wild
  survived         27   51
  not caught     3973 3949

Practice Problem #5

In R: Finally, lets add some margins using the addmargins command.


addmargins(troutfryTable)

             hatchery wild  Sum
  survived         27   51   78
  not caught     3973 3949 7922
  Sum            4000 4000 8000

Practice Problem #5: Shearing sheep

There's more than one way to shear a sheep…

t(addmargins(table(troutfry)))
            frySource
survival     hatchery wild  Sum
  survived         27   51   78
  not caught     3973 3949 7922
  Sum            4000 4000 8000

The t command transposes the table (i.e. switch horizontal and vertical variable placement).

Practice Problem #5: Mosaic plot

In R: Alright, let’s do some data viz. Draw a mosaic plot.


mosaicplot(troutfryTable)

plot of chunk unnamed-chunk-23

Practice Problem #5: Mosaic plot

In R: Alright, let’s do some data viz. Draw a mosaic plot.


plot of chunk unnamed-chunk-24

Question: Explanatory variable along vertical axis. How to fix?

Practice Problem #5: Mosaic plot

In R: Alright, let’s do some data viz. Draw a mosaic plot.


plot of chunk unnamed-chunk-25

Answer: Transpose the table!

Practice Problem #5: Transpose table

In R: Same command as before, but plot transposed table with the t command.


mosaicplot(t(troutfryTable))

plot of chunk unnamed-chunk-26

Data Visualization Tidbit

Practice Problem #5: Add axes

In R: Final to-do: label your axes! Bonus: Remove the title.


mosaicplot(t(troutfryTable), 
           xlab="Fry source", 
           ylab="Relative frequency", 
           main="", 
           cex = 1.5, 
           cex.sub = 1.5, 
           col = c("forestgreen", "goldenrod1"))

Practice Problem #5: Add axes

In R: Final to-do: label your axes! Bonus: Remove the title.


plot of chunk unnamed-chunk-27