Data Visualization (Part 2)

M. Drew LaMar
August 26, 2020

“Maturity of mind is the capacity to endure uncertainty.”

- John Finley

Course Announcements

  • Reading Assignment #3: Whitlock & Schluter, Chapter 3: Describing data (due Friday, August 28, 2:00 pm)

Random sampling and pseudoreplication

Reminder: Whether a sample is a random sample depends on:

    (a) What is the population of interest?
    (b) What is considered a replicate?

Definition: A replicate is an independent unit of a sample from a population.

Treating replicates as if they were independent when they are not is called pseudoreplication.

Pseudoreplication ends up (1) giving a false sense of increased precision, and (2) reduces the generality of the results.

Replication: Example

Assignment Problem 21

In a study of heart rate in ocean-diving birds, researchers harnessed 10 randomly sampled, wild-caught cormorants to a laboratory contraption that monitored vital signs. Each cormorant was subjected to six artificial “dives” over the following week (one dive per day). A dive consisted of rapidly immersing the head of the bird in water by tipping the harness. In this way, a sample of 60 measurements of heart rate in diving birds was obtained. Do these 60 measurements represent a random sample? Why or why not?

Data Visualization


Allison Horst

“Numerical quantities focus on expected values, graphical summaries on unexpected values.”

- John Tukey

Communicating with data visualization

Data is beautiful!

Data is ugly!

What is data?

Definition: Variables are characteristics that differ among objects of interest.

Definition: Data are the measurements of one or more variables made on a sample of objects of interest.

Data, essentially, is any measurement of the real world since

  • \( n=1 \) counts as a sample,
  • variables can technically have only one possible value (i.e. no variation)

Types of data

  • Categorical variable (qualitative)

    • Nominal (levels have no inherent ordering)
    • Ordinal (levels have an inherent ordering)

    Remember the factor data type in R?

  • Numerical variable (quantitative)

    • Continuous
    • Discrete

    Remember the numeric data type in R?

Types of data (Class discussion)

Discuss: Would the fraction of birds in a large sample infected with avian flu virus be a discrete or continuous numerical variable?

Answer: Neither! The variable of interest here is actually categorical (nominal). Why?

Ask yourself the following questions:

  • What is the population of interest?
  • What measurement is being taken on objects in population?
  • What are the characteristics of this measurement (i.e. data type)?

Plots and data types

Frequency distributions of univariate data

Type of data Graphical method
Categorical Bar graph
Numerical Histogram

Plots and data types

Showing association of bivariate data

Type of data Graphical method
Two numerical Scatter plot
Line plot
Map
Two categorical Grouped bar graph
Mosaic plot
Mixed Strip chart
Box plot
Multiple histograms
Cumulative frequency distributions

Plots and data types

Visualize before you analyze!!!

Data visualization is one step in exploratory data analysis.

Quote: …the first step in any data analysis or statistical procedure is to graph the data and look at it. Humans are a visual species, with brains evolved to process visual information. Take advantage of millions of years of evolution, and look at visual representations of your data before doing anything else.
- Whitlock & Schluter

Visualize before you analyze!!!

Data visualization is one step in exploratory data analysis.

Four Rules of Graphing

  • Show the data.
  • Make patterns in the data easy to see.
  • Represent magnitudes honestly.
  • Draw graphical elements clearly.

Transform before visualize!

If you want to graph some data, you most likely will need to manipulate the data first to put it in the right form.

Figure 2.1-2. Locust serotonin

Strip chart of serotonin levels in the central nervous system of desert locusts that were experimentally crowded for 0 (the control group), 1, and 2 hours. (left panel)

alt text

Figure 2.1-2. Locust serotonin - Load data

Read the data and store in data frame (here named locustData). The following command uses read.csv to grab the data from a file on the internet (on the current web site).

locustData <- read.csv(paste0(here::here(), "/Datasets/chapter02/chap02f1_2locustSerotonin.csv"))

The read.csv command reads a CSV (comma-separated value) file. It's argument can be a file on your computer, or in this case, a location to the file on the web via a URL.

Question: So where is the data for the book?

Figure 2.1-2. Locust serotonin - Look at data

Show the first few lines of the data, to ensure it read correctly. Determine the number of cases in the data.

head(locustData)
  serotoninLevel treatmentTime
1            5.3             0
2            4.6             0
3            4.5             0
4            4.3             0
5            4.2             0
6            3.6             0

Figure 2.1-2. Locust serotonin - Look at data

Show the first few lines of the data, to ensure it read correctly. Determine the number of cases in the data.

nrow(locustData)
[1] 30

Figure 2.1-2. Locust serotonin - Look at data

Check the object type of the variables using str.

str(locustData)
'data.frame':   30 obs. of  2 variables:
 $ serotoninLevel: num  5.3 4.6 4.5 4.3 4.2 3.6 3.7 3.3 12.1 18 ...
 $ treatmentTime : int  0 0 0 0 0 0 0 0 0 0 ...

Figure 2.1-2. Locust serotonin - Graph data

Draw a stripchart (the tilde “~” means that the first argument below is a formula, relating one variable to the other).

stripchart(serotoninLevel ~ treatmentTime, 
           data = locustData, 
           method = "jitter", 
           vertical = TRUE, 
           xlab="Treatment time (hours)", 
           ylab="Serotonin (pmoles)", 
           cex.lab = 1.5)

Figure 2.1-2. Locust serotonin - Graph data

Draw a stripchart (the tilde “~” means that the first argument below is a formula, relating one variable to the other).

plot of chunk unnamed-chunk-5

Figure 2.1-2. Locust serotonin - Fancier graph

A fancier strip chart, closer to that shown in Figure 2.1-2, by including more options.

# Stripchart with options
par(bty = "l") # plot x and y axes only, not a complete box
stripchart(serotoninLevel ~ treatmentTime, 
           data = locustData, 
           vertical = TRUE, 
           method = "jitter", 
           pch = 16, 
           col = "firebrick", 
           cex = 1.5, 
           cex.lab = 1.5, 
           las = 1,
           ylab = "Serotonin (pmoles)", 
           xlab = "Treatment time (hours)",
           ylim = c(0, max(locustData$serotoninLevel)))

Figure 2.1-2. Locust serotonin - Fancier graph

A fancier strip chart, closer to that shown in Figure 2.1-2, by including more options.

plot of chunk unnamed-chunk-6

Figure 2.1-2. Locust serotonin - Graphing parameters

In R: Search arguments you don’t understand in the help for the command par.


?par
# Stripchart with options
par(bty = "l") # plot x and y axes only, not a complete box
stripchart(serotoninLevel ~ treatmentTime, 
           data = locustData, 
           vertical = TRUE, 
           method = "jitter", 
           pch = 16, 
           col = "firebrick", 
           cex.lab = 1.5, 
           las = 1,
           ylab = "Serotonin (pmoles)", 
           xlab = "Treatment time (hours)",
           ylim = c(0, max(locustData$serotoninLevel)))

Practice Problem #5

A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.

Practice Problem #5 - Main questions

A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.

Question: Experimental or observational?

Answer: Observational

Practice Problem #5

A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.

Question: Explanatory variables with type?

Answer: Fry source, which is categorical variable with two levels, “hatchery” and “wild”.

Practice Problem #5

A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.

Question: Response variables with type?

Answer: “Survival”, which is a categorical variable with two levels, “caught” and “not caught”.

Practice Problem #5

In R: First load the data and get some quick info on the data using the head and str commands.

Load the data:

troutfry <- read.csv(paste0(here::here(), "/Datasets/chapter02/chap02q05FrySurvival.csv"))

Practice Problem #5

In R: First load the data and get some quick info on the data using the head and str commands.


head(troutfry)
  frySource survival
1      wild survived
2      wild survived
3      wild survived
4      wild survived
5      wild survived
6      wild survived

Practice Problem #5

In R: First load the data and get some quick info on the data using the head and str commands.


str(troutfry)
'data.frame':   8000 obs. of  2 variables:
 $ frySource: chr  "wild" "wild" "wild" "wild" ...
 $ survival : chr  "survived" "survived" "survived" "survived" ...

Looks like our data is in raw form, i.e. each row is an observation and each column is a measurement/variable.

Practice Problem #5

In R: Now make a table using the table command.


(troutfryTable <- table(troutfry$frySource, troutfry$survival))

           not caught survived
  hatchery       3973       27
  wild           3949       51

Notice the parenthesis around the assignment, which says to output the result.

Question: Anything off with this format?

Practice Problem #5

In R: Now make a table using the table command.


(troutfryTable <- table(troutfry$frySource, troutfry$survival))

           not caught survived
  hatchery       3973       27
  wild           3949       51

Notice the parenthesis around the assignment, which says to output the result.

Answer: Explanatory variable should be in the horizontal dimension!

Practice Problem #5

In R: Now make a table using the table command.


(troutfryTable <- table(troutfry$frySource, troutfry$survival))

           not caught survived
  hatchery       3973       27
  wild           3949       51

Notice the parenthesis around the assignment, which says to output the result.

Question: So how do we fix it?

Practice Problem #5

In R: Now make a table using the table command, putting the explanatory variable in the correct dimension.


(troutfryTable <- table(troutfry$survival, troutfry$frySource))

             hatchery wild
  not caught     3973 3949
  survived         27   51

Question: Any other changes?

Practice Problem #5

In R: Now make a table using the table command, putting the explanatory variable in the correct dimension.


(troutfryTable <- table(troutfry$survival, troutfry$frySource))

             hatchery wild
  not caught     3973 3949
  survived         27   51

Answer: Maybe “survived” should come before “not caught”, as that might be the most interesting.

Practice Problem #5

In R: Now make a table using the table command, putting the explanatory variable in the correct dimension.


(troutfryTable <- table(troutfry$survival, troutfry$frySource))

             hatchery wild
  not caught     3973 3949
  survived         27   51

Question: How do we change the order of the levels for “survival”?

Practice Problem #5

In R: Reorder the levels of the “survival” factor…


str(troutfry$survival)
 chr [1:8000] "survived" "survived" "survived" "survived" "survived" ...
troutfry$survival <- factor(troutfry$survival, levels = c("survived", "not caught"))
str(troutfry$survival)
 Factor w/ 2 levels "survived","not caught": 1 1 1 1 1 1 1 1 1 1 ...

Practice Problem #5

In R: …and remake the table.


(troutfryTable <- table(troutfry$survival, troutfry$frySource))

             hatchery wild
  survived         27   51
  not caught     3973 3949

Practice Problem #5

In R: Finally, lets add some margins using the addmargins command.


addmargins(troutfryTable)

             hatchery wild  Sum
  survived         27   51   78
  not caught     3973 3949 7922
  Sum            4000 4000 8000

Practice Problem #5: Shearing sheep

There's more than one way to shear a sheep…

t(addmargins(table(troutfry)))
            frySource
survival     hatchery wild  Sum
  survived         27   51   78
  not caught     3973 3949 7922
  Sum            4000 4000 8000

The t command transposes the table (i.e. switch horizontal and vertical variable placement).

Practice Problem #5: Mosaic plot

In R: Alright, let’s do some data viz. Draw a mosaic plot.


mosaicplot(troutfryTable)

plot of chunk unnamed-chunk-23

Practice Problem #5: Mosaic plot

In R: Alright, let’s do some data viz. Draw a mosaic plot.


plot of chunk unnamed-chunk-24

Question: Explanatory variable along vertical axis. How to fix?

Practice Problem #5: Mosaic plot

In R: Alright, let’s do some data viz. Draw a mosaic plot.


plot of chunk unnamed-chunk-25

Answer: Transpose the table!

Practice Problem #5: Transpose table

In R: Same command as before, but plot transposed table with the t command.


mosaicplot(t(troutfryTable))

plot of chunk unnamed-chunk-26

Data Visualization Tidbit

Practice Problem #5: Add axes

In R: Final to-do: label your axes! Bonus: Remove the title.


mosaicplot(t(troutfryTable), 
           xlab="Fry source", 
           ylab="Relative frequency", 
           main="", 
           cex = 1.5, 
           cex.sub = 1.5, 
           col = c("forestgreen", "goldenrod1"))

Practice Problem #5: Add axes

In R: Final to-do: label your axes! Bonus: Remove the title.


plot of chunk unnamed-chunk-27

Assignment Problem #33

The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997 - 1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association?

Assignment Problem #33 - Questions

The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997 - 1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association?

Question: Experimental or observational?

Answer: Observational

Assignment Problem #33 - Questions

The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997 - 1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association?

Question: Explanatory variables with type?

Answer: Birth month - categorical with levels “Jan” and “Dec”.

Assignment Problem #33 - Questions

The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997 - 1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association?

Question: Response variables with type?

Answer: ADHD diagnosis - categorical with levels “ADHD” and “no ADHD”.

Assignment Problem #33: ADHD - Load data

In R: Load the dataset and look at the structure of the data.


adhd <- read.csv(paste0(here::here(), "/Homeworks/Homework2/chap02q33BirthMonthADHD.csv"))
str(adhd)
'data.frame':   4 obs. of  3 variables:
 $ birthMonth : chr  "January" "January" "December" "December"
 $ diagnosis  : chr  "ADHD" "no ADHD" "ADHD" "no ADHD"
 $ frequencies: int  2219 36917 2870 36107

Assignment Problem #33: ADHD - Look at data

Question: Is this processed or raw data?


  birthMonth diagnosis frequencies
1    January      ADHD        2219
2    January   no ADHD       36917
3   December      ADHD        2870
4   December   no ADHD       36107

Answer: Processed! They have already counted the data!

In R: Transform this into a more appropriate form.

Assignment Problem #33: ADHD - Transform data

In R: Transform this into a more appropriate form.


(adhdMatrix <- matrix(adhd$frequencies, 
                      nrow = 2, 
                      ncol = 2, 
                      dimnames = list(c("Diagnosed ADHD", "Not diagnosed"), 
                                      c("Jan","Dec"))))
                 Jan   Dec
Diagnosed ADHD  2219  2870
Not diagnosed  36917 36107
  birthMonth diagnosis frequencies
1    January      ADHD        2219
2    January   no ADHD       36917
3   December      ADHD        2870
4   December   no ADHD       36107

Assignment Problem #33: ADHD - Hint for HW

Look at Chapter 2 R-Markdown, Example 2.3A. Bird malaria to see how to plot with data of this type. In particular, note that grouped barplot (using the barplot command) and mosaic plots (using the mosaicplot command) can work on matrices or tables.

Assignment Problem #35

The following data are from Mattison et al. (2012), who carried out an experiment with rhesus monkeys to test whether a reduction in food intake extends life span (as measured in years). The data are the life spans of 19 male and 15 female monkeys who were randomly assigned a normal nutritious diet or a similar diet reduced in amount by 30%. All monkeys were adults at the start of the study.

  1. Graph the results, using the most appropriate method and following the four principles of good graph design.
  2. According to your graph, which difference in life span is greater: that between the sexes, or that between diet groups?

Other questions: Experimental or observational? Explanatory and response variables?

Assignment Problem #35

The following data are from Mattison et al. (2012), who carried out an experiment with rhesus monkeys to test whether a reduction in food intake extends life span (as measured in years). The data are the life spans of 19 male and 15 female monkeys who were randomly assigned a normal nutritious diet or a similar diet reduced in amount by 30%. All monkeys were adults at the start of the study.

Answer: This is an experimental procedure.

Answer: Explanatory variables are sex and food intake, both of which are categorical and have 2 levels.

Answer: Response variable is life span, which is numerical and continuous.

Assignment Problem #35 - Look at data

'data.frame':   34 obs. of  3 variables:
 $ sex          : chr  "female" "female" "female" "female" ...
 $ foodTreatment: chr  "reduced" "reduced" "reduced" "reduced" ...
 $ lifespan     : num  16.5 18.9 22.6 27.8 30.2 30.7 35.9 23.7 24.5 24.7 ...

Categories of the sex variable:

[1] "female" "male"  

Categories of the foodTreatment variable:

[1] "control" "reduced"

Assignment Problem #35 - Look at data

head(foodforlife)
     sex foodTreatment lifespan
1 female       reduced     16.5
2 female       reduced     18.9
3 female       reduced     22.6
4 female       reduced     27.8
5 female       reduced     30.2
6 female       reduced     30.7

Question: Is this data in the right form for graphing?

Assignment Problem #35 - Trick

Depending on the graph that you choose, the first two arguments to the function should be

functionname(lifespan ~ foodTreatment * sex, data=mydata, …)

where functionname is the name of the plotting function in R that you chose, and mydata is the name of your data frame.

Assignment Problem #35 - Another way

You can combine two columns into one column using the unite function in the tidyr package (we'll learn more later about this package).

foodforlife <- tidyr::unite(foodforlife, category, c("sex", "foodTreatment"), sep="-")
str(foodforlife)
'data.frame':   34 obs. of  2 variables:
 $ category: chr  "female-reduced" "female-reduced" "female-reduced" "female-reduced" ...
 $ lifespan: num  16.5 18.9 22.6 27.8 30.2 30.7 35.9 23.7 24.5 24.7 ...

The values of the new category variable are:

[1] "female-control" "female-reduced" "male-control"   "male-reduced"