M. Drew LaMar
August 26, 2020
“Maturity of mind is the capacity to endure uncertainty.”
- John Finley
Reminder: Whether a sample is a
random sample depends on:
(a) What is the population of interest?
(b) What is considered areplicate ?
Definition: A
replicate is an independent unit of a sample from a population.
Treating replicates as if they were independent when they are not is called pseudoreplication.
Pseudoreplication ends up (1) giving a false sense of increased precision, and (2) reduces the generality of the results.
Assignment Problem 21
In a study of heart rate in ocean-diving birds, researchers harnessed 10 randomly sampled, wild-caught cormorants to a laboratory contraption that monitored vital signs. Each cormorant was subjected to six artificial “dives” over the following week (one dive per day). A dive consisted of rapidly immersing the head of the bird in water by tipping the harness. In this way, a sample of 60 measurements of heart rate in diving birds was obtained. Do these 60 measurements represent a random sample? Why or why not?
Definition:
Variables are characteristics that differ among objects of interest.
Definition:
Data are the measurements of one or more variables made on a sample of objects of interest.
Data, essentially, is any measurement of the real world since
Categorical variable (qualitative)
Remember the factor
data type in R?
Numerical variable (quantitative)
Remember the numeric
data type in R?
Discuss: Would the fraction of birds in a large sample infected with avian flu virus be a discrete or continuous numerical variable?
Answer: Neither! The variable of interest here is actually categorical (nominal). Why?
Ask yourself the following questions:
Frequency distributions of univariate data
Type of data | Graphical method |
---|---|
Categorical | Bar graph |
Numerical | Histogram |
Showing association of bivariate data
Type of data | Graphical method |
---|---|
Two numerical | Scatter plot |
“ | Line plot |
” | Map |
Two categorical | Grouped bar graph |
“ | Mosaic plot |
Mixed | Strip chart |
” | Box plot |
“ | Multiple histograms |
” | Cumulative frequency distributions |
Data visualization is one step in exploratory data analysis.
Quote: …the first step in any data analysis or statistical procedure is to graph the data and look at it. Humans are a visual species, with brains evolved to process visual information. Take advantage of millions of years of evolution, and look at visual representations of your data before doing anything else.
- Whitlock & Schluter
Data visualization is one step in exploratory data analysis.
If you want to graph some data, you most likely will need to manipulate the data first to put it in the right form.
Strip chart of serotonin levels in the central nervous system of desert locusts that were experimentally crowded for 0 (the control group), 1, and 2 hours. (left panel)
Read the data and store in data frame (here named locustData). The following command uses read.csv
to grab the data from a file on the internet (on the current web site).
locustData <- read.csv(paste0(here::here(), "/Datasets/chapter02/chap02f1_2locustSerotonin.csv"))
The read.csv
command reads a CSV (comma-separated value) file. It's argument can be a file on your computer, or in this case, a location to the file on the web via a URL.
Question: So where is the data for the book?
Show the first few lines of the data, to ensure it read correctly. Determine the number of cases in the data.
head(locustData)
serotoninLevel treatmentTime
1 5.3 0
2 4.6 0
3 4.5 0
4 4.3 0
5 4.2 0
6 3.6 0
Show the first few lines of the data, to ensure it read correctly. Determine the number of cases in the data.
nrow(locustData)
[1] 30
Check the object type of the variables using str
.
str(locustData)
'data.frame': 30 obs. of 2 variables:
$ serotoninLevel: num 5.3 4.6 4.5 4.3 4.2 3.6 3.7 3.3 12.1 18 ...
$ treatmentTime : int 0 0 0 0 0 0 0 0 0 0 ...
Draw a stripchart (the tilde “~” means that the first argument below is a formula, relating one variable to the other).
stripchart(serotoninLevel ~ treatmentTime,
data = locustData,
method = "jitter",
vertical = TRUE,
xlab="Treatment time (hours)",
ylab="Serotonin (pmoles)",
cex.lab = 1.5)
Draw a stripchart (the tilde “~” means that the first argument below is a formula, relating one variable to the other).
A fancier strip chart, closer to that shown in Figure 2.1-2, by including more options.
# Stripchart with options
par(bty = "l") # plot x and y axes only, not a complete box
stripchart(serotoninLevel ~ treatmentTime,
data = locustData,
vertical = TRUE,
method = "jitter",
pch = 16,
col = "firebrick",
cex = 1.5,
cex.lab = 1.5,
las = 1,
ylab = "Serotonin (pmoles)",
xlab = "Treatment time (hours)",
ylim = c(0, max(locustData$serotoninLevel)))
A fancier strip chart, closer to that shown in Figure 2.1-2, by including more options.
?par
# Stripchart with options
par(bty = "l") # plot x and y axes only, not a complete box
stripchart(serotoninLevel ~ treatmentTime,
data = locustData,
vertical = TRUE,
method = "jitter",
pch = 16,
col = "firebrick",
cex.lab = 1.5,
las = 1,
ylab = "Serotonin (pmoles)",
xlab = "Treatment time (hours)",
ylim = c(0, max(locustData$serotoninLevel)))
A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.
A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.
Question: Experimental or observational?
Answer: Observational
A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.
Question: Explanatory variables with type?
Answer: Fry source, which is categorical variable with two levels, “hatchery” and “wild”.
A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.
Question: Response variables with type?
Answer: “Survival”, which is a categorical variable with two levels, “caught” and “not caught”.
Load the data:
troutfry <- read.csv(paste0(here::here(), "/Datasets/chapter02/chap02q05FrySurvival.csv"))
head(troutfry)
frySource survival
1 wild survived
2 wild survived
3 wild survived
4 wild survived
5 wild survived
6 wild survived
str(troutfry)
'data.frame': 8000 obs. of 2 variables:
$ frySource: chr "wild" "wild" "wild" "wild" ...
$ survival : chr "survived" "survived" "survived" "survived" ...
Looks like our data is in raw form, i.e. each row is an observation and each column is a measurement/variable.
(troutfryTable <- table(troutfry$frySource, troutfry$survival))
not caught survived
hatchery 3973 27
wild 3949 51
Notice the parenthesis around the assignment, which says to output the result.
Question: Anything off with this format?
(troutfryTable <- table(troutfry$frySource, troutfry$survival))
not caught survived
hatchery 3973 27
wild 3949 51
Notice the parenthesis around the assignment, which says to output the result.
Answer: Explanatory variable should be in the horizontal dimension!
(troutfryTable <- table(troutfry$frySource, troutfry$survival))
not caught survived
hatchery 3973 27
wild 3949 51
Notice the parenthesis around the assignment, which says to output the result.
(troutfryTable <- table(troutfry$survival, troutfry$frySource))
hatchery wild
not caught 3973 3949
survived 27 51
Question: Any other changes?
(troutfryTable <- table(troutfry$survival, troutfry$frySource))
hatchery wild
not caught 3973 3949
survived 27 51
Answer: Maybe “survived” should come before “not caught”, as that might be the most interesting.
(troutfryTable <- table(troutfry$survival, troutfry$frySource))
hatchery wild
not caught 3973 3949
survived 27 51
Question: How do we change the order of the levels for “survival”?
str(troutfry$survival)
chr [1:8000] "survived" "survived" "survived" "survived" "survived" ...
troutfry$survival <- factor(troutfry$survival, levels = c("survived", "not caught"))
str(troutfry$survival)
Factor w/ 2 levels "survived","not caught": 1 1 1 1 1 1 1 1 1 1 ...
(troutfryTable <- table(troutfry$survival, troutfry$frySource))
hatchery wild
survived 27 51
not caught 3973 3949
addmargins(troutfryTable)
hatchery wild Sum
survived 27 51 78
not caught 3973 3949 7922
Sum 4000 4000 8000
There's more than one way to shear a sheep…
t(addmargins(table(troutfry)))
frySource
survival hatchery wild Sum
survived 27 51 78
not caught 3973 3949 7922
Sum 4000 4000 8000
The t
command transposes the table (i.e. switch horizontal and vertical variable placement).
mosaicplot(troutfryTable)
Question: Explanatory variable along vertical axis. How to fix?
Answer: Transpose the table!
mosaicplot(t(troutfryTable))
mosaicplot(t(troutfryTable),
xlab="Fry source",
ylab="Relative frequency",
main="",
cex = 1.5,
cex.sub = 1.5,
col = c("forestgreen", "goldenrod1"))
The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997 - 1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association?
The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997 - 1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association?
Question: Experimental or observational?
Answer: Observational
The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997 - 1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association?
Question: Explanatory variables with type?
Answer: Birth month - categorical with levels “Jan” and “Dec”.
The cutoff birth date for school entry in British Columbia, Canada, is December 31. As a result, children born in December tend to be the youngest in their grade, whereas those born in January tend to be the oldest. Morrow et al. (2012) examined how this relative age difference influenced diagnosis and treatment of attention deficit/hyperactivity disorder (ADHD). A total of 39,136 boys aged 6 to 12 years and registered in school in 1997 - 1998 had January birth dates. Of these, 2219 were diagnosed with ADHD in that year. A total of 38,977 boys had December birth dates, of which 2870 were diagnosed with ADHD in that year. Display the association between birth month and ADHD diagnosis using a table or graphical method from this chapter. Is there an association?
Question: Response variables with type?
Answer: ADHD diagnosis - categorical with levels “ADHD” and “no ADHD”.
adhd <- read.csv(paste0(here::here(), "/Homeworks/Homework2/chap02q33BirthMonthADHD.csv"))
str(adhd)
'data.frame': 4 obs. of 3 variables:
$ birthMonth : chr "January" "January" "December" "December"
$ diagnosis : chr "ADHD" "no ADHD" "ADHD" "no ADHD"
$ frequencies: int 2219 36917 2870 36107
birthMonth diagnosis frequencies
1 January ADHD 2219
2 January no ADHD 36917
3 December ADHD 2870
4 December no ADHD 36107
Answer: Processed! They have already counted the data!
In R: Transform this into a more appropriate form.
(adhdMatrix <- matrix(adhd$frequencies,
nrow = 2,
ncol = 2,
dimnames = list(c("Diagnosed ADHD", "Not diagnosed"),
c("Jan","Dec"))))
Jan Dec
Diagnosed ADHD 2219 2870
Not diagnosed 36917 36107
birthMonth diagnosis frequencies
1 January ADHD 2219
2 January no ADHD 36917
3 December ADHD 2870
4 December no ADHD 36107
Look at Chapter 2 R-Markdown, Example 2.3A. Bird malaria to see how to plot with data of this type. In particular, note that grouped barplot (using the barplot
command) and mosaic plots (using the mosaicplot
command) can work on matrices or tables.
The following data are from Mattison et al. (2012), who carried out an experiment with rhesus monkeys to test whether a reduction in food intake extends life span (as measured in years). The data are the life spans of 19 male and 15 female monkeys who were randomly assigned a normal nutritious diet or a similar diet reduced in amount by 30%. All monkeys were adults at the start of the study.
Other questions: Experimental or observational? Explanatory and response variables?
The following data are from Mattison et al. (2012), who carried out an experiment with rhesus monkeys to test whether a reduction in food intake extends life span (as measured in years). The data are the life spans of 19 male and 15 female monkeys who were randomly assigned a normal nutritious diet or a similar diet reduced in amount by 30%. All monkeys were adults at the start of the study.
Answer: This is an experimental procedure.
Answer: Explanatory variables are sex and food intake, both of which are categorical and have 2 levels.
Answer: Response variable is life span, which is numerical and continuous.
'data.frame': 34 obs. of 3 variables:
$ sex : chr "female" "female" "female" "female" ...
$ foodTreatment: chr "reduced" "reduced" "reduced" "reduced" ...
$ lifespan : num 16.5 18.9 22.6 27.8 30.2 30.7 35.9 23.7 24.5 24.7 ...
Categories of the sex
variable:
[1] "female" "male"
Categories of the foodTreatment
variable:
[1] "control" "reduced"
head(foodforlife)
sex foodTreatment lifespan
1 female reduced 16.5
2 female reduced 18.9
3 female reduced 22.6
4 female reduced 27.8
5 female reduced 30.2
6 female reduced 30.7
Question: Is this data in the right form for graphing?
Depending on the graph that you choose, the first two arguments to the function should be
functionname(lifespan ~ foodTreatment * sex, data=mydata, …)
where functionname
is the name of the plotting function in R that you chose, and mydata
is the name of your data frame.
You can combine two columns into one column using the unite
function in the tidyr
package (we'll learn more later about this package).
foodforlife <- tidyr::unite(foodforlife, category, c("sex", "foodTreatment"), sep="-")
str(foodforlife)
'data.frame': 34 obs. of 2 variables:
$ category: chr "female-reduced" "female-reduced" "female-reduced" "female-reduced" ...
$ lifespan: num 16.5 18.9 22.6 27.8 30.2 30.7 35.9 23.7 24.5 24.7 ...
The values of the new category
variable are:
[1] "female-control" "female-reduced" "male-control" "male-reduced"