M. Drew LaMar
September 6, 2021
“Maturity of mind is the capacity to endure uncertainty.”
- John Finley
Definition: Aparameter is a quantity describing a population, whereas anestimate orstatistic is a related quantity calculated from a sample.
Parameter examples: Averages, proportions, measures of variation, and measures of relationship
The two sides of the statistical coin:
Definition: Astatistical hypothesis is a specific claim regarding a population parameter.
Definition:Hypothesis testing uses data to evaluate evidence for or against statistical hypotheses.
The two sides of the statistical coin:
Example: A trapping study measures the rate of fruit fall in forest clear-cuts.
The two sides of the statistical coin:
Example: A clinical trial is carried out to determine whether taking large doses of vitamin C benefits health of advanced cancer patients.
…well, most of the time.
Quote: “Huh?”
- Student
“Modern statisticians are familiar with the notion that any finite body of data contains only a limited amount of information on any point under examination; that this limit is set by the nature of the data themselves, and cannot be increased by any amount of ingenuity expended in their statistical examination: that the statistician's task, in fact, is limited to the extraction of the whole of the available information on any particular issue.”
- R. A. Fisher (biologist!)
There is desired and undesired information in data.
Goals:
Get accurate information by reducing bias (do we have the right signal?)
Get precise information by reducing sampling error due to random variation (increase signal-to-noise ratio)
Definition:Bias is a systematic discrepancy between the estimates we would obtain,if we could sample a population again and again , and the true population characteristic.
There is desired and undesired information in data.
Goals:
Get accurate information by reducing bias (do we have the right signal?)
Get precise information by reducing sampling error due to random variation (increase signal-to-noise ratio)
Definition:Sampling error is the difference between an estimate and the population parameter being estimated caused by chance.
There is desired and undesired information in data.
Goals:
Get accurate information by reducing bias (do we have the right signal?)
Get precise information by reducing sampling error due to random variation (increase signal-to-noise ratio)
“An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.”
- John Tukey
For your question, there is desired (signal) and undesired (noise) information in your data.
Goals:
“The aim … is to provide a clear and rigorous basis for determining when a causal ordering can be said to hold between two variables or groups of variables in a model…”
- H. Simon
The main assumptions of all statistical techniques is that your data come from a random sample.
Definition: In a
random sample , each member of a population has an equal and independent chance of being selected.
Random sampling
In a recent study, researchers took electrophysiological measurements from the brains of two rhesus macaques (monkeys). Forty neurons were tested in each monkey, yielding a total of 80 measurements.
Do the 80 neurons constitute a random sample? Why or why not?
Lack of independence
If the 80 measurements were analyzed as though they constituted a random sample, what consequences would this have for the estimate of the measurement in the monkey population?
Incorrect precision of estimate (most likely underestimated)
“Designing experiments is as much about learning to think scientifically as it is about the mechanics of the statistics that we use to analyse the data once we have it. It is about having confidence in your data, and knowing that you are measuring what you think you are measuring. It is about knowing what can be concluded from a particular type of experiment and what cannot.
- Ruxton & Colegrave
Design your experiment so that:
“It might be said that the two major goals of designing experiments are to minimize random variation and account for confounding factors.
- Ruxton & Colegrave
Definition:
Random variation is the differences between measured values of the same variable taken from different experimental subjects.
Good experiments minimize or control for "unwanted” random variation, so that any variation due to the factors of interest can be detected more easily.
“It might be said that the two major goals of designing experiments are to minimize random variation and account for confounding factors.
- Ruxton & Colegrave
Definition: If we want to study the effect of variable A on variable B, but variable C also affects B, then C is a
confounding factor .
“Designing effective experiments needs thinking about biology more than it does mathematical calculations.”
“Experimental design is about the biology of the system, and that is why the best people to devise biological experiments are biologists themselves.”
- Ruxton & Colegrave
Definition:
Variables are characteristics that differ among objects of interest.
Definition:
Data are the measurements of one or more variables made on a sample of objects of interest.
Data, essentially, is any measurement of the real world since
Categorical variable (qualitative)
Remember the factor
data type in R?
Numerical variable (quantitative)
Remember the numeric
data type in R?
Discuss: Would the fraction of birds in a large sample infected with avian flu virus be a discrete or continuous numerical variable?
Answer: Neither! The variable of interest here is actually categorical (nominal). Why?
Ask yourself the following questions:
Frequency distributions of univariate data
Type of data | Graphical method |
---|---|
Categorical | Bar graph |
Numerical | Histogram |
Showing association of bivariate data
Type of data | Graphical method |
---|---|
Two numerical | Scatter plot |
“ | Line plot |
” | Map |
Two categorical | Grouped bar graph |
“ | Mosaic plot |
Mixed | Strip chart |
” | Box plot |
“ | Multiple histograms |
” | Cumulative frequency distributions |
Data visualization is one step in exploratory data analysis.
Quote: …the first step in any data analysis or statistical procedure is to graph the data and look at it. Humans are a visual species, with brains evolved to process visual information. Take advantage of millions of years of evolution, and look at visual representations of your data before doing anything else.
- Whitlock & Schluter
Data visualization is one step in exploratory data analysis.
If you want to graph some data, you most likely will need to manipulate the data first to put it in the right form.
Strip chart of serotonin levels in the central nervous system of desert locusts that were experimentally crowded for 0 (the control group), 1, and 2 hours. (left panel)
Read the data and store in data frame (here named locustData). The following command uses read.csv
to grab the data from a file on the internet (on the current web site).
locustData <- read.csv(paste0(here::here(), "/Datasets/chapter02/chap02f1_2locustSerotonin.csv"))
The read.csv
command reads a CSV (comma-separated value) file. It's argument can be a file on your computer, or in this case, a location to the file on the web via a URL.
Question: So where is the data for the book?
Show the first few lines of the data, to ensure it read correctly. Determine the number of cases in the data.
head(locustData)
serotoninLevel treatmentTime
1 5.3 0
2 4.6 0
3 4.5 0
4 4.3 0
5 4.2 0
6 3.6 0
Show the first few lines of the data, to ensure it read correctly. Determine the number of cases in the data.
nrow(locustData)
[1] 30
Check the object type of the variables using str
.
str(locustData)
'data.frame': 30 obs. of 2 variables:
$ serotoninLevel: num 5.3 4.6 4.5 4.3 4.2 3.6 3.7 3.3 12.1 18 ...
$ treatmentTime : int 0 0 0 0 0 0 0 0 0 0 ...
Draw a stripchart (the tilde “~” means that the first argument below is a formula, relating one variable to the other).
stripchart(serotoninLevel ~ treatmentTime,
data = locustData,
method = "jitter",
vertical = TRUE,
xlab="Treatment time (hours)",
ylab="Serotonin (pmoles)",
cex.lab = 1.5)
Draw a stripchart (the tilde “~” means that the first argument below is a formula, relating one variable to the other).
A fancier strip chart, closer to that shown in Figure 2.1-2, by including more options.
# Stripchart with options
par(bty = "l") # plot x and y axes only, not a complete box
stripchart(serotoninLevel ~ treatmentTime,
data = locustData,
vertical = TRUE,
method = "jitter",
pch = 16,
col = "firebrick",
cex = 1.5,
cex.lab = 1.5,
las = 1,
ylab = "Serotonin (pmoles)",
xlab = "Treatment time (hours)",
ylim = c(0, max(locustData$serotoninLevel)))
A fancier strip chart, closer to that shown in Figure 2.1-2, by including more options.
?par
# Stripchart with options
par(bty = "l") # plot x and y axes only, not a complete box
stripchart(serotoninLevel ~ treatmentTime,
data = locustData,
vertical = TRUE,
method = "jitter",
pch = 16,
col = "firebrick",
cex.lab = 1.5,
las = 1,
ylab = "Serotonin (pmoles)",
xlab = "Treatment time (hours)",
ylim = c(0, max(locustData$serotoninLevel)))
A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.
A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.
Question: Experimental or observational?
Answer: Observational
A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.
Question: Explanatory variables with type?
Answer: Fry source, which is categorical variable with two levels, “hatchery” and “wild”.
A study by Miller et al. (2004) compared the survival of two kinds of Lake Superior rainbow trout fry (babies). Four thousand fry were from a government hatchery on the lake, whereas 4000 more fry came from wild trout. All 8000 fry were released into a stream flowing into the lake, where they remained for one year. After one year, the researchers found 78 survivors. Of these, 27 were hatchery fish and 51 were wild. Display these results in the most appropriate table. Identify the type of table you used.
Question: Response variables with type?
Answer: “Survival”, which is a categorical variable with two levels, “caught” and “not caught”.
Load the data:
troutfry <- read.csv(paste0(here::here(), "/Datasets/chapter02/chap02q05FrySurvival.csv"))
head(troutfry)
frySource survival
1 wild survived
2 wild survived
3 wild survived
4 wild survived
5 wild survived
6 wild survived
str(troutfry)
'data.frame': 8000 obs. of 2 variables:
$ frySource: chr "wild" "wild" "wild" "wild" ...
$ survival : chr "survived" "survived" "survived" "survived" ...
Looks like our data is in raw form, i.e. each row is an observation and each column is a measurement/variable.
(troutfryTable <- table(troutfry$frySource, troutfry$survival))
not caught survived
hatchery 3973 27
wild 3949 51
Notice the parenthesis around the assignment, which says to output the result.
Question: Anything off with this format?
(troutfryTable <- table(troutfry$frySource, troutfry$survival))
not caught survived
hatchery 3973 27
wild 3949 51
Notice the parenthesis around the assignment, which says to output the result.
Answer: Explanatory variable should be in the horizontal dimension!
(troutfryTable <- table(troutfry$frySource, troutfry$survival))
not caught survived
hatchery 3973 27
wild 3949 51
Notice the parenthesis around the assignment, which says to output the result.
(troutfryTable <- table(troutfry$survival, troutfry$frySource))
hatchery wild
not caught 3973 3949
survived 27 51
Question: Any other changes?
(troutfryTable <- table(troutfry$survival, troutfry$frySource))
hatchery wild
not caught 3973 3949
survived 27 51
Answer: Maybe “survived” should come before “not caught”, as that might be the most interesting.
(troutfryTable <- table(troutfry$survival, troutfry$frySource))
hatchery wild
not caught 3973 3949
survived 27 51
Question: How do we change the order of the levels for “survival”?
str(troutfry$survival)
chr [1:8000] "survived" "survived" "survived" "survived" "survived" ...
troutfry$survival <- factor(troutfry$survival, levels = c("survived", "not caught"))
str(troutfry$survival)
Factor w/ 2 levels "survived","not caught": 1 1 1 1 1 1 1 1 1 1 ...
(troutfryTable <- table(troutfry$survival, troutfry$frySource))
hatchery wild
survived 27 51
not caught 3973 3949
addmargins(troutfryTable)
hatchery wild Sum
survived 27 51 78
not caught 3973 3949 7922
Sum 4000 4000 8000
There's more than one way to shear a sheep…
t(addmargins(table(troutfry)))
frySource
survival hatchery wild Sum
survived 27 51 78
not caught 3973 3949 7922
Sum 4000 4000 8000
The t
command transposes the table (i.e. switch horizontal and vertical variable placement).
mosaicplot(troutfryTable)
Question: Explanatory variable along vertical axis. How to fix?
Answer: Transpose the table!
mosaicplot(t(troutfryTable))
mosaicplot(t(troutfryTable),
xlab="Fry source",
ylab="Relative frequency",
main="",
cex = 1.5,
cex.sub = 1.5,
col = c("forestgreen", "goldenrod1"))