Statistics and samples

Alban Guillaumet, Troy University

“Maturity of mind is the capacity to endure uncertainty.”

- John Finley

Objectives

Definition of statistics
- Parameter estimation
- Hypothesis testing
Sampling populations
Type of data, variables and studies

What is statistics?

Statistics is the study of methods:

for measuring aspects of populations from samples; and
for quantifying the uncertainty of the measurements.
- Statistics allows determining the likely magnitude of departure from the truth.

Statistical methods are essential in almost every area of biology.

Statistics has two main components

1 - Estimation, the process of inferring an unknown quantity of a population using sample data
- Note that a parameter is a quantity describing a population (e.g., age at first reproduction), whereas an estimate, or statistic, is a related quantity calculated from a sample

Statistics has two main components

1 - Estimation
- Statistical methods tell us how to best estimate parameters using our sample.
- The parameter is the truth, and the estimate an approximation of the truth, subject to error.

Statistics has two main components

1 - Estimation

Statistics has two main components

2 - Hypothesis testing, a method to determine how well a hypothesis about a population parameter fit the sample data

Definition: A statistical hypothesis is a specific claim regarding a population parameter.

Definition: Hypothesis testing uses data to evaluate evidence for or against statistical hypotheses.

Hypothesis testing

Example of hypothesis: in the peregrine falcon, females are bigger than males

Other examples of hypothesis:

Males red deer with larger antlers have a greater reproductive success
Drug B augments the chance of recovery as compared to Drug A

Sampling populations

The image below shows the density of bullet holes on different sections of British Royal Air Forces planes return to base from aerial sorties during WWII.

Question: According to you, wich plane sections should be reinforced ?

Sampling populations

Answer: On the engines and cockpit!

Remember: the data are a sample of planes, and it is biased. The whole population also includes all the planes that did NOT make it back to the base.

Sampling populations

a population is the entire collection of individuals. For example:
- All flying snakes in Borneo
- All the genes in the human genome
A sample is a smaller set of individuals selected from the population. For example:
- 10 flying snakes caught in the field
- 20 genes in the human genome
Note that the basic unit of sampling is NOT always an actual individual (e.g., a gene, but also a a single family, a species, etc.)

Sampling populations

The researcher uses a sample to draw conclusion that, hopefully, will apply to the whole population
Good samples are therefore a foundation of good science.

Properties of good samples

Sampling error is the difference between an estimate and the population parameter being estimated caused by chance

The lower the sampling error, the higher the precision

Properties of good samples

Ideally, our estimate is also accurate (or unbiased), meaning that the average of estimates that we might obtain is centered on the true population value.

A second type of error called a bias occurs when the sample systematically underestimates, or overestimates, the population parameter.

Properties of good samples

The major goal of sampling is to minimize sampling error and bias in estimates.

Random sampling

The main assumptions of all statistical techniques is that your data come from a random sample.

Definition: In a random sample, each member of a population has an equal and independent chance of being selected.

Random sample

1 - Every unit in the population must have an equal chance of being included in the sample.

Question: Give a few examples of sampling that does not satisfy this condition

Examples: pick the tallest flowers, the easiest birds to capture (territorial males), people answering the phone.

These hard-to-sample individuals might differ in their characteristics from those of the rest of the population, so underrepresentating them would lead to bias.

Random sample

2 - The selection of units must be independent, i.e. the selection of any one member of the population cannot be influenced by the selection of any other member.

Question: Give a few examples of sampling that does not satisfy this condition

Examples: members of the same household for survey, same family members for a response to treatment, select closely related species to investigate body size change with aridity.

With non-independent sampling, our sample size is effectively smaller than we think. This will cause to miscalculate the precision of the estimates.

Random sample (Class discussion)

In a recent study, researchers took electrophysiological measurements from the brains of two rhesus macaques (monkeys). Forty neurons were tested in each monkey, yielding a total of 80 measurements.

Do the 80 neurons constitute a random sample? Why or why not?

Sample of convenience

A sample of convenience is a collection of individuals easily available to the researcher.

What is the main problem associated with samples of convenience?

Response: bias.

Random samples are difficult to obtain, and researcher must sometimes deal with suboptimal samples. It is important to acknowledge the limitations of the study, and attempt to correct for them in future work.

Examples of problems due to Samples of convenience

bullet holes in planes!
collapse of North Atlantic cod fishery due to excessive allowable catches by fishing boats.
volunteer bias in human studies (e.g., low-income categories if participants are paid, older people are more likely to be at home and respond to the phone, etc.)

Data & Variables

Now that we have a sample, we can start our study and measure variables.

Definition: Variables are characteristics that differ among objects of interest, such as individuals

Definition: Data are the measurements of one or more variables made on a sample of objects of interest.

Variable

Numerical variable (quantitative: magnitude on a numerical scale)
- Continuous
- Discrete
numeric data type in R
Categorical variable (qualitative)
- Nominal (levels have no inherent ordering)
- Ordinal
factor data type in R

Variable

Ex. of numerical variables:
- number of amino acids in a protein (discrete)
- body temperature (continuous)
Ex. of categorical variables:
- school attended (nominal)
- life stage; e.g., egg, larva, juvenile, subadult or adult (ordinal); note that life-stage values can be ordered, but the magnitude of the difference between consecutive values is not known.

Explanatory versus response variables

A frequent goal in statistics is to test how well one, or a set of variables (called explanatory), affects or predicts another variable of interest that we call the response variable.
which is which?
- blood pressure vs. risk of stroke
- survival rate vs. venom dose
- color of the ground vs. color of the bird

Type of studies

Definition: A study is experimental if the researcher assigns treatments randomly to individuals, whereas a study is observational if the assignment of treatments is not made by the researcher.

Reading Quiz (Class Discussion)

Q. Is the following study observational or experimental?

Psychologists tested whether the amount of illegal drug use differs between people suffering from schizophrenia and those not having the disease. They measured drug use in a group of schizophrenia patients and compared it with that in a similar sized group of randomly chosen people.

Sub-Q:

What are the explanatory and response variables?
Are they categorical or numerical?

Reading Quiz (Class Discussion)

Psychologists tested whether the amount of illegal drug use differs between people suffering from schizophrenia and those not having the disease. They measured drug use in a group of schizophrenia patients and compared it with that in a similar sized group of randomly chosen people.

Explanatory variable has values “schizophrenic” and “not schizophrenic”, which is a categorical variable.

Response variable, “drug use” is a numerical variable.

Observational because treatment groups (or values of the explanatory variable) not assigned randomly by scientist!

In the news - Cervical cancer

Quote: We don’t include men in our calculation because they are not at risk for cervical cancer and by the same measure, we shouldn’t include women who don’t have a cervix
- Anne F. Rositch

What happens if you include females who do not have a cervix in your sample?
Sample is biased! -> Loss of accuracy (underestimate)

In the news - Cervical cancer

Quote: The researchers found that black women have a mortality rate of 10.1 per 100,000. For white women, the rate is 4.7 per 100,000. Past estimates had those rates at 5.7 and 3.2, respectively. The new death rate for black women in the US is on par with that of developing countries.

Discuss: Does this warrant the “much deadlier” headline?

Discuss: Why do you think the death rate for black woman in the US is higher?