Learning targets
Understand the different types and natures of variables.
Understand the terms ‘probability distribution’ and ‘histogram’
Know examples of common data distributions, be able to simulate such data
Be able to calculate probabilities and quantiles given a certain distribution
Be able to explore a data set using a few simple commands
Main variable types
The choice of a statistical test largely depends on the nature of the response variable (dependent or outcome variable) and on the type of the explanatory variable(s) (independent variables, predictors) (Table 1 and 2). That means, first and foremost we need to understand what type of variables we are dealing with and what their characteristics are. Generally speaking, variables represent measurements on some sort of scale, which, broadly speaking, can be continuous or discrete (Table 1). Variables on a continuous scale can take on any value between their minimum and maximum value and can be ordered sequentially. For example, biomass, height, age, fluxes, concentrations, etc. are continuous variables. Discrete variables is an umbrella term for variables that represent information on a scale with dis- tinct units such as whole numbers or categories. Discrete variables encompass categorical variables, which are also referred to as factors or nominal variables. A categorical variable comprises two or more categories without intrinsic ordering to the categories. Examples are gender, colour, species identity. The values which a categorical variable can take on are called levels and the number of levels must not be confused with the number of factors in an experiment. For instance, in a plant fertiliser efficacy study with two different types of fertiliser and a control, we would speak of one treatment: fertiliser administration. Fertiliser administration is a factor with three levels: control, fertiliser 1 and fertiliser 2. A cataegorical variable that can take on exactly two levels is called a binary or binomial variable. Ordinal variables are another type of discrete variables and similar to categorical variables but there there is a clear ordering of levels. For example, suppose you are investigating invasive species whose dispersal potential is described as low, medium and high. You may not be able to put a number on these levels but they can be ordered from low to high. Another example are rating scales such as the classical 5-point Likert scales (e.g. Strongly Disagree - Disagree - Undecided - Agree - Strongly Agree).
Continuous |
a variable that can take on any value between its minimum and maximum |
Height, distance, mass, temperature, speed, … |
Discrete |
|
|
Categorical (syn.: factor, nominal variable, if only 2: binary/binomial) |
a variable that has two or more categories |
Habitat, colour, species identity, presence/absence, dead/alive, genotype, gender, type of rock, . . . |
Ordinal |
similar to categorical but with intrinsic ordering |
Fertiliser application (low, medium, high), age classes, rating scales, . . . |
Looking at data - distributions
Knowing about the distribution of your data is important and one of the first things you do in data analysis. Before you calculate summary statistics like the mean, the median, the standard deviation etc., you should simply look at your data. It is best to start off looking at each variable separately, and then look at the relationships between the variables. Generally, we check whether a distribution looks more or less normal. ‘Normal’ here does not have its usual meaning, but refers to the normal or Gauss distribution. It is important to know that the normal distribution (1) has superior importance over other distributions (if you are interested, look up the central limit theorem), and (2) it is part of some fundamental assumptions in all basic statistical tests. Even more importantly, you should understand very well what is meant by the word ‘distribution’, be able to draw and interpret a histogram or density plot for a given variable, and simulate random numbers that are distributed according to a prediscribed distribution. In histograms, the data are binned, and then the frequencies (number of observations per bin) are plotted. In density plots, we look at the relative frequencies with or without binning the data so that the area under the curve is equal to one. Let’s quickly collect a small data set in the form of our heights given in metres. The data collected in class will simply be called heights
.
heights <- c(1.6, 1.63, 1.60, 1.61, 1.65, 1.67, 1.62, 1.55, 1.60, 1.7, 1.45, 1.75, 1.76, 1.62,
1.67, 1.65, 1.72, 1.65, 1.55, 1.62, 1.73, 1.53, 1.55, 1.68, 1.60, 1.66, 1.90)
## Frequency histogram (bins data and counts the number of observations within each bin)
hist(heights) # data looks normally distributed

## Density histogram (setting the freq argument to FALSE or F returns the probability density)
## Read the hist() help file.
hist(heights, freq = F)

The kernel probability density gives the cumulative probability (in %) for the values seen on the x-axis (in this example our heights). Cumulative means successive additions, i.e., you simply add up all the point probability estimates. In doing so, we end up with a total area under the curve of 1 (100%). In other words, each possible height value within the data range has a certain probability to be drawn if we take a random sample from our data. The height values closer to the mean have higher and the very small and very large values lower probabilities. Let’s derive such a density kernel for our heights data:
## Derive a kernel density estimate and plot it
dens <- density(x = heights)
## Add curve to an existing histogram
hist(heights, freq = F, main = NA, ylim = c(0, 6)) # main = NA suppresses title
lines(dens)

## Or plot it separately
plot(dens)

We can check that the area under curve yields one by calculating it. To achieve this, we need to install and subsequently load a new package, called flux
. We then feed the x- and y-values used by the density
function into the auc
(area under curve) function. We can do this by using the $
operator, which allows us to extract single variables (columns) from a data frame or list object.
library(flux)
## Loading required package: caTools
## This is flux 0.3-0
auc(x = dens$x, y = dens$y)
## [1] 1.000877
Now, let’s try the same thing using simulated numbers that follow a normal distribution. R has built-in functions to generate random numbers from various distributions. Continuous variables often follow a normal distribution (Gauss distribution) and the rnorm
function can be used to produce random normally distributed numbers. Type ?rnorm
to learn more…
## Without specifying a mean or standard deviation (sd), the 'rnorm' function returns values drawn from the standard normal distribution, i.e., mean = 0, sd = 1
pop <- rnorm(n = 1000)
hist(pop, main = NA)

## Let's specify a mean of 10. Since we omitted the sd, R assumes sd = 1
pop <- rnorm(n = 1000, mean = 10)
hist(pop)

dens <- density(pop, bw = 1)
plot(dens, main = NA)

auc(x = dens$x, y = dens$y)
## [1] 1.000971
In order to get a better grasp on the cumulative distribution concept, let’s run a little test of our understanding… If the area under the curve sums up to one, then the cumulative probability of a random sample having a value that is smaller than or equal to the mean (here 10) should be 50 % (0.5). The mean is the summit of the curce, hence we are talking about the left half of the curve and therefore half the area under the curve which is 0.5. Anyway, if this holds and we randomly sample a bunch of values from our pop
data, then we would expect around half of those sampled values to be smaller than or equal to 10 right? We can easily check this by drawing a random sample of say 100 values, around 50 of which we would expect to be <= 10. The sample
function allows us to draw a random sample from a data set. Type ?sample
for more information.
x <- sample(x = pop, size = 100, replace = F)
length(x[x <= 10])
## [1] 51
## Divided by the sample size (100) we obtain the proportion
length(x[x <= 10])/100
## [1] 0.51
Each time you run the above sampling procedure, you will get a slightly different result because we draw a random sample of 100 values out of our 1000 normally distributed numbers stored in pop
. However, most times the result will be reasonably close to 50.
Calculating probabilities and quantiles
Frequentist statistics relies on distributions. We need to ask questions like ‘What is the probability of observing a value equal to or smaller than a given threshold?’. In other words: how rare are certain observations? All simple statistical tests work like this: you set yourself a threshold beyond which an observation is considered ‘unusually rare’. The term ‘statistically significant’ rests on this principle. It is therefore worthwhile having a look at how to calculate probabilities and quantile for the most common distribution of all, the normal distribution. Explain the use of pnorm()
and qnorm()
as shown during the lab. Illustrate with examples.
Let’s start with quantiles. What are quantiles? Quantiles are cutpoints that divide the data in equally sized parts. The value that perfectly divides the data is the 50% quantile, which is commonly called the median.
## Dummy data
y <- 0:10
## Derive the 50% quantile
quantile(y, probs = 0.5)
## 50%
## 5
##...which is the median
median(y)
## [1] 5
## Derive multiple quantiles simultaneously (25%, 50%, 75%).
quantile(y, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75%
## 2.5 5.0 7.5
## The above quantiles make up a boxplot.
bp <- boxplot(y)

bp
## $stats
## [,1]
## [1,] 0.0
## [2,] 2.5
## [3,] 5.0
## [4,] 7.5
## [5,] 10.0
## attr(,"class")
## 1
## "integer"
##
## $n
## [1] 11
##
## $conf
## [,1]
## [1,] 2.61806
## [2,] 7.38194
##
## $out
## numeric(0)
##
## $group
## numeric(0)
##
## $names
## [1] "1"
We can easily compute the numeric value (quantile) from a normal distribution associated with a given probability using the qnorm
function, or, the other way around, we can get the probability (P-value) associated with a given quantile (a given numeric value following this distribution) using the pnorm
function. In other words: the qnorm
function asks: “Which normally distributed random number matches the given probability”. The pnorm
function answers the question: “What is the probability of obtaining a value smaller than x”, where x is the random normally distributed number of interest. The standard normal distribution has a mean of 0 and a standard deviation of 1. Let’s focus on the standard normal distribution for a little while to better understand pnrom
and qnorm
functions. If the area under a density curve is 1, then the probability to draw a random number that is smaller than or equal to the mean from a standard normal distribution should be 0.5, since the normal distribution is symetric around the mean.
plot(density(rnorm(1000), bw = 1), main = NA, xlab = NA)
abline(v = 0, lty = 3) # vertical line indicating the mean

## Which value (or smaller) has a probability of 50% (0.5) to be sampled?
qnorm(p = 0.5) # no mean or sd specified, i.e., standard normal distribution
## [1] 0
## Same question but the mean is 10, so the value we're looking for should also be 10.
qnorm(p = 0.5, mean = 10) # Here we specified a normal distribution with a mean of 10.
## [1] 10
## For the standard normal distribution, 95% of the numbers lie between
## -1.96 and 1.96 => computation of confidence intervals relies on this fact
## We can see this by removing the lower and upper 2.5% of the probabilites
## (leaving 95%)...
qnorm(p = c(0.025, 0.975))
## [1] -1.959964 1.959964
## What is the probability of sampling a value smaller than or equal to zero?
pnorm(q = 0)
## [1] 0.5
## What is the probability of sampling a value smaller than or equal to 10
## given a normal distribution with a mean of 10?
pnorm(q = 10, mean = 10)
## [1] 0.5
pnorm(q = 9.5, mean = 10)
## [1] 0.3085375
We can now ask probability questions, for instance, what’s the probability of a random draw of a value of 365 or higher from a data set, that has a mean of 300 and a standard deviation of 50. By default the pnorm
function computes the cumulative probability starting from the lower tail, i.e. the left hand side of the normal curve. So, if we ask about the probability of a certain value or higher, we need to specify this by setting the argument lower.tail
to FALSE
resulting in the cumulative probabilities in the upper tail, i.e. starting from the provided value (here asking what is the probility to draw a value as high as 365 or higher).
hist(rnorm(n = 1000, mean = 300, sd = 50), main = NA)

pnorm(q = 365, mean = 300, sd = 50, lower.tail = F)
## [1] 0.09680048
rn <- rnorm(n = 1000, mean = 100, sd = 10)
hist(rn, freq = F, main = NA, ylim = c(0, 0.045))
dens <- density(rn)
lines(dens)

With our new tools we can also ask questions such as: What is the probability to randomly draw a number between 85 and 95 from a normally distributed random data set with a mean of 100 and a standard deviation of 10 (see orange area in the figure below). This can easily be computed by subtracting the probability associated with the smaller number from the probability of the larger number like this:
lines(xv, yv)

pnorm(q = 95, mean = 100, sd = 10) - pnorm(q = 85, mean = 100, sd = 10)
## [1] 0.2417303
## The pnorm approach gives the same result as an integration
integrate(dnorm, lower = 85, upper = 95, mean = 100, sd = 10)
## 0.2417303 with absolute error < 2.7e-15
## ...or the area under the curve
auc(x = xval, y = yval)
## [1] 0.2417265
Answer: the probability to randomly draw a number between 85 and 95 from data following the above distribution is 24.2 %.
Example question for the exam
Make up a short data frame that contains a continuous response variable (e.g. size) and a grouping variable (e.g. groups A, B, and C). The dimensions of the data frame should be 30 rows by 2 columns. Use a few functions to explore the data set and draw a box plot that shows the differences between groups. Then create a second data frame (give it a name) that contains only the data for groups B and C.
Now plot histograms of samples of different sizes (10, 100, 1000, …) of normally distributed random numbers. Also, use different means and standard deviations. You can show 4 plots on one panel if you insert the command par(mfrow = c(2, 2))
before you start plotting.
Other distributions
Apart from the normal distribution, let us introduce two more distributions: the uniform distribution, and the poisson distribution. (Later, we will also use the t-distribution, the Chi-squared distribution, and the F-distribution.) Each of these distributions can be simulated in R. In the uniform distribution, data are equally likely to fall anywhere within a lower and an upper margin. For example, if you arrive at a bus stop with a service every 15 minutes, your waiting time (if you catch that bus say 30 times and ignore the timetable or don’t have a watch) will be uniformly distributed. The key characteristic of the uniform distribution is that it is truncated on both sides (it is impossible to wait a negative number of minutes, and it is equally impossible to wait more than 15 minutes). Simulating uniformly distributed numbers may be useful in designing a study, for example if you want to sample randomly along a transect:
runif(n = 10, min = 0, max = 100) # 10 random samples along a 100 m transect
[1] 73.38038 86.17605 47.94090 96.64878 88.27526 37.59530 21.26616
[8] 64.15704 45.54379 61.37383
The poisson distribution applies to count data, e.g. if you count the number of crabs per square meter, or the number of cars passing a road per time unit. In this distribution, the variance is the same as the mean: if the mean increases, so does the variance. Therefore, when simulating count data, you only have to specify \(n\), the number of samples, and \(\lambda\), the mean (which is the same as the variance).
Use the commands runif()
and rpois()
to simulate and plot a uniform and a poisson distribution each. Come up with an example each to illustrate what the numbers could represent.
Exploring a data set
As an example of how you could start exploring a data set, we look at the ‘swiss’ data set that you can call from within R. For an explanation of the variables, type ?swiss
. We first plot all the variables individually.
data(swiss) # load the built-in data set
names(swiss) # show the variable names
[1] "Fertility" "Agriculture" "Examination"
[4] "Education" "Catholic" "Infant.Mortality"
str(swiss)
'data.frame': 47 obs. of 6 variables:
$ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
$ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
$ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
$ Education : int 12 9 5 7 15 7 7 8 7 13 ...
$ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
$ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
summary(swiss)
Fertility Agriculture Examination Education
Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00
Median :70.40 Median :54.10 Median :16.00 Median : 8.00
Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98
3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00
Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
Catholic Infant.Mortality
Min. : 2.150 Min. :10.80
1st Qu.: 5.195 1st Qu.:18.15
Median : 15.140 Median :20.00
Mean : 41.144 Mean :19.94
3rd Qu.: 93.125 3rd Qu.:21.70
Max. :100.000 Max. :26.60
par(mfrow = c(3, 2)) # divide plotting area up (3 x 2)
hist(swiss[, "Fertility"]); hist(swiss[, "Agriculture"]); hist(swiss[, "Examination"]); hist(swiss[, "Education"]); hist(swiss[, "Catholic"]); hist(swiss[, "Infant.Mortality"])

From this, we can see that variable ‘Education’ is skewed to the right (tail to the right), and variable ‘Catholic’ is bimodal, with most of the values close to zero or close to 100%. The other variables look reasonably normally distributed (although with percentage data, this is never quite the case because they are truncated at zero and 100). Last, we can look at the relationships between variables:
pairs(swiss) # note that plot(swiss) does the same

This gives us a fair idea of how the variables are related, for instance, we can see that ‘Fertility’ is more or less linearly correlated to ‘Agriculture’ and ‘Infant Mortality’. In summary, we have learnt about what kind of data are contained in our data frame (continuous), about the individual variable’s distributions, and how they relate to each other. This is the prerequisite for anything that follows in the interpretation and analysis of data. What questions/hypotheses could we come up with for which we would need statistical tests to verify them? To practice, explore the data set ‘iris’ (use data(iris)
) in a similar way.