# Crucial settings
# Load libraries
library(knitr)
library(tidyverse)
library(readxl)
Example: Instead of saying: “bananas are yellow fruits”" we can say “bananas are fruits that absorb all the light except for 550-590 nm”. It sounds a bit akward for such a silly example but yellow is kind of social/cultural agreement, while the range of ca 550-590 nm of the light spectrum (or hex code of #FFFF00) is an objective statment.
Example: Examining two students group that consist of persons of various height we may want to say in what group people are taller/shorter. However useless that is, we should not make our “judgment” based on a random comparison of each pair of people from the two groups. To be objective, we first need to quatify parameter/s for each group that somehow grasps essential group characteristic (e.g. calculate mean height in the group) and then compare this parameter/s between the two groups (i.e. using proper, statistical tests)
Before anything else, important to say that many natural phenomena, behaviours, etc, can be described with a variable that exhibits a normal distribution (Fig.1) - distribution of the variable in the population.
Even, if the variable in question does not exhibit normal distribution very often the sampling distribution of the mean is normally distributed, owing to the central limit theorem, CLT (more on the CLT here)
What’s the sampling distribution of the mean? Typically, you perform a study once, and you might calculate the mean of that one sample. Now, imagine that you repeat the study many times and collect the same sample size for each one. Then, you calculate the mean for each of these samples and graph them on a histogram. The histogram displays the distribution of sample means, which statisticians refer to as the sampling distribution of the mean.
df <- data.frame(x = rnorm(100000)) # random data sampled from a normal distribution
ggplot(data = df, aes(x = x)) +
geom_density() +
theme_classic() +
labs(x = "A variable", y = "Density",
title = "Fig.1 A normal distribution")
Bacause of its properies!
It is all symmetric -> so we can easily interpret values in regard to each other and determine the proportion of values that falls within certain distances from the mean.
The mean, median, and mode are all equal -> simplifies the life a lot
Half of the population is less than the mean and half is greater than the mean -> again, easy to interpret the values in regard to each other and calculate probability of occurence of pariticular values.
These properties allow to easily (in the sense, intuitively) and objectively quatify the reality (phenomena, behaviours, etc) that we examine.
The best way to describe reality is to examine all the individuals/items of the population. But this is rarely possible, so we sample these individuals/items (i.e. we examine just a part of it, usually unknown proportion) instead, and based on these samples we make a conclusion about the whole population. This whole process (called inferential statistics) is based on a hypothesis testing.
Previous lecture was about hypothesis testing. If that was insufficient, watch it again and/or study the following materials:
–
When we perform a statistical test, what we actually do is comparing two mutually exclusive statements about a population/groups/proportions/relatioships - the null hypothesis and the alternative hypothesis.
There are two ways of doing so: parametric and nonparametric tests.
If our data are normally distributed (or our sample size is big enough) we can use paramteric statistics/tests. What big enough sample size means? Well, depends on the analysis:
1-sample t-test: >20 data points
2-sample t-test: >15 data points in each group
ONE-way anova:
For 2-9 groups, each group should have more than 15 observations.
For 10-12 groups, each group should have more than 20 observations.
A big advantage of parametric tests is that they have greater statistical power than non-parametric ones.
If our data are not normally distributed and/or have a small sample size the non-parametric tests are the solution. Also, for some phenomenas/behaviours (ordinal, ranked data, multiple outliers) median - that is the parameter to describe non-normal distribution - is better than a mean.
Comparing groups - continous vs discrete (grouping) variable/s
Test | Parametric tests of means | P_function in R | Nonparametric tests of medians | NP_function in R |
---|---|---|---|---|
Analysis of a single value | 1-sample t-test | t.test(x, mu = …) | 1-sample Sign, 1-sample Wilcoxon | wilcox_test(x ~ 1, mu = …) |
Comparison of two groups | 2-sample t-test | t.test(x,y, paired = FALSE/TRUE) | Mann-Whitney test | wilcox.test(y,x, paired = FALSE/TRUE) |
Comparison of >2 groups | One-Way ANOVA | aov(y ~ A, data=…) | Kruskal-Wallis test | kruskal.test(y~A) |
Correlation analysis - continous vs continous variables
Test | Parametric | P_function in R | Nonparametric | NP_function in R |
---|---|---|---|---|
Correlation | Pearson | cor.test(x, y, method = “pearson”) | Spearman | cor.test(x, y, method = “spearman”) |
Analysis of frequencies and proportions - frequency and contingency tables from categorical variables
Test | Function in R |
---|---|
Frequencies (cols vs rows, both >= 2) | chisq.test(table) |
Frequencies, table 2x2 | fisher.test(x) |
Proportions | prop.test() |
Obviously, there are more tests (and alternatives) and this is just a basic that everyone just has to know to move forward ;)
Very nice tutorial on comparing means is here
Great tutorial on correlation analysis is here
And if you are interested in proportions, the nice materials is here
Question 1
Let’s say someone said that beavers cut the most precious trees, like those of 50 cm or even wider diameter. Considering “bobry” data we can verify this statment asking the question, is an average diameter of trees (SR) cut by the beaver (regardless of the species) is equal or wider than 50 cm? So, this is the question about a single sample being tested against an observed distribution.
Step 1 - prepare data
bobry <- read_excel("C:/Users/KWJ/Dropbox/Working files/Dydaktyka/Biostatistics/bobry.xlsx") # read data
bobry <- bobry %>%
mutate(SR = as.numeric(SR)) %>% # fix the variable (change from character to numeric)
filter(USZKODZ != "0") # we want to filter only cut trees, right!
Step 2 - check normality - density (or histogram) plot
ggplot(data = bobry) +
geom_density(aes(x = SR)) +
# geom_histogram(aes(x = SR)) +
theme_classic()
Does not really look like normally distributed…
Step 3 - check the sample size
…even if data are not normally distributed, owing to CLT we could use paramteric approach if only sample size is big enough. Is it?
tot <- sum(!is.na(bobry$SR))
# a cool trick was used here to calculate the number of valid cases (not-NA values) - sum(!is.na(...)). First we asked about the content of each cell with is.na(). The output of this function a logical. If the value is NA the output is "TRUE", if not it is "FALSE". Since we are intersted in not-NA values we make a negation for is.na(), with exclamation mark (!is.na()). This way, if the value in a cell is NA the output will be "FALSE", if it is not-NA (so it is what we are interested in) the output is "TRUE". Then, we take advantage of the fact that "TRUE"" is coded in R with 1, and "FALSE"" with 0 value. So if we sum up all TRUEs, which in our case (with !is.na()) denote not-NA values we will get the desired sample size. Cool, right?
tot
[1] 467
Well, 467 is not a bad number, let’s try the parametric test then.
Step 4a - test of the mean
Thus, we simply compare our value in question (mu), which is 50, with the mean SR calculated from the bobry data. Since the test is a hypothesis testing we will simply check the null hypothesis (H0), and if that is true we will keep it, it it is not true we will reject it and accept the alternative hypthesis (H1).
Importantly, we need to consider which test we want to use here: one-sided test or two-sided test. Consider one-sided test, our hypoteses sound a bit differently.
H0 (one-sided - greater/less): the mu = 50 IS NOT GREATER/LESS than the mean of SR
H1 (one-sided - greater/less): the mu = 50 IS GREATER/LESS than the mean of SR
H0 (two-sided): the mu = 50 IS similar to the mean of SR
H1 (two-sided): the mu = 50 IS NOT similar to the mean of SR
If the concept of one/two- sided (or one/two -tailed) test is not clear for you read more here
Let’s say, we want to test the one-sided hypothesis. After all we want to prove that the mean diameter of trees cut by the beaver is not that wide as suggested (mu = 50).
ttest <- t.test(x = bobry$SR, alternative = "less", mu = 50)
# Note the argument "alternative = "", where we specify the side of the distribution we want to test (other options possible are: "greater", "two.sided"). The value we test is give in the argument "mu =", and the distribution we consider is x
ttest
One Sample t-test
data: bobry$SR
t = -102.86, df = 466, p-value < 2.2e-16
alternative hypothesis: true mean is less than 50
95 percent confidence interval:
-Inf 9.989759
sample estimates:
mean of x
9.338255
INTERPRETATION. The mean diameter of trees cut by the beaver is 9.34, in the worst case scenario it is 9.99 (as specified by the confidence interval). This is way below mu = 50, thus we can reject the null hypothesis, as the probability of making error of type I is 0 (p-value). Rejecting H0, we accept H1: true mean is less than 50.
Step 4b - test of the medians
If we are very, very conservative, and we do not want to use paramteric test (because of a bit left-skewed data distribution), we can perform the non-paramteric alternative.
For the hypotheses the only difference refer to the paramter being compared (here rank sum)
H0 (one-sided - greater/less): the mu = 50 IS NOT GREATER/LESS than the rank sum of SR
H1 (one-sided - greater/less): the mu = 50 IS GREATER/LESS than the rank sum of SR
wtest <- wilcox.test(x = bobry$SR, alternative = "less", mu = 50)
wtest
Wilcoxon signed rank test with continuity correction
data: bobry$SR
V = 5, p-value < 2.2e-16
alternative hypothesis: true location is less than 50
INTERPRETATION: We can also reject H0 here as the chance for an error (type I) is 0 (p-value). Again, rejecting H0 we accept H1.
Final conclusion: Diameter of trees cut by the beaver is way narrower than suggested 50 cm.
Question 2
Considering bobry data we can ask the question, for example, is diameter of trees (SR) of “Ag” and “Pa” species (GAT) similar or different? So, this is the question about the two groups (species) which are described by continous variable (diameter).
Step 1 - prepare data
bobry <- read_excel("C:/Users/KWJ/Dropbox/Working files/Dydaktyka/Biostatistics/bobry.xlsx") # read data
bobry <- bobry %>%
mutate(SR = as.numeric(SR)) %>% # fix the variable (change from character to numeric)
filter(USZKODZ != "0") %>% # we want to filter only cut trees, right!
filter(GAT == "Ag" | GAT == "Pa") %>% # we also want to filter the two species now in question
select(GAT, SR) # select valid columns (optional set)
Step 2 - check normality - density (or histogram) plot
ggplot(data = bobry) +
geom_density(aes(x = SR)) +
# geom_histogram(aes(x = SR)) +
facet_wrap(~GAT) +
theme_classic()
Does not really look like normally distributed…
Step 3 - check the sample size
…even if data are not normally distributed, owing to CLT we could use paramteric approach if only sample size is big enough. Is it?
tb <- bobry %>%
group_by(GAT) %>%
summarise(n = sum(!is.na(SR)))
# technical stuff: two new functions have been used here - group_by() that allows to group data set following a pointed categorical variable (here species, GAT), and summarise() that follows group_by and calculates summaries as specified inside the function (here calculated the number of cases with numeric values).
# a cool trick was used to calculate the number of valid cases (not-NA values) - sum(!is.na(...)). First we asked about the content of each cell with is.na(). The output of this function a logical. If the value is NA the output is "TRUE", if not it is "FALSE". Since we are intersted in not-NA values we make a negation for is.na(), with exclamation mark (!is.na()). This way, if the value in a cell is NA the output will be "FALSE", if it is not-NA (so it is what we are interested in) the output is "TRUE". Then, we take advantage of the fact that "TRUE"" is coded in R with 1, and "FALSE"" with 0 value. So if we sum up all TRUEs, which in our case (with !is.na()) denote not-NA values we will get the desired sample size. Cool, right?
kable(tb) # kable() function is to change table layout for the markdown document; working in regular R script you just need to type tb to see the results of the above calculations.
GAT | n |
---|---|
Ag | 53 |
Pa | 116 |
Yes, the sample size is definately >15 for each group! So we can go for a parametric test.
Step 4a - test of the means
We simply compare the means of the two groups. Since the test is a hypothesis testing we will simply check the null hypothesis (H0), and if that is true we will keep it, it it is not true we will reject it and accept the alternative hypthesis (H1).
H0: the difference between the means of the two group IS NOT significantly different from 0
H1: the difference between the means of the two group IS significantly different from 0
ttest2 <- t.test(bobry$SR ~ bobry$GAT)
ttest2
Welch Two Sample t-test
data: bobry$SR by bobry$GAT
t = -0.75761, df = 110.45, p-value = 0.4503
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.093335 1.829137
sample estimates:
mean in group Ag mean in group Pa
8.764151 9.896250
INTERPRETATION: The mean of Ag is 8.76 and the mean of Pa is 9.9. The difference between the two (given the variability of the sampling sets) is somewhat between -4.09 and 1.83 (as pointed by the confidence intervals in the test output). Although we can not be 100% sure about the accurate value of the difference between the two means, the range does cover 0! So the difference between means of the two species groups is not significantly different from 0. We cannot reject H0, as the chance for an error (type I) is pretty high 0.45 (p-value). Thus we accept it and say there is no significant difference in diameter of the two species.
Step 4b - test of the medians
If we are very, very conservative, and we do not want to use paramteric test (because of a bit left-skewed data distribution), we can perform the non-paramteric alternative.
For the hypotheses the only difference refer to the paramter being compared (here ranks)
H0 ranks for the two groups ARE NOT different
H1 ranks for the two groups ARE different
wtest2 <- wilcox.test(SR ~ GAT, data = bobry)
wtest2
Wilcoxon rank sum test with continuity correction
data: SR by GAT
W = 2662.5, p-value = 0.1624
alternative hypothesis: true location shift is not equal to 0
INTERPRETATION: Although we can not say that much as for the parametric test, looking at the p-value, which is 0.162, we can not reject H0, as the chance for an error is pretty high.
Final conclusion: Diameter of examined Ag and Pa does not differ significantly.
Question 3
Considering bobry data,let’s say we want to check is there any relationship between the diameter of trees cut by the beaver and the distance from the river. So we now focus on the two continous variables - the diameter (SR) and distance (TRANS).
Step 1 - prepare data
bobry <- read_excel("C:/Users/KWJ/Dropbox/Working files/Dydaktyka/Biostatistics/bobry.xlsx") # read data (set you path appropriately!)
bobry <- bobry %>%
filter(USZKODZ != "0") %>% # we want to filter only cut trees
mutate(SR = as.numeric(SR)) %>% # fix the variable (change from character to numeric)
select(TRANS, SR) # optional
Step 2 and 3 - check normality and sample size
We know already how things look like for the distribution and sample size (except the transect variable) so we will not do this check now, but of course you can practice it on your own. If you do not remember how to do this, consult relevant code above
Step 4 - paramteric correlation
Let’s try paramtertic approach first. We calculate the coeficient of correlation and test it is it different from 0 (as 0 means not correlation at all). So,
H0 - correlation coefficient IS equal to 0
H1 - correlation coefficient IS NOT equal to 0
cor_par <- cor.test(x = bobry$TRANS, y = bobry$SR, method = "pearson")
cor_par
Pearson's product-moment correlation
data: bobry$TRANS and bobry$SR
t = -5.8412, df = 287, p-value = 1.403e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4253444 -0.2188127
sample estimates:
cor
-0.3259623
INTERPRETATION: The correlation coeficient is -0.33, which is a moderate and negative value, meaning a moderate and negative relationship relationship between the two tested variables. We can say that the larger distance (the higher TRANS value) the narrower trees are cut by the beaver (the lower SR value). Importantly, this coeficient although given as a precise number of -0.33 is as matter of fact something between -0.43 and -0.22 as specified by the confidence interval. Thus, it is never 0, thus we can reject the null hypothesis, our coeficient is significantly different from zero. Rejecting the H0 we are likely to commit error with very, very little probability 0 (p-value).
Step 5 - non-paramteric correlation
Again, if we are conservative, we can apply a non-parametric approach. As a matter of fact in this particular case, given the nature of the variable (the distance) is better to use non-parametric correlation (but this is a bit different story, to be told another time!).
cor_spe <- cor.test(x = bobry$TRANS, y = bobry$SR, method = "spearman") # note it is a childishly simple in R, we just need to change the argument "metod" to "spearman""
cor_spe
Spearman's rank correlation rho
data: bobry$TRANS and bobry$SR
S = 5441631, p-value = 6.883e-10
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.3526705
INTERPRETATION: Now we have a bit different coeficient (rho), which is conceivable as this is the rank correlation. But the conclustion is the same, the coeficient is moderate and negative (-0.35), and significantly different from 0. We reject then H0 hypothesis, and doing so we are likely to commit error with very, very small probability 0 (p-value).
Question 4
Considering bobry data, when we visualize the number of trees being cut by the beaver (thus USZKODZ > 0) we can see that it seems to vary among the species (perhaps indicating beaver preference?). Are these frequencies indeed different?
# read data
bobry <- read_excel("C:/Users/KWJ/Dropbox/Working files/Dydaktyka/Biostatistics/bobry.xlsx") # read data (set you path appropriately!)
# tidy-up a bit
bobry_count <- bobry %>%
mutate(USZKODZ = as.numeric(USZKODZ)) %>%
filter(USZKODZ > 0 & GAT != "?")
# plot
ggplot(data = bobry_count) +
geom_bar(aes(x = GAT)) +
theme_classic() +
labs(title = "Fig. 2. Frequency of trees being cut by the beaver in respect to the tree species ")
To test it we form the null hypothesis, which can sound like this:
H0: Are trees of particular species EQUALLY common?
H1: Are trees of particular species NOT EQUALLY common?
We test H0 with chi-squared goodness-of-fit test. This test compares the observed distribution to an expected distribution (and here the expected is like all the same)
# a chi-square test is a bit tricky in R as needs some data preparation: creating seperate vector with observed (x) and expected frequencies (p).
# observed distribution
bobry_count_test <- bobry_count %>%
group_by(GAT) %>%
summarise(n = length(GAT))
x <- bobry_count_test$n
# technical stuff: two new functions have been used here - group_by() that allows to group data set following a pointed categorical variable (here species, GAT), and summarise() that follows group_by and calculates summaries as specified inside the function (here calculated the number of cases, using function length(), which simply counts the number of elements in a vector).
# expected distribution
# it has to sum-up to 1 and be represented as many times as the number of groups/classess in the observed distribution
# so we first need to divide 1 by the number of species groups/classes
p_temp <- 1/nrow(bobry_count_test)
# nrow(bobry_count_test) - the number of groups/classes is conveniently the number of rows in just prepared data frame
# then we need to repeat this value (p_temp) as many times as we have the number of groups/classes
p <- rep(p_temp, nrow(bobry_count_test)) # we do this using rep() - see help to figure out how the function works
# then the test is strighforward
chtest <- chisq.test(x = x, p = p)
chtest
Chi-squared test for given probabilities
data: x
X-squared = 853.79, df = 5, p-value < 2.2e-16
INTERPRETATION: The most important (in the sense informative) part of the test output is the p-value, which is 0 - super low. If so, we can reject the null hypothesis. Doing so, we accept the alternative hypotehsis saying that the observed frequencies of particular trees species is not equal.
Question 5
Considering bobry data, when we focus on let’s say “Ag: species we can notice that proportion number of trees being cut (>0) and not cut (0) is somehow different from being equal? Is this really the case?
# read data
bobry <- read_excel("C:/Users/KWJ/Dropbox/Working files/Dydaktyka/Biostatistics/bobry.xlsx") # read data (set you path appropriately!)
# prepare
bobry_prop <- bobry %>%
filter(GAT == "Ag") %>%
mutate(USZKODZ = as.numeric(USZKODZ),
USZKODZ_cat2 = if_else(USZKODZ == 0, 0, 1)) %>%
group_by(USZKODZ_cat2) %>%
summarise(n = n())
# Technical stuff: For the purpose of the present question we need just to categories for the cut trees 0 - not cut and 1 - cut. With 0 category the issue is simple, as it is already specified like that in our data frame but for 1 categories we have multiple levels (1,2,3... depending on degree at which the tree was "treated" by the beaver). We are not really intersted in all these degrees, and we would be good if all the trees anytime treated by the beaver could be classified as 1. So we need to unify the variable USZKODZ, and we do that using the function if_else().[Of course, if_else is "embeded"" in mutate(), as operating in the stream we need to use mutate to create the new variable).
# if_else(), which works the same as in Excel - it considers a value in respect to a specified condition (here the condition is: USZKODZ == 0), and if the condition is met it does something what we want it to do (here assign 0), and if the condition is not met it does something else (here we want it to assign 1).
kable(bobry_prop)
USZKODZ_cat2 | n |
---|---|
0 | 94 |
1 | 54 |
To test this assumpition we form the null hypothesis, which can sound like this:
H0: The observed propotion of cut and not-cut trees IS similar
H1: The observed propotion of cut and not-cut trees IS NOT similar
We test H0 with proportion test. This test compares the observed proportion to an expected proportion (and here the expected one is 50:50)
# a prop.test() may look a bit tricky in R as needs some data preparation: creating x value - which is number of counts of "success" (here cut trees), and n - which is the vector of count trials (here total number of trees considered)
x <- bobry_prop %>% filter(USZKODZ_cat2 == 1) %>% select(n)
x <- x$n
n <- sum(bobry_prop$n)
prtest <- prop.test(x = x, n = n)
prtest
1-sample proportions test with continuity correction
data: x out of n, null probability 0.5
X-squared = 10.277, df = 1, p-value = 0.001347
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.2884945 0.4483666
sample estimates:
p
0.3648649
INTERPRETATION: The most important (in the sense informative) part of the test output is the p-value, which is 0.001 quite low. If so, we can reject the null hypothesis. Doing so, we accept the alternative hypotehsis saying that the observed proportion is not equal to 0.5. It is something between 0.29 and 0.45, with an average being 0.36 As such is does not really cover 0.5 value. So, concluding about the beaver preferences we can say that well, something is going on here…. the beaver does not cut Ag randomly, as a matter of fact it does not looks like its favourite species.