<br> -> page break!
See also BBR Sections 7.2 & 7.3 link
###Kruskal-Wallis Rank Sum Test: kruskal.test() - extension of the Wilcoxon rank sum test for more than two groups
- test for a difference in location of a continuous variable between multiple groups
- the Wilcoxon rank sum test is a special case of the Kruskal-Wallis rank sum test
###Other tests for continuous data - Kolmogorov-Smirnov Test: ks.test() = tests if two samples are drawn from the same continuous distribution
- Shapiro-Wilk Normality Test: shapiro.test()
- Friedman Rank Sum Test: friedman.test() = non-parametric test for two or more related samples
- There is more!
This is not part of the demo. It just allows the output to be wider (to make the html look nicer)
options(width = 105)
set.seed(2020)
# make data:
x <- rnorm(250, mean = 20.5, sd = 4)
# Test if the mean of `x` is equal to 0:
t.test(x)
# Test if the mean of `x` is equal to 21:
t.test(x, mu = 21)
#' By default, a two-sided test is performed. To do a one-sided test, the argument
#' `alternative` can be set to `less` or `greater`:
t.test(x, mu = 21, alternative = 'less')
#' Note that the lower limit of the confidence interval became `-Inf`, because
#' we look at a one-sided test, and that the upper bound reduced compared to the
#' two sided-test.
#' The output we see is produced by the `print()` function that is automatically
#' called.
#' To access the elements of the test results separately:
testres <- t.test(x, mu = 21, alternative = 'less', conf.level = 0.975)
names(testres)
class(testres) #all are called htest
is.list(testres)
#' This means we can treat `testres` as a `list`:
testres$p.value
testres$conf.int
y <- rnorm(250, mean = 20, sd = 2)
#' By default, the test assumes that the two samples have different variances
#' and that the samples are independent.
#' How can you know this?
#' It is written in the help file! -> ?t.test
t.test(x, y)
t.test(x, y, var.equal = TRUE)
# you make the assumption on whether it is equal, this is not a statistical test!
#' A t-test for a paired sample can be performed by setting the argument
#' `paired = TRUE`:
t.test(x, y, paired = TRUE) #specify paired if its paired!
#' Again, we can change `mean` to test if the difference is different from that
#' value instead of testing for a difference equal to zero.
#' This is equivalent to performing a one-sampe t-test of the differences
#' `x - y`:
t.test(x - y) #because you check whether the difference between two groups differ from zero?
#' It is also possible to specify the test using a formula. This can be more convenient when we have the data in a `data.frame`
dat <- data.frame(value = c(x, y),
group = rep(c('x', 'y'), c(length(x), length(y))))
summary(dat)
t.test(value ~ group, data = dat)
# use ~ if data is in data frame
#' or
t.test(value ~ group, data = dat, paired = TRUE)
wilcox.test(x, mu = 20)
#' The function `wilcox.test()` has similar arguments as `t.test()`
#' (`x`, `y`, `alternative`, `mu`, `paired`, `conf.level`).
#' Note that conidence intervals are only returned if `conf.int = TRUE`:
wilcox.test(x, conf.int = TRUE, mu = 20)
#' The additional argument `exact` controls if exact p-values and confidence intervals are calculated or if the normal approximation is used. In the latter case, the argument `correct` determins if a continuity correction is applied.
wilcox.test(x, conf.int = TRUE, exact = TRUE, mu = 20)
# exact = TRUE has something to do with p-values
# you correct for the fact you don't have a lot of data! so with a lot of data you don't need to use it.
#' ### Wilcoxon Rank Sum Test
wilcox.test(x, y, correct = TRUE, conf.int = TRUE)
#' Like for the `t.test()`, we can also use a formula specification and the
#' test output is of the same class, i.e., the p-value etc. can be accessed
#' like elements of a list.
This test is an extension to the Wilcoxon rank sum test to more than two groups
We can povide the data as a list():
x <- list(x1 = rnorm(100, mean = 12, sd = 1.5),
x2 = rnorm(100, mean = 11, sd = 2),
x3 = rnorm(100, mean = 11.5, sd = 1.8))
kruskal.test(x) #chi square for when you have more than two groups
#' or as a `data.frame` with one variable per column:
kruskal.test(as.data.frame(x))
#' or using the formula specification
dat <- data.frame(x = c(x$x1, x$x2, x$x3),
group = rep(c('x1', 'x2', 'x3'), each = 100))
head(dat)
kruskal.test(x ~ group, data = dat)
While we can calculate a correlation matrix for a whole set of variables, the function cor.test() can only handle two variables at a time: cor.test(x = swiss\(Fertility, y = swiss\)Agriculture)
Using the argument method we can choose between
* pearson
* kendall
* spearman
Other arguments we know already from other tests are
* alternative
* conf.level
* exact (only for kendall and spearman)
We additionally have the argument continuity that allows us to use continuity correction for method = "kendall" and method = "spearman".
cor.test(x = swiss$Fertility, y = swiss$Agriculture, method = "kendall", exact = FALSE, continuity = TRUE)
#' It is possible to use a `formula` specification:
cor.test(~ Fertility + Agriculture, data = swiss)
#' This allows us to use the argument `subset` to select only part of the data:
cor.test(~ Fertility + Agriculture, data = swiss, subset = Infant.Mortality > 18) # only use the data for which ifm is higher than 18
Testing multiple hypotheses at a significance level will result in an overall chance of at least one false positive result that is larger than that significance level. We then need to correct for multiple testing.
This can be done easily using the function p.adjust():
pval1 <- t.test(swiss$Fertility, mu = 50)$p.value
pval2 <- t.test(swiss$Agriculture, mu = 50)$p.value
pval3 <- t.test(swiss$Examination, mu = 30)$p.value
# combine the unadjusted p-values in one vector
(pvals <- c(Fert = pval1, Agri = pval2, Exam = pval3))
# Adjustment using the Bonferroni method:
p_adj_Bonf <- p.adjust(pvals, method = 'bonferroni') # correct for the number of p-values
# Adjustment using the Benjamini-Hochberg method
p_adj_BH <- p.adjust(pvals, method = "BH") #better dan bonf
cbind(pvals, p_adj_Bonf, p_adj_BH)
The available correction methods are available in the vector p.adjust.methods:
p.adjust.methods
###One-sample Proportion Test - tests if the proportion in one sample is equal to a fixed value p
- prop.test() and binom.test()
###One-sample Proportion Test - tests if the proportion in one sample is equal to a fixed value p
- prop.test() and binom.test()
###Tests for Proportions in Multiple (independent) Groups - tests if the proportions in several samples are equal
- chisq.test() and fisher.test() (when there are cells with 0)
See also BBR Sections 5.7 & 6 link
###Related Samples: McNemar Test Smokers before and after interventions - Tests for symmetry in a 2 times 2 table
- mcnemar.test()
# create some data:
(x <- rbinom(50, size = 1, prob = 0.3)) #number of heads with an uneven coin with prob .3
#' For a proportion test, the data can be provided as the number of
#' successes (`x`) and the number of trials (`n`)
prop.test(x = sum(x), n = length(x), p = 0.4)
# tests whether the proportion of heads is different from .4. So, is 40% one?
#' or as a matrix containing the number of sucesses and failures:
M <- matrix(data = c(sum(x), sum(1 - x)), #sum x is number of ones, second argument is the reversed
nrow = 1, ncol = 2, #make a matrix with one row and two columns
dimnames = list(c(), c('success', 'failure'))) #names of dims (row = empty, columns = succes and failure)
M
prop.test(M, p = 0.4) # works the same as providing sum(x) with length (x)
If the argument p is unspecified, p = 0.5 is used. This is because when you compare two things, .5 is default chance.
Like in the t.test() we can choose a one- or two-sided null hypothesisusing the argument alternative, and the confidence level using conf.level.
When correct = TRUE Yate’s continuity correction is used.
The function binom.test() works the same as prop.test(), but performs an exact test.
binom.test(x = sum(x), n = length(x), p = 0.4)
Categorical data in mutiple categories are usually displayed in a table:
X <- matrix(data = sample(5:50, size = 6),
nrow = 2, ncol = 3,
dimnames = list(c('exposed', 'non-exposed'),
c('none', 'mild', 'severe'))
)
X
The function chisq.test() performs Pearson’s Chi-squared test.
For this test (or for Fisher’s Exact test) it does not matter which variable goes into the rows and which into the columns:
chisq.test(X)
chisq.test(t(X)) #t transposes the matrix -> rows to columns. But results are the same
By default, p-values are calculated from the asymptotic chi-squared distribution(are good when you have a lot of data, not when you have not that much data) of the test statistic.This can be changed to calculation via Monte Carlo simuation when simulate.p.value = TRUE (something like bootstrapping/permutations). Then, the argument B specifies the number of simulations used to calculate the p-value.
chisq.test(X, simulate.p.value = TRUE, B = 1e5)
#' **Note** that simulation can result in different p-values every time, especially
#' when `B` is small:
set.seed(1234)
chisq.test(X, simulate.p.value = TRUE, B = 200)
chisq.test(X, simulate.p.value = TRUE, B = 200)
#' Specification is also possible via two factors:
x <- factor(sample(c('a', 'b'), size = 100, replace = TRUE))
y <- factor(sample(c('yes', 'no'), size = 100, replace = TRUE,
prob = c(0.3, 0.7)))
table(x, y)
chisq.test(x, y, correct = FALSE)
chisq.test(table(x, y), correct = FALSE)
# correct argument = continuity correction for small samples
fisher.test() takes similar arguments as chisq.test() and can be used when there are combinations with no observations:X[2, 3] <- 0
X
fisher.test(X)
#' * Arguments `simulate.p.value` and `B` work like for `chisq.test()`.
#' * Confidence intervals for the odds ratio and the specification of an `alternative` are only available for 2x2 tables:
fisher.test(X[, -3], conf.int = TRUE, alternative = 'less')
M <- matrix(data = sample(1:50, size = 4),
nrow = 2, ncol = 2,
dimnames = list(before = c('no', 'yes'), after = c('no', 'yes')))
M
mcnemar.test(M)
#' The `mcnemar.test()` also has the option to switch off the continuity
#' correction by setting `correct = FALSE`.
#' Specification for the test is also possible via two factors:
x <- factor(sample(c('yes', 'no'), size = 100, replace = TRUE))
y <- factor(sample(c('yes', 'no'), size = 100, replace = TRUE,
prob = c(0.3, 0.7)))
#' **Note** that in this example the data are independent, but in a real case we would have **paired observations**. X and Y would be before and after
mcnemar.test(x, y)
table(x, y)
mcnemar.test(table(x,y))
For contingency tables
x <- factor(sample(c('exposed', 'not exposed'), size = 100, replace = TRUE))
y <- factor(sample(c('yes', 'no'), size = 100, replace = TRUE))
stratum <- factor(sample(c('A', 'B', 'C'), size = 100, replace = TRUE))
table(x, y, stratum)
mantelhaen.test(x = x, y = y, z = stratum)
mantelhaen.test(table(x, y, stratum))
#' The arguments `alternative`, `correct`, `exact` and `conf.level` can be used
#' like for the tests before, but only in the case of a $2 \times 2 \times K$ table.
library(MASS)
survey <- MASS::survey
str(survey)
First, we want to test if there is a difference in Age between Female and Male students.
Which statistical test do you think is appropriate to test this?
- first check whether it is parametric or not
par(mfrow = c(1, 2)) # arrange plots in a grid with one row and two columns
hist(survey$Age[which(survey$Sex == 'Female')], nclass = 50, main = 'Female', xlab = 'age')
hist(survey$Age[which(survey$Sex == 'Male')], nclass = 50, main = 'Male', xlab = 'age')
# nclass = how small your bars should be
# use [which(survey$Sex == 'Female')] -> otherswise you will also take in the NAs
So we need to perform a non-parametric test!
Perform a non-parametric test to investigate whether males are older than females. It is recommended to use the exact p-value calculation when possible (= when there are no ties = exact calculation is not possible when certain number are equal), and to use continuity correction if a exact calculation is not possble.
One sided: males > females -> alternative = ‘greater’
- Setting alternative = “greater” would, hence, test the alternative hypothesis
- H1:Female>MaleH1:Female>Male. Since we want to test H1:Male>Female - H0:Male???FemaleH0:Male???Female) we need to set alternative = “less”
The one-sided alternative “greater” is that x is shifted to the right of y. When using the formula specification, x refers to the first factor level (Female in our case) and y refers to the second level (Male in our case).
female = 0, male is 1, we want to know whether the first group is less than the second, so we pick less
# Wilcoxon Rank Sum Test WRONG
wilcox.test(Age~Sex, data=survey, conf.int = TRUE, exact = TRUE, alternative = 'less')
# exact is not possible, observations are equal to each other
# Wilcoxon Rank Sum Test
wc.test <- wilcox.test(Age~Sex, data=survey, conf.int = TRUE, exact = FALSE, correct = TRUE, alternative = 'less')
# use correction for small sample size
Because there are ties in the data, the exact calculation of the p-value could not be used and the normal approximation was used instead. Continuity correction was used (by default).
The output shows that we can reject the null hypothesis that males are younger than (or equally as old as) females, i.e., males are likely older.
###Testing Categorical Variables ####Left vs right We now want to test if the writing hand (W.Hnd) is associated with which hand was on top when the students clapped their hands (Clap).
str(survey)
M <- table(survey$W.Hnd, survey$Clap)
M
cs.test <- chisq.test(M, correct = T, simulate.p.value = T)
Simulate p-value is not necessary - and the warning you get about expected frequencies is also nonsense.
A common recommendation is / used to be that Fisher’s Exact test should be used instead of the ??2??2-test whenever one of the expected frequencies is <5<5, however this “cut-off” is arbitrary link and a ??2??2-test can be used.
Check if the conclusion differs when we take into account the students’ Sex.So we have three variables.
mh.test <- mantelhaen.test(table(survey$W.Hnd, survey$Clap, survey$Sex))
###Multiple Testing Obtain adjusted p-values for these test and perform a multiplicity correction: - Wilcoxon rank sum test for Age by Sex
- t-test for the difference between the Wr.Hnd and NW.Hnd
- Mantel-Haenszel test for W.Hnd, Clap and Sex
pval1 <- wc.test$p.value
pval2 <- cs.test$p.value
pval3 <- mh.test$p.value
# combine the unadjusted p-values in one vector
(pvals <- c(wc.test = pval1, cs.test = pval2, mh.test = pval3))
# Adjustment using the Bonferroni method:
p_adj_Bonf <- p.adjust(pvals, method = 'bonferroni') # correct for the number of p-values
# Adjustment using the Benjamini-Hochberg method
p_adj_BH <- p.adjust(pvals, method = "BH") #better dan bonf
padj <- cbind(p_adj_Bonf, p_adj_BH)
#make a better graph
round(padj, 3)
# doesn't give <.001
# does give <.001
padj <- p.adjust(pvals, method = "BH")
data.frame('p-value' = format.pval(pvals, eps = 0.001, digits = 2),
'adj.p-value' = format.pval(padj, eps = 0.001, digits = 2),
row.names = names(pvals), check.names = FALSE)
Linear model: Evaluate the assumptions of a linear regression model visually, for example:\ - Histogram of residuals
- Normal QQ-plot of residuals
- Scatterplot residuals vs fitted values
Comparison using a likelihood ratio (LR) test, for example:\ - anova(modelA, modelB)
- anova(modelA, modelB, test = “LRT”) # for a glm
mod <- lm(Infant.Mortality ~ Fertility + Agriculture + Education + Catholic,
data = swiss)
To evaluate how well the model fits the data we can check the diagnosticplots that are obtained when plotting the model: + fig.width = 8, fig.height = 6
par(mfrow = c(2, 2), mar = c(4, 4.2, 2, 1.3)) # graphical settings
plot(mod)
We can re-create some of these plots ourselves.
Residuals and fitted values can be obtained from a linear regression model in two ways:
mod$residuals
residuals(mod)
mod$fitted.values
fitted(mod) # also: fitted.values(mod)
# scatter plot with smooth (loess) fit:
scatter.smooth(mod$fitted.values, mod$residuals, lpars = list(col = 'red'))
abline(h = 0, lty = 2, col = 'grey') # grey horizontal line
To create a normal QQ-plot we plot the quantiles of the standardized residuals against the quantiles of a normal distribution:
qqnorm(rstandard(mod))
abline(a = 0, b = 1, lty = 1, col = 'grey')
hist(mod$residuals)
hist(mod$residuals, nclass = 50)
To check if education is needed in the model, we can use a likelihood ratio test to compare the models with and without education. This is implemented in the function anova().
mod2 <- glm(case ~ age + education + spontaneous, data = infert, family = binomial())
summary(mod2)
To re-fit the model with a small change, the function update() is useful:
mod2b <- update(mod2, formula = . ~ . - education)
#update changes the existing model -> everything before the tilde must remain the same, same after tilde - however, minus education
anova(mod2, mod2b, test = "LRT")
# one is a nested model, one doesn't had edu in it
To compare non-nested models (where one model is not a special case of the other model) we can use the AIC() or BIC():
mod3 <- update(mod2b, formula = . ~ . - age + induced)
# i want to change the formula, but i want to exchange age for induced. Models are not nested when you exchange a var!
AIC(mod3, mod2b) #which one has a better fit?
BIC(mod3, mod2b) #aic is a bit more strict, so they do not always agree - smaller is better!
#when you need the vars of a formula:
formula(mod3) #cannot work with it
all.vars(formula(mod3))
editor_options: chunk_output_type: console — ### Logistic Regression We also would like to use this function for output from logistic regression models. Does the current version of the function work with logistic regression models?
summarylm(modlog)
#We get an error message “subscript out of bounds”. This means we are trying to select an element that does not exist.
#To find the problem, we check each part of the function body, replacing model by test_logit:
decimals <- 3
dat <- data.frame(estimate = coef(test_logit),
confint(test_logit),
check.names = FALSE)
dat <- round(dat, decimals)
dat$`p-value` <- format.pval(summary(test_logit)$coefficients[, 'Pr(>|t|)'],
eps = 10^(-decimals), digits = decimals)
summary(test_logit)$coefficients[, 'Pr(>|t|)']
#Same error, so probably the column 'Pr(>|t|)' does not exist.
summary(test_logit)$coefficients
Indeed, for a logistic regression model, the column is called Pr(>|z|)!
Adjust the function so that it works with both linear and logistic regression models.
summarylm <- function(lmmodel, decimals = 3){
dat <- data.frame(estimate = coef(lmmodel),
confint(lmmodel),
check.names = FALSE)
dat <- round(dat, decimals)
dat$'p-value' <- Hmisc::format.pval((summary(lmmodel)$coefficients[, 4]), eps = 10^(-decimals), digits = decimals)
dat
}
summarylm(modlog)
fun4 <- function(model, decimals = 3, modeltype) {
dat <- data.frame(estimate = coef(model),
confint(model),
check.names = FALSE)
dat <- round(dat, decimals)
if (modeltype == 'lm') {
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
eps = 10^(-decimals), digits = decimals)
} else {
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
eps = 10^(-decimals), digits = decimals)
}
dat
}
fun4(modlog, modeltype = 'logit')
Now that the function “works”, i.e., runs for both types of models, we can come back to the other question: For a logistic model: is this the output that we are interested in?
For a logistic regression model, the effects are usually reported as odds ratios. In our current version, we present the effects on the log odds scale.
Adjust the function so that for logistic regression models the summary is given as odds ratios. Do not forget to change the column name(s).
fun5 <- function(model, decimals = 3, modeltype) {
dat <- data.frame(estimate = coef(model),
confint(model),
check.names = FALSE)
if (modeltype == 'lm') {
dat <- round(dat, decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
eps = 10^(-decimals), digits = decimals)
} else {
dat <- round(exp(dat), decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
eps = 10^(-decimals), digits = decimals)
names(dat)[1] <- 'OR'
}
dat
}
# heel het model moet exp!
# change name with names(dat)[1] <- 'OR'
fun5(modlog, modeltype = 'logit')
To make the function more convenient to use, we want to automatically identify the model type, so that the user does not need to provide the modeltype.
fun6 <- function(model, decimals = 3) {
# extract the family and link
fam <- family(model)$family
link <- family(model)$link
modeltype <- if (fam == 'binomial' & link == 'logit') {
'logit'
} else {
if (fam == 'gaussian' & link == 'identity') {
'lm'
}
}
dat <- data.frame(estimate = coef(model),
confint(model),
check.names = FALSE)
if (modeltype == 'lm') {
dat <- round(dat, decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
eps = 10^(-decimals), digits = decimals)
} else {
dat <- round(exp(dat), decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
eps = 10^(-decimals), digits = decimals)
names(dat)[1] <- 'OR'
}
dat
}
# specifies LINK and FAMILY!!!!
fun6(modlog, modeltype = 'logit')
What would happen if we would apply our function to a model that is not a standard linear regression (with identity link) or logistic regression model?
Try to figure this out by looking at the syntax without trying it out!
If we provide a model that does not fit either of the criteria fam == ‘binomial’ & link == ‘logit’ or fam == ‘gaussian’ & link == ‘identity’ the value of modeltype will be NULL.
Why? The first condition (fam == ‘binomial’ & link == ‘logit’) would be FALSE so we would end up in the else part. There, the condition (fam == ‘gaussian’ & link == ‘identity’) would also be FALSE. Since this if() does not provide any alternative, it will return NULL.
The question is what happens when we try to evaluate if (modeltype == ‘lm’) when modeltype = NULL.
The comparison of NULL with ‘lm’ will result in a logical vector of length zero.
Using a logical vector of length zero as condition to if() results in an error:## Error in if (NULL == “lm”) {: argument is of length zero –> ## Error in if (modeltype == “lm”) {: argument is of length zero
Prevent this error from occuring by stopping the evaluation of the function before it happens and provide an informative error message.
# see below for a better option!
fun7 <- function(model, decimals = 3) {
# extract the family and link
fam <- family(model)$family
link <- family(model)$link
modeltype <- if (fam == 'binomial' & link == 'logit') {
'logit'
} else {
if (fam == 'gaussian' & link == 'identity') {
'lm'
}else{
stop('If model is bionomial, it must be logit or gaussian')
}
}
dat <- data.frame(estimate = coef(model),
confint(model),
check.names = FALSE)
if (modeltype == 'lm') {
dat <- round(dat, decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
eps = 10^(-decimals), digits = decimals)
} else {
dat <- round(exp(dat), decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
eps = 10^(-decimals), digits = decimals)
names(dat)[1] <- 'OR'
}
dat
}
test_probit <- glm(ybin ~ ., family = binomial(link = 'probit'), data = testdat)
fun7(test_probit)
#OR - better option!!
fun7 <- function(model, decimals = 3) {
# extract the family and link
fam <- family(model)$family
link <- family(model)$link
modeltype <- if (fam == 'binomial' & link == 'logit') {
'logit'
} else {
if (fam == 'gaussian' & link == 'identity') {
'lm'
}
}
if (is.null(modeltype)) {
stop("I do not know how to create the summary for the type of model you have provided.")
}
dat <- data.frame(estimate = coef(model),
confint(model),
check.names = FALSE)
if (modeltype == 'lm') {
dat <- round(dat, decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
eps = 10^(-decimals), digits = decimals)
} else {
dat <- round(exp(dat), decimals)
dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
eps = 10^(-decimals), digits = decimals)
names(dat)[1] <- 'OR'
}
dat
}