BST02: Using R for Statistics in Medical Research

Part D: Statistics with R


Common statistical tests

for continuous data

t-test: t.test()

  • One-sample t-test = compares the mean of one sample with a fixed value mu
  • Two sample / independent samples t-test = compares the difference between the means of two samples with a fixed value \(\mu\) (two different persons will be compared)
  • Related samples t-test = compares the mean of the difference between related observations with a fixed value \(\mu\) (same as one-sample t-test) (two measurements of the same person)

<br> -> page break!

Wilcoxon Test: wilcox.test()

  • Wilcoxon Signed Rank Test = tests if one sample (or the difference between two paired samples) is symmetric about \(\mu\)
  • Wilcoxon Rank Sum Test / Mann-Whitney test = test for a location shift between the distributions of two independent samples

See also BBR Sections 7.2 & 7.3 link

###Kruskal-Wallis Rank Sum Test: kruskal.test() - extension of the Wilcoxon rank sum test for more than two groups
- test for a difference in location of a continuous variable between multiple groups
- the Wilcoxon rank sum test is a special case of the Kruskal-Wallis rank sum test

###Other tests for continuous data - Kolmogorov-Smirnov Test: ks.test() = tests if two samples are drawn from the same continuous distribution
- Shapiro-Wilk Normality Test: shapiro.test()
- Friedman Rank Sum Test: friedman.test() = non-parametric test for two or more related samples
- There is more!

DEMO TESTS FOR CONTIUOUS DATA

This is not part of the demo. It just allows the output to be wider (to make the html look nicer)

options(width = 105)
set.seed(2020)

t-test

One-sample t-test

# make data:
x <- rnorm(250, mean = 20.5, sd = 4)

# Test if the mean of `x` is equal to 0:
t.test(x)

# Test if the mean of `x` is equal to 21:
t.test(x, mu = 21)

#' By default, a two-sided test is performed. To do a one-sided test, the argument
#' `alternative` can be set to `less` or `greater`:
t.test(x, mu = 21, alternative = 'less')
#' Note that the lower limit of the confidence interval became `-Inf`, because
#' we look at a one-sided test, and that the upper bound reduced compared to the
#' two sided-test.

#' The output we see is produced by the `print()` function that is automatically
#' called.
#' To access the elements of the test results separately:
testres <- t.test(x, mu = 21, alternative = 'less', conf.level = 0.975)
names(testres)
class(testres) #all are called htest
is.list(testres)
#' This means we can treat `testres` as a `list`:
testres$p.value
testres$conf.int

Two-samples t-test

y <- rnorm(250, mean = 20, sd = 2)

#' By default, the test assumes that the two samples have different variances
#' and that the samples are independent.
#' How can you know this?
#' It is written in the help file! -> ?t.test
t.test(x, y)
t.test(x, y, var.equal = TRUE)
# you make the assumption on whether it is equal, this is not a statistical test!

Paired-samples t-test

#' A t-test for a paired sample can be performed by setting the argument 
#' `paired = TRUE`:
t.test(x, y, paired = TRUE) #specify paired if its paired!
#' Again, we can change `mean` to test if the difference is different from that
#' value instead of testing for a difference equal to zero.

#' This is equivalent to performing a one-sampe t-test of the differences 
#' `x - y`:
t.test(x - y) #because you check whether the difference between two groups differ from zero?

Using the formula specification

#' It is also possible to specify the test using a formula. This can be more convenient when we have the data in a `data.frame`
dat <- data.frame(value = c(x, y),
                  group  = rep(c('x', 'y'), c(length(x), length(y))))

summary(dat)
t.test(value ~ group, data = dat)

# use ~ if data is in data frame

#' or
t.test(value ~ group, data = dat, paired = TRUE)

Wilcoxon Test

Wilcoxon Signed Rank Test

wilcox.test(x, mu = 20)

#' The function `wilcox.test()` has similar arguments as `t.test()`
#' (`x`, `y`, `alternative`, `mu`, `paired`, `conf.level`).

#' Note that conidence intervals are only returned if `conf.int = TRUE`:
wilcox.test(x, conf.int = TRUE, mu = 20)

#' The additional argument `exact` controls if exact p-values and confidence intervals are calculated or if the normal approximation is used. In the latter case, the argument `correct` determins if a continuity correction is applied.
wilcox.test(x, conf.int = TRUE, exact = TRUE, mu = 20) 
# exact = TRUE has something to do with p-values
# you correct for the fact you don't have a lot of data! so with a lot of data you don't need to use it. 

#' ### Wilcoxon Rank Sum Test
wilcox.test(x, y, correct = TRUE, conf.int = TRUE)

#' Like for the `t.test()`, we can also use a formula specification and the 
#' test output is of the same class, i.e., the p-value etc. can be accessed
#' like elements of a list.

Kruskal-Wallis Rank Sum Test

This test is an extension to the Wilcoxon rank sum test to more than two groups

We can povide the data as a list():

x <- list(x1 = rnorm(100, mean = 12, sd = 1.5),
          x2 = rnorm(100, mean = 11, sd = 2),
          x3 = rnorm(100, mean = 11.5, sd = 1.8))

kruskal.test(x) #chi square for when you have more than two groups

#' or as a `data.frame` with one variable per column:
kruskal.test(as.data.frame(x))

#' or using the formula specification
dat <- data.frame(x = c(x$x1, x$x2, x$x3),
                  group = rep(c('x1', 'x2', 'x3'), each = 100))
head(dat)
kruskal.test(x ~ group, data = dat)

Testing the Correlation

While we can calculate a correlation matrix for a whole set of variables, the function cor.test() can only handle two variables at a time: cor.test(x = swiss\(Fertility, y = swiss\)Agriculture)

Using the argument method we can choose between
* pearson
* kendall
* spearman

Other arguments we know already from other tests are
* alternative
* conf.level
* exact (only for kendall and spearman)

We additionally have the argument continuity that allows us to use continuity correction for method = "kendall" and method = "spearman".

cor.test(x = swiss$Fertility, y = swiss$Agriculture, method = "kendall", exact = FALSE, continuity = TRUE)

#' It is possible to use a `formula` specification:
cor.test(~ Fertility + Agriculture, data = swiss)

#' This allows us to use the argument `subset` to select only part of the data:
cor.test(~ Fertility + Agriculture, data = swiss, subset = Infant.Mortality > 18) # only use the data for which ifm is higher than 18

Multiple Testing Adjustment

Testing multiple hypotheses at a significance level will result in an overall chance of at least one false positive result that is larger than that significance level. We then need to correct for multiple testing.

This can be done easily using the function p.adjust():

pval1 <- t.test(swiss$Fertility, mu = 50)$p.value
pval2 <- t.test(swiss$Agriculture, mu = 50)$p.value
pval3 <- t.test(swiss$Examination, mu = 30)$p.value

# combine the unadjusted p-values in one vector
(pvals <- c(Fert = pval1, Agri = pval2, Exam = pval3))

# Adjustment using the Bonferroni method:
p_adj_Bonf <- p.adjust(pvals, method = 'bonferroni') # correct for the number of p-values

# Adjustment using the Benjamini-Hochberg method
p_adj_BH <- p.adjust(pvals, method = "BH") #better dan bonf

cbind(pvals, p_adj_Bonf, p_adj_BH)

The available correction methods are available in the vector p.adjust.methods:

p.adjust.methods

for categorical data

###One-sample Proportion Test - tests if the proportion in one sample is equal to a fixed value p
- prop.test() and binom.test()

###One-sample Proportion Test - tests if the proportion in one sample is equal to a fixed value p
- prop.test() and binom.test()

###Tests for Proportions in Multiple (independent) Groups - tests if the proportions in several samples are equal
- chisq.test() and fisher.test() (when there are cells with 0)
See also BBR Sections 5.7 & 6 link

###Related Samples: McNemar Test Smokers before and after interventions - Tests for symmetry in a 2 times 2 table
- mcnemar.test()

Tests for Categorical Data / Proportions

  • Related Samples: McNemar Test
    • Tests for symmetry in a 2 times 2 table
    • mcnemar.test()
  • 3-Dimensional Contingency Table
    • Cochrane-Mantel-Haenszel Test
    • ?? 2 test for independence of two nominal variables within each stratum
    • mantelhaen.test()

DEMO TESTS FOR CATEGORICAL DATA

One-sample proportion test

# create some data:
(x <- rbinom(50, size = 1, prob = 0.3)) #number of heads with an uneven coin with prob .3

#' For a proportion test, the data can be provided as the number of 
#' successes (`x`) and the number of trials (`n`)
prop.test(x = sum(x), n = length(x), p = 0.4)
# tests whether the proportion of heads is different from .4. So, is 40% one?

#' or as a matrix containing the number of sucesses and failures:
M <- matrix(data = c(sum(x), sum(1 - x)), #sum x is number of ones, second argument is the reversed
             nrow = 1, ncol = 2, #make a matrix with one row and two columns
             dimnames = list(c(), c('success', 'failure'))) #names of dims (row = empty, columns = succes and failure)
M

prop.test(M, p = 0.4) # works the same as providing sum(x) with length (x)

If the argument p is unspecified, p = 0.5 is used. This is because when you compare two things, .5 is default chance.

Like in the t.test() we can choose a one- or two-sided null hypothesisusing the argument alternative, and the confidence level using conf.level.

When correct = TRUE Yate’s continuity correction is used.

Exact one-sample proportion test

The function binom.test() works the same as prop.test(), but performs an exact test.

binom.test(x = sum(x), n = length(x), p = 0.4)

Pearson’s \(\chi^2\)-Test

Categorical data in mutiple categories are usually displayed in a table:

X <- matrix(data = sample(5:50, size = 6),
            nrow = 2, ncol = 3,
            dimnames = list(c('exposed', 'non-exposed'),
                            c('none', 'mild', 'severe'))
)
X

The function chisq.test() performs Pearson’s Chi-squared test.

For this test (or for Fisher’s Exact test) it does not matter which variable goes into the rows and which into the columns:

chisq.test(X)
chisq.test(t(X)) #t transposes the matrix -> rows to columns. But results are the same

By default, p-values are calculated from the asymptotic chi-squared distribution(are good when you have a lot of data, not when you have not that much data) of the test statistic.This can be changed to calculation via Monte Carlo simuation when simulate.p.value = TRUE (something like bootstrapping/permutations). Then, the argument B specifies the number of simulations used to calculate the p-value.

chisq.test(X, simulate.p.value = TRUE, B = 1e5)

#' **Note** that simulation can result in different p-values every time, especially 
#' when `B` is small:
set.seed(1234)
chisq.test(X, simulate.p.value = TRUE, B = 200)
chisq.test(X, simulate.p.value = TRUE, B = 200)

#' Specification is also possible via two factors:
x <- factor(sample(c('a', 'b'), size = 100, replace = TRUE))
y <- factor(sample(c('yes', 'no'), size = 100, replace = TRUE,
                   prob = c(0.3, 0.7)))

table(x, y)
chisq.test(x, y, correct = FALSE)
chisq.test(table(x, y), correct = FALSE)
# correct argument = continuity correction for small samples

Fisher’s Exact Test

  • fisher.test() takes similar arguments as chisq.test() and can be used when there are combinations with no observations:
X[2, 3] <- 0
X
fisher.test(X)

#' * Arguments `simulate.p.value` and `B` work like for `chisq.test()`.
#' * Confidence intervals for the odds ratio and the specification of an `alternative` are only available for 2x2 tables:
fisher.test(X[, -3], conf.int = TRUE, alternative = 'less')

McNemar Test

M <- matrix(data = sample(1:50, size = 4),
            nrow = 2, ncol = 2, 
            dimnames = list(before = c('no', 'yes'), after = c('no', 'yes')))
M
mcnemar.test(M)

#' The `mcnemar.test()` also has the option to switch off the continuity 
#' correction by setting `correct = FALSE`.

#' Specification for the test is also possible via two factors:
x <- factor(sample(c('yes', 'no'), size = 100, replace = TRUE))
y <- factor(sample(c('yes', 'no'), size = 100, replace = TRUE,
                   prob = c(0.3, 0.7)))

#' **Note** that in this example the data are independent, but in a real case we would have **paired observations**. X and Y would be before and after

mcnemar.test(x, y)

table(x, y)
mcnemar.test(table(x,y))

Cochran-Mantel-Haenszel Chi-Squared Test

For contingency tables

x <- factor(sample(c('exposed', 'not exposed'), size = 100, replace = TRUE))
y <- factor(sample(c('yes', 'no'), size = 100, replace = TRUE))
stratum <- factor(sample(c('A', 'B', 'C'), size = 100, replace = TRUE))
table(x, y, stratum)

mantelhaen.test(x = x, y = y, z = stratum)
mantelhaen.test(table(x, y, stratum))

#' The arguments `alternative`, `correct`, `exact` and `conf.level` can be used
#' like for the tests before, but only in the case of a $2 \times 2 \times K$ table.

PRACTICAL STATISTICAL TESTS

library(MASS)
survey <- MASS::survey
str(survey)

Test for continous data

Age differences

First, we want to test if there is a difference in Age between Female and Male students.

Which statistical test do you think is appropriate to test this?
- first check whether it is parametric or not

par(mfrow = c(1, 2)) # arrange plots in a grid with one row and two columns
hist(survey$Age[which(survey$Sex == 'Female')], nclass = 50, main = 'Female', xlab = 'age')
hist(survey$Age[which(survey$Sex == 'Male')], nclass = 50, main = 'Male', xlab = 'age')
# nclass = how small your bars should be
# use [which(survey$Sex == 'Female')] -> otherswise you will also take in the NAs

So we need to perform a non-parametric test!

Perform a non-parametric test to investigate whether males are older than females. It is recommended to use the exact p-value calculation when possible (= when there are no ties = exact calculation is not possible when certain number are equal), and to use continuity correction if a exact calculation is not possble.

One sided: males > females -> alternative = ‘greater’
- Setting alternative = “greater” would, hence, test the alternative hypothesis
- H1:Female>MaleH1:Female>Male. Since we want to test H1:Male>Female   - H0:Male???FemaleH0:Male???Female) we need to set alternative = “less”

The one-sided alternative “greater” is that x is shifted to the right of y. When using the formula specification, x refers to the first factor level (Female in our case) and y refers to the second level (Male in our case).

female = 0, male is 1, we want to know whether the first group is less than the second, so we pick less

# Wilcoxon Rank Sum Test WRONG
wilcox.test(Age~Sex, data=survey, conf.int = TRUE, exact = TRUE, alternative = 'less')

# exact is not possible, observations are equal to each other

# Wilcoxon Rank Sum Test
wc.test <- wilcox.test(Age~Sex, data=survey, conf.int = TRUE, exact = FALSE, correct = TRUE, alternative = 'less')
# use correction for small sample size

Because there are ties in the data, the exact calculation of the p-value could not be used and the normal approximation was used instead. Continuity correction was used (by default).

The output shows that we can reject the null hypothesis that males are younger than (or equally as old as) females, i.e., males are likely older.

###Testing Categorical Variables ####Left vs right We now want to test if the writing hand (W.Hnd) is associated with which hand was on top when the students clapped their hands (Clap).

str(survey)

M <- table(survey$W.Hnd, survey$Clap)
M

cs.test <- chisq.test(M, correct = T, simulate.p.value = T)

Simulate p-value is not necessary - and the warning you get about expected frequencies is also nonsense.

A common recommendation is / used to be that Fisher’s Exact test should be used instead of the ??2??2-test whenever one of the expected frequencies is <5<5, however this “cut-off” is arbitrary link and a ??2??2-test can be used.

Check if the conclusion differs when we take into account the students’ Sex.So we have three variables.

mh.test <- mantelhaen.test(table(survey$W.Hnd, survey$Clap, survey$Sex))

###Multiple Testing Obtain adjusted p-values for these test and perform a multiplicity correction: - Wilcoxon rank sum test for Age by Sex
- t-test for the difference between the Wr.Hnd and NW.Hnd
- Mantel-Haenszel test for W.Hnd, Clap and Sex

pval1 <- wc.test$p.value
pval2 <- cs.test$p.value
pval3 <- mh.test$p.value

# combine the unadjusted p-values in one vector
(pvals <- c(wc.test = pval1, cs.test = pval2, mh.test = pval3))

# Adjustment using the Bonferroni method:
p_adj_Bonf <- p.adjust(pvals, method = 'bonferroni') # correct for the number of p-values

# Adjustment using the Benjamini-Hochberg method
p_adj_BH <- p.adjust(pvals, method = "BH") #better dan bonf

padj <- cbind(p_adj_Bonf, p_adj_BH)

#make a better graph
round(padj, 3)
# doesn't give <.001 

# does give <.001
padj <- p.adjust(pvals, method = "BH")
data.frame('p-value' = format.pval(pvals, eps = 0.001, digits = 2),
           'adj.p-value' = format.pval(padj, eps = 0.001, digits = 2),
           row.names = names(pvals), check.names = FALSE)

Useful functions for regression models

SLIDES Model Evaluation

Evaluation model

Linear model: Evaluate the assumptions of a linear regression model visually, for example:\ - Histogram of residuals
- Normal QQ-plot of residuals
- Scatterplot residuals vs fitted values

Model Comparison

Nested models:

  • model is a special case of the other, i.e.,
  • model B is a special case of model A when B can be obtained by setting some regression coefficients in A to zero -> B can be obtained by setting some of the coef in A to zero

Comparison using a likelihood ratio (LR) test, for example:\ - anova(modelA, modelB)
- anova(modelA, modelB, test = “LRT”) # for a glm

Non-nested models:

  • using information criteria, e.g.
    • AIC(modelA, modelB)
    • BIC(modelA, modelB)
  • The model with the smaller AIC (or BIC) has the better fit

DEMO MODEL EVALUATION

Evaluating linear regression models

mod <- lm(Infant.Mortality ~ Fertility + Agriculture + Education + Catholic,
          data = swiss)

To evaluate how well the model fits the data we can check the diagnosticplots that are obtained when plotting the model: + fig.width = 8, fig.height = 6

par(mfrow = c(2, 2), mar = c(4, 4.2, 2, 1.3)) # graphical settings
plot(mod)

We can re-create some of these plots ourselves.

Residuals and fitted values can be obtained from a linear regression model in two ways:

mod$residuals
residuals(mod)

mod$fitted.values
fitted(mod) # also: fitted.values(mod)

# scatter plot with smooth (loess) fit:
scatter.smooth(mod$fitted.values, mod$residuals, lpars = list(col = 'red'))
abline(h = 0, lty = 2, col = 'grey') # grey horizontal line

QQ-plot

To create a normal QQ-plot we plot the quantiles of the standardized residuals against the quantiles of a normal distribution:

qqnorm(rstandard(mod))
abline(a = 0, b = 1, lty = 1, col = 'grey')

Histogram of the residuals

hist(mod$residuals)
hist(mod$residuals, nclass = 50)

Comparing Models

To check if education is needed in the model, we can use a likelihood ratio test to compare the models with and without education. This is implemented in the function anova().

mod2 <- glm(case ~ age + education + spontaneous, data = infert, family = binomial())
summary(mod2)

To re-fit the model with a small change, the function update() is useful:

mod2b <- update(mod2, formula = . ~ . - education)
#update changes the existing model -> everything before the tilde must remain the same, same after tilde - however, minus education

anova(mod2, mod2b, test = "LRT")
# one is a nested model, one doesn't had edu in it 

To compare non-nested models (where one model is not a special case of the other model) we can use the AIC() or BIC():

mod3 <- update(mod2b, formula = . ~ . - age + induced)
# i want to change the formula, but i want to exchange age for induced. Models are not nested when you exchange a var!

AIC(mod3, mod2b) #which one has a better fit?
BIC(mod3, mod2b) #aic is a bit more strict, so they do not always agree - smaller is better!

#when you need the vars of a formula:
formula(mod3) #cannot work with it
all.vars(formula(mod3))

editor_options: chunk_output_type: console — ### Logistic Regression We also would like to use this function for output from logistic regression models. Does the current version of the function work with logistic regression models?

  • If it does: does it give us the summary that we are interested in?
  • If it does not: investigate what the problem is.
summarylm(modlog)
#We get an error message “subscript out of bounds”. This means we are trying to select an element that does not exist.

#To find the problem, we check each part of the function body, replacing model by test_logit:

decimals <- 3

dat <- data.frame(estimate = coef(test_logit),
                  confint(test_logit),
                  check.names = FALSE)

dat <- round(dat, decimals)

dat$`p-value` <- format.pval(summary(test_logit)$coefficients[, 'Pr(>|t|)'],
                             eps = 10^(-decimals), digits = decimals)

summary(test_logit)$coefficients[, 'Pr(>|t|)']

#Same error, so probably the column 'Pr(>|t|)' does not exist.

summary(test_logit)$coefficients

Indeed, for a logistic regression model, the column is called Pr(>|z|)!

Adjust the function so that it works with both linear and logistic regression models.

summarylm <- function(lmmodel, decimals = 3){
  dat <- data.frame(estimate = coef(lmmodel),
                    confint(lmmodel),
                    check.names = FALSE)
  dat <- round(dat, decimals)
  dat$'p-value' <- Hmisc::format.pval((summary(lmmodel)$coefficients[, 4]), eps = 10^(-decimals), digits = decimals)
  dat
}


summarylm(modlog)

fun4 <- function(model, decimals = 3, modeltype) {
  dat <- data.frame(estimate = coef(model),
                    confint(model),
                    check.names = FALSE)
  
  dat <- round(dat, decimals)
  
  if (modeltype == 'lm') {
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
                                 eps = 10^(-decimals), digits = decimals)
  } else {
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
                                 eps = 10^(-decimals), digits = decimals)
  }
  
  dat
}

fun4(modlog, modeltype = 'logit')

Now that the function “works”, i.e., runs for both types of models, we can come back to the other question: For a logistic model: is this the output that we are interested in?

For a logistic regression model, the effects are usually reported as odds ratios. In our current version, we present the effects on the log odds scale.

Adjust the function so that for logistic regression models the summary is given as odds ratios. Do not forget to change the column name(s).

fun5 <- function(model, decimals = 3, modeltype) {
  dat <- data.frame(estimate = coef(model),
                    confint(model),
                    check.names = FALSE)
  
  if (modeltype == 'lm') {
    dat <- round(dat, decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
                                 eps = 10^(-decimals), digits = decimals)
  } else {
    dat <- round(exp(dat), decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
                                 eps = 10^(-decimals), digits = decimals)
    
    names(dat)[1] <- 'OR'
  }
  
  dat
}

# heel het model moet exp!
# change name with names(dat)[1] <- 'OR'

fun5(modlog, modeltype = 'logit')

To make the function more convenient to use, we want to automatically identify the model type, so that the user does not need to provide the modeltype.

fun6 <- function(model, decimals = 3) {

  # extract the family and link  
  fam <- family(model)$family
  link <- family(model)$link
  
  modeltype <- if (fam == 'binomial' & link == 'logit') {
    'logit'
  } else {
    if (fam == 'gaussian' & link == 'identity') {
      'lm'
    }
  }
  
  dat <- data.frame(estimate = coef(model),
                    confint(model),
                    check.names = FALSE)
  
  if (modeltype == 'lm') {
    dat <- round(dat, decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
                                 eps = 10^(-decimals), digits = decimals)
  } else {
    dat <- round(exp(dat), decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
                                 eps = 10^(-decimals), digits = decimals)
    
    names(dat)[1] <- 'OR'
  }
  
  dat
}

# specifies LINK and FAMILY!!!!

fun6(modlog, modeltype = 'logit')

What would happen if we would apply our function to a model that is not a standard linear regression (with identity link) or logistic regression model?

Try to figure this out by looking at the syntax without trying it out!

If we provide a model that does not fit either of the criteria fam == ‘binomial’ & link == ‘logit’ or fam == ‘gaussian’ & link == ‘identity’ the value of modeltype will be NULL.

Why? The first condition (fam == ‘binomial’ & link == ‘logit’) would be FALSE so we would end up in the else part. There, the condition (fam == ‘gaussian’ & link == ‘identity’) would also be FALSE. Since this if() does not provide any alternative, it will return NULL.

The question is what happens when we try to evaluate if (modeltype == ‘lm’) when modeltype = NULL.

The comparison of NULL with ‘lm’ will result in a logical vector of length zero.

Using a logical vector of length zero as condition to if() results in an error:## Error in if (NULL == “lm”) {: argument is of length zero –> ## Error in if (modeltype == “lm”) {: argument is of length zero

Prevent this error from occuring by stopping the evaluation of the function before it happens and provide an informative error message.

# see below for a better option! 
fun7 <- function(model, decimals = 3) {

  # extract the family and link  
  fam <- family(model)$family
  link <- family(model)$link
  
  modeltype <- if (fam == 'binomial' & link == 'logit') {
    'logit'
  } else {
    if (fam == 'gaussian' & link == 'identity') {
      'lm'
    }else{
      stop('If model is bionomial, it must be logit or gaussian')
    }
  }
  
  dat <- data.frame(estimate = coef(model),
                    confint(model),
                    check.names = FALSE)
  
  if (modeltype == 'lm') {
    dat <- round(dat, decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
                                 eps = 10^(-decimals), digits = decimals)
  } else {
    dat <- round(exp(dat), decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
                                 eps = 10^(-decimals), digits = decimals)
    
    names(dat)[1] <- 'OR'
  }
  
  dat
}

test_probit <- glm(ybin ~ ., family = binomial(link = 'probit'), data = testdat)


fun7(test_probit)

#OR - better option!!

fun7 <- function(model, decimals = 3) {

  # extract the family and link  
  fam <- family(model)$family
  link <- family(model)$link
  
  modeltype <- if (fam == 'binomial' & link == 'logit') {
    'logit'
  } else {
    if (fam == 'gaussian' & link == 'identity') {
      'lm'
    }
  }
  
  if (is.null(modeltype)) {
    stop("I do not know how to create the summary for the type of model you have provided.")
  }
  
  dat <- data.frame(estimate = coef(model),
                    confint(model),
                    check.names = FALSE)
  
  if (modeltype == 'lm') {
    dat <- round(dat, decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|t|)'],
                                 eps = 10^(-decimals), digits = decimals)
  } else {
    dat <- round(exp(dat), decimals)
    
    dat$`p-value` <- format.pval(summary(model)$coefficients[, 'Pr(>|z|)'],
                                 eps = 10^(-decimals), digits = decimals)
    
    names(dat)[1] <- 'OR'
  }
  
  dat
}