What happens when your data fail your test’s assumptions?

All of the tests we’ve seen so far have been developed for variables that conform to certain predictions, like a normal or binomial distribution. If you had a sample that didn’t conform to this, then what would happen is that your chance of making a Type I error increases—you’d “cheat” the test—and in doing so reach a conclusion that is meaningless. Of course, this defeats the purpose of statistics.

Recognizing non-normality

Here is a sample of random numbers from http://random.org/

demo <- c(14, 1, 34, 93, 20, 90, 88, 15, 34, 98)

hist(demo, col = "pink", main = "", breaks = seq(0,100, by = 4))

So those are truly random numbers between 1 and 100—we would expect their mean to be 50. Next I had R choose 40 random numbers drawn from a normal distribution, and I got this:

demo2 <- rnorm(40, mean = 50, sd = 10)

hist(demo2, col = "pink", main = "", breaks = seq(0,100, by = 4))

The second histogram clearly looks more normally-distributed, and that’s not surprising—I pulled a larger sample size and I specified that it should be normal, as opposed to random without a distribution.

Graphical plots like histograms are good for visually assessing normality. R has another built-in graphical method called the normal quantile plot, which plots your data against a hypothetical normal distribution. Compare the two samples from above:

# I did these as separate functions but am putting the plots side by side.
# datax = TRUE specifies that the sample data should go on the X axis 

qqnorm(demo, main = "demo", datax = TRUE)

qqnorm(demo2, main = "demo2", datax = TRUE)

The Q-Q plot shows that demo2 sample falls much closer to a straight line than the smaller sample does, but the demo2 sample isn’t exactly on a straight line, either—can we hang some numbers on this? Yes, we can, with the Shapiro-Wilk test, which tests the null hypothesis that the data come from a normal distribution.

shapiro.test(demo)

## 
##  Shapiro-Wilk normality test
## 
## data:  demo
## W = 0.83245, p-value = 0.03581

shapiro.test(demo2)

## 
##  Shapiro-Wilk normality test
## 
## data:  demo2
## W = 0.96576, p-value = 0.262

So, here we see that demo is not drawn from a normal distribution (\(P = 0.036\)) but demo2 fails to reject \(H_0\), so it does come from a normal distribution (\(P = 0.733\)).

So what? I’m going to run my test anyway

What’s the worst that could happen? It’s not going to break your computer. The mean of my same is 48.7 and the SD is 38.75 (which is huge). Let’s see if my non-normal dataset is different from my expected mean of 50.

t.test(demo, mu = 50)

## 
##  One Sample t-test
## 
## data:  demo
## t = -0.10609, df = 9, p-value = 0.9178
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  20.97988 76.42012
## sample estimates:
## mean of x 
##      48.7

Ugh. We can’t reject the H₀ of having a different mean than 50 (\(P = 0.92\)). Our sample is nowhere close to that. So, with this sample, even if I cheat the test I can’t reject \(H_0\), so that’s a sign that there really is no difference from the population mean of 50.

Data transformation

Sometimes we can transform our data to try to make them normal. This is frequently a problem for biologists when we look at a variable that has values across many orders of magnitude. Consider the example of brain size in mammals. Because we are very different in body size and cognitive ability, brain size ranges from at least 0.176 g in a shrew to 1500 g in a human to about 7000 g in a blue whale. Even for the sample of animals shown below, there is a wide range of variability.

The histogram of these masses would look like this:

brains <- c(0.176, 0.347, 0.416, 1.02, 0.802, 1.802, 0.999, 3.759, 7.78, 18.365, 10.15, 15.73, 1600, 30.22, 53.21, 87.36, 1508)
hist(brains, col = "gray50", breaks = seq(0, 1600, by = 5), xlim = c(0, 1600), main = "", xlab = "Brain mass (g)")

Not normal at all, because there are a bunch of small values and a few large ones. This would fail to align with the assumptions of a t test. But seeing how non-normal our data are because of the large range of values, we should try a transformation before giving up.

There are lots of transformations that are possible, but the most common one in biology (particularly in morphometrics) to flatten out curved distributions is the natural-log transformation.

# Transform array into a new object...

lnbrains <- log(brains)

# and then compare the normal quantile plots again.

qqnorm(brains, main ="Linear brain masses")

qqnorm(lnbrains, main = "Transformed brain masses")

While it’s not perfectly linear, the transformed sample I called lnbrains looks a lot more normal. A Shapiro-Wilk test confirms this:

shapiro.test(lnbrains)

## 
##  Shapiro-Wilk normality test
## 
## data:  lnbrains
## W = 0.93694, p-value = 0.2836

So, the transformation is not perfect, but it’s not so bad that it would make a t test choke, so let’s do it. The raw linear sample gives a mean of about 200 g for brain size, but we know that the mean is highly susceptible to outliers. If we want to see if this transformed sample has a mean of 200 g, we first need to transform 200 g into 5.3 ln g. This is an awkward unit, but it’s only needed while doing the stats.

\(H_0\): The sample mean is 5.3 ln g

\(H_a\): The sample mean is not 5.3 ln g

t.test(lnbrains, mu = 5.3)

## 
##  One Sample t-test
## 
## data:  lnbrains
## t = -4.9636, df = 16, p-value = 0.0001408
## alternative hypothesis: true mean is not equal to 5.3
## 95 percent confidence interval:
##  0.6428663 3.4303768
## sample estimates:
## mean of x 
##  2.036622

So, from this we conclude that the parametric mean of brain size for these animals is between 0.64 and 3.43 ln g, or 1.9 to 30.9 g. (We “un-transform” these values with the e^x function on a calculator or in R.) It’s always necessary to return your data to the original units when stating your conclusion.

Nonparametric alternatives for t tests

Almost every test has a non-parametric equivalent that lets you have a chance to answer questions with your rotten data. The Sign test is an example is an alternative to either the one-sample or paired t test, but it works with the median. It tests these hypotheses:

\(H_0\): The sample median is not different than the population mean (\(\mu_0\))

\(H_a\): The sample median differs from \(\mu_0\)

\(\mu_0\) is usually zero. This test is identical to the binomial test, using p = 0.5 as the null probability of a hit. Use this test to calculate a probability for the observed outcome and all more extreme outcome combinations. Its only assumption is that you’ve sampled randomly from the population.

The idea with this test is we are doing the over/under to compare two values. For a one-sample t test like my first example (demo) above, we would count the number of times that an observation was either above (+) or below (-) the hypothesized mean of 50:

god i wish i knew how to format tables at this point in time

So, there are 6 – and 4 +. Fit those 6 hits into the binomial function, using p = 0.5:

binom.test(6, n = 10, p = 0.5, alternative = "two.sided")

## 
##  Exact binomial test
## 
## data:  6 and 10
## number of successes = 6, number of trials = 10, p-value = 0.7539
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2623781 0.8784477
## sample estimates:
## probability of success 
##                    0.6

So we fail to reject the null hypothesis that the mean is different than 50 (\(p ≠ 0.5\)) because the 95% CI includes 0.5. Why not just use the t test result we got above? Because the sign test isn’t being cheated when we do it this way—you are justified in this program.

Example for a paired t test

Consider a drug that is being tested to see if it decreases the frequency of repetitive behaviors. Here is a sample of 10 subjects measured before and after one week on the drug.

\(H_0\): There are fewer decreases or about the same number of increases and decreases (\(p \le 0.05\)).

\(H_a\): There are about more decreases (\(p > 0.05\))

R can do these comparisons for you.

# Bring in the data

repdrug <- read.csv(url("https://raw.githubusercontent.com/nmccurtin/CSVfilesbiostats/master/rep-drug.csv"))

# Count the frequencies of differences

repdrug$result <- "equal" ## make this column first, and set all rows as "equal"
repdrug$result[repdrug$difference > 0] <- "above" ## register the increases
repdrug$result[repdrug$difference < 0] <- "below" ## register the decreases
repdrug$result <- factor(repdrug$result, levels = c("equal", "above", "below")) ## make those results into factors

table(repdrug$result) ## see the results

## 
## equal above below 
##     8     0     0

In case you’re thinking that this is a lot of hassle for 8 observations, I would agree, but this workflow will work better than doing it by hand if you have 80 or 8000 observations. Okay, now do the sign/binomial test to see if you have more increases than decreases:

binom.test(6, n = 8, p = 0.5, alternative = "greater")

## 
##  Exact binomial test
## 
## data:  6 and 8
## number of successes = 6, number of trials = 8, p-value = 0.1445
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.4003106 1.0000000
## sample estimates:
## probability of success 
##                   0.75

The probability, then, of our observed outcome having more successes than failures and all the more extreme outcomes is 0.142, so we fail to reject H0. Notice that the observed probability (\(\widehat{p} = 0.75\)) is within the 95% CI.

Understand this: If we had just done a t test on those data, we would have rejected H0 (t = 2.06, df = 7, P = 0.039), but that result would have been meaningless because we would have cheated that test. Any nonparametric test has lower power than a “regular” parametric test. If you can discern differences with a parametric test without cheating, you will always reject H0 with a nonparametric test. On the other hand, nonparametric alternatives allow us to have some chance to work with data that otherwise would be untestable without cheating.

Example for a two-sample t test: Mann-Whitney U (or Wilcoxon) test

The Mann-Whitney U test (called the Wilcoxon rank-sum test in R) compares two means by comparing the ranks of one sample vs. the ranks of the other. If there were differences in the median between two samples, you would expect that most of the small values would be in the sample with the lower median (or mean) and most of the large values would be in the sample with the higher median (or mean).

Consider this example of CD4 cell counts in immunocompromised people (download it here) CD4 cells are a type of white blood cell; their levels in the blood are a measure of the competency of the immune system. We would expect that patients would have lower CD4 counts than control subjects.

\(H_0\): The sample median of the patients equal to or larger than the median of the control sample.

\(H_a\): The patient sample has a lower median than the patient sample.

# Bring in the data

cd4 <- read.csv(url("https://raw.githubusercontent.com/nmccurtin/CSVfilesbiostats/master/cd4.csv"))

# Let's compare the medians to see what we'd expect
# The argument rm.na = TRUE is there so that the blank values are ignored.

tapply(cd4$count, cd4$group, FUN = median, rm.na = TRUE)

## controls patients 
##      771      437

Based on this, we expect that most of the smaller values should be in the patients group because it has the smaller median. Let’s make another grouped histogram with the lattice package:

library("lattice")

histogram( ~ count | group, data = cd4,
     layout = c(1,2), col = "firebrick", breaks = seq(0, 1400, by = 100),
     type = "count", xlab = "Count of CD4 cells", ylab = "Frequency")

Let’s have R run the test now.

wilcox.test(count ~ group, data = cd4, alternative = "greater")

## Warning in wilcox.test.default(x = c(710L, 1260L, 717L, 590L, 930L, 995L, :
## cannot compute exact p-value with ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  count by group
## W = 90, p-value = 0.005806
## alternative hypothesis: true location shift is greater than 0

Don’t sweat that error. It’s just telling you that there are tied values (two of the controls had a count of 710). The R help tells us that “by default (if exact is not specified), an exact P-value is computed if the samples contain less than 50 values and there are no ties. Otherwise, a normal approximation is used”… and you get that error message.

Okay! This covered a lot of ground. We’ll unpack this in class as needed.