Tests of Significance, Version 2

Remember the violin/box plot of the price of diamonds in the diamonds data set?

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

The violin/box plot, shown above, illustrates the distribution of the price variable across the population represented in the data set. Later, we will call this population of 53940 diamonds, the reference or null population of diamonds. But, while keeping the reference population in mind, we are going to work with a sample of diamonds from a possibly different population (for example, diamonds from another store—different from the one that housed the orignial population of 53940 diamonds). We won’t know the size of this new population, and for the moment, we will even leave the size of this sample unspecified. Maybe the sample is only 4 diamonds, or maybe it’s 400 diamonds (but actually, below, we will start with a sample of only 1 single diamond, before upping our sample size in a future lesson).

Our main question: “Are the diamonds from this new population, from which the sample was drawn, more expensive, on average, than diamonds from our original, reference, or null population?” In other words, for example, does Zales Jewelers, sell more expensive diamonds, on average, than Helzberg Diamonds? We would need data from these two stores to address this question.

Question: If we had data sets that involved prices of all of the diamonds from Zales Jewlers, and prices of all of the diamonds from Helzberg Diamonds, describe a procedure, with formulas, for determining precisely, beyond a shadow of a doubt, which Jewler sells more expensive diamonds, on average. This question should be easy, but our problem is harder. We have all of the diamonds from one store, but only a sample from the other (requiring a one-sample test). A related question asks what we would do if we only two samples, one from each store.

The most common procedure with this information would be to gather all the available data and compute the average of both data sets. After comparing the two means, one could determine that whichever has a higher average would represent the Jewler that sells more expensive diamonds.

I am not going to tell you how I got the samples I use below—that information is totally irrelvant for the present purposes. Suffice to say, I only have data from one store. But I will say that it is possible to sample from other populations, given just the data we have. Specifically, you can create new populations by simply restricting the reference population using other characteristics of diamonds described by the data set: color, cut, clarity, carat, etc.

For example, we could ask, are diamonds with the D-color designation more expensive than diamonds overall. If you know anything about diamonds, you might know that diamonds with this color designation are considered more valuable, all other things being equal, than diamonds of any other color designation. So this question sounds like a no-brainer—except for the “all other things being equal” clause. Without knowing anything about diamonds, it would be entirely conceivable to wonder if D-color diamonds might tend to be a lot smaller than other diamonds, so their price might tend to be less, on average, despite their superior color, simply due to their inferior weight (carat). Of course, with the whole data set you could answer this question definitively—and without sampling. And if you know anything about diamonds, you might know the answer already. Question: Describe a procedure in words, not code, for determining the answer to this question definitively—and without sampling, with the whole diamonds data set.

Similar to the above question, a procedure that would determine the answer to this question would be to compile the data and average the numbers to find the mean. Then, comparing the two means, you can determine the more expensive diamond.

But the question I am asking here is different. Here, you are given a sample of diamonds, and their prices. The sample comes from a larger population, and you know that it is drawn as a simple random sample from that population. You don’t know anything about this population other than the prices of the diamonds in that simple random sample. The sampled-from population may be from another store, or it may be from a subset of diamonds from the reference population. You don’t know—and if you do know, these details don’t matter for the problem at hand. We address this question: are the diamonds in the new sampled-from population (the whole population, including the diamonds whose prices you don’t know) more expensive, on average, than the diamonds in your reference population (the one that you do know about). Question: Why is impossible to answer this question with the information you have, definitively, and beyond a shadow of a doubt?

It is impossible to answewr this question because you lack important data that is necessary to answer the question. There is no way to be completely certain because the sample is not representative of the whole population. There will always be doubt without the entire population’s data set.

Note that while these questions may be criticized as being contrived, they remain very similar to important questions in other domains, such as: “Does a new drug work better, on average than an old drug,” or: “Do students perform better with a new teaching method than with an old teaching method?” For these questions, sampling is not always so contrived. You can’t know how all the people in the world with the disease you are studying (your population of interest) would respond to your drug—you can only know how the people you are studying in your sample would respond to the drug, the ones to whom you gave the drug. For these problems, a two sample test might be more relevant. A two sample test would involve, for example, both a sample of people using the new drug, and a sample of people using an old drug (or a placebo). We will get to two sample tests soon enough, but we will start with a one sample test. Here we compare one sample from an unknown distribution to a reference or null population that is fully known.

Our reference population consists of the entire data set of 53940 diamonds. The population mean of the price of these diamonds (labeled in the plot with the red dot) is $3932.80. We look for evidence that the sampled-from population consists of diamonds that are more expensive, on average, than the reference population. In other words, we look for evidence that the population mean of the new population (which we only know of from the sample) is larger than $3932.80, the population mean of the reference population. The approach we take uses the data we have.

We can identify the following hypotheses:

The null hypothesis: the statement that there is no effect:

\[H_0 : \mbox{The sample was drawn from the reference population}\] Restated, the null hypothesis posits that the new population equals the reference population—and we know exactly what the reference population is. Thus the null hypothesis is a specific statement. There is only one way for it to be correct. Making such a specific statement allows us to quantify the evidence that it is false.

The alternative hypothesis: the statement that the effect we want is present:

\[H_a: \mbox{The sample was drawn from a population with a population mean greater than \$3932.80}\] Question: Is the alternative hypothesis a specific statement? We said there was only one way for the null hypothesis to be correct, is there more than one way that the alternative hypothesis could be correct?

The alternative statement is not a simple hypothesis regarding the population mean, meaning there are lots of different populations that satisfy this statement. Because of this, there are multiple different ways to prove the alternative hypothesis correct.

Here is an example sample of prices of 4 diamonds:

## [1] 5801 8549  744  538

Is there evidence for our alternative hypothesis? We are going to use the sample mean to make this decision. In this case, the sample mean is the sum of these four prices, the only prices we know from new population, divided by 4. This sample mean is what we call our test statistic. Question: If our null hypothesis is correct, what are the likely values of our test statistic? If our null hypothesis is false, and instead our alternative hypothesis is correct, in what way(s) will the likely values of the test statistic shift? To answer these questions, refer to the plot, below. Why is this plot relevant? What part is relevant to the sample size of 4? What values for the test statistic would suggest that the null hypothesis is false and instead that the alternative hypothesis is correct? How does this change with different sample sizes?

If our null hypothesis is correct, then the likely values of our test statistic would be located within the box of the sample size four plot (so from about 2500 to about 5000). If our null hypothesis is false, then the likely values of the test statistic would shift up, and be more expensive. The plot shows the data from the reference population, but the sample size 4 is most relelvent. Any population means higher than the number stated in the alternative hypothesis would suggest that the null hypothesis is false, and the alternative hypothesis is correct. This changes with different sample sizes, because the population mean would shift up.

If you answered the above questions, you may understand why we interpret large sample means as evidence against our null hypothesis and for our alternative hypothesis. At some point, when the sample mean is large enough, we deem the evidence for the alternative hypothesis significant. But how large does the sample mean need to be to get this conclusion? The threshold, where we say the sample mean is large enough, is called the critical value of the test statistic.

To understand how to set this critical value, let’s tackle a simpler problem: our sample will consist of a single diamond: just the first one selected in the sample above, with price: $5801. Based on just this diamond, should we accept our alternative hypothesis? It can be said that we do have some evidence: $5801 > 3932.80.$ But do we have enough evidence? And how much evidence is enough? It depends on our critical value. If 5801 is bigger than our critical value, then we will say we have enough evidence to conclude that diamond was sampled from a population with a greater mean. If not, we won’t be able to say that. Question: If we had set our critical value for the test statistic to $5600, what would we deem true about our null hypothesis and our alternative hypothesis? How about if we had set our critical value for the test statistic to $6600? Turns out, there is a principled way of setting this critical value. How do we do it?

The higher we set our critical value, the more evidence we demand. So setting the critical value lower, we would say that the diamonds were more expensive, because our evidence would exceed that. We have enough evidence for a critical value of $5600, but not enough for $6600. If the data is below the critical value, there is enough evidence, then we have enough evidence to accept the hypothesis, but if the data is above the critical value, there is not enough evidence.

We don’t know the distribution of the diamonds from the population from which the sample was drawn, however we do know the distribution of diamonds under the null hypothesis. Therefore, we can answer the following question: If we make the null hypothesis correct, by drawing a sample from the null population (call it a null-sample), and we use our observed test statistic, $5801, as the critical value, for deeming the alternative hypothesis correct, what is the probability that we will make such an error? (Note: we know the alternative hypothesis is incorrect for null-samples.) This probability is the p-value for our original sample. Said a slightly different way, what is the probability of erroneously concluding that a null-sample comes from a more expensive population, assuming the original sample determines our threshold for making this decision? This probability is the p-value for our original sample. Because the p-value is a probability, it is always a number between 0 and 1. The closer the p-value is to 0, the more surprised we would be to see our data (the original sample) if the null hypothesis were correct. Low p-values are interpreted as strong evidence against the null hypothesis and for the alternative hypothesis.

The black horizontal line is drawn at the price of the single sample diamond: $5801. What if, as suggested above, we used this line as the critical value for assessing the truth of the alternative hypothesis? We have no way of knowing what would happen if the alternative hypothesis were correct. There are many populations that satisfy the alternative hypothesis. But we do know what would happen if the null hypothesis is correct—there is only one population that satisfies the null hypothesis—the one depicted in the violin/box plot, above. If the black line represented our critical value, and if the price of a null-sampled diamond fell above the black line, we would incorrectly reject the null hypothesis. Question: Compared to the single diamond price seen in our original sample, $5801, in what way could the originally sampled price of the diamond have been different such that it would have led to a lower p-value? Think carefully about the definition given above and also refer to the immediately previous figure. Test statistics that lead to lower p-values are said to be more extreme (that is, further into the region where we deem the alternative hypothesis correct). The p-value is also described as the probability of observing a null-sample with a more extreme p-value than seen with the original sample.

Compared to the single diamond price seen in the original sample, the originally sampled price of the diamond could have been different and led to a lower p-value by having the horizontal black line be higher on the graph. If a p-value is smaller, the more surprised we would be to see data if our null hypothesis was correct.

The number of diamonds in the reference population whose price fall above this line is 12077 which is 22.39% of the diamonds in the data set. Since each diamond has an equal chance of being selected as our new samples, the probability of an error, in this situation, is:

\[\mbox{p-value} = \frac{12077}{53940} = 0.2239\]

Erroneously accepting the alternative hypothesis is called a false positive, because we erroneously find the positive effect: the null-sample comes from a more expensive population, but the correct answer is negative: we drew our samples from the null distribution. The following are examples of positive effects: the new store sells more expensive diamonds than the old store, the new drug works better than placebo, my new teaching methods work better than the old ones, etc. If we come to these conclusions in error, we have made a false-positive error (type I error). The other type of error is a false negative error (type II error). Negating the positive statements yields negative statements (e.g. the new drug works the same as placebo), and if such a conclusion is in error (we conclude that the new drug works the same as placebo, when in fact it works better), our conclusion is a false negative (type II) error. Tests of significance are designed to control for type I errors (false positives), though we can study false negatives as well (mostly saved for future lessons, but see below). Question: Why is it easier to control for type I errors than to control for type II errors? Hint: remember that the null hypothesis is a specific statement about a single distribution or population—there is only one way the null hypothesis can be correct—whereas the alternative hypothesis isn’t—there are many ways the alternative hypothesis can be correct.

It is easier to control for type I errors because there are many populations that can fit the alternative hypothesis, and only one population that fits the null hypothesis. Because of this, type I errors make it easy to rule out every population that does NOT fit the null hypothesis, while type II errors consider multiple different populations correct, but not all of them.

The p-value is the probability of making a false-positive error—incorrectly accepting the alternative hypothesis—if the null hypothesis is correct and if we use the test statistic seen in our original sample as the critical value for accepting the alternative hypothesis. However the critical value should never be chosen based on the data. It should be chosen, before collecting data, based on what probability of a type I error you find acceptable.

So what’s the answer: do we have evidence to support the alternative hypothesis? Actually, it depends on you. If you are O.K. with being wrong 22.39% of the time when you make the null hypothesis correct, then yes, you do have enough evidence.

The level of significance of a test, alpha-level of a test or $\alpha$-level of a test, is probability of making a false-positive error, assuming the null hypothesis is correct. The alpha-level of a test is typically decided on, in advance of an experiment, and sets the actual critical value for accepting the alternative hypothesis that is used, regardless of the sample.

The alpha-level of the test, the acceptable probability of a false-positive error, is usually much lower than 0.2239. In fact, the traditional level, accepted by most statisticians, is 0.05, though other levels are sometimes used. In other words, many statisticians consider it acceptable to have a 0.05 probability of making a false-positive error, under the condition where the null hypothesis is correct.

So if we set the alpha-level to its traditional level, 0.05, what is the critical value for the test statistic. That’s easy: its the value such that 5% of the diamonds have a larger price, and thus 95% of the diamonds have a lower price. That value has a name: the 95th percentile of the price of diamonds.

The 95th percentile of diamond price in our data set is 13107.1. In other words, if a single diamond, sampled at random from a different data set, such as from a different store, has a price that is greater than 13107.1, then statisticians would, based on tradition, consider the evidence sufficient to conclude that the diamonds from the new store are more expensive than the diamonds from the old store. However, if we trick the statisticians, and give them a diamond from the old store, so that a positive conclusion (diamonds are more expensive from the population sampled) would be wrong, the statisticians would be wrong 5% of the time. They would know that fact, but consider this rate of false-positive error acceptable.

The black line (at $5801) indicates the price of the diamond sampled. The p-value is the fraction of diamonds in the reference (null) distribution that fall above the black line. But location of the black line depends on which diamond is chosen for the sample, and this diamond is chosen from a completely unknown distribution. In fact, we have no idea where this line is likely to fall. But we interpret its location based on the null distribution. If the null distribution were correct, how surprised would we be to find the black line where we find it: our surprise is quantified by the p-value.

The red line (at $13107.1) indicates the critical value of the test statistic, determined by the alpha-level of the test (chosen in accordance with tradition to be 0.05), so that 5% of the diamonds in the null distribution lie above the line. Since the black line is below the red line, (the p-value is greater than 0.05), we do not deem the evidence sufficient to conclude our alternative hypothesis that the diamonds are more expensive, on average, at the new store. This is said to be a non-significant result, and that the data are not significant.

On the other hand if the black line were higher than the red line, the p-value would be less than 0.05, and we would accept the alternative hypothesis. In this case we would find a significant result, and the data would be significant. Question: If we set the alpha-level higher, say to 0.1, would it be easier or harder to make a type I error? Would it be easier or harder to find a significant result?

It would be easier to make a type I error if you were to set the alpha-level higher. It would be easier to find a significant result because your burden of proof is lower.

If the null hypothesis is correct, then we erroneously get a type I error 5% of the time, with a 0.05 level of significance. If the alternative hypothesis is correct (the mean of the sampled-from distribution is larger than the mean of the null distribution) then we would usually expect to find a significant effect more than 5% of the time. If we knew the distribution of the sampled-from distribution, we could compute this number exactly: it is called the power of the test.

The power of a test is the probability that a sample will be significant, assuming the alternative hypothesis is correct. To compute the power, you must assume you know the distribution of the sampled-from distribution that satisfies the alternative hypothesis.

Actually, to be precise, the alternative hypothesis is framed in terms of the mean of the sample-from distribution, whereas the power of the test depends on the percentile of the critical value of the test statistic under the sampled-from distribution, which may or may not be bigger than 5%, but usually is substantially bigger than 5%. A power of 80% is considered good, although many times statisticians settle for much lower power.

The probability of a type I error is always the alpha-level of the test, often 0.05. The probability of a type II error is $1 - \mbox{power}$. If the power is 80%, (or 0.8) the probability of a type II error is 0.2. The power depends on many things, notably the sample-from distribution. But the best way to control power is to tweak your sample size. Question: If we set the alpha-level higher, say to 0.1, would the power go up or down? Would it be easier or harder to find a significant result?

If you were to set the aplha-level higher, the probability of the type I would go up, but the probability of the type II error would go down, so the power would go up. It would be easier to find a significant result.

Next up: we will start thinking about what happens to power as we increase the sample size, and why. I saw this written on the wall of a booth in the library a few years ago: “I heard you upped your sample size. More power to you!” We want to understand this statement, and why its true.

Tests of Significance, Version 2

Ellie Kight