Tests of Significance, Version 2

Remember the violin/box plot of the price of diamonds in the diamonds data set?

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lubridate'

The violin/box plot, shown above, illustrates the distribution of the price variable across the population represented in the data set. Later, we will call this population of 53940 diamonds, the reference or null population of diamonds. But, while keeping the reference population in mind, we are going to work with a sample of diamonds from a possibly different population (for example, diamonds from another store—different from the one that housed the orignial population of 53940 diamonds). We won’t know the size of this new population, and for the moment, we will even leave the size of this sample unspecified. Maybe the sample is only 4 diamonds, or maybe it’s 400 diamonds (but actually, below, we will start with a sample of only 1 single diamond, before upping our sample size in a future lesson).

Our main question: “Are the diamonds from this new population, from which the sample was drawn, more expensive, on average, than diamonds from our original, reference, or null population?” In other words, for example, does Zales Jewelers, sell more expensive diamonds, on average, than Helzberg Diamonds? We would need data from these two stores to address this question.
Answer: One could simply compute the population means of both of the populations, and then compare the means. If the Jewlers has an average diamond price higher than the Zales population, this would confirm that Jewler sells more expensive diamonds.

If we had data sets that involved prices of all of the diamonds from Zales Jewlers, and prices of all of the diamonds from Helzberg Diamonds, pne could simply compute the population means of both of the populations, and then compare the means to determine precisely, beyond a shadow of a doubt, which Jewler sells more expensive diamonds, on average. If the Jewlers has an average diamond price higher than the Zales population, this would confirm that Jewler sells more expensive diamonds. This question should be easy, but our problem is harder. We have all of the diamonds from one store, but only a sample from the other (requiring a one-sample test). A related question asks what we would do if we only two samples, one from each store.

I am not going to tell you how I got the samples I use below—that information is totally irrelvant for the present purposes. Suffice to say, I only have data from one store. But I will say that it is possible to sample from other populations, given just the data we have. Specifically, you can create new populations by simply restricting the reference population using other characteristics of diamonds described by the data set: color, cut, clarity, carat, etc.

For example, we could ask, are diamonds with the D-color designation more expensive than diamonds overall. If you know anything about diamonds, you might know that diamonds with this color designation are considered more valuable, all other things being equal, than diamonds of any other color designation. So this question sounds like a no-brainer—except for the “all other things being equal” clause. Without knowing anything about diamonds, it would be entirely conceivable to wonder if D-color diamonds might tend to be a lot smaller than other diamonds, so their price might tend to be less, on average, despite their superior color, simply due to their inferior weight (carat). Of course, with the whole data set you could answer this question definitively—and without sampling. And if you know anything about diamonds, you might know the answer already. To compute the mean, or average, price of diamonds in the two populations, add all of the diamond prices in one population and divide by the number of individual prices in the population. This will give the average. This can easily be computed in applications such as StatCrunch.

But the question I am asking here is different. Here, you are given a sample of diamonds, and their prices. The sample comes from a larger population, and you know that it is drawn as a simple random sample from that population. You don’t know anything about this population other than the prices of the diamonds in that simple random sample. The sampled-from population may be from another store, or it may be from a subset of diamonds from the reference population. You don’t know—and if you do know, these details don’t matter for the problem at hand. We address this question: are the diamonds in the new sampled-from population (the whole population, including the diamonds whose prices you don’t know) more expensive, on average, than the diamonds in your reference population (the one that you do know about). It is impossible to answer, beyond a shadow of doubt, whether or not the diamonds in the new sampled-from population more expensive on average than the diamonds in your reference population, simply because we do not have the complete list of data points from the new population, and thus we cannot compute a population mean, which would give the true parameter that we could compare to the reference population mean. We only have a sample from the new population, and samples yield statistics that do not necessarily reflect the true parameter of a population.

Note that while these questions may be criticized as being contrived, they remain very similar to important questions in other domains, such as: “Does a new drug work better, on average than an old drug,” or: “Do students perform better with a new teaching method than with an old teaching method?” For these questions, sampling is not always so contrived. You can’t know how all the people in the world with the disease you are studying (your population of interest) would respond to your drug—you can only know how the people you are studying in your sample would respond to the drug, the ones to whom you gave the drug. For these problems, a two sample test might be more relevant. A two sample test would involve, for example, both a sample of people using the new drug, and a sample of people using an old drug (or a placebo). We will get to two sample tests soon enough, but we will start with a one sample test. Here we compare one sample from an unknown distribution to a reference or null population that is fully known.

Our reference population consists of the entire data set of 53940 diamonds. The population mean of the price of these diamonds (labeled in the plot with the red dot) is $3932.80. We look for evidence that the sampled-from population consists of diamonds that are more expensive, on average, than the reference population. In other words, we look for evidence that the population mean of the new population (which we only know of from the sample) is larger than $3932.80, the population mean of the reference population. The approach we take uses the data we have.

We can identify the following hypotheses:

The null hypothesis: the statement that there is no effect:

\[H_0 : \mbox{The sample was drawn from the reference population}\] Restated, the null hypothesis posits that the new population equals the reference population—and we know exactly what the reference population is. Thus the null hypothesis is a specific statement. There is only one way for it to be correct. Making such a specific statement allows us to quantify the evidence that it is false.

The alternative hypothesis: the statement that the effect we want is present:

\[H_a: \mbox{The sample was drawn from a population with a population mean greater than \$3932.80}\] The alternative hypothesis is not a specific statement. It is a family of distributions/populations that satisfies some criterion, so there is more than one way for the alternative hypothesis to be correct. There is also more than one way for a statistician to define the alternative hypothesis (e.g. perhaps limit it to data above the null hypothesis benchmark or below the null hypothesis benchmark). Meanwhile, the null hypothesis is a specific distribution of population that uses one benchmark to determine the probability of error if the alternative hypothesis is not correct.

Here is an example sample of prices of 4 diamonds:

## [1] 5801 8549  744  538

Is there evidence for our alternative hypothesis? We are going to use the sample mean to make this decision. In this case, the sample mean is the sum of these four prices, the only prices we know from new population, divided by 4. This sample mean is what we call our test statistic. If the null hypothesis is correct, then the values of the test statistic for the sampled-from population would be the same for the values of the reference population. If the null hypothesis is false, then the alternative hypothesis could be greater or less than the values for the reference population. In this case, our alternative hypothesis is that the sample from the new population would have a mean greater than $3932.80. Thus, the test statistic (mean) would become greater than the population parameter. The plot below is relevant because it hints at the alternative hypothesis and displays the sample mean of price in different sample sizes. Sample size 4 is relevant because it displays a great deal of outliers that would probably make the test statistic (mean) greater than the population mean, thus supporting our alternative hypothesis. Any test statistic for the sample that does not have a sample mean of $3932.80 suggests that the null hypothesis is false and the alternative hypothesis is correct. This changes with different sample sizes, because as sample size increases, the sample is more likely to take values closer to the true population, and thus comes closer to supporting the null hypothesis. If null is correct, then the sample mean will be described by the distributions.

If you answered the above questions, you may understand why we interpret large sample means as evidence against our null hypothesis and for our alternative hypothesis. At some point, when the sample mean is large enough, we deem the evidence for the alternative hypothesis significant. But how large does the sample mean need to be to get this conclusion? The threshold, where we say the sample mean is large enough, is called the critical value of the test statistic.

To understand how to set this critical value, let’s tackle a simpler problem: our sample will consist of a single diamond: just the first one selected in the sample above, with price: $5801. Based on just this diamond, should we accept our alternative hypothesis? It can be said that we do have some evidence: $5801 > 3932.80.$ But do we have enough evidence? And how much evidence is enough? It depends on our critical value. If 5801 is bigger than our critical value, then we will say we have enough evidence to conclude that diamond was sampled from a population with a greater mean. If not, we won’t be able to say that. Our null hypothesis could have then been proven correct if the sample had a test statistic (mean) of $5600, and our alternative hypothesis would have been proven if the sample had a mean greater than $5600. In a similar manner, this logic holds true for $6600.

We don’t know the distribution of the diamonds from the population from which the sample was drawn, however we do know the distribution of diamonds under the null hypothesis. Therefore, we can answer the following question: If we make the null hypothesis correct, by drawing a sample from the null population (call it a null-sample), and we use our observed test statistic, $5801, as the critical value, for deeming the alternative hypothesis correct, what is the probability that we will make such an error? (Note: we know the alternative hypothesis is incorrect for null-samples.) This probability is the p-value for our original sample. Said a slightly different way, what is the probability of erroneously concluding that a null-sample comes from a more expensive population, assuming the original sample determines our threshold for making this decision? This probability is the p-value for our original sample. Because the p-value is a probability, it is always a number between 0 and 1. The closer the p-value is to 0, the more surprised we would be to see our data (the original sample) if the null hypothesis were correct. Low p-values are interpreted as strong evidence against the null hypothesis and for the alternative hypothesis.

The black horizontal line is drawn at the price of the single sample diamond: $5801. What if, as suggested above, we used this line as the critical value for assessing the truth of the alternative hypothesis? We have no way of knowing what would happen if the alternative hypothesis were correct. There are many populations that satisfy the alternative hypothesis. But we do know what would happen if the null hypothesis is correct—there is only one population that satisfies the null hypothesis—the one depicted in the violin/box plot, above. If the black line represented our critical value, and if the price of a null-sampled diamond fell above the black line, we would incorrectly reject the null hypothesis. The higher the price, the lower the p-value. Thus, to get a lower p-value, we need to ask for a greater price. A lower p-value means that the sample is closer to proving the alternative hypothesis. Test statistics that lead to lower p-values are said to be more extreme (that is, further into the region where we deem the alternative hypothesis correct). The p-value is also described as the probability of observing a null-sample with a more extreme p-value than seen with the original sample.

The number of diamonds in the reference population whose price fall above this line is 12077 which is 22.39% of the diamonds in the data set. Since each diamond has an equal chance of being selected as our new samples, the probability of an error, in this situation, is:

\[\mbox{p-value} = \frac{12077}{53940} = 0.2239\]

Erroneously accepting the alternative hypothesis is called a false positive, because we erroneously find the positive effect: the null-sample comes from a more expensive population, but the correct answer is negative: we drew our samples from the null distribution. The following are examples of positive effects: the new store sells more expensive diamonds than the old store, the new drug works better than placebo, my new teaching methods work better than the old ones, etc. If we come to these conclusions in error, we have made a false-positive error (type I error). The other type of error is a false negative error (type II error). Negating the positive statements yields negative statements (e.g. the new drug works the same as placebo), and if such a conclusion is in error (we conclude that the new drug works the same as placebo, when in fact it works better), our conclusion is a false negative (type II) error. Tests of significance are designed to control for type I errors (false positives), though we can study false negatives as well (mostly saved for future lessons, but see below). A type I error is when one incorrectly rejects a true null hypothesis (false positive), while a type II error is when one incorrectly accepts a false null hypothesis (false negative) It’s easier to control type I errors than type II errors because type II errors are based on 1-power, and power can depend on many things from the distribution. Meanwhile, type I errors are just based on the alpha-level of the test, which is easily controlled.

The p-value is the probability of making a false-positive error—incorrectly accepting the alternative hypothesis—if the null hypothesis is correct and if we use the test statistic seen in our original sample as the critical value for accepting the alternative hypothesis. However the critical value should never be chosen based on the data. It should be chosen, before collecting data, based on what probability of a type I error you find acceptable.

So what’s the answer: do we have evidence to support the alternative hypothesis? Actually, it depends on you. If you are O.K. with being wrong 22.39% of the time when you make the null hypothesis correct, then yes, you do have enough evidence.

The level of significance of a test, alpha-level of a test or $\alpha$-level of a test, is probability of making a false-positive error, assuming the null hypothesis is correct. The alpha-level of a test is typically decided on, in advance of an experiment, and sets the actual critical value for accepting the alternative hypothesis that is used, regardless of the sample.

The alpha-level of the test, the acceptable probability of a false-positive error, is usually much lower than 0.2239. In fact, the traditional level, accepted by most statisticians, is 0.05, though other levels are sometimes used. In other words, many statisticians consider it acceptable to have a 0.05 probability of making a false-positive error, under the condition where the null hypothesis is correct.

So if we set the alpha-level to its traditional level, 0.05, what is the critical value for the test statistic. That’s easy: its the value such that 5% of the diamonds have a larger price, and thus 95% of the diamonds have a lower price. That value has a name: the 95th percentile of the price of diamonds.

The 95th percentile of diamond price in our data set is 13107.1. In other words, if a single diamond, sampled at random from a different data set, such as from a different store, has a price that is greater than 13107.1, then statisticians would, based on tradition, consider the evidence sufficient to conclude that the diamonds from the new store are more expensive than the diamonds from the old store. However, if we trick the statisticians, and give them a diamond from the old store, so that a positive conclusion (diamonds are more expensive from the population sampled) would be wrong, the statisticians would be wrong 5% of the time. They would know that fact, but consider this rate of false-positive error acceptable.

The black line (at $5801) indicates the price of the diamond sampled. The p-value is the fraction of diamonds in the reference (null) distribution that fall above the black line. But location of the black line depends on which diamond is chosen for the sample, and this diamond is chosen from a completely unknown distribution. In fact, we have no idea where this line is likely to fall. But we interpret its location based on the null distribution. If the null distribution were correct, how surprised would we be to find the black line where we find it: our surprise is quantified by the p-value.

The red line (at $13107.1) indicates the critical value of the test statistic, determined by the alpha-level of the test (chosen in accordance with tradition to be 0.05), so that 5% of the diamonds in the null distribution lie above the line. Since the black line is below the red line, (the p-value is greater than 0.05), we do not deem the evidence sufficient to conclude our alternative hypothesis that the diamonds are more expensive, on average, at the new store. This is said to be a non-significant result, and that the data are not significant.

On the other hand if the black line were higher than the red line, the p-value would be less than 0.05, and we would accept the alternative hypothesis. In this case we would find a significant result, and the data would be significant. The alpha-level, or significance level, is the probability of getting a type I error (incorrectly rejecting a true null hypothesis). Thus, setting the alpha level to .1 would make it easier to make a type I error because this would demand less evidence and mean we are ok with a higher probability of a type I error. Since the probability would now be 10%, as compared to a regular alpha-level set at 5%. It would be harder to find a significant result.

If the null hypothesis is correct, then we erroneously get a type I error 5% of the time, with a 0.05 level of significance. If the alternative hypothesis is correct (the mean of the sampled-from distribution is larger than the mean of the null distribution) then we would usually expect to find a significant effect more than 5% of the time. If we knew the distribution of the sampled-from distribution, we could compute this number exactly: it is called the power of the test.

The power of a test is the probability that a sample will be significant, assuming the alternative hypothesis is correct. To compute the power, you must assume you know the distribution of the sampled-from distribution that satisfies the alternative hypothesis.

Actually, to be precise, the alternative hypothesis is framed in terms of the mean of the sample-from distribution, whereas the power of the test depends on the percentile of the critical value of the test statistic under the sampled-from distribution, which may or may not be bigger than 5%, but usually is substantially bigger than 5%. A power of 80% is considered good, although many times statisticians settle for much lower power.

The probability of a type I error is always the alpha-level of the test, often 0.05. The probability of a type II error is $1 - \mbox{power}$. If the power is 80%, (or 0.8) the probability of a type II error is 0.2. The power depends on many things, notably the sample-from distribution. But the best way to control power is to tweak your sample size. The alpha-level, or significance level, is the probability of getting a type I error (incorrectly rejecting a true null hypothesis). Power is the probability of rejecting the alternative hypothesis with the alternative hypothesis is true. Setting the alpha-level higher would make the type I error chances increase, and the type 2 error decrease, and power increase, making it harder to find a significant result because of increased chance of error.

Next up: we will start thinking about what happens to power as we increase the sample size, and why. I saw this written on the wall of a booth in the library a few years ago: “I heard you upped your sample size. More power to you!” We want to understand this statement, and why its true.

Tests of Significance, Version 2

Sarah Bellatti