Statistical processes are based on probability models. This means some (hopefully small) chance of error will always be present due to randomness. We don’t question the validity of every statistical result, but one should wonder if some way exists to quantify how certain we are that our results are accurate.

Spoiler Alert: there is a way!

Initializing RStudio

Please install the package pwr before initializing RStudio for this session. Use Tools: Install Packages to do so. The code block below uses the library function to ensure that the Mosaic and pwr packages are loaded and will also import the data frame used in this module: Data3350

library(mosaic)
library(pwr)
library(readxl)
Data3350 = read_excel("Data3350.xlsx")

I. Statistical Power

The power of a statistical test is the percentage chance that a specific test will reject the null hypothesis when, in fact, the null is actually false. Seems simple, right? Yet, we don’t know if the null is false without statistical tests. Some chance of error is always lurking, but we rarely know when a statistical error has actually occurred.

The goal is having enough power each time we perform a statistics-based investigation. Three ways exist to improve power:

  1. Increase the sample size.
  2. Increase the effect size.
  3. Increase \(\alpha\).

The first two ideas seem straightforward. Increasing the sample size invokes the Law of Large Numbers which says that estimates of parameters become more accurate as sample size increases. With increased accuracy of estimates, we will correctly reject the null more often. For the second, we face the ubiquitous scientific problem of separating signal from noise. In a study with an experimental design, we test the main effect of the treatment. If we can magnify the main effect, we have a better chance of picking up on difference between the treatment and control groups. The main effect can be made larger, for example, by increasing the duration or intensity of the treatment.

The first two options for improving power are not always possible. Increasing the intensity of a treatment may threaten participant safety. For some research, it is hard to find enough participants. What terminally ill patients will sign up for a randomized treatment vs. control group design where they have a 50-50 chance of being assigned to the placebo (control) group? Also, some current cancer researchers have found that certain treatments work well only if specific genetic markers are present in the patient. Every study would have a huge sample size and adequate power if data were inexpensive and easy to collect, but ethical standards, privacy protections and many other concerns mean the opposite. High quality, ethical data sets are expensive and time-consuming to generate.

What about the third way: increasing \(\alpha\) to improve power? To understand how to set \(\alpha\) requires an investigation of statistical error.

II. Statistical Error

In statistics, we have two forms of randomness-based errors:

  1. Type I Error. Falsely rejecting the null hypothesis.
  2. Type II Error. Falsely failing to reject the null hypothesis.

The level of significance \(\alpha\) is the percent chance of making a Type I error. Why is this true?

Consider a hypothesis test of Perfectionism where we know the population follows exactly a \(N(82,18)\) distribution (measured using the APS-R). Perhaps we’ve measured the Perfectionism level for every single undergraduate student currently enrolled at a certain university. Let’s test a hypothesis: \[H_0 : \mu = 82\] \[H_a : \mu > 82\] The population is known to be exactly 82, so rejecting the null would be incorrect. Yet, let’s test the above hypothesis repeatedly by drawing samples of size 40 from this exact \(N(82,18)\) distribution. Over thousands of samples, we will find a mean far enough above \(\mu = 82\) to reject the null 5% of the time. Yet, it would be an error. We’ll falsely reject the null approximately 5% of the times.

When we set \(\alpha\), we are choosing the probability of making a Type I error. This is not a researcher error. Type I errors occur due to randomness because we use probability models in hypothesis testing. Type I and Type II errors are inescapable realities for any statistical test.

# library(mosaic)
tests = do(1000) * t.test(rnorm(40,82,18),mu = 82)
tests
histogram(tests$p.value, width = 0.05 )

When we enter a logical statement into the sum function, it will count instead of adding.

sum(tests$p.value < 0.05)
[1] 43

While results will vary if we rerun the code blocks, we see that in this example 3 of samples turned up with \(p\)-values less than 0.05. The actual Type I error in our randomized experiment was 4.3%.

No matter what, we will always have a Type I error rate. While we can set \(\alpha\) lower, say \(\alpha = 0.01\), when we do then Type II error will increase and power will decrease.

1. Power = \(1-\beta\)

The percentage chance of a Type II error is defined to be \(\beta\). There is no exact formula for the relationship between \(\alpha\) and \(\beta\), but we do know they have an inverse relationship. As Type I error increases (with everything else held constant), Type II error will decrease, and vice versa.

If \(\beta\) is the probability of a Type II error occurring, what is it’s opposite? Think of it this way: \(\beta\) is the probability of an event occurring. What event? Let \(T2\) represent the event that a Type II error occurs, and we’re simply saying \[P(T2)=\beta\] We know that \(P(\overline{T2})= 1 - \beta\), so what is the event \(\overline{T2}\)? The logical opposite of “falsely failing to reject the null” is “correctly rejecting the null,” which is the definition of power.

Thus, as we set \(\alpha\), we are making a choice that affects \(\beta\) and therefore power. If we increase \(\alpha\), \(\beta\) will decrease and power will increase.

2. Adequate Power

Jacob Cohen was the statistics pioneer who investigated power, and his work still underpins our understanding. According to Cohen,

  • Power = 0.6 is poor
  • Power = 0.7 is adequate
  • Power = 0.8 is good
  • Power = 0.9 is excellent

3. Calculating Power

Recall the matched pairs example from Module S5 where we measured Stress in a pretest vs. post test format. We created the gain scores (increase in Stress) as follows

Pre = Data3350$Stress1
Post = Data3350$Stress2
Gain = Post - Pre

To estimate power, we first must calculate effect size which, for matched pairs, will be the average gain score divided by the standard deviation. Let’s store the summary stats in a variable called \(\text{g}\) so we can access them as needed.

g = favstats(Gain)
g

We refer to the various objects in the \(\text{favstats}\) output using dollar signs. For example, let’s list the mean, standard deviation and sample size for the gain scores.

g$mean
[1] 0.7682927
g$sd
[1] 3.190472
g$n
[1] 164

The power guru Jacob Cohen used the variable \(d\) to indicate effect size which he defined to be, for this case, the average Gain score divided by the sample standard deviation.

d = g$mean / g$sd
d
[1] 0.2408085

Cohen defined different levels of effect sizes. \[\begin{array}{ccl}d &&\text{Effect Size}\\ \hline 0.2 && \text{Small}\\ 0.5 && \text{Medium}\\ 0.8 && \text{Large} \end{array}\] In our example, we have a small effect size since it’s close to \(0.2\). We loaded the \(\textbf{pwr}\) package which implements the recommendations in the seminal textbook on power by Cohen (1988). Through the pwr package, we now have access to the function pwr.t.test which will pretty much do what its name suggests: estimate the power of a \(t\)-test.

library(pwr)
pwr.t.test(n = g$n, d = g$mean/g$sd, sig.level = .05, type = "paired")

     Paired t test power calculation 

              n = 164
              d = 0.2408085
      sig.level = 0.05
          power = 0.8655655
    alternative = two.sided

NOTE: n is number of *pairs*

The default hypothesis is a two-tailed test, but if you recall we were running a one-tailed test that the Stress measured at midterm would be greater than Stress at the beginning of the semester. Let’s add the alternative hypothesis and see if changes the power which, in this case, was approximately .87 which is good (see chart above).

pwr.t.test(n = g$n, d = g$mean/g$sd, sig.level = .05, type = "paired", alternative = "greater")

     Paired t test power calculation 

              n = 164
              d = 0.2408085
      sig.level = 0.05
          power = 0.9230891
    alternative = greater

NOTE: n is number of *pairs*

While this is excellent power, notice that the for this small effect size, we needed a very large sample to generate excellent power. What if our sample size had only been 50 yet the effect size was d = 0.25?

pwr.t.test(n = 50, d = 0.25, sig.level = .05, type = "paired", alternative = "greater")

     Paired t test power calculation 

              n = 50
              d = 0.25
      sig.level = 0.05
          power = 0.5392054
    alternative = greater

NOTE: n is number of *pairs*

For sample size of 50, we don’t even have poor power, and similar sample sizes are pretty typical for many research situations.

III. Power Charts

Both charts below clearly show the impact of increasing sample sizes on power. As sample size increases, power increases. The two charts below focus on the impact of the other two ways for increasing power: increasing the effect size and increasing the level of significance \(\alpha\).

1. Effect Size and Power

The chart below summarizes the power for different levels of effect size (small, medium, large) using 2-Tailed, 1-Sample \(t\)-tests with \(\alpha = 0.05\). Notice how low the power is for small effect sizes (left column) even for sample sizes of 100. On the other hand, consider that we have excellent power for samples of only 20 if the effect size is large (right column).

2. Level of Significance \((\alpha)\) and Power

The chart below summarizes the power for different levels of \(\alpha\) using 2-Tailed, 1-Sample \(t\)-tests with effect sizes \(d = 0.30\). Notice how low the power is for small levels of significance (left column) even for sample sizes of 100. The power is much better for higher levels of significance.

3. Overpowering Statistical Test

Oddly, you can have too much power. When sample sizes are ridiculously large, almost any miniscule effect size will be enough to reject the null. Yet, the results will be meaningless because no real-world difference between groups exists.

Consider a two-sample comparison of test scores where the average Treatment group score is an 80.1, and the average Control group score is an 80.0. If the standard deviations are both 10, then Cohen’s effect size is \[d = \frac{80.1-80.0}{10}=\frac{.1}{10}=0.01\] We can all agree that with test scores out of 100 points, there is no real-world difference between the Treatment and Control groups. Yet, consider the following sample sizes

Power when \(n = 1000\)

pwr.t.test(n = 1000, d = 0.01, type = "two.sample")$power
[1] 0.05574171

A very large sample size, but not much power. This seems fine as we shouldn’t be rejecting the null.

Power when \(n = 10,000\)

pwr.t.test(n = 10000, d = 0.01, type = "two.sample")$power
[1] 0.1089488

A huge sample size but still hardly any power. Good so far.

Power when \(n = 100,000\)

pwr.t.test(n = 100000, d = 0.01, type = "two.sample")$power
[1] 0.6087754

Whoa - now we have a reasonable chance of rejecting the null.

Power when \(n = 1,000,000\)

pwr.t.test(n = 1000000, d = 0.01, type = "two.sample")$power
[1] 0.9999998

With a ridiculous sample size of one million, we will always be rejecting the null despite the fact that there is no real-world difference between the group means.

Word of Caution

Generally, the issue of overpowering a test isn’t one we’re worried about. As mentioned above, data is often both difficult and expensive to collect. However, in the era of big data, we can often find massive data sets readily available online. While interesting results can certainly be derived from studying huge data sets, researchers do need to ask themselves the question: “Is this finding significant, not statistically, but in the real world context of the study?”

---
title: "Error and Power"
output: html_notebook
---

Statistical processes are based on probability models. This means some (hopefully small) chance of error will always be present due to randomness. We don't question the validity of every statistical result, but one should wonder if some way exists to quantify how certain we are that our results are accurate.

Spoiler Alert: there is a way!

<div style="float:center; margin: 8px; border:2px black solid; padding: 0px 10px 5px">
### <span style="color: red;">Initializing RStudio</span>
Please <span style="color: blue;">install the package **pwr**</span> before initializing RStudio for this session. Use **Tools: Install Packages** to do so. The code block below uses the **library** function to ensure that the **Mosaic** and **pwr** packages are loaded and will also import the data frame used in this module: **Data3350**

```{r}
library(mosaic)
library(pwr)
library(readxl)
Data3350 = read_excel("Data3350.xlsx")
```
</div>

# <span style="color: blue;">I. Statistical Power</span>

The power of a statistical test is the percentage chance that a specific test will reject the null hypothesis when, in fact, the null is actually false. Seems simple, right? Yet, we don't know if the null is false without statistical tests. Some chance of error is always lurking, but we rarely know when a statistical error has actually occurred.

The goal is having enough power each time we perform a statistics-based investigation. Three ways exist to improve power:

1. Increase the sample size.
2. Increase the effect size.
3. Increase $\alpha$.

The first two ideas seem straightforward. Increasing the sample size invokes the Law of Large Numbers which says that estimates of parameters become more accurate as sample size increases. With increased accuracy of estimates, we will correctly reject the null more often. For the second, we face the ubiquitous scientific problem of separating signal from noise. In a study with an experimental design, we test the main effect of the treatment. If we can magnify the main effect, we have a better chance of picking up on difference between the treatment and control groups. The main effect can be made larger, for example, by increasing the duration or intensity of the treatment. 

The first two options for improving power are not always possible. Increasing the intensity of a treatment may threaten participant safety. For some research, it is hard to find enough participants. What terminally ill patients will sign up for a randomized treatment vs. control group design where they have a 50-50 chance of being assigned to the placebo (control) group? Also, some current cancer researchers have found that certain treatments work well only if specific genetic markers are present in the patient. Every study would have a huge sample size and adequate power if data were inexpensive and easy to collect, but ethical standards, privacy protections and many other concerns mean the opposite. High quality, ethical data sets are expensive and time-consuming to generate.

What about the third way: increasing $\alpha$ to improve power? To understand how to set $\alpha$ requires an investigation of **statistical error**.

# <span style="color: blue;">II. Statistical Error</span>

In statistics, we have two forms of randomness-based errors:

1. **Type I Error.** Falsely rejecting the null hypothesis.
2. **Type II Error.** Falsely failing to reject the null hypothesis.

The level of significance $\alpha$ is the percent chance of making a Type I error. Why is this true?

Consider a hypothesis test of Perfectionism where we know the population follows exactly a $N(82,18)$ distribution (measured using the <a href = http://kennethwang.com/apsr/scales/APS-R_96.pdf>APS-R</a>). Perhaps we've measured the Perfectionism level for every single undergraduate student currently enrolled at a certain university. Let's test a hypothesis:
$$H_0 : \mu = 82$$
$$H_a : \mu > 82$$
The population is known to be exactly 82, so rejecting the null would be incorrect. Yet, let's test the above hypothesis repeatedly by drawing samples of size 40 from this exact $N(82,18)$ distribution. Over thousands of samples, we will find a mean far enough above $\mu = 82$ to reject the null 5% of the time. Yet, it would be an error. We'll falsely reject the null approximately 5% of the times.

When we set $\alpha$, we are choosing the probability of making a Type I error. This is not a researcher error. Type I errors occur due to randomness because we use probability models in hypothesis testing. Type I and Type II errors are inescapable realities for any statistical test.

```{r}
# library(mosaic)
tests = do(1000) * t.test(rnorm(40,82,18),mu = 82)
tests
```

```{r}
histogram(tests$p.value, width = 0.05 )

```

When we enter a logical statement into the **sum** function, it will count instead of adding.

```{r}
sum(tests$p.value < 0.05)
```

While results will vary if we rerun the code blocks, we see that in this example 3 of samples turned up with $p$-values less than 0.05. The actual Type I error in our randomized experiment was `r paste(sum(tests$p.value < 0.05)/10)`%.

No matter what, we will always have a Type I error rate. While we can set $\alpha$ lower, say $\alpha = 0.01$, when we do then Type II error will increase and power will decrease.

## 1. Power = $1-\beta$

The percentage chance of a Type II error is defined to be $\beta$. There is no exact formula for the relationship between $\alpha$ and $\beta$, but we do know they have an inverse relationship. As Type I error increases (with everything else held constant), Type II error will decrease, and vice versa.

If $\beta$ is the probability of a Type II error occurring, what is it's opposite? Think of it this way: $\beta$ is the probability of an event occurring. What event? Let $T2$ represent the event that a Type II error occurs, and we're simply saying $$P(T2)=\beta$$
We know that $P(\overline{T2})= 1 - \beta$, so what is the event $\overline{T2}$? The logical opposite of "falsely failing to reject the null" is "correctly rejecting the null," which is the definition of power.

Thus, as we set $\alpha$, we are making a choice that affects $\beta$ and therefore power. If we increase $\alpha$, $\beta$ will decrease and power will increase.

## 2. Adequate Power

Jacob Cohen was the statistics pioneer who investigated power, and his work still underpins our understanding. According to Cohen,

* Power = 0.6 is poor
* Power = 0.7 is adequate
* Power = 0.8 is good
* Power = 0.9 is excellent

## 3. Calculating Power

Recall the matched pairs example from Module S5 where we measured Stress in a pretest vs. post test format. We created the gain scores (increase in Stress) as follows

```{r}
Pre = Data3350$Stress1
Post = Data3350$Stress2
Gain = Post - Pre
```

To estimate power, we first must calculate effect size which, for matched pairs, will be the average gain score divided by the standard deviation. Let's store the summary stats in a variable called $\text{g}$ so we can access them as needed.

```{r}
g = favstats(Gain)
g
```

We refer to the various objects in the $\text{favstats}$ output using dollar signs. For example, let's list the mean, standard deviation and sample size for the gain scores.

```{r}
g$mean
g$sd
g$n
```

The power guru Jacob Cohen used the variable $d$ to indicate effect size which he defined to be, for this case, the average Gain score divided by the sample standard deviation.

```{r}
d = g$mean / g$sd
d
```

Cohen defined different levels of effect sizes.
$$\begin{array}{ccl}d &&\text{Effect Size}\\ \hline
0.2 && \text{Small}\\ 0.5 && \text{Medium}\\ 0.8 && \text{Large}
\end{array}$$
In our example, we have a small effect size since it's close to $0.2$. We loaded the $\textbf{pwr}$ package which implements the recommendations in the seminal textbook on power by Cohen (1988). Through the **pwr** package, we now have access to the function **pwr.t.test** which will pretty much do what its name suggests: estimate the power of a $t$-test.

```{r}
pwr.t.test(n = g$n, d = g$mean/g$sd, sig.level = .05, type = "paired")
```
The default hypothesis is a two-tailed test, but if you recall we were running a one-tailed test that the Stress measured at midterm would be greater than Stress at the beginning of the semester. Let's add the alternative hypothesis and see if changes the power which, in this case, was approximately .87 which is good (see chart above).

```{r}
pwr.t.test(n = g$n, d = g$mean/g$sd, sig.level = .05, type = "paired", alternative = "greater")
```

While this is excellent power, notice that the for this small effect size, we needed a very large sample to generate excellent power. What if our sample size had only been 50 yet the effect size was d = 0.25?


```{r}
pwr.t.test(n = 50, d = 0.25, sig.level = .05, type = "paired", alternative = "greater")
```

For sample size of 50, we don't even have poor power, and similar sample sizes are pretty typical for many research situations.

# <span style="color: blue;">III. Power Charts</span>

Both charts below clearly show the impact of increasing sample sizes on power. As sample size increases, power increases. The two charts below focus on the impact of the other two ways for increasing power: increasing the effect size and increasing the level of significance $\alpha$.

## 1. Effect Size and Power
The chart below summarizes the power for different levels of effect size (small, medium, large) using 2-Tailed, 1-Sample $t$-tests with $\alpha = 0.05$. Notice how low the power is for small effect sizes (left column) even for sample sizes of 100. On the other hand, consider that we have excellent power for samples of only 20 if the effect size is large (right column).

```{r echo = FALSE}
# Create a 10x4 data frame called P and fill it with zeros
#
N = 10
temp = N*4
P = data.frame(matrix(1:temp,ncol = 4))   # Create 10x4 table
P = P *0L
#
# Create labels for the columns of P
#
colnames(P) = c("n", "Pwr if d = 0.2", "Pwr if d = 0.5", "Pwr if d = 0.8")
#
# Run Cohen's power calculations (2nd, 3rd and 4th columns in P)
#
P$n = 10*1:N
for (k in 1:length(P$n)) {
  P[k,2] = round(pwr.t.test(n = k*10, d = 0.2, type = "one.sample")$power,2)
}
for (k in 1:length(P$n)) {
  P[k,3] = round(pwr.t.test(n = k*10, d = 0.5, type = "one.sample")$power,2)
}
for (k in 1:length(P$n)) {
  P[k,4] = round(pwr.t.test(n = k*10, d = 0.8, type = "one.sample")$power,2)
}
P
```

## 2. Level of Significance $(\alpha)$ and Power

The chart below summarizes the power for different levels of $\alpha$ using 2-Tailed, 1-Sample $t$-tests with effect sizes $d = 0.30$. Notice how low the power is for small levels of significance (left column) even for sample sizes of 100. The power is much better for higher levels of significance.

```{r echo = FALSE}
# Create a 10x4 data frame called P and fill it with zeros
#
N = 10
temp = N*4
P = data.frame(matrix(1:temp,ncol = 4))   # Create 10x4 table
P = P *0L
#
# Create labels for the columns of P
#
colnames(P) = c("n", "Pwr if alpha = 0.01", "Pwr if alpha = 0.05", "Pwr if alpha = 0.10")
#
# Run Cohen's power calculations (2nd, 3rd and 4th columns in P)
#
P$n = 10*1:N
for (k in 1:length(P$n)) {
  P[k,2] = round(pwr.t.test(n = k*10, d = 0.3, type = "one.sample", sig.level = 0.01)$power,2)
}
for (k in 1:length(P$n)) {
  P[k,3] = round(pwr.t.test(n = k*10, d = 0.3, type = "one.sample", sig.level = 0.05)$power,2)
}
for (k in 1:length(P$n)) {
  P[k,4] = round(pwr.t.test(n = k*10, d = 0.3, type = "one.sample", sig.level = 0.10)$power,2)
}
P
```


## 3. Overpowering Statistical Test

Oddly, you can have too much power. When sample sizes are ridiculously large, almost any miniscule effect size will be enough to reject the null. Yet, the results will be meaningless because no real-world difference between groups exists.

Consider a two-sample comparison of test scores where the average Treatment group score is an 80.1, and the average Control group score is an 80.0. If the standard deviations are both 10, then Cohen's effect size is 
$$d = \frac{80.1-80.0}{10}=\frac{.1}{10}=0.01$$
We can all agree that with test scores out of 100 points, there is no real-world difference between the Treatment and Control groups. Yet, consider the following sample sizes

### Power when $n = 1000$

```{r}
pwr.t.test(n = 1000, d = 0.01, type = "two.sample")$power
```
A very large sample size, but not much power. This seems fine as we shouldn't be rejecting the null.

### Power when $n = 10,000$

```{r}
pwr.t.test(n = 10000, d = 0.01, type = "two.sample")$power
```
A huge sample size but still hardly any power. Good so far.

### Power when $n = 100,000$

```{r}
pwr.t.test(n = 100000, d = 0.01, type = "two.sample")$power
```
Whoa - now we have a reasonable chance of rejecting the null.

### Power when $n = 1,000,000$

```{r}
pwr.t.test(n = 1000000, d = 0.01, type = "two.sample")$power
```
With a ridiculous sample size of one million, we will always be rejecting the null despite the fact that there is no real-world difference between the group means.

### Word of Caution

Generally, the issue of overpowering a test isn't one we're worried about. As mentioned above, data is often both difficult and expensive to collect. However, in the era of big data, we can often find massive data sets readily available online. While interesting results can certainly be derived from studying huge data sets, researchers do need to ask themselves the question: "Is this finding significant, not statistically, but in the real world context of the study?" 
