I am starting an exploration of statistical significance and power to support communication of the concept with business partners. This first iteration focuses on a two-sample t-test. I am providing code to run simulations and visualizations. You can easily adjust the parameter settings for your own exploration.
library(dplyr)
library(ggplot2)
library(ggthemes)
library(pwr)
Here the user defines the inputs to create the distributions.
# population size of groups; here I'm assuming equal size.
n = 100000
# mean and standard deviation of population 1
mean1 = 0
sd1 = 1.0
# mean and standard deviation of population 2
mean2 = 0.5
sd2 = 1.1
Creating the distributions and a common dataframe.
population1 <- data.frame(rnorm(n = n, mean = mean1, sd = sd1))
names(population1) <- 'measure'
population1$pop <- 'one'
population2 <- data.frame(rnorm(n = n, mean = mean2, sd = sd2))
names(population2) <- 'measure'
population2$pop <- 'two'
distdata <- bind_rows(population1, population2)
head(distdata)
## measure pop
## 1 1.0738174 one
## 2 0.3757792 one
## 3 1.2811745 one
## 4 -1.3284043 one
## 5 -2.0046181 one
## 6 -0.3457713 one
Visualizing the density distributions.
ggplot(distdata, aes(x = measure, color = pop)) +
geom_density() +
theme_hc()
Let’s define the inputs for the experiment. Change them as you so desire.
trials = 1000 # numer of simulation trials
sampsize = 75 # assuming even sample size in this instance.
sample.pvalue = rep(NA, trials)
sample.power = rep(NA, trials)
pooled.sd = rep(NA, trials) # R's default t-test does not use pooled variance/sd
sample.effectSize = rep(NA, trials)
We will be testing the null hypothesis of no difference in the means for the two populations at an alpha of 0.05. The American Statistical Association defines a p-value as “a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value”.
http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108#.WdESGGhSzIU
Another way to look at is to assume the significance test of an experiment produced a p-value of 0.03. You could say you would have no difference in the means of 3% of the studies due to random sampling error. It tells you how likely are your data, assuming a true null hypothesis.
For each ‘trial’ the simulation creates a random sample from each population, conducts the t-test, stores the p-value, the pooled standard deviation for calculating power, statistical power, and finally effect size.
delta = mean(population2$measure - mean(population1$measure)) # or make an assumption
for(i in 1:trials){
sample.1 = filter(distdata, pop == 'one') %>% sample_n(sampsize, replace = F)
sample.2 = filter(distdata, pop == 'two') %>% sample_n(sampsize, replace = F)
results = t.test(sample.1$measure, sample.2$measure, var.equal = T) # assume equal variance T or F
sample.pvalue[i] = results$p.value
pooled.sd[i] = sqrt(((sampsize - 1) * var(sample.1$measure) + (sampsize - 1) *
var(sample.2$measure)) / (sampsize + sampsize - 2))
sample.power[i] = power.t.test(n = sampsize, delta = delta, sd = pooled.sd[i],
sig.level = 0.05, power = NULL,
type = 'two.sample',
alternative = 'two.sided')$power
sample.effectSize[i] = mean(sample.2$measure - mean(sample.1$measure))
}
First thing is to explore the distributions of the p-values.
mean(sample.pvalue)
## [1] 0.03839203
median(sample.pvalue)
## [1] 0.004619976
sd(sample.pvalue)
## [1] 0.1051715
hist(sample.pvalue)
Now we explore the power.
mean(sample.power)
## [1] 0.8186528
median(sample.power)
## [1] 0.8212127
sd(sample.power)
## [1] 0.045288
hist(sample.power)
This creates a table to explore the relationship between p-values and power.
table(sample.pvalue < 0.05, sample.power > 0.8)
##
## FALSE TRUE
## FALSE 57 121
## TRUE 261 561
In the table the rows are for p-values and columns for power. Since power is the probability of correctly rejecting the null hypothesis when the alternative is true, then as power increases the probability of a false negative i.e. Type 2 error decreases proportionally. The upper right value in the table is the number of false negatives when power is above the recommended 0.8. In your simulations simply divide that number by the column total and it should be a number close to “1 - sample.power”.
Adjust the simulation parameters and notice how the outcome changes and what drives those changes. Create distributions that violate the t-test assumptions and observe what happens! As you explore the various factors remember that a p-value is a random variable and should be used with caution.
sdf <- data.frame(sample.pvalue, sample.power, pooled.sd, sample.effectSize)
ggplot(sdf, aes(sample.effectSize, sample.power)) +
geom_point() +
theme_stata()
ggplot(sdf, aes(pooled.sd, sample.power)) +
geom_point() +
theme_economist_white()
ggplot(sdf, aes(sample.effectSize, sample.pvalue)) +
geom_point() +
theme_bw()
I hope you find this helpful. Please let me know your thoughts on how to improve or simplify or explore or whatever for this exercise.
Thank You.
Cory