Disclaimer: The content of this RMarkdown note came from a course called Foundations of Inference in datacamp.
Inference, or the process of drawing conclusions about a larger population from a sample of data is one of the foundational aspects of statistical analysis. This course covers such topics as:
In this chapter, you will investigate how repeated samples taken from a population can vary. It is the variability in samples that allow you to make claims about the population of interest. It is important to remember that the research claims of interest focus on the population while the information available comes only from the sample data.
Suppose a pharmaceutical company is trying to get FDA approval for a new diabetes treatment called drug A. Currently, most doctors prescribe drug B to treat diabetes.
The alternative hypothesis corresponds to the research question of interest, in this case whether drug A is more effective at treating diabetes than drug B.
Briefly explore the NHANES data, which contains body shape and related measurements from the US National Health and Nutrition Examination Survey (NHANES, 1999-2004). For more information about the NHANES package and its data set, see its manual.
Definition of variables
HomeOwn
One of Home, Rent, or Other indicating whether the home of study participant or someone in their family is owned, rented or occupied by some other arrangement.SleepHrsNight
Self-reported number of hours study participant usually gets at night on weekdays or workdays. Reported for participants aged 16 years and older.HealthGen
Self-reported rating of participant’s health in general Reported for participants aged 12 years or older. One of Excellent, Vgood, Good, Fair, or Poor.# Load packages
library(NHANES)
library(dplyr)
library(ggplot2)
# What are the variables in the NHANES dataset?
names(NHANES)
## [1] "ID" "SurveyYr" "Gender"
## [4] "Age" "AgeDecade" "AgeMonths"
## [7] "Race1" "Race3" "Education"
## [10] "MaritalStatus" "HHIncome" "HHIncomeMid"
## [13] "Poverty" "HomeRooms" "HomeOwn"
## [16] "Work" "Weight" "Length"
## [19] "HeadCirc" "Height" "BMI"
## [22] "BMICatUnder20yrs" "BMI_WHO" "Pulse"
## [25] "BPSysAve" "BPDiaAve" "BPSys1"
## [28] "BPDia1" "BPSys2" "BPDia2"
## [31] "BPSys3" "BPDia3" "Testosterone"
## [34] "DirectChol" "TotChol" "UrineVol1"
## [37] "UrineFlow1" "UrineVol2" "UrineFlow2"
## [40] "Diabetes" "DiabetesAge" "HealthGen"
## [43] "DaysPhysHlthBad" "DaysMentHlthBad" "LittleInterest"
## [46] "Depressed" "nPregnancies" "nBabies"
## [49] "Age1stBaby" "SleepHrsNight" "SleepTrouble"
## [52] "PhysActive" "PhysActiveDays" "TVHrsDay"
## [55] "CompHrsDay" "TVHrsDayChild" "CompHrsDayChild"
## [58] "Alcohol12PlusYr" "AlcoholDay" "AlcoholYear"
## [61] "SmokeNow" "Smoke100" "Smoke100n"
## [64] "SmokeAge" "Marijuana" "AgeFirstMarij"
## [67] "RegularMarij" "AgeRegMarij" "HardDrugs"
## [70] "SexEver" "SexAge" "SexNumPartnLife"
## [73] "SexNumPartYear" "SameSex" "SexOrientation"
## [76] "PregnantNow"
# Create bar plot for Home Ownership by Gender
ggplot(NHANES, aes(x = Gender, fill = HomeOwn)) +
geom_bar(position = "fill") + # "fill" to have 100% bars to compare relative frequencies
ylab("Relative frequencies")
# Density for SleepHrsNight colored by SleepTrouble, faceted by HealthGen
ggplot(NHANES, aes(x = SleepHrsNight, color = SleepTrouble)) +
geom_density(adjust = 2) + # adjust = 2 to smooth the density
facet_wrap(~ HealthGen)
Interpretation
Investigate the relationship between gender and home ownership.
homes <- NHANES %>%
# Select Gender and HomeOwn
select(Gender, HomeOwn) %>%
# Filter for HomeOwn equal to "Own" or "Rent"
filter(HomeOwn %in% c("Own", "Rent"))
homes
## # A tibble: 9,712 x 2
## Gender HomeOwn
## <fct> <fct>
## 1 male Own
## 2 male Own
## 3 male Own
## 4 male Own
## 5 female Rent
## 6 male Rent
## 7 male Own
## 8 female Own
## 9 female Own
## 10 female Own
## # ... with 9,702 more rows
diff_orig <- homes %>%
# Group by gender
group_by(Gender) %>%
# Summarize proportion of homeowners
summarise(prop_own = mean(HomeOwn == "Own")) %>%
# Summarize difference in proportion of homeowners
summarise(obs_diff_prop = diff(prop_own)) # male - female
diff_orig
## # A tibble: 1 x 1
## obs_diff_prop
## <dbl>
## 1 -0.00783
Interpretation
The infer
package will allow you to model a particular null hypothesis and then randomize the data to calculate permuted statistics. In this exercise, after specifying your null hypothesis you will permute the home ownership variable 10 times. By doing so, you will ensure that there is no relationship between home ownership and gender, so any difference in home ownership proportion for female versus male will be due only to natural variability.
This exercise will demonstrate the first three steps from the infer
package:
specify
will specify the response and explanatory variables.hypothesize
will declare the null hypothesis.generate
will generate resamples, permutations, or simulations.library(infer)
# Specify variables
homeown_perm <- homes %>%
specify(HomeOwn ~ Gender, success = "Own")
# Print results to console
homeown_perm
## Response: HomeOwn (factor)
## Explanatory: Gender (factor)
## # A tibble: 9,712 x 2
## HomeOwn Gender
## <fct> <fct>
## 1 Own male
## 2 Own male
## 3 Own male
## 4 Own male
## 5 Rent female
## 6 Rent male
## 7 Own male
## 8 Own female
## 9 Own female
## 10 Own female
## # ... with 9,702 more rows
# Hypothesize independence
homeown_perm <- homes %>%
specify(HomeOwn ~ Gender, success = "Own") %>%
hypothesize(null = "independence")
# Print results to console
homeown_perm
## Response: HomeOwn (factor)
## Explanatory: Gender (factor)
## Null Hypothesis: independence
## # A tibble: 9,712 x 2
## HomeOwn Gender
## <fct> <fct>
## 1 Own male
## 2 Own male
## 3 Own male
## 4 Own male
## 5 Rent female
## 6 Rent male
## 7 Own male
## 8 Own female
## 9 Own female
## 10 Own female
## # ... with 9,702 more rows
# Perform 10 permutations
homeown_perm <- homes %>%
specify(HomeOwn ~ Gender, success = "Own") %>%
hypothesize(null = "independence") %>%
# Shuffle the response variable, HomeOwn, ten times
generate(reps = 10, type = "permute")
# Print results to console
homeown_perm
## Response: HomeOwn (factor)
## Explanatory: Gender (factor)
## Null Hypothesis: independence
## # A tibble: 97,120 x 3
## # Groups: replicate [10]
## HomeOwn Gender replicate
## <fct> <fct> <int>
## 1 Own male 1
## 2 Rent male 1
## 3 Rent male 1
## 4 Own male 1
## 5 Own female 1
## 6 Own male 1
## 7 Rent male 1
## 8 Own female 1
## 9 Own female 1
## 10 Own female 1
## # ... with 97,110 more rows
By permuting the home ownership variable multiple times, you generate differences in proportions that are consistent with the assumption that the variables are unrelated. The statistic of interest is the difference in proportions given by stat = "diff in props"
.
This exercise shows all four steps from the infer package:
specify
will specify the response and explanatory variables.hypothesize
will declare the null hypothesis.generate
will generate resamples, permutations, or simulations.calculate
will calculate summary statistics.# Perform 100 permutations
homeown_perm <- homes %>%
specify(HomeOwn ~ Gender, success = "Own") %>%
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute") %>%
calculate(stat = "diff in props", order = c("male", "female"))
# Print results to console
homeown_perm
## # A tibble: 100 x 2
## replicate stat
## <int> <dbl>
## 1 1 -0.0218
## 2 2 0.0144
## 3 3 0.00370
## 4 4 0.0144
## 5 5 0.00618
## 6 6 -0.00742
## 7 7 0.00453
## 8 8 -0.00000297
## 9 9 -0.00948
## 10 10 -0.0136
## # ... with 90 more rows
# Dotplot of 100 permuted differences in proportions
ggplot(homeown_perm, aes(x = stat)) +
geom_dotplot(binwidth = .001)
Repeat the process 1000 times to get a sense for the complete distribution of null differences in proportions. Using 100 repetitions allows you to understand the mechanism of permuting. However, 100 is not enough to observe the full range of likely values for the null differences in proportions.
# Perform 1000 permutations
homeown_perm <- homes %>%
# Specify HomeOwn vs. Gender, with `"Own" as success
specify(HomeOwn ~ Gender, success = "Own") %>%
# Use a null hypothesis of independence
hypothesize(null = "independence") %>%
# Generate 1000 repetitions (by permutation)
generate(reps = 1000, type = "permute") %>%
# Calculate the difference in proportions (male then female)
calculate(stat = "diff in props", order = c("male", "female"))
# Density plot of 1000 permuted differences in proportions
ggplot(homeown_perm, aes(x = stat)) +
geom_density()
Interpretation
Recall that the observed difference (i.e. the difference in proportions in the homes
dataset, shown as the red vertical line) was around -0.0078, which seems to fall below the bulk of the density of shuffled differences. It is important to know, however, whether any of the randomly permuted differences were as extreme as the observed difference.
In this exercise, you’ll re-create this dotplot as a density plot and count the number of permuted differences that were to the left of the observed difference.
# Redefinie homeown_perm
homeown_perm <-
homeown_perm %>%
rename(diff_perm = stat) %>%
mutate(diff_orig = -0.00783)
# Plot permuted differences, diff_perm
ggplot(homeown_perm, aes(x = diff_perm)) +
# Add a density layer
geom_density() +
# Add a vline layer with intercept diff_orig
geom_vline(aes(xintercept = diff_orig), color = "red")
# Compare permuted differences to observed difference
homeown_perm %>%
summarize(sum(diff_perm <= diff_orig))
## # A tibble: 1 x 1
## `sum(diff_perm <= diff_orig)`
## <int>
## 1 212
Interpretation
In this chapter, you will gain the tools and knowledge to complete a full hypothesis test. That is, given a dataset, you will know whether or not is appropriate to reject the null hypothesis in favor of the research claim of interest.
Null hypothesis: gender and promotion are unrelated variables. Alternative hypothesis: : men are more likely to be promoted.
Summarize the proportions of women who were promoted. Two possible proportions are:
Do the former.
disc %>%
# Count the rows by promote and sex
count(sex, promote)
## # A tibble: 4 x 3
## sex promote n
## <fct> <fct> <int>
## 1 female not_promoted 10
## 2 female promoted 14
## 3 male not_promoted 3
## 4 male promoted 21
# Find proportion of each sex who were promoted
disc %>%
# Group by sex
group_by(sex) %>%
# Calculate proportion promoted summary stat
summarise(promoted_prop = mean(promote == "promoted"))
## # A tibble: 2 x 2
## sex promoted_prop
## <fct> <dbl>
## 1 female 0.583
## 2 male 0.875
Interpretation
# Replicate the entire data frame, permuting the promote variable
disc_perm <- disc %>%
specify(promote ~ sex, success = "promoted") %>%
hypothesize(null = "independence") %>%
generate(reps = 5, type = "permute")
disc_perm %>%
# Group by replicate
group_by(replicate) %>%
# Count per group
count(promote, sex)
## # A tibble: 20 x 4
## # Groups: replicate [5]
## replicate promote sex n
## <int> <fct> <fct> <int>
## 1 1 not_promoted female 8
## 2 1 not_promoted male 5
## 3 1 promoted female 16
## 4 1 promoted male 19
## 5 2 not_promoted female 6
## 6 2 not_promoted male 7
## 7 2 promoted female 18
## 8 2 promoted male 17
## 9 3 not_promoted female 4
## 10 3 not_promoted male 9
## 11 3 promoted female 20
## 12 3 promoted male 15
## 13 4 not_promoted female 6
## 14 4 not_promoted male 7
## 15 4 promoted female 18
## 16 4 promoted male 17
## 17 5 not_promoted female 5
## 18 5 not_promoted male 8
## 19 5 promoted female 19
## 20 5 promoted male 16
disc_perm %>%
# Calculate difference in proportion, male then female
calculate(stat = "diff in props", order = c("male", "female")) # male - female
## # A tibble: 5 x 2
## replicate stat
## <int> <dbl>
## 1 1 0.125
## 2 2 -0.0417
## 3 3 -0.208
## 4 4 -0.0417
## 5 5 -0.125
Interpretation
# Calculate the observed difference in promotion rate
diff_orig <- disc %>%
# Group by sex
group_by(sex) %>%
# Summarize to calculate fraction promoted
summarise(prop_prom = mean(promote == "promoted")) %>%
# Summarize to calculate difference
summarise(stat = diff(prop_prom)) %>%
pull()
# See the result
diff_orig # male - female
## [1] 0.2916667
# Create data frame of permuted differences in promotion rates
disc_perm <- disc %>%
# Specify promote vs. sex
specify(promote ~ sex, success = "promoted") %>%
# Set null hypothesis as independence
hypothesize(null = "independence") %>%
# Generate 1000 permutations
generate(reps = 1000, type = "permute") %>%
# Calculate difference in proportions
calculate(stat = "diff in props", order = c("male", "female"))
# Using permutation data, plot stat
ggplot(disc_perm, aes(x = stat)) +
# Add a histogram layer
geom_histogram(binwidth = 0.01) +
# Using original data, add a vertical line at stat
geom_vline(aes(xintercept = diff_orig), color = "red")
It seems as though the statistic—a difference in promotion rates of 0.2917—is on the extreme end of the permutation distribution. That is, there are very few permuted differences which are as extreme as the observed difference.
disc_perm %>%
summarize(
# Find the 0.9 quantile of diff_perm's stat
q.90 = quantile(stat, p = 0.9),
# ... and the 0.95 quantile
q.95 = quantile(stat, p = 0.95),
# ... and the 0.99 quantile
q.99 = quantile(stat, p = 0.99)
)
## # A tibble: 1 x 3
## q.90 q.95 q.99
## <dbl> <dbl> <dbl>
## 1 0.208 0.208 0.292
Interpretation
For the discrimination data, the question at hand is whether or not women were promoted less often than men. However, there are often scenarios where the research question centers around a difference without directionality. For example, you might be interested in whether the rate of promotion for men and women is different.
Two-sided critical region is to test:
Two-sided critical region is used when the research question centers around a difference without directionality.
# Find the 0.01, 0.05, and 0.10 quantiles of diff_perm
disc_perm %>%
summarise(q.01 = quantile(stat, p = 0.01),
q.05 = quantile(stat, p = 0.05),
q.10 = quantile(stat, p = 0.10))
## # A tibble: 1 x 3
## q.01 q.05 q.10
## <dbl> <dbl> <dbl>
## 1 -0.292 -0.208 -0.208
Interpretation
Notice that the observed difference of 0.2917 is in the extreme right tail of the permuted differences. If the sample was ten times larger but the sample statistic was exactly the same (i.e. 0.2917), how would the distribution of permuted differences change?
The statistic of 0.2917 would be much farther to the right of the permuted differences (completely off of the distribution).
Create two new distributions to get a sense for how the differences vary given widely different sample sizes. One of the datasets (disc_small) is one third the size of the original dataset and the other (disc_big) is 10 times larger than the original dataset.
# Tabulate the small dataset
disc_small %>%
# Select sex and promote
count(sex, promote)
## # A tibble: 4 x 3
## sex promote n
## <fct> <fct> <int>
## 1 female not_promoted 3
## 2 female promoted 5
## 3 male not_promoted 1
## 4 male promoted 7
# Do the same for disc_big
disc_big %>%
# Select sex and promote
count(sex, promote)
## # A tibble: 4 x 3
## sex promote n
## <fct> <fct> <int>
## 1 female not_promoted 100
## 2 female promoted 140
## 3 male not_promoted 30
## 4 male promoted 210
# Using disc_perm_small, plot stat
ggplot(disc_perm_small, aes(x = stat)) +
# Add a histogram layer with binwidth 0.01
geom_histogram(binwidth = 0.01) +
# Add a vline layer, crossing x-axis at diff_orig_small
geom_vline(aes(xintercept = diff_orig_small), color = "red")
# Swap the dataset to disc_perm_big
ggplot(disc_perm_big, aes(x = stat)) +
geom_histogram(binwidth = 0.01) +
# Change the x-axis intercept to diff_orig_big
geom_vline(aes(xintercept = diff_orig_big), color = "red")
Interpretation
Using the randomization distributions with the small and big datasets, calculate different cutoffs for significance.
calc_upper_quantiles <- function(dataset) {
dataset %>%
summarize(
q.90 = quantile(stat, p = 0.90),
q.95 = quantile(stat, p = 0.95),
q.99 = quantile(stat, p = 0.99)
)
}
# Recall the quantiles associated with the original dataset
calc_upper_quantiles(disc_perm)
## # A tibble: 1 x 3
## q.90 q.95 q.99
## <dbl> <dbl> <dbl>
## 1 0.208 0.208 0.292
# Calculate the quantiles associated with the small dataset
calc_upper_quantiles(disc_perm_small)
## # A tibble: 1 x 3
## q.90 q.95 q.99
## <dbl> <dbl> <dbl>
## 1 0.250 0.250 0.500
# Calculate the quantiles associated with the big dataset
calc_upper_quantiles(disc_perm_big)
## # A tibble: 1 x 3
## q.90 q.95 q.99
## <dbl> <dbl> <dbl>
## 1 0.0500 0.0667 0.0917
Interpretation
Calculate the p-value for the original discrimination dataset as well as disc_small
and disc_big
. Remember that a p-value measures the degree of disagreement between the data and the null hypothesis.
library(infer)
# Visualize and calculate the p-value for the original dataset
disc_perm %>%
visualize(obs_stat = diff_orig, direction = "greater")
disc_perm %>%
get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.0260
# Visualize and calculate the p-value for the small dataset
disc_perm_small %>%
visualize(obs_stat = diff_orig_small, direction = "greater")
disc_perm_small %>%
get_p_value(obs_stat = diff_orig_small, direction = "greater")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.298
# Visualize and calculate the p-value for the big dataset
disc_perm_big %>%
visualize(obs_stat = diff_orig_big, direction = "greater")
disc_perm_big %>%
get_p_value(obs_stat = diff_orig_big, direction = "greater")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.
Interpretation
Consider a situation where there are 24 men, 24 women, and 35 people are still promoted. But in this new scenario, 75% of the men are promoted and 70.8% of the women are promoted. Does the difference in promotion rates still appear to be statistically significant? That is, could this difference in promotion rates have come from random chance?
In the original dataset, 87.5% of the men were promoted and 58.3% of the women were promoted.
# Recall the original data
disc %>%
count(sex, promote)
## # A tibble: 4 x 3
## sex promote n
## <fct> <fct> <int>
## 1 female not_promoted 10
## 2 female promoted 14
## 3 male not_promoted 3
## 4 male promoted 21
# Tabulate the new data
disc_new %>%
count(sex, promote)
## # A tibble: 4 x 3
## sex promote n
## <fct> <fct> <int>
## 1 female not_promoted 7
## 2 female promoted 17
## 3 male not_promoted 6
## 4 male promoted 18
# Recall the distribution of the original permuted differences
ggplot(disc_perm, aes(x = stat)) +
geom_histogram() +
geom_vline(aes(xintercept = diff_orig), color = "red")
# Plot the distribution of the new permuted differences
ggplot(disc_perm_new, aes(x = stat)) +
geom_histogram() +
geom_vline(aes(xintercept = diff_orig_new), color = "red")
# Recall the p-value from the original data
disc_perm %>%
summarize(p_value = mean(diff_orig <= stat))
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.0260
# Find the p-value from the new data
disc_perm_new %>%
summarize(p_value = mean(diff_orig_new <= stat))
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.507
Interpreation
When there is no directionality to the alternative hypothesis, the hypothesis and p-value are considered to be two-sided. In a two-sided setting, the p-value is double the one-sided p-value. In this case, a difference like the one observed would occur twice as often (by chance) because sometimes the difference would be positive and sometimes it would be negative.
# Calculate the two-sided p-value
disc_perm %>%
summarize(p_value = mean(diff_orig <= stat)*2)
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.0520
Interpretation
You will continue learning about hypothesis testing with a new example and the same structure of randomization tests. In this chapter, however, the focus will be on different errors (type I and type II), how they are made, when one is worse than another, and how things like sample size and effect size impact the error rates.
Research questions: whether reminding students of saving money would make it less likely for them to spend money on entertaining video
Students were ramdomly assigned to two different groups.
Do not reject Null Hypothesis | Reject Null Hypothesis | |
---|---|---|
Null true | Type 1 error | |
Alt true | Type 2 error |
Find the sample statistics (here: proportions) that are needed for the analysis.
# Import data
opportunity <- read.csv("/resources/rstudio/Bus Statistics/data/Foundations of Inference/opportunity.csv")
head(opportunity)
## decision group
## 1 buyDVD control
## 2 buyDVD control
## 3 buyDVD control
## 4 buyDVD control
## 5 buyDVD control
## 6 buyDVD control
# Tabulate the data
opportunity %>%
count(decision, group)
## # A tibble: 4 x 3
## decision group n
## <fct> <fct> <int>
## 1 buyDVD control 56
## 2 buyDVD treatment 41
## 3 nobuyDVD control 19
## 4 nobuyDVD treatment 34
# Find the proportion who bought the DVD in each group
opportunity %>%
group_by(group) %>%
summarize(buy_prop = mean(decision == "buyDVD"))
## # A tibble: 2 x 2
## group buy_prop
## <fct> <dbl>
## 1 control 0.747
## 2 treatment 0.547
Interpretation
Create a barplot to visualize the difference in proportions between the treatment and control groups.
# Create a barplot
ggplot(opportunity, aes(x = group, fill = decision)) +
geom_bar(position = "fill")
Permute the data to generate a distribution of differences as if the null hypothesis were true.
Assumptions
# Calculate the observed difference in purchase rate
diff_obs <- opportunity %>%
# Group by group
group_by(group) %>%
# Calculate proportion deciding to buy a DVD
summarise(prop_buy = mean(decision == "buyDVD")) %>%
# Calculate difference between groups
summarise(stat = diff(prop_buy)) %>%
pull()
# Create data frame of permuted differences in purchase rates
opp_perm <- opportunity %>%
# Specify decision vs. group, where success is buying a DVD
specify(decision ~ group, success = "buyDVD") %>%
# Set the null hypothesis to independence
hypothesize(null = "independence") %>%
# Generate 1000 reps of type permute
generate(reps = 1000, type = "permute") %>%
# Calculate the summary stat difference in proportions
calculate(stat = "diff in props", order = c("treatment", "control"))
# Review the result
opp_perm
## # A tibble: 1,000 x 2
## replicate stat
## <int> <dbl>
## 1 1 0.120
## 2 2 -0.0400
## 3 3 -0.0133
## 4 4 0.0667
## 5 5 0.0400
## 6 6 -0.173
## 7 7 -0.0133
## 8 8 -0.0933
## 9 9 -0.0400
## 10 10 -0.120
## # ... with 990 more rows
# Using the permuation data, plot stat
ggplot(opp_perm, aes(x = stat)) +
# Add a histogram layer with binwidth 0.005
geom_histogram(binwidth = 0.005) +
# Add a vline layer with intercept diff_obs
geom_vline(aes(xintercept = diff_obs), color = "red")
Compute the p-value, or the proportion of permuted differences less than or equal to the observed difference. In other words, assess whether the observed difference in proportions is consistent with the null difference.
# Visualize the statistic
opp_perm %>%
visualize(obs_stat = diff_orig, direction = "less")
# Calculate the p-value using `get_p_value`
opp_perm %>%
get_p_value(obs_stat = diff_orig, direction = "less")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.0120
# Calculate the p-value using `summarize`
opp_perm %>%
summarize(p_value = mean(stat <= diff_orig))
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.0120
Interpretation
Consider again a situation where the task is to differentiate the proportion of successes across two different groups. What decision should be made if the goal is to never make a type II error (false negative)?
Find a two-sided p-value by simply doubling the one sided p-value.
# Calculate the two-sided p-value
opp_perm %>%
summarize(p_value = 2 * mean(stat <= diff_orig))
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.0240
Interpretation
As a complement to hypothesis testing, confidence intervals allow you to estimate a population parameter. Recall that your interest is always in some characteristic of the population, but you only have incomplete information to estimate the parameter using sample data. Here, the parameter is the true proportion of successes in a population. Bootstrapping is used to estimate the variability needed to form the confidence interval.
hypothesis testing versus confidence interval
Set up two sampling experiments.
It’s important to realize that the first experiment relies on knowing the population and is typically impossible in practice.
# Compute p-hat for each poll
ex1_props <- all_polls %>%
# Group by poll
group_by(poll) %>%
# Calculate proportion of yes votes
summarise(stat = mean(vote == "yes"))
# Review the result
ex1_props
## # A tibble: 1,000 x 2
## poll stat
## <int> <dbl>
## 1 1 0.700
## 2 2 0.667
## 3 3 0.633
## 4 4 0.633
## 5 5 0.400
## 6 6 0.600
## 7 7 0.500
## 8 8 0.533
## 9 9 0.567
## 10 10 0.567
## # ... with 990 more rows
# Select one poll from which to resample
one_poll <- all_polls %>%
# Filter for the first poll
filter(poll == 1) %>%
# Select vote
select(vote)
# Compute p-hat* for each resampled poll
ex2_props <- one_poll %>%
# Specify vote as the response, where yes means success
specify(response = vote, success = "yes") %>%
# Generate 1000 reps of type bootstrap
generate(reps = 1000, type = "bootstrap") %>%
# Calculate the summary stat "prop"
calculate(stat = "prop")
# Calculate variability of p-hat
ex1_props %>%
summarize(variability = sd(stat))
## # A tibble: 1 x 1
## variability
## <dbl>
## 1 0.0868
# Calculate variability of p-hat*
ex2_props %>%
summarize(variability = sd(stat)) # Note that this number will change everytime the code is run b/c the data is resampled from the original sample
## # A tibble: 1 x 1
## variability
## <dbl>
## 1 0.0846
Interpretation
# Combine data from both experiments
both_ex_props <- bind_rows(ex1_props, ex2_props, .id = "experiment")
both_ex_props
## # A tibble: 2,000 x 4
## experiment poll stat replicate
## <chr> <int> <dbl> <int>
## 1 1 1 0.700 NA
## 2 1 2 0.667 NA
## 3 1 3 0.633 NA
## 4 1 4 0.633 NA
## 5 1 5 0.400 NA
## 6 1 6 0.600 NA
## 7 1 7 0.500 NA
## 8 1 8 0.533 NA
## 9 1 9 0.567 NA
## 10 1 10 0.567 NA
## # ... with 1,990 more rows
# Using both_ex_props, plot stat colored by experiment
ggplot(both_ex_props, aes(stat, color = experiment)) +
# Add a density layer with bandwidth 0.1
geom_density(bw = 0.1)
Interpretation
If the variability of the sample proportion (called the standard error, or SE) is known, then approximately 95% of p^ values (from different samples) will be within 2SE of the true population proportion.
Check whether that holds in the polls generated by taking many samples from the same population.
# Proportion of yes votes by poll
props <- all_polls %>%
group_by(poll) %>%
summarize(prop_yes = mean(vote == "yes"))
# The true population proportion of yes votes
true_prop_yes <- 0.6
# Proportion of polls within 2SE
props %>%
# Add column: is prop_yes in 2SE of 0.6
mutate(is_in_conf_int = abs(prop_yes - true_prop_yes) < 2 * sd(prop_yes)) %>%
# Calculate proportion in conf int
summarize(prop_in_conf_int = mean(is_in_conf_int))
## # A tibble: 1 x 1
## prop_in_conf_int
## <dbl>
## 1 0.966
Interpretation
The previous exercises told you two things:
# From previous exercises
one_poll <- all_polls %>%
filter(poll == 1) %>%
select(vote)
one_poll_boot <- one_poll %>%
specify(response = vote, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop")
p_hat <- one_poll %>%
# Calculate proportion of yes votes
summarize(stat = mean(vote == "yes")) %>%
pull()
# Create an interval of plausible values
one_poll_boot %>%
summarize(
# Lower bound is p_hat minus 2 std errs
lower = p_hat - 2 * sd(stat),
# Upper bound is p_hat plus 2 std errs
upper = p_hat + 2 * sd(stat)
)
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 0.532 0.868
Interpretation
There is the second method of constructing bootstrap intervals. Instead of using ±2SE as a way to measure the middle 95% of the sampled p^ values, you can find the middle of the resampled p^ values by removing the upper and lower 2.5%.
# From previous exercise: bootstrap t-confidence interval
one_poll_boot %>%
summarize(
lower = p_hat - 2 * sd(stat),
upper = p_hat + 2 * sd(stat)
)
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 0.532 0.868
# Manually calculate a 95% percentile interval
one_poll_boot %>%
summarize(
lower = quantile(stat, p = 0.025),
upper = quantile(stat, p = 0.975)
)
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 0.533 0.867
# Calculate the same interval, more conveniently
percentile_ci <- one_poll_boot %>%
get_confidence_interval(level = 0.95)
# Review the value
percentile_ci
## # A tibble: 1 x 2
## `2.5%` `97.5%`
## <dbl> <dbl>
## 1 0.533 0.867
# From previous step
percentile_ci <- one_poll_boot %>%
get_confidence_interval(level = 0.95)
one_poll_boot %>%
# Visualize in-between the endpoints given by percentile_ci
visualize(endpoints = percentile_ci,
direction = "between")
Interpretation
There are two things that affect the width of the confidence interval:
Use the incorrect standard error (based on the incorrect sample size) to create a confidence interval. The idea is that when the standard error is off, the interval is not particularly useful, nor is it correct.
Note that if you resampled the data with the wrong size (e.g. 300 or 3 instead of 30), the standard error of the sample proportions was off.
calc_t_conf_int <- function(resampled_dataset) {
resampled_dataset %>%
summarize(
lower = p_hat - 2 * sd(stat),
upper = p_hat + 2 * sd(stat)
)
}
# Find the bootstrap t-confidence interval for 30 resamples
calc_t_conf_int(one_poll_boot)
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 0.532 0.868
# ... and for 300 resamples
calc_t_conf_int(one_poll_boot_300)
## lower upper
## 1 0.6468069 0.7531931
# ... and for 3 resamples
calc_t_conf_int(one_poll_boot_3)
## lower upper
## 1 0.1613891 1.238611
Interpretation
One additional element that changes the width of the confidence interval is the true parameter value.
When the true parameter is close to 0.5, the standard error of p^ is larger than when the true parameter is closer to 0 or 1. When calculating a bootstrap t-confidence interval, the standard error controls the width of the CI, and here the width will be narrower.
calc_p_hat <- function(dataset) {
dataset %>%
summarize(stat = mean(vote == "yes")) %>%
pull()
}
calc_t_conf_int <- function(resampled_dataset, p_hat) {
resampled_dataset %>%
summarize(
lower = p_hat - 2 * sd(stat),
upper = p_hat + 2 * sd(stat)
)
}
# Find proportion of yes votes from original population
p_hat <- calc_p_hat(one_poll)
# Review the value
p_hat
## [1] 0.7
# Calculate bootstrap t-confidence interval (original 0.6 param)
calc_t_conf_int(one_poll_boot, p_hat)
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 0.532 0.868
# Find proportion of yes votes from new population
p_hat_0.8 <- calc_p_hat(one_poll_0.8)
# Review the value
p_hat_0.8
## [1] 0.8333333
# Calculate the bootstrap t-confidence interval (new 0.8 param)
calc_t_conf_int(one_poll_boot_0.8, p_hat_0.8)
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 0.695 0.972
Interpretation
Most scientists use 95% intervals to quantify their uncertainty about an estimate. That is, they understand that over a lifetime of creating confidence intervals, only 95% of them will actually contain the parameter that they set out to estimate.
# Calculate a 95% bootstrap percentile interval
one_poll_boot %>%
get_confidence_interval(level = 0.95)
## # A tibble: 1 x 2
## `2.5%` `97.5%`
## <dbl> <dbl>
## 1 0.533 0.867
# Calculate a 99% bootstrap percentile interval
one_poll_boot %>%
get_confidence_interval(level = 0.99)
## # A tibble: 1 x 2
## `0.5%` `99.5%`
## <dbl> <dbl>
## 1 0.467 0.900
# Calculate a 90% bootstrap percentile interval
one_poll_boot %>%
get_confidence_interval(level = 0.9)
## # A tibble: 1 x 2
## `5%` `95%`
## <dbl> <dbl>
## 1 0.567 0.833
# Plot ci_endpoints vs. ci_percent to compare the intervals
ggplot(conf_int_data, aes(ci_percent, ci_endpoints)) +
# Add a line layer
geom_line()