Disclaimer: The content of this RMarkdown note came from a course called Foundations of Inference in datacamp.

Foundations of Inference

Inference, or the process of drawing conclusions about a larger population from a sample of data is one of the foundational aspects of statistical analysis. This course covers such topics as:

Chapter 1: Introduction to ideas of inference

In this chapter, you will investigate how repeated samples taken from a population can vary. It is the variability in samples that allow you to make claims about the population of interest. It is important to remember that the research claims of interest focus on the population while the information available comes only from the sample data.

1.1 Hypotheses

Suppose a pharmaceutical company is trying to get FDA approval for a new diabetes treatment called drug A. Currently, most doctors prescribe drug B to treat diabetes.

  • Null hypothesis: Drug A is the same as drug B at treating diabetes.
  • Alternative hyposthesis: Drug A is better than drug B at treating diabetes.

The alternative hypothesis corresponds to the research question of interest, in this case whether drug A is more effective at treating diabetes than drug B.

1.2 Working with the NHANES data

Briefly explore the NHANES data, which contains body shape and related measurements from the US National Health and Nutrition Examination Survey (NHANES, 1999-2004). For more information about the NHANES package and its data set, see its manual.

Definition of variables

  • HomeOwn One of Home, Rent, or Other indicating whether the home of study participant or someone in their family is owned, rented or occupied by some other arrangement.
  • SleepHrsNight Self-reported number of hours study participant usually gets at night on weekdays or workdays. Reported for participants aged 16 years and older.
  • HealthGen Self-reported rating of participant’s health in general Reported for participants aged 12 years or older. One of Excellent, Vgood, Good, Fair, or Poor.
# Load packages
library(NHANES)
library(dplyr)
library(ggplot2)

# What are the variables in the NHANES dataset?
names(NHANES)
##  [1] "ID"               "SurveyYr"         "Gender"          
##  [4] "Age"              "AgeDecade"        "AgeMonths"       
##  [7] "Race1"            "Race3"            "Education"       
## [10] "MaritalStatus"    "HHIncome"         "HHIncomeMid"     
## [13] "Poverty"          "HomeRooms"        "HomeOwn"         
## [16] "Work"             "Weight"           "Length"          
## [19] "HeadCirc"         "Height"           "BMI"             
## [22] "BMICatUnder20yrs" "BMI_WHO"          "Pulse"           
## [25] "BPSysAve"         "BPDiaAve"         "BPSys1"          
## [28] "BPDia1"           "BPSys2"           "BPDia2"          
## [31] "BPSys3"           "BPDia3"           "Testosterone"    
## [34] "DirectChol"       "TotChol"          "UrineVol1"       
## [37] "UrineFlow1"       "UrineVol2"        "UrineFlow2"      
## [40] "Diabetes"         "DiabetesAge"      "HealthGen"       
## [43] "DaysPhysHlthBad"  "DaysMentHlthBad"  "LittleInterest"  
## [46] "Depressed"        "nPregnancies"     "nBabies"         
## [49] "Age1stBaby"       "SleepHrsNight"    "SleepTrouble"    
## [52] "PhysActive"       "PhysActiveDays"   "TVHrsDay"        
## [55] "CompHrsDay"       "TVHrsDayChild"    "CompHrsDayChild" 
## [58] "Alcohol12PlusYr"  "AlcoholDay"       "AlcoholYear"     
## [61] "SmokeNow"         "Smoke100"         "Smoke100n"       
## [64] "SmokeAge"         "Marijuana"        "AgeFirstMarij"   
## [67] "RegularMarij"     "AgeRegMarij"      "HardDrugs"       
## [70] "SexEver"          "SexAge"           "SexNumPartnLife" 
## [73] "SexNumPartYear"   "SameSex"          "SexOrientation"  
## [76] "PregnantNow"

# Create bar plot for Home Ownership by Gender
ggplot(NHANES, aes(x = Gender, fill = HomeOwn)) + 
  geom_bar(position = "fill") +   # "fill" to have 100% bars to compare relative frequencies 
  ylab("Relative frequencies")


# Density for SleepHrsNight colored by SleepTrouble, faceted by HealthGen
ggplot(NHANES, aes(x = SleepHrsNight, color = SleepTrouble)) + 
  geom_density(adjust = 2) +   # adjust = 2 to smooth the density
  facet_wrap(~ HealthGen)

Interpretation

  • The warning message for the densities and the grey portions of the bars both indicate a large number of missing observations in the dataset. If this were your dataset, it would be important to stop here and consider the cause of the missingness.

1.3 Calculating statistic of interest

Investigate the relationship between gender and home ownership.

homes <- NHANES %>%
  # Select Gender and HomeOwn
  select(Gender, HomeOwn) %>%
  # Filter for HomeOwn equal to "Own" or "Rent"
  filter(HomeOwn %in% c("Own", "Rent"))

homes
## # A tibble: 9,712 x 2
##    Gender HomeOwn
##    <fct>  <fct>  
##  1 male   Own    
##  2 male   Own    
##  3 male   Own    
##  4 male   Own    
##  5 female Rent   
##  6 male   Rent   
##  7 male   Own    
##  8 female Own    
##  9 female Own    
## 10 female Own    
## # ... with 9,702 more rows

diff_orig <- homes %>%   
  # Group by gender
  group_by(Gender) %>%
  # Summarize proportion of homeowners
  summarise(prop_own = mean(HomeOwn == "Own")) %>%
  # Summarize difference in proportion of homeowners
  summarise(obs_diff_prop = diff(prop_own)) # male - female

diff_orig
## # A tibble: 1 x 1
##   obs_diff_prop
##           <dbl>
## 1      -0.00783

Interpretation

  • Now that you have the observed difference in proportions, you’ll move on to calculating the difference in proportions for randomized datasets.

1.4 Randomized data under null model of independence

The infer package will allow you to model a particular null hypothesis and then randomize the data to calculate permuted statistics. In this exercise, after specifying your null hypothesis you will permute the home ownership variable 10 times. By doing so, you will ensure that there is no relationship between home ownership and gender, so any difference in home ownership proportion for female versus male will be due only to natural variability.

This exercise will demonstrate the first three steps from the infer package:

  1. specify will specify the response and explanatory variables.
  2. hypothesize will declare the null hypothesis.
  3. generate will generate resamples, permutations, or simulations.
library(infer)

# Specify variables
homeown_perm <- homes %>%
  specify(HomeOwn ~ Gender, success = "Own")

# Print results to console
homeown_perm
## Response: HomeOwn (factor)
## Explanatory: Gender (factor)
## # A tibble: 9,712 x 2
##    HomeOwn Gender
##    <fct>   <fct> 
##  1 Own     male  
##  2 Own     male  
##  3 Own     male  
##  4 Own     male  
##  5 Rent    female
##  6 Rent    male  
##  7 Own     male  
##  8 Own     female
##  9 Own     female
## 10 Own     female
## # ... with 9,702 more rows

# Hypothesize independence
homeown_perm <- homes %>%
  specify(HomeOwn ~ Gender, success = "Own") %>%
  hypothesize(null = "independence")  

# Print results to console
homeown_perm
## Response: HomeOwn (factor)
## Explanatory: Gender (factor)
## Null Hypothesis: independence
## # A tibble: 9,712 x 2
##    HomeOwn Gender
##    <fct>   <fct> 
##  1 Own     male  
##  2 Own     male  
##  3 Own     male  
##  4 Own     male  
##  5 Rent    female
##  6 Rent    male  
##  7 Own     male  
##  8 Own     female
##  9 Own     female
## 10 Own     female
## # ... with 9,702 more rows

# Perform 10 permutations
homeown_perm <- homes %>%
  specify(HomeOwn ~ Gender, success = "Own") %>%
  hypothesize(null = "independence") %>% 
  # Shuffle the response variable, HomeOwn, ten times
  generate(reps = 10, type = "permute") 

# Print results to console
homeown_perm
## Response: HomeOwn (factor)
## Explanatory: Gender (factor)
## Null Hypothesis: independence
## # A tibble: 97,120 x 3
## # Groups:   replicate [10]
##    HomeOwn Gender replicate
##    <fct>   <fct>      <int>
##  1 Own     male           1
##  2 Rent    male           1
##  3 Rent    male           1
##  4 Own     male           1
##  5 Own     female         1
##  6 Own     male           1
##  7 Rent    male           1
##  8 Own     female         1
##  9 Own     female         1
## 10 Own     female         1
## # ... with 97,110 more rows

1.5 Randomized statistics and dotplot

By permuting the home ownership variable multiple times, you generate differences in proportions that are consistent with the assumption that the variables are unrelated. The statistic of interest is the difference in proportions given by stat = "diff in props".

This exercise shows all four steps from the infer package:

  1. specify will specify the response and explanatory variables.
  2. hypothesize will declare the null hypothesis.
  3. generate will generate resamples, permutations, or simulations.
  4. calculate will calculate summary statistics.
# Perform 100 permutations
homeown_perm <- homes %>%
  specify(HomeOwn ~ Gender, success = "Own") %>%
  hypothesize(null = "independence") %>% 
  generate(reps = 100, type = "permute") %>% 
  calculate(stat = "diff in props", order = c("male", "female"))
  
# Print results to console
homeown_perm
## # A tibble: 100 x 2
##    replicate        stat
##        <int>       <dbl>
##  1         1 -0.0218    
##  2         2  0.0144    
##  3         3  0.00370   
##  4         4  0.0144    
##  5         5  0.00618   
##  6         6 -0.00742   
##  7         7  0.00453   
##  8         8 -0.00000297
##  9         9 -0.00948   
## 10        10 -0.0136    
## # ... with 90 more rows

# Dotplot of 100 permuted differences in proportions
ggplot(homeown_perm, aes(x = stat)) + 
  geom_dotplot(binwidth = .001)

1.6 Randomization density

Repeat the process 1000 times to get a sense for the complete distribution of null differences in proportions. Using 100 repetitions allows you to understand the mechanism of permuting. However, 100 is not enough to observe the full range of likely values for the null differences in proportions.

# Perform 1000 permutations
homeown_perm <- homes %>%
  # Specify HomeOwn vs. Gender, with `"Own" as success
  specify(HomeOwn ~ Gender, success = "Own") %>%
  # Use a null hypothesis of independence
  hypothesize(null = "independence") %>% 
  # Generate 1000 repetitions (by permutation)
  generate(reps = 1000, type = "permute") %>% 
  # Calculate the difference in proportions (male then female)
  calculate(stat = "diff in props", order = c("male", "female"))

# Density plot of 1000 permuted differences in proportions
ggplot(homeown_perm, aes(x = stat)) + 
  geom_density()

Interpretation

  • You can now see that the distribution is approximately normally distributed around -0.01, but what can we conclude from it?

1.7 Do the data come from the population?

Recall that the observed difference (i.e. the difference in proportions in the homes dataset, shown as the red vertical line) was around -0.0078, which seems to fall below the bulk of the density of shuffled differences. It is important to know, however, whether any of the randomly permuted differences were as extreme as the observed difference.

In this exercise, you’ll re-create this dotplot as a density plot and count the number of permuted differences that were to the left of the observed difference.

# Redefinie homeown_perm
homeown_perm <-
  homeown_perm %>%
  rename(diff_perm = stat) %>%
  mutate(diff_orig = -0.00783)

# Plot permuted differences, diff_perm
ggplot(homeown_perm, aes(x = diff_perm)) + 
  # Add a density layer
  geom_density() +
  # Add a vline layer with intercept diff_orig
  geom_vline(aes(xintercept = diff_orig), color = "red")


# Compare permuted differences to observed difference
homeown_perm %>%
  summarize(sum(diff_perm <= diff_orig))
## # A tibble: 1 x 1
##   `sum(diff_perm <= diff_orig)`
##                           <int>
## 1                           212

Interpretation

  • 212 permuted differences are more extreme than the observed difference. This represents 21.2% of the null statistics, so you can conclude that the observed difference is consistent with the permuted distribution. In other words, there is no difference in homeownership across gender.

Chapter 2: Completing a randomization test: gender discrimination

In this chapter, you will gain the tools and knowledge to complete a full hypothesis test. That is, given a dataset, you will know whether or not is appropriate to reject the null hypothesis in favor of the research claim of interest.

2.1 Gender discrimination hypotheses

Null hypothesis: gender and promotion are unrelated variables. Alternative hypothesis: : men are more likely to be promoted.

2.2 Summarizing gender discrimination

Summarize the proportions of women who were promoted. Two possible proportions are:

  • the proportion of women who were promoted
  • the proportion of promoted individuals who were women

Do the former.

disc %>%
  # Count the rows by promote and sex
  count(sex, promote)
## # A tibble: 4 x 3
##   sex    promote          n
##   <fct>  <fct>        <int>
## 1 female not_promoted    10
## 2 female promoted        14
## 3 male   not_promoted     3
## 4 male   promoted        21

# Find proportion of each sex who were promoted
disc %>%
  # Group by sex
  group_by(sex) %>%
  # Calculate proportion promoted summary stat
  summarise(promoted_prop = mean(promote == "promoted"))
## # A tibble: 2 x 2
##   sex    promoted_prop
##   <fct>          <dbl>
## 1 female         0.583
## 2 male           0.875

Interpretation

  • The difference in proportions promoted is almost 0.3.

2.3 Step-by-step through the permutation

# Replicate the entire data frame, permuting the promote variable
disc_perm <- disc %>%
  specify(promote ~ sex, success = "promoted") %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5, type = "permute")

disc_perm %>%
  # Group by replicate
  group_by(replicate) %>%
  # Count per group
  count(promote, sex)
## # A tibble: 20 x 4
## # Groups:   replicate [5]
##    replicate promote      sex        n
##        <int> <fct>        <fct>  <int>
##  1         1 not_promoted female     8
##  2         1 not_promoted male       5
##  3         1 promoted     female    16
##  4         1 promoted     male      19
##  5         2 not_promoted female     6
##  6         2 not_promoted male       7
##  7         2 promoted     female    18
##  8         2 promoted     male      17
##  9         3 not_promoted female     4
## 10         3 not_promoted male       9
## 11         3 promoted     female    20
## 12         3 promoted     male      15
## 13         4 not_promoted female     6
## 14         4 not_promoted male       7
## 15         4 promoted     female    18
## 16         4 promoted     male      17
## 17         5 not_promoted female     5
## 18         5 not_promoted male       8
## 19         5 promoted     female    19
## 20         5 promoted     male      16

disc_perm %>%
  # Calculate difference in proportion, male then female
  calculate(stat = "diff in props", order = c("male", "female")) # male - female
## # A tibble: 5 x 2
##   replicate    stat
##       <int>   <dbl>
## 1         1  0.125 
## 2         2 -0.0417
## 3         3 -0.208 
## 4         4 -0.0417
## 5         5 -0.125

Interpretation

  • Each replicate had slightly different counts of promotion and sex, which led to slightly different statistics being calculated for each replicate.

2.4 Randomizing gender discrimination

# Calculate the observed difference in promotion rate
diff_orig <- disc %>%
  # Group by sex
  group_by(sex) %>%
  # Summarize to calculate fraction promoted
  summarise(prop_prom = mean(promote == "promoted")) %>%
  # Summarize to calculate difference
  summarise(stat = diff(prop_prom)) %>% 
  pull()
    
# See the result
diff_orig # male - female
## [1] 0.2916667

# Create data frame of permuted differences in promotion rates
disc_perm <- disc %>%
  # Specify promote vs. sex
  specify(promote ~ sex, success = "promoted") %>%
  # Set null hypothesis as independence
  hypothesize(null = "independence") %>%
  # Generate 1000 permutations
  generate(reps = 1000, type = "permute") %>%
  # Calculate difference in proportions
  calculate(stat = "diff in props", order = c("male", "female"))
  
# Using permutation data, plot stat
ggplot(disc_perm, aes(x = stat)) + 
  # Add a histogram layer
  geom_histogram(binwidth = 0.01) +
  # Using original data, add a vertical line at stat
  geom_vline(aes(xintercept = diff_orig), color = "red")

2.5 Critical region

It seems as though the statistic—a difference in promotion rates of 0.2917—is on the extreme end of the permutation distribution. That is, there are very few permuted differences which are as extreme as the observed difference.

disc_perm %>% 
  summarize(
    # Find the 0.9 quantile of diff_perm's stat
    q.90 = quantile(stat, p = 0.9),
    # ... and the 0.95 quantile
    q.95 = quantile(stat, p = 0.95),
    # ... and the 0.99 quantile
    q.99 = quantile(stat, p = 0.99)
  )
## # A tibble: 1 x 3
##    q.90  q.95  q.99
##   <dbl> <dbl> <dbl>
## 1 0.208 0.208 0.292

Interpretation

  • You were able to figure out the upper bound of the permuted statistics. Note, different people will define upper differently. Different options are to use the top 10% or the top 5% or the top 1%.

2.6 Two-sided critical region

For the discrimination data, the question at hand is whether or not women were promoted less often than men. However, there are often scenarios where the research question centers around a difference without directionality. For example, you might be interested in whether the rate of promotion for men and women is different.

Two-sided critical region is to test:

  • Null hypothesis: there are no differences in the rate of promotion across gender.
  • Alternative hypothesis in two-sided critical region: there are differences across gender.
  • Alternative hypothesis in one-sided critical region: women were promoted less often than men.

Two-sided critical region is used when the research question centers around a difference without directionality.

# Find the 0.01, 0.05, and 0.10 quantiles of diff_perm
disc_perm %>% 
  summarise(q.01 = quantile(stat, p = 0.01),
            q.05 = quantile(stat, p = 0.05),
            q.10 = quantile(stat, p = 0.10))
## # A tibble: 1 x 3
##     q.01   q.05   q.10
##    <dbl>  <dbl>  <dbl>
## 1 -0.292 -0.208 -0.208

Interpretation

  • Now you’ve seen both the upper and lower values of the permuted statistics. The information will help you know whether the value calculated from the data (the observed statistic) is large or small.

2.7 How does sample size affect results?

Notice that the observed difference of 0.2917 is in the extreme right tail of the permuted differences. If the sample was ten times larger but the sample statistic was exactly the same (i.e. 0.2917), how would the distribution of permuted differences change?

The statistic of 0.2917 would be much farther to the right of the permuted differences (completely off of the distribution).

2.8 Sample size in randomization distribution

Create two new distributions to get a sense for how the differences vary given widely different sample sizes. One of the datasets (disc_small) is one third the size of the original dataset and the other (disc_big) is 10 times larger than the original dataset.

# Tabulate the small dataset
disc_small %>% 
  # Select sex and promote
  count(sex, promote)
## # A tibble: 4 x 3
##   sex    promote          n
##   <fct>  <fct>        <int>
## 1 female not_promoted     3
## 2 female promoted         5
## 3 male   not_promoted     1
## 4 male   promoted         7
  
# Do the same for disc_big
disc_big %>% 
  # Select sex and promote
  count(sex, promote)
## # A tibble: 4 x 3
##   sex    promote          n
##   <fct>  <fct>        <int>
## 1 female not_promoted   100
## 2 female promoted       140
## 3 male   not_promoted    30
## 4 male   promoted       210

# Using disc_perm_small, plot stat
ggplot(disc_perm_small, aes(x = stat)) + 
  # Add a histogram layer with binwidth 0.01
  geom_histogram(binwidth = 0.01) +
  # Add a vline layer, crossing x-axis at diff_orig_small
  geom_vline(aes(xintercept = diff_orig_small), color = "red")


# Swap the dataset to disc_perm_big
ggplot(disc_perm_big, aes(x = stat)) + 
  geom_histogram(binwidth = 0.01) +
  # Change the x-axis intercept to diff_orig_big
  geom_vline(aes(xintercept = diff_orig_big), color = "red")

Interpretation

  • The observed difference is consistent with differences you would see by chance if the sample size was small. The observed difference would virtually never be observed by chance if the sample size was big.

2.9 Sample size for critical region

Using the randomization distributions with the small and big datasets, calculate different cutoffs for significance.

calc_upper_quantiles <- function(dataset) {
  dataset %>% 
    summarize(
      q.90 = quantile(stat, p = 0.90),
      q.95 = quantile(stat, p = 0.95),
      q.99 = quantile(stat, p = 0.99)
    )
}

# Recall the quantiles associated with the original dataset
calc_upper_quantiles(disc_perm)
## # A tibble: 1 x 3
##    q.90  q.95  q.99
##   <dbl> <dbl> <dbl>
## 1 0.208 0.208 0.292

# Calculate the quantiles associated with the small dataset
calc_upper_quantiles(disc_perm_small)
## # A tibble: 1 x 3
##    q.90  q.95  q.99
##   <dbl> <dbl> <dbl>
## 1 0.250 0.250 0.500

# Calculate the quantiles associated with the big dataset
calc_upper_quantiles(disc_perm_big)
## # A tibble: 1 x 3
##     q.90   q.95   q.99
##    <dbl>  <dbl>  <dbl>
## 1 0.0500 0.0667 0.0917

Interpretation

  • Notice how the differences in proportions must be much larger to be significant if the sample size is small. With a big sample size, a small difference in proportions can be significant.

2.10 Calculating the p-values

Calculate the p-value for the original discrimination dataset as well as disc_small and disc_big. Remember that a p-value measures the degree of disagreement between the data and the null hypothesis.

  • Null hypothesis: “There are no differences in the rate of promotion across gender.”
  • Alternative hypothesis: “Are men more likely to be promoted than women?”
library(infer)
# Visualize and calculate the p-value for the original dataset
disc_perm %>%
  visualize(obs_stat = diff_orig, direction = "greater")


disc_perm %>%
  get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1  0.0260

# Visualize and calculate the p-value for the small dataset
disc_perm_small %>%
  visualize(obs_stat = diff_orig_small, direction = "greater")


disc_perm_small %>%
  get_p_value(obs_stat = diff_orig_small, direction = "greater")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.298

# Visualize and calculate the p-value for the big dataset
disc_perm_big %>%
  visualize(obs_stat = diff_orig_big, direction = "greater")


disc_perm_big %>%
  get_p_value(obs_stat = diff_orig_big, direction = "greater")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1      0.

Interpretation

  • The percentages vary each time you run the program because disc_perm is randomly permuted.
  • A value of 0.02, for example, indicates that only a 2% of the permuted distribution is more extreme than the observed difference. In other wods, it would be highly unlikely to see the observed difference by chance if there was no difference across gender.

2.11 Practice calculating p-values

Consider a situation where there are 24 men, 24 women, and 35 people are still promoted. But in this new scenario, 75% of the men are promoted and 70.8% of the women are promoted. Does the difference in promotion rates still appear to be statistically significant? That is, could this difference in promotion rates have come from random chance?

In the original dataset, 87.5% of the men were promoted and 58.3% of the women were promoted.

# Recall the original data
disc %>% 
  count(sex, promote)
## # A tibble: 4 x 3
##   sex    promote          n
##   <fct>  <fct>        <int>
## 1 female not_promoted    10
## 2 female promoted        14
## 3 male   not_promoted     3
## 4 male   promoted        21

# Tabulate the new data
disc_new %>% 
  count(sex, promote)
## # A tibble: 4 x 3
##   sex    promote          n
##   <fct>  <fct>        <int>
## 1 female not_promoted     7
## 2 female promoted        17
## 3 male   not_promoted     6
## 4 male   promoted        18

# Recall the distribution of the original permuted differences
ggplot(disc_perm, aes(x = stat)) + 
  geom_histogram() +
  geom_vline(aes(xintercept = diff_orig), color = "red")


# Plot the distribution of the new permuted differences
ggplot(disc_perm_new, aes(x = stat)) + 
  geom_histogram() +
  geom_vline(aes(xintercept = diff_orig_new), color = "red")


# Recall the p-value from the original data
disc_perm %>%
  summarize(p_value = mean(diff_orig <= stat))
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1  0.0260

# Find the p-value from the new data
disc_perm_new %>%
  summarize(p_value = mean(diff_orig_new <= stat))
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.507

Interpreation

  • Original data: the vertical line is on the far right side of the randomly permuted distribution, which indicates that it would be highly unlikely to see the observed difference in the rate of promotion across gender if there were no gender discrimination in the population.
  • New data: the vertical line is at around the center of the distribution of the permuted data, which indicates that it would be highly likely to see the observed difference in the rate of promotion across gender if there were no gender discrimination. In other words, it likely that there is no gender discrimination.

2.12 Calculating two-sided p-values

  • Null hypothesis: “There are no differences in the rate of promotion across gender.”
  • Alternative hypothesis for a one-tail test: “Are men more likely to be promoted than women?”
  • Alternative hypothesis for a two-tail test: “Is there any difference in promotion rates between men and women?”

When there is no directionality to the alternative hypothesis, the hypothesis and p-value are considered to be two-sided. In a two-sided setting, the p-value is double the one-sided p-value. In this case, a difference like the one observed would occur twice as often (by chance) because sometimes the difference would be positive and sometimes it would be negative.

# Calculate the two-sided p-value
disc_perm %>%
  summarize(p_value = mean(diff_orig <= stat)*2)
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1  0.0520

Interpretation

  • Notice, from the calculation, that the two-sided p-value is twice as big as the one-sided p-value. Often, two-sided p-values are preferred as a way of avoiding making false significance claims.

Chapter 3: Hypothesis testing errors: opportunity cost

You will continue learning about hypothesis testing with a new example and the same structure of randomization tests. In this chapter, however, the focus will be on different errors (type I and type II), how they are made, when one is worse than another, and how things like sample size and effect size impact the error rates.

Research questions: whether reminding students of saving money would make it less likely for them to spend money on entertaining video

Students were ramdomly assigned to two different groups.

  • Control group (75 students) presented with two options:
      1. Buy this entertaining video
      1. Not buy
  • Treatment group (75 students) presented with a slightly modified option (B):
      1. Buy this entertaining video
      1. Not buy. Keep the $14.99 for other purchases
  • Null hypothesis: Reminding students will have no impact on their spending decisions
  • Alternative hypothesis: Reminding students will reduce the chance they spend
Do not reject Null Hypothesis Reject Null Hypothesis
Null true Type 1 error
Alt true Type 2 error

3.1 Summarizing opportunity cost (1)

Find the sample statistics (here: proportions) that are needed for the analysis.

# Import data
opportunity <- read.csv("/resources/rstudio/Bus Statistics/data/Foundations of Inference/opportunity.csv")

head(opportunity)
##   decision   group
## 1   buyDVD control
## 2   buyDVD control
## 3   buyDVD control
## 4   buyDVD control
## 5   buyDVD control
## 6   buyDVD control
# Tabulate the data
opportunity %>%
  count(decision, group)
## # A tibble: 4 x 3
##   decision group         n
##   <fct>    <fct>     <int>
## 1 buyDVD   control      56
## 2 buyDVD   treatment    41
## 3 nobuyDVD control      19
## 4 nobuyDVD treatment    34

# Find the proportion who bought the DVD in each group
opportunity %>%
  group_by(group) %>%
  summarize(buy_prop = mean(decision == "buyDVD"))
## # A tibble: 2 x 2
##   group     buy_prop
##   <fct>        <dbl>
## 1 control      0.747
## 2 treatment    0.547

Interpretation

  • The treatment seems like it might have had an effect!
  • About 75% of the control group bought the DVD.
  • About 55% of the treatment group (i.e. The group that was reminded that the money could be saved) bought the DVD.

3.2 Plotting opportunity cost

Create a barplot to visualize the difference in proportions between the treatment and control groups.

# Create a barplot
ggplot(opportunity, aes(x = group, fill = decision)) + 
  geom_bar(position = "fill")

3.3 Randomizing opportunity cost

Permute the data to generate a distribution of differences as if the null hypothesis were true.

Assumptions

  • The number of individuals in each of the control and treatment groups is fixed.
  • The number of individuals who would buy a DVD is also fixed. That is, 97 people were going to buy a DVD regardless of which treatment group they were in. This is reasonable when you assume that the null hypothesis is true—that is, the experiment had no effect on the outcome of buying a DVD.
# Calculate the observed difference in purchase rate
diff_obs <- opportunity %>%
  # Group by group
  group_by(group) %>%
  # Calculate proportion deciding to buy a DVD
  summarise(prop_buy = mean(decision == "buyDVD")) %>%
  # Calculate difference between groups
  summarise(stat = diff(prop_buy)) %>% 
  pull()

# Create data frame of permuted differences in purchase rates
opp_perm <- opportunity %>%
  # Specify decision vs. group, where success is buying a DVD
  specify(decision ~ group, success = "buyDVD") %>%
  # Set the null hypothesis to independence
  hypothesize(null = "independence") %>%
  # Generate 1000 reps of type permute
  generate(reps = 1000, type = "permute") %>%
  # Calculate the summary stat difference in proportions
  calculate(stat = "diff in props", order = c("treatment", "control"))
    
# Review the result
opp_perm
## # A tibble: 1,000 x 2
##    replicate    stat
##        <int>   <dbl>
##  1         1  0.120 
##  2         2 -0.0400
##  3         3 -0.0133
##  4         4  0.0667
##  5         5  0.0400
##  6         6 -0.173 
##  7         7 -0.0133
##  8         8 -0.0933
##  9         9 -0.0400
## 10        10 -0.120 
## # ... with 990 more rows
  
# Using the permuation data, plot stat
ggplot(opp_perm, aes(x = stat)) + 
  # Add a histogram layer with binwidth 0.005
  geom_histogram(binwidth = 0.005) +
  # Add a vline layer with intercept diff_obs
  geom_vline(aes(xintercept = diff_obs), color = "red")

3.4 Summarizing opportunity cost (2)

Compute the p-value, or the proportion of permuted differences less than or equal to the observed difference. In other words, assess whether the observed difference in proportions is consistent with the null difference.

# Visualize the statistic 
opp_perm %>%
  visualize(obs_stat = diff_orig, direction = "less")


# Calculate the p-value using `get_p_value`
opp_perm %>%
  get_p_value(obs_stat = diff_orig, direction = "less")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1  0.0120

# Calculate the p-value using `summarize`
opp_perm %>%
  summarize(p_value = mean(stat <= diff_orig))
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1  0.0120

Interpretation

  • There is little chance to see the observed difference if reminding didn’t make a difference in students’ spending behavior.
  • In other words, reminding them causes them to be less likely to buy the DVD.

3.5 Different choice of error rate

Consider again a situation where the task is to differentiate the proportion of successes across two different groups. What decision should be made if the goal is to never make a type II error (false negative)?

  • Always claim there is a difference in proportions.

3.6 p-value for two-sided hypotheses: opportunity costs

Find a two-sided p-value by simply doubling the one sided p-value.

  • For a one-sided alternative hypothesis: the p-value measures the likelihood of data as or more extreme than the observed data, given the null hypothesis is true.
  • For a two-sided alternative hypothesis: the p-value is a two-sided p-value, or two times as large as the one-sided p-value.
# Calculate the two-sided p-value
opp_perm %>%
  summarize(p_value = 2 * mean(stat <= diff_orig))
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1  0.0240

Interpretation

  • Note that the p-value is the proportion of permuted statistics that are smaller than (instead of bigger than) the observed value. Look at the histogram of permuted statistics in Section 3.3.

Chapter 4: Confidence intervals

As a complement to hypothesis testing, confidence intervals allow you to estimate a population parameter. Recall that your interest is always in some characteristic of the population, but you only have incomplete information to estimate the parameter using sample data. Here, the parameter is the true proportion of successes in a population. Bootstrapping is used to estimate the variability needed to form the confidence interval.

hypothesis testing versus confidence interval

  • hypothesis testing (question of comparison): Under which diet plan will participant lose more weight on average?
  • confidence interval (question of estimation): How much should participants expect to lose on average?

4.1 Resampling from a sample

Set up two sampling experiments.

  • Experiment 1: Assume the true proportion of people who will vote for Candidate X is 0.6. Repeatedly sample 30 people from the population and measure the variability of p^ (the sample proportion).
  • Experiment 2: Take one sample of size 30 from the same population. Repeatedly sample 30 people (with replacement!) from the original sample and measure the variability of p̂∗ (the resample proportion).

It’s important to realize that the first experiment relies on knowing the population and is typically impossible in practice.

# Compute p-hat for each poll
ex1_props <- all_polls %>% 
  # Group by poll
  group_by(poll) %>% 
  # Calculate proportion of yes votes
  summarise(stat = mean(vote == "yes"))
  
# Review the result
ex1_props
## # A tibble: 1,000 x 2
##     poll  stat
##    <int> <dbl>
##  1     1 0.700
##  2     2 0.667
##  3     3 0.633
##  4     4 0.633
##  5     5 0.400
##  6     6 0.600
##  7     7 0.500
##  8     8 0.533
##  9     9 0.567
## 10    10 0.567
## # ... with 990 more rows

# Select one poll from which to resample
one_poll <- all_polls %>%
  # Filter for the first poll
  filter(poll == 1) %>%
  # Select vote
  select(vote)
  
# Compute p-hat* for each resampled poll
ex2_props <- one_poll %>%
  # Specify vote as the response, where yes means success
  specify(response = vote, success = "yes") %>%
  # Generate 1000 reps of type bootstrap
  generate(reps = 1000, type = "bootstrap") %>% 
  # Calculate the summary stat "prop"
  calculate(stat = "prop")

# Calculate variability of p-hat
ex1_props %>% 
  summarize(variability = sd(stat))
## # A tibble: 1 x 1
##   variability
##         <dbl>
## 1      0.0868
  
# Calculate variability of p-hat*
ex2_props %>% 
  summarize(variability = sd(stat)) # Note that this number will change everytime the code is run b/c the data is resampled from the original sample
## # A tibble: 1 x 1
##   variability
##         <dbl>
## 1      0.0846

Interpretation

  • The variability in the proportion of “successes” in a sample is approximately the same whether we sample from the population or resample from a sample.

4.2 Visualizing the variability of p-hat

# Combine data from both experiments
both_ex_props <- bind_rows(ex1_props, ex2_props, .id = "experiment")
both_ex_props
## # A tibble: 2,000 x 4
##    experiment  poll  stat replicate
##    <chr>      <int> <dbl>     <int>
##  1 1              1 0.700        NA
##  2 1              2 0.667        NA
##  3 1              3 0.633        NA
##  4 1              4 0.633        NA
##  5 1              5 0.400        NA
##  6 1              6 0.600        NA
##  7 1              7 0.500        NA
##  8 1              8 0.533        NA
##  9 1              9 0.567        NA
## 10 1             10 0.567        NA
## # ... with 1,990 more rows

# Using both_ex_props, plot stat colored by experiment
ggplot(both_ex_props, aes(stat, color = experiment)) + 
  # Add a density layer with bandwidth 0.1
  geom_density(bw = 0.1)

Interpretation

  • Resamples with replacement are an excellent model for the process of taking the original sample from the population. Remember, in research problems, you don’t have an ability to take more than one original sample, but you can take as many resamples as you like.
  • Bootstraping (resampling from an original sample with replacement) the same number of observations as in the original sample = sampling repeatedly from the pouplation in providing the standard error
  • The standard error measures the variability of statistic around the parameter.

4.4 Empirical Rule

If the variability of the sample proportion (called the standard error, or SE) is known, then approximately 95% of p^ values (from different samples) will be within 2SE of the true population proportion.

Check whether that holds in the polls generated by taking many samples from the same population.

# Proportion of yes votes by poll
props <- all_polls %>% 
  group_by(poll) %>% 
  summarize(prop_yes = mean(vote == "yes"))

# The true population proportion of yes votes
true_prop_yes <- 0.6

# Proportion of polls within 2SE
props %>%
  # Add column: is prop_yes in 2SE of 0.6
  mutate(is_in_conf_int = abs(prop_yes - true_prop_yes) < 2 * sd(prop_yes)) %>%
  # Calculate  proportion in conf int
  summarize(prop_in_conf_int = mean(is_in_conf_int))
## # A tibble: 1 x 1
##   prop_in_conf_int
##              <dbl>
## 1            0.966

Interpretation

  • It looks like 96.6% are within 2 standard errors of the true population parameter.

4.5 Bootstrap t-confidence interval

The previous exercises told you two things:

  • You can measure the variability associated with p^ by resampling from the original sample.
  • Once you know the variability of p^, you can use it as a way to measure how far away the true proportion is.
  • In other words, the distance between the original sample p^ and the resampled (or bootstrapped) p^ values gives a measure for how far the original p^ is from the true population proportion.
# From previous exercises
one_poll <- all_polls %>%
  filter(poll == 1) %>%
  select(vote)
one_poll_boot <- one_poll %>%
  specify(response = vote, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "prop")
  
p_hat <- one_poll %>%
  # Calculate proportion of yes votes
  summarize(stat = mean(vote == "yes")) %>%
  pull()

# Create an interval of plausible values
one_poll_boot %>%
  summarize(
    # Lower bound is p_hat minus 2 std errs
    lower = p_hat - 2 * sd(stat),
    # Upper bound is p_hat plus 2 std errs
    upper = p_hat + 2 * sd(stat)
  )
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1 0.532 0.868

Interpretation

  • Remember that a confidence level describes how likely you are to have gotten a sample that was close enough to the true parameter. Indeed, with the sample at hand, the confidence interval of plausible values does contain the true population parameter of 0.6. Nice job!

4.6 Bootstrap percentile interval

There is the second method of constructing bootstrap intervals. Instead of using ±2SE as a way to measure the middle 95% of the sampled p^ values, you can find the middle of the resampled p^ values by removing the upper and lower 2.5%.

# From previous exercise: bootstrap t-confidence interval
one_poll_boot %>%
  summarize(
    lower = p_hat - 2 * sd(stat),
    upper = p_hat + 2 * sd(stat)
  )
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1 0.532 0.868
  
# Manually calculate a 95% percentile interval
one_poll_boot %>%
  summarize(
    lower = quantile(stat, p = 0.025),
    upper = quantile(stat, p = 0.975)
  )
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1 0.533 0.867

# Calculate the same interval, more conveniently
percentile_ci <- one_poll_boot %>% 
  get_confidence_interval(level = 0.95)
  
# Review the value
percentile_ci
## # A tibble: 1 x 2
##   `2.5%` `97.5%`
##    <dbl>   <dbl>
## 1  0.533   0.867

# From previous step
percentile_ci <- one_poll_boot %>% 
  get_confidence_interval(level = 0.95)
  
one_poll_boot %>% 
  # Visualize in-between the endpoints given by percentile_ci
  visualize(endpoints = percentile_ci,
  direction = "between")

Interpretation

  • Note that the two intervals were created using different methods. Because the methods are different, the intervals are expected to be a bit different as well. In the long run, however, the intervals should provide the same information.

4.7 Sample size effects on bootstrap CIs

There are two things that affect the width of the confidence interval:

  1. sample size
  2. true parameter

Use the incorrect standard error (based on the incorrect sample size) to create a confidence interval. The idea is that when the standard error is off, the interval is not particularly useful, nor is it correct.

Note that if you resampled the data with the wrong size (e.g. 300 or 3 instead of 30), the standard error of the sample proportions was off.

calc_t_conf_int <- function(resampled_dataset) {
  resampled_dataset %>%
    summarize(
      lower = p_hat - 2 * sd(stat),
      upper = p_hat + 2 * sd(stat)
    )
}

# Find the bootstrap t-confidence interval for 30 resamples
calc_t_conf_int(one_poll_boot)
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1 0.532 0.868

# ... and for 300 resamples
calc_t_conf_int(one_poll_boot_300)
##       lower     upper
## 1 0.6468069 0.7531931

# ... and for 3 resamples
calc_t_conf_int(one_poll_boot_3)
##       lower    upper
## 1 0.1613891 1.238611

Interpretation

  • Notice how the resampled interval with size 300 was way too small and the resampled interval with size 3 was way too big.

4.8 Sample proportion value effects on bootstrap CIs

One additional element that changes the width of the confidence interval is the true parameter value.

When the true parameter is close to 0.5, the standard error of p^ is larger than when the true parameter is closer to 0 or 1. When calculating a bootstrap t-confidence interval, the standard error controls the width of the CI, and here the width will be narrower.

calc_p_hat <- function(dataset) {
  dataset %>%
    summarize(stat = mean(vote == "yes")) %>%
    pull()
}
calc_t_conf_int <- function(resampled_dataset, p_hat) {
  resampled_dataset %>%
    summarize(
      lower = p_hat - 2 * sd(stat),
      upper = p_hat + 2 * sd(stat)
    )
}

# Find proportion of yes votes from original population
p_hat <- calc_p_hat(one_poll)

# Review the value
p_hat  
## [1] 0.7

# Calculate bootstrap t-confidence interval (original 0.6 param)
calc_t_conf_int(one_poll_boot, p_hat)
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1 0.532 0.868

# Find proportion of yes votes from new population
p_hat_0.8 <- calc_p_hat(one_poll_0.8)
  
# Review the value
p_hat_0.8  
## [1] 0.8333333
  
# Calculate the bootstrap t-confidence interval (new 0.8 param)
calc_t_conf_int(one_poll_boot_0.8, p_hat_0.8)
## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1 0.695 0.972

Interpretation

  • Note that it’s narrower than previously calculated.

4.9 Percentile effects on bootstrap CIs

Most scientists use 95% intervals to quantify their uncertainty about an estimate. That is, they understand that over a lifetime of creating confidence intervals, only 95% of them will actually contain the parameter that they set out to estimate.

# Calculate a 95% bootstrap percentile interval
one_poll_boot %>% 
  get_confidence_interval(level = 0.95) 
## # A tibble: 1 x 2
##   `2.5%` `97.5%`
##    <dbl>   <dbl>
## 1  0.533   0.867

# Calculate a 99% bootstrap percentile interval
one_poll_boot %>% 
  get_confidence_interval(level = 0.99) 
## # A tibble: 1 x 2
##   `0.5%` `99.5%`
##    <dbl>   <dbl>
## 1  0.467   0.900

# Calculate a 90% bootstrap percentile interval
one_poll_boot %>% 
  get_confidence_interval(level = 0.9) 
## # A tibble: 1 x 2
##    `5%` `95%`
##   <dbl> <dbl>
## 1 0.567 0.833

# Plot ci_endpoints vs. ci_percent to compare the intervals
ggplot(conf_int_data, aes(ci_percent, ci_endpoints)) +
  # Add a line layer
  geom_line()