Ch. 1 - Introduction to ideas of inference

Welcome to the course!

[Video]

Vocabulary:

  • Null Hypothesis (H0): The claim that is not interesting
  • Alternative Hypothesis (H1): The claim corresponding to the research hypothesis

The goal is to disprove the null hypothesis

Hypotheses (1)

Suppose a pharmaceutical company is trying to get FDA approval for a new diabetes treatment called drug A. Currently, most doctors prescribe drug B to treat diabetes.

Which would be a good null hypothesis for the FDA statistician examining the drug company’s data?

  • Drug A is better than drug B at treating diabetes.
  • Drug A is worse than drug B at treating diabetes.
  • Drug A is different than drug B at treating diabetes (but you don’t know if it is better or worse).
  • [*] Drug A is the same as drug B at treating diabetes.

Hypotheses (2)

Consider the same situation as in the last exercise. A pharmaceutical company is trying to pass drug A for diabetes through the FDA, but most doctors currently prescribe drug B.

Which would be a good alternative hypothesis?

  • [*] Drug A is better than drug B at treating diabetes.
  • Drug A is worse than drug B at treating diabetes.
  • Drug A is different than drug B at treating diabetes (but you don’t know if it is better or worse).
  • Drug A is the same as drug B at treating diabetes.

Randomized distributions

[Video]

One random permutation

# soda %>%
#   group_by(location) %>%
#   summarize(prop_cola = mean(drink == "cola")) %>%
#   summarize(diff(prop_cola))
# 
# library(infer)
# soda %>%
#   specify(drink ~ location, success = "cola") %>%
#   hypothesize(null == "independence") %>%
#   generate(reps = 1, type = "permute") %>%
#   calculate(stat = "diff in props", order = c("west", "east"))

Many random permutations

# soda %>%
#   specify(drink ~ location, success = "cola") %>%
#   hypothesize(null == "independence") %>%
#   generate(reps = 5, type = "permute") %>%
#   calculate(stat = "diff in props", order = c("west", "east"))

Random distribution

Working with the NHANES data

# Load packages
library(ggplot2)
library(NHANES)


# What are the variables in the NHANES dataset?
colnames(NHANES)
##  [1] "ID"               "SurveyYr"         "Gender"           "Age"             
##  [5] "AgeDecade"        "AgeMonths"        "Race1"            "Race3"           
##  [9] "Education"        "MaritalStatus"    "HHIncome"         "HHIncomeMid"     
## [13] "Poverty"          "HomeRooms"        "HomeOwn"          "Work"            
## [17] "Weight"           "Length"           "HeadCirc"         "Height"          
## [21] "BMI"              "BMICatUnder20yrs" "BMI_WHO"          "Pulse"           
## [25] "BPSysAve"         "BPDiaAve"         "BPSys1"           "BPDia1"          
## [29] "BPSys2"           "BPDia2"           "BPSys3"           "BPDia3"          
## [33] "Testosterone"     "DirectChol"       "TotChol"          "UrineVol1"       
## [37] "UrineFlow1"       "UrineVol2"        "UrineFlow2"       "Diabetes"        
## [41] "DiabetesAge"      "HealthGen"        "DaysPhysHlthBad"  "DaysMentHlthBad" 
## [45] "LittleInterest"   "Depressed"        "nPregnancies"     "nBabies"         
## [49] "Age1stBaby"       "SleepHrsNight"    "SleepTrouble"     "PhysActive"      
## [53] "PhysActiveDays"   "TVHrsDay"         "CompHrsDay"       "TVHrsDayChild"   
## [57] "CompHrsDayChild"  "Alcohol12PlusYr"  "AlcoholDay"       "AlcoholYear"     
## [61] "SmokeNow"         "Smoke100"         "Smoke100n"        "SmokeAge"        
## [65] "Marijuana"        "AgeFirstMarij"    "RegularMarij"     "AgeRegMarij"     
## [69] "HardDrugs"        "SexEver"          "SexAge"           "SexNumPartnLife" 
## [73] "SexNumPartYear"   "SameSex"          "SexOrientation"   "PregnantNow"
# # Create bar plot for Home Ownership by Gender
# ggplot(NHANES, aes(x = Gender, fill = HomeOwn)) + 
#   # Set the position to fill
#   geom_bar(position = "fill") +
#   ylab("Relative frequencies")
# # Density plot of SleepHrsNight colored by SleepTrouble
# ggplot(NHANES, aes(x = SleepHrsNight, color = SleepTrouble)) + 
#   # Adjust by 2
#   geom_density(adjust = 2) + 
#   # Facet by HealthGen
#   facet_wrap(~ HealthGen)

Calculating statistic of interest

# homes <- NHANES %>%
#   # Select Gender and HomeOwn
#   select(Gender, HomeOwn) %>%
#   # Filter for HomeOwn equal to "Own" or "Rent"
#   filter(HomeOwn %in% c("Own", "Rent"))
# diff_orig <- homes %>%   
#   # Group by gender
#   group_by(Gender) %>%
#   # Summarize proportion of homeowners
#   summarize(prop_own = mean(HomeOwn == "Own")) %>%
#   # Summarize difference in proportion of homeowners
#   summarize(obs_diff_prop = diff(prop_own)) # male - female
#   
# # See the result
# diff_orig

Randomized data under null model of independence

# # Specify variables
# homeown_perm <- homes %>%
#   specify(HomeOwn ~ Gender, success = "Own")
# 
# # Print results to console
# homeown_perm
# # Hypothesize independence
# homeown_perm <- homes %>%
#   specify(HomeOwn ~ Gender, success = "Own") %>%
#   hypothesize(null = "independence")  
# 
# # Print results to console
# homeown_perm
# # Perform 10 permutations
# homeown_perm <- homes %>%
#   specify(HomeOwn ~ Gender, success = "Own") %>%
#   hypothesize(null = "independence") %>% 
#   generate(reps = 10, type = "permute") 
# 
# # Print results to console
# homeown_perm

Randomized statistics and dotplot

# # Perform 100 permutations
# homeown_perm <- homes %>%
#   specify(HomeOwn ~ Gender, success = "Own") %>%
#   hypothesize(null = "independence") %>% 
#   generate(reps = 100, type = "permute") %>% 
#   calculate(stat = "diff in props", order = c("male", "female"))
#   
# # Print results to console
# homeown_perm
# # Perform 100 permutations
# homeown_perm <- homes %>%
#   specify(HomeOwn ~ Gender, success = "Own") %>%
#   hypothesize(null = "independence") %>% 
#   generate(reps = 100, type = "permute") %>% 
#   calculate(stat = "diff in props", order = c("male", "female"))
#   
# # Dotplot of 100 permuted differences in proportions
# ggplot(homeown_perm, aes(x = stat)) + 
#   geom_dotplot(binwidth = 0.001)

Randomization density

# # Perform 1000 permutations
# homeown_perm <- homes %>%
#   # Specify HomeOwn vs. Gender, with `"Own" as success
#   specify(HomeOwn ~ Gender, success = "Own") %>%
#   # Use a null hypothesis of independence
#   hypothesize(null = "independence") %>% 
#   # Generate 1000 repetitions (by permutation)
#   generate(reps = 1000, type = "permute") %>% 
#   # Calculate the difference in proportions (male then female)
#   calculate(stat = "diff in props", order = c("male", "female"))
# 
# # Density plot of 1000 permuted differences in proportions
# ggplot(homeown_perm, aes(x = stat)) + 
#   geom_density()

Using the randomization distribution

[Video]

# set.seed(470)
# soda_perm <- soda %>%
#   rep_sample_n(size = nrow(soda), reps = 100) %>%
#   mutate(drink_perm = sample(drink)) %>%
#   group_by(replicate, location) %>%
#   summarize(prop_cola_perm = mean(drink_perm == "cola"),
#             prop_cola = mean(drink == "cola")) %>%
#   summarize(diff_perm = diff(prop_cola_perm),
#             diff_orig = diff(prop_cola))
# 
# soda_perm %>%
#   summarize(count = sum(diff_orig >= diff_perm),
#             proportion = mean(diff_orig >= diff_perm))

Do the data come from the population?

# # Plot permuted differences, diff_perm
# ggplot(homeown_perm, aes(x = diff_perm)) + 
#   # Add a density layer
#   geom_density() +
#   # Add a vline layer with intercept diff_orig
#   geom_vline(aes(xintercept = diff_orig), color = "red")
# 
# # Compare permuted differences to observed difference
# homeown_perm %>%
#   summarize(n_perm_le_obs = sum(diff_perm <= diff_orig))

What can you conclude?

What can you conclude from the analysis?

  • We have learned that being female causes people to buy houses.
  • [*] We have learned that our data is consistent with the hypothesis of no difference in home ownership across gender.
  • We have learned that the observed difference (from the data) in proportion of home ownership across gender is due to something other than random variation.

Study conclusions

[Video]


Ch. 2 - Completing a randomization test: gender discrimination

Example: gender discrimination

[Video]

Gender discrimination hypotheses

Summarizing gender discrimination

Step-by-step through the permutation

Randomizing gender discrimination

Distribution of statistics

Reflecting on analysis

Critical region

Two-sided critical region

Why 0.05?

How does sample size affect results?

Sample size in randomization distribution

Sample size for critical region

What is a p-value?

Calculating the p-values

Practice calculating p-values

Calculating two-sided p-values

Summary of gender discrimination


Ch. 3 - Hypothesis testing errors: opportunity cost

Example: opportunity cost

Summarizing opportunity cost (1)

Plotting opportunity cost

Randomizing opportunity cost

Summarizing opportunity cost (2)

Opportunity cost conclusion

Errors and their consequences

Different choice of error rate

Errors for two-sided hypotheses

p-value for two-sided hypotheses: opportunity costs

Summary of opportunity costs


Ch. 4 - Confidence intervals

Parameters and confidence intervals

What is the parameter?

Hypothesis test or confidence interval?

Bootstrapping

Resampling from a sample

Visualizing the variability of p-hat

Always resample the original number of observations

Variability in p-hat

Empirical Rule

Bootstrap t-confidence interval

Bootstrap percentile interval

Interpreting CIs and technical conditions

Sample size effects on bootstrap CIs

Sample proportion value effects on bootstrap CIs

Percentile effects on bootstrap CIs

Summary of statistical inference


About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470