[Video]
Vocabulary:
The goal is to disprove the null hypothesis
Suppose a pharmaceutical company is trying to get FDA approval for a new diabetes treatment called drug A. Currently, most doctors prescribe drug B to treat diabetes.
Which would be a good null hypothesis for the FDA statistician examining the drug company’s data?
Consider the same situation as in the last exercise. A pharmaceutical company is trying to pass drug A for diabetes through the FDA, but most doctors currently prescribe drug B.
Which would be a good alternative hypothesis?
[Video]
One random permutation
# soda %>%
# group_by(location) %>%
# summarize(prop_cola = mean(drink == "cola")) %>%
# summarize(diff(prop_cola))
#
# library(infer)
# soda %>%
# specify(drink ~ location, success = "cola") %>%
# hypothesize(null == "independence") %>%
# generate(reps = 1, type = "permute") %>%
# calculate(stat = "diff in props", order = c("west", "east"))
Many random permutations
# soda %>%
# specify(drink ~ location, success = "cola") %>%
# hypothesize(null == "independence") %>%
# generate(reps = 5, type = "permute") %>%
# calculate(stat = "diff in props", order = c("west", "east"))
Random distribution
# Load packages
library(ggplot2)
library(NHANES)
# What are the variables in the NHANES dataset?
colnames(NHANES)
## [1] "ID" "SurveyYr" "Gender" "Age"
## [5] "AgeDecade" "AgeMonths" "Race1" "Race3"
## [9] "Education" "MaritalStatus" "HHIncome" "HHIncomeMid"
## [13] "Poverty" "HomeRooms" "HomeOwn" "Work"
## [17] "Weight" "Length" "HeadCirc" "Height"
## [21] "BMI" "BMICatUnder20yrs" "BMI_WHO" "Pulse"
## [25] "BPSysAve" "BPDiaAve" "BPSys1" "BPDia1"
## [29] "BPSys2" "BPDia2" "BPSys3" "BPDia3"
## [33] "Testosterone" "DirectChol" "TotChol" "UrineVol1"
## [37] "UrineFlow1" "UrineVol2" "UrineFlow2" "Diabetes"
## [41] "DiabetesAge" "HealthGen" "DaysPhysHlthBad" "DaysMentHlthBad"
## [45] "LittleInterest" "Depressed" "nPregnancies" "nBabies"
## [49] "Age1stBaby" "SleepHrsNight" "SleepTrouble" "PhysActive"
## [53] "PhysActiveDays" "TVHrsDay" "CompHrsDay" "TVHrsDayChild"
## [57] "CompHrsDayChild" "Alcohol12PlusYr" "AlcoholDay" "AlcoholYear"
## [61] "SmokeNow" "Smoke100" "Smoke100n" "SmokeAge"
## [65] "Marijuana" "AgeFirstMarij" "RegularMarij" "AgeRegMarij"
## [69] "HardDrugs" "SexEver" "SexAge" "SexNumPartnLife"
## [73] "SexNumPartYear" "SameSex" "SexOrientation" "PregnantNow"
# # Create bar plot for Home Ownership by Gender
# ggplot(NHANES, aes(x = Gender, fill = HomeOwn)) +
# # Set the position to fill
# geom_bar(position = "fill") +
# ylab("Relative frequencies")
# # Density plot of SleepHrsNight colored by SleepTrouble
# ggplot(NHANES, aes(x = SleepHrsNight, color = SleepTrouble)) +
# # Adjust by 2
# geom_density(adjust = 2) +
# # Facet by HealthGen
# facet_wrap(~ HealthGen)
# homes <- NHANES %>%
# # Select Gender and HomeOwn
# select(Gender, HomeOwn) %>%
# # Filter for HomeOwn equal to "Own" or "Rent"
# filter(HomeOwn %in% c("Own", "Rent"))
# diff_orig <- homes %>%
# # Group by gender
# group_by(Gender) %>%
# # Summarize proportion of homeowners
# summarize(prop_own = mean(HomeOwn == "Own")) %>%
# # Summarize difference in proportion of homeowners
# summarize(obs_diff_prop = diff(prop_own)) # male - female
#
# # See the result
# diff_orig
# # Specify variables
# homeown_perm <- homes %>%
# specify(HomeOwn ~ Gender, success = "Own")
#
# # Print results to console
# homeown_perm
# # Hypothesize independence
# homeown_perm <- homes %>%
# specify(HomeOwn ~ Gender, success = "Own") %>%
# hypothesize(null = "independence")
#
# # Print results to console
# homeown_perm
# # Perform 10 permutations
# homeown_perm <- homes %>%
# specify(HomeOwn ~ Gender, success = "Own") %>%
# hypothesize(null = "independence") %>%
# generate(reps = 10, type = "permute")
#
# # Print results to console
# homeown_perm
# # Perform 100 permutations
# homeown_perm <- homes %>%
# specify(HomeOwn ~ Gender, success = "Own") %>%
# hypothesize(null = "independence") %>%
# generate(reps = 100, type = "permute") %>%
# calculate(stat = "diff in props", order = c("male", "female"))
#
# # Print results to console
# homeown_perm
# # Perform 100 permutations
# homeown_perm <- homes %>%
# specify(HomeOwn ~ Gender, success = "Own") %>%
# hypothesize(null = "independence") %>%
# generate(reps = 100, type = "permute") %>%
# calculate(stat = "diff in props", order = c("male", "female"))
#
# # Dotplot of 100 permuted differences in proportions
# ggplot(homeown_perm, aes(x = stat)) +
# geom_dotplot(binwidth = 0.001)
# # Perform 1000 permutations
# homeown_perm <- homes %>%
# # Specify HomeOwn vs. Gender, with `"Own" as success
# specify(HomeOwn ~ Gender, success = "Own") %>%
# # Use a null hypothesis of independence
# hypothesize(null = "independence") %>%
# # Generate 1000 repetitions (by permutation)
# generate(reps = 1000, type = "permute") %>%
# # Calculate the difference in proportions (male then female)
# calculate(stat = "diff in props", order = c("male", "female"))
#
# # Density plot of 1000 permuted differences in proportions
# ggplot(homeown_perm, aes(x = stat)) +
# geom_density()
[Video]
# set.seed(470)
# soda_perm <- soda %>%
# rep_sample_n(size = nrow(soda), reps = 100) %>%
# mutate(drink_perm = sample(drink)) %>%
# group_by(replicate, location) %>%
# summarize(prop_cola_perm = mean(drink_perm == "cola"),
# prop_cola = mean(drink == "cola")) %>%
# summarize(diff_perm = diff(prop_cola_perm),
# diff_orig = diff(prop_cola))
#
# soda_perm %>%
# summarize(count = sum(diff_orig >= diff_perm),
# proportion = mean(diff_orig >= diff_perm))
# # Plot permuted differences, diff_perm
# ggplot(homeown_perm, aes(x = diff_perm)) +
# # Add a density layer
# geom_density() +
# # Add a vline layer with intercept diff_orig
# geom_vline(aes(xintercept = diff_orig), color = "red")
#
# # Compare permuted differences to observed difference
# homeown_perm %>%
# summarize(n_perm_le_obs = sum(diff_perm <= diff_orig))
What can you conclude from the analysis?
[Video]
[Video]
Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.
Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.
LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470