Suppose that your boss is interested in relationships between diabetes and insomnia. Are those who have trouble sleeping are more likely to be diabetic than those who don’t?
Use the NHANES data set in the NHANES R package. You may find more about the data set in the NHANES package’s manual.
Def of variables
Null Hypothesis: There is no link between diabetics and insomnia. Alternative Hypothesis: There is a link between diabetics and insomnia.
# Load packages
library(NHANES)
library(tidyverse)
library(infer)
NHANES
## # A tibble: 10,000 x 76
## ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education
## <int> <fct> <fct> <int> <fct> <int> <fct> <fct> <fct>
## 1 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch~
## 2 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch~
## 3 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch~
## 4 51625 2009_10 male 4 " 0-9" 49 Other <NA> <NA>
## 5 51630 2009_10 female 49 " 40-49" 596 White <NA> Some Col~
## 6 51638 2009_10 male 9 " 0-9" 115 White <NA> <NA>
## 7 51646 2009_10 male 8 " 0-9" 101 White <NA> <NA>
## 8 51647 2009_10 female 45 " 40-49" 541 White <NA> College ~
## 9 51647 2009_10 female 45 " 40-49" 541 White <NA> College ~
## 10 51647 2009_10 female 45 " 40-49" 541 White <NA> College ~
## # ... with 9,990 more rows, and 67 more variables: MaritalStatus <fct>,
## # HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
## # HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
## # Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
## # Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>,
## # BPDia1 <int>, BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>,
## # Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>,
## # UrineFlow1 <dbl>, UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>,
## # DiabetesAge <int>, HealthGen <fct>, DaysPhysHlthBad <int>,
## # DaysMentHlthBad <int>, LittleInterest <fct>, Depressed <fct>,
## # nPregnancies <int>, nBabies <int>, Age1stBaby <int>,
## # SleepHrsNight <int>, SleepTrouble <fct>, PhysActive <fct>,
## # PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
## # TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
## # AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
## # Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
## # RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
## # SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>,
## # SameSex <fct>, SexOrientation <fct>, PregnantNow <fct>
The first observation is not a diabetic but has sleep trouble.
NHANES %>%
count(SleepTrouble == "Yes")
## # A tibble: 3 x 2
## `SleepTrouble == "Yes"` n
## <lgl> <int>
## 1 FALSE 5799
## 2 TRUE 1973
## 3 NA 2228
NHANES %>%
count(SleepTrouble == "Yes", Diabetes == "Yes")
## # A tibble: 8 x 3
## `SleepTrouble == "Yes"` `Diabetes == "Yes"` n
## <lgl> <lgl> <int>
## 1 FALSE FALSE 5326
## 2 FALSE TRUE 473
## 3 TRUE FALSE 1700
## 4 TRUE TRUE 271
## 5 TRUE NA 2
## 6 NA FALSE 2072
## 7 NA TRUE 16
## 8 NA NA 140
Out of 10000 survey participants, 1973 reported to have sleep trouble. Of the 1973 with sleep trouble, 271 also reported to have diabetes.
# Find proportion of each SleepTrouble who were Yes
NHANES %>%
filter(!is.na(SleepTrouble), !is.na(Diabetes)) %>%
# Group by SleepTrouble
group_by(SleepTrouble) %>%
# Calculate proportion Yes summary stat
summarise(Yes_prop = mean(Diabetes == "Yes"))
## # A tibble: 2 x 2
## SleepTrouble Yes_prop
## <fct> <dbl>
## 1 No 0.0816
## 2 Yes 0.137
# Calculate the observed difference in promotion rate
diff_orig <- NHANES %>%
filter(!is.na(SleepTrouble), !is.na(Diabetes)) %>%
# Group by SleepTrouble
group_by(SleepTrouble) %>%
# Summarize to calculate fraction Yes
summarise(prop_prom = mean(Diabetes == "Yes")) %>%
# Summarize to calculate difference
summarise(stat = diff(prop_prom)) %>%
pull()
# See the result
diff_orig # male - female
## [1] 0.05592787
13.75% of people who had trouble sleeping also reported to be diabetic.
Those who have trouble sleeping are 5.6% more likely to be diabetic than those who do not.
# Set the seed of R's random number generator so that the random numbers would continue to be the same.
set.seed(2019)
# Create data frame of permuted differences in promotion rates
NHANES_perm <- NHANES %>%
# Specify variables: Diabetes (response variable) and SleepTrouble (explanatory variable)
specify(Diabetes ~ SleepTrouble, success = "Yes") %>%
# Set null hypothesis as independence: there is no gender NHANESrimination
hypothesize(null = "independence") %>%
# Shuffle the response variable, Diabetes, one thousand times
generate(reps = 1000, type = "permute") %>%
# Calculate difference in proportion, Yes then No
calculate(stat = "diff in props", order = c("Yes", "No")) # Yes - No
NHANES_perm
## # A tibble: 1,000 x 2
## replicate stat
## <int> <dbl>
## 1 1 0.00154
## 2 2 -0.0107
## 3 3 0.00698
## 4 4 -0.00865
## 5 5 0.00766
## 6 6 0.00154
## 7 7 -0.00253
## 8 8 0.00766
## 9 9 -0.0107
## 10 10 0.000184
## # ... with 990 more rows
# Using permutation data, plot stat
ggplot(NHANES_perm, aes(x = stat)) +
# Add a histogram layer
geom_histogram(binwidth = 0.01) +
# Using original data, add a vertical line at stat
geom_vline(aes(xintercept = diff_orig), color = "red")
When randomly assigning values of “yes” and “no” to Diabetes and Sleep Trouble, we find that it is statistically likely there would be very little difference between participants with sleep trouble having diabetes and participants with sleep trouble not having diabetes.
# Find the 0.90, 0.95, and 0.99 quantiles of stat
NHANES_perm %>%
summarize(q.90 = quantile(stat, p = 0.9),
q.95 = quantile(stat, p = 0.95),
q.99 = quantile(stat, p = 0.99))
## # A tibble: 1 x 3
## q.90 q.95 q.99
## <dbl> <dbl> <dbl>
## 1 0.00970 0.0124 0.0185
We would reject the null hypothesis at 1% and conclude that those who have trouble sleeping are more likely to be diabetic. This is because at 1% the expected difference in those with sleep trouble having diabetes and those with sleep trouble not having diabetes is 1.85% and we have an observed difference of 5.6%.
# Calculate the p-value for the original dataset
NHANES_perm %>%
get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
According to the p-value of 0, it is completely unlikely that we are wrong in our conclusion.