Hypothesis Testing

Q1 State both null and alternative hypotheses.
Q3 Describe the first observation, using only two variables: Diabetes and SleepTrouble.
Q4 How many of the survey participants reported to have trouble sleeping? And how many of those who have trouble sleeping reported to have diabetes?
Q5 What percentage of those who have trouble sleeping are diabetic?
Q6 Which of the two groups is more likely to be diabetic? Those who have trouble sleeping or those who don’t? By what percentage?
Q7 The distribution of randomized differences below shows that the difference of zero is most likely seen by chance. What does this mean?
Q8 Would you reject the null hypothesis at 1% and conclude that those who have trouble sleeping are more likely to be diabetic?
Q9 According the computed p-value below, how likely is it that you would be wrong if you concluded that those who have trouble sleeping are more likely to be diabetic?
Q10.a. Display both the code and the results of the code on the webpage.
Q10.b. Display the title and your name correctly at the top of the webpage.
Q10.c. Use the correct slug.

Suppose that your boss is interested in relationships between diabetes and insomnia. Are those who have trouble sleeping are more likely to be diabetic than those who don’t?

Use the NHANES data set in the NHANES R package. You may find more about the data set in the NHANES package’s manual.

Def of variables

Diabetes: Study participant told by a doctor or health professional that they have diabetes. Reported for participants aged 1 year or older as Yes or No.
SleepTrouble: Participant has told a doctor or other health professional that they had trouble sleeping. Reported for participants aged 16 years and older. Coded as Yes or No.

Q1 State both null and alternative hypotheses.

Null Hypotheses : There is no correlation when it comes to participant who have trouble sleeping and diabetic Alternative hypotheses: There is a correlation between diabetes and sleep troubles in people ## Q2 Load the three packages: NHANES, tidyverse and infer.

# Load packages
library(NHANES)
library(tidyverse)
library(infer)

NHANES
## # A tibble: 10,000 x 76
##       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
##    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
##  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch~
##  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch~
##  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch~
##  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>     
##  5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some Col~
##  6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>     
##  7 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>     
##  8 51647 2009_10  female    45 " 40-49"        541 White <NA>  College ~
##  9 51647 2009_10  female    45 " 40-49"        541 White <NA>  College ~
## 10 51647 2009_10  female    45 " 40-49"        541 White <NA>  College ~
## # ... with 9,990 more rows, and 67 more variables: MaritalStatus <fct>,
## #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
## #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
## #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
## #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>,
## #   BPDia1 <int>, BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>,
## #   Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>,
## #   UrineFlow1 <dbl>, UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>,
## #   DiabetesAge <int>, HealthGen <fct>, DaysPhysHlthBad <int>,
## #   DaysMentHlthBad <int>, LittleInterest <fct>, Depressed <fct>,
## #   nPregnancies <int>, nBabies <int>, Age1stBaby <int>,
## #   SleepHrsNight <int>, SleepTrouble <fct>, PhysActive <fct>,
## #   PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
## #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
## #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
## #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
## #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
## #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>,
## #   SameSex <fct>, SexOrientation <fct>, PregnantNow <fct>

Q3 Describe the first observation, using only two variables: Diabetes and SleepTrouble.

individuals who had diabetic and trouble sleeping

Q4 How many of the survey participants reported to have trouble sleeping? And how many of those who have trouble sleeping reported to have diabetes?

NHANES %>%
  # Count the rows by singer and sex
  count(SleepTrouble, Diabetes)
## # A tibble: 8 x 3
##   SleepTrouble Diabetes     n
##   <fct>        <fct>    <int>
## 1 No           No        5326
## 2 No           Yes        473
## 3 Yes          No        1700
## 4 Yes          Yes        271
## 5 Yes          <NA>         2
## 6 <NA>         No        2072
## 7 <NA>         Yes         16
## 8 <NA>         <NA>       140

According to the survey 1700 had trouble sleeping and 271 had diabetes and trouble sleeping.

Q5 What percentage of those who have trouble sleeping are diabetic?

# Find proportion of each SleepTrouble who were Yes
NHANES %>%
  filter(!is.na(SleepTrouble), !is.na(Diabetes)) %>%
  # Group by SleepTrouble
  group_by(SleepTrouble) %>%
  # Calculate proportion Yes summary stat
  summarise(Yes_prop = mean(Diabetes == "Yes"))
## # A tibble: 2 x 2
##   SleepTrouble Yes_prop
##   <fct>           <dbl>
## 1 No             0.0816
## 2 Yes            0.137

# Calculate the observed difference in promotion rate
diff_orig <- NHANES %>%
  filter(!is.na(SleepTrouble), !is.na(Diabetes)) %>%
  # Group by SleepTrouble
  group_by(SleepTrouble) %>%
  # Summarize to calculate fraction Yes
  summarise(prop_prom = mean(Diabetes == "Yes")) %>%
  # Summarize to calculate difference
  summarise(stat = diff(prop_prom)) %>% 
  pull()
    
# See the result
diff_orig # male - female
## [1] 0.05592787

13.7 percent of those who have trouble sleeping are Diabetes

Q6 Which of the two groups is more likely to be diabetic? Those who have trouble sleeping or those who don’t? By what percentage?

People who have ismonia are 5.6 more likely to have Diabetes.

Q7 The distribution of randomized differences below shows that the difference of zero is most likely seen by chance. What does this mean?

# Set the seed of R's random number generator so that the random numbers would continue to be the same.
set.seed(2019)

# Create data frame of permuted differences in promotion rates
NHANES_perm <- NHANES %>%
  # Specify variables: Diabetes (response variable) and SleepTrouble (explanatory variable)
  specify(Diabetes ~ SleepTrouble, success = "Yes") %>%
  # Set null hypothesis as independence: there is no gender NHANESrimination
  hypothesize(null = "independence") %>%
  # Shuffle the response variable, Diabetes, one thousand times
  generate(reps = 1000, type = "permute") %>% 
  # Calculate difference in proportion, Yes then No
  calculate(stat = "diff in props", order = c("Yes", "No")) # Yes - No
  
NHANES_perm
## # A tibble: 1,000 x 2
##    replicate      stat
##        <int>     <dbl>
##  1         1  0.00154 
##  2         2 -0.0107  
##  3         3  0.00698 
##  4         4 -0.00865 
##  5         5  0.00766 
##  6         6  0.00154 
##  7         7 -0.00253 
##  8         8  0.00766 
##  9         9 -0.0107  
## 10        10  0.000184
## # ... with 990 more rows

# Using permutation data, plot stat
ggplot(NHANES_perm, aes(x = stat)) + 
  # Add a histogram layer
  geom_histogram(binwidth = 0.01) +
  # Using original data, add a vertical line at stat
  geom_vline(aes(xintercept = diff_orig), color = "red")

there is a difference of 0 which means isnomnia has no correlation with diabetes. The histogram shows that the graph is tallest at 0.00, we would conclude that the difference of 0 is most likely seen by chance when there is no difference

Q8 Would you reject the null hypothesis at 1% and conclude that those who have trouble sleeping are more likely to be diabetic?

# Find the 0.90, 0.95, and 0.99 quantiles of stat
NHANES_perm %>% 
  summarize(q.90 = quantile(stat, p = 0.9),
            q.95 = quantile(stat, p = 0.95),
            q.99 = quantile(stat, p = 0.99))
## # A tibble: 1 x 3
##      q.90   q.95   q.99
##     <dbl>  <dbl>  <dbl>
## 1 0.00970 0.0124 0.0185

Our 99% quantile is 1.8 which means that 95% of the null stats are 1.8 or below. the data shows that insomnia does have an impact on diabetes. we are 99% confident about this. we would reject our null hypothesis

Q9 According the computed p-value below, how likely is it that you would be wrong if you concluded that those who have trouble sleeping are more likely to be diabetic?

# Calculate the p-value for the original dataset
NHANES_perm %>%
  get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

A value of 0 , it would be extemely unlikely (zero chance in this case) to see the observed difference there was no difference across diabetes and sleep troubles

Hypothesis Testing

Niti bista