The code below examines relationships between diabetes and insomnia. In doing so, we use the NHANES data set in the NHANES R package. You may find more about the data set in the NHANES package’s manual. To find the manual, Google the package’s name or click [here] (https://cran.r-project.org/web/packages/NHANES/NHANES.pdf).

Def of variables

Given the data, it is shown below whether or not is appropriate to reject the null hypothesis in favor of the research claim of interest.

Your assignment

Suppose that your boss is more interested in relationships between smoking marijuana and diabetes. Are those who regularly smoke marijuana are more likely to be diabetic than those who don’t? Run the program after replacing SleepTrouble by RegularMarij.

Would you reject the null hypothesis? In other words, would you conclude that those who smoking marijuana are more likely to be diabetic than those who don’t. At what significance level? Write your conclusion by revising the interpretation given in the Ex_Inference file.

Q1 How many survey participants are there in the data set?

Hint: The row represents survey participants.

There are 10,000 participants.

Q2 How many of the survey participants reported that they regularly smoke marijuana?

Hint: Check the result of the summary() function.

1366 survey participants reported regularly soming marijuana.

# Load packages
library(NHANES)
library(tidyverse)

str(NHANES)
## Classes 'tbl_df', 'tbl' and 'data.frame':    10000 obs. of  76 variables:
##  $ ID              : int  51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ...
##  $ SurveyYr        : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender          : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Age             : int  34 34 34 4 49 9 8 45 45 45 ...
##  $ AgeDecade       : Factor w/ 8 levels " 0-9"," 10-19",..: 4 4 4 1 5 1 1 5 5 5 ...
##  $ AgeMonths       : int  409 409 409 49 596 115 101 541 541 541 ...
##  $ Race1           : Factor w/ 5 levels "Black","Hispanic",..: 4 4 4 5 4 4 4 4 4 4 ...
##  $ Race3           : Factor w/ 6 levels "Asian","Black",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Education       : Factor w/ 5 levels "8th Grade","9 - 11th Grade",..: 3 3 3 NA 4 NA NA 5 5 5 ...
##  $ MaritalStatus   : Factor w/ 6 levels "Divorced","LivePartner",..: 3 3 3 NA 2 NA NA 3 3 3 ...
##  $ HHIncome        : Factor w/ 12 levels " 0-4999"," 5000-9999",..: 6 6 6 5 7 11 9 11 11 11 ...
##  $ HHIncomeMid     : int  30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
##  $ Poverty         : num  1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
##  $ HomeRooms       : int  6 6 6 9 5 6 7 6 6 6 ...
##  $ HomeOwn         : Factor w/ 3 levels "Own","Rent","Other": 1 1 1 1 2 2 1 1 1 1 ...
##  $ Work            : Factor w/ 3 levels "Looking","NotWorking",..: 2 2 2 NA 2 NA NA 3 3 3 ...
##  $ Weight          : num  87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
##  $ Length          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ HeadCirc        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Height          : num  165 165 165 105 168 ...
##  $ BMI             : num  32.2 32.2 32.2 15.3 30.6 ...
##  $ BMICatUnder20yrs: Factor w/ 4 levels "UnderWeight",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BMI_WHO         : Factor w/ 4 levels "12.0_18.5","18.5_to_24.9",..: 4 4 4 1 4 1 2 3 3 3 ...
##  $ Pulse           : int  70 70 70 NA 86 82 72 62 62 62 ...
##  $ BPSysAve        : int  113 113 113 NA 112 86 107 118 118 118 ...
##  $ BPDiaAve        : int  85 85 85 NA 75 47 37 64 64 64 ...
##  $ BPSys1          : int  114 114 114 NA 118 84 114 106 106 106 ...
##  $ BPDia1          : int  88 88 88 NA 82 50 46 62 62 62 ...
##  $ BPSys2          : int  114 114 114 NA 108 84 108 118 118 118 ...
##  $ BPDia2          : int  88 88 88 NA 74 50 36 68 68 68 ...
##  $ BPSys3          : int  112 112 112 NA 116 88 106 118 118 118 ...
##  $ BPDia3          : int  82 82 82 NA 76 44 38 60 60 60 ...
##  $ Testosterone    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ DirectChol      : num  1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
##  $ TotChol         : num  3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
##  $ UrineVol1       : int  352 352 352 NA 77 123 238 106 106 106 ...
##  $ UrineFlow1      : num  NA NA NA NA 0.094 ...
##  $ UrineVol2       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ UrineFlow2      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Diabetes        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DiabetesAge     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ HealthGen       : Factor w/ 5 levels "Excellent","Vgood",..: 3 3 3 NA 3 NA NA 2 2 2 ...
##  $ DaysPhysHlthBad : int  0 0 0 NA 0 NA NA 0 0 0 ...
##  $ DaysMentHlthBad : int  15 15 15 NA 10 NA NA 3 3 3 ...
##  $ LittleInterest  : Factor w/ 3 levels "None","Several",..: 3 3 3 NA 2 NA NA 1 1 1 ...
##  $ Depressed       : Factor w/ 3 levels "None","Several",..: 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ nPregnancies    : int  NA NA NA NA 2 NA NA 1 1 1 ...
##  $ nBabies         : int  NA NA NA NA 2 NA NA NA NA NA ...
##  $ Age1stBaby      : int  NA NA NA NA 27 NA NA NA NA NA ...
##  $ SleepHrsNight   : int  4 4 4 NA 8 NA NA 8 8 8 ...
##  $ SleepTrouble    : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ PhysActive      : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
##  $ PhysActiveDays  : int  NA NA NA NA NA NA NA 5 5 5 ...
##  $ TVHrsDay        : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ CompHrsDay      : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ TVHrsDayChild   : int  NA NA NA 4 NA 5 1 NA NA NA ...
##  $ CompHrsDayChild : int  NA NA NA 1 NA 0 6 NA NA NA ...
##  $ Alcohol12PlusYr : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ AlcoholDay      : int  NA NA NA NA 2 NA NA 3 3 3 ...
##  $ AlcoholYear     : int  0 0 0 NA 20 NA NA 52 52 52 ...
##  $ SmokeNow        : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...
##  $ Smoke100        : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ Smoke100n       : Factor w/ 2 levels "Non-Smoker","Smoker": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ SmokeAge        : int  18 18 18 NA 38 NA NA NA NA NA ...
##  $ Marijuana       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ AgeFirstMarij   : int  17 17 17 NA 18 NA NA 13 13 13 ...
##  $ RegularMarij    : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 1 1 1 ...
##  $ AgeRegMarij     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ HardDrugs       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ SexEver         : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
##  $ SexAge          : int  16 16 16 NA 12 NA NA 13 13 13 ...
##  $ SexNumPartnLife : int  8 8 8 NA 10 NA NA 20 20 20 ...
##  $ SexNumPartYear  : int  1 1 1 NA 1 NA NA 0 0 0 ...
##  $ SameSex         : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA 2 2 2 ...
##  $ SexOrientation  : Factor w/ 3 levels "Bisexual","Heterosexual",..: 2 2 2 NA 2 NA NA 1 1 1 ...
##  $ PregnantNow     : Factor w/ 3 levels "Yes","No","Unknown": NA NA NA NA NA NA NA NA NA NA ...
summary(NHANES)
##        ID           SurveyYr       Gender          Age       
##  Min.   :51624   2009_10:5000   female:5020   Min.   : 0.00  
##  1st Qu.:56904   2011_12:5000   male  :4980   1st Qu.:17.00  
##  Median :62160                                Median :36.00  
##  Mean   :61945                                Mean   :36.74  
##  3rd Qu.:67039                                3rd Qu.:54.00  
##  Max.   :71915                                Max.   :80.00  
##                                                              
##    AgeDecade      AgeMonths          Race1           Race3     
##   40-49 :1398   Min.   :  0.0   Black   :1197   Asian   : 288  
##   0-9   :1391   1st Qu.:199.0   Hispanic: 610   Black   : 589  
##   10-19 :1374   Median :418.0   Mexican :1015   Hispanic: 350  
##   20-29 :1356   Mean   :420.1   White   :6372   Mexican : 480  
##   30-39 :1338   3rd Qu.:624.0   Other   : 806   White   :3135  
##  (Other):2810   Max.   :959.0                   Other   : 158  
##  NA's   : 333   NA's   :5038                    NA's    :5000  
##           Education         MaritalStatus         HHIncome   
##  8th Grade     : 451   Divorced    : 707   more 99999 :2220  
##  9 - 11th Grade: 888   LivePartner : 560   75000-99999:1084  
##  High School   :1517   Married     :3945   25000-34999: 958  
##  Some College  :2267   NeverMarried:1380   35000-44999: 863  
##  College Grad  :2098   Separated   : 183   45000-54999: 784  
##  NA's          :2779   Widowed     : 456   (Other)    :3280  
##                        NA's        :2769   NA's       : 811  
##   HHIncomeMid        Poverty        HomeRooms       HomeOwn    
##  Min.   :  2500   Min.   :0.000   Min.   : 1.000   Own  :6425  
##  1st Qu.: 30000   1st Qu.:1.240   1st Qu.: 5.000   Rent :3287  
##  Median : 50000   Median :2.700   Median : 6.000   Other: 225  
##  Mean   : 57206   Mean   :2.802   Mean   : 6.249   NA's :  63  
##  3rd Qu.: 87500   3rd Qu.:4.710   3rd Qu.: 8.000               
##  Max.   :100000   Max.   :5.000   Max.   :13.000               
##  NA's   :811      NA's   :726     NA's   :69                   
##          Work          Weight           Length          HeadCirc    
##  Looking   : 311   Min.   :  2.80   Min.   : 47.10   Min.   :34.20  
##  NotWorking:2847   1st Qu.: 56.10   1st Qu.: 75.70   1st Qu.:39.58  
##  Working   :4613   Median : 72.70   Median : 87.00   Median :41.45  
##  NA's      :2229   Mean   : 70.98   Mean   : 85.02   Mean   :41.18  
##                    3rd Qu.: 88.90   3rd Qu.: 96.10   3rd Qu.:42.92  
##                    Max.   :230.70   Max.   :112.20   Max.   :45.40  
##                    NA's   :78       NA's   :9457     NA's   :9912   
##      Height           BMI           BMICatUnder20yrs         BMI_WHO    
##  Min.   : 83.6   Min.   :12.88   UnderWeight:  55    12.0_18.5   :1277  
##  1st Qu.:156.8   1st Qu.:21.58   NormWeight : 805    18.5_to_24.9:2911  
##  Median :166.0   Median :25.98   OverWeight : 193    25.0_to_29.9:2664  
##  Mean   :161.9   Mean   :26.66   Obese      : 221    30.0_plus   :2751  
##  3rd Qu.:174.5   3rd Qu.:30.89   NA's       :8726    NA's        : 397  
##  Max.   :200.4   Max.   :81.25                                          
##  NA's   :353     NA's   :366                                            
##      Pulse           BPSysAve        BPDiaAve          BPSys1     
##  Min.   : 40.00   Min.   : 76.0   Min.   :  0.00   Min.   : 72.0  
##  1st Qu.: 64.00   1st Qu.:106.0   1st Qu.: 61.00   1st Qu.:106.0  
##  Median : 72.00   Median :116.0   Median : 69.00   Median :116.0  
##  Mean   : 73.56   Mean   :118.2   Mean   : 67.48   Mean   :119.1  
##  3rd Qu.: 82.00   3rd Qu.:127.0   3rd Qu.: 76.00   3rd Qu.:128.0  
##  Max.   :136.00   Max.   :226.0   Max.   :116.00   Max.   :232.0  
##  NA's   :1437     NA's   :1449    NA's   :1449     NA's   :1763   
##      BPDia1           BPSys2          BPDia2           BPSys3     
##  Min.   :  0.00   Min.   : 76.0   Min.   :  0.00   Min.   : 76.0  
##  1st Qu.: 62.00   1st Qu.:106.0   1st Qu.: 60.00   1st Qu.:106.0  
##  Median : 70.00   Median :116.0   Median : 68.00   Median :116.0  
##  Mean   : 68.28   Mean   :118.5   Mean   : 67.66   Mean   :117.9  
##  3rd Qu.: 76.00   3rd Qu.:128.0   3rd Qu.: 76.00   3rd Qu.:126.0  
##  Max.   :118.00   Max.   :226.0   Max.   :118.00   Max.   :226.0  
##  NA's   :1763     NA's   :1647    NA's   :1647     NA's   :1635   
##      BPDia3       Testosterone       DirectChol       TotChol      
##  Min.   :  0.0   Min.   :   0.25   Min.   :0.390   Min.   : 1.530  
##  1st Qu.: 60.0   1st Qu.:  17.70   1st Qu.:1.090   1st Qu.: 4.110  
##  Median : 68.0   Median :  43.82   Median :1.290   Median : 4.780  
##  Mean   : 67.3   Mean   : 197.90   Mean   :1.365   Mean   : 4.879  
##  3rd Qu.: 76.0   3rd Qu.: 362.41   3rd Qu.:1.580   3rd Qu.: 5.530  
##  Max.   :116.0   Max.   :1795.60   Max.   :4.030   Max.   :13.650  
##  NA's   :1635    NA's   :5874      NA's   :1526    NA's   :1526    
##    UrineVol1       UrineFlow1        UrineVol2       UrineFlow2    
##  Min.   :  0.0   Min.   : 0.0000   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.: 50.0   1st Qu.: 0.4030   1st Qu.: 52.0   1st Qu.: 0.475  
##  Median : 94.0   Median : 0.6990   Median : 95.0   Median : 0.760  
##  Mean   :118.5   Mean   : 0.9793   Mean   :119.7   Mean   : 1.149  
##  3rd Qu.:164.0   3rd Qu.: 1.2210   3rd Qu.:171.8   3rd Qu.: 1.513  
##  Max.   :510.0   Max.   :17.1670   Max.   :409.0   Max.   :13.692  
##  NA's   :987     NA's   :1603      NA's   :8522    NA's   :8524    
##  Diabetes     DiabetesAge        HealthGen    DaysPhysHlthBad 
##  No  :9098   Min.   : 1.00   Excellent: 878   Min.   : 0.000  
##  Yes : 760   1st Qu.:40.00   Vgood    :2508   1st Qu.: 0.000  
##  NA's: 142   Median :50.00   Good     :2956   Median : 0.000  
##              Mean   :48.42   Fair     :1010   Mean   : 3.335  
##              3rd Qu.:58.00   Poor     : 187   3rd Qu.: 3.000  
##              Max.   :80.00   NA's     :2461   Max.   :30.000  
##              NA's   :9371                     NA's   :2468    
##  DaysMentHlthBad  LittleInterest   Depressed     nPregnancies   
##  Min.   : 0.000   None   :5103   None   :5246   Min.   : 1.000  
##  1st Qu.: 0.000   Several:1130   Several:1009   1st Qu.: 2.000  
##  Median : 0.000   Most   : 434   Most   : 418   Median : 3.000  
##  Mean   : 4.127   NA's   :3333   NA's   :3327   Mean   : 3.027  
##  3rd Qu.: 4.000                                 3rd Qu.: 4.000  
##  Max.   :30.000                                 Max.   :32.000  
##  NA's   :2466                                   NA's   :7396    
##     nBabies         Age1stBaby    SleepHrsNight    SleepTrouble
##  Min.   : 0.000   Min.   :14.00   Min.   : 2.000   No  :5799   
##  1st Qu.: 2.000   1st Qu.:19.00   1st Qu.: 6.000   Yes :1973   
##  Median : 2.000   Median :22.00   Median : 7.000   NA's:2228   
##  Mean   : 2.457   Mean   :22.65   Mean   : 6.928               
##  3rd Qu.: 3.000   3rd Qu.:26.00   3rd Qu.: 8.000               
##  Max.   :12.000   Max.   :39.00   Max.   :12.000               
##  NA's   :7584     NA's   :8116    NA's   :2245                 
##  PhysActive  PhysActiveDays       TVHrsDay        CompHrsDay  
##  No  :3677   Min.   :1.000   2_hr     :1275   0_to_1_hr:1409  
##  Yes :4649   1st Qu.:2.000   1_hr     : 884   0_hrs    :1073  
##  NA's:1674   Median :3.000   3_hr     : 836   1_hr     :1030  
##              Mean   :3.744   0_to_1_hr: 638   2_hr     : 589  
##              3rd Qu.:5.000   More_4_hr: 615   3_hr     : 347  
##              Max.   :7.000   (Other)  : 611   (Other)  : 415  
##              NA's   :5337    NA's     :5141   NA's     :5137  
##  TVHrsDayChild   CompHrsDayChild Alcohol12PlusYr   AlcoholDay    
##  Min.   :0.000   Min.   :0.000   No  :1368       Min.   : 1.000  
##  1st Qu.:1.000   1st Qu.:0.000   Yes :5212       1st Qu.: 1.000  
##  Median :2.000   Median :1.000   NA's:3420       Median : 2.000  
##  Mean   :1.939   Mean   :2.198                   Mean   : 2.914  
##  3rd Qu.:3.000   3rd Qu.:6.000                   3rd Qu.: 3.000  
##  Max.   :6.000   Max.   :6.000                   Max.   :82.000  
##  NA's   :9347    NA's   :9347                    NA's   :5086    
##   AlcoholYear    SmokeNow    Smoke100         Smoke100n       SmokeAge    
##  Min.   :  0.0   No  :1745   No  :4024   Non-Smoker:4024   Min.   : 6.00  
##  1st Qu.:  3.0   Yes :1466   Yes :3211   Smoker    :3211   1st Qu.:15.00  
##  Median : 24.0   NA's:6789   NA's:2765   NA's      :2765   Median :17.00  
##  Mean   : 75.1                                             Mean   :17.83  
##  3rd Qu.:104.0                                             3rd Qu.:19.00  
##  Max.   :364.0                                             Max.   :72.00  
##  NA's   :4078                                              NA's   :6920   
##  Marijuana   AgeFirstMarij   RegularMarij  AgeRegMarij    HardDrugs  
##  No  :2049   Min.   : 1.00   No  :3575    Min.   : 5.00   No  :4700  
##  Yes :2892   1st Qu.:15.00   Yes :1366    1st Qu.:15.00   Yes :1065  
##  NA's:5059   Median :16.00   NA's:5059    Median :17.00   NA's:4235  
##              Mean   :17.02                Mean   :17.69              
##              3rd Qu.:19.00                3rd Qu.:19.00              
##              Max.   :48.00                Max.   :52.00              
##              NA's   :7109                 NA's   :8634               
##  SexEver         SexAge      SexNumPartnLife   SexNumPartYear  
##  No  : 223   Min.   : 9.00   Min.   :   0.00   Min.   : 0.000  
##  Yes :5544   1st Qu.:15.00   1st Qu.:   2.00   1st Qu.: 1.000  
##  NA's:4233   Median :17.00   Median :   5.00   Median : 1.000  
##              Mean   :17.43   Mean   :  15.09   Mean   : 1.342  
##              3rd Qu.:19.00   3rd Qu.:  12.00   3rd Qu.: 1.000  
##              Max.   :50.00   Max.   :2000.00   Max.   :69.000  
##              NA's   :4460    NA's   :4275      NA's   :5072    
##  SameSex          SexOrientation  PregnantNow  
##  No  :5353   Bisexual    : 119   Yes    :  72  
##  Yes : 415   Heterosexual:4638   No     :1573  
##  NA's:4232   Homosexual  :  85   Unknown:  51  
##              NA's        :5158   NA's   :8304  
##                                                
##                                                
## 

Q3 What is the percentage of the regular smokers who are diabetic?

The percentage of regular smokers who are diabetic is 0.06881406%.

Q4 What is the difference in the percentages between those who are regular smokers and those who are not?

The difference in the percentages is 0.01398888%.

# What are the variables in the NHANES_Ph dataset?
NHANES_Ph <- 
  NHANES %>%
  select(RegularMarij, Diabetes) %>%
  filter(!is.na(RegularMarij))
NHANES_Ph
## # A tibble: 4,941 x 2
##    RegularMarij Diabetes
##    <fct>        <fct>   
##  1 No           No      
##  2 No           No      
##  3 No           No      
##  4 No           No      
##  5 No           No      
##  6 No           No      
##  7 No           No      
##  8 Yes          No      
##  9 Yes          No      
## 10 No           No      
## # ... with 4,931 more rows

# Create a contingency table summarizing the data
table(NHANES_Ph)
##             Diabetes
## RegularMarij   No  Yes
##          No  3379  196
##          Yes 1272   94

# Find proportion of each SleepTrouble who were Diabetes
NHANES_Ph %>%
group_by(RegularMarij) %>%
  summarise(Diabetes_prop = mean(Diabetes == "Yes", na.rm = TRUE))
## # A tibble: 2 x 2
##   RegularMarij Diabetes_prop
##   <fct>                <dbl>
## 1 No                  0.0548
## 2 Yes                 0.0688

# Find the difference in proporations between SleepTrouble
NHANES_Ph %>%
group_by(RegularMarij) %>%
  summarise(Diabetes_prop = mean(Diabetes == "Yes", na.rm = TRUE)) %>%
  summarise(Diabetes_prop_diff = diff(Diabetes_prop)) # Yes - No
## # A tibble: 1 x 1
##   Diabetes_prop_diff
##                <dbl>
## 1             0.0140

Interpretation

The code below creates 1,000 sets of permuted dataset (which represents no relationships between diabetes and insomnia because they were created by random shuffling) and then compare the difference between observed dataset and permuted dataset.

Q5 Revise the code below to create 2,000 sets of permuted data, instead of 1,000.

Q6 How different is the original difference in percentages from the permuted differences? What is this likely suggest regarding the relationship between diabetes and smoking marijuana?

Hint: Answer this question by describing the histogram below. In addition, read the interpretation below for the case of diabetes and insomnia.

# The oilabs package is failed to be installed b/c it's not available for this R version.
# Define function
rep_sample_n <- function(tbl, size, replace = FALSE, reps = 1) { 
     n <- nrow(tbl) 
     i <- unlist(replicate(reps, sample.int(n, size, replace = replace), simplify = FALSE)) 
 
 
     rep_tbl <- cbind(replicate = rep(1:reps,rep(size,reps)), tbl[i,]) 
 
 
     dplyr::group_by(rep_tbl, replicate) 
}

# Create a data frame of differences in diabetic rates
# Create 1,000 replicate sets
NHANES_Ph %>%
  rep_sample_n(size = nrow(NHANES_Ph), reps = 2000)
## # A tibble: 9,882,000 x 3
## # Groups:   replicate [2,000]
##    replicate RegularMarij Diabetes
##        <int> <fct>        <fct>   
##  1         1 No           No      
##  2         1 No           No      
##  3         1 No           No      
##  4         1 No           No      
##  5         1 Yes          No      
##  6         1 No           No      
##  7         1 Yes          No      
##  8         1 No           No      
##  9         1 Yes          No      
## 10         1 No           No      
## # ... with 9,881,990 more rows

# Create a new variable, Diabetes_perm, which represents no relationships between diabetes and insomonia becasue they were created by random shuffling
NHANES_Ph %>%
  rep_sample_n(size = nrow(NHANES_Ph), reps = 2000) %>%
  mutate(Diabetes_perm = sample(Diabetes))
## # A tibble: 9,882,000 x 4
## # Groups:   replicate [2,000]
##    replicate RegularMarij Diabetes Diabetes_perm
##        <int> <fct>        <fct>    <fct>        
##  1         1 No           No       No           
##  2         1 No           No       No           
##  3         1 No           No       No           
##  4         1 Yes          Yes      No           
##  5         1 No           Yes      Yes          
##  6         1 No           No       No           
##  7         1 No           No       No           
##  8         1 No           No       No           
##  9         1 No           No       No           
## 10         1 No           No       No           
## # ... with 9,881,990 more rows

# Calculate percentages of diabetes
NHANES_Ph %>%
  rep_sample_n(size = nrow(NHANES_Ph), reps = 2000) %>%
  mutate(Diabetes_perm = sample(Diabetes)) %>%
  group_by(replicate, RegularMarij) %>%
  summarize(prop_Diabetes = mean(Diabetes == "Yes", na.rm = TRUE),
            prop_Diabetes_perm = mean(Diabetes_perm == "Yes", na.rm = TRUE))
## # A tibble: 4,000 x 4
## # Groups:   replicate [?]
##    replicate RegularMarij prop_Diabetes prop_Diabetes_perm
##        <int> <fct>                <dbl>              <dbl>
##  1         1 No                  0.0548             0.0579
##  2         1 Yes                 0.0688             0.0608
##  3         2 No                  0.0548             0.0557
##  4         2 Yes                 0.0688             0.0666
##  5         3 No                  0.0548             0.0601
##  6         3 Yes                 0.0688             0.0549
##  7         4 No                  0.0548             0.0568
##  8         4 Yes                 0.0688             0.0637
##  9         5 No                  0.0548             0.0601
## 10         5 Yes                 0.0688             0.0549
## # ... with 3,990 more rows

# Calculate the difference in the percentages between those who have sleep trouble and those who are not
NHANES_Ph %>%
  rep_sample_n(size = nrow(NHANES_Ph), reps = 2000) %>%
  mutate(Diabetes_perm = sample(Diabetes)) %>%
  group_by(replicate, RegularMarij) %>%
  summarize(prop_Diabetes = mean(Diabetes == "Yes", na.rm = TRUE),
            prop_Diabetes_perm = mean(Diabetes_perm == "Yes", na.rm = TRUE)) %>%
  summarize(diff_orig = diff(prop_Diabetes),
            diff_perm = diff(prop_Diabetes_perm))
## # A tibble: 2,000 x 3
##    replicate diff_orig diff_perm
##        <int>     <dbl>     <dbl>
##  1         1    0.0140  -0.00422
##  2         2    0.0140   0.00488
##  3         3    0.0140   0.0110 
##  4         4    0.0140   0.00691
##  5         5    0.0140  -0.00928
##  6         6    0.0140   0.00286
##  7         7    0.0140   0.00792
##  8         8    0.0140   0.00893
##  9         9    0.0140   0.0120 
## 10        10    0.0140   0.00286
## # ... with 1,990 more rows

# Save the above results under the new object, NHANES_Ph_perm
NHANES_Ph_perm <- 
  NHANES_Ph %>%
  rep_sample_n(size = nrow(NHANES_Ph), reps = 2000) %>%
  mutate(Diabetes_perm = sample(Diabetes)) %>%
  group_by(replicate, RegularMarij) %>%
  summarize(prop_Diabetes = mean(Diabetes == "Yes", na.rm = TRUE),
            prop_Diabetes_perm = mean(Diabetes_perm == "Yes", na.rm = TRUE)) %>%
  summarize(diff_orig = diff(prop_Diabetes),
            diff_perm = diff(prop_Diabetes_perm))  

# Histogram of permuted differences
ggplot(NHANES_Ph_perm, aes(x = diff_perm)) + 
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = diff_orig), col = "red")

Interpretation

Q7 Would you reject the null hypothesis: Diabetes and smoking marijuana are unrelated?

Hint: Read the interpretation below for the case of diabetes and insomnia.

0.014 > the critical value at 95% (0.012). I am 95% confident that those who smoke marijuana are more likely to be diabetic than those who dont.

# Find the 0.90, 0.95, and 0.99 quantiles of diff_perm
NHANES_Ph_perm %>% 
  summarize(q.90 = quantile(diff_perm, p = 0.9),
            q.95 = quantile(diff_perm, p = 0.95),
            q.99 = quantile(diff_perm, p = 0.99))
## # A tibble: 1 x 3
##      q.90   q.95   q.99
##     <dbl>  <dbl>  <dbl>
## 1 0.00994 0.0130 0.0180

Interpretation