Introduction

The NHANES dataset is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960’s.Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey.

You can load it by installing the package

#install.packages("NHANES")
library(NHANES)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
data(NHANES)
?NHANES
## starting httpd help server ... done

Using the NHANES dataset:

colnames(NHANES)
##  [1] "ID"               "SurveyYr"         "Gender"           "Age"             
##  [5] "AgeDecade"        "AgeMonths"        "Race1"            "Race3"           
##  [9] "Education"        "MaritalStatus"    "HHIncome"         "HHIncomeMid"     
## [13] "Poverty"          "HomeRooms"        "HomeOwn"          "Work"            
## [17] "Weight"           "Length"           "HeadCirc"         "Height"          
## [21] "BMI"              "BMICatUnder20yrs" "BMI_WHO"          "Pulse"           
## [25] "BPSysAve"         "BPDiaAve"         "BPSys1"           "BPDia1"          
## [29] "BPSys2"           "BPDia2"           "BPSys3"           "BPDia3"          
## [33] "Testosterone"     "DirectChol"       "TotChol"          "UrineVol1"       
## [37] "UrineFlow1"       "UrineVol2"        "UrineFlow2"       "Diabetes"        
## [41] "DiabetesAge"      "HealthGen"        "DaysPhysHlthBad"  "DaysMentHlthBad" 
## [45] "LittleInterest"   "Depressed"        "nPregnancies"     "nBabies"         
## [49] "Age1stBaby"       "SleepHrsNight"    "SleepTrouble"     "PhysActive"      
## [53] "PhysActiveDays"   "TVHrsDay"         "CompHrsDay"       "TVHrsDayChild"   
## [57] "CompHrsDayChild"  "Alcohol12PlusYr"  "AlcoholDay"       "AlcoholYear"     
## [61] "SmokeNow"         "Smoke100"         "Smoke100n"        "SmokeAge"        
## [65] "Marijuana"        "AgeFirstMarij"    "RegularMarij"     "AgeRegMarij"     
## [69] "HardDrugs"        "SexEver"          "SexAge"           "SexNumPartnLife" 
## [73] "SexNumPartYear"   "SameSex"          "SexOrientation"   "PregnantNow"

1.

Provide descriptive statistics of this dataset, focusing on Age, BMI, and Height.

nhanes_df <- NHANES %>%
  select("Height", "BMI", "Age", "Gender")
head(nhanes_df)
## # A tibble: 6 × 4
##   Height   BMI   Age Gender
##    <dbl> <dbl> <int> <fct> 
## 1   165.  32.2    34 male  
## 2   165.  32.2    34 male  
## 3   165.  32.2    34 male  
## 4   105.  15.3     4 male  
## 5   168.  30.6    49 female
## 6   133.  16.8     9 male

What type of data is it?

str(nhanes_df)
## tibble [10,000 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Height: num [1:10000] 165 165 165 105 168 ...
##  $ BMI   : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
##  $ Age   : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
##  $ Gender: Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...

Are there any missing values in the data?

sum(is.na(nhanes_df))
## [1] 719

There are 719 NAs in the nhanes_df. I will need to use na.rm = TRUE to remove the NA so the functions can calculate.

Mean:

mean_age <- mean(nhanes_df$Age, na.rm=TRUE)
mean_age
## [1] 36.7421
mean_height <- mean(nhanes_df$Height, na.rm=TRUE)
mean_height
## [1] 161.8778
mean_bmi <- mean(nhanes_df$BMI, na.rm=TRUE)
mean_bmi
## [1] 26.66014

mean age = 36.7 years mean height = 161 cm mean BMI = 26.7 kg/m^2

Median

median(nhanes_df$Age, na.rm=TRUE)
## [1] 36
median(nhanes_df$Height, na.rm=TRUE)
## [1] 166
median(nhanes_df$BMI, na.rm=TRUE)
## [1] 25.98

median age = 36 years median height = 166 cm median BMI = 26.0 kg/m^2

The mean and median of age seem to be close in value which suggest that variables may not have outliers and be normally distributed.

Range

range(nhanes_df$Age, na.rm=TRUE)
## [1]  0 80
max(nhanes_df$Age, na.rm=TRUE) - min(nhanes_df$Age, na.rm=TRUE)
## [1] 80
range(nhanes_df$Height, na.rm=TRUE)
## [1]  83.6 200.4
max(nhanes_df$Height, na.rm=TRUE) - min(nhanes_df$Height, na.rm=TRUE)
## [1] 116.8
range(nhanes_df$BMI, na.rm=TRUE)
## [1] 12.88 81.25
max(nhanes_df$BMI, na.rm=TRUE) - min(nhanes_df$BMI, na.rm=TRUE)
## [1] 68.37

Age ranged from 0 to 80. (Subjects 80 years or older were recorded as 80.) Height ranged from 83.6 cm to 200.4 cm which is a 116.8 cm range. BMI ranged from 12.9 kg/m^2 to 81.3 kg/m^2 which is a 68.4 kg/m^2 range.

Variance and Standard Deviation

var(nhanes_df$Age, na.rm=TRUE)
## [1] 501.651
sd(nhanes_df$Age, na.rm=TRUE)
## [1] 22.39757
var(nhanes_df$Height, na.rm=TRUE)
## [1] 407.4975
sd(nhanes_df$Height, na.rm=TRUE)
## [1] 20.18657
var(nhanes_df$BMI, na.rm=TRUE)
## [1] 54.41392
sd(nhanes_df$BMI, na.rm=TRUE)
## [1] 7.376579

Age variance is 501.7 and the standard deviation is 22.4. Height variance is 407.5 and the standard deviation is 20.2. BMI variance is 54.4 and the standard deviation is 7.4.

Frequency Distribution

Age Histogram

ggplot(nhanes_df, aes(x = Age)) +
  geom_histogram(binwidth = 10)

Age histogram reflects the large standard deviation. It is platykurtic or too flat. I selected a binwidth of 10 because age is often grouped in 10 years.

Height Histogram

ggplot(nhanes_df, aes(x = Height)) +
  geom_histogram(binwidth = 30)
## Warning: Removed 353 rows containing non-finite values (`stat_bin()`).

I selected the binwidth of 30 cm because 30 cm is approximately 1 foot. Height has a negative skew.

BMI Histogram

ggplot(nhanes_df, aes(x = BMI)) +
  geom_histogram(binwidth = 2.5)
## Warning: Removed 366 rows containing non-finite values (`stat_bin()`).

BMI has a positive skew.

discribeBy function (Just for fun)

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(NHANES)
##                   vars     n     mean       sd   median  trimmed      mad
## ID                   1 10000 61944.64  5871.17 62159.50 61983.84  7477.49
## SurveyYr*            2 10000     1.50     0.50     1.50     1.50     0.74
## Gender*              3 10000     1.50     0.50     1.00     1.50     0.00
## Age                  4 10000    36.74    22.40    36.00    36.11    26.69
## AgeDecade*           5  9667     4.09     2.12     4.00     4.03     2.97
## AgeMonths            6  4962   420.12   259.04   418.00   414.35   314.31
## Race1*               7 10000     3.50     1.12     4.00     3.65     0.00
## Race3*               8  5000     4.21     1.33     5.00     4.42     0.00
## Education*           9  7221     3.65     1.20     4.00     3.76     1.48
## MaritalStatus*      10  7231     3.16     1.14     3.00     3.13     0.00
## HHIncome*           11  9189     8.16     3.28     8.00     8.41     4.45
## HHIncomeMid         12  9189 57206.17 33020.28 50000.00 57878.76 48184.50
## Poverty             13  9274     2.80     1.68     2.70     2.83     2.39
## HomeRooms           14  9931     6.25     2.28     6.00     6.10     1.48
## HomeOwn*            15  9937     1.38     0.53     1.00     1.32     0.00
## Work*               16  7771     2.55     0.57     3.00     2.62     0.00
## Weight              17  9922    70.98    29.13    72.70    71.66    24.31
## Length              18   543    85.02    13.71    87.00    85.82    14.83
## HeadCirc            19    88    41.18     2.31    41.45    41.29     2.52
## Height              20  9647   161.88    20.19   166.00   165.20    13.05
## BMI                 21  9634    26.66     7.38    25.98    26.18     6.85
## BMICatUnder20yrs*   22  1274     2.46     0.83     2.00     2.37     0.00
## BMI_WHO*            23  9603     2.72     1.02     3.00     2.77     1.48
## Pulse               24  8563    73.56    12.16    72.00    73.07    11.86
## BPSysAve            25  8551   118.15    17.25   116.00   116.80    14.83
## BPDiaAve            26  8551    67.48    14.35    69.00    68.37    11.86
## BPSys1              27  8237   119.09    17.50   116.00   117.71    14.83
## BPDia1              28  8237    68.28    13.78    70.00    68.86    11.86
## BPSys2              29  8353   118.48    17.49   116.00   117.11    14.83
## BPDia2              30  8353    67.66    14.42    68.00    68.49    11.86
## BPSys3              31  8365   117.93    17.18   116.00   116.61    14.83
## BPDia3              32  8365    67.30    14.96    68.00    68.32    11.86
## Testosterone        33  4126   197.90   226.50    43.82   164.03    59.33
## DirectChol          34  8474     1.36     0.40     1.29     1.33     0.36
## TotChol             35  8474     4.88     1.08     4.78     4.82     1.07
## UrineVol1           36  9013   118.52    90.34    94.00   106.41    77.10
## UrineFlow1          37  8397     0.98     0.95     0.70     0.81     0.53
## UrineVol2           38  1478   119.68    90.16    95.00   107.99    77.10
## UrineFlow2          39  1476     1.15     1.07     0.76     0.96     0.56
## Diabetes*           40  9858     1.08     0.27     1.00     1.00     0.00
## DiabetesAge         41   629    48.42    15.68    50.00    49.26    13.34
## HealthGen*          42  7539     2.62     0.94     3.00     2.62     1.48
## DaysPhysHlthBad     43  7532     3.33     7.40     0.00     1.22     0.00
## DaysMentHlthBad     44  7534     4.13     7.83     0.00     2.04     0.00
## LittleInterest*     45  6667     1.30     0.58     1.00     1.17     0.00
## Depressed*          46  6673     1.28     0.57     1.00     1.14     0.00
## nPregnancies        47  2604     3.03     1.80     3.00     2.82     1.48
## nBabies             48  2416     2.46     1.32     2.00     2.29     1.48
## Age1stBaby          49  1884    22.65     4.77    22.00    22.30     4.45
## SleepHrsNight       50  7755     6.93     1.35     7.00     6.97     1.48
## SleepTrouble*       51  7772     1.25     0.44     1.00     1.19     0.00
## PhysActive*         52  8326     1.56     0.50     2.00     1.57     0.00
## PhysActiveDays      53  4663     3.74     1.84     3.00     3.68     1.48
## TVHrsDay*           54  4859     4.25     1.60     4.00     4.22     1.48
## CompHrsDay*         55  4863     2.84     1.64     2.00     2.63     1.48
## TVHrsDayChild       56   653     1.94     1.43     2.00     1.81     1.48
## CompHrsDayChild     57   653     2.20     2.52     1.00     2.00     1.48
## Alcohol12PlusYr*    58  6580     1.79     0.41     2.00     1.87     0.00
## AlcoholDay          59  4914     2.91     3.18     2.00     2.34     1.48
## AlcoholYear         60  5922    75.10   103.03    24.00    53.20    35.58
## SmokeNow*           61  3211     1.46     0.50     1.00     1.45     0.00
## Smoke100*           62  7235     1.44     0.50     1.00     1.43     0.00
## Smoke100n*          63  7235     1.44     0.50     1.00     1.43     0.00
## SmokeAge            64  3080    17.83     5.33    17.00    17.15     2.97
## Marijuana*          65  4941     1.59     0.49     2.00     1.61     0.00
## AgeFirstMarij       66  2891    17.02     3.90    16.00    16.68     2.97
## RegularMarij*       67  4941     1.28     0.45     1.00     1.22     0.00
## AgeRegMarij         68  1366    17.69     4.81    17.00    17.10     2.97
## HardDrugs*          69  5765     1.18     0.39     1.00     1.11     0.00
## SexEver*            70  5767     1.96     0.19     2.00     2.00     0.00
## SexAge              71  5540    17.43     3.72    17.00    17.07     2.97
## SexNumPartnLife     72  5725    15.09    57.85     5.00     7.55     5.93
## SexNumPartYear      73  4928     1.34     2.78     1.00     0.98     0.00
## SameSex*            74  5768     1.07     0.26     1.00     1.00     0.00
## SexOrientation*     75  4842     1.99     0.21     2.00     2.00     0.00
## PregnantNow*        76  1696     1.99     0.27     2.00     2.00     0.00
##                        min       max    range  skew kurtosis     se
## ID                51624.00  71915.00 20291.00 -0.05    -1.20  58.71
## SurveyYr*             1.00      2.00     1.00  0.00    -2.00   0.01
## Gender*               1.00      2.00     1.00  0.01    -2.00   0.01
## Age                   0.00     80.00    80.00  0.16    -1.01   0.22
## AgeDecade*            1.00      8.00     7.00  0.14    -1.08   0.02
## AgeMonths             0.00    959.00   959.00  0.13    -1.03   3.68
## Race1*                1.00      5.00     4.00 -1.24     0.46   0.01
## Race3*                1.00      6.00     5.00 -1.20     0.09   0.02
## Education*            1.00      5.00     4.00 -0.60    -0.56   0.01
## MaritalStatus*        1.00      6.00     5.00  0.41     0.98   0.01
## HHIncome*             1.00     12.00    11.00 -0.37    -1.05   0.03
## HHIncomeMid        2500.00 100000.00 97500.00  0.03    -1.48 344.47
## Poverty               0.00      5.00     5.00  0.05    -1.47   0.02
## HomeRooms             1.00     13.00    12.00  0.55     0.18   0.02
## HomeOwn*              1.00      3.00     2.00  0.96    -0.19   0.01
## Work*                 1.00      3.00     2.00 -0.85    -0.28   0.01
## Weight                2.80    230.70   227.90 -0.03     0.56   0.29
## Length               47.10    112.20    65.10 -0.46    -0.60   0.59
## HeadCirc             34.20     45.40    11.20 -0.43    -0.31   0.25
## Height               83.60    200.40   116.80 -1.66     2.89   0.21
## BMI                  12.88     81.25    68.37  0.90     2.20   0.08
## BMICatUnder20yrs*     1.00      4.00     3.00  0.84    -0.37   0.02
## BMI_WHO*              1.00      4.00     3.00 -0.16    -1.14   0.01
## Pulse                40.00    136.00    96.00  0.45     0.46   0.13
## BPSysAve             76.00    226.00   150.00  1.05     2.41   0.19
## BPDiaAve              0.00    116.00   116.00 -1.13     3.91   0.16
## BPSys1               72.00    232.00   160.00  1.00     2.15   0.19
## BPDia1                0.00    118.00   118.00 -0.92     3.74   0.15
## BPSys2               76.00    226.00   150.00  1.06     2.53   0.19
## BPDia2                0.00    118.00   118.00 -1.17     4.38   0.16
## BPSys3               76.00    226.00   150.00  1.02     2.29   0.19
## BPDia3                0.00    116.00   116.00 -1.35     4.89   0.16
## Testosterone          0.25   1795.60  1795.35  1.10     0.82   3.53
## DirectChol            0.39      4.03     3.64  1.02     2.07   0.00
## TotChol               1.53     13.65    12.12  0.70     1.39   0.01
## UrineVol1             0.00    510.00   510.00  1.12     0.75   0.95
## UrineFlow1            0.00     17.17    17.17  3.47    24.62   0.01
## UrineVol2             0.00    409.00   409.00  1.06     0.50   2.35
## UrineFlow2            0.00     13.69    13.69  2.78    16.23   0.03
## Diabetes*             1.00      2.00     1.00  3.17     8.05   0.00
## DiabetesAge           1.00     80.00    79.00 -0.61     0.68   0.63
## HealthGen*            1.00      5.00     4.00  0.17    -0.27   0.01
## DaysPhysHlthBad       0.00     30.00    30.00  2.71     6.46   0.09
## DaysMentHlthBad       0.00     30.00    30.00  2.31     4.42   0.09
## LittleInterest*       1.00      3.00     2.00  1.80     2.11   0.01
## Depressed*            1.00      3.00     2.00  1.95     2.66   0.01
## nPregnancies          1.00     32.00    31.00  2.74    27.74   0.04
## nBabies               0.00     12.00    12.00  1.57     4.54   0.03
## Age1stBaby           14.00     39.00    25.00  0.62    -0.11   0.11
## SleepHrsNight         2.00     12.00    10.00 -0.18     0.79   0.02
## SleepTrouble*         1.00      2.00     1.00  1.13    -0.72   0.00
## PhysActive*           1.00      2.00     1.00 -0.24    -1.94   0.01
## PhysActiveDays        1.00      7.00     6.00  0.32    -0.90   0.03
## TVHrsDay*             1.00      7.00     6.00  0.16    -0.80   0.02
## CompHrsDay*           1.00      7.00     6.00  0.94     0.23   0.02
## TVHrsDayChild         0.00      6.00     6.00  0.73     0.07   0.06
## CompHrsDayChild       0.00      6.00     6.00  0.69    -1.30   0.10
## Alcohol12PlusYr*      1.00      2.00     1.00 -1.44     0.07   0.01
## AlcoholDay            1.00     82.00    81.00  8.22   144.17   0.05
## AlcoholYear           0.00    364.00   364.00  1.58     1.48   1.34
## SmokeNow*             1.00      2.00     1.00  0.17    -1.97   0.01
## Smoke100*             1.00      2.00     1.00  0.23    -1.95   0.01
## Smoke100n*            1.00      2.00     1.00  0.23    -1.95   0.01
## SmokeAge              6.00     72.00    66.00  2.78    14.96   0.10
## Marijuana*            1.00      2.00     1.00 -0.35    -1.88   0.01
## AgeFirstMarij         1.00     48.00    47.00  2.10    11.37   0.07
## RegularMarij*         1.00      2.00     1.00  1.00    -1.00   0.01
## AgeRegMarij           5.00     52.00    47.00  2.82    13.33   0.13
## HardDrugs*            1.00      2.00     1.00  1.62     0.64   0.01
## SexEver*              1.00      2.00     1.00 -4.78    20.89   0.00
## SexAge                9.00     50.00    41.00  1.88     8.53   0.05
## SexNumPartnLife       0.00   2000.00  2000.00 16.82   384.81   0.76
## SexNumPartYear        0.00     69.00    69.00 13.58   252.27   0.04
## SameSex*              1.00      2.00     1.00  3.31     8.97   0.00
## SexOrientation*       1.00      3.00     2.00 -0.71    20.68   0.00
## PregnantNow*          1.00      3.00     2.00 -0.50    10.73   0.01

A table will tell the number of observations for Gender, a categorical (factor vector).

table(NHANES$Gender)
## 
## female   male 
##   5020   4980

z-scores for BMI

If a BMI is 30 and the mean age is 26.7 and the standard deviation is 7.4, What is the z-score? 30 is larger than 26.7 so the z-score should be larger than 50%

pnorm(30, 26.7, 7.4)
## [1] 0.6721819

The z-score is 0.672 or 67.2% of the data is to the left of 30 BMI.

2.

Create a scatter plot of age (x-axis) vs. BMI (y-axis). Make the size of the points = 0.5. Indicate women and men in different colors. Include a title,“BMI by Age for Americans”. Put a line on the graph for BMI = 30 (obesity level)

nhanes_df %>%
  ggplot(aes(Age, BMI)) +
    geom_point(aes(col = Gender), size = 0.5) +
    labs(title = "BMI by Age for Americans") +
    geom_abline(intercept = 30, slope = 0) + 
    geom_label(aes(x = 10, y = 70, label = "30 BMI (obesity level)")) +
    geom_segment(x = 10, y = 67, xend = 5, yend = 30)
## Warning: Removed 366 rows containing missing values (`geom_point()`).

3.

Create a histogram of heights (use an appropriate binwidth(), fill = "blue", x label = "Heights in cm")

nhanes_df %>%
  ggplot(aes(x = Height)) +
    geom_histogram(binwidth = 10, fill = "blue") +
    labs(x = "Heights in cm")
## Warning: Removed 353 rows containing non-finite values (`stat_bin()`).

4.

Create a smooth density plot for male & female heights (use different fill colors so you can see both distributions and adjust alpha so you can see both distributions where they overlap)

nhanes_df %>%
  ggplot(aes(x = Height, fill=Gender)) +
  geom_density(alpha = 0.4)
## Warning: Removed 353 rows containing non-finite values (`stat_density()`).

5.

Create a histogram of female heights and overlay a standard normal curve.

female_heights<- nhanes_df$Height[nhanes_df$Gender=="female"]
mean(female_heights, na.rm=TRUE)
## [1] 156.6159
sd(female_heights, na.rm=TRUE)
## [1] 16.79195
nhanes_df %>% 
  filter(Gender=="female") %>%
    ggplot(aes(Height)) +
    geom_density(fill = "red") +
    stat_function(fun = dnorm, args=list(mean= 156.6, sd=16.8)) +
    labs(x = "Female Heights (cm)")
## Warning: Removed 173 rows containing non-finite values (`stat_density()`).

Are female heights representative of a standard normal curve in this dataset? Why or why not?

This graph is not representative of a standard normal curve. The graph is Leptokurtic or “too pointy” and has a negative skew.