The NHANES dataset is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960’s.Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey.
You can load it by installing the package
#install.packages("NHANES")
library(NHANES)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
data(NHANES)
?NHANES
## starting httpd help server ... done
colnames(NHANES)
## [1] "ID" "SurveyYr" "Gender" "Age"
## [5] "AgeDecade" "AgeMonths" "Race1" "Race3"
## [9] "Education" "MaritalStatus" "HHIncome" "HHIncomeMid"
## [13] "Poverty" "HomeRooms" "HomeOwn" "Work"
## [17] "Weight" "Length" "HeadCirc" "Height"
## [21] "BMI" "BMICatUnder20yrs" "BMI_WHO" "Pulse"
## [25] "BPSysAve" "BPDiaAve" "BPSys1" "BPDia1"
## [29] "BPSys2" "BPDia2" "BPSys3" "BPDia3"
## [33] "Testosterone" "DirectChol" "TotChol" "UrineVol1"
## [37] "UrineFlow1" "UrineVol2" "UrineFlow2" "Diabetes"
## [41] "DiabetesAge" "HealthGen" "DaysPhysHlthBad" "DaysMentHlthBad"
## [45] "LittleInterest" "Depressed" "nPregnancies" "nBabies"
## [49] "Age1stBaby" "SleepHrsNight" "SleepTrouble" "PhysActive"
## [53] "PhysActiveDays" "TVHrsDay" "CompHrsDay" "TVHrsDayChild"
## [57] "CompHrsDayChild" "Alcohol12PlusYr" "AlcoholDay" "AlcoholYear"
## [61] "SmokeNow" "Smoke100" "Smoke100n" "SmokeAge"
## [65] "Marijuana" "AgeFirstMarij" "RegularMarij" "AgeRegMarij"
## [69] "HardDrugs" "SexEver" "SexAge" "SexNumPartnLife"
## [73] "SexNumPartYear" "SameSex" "SexOrientation" "PregnantNow"
Provide descriptive statistics of this dataset, focusing on
Age, BMI, and Height.
nhanes_df <- NHANES %>%
select("Height", "BMI", "Age", "Gender")
head(nhanes_df)
## # A tibble: 6 × 4
## Height BMI Age Gender
## <dbl> <dbl> <int> <fct>
## 1 165. 32.2 34 male
## 2 165. 32.2 34 male
## 3 165. 32.2 34 male
## 4 105. 15.3 4 male
## 5 168. 30.6 49 female
## 6 133. 16.8 9 male
str(nhanes_df)
## tibble [10,000 × 4] (S3: tbl_df/tbl/data.frame)
## $ Height: num [1:10000] 165 165 165 105 168 ...
## $ BMI : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
## $ Age : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
## $ Gender: Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
sum(is.na(nhanes_df))
## [1] 719
There are 719 NAs in the nhanes_df. I will
need to use na.rm = TRUE to remove the NA so
the functions can calculate.
mean_age <- mean(nhanes_df$Age, na.rm=TRUE)
mean_age
## [1] 36.7421
mean_height <- mean(nhanes_df$Height, na.rm=TRUE)
mean_height
## [1] 161.8778
mean_bmi <- mean(nhanes_df$BMI, na.rm=TRUE)
mean_bmi
## [1] 26.66014
mean age = 36.7 years mean height = 161 cm mean BMI = 26.7 kg/m^2
median(nhanes_df$Age, na.rm=TRUE)
## [1] 36
median(nhanes_df$Height, na.rm=TRUE)
## [1] 166
median(nhanes_df$BMI, na.rm=TRUE)
## [1] 25.98
median age = 36 years median height = 166 cm median BMI = 26.0 kg/m^2
The mean and median of age seem to be close in value which suggest that variables may not have outliers and be normally distributed.
range(nhanes_df$Age, na.rm=TRUE)
## [1] 0 80
max(nhanes_df$Age, na.rm=TRUE) - min(nhanes_df$Age, na.rm=TRUE)
## [1] 80
range(nhanes_df$Height, na.rm=TRUE)
## [1] 83.6 200.4
max(nhanes_df$Height, na.rm=TRUE) - min(nhanes_df$Height, na.rm=TRUE)
## [1] 116.8
range(nhanes_df$BMI, na.rm=TRUE)
## [1] 12.88 81.25
max(nhanes_df$BMI, na.rm=TRUE) - min(nhanes_df$BMI, na.rm=TRUE)
## [1] 68.37
Age ranged from 0 to 80. (Subjects 80 years or older were recorded as 80.) Height ranged from 83.6 cm to 200.4 cm which is a 116.8 cm range. BMI ranged from 12.9 kg/m^2 to 81.3 kg/m^2 which is a 68.4 kg/m^2 range.
var(nhanes_df$Age, na.rm=TRUE)
## [1] 501.651
sd(nhanes_df$Age, na.rm=TRUE)
## [1] 22.39757
var(nhanes_df$Height, na.rm=TRUE)
## [1] 407.4975
sd(nhanes_df$Height, na.rm=TRUE)
## [1] 20.18657
var(nhanes_df$BMI, na.rm=TRUE)
## [1] 54.41392
sd(nhanes_df$BMI, na.rm=TRUE)
## [1] 7.376579
Age variance is 501.7 and the standard deviation is 22.4. Height variance is 407.5 and the standard deviation is 20.2. BMI variance is 54.4 and the standard deviation is 7.4.
ggplot(nhanes_df, aes(x = Age)) +
geom_histogram(binwidth = 10)
Age histogram reflects the large standard deviation. It is platykurtic
or too flat. I selected a
binwidth of 10 because age is
often grouped in 10 years.
ggplot(nhanes_df, aes(x = Height)) +
geom_histogram(binwidth = 30)
## Warning: Removed 353 rows containing non-finite values (`stat_bin()`).
I selected the binwidth of 30 cm because 30 cm is approximately 1 foot.
Height has a negative skew.
ggplot(nhanes_df, aes(x = BMI)) +
geom_histogram(binwidth = 2.5)
## Warning: Removed 366 rows containing non-finite values (`stat_bin()`).
BMI has a positive skew.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(NHANES)
## vars n mean sd median trimmed mad
## ID 1 10000 61944.64 5871.17 62159.50 61983.84 7477.49
## SurveyYr* 2 10000 1.50 0.50 1.50 1.50 0.74
## Gender* 3 10000 1.50 0.50 1.00 1.50 0.00
## Age 4 10000 36.74 22.40 36.00 36.11 26.69
## AgeDecade* 5 9667 4.09 2.12 4.00 4.03 2.97
## AgeMonths 6 4962 420.12 259.04 418.00 414.35 314.31
## Race1* 7 10000 3.50 1.12 4.00 3.65 0.00
## Race3* 8 5000 4.21 1.33 5.00 4.42 0.00
## Education* 9 7221 3.65 1.20 4.00 3.76 1.48
## MaritalStatus* 10 7231 3.16 1.14 3.00 3.13 0.00
## HHIncome* 11 9189 8.16 3.28 8.00 8.41 4.45
## HHIncomeMid 12 9189 57206.17 33020.28 50000.00 57878.76 48184.50
## Poverty 13 9274 2.80 1.68 2.70 2.83 2.39
## HomeRooms 14 9931 6.25 2.28 6.00 6.10 1.48
## HomeOwn* 15 9937 1.38 0.53 1.00 1.32 0.00
## Work* 16 7771 2.55 0.57 3.00 2.62 0.00
## Weight 17 9922 70.98 29.13 72.70 71.66 24.31
## Length 18 543 85.02 13.71 87.00 85.82 14.83
## HeadCirc 19 88 41.18 2.31 41.45 41.29 2.52
## Height 20 9647 161.88 20.19 166.00 165.20 13.05
## BMI 21 9634 26.66 7.38 25.98 26.18 6.85
## BMICatUnder20yrs* 22 1274 2.46 0.83 2.00 2.37 0.00
## BMI_WHO* 23 9603 2.72 1.02 3.00 2.77 1.48
## Pulse 24 8563 73.56 12.16 72.00 73.07 11.86
## BPSysAve 25 8551 118.15 17.25 116.00 116.80 14.83
## BPDiaAve 26 8551 67.48 14.35 69.00 68.37 11.86
## BPSys1 27 8237 119.09 17.50 116.00 117.71 14.83
## BPDia1 28 8237 68.28 13.78 70.00 68.86 11.86
## BPSys2 29 8353 118.48 17.49 116.00 117.11 14.83
## BPDia2 30 8353 67.66 14.42 68.00 68.49 11.86
## BPSys3 31 8365 117.93 17.18 116.00 116.61 14.83
## BPDia3 32 8365 67.30 14.96 68.00 68.32 11.86
## Testosterone 33 4126 197.90 226.50 43.82 164.03 59.33
## DirectChol 34 8474 1.36 0.40 1.29 1.33 0.36
## TotChol 35 8474 4.88 1.08 4.78 4.82 1.07
## UrineVol1 36 9013 118.52 90.34 94.00 106.41 77.10
## UrineFlow1 37 8397 0.98 0.95 0.70 0.81 0.53
## UrineVol2 38 1478 119.68 90.16 95.00 107.99 77.10
## UrineFlow2 39 1476 1.15 1.07 0.76 0.96 0.56
## Diabetes* 40 9858 1.08 0.27 1.00 1.00 0.00
## DiabetesAge 41 629 48.42 15.68 50.00 49.26 13.34
## HealthGen* 42 7539 2.62 0.94 3.00 2.62 1.48
## DaysPhysHlthBad 43 7532 3.33 7.40 0.00 1.22 0.00
## DaysMentHlthBad 44 7534 4.13 7.83 0.00 2.04 0.00
## LittleInterest* 45 6667 1.30 0.58 1.00 1.17 0.00
## Depressed* 46 6673 1.28 0.57 1.00 1.14 0.00
## nPregnancies 47 2604 3.03 1.80 3.00 2.82 1.48
## nBabies 48 2416 2.46 1.32 2.00 2.29 1.48
## Age1stBaby 49 1884 22.65 4.77 22.00 22.30 4.45
## SleepHrsNight 50 7755 6.93 1.35 7.00 6.97 1.48
## SleepTrouble* 51 7772 1.25 0.44 1.00 1.19 0.00
## PhysActive* 52 8326 1.56 0.50 2.00 1.57 0.00
## PhysActiveDays 53 4663 3.74 1.84 3.00 3.68 1.48
## TVHrsDay* 54 4859 4.25 1.60 4.00 4.22 1.48
## CompHrsDay* 55 4863 2.84 1.64 2.00 2.63 1.48
## TVHrsDayChild 56 653 1.94 1.43 2.00 1.81 1.48
## CompHrsDayChild 57 653 2.20 2.52 1.00 2.00 1.48
## Alcohol12PlusYr* 58 6580 1.79 0.41 2.00 1.87 0.00
## AlcoholDay 59 4914 2.91 3.18 2.00 2.34 1.48
## AlcoholYear 60 5922 75.10 103.03 24.00 53.20 35.58
## SmokeNow* 61 3211 1.46 0.50 1.00 1.45 0.00
## Smoke100* 62 7235 1.44 0.50 1.00 1.43 0.00
## Smoke100n* 63 7235 1.44 0.50 1.00 1.43 0.00
## SmokeAge 64 3080 17.83 5.33 17.00 17.15 2.97
## Marijuana* 65 4941 1.59 0.49 2.00 1.61 0.00
## AgeFirstMarij 66 2891 17.02 3.90 16.00 16.68 2.97
## RegularMarij* 67 4941 1.28 0.45 1.00 1.22 0.00
## AgeRegMarij 68 1366 17.69 4.81 17.00 17.10 2.97
## HardDrugs* 69 5765 1.18 0.39 1.00 1.11 0.00
## SexEver* 70 5767 1.96 0.19 2.00 2.00 0.00
## SexAge 71 5540 17.43 3.72 17.00 17.07 2.97
## SexNumPartnLife 72 5725 15.09 57.85 5.00 7.55 5.93
## SexNumPartYear 73 4928 1.34 2.78 1.00 0.98 0.00
## SameSex* 74 5768 1.07 0.26 1.00 1.00 0.00
## SexOrientation* 75 4842 1.99 0.21 2.00 2.00 0.00
## PregnantNow* 76 1696 1.99 0.27 2.00 2.00 0.00
## min max range skew kurtosis se
## ID 51624.00 71915.00 20291.00 -0.05 -1.20 58.71
## SurveyYr* 1.00 2.00 1.00 0.00 -2.00 0.01
## Gender* 1.00 2.00 1.00 0.01 -2.00 0.01
## Age 0.00 80.00 80.00 0.16 -1.01 0.22
## AgeDecade* 1.00 8.00 7.00 0.14 -1.08 0.02
## AgeMonths 0.00 959.00 959.00 0.13 -1.03 3.68
## Race1* 1.00 5.00 4.00 -1.24 0.46 0.01
## Race3* 1.00 6.00 5.00 -1.20 0.09 0.02
## Education* 1.00 5.00 4.00 -0.60 -0.56 0.01
## MaritalStatus* 1.00 6.00 5.00 0.41 0.98 0.01
## HHIncome* 1.00 12.00 11.00 -0.37 -1.05 0.03
## HHIncomeMid 2500.00 100000.00 97500.00 0.03 -1.48 344.47
## Poverty 0.00 5.00 5.00 0.05 -1.47 0.02
## HomeRooms 1.00 13.00 12.00 0.55 0.18 0.02
## HomeOwn* 1.00 3.00 2.00 0.96 -0.19 0.01
## Work* 1.00 3.00 2.00 -0.85 -0.28 0.01
## Weight 2.80 230.70 227.90 -0.03 0.56 0.29
## Length 47.10 112.20 65.10 -0.46 -0.60 0.59
## HeadCirc 34.20 45.40 11.20 -0.43 -0.31 0.25
## Height 83.60 200.40 116.80 -1.66 2.89 0.21
## BMI 12.88 81.25 68.37 0.90 2.20 0.08
## BMICatUnder20yrs* 1.00 4.00 3.00 0.84 -0.37 0.02
## BMI_WHO* 1.00 4.00 3.00 -0.16 -1.14 0.01
## Pulse 40.00 136.00 96.00 0.45 0.46 0.13
## BPSysAve 76.00 226.00 150.00 1.05 2.41 0.19
## BPDiaAve 0.00 116.00 116.00 -1.13 3.91 0.16
## BPSys1 72.00 232.00 160.00 1.00 2.15 0.19
## BPDia1 0.00 118.00 118.00 -0.92 3.74 0.15
## BPSys2 76.00 226.00 150.00 1.06 2.53 0.19
## BPDia2 0.00 118.00 118.00 -1.17 4.38 0.16
## BPSys3 76.00 226.00 150.00 1.02 2.29 0.19
## BPDia3 0.00 116.00 116.00 -1.35 4.89 0.16
## Testosterone 0.25 1795.60 1795.35 1.10 0.82 3.53
## DirectChol 0.39 4.03 3.64 1.02 2.07 0.00
## TotChol 1.53 13.65 12.12 0.70 1.39 0.01
## UrineVol1 0.00 510.00 510.00 1.12 0.75 0.95
## UrineFlow1 0.00 17.17 17.17 3.47 24.62 0.01
## UrineVol2 0.00 409.00 409.00 1.06 0.50 2.35
## UrineFlow2 0.00 13.69 13.69 2.78 16.23 0.03
## Diabetes* 1.00 2.00 1.00 3.17 8.05 0.00
## DiabetesAge 1.00 80.00 79.00 -0.61 0.68 0.63
## HealthGen* 1.00 5.00 4.00 0.17 -0.27 0.01
## DaysPhysHlthBad 0.00 30.00 30.00 2.71 6.46 0.09
## DaysMentHlthBad 0.00 30.00 30.00 2.31 4.42 0.09
## LittleInterest* 1.00 3.00 2.00 1.80 2.11 0.01
## Depressed* 1.00 3.00 2.00 1.95 2.66 0.01
## nPregnancies 1.00 32.00 31.00 2.74 27.74 0.04
## nBabies 0.00 12.00 12.00 1.57 4.54 0.03
## Age1stBaby 14.00 39.00 25.00 0.62 -0.11 0.11
## SleepHrsNight 2.00 12.00 10.00 -0.18 0.79 0.02
## SleepTrouble* 1.00 2.00 1.00 1.13 -0.72 0.00
## PhysActive* 1.00 2.00 1.00 -0.24 -1.94 0.01
## PhysActiveDays 1.00 7.00 6.00 0.32 -0.90 0.03
## TVHrsDay* 1.00 7.00 6.00 0.16 -0.80 0.02
## CompHrsDay* 1.00 7.00 6.00 0.94 0.23 0.02
## TVHrsDayChild 0.00 6.00 6.00 0.73 0.07 0.06
## CompHrsDayChild 0.00 6.00 6.00 0.69 -1.30 0.10
## Alcohol12PlusYr* 1.00 2.00 1.00 -1.44 0.07 0.01
## AlcoholDay 1.00 82.00 81.00 8.22 144.17 0.05
## AlcoholYear 0.00 364.00 364.00 1.58 1.48 1.34
## SmokeNow* 1.00 2.00 1.00 0.17 -1.97 0.01
## Smoke100* 1.00 2.00 1.00 0.23 -1.95 0.01
## Smoke100n* 1.00 2.00 1.00 0.23 -1.95 0.01
## SmokeAge 6.00 72.00 66.00 2.78 14.96 0.10
## Marijuana* 1.00 2.00 1.00 -0.35 -1.88 0.01
## AgeFirstMarij 1.00 48.00 47.00 2.10 11.37 0.07
## RegularMarij* 1.00 2.00 1.00 1.00 -1.00 0.01
## AgeRegMarij 5.00 52.00 47.00 2.82 13.33 0.13
## HardDrugs* 1.00 2.00 1.00 1.62 0.64 0.01
## SexEver* 1.00 2.00 1.00 -4.78 20.89 0.00
## SexAge 9.00 50.00 41.00 1.88 8.53 0.05
## SexNumPartnLife 0.00 2000.00 2000.00 16.82 384.81 0.76
## SexNumPartYear 0.00 69.00 69.00 13.58 252.27 0.04
## SameSex* 1.00 2.00 1.00 3.31 8.97 0.00
## SexOrientation* 1.00 3.00 2.00 -0.71 20.68 0.00
## PregnantNow* 1.00 3.00 2.00 -0.50 10.73 0.01
A table will tell the number of observations for Gender, a categorical (factor vector).
table(NHANES$Gender)
##
## female male
## 5020 4980
If a BMI is 30 and the mean age is 26.7 and the standard deviation is 7.4, What is the z-score? 30 is larger than 26.7 so the z-score should be larger than 50%
pnorm(30, 26.7, 7.4)
## [1] 0.6721819
The z-score is 0.672 or 67.2% of the data is to the left of 30 BMI.
Create a scatter plot of age (x-axis) vs. BMI (y-axis). Make the size of the points = 0.5. Indicate women and men in different colors. Include a title,“BMI by Age for Americans”. Put a line on the graph for BMI = 30 (obesity level)
nhanes_df %>%
ggplot(aes(Age, BMI)) +
geom_point(aes(col = Gender), size = 0.5) +
labs(title = "BMI by Age for Americans") +
geom_abline(intercept = 30, slope = 0) +
geom_label(aes(x = 10, y = 70, label = "30 BMI (obesity level)")) +
geom_segment(x = 10, y = 67, xend = 5, yend = 30)
## Warning: Removed 366 rows containing missing values (`geom_point()`).
Create a histogram of heights (use an appropriate
binwidth(), fill = "blue",
x label = "Heights in cm")
nhanes_df %>%
ggplot(aes(x = Height)) +
geom_histogram(binwidth = 10, fill = "blue") +
labs(x = "Heights in cm")
## Warning: Removed 353 rows containing non-finite values (`stat_bin()`).
Create a smooth density plot for male & female heights (use different fill colors so you can see both distributions and adjust alpha so you can see both distributions where they overlap)
nhanes_df %>%
ggplot(aes(x = Height, fill=Gender)) +
geom_density(alpha = 0.4)
## Warning: Removed 353 rows containing non-finite values (`stat_density()`).
Create a histogram of female heights and overlay a standard normal curve.
female_heights<- nhanes_df$Height[nhanes_df$Gender=="female"]
mean(female_heights, na.rm=TRUE)
## [1] 156.6159
sd(female_heights, na.rm=TRUE)
## [1] 16.79195
nhanes_df %>%
filter(Gender=="female") %>%
ggplot(aes(Height)) +
geom_density(fill = "red") +
stat_function(fun = dnorm, args=list(mean= 156.6, sd=16.8)) +
labs(x = "Female Heights (cm)")
## Warning: Removed 173 rows containing non-finite values (`stat_density()`).
This graph is not representative of a standard normal curve. The graph is Leptokurtic or “too pointy” and has a negative skew.