1 Setup and Data Ingest
1.1 Initial Setup and Package Loads
library(knitr); library(rmdformats)
library(janitor); library(naniar)
library(broom); library(patchwork)
library(readxl)
library(Epi)
library(Hmisc)
library(tidyverse)
library(patchwork)
library(car)
library(equatiomatic)
library(GGally)
## Load Love-boost
source("/Users/quynhnguyen/Documents/Study Materials/boost.r")
opts_chunk$set(comment=NA)
opts_knit$set(width=75)
theme_set(theme_bw())
1.2 Loading the Raw Data into R
For this study, I would like to use some datasets that I obtained
from the National Health and Nutrition Examinaition Survey (NHANES) by
National Center for Health Statistics by using nhanesA
package. I would like to use the following datasets:
- Blood Pressure - Oscillometric Measurement (P_BPXO)
- Smoking - Cigarette Use (P_SMQ)
- Weight History (P_WHQ)
- Income (P_INQ)
- Sleep Disorder (P_SLQ)
- Diabetes (P_DIQ)
- Blood Pressure & Cholesterol (P_BPQ)
library(nhanesA)
#Blood pressure measurement
bp_raw <- nhanes('P_BPXO') |> tibble()
saveRDS(bp_raw, "BPX_J.Rds")
bp_raw <- readRDS("BPX_J.Rds")
# Smoking
smoke_raw <- nhanes('P_SMQ') |> tibble()
saveRDS(smoke_raw, "P_SMQ.Rds")
smoke_raw <- readRDS("P_SMQ.Rds")
# Weight History
weight_raw <- nhanes('P_WHQ') |> tibble()
saveRDS(weight_raw, "P_WHQ.Rds")
weight_raw <- readRDS("P_WHQ.Rds")
# Income
income_raw <- nhanes('P_INQ') |> tibble()
saveRDS(income_raw, "P_INQ.Rds")
income_raw <- readRDS("P_INQ.Rds")
# Sleep Disorder
sleep_raw <- nhanes('P_SLQ') |> tibble()
saveRDS(sleep_raw, "P_SLQ.Rds")
sleep_raw <- readRDS("P_SLQ.Rds")
# Diabetes
diabetes_raw <- nhanes('P_DIQ') |> tibble()
saveRDS(diabetes_raw, "P_DIQ.Rds")
diabetes_raw <- readRDS("P_DIQ.Rds")
# Cholesterol level
chol_raw <- nhanes('P_BPQ') |> tibble()
saveRDS(chol_raw, "P_BPQ.Rds")
chol_raw <- readRDS("P_BPQ.Rds")
1.3 Content of the Raw Tibbles
- For the first tible,
bp_raw, it contains 12 variables with 11656 raws, including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(bp_raw)
[1] 11656 12
- For
smoke_raw, it contains 16 variables with 11137 rows, including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(smoke_raw)
[1] 11137 16
- For
weight_raw, it contains 35 vairables with 10195 rows, including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(weight_raw)
[1] 10195 35
- For
income_raw, it contains 3 variables with 15560 rows,including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(income_raw)
[1] 15560 3
- For
sleep_raw, it contains 11 variables with 10195 rows, including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(sleep_raw)
[1] 10195 11
- For
diabetes_raw, it contains 28 variables with 14986 rows, including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(diabetes_raw)
[1] 14986 28
- For
chol_raw, it contains 11 variables with 10195 rows, including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(chol_raw)
[1] 10195 11
1.4 Merging Steps
I would like to joining columns of these tables togehter by using
inner join on SEQN. It should contains 110 columns after
the merging step.
df_raw <- inner_join(bp_raw, smoke_raw, by = "SEQN")
df_raw <- inner_join(df_raw, weight_raw, by = "SEQN")
df_raw <- inner_join(df_raw, sleep_raw, by = "SEQN")
df_raw <- inner_join(df_raw, diabetes_raw, by = "SEQN")
df_raw <- inner_join(df_raw, income_raw, by = "SEQN")
df_raw <- inner_join(df_raw, chol_raw, by = "SEQN")
dim(df_raw)
[1] 9445 110
1.5 Selecting Variables:
I would like to use these variables for my study:
SEQNas the respondent sequence, which is unique for each patient.BPXOSY1,BPXOSY2,BPXOSY3determine the 1st, 2nd, and third systolic value of oscillometric reading, respectively, orignially from the Blood Pressure - Oscillometric Measurement (P_BPXO) tibble (bp_raw).BPXODI1,BPXODI2,BPXODI3determine the 1st, 2nd, and third systolic value of oscillometric reading, respectively, originally from the Blood Pressure - Oscillometric Measurement (P_BPXO) or (bp_raw) tibble.
SMQ020from Smoking - Cigarette Use (P_SMQ) orsmoke_rawtibble as if the participant smoke about 100 cigarettes in their lives.WHD010from Weight History (P_WHQ) orweight_rawtibble as the self-report height of the participants.WHD0120from Weight History (P_WHQ) orweight_rawtibble as the self-report weight of the participants.INDFMMPCfrom Income (P_INQ) orincome_rawtibble as family monthly poverty level categoryDIQ010from Diabetes (P_DIQ) ordiabetes_rawtibble as if the person was told by doctor that they have diabetes.SLD012from Sleep Disorder (P_SLQ) orsleep_rawtibble as hours of sleep during weekdays of the participantsSLD013from Sleep Disorder (P_SLQ) orsleep_rawtibble as hours of sleep during weekends of the participantsBPQ080from Blood Pressure & Cholesterol (P_BPQ) orchol_rawtibble as being told to have high cholesterol level
df <- df_raw[, c("SEQN", "BPXOSY1", "BPXOSY2", "BPXOSY3", "BPXODI1", "BPXODI2", "BPXODI3", "SMQ020", 'WHD020', 'WHD010', 'INDFMMPC', 'DIQ010', 'SLD012', 'SLD013', 'BPQ080')]
2 Cleaning the Data
2.1 Pre-processing Data
I would like to check the missing values from my current tibble.
miss_var_summary(df) |> kable()
| variable | n_miss | pct_miss |
|---|---|---|
| BPXOSY3 | 1021 | 10.8099524 |
| BPXODI3 | 1021 | 10.8099524 |
| BPXOSY2 | 999 | 10.5770249 |
| BPXODI2 | 999 | 10.5770249 |
| BPXOSY1 | 988 | 10.4605611 |
| BPXODI1 | 988 | 10.4605611 |
| INDFMMPC | 755 | 7.9936474 |
| SMQ020 | 480 | 5.0820540 |
| SLD013 | 85 | 0.8999471 |
| SLD012 | 81 | 0.8575966 |
| WHD010 | 52 | 0.5505558 |
| SEQN | 0 | 0.0000000 |
| WHD020 | 0 | 0.0000000 |
| DIQ010 | 0 | 0.0000000 |
| BPQ080 | 0 | 0.0000000 |
The missing percentage for each variable is quite small. Therefore, I
would like to pre-process my data by filtering out those
missing value, don'know value, and
refused value. These values are specified in each variable
where it originally come from.
This following snippet would to filter out those missing values.
df <- df[complete.cases(df),]
Then, I would like to filter out the value 7777 and
9999 from WHD010 and WHD020.
These values are stated for the refused (7777) and don’t know (9999)
value.
df <- df[df$WHD010 < 7777,]
df <- df[df$WHD020 < 7777,]
I would like to filter out values 7 and 9
from DIQ0101. These values are stated for the refused (7)
and don’t know (9) value.
df <- df[df$DIQ010 < 7,]
I would like to filter out values 7 and 9
from SMQ020. These values are stated for the refused (7)
and don’t know (9) value.
df <-df[df$SMQ020 < 7,]
I would like to filter out values 7 and 9
from INDFMMPC. These values are stated for the refused (7)
and don’t know (9) value.
df <- df[df$INDFMMPC < 7,]
Finally, I would like to filter out values 7 and
9 from BPQ080. These values are stated for the
refused (7) and don’t know (9) value.
df <- df[df$BPQ080 < 7,]
After all, this is my current tibble with 6837 rows and 15 columns.
dim(df)
[1] 6837 15
2.2 My Survey Items:
2.2.1 Quantitative Variables:
BPXOSY1,BPXOSY2,BPXOSY3:1st, 2nd, and third systolic value of oscillometric reading, respectively.BPXODI1,BPXODI2,BPXODI3: 1st, 2nd, and third systolic value of oscillometric reading, respectively.SLD012: hours of sleep during weekdays of the participantsSLD013: hours of sleep during weekends of the participantsWHD010: the self-report height of the participants.WHD0120: the self-report weight of the participants.
2.2.2 Binary Variables:
SMQ020: the participant smoke about 100 cigarettes in their lives (Yes/No)BPQ080: being told to have high cholesterol level (Yes/No)
2.2.3 Multi-Category Variables:
INDFMMPCfrom Income (P_INQ) orincome_rawtibble as family monthly poverty level category (1 as Monthly poverty level index =1.30 , 2 as 1.30 < Monthly poverty level index = 1.85 , 3 as Monthly poverty level index >1.85)DIQ010: the person was told by doctor that they have diabetes (Yes/No/Borderline)
glimpse(df)
Rows: 6,837
Columns: 15
$ SEQN <dbl> 109266, 109271, 109273, 109274, 109282, 109290, 109292, 10929…
$ BPXOSY1 <dbl> 99, 102, 116, 138, 141, 126, 143, 126, 158, 105, 94, 155, 103…
$ BPXOSY2 <dbl> 99, 108, 110, 132, 137, 116, 138, 130, 161, 105, 91, 165, 104…
$ BPXOSY3 <dbl> 99, 111, 115, 132, 140, 122, 133, 130, 158, 102, 92, 167, 96,…
$ BPXODI1 <dbl> 56, 65, 68, 70, 77, 62, 96, 83, 96, 72, 51, 95, 65, 77, 71, 6…
$ BPXODI2 <dbl> 55, 68, 66, 69, 71, 60, 98, 84, 92, 69, 51, 92, 66, 73, 67, 6…
$ BPXODI3 <dbl> 52, 68, 68, 71, 70, 59, 97, 84, 92, 68, 49, 95, 63, 75, 67, 6…
$ SMQ020 <dbl> 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2…
$ WHD020 <dbl> 210, 222, 165, 219, 185, 155, 185, 230, 137, 120, 205, 130, 1…
$ WHD010 <dbl> 64, 72, 72, 75, 70, 63, 66, 70, 63, 60, 66, 59, 62, 69, 66, 6…
$ INDFMMPC <dbl> 3, 1, 1, 1, 3, 3, 2, 1, 3, 3, 2, 3, 1, 3, 3, 3, 3, 1, 1, 3, 1…
$ DIQ010 <dbl> 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2…
$ SLD012 <dbl> 7.5, 10.0, 6.5, 9.5, 7.0, 4.0, 4.5, 7.5, 7.0, 8.0, 6.0, 4.0, …
$ SLD013 <dbl> 8.0, 13.0, 8.0, 9.5, 8.0, 4.0, 4.5, 6.5, 7.0, 8.0, 7.0, 14.0,…
$ BPQ080 <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2…
My analytic tibble will be called as df that includes
the following items:
- This tibble will need to contain information developed from the
variables listed above with the subject identifying number
SEQN - For those quantitative variables, I would also like to check as if they have reasonable range.
2.3 Checking our Quantitative Variables
In this study, I have 10 quantitative variables.I want to check the range for each of them to ensure there are no abnormal value.
df |>
select(BPXOSY1, BPXOSY2, BPXOSY3, BPXODI1, BPXODI2, BPXODI3, WHD010, WHD020, SLD012, SLD013) |>
mosaic::inspect()
quantitative variables:
name class min Q1 median Q3 max mean sd n missing
1 BPXOSY1 numeric 66 110.0 121.0 135.0 220 124.049144 19.518912 6837 0
2 BPXOSY2 numeric 54 110.0 121.0 134.0 222 123.870265 19.438437 6837 0
3 BPXOSY3 numeric 60 110.0 121.0 134.0 220 123.798157 19.260963 6837 0
4 BPXODI1 numeric 36 67.0 74.0 82.0 141 74.751645 11.755247 6837 0
5 BPXODI2 numeric 31 66.0 73.0 81.0 146 74.220711 11.769410 6837 0
6 BPXODI3 numeric 36 66.0 73.0 81.0 145 73.931110 11.833857 6837 0
7 WHD010 numeric 49 63.0 66.0 70.0 82 66.467018 4.160892 6837 0
8 WHD020 numeric 80 150.0 175.0 210.0 457 183.057335 48.703957 6837 0
9 SLD012 numeric 2 6.5 7.5 8.5 14 7.571449 1.649106 6837 0
10 SLD013 numeric 2 7.0 8.0 9.0 14 8.256180 1.786721 6837 0
We can see that the range for each variable regarding to each meaning are quite reasonable. Therefore, I would like to proceed into next step.
2.3.1 Processing Height and Weight Value to Calculate the Body Mass Index (BMI)
In this study,instead of using weight and height, I would like to use them for calculating BMI. The BMI formula for inches and pounds is available at http://www.bmi-calculator.net/bmi-formula.php. Normally, the normal range for BMI would be reasonable from 15 to 50. However, since NHANES’s purpose was to collect data prevalence of chronic conditions in the population in the U.S. Due to this, the range can be go over or under the normal reasonable range. Therefore, getting values of BMI above 50 or slightly under 15 as below would be considered as acceptable. However, in order to confirm it, I would like to check the range value for the height and weight.
df['BMI'] <-df$WHD020*703/(df$WHD010)**2
describe(~ BMI, data = df)
BMI
1 Variables 6837 Observations
--------------------------------------------------------------------------------
BMI
n missing distinct Info Mean Gmd .05 .10
6837 0 2056 1 29.05 7.521 20.17 21.45
.25 .50 .75 .90 .95
24.21 27.81 32.48 38.17 42.28
lowest : 14.63059 14.74719 14.87748 14.99636 15.10352
highest: 70.47227 72.25278 72.62397 74.87574 76.87125
--------------------------------------------------------------------------------
As we can see in here, the range of weight are quite big as there are people go up to 457 lbs. The tallest person also have the height of 82 inches (6’ 8’’). These are quite rare in normal, but if we consider as sampling people in the population, there would be some exceptional cases. Therefore, I would leave these values to be intact.
df|>
select(WHD020, WHD010) |>
describe()
select(df, WHD020, WHD010)
2 Variables 6837 Observations
--------------------------------------------------------------------------------
WHD020
n missing distinct Info Mean Gmd .05 .10
6837 0 265 1 183.1 53 118.0 128.0
.25 .50 .75 .90 .95
150.0 175.0 210.0 248.0 271.2
lowest : 80 82 86 88 89, highest: 415 416 434 450 457
--------------------------------------------------------------------------------
WHD010
n missing distinct Info Mean Gmd .05 .10
6837 0 31 0.995 66.47 4.72 60 61
.25 .50 .75 .90 .95
63 66 70 72 73
lowest : 49 50 53 54 55, highest: 77 78 79 81 82
--------------------------------------------------------------------------------
2.3.2 Average the Systolic and Diastolic Values:
Instead of getting 3 values for each systolic and diastolic reading, I would like to average them out.
df['systolic'] <- as.numeric(format(round(rowMeans(df[,c("BPXOSY3","BPXOSY2","BPXOSY1")]), 1), nsmall = 1))
df['diastolic'] <- as.numeric(format(round(rowMeans(df[,c("BPXODI1","BPXODI2","BPXODI3")]), 1), nsmall = 1))
Then, I would like to see their range. The normal people should have systolic/diastolic as 90/60. However, for some people who have abnormal blood pressure (i.e hypotension, high blood pressure, hypertension), the range can be out of scope. As there are cases that people with hypertension crisis can have systolic to be over 200/diastolic over 120 or hypotension with systolic to be under 90/diastolic to be under 60 (but over 40), the range describes below are reasonable to be keep intact.
df |>
select(systolic, diastolic) |>
describe()
select(df, systolic, diastolic)
2 Variables 6837 Observations
--------------------------------------------------------------------------------
systolic
n missing distinct Info Mean Gmd .05 .10
6837 0 344 1 123.9 20.91 98.3 102.7
.25 .50 .75 .90 .95
110.0 121.0 134.3 149.7 160.3
lowest : 78.7 79.7 80.0 80.3 82.7, highest: 210.3 213.3 214.0 217.0 218.7
--------------------------------------------------------------------------------
diastolic
n missing distinct Info Mean Gmd .05 .10
6837 0 229 1 74.3 12.78 57.3 60.3
.25 .50 .75 .90 .95
66.3 73.3 81.3 89.0 94.0
lowest : 41.3 41.7 42.0 43.0 43.7, highest: 126.0 128.0 135.0 137.7 143.7
--------------------------------------------------------------------------------
2.3.3 Average Hours of Sleep
Instead of having hours of sleep during the weekdays and weekends separately, I would like to keep one value by average these two hours.
df['SleepHours'] <- as.numeric(format(round(rowMeans(df[,c('SLD012', 'SLD013')]), 1), nsmall = 1))
Then, I would like to check the range of this values. In the original study, the hours of sleep going from 2 to 14. Then, the average range, if normal, should be within this scope.
df |>
select(SleepHours) |>
describe()
select(df, SleepHours)
1 Variables 6837 Observations
--------------------------------------------------------------------------------
SleepHours
n missing distinct Info Mean Gmd .05 .10
6837 0 47 0.994 7.914 1.653 5.5 6.0
.25 .50 .75 .90 .95
7.0 8.0 9.0 9.8 10.5
lowest : 2.0 2.5 2.8 3.0 3.2, highest: 12.8 13.0 13.5 13.8 14.0
--------------------------------------------------------------------------------
As we can see that the range is falling within the scope that we anticipated. Therefore, this value is normal.
2.4 Checking Binary Variables
My binary vairables including SMQ020 and
BPQ080 ### For SMQ020 Variables
df |> select(SMQ020) |> glimpse()
Rows: 6,837
Columns: 1
$ SMQ020 <dbl> 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, …
From the original tibble, 1 denotes Yes, and
2 denotes No. Therefore, I would like to change it into
Yes/No value and factor it.
df <- df %>% mutate(smoking = fct_recode(factor(SMQ020), "Yes" = "1", "No" = "2"))
df |> select(smoking) |> summary()
smoking
Yes:2783
No :4054
There are no values that out of range.
2.4.1 For
BPQ080 Variables
df |> select(BPQ080) |> glimpse()
Rows: 6,837
Columns: 1
$ BPQ080 <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, …
From the original tibble, 1 denotes Yes, and
2 denotes No. Therefore, I would like to change it into
Yes/No value and factor it.
df <- df %>% mutate(highchol = fct_recode(factor(BPQ080), "Yes" = "1", "No" = "2"))
df |> select(highchol) |> summary()
highchol
Yes:2409
No :4428
There are no values that out of range.
2.5 Checking Multi-Category Variables
I have two multi-category variables: INDFMMPC, and
DIQ010. I will first check them if they have any surprising
values and they will rename and factor it to mirror what we need in our
analyses.
2.5.1 The
INDFMMPC vairable
First, I would take a look at this variable.
df |> tabyl(INDFMMPC) |> kable()
| INDFMMPC | n | percent |
|---|---|---|
| 1 | 2141 | 0.3131490 |
| 2 | 1051 | 0.1537224 |
| 3 | 3645 | 0.5331286 |
For this variable, I would like to create a new factor called
eco_statuswhich is a factor which has specially level
names: 1 as
Lower,2asMiddle, and3forUpper`.
df <- df %>% mutate(eco_status = fct_recode(factor(INDFMMPC), "Lower" = "1", "Middle" = "2", "Upper" = "3"))
After that, I would check if I factor them correctly.
df |> count(INDFMMPC, eco_status) |> kable()
| INDFMMPC | eco_status | n |
|---|---|---|
| 1 | Lower | 2141 |
| 2 | Middle | 1051 |
| 3 | Upper | 3645 |
2.5.2 The
DIQ010 Vairable
First, I would take a look at this variable.
df |> tabyl(DIQ010) |> kable()
| DIQ010 | n | percent |
|---|---|---|
| 1 | 986 | 0.1442153 |
| 2 | 5664 | 0.8284335 |
| 3 | 187 | 0.0273512 |
For this variable, I would like to create a new factor called
diabeteswhich is a factor which has specially level names:
1 as Yes, 2 as No,
and 3 for Borderline.
df <- df %>% mutate(diabetes = fct_recode(factor(DIQ010), "Yes" = "1", "No" = "2", "Borderline" = "3"))
After that, I would check if I factor them correctly.
df |> count(DIQ010, diabetes) |> kable()
| DIQ010 | diabetes | n |
|---|---|---|
| 1 | Yes | 986 |
| 2 | No | 5664 |
| 3 | Borderline | 187 |
2.6 Creating Analytic Tibbles
So our analytic tibble, which I will call as df should
contains 7 values after processing.
df <- df %>% select(SEQN,systolic, diastolic, BMI, SleepHours, smoking, eco_status, diabetes, highchol)
2.7 List of Missing Values
As I already filter those don’t know/missing values above, our analytic tibble should not contain any of these values
miss_var_summary(df) |> kable()
| variable | n_miss | pct_miss |
|---|---|---|
| SEQN | 0 | 0 |
| systolic | 0 | 0 |
| diastolic | 0 | 0 |
| BMI | 0 | 0 |
| SleepHours | 0 | 0 |
| smoking | 0 | 0 |
| eco_status | 0 | 0 |
| diabetes | 0 | 0 |
| highchol | 0 | 0 |
3 Codebook and Data Description
3.1 Codebook
The 9 variables of my analytic tibble would be described below. The Type column indicates the number of levels in each categorical (factor) variable. As for the Type information, I’m using Quant to indicate quantitative variables, and Cat-x indicates a categorical variable (factor) with x levels.
| Variable | Type | Description/Levels |
|---|---|---|
| SEQN | id | respondent sequence, unique for each participants. |
| systolic | Quant | Average of systolic readings |
| diastolic | Quant | Average of diastolic readings |
| BMI | Quant | Body Mass Index |
| SleepHours | Quant | Average sleeping hours a week |
| smoking | Cat - 2 | Yes, No: Did you smoke 100 cigarettes in life? |
| eco_status | Cat - 3 | Lower Class, Middle Class, Upper Class: |
| diabetes | Cat - 3 | Yes, No, Borderline: Have doctor told you that you have diabetes? |
| highchol | Cat - 2 | Yes, No: Have doctor told you that you have high cholesterol? |
3.2 Analytic Tibble
df
# A tibble: 6,837 × 9
SEQN systolic diastolic BMI SleepHours smoking eco_status diabe…¹ highc…²
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct> <fct>
1 109266 99 54.3 36.0 7.8 No Upper No Yes
2 109271 107 67 30.1 11.5 Yes Lower No Yes
3 109273 114. 67.3 22.4 7.2 Yes Lower No No
4 109274 134 70 27.4 9.5 No Lower Yes Yes
5 109282 139. 72.7 26.5 7.5 Yes Upper No Yes
6 109290 121. 60.3 27.5 4 No Upper Yes No
7 109292 138 97 29.9 4.5 No Middle Yes Yes
8 109293 129. 83.7 33.0 7 No Lower No No
9 109295 159 93.3 24.3 7 No Upper No No
10 109297 104 69.7 23.4 8 No Upper No No
# … with 6,827 more rows, and abbreviated variable names ¹diabetes, ²highchol
3.3 Data Summary
Hmisc::describe(df |> select(-SEQN))
select(df, -SEQN)
8 Variables 6837 Observations
--------------------------------------------------------------------------------
systolic
n missing distinct Info Mean Gmd .05 .10
6837 0 344 1 123.9 20.91 98.3 102.7
.25 .50 .75 .90 .95
110.0 121.0 134.3 149.7 160.3
lowest : 78.7 79.7 80.0 80.3 82.7, highest: 210.3 213.3 214.0 217.0 218.7
--------------------------------------------------------------------------------
diastolic
n missing distinct Info Mean Gmd .05 .10
6837 0 229 1 74.3 12.78 57.3 60.3
.25 .50 .75 .90 .95
66.3 73.3 81.3 89.0 94.0
lowest : 41.3 41.7 42.0 43.0 43.7, highest: 126.0 128.0 135.0 137.7 143.7
--------------------------------------------------------------------------------
BMI
n missing distinct Info Mean Gmd .05 .10
6837 0 2056 1 29.05 7.521 20.17 21.45
.25 .50 .75 .90 .95
24.21 27.81 32.48 38.17 42.28
lowest : 14.63059 14.74719 14.87748 14.99636 15.10352
highest: 70.47227 72.25278 72.62397 74.87574 76.87125
--------------------------------------------------------------------------------
SleepHours
n missing distinct Info Mean Gmd .05 .10
6837 0 47 0.994 7.914 1.653 5.5 6.0
.25 .50 .75 .90 .95
7.0 8.0 9.0 9.8 10.5
lowest : 2.0 2.5 2.8 3.0 3.2, highest: 12.8 13.0 13.5 13.8 14.0
--------------------------------------------------------------------------------
smoking
n missing distinct
6837 0 2
Value Yes No
Frequency 2783 4054
Proportion 0.407 0.593
--------------------------------------------------------------------------------
eco_status
n missing distinct
6837 0 3
Value Lower Middle Upper
Frequency 2141 1051 3645
Proportion 0.313 0.154 0.533
--------------------------------------------------------------------------------
diabetes
n missing distinct
6837 0 3
Value Yes No Borderline
Frequency 986 5664 187
Proportion 0.144 0.828 0.027
--------------------------------------------------------------------------------
highchol
n missing distinct
6837 0 2
Value Yes No
Frequency 2409 4428
Proportion 0.352 0.648
--------------------------------------------------------------------------------
4 Analysis B:Comparing 2 Means with Independent Samples
4.1 The Questions
We’ll compare BMI by cholesterol level in this analysis
using independent samples. We’re comparing the mean BMI of
the population represented by respondents who have high cholesterol to
the mean BMI of the population represented by the respondents who don’t
have cholesterol. Since the SEQN are unique for each
respondents, and each respondent only got to answer Yes or No during the
survey, so the samples are not either paired or matched any way. In
addition, there are difference in people who have high-cholesterol and
people without it, so there is not way their BMI values could be paired.
As a result, we’re going to be interested in looking at the two samples
separately to help us understand issues related to hypothesis testing
assumptions. For this analysis, I would like to use 90% confidence
level.
Our research question is: Did people who have cholesterol would be associated of having higher body mass index values than people who don’t have high cholesterol?
4.2 Describing the Data
I’ll start by looking at the range of the BMI data within each cholesterol group
mosaic::favstats(BMI ~ highchol, data = df) |>
kable(digits = 2)
| highchol | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Yes | 16.18 | 25.40 | 28.73 | 33.20 | 72.62 | 29.89 | 6.55 | 2409 | 0 |
| No | 14.63 | 23.49 | 27.26 | 32.12 | 76.87 | 28.60 | 7.23 | 4428 | 0 |
There are no missing values for each group of cholesterol level as I already filtered it out at the beginning. ### Graphical Summaries
ggplot(df, aes(x = highchol, y = BMI, fill = highchol)) +
geom_violin(alpha = 0.3) +
geom_boxplot(width = 0.3, notch = TRUE) +
guides(fill = FALSE) +
labs(title = "BMI data somewhat right skewed in each group",
x = "Having high cholesterol?", y = "Body Mass Index") +
theme_bw()
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
"none")` instead.
From the boxplot (with notches and violin), we could see that I have a
lot of outliners at the high end, which suggested that I may have a
right-skewed. Therefore, I would like to double-check it with Q-Q
Plot.
ggplot(df, aes(sample = BMI, col = highchol)) +
geom_qq() + geom_qq_line() +
facet_wrap(~ highchol, labeller = "label_both") +
guides(col = FALSE) +
theme_bw() +
labs(y = "Observed BMI values",
title = "BMI isn't well fit by a Normal model in either group")
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
"none")` instead.
We can see that for both Cholesterol group, they have a very clear
right-skewed. Therefore, it is not reasonable to assume the Normality
here, or even symmetry.
4.2.1 Numerical Summaries
mosaic::favstats(BMI ~ highchol, data = df) |>
kable()
| highchol | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Yes | 16.17695 | 25.40137 | 28.72681 | 33.20027 | 72.62397 | 29.88527 | 6.552644 | 2409 | 0 |
| No | 14.63059 | 23.49076 | 27.25918 | 32.11771 | 76.87125 | 28.59715 | 7.228255 | 4428 | 0 |
df |> group_by(highchol) |>
summarize(skew1 = round_half_up((mean(BMI) - median(BMI))/sd(BMI), 3))
# A tibble: 2 × 2
highchol skew1
<fct> <dbl>
1 Yes 0.177
2 No 0.185
Even the skew1 values for both of cholesterol group are not fairly substantial right-skewed as they are not reached 0.2, but they were both really close to this value. Therefore, with the Q-Q Plot and those skew1 value, it is not reasonable to assume Normality here.
#df <- df %>% mutate(highchol_group = as.character(highchol))
df |> group_by(highchol) %>% summarize(n = n(), mean = mean(BMI), variance = var(BMI))
# A tibble: 2 × 4
highchol n mean variance
<fct> <int> <dbl> <dbl>
1 Yes 2409 29.9 42.9
2 No 4428 28.6 52.2
When calculate the variance for both of the cholesterol group, we could see that they are also different. Therefore, we cannot assume that they both have equal variance.
4.3 Main analysis
Since we cannot assume that the distribution of BMI for each group
are normally as well as the variance are equal, we are now using
Bootstrap with 90% confidence interval for comparing BMI
for people who answered Yes and No to the question about being told to
have high cholesterol.
Here is a 90% confidence interval for the difference between the cholesterol and non-cholesterol population BMI distributions based on the bootstrap using a seed of 42.
set.seed(42)
bootdif(df$BMI, df$highchol, conf.level = 0.90)
Mean Difference 0.05 0.95
-1.288114 -1.567482 -1.009058
- The population mean BMI in those who said No is estimated to be about 1.29 points lower than the population mean BMI for those who said Yes, based on our samples. So the mean differences’ point estimate is -1.29
- Our 90% confidence interval for the difference (No - Yes) of the population means is (-1.57, -1.01).
- Here, I’ve assumed a two-sided confidence interval procedure. We conclude from the confidence interval that the difference between the true means of the cholesterol and non-cholesterol bmi levels is negative, based on our analysis.
- The assumptions of this bootstrap procedure are:
- that the samples in each group are drawn independently of each other,
- and that the samples in each group represent a random sample of the population of interest.
4.4 Conclusions
We find a range of possible values for the difference between the population mean BMI for those who were told to have cholesterol and those who weren’t, based on our sample of respondents with complete data on BMI. Since this confidence interval doesn’t include 0, we could say that it is significant. Therefore, we are 90% confidence that the difference between the population mean BMI for those who were told to have cholesterol and those who weren’t is 1.29.This conclusion is motivated by a bootstrap estimate to compare the two groups (cholesterol and non-cholesterol) with complete data on BMI. One of the limitation I believe that leading to the use of Bootstrap is that we can neither assume normality or equal variance. This could be a side-effect of filtering the missing values or don’t know value from the table at the beginning. In addition to that, when we merged different columns by inner join on SEQN, some of the data was also be lost during this process, which can lead to the skewness of our data.
5 Analysis C: Comparing 3 Means with Independent Samples
5.1 The Questions
I’ll compare BMI by diabetes in this
analysis, using the analysis of variance, and related tools. We’re
comparing the mean BMI of the population represented by the
respondents who got diabetes, to the population represented by the
respondents who don’t have diabetes, to the population represented by
the respondents who were told that they are at borderline to have
diabetes. There is no link between subjects across the three groups as
each person has different unique identifier number, so the samples are
independent. Plus, as we’ll see, there are different numbers of subjects
in the three diabetes groups, so there’s no way their BMI
values could be matched. As a result, we’re going to be interested in
looking at the three samples separately to help us understand issues
related to hypothesis testing assumptions. I would like to use 90%
confidence interval. This analysis would help to answer the research
question that if the person who has diabetes would likely to have less
hours of sleep?
5.2 Describing the Data
I’ll start by looking at the range of the BMI data
within each diabetes group.
mosaic::favstats(BMI ~ diabetes, data = df)
diabetes min Q1 median Q3 max mean sd
1 Yes 15.49311 26.86241 30.84990 35.92549 69.47902 32.02010 7.412293
2 No 14.63059 23.73167 27.27690 31.88209 76.87125 28.47078 6.742670
3 Borderline 17.13822 25.17888 28.71701 33.80094 70.47227 30.97049 8.803573
n missing
1 986 0
2 5664 0
3 187 0
The distribution for the diabetes group are not fairly as the people who don’t have diabetes outweighted other groups.
describe(df$diabetes)
df$diabetes
n missing distinct
6837 0 3
Value Yes No Borderline
Frequency 986 5664 187
Proportion 0.144 0.828 0.027
As we can see, there are no missing values for this variable as I already filter it out before this.
5.2.1 Graphical Summaries
I would like to investigate the distribution of three independent samples as I will plot each of the groups in a comparison boxplot.
ggplot(df, aes(x = diabetes, y = BMI, fill = diabetes)) +
geom_violin(alpha = 0.3) +
geom_boxplot(width = 0.3) +
coord_flip() +
guides(fill = FALSE) +
theme_bw() +
labs(title = "BMI by diabetes",
x = "Diabetes Indicator",
y = "BMI")
We can see that our groups have a lot of outliners. Moreover, it seems
like the variance across three groups are not equally to each other.
However, it seems like all three groups are right-skewed.
Even though we have such a large sample size, but because of the numbers of respondents in each group are not equally to each other, the histogram would look odds in this case. However, from the histogram, we can also see that it looks right-skewed.
ggplot(df, aes(x = BMI)) +
geom_histogram(aes(fill = diabetes), bins = 10, col = "white") +
theme_bw() +
facet_wrap(~ diabetes, labeller = "label_both") +
guides(fill = FALSE) +
labs(title = "BMI that associated with diabetes",
y = "BMI",
x = "diabetes group")
Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
"none")` instead.
## Main analysis ### Comparing BMI across three diabetes group
by(df$BMI, df$diabetes, mosaic::favstats)
df$diabetes: Yes
min Q1 median Q3 max mean sd n missing
15.49311 26.86241 30.8499 35.92549 69.47902 32.0201 7.412293 986 0
------------------------------------------------------------
df$diabetes: No
min Q1 median Q3 max mean sd n missing
14.63059 23.73167 27.2769 31.88209 76.87125 28.47078 6.74267 5664 0
------------------------------------------------------------
df$diabetes: Borderline
min Q1 median Q3 max mean sd n missing
17.13822 25.17888 28.71701 33.80094 70.47227 30.97049 8.803573 187 0
When comparing the quotient between three groups of diabetes, we could see that the first quotient of yes is the highest, following by the borderline, and then diabetes. This applies the same for other quotient, which gives us a sense that people who have higher BMI is more likely having diabetes.
5.2.2 Kruskal-Wallis Test
Since the distributions of three groups of diabetes are not normally distributed, Kruskal-Wallis Test would likely to be helpful as it doesn’t have any assumption about Normality.
kruskal.test(BMI ~ diabetes, data = df)
Kruskal-Wallis rank sum test
data: BMI by diabetes
Kruskal-Wallis chi-squared = 243.92, df = 2, p-value < 2.2e-16
This result suggests only that the separation we observe between the BMI for the three diabetes categories is consistent with some true differences between those groups as the p-value is below 0.1.
In my study, even distribution of BMI are not normally, and the variance/sample size of these groups are also very different from each other, I would also want to run the ANOVA test
5.2.3 Analysis of Variance
The Analysis of Variance compares the means of BMI in the three diabetes group. We can run the analysis using either of two approaches, each of which we’ll show in what follows.
lm(BMI ~ diabetes, data = df) |>
anova()
Analysis of Variance Table
Response: BMI
Df Sum Sq Mean Sq F value Pr(>F)
diabetes 2 11288 5644.0 118.32 < 2.2e-16 ***
Residuals 6834 325994 47.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the analysis, it is consistent with the fact that there are a true difference between the BMI mean for 3 diabetes categories.
5.2.4 Tukey’s Honestly Significant Differences approach to Pairwise Comparisons of Means
aov(df$BMI ~ df$diabetes) |> summary()
Df Sum Sq Mean Sq F value Pr(>F)
df$diabetes 2 11288 5644 118.3 <2e-16 ***
Residuals 6834 325994 48
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now, we run the Tukey HSD comparisons, both in a plot and table of results. As specified previously, we’ll use a 90% confidence level across the set of comparisons.
TukeyHSD(aov(df$BMI ~ df$diabetes), conf.level = 0.90)
Tukey multiple comparisons of means
90% family-wise confidence level
Fit: aov(formula = df$BMI ~ df$diabetes)
$`df$diabetes`
diff lwr upr p adj
No-Yes -3.549318 -4.038522 -3.06011357 0.0000000
Borderline-Yes -1.049611 -2.180368 0.08114622 0.1372062
Borderline-No 2.499707 1.446019 3.55339548 0.0000034
The confidence interval suggests that the group that has no diabetes would have lower BMI comparing to the group has diabetes or at borderline. These difference are significant. The group at the borderline would have lower BMI comparing to the group has diabetes. However, this different is not significant at all as the p-value is larger than 0.1
mar.default <- c(5,6,4,2) + 0.1
par(mar = mar.default + c(0, 4, 0, 0))
plot(TukeyHSD(aov(df$BMI ~ df$diabetes),
conf.level = 0.90), las = 1)
par(mar = mar.default)
5.3 Conclusions
Since I can only use Kruskal-Wallis Test for this study, and it returned that there are some true differences between those groups as the p-value is below 0.1. Even our data for this analysis doesn’t qualify all the ANOVA assumptions, but based on the analysis of ANOVA, it also agrees with the Kruskal-Wallis Tes as there are some true difference between those group, i.e between group without diabetes with group has diabetes and at borderline. This helps to answer our research question that the people who has diabetes would likely to have higher BMI.
Since we use ANOVA test without caring about its assumption, I believe that this result is not reliable. In order to improving it, I should better do some transformation, or looking back at my filtering step in order to get more values as some of these value may be filtered out at the beginning.
6 Analysis D: Analyzing a 2x2 table
6.1 The Questions
I’ll look at the association between two categorical factors that I
created earlier: smoking and highchol in this
analysis. I am interested in whether there is an association between
smoking and high cholesterol level. Both of these values have 2 levels.
For the analysis, I would like to use 90% confidence level.
6.2 Describing the Data
table(df$smoking, df$highchol) |> kable()
| Yes | No | |
|---|---|---|
| Yes | 1124 | 1659 |
| No | 1285 | 2769 |
df_D <- df |>
select(SEQN, smoking, highchol) |>
mutate(smoking_r = fct_recode(factor(smoking),
"No Smoking" = "No",
"Smoking" = "Yes"),
smoking_r = fct_relevel(smoking_r, "Smoking"),
highchol_r = fct_recode(factor(highchol),
"No High Cholesterol" = "No",
"High Cholesterol" = "Yes"),
highchol_r = fct_relevel(highchol_r, "High Cholesterol"))
df_D |> tabyl(smoking_r,highchol_r) |> kable()
| smoking_r | High Cholesterol | No High Cholesterol |
|---|---|---|
| Smoking | 1124 | 1659 |
| No Smoking | 1285 | 2769 |
6.3 Main analysis
t1 <- table(df_D$smoking_r, df_D$highchol_r)
twoby2(t1 , conf.level = 0.90)
2 by 2 table analysis:
------------------------------------------------------
Outcome : High Cholesterol
Comparing : Smoking vs. No Smoking
High Cholesterol No High Cholesterol P(High Cholesterol)
Smoking 1124 1659 0.4039
No Smoking 1285 2769 0.3170
90% conf. interval
Smoking 0.3887 0.4193
No Smoking 0.3051 0.3291
90% conf. interval
Relative Risk: 1.2742 1.2077 1.3443
Sample Odds Ratio: 1.4600 1.3418 1.5885
Conditional MLE Odds Ratio: 1.4598 1.3400 1.5904
Probability difference: 0.0869 0.0675 0.1064
Exact P-value: 0.0000
Asymptotic P-value: 0.0000
------------------------------------------------------
6.4 Conclusions
From the table, we could see that both the exact p-value and asymptotic p-value from our 2x2 analysis that are below 0.1, which determines that there is a significant difference between non-smoking and smoking version when it comes to the high cholesterol. In this case, people who smoking turn out having a higher chance of being high cholesterol.
7 Analysis E: Two-way(3 x 3) Contingency Table
7.1 The Questions
I’ll look at the association between two categorical factors that I
created earlier: diabetes and eco_status in
this analysis. I am interested in whether there is an association
between diabetes and economic status. Both of these values have 3
levels. For the analysis, I would like to use 90% confidence level.
7.2 Describing the Data
Let’s store this initial table of interest as
table_E1
table_E1 <- table(df$diabetes, df$eco_status)
table_E1 |> kable()
| Lower | Middle | Upper | |
|---|---|---|---|
| Yes | 311 | 175 | 500 |
| No | 1781 | 847 | 3036 |
| Borderline | 49 | 29 | 109 |
I could add the marginal totals:
df |>
tabyl(diabetes, eco_status) |>
adorn_totals(where = c("row", "col"))
diabetes Lower Middle Upper Total
Yes 311 175 500 986
No 1781 847 3036 5664
Borderline 49 29 109 187
Total 2141 1051 3645 6837
The research question is wheter the lower class you are, the higher chance that you have diabetes. ## Main analysis
7.2.1 Running the Pearson χ2 Test
I’ll run the Pearson χ2 test using:
chisq.test(table_E1)
Pearson's Chi-squared test
data: table_E1
X-squared = 8.0851, df = 4, p-value = 0.08851
7.2.2 Checking Assumptions - The Cochran Conditions
The “Cochran conditions”, which require that we have:
- no cells with 0 counts
- at least 80% of the cells in our table with counts of 5 or higher
- expected counts in each cell of the table should be 5 or more
My table_E1 meets all the Cochran conditions above
7.2.3 An Association Plot for the 3x3 Table
assocplot(table_E1)
##Conclusions Using the
table_E1 here, we can see that the
p-value is 0.06655 < 0.1. In addition, from the assocplot, we could
see that people at upper level, they tend to be at no and borderline of
diabetes, and less likely to have diabetes. For people in the lower
class, they are less likely to be at the borderline, but they have quite
the same amount of having diabetes and not. For people at the middle
class, they tend to have diabetes, and borderline, but less likely to do
not have diabetes. From this, we can see that, if we go from the upper
class to the lower class, the chance of having diabetes increases, and
the chance of not having diabetes decreases. This alongs with the
p-value, it helps to determine that we are 90% confidence that the lower
class you are, the higher chance you have diabetes.
One of the main reason why it behaves a little bit weirdly in the middle class is that this class has much less number of respondents comparing to the lower and upper class. This may be due to our filtering out the missing values and don’t know value as well as the merging step that wiped out the respondents in this class.
8 Session Information
sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mosaicData_0.20.3 ggformula_0.10.2 Matrix_1.5-3
[4] nhanesA_0.7.1 GGally_2.1.2 equatiomatic_0.3.1
[7] car_3.1-1 carData_3.0-5 forcats_0.5.2
[10] stringr_1.4.1 dplyr_1.0.10 purrr_0.3.5
[13] readr_2.1.3 tidyr_1.2.1 tibble_3.1.8
[16] tidyverse_1.3.2 Hmisc_4.7-1 ggplot2_3.3.6
[19] Formula_1.2-4 survival_3.3-1 lattice_0.20-45
[22] Epi_2.47 readxl_1.4.1 patchwork_1.1.2.9000
[25] broom_1.0.1 naniar_0.6.1 janitor_2.1.0
[28] rmdformats_1.0.4 knitr_1.40
loaded via a namespace (and not attached):
[1] googledrive_2.0.0 colorspace_2.0-3 deldir_1.0-6
[4] ggridges_0.5.4 ellipsis_0.3.2 visdat_0.5.3
[7] ggstance_0.3.5 snakecase_0.11.0 htmlTable_2.4.1
[10] base64enc_0.1-3 fs_1.5.2 rstudioapi_0.14
[13] farver_2.1.1 fansi_1.0.3 lubridate_1.8.0
[16] xml2_1.3.3 mosaic_1.8.4.2 splines_4.2.1
[19] cachem_1.0.6 polyclip_1.10-0 jsonlite_1.8.2
[22] cluster_2.1.3 dbplyr_2.2.1 png_0.1-7
[25] ggforce_0.4.1 shiny_1.7.2 compiler_4.2.1
[28] httr_1.4.4 backports_1.4.1 assertthat_0.2.1
[31] fastmap_1.1.0 gargle_1.2.1 cli_3.4.1
[34] tweenr_2.0.2 later_1.3.0 htmltools_0.5.3
[37] tools_4.2.1 gtable_0.3.1 glue_1.6.2
[40] Rcpp_1.0.9 cellranger_1.1.0 jquerylib_0.1.4
[43] vctrs_0.5.0 nlme_3.1-157 xfun_0.33
[46] rvest_1.0.3 mosaicCore_0.9.2.1 mime_0.12
[49] lifecycle_1.0.3 googlesheets4_1.0.1 MASS_7.3-57
[52] zoo_1.8-11 scales_1.2.1 hms_1.1.2
[55] promises_1.2.0.1 parallel_4.2.1 RColorBrewer_1.1-3
[58] yaml_2.3.5 gridExtra_2.3 sass_0.4.2
[61] labelled_2.10.0 rpart_4.1.16 reshape_0.8.9
[64] latticeExtra_0.6-30 stringi_1.7.8 highr_0.9
[67] checkmate_2.1.0 rlang_1.0.6 pkgconfig_2.0.3
[70] evaluate_0.16 labeling_0.4.2 htmlwidgets_1.5.4
[73] cmprsk_2.2-11 tidyselect_1.1.2 plyr_1.8.7
[76] magrittr_2.0.3 bookdown_0.29 R6_2.5.1
[79] generics_0.1.3 DBI_1.1.3 pillar_1.8.1
[82] haven_2.5.1 foreign_0.8-82 withr_2.5.0
[85] mgcv_1.8-40 abind_1.4-5 nnet_7.3-17
[88] etm_1.1.1 modelr_0.1.9 crayon_1.5.2
[91] interp_1.1-3 utf8_1.2.2 tzdb_0.3.0
[94] rmarkdown_2.16 jpeg_0.1-9 grid_4.2.1
[97] data.table_1.14.2 reprex_2.0.2 digest_0.6.30
[100] xtable_1.8-4 httpuv_1.6.6 numDeriv_2016.8-1.1
[103] munsell_0.5.0 bslib_0.4.0