1 Setup and Data Ingest
1.1 Initial Setup and Package Loads
1.2 Loading the Raw Data into R
For this study, I would like to use some datasets that I obtained
from the National Health and Nutrition Examinaition Survey (NHANES) by
National Center for Health Statistics by using nhanesA
package. I would like to use the following datasets:
- Blood Pressure - Oscillometric Measurement (P_BPXO)
- Smoking - Cigarette Use (P_SMQ)
- Weight History (P_WHQ)
- Income (P_INQ)
- Sleep Disorder (P_SLQ)
- Diabetes (P_DIQ)
- Blood Pressure & Cholesterol (P_BPQ)
library(nhanesA)
#Blood pressure measurement
bp_raw <- nhanes('P_BPXO') |> tibble()
saveRDS(bp_raw, "BPX_J.Rds")
bp_raw <- readRDS("BPX_J.Rds")
# Smoking
smoke_raw <- nhanes('P_SMQ') |> tibble()
saveRDS(smoke_raw, "P_SMQ.Rds")
smoke_raw <- readRDS("P_SMQ.Rds")
# Weight History
weight_raw <- nhanes('P_WHQ') |> tibble()
saveRDS(weight_raw, "P_WHQ.Rds")
weight_raw <- readRDS("P_WHQ.Rds")
# Income
income_raw <- nhanes('P_INQ') |> tibble()
saveRDS(income_raw, "P_INQ.Rds")
income_raw <- readRDS("P_INQ.Rds")
# Sleep Disorder
sleep_raw <- nhanes('P_SLQ') |> tibble()
saveRDS(sleep_raw, "P_SLQ.Rds")
sleep_raw <- readRDS("P_SLQ.Rds")
# Diabetes
diabetes_raw <- nhanes('P_DIQ') |> tibble()
saveRDS(diabetes_raw, "P_DIQ.Rds")
diabetes_raw <- readRDS("P_DIQ.Rds")
# Cholesterol level
chol_raw <- nhanes('P_BPQ') |> tibble()
saveRDS(chol_raw, "P_BPQ.Rds")
chol_raw <- readRDS("P_BPQ.Rds")
1.3 Content of the Raw Tibbles
- For the first tible,
bp_raw, it contains 12 variables with 11656 rows, including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(bp_raw)
[1] 11656 12
- For
smoke_raw, it contains 16 variables with 11137 rows, including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(smoke_raw)
[1] 11137 16
- For
weight_raw, it contains 35 vairables with 10195 rows, including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(weight_raw)
[1] 10195 35
- For
income_raw, it contains 3 variables with 15560 rows,including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(income_raw)
[1] 15560 3
- For
sleep_raw, it contains 11 variables with 10195 rows, including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(sleep_raw)
[1] 10195 11
- For
diabetes_raw, it contains 28 variables with 14986 rows, including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(diabetes_raw)
[1] 14986 28
- For
chol_raw, it contains 11 variables with 10195 rows, including the unique identifier, i.eSEQNas respondent sequence, which is unique for each patient for each participant.
dim(chol_raw)
[1] 10195 11
1.4 Merging Steps
I would like to joining columns of these tables togehter by using
inner join on SEQN. It should contains 110 columns after
the merging step.
df_raw <- inner_join(bp_raw, smoke_raw, by = "SEQN")
df_raw <- inner_join(df_raw, weight_raw, by = "SEQN")
df_raw <- inner_join(df_raw, sleep_raw, by = "SEQN")
df_raw <- inner_join(df_raw, diabetes_raw, by = "SEQN")
df_raw <- inner_join(df_raw, income_raw, by = "SEQN")
df_raw <- inner_join(df_raw, chol_raw, by = "SEQN")
dim(df_raw)
[1] 9445 110
1.5 Selecting Variables:
I would like to use these variables for my study:
SEQNas the respondent sequence, which is unique for each patient.BPXOSY1,BPXOSY2,BPXOSY3determine the 1st, 2nd, and third systolic value of oscillometric reading, respectively, orignially from the Blood Pressure - Oscillometric Measurement (P_BPXO) tibble (bp_raw).BPXODI1,BPXODI2,BPXODI3determine the 1st, 2nd, and third systolic value of oscillometric reading, respectively, originally from the Blood Pressure - Oscillometric Measurement (P_BPXO) or (bp_raw) tibble.
SMQ020from Smoking - Cigarette Use (P_SMQ) orsmoke_rawtibble as if the participant smoke about 100 cigarettes in their lives.WHD010from Weight History (P_WHQ) orweight_rawtibble as the self-report height of the participants.WHD0120from Weight History (P_WHQ) orweight_rawtibble as the self-report weight of the participants.INDFMMPCfrom Income (P_INQ) orincome_rawtibble as family monthly poverty level categoryDIQ010from Diabetes (P_DIQ) ordiabetes_rawtibble as if the person was told by doctor that they have diabetes.SLD012from Sleep Disorder (P_SLQ) orsleep_rawtibble as hours of sleep during weekdays of the participantsSLD013from Sleep Disorder (P_SLQ) orsleep_rawtibble as hours of sleep during weekends of the participantsBPQ080from Blood Pressure & Cholesterol (P_BPQ) orchol_rawtibble as being told to have high cholesterol level
df <- df_raw[, c("SEQN", "BPXOSY1", "BPXOSY2", "BPXOSY3", "BPXODI1", "BPXODI2", "BPXODI3", "SMQ020", 'WHD020', 'WHD010', 'INDFMMPC', 'DIQ010', 'SLD012', 'SLD013', 'BPQ080')]
2 Cleaning the Data
2.1 Pre-processing Data
I would like to check the missing values from my current tibble.
miss_var_summary(df) |> kable()
| variable | n_miss | pct_miss |
|---|---|---|
| BPXOSY3 | 1021 | 10.8099524 |
| BPXODI3 | 1021 | 10.8099524 |
| BPXOSY2 | 999 | 10.5770249 |
| BPXODI2 | 999 | 10.5770249 |
| BPXOSY1 | 988 | 10.4605611 |
| BPXODI1 | 988 | 10.4605611 |
| INDFMMPC | 755 | 7.9936474 |
| SMQ020 | 480 | 5.0820540 |
| SLD013 | 85 | 0.8999471 |
| SLD012 | 81 | 0.8575966 |
| WHD010 | 52 | 0.5505558 |
| SEQN | 0 | 0.0000000 |
| WHD020 | 0 | 0.0000000 |
| DIQ010 | 0 | 0.0000000 |
| BPQ080 | 0 | 0.0000000 |
The missing percentage for each variable is quite small. Therefore, I
would like to pre-process my data by filtering out those
missing value, don'know value, and
refused value. These values are specified in each variable
where it originally come from.
This following snippet would to filter out those missing values.
df <- df[complete.cases(df),]
Then, I would like to filter out the value 7777 and
9999 from WHD010 and WHD020.
These values are stated for the refused (7777) and don’t know (9999)
value.
df <- df[df$WHD010 < 7777,]
df <- df[df$WHD020 < 7777,]
I would like to filter out values 7 and 9
from DIQ0101. These values are stated for the refused (7)
and don’t know (9) value.
df <- df[df$DIQ010 < 7,]
I would like to filter out values 7 and 9
from SMQ020. These values are stated for the refused (7)
and don’t know (9) value.
df <-df[df$SMQ020 < 7,]
I would like to filter out values 7 and 9
from INDFMMPC. These values are stated for the refused (7)
and don’t know (9) value.
df <- df[df$INDFMMPC < 7,]
Finally, I would like to filter out values 7 and
9 from BPQ080. These values are stated for the
refused (7) and don’t know (9) value.
df <- df[df$BPQ080 < 7,]
After all, this is my current tibble with 6837 rows and 15 columns.
dim(df)
[1] 6837 15
2.2 Checking Quantitative Variables
df %>% select(BPXOSY1, BPXOSY2, BPXOSY3, BPXODI1, BPXODI2, BPXODI3, WHD010, WHD020, SLD012, SLD013) %>% mosaic::inspect()
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
quantitative variables:
name class min Q1 median Q3 max mean sd n missing
1 BPXOSY1 numeric 66 110.0 121.0 135.0 220 124.049144 19.518912 6837 0
2 BPXOSY2 numeric 54 110.0 121.0 134.0 222 123.870265 19.438437 6837 0
3 BPXOSY3 numeric 60 110.0 121.0 134.0 220 123.798157 19.260963 6837 0
4 BPXODI1 numeric 36 67.0 74.0 82.0 141 74.751645 11.755247 6837 0
5 BPXODI2 numeric 31 66.0 73.0 81.0 146 74.220711 11.769410 6837 0
6 BPXODI3 numeric 36 66.0 73.0 81.0 145 73.931110 11.833857 6837 0
7 WHD010 numeric 49 63.0 66.0 70.0 82 66.467018 4.160892 6837 0
8 WHD020 numeric 80 150.0 175.0 210.0 457 183.057335 48.703957 6837 0
9 SLD012 numeric 2 6.5 7.5 8.5 14 7.571449 1.649106 6837 0
10 SLD013 numeric 2 7.0 8.0 9.0 14 8.256180 1.786721 6837 0
glimpse(df)
Rows: 6,837
Columns: 15
$ SEQN <dbl> 109266, 109271, 109273, 109274, 109282, 109290, 109292, 10929…
$ BPXOSY1 <dbl> 99, 102, 116, 138, 141, 126, 143, 126, 158, 105, 94, 155, 103…
$ BPXOSY2 <dbl> 99, 108, 110, 132, 137, 116, 138, 130, 161, 105, 91, 165, 104…
$ BPXOSY3 <dbl> 99, 111, 115, 132, 140, 122, 133, 130, 158, 102, 92, 167, 96,…
$ BPXODI1 <dbl> 56, 65, 68, 70, 77, 62, 96, 83, 96, 72, 51, 95, 65, 77, 71, 6…
$ BPXODI2 <dbl> 55, 68, 66, 69, 71, 60, 98, 84, 92, 69, 51, 92, 66, 73, 67, 6…
$ BPXODI3 <dbl> 52, 68, 68, 71, 70, 59, 97, 84, 92, 68, 49, 95, 63, 75, 67, 6…
$ SMQ020 <dbl> 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2…
$ WHD020 <dbl> 210, 222, 165, 219, 185, 155, 185, 230, 137, 120, 205, 130, 1…
$ WHD010 <dbl> 64, 72, 72, 75, 70, 63, 66, 70, 63, 60, 66, 59, 62, 69, 66, 6…
$ INDFMMPC <dbl> 3, 1, 1, 1, 3, 3, 2, 1, 3, 3, 2, 3, 1, 3, 3, 3, 3, 1, 1, 3, 1…
$ DIQ010 <dbl> 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2…
$ SLD012 <dbl> 7.5, 10.0, 6.5, 9.5, 7.0, 4.0, 4.5, 7.5, 7.0, 8.0, 6.0, 4.0, …
$ SLD013 <dbl> 8.0, 13.0, 8.0, 9.5, 8.0, 4.0, 4.5, 6.5, 7.0, 8.0, 7.0, 14.0,…
$ BPQ080 <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2…
2.3 Checking our Quantitative Variables
In this study, I have 10 quantitative variables.I want to check the range for each of them to ensure there are no abnormal value.
df |>
select(BPXOSY1, BPXOSY2, BPXOSY3, BPXODI1, BPXODI2, BPXODI3, WHD010, WHD020, SLD012, SLD013) |>
mosaic::inspect()
quantitative variables:
name class min Q1 median Q3 max mean sd n missing
1 BPXOSY1 numeric 66 110.0 121.0 135.0 220 124.049144 19.518912 6837 0
2 BPXOSY2 numeric 54 110.0 121.0 134.0 222 123.870265 19.438437 6837 0
3 BPXOSY3 numeric 60 110.0 121.0 134.0 220 123.798157 19.260963 6837 0
4 BPXODI1 numeric 36 67.0 74.0 82.0 141 74.751645 11.755247 6837 0
5 BPXODI2 numeric 31 66.0 73.0 81.0 146 74.220711 11.769410 6837 0
6 BPXODI3 numeric 36 66.0 73.0 81.0 145 73.931110 11.833857 6837 0
7 WHD010 numeric 49 63.0 66.0 70.0 82 66.467018 4.160892 6837 0
8 WHD020 numeric 80 150.0 175.0 210.0 457 183.057335 48.703957 6837 0
9 SLD012 numeric 2 6.5 7.5 8.5 14 7.571449 1.649106 6837 0
10 SLD013 numeric 2 7.0 8.0 9.0 14 8.256180 1.786721 6837 0
We can see that the range for each variable regarding to each meaning are quite reasonable. Therefore, I would like to proceed into next step.
###Processing Height and Weight Value to Calculate the Body Mass Index (BMI)
In this study,instead of using weight and height, I would like to use them for calculating BMI. The BMI formula for inches and pounds is available at http://www.bmi-calculator.net/bmi-formula.php. Normally, the normal range for BMI would be reasonable from 15 to 50. However, since NHANES’s purpose was to collect data prevalence of chronic conditions in the population in the U.S. Due to this, the range can be go over or under the normal reasonable range. Therefore, getting values of BMI above 50 or slightly under 15 as below would be considered as acceptable. However, in order to confirm it, I would like to check the range value for the height and weight.
df['BMI'] <-df$WHD020*703/(df$WHD010)**2
describe(~ BMI, data = df)
BMI
1 Variables 6837 Observations
--------------------------------------------------------------------------------
BMI
n missing distinct Info Mean Gmd .05 .10
6837 0 2056 1 29.05 7.521 20.17 21.45
.25 .50 .75 .90 .95
24.21 27.81 32.48 38.17 42.28
lowest : 14.63059 14.74719 14.87748 14.99636 15.10352
highest: 70.47227 72.25278 72.62397 74.87574 76.87125
--------------------------------------------------------------------------------
As we can see in here, the range of weight are quite big as there are people go up to 457 lbs. The tallest person also have the height of 82 inches (6’ 8’’). These are quite rare in normal, but if we consider as sampling people in the population, there would be some exceptional cases. Therefore, I would leave these values to be intact.
df|>
select(WHD020, WHD010) |>
describe()
select(df, WHD020, WHD010)
2 Variables 6837 Observations
--------------------------------------------------------------------------------
WHD020
n missing distinct Info Mean Gmd .05 .10
6837 0 265 1 183.1 53 118.0 128.0
.25 .50 .75 .90 .95
150.0 175.0 210.0 248.0 271.2
lowest : 80 82 86 88 89, highest: 415 416 434 450 457
--------------------------------------------------------------------------------
WHD010
n missing distinct Info Mean Gmd .05 .10
6837 0 31 0.995 66.47 4.72 60 61
.25 .50 .75 .90 .95
63 66 70 72 73
lowest : 49 50 53 54 55, highest: 77 78 79 81 82
--------------------------------------------------------------------------------
2.3.1 Average the Systolic and Diastolic Values:
Instead of getting 3 values for each systolic and diastolic reading, I would like to average them out.
df['systolic'] <- as.numeric(format(round(rowMeans(df[,c("BPXOSY3","BPXOSY2","BPXOSY1")]), 1), nsmall = 1))
df['diastolic'] <- as.numeric(format(round(rowMeans(df[,c("BPXODI1","BPXODI2","BPXODI3")]), 1), nsmall = 1))
Then, I would like to see their range. The normal people should have systolic/diastolic as 90/60. However, for some people who have abnormal blood pressure (i.e hypotension, high blood pressure, hypertension), the range can be out of scope. As there are cases that people with hypertension crisis can have systolic to be over 200/diastolic over 120 or hypotension with systolic to be under 90/diastolic to be under 60 (but over 40), the range describes below are reasonable to be keep intact.
df |>
select(systolic, diastolic) |>
describe()
select(df, systolic, diastolic)
2 Variables 6837 Observations
--------------------------------------------------------------------------------
systolic
n missing distinct Info Mean Gmd .05 .10
6837 0 344 1 123.9 20.91 98.3 102.7
.25 .50 .75 .90 .95
110.0 121.0 134.3 149.7 160.3
lowest : 78.7 79.7 80.0 80.3 82.7, highest: 210.3 213.3 214.0 217.0 218.7
--------------------------------------------------------------------------------
diastolic
n missing distinct Info Mean Gmd .05 .10
6837 0 229 1 74.3 12.78 57.3 60.3
.25 .50 .75 .90 .95
66.3 73.3 81.3 89.0 94.0
lowest : 41.3 41.7 42.0 43.0 43.7, highest: 126.0 128.0 135.0 137.7 143.7
--------------------------------------------------------------------------------
2.3.2 Average Hours of Sleep
Instead of having hours of sleep during the weekdays and weekends separately, I would like to keep one value by average these two hours.
df['SleepHours'] <- as.numeric(format(round(rowMeans(df[,c('SLD012', 'SLD013')]), 1), nsmall = 1))
Then, I would like to check the range of this values. In the original study, the hours of sleep going from 2 to 14. Then, the average range, if normal, should be within this scope.
df |>
select(SleepHours) |>
describe()
select(df, SleepHours)
1 Variables 6837 Observations
--------------------------------------------------------------------------------
SleepHours
n missing distinct Info Mean Gmd .05 .10
6837 0 47 0.994 7.914 1.653 5.5 6.0
.25 .50 .75 .90 .95
7.0 8.0 9.0 9.8 10.5
lowest : 2.0 2.5 2.8 3.0 3.2, highest: 12.8 13.0 13.5 13.8 14.0
--------------------------------------------------------------------------------
2.4 Checking Binary Variables
My binary vairables including SMQ020 and
BPQ080 ### For SMQ020 Variables
df |> select(SMQ020) |> glimpse()
Rows: 6,837
Columns: 1
$ SMQ020 <dbl> 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, …
From the original tibble, 1 denotes Yes, and
2 denotes No. Therefore, I would like to change it into
Yes/No value and factor it.
df <- df %>% mutate(smoking = fct_recode(factor(SMQ020), "Yes" = "1", "No" = "2"))
df |> select(smoking) |> summary()
smoking
Yes:2783
No :4054
There are no values that out of range.
2.4.1 For
BPQ080 Variables
df |> select(BPQ080) |> glimpse()
Rows: 6,837
Columns: 1
$ BPQ080 <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, …
From the original tibble, 1 denotes Yes, and
2 denotes No. Therefore, I would like to change it into
Yes/No value and factor it.
df <- df %>% mutate(highchol = fct_recode(factor(BPQ080), "Yes" = "1", "No" = "2"))
df |> select(highchol) |> summary()
highchol
Yes:2409
No :4428
There are no values that out of range.
2.5 Checking Multi-Category Variables
I have two multi-category variables: INDFMMPC, and
DIQ010. I will first check them if they have any surprising
values and they will rename and factor it to mirror what we need in our
analyses.
2.6 The
INDFMMPC vairable
First, I would take a look at this variable.
df |> tabyl(INDFMMPC) |> kable()
| INDFMMPC | n | percent |
|---|---|---|
| 1 | 2141 | 0.3131490 |
| 2 | 1051 | 0.1537224 |
| 3 | 3645 | 0.5331286 |
For this variable, I would like to create a new factor called
eco_statuswhich is a factor which has specially level
names: 1 as
Lower,2asMiddle, and3forUpper`.
df <- df %>% mutate(eco_status = fct_recode(factor(INDFMMPC), "Lower" = "1", "Middle" = "2", "Upper" = "3"))
After that, I would check if I factor them correctly.
df |> count(INDFMMPC, eco_status) |> kable()
| INDFMMPC | eco_status | n |
|---|---|---|
| 1 | Lower | 2141 |
| 2 | Middle | 1051 |
| 3 | Upper | 3645 |
2.7 The
DIQ010 Vairable
First, I would take a look at this variable.
df |> tabyl(DIQ010) |> kable()
| DIQ010 | n | percent |
|---|---|---|
| 1 | 986 | 0.1442153 |
| 2 | 5664 | 0.8284335 |
| 3 | 187 | 0.0273512 |
For this variable, I would like to create a new factor called
diabeteswhich is a factor which has specially level names:
1 as Yes, 2 as No,
and 3 for Borderline.
df <- df %>% mutate(diabetes = fct_recode(factor(DIQ010), "Yes" = "1", "No" = "2", "Borderline" = "3"))
After that, I would check if I factor them correctly.
df |> count(DIQ010, diabetes) |> kable()
| DIQ010 | diabetes | n |
|---|---|---|
| 1 | Yes | 986 |
| 2 | No | 5664 |
| 3 | Borderline | 187 |
2.8 Creating Analytic Tibles
So our analytic tibble, which I will call as df should
contains 7 values after processing.
df <- df %>% select(SEQN,systolic, diastolic, BMI, SleepHours, smoking, eco_status, diabetes, highchol)
2.9 List of Missing Values
As I already filter those don’t know/missing values above, our analytic tibble should not contain any of these values
miss_var_summary(df) |> kable()
| variable | n_miss | pct_miss |
|---|---|---|
| SEQN | 0 | 0 |
| systolic | 0 | 0 |
| diastolic | 0 | 0 |
| BMI | 0 | 0 |
| SleepHours | 0 | 0 |
| smoking | 0 | 0 |
| eco_status | 0 | 0 |
| diabetes | 0 | 0 |
| highchol | 0 | 0 |
3 Codebook and Data Description
3.1 Codebook
The 9 variables of my analytic tibble would be described below. The Type column indicates the number of levels in each categorical (factor) variable. As for the Type information, I’m using Quant to indicate quantitative variables, and Cat-x indicates a categorical variable (factor) with x levels.
| Variable | Type | Description/Levels |
|---|---|---|
| SEQN | id | respondent sequence, unique for each participants. |
| systolic | Quant | outcome variable, Average of systolic readings |
| diastolic | Quant | Average of diastolic readings |
| BMI | Quant | key predictor Body Mass Index |
| SleepHours | Quant | Average sleeping hours a week |
| smoking | Cat - 2 | Yes, No: Did you smoke 100 cigarettes in life? |
| eco_status | Cat - 3 | Lower Class, Middle Class, Upper Class: |
| diabetes | Cat - 3 | Yes, No, Borderline: Have doctor told you that you have diabetes? |
| highchol | Cat - 2 | Yes, No: Have doctor told you that you have high cholesterol? |
3.2 Analytic Tible
df
# A tibble: 6,837 × 9
SEQN systolic diastolic BMI SleepHours smoking eco_status diabe…¹ highc…²
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <fct> <fct>
1 109266 99 54.3 36.0 7.8 No Upper No Yes
2 109271 107 67 30.1 11.5 Yes Lower No Yes
3 109273 114. 67.3 22.4 7.2 Yes Lower No No
4 109274 134 70 27.4 9.5 No Lower Yes Yes
5 109282 139. 72.7 26.5 7.5 Yes Upper No Yes
6 109290 121. 60.3 27.5 4 No Upper Yes No
7 109292 138 97 29.9 4.5 No Middle Yes Yes
8 109293 129. 83.7 33.0 7 No Lower No No
9 109295 159 93.3 24.3 7 No Upper No No
10 109297 104 69.7 23.4 8 No Upper No No
# … with 6,827 more rows, and abbreviated variable names ¹diabetes, ²highchol
is_tibble(df)
[1] TRUE
3.3 Data Summary
Hmisc::describe(df |> select(-SEQN))
select(df, -SEQN)
8 Variables 6837 Observations
--------------------------------------------------------------------------------
systolic
n missing distinct Info Mean Gmd .05 .10
6837 0 344 1 123.9 20.91 98.3 102.7
.25 .50 .75 .90 .95
110.0 121.0 134.3 149.7 160.3
lowest : 78.7 79.7 80.0 80.3 82.7, highest: 210.3 213.3 214.0 217.0 218.7
--------------------------------------------------------------------------------
diastolic
n missing distinct Info Mean Gmd .05 .10
6837 0 229 1 74.3 12.78 57.3 60.3
.25 .50 .75 .90 .95
66.3 73.3 81.3 89.0 94.0
lowest : 41.3 41.7 42.0 43.0 43.7, highest: 126.0 128.0 135.0 137.7 143.7
--------------------------------------------------------------------------------
BMI
n missing distinct Info Mean Gmd .05 .10
6837 0 2056 1 29.05 7.521 20.17 21.45
.25 .50 .75 .90 .95
24.21 27.81 32.48 38.17 42.28
lowest : 14.63059 14.74719 14.87748 14.99636 15.10352
highest: 70.47227 72.25278 72.62397 74.87574 76.87125
--------------------------------------------------------------------------------
SleepHours
n missing distinct Info Mean Gmd .05 .10
6837 0 47 0.994 7.914 1.653 5.5 6.0
.25 .50 .75 .90 .95
7.0 8.0 9.0 9.8 10.5
lowest : 2.0 2.5 2.8 3.0 3.2, highest: 12.8 13.0 13.5 13.8 14.0
--------------------------------------------------------------------------------
smoking
n missing distinct
6837 0 2
Value Yes No
Frequency 2783 4054
Proportion 0.407 0.593
--------------------------------------------------------------------------------
eco_status
n missing distinct
6837 0 3
Value Lower Middle Upper
Frequency 2141 1051 3645
Proportion 0.313 0.154 0.533
--------------------------------------------------------------------------------
diabetes
n missing distinct
6837 0 3
Value Yes No Borderline
Frequency 986 5664 187
Proportion 0.144 0.828 0.027
--------------------------------------------------------------------------------
highchol
n missing distinct
6837 0 2
Value Yes No
Frequency 2409 4428
Proportion 0.352 0.648
--------------------------------------------------------------------------------
Since I already filter out all missing values and other noisy ones, it should return no missing values for each variable.
4 My Research Question
From Center of Disease Control and Prevention (CDC), nearly half of adults in the U.S (47% or 116 million people) have hypertension, which is defined as systolic blood pressure greater than 130 mmHg or diastolic blood pressure greater than 80 mmHg. Having hypertension puts you at risk for heart disease and stroke, which are top 10 of leading causes of death in the U.S.
There are various of factors that can lead to high blood pressure, such as age, race, family history, physical activities, weight, smoking, stress, certain chronic condition, sleep disorder. Therefore, in this study, I would like to choose serveral risk factors in order to investigate about blood pressure, which includes: weight, smoke (cigarettes), income, sleep disorder, and diabetes from the dataset from National Center of Health Statistics during 2017 to 2020 before the pandemic.
In addition, in reality, to address one’s blood pressure, the heatlhcare professionals would use both systolic and diastolic readings. However, since I can only use one outcome, I would like to use systolic over diastolic as outcome for my study. Both of these values are equally important, however, there are some study showed that people who have higher risk of heart attack and stroke would likely to have higher systolic readings. in addition, for people who are still in the elevated state of blood pressure, they have higher systolic readings not diastolic reading comparing to normal person. Due to this, having systolic would be more useful and more noticeable in our study.
This leads to the research of my study:
How effectively can we predict the high blood pressure via systolic reading using BMI, and is the quality of prediction meaningfully when I adjust for other predictors (sleep hours, diabetes, income, smoking, and cholesterol) based on my data?
5 Paritioning the Data
Here, I will obtain a training sample with a randomly selected 70% of
the data, and have the remaining 30% in a test sample, properly labeled,
and using set.seed so that the results can be replicated
later.
I will call the training sample df_training and the test
sample df_test.
- The
slice_samplefunction will sample the specified proportion of the data. - The
anti_joinfunction returns all rows in the first data frame (here specified asdf_analytic) that are not in the second data frame (here specified asdf_training) as assessed by the row-specific identification code (hereSEQN)).
set.seed(42)
df_analytic <- df
df_training <- df_analytic |> slice_sample(prop = .70)
df_test <- anti_join(df_analytic, df_training, by = "SEQN")
dim(df_analytic)
[1] 6837 9
dim(df_training)
[1] 4785 9
dim(df_test)
[1] 2052 9
Since 4785 + 2052 = 6837, we should be fine. # Transforming the Outcome
5.1 Visualizing the Outcome Distribution
In order to address the distribution of my outcomes
(systolic)
p1 <- ggplot(df_training, aes(sample = systolic)) +
geom_qq() + # plot the points
geom_qq_line(col = "blue") + # plot the Y = X line
theme(aspect.ratio = 1) + # make the plot square
labs(title = "Normal Q-Q plot: Untransformed Systolic")
p2 <- ggplot(df_training, aes(x=systolic)) +
geom_histogram(aes(y=stat(density)), bins = 20, fill ="royalblue", col ="white") +
stat_function(fun = dnorm, args = list(mean = mean(df_training$systolic), sd = sd(df_training$systolic)), col ="red", lwd = 1.5) +
labs(title="Density Function: Untransformed Systolic")
p3 <- ggplot(df_training, aes(x = systolic, y = "")) +
geom_boxplot(fill = "royalblue",
outlier.color = "royalblue") +
labs(title = "Boxplot: Untransformed Systolic", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1)))
We can see that the distribution of our outcome is clearly right-skewed.
Therefore, it is better to do some transformation for the outcome. In
order to do this, I would like to use
boxCox
5.2 boxCox
function to assess need for transformation of our outcome
In order to use the boxCox function, I need to ensure that my
outcome, systolic, including strictly positive values. We
can see that from below, the minimum value for our outcome is 78.7, so
we are good to go to next step.
mosaic::favstats(~ systolic, data = df_training) |> kable()
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 79.7 | 110 | 120.7 | 133.3 | 218.7 | 123.3937 | 18.64711 | 4785 | 0 |
mod_0 <- lm(systolic ~ BMI + SleepHours + smoking + diabetes + eco_status + highchol, data = df_training)
boxCox(mod_0)
powerTransform(mod_0)
Estimated transformation parameter
Y1
-0.9637329
We can see that from the boxCox function and the estimated
transformation parameter, it suggested a value of nearly -1, which looks
the inverse of the systolic. Therefore, I would like to
re-investigate the distribution if I choose to do the transformation for
my outcome.
p1 <- ggplot(df_training, aes(sample = 1/systolic)) +
geom_qq() + # plot the points
geom_qq_line(col = "blue") + # plot the Y = X line
theme(aspect.ratio = 1) + # make the plot square
labs(title = "Normal Q-Q plot: Inverse Systolic")
p2 <- ggplot(df_training, aes(x=1/systolic)) +
geom_histogram(aes(y=stat(density)), bins = 20, fill ="royalblue", col ="white") +
stat_function(fun = dnorm, args = list(mean = mean(1/df_training$systolic), sd = sd(1/df_training$systolic)), col ="red", lwd = 1.5) +
labs(title="Density Function: Inverse Systolic")
p3 <- ggplot(df_training, aes(x = 1/systolic, y = "")) +
geom_boxplot(fill = "royalblue",
outlier.color = "royalblue") +
labs(title = "Boxplot: Inverse Systolic", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1)))
We could see that from those graphs, after the transformation, the
outcome looks to be normally distributed. Therefore, in my study, I
would like to do the transformation for the outcome.
5.3 Numerical Summary of the Outcome
This is the numerical summary of the outcome before the transformation
mosaic::favstats(~ systolic, data = df_training) |> kable()
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 79.7 | 110 | 120.7 | 133.3 | 218.7 | 123.3937 | 18.64711 | 4785 | 0 |
And this is the numerical summary of the outcome after the transformation
mosaic::favstats(~ 1/systolic, data = df_training) |> kable()
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0.0045725 | 0.0075019 | 0.008285 | 0.0090909 | 0.0125471 | 0.008278 | 0.0011759 | 4785 | 0 |
5.4 Numerical Summaries of the Predictors
I would like to see some numerical summaries for my predictor variable in the training data
df_training |> select(-SEQN, -systolic, -diastolic) |>
mosaic::inspect()
categorical variables:
name class levels n missing
1 smoking factor 2 4785 0
2 eco_status factor 3 4785 0
3 diabetes factor 3 4785 0
4 highchol factor 2 4785 0
distribution
1 No (59.5%), Yes (40.5%)
2 Upper (53.9%), Lower (31.2%) ...
3 No (83.1%), Yes (14.3%) ...
4 No (64.5%), Yes (35.5%)
quantitative variables:
name class min Q1 median Q3 max mean
1 BMI numeric 14.63059 24.12663 27.75849 32.60009 72.62397 29.039051
2 SleepHours numeric 2.00000 7.00000 8.00000 9.00000 14.00000 7.924305
sd n missing
1 7.051636 4785 0
2 1.511098 4785 0
5.5 Scatterplot Matrix
Then, here I would like to build scatterplot matrices to investigate the relationship between our outcome and the predictors. The first matrix should include all the quantitative variables, while the second one includes categorical variables.
This plot is the one between our outcome and its quantitative
predictors. We can see that the correlation between our outcome (
1/systolic) and the predictors are relatively low, as it is
only -0.048 and 0.080. However, comparing to the values that I got
during my presentation, this already improved. In addition to that, the
distribution of my key predictor is also right-skewed.
df4_training <- mutate(df_training, systolic = 1/systolic)
temp <- df4_training |>
select(systolic, BMI, SleepHours)
ggpairs(temp, title = "Scatterplot Matrix",
lower = list(combo = wrap("facethist", bins = 20)))
From this scatterplot matrix, we can see that there are a lot of outliners for both end between our outcome and our categorical predictors. In addition, the distribution between groups in one variables also not equal to each other, i.e diabetes. However, it seems like the distribution between groups in one variables is normally distributed when comparing to the outcome
df4_training <- mutate(df_training, systolic = 1/systolic)
temp <- df4_training |>
select(systolic, smoking, diabetes, eco_status, highchol)
ggpairs(temp, title = "Scatterplot Matrix",
lower = list(combo = wrap("facethist", bins = 20)))
Since the inverse of systolic gives the numerical value that are
relatively small, we could see that for smoking variable, the difference
between two groups are not noticeable. Howe
mosaic::favstats(systolic ~ smoking, data = df4_training) |> kable()
| smoking | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Yes | 0.0046729 | 0.0073153 | 0.0081301 | 0.0089286 | 0.0125471 | 0.0081269 | 0.0011751 | 1939 | 0 |
| No | 0.0045725 | 0.0076161 | 0.0084034 | 0.0091743 | 0.0125000 | 0.0083809 | 0.0011655 | 2846 | 0 |
5.6 Collinearity Checking
None of the numeric candidate predictors show any substantial
correlation with each other. The largest Pearson correlation (in
absolute value) between predictors is (-0.054) for BMI and
SleepHours, and that’s not strong. If we did see signs of
meaningful collinearity, we might rethink our selected set of
predictors.
6 The Big Model
The “kitchen sink” linear regression model is to describe the
relationship between our outcome (1/systolic) and the main
effects of each of our predictors.
6.1 Fitting/Summarizing the Kitchen Sink model
Our “kitchen sink” or “big” model predicts the square root of sbp2 using the predictors (square root of sbp1), age, bmi1, diabetes and tobacco.
Our kitchen sink or mod_1 predicts the
inverse of outcome using the predictors:
BMI,SleepHours, smoking,
diabetes, eco_status,
highchol.
mod_1 <- lm(1/systolic ~ BMI + SleepHours + smoking + diabetes + eco_status + highchol, data = df_training)
summary(mod_1)
Call:
lm(formula = 1/systolic ~ BMI + SleepHours + smoking + diabetes +
eco_status + highchol, data = df_training)
Residuals:
Min 1Q Median 3Q Max
-0.0040245 -0.0007491 -0.0000029 0.0007505 0.0049285
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.347e-03 1.298e-04 56.613 < 2e-16 ***
BMI -2.017e-06 2.378e-06 -0.848 0.3965
SleepHours 4.940e-05 1.096e-05 4.506 6.77e-06 ***
smokingNo 1.911e-04 3.409e-05 5.607 2.17e-08 ***
diabetesNo 3.603e-04 4.939e-05 7.296 3.46e-13 ***
diabetesBorderline 4.387e-05 1.106e-04 0.397 0.6917
eco_statusMiddle -9.760e-05 5.186e-05 -1.882 0.0599 .
eco_statusUpper -8.097e-05 3.748e-05 -2.160 0.0308 *
highcholNo 3.745e-04 3.597e-05 10.412 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.001139 on 4776 degrees of freedom
Multiple R-squared: 0.06357, Adjusted R-squared: 0.062
F-statistic: 40.53 on 8 and 4776 DF, p-value: < 2.2e-16
6.2 Effect Sizes: Coefficient Estimates
Specify the size and magnitude of all coefficients, providing
estimated effect sizes with 90% confidence intervals. Since the value of
my coefficients are very small, so I need it to be display in full-form
in order to get some integter. From the analysis below, we could see
that our key predictor, BMI, is not really significant as
the p-value is larger than 0.1.
tidy(mod_1, conf.int = TRUE, conf.level = 0.90) |>
select(term, estimate, std.error, conf.low, conf.high, p.value) |>
kable()
| term | estimate | std.error | conf.low | conf.high | p.value |
|---|---|---|---|---|---|
| (Intercept) | 0.0073475 | 0.0001298 | 0.0071340 | 0.0075610 | 0.0000000 |
| BMI | -0.0000020 | 0.0000024 | -0.0000059 | 0.0000019 | 0.3964848 |
| SleepHours | 0.0000494 | 0.0000110 | 0.0000314 | 0.0000674 | 0.0000068 |
| smokingNo | 0.0001911 | 0.0000341 | 0.0001351 | 0.0002472 | 0.0000000 |
| diabetesNo | 0.0003603 | 0.0000494 | 0.0002791 | 0.0004416 | 0.0000000 |
| diabetesBorderline | 0.0000439 | 0.0001106 | -0.0001381 | 0.0002258 | 0.6917092 |
| eco_statusMiddle | -0.0000976 | 0.0000519 | -0.0001829 | -0.0000123 | 0.0598946 |
| eco_statusUpper | -0.0000810 | 0.0000375 | -0.0001426 | -0.0000193 | 0.0308103 |
| highcholNo | 0.0003745 | 0.0000360 | 0.0003153 | 0.0004337 | 0.0000000 |
6.3 Describing the Equation
extract_eq(mod_1, use_coefs = TRUE, coef_digits = 8,
terms_per_line = 3, wrap = TRUE, ital_vars = TRUE)
\[ \begin{aligned} \widehat{1/systolic} &= 0.0073475 - 2.02e-06(BMI) + 4.94e-05(SleepHours)\ + \\ &\quad 0.00019114(smoking_{No}) + 0.00036035(diabetes_{No}) + 4.387e-05(diabetes_{Borderline})\ - \\ &\quad 9.76e-05(eco\_status_{Middle}) - 8.097e-05(eco\_status_{Upper}) + 0.0003745(highchol_{No}) \end{aligned} \]
- This model can be described as for every increasing in BMI, decrease in sleep hours, we anticipated the increase in the outcome (systolic), which is decrease in the inverse of systolic.
- If you are not smoking/not diabetes or diabetes only at borderline/no high cholesterol, there would be a decrease in the systolic, which is increase in the inverse of systolic.
- If you are in either upper or lower classes, there would be an increase in the outcome (systolic), which is decrease in the inverse of systolic.
7 The Smaller Model
7.1 Backwards Stepwise Elimination
Instead of using step function for backwards stepwise
elimination, I would like to use a couple selected models that I want to
investigate. This is because when I use step function, it
will eventually remove my key predictor out of the model, which is not
what I wanted.
step(mod_1)
Start: AIC=-64853.86
1/systolic ~ BMI + SleepHours + smoking + diabetes + eco_status +
highchol
Df Sum of Sq RSS AIC
- BMI 1 9.3300e-07 0.0061954 -64855
<none> 0.0061945 -64854
- eco_status 2 7.3720e-06 0.0062018 -64852
- SleepHours 1 2.6330e-05 0.0062208 -64836
- smoking 1 4.0777e-05 0.0062353 -64824
- diabetes 2 7.6476e-05 0.0062710 -64799
- highchol 1 1.4061e-04 0.0063351 -64748
Step: AIC=-64855.14
1/systolic ~ SleepHours + smoking + diabetes + eco_status + highchol
Df Sum of Sq RSS AIC
<none> 0.0061954 -64855
- eco_status 2 7.1900e-06 0.0062026 -64854
- SleepHours 1 2.6898e-05 0.0062223 -64836
- smoking 1 4.0834e-05 0.0062362 -64826
- diabetes 2 8.1007e-05 0.0062764 -64797
- highchol 1 1.4171e-04 0.0063371 -64749
Call:
lm(formula = 1/systolic ~ SleepHours + smoking + diabetes + eco_status +
highchol, data = df_training)
Coefficients:
(Intercept) SleepHours smokingNo diabetesNo
7.278e-03 4.987e-05 1.913e-04 3.668e-04
diabetesBorderline eco_statusMiddle eco_statusUpper highcholNo
4.669e-05 -9.713e-05 -7.950e-05 3.757e-04
Therefore, besides my big model, I would like to investigate the other three models:
Model 2: 1/systolic ~ BMIModel 3: 1/systolic ~ BMI, SleepHoursModel 4: 1/systolic ~ BMI, SleepHours, diabetes.
7.2 Fitting the “small” model
This is my first small model, which is investigate the relationship between the outcome and the key predictor.
mod_2 <- lm(1/systolic ~ BMI, data = df_training)
summary(mod_2)
Call:
lm(formula = 1/systolic ~ BMI, data = df_training)
Residuals:
Min 1Q Median 3Q Max
-0.0036536 -0.0007741 0.0000151 0.0008147 0.0045362
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.512e-03 7.197e-05 118.276 < 2e-16 ***
BMI -8.064e-06 2.408e-06 -3.348 0.000819 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.001175 on 4783 degrees of freedom
Multiple R-squared: 0.002339, Adjusted R-squared: 0.00213
F-statistic: 11.21 on 1 and 4783 DF, p-value: 0.0008189
This is my second small model, which is investigate the relationship between the outcome and the key predictor along with the average hours of sleep per week.
mod_3 <- lm(1/systolic ~ BMI + SleepHours, data = df_training)
summary(mod_3)
Call:
lm(formula = 1/systolic ~ BMI + SleepHours, data = df_training)
Residuals:
Min 1Q Median 3Q Max
-0.0036914 -0.0007896 0.0000078 0.0008064 0.0044072
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.016e-03 1.172e-04 68.388 < 2e-16 ***
BMI -7.374e-06 2.405e-06 -3.066 0.00218 **
SleepHours 6.010e-05 1.122e-05 5.355 8.94e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.001171 on 4782 degrees of freedom
Multiple R-squared: 0.008287, Adjusted R-squared: 0.007872
F-statistic: 19.98 on 2 and 4782 DF, p-value: 2.288e-09
This is my second small model, which is investigate the relationship between the outcome and the key predictor along with the average hours of sleep per week and diabetes groups.
mod_4 <- lm(1/systolic ~ BMI + SleepHours + diabetes, data = df_training)
summary(mod_4)
Call:
lm(formula = 1/systolic ~ BMI + SleepHours + diabetes, data = df_training)
Residuals:
Min 1Q Median 3Q Max
-0.0037745 -0.0007651 -0.0000047 0.0007947 0.0046675
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.477e-03 1.273e-04 58.749 < 2e-16 ***
BMI -2.877e-06 2.412e-06 -1.193 0.233
SleepHours 5.816e-05 1.109e-05 5.242 1.65e-07 ***
diabetesNo 5.072e-04 4.858e-05 10.441 < 2e-16 ***
diabetesBorderline 9.720e-05 1.123e-04 0.866 0.387
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.001157 on 4780 degrees of freedom
Multiple R-squared: 0.03236, Adjusted R-squared: 0.03155
F-statistic: 39.96 on 4 and 4780 DF, p-value: < 2.2e-16
7.3 Effect Sizes: Coefficient Estimates
tidy(mod_2, conf.int = TRUE, conf.level = 0.90) |>
select(term, estimate, std.error, conf.low, conf.high, p.value) |> kable()
| term | estimate | std.error | conf.low | conf.high | p.value |
|---|---|---|---|---|---|
| (Intercept) | 0.0085122 | 7.2e-05 | 0.0083938 | 0.0086306 | 0.0000000 |
| BMI | -0.0000081 | 2.4e-06 | -0.0000120 | -0.0000041 | 0.0008189 |
tidy(mod_3, conf.int = TRUE, conf.level = 0.90) |>
select(term, estimate, std.error, conf.low, conf.high, p.value) |> kable()
| term | estimate | std.error | conf.low | conf.high | p.value |
|---|---|---|---|---|---|
| (Intercept) | 0.0080159 | 0.0001172 | 0.0078230 | 0.0082087 | 0.0000000 |
| BMI | -0.0000074 | 0.0000024 | -0.0000113 | -0.0000034 | 0.0021808 |
| SleepHours | 0.0000601 | 0.0000112 | 0.0000416 | 0.0000786 | 0.0000001 |
tidy(mod_4, conf.int = TRUE, conf.level = 0.90) |>
select(term, estimate, std.error, conf.low, conf.high, p.value) |> kable()
| term | estimate | std.error | conf.low | conf.high | p.value |
|---|---|---|---|---|---|
| (Intercept) | 0.0074769 | 0.0001273 | 0.0072675 | 0.0076862 | 0.0000000 |
| BMI | -0.0000029 | 0.0000024 | -0.0000068 | 0.0000011 | 0.2329745 |
| SleepHours | 0.0000582 | 0.0000111 | 0.0000399 | 0.0000764 | 0.0000002 |
| diabetesNo | 0.0005072 | 0.0000486 | 0.0004273 | 0.0005871 | 0.0000000 |
| diabetesBorderline | 0.0000972 | 0.0001123 | -0.0000875 | 0.0002819 | 0.3866541 |
7.4 Small Model Regression Equation
extract_eq(mod_2, use_coefs = TRUE, coef_digits = 8,
terms_per_line = 3, wrap = TRUE, ital_vars = TRUE)
\[ \begin{aligned} \widehat{1/systolic} &= 0.00851217 - 8.06e-06(BMI) \end{aligned} \]
- This model can be described as for every increasing in BMI, we anticipated the increase in the outcome (systolic), which is decrease in the inverse of systolic.
extract_eq(mod_3, use_coefs = TRUE, coef_digits = 8,
terms_per_line = 3, wrap = TRUE, ital_vars = TRUE)
\[ \begin{aligned} \widehat{1/systolic} &= 0.00801586 - 7.37e-06(BMI) + 6.01e-05(SleepHours) \end{aligned} \] - This model can be described as for every increasing in BMI, decrease in sleep hours, we anticipated the increase in the outcome (systolic), which is decrease in the inverse of systolic.
extract_eq(mod_4, use_coefs = TRUE, coef_digits = 8,
terms_per_line = 3, wrap = TRUE, ital_vars = TRUE)
\[ \begin{aligned} \widehat{1/systolic} &= 0.00747686 - 2.88e-06(BMI) + 5.816e-05(SleepHours)\ + \\ &\quad 0.00050721(diabetes_{No}) + 9.72e-05(diabetes_{Borderline}) \end{aligned} \] - This model can be described as for every increasing in BMI, decrease in sleep hours, we anticipated the increase in the outcome (systolic), which is decrease in the inverse of systolic. - If you are not diabetes or diabetes only at borderline, there would be a decrease in the systolic, which is increase in the inverse of systolic.
8 In-Sample Comparison
8.1 Quality of Fit
In order to do the in-cample comparison, I would like to compare all of my small model to the big model in our training sample using adjusted R2, the residual standard error, AIC and BIC.
bind_rows(glance(mod_1), glance(mod_2), glance(mod_3), glance(mod_4)) %>%
mutate(model_name = c("Model 1", "Model 2", "Model 3", "Model 4"))%>%
select(model_name, r.squared, adj.r.squared, sigma, AIC, BIC, nobs) %>%
kable()
| model_name | r.squared | adj.r.squared | sigma | AIC | BIC | nobs |
|---|---|---|---|---|---|---|
| Model 1 | 0.0635718 | 0.0620032 | 0.0011389 | -51272.62 | -51207.89 | 4785 |
| Model 2 | 0.0023387 | 0.0021301 | 0.0011746 | -50983.54 | -50964.12 | 4785 |
| Model 3 | 0.0082865 | 0.0078717 | 0.0011713 | -51010.15 | -50984.25 | 4785 |
| Model 4 | 0.0323557 | 0.0315460 | 0.0011572 | -51123.71 | -51084.87 | 4785 |
When comparing the AIC, Model 1 is the best, which is following by Model 4, Model 3, Model 2, respectively. When comparing the BIC, Model 1 is the best, which is following by Model 4, Model 3, Model 2, respectively. When comparing the sigma value, Model 1 is the best, which is following by Model 4, Model 3, Model 2, respectively. When comparing the r-squared value, Model 1 is the best, which is following by Model 4, Model 3, Model 2, respectively. ## Assessing Assumptions Here I would like to run the residula plots for every model. ### Residual Plots for the Big Model
par(mfrow = c(2,2)); plot(mod_1); par(mfrow = c(1,1))
I see no serious problems with the assumptions of linearity, Normality
and constant variance, nor do I see any highly influential points in our
big model. ### Residual Plots for the Small Model
Model 2
par(mfrow = c(2,2)); plot(mod_2); par(mfrow = c(1,1))
I see no serious problems with the assumptions of linearity, Normality
and constant variance, nor do I see any highly influential points in our
big model.
Model 3
par(mfrow = c(2,2)); plot(mod_3); par(mfrow = c(1,1))
I see no serious problems with the assumptions of linearity, Normality
and constant variance, nor do I see any highly influential points in our
big model.
Model 4
par(mfrow = c(2,2)); plot(mod_4); par(mfrow = c(1,1))
I see no serious problems with the assumptions of linearity, Normality
and constant variance, nor do I see any highly influential points in our
big model. However, it is weired that it forms two clusters in the
graph. I believe this is influenced by the diabetes variables as this
only happens after adding this values into the model 4 comparing to
Model 1 and 2. The main reason may be because the groups in diabetes are
not having equal size.
8.1.1 Does collinearity have a meaningful impact?
car::vif(mod_1)
GVIF Df GVIF^(1/(2*Df))
BMI 1.037236 1 1.018448
SleepHours 1.012586 1 1.006273
smoking 1.033282 1 1.016505
diabetes 1.106591 2 1.025644
eco_status 1.024753 2 1.006132
highchol 1.092392 1 1.045175
car::vif(mod_4)
GVIF Df GVIF^(1/(2*Df))
BMI 1.033453 1 1.016589
SleepHours 1.004043 1 1.002020
diabetes 1.031996 2 1.007905
The generalized variance inflation factors are under 5 for both model, which address that there are no potential impact of collinearity. ## Comparing the Models Based on the training sample, my conclusions so far is to support the biggest model or Model 1. It won every values to be the best model for the fit quality measures, and each model shows no serious problems with regression assumptions. # Model Validation Since I transformed my outcome earlier, then I need to back-transformed for model validation ## Calculating Prediction Errors
8.1.2 Big Model: Back-Transformation and Calculating Fits/Residuals
We’ll use the augment function from the broom package to help us here, and create fit_systolic to hold the fitted values on the original systolic scale after back-transformation (by doing the inverse of systolic) and then res_systolic to hold the residuals (prediction errors) we observe using the big model on the chosen nhannes data.
test_m1 <- augment(mod_1, newdata = df_test) %>% mutate(name = "mod_1", fit_systolic = 1 / .fitted, res_systolic = systolic - fit_systolic) %>% select(SEQN, name, systolic, fit_systolic, res_systolic, everything())
head(test_m1,3)
# A tibble: 3 × 14
SEQN name systolic fit_syst…¹ res_s…² diast…³ BMI Sleep…⁴ smoking eco_s…⁵
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 109266 mod_1 99 123. -24.0 54.3 36.0 7.8 No Upper
2 109282 mod_1 139. 126. 13.4 72.7 26.5 7.5 Yes Upper
3 109290 mod_1 121. 125. -4.10 60.3 27.5 4 No Upper
# … with 4 more variables: diabetes <fct>, highchol <fct>, .fitted <dbl>,
# .resid <dbl>, and abbreviated variable names ¹fit_systolic, ²res_systolic,
# ³diastolic, ⁴SleepHours, ⁵eco_status
8.1.3 Small Model: Back-Transformation and Calculating Fits/Residuals
We’ll do the same thing, but using the small model in the chosen nhannes data.
Model 2
test_m2 <- augment(mod_2, newdata = df_test) %>% mutate(name = "mod_2", fit_systolic = 1 / .fitted, res_systolic = systolic - fit_systolic) %>% select(SEQN, name, systolic, fit_systolic, res_systolic, everything())
head(test_m2,3)
# A tibble: 3 × 14
SEQN name systolic fit_syst…¹ res_s…² diast…³ BMI Sleep…⁴ smoking eco_s…⁵
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 109266 mod_2 99 122. -22.6 54.3 36.0 7.8 No Upper
2 109282 mod_2 139. 121. 18.8 72.7 26.5 7.5 Yes Upper
3 109290 mod_2 121. 121. 0.684 60.3 27.5 4 No Upper
# … with 4 more variables: diabetes <fct>, highchol <fct>, .fitted <dbl>,
# .resid <dbl>, and abbreviated variable names ¹fit_systolic, ²res_systolic,
# ³diastolic, ⁴SleepHours, ⁵eco_status
Model 3
test_m3 <- augment(mod_3, newdata = df_test) %>% mutate(name = "mod_3", fit_systolic = 1 / .fitted, res_systolic = systolic - fit_systolic) %>% select(SEQN, name, systolic, fit_systolic, res_systolic, everything())
head(test_m3,3)
# A tibble: 3 × 14
SEQN name systolic fit_syst…¹ res_s…² diast…³ BMI Sleep…⁴ smoking eco_s…⁵
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 109266 mod_3 99 122. -22.7 54.3 36.0 7.8 No Upper
2 109282 mod_3 139. 121. 18.4 72.7 26.5 7.5 Yes Upper
3 109290 mod_3 121. 124. -2.86 60.3 27.5 4 No Upper
# … with 4 more variables: diabetes <fct>, highchol <fct>, .fitted <dbl>,
# .resid <dbl>, and abbreviated variable names ¹fit_systolic, ²res_systolic,
# ³diastolic, ⁴SleepHours, ⁵eco_status
Model 4
test_m4 <- augment(mod_4, newdata = df_test) %>% mutate(name = "mod_4", fit_systolic = 1 / .fitted, res_systolic = systolic - fit_systolic) %>% select(SEQN, name, systolic, fit_systolic, res_systolic, everything())
head(test_m4,3)
# A tibble: 3 × 14
SEQN name systolic fit_syst…¹ res_s…² diast…³ BMI Sleep…⁴ smoking eco_s…⁵
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 109266 mod_4 99 120. -21.0 54.3 36.0 7.8 No Upper
2 109282 mod_4 139. 120. 19.5 72.7 26.5 7.5 Yes Upper
3 109290 mod_4 121. 131. -9.75 60.3 27.5 4 No Upper
# … with 4 more variables: diabetes <fct>, highchol <fct>, .fitted <dbl>,
# .resid <dbl>, and abbreviated variable names ¹fit_systolic, ²res_systolic,
# ³diastolic, ⁴SleepHours, ⁵eco_status
8.1.4 Combining the Results
the test_comp tibble including all the predictions and
residuals from 4 models. It helps to visualize the predictions,
summarizing the errors, identify the largest errors, and validated the
r-squared values.
test_comp <- bind_rows(test_m1, test_m2, test_m3, test_m4) |> arrange(SEQN, name)
test_comp |> head()
# A tibble: 6 × 14
SEQN name systolic fit_syst…¹ res_s…² diast…³ BMI Sleep…⁴ smoking eco_s…⁵
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 109266 mod_1 99 123. -24.0 54.3 36.0 7.8 No Upper
2 109266 mod_2 99 122. -22.6 54.3 36.0 7.8 No Upper
3 109266 mod_3 99 122. -22.7 54.3 36.0 7.8 No Upper
4 109266 mod_4 99 120. -21.0 54.3 36.0 7.8 No Upper
5 109282 mod_1 139. 126. 13.4 72.7 26.5 7.5 Yes Upper
6 109282 mod_2 139. 121. 18.8 72.7 26.5 7.5 Yes Upper
# … with 4 more variables: diabetes <fct>, highchol <fct>, .fitted <dbl>,
# .resid <dbl>, and abbreviated variable names ¹fit_systolic, ²res_systolic,
# ³diastolic, ⁴SleepHours, ⁵eco_status
8.2 Visualizing the Predictions
ggplot(test_comp, aes(x = fit_systolic, y = systolic)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, lty = "dashed") +
geom_smooth(method = "loess", col = "blue", se = FALSE, formula = y ~ x) +
facet_wrap( ~ name, labeller = "label_both") +
labs(x = "Predicted systolic",
y = "Observed systolic",
title = "Observed vs. Predicted systolic",
subtitle = "Comparing Big to 3 Small Modesl in Test Sample",
caption = "Dashed line is where Observed = Predicted")
We can see that in four graphs, the Model 1 seems to be the most
reasonable comparing to other 3. Other 3 models, all the points seem to
be cluster at one place, except for Model 4 that it clusters into 2
different clusters. However, there are not much difference in the dashed
line, which is predictable since the coefficients are really low.
8.3 Summarizing the Errors
Calculate the mean absolute prediction error (MAPE), the root mean squared prediction error (RMSPE) and the maximum absolute error across the predictions made by each model.
test_comp |>
group_by(name) |>
dplyr::summarise(n = n(),
MAPE = mean(abs(res_systolic)),
RMSPE = sqrt(mean(res_systolic**2)),
max_error = max(abs(res_systolic)))
# A tibble: 4 × 5
name n MAPE RMSPE max_error
<chr> <int> <dbl> <dbl> <dbl>
1 mod_1 2052 14.9 19.9 91.8
2 mod_2 2052 15.4 20.4 94.8
3 mod_3 2052 15.4 20.3 92.7
4 mod_4 2052 15.3 20.2 94.8
Model 1 has the lowest MAPE, RMSPE, and max_error. Model 2 has the second-lowest for all these 3 values. Due to this, it also suggests that Model 1 is the best model so far.
8.3.1 Identify the largest errors
temp1 <- test_m1 %>%
filter(abs(res_systolic) == max(abs(res_systolic)))
temp2 <- test_m2 %>%
filter(abs(res_systolic) == max(abs(res_systolic)))
temp3 <- test_m3 %>%
filter(abs(res_systolic) == max(abs(res_systolic)))
temp4 <- test_m4 %>%
filter(abs(res_systolic) == max(abs(res_systolic)))
bind_rows(temp1, temp2, temp3, temp4) %>%
select(SEQN, name, systolic, fit_systolic, res_systolic)
# A tibble: 4 × 5
SEQN name systolic fit_systolic res_systolic
<dbl> <chr> <dbl> <dbl> <dbl>
1 112898 mod_1 217 125. 91.8
2 112898 mod_2 217 122. 94.8
3 112898 mod_3 217 124. 92.7
4 112898 mod_4 217 122. 94.8
This helps to identify the outliner for all 4 models. We can try to investigate the errors after removing this point.
test_comp %>% filter(SEQN != "112898") %>%
group_by(name) |>
dplyr::summarise(n = n(),
MAPE = mean(abs(res_systolic)),
RMSPE = sqrt(mean(res_systolic**2)),
max_error = max(abs(res_systolic)))
# A tibble: 4 × 5
name n MAPE RMSPE max_error
<chr> <int> <dbl> <dbl> <dbl>
1 mod_1 2051 14.9 19.8 82.3
2 mod_2 2051 15.4 20.3 80.8
3 mod_3 2051 15.3 20.2 81.1
4 mod_4 2051 15.2 20.1 82.0
After removing that outliner, Model 1 only has the lowest value for MAPE and RMSPE, with the biggest maximum absolute error. However, it still outweights other models.
8.3.2 Validated R-square values
cor(test_m1$systolic, test_m1$fit_systolic)**2
[1] 0.04779547
cor(test_m2$systolic, test_m2$fit_systolic)**2
[1] 0.000361166
cor(test_m3$systolic, test_m3$fit_systolic)**2
[1] 0.002475879
cor(test_m4$systolic, test_m4$fit_systolic)**2
[1] 0.01602802
From this analysis, we can see that Model 1 has the biggest correlations.
8.4 Comparing the Models
After the analysis, I would like to use Model 1, or the big model because it performs the best for fit quality as well as has the smallest errors comparing to other models.
9 Discussion
9.1 Chosen Model
After the analysis, I would like to use Model 1, or the big model because it performs the best for fit quality (lowest AIC, BIC,sigma but highest R-squared ) as well as has the smallest errors (mean absolute prediction error, , maximum absolute error ) comparing to other models.
9.2 Answering My Question
This helps to answer my questions that having higher BMI, less sleep could lead to higher systolic, which indicate the high blood pressure. In addition, not having diabetes or diabetes at the borderline could also have lower systolic reading. Not having high cholesterol or smokig also indicates lower systolic reading. People come from the upper and lower, there would be an increase in the systolic. Even though the coefficients are very small, but the result seems to be reasonable as those risk factors only increase the systolic reading when it increase in the harmful way. For the economic status, it can be explained that people from the lower class, they have less chance to get access to the healthcare and don’t take care much about their food. For the upper class, they have access to more food, and may overconsume it.
9.3 Next Steps
In this analysis, I only consider 4 models without doing in the step function. That’s being said, there could be other models that are better fit comparing to my big model. Further works must be done in order to investigate different combinations of predictors in order to predict the outcome.
In addition to that, I only use systolic reading for this analysis without consider diastolic reading. Further work should be worked on diastolic reading to see what factors affect this value. This is because both of these values are essential in order to indicate a person who is at risk for having abnormal blood pressure or not. Getting to know one value could lead to bias. ## Reflection
I already changed my approach when doing this study after the presentation after listening advice from Professor. However, it seems like my coefficients are still very low. I believe if I can redo it, I would like to investigate more about the datasets and make a better choice for my predictors.
9.4 Session Information
sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mosaicData_0.20.3 ggformula_0.10.2 Matrix_1.5-3
[4] nhanesA_0.7.1 GGally_2.1.2 equatiomatic_0.3.1
[7] car_3.1-1 carData_3.0-5 forcats_0.5.2
[10] stringr_1.4.1 dplyr_1.0.10 purrr_0.3.5
[13] readr_2.1.3 tidyr_1.2.1 tibble_3.1.8
[16] tidyverse_1.3.2 Hmisc_4.7-1 ggplot2_3.3.6
[19] Formula_1.2-4 survival_3.3-1 lattice_0.20-45
[22] Epi_2.47 readxl_1.4.1 patchwork_1.1.2.9000
[25] broom_1.0.1 naniar_0.6.1 janitor_2.1.0
[28] rmdformats_1.0.4 knitr_1.40
loaded via a namespace (and not attached):
[1] googledrive_2.0.0 colorspace_2.0-3 deldir_1.0-6
[4] ggridges_0.5.4 ellipsis_0.3.2 visdat_0.5.3
[7] ggstance_0.3.5 snakecase_0.11.0 htmlTable_2.4.1
[10] base64enc_0.1-3 fs_1.5.2 rstudioapi_0.14
[13] farver_2.1.1 fansi_1.0.3 lubridate_1.8.0
[16] xml2_1.3.3 mosaic_1.8.4.2 splines_4.2.1
[19] cachem_1.0.6 polyclip_1.10-0 jsonlite_1.8.2
[22] cluster_2.1.3 dbplyr_2.2.1 png_0.1-7
[25] ggforce_0.4.1 shiny_1.7.2 compiler_4.2.1
[28] httr_1.4.4 backports_1.4.1 assertthat_0.2.1
[31] fastmap_1.1.0 gargle_1.2.1 cli_3.4.1
[34] tweenr_2.0.2 later_1.3.0 htmltools_0.5.3
[37] tools_4.2.1 gtable_0.3.1 glue_1.6.2
[40] Rcpp_1.0.9 cellranger_1.1.0 jquerylib_0.1.4
[43] vctrs_0.5.0 nlme_3.1-157 xfun_0.33
[46] rvest_1.0.3 mosaicCore_0.9.2.1 mime_0.12
[49] lifecycle_1.0.3 googlesheets4_1.0.1 MASS_7.3-57
[52] zoo_1.8-11 scales_1.2.1 hms_1.1.2
[55] promises_1.2.0.1 parallel_4.2.1 RColorBrewer_1.1-3
[58] yaml_2.3.5 gridExtra_2.3 sass_0.4.2
[61] labelled_2.10.0 rpart_4.1.16 reshape_0.8.9
[64] latticeExtra_0.6-30 stringi_1.7.8 highr_0.9
[67] checkmate_2.1.0 rlang_1.0.6 pkgconfig_2.0.3
[70] evaluate_0.16 labeling_0.4.2 htmlwidgets_1.5.4
[73] cmprsk_2.2-11 tidyselect_1.1.2 plyr_1.8.7
[76] magrittr_2.0.3 bookdown_0.29 R6_2.5.1
[79] generics_0.1.3 DBI_1.1.3 pillar_1.8.1
[82] haven_2.5.1 foreign_0.8-82 withr_2.5.0
[85] mgcv_1.8-40 abind_1.4-5 nnet_7.3-17
[88] etm_1.1.1 modelr_0.1.9 crayon_1.5.2
[91] interp_1.1-3 utf8_1.2.2 tzdb_0.3.0
[94] rmarkdown_2.16 jpeg_0.1-9 grid_4.2.1
[97] data.table_1.14.2 reprex_2.0.2 digest_0.6.30
[100] xtable_1.8-4 httpuv_1.6.6 numDeriv_2016.8-1.1
[103] munsell_0.5.0 bslib_0.4.0