This analysis aims to examine the relationship between cholesterol level and key variables include demographic characteristics (age, sex, race/ethnicity), socioeconomic status, behavioral factors (smoking, alcohol use, physical activity, diet), and clinical measures (HDL, triglycerides, blood pressure, glucose, C-reactive protein) using the National Health and Nutrition Examination Survey (NHANES) 2017-2018 cross-sectional data. The sample of individuals aged 18-80 years will be used to determine which variable most strongly influences cholesterol while accounting for complex population-level variations.
Since the 2017–2018 NHANES cycle contains some of the most complete and up-to-date data, we selected this cycle for our analysis. Within this cycle, we downloaded a set of key modules that capture demographic, behavioral, and clinical information relevant to predicting cholesterol and BMI outcomes. These modules include demographics, smoking status, alcohol use, physical activity, dietary intake, body measurements, blood pressure, glucose, inflammation (CRP), HDL cholesterol, triglycerides, and total cholesterol. Each dataset was imported individually and prepared for merging so that a comprehensive, unified analytic file could be created for building our regression models.
load_nhanes <- function(files, cycle = "2017") {
base_url <- paste0("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/", cycle, "/DataFiles/")
result <- list()
for (f in files) {
file_url <- paste0(base_url, f, ".xpt")
local_name <- paste0(f, ".XPT")
download.file(file_url, local_name, mode = "wb")
df <- read_xpt(local_name)
result[[f]] <- df
}
return(result)
}
files_to_load <- c(
"DEMO_J", # Demographics
"SMQ_J", # Smoking
"ALQ_J", # Alcohol
"PAQ_J", # Physical activity
"DR1TOT_J", # Diet (calories, fat, sodium)
"BMX_J", # Body measures (BMI)
"BPX_J", # Blood pressure
"GLU_J", # Glucose
"HSCRP_J", # CRP
"HDL_J", # HDL cholesterol
"TRIGLY_J", # Triglycerides
"TCHOL_J" # Total Cholesterol
)
nhanes_list <- load_nhanes(files_to_load)
demo <- nhanes_list$DEMO_J
smoking <- nhanes_list$SMQ_J
alcohol <- nhanes_list$ALQ_J
physact <- nhanes_list$PAQ_J
diet <- nhanes_list$DR1TOT_J
bmx <- nhanes_list$BMX_J
bp <- nhanes_list$BPX_J
glucose <- nhanes_list$GLU_J
crp <- nhanes_list$HSCRP_J
hdl <- nhanes_list$HDL_J
trig <- nhanes_list$TRIGLY_J
chol <- nhanes_list$TCHOL_J
After loading the individual NHANES datasets, we merged them into a single analytic dataset using the respondent identifier SEQN, which is shared across all NHANES modules. Once all modules were combined, we restricted the sample to adults aged 18 years and older and created derived physical activity variables representing vigorous, moderate, and sedentary minutes per day. Finally, we selected the key demographic, behavioral, dietary, and clinical variables needed for modeling, including predictors such as age, gender, race/ethnicity, socioeconomic indicators, lifestyle behaviors, and biomarkers, along with total cholesterol, lbxtc, as the primary outcome.
Features:
# Merge data
demo <- nhanes_list$DEMO_J %>%
janitor::clean_names()
# Start with demographics
nhanes_master <- demo
# Merge BMI
nhanes_master <- nhanes_master %>%
left_join(janitor::clean_names(nhanes_list$BMX_J), by = "seqn")
# Merge Blood Pressure
nhanes_master <- nhanes_master %>%
left_join(janitor::clean_names(nhanes_list$BPX_J), by = "seqn")
# Merge Glucose
nhanes_master <- nhanes_master %>%
left_join(janitor::clean_names(nhanes_list$GLU_J), by = "seqn")
# Merge CRP
nhanes_master <- nhanes_master %>%
left_join(janitor::clean_names(nhanes_list$HSCRP_J), by = "seqn")
# Merge HDL
nhanes_master <- nhanes_master %>%
left_join(janitor::clean_names(nhanes_list$HDL_J), by = "seqn")
# Merge Triglycerides
nhanes_master <- nhanes_master %>%
left_join(janitor::clean_names(nhanes_list$TRIGLY_J), by = "seqn")
# Merge Smoking
nhanes_master <- nhanes_master %>%
left_join(janitor::clean_names(nhanes_list$SMQ_J), by = "seqn")
# Merge Alcohol
nhanes_master <- nhanes_master %>%
left_join(janitor::clean_names(nhanes_list$ALQ_J), by = "seqn")
# Merge Physical Activity
nhanes_master <- nhanes_master %>%
left_join(janitor::clean_names(nhanes_list$PAQ_J), by = "seqn")
# Merge Diet (calories, fat, sodium)
nhanes_master <- nhanes_master %>%
left_join(janitor::clean_names(nhanes_list$DR1TOT_J), by = "seqn")
# Merge Diet (calories, fat, sodium)
nhanes_master <- nhanes_master %>%
left_join(janitor::clean_names(nhanes_list$TCHOL_J), by = "seqn")
# Filter for adults only
nhanes_model <- nhanes_master %>%
filter(ridageyr >= 18) %>%
mutate(
vigorous_minutes = coalesce(pad615, 0) + coalesce(pad660, 0), # Vigorous activity (work + recreation)
moderate_minutes = coalesce(pad630, 0) + coalesce(pad675, 0) + coalesce(pad645, 0), # Moderate activity (work + recreation + transportation)
sedentary_minutes = coalesce(pad680, 0) # Sedentary minutes
)
# Select Variables for Modeling
nhanes_model2 <- nhanes_model %>%
dplyr::select(
seqn,
ridageyr, # Age
riagendr, # Gender
ridreth3, # Race/Ethnicity
dmdeduc2, # Education
indfmpir, # Income/Poverty ratio
smq020, # Smoking status
alq111, # Alcohol use
vigorous_minutes,
moderate_minutes,
sedentary_minutes,
dr1tkcal, # Total calories
dr1ttfat, # Total fat
dr1tsodi, # Sodium
bmxbmi, # BMI
bpxsy1, bpxdi1, # Systolic & Diastolic BP
lbxglu, # Glucose
lbxhscrp, # CRP
lbdhdd, # HDL
lbxtr, # Triglycerides
lbxtc # Total Cholesterol (target)
)
To assess the structure and distribution of our variables, we began by generating summary statistics for all predictors in the dataset. We then visualized numerical features using histograms to examine normality, skewness, and potential outliers. Boxplots were created for key categorical variables to show how total cholesterol differs across demographic and behavioral groups. Finally, we plotted the distribution of the target variable—total cholesterol—to evaluate its overall shape before modeling. These exploratory steps provide a clear understanding of the data and help verify assumptions for regression analysis.
Several continuous variables in the dataset exhibit skewed distributions that may affect regression modeling:
These variables may benefit from log-transformations or other normalization techniques prior to modeling.
# Summary statistics
nhanes_model2 %>%
summary() %>%
kable(caption = "Descriptive Statistics of Predictor Variables") %>%
kable_styling()
| seqn | ridageyr | riagendr | ridreth3 | dmdeduc2 | indfmpir | smq020 | alq111 | vigorous_minutes | moderate_minutes | sedentary_minutes | dr1tkcal | dr1ttfat | dr1tsodi | bmxbmi | bpxsy1 | bpxdi1 | lbxglu | lbxhscrp | lbdhdd | lbxtr | lbxtc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 93705 | Min. :18.00 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :0.000 | Min. :1.000 | Min. :1.000 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0 | Min. : 0.00 | Min. : 0 | Min. :14.20 | Min. : 72 | Min. : 0.00 | Min. : 47.0 | Min. : 0.11 | Min. : 10.00 | Min. : 10.0 | Min. : 76.0 | |
| 1st Qu.: 95955 | 1st Qu.:33.00 | 1st Qu.:1.000 | 1st Qu.:3.000 | 1st Qu.:3.000 | 1st Qu.:1.180 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.: 0.0 | 1st Qu.: 0.0 | 1st Qu.: 180.0 | 1st Qu.: 1406 | 1st Qu.: 50.74 | 1st Qu.: 2137 | 1st Qu.:24.60 | 1st Qu.:112 | 1st Qu.: 64.00 | 1st Qu.: 96.0 | 1st Qu.: 0.88 | 1st Qu.: 42.00 | 1st Qu.: 61.0 | 1st Qu.:158.0 | |
| Median : 98263 | Median :51.00 | Median :2.000 | Median :3.000 | Median :4.000 | Median :2.080 | Median :2.000 | Median :1.000 | Median : 0.0 | Median : 50.0 | Median : 300.0 | Median : 1942 | Median : 76.18 | Median : 3102 | Median :28.50 | Median :124 | Median : 72.00 | Median :104.0 | Median : 1.94 | Median : 51.00 | Median : 92.0 | Median :183.0 | |
| Mean : 98280 | Mean :49.89 | Mean :1.515 | Mean :3.504 | Mean :3.526 | Mean :2.521 | Mean :1.597 | Mean :1.114 | Mean : 77.6 | Mean : 141.8 | Mean : 388.9 | Mean : 2110 | Mean : 85.07 | Mean : 3456 | Mean :29.69 | Mean :126 | Mean : 71.65 | Mean :113.7 | Mean : 4.17 | Mean : 53.19 | Mean : 112.3 | Mean :186.9 | |
| 3rd Qu.:100595 | 3rd Qu.:65.00 | 3rd Qu.:2.000 | 3rd Qu.:4.000 | 3rd Qu.:4.000 | 3rd Qu.:4.060 | 3rd Qu.:2.000 | 3rd Qu.:1.000 | 3rd Qu.: 60.0 | 3rd Qu.: 180.0 | 3rd Qu.: 480.0 | 3rd Qu.: 2628 | 3rd Qu.:108.28 | 3rd Qu.: 4311 | 3rd Qu.:33.50 | 3rd Qu.:136 | 3rd Qu.: 80.00 | 3rd Qu.:115.0 | 3rd Qu.: 4.48 | 3rd Qu.: 61.00 | 3rd Qu.: 134.0 | 3rd Qu.:212.0 | |
| Max. :102956 | Max. :80.00 | Max. :2.000 | Max. :7.000 | Max. :9.000 | Max. :5.000 | Max. :2.000 | Max. :2.000 | Max. :10029.0 | Max. :19998.0 | Max. :9999.0 | Max. :12501 | Max. :567.96 | Max. :25949 | Max. :86.20 | Max. :228 | Max. :136.00 | Max. :451.0 | Max. :182.82 | Max. :189.00 | Max. :2684.0 | Max. :446.0 | |
| NA | NA | NA | NA | NA’s :287 | NA’s :835 | NA | NA’s :726 | NA | NA | NA | NA’s :873 | NA’s :873 | NA’s :873 | NA’s :422 | NA’s :957 | NA’s :957 | NA’s :3319 | NA’s :710 | NA’s :680 | NA’s :3363 | NA’s :680 |
# Plot histograms of all numeric variables in data_train
nhanes_model2 |>
dplyr::select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
filter(!is.na(Value)) |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~Feature, scales = "free") +
labs(title = "Histograms of Numerical Features", x = NULL, y = "Frequency") +
theme_minimal()
# Box plot of categorical variables
categorical_vars <- c("riagendr", "ridreth3", "dmdeduc2", "smq020", "alq111")
for (var in categorical_vars) {
print(
ggplot(nhanes_model2, aes_string(x = var, y = "lbxtc")) +
geom_boxplot(fill = "steelblue", color = "black") +
labs(
title = paste("Total Cholesterol by", var),
x = var,
y = "Total Cholesterol (mg/dL)"
) +
theme_minimal()
)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Removed 680 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Removed 287 rows containing missing values or values outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 632 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Removed 726 rows containing missing values or values outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 293 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Target variable distribution (Total Cholesterol)
ggplot(nhanes_model2, aes(x = lbxtc)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
labs(title = "Distribution of Total Cholesterol (mg/dL)", x = "Total Cholesterol", y = "Count") +
theme_minimal()
## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_bin()`).
To prepare the dataset for modeling, we first removed any observations with missing values in the target variable, total cholesterol. Next, we addressed missingness in the predictors. Variables with moderate missingness were replaced with their median values, while categorical variables were imputed using the most frequent category. For variables with higher levels of missingness, we applied MICE using predictive mean matching, allowing more accurate estimation based on relationships among variables.
## Look at data
# Structure of the data
str(nhanes_model2)
## tibble [5,856 × 22] (S3: tbl_df/tbl/data.frame)
## $ seqn : num [1:5856] 93705 93706 93708 93709 93711 ...
## ..- attr(*, "label")= chr "Respondent sequence number"
## $ ridageyr : num [1:5856] 66 18 66 75 56 18 67 54 71 61 ...
## ..- attr(*, "label")= chr "Age in years at screening"
## $ riagendr : num [1:5856] 2 1 2 2 1 1 1 2 1 1 ...
## ..- attr(*, "label")= chr "Gender"
## $ ridreth3 : num [1:5856] 4 6 6 4 6 1 3 4 7 6 ...
## ..- attr(*, "label")= chr "Race/Hispanic origin w/ NH Asian"
## $ dmdeduc2 : num [1:5856] 2 NA 1 4 5 NA 3 4 3 5 ...
## ..- attr(*, "label")= chr "Education level - Adults 20+"
## $ indfmpir : num [1:5856] 0.82 NA 1.63 0.41 5 0.76 2.65 1.86 1.56 5 ...
## ..- attr(*, "label")= chr "Ratio of family income to poverty"
## $ smq020 : num [1:5856] 1 2 2 1 2 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "Smoked at least 100 cigarettes in life"
## $ alq111 : num [1:5856] 1 2 2 NA 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "Ever had a drink of any kind of alcohol"
## $ vigorous_minutes : num [1:5856] 0 0 0 0 60 465 0 0 660 180 ...
## $ moderate_minutes : num [1:5856] 60 75 30 180 90 270 90 0 260 240 ...
## $ sedentary_minutes: num [1:5856] 300 240 120 600 420 120 120 360 180 300 ...
## $ dr1tkcal : num [1:5856] 1202 1987 1251 NA 2840 ...
## ..- attr(*, "label")= chr "Energy (kcal)"
## $ dr1ttfat : num [1:5856] 57 137.4 65.5 NA 124.2 ...
## ..- attr(*, "label")= chr "Total fat (gm)"
## $ dr1tsodi : num [1:5856] 3574 3657 2135 NA 4382 ...
## ..- attr(*, "label")= chr "Sodium (mg)"
## $ bmxbmi : num [1:5856] 31.7 21.5 23.7 38.9 21.3 19.7 23.5 39.9 22.5 30.7 ...
## ..- attr(*, "label")= chr "Body Mass Index (kg/m**2)"
## $ bpxsy1 : num [1:5856] NA 112 NA 120 108 112 104 NA 112 120 ...
## ..- attr(*, "label")= chr "Systolic: Blood pres (1st rdg) mm Hg"
## $ bpxdi1 : num [1:5856] NA 74 NA 66 68 68 70 NA 60 72 ...
## ..- attr(*, "label")= chr "Diastolic: Blood pres (1st rdg) mm Hg"
## $ lbxglu : num [1:5856] NA NA 122 NA 107 NA NA NA NA NA ...
## ..- attr(*, "label")= chr "Fasting Glucose (mg/dL)"
## $ lbxhscrp : num [1:5856] 2.72 0.74 1.83 6.94 0.82 0.37 1.66 5.71 6.1 1.75 ...
## ..- attr(*, "label")= chr "HS C-Reactive Protein (mg/L)"
## $ lbdhdd : num [1:5856] 60 47 88 65 72 48 48 42 57 39 ...
## ..- attr(*, "label")= chr "Direct HDL-Cholesterol (mg/dL)"
## $ lbxtr : num [1:5856] NA NA 58 NA 48 NA NA NA NA NA ...
## ..- attr(*, "label")= chr "Triglyceride (mg/dL)"
## $ lbxtc : num [1:5856] 157 148 209 176 238 182 184 230 180 225 ...
## ..- attr(*, "label")= chr "Total Cholesterol (mg/dL)"
# Glimpse of the data
head(nhanes_model2)
## # A tibble: 6 × 22
## seqn ridageyr riagendr ridreth3 dmdeduc2 indfmpir smq020 alq111
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 93705 66 2 4 2 0.82 1 1
## 2 93706 18 1 6 NA NA 2 2
## 3 93708 66 2 6 1 1.63 2 2
## 4 93709 75 2 4 4 0.41 1 NA
## 5 93711 56 1 6 5 5 2 1
## 6 93712 18 1 1 NA 0.76 1 1
## # ℹ 14 more variables: vigorous_minutes <dbl>, moderate_minutes <dbl>,
## # sedentary_minutes <dbl>, dr1tkcal <dbl>, dr1ttfat <dbl>, dr1tsodi <dbl>,
## # bmxbmi <dbl>, bpxsy1 <dbl>, bpxdi1 <dbl>, lbxglu <dbl>, lbxhscrp <dbl>,
## # lbdhdd <dbl>, lbxtr <dbl>, lbxtc <dbl>
# Count missing values per Variable
nhanes_model2 %>%
summarise_all(~ sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
filter(missing_count != 0) %>%
arrange(desc(missing_count))
## # A tibble: 14 × 2
## variable missing_count
## <chr> <int>
## 1 lbxtr 3363
## 2 lbxglu 3319
## 3 bpxsy1 957
## 4 bpxdi1 957
## 5 dr1tkcal 873
## 6 dr1ttfat 873
## 7 dr1tsodi 873
## 8 indfmpir 835
## 9 alq111 726
## 10 lbxhscrp 710
## 11 lbdhdd 680
## 12 lbxtc 680
## 13 bmxbmi 422
## 14 dmdeduc2 287
colSums(is.na(nhanes_model2))
## seqn ridageyr riagendr ridreth3
## 0 0 0 0
## dmdeduc2 indfmpir smq020 alq111
## 287 835 0 726
## vigorous_minutes moderate_minutes sedentary_minutes dr1tkcal
## 0 0 0 873
## dr1ttfat dr1tsodi bmxbmi bpxsy1
## 873 873 422 957
## bpxdi1 lbxglu lbxhscrp lbdhdd
## 957 3319 710 680
## lbxtr lbxtc
## 3363 680
# Keep only rows where Total Cholesterol is observed
nhanes_model2 <- nhanes_model2 %>%
filter(!is.na(lbxtc))
medium_numeric <- c("indfmpir","dr1tkcal","dr1ttfat","dr1tsodi",
"bmxbmi","bpxsy1","bpxdi1","lbxhscrp","lbdhdd")
medium_categorical <- c("dmdeduc2","alq111")
high_missing <- c("lbxglu","lbxtr")
## Median Imputations
# Numeric
for(v in medium_numeric){
nhanes_model2[[v]][is.na(nhanes_model2[[v]])] <- median(nhanes_model2[[v]], na.rm = TRUE)
}
# Categorical
for(v in medium_categorical){
mode_val <- as.numeric(names(sort(table(nhanes_model2[[v]]), decreasing = TRUE)[1]))
nhanes_model2[[v]][is.na(nhanes_model2[[v]])] <- mode_val
}
# Mice Imputations
# Select only high missing numeric variables + all predictors
mice_vars <- c(high_missing, medium_numeric, "vigorous_minutes", "moderate_minutes", "sedentary_minutes")
# Run MICE
mice_data <- mice(nhanes_model2[mice_vars], m = 5, method = "pmm", seed = 123)
##
## iter imp variable
## 1 1 lbxglu lbxtr
## 1 2 lbxglu lbxtr
## 1 3 lbxglu lbxtr
## 1 4 lbxglu lbxtr
## 1 5 lbxglu lbxtr
## 2 1 lbxglu lbxtr
## 2 2 lbxglu lbxtr
## 2 3 lbxglu lbxtr
## 2 4 lbxglu lbxtr
## 2 5 lbxglu lbxtr
## 3 1 lbxglu lbxtr
## 3 2 lbxglu lbxtr
## 3 3 lbxglu lbxtr
## 3 4 lbxglu lbxtr
## 3 5 lbxglu lbxtr
## 4 1 lbxglu lbxtr
## 4 2 lbxglu lbxtr
## 4 3 lbxglu lbxtr
## 4 4 lbxglu lbxtr
## 4 5 lbxglu lbxtr
## 5 1 lbxglu lbxtr
## 5 2 lbxglu lbxtr
## 5 3 lbxglu lbxtr
## 5 4 lbxglu lbxtr
## 5 5 lbxglu lbxtr
# Complete the dataset using first imputed dataset
nhanes_model2[mice_vars] <- complete(mice_data, 1)
colSums(is.na(nhanes_model2))
## seqn ridageyr riagendr ridreth3
## 0 0 0 0
## dmdeduc2 indfmpir smq020 alq111
## 0 0 0 0
## vigorous_minutes moderate_minutes sedentary_minutes dr1tkcal
## 0 0 0 0
## dr1ttfat dr1tsodi bmxbmi bpxsy1
## 0 0 0 0
## bpxdi1 lbxglu lbxhscrp lbdhdd
## 0 0 0 0
## lbxtr lbxtc
## 0 0
The boxplots reveal that several variables, particularly total cholesterol (lbxtc), triglycerides (lbxtr), LDL cholesterol (lbdldl), and nutrient/weight totals, exhibit pronounced high-end outliers with long upper whiskers and many points far above the main distribution. These patterns reflect right-skewed distributions and natural physiological or dietary variability in the population rather than errors. Other variables, such as blood pressure, hemoglobin, and BMI, show moderate high outliers, while some nutrient or BMI measures have only a few scattered extremes. Overall, the most prominent outliers are concentrated in biomarkers and nutrient totals, which could influence regression models if unaddressed. These findings suggest the need for careful handling, such as log transformation or robust modeling approaches, rather than outright removal, to ensure valid and generalizable analytical results.
# Boxplot of lbxtc
nhanes_model2 |>
dplyr::select(lbxtc) |>
ggplot(aes(x = "", y = lbxtc)) +
geom_boxplot(fill="pink") +
labs(title = "Distribution of Total Cholesterol", y = "Total Cholesterol mg/dL") +
theme_minimal()
# Boxplots of all predictors
ggplot(stack(nhanes_model2[,-1]), aes(x = ind, y = values)) +
geom_boxplot(fill = "darkgrey") +
coord_cartesian(ylim = c(0, 1000)) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1),
panel.background = element_rect(fill = 'grey96')) +
labs(title = "Boxplots of Predictor Variables", x="Predictors")
## Warning in stack.data.frame(nhanes_model2[, -1]): non-vector columns will be
## ignored
## Warning: <ggplot> %+% x was deprecated in ggplot2 4.0.0.
## ℹ Please use <ggplot> + x instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
nhanes_model2 %>%
dplyr::select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Variable, y = Value)) +
geom_boxplot(fill = "grey70") +
coord_cartesian(ylim = c(0, 500)) + # Adjust depending on variable range
labs(
title = "Boxplots of Numeric Predictor Variables",
x = "Predictors",
y = "Value"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
To address skewness and high-end outliers in the NHANES dataset, we applied log transformations to several right-skewed variables. Specifically, CRP (lbxhscrp), triglycerides (lbxtr), total cholesterol (lbxtc), HDL cholesterol (lbdhdd), glucose (lbxglu), BMI (bmxbmi), and dietary intake variables such as total calories (dr1tkcal), total fat (dr1ttfat), and sodium (dr1tsodi) were log-transformed. A small constant was added to each variable to avoid taking the logarithm of zero. These transformations reduce the influence of extreme values, improve the symmetry of distributions, and stabilize variance, thereby helping to meet the assumptions of linear regression and supporting more reliable model estimation.
In addition to transforming skewed numeric variables, we converted key categorical variables into factor type for modeling. Specifically, gender (riagendr) was labeled as “Male” and “Female,” while race/ethnicity (ridreth3), education level (dmdeduc2), smoking status (smq020), and alcohol use (alq111) were converted to factor variables without changing their original coding. This ensures that these categorical predictors are properly treated in regression models, allowing R to handle them as discrete groups rather than continuous numeric values, which improves interpretation and model performance.
# tranform skewed variables
# Log-transform skewed variables, adding a small constant to avoid log(0)
nhanes_model3 <- nhanes_model2 %>%
mutate(
log_lbxhscrp = log(lbxhscrp + 0.01),
log_lbxtr = log(lbxtr + 0.01),
log_dr1tkcal = log(dr1tkcal + 1),
log_dr1tsodi= log(dr1tsodi + 1),
log_dr1ttfat = log(dr1ttfat + 1),
log_lbxtc = log(lbxtc + 0.01),
log_lbdhdd = log(lbdhdd + 0.01),
log_lbxglu = log(lbxglu + 0.01),
log_bmxbmi = log(bmxbmi + 0.01)
)
log_vars <- c("log_lbxhscrp","log_lbxtr","log_dr1tkcal","log_dr1tsodi","log_dr1ttfat",
"log_lbxtc","log_lbdhdd","log_lbxglu","log_bmxbmi")
nhanes_model3 %>%
dplyr::select(all_of(log_vars)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~Variable, scales = "free") +
labs(title = "Histograms of Log-Transformed Variables", x = NULL, y = "Frequency") +
theme_minimal()
# Factor categorical variables
nhanes_model4 <- nhanes_model3 %>%
mutate(
riagendr = factor(riagendr, labels = c("Male", "Female")),
ridreth3 = factor(ridreth3),
dmdeduc2 = factor(dmdeduc2),
smq020 = factor(smq020),
alq111 = factor(alq111)
)
To evaluate the suitability of linear regression for predicting total cholesterol, we examined the relationships between each numeric predictor and the target variable (lbxtc) using scatterplots and Pearson correlation coefficients. Variables with stronger correlations may be more informative for the regression model, while variables with very weak correlations may contribute less predictive value or require further transformation.
Across the scatterplots, most numeric predictors exhibit very weak or negligible linear relationships with total cholesterol, as reflected by near-zero Pearson correlation coefficients. Variables such as physical activity (vigorous, moderate, sedentary minutes), dietary intakes (calories, fat, sodium), and demographic characteristics show little predictive power individually. A few clinical variables—particularly systolic and diastolic blood pressure (bpxsy1, bpxdi1) and log-transformed triglycerides (log_trig)—display mild positive associations with cholesterol, but these relationships are weak and accompanied by substantial scatter. Overall, the plots suggest that total cholesterol is largely uncorrelated with most individual predictors, with only modest positive trends observed for a few clinical measures.
# Get all numeric predictor variables (excluding total cholesterol)
predictors <- names(nhanes_model4)[sapply(nhanes_model4, is.numeric) & names(nhanes_model4) != "lbxtc"]
# Set up plotting parameters for multiple plots
par(mfrow = c(4, 4), mar = c(4, 4, 3, 2))
# Plot each predictor vs total cholesterol
for(i in 1:length(predictors)) {
predictor <- predictors[i]
correlation <- cor(nhanes_model4[[predictor]], nhanes_model4$lbxtc, use = "complete.obs")
plot(nhanes_model4[[predictor]], nhanes_model4$lbxtc,
xlab = predictor,
ylab = "Total Cholesterol",
main = paste(predictor, "\n(r =", round(correlation, 3), ")"),
pch = 16,
col = ifelse(correlation > 0, "darkblue", "darkred"),
cex = 0.6)
# Add regression line
abline(lm(nhanes_model4$lbxtc ~ nhanes_model4[[predictor]]),
col = "red", lwd = 2)
# Add correlation text
text(x = quantile(nhanes_model4[[predictor]], 0.05, na.rm = TRUE),
y = quantile(nhanes_model4$lbxtc, 0.95, na.rm = TRUE),
labels = paste("r =", round(correlation, 3)),
cex = 0.8, col = "red")
}
# Reset plotting parameters
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2))
The correlations between total cholesterol (lbxtc) and other variables in the dataset are generally modest.
Strong Positive Correlations:
Negative Correlations:
Moderate Correlations:
Weak or Near-Zero Correlations:
Most other lifestyle and dietary variables, including BMI, physical activity, total calories, fat, and sodium intake, have correlations near zero or slightly negative, indicating minimal linear association with total cholesterol in this dataset. Overall, these results suggest that cholesterol levels are more strongly associated with clinical biomarkers (HDL, triglycerides, blood pressure) than with demographic, lifestyle, or dietary factors.
# ORIGINAL PREDICTORS DATASET
# log_vars is "log_crp","log_trig","log_calories","log_sodium","log_fat", "log_chol","log_hdl", "log_glucose", "log_bmi"
nhanes_model4_original <- nhanes_model4 |>
dplyr::select(-any_of(log_vars))
df_character_wide <- nhanes_model4_original %>%
select_if(function(col) !is.numeric(col) | all(col == .$lbxtc)) %>%
pivot_longer(cols = -lbxtc, names_to = "variable", values_to = "value")
df_character_wide %>%
ggplot(aes(x = value, y = lbxtc)) +
geom_boxplot() +
facet_wrap(~variable, scales = "free") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90))
numeric_vars <- nhanes_model4_original %>% dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")
# Correlation with total cholesterol
corr_target <- cor_matrix[, "lbxtc"]
corr_target_sorted <- sort(corr_target, decreasing = TRUE)
kable(as.data.frame(corr_target_sorted), col.names = c("Correlation with Total Cholesterol"), digits = 2)
| Correlation with Total Cholesterol | |
|---|---|
| lbxtc | 1.00 |
| lbdhdd | 0.19 |
| bpxdi1 | 0.13 |
| bpxsy1 | 0.13 |
| lbxtr | 0.11 |
| ridageyr | 0.10 |
| indfmpir | 0.05 |
| bmxbmi | 0.01 |
| vigorous_minutes | 0.00 |
| seqn | 0.00 |
| sedentary_minutes | 0.00 |
| lbxhscrp | 0.00 |
| moderate_minutes | 0.00 |
| lbxglu | -0.02 |
| dr1tkcal | -0.03 |
| dr1ttfat | -0.03 |
| dr1tsodi | -0.03 |
# corrplot::corrplot(cor_matrix, method = "color", type = "upper",
# tl.col = "black", tl.srt = 45, addCoef.col = "black",
# number.cex = 0.7, diag = FALSE)
corrplot::corrplot(cor_matrix,
method = "circle",
type = "upper",
tl.cex = 0.5,
tl.srt = 45) # Rotate text diagonally
# Remove self-correlation
corr_target_no_self <- corr_target[names(corr_target) != "lbxtc"]
# Top 3 positive and negative correlations
top_pos <- sort(corr_target_no_self, decreasing = TRUE)[1:3]
top_neg <- sort(corr_target_no_self, decreasing = FALSE)[1:3]
combined_df <- data.frame(
Feature = c(names(top_pos), names(top_neg)),
Correlation = c(top_pos, top_neg)
) %>%
mutate(Direction = ifelse(Correlation > 0, "Positive", "Negative"))
ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
labs(title = "Top Features Correlated with Total Cholesterol",
x = "Feature",
y = "Correlation with Total Cholesterol",
fill = "Direction") +
theme_minimal()
# Correlation among predictors
melt_cor <- melt(cor_matrix)
melt_cor <- melt_cor |>
filter(Var1 != "SalePrice") |>
filter(Var2 != "SalePrice")
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
kable(as.data.frame(sorted_cor), digits = 2)
| Var1 | Var2 | value | |
|---|---|---|---|
| 126 | dr1tkcal | dr1ttfat | 0.89 |
| 143 | dr1tkcal | dr1tsodi | 0.80 |
| 144 | dr1ttfat | dr1tsodi | 0.76 |
| 172 | ridageyr | bpxsy1 | 0.47 |
| 198 | bpxsy1 | bpxdi1 | 0.35 |
| 270 | lbdhdd | lbxtr | -0.31 |
| 248 | bmxbmi | lbdhdd | -0.27 |
| 268 | lbxglu | lbxtr | 0.26 |
| 231 | bmxbmi | lbxhscrp | 0.26 |
| 287 | lbdhdd | lbxtc | 0.19 |
| 214 | bmxbmi | lbxglu | 0.18 |
| 215 | bpxsy1 | lbxglu | 0.18 |
| 251 | lbxglu | lbdhdd | -0.18 |
| 206 | ridageyr | lbxglu | 0.16 |
| 265 | bmxbmi | lbxtr | 0.15 |
| 138 | ridageyr | dr1tsodi | -0.14 |
| 104 | ridageyr | dr1tkcal | -0.13 |
| 284 | bpxdi1 | lbxtc | 0.13 |
| 72 | vigorous_minutes | moderate_minutes | 0.13 |
| 283 | bpxsy1 | lbxtc | 0.13 |
| 180 | bmxbmi | bpxsy1 | 0.12 |
| 288 | lbxtr | lbxtc | 0.11 |
| 121 | ridageyr | dr1ttfat | -0.10 |
| 274 | ridageyr | lbxtc | 0.10 |
| 267 | bpxdi1 | lbxtr | 0.10 |
| 234 | lbxglu | lbxhscrp | 0.09 |
| 241 | indfmpir | lbdhdd | 0.09 |
| 197 | bmxbmi | bpxdi1 | 0.09 |
| 252 | lbxhscrp | lbdhdd | -0.08 |
| 266 | bpxsy1 | lbxtr | 0.08 |
| 53 | ridageyr | vigorous_minutes | -0.08 |
| 106 | vigorous_minutes | dr1tkcal | 0.08 |
| 36 | ridageyr | indfmpir | 0.08 |
| 210 | sedentary_minutes | lbxglu | 0.07 |
| 240 | ridageyr | lbdhdd | 0.07 |
| 245 | dr1tkcal | lbdhdd | -0.07 |
| 247 | dr1tsodi | lbdhdd | -0.07 |
| 140 | vigorous_minutes | dr1tsodi | 0.07 |
| 262 | dr1tkcal | lbxtr | 0.07 |
| 224 | indfmpir | lbxhscrp | -0.07 |
| 264 | dr1tsodi | lbxtr | 0.07 |
| 209 | moderate_minutes | lbxglu | 0.06 |
| 260 | moderate_minutes | lbxtr | 0.06 |
| 250 | bpxdi1 | lbdhdd | -0.06 |
| 194 | dr1tkcal | bpxdi1 | 0.05 |
| 123 | vigorous_minutes | dr1ttfat | 0.05 |
| 228 | dr1tkcal | lbxhscrp | -0.05 |
| 230 | dr1tsodi | lbxhscrp | -0.05 |
| 161 | dr1ttfat | bmxbmi | 0.05 |
| 156 | indfmpir | bmxbmi | -0.05 |
| 275 | indfmpir | lbxtc | 0.05 |
| 261 | sedentary_minutes | lbxtr | 0.05 |
| 196 | dr1tsodi | bpxdi1 | 0.05 |
| 179 | dr1tsodi | bpxsy1 | -0.04 |
| 263 | dr1ttfat | lbxtr | 0.04 |
| 223 | ridageyr | lbxhscrp | 0.04 |
| 246 | dr1ttfat | lbdhdd | -0.04 |
| 87 | ridageyr | sedentary_minutes | 0.04 |
| 227 | sedentary_minutes | lbxhscrp | 0.04 |
| 229 | dr1ttfat | lbxhscrp | -0.04 |
| 159 | sedentary_minutes | bmxbmi | 0.04 |
| 208 | vigorous_minutes | lbxglu | -0.04 |
| 178 | dr1ttfat | bpxsy1 | -0.04 |
| 122 | indfmpir | dr1ttfat | 0.04 |
| 257 | ridageyr | lbxtr | 0.04 |
| 176 | sedentary_minutes | bpxsy1 | 0.04 |
| 174 | vigorous_minutes | bpxsy1 | -0.04 |
| 177 | dr1tkcal | bpxsy1 | -0.04 |
| 139 | indfmpir | dr1tsodi | 0.03 |
| 281 | dr1tsodi | lbxtc | -0.03 |
| 70 | ridageyr | moderate_minutes | -0.03 |
| 280 | dr1ttfat | lbxtc | -0.03 |
| 195 | dr1ttfat | bpxdi1 | 0.03 |
| 190 | indfmpir | bpxdi1 | 0.03 |
| 192 | moderate_minutes | bpxdi1 | 0.03 |
| 89 | vigorous_minutes | sedentary_minutes | -0.03 |
| 173 | indfmpir | bpxsy1 | -0.03 |
| 88 | indfmpir | sedentary_minutes | 0.03 |
| 90 | moderate_minutes | sedentary_minutes | -0.03 |
| 279 | dr1tkcal | lbxtc | -0.03 |
| 162 | dr1tsodi | bmxbmi | 0.03 |
| 207 | indfmpir | lbxglu | 0.03 |
| 232 | bpxsy1 | lbxhscrp | 0.03 |
| 155 | ridageyr | bmxbmi | 0.02 |
| 269 | lbxhscrp | lbxtr | 0.02 |
| 285 | lbxglu | lbxtc | -0.02 |
| 52 | seqn | vigorous_minutes | -0.02 |
| 244 | sedentary_minutes | lbdhdd | -0.02 |
| 243 | moderate_minutes | lbdhdd | 0.02 |
| 205 | seqn | lbxglu | -0.02 |
| 71 | indfmpir | moderate_minutes | -0.02 |
| 189 | ridageyr | bpxdi1 | 0.02 |
| 225 | vigorous_minutes | lbxhscrp | -0.02 |
| 259 | vigorous_minutes | lbxtr | -0.01 |
| 157 | vigorous_minutes | bmxbmi | -0.01 |
| 86 | seqn | sedentary_minutes | 0.01 |
| 211 | dr1tkcal | lbxglu | -0.01 |
| 188 | seqn | bpxdi1 | 0.01 |
| 258 | indfmpir | lbxtr | 0.01 |
| 193 | sedentary_minutes | bpxdi1 | 0.01 |
| 105 | indfmpir | dr1tkcal | 0.01 |
| 158 | moderate_minutes | bmxbmi | 0.01 |
| 239 | seqn | lbdhdd | 0.01 |
| 160 | dr1tkcal | bmxbmi | 0.01 |
| 54 | indfmpir | vigorous_minutes | -0.01 |
| 282 | bmxbmi | lbxtc | 0.01 |
| 107 | moderate_minutes | dr1tkcal | 0.01 |
| 137 | seqn | dr1tsodi | 0.01 |
| 171 | seqn | bpxsy1 | 0.01 |
| 249 | bpxsy1 | lbdhdd | 0.01 |
| 256 | seqn | lbxtr | 0.01 |
| 142 | sedentary_minutes | dr1tsodi | -0.01 |
| 226 | moderate_minutes | lbxhscrp | -0.01 |
| 18 | seqn | ridageyr | 0.01 |
| 212 | dr1ttfat | lbxglu | 0.01 |
| 108 | sedentary_minutes | dr1tkcal | 0.00 |
| 277 | moderate_minutes | lbxtc | 0.00 |
| 222 | seqn | lbxhscrp | 0.00 |
| 175 | moderate_minutes | bpxsy1 | 0.00 |
| 124 | moderate_minutes | dr1ttfat | 0.00 |
| 286 | lbxhscrp | lbxtc | 0.00 |
| 103 | seqn | dr1tkcal | 0.00 |
| 120 | seqn | dr1ttfat | 0.00 |
| 125 | sedentary_minutes | dr1ttfat | 0.00 |
| 69 | seqn | moderate_minutes | 0.00 |
| 233 | bpxdi1 | lbxhscrp | 0.00 |
| 213 | dr1tsodi | lbxglu | 0.00 |
| 154 | seqn | bmxbmi | 0.00 |
| 242 | vigorous_minutes | lbdhdd | 0.00 |
| 216 | bpxdi1 | lbxglu | 0.00 |
| 141 | moderate_minutes | dr1tsodi | 0.00 |
| 35 | seqn | indfmpir | 0.00 |
| 278 | sedentary_minutes | lbxtc | 0.00 |
| 273 | seqn | lbxtc | 0.00 |
| 191 | vigorous_minutes | bpxdi1 | 0.00 |
| 276 | vigorous_minutes | lbxtc | 0.00 |
numeric_predictors <- c("ridageyr", "indfmpir", "vigorous_minutes",
"moderate_minutes", "sedentary_minutes",
"bmxbmi", "bpxsy1", "bpxdi1", "lbxglu",
"lbxhscrp", "lbdhdd", "lbxtr")
par(mfrow = c(3, 4))
for (var in numeric_predictors) {
boxplot(nhanes_model4[[var]] ~ nhanes_model4$lbxtc,
main = paste(var, "vs Total Cholesterol"),
xlab = "Total Cholesterol", ylab = var, col = "lightblue")
}
continuous_vars <- nhanes_model4 %>%
dplyr::select(
lbxtc, ridageyr, indfmpir,
vigorous_minutes, moderate_minutes, sedentary_minutes,
dr1tkcal, dr1ttfat, dr1tsodi,
bmxbmi, bpxsy1, bpxdi1,
lbxglu, lbxhscrp, lbdhdd, lbxtr
)
# relationships between predictors
ggpairs(continuous_vars,
progress = FALSE,
upper = list(continuous = wrap("cor", size = 3)))
# TRANSFORMED PREDICTORS
nhanes_model4_transformed <- nhanes_model4 |>
dplyr::select(-c(lbxhscrp, lbxtr, dr1tsodi, dr1tkcal, dr1ttfat, lbdhdd, lbxglu, bmxbmi))
df_character_wide <- nhanes_model4_transformed %>%
select_if(function(col) !is.numeric(col) | all(col == .$lbxtc)) %>%
pivot_longer(cols = -lbxtc, names_to = "variable", values_to = "value")
df_character_wide %>%
ggplot(aes(x = value, y = lbxtc)) +
geom_boxplot() +
facet_wrap(~variable, scales = "free") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90))
numeric_vars <- nhanes_model4_transformed %>% dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")
# Correlation with total cholesterol
corr_target <- cor_matrix[, "lbxtc"]
corr_target_sorted <- sort(corr_target, decreasing = TRUE)
kable(as.data.frame(corr_target_sorted), col.names = c("Correlation with Total Cholesterol"), digits = 2)
| Correlation with Total Cholesterol | |
|---|---|
| lbxtc | 1.00 |
| log_lbxtc | 0.99 |
| log_lbdhdd | 0.17 |
| log_lbxtr | 0.13 |
| bpxdi1 | 0.13 |
| bpxsy1 | 0.13 |
| ridageyr | 0.10 |
| log_lbxhscrp | 0.06 |
| indfmpir | 0.05 |
| log_bmxbmi | 0.02 |
| vigorous_minutes | 0.00 |
| seqn | 0.00 |
| sedentary_minutes | 0.00 |
| moderate_minutes | 0.00 |
| log_dr1tkcal | -0.02 |
| log_lbxglu | -0.02 |
| log_dr1ttfat | -0.03 |
| log_dr1tsodi | -0.04 |
# corrplot::corrplot(cor_matrix, method = "color", type = "upper",
# tl.col = "black", tl.srt = 45, addCoef.col = "black",
# number.cex = 0.7, diag = FALSE)
corrplot::corrplot(cor_matrix,
method = "circle",
type = "upper",
tl.cex = 0.5,
tl.srt = 45) # Rotate text diagonally
# Remove self-correlation
corr_target_no_self <- corr_target[names(corr_target) != "lbxtc"]
# Top 3 positive and negative correlations
top_pos <- sort(corr_target_no_self, decreasing = TRUE)[1:3]
top_neg <- sort(corr_target_no_self, decreasing = FALSE)[1:3]
combined_df <- data.frame(
Feature = c(names(top_pos), names(top_neg)),
Correlation = c(top_pos, top_neg)
) %>%
mutate(Direction = ifelse(Correlation > 0, "Positive", "Negative"))
ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
labs(title = "Top Features Correlated with Total Cholesterol",
x = "Feature",
y = "Correlation with Total Cholesterol",
fill = "Direction") +
theme_minimal()
# Correlation among predictors
melt_cor <- melt(cor_matrix)
melt_cor <- melt_cor |>
filter(Var1 != "SalePrice") |>
filter(Var2 != "SalePrice")
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
kable(as.data.frame(sorted_cor), digits = 2)
| Var1 | Var2 | value | |
|---|---|---|---|
| 261 | lbxtc | log_lbxtc | 0.99 |
| 246 | log_dr1tkcal | log_dr1ttfat | 0.88 |
| 228 | log_dr1tkcal | log_dr1tsodi | 0.82 |
| 247 | log_dr1tsodi | log_dr1ttfat | 0.79 |
| 281 | log_lbxtr | log_lbdhdd | -0.48 |
| 316 | log_lbxhscrp | log_bmxbmi | 0.48 |
| 110 | ridageyr | bpxsy1 | 0.47 |
| 133 | bpxsy1 | bpxdi1 | 0.35 |
| 299 | log_lbxtr | log_lbxglu | 0.31 |
| 322 | log_lbdhdd | log_bmxbmi | -0.30 |
| 317 | log_lbxtr | log_bmxbmi | 0.25 |
| 323 | log_lbxglu | log_bmxbmi | 0.22 |
| 304 | log_lbdhdd | log_lbxglu | -0.22 |
| 295 | bpxsy1 | log_lbxglu | 0.21 |
| 280 | log_lbxhscrp | log_lbdhdd | -0.19 |
| 290 | ridageyr | log_lbxglu | 0.19 |
| 285 | log_lbxtc | log_lbdhdd | 0.18 |
| 279 | lbxtc | log_lbdhdd | 0.17 |
| 298 | log_lbxhscrp | log_lbxglu | 0.16 |
| 190 | log_lbxhscrp | log_lbxtr | 0.15 |
| 188 | bpxdi1 | log_lbxtr | 0.14 |
| 260 | bpxdi1 | log_lbxtc | 0.14 |
| 189 | lbxtc | log_lbxtr | 0.13 |
| 313 | bpxsy1 | log_bmxbmi | 0.13 |
| 152 | bpxdi1 | lbxtc | 0.13 |
| 263 | log_lbxtr | log_lbxtc | 0.13 |
| 76 | vigorous_minutes | moderate_minutes | 0.13 |
| 187 | bpxsy1 | log_lbxtr | 0.13 |
| 151 | bpxsy1 | lbxtc | 0.13 |
| 218 | ridageyr | log_dr1tsodi | -0.12 |
| 259 | bpxsy1 | log_lbxtc | 0.12 |
| 200 | ridageyr | log_dr1tkcal | -0.11 |
| 146 | ridageyr | lbxtc | 0.10 |
| 165 | indfmpir | log_lbxhscrp | -0.09 |
| 273 | indfmpir | log_lbdhdd | 0.09 |
| 314 | bpxdi1 | log_bmxbmi | 0.09 |
| 226 | log_lbxhscrp | log_dr1tsodi | -0.09 |
| 164 | ridageyr | log_lbxhscrp | 0.09 |
| 254 | ridageyr | log_lbxtc | 0.09 |
| 208 | log_lbxhscrp | log_dr1tkcal | -0.08 |
| 169 | bpxsy1 | log_lbxhscrp | 0.08 |
| 294 | sedentary_minutes | log_lbxglu | 0.08 |
| 56 | ridageyr | vigorous_minutes | -0.08 |
| 38 | ridageyr | indfmpir | 0.08 |
| 236 | ridageyr | log_dr1ttfat | -0.07 |
| 182 | ridageyr | log_lbxtr | 0.07 |
| 272 | ridageyr | log_lbdhdd | 0.07 |
| 282 | log_dr1tkcal | log_lbdhdd | -0.07 |
| 262 | log_lbxhscrp | log_lbxtc | 0.07 |
| 171 | lbxtc | log_lbxhscrp | 0.06 |
| 283 | log_dr1tsodi | log_lbdhdd | -0.06 |
| 278 | bpxdi1 | log_lbdhdd | -0.06 |
| 219 | indfmpir | log_dr1tsodi | 0.06 |
| 202 | vigorous_minutes | log_dr1tkcal | 0.06 |
| 244 | log_lbxhscrp | log_dr1ttfat | -0.06 |
| 237 | indfmpir | log_dr1ttfat | 0.06 |
| 227 | log_lbxtr | log_dr1tsodi | 0.06 |
| 209 | log_lbxtr | log_dr1tkcal | 0.06 |
| 293 | moderate_minutes | log_lbxglu | 0.05 |
| 220 | vigorous_minutes | log_dr1tsodi | 0.05 |
| 206 | bpxdi1 | log_dr1tkcal | 0.05 |
| 308 | ridageyr | log_bmxbmi | 0.05 |
| 186 | sedentary_minutes | log_lbxtr | 0.05 |
| 255 | indfmpir | log_lbxtc | 0.05 |
| 147 | indfmpir | lbxtc | 0.05 |
| 170 | bpxdi1 | log_lbxhscrp | 0.05 |
| 224 | bpxdi1 | log_dr1tsodi | 0.05 |
| 320 | log_dr1ttfat | log_bmxbmi | 0.04 |
| 292 | vigorous_minutes | log_lbxglu | -0.04 |
| 92 | ridageyr | sedentary_minutes | 0.04 |
| 312 | sedentary_minutes | log_bmxbmi | 0.04 |
| 238 | vigorous_minutes | log_dr1ttfat | 0.04 |
| 309 | indfmpir | log_bmxbmi | -0.04 |
| 114 | sedentary_minutes | bpxsy1 | 0.04 |
| 201 | indfmpir | log_dr1tkcal | 0.04 |
| 223 | bpxsy1 | log_dr1tsodi | -0.04 |
| 185 | moderate_minutes | log_lbxtr | 0.04 |
| 112 | vigorous_minutes | bpxsy1 | -0.04 |
| 225 | lbxtc | log_dr1tsodi | -0.04 |
| 265 | log_dr1tsodi | log_lbxtc | -0.03 |
| 291 | indfmpir | log_lbxglu | 0.03 |
| 266 | log_dr1ttfat | log_lbxtc | -0.03 |
| 74 | ridageyr | moderate_minutes | -0.03 |
| 303 | log_lbxtc | log_lbxglu | -0.03 |
| 243 | lbxtc | log_dr1ttfat | -0.03 |
| 129 | indfmpir | bpxdi1 | 0.03 |
| 131 | moderate_minutes | bpxdi1 | 0.03 |
| 284 | log_dr1ttfat | log_lbdhdd | -0.03 |
| 205 | bpxsy1 | log_dr1tkcal | -0.03 |
| 94 | vigorous_minutes | sedentary_minutes | -0.03 |
| 111 | indfmpir | bpxsy1 | -0.03 |
| 321 | log_lbxtc | log_bmxbmi | 0.03 |
| 93 | indfmpir | sedentary_minutes | 0.03 |
| 166 | vigorous_minutes | log_lbxhscrp | -0.03 |
| 168 | sedentary_minutes | log_lbxhscrp | 0.03 |
| 95 | moderate_minutes | sedentary_minutes | -0.03 |
| 242 | bpxdi1 | log_dr1ttfat | 0.03 |
| 184 | vigorous_minutes | log_lbxtr | -0.03 |
| 319 | log_dr1tsodi | log_bmxbmi | 0.02 |
| 241 | bpxsy1 | log_dr1ttfat | -0.02 |
| 297 | lbxtc | log_lbxglu | -0.02 |
| 315 | lbxtc | log_bmxbmi | 0.02 |
| 245 | log_lbxtr | log_dr1ttfat | 0.02 |
| 207 | lbxtc | log_dr1tkcal | -0.02 |
| 264 | log_dr1tkcal | log_lbxtc | -0.02 |
| 55 | seqn | vigorous_minutes | -0.02 |
| 276 | sedentary_minutes | log_lbdhdd | -0.02 |
| 75 | indfmpir | moderate_minutes | -0.02 |
| 128 | ridageyr | bpxdi1 | 0.02 |
| 289 | seqn | log_lbxglu | -0.02 |
| 163 | seqn | log_lbxhscrp | 0.02 |
| 221 | moderate_minutes | log_dr1tsodi | -0.01 |
| 91 | seqn | sedentary_minutes | 0.01 |
| 311 | moderate_minutes | log_bmxbmi | 0.01 |
| 127 | seqn | bpxdi1 | 0.01 |
| 132 | sedentary_minutes | bpxdi1 | 0.01 |
| 310 | vigorous_minutes | log_bmxbmi | -0.01 |
| 239 | moderate_minutes | log_dr1ttfat | -0.01 |
| 271 | seqn | log_lbdhdd | 0.01 |
| 300 | log_dr1tkcal | log_lbxglu | -0.01 |
| 235 | seqn | log_dr1ttfat | -0.01 |
| 57 | indfmpir | vigorous_minutes | -0.01 |
| 275 | moderate_minutes | log_lbdhdd | 0.01 |
| 240 | sedentary_minutes | log_dr1ttfat | 0.01 |
| 302 | log_dr1ttfat | log_lbxglu | 0.01 |
| 203 | moderate_minutes | log_dr1tkcal | -0.01 |
| 222 | sedentary_minutes | log_dr1tsodi | -0.01 |
| 109 | seqn | bpxsy1 | 0.01 |
| 19 | seqn | ridageyr | 0.01 |
| 296 | bpxdi1 | log_lbxglu | 0.01 |
| 318 | log_dr1tkcal | log_bmxbmi | 0.01 |
| 301 | log_dr1tsodi | log_lbxglu | 0.00 |
| 149 | moderate_minutes | lbxtc | 0.00 |
| 217 | seqn | log_dr1tsodi | 0.00 |
| 113 | moderate_minutes | bpxsy1 | 0.00 |
| 204 | sedentary_minutes | log_dr1tkcal | 0.00 |
| 183 | indfmpir | log_lbxtr | 0.00 |
| 181 | seqn | log_lbxtr | 0.00 |
| 73 | seqn | moderate_minutes | 0.00 |
| 258 | sedentary_minutes | log_lbxtc | 0.00 |
| 256 | vigorous_minutes | log_lbxtc | 0.00 |
| 257 | moderate_minutes | log_lbxtc | 0.00 |
| 277 | bpxsy1 | log_lbdhdd | 0.00 |
| 167 | moderate_minutes | log_lbxhscrp | 0.00 |
| 37 | seqn | indfmpir | 0.00 |
| 199 | seqn | log_dr1tkcal | 0.00 |
| 274 | vigorous_minutes | log_lbdhdd | 0.00 |
| 307 | seqn | log_bmxbmi | 0.00 |
| 150 | sedentary_minutes | lbxtc | 0.00 |
| 145 | seqn | lbxtc | 0.00 |
| 253 | seqn | log_lbxtc | 0.00 |
| 130 | vigorous_minutes | bpxdi1 | 0.00 |
| 148 | vigorous_minutes | lbxtc | 0.00 |
numeric_predictors <- c("ridageyr", "indfmpir", "vigorous_minutes",
"moderate_minutes", "sedentary_minutes",
"bmxbmi", "bpxsy1", "bpxdi1", "lbxglu",
"lbxhscrp", "lbdhdd", "lbxtr")
par(mfrow = c(3, 4))
for (var in numeric_predictors) {
boxplot(nhanes_model4[[var]] ~ nhanes_model4$lbxtc,
main = paste(var, "vs Total Cholesterol"),
xlab = "Total Cholesterol", ylab = var, col = "lightblue")
}
# relationships between predictors
#ggpairs(nhanes_model4_transformed, progress = FALSE)
To enhance the predictive modeling of total cholesterol, several new features were derived from the raw NHANES data. Age was categorized into young (<30), middle-aged (30–65), and older (>65) groups, while BMI was classified into underweight, normal, overweight, and obese categories. Physical activity variables were combined into a total activity measure, and a binary indicator was created to flag participants meeting recommended activity levels. Dietary intake was normalized by creating fat-to-calorie and sodium-to-calorie ratios. Finally, skewed laboratory and dietary variables, including triglycerides, CRP, total calories, fat, and sodium, were log-transformed to reduce skewness and improve model stability.
nhanes_model5 <- nhanes_model4 %>%
# Age groups
mutate(
AGE_YOUNG = as.numeric(ridageyr < 30),
AGE_MID = as.numeric(ridageyr >= 30 & ridageyr <= 65),
AGE_OLD = as.numeric(ridageyr > 65)
) %>%
# BMI categories
mutate(
BMI_UNDER = as.numeric(bmxbmi < 18.5),
BMI_NORMAL = as.numeric(bmxbmi >= 18.5 & bmxbmi < 25),
BMI_OVER = as.numeric(bmxbmi >= 25 & bmxbmi < 30),
BMI_OBESE = as.numeric(bmxbmi >= 30)
) %>%
# Physical activity
mutate(
TOTAL_ACTIVITY = vigorous_minutes + moderate_minutes,
ACTIVE_FLAG = as.numeric(TOTAL_ACTIVITY >= 150) # Meets 150 min/week guideline
) %>%
# Dietary ratios
mutate(
# FAT_RATIO = ifelse(dr1tkcal > 0, dr1ttfat / dr1tkcal, NA),
# SODIUM_RATIO = ifelse(dr1tkcal > 0, dr1tsodi / dr1tkcal, NA)
HIGH_FAT_DIET = as.numeric((exp(log_dr1ttfat) * 9 / exp(log_dr1tkcal)) > 0.35),
HIGH_SODIUM = as.numeric(exp(log_dr1tsodi) > 2300)
)
To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.
set.seed(123)
train_index <- createDataPartition(nhanes_model4$lbxtc, p = 0.8, list = FALSE)
# Create train/test splits for each dataset
# Removed seqn for each dataset
# Original predictors - removed transformed predictors
nhanes_model_original <- nhanes_model4 |>
dplyr::select(-c(log_lbxhscrp, log_lbxtr, log_dr1tkcal, log_dr1tsodi, log_dr1ttfat, log_lbxtc, log_lbdhdd, log_lbxglu, log_bmxbmi, seqn))
train_data_original <- nhanes_model_original[train_index, ]
test_data_original <- nhanes_model_original[-train_index, ]
# transformed predictors - ONLY the transformed predictors (remove transformed target variable log_lbxtc)
nhanes_model_transformed <- nhanes_model4 |>
dplyr::select(-c(lbxhscrp, lbxtr, dr1tkcal, dr1tsodi, dr1ttfat, log_lbxtc, lbdhdd, lbxglu, bmxbmi, seqn))
train_data_transformed <- nhanes_model_transformed[train_index, ]
test_data_transformed <- nhanes_model_transformed[-train_index, ]
# transformed predictors + log transformed target variable log_lbxtc
nhanes_model_transformed_target <- nhanes_model4 |>
dplyr::select(-c(lbxhscrp, lbxtr, dr1tkcal, dr1tsodi, dr1ttfat, lbxtc, lbdhdd, lbxglu, bmxbmi, seqn))
train_data_transformed_target <- nhanes_model_transformed_target[train_index, ]
test_data_transformed_target <- nhanes_model_transformed_target[-train_index, ]
# new features
nhanes_model_transformed_newfeatures <- nhanes_model5 |>
dplyr::select(
# Target
log_lbxtc,
# Demographics (NO seqn)
ridageyr, riagendr, ridreth3, dmdeduc2, indfmpir,
# Lifestyle
smq020, alq111, sedentary_minutes,
# NEW FEATURES (activity & diet)
TOTAL_ACTIVITY, ACTIVE_FLAG, HIGH_FAT_DIET, HIGH_SODIUM,
# Age/BMI categories
AGE_YOUNG, AGE_MID, # AGE_OLD is reference
BMI_UNDER, BMI_NORMAL, BMI_OVER, # BMI_OBESE is reference
# Blood pressure (raw values)
bpxsy1, bpxdi1,
# Biomarkers (LOG versions ONLY - no raw)
log_lbxhscrp, log_lbxtr, log_lbdhdd, log_lbxglu
)
train_data_transformed_newfeatures <- nhanes_model_transformed_newfeatures[train_index, ]
test_data_transformed_newfeatures <- nhanes_model_transformed_newfeatures[-train_index, ]
# Model 1: original predictors + original target variable lbxtc
set.seed(123)
model_full <- lm(lbxtc ~ ., data = train_data_original)
summary(model_full)
##
## Call:
## lm(formula = lbxtc ~ ., data = train_data_original)
##
## Residuals:
## Min 1Q Median 3Q Max
## -205.386 -26.880 -3.902 22.995 246.556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.859e+01 6.803e+00 14.493 < 2e-16 ***
## ridageyr 6.551e-02 4.048e-02 1.618 0.105684
## riagendrFemale 4.976e+00 1.400e+00 3.554 0.000384 ***
## ridreth32 2.054e+00 2.584e+00 0.795 0.426837
## ridreth33 -2.334e+00 2.092e+00 -1.116 0.264487
## ridreth34 -6.533e+00 2.219e+00 -2.943 0.003264 **
## ridreth36 1.608e+00 2.514e+00 0.640 0.522303
## ridreth37 -2.719e+00 3.223e+00 -0.844 0.398850
## dmdeduc22 -2.821e+00 2.924e+00 -0.965 0.334658
## dmdeduc23 -1.927e+00 2.661e+00 -0.724 0.468872
## dmdeduc24 -3.381e+00 2.612e+00 -1.294 0.195707
## dmdeduc25 -2.496e-01 2.852e+00 -0.088 0.930271
## dmdeduc27 5.277e+01 2.790e+01 1.891 0.058689 .
## dmdeduc29 -6.112e+00 1.410e+01 -0.434 0.664619
## indfmpir 4.399e-01 4.597e-01 0.957 0.338673
## smq0202 -3.037e+00 1.356e+00 -2.240 0.025171 *
## alq1112 1.032e+00 2.149e+00 0.480 0.630904
## vigorous_minutes 5.556e-05 1.868e-03 0.030 0.976276
## moderate_minutes -3.243e-04 1.122e-03 -0.289 0.772516
## sedentary_minutes 7.255e-05 8.792e-04 0.083 0.934240
## dr1tkcal 1.876e-03 1.545e-03 1.214 0.224706
## dr1ttfat -3.107e-02 2.910e-02 -1.068 0.285671
## dr1tsodi -4.430e-04 5.512e-04 -0.804 0.421583
## bmxbmi 2.602e-01 9.389e-02 2.771 0.005609 **
## bpxsy1 1.552e-01 4.148e-02 3.741 0.000186 ***
## bpxdi1 3.544e-01 5.282e-02 6.710 2.21e-11 ***
## lbxglu -5.781e-02 1.749e-02 -3.306 0.000956 ***
## lbxhscrp -1.797e-02 8.319e-02 -0.216 0.828978
## lbdhdd 6.508e-01 4.593e-02 14.167 < 2e-16 ***
## lbxtr 5.909e-02 5.548e-03 10.652 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.14 on 4113 degrees of freedom
## Multiple R-squared: 0.1082, Adjusted R-squared: 0.1019
## F-statistic: 17.21 on 29 and 4113 DF, p-value: < 2.2e-16
Using log transformations for lbxhscrp, lbxtr, dr1tkcal, dr1tsodi, dr1ttfat, lbdhdd, lbxglu, and bmxbmi.
# Model 2: LM transformed predictors + original target variable
set.seed(123)
# Full linear model using log-transformed predictors
# model_full_log <- lm(lbxtc ~ log_chol + log_hdl + log_bmi + bpxsy1 + bpxdi1 +
# log_glucose + vigorous_minutes + moderate_minutes +
# sedentary_minutes + indfmpir + ridageyr +
# log_crp + log_trig + log_sodium + log_calories + log_fat +
# riagendr + ridreth3 + dmdeduc2 + smq020 + alq111,
# data = train_full_log)
model_full_transformed <- lm(lbxtc ~ .,
data = train_data_transformed)
summary(model_full_transformed)
##
## Call:
## lm(formula = lbxtc ~ ., data = train_data_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.489 -26.461 -3.879 22.866 240.852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.413e+01 2.547e+01 -2.517 0.011862 *
## ridageyr 3.486e-02 4.000e-02 0.871 0.383550
## riagendrFemale 3.096e+00 1.388e+00 2.231 0.025761 *
## ridreth32 2.335e+00 2.546e+00 0.917 0.359228
## ridreth33 -1.478e+00 2.061e+00 -0.717 0.473536
## ridreth34 -4.676e+00 2.190e+00 -2.135 0.032800 *
## ridreth36 3.294e+00 2.486e+00 1.325 0.185150
## ridreth37 -1.835e+00 3.173e+00 -0.578 0.563126
## dmdeduc22 -3.932e+00 2.883e+00 -1.364 0.172719
## dmdeduc23 -2.363e+00 2.621e+00 -0.901 0.367508
## dmdeduc24 -3.947e+00 2.576e+00 -1.532 0.125509
## dmdeduc25 -9.525e-01 2.811e+00 -0.339 0.734730
## dmdeduc27 5.492e+01 2.748e+01 1.999 0.045727 *
## dmdeduc29 -7.819e+00 1.389e+01 -0.563 0.573419
## indfmpir 6.082e-01 4.531e-01 1.342 0.179577
## smq0202 -2.473e+00 1.337e+00 -1.850 0.064415 .
## alq1112 7.991e-01 2.117e+00 0.377 0.705910
## vigorous_minutes 5.994e-04 1.841e-03 0.326 0.744700
## moderate_minutes -4.654e-04 1.105e-03 -0.421 0.673633
## sedentary_minutes -4.768e-05 8.658e-04 -0.055 0.956086
## bpxsy1 1.488e-01 4.091e-02 3.638 0.000278 ***
## bpxdi1 2.856e-01 5.227e-02 5.464 4.94e-08 ***
## log_lbxhscrp 3.224e+00 6.367e-01 5.064 4.29e-07 ***
## log_lbxtr 1.826e+01 1.186e+00 15.397 < 2e-16 ***
## log_dr1tkcal 4.812e+00 2.980e+00 1.615 0.106441
## log_dr1tsodi -2.074e+00 1.969e+00 -1.053 0.292402
## log_dr1ttfat -2.845e+00 2.219e+00 -1.282 0.199821
## log_lbdhdd 4.536e+01 2.728e+00 16.627 < 2e-16 ***
## log_lbxglu -1.505e+01 2.734e+00 -5.505 3.91e-08 ***
## log_bmxbmi 3.525e+00 3.164e+00 1.114 0.265378
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.55 on 4113 degrees of freedom
## Multiple R-squared: 0.1351, Adjusted R-squared: 0.129
## F-statistic: 22.15 on 29 and 4113 DF, p-value: < 2.2e-16
# Model 3: LM transformed predictors + transformed target variable
set.seed(123)
# Full linear model using log-transformed predictors and target
model_full_transformed_target <- lm(log_lbxtc ~ .,
data = train_data_transformed_target)
summary(model_full_transformed_target)
##
## Call:
## lm(formula = log_lbxtc ~ ., data = train_data_transformed_target)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79528 -0.13458 -0.00201 0.13493 0.85593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.781e+00 1.351e-01 27.974 < 2e-16 ***
## ridageyr 9.798e-05 2.122e-04 0.462 0.64428
## riagendrFemale 1.522e-02 7.363e-03 2.067 0.03879 *
## ridreth32 1.328e-02 1.351e-02 0.983 0.32557
## ridreth33 -9.215e-03 1.094e-02 -0.843 0.39954
## ridreth34 -2.878e-02 1.162e-02 -2.477 0.01327 *
## ridreth36 1.696e-02 1.319e-02 1.286 0.19841
## ridreth37 -1.084e-02 1.684e-02 -0.644 0.51950
## dmdeduc22 -1.849e-02 1.530e-02 -1.209 0.22677
## dmdeduc23 -1.390e-02 1.391e-02 -0.999 0.31767
## dmdeduc24 -2.196e-02 1.366e-02 -1.607 0.10810
## dmdeduc25 -3.279e-03 1.491e-02 -0.220 0.82598
## dmdeduc27 2.616e-01 1.458e-01 1.794 0.07290 .
## dmdeduc29 -3.810e-02 7.367e-02 -0.517 0.60510
## indfmpir 2.862e-03 2.404e-03 1.190 0.23395
## smq0202 -1.135e-02 7.093e-03 -1.600 0.10964
## alq1112 4.982e-03 1.123e-02 0.443 0.65746
## vigorous_minutes 3.891e-06 9.765e-06 0.398 0.69033
## moderate_minutes -1.840e-06 5.862e-06 -0.314 0.75363
## sedentary_minutes -3.211e-07 4.593e-06 -0.070 0.94427
## bpxsy1 6.533e-04 2.170e-04 3.010 0.00263 **
## bpxdi1 1.742e-03 2.773e-04 6.280 3.74e-10 ***
## log_lbxhscrp 1.809e-02 3.378e-03 5.357 8.94e-08 ***
## log_lbxtr 9.672e-02 6.292e-03 15.371 < 2e-16 ***
## log_dr1tkcal 2.774e-02 1.581e-02 1.754 0.07943 .
## log_dr1tsodi -9.894e-03 1.045e-02 -0.947 0.34371
## log_dr1ttfat -1.821e-02 1.177e-02 -1.547 0.12185
## log_lbdhdd 2.564e-01 1.447e-02 17.712 < 2e-16 ***
## log_lbxglu -8.268e-02 1.451e-02 -5.700 1.28e-08 ***
## log_bmxbmi 3.078e-02 1.679e-02 1.834 0.06677 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2045 on 4113 degrees of freedom
## Multiple R-squared: 0.1424, Adjusted R-squared: 0.1364
## F-statistic: 23.55 on 29 and 4113 DF, p-value: < 2.2e-16
AIC=30412.97
# Model 4
set.seed(123)
step_model_full <- stats::step(model_full, direction = "both")
## Start: AIC=30416.75
## lbxtc ~ ridageyr + riagendr + ridreth3 + dmdeduc2 + indfmpir +
## smq020 + alq111 + vigorous_minutes + moderate_minutes + sedentary_minutes +
## dr1tkcal + dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + bpxdi1 +
## lbxglu + lbxhscrp + lbdhdd + lbxtr
##
## Df Sum of Sq RSS AIC
## - dmdeduc2 6 12506 6314741 30413
## - vigorous_minutes 1 1 6302237 30415
## - sedentary_minutes 1 10 6302246 30415
## - lbxhscrp 1 72 6302307 30415
## - moderate_minutes 1 128 6302363 30415
## - alq111 1 354 6302589 30415
## - dr1tsodi 1 990 6303225 30415
## - indfmpir 1 1403 6303638 30416
## - dr1ttfat 1 1747 6303982 30416
## - dr1tkcal 1 2259 6304495 30416
## <none> 6302235 30417
## - ridageyr 1 4013 6306248 30417
## - smq020 1 7686 6309921 30420
## - bmxbmi 1 11768 6314003 30423
## - lbxglu 1 16743 6318979 30426
## - ridreth3 5 30282 6332517 30427
## - riagendr 1 19350 6321585 30428
## - bpxsy1 1 21447 6323682 30429
## - bpxdi1 1 68996 6371232 30460
## - lbxtr 1 173850 6476085 30528
## - lbdhdd 1 307535 6609770 30612
##
## Step: AIC=30412.97
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## alq111 + vigorous_minutes + moderate_minutes + sedentary_minutes +
## dr1tkcal + dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + bpxdi1 +
## lbxglu + lbxhscrp + lbdhdd + lbxtr
##
## Df Sum of Sq RSS AIC
## - vigorous_minutes 1 1 6314743 30411
## - sedentary_minutes 1 4 6314745 30411
## - lbxhscrp 1 127 6314868 30411
## - moderate_minutes 1 160 6314902 30411
## - alq111 1 422 6315164 30411
## - dr1tsodi 1 1116 6315857 30412
## - dr1ttfat 1 2051 6316793 30412
## - dr1tkcal 1 2686 6317427 30413
## - indfmpir 1 2697 6317439 30413
## <none> 6314741 30413
## - ridageyr 1 5813 6320554 30415
## - smq020 1 6883 6321624 30416
## + dmdeduc2 6 12506 6302235 30417
## - bmxbmi 1 11978 6326720 30419
## - lbxglu 1 16906 6331647 30422
## - riagendr 1 19002 6333743 30423
## - bpxsy1 1 21650 6336391 30425
## - ridreth3 5 37589 6352330 30428
## - bpxdi1 1 67805 6382546 30455
## - lbxtr 1 173060 6487802 30523
## - lbdhdd 1 310092 6624833 30610
##
## Step: AIC=30410.97
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## alq111 + moderate_minutes + sedentary_minutes + dr1tkcal +
## dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + bpxdi1 + lbxglu +
## lbxhscrp + lbdhdd + lbxtr
##
## Df Sum of Sq RSS AIC
## - sedentary_minutes 1 4 6314746 30409
## - lbxhscrp 1 126 6314869 30409
## - moderate_minutes 1 170 6314913 30409
## - alq111 1 423 6315165 30409
## - dr1tsodi 1 1118 6315860 30410
## - dr1ttfat 1 2051 6316794 30410
## - dr1tkcal 1 2685 6317428 30411
## - indfmpir 1 2697 6317439 30411
## <none> 6314743 30411
## - ridageyr 1 5842 6320585 30413
## + vigorous_minutes 1 1 6314741 30413
## - smq020 1 6882 6321624 30414
## + dmdeduc2 6 12506 6302237 30415
## - bmxbmi 1 11983 6326725 30417
## - lbxglu 1 16911 6331654 30420
## - riagendr 1 19052 6333795 30421
## - bpxsy1 1 21649 6336392 30423
## - ridreth3 5 37617 6352359 30426
## - bpxdi1 1 67807 6382549 30453
## - lbxtr 1 173089 6487832 30521
## - lbdhdd 1 310091 6624834 30608
##
## Step: AIC=30408.97
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## alq111 + moderate_minutes + dr1tkcal + dr1ttfat + dr1tsodi +
## bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbxhscrp + lbdhdd + lbxtr
##
## Df Sum of Sq RSS AIC
## - lbxhscrp 1 126 6314872 30407
## - moderate_minutes 1 172 6314918 30407
## - alq111 1 422 6315168 30407
## - dr1tsodi 1 1119 6315865 30408
## - dr1ttfat 1 2052 6316798 30408
## - dr1tkcal 1 2689 6317435 30409
## - indfmpir 1 2703 6317449 30409
## <none> 6314746 30409
## - ridageyr 1 5845 6320591 30411
## + sedentary_minutes 1 4 6314743 30411
## + vigorous_minutes 1 1 6314745 30411
## - smq020 1 6882 6321628 30412
## + dmdeduc2 6 12499 6302247 30413
## - bmxbmi 1 11986 6326732 30415
## - lbxglu 1 16929 6331676 30418
## - riagendr 1 19092 6333839 30420
## - bpxsy1 1 21673 6336419 30421
## - ridreth3 5 37646 6352392 30424
## - bpxdi1 1 67814 6382560 30451
## - lbxtr 1 173420 6488166 30519
## - lbdhdd 1 310089 6624835 30606
##
## Step: AIC=30407.05
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## alq111 + moderate_minutes + dr1tkcal + dr1ttfat + dr1tsodi +
## bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbdhdd + lbxtr
##
## Df Sum of Sq RSS AIC
## - moderate_minutes 1 171 6315043 30405
## - alq111 1 409 6315282 30405
## - dr1tsodi 1 1106 6315979 30406
## - dr1ttfat 1 2037 6316909 30406
## - dr1tkcal 1 2682 6317554 30407
## - indfmpir 1 2746 6317618 30407
## <none> 6314872 30407
## - ridageyr 1 5814 6320686 30409
## + lbxhscrp 1 126 6314746 30409
## + sedentary_minutes 1 3 6314869 30409
## + vigorous_minutes 1 1 6314871 30409
## - smq020 1 6819 6321691 30410
## + dmdeduc2 6 12554 6302318 30411
## - bmxbmi 1 12075 6326948 30413
## - lbxglu 1 17092 6331964 30416
## - riagendr 1 18970 6333842 30418
## - bpxsy1 1 21764 6336637 30419
## - ridreth3 5 38012 6352885 30422
## - bpxdi1 1 67836 6382708 30449
## - lbxtr 1 173768 6488640 30518
## - lbdhdd 1 311113 6625986 30604
##
## Step: AIC=30405.16
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## alq111 + dr1tkcal + dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 +
## bpxdi1 + lbxglu + lbdhdd + lbxtr
##
## Df Sum of Sq RSS AIC
## - alq111 1 416 6315459 30403
## - dr1tsodi 1 1095 6316139 30404
## - dr1ttfat 1 2030 6317074 30405
## - dr1tkcal 1 2671 6317715 30405
## - indfmpir 1 2773 6317816 30405
## <none> 6315043 30405
## - ridageyr 1 5909 6320952 30407
## + moderate_minutes 1 171 6314872 30407
## + lbxhscrp 1 125 6314918 30407
## + vigorous_minutes 1 11 6315032 30407
## + sedentary_minutes 1 5 6315038 30407
## - smq020 1 6789 6321833 30408
## + dmdeduc2 6 12597 6302447 30409
## - bmxbmi 1 12027 6327071 30411
## - lbxglu 1 17307 6332351 30415
## - riagendr 1 19111 6334155 30416
## - bpxsy1 1 21846 6336889 30418
## - ridreth3 5 38111 6353154 30420
## - bpxdi1 1 67695 6382738 30447
## - lbxtr 1 173608 6488652 30516
## - lbdhdd 1 311739 6626782 30603
##
## Step: AIC=30403.44
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## dr1tkcal + dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + bpxdi1 +
## lbxglu + lbdhdd + lbxtr
##
## Df Sum of Sq RSS AIC
## - dr1tsodi 1 1113 6316572 30402
## - dr1ttfat 1 2033 6317492 30403
## - indfmpir 1 2638 6318097 30403
## - dr1tkcal 1 2656 6318115 30403
## <none> 6315459 30403
## + alq111 1 416 6315043 30405
## + moderate_minutes 1 178 6315282 30405
## - ridageyr 1 5959 6321418 30405
## + lbxhscrp 1 112 6315347 30405
## + vigorous_minutes 1 13 6315446 30405
## + sedentary_minutes 1 4 6315455 30405
## - smq020 1 6400 6321859 30406
## + dmdeduc2 6 12664 6302795 30407
## - bmxbmi 1 11982 6327442 30409
## - lbxglu 1 17484 6332943 30413
## - riagendr 1 19527 6334986 30414
## - bpxsy1 1 22015 6337474 30416
## - ridreth3 5 38933 6354392 30419
## - bpxdi1 1 67308 6382767 30445
## - lbxtr 1 173551 6489011 30514
## - lbdhdd 1 311396 6626855 30601
##
## Step: AIC=30402.17
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## dr1tkcal + dr1ttfat + bmxbmi + bpxsy1 + bpxdi1 + lbxglu +
## lbdhdd + lbxtr
##
## Df Sum of Sq RSS AIC
## - dr1tkcal 1 1763 6318336 30401
## - indfmpir 1 2593 6319165 30402
## - dr1ttfat 1 2684 6319256 30402
## <none> 6316572 30402
## + dr1tsodi 1 1113 6315459 30403
## + alq111 1 434 6316139 30404
## + moderate_minutes 1 166 6316406 30404
## + lbxhscrp 1 101 6316472 30404
## + vigorous_minutes 1 17 6316555 30404
## + sedentary_minutes 1 5 6316568 30404
## - ridageyr 1 6262 6322834 30404
## - smq020 1 6489 6323062 30404
## + dmdeduc2 6 12788 6303785 30406
## - bmxbmi 1 11673 6328245 30408
## - lbxglu 1 17493 6334065 30412
## - riagendr 1 20116 6336689 30413
## - bpxsy1 1 21965 6338537 30415
## - ridreth3 5 38115 6354687 30417
## - bpxdi1 1 67451 6384023 30444
## - lbxtr 1 173143 6489716 30512
## - lbdhdd 1 311010 6627583 30599
##
## Step: AIC=30401.32
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## dr1ttfat + bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbdhdd + lbxtr
##
## Df Sum of Sq RSS AIC
## - dr1ttfat 1 997 6319333 30400
## - indfmpir 1 2434 6320770 30401
## <none> 6318336 30401
## + dr1tkcal 1 1763 6316572 30402
## + alq111 1 410 6317925 30403
## - ridageyr 1 5762 6324098 30403
## + dr1tsodi 1 221 6318115 30403
## + moderate_minutes 1 163 6318173 30403
## + lbxhscrp 1 101 6318234 30403
## + vigorous_minutes 1 13 6318323 30403
## + sedentary_minutes 1 8 6318328 30403
## - smq020 1 6573 6324908 30404
## + dmdeduc2 6 13094 6305242 30405
## - bmxbmi 1 11238 6329574 30407
## - lbxglu 1 17995 6336330 30411
## - riagendr 1 18728 6337063 30412
## - bpxsy1 1 22276 6340611 30414
## - ridreth3 5 39916 6358252 30417
## - bpxdi1 1 68062 6386397 30444
## - lbxtr 1 175897 6494233 30513
## - lbdhdd 1 310869 6629204 30598
##
## Step: AIC=30399.98
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbdhdd + lbxtr
##
## Df Sum of Sq RSS AIC
## - indfmpir 1 2293 6321625 30400
## <none> 6319333 30400
## + dr1tsodi 1 1112 6318221 30401
## + dr1ttfat 1 997 6318336 30401
## + alq111 1 454 6318879 30402
## + moderate_minutes 1 155 6319178 30402
## + dr1tkcal 1 76 6319256 30402
## + lbxhscrp 1 76 6319257 30402
## + vigorous_minutes 1 18 6319315 30402
## + sedentary_minutes 1 5 6319328 30402
## - ridageyr 1 6417 6325750 30402
## - smq020 1 6494 6325826 30402
## + dmdeduc2 6 13170 6306162 30403
## - bmxbmi 1 10892 6330224 30405
## - lbxglu 1 18021 6337354 30410
## - riagendr 1 21858 6341190 30412
## - bpxsy1 1 22385 6341717 30413
## - ridreth3 5 41077 6360410 30417
## - bpxdi1 1 67820 6387153 30442
## - lbxtr 1 175262 6494595 30511
## - lbdhdd 1 309874 6629207 30596
##
## Step: AIC=30399.48
## lbxtc ~ ridageyr + riagendr + ridreth3 + smq020 + bmxbmi + bpxsy1 +
## bpxdi1 + lbxglu + lbdhdd + lbxtr
##
## Df Sum of Sq RSS AIC
## <none> 6321625 30400
## + indfmpir 1 2293 6319333 30400
## + dr1tsodi 1 1006 6320620 30401
## + dr1ttfat 1 856 6320770 30401
## - smq020 1 5649 6327274 30401
## + alq111 1 319 6321307 30401
## + moderate_minutes 1 178 6321448 30401
## + lbxhscrp 1 111 6321514 30401
## + dr1tkcal 1 57 6321568 30401
## + vigorous_minutes 1 16 6321609 30402
## + sedentary_minutes 1 12 6321613 30402
## + dmdeduc2 6 14303 6307323 30402
## - ridageyr 1 7235 6328861 30402
## - bmxbmi 1 11074 6332699 30405
## - lbxglu 1 17574 6339200 30409
## - riagendr 1 20983 6342608 30411
## - bpxsy1 1 21323 6342949 30411
## - ridreth3 5 42534 6364160 30417
## - bpxdi1 1 69815 6391440 30443
## - lbxtr 1 176892 6498518 30512
## - lbdhdd 1 319754 6641379 30602
summary(step_model_full)
##
## Call:
## lm(formula = lbxtc ~ ridageyr + riagendr + ridreth3 + smq020 +
## bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbdhdd + lbxtr, data = train_data_original)
##
## Residuals:
## Min 1Q Median 3Q Max
## -205.991 -26.687 -3.703 22.741 247.549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.05147 6.24730 15.535 < 2e-16 ***
## ridageyr 0.08447 0.03886 2.174 0.029790 *
## riagendrFemale 4.92693 1.33103 3.702 0.000217 ***
## ridreth32 2.23916 2.57090 0.871 0.383825
## ridreth33 -2.63478 1.96970 -1.338 0.181082
## ridreth34 -6.95561 2.11946 -3.282 0.001040 **
## ridreth36 2.76851 2.33896 1.184 0.236620
## ridreth37 -3.10601 3.15328 -0.985 0.324677
## smq0202 -2.52044 1.31232 -1.921 0.054852 .
## bmxbmi 0.24385 0.09068 2.689 0.007193 **
## bpxsy1 0.15324 0.04107 3.732 0.000193 ***
## bpxdi1 0.35326 0.05232 6.752 1.66e-11 ***
## lbxglu -0.05884 0.01737 -3.388 0.000712 ***
## lbdhdd 0.65403 0.04526 14.450 < 2e-16 ***
## lbxtr 0.05933 0.00552 10.748 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.13 on 4128 degrees of freedom
## Multiple R-squared: 0.1055, Adjusted R-squared: 0.1024
## F-statistic: 34.76 on 14 and 4128 DF, p-value: < 2.2e-16
# Model 5
set.seed(123)
step_model_full_transformed_target <- stats::step(model_full_transformed_target, direction = "both")
## Start: AIC=-13120.82
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + dmdeduc2 + indfmpir +
## smq020 + alq111 + vigorous_minutes + moderate_minutes + sedentary_minutes +
## bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal +
## log_dr1tsodi + log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
##
## Df Sum of Sq RSS AIC
## - dmdeduc2 6 0.3961 172.43 -13123
## - sedentary_minutes 1 0.0002 172.04 -13123
## - moderate_minutes 1 0.0041 172.04 -13123
## - vigorous_minutes 1 0.0066 172.04 -13123
## - alq111 1 0.0082 172.04 -13123
## - ridageyr 1 0.0089 172.04 -13123
## - log_dr1tsodi 1 0.0375 172.07 -13122
## - indfmpir 1 0.0593 172.10 -13121
## <none> 172.04 -13121
## - log_dr1ttfat 1 0.1002 172.14 -13120
## - smq020 1 0.1071 172.14 -13120
## - log_dr1tkcal 1 0.1287 172.16 -13120
## - log_bmxbmi 1 0.1406 172.18 -13119
## - riagendr 1 0.1787 172.22 -13118
## - bpxsy1 1 0.3790 172.42 -13114
## - ridreth3 5 0.8144 172.85 -13111
## - log_lbxhscrp 1 1.2002 173.24 -13094
## - log_lbxglu 1 1.3590 173.40 -13090
## - bpxdi1 1 1.6495 173.69 -13083
## - log_lbxtr 1 9.8824 181.92 -12891
## - log_lbdhdd 1 13.1221 185.16 -12818
##
## Step: AIC=-13123.3
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## alq111 + vigorous_minutes + moderate_minutes + sedentary_minutes +
## bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal +
## log_dr1tsodi + log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
##
## Df Sum of Sq RSS AIC
## - sedentary_minutes 1 0.0005 172.43 -13125
## - vigorous_minutes 1 0.0046 172.44 -13125
## - moderate_minutes 1 0.0053 172.44 -13125
## - alq111 1 0.0103 172.44 -13125
## - ridageyr 1 0.0288 172.46 -13125
## - log_dr1tsodi 1 0.0408 172.47 -13124
## <none> 172.43 -13123
## - smq020 1 0.0894 172.52 -13123
## - indfmpir 1 0.1082 172.54 -13123
## - log_dr1ttfat 1 0.1181 172.55 -13122
## - log_bmxbmi 1 0.1489 172.58 -13122
## - log_dr1tkcal 1 0.1495 172.58 -13122
## - riagendr 1 0.1725 172.60 -13121
## + dmdeduc2 6 0.3961 172.04 -13121
## - bpxsy1 1 0.3792 172.81 -13116
## - ridreth3 5 1.0580 173.49 -13108
## - log_lbxhscrp 1 1.1580 173.59 -13098
## - log_lbxglu 1 1.3590 173.79 -13093
## - bpxdi1 1 1.6288 174.06 -13086
## - log_lbxtr 1 9.8568 182.29 -12895
## - log_lbdhdd 1 13.2179 185.65 -12819
##
## Step: AIC=-13125.29
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## alq111 + vigorous_minutes + moderate_minutes + bpxsy1 + bpxdi1 +
## log_lbxhscrp + log_lbxtr + log_dr1tkcal + log_dr1tsodi +
## log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
##
## Df Sum of Sq RSS AIC
## - vigorous_minutes 1 0.0046 172.44 -13127
## - moderate_minutes 1 0.0052 172.44 -13127
## - alq111 1 0.0103 172.44 -13127
## - ridageyr 1 0.0288 172.46 -13127
## - log_dr1tsodi 1 0.0407 172.47 -13126
## <none> 172.43 -13125
## - smq020 1 0.0894 172.52 -13125
## - indfmpir 1 0.1080 172.54 -13125
## - log_dr1ttfat 1 0.1181 172.55 -13124
## - log_bmxbmi 1 0.1489 172.58 -13124
## - log_dr1tkcal 1 0.1494 172.58 -13124
## + sedentary_minutes 1 0.0005 172.43 -13123
## - riagendr 1 0.1721 172.60 -13123
## + dmdeduc2 6 0.3963 172.04 -13123
## - bpxsy1 1 0.3788 172.81 -13118
## - ridreth3 5 1.0616 173.50 -13110
## - log_lbxhscrp 1 1.1577 173.59 -13100
## - log_lbxglu 1 1.3656 173.80 -13095
## - bpxdi1 1 1.6286 174.06 -13088
## - log_lbxtr 1 9.8634 182.30 -12897
## - log_lbdhdd 1 13.2175 185.65 -12821
##
## Step: AIC=-13127.17
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## alq111 + moderate_minutes + bpxsy1 + bpxdi1 + log_lbxhscrp +
## log_lbxtr + log_dr1tkcal + log_dr1tsodi + log_dr1ttfat +
## log_lbdhdd + log_lbxglu + log_bmxbmi
##
## Df Sum of Sq RSS AIC
## - moderate_minutes 1 0.0037 172.44 -13129
## - alq111 1 0.0102 172.45 -13129
## - ridageyr 1 0.0277 172.47 -13128
## - log_dr1tsodi 1 0.0401 172.48 -13128
## <none> 172.44 -13127
## - smq020 1 0.0902 172.53 -13127
## - indfmpir 1 0.1082 172.55 -13127
## - log_dr1ttfat 1 0.1184 172.56 -13126
## - log_bmxbmi 1 0.1488 172.59 -13126
## - log_dr1tkcal 1 0.1495 172.59 -13126
## + vigorous_minutes 1 0.0046 172.43 -13125
## + sedentary_minutes 1 0.0005 172.44 -13125
## - riagendr 1 0.1700 172.61 -13125
## + dmdeduc2 6 0.3943 172.04 -13125
## - bpxsy1 1 0.3791 172.82 -13120
## - ridreth3 5 1.0589 173.50 -13112
## - log_lbxhscrp 1 1.1560 173.59 -13102
## - log_lbxglu 1 1.3711 173.81 -13096
## - bpxdi1 1 1.6285 174.07 -13090
## - log_lbxtr 1 9.8588 182.30 -12899
## - log_lbdhdd 1 13.2164 185.65 -12823
##
## Step: AIC=-13129.08
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## alq111 + bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal +
## log_dr1tsodi + log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
##
## Df Sum of Sq RSS AIC
## - alq111 1 0.0104 172.45 -13131
## - ridageyr 1 0.0287 172.47 -13130
## - log_dr1tsodi 1 0.0398 172.48 -13130
## <none> 172.44 -13129
## - smq020 1 0.0898 172.53 -13129
## - indfmpir 1 0.1089 172.55 -13128
## - log_dr1ttfat 1 0.1181 172.56 -13128
## - log_bmxbmi 1 0.1482 172.59 -13128
## - log_dr1tkcal 1 0.1493 172.59 -13128
## + moderate_minutes 1 0.0037 172.44 -13127
## + vigorous_minutes 1 0.0032 172.44 -13127
## + sedentary_minutes 1 0.0004 172.44 -13127
## - riagendr 1 0.1721 172.61 -13127
## + dmdeduc2 6 0.3955 172.05 -13127
## - bpxsy1 1 0.3808 172.82 -13122
## - ridreth3 5 1.0622 173.50 -13114
## - log_lbxhscrp 1 1.1551 173.60 -13103
## - log_lbxglu 1 1.3809 173.82 -13098
## - bpxdi1 1 1.6255 174.07 -13092
## - log_lbxtr 1 9.8564 182.30 -12901
## - log_lbdhdd 1 13.2504 185.69 -12824
##
## Step: AIC=-13130.83
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 +
## bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal +
## log_dr1tsodi + log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
##
## Df Sum of Sq RSS AIC
## - ridageyr 1 0.0292 172.48 -13132
## - log_dr1tsodi 1 0.0401 172.49 -13132
## - smq020 1 0.0815 172.53 -13131
## <none> 172.45 -13131
## - indfmpir 1 0.1050 172.56 -13130
## - log_dr1ttfat 1 0.1194 172.57 -13130
## - log_bmxbmi 1 0.1469 172.60 -13129
## - log_dr1tkcal 1 0.1491 172.60 -13129
## + alq111 1 0.0104 172.44 -13129
## + moderate_minutes 1 0.0039 172.45 -13129
## + vigorous_minutes 1 0.0031 172.45 -13129
## + sedentary_minutes 1 0.0005 172.45 -13129
## - riagendr 1 0.1772 172.63 -13129
## + dmdeduc2 6 0.3976 172.05 -13128
## - bpxsy1 1 0.3845 172.84 -13124
## - ridreth3 5 1.0917 173.54 -13115
## - log_lbxhscrp 1 1.1626 173.61 -13105
## - log_lbxglu 1 1.3877 173.84 -13100
## - bpxdi1 1 1.6160 174.07 -13094
## - log_lbxtr 1 9.8481 182.30 -12903
## - log_lbdhdd 1 13.2450 185.70 -12826
##
## Step: AIC=-13132.13
## log_lbxtc ~ riagendr + ridreth3 + indfmpir + smq020 + bpxsy1 +
## bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + log_dr1tsodi +
## log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
##
## Df Sum of Sq RSS AIC
## - log_dr1tsodi 1 0.0448 172.53 -13133
## <none> 172.48 -13132
## - smq020 1 0.0965 172.58 -13132
## - log_dr1ttfat 1 0.1156 172.60 -13131
## - indfmpir 1 0.1173 172.60 -13131
## + ridageyr 1 0.0292 172.45 -13131
## - log_dr1tkcal 1 0.1439 172.62 -13131
## - log_bmxbmi 1 0.1449 172.63 -13131
## + alq111 1 0.0109 172.47 -13130
## + moderate_minutes 1 0.0050 172.48 -13130
## + dmdeduc2 6 0.4181 172.06 -13130
## + vigorous_minutes 1 0.0020 172.48 -13130
## + sedentary_minutes 1 0.0004 172.48 -13130
## - riagendr 1 0.1703 172.65 -13130
## - bpxsy1 1 0.6265 173.11 -13119
## - ridreth3 5 1.0913 173.57 -13116
## - log_lbxhscrp 1 1.1836 173.66 -13106
## - log_lbxglu 1 1.3627 173.84 -13102
## - bpxdi1 1 1.5877 174.07 -13096
## - log_lbxtr 1 9.9258 182.41 -12902
## - log_lbdhdd 1 13.5534 186.03 -12821
##
## Step: AIC=-13133.06
## log_lbxtc ~ riagendr + ridreth3 + indfmpir + smq020 + bpxsy1 +
## bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + log_dr1ttfat +
## log_lbdhdd + log_lbxglu + log_bmxbmi
##
## Df Sum of Sq RSS AIC
## <none> 172.53 -13133
## - smq020 1 0.1010 172.63 -13133
## - log_dr1tkcal 1 0.1028 172.63 -13133
## - indfmpir 1 0.1139 172.64 -13132
## + log_dr1tsodi 1 0.0448 172.48 -13132
## + ridageyr 1 0.0339 172.49 -13132
## - log_bmxbmi 1 0.1367 172.66 -13132
## + alq111 1 0.0113 172.51 -13131
## + dmdeduc2 6 0.4237 172.10 -13131
## - log_dr1ttfat 1 0.1608 172.69 -13131
## + moderate_minutes 1 0.0046 172.52 -13131
## + vigorous_minutes 1 0.0016 172.52 -13131
## + sedentary_minutes 1 0.0004 172.53 -13131
## - riagendr 1 0.1793 172.71 -13131
## - bpxsy1 1 0.6367 173.16 -13120
## - ridreth3 5 1.0666 173.59 -13118
## - log_lbxhscrp 1 1.2063 173.73 -13106
## - log_lbxglu 1 1.3656 173.89 -13102
## - bpxdi1 1 1.5851 174.11 -13097
## - log_lbxtr 1 9.9074 182.43 -12904
## - log_lbdhdd 1 13.5512 186.08 -12822
summary(step_model_full_transformed_target)
##
## Call:
## lm(formula = log_lbxtc ~ riagendr + ridreth3 + indfmpir + smq020 +
## bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal +
## log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi, data = train_data_transformed_target)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78399 -0.13374 -0.00143 0.13398 0.86598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7387534 0.1324538 28.227 < 2e-16 ***
## riagendrFemale 0.0151442 0.0073136 2.071 0.03845 *
## ridreth32 0.0137406 0.0134444 1.022 0.30683
## ridreth33 -0.0110760 0.0103169 -1.074 0.28308
## ridreth34 -0.0313803 0.0111297 -2.820 0.00483 **
## ridreth36 0.0197936 0.0124451 1.590 0.11181
## ridreth37 -0.0139308 0.0165009 -0.844 0.39858
## indfmpir 0.0036243 0.0021962 1.650 0.09897 .
## smq0202 -0.0106687 0.0068649 -1.554 0.12024
## bpxsy1 0.0007444 0.0001908 3.902 9.70e-05 ***
## bpxdi1 0.0016752 0.0002721 6.156 8.17e-10 ***
## log_lbxhscrp 0.0180565 0.0033622 5.370 8.29e-08 ***
## log_lbxtr 0.0965329 0.0062721 15.391 < 2e-16 ***
## log_dr1tkcal 0.0224983 0.0143528 1.568 0.11707
## log_dr1ttfat -0.0223556 0.0114010 -1.961 0.04996 *
## log_lbdhdd 0.2574592 0.0143033 18.000 < 2e-16 ***
## log_lbxglu -0.0822295 0.0143904 -5.714 1.18e-08 ***
## log_bmxbmi 0.0302211 0.0167136 1.808 0.07065 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2045 on 4125 degrees of freedom
## Multiple R-squared: 0.14, Adjusted R-squared: 0.1364
## F-statistic: 39.5 on 17 and 4125 DF, p-value: < 2.2e-16
Good for multicollinearity - adds a penalty (L2 regularization)
# Model 6
set.seed(123)
ridge_model <- train(log_lbxtc ~ ., data = train_data_transformed_target,
method = "glmnet",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10),
tuneGrid = expand.grid(alpha = 0, lambda = seq(0, 1, 0.1)))
summary(ridge_model)
## Length Class Mode
## a0 100 -none- numeric
## beta 2900 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
## lambdaOpt 1 -none- numeric
## xNames 29 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
testX <- test_data_transformed_target %>%
dplyr::select(-log_lbxtc)
testY <- test_data_transformed_target$log_lbxtc
ridge_pred <- predict(ridge_model, testX)
ridge_resample <- postResample(ridge_pred, testY)
ridge_resample
## RMSE Rsquared MAE
## 0.2054136 0.1103974 0.1596422
# Model 7
set.seed(123)
# Elastic net
testX <- test_data_transformed_target %>%
dplyr::select(-log_lbxtc)
testY <- test_data_transformed_target$log_lbxtc
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))
enet_model <- train(log_lbxtc ~ ., data = train_data_transformed_target, method = "enet",
tuneGrid = enetGrid,
trControl = trainControl(method = "cv", number = 10),
preProcess = c("center", "scale"))
enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)
enet_resample
## RMSE Rsquared MAE
## 0.2062352 0.1030591 0.1606479
# Model 8
set.seed(123)
# weighting functions to down-weight influential observations
# default weighting function is Huber
# uses Iteratively Reweighted Least Squares (IRLS)
robust_model_full <- rlm(lbxtc ~ ., data = train_data_original, psi = psi.huber)
summary(robust_model_full)
##
## Call: rlm(formula = lbxtc ~ ., data = train_data_original, psi = psi.huber)
## Residuals:
## Min 1Q Median 3Q Max
## -249.339 -24.778 -1.536 24.560 250.206
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 93.9388 6.5811 14.2740
## ridageyr 0.0931 0.0392 2.3770
## riagendrFemale 3.8709 1.3546 2.8576
## ridreth32 2.4554 2.5001 0.9821
## ridreth33 -3.0352 2.0235 -1.4999
## ridreth34 -6.3734 2.1471 -2.9683
## ridreth36 1.5197 2.4317 0.6250
## ridreth37 -1.2667 3.1176 -0.4063
## dmdeduc22 -0.9823 2.8286 -0.3473
## dmdeduc23 -0.7882 2.5741 -0.3062
## dmdeduc24 -2.2230 2.5274 -0.8796
## dmdeduc25 1.3845 2.7593 0.5018
## dmdeduc27 59.5691 26.9946 2.2067
## dmdeduc29 -6.3412 13.6374 -0.4650
## indfmpir 0.3149 0.4447 0.7080
## smq0202 -2.5989 1.3119 -1.9810
## alq1112 0.5368 2.0787 0.2582
## vigorous_minutes 0.0002 0.0018 0.1261
## moderate_minutes -0.0003 0.0011 -0.2554
## sedentary_minutes 0.0004 0.0009 0.4564
## dr1tkcal 0.0013 0.0015 0.8648
## dr1ttfat -0.0289 0.0282 -1.0282
## dr1tsodi -0.0003 0.0005 -0.5100
## bmxbmi 0.3318 0.0908 3.6527
## bpxsy1 0.1089 0.0401 2.7127
## bpxdi1 0.4065 0.0511 7.9548
## lbxglu -0.0769 0.0169 -4.5432
## lbxhscrp -0.0558 0.0805 -0.6935
## lbdhdd 0.6806 0.0444 15.3166
## lbxtr 0.0765 0.0054 14.2568
##
## Residual standard error: 36.66 on 4113 degrees of freedom
# Model 9
set.seed(123)
# weighting functions to down-weight influential observations
# default weighting function is Huber
# uses Iteratively Reweighted Least Squares (IRLS)
robust_model_full_transformed_target <- rlm(log_lbxtc ~ ., data = train_data_transformed_target, psi = psi.huber)
summary(robust_model_full_transformed_target)
##
## Call: rlm(formula = log_lbxtc ~ ., data = train_data_transformed_target,
## psi = psi.huber)
## Residuals:
## Min 1Q Median 3Q Max
## -0.80890 -0.13423 -0.00364 0.13459 0.85464
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 3.7498 0.1343 27.9279
## ridageyr 0.0003 0.0002 1.4544
## riagendrFemale 0.0114 0.0073 1.5641
## ridreth32 0.0136 0.0134 1.0170
## ridreth33 -0.0116 0.0109 -1.0668
## ridreth34 -0.0272 0.0115 -2.3562
## ridreth36 0.0196 0.0131 1.4938
## ridreth37 -0.0045 0.0167 -0.2677
## dmdeduc22 -0.0124 0.0152 -0.8175
## dmdeduc23 -0.0109 0.0138 -0.7924
## dmdeduc24 -0.0186 0.0136 -1.3708
## dmdeduc25 0.0008 0.0148 0.0542
## dmdeduc27 0.2756 0.1449 1.9023
## dmdeduc29 -0.0331 0.0732 -0.4526
## indfmpir 0.0029 0.0024 1.1974
## smq0202 -0.0136 0.0070 -1.9270
## alq1112 0.0007 0.0112 0.0627
## vigorous_minutes 0.0000 0.0000 0.4452
## moderate_minutes 0.0000 0.0000 -0.4649
## sedentary_minutes 0.0000 0.0000 0.2177
## bpxsy1 0.0005 0.0002 2.4705
## bpxdi1 0.0019 0.0003 6.9050
## log_lbxhscrp 0.0160 0.0034 4.7700
## log_lbxtr 0.1012 0.0063 16.1821
## log_dr1tkcal 0.0280 0.0157 1.7821
## log_dr1tsodi -0.0112 0.0104 -1.0836
## log_dr1ttfat -0.0190 0.0117 -1.6273
## log_lbdhdd 0.2615 0.0144 18.1882
## log_lbxglu -0.0926 0.0144 -6.4237
## log_bmxbmi 0.0439 0.0167 2.6306
##
## Residual standard error: 0.1993 on 4113 degrees of freedom
# Model 10
set.seed(123)
# weighting functions to down-weight influential observations
# default weighting function is Huber
# uses Iteratively Reweighted Least Squares (IRLS)
robust_bisquare_model_full_transformed_target <- rlm(log_lbxtc ~ ., data = train_data_transformed_target, psi = psi.bisquare)
summary(robust_bisquare_model_full_transformed_target)
##
## Call: rlm(formula = log_lbxtc ~ ., data = train_data_transformed_target,
## psi = psi.bisquare)
## Residuals:
## Min 1Q Median 3Q Max
## -0.810378 -0.134024 -0.003544 0.134312 0.855685
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 3.7382 0.1352 27.6441
## ridageyr 0.0003 0.0002 1.5380
## riagendrFemale 0.0112 0.0074 1.5156
## ridreth32 0.0121 0.0135 0.8933
## ridreth33 -0.0126 0.0109 -1.1558
## ridreth34 -0.0290 0.0116 -2.4912
## ridreth36 0.0181 0.0132 1.3685
## ridreth37 -0.0038 0.0168 -0.2252
## dmdeduc22 -0.0115 0.0153 -0.7500
## dmdeduc23 -0.0094 0.0139 -0.6776
## dmdeduc24 -0.0172 0.0137 -1.2556
## dmdeduc25 0.0024 0.0149 0.1585
## dmdeduc27 0.2773 0.1459 1.9006
## dmdeduc29 -0.0297 0.0737 -0.4026
## indfmpir 0.0025 0.0024 1.0586
## smq0202 -0.0134 0.0071 -1.8892
## alq1112 0.0016 0.0112 0.1413
## vigorous_minutes 0.0000 0.0000 0.4244
## moderate_minutes 0.0000 0.0000 -0.4410
## sedentary_minutes 0.0000 0.0000 0.2513
## bpxsy1 0.0005 0.0002 2.4135
## bpxdi1 0.0019 0.0003 6.9122
## log_lbxhscrp 0.0159 0.0034 4.7084
## log_lbxtr 0.1016 0.0063 16.1453
## log_dr1tkcal 0.0290 0.0158 1.8305
## log_dr1tsodi -0.0120 0.0105 -1.1503
## log_dr1ttfat -0.0195 0.0118 -1.6588
## log_lbdhdd 0.2631 0.0145 18.1656
## log_lbxglu -0.0925 0.0145 -6.3756
## log_bmxbmi 0.0450 0.0168 2.6791
##
## Residual standard error: 0.1989 on 4113 degrees of freedom
Age Categories Revealed Non-Linear Pattern as Cholesterol peaks in middle age, then declines in elderly . BMI Categories are more informative than Log BMI. High Sodium Flag is Significant as HIGH_SODIUM (>2300 mg/day) → -1.93% cholesterol (p = 0.009) and hence the effect is in unexpected direction (negative).
Physical activity and high-fat diet does not have detectable direct effect on cholesterol.
# Model 12
set.seed(123)
model_transformed_newfeatures <- lm(log_lbxtc ~ ., data =
train_data_transformed_newfeatures)
summary(model_transformed_newfeatures)
##
## Call:
## lm(formula = log_lbxtc ~ ., data = train_data_transformed_newfeatures)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.84349 -0.13022 0.00021 0.13048 0.92606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.939e+00 9.713e-02 40.557 < 2e-16 ***
## ridageyr 2.610e-04 3.910e-04 0.668 0.50447
## riagendrFemale 9.479e-03 7.208e-03 1.315 0.18854
## ridreth32 1.294e-02 1.325e-02 0.977 0.32849
## ridreth33 3.639e-03 1.079e-02 0.337 0.73587
## ridreth34 -2.564e-02 1.141e-02 -2.247 0.02472 *
## ridreth36 2.124e-02 1.300e-02 1.634 0.10242
## ridreth37 -5.423e-03 1.654e-02 -0.328 0.74299
## dmdeduc22 -1.153e-02 1.501e-02 -0.768 0.44268
## dmdeduc23 -8.491e-03 1.368e-02 -0.621 0.53493
## dmdeduc24 -1.219e-02 1.342e-02 -0.908 0.36384
## dmdeduc25 -3.099e-04 1.462e-02 -0.021 0.98309
## dmdeduc27 2.315e-01 1.430e-01 1.619 0.10556
## dmdeduc29 -1.392e-02 7.229e-02 -0.193 0.84730
## indfmpir 4.088e-04 2.367e-03 0.173 0.86288
## smq0202 -6.101e-03 6.986e-03 -0.873 0.38257
## alq1112 8.724e-03 1.104e-02 0.790 0.42945
## sedentary_minutes 1.529e-06 4.523e-06 0.338 0.73543
## TOTAL_ACTIVITY -1.197e-06 4.763e-06 -0.251 0.80162
## ACTIVE_FLAG 8.684e-03 7.269e-03 1.195 0.23230
## HIGH_FAT_DIET -6.568e-03 6.590e-03 -0.997 0.31892
## HIGH_SODIUM -2.001e-02 7.416e-03 -2.699 0.00699 **
## AGE_YOUNG 1.239e-02 2.112e-02 0.586 0.55763
## AGE_MID 8.465e-02 1.216e-02 6.962 3.90e-12 ***
## BMI_UNDER -7.372e-02 2.690e-02 -2.741 0.00616 **
## BMI_NORMAL -2.728e-02 9.366e-03 -2.912 0.00361 **
## BMI_OVER 1.256e-02 7.745e-03 1.621 0.10500
## bpxsy1 1.006e-03 2.150e-04 4.680 2.96e-06 ***
## bpxdi1 7.897e-04 2.844e-04 2.777 0.00552 **
## log_lbxhscrp 1.460e-02 3.227e-03 4.524 6.23e-06 ***
## log_lbxtr 9.078e-02 6.181e-03 14.686 < 2e-16 ***
## log_lbdhdd 2.580e-01 1.421e-02 18.160 < 2e-16 ***
## log_lbxglu -8.354e-02 1.418e-02 -5.889 4.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2005 on 4110 degrees of freedom
## Multiple R-squared: 0.1762, Adjusted R-squared: 0.1698
## F-statistic: 27.47 on 32 and 4110 DF, p-value: < 2.2e-16
We know we have some low correlated features, so let’s address this potential issue by looking at the VIF for the models and removing predictors.
VIF for LM1: Basic Model had dr1tkcal and dr1ttfat whose VIF > 5.Hence these predictos should be removed.
For the logistic regression models, luckily no predictor has a VIF > 5 (for Df > 1 (categorical), must look at GVIF^(1/(2*Df)) value), so we don’t need to remove any predictors.
# Multiple linear models
cat("=== VIF for LM1: Basic Model ===\n")
## === VIF for LM1: Basic Model ===
print(car::vif(model_full))
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.521880 1 1.233645
## riagendr 1.323334 1 1.150363
## ridreth3 1.644105 5 1.050976
## dmdeduc2 1.742917 6 1.047385
## indfmpir 1.298192 1 1.139382
## smq020 1.199062 1 1.095017
## alq111 1.103425 1 1.050440
## vigorous_minutes 1.049989 1 1.024690
## moderate_minutes 1.045875 1 1.022680
## sedentary_minutes 1.022078 1 1.010979
## dr1tkcal 6.267934 1 2.503584
## dr1ttfat 5.378923 1 2.319251
## dr1tsodi 2.854178 1 1.689431
## bmxbmi 1.304856 1 1.142303
## bpxsy1 1.591836 1 1.261680
## bpxdi1 1.232550 1 1.110203
## lbxglu 1.156394 1 1.075357
## lbxhscrp 1.110765 1 1.053928
## lbdhdd 1.374965 1 1.172589
## lbxtr 1.173039 1 1.083069
cat("=== VIF for LM2: Transformed predictors + original target variable ===\n")
## === VIF for LM2: Transformed predictors + original target variable ===
print(car::vif(model_full_transformed))
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.531744 1 1.237636
## riagendr 1.340638 1 1.157859
## ridreth3 1.674581 5 1.052908
## dmdeduc2 1.752564 6 1.047867
## indfmpir 1.300326 1 1.140318
## smq020 1.201889 1 1.096307
## alq111 1.104838 1 1.051113
## vigorous_minutes 1.050999 1 1.025182
## moderate_minutes 1.046303 1 1.022889
## sedentary_minutes 1.021970 1 1.010925
## bpxsy1 1.596436 1 1.263502
## bpxdi1 1.244769 1 1.115692
## log_lbxhscrp 1.389545 1 1.178790
## log_lbxtr 1.443205 1 1.201335
## log_dr1tkcal 5.771872 1 2.402472
## log_dr1tsodi 3.265137 1 1.806969
## log_dr1ttfat 4.965713 1 2.228388
## log_lbdhdd 1.614277 1 1.270542
## log_lbxglu 1.217585 1 1.103442
## log_bmxbmi 1.510988 1 1.229222
cat("=== VIF for LM3: Transformed predictors + transformed target variable ===\n")
## === VIF for LM3: Transformed predictors + transformed target variable ===
print(car::vif(model_full_transformed_target))
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.531744 1 1.237636
## riagendr 1.340638 1 1.157859
## ridreth3 1.674581 5 1.052908
## dmdeduc2 1.752564 6 1.047867
## indfmpir 1.300326 1 1.140318
## smq020 1.201889 1 1.096307
## alq111 1.104838 1 1.051113
## vigorous_minutes 1.050999 1 1.025182
## moderate_minutes 1.046303 1 1.022889
## sedentary_minutes 1.021970 1 1.010925
## bpxsy1 1.596436 1 1.263502
## bpxdi1 1.244769 1 1.115692
## log_lbxhscrp 1.389545 1 1.178790
## log_lbxtr 1.443205 1 1.201335
## log_dr1tkcal 5.771872 1 2.402472
## log_dr1tsodi 3.265137 1 1.806969
## log_dr1ttfat 4.965713 1 2.228388
## log_lbdhdd 1.614277 1 1.270542
## log_lbxglu 1.217585 1 1.103442
## log_bmxbmi 1.510988 1 1.229222
cat("=== VIF for LM4: Stepwise on Model 1 ===\n")
## === VIF for LM4: Stepwise on Model 1 ===
print(car::vif(step_model_full))
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.403331 1 1.184623
## riagendr 1.196522 1 1.093856
## ridreth3 1.199283 5 1.018339
## smq020 1.123607 1 1.060003
## bmxbmi 1.217974 1 1.103619
## bpxsy1 1.561195 1 1.249478
## bpxdi1 1.210080 1 1.100036
## lbxglu 1.141326 1 1.068329
## lbdhdd 1.335759 1 1.155750
## lbxtr 1.162166 1 1.078038
cat("=== VIF for LM5: Stepwise on Model 3 ===\n")
## === VIF for LM5: Stepwise on Model 3 ===
print(car::vif(step_model_full_transformed_target))
## GVIF Df GVIF^(1/(2*Df))
## riagendr 1.322698 1 1.150086
## ridreth3 1.280753 5 1.025053
## indfmpir 1.085509 1 1.041878
## smq020 1.125802 1 1.061038
## bpxsy1 1.233886 1 1.110804
## bpxdi1 1.198605 1 1.094808
## log_lbxhscrp 1.376780 1 1.173363
## log_lbxtr 1.434107 1 1.197542
## log_dr1tkcal 4.757132 1 2.181085
## log_dr1ttfat 4.659189 1 2.158515
## log_lbdhdd 1.576608 1 1.255631
## log_lbxglu 1.198388 1 1.094709
## log_bmxbmi 1.497774 1 1.223836
cat("=== VIF for LM8: Robust: Original ===\n")
## === VIF for LM8: Robust: Original ===
print(car::vif(robust_model_full))
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.521880 1 1.233645
## riagendr 1.323334 1 1.150363
## ridreth3 1.644105 5 1.050976
## dmdeduc2 1.742917 6 1.047385
## indfmpir 1.298192 1 1.139382
## smq020 1.199062 1 1.095017
## alq111 1.103425 1 1.050440
## vigorous_minutes 1.049989 1 1.024690
## moderate_minutes 1.045875 1 1.022680
## sedentary_minutes 1.022078 1 1.010979
## dr1tkcal 6.267934 1 2.503584
## dr1ttfat 5.378923 1 2.319251
## dr1tsodi 2.854178 1 1.689431
## bmxbmi 1.304856 1 1.142303
## bpxsy1 1.591836 1 1.261680
## bpxdi1 1.232550 1 1.110203
## lbxglu 1.156394 1 1.075357
## lbxhscrp 1.110765 1 1.053928
## lbdhdd 1.374965 1 1.172589
## lbxtr 1.173039 1 1.083069
cat("=== VIF for LM9: Robust: Transformed ===\n")
## === VIF for LM9: Robust: Transformed ===
print(car::vif(robust_model_full_transformed_target))
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.531744 1 1.237636
## riagendr 1.340638 1 1.157859
## ridreth3 1.674581 5 1.052908
## dmdeduc2 1.752564 6 1.047867
## indfmpir 1.300326 1 1.140318
## smq020 1.201889 1 1.096307
## alq111 1.104838 1 1.051113
## vigorous_minutes 1.050999 1 1.025182
## moderate_minutes 1.046303 1 1.022889
## sedentary_minutes 1.021970 1 1.010925
## bpxsy1 1.596436 1 1.263502
## bpxdi1 1.244769 1 1.115692
## log_lbxhscrp 1.389545 1 1.178790
## log_lbxtr 1.443205 1 1.201335
## log_dr1tkcal 5.771872 1 2.402472
## log_dr1tsodi 3.265137 1 1.806969
## log_dr1ttfat 4.965713 1 2.228388
## log_lbdhdd 1.614277 1 1.270542
## log_lbxglu 1.217585 1 1.103442
## log_bmxbmi 1.510988 1 1.229222
cat("=== VIF for LM10: Robust: Transformed: Bisquare ===\n")
## === VIF for LM10: Robust: Transformed: Bisquare ===
print(car::vif(robust_bisquare_model_full_transformed_target))
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.531744 1 1.237636
## riagendr 1.340638 1 1.157859
## ridreth3 1.674581 5 1.052908
## dmdeduc2 1.752564 6 1.047867
## indfmpir 1.300326 1 1.140318
## smq020 1.201889 1 1.096307
## alq111 1.104838 1 1.051113
## vigorous_minutes 1.050999 1 1.025182
## moderate_minutes 1.046303 1 1.022889
## sedentary_minutes 1.021970 1 1.010925
## bpxsy1 1.596436 1 1.263502
## bpxdi1 1.244769 1 1.115692
## log_lbxhscrp 1.389545 1 1.178790
## log_lbxtr 1.443205 1 1.201335
## log_dr1tkcal 5.771872 1 2.402472
## log_dr1tsodi 3.265137 1 1.806969
## log_dr1ttfat 4.965713 1 2.228388
## log_lbdhdd 1.614277 1 1.270542
## log_lbxglu 1.217585 1 1.103442
## log_bmxbmi 1.510988 1 1.229222
cat("=== VIF for LM12: Transformed predictors and target + new features ===\n")
## === VIF for LM12: Transformed predictors and target + new features ===
print(car::vif(model_transformed_newfeatures))
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 5.409444 1 2.325821
## riagendr 1.336220 1 1.155950
## ridreth3 1.724996 5 1.056036
## dmdeduc2 1.789852 6 1.049707
## indfmpir 1.311627 1 1.145263
## smq020 1.212757 1 1.101253
## alq111 1.110053 1 1.053590
## sedentary_minutes 1.030810 1 1.015288
## TOTAL_ACTIVITY 1.131137 1 1.063549
## ACTIVE_FLAG 1.277647 1 1.130331
## HIGH_FAT_DIET 1.077105 1 1.037837
## HIGH_SODIUM 1.110074 1 1.053600
## AGE_YOUNG 6.836026 1 2.614580
## AGE_MID 3.713059 1 1.926930
## BMI_UNDER 1.081259 1 1.039836
## BMI_NORMAL 1.667455 1 1.291300
## BMI_OVER 1.378407 1 1.174056
## bpxsy1 1.629344 1 1.276458
## bpxdi1 1.361794 1 1.166959
## log_lbxhscrp 1.318844 1 1.148410
## log_lbxtr 1.448880 1 1.203694
## log_lbdhdd 1.617624 1 1.271859
## log_lbxglu 1.211128 1 1.100513
# Model 1 (original predictors + original target variable lbxtc)
# Remove dr1tkcal which has highest VIF (above 5)
set.seed(123)
model_full_v2 <- lm(lbxtc ~ ., data = train_data_original %>% dplyr::select(-c(dr1tkcal,dr1ttfat)))
summary(model_full_v2)
##
## Call:
## lm(formula = lbxtc ~ ., data = train_data_original %>% dplyr::select(-c(dr1tkcal,
## dr1ttfat)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -205.782 -26.737 -3.807 22.935 245.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.989e+01 6.707e+00 14.893 < 2e-16 ***
## ridageyr 6.249e-02 4.040e-02 1.547 0.121953
## riagendrFemale 4.786e+00 1.386e+00 3.453 0.000560 ***
## ridreth32 2.132e+00 2.583e+00 0.825 0.409304
## ridreth33 -2.441e+00 2.088e+00 -1.169 0.242457
## ridreth34 -6.643e+00 2.216e+00 -2.998 0.002737 **
## ridreth36 1.590e+00 2.503e+00 0.635 0.525434
## ridreth37 -2.773e+00 3.222e+00 -0.861 0.389422
## dmdeduc22 -2.895e+00 2.921e+00 -0.991 0.321690
## dmdeduc23 -2.048e+00 2.657e+00 -0.771 0.440889
## dmdeduc24 -3.519e+00 2.608e+00 -1.349 0.177390
## dmdeduc25 -2.827e-01 2.849e+00 -0.099 0.920965
## dmdeduc27 5.267e+01 2.790e+01 1.888 0.059112 .
## dmdeduc29 -6.016e+00 1.409e+01 -0.427 0.669497
## indfmpir 4.117e-01 4.591e-01 0.897 0.369903
## smq0202 -3.061e+00 1.356e+00 -2.258 0.023977 *
## alq1112 1.015e+00 2.148e+00 0.473 0.636504
## vigorous_minutes 6.989e-05 1.868e-03 0.037 0.970155
## moderate_minutes -3.135e-04 1.122e-03 -0.280 0.779859
## sedentary_minutes 8.496e-05 8.790e-04 0.097 0.923008
## dr1tsodi -2.817e-04 3.438e-04 -0.819 0.412633
## bmxbmi 2.524e-01 9.366e-02 2.694 0.007078 **
## bpxsy1 1.565e-01 4.146e-02 3.775 0.000163 ***
## bpxdi1 3.559e-01 5.280e-02 6.741 1.79e-11 ***
## lbxglu -5.873e-02 1.747e-02 -3.361 0.000783 ***
## lbxhscrp -1.686e-02 8.317e-02 -0.203 0.839347
## lbdhdd 6.500e-01 4.592e-02 14.157 < 2e-16 ***
## lbxtr 5.943e-02 5.540e-03 10.727 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.14 on 4115 degrees of freedom
## Multiple R-squared: 0.1079, Adjusted R-squared: 0.102
## F-statistic: 18.43 on 27 and 4115 DF, p-value: < 2.2e-16
car::vif(model_full_v2) # looks good
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.515673 1 1.231127
## riagendr 1.296892 1 1.138812
## ridreth3 1.596730 5 1.047908
## dmdeduc2 1.732063 6 1.046840
## indfmpir 1.294865 1 1.137921
## smq020 1.198310 1 1.094673
## alq111 1.103324 1 1.050392
## vigorous_minutes 1.049948 1 1.024670
## moderate_minutes 1.045805 1 1.022646
## sedentary_minutes 1.021712 1 1.010798
## dr1tsodi 1.110621 1 1.053860
## bmxbmi 1.298835 1 1.139665
## bpxsy1 1.590519 1 1.261158
## bpxdi1 1.231716 1 1.109827
## lbxglu 1.154277 1 1.074373
## lbxhscrp 1.110360 1 1.053736
## lbdhdd 1.373973 1 1.172166
## lbxtr 1.170124 1 1.081723
# Model 2 (transformed predictors + original target variable lbxtc)
# Remove log_dr1tkcal which has highest VIF (above 5)
set.seed(123)
model_full_transformed_v2 <- lm(lbxtc ~ ., data = train_data_transformed %>% dplyr::select(-c(log_dr1tkcal)))
summary(model_full_transformed_v2)
##
## Call:
## lm(formula = lbxtc ~ ., data = train_data_transformed %>% dplyr::select(-c(log_dr1tkcal)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -109.70 -26.43 -3.92 22.77 240.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.642e+01 2.300e+01 -2.019 0.043596 *
## ridageyr 3.173e-02 3.996e-02 0.794 0.427180
## riagendrFemale 2.833e+00 1.379e+00 2.055 0.039917 *
## ridreth32 2.444e+00 2.546e+00 0.960 0.337078
## ridreth33 -1.638e+00 2.059e+00 -0.796 0.426344
## ridreth34 -4.824e+00 2.188e+00 -2.204 0.027560 *
## ridreth36 3.186e+00 2.485e+00 1.282 0.199955
## ridreth37 -1.959e+00 3.173e+00 -0.617 0.536996
## dmdeduc22 -4.018e+00 2.883e+00 -1.394 0.163527
## dmdeduc23 -2.452e+00 2.621e+00 -0.935 0.349722
## dmdeduc24 -4.068e+00 2.575e+00 -1.580 0.114252
## dmdeduc25 -9.185e-01 2.811e+00 -0.327 0.743923
## dmdeduc27 5.486e+01 2.749e+01 1.996 0.046010 *
## dmdeduc29 -7.750e+00 1.389e+01 -0.558 0.576885
## indfmpir 5.811e-01 4.529e-01 1.283 0.199543
## smq0202 -2.501e+00 1.337e+00 -1.870 0.061538 .
## alq1112 7.900e-01 2.118e+00 0.373 0.709153
## vigorous_minutes 6.152e-04 1.841e-03 0.334 0.738280
## moderate_minutes -4.563e-04 1.105e-03 -0.413 0.679729
## sedentary_minutes -3.619e-05 8.659e-04 -0.042 0.966662
## bpxsy1 1.495e-01 4.091e-02 3.655 0.000261 ***
## bpxdi1 2.872e-01 5.227e-02 5.494 4.18e-08 ***
## log_lbxhscrp 3.230e+00 6.368e-01 5.072 4.11e-07 ***
## log_lbxtr 1.835e+01 1.185e+00 15.485 < 2e-16 ***
## log_dr1tsodi -7.707e-01 1.797e+00 -0.429 0.668013
## log_dr1ttfat -4.083e-01 1.627e+00 -0.251 0.801853
## log_lbdhdd 4.529e+01 2.728e+00 16.601 < 2e-16 ***
## log_lbxglu -1.527e+01 2.731e+00 -5.592 2.39e-08 ***
## log_bmxbmi 3.214e+00 3.159e+00 1.017 0.309091
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.56 on 4114 degrees of freedom
## Multiple R-squared: 0.1346, Adjusted R-squared: 0.1287
## F-statistic: 22.84 on 28 and 4114 DF, p-value: < 2.2e-16
car::vif(model_full_transformed_v2) # looks good
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.528157 1 1.236186
## riagendr 1.322256 1 1.149894
## ridreth3 1.663460 5 1.052207
## dmdeduc2 1.746420 6 1.047560
## indfmpir 1.298539 1 1.139535
## smq020 1.201694 1 1.096218
## alq111 1.104830 1 1.051109
## vigorous_minutes 1.050970 1 1.025168
## moderate_minutes 1.046275 1 1.022876
## sedentary_minutes 1.021901 1 1.010891
## bpxsy1 1.596263 1 1.263433
## bpxdi1 1.244341 1 1.115500
## log_lbxhscrp 1.389502 1 1.178771
## log_lbxtr 1.440165 1 1.200069
## log_dr1tsodi 2.716875 1 1.648295
## log_dr1ttfat 2.668648 1 1.633600
## log_lbdhdd 1.613914 1 1.270399
## log_lbxglu 1.214539 1 1.102061
## log_bmxbmi 1.505382 1 1.226940
# Model 3 (transformed predictors + transformed target variable log_lbxtc)
# Remove log_dr1tkcal which has highest VIF (above 5)
set.seed(123)
model_full_transformed_target_v2 <- lm(log_lbxtc ~ ., data = train_data_transformed_target %>% dplyr::select(-c(log_dr1tkcal)))
summary(model_full_transformed_target_v2)
##
## Call:
## lm(formula = log_lbxtc ~ ., data = train_data_transformed_target %>%
## dplyr::select(-c(log_dr1tkcal)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.80059 -0.13504 -0.00158 0.13451 0.85592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.883e+00 1.220e-01 31.819 < 2e-16 ***
## ridageyr 7.997e-05 2.120e-04 0.377 0.70604
## riagendrFemale 1.371e-02 7.314e-03 1.874 0.06099 .
## ridreth32 1.391e-02 1.351e-02 1.030 0.30307
## ridreth33 -1.014e-02 1.093e-02 -0.928 0.35341
## ridreth34 -2.963e-02 1.161e-02 -2.552 0.01073 *
## ridreth36 1.634e-02 1.319e-02 1.239 0.21539
## ridreth37 -1.156e-02 1.683e-02 -0.687 0.49232
## dmdeduc22 -1.899e-02 1.530e-02 -1.241 0.21459
## dmdeduc23 -1.441e-02 1.391e-02 -1.036 0.30015
## dmdeduc24 -2.266e-02 1.366e-02 -1.658 0.09730 .
## dmdeduc25 -3.083e-03 1.492e-02 -0.207 0.83629
## dmdeduc27 2.612e-01 1.458e-01 1.791 0.07336 .
## dmdeduc29 -3.770e-02 7.369e-02 -0.512 0.60897
## indfmpir 2.705e-03 2.403e-03 1.126 0.26028
## smq0202 -1.151e-02 7.095e-03 -1.622 0.10484
## alq1112 4.929e-03 1.124e-02 0.439 0.66091
## vigorous_minutes 3.982e-06 9.767e-06 0.408 0.68356
## moderate_minutes -1.787e-06 5.863e-06 -0.305 0.76051
## sedentary_minutes -2.549e-07 4.594e-06 -0.055 0.95576
## bpxsy1 6.573e-04 2.171e-04 3.028 0.00248 **
## bpxdi1 1.751e-03 2.773e-04 6.312 3.05e-10 ***
## log_lbxhscrp 1.813e-02 3.379e-03 5.365 8.53e-08 ***
## log_lbxtr 9.722e-02 6.287e-03 15.464 < 2e-16 ***
## log_dr1tsodi -2.383e-03 9.533e-03 -0.250 0.80266
## log_dr1ttfat -4.168e-03 8.631e-03 -0.483 0.62916
## log_lbdhdd 2.560e-01 1.448e-02 17.683 < 2e-16 ***
## log_lbxglu -8.396e-02 1.449e-02 -5.794 7.40e-09 ***
## log_bmxbmi 2.899e-02 1.676e-02 1.730 0.08378 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2046 on 4114 degrees of freedom
## Multiple R-squared: 0.1418, Adjusted R-squared: 0.1359
## F-statistic: 24.27 on 28 and 4114 DF, p-value: < 2.2e-16
car::vif(model_full_transformed_target_v2) # looks good
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.528157 1 1.236186
## riagendr 1.322256 1 1.149894
## ridreth3 1.663460 5 1.052207
## dmdeduc2 1.746420 6 1.047560
## indfmpir 1.298539 1 1.139535
## smq020 1.201694 1 1.096218
## alq111 1.104830 1 1.051109
## vigorous_minutes 1.050970 1 1.025168
## moderate_minutes 1.046275 1 1.022876
## sedentary_minutes 1.021901 1 1.010891
## bpxsy1 1.596263 1 1.263433
## bpxdi1 1.244341 1 1.115500
## log_lbxhscrp 1.389502 1 1.178771
## log_lbxtr 1.440165 1 1.200069
## log_dr1tsodi 2.716875 1 1.648295
## log_dr1ttfat 2.668648 1 1.633600
## log_lbdhdd 1.613914 1 1.270399
## log_lbxglu 1.214539 1 1.102061
## log_bmxbmi 1.505382 1 1.226940
# Model 8 robust regression - original
# Remove dr1tkcal which has highest VIF (above 5)
set.seed(123)
robust_model_full_v2 <- rlm(lbxtc ~ ., psi = psi.huber, data = train_data_original %>% dplyr::select(-c(dr1tkcal)))
summary(robust_model_full_v2)
##
## Call: rlm(formula = lbxtc ~ ., data = train_data_original %>% dplyr::select(-c(dr1tkcal)),
## psi = psi.huber)
## Residuals:
## Min 1Q Median 3Q Max
## -249.964 -24.745 -1.669 24.520 249.598
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 94.7462 6.4933 14.5915
## ridageyr 0.0911 0.0391 2.3301
## riagendrFemale 3.7152 1.3442 2.7639
## ridreth32 2.5257 2.4997 1.0104
## ridreth33 -3.0861 2.0228 -1.5256
## ridreth34 -6.4170 2.1463 -2.9897
## ridreth36 1.4823 2.4314 0.6097
## ridreth37 -1.2720 3.1180 -0.4079
## dmdeduc22 -1.0018 2.8288 -0.3541
## dmdeduc23 -0.8358 2.5734 -0.3248
## dmdeduc24 -2.3021 2.5263 -0.9112
## dmdeduc25 1.3924 2.7598 0.5045
## dmdeduc27 59.5554 26.9989 2.2058
## dmdeduc29 -6.2907 13.6396 -0.4612
## indfmpir 0.2940 0.4443 0.6617
## smq0202 -2.6139 1.3119 -1.9925
## alq1112 0.5186 2.0790 0.2495
## vigorous_minutes 0.0002 0.0018 0.1301
## moderate_minutes -0.0003 0.0011 -0.2469
## sedentary_minutes 0.0004 0.0009 0.4543
## dr1ttfat -0.0106 0.0189 -0.5604
## dr1tsodi -0.0001 0.0005 -0.2006
## bmxbmi 0.3272 0.0907 3.6096
## bpxsy1 0.1098 0.0401 2.7371
## bpxdi1 0.4078 0.0511 7.9811
## lbxglu -0.0774 0.0169 -4.5796
## lbxhscrp -0.0561 0.0805 -0.6966
## lbdhdd 0.6811 0.0444 15.3239
## lbxtr 0.0768 0.0054 14.3251
##
## Residual standard error: 36.63 on 4114 degrees of freedom
car::vif(robust_model_full_v2) # looks good
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.515794 1 1.231176
## riagendr 1.302600 1 1.141315
## ridreth3 1.637276 5 1.050539
## dmdeduc2 1.736752 6 1.047076
## indfmpir 1.295408 1 1.138160
## smq020 1.198662 1 1.094834
## alq111 1.103360 1 1.050409
## vigorous_minutes 1.049948 1 1.024670
## moderate_minutes 1.045806 1 1.022647
## sedentary_minutes 1.021872 1 1.010877
## dr1ttfat 2.424637 1 1.557125
## dr1tsodi 2.438760 1 1.561653
## bmxbmi 1.299592 1 1.139996
## bpxsy1 1.591051 1 1.261369
## bpxdi1 1.231797 1 1.109863
## lbxglu 1.154521 1 1.074487
## lbxhscrp 1.110704 1 1.053900
## lbdhdd 1.374888 1 1.172556
## lbxtr 1.170204 1 1.081760
set.seed(123)
# Model 9 robust regression - transformed
# Remove log_dr1tkcal which has highest VIF (above 5)
robust_model_full_transformed_target_v2 <- rlm(log_lbxtc ~ ., psi = psi.huber, data = train_data_transformed_target %>% dplyr::select(-c(log_dr1tkcal)))
summary(robust_model_full_transformed_target_v2)
##
## Call: rlm(formula = log_lbxtc ~ ., data = train_data_transformed_target %>%
## dplyr::select(-c(log_dr1tkcal)), psi = psi.huber)
## Residuals:
## Min 1Q Median 3Q Max
## -0.814008 -0.133808 -0.002436 0.133927 0.854640
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 3.8504 0.1209 31.8572
## ridageyr 0.0003 0.0002 1.3796
## riagendrFemale 0.0099 0.0072 1.3654
## ridreth32 0.0143 0.0134 1.0714
## ridreth33 -0.0126 0.0108 -1.1667
## ridreth34 -0.0280 0.0115 -2.4370
## ridreth36 0.0189 0.0131 1.4450
## ridreth37 -0.0053 0.0167 -0.3183
## dmdeduc22 -0.0128 0.0152 -0.8445
## dmdeduc23 -0.0111 0.0138 -0.8065
## dmdeduc24 -0.0192 0.0135 -1.4188
## dmdeduc25 0.0013 0.0148 0.0853
## dmdeduc27 0.2753 0.1445 1.9061
## dmdeduc29 -0.0330 0.0730 -0.4522
## indfmpir 0.0026 0.0024 1.1095
## smq0202 -0.0137 0.0070 -1.9459
## alq1112 0.0007 0.0111 0.0665
## vigorous_minutes 0.0000 0.0000 0.4513
## moderate_minutes 0.0000 0.0000 -0.4396
## sedentary_minutes 0.0000 0.0000 0.2255
## bpxsy1 0.0005 0.0002 2.5041
## bpxdi1 0.0019 0.0003 6.9660
## log_lbxhscrp 0.0160 0.0033 4.7728
## log_lbxtr 0.1018 0.0062 16.3397
## log_dr1tsodi -0.0037 0.0094 -0.3921
## log_dr1ttfat -0.0047 0.0085 -0.5543
## log_lbdhdd 0.2613 0.0143 18.2221
## log_lbxglu -0.0939 0.0144 -6.5413
## log_bmxbmi 0.0423 0.0166 2.5495
##
## Residual standard error: 0.1984 on 4114 degrees of freedom
car::vif(robust_model_full_transformed_target_v2) # looks good
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.528157 1 1.236186
## riagendr 1.322256 1 1.149894
## ridreth3 1.663460 5 1.052207
## dmdeduc2 1.746420 6 1.047560
## indfmpir 1.298539 1 1.139535
## smq020 1.201694 1 1.096218
## alq111 1.104830 1 1.051109
## vigorous_minutes 1.050970 1 1.025168
## moderate_minutes 1.046275 1 1.022876
## sedentary_minutes 1.021901 1 1.010891
## bpxsy1 1.596263 1 1.263433
## bpxdi1 1.244341 1 1.115500
## log_lbxhscrp 1.389502 1 1.178771
## log_lbxtr 1.440165 1 1.200069
## log_dr1tsodi 2.716875 1 1.648295
## log_dr1ttfat 2.668648 1 1.633600
## log_lbdhdd 1.613914 1 1.270399
## log_lbxglu 1.214539 1 1.102061
## log_bmxbmi 1.505382 1 1.226940
# Model 10
set.seed(123)
robust_bisquare_model_full_transformed_target_v2 <- rlm(log_lbxtc ~ ., psi = psi.bisquare, data = train_data_transformed_target %>% dplyr::select(-c(log_dr1tkcal)))
summary(robust_bisquare_model_full_transformed_target_v2)
##
## Call: rlm(formula = log_lbxtc ~ ., data = train_data_transformed_target %>%
## dplyr::select(-c(log_dr1tkcal)), psi = psi.bisquare)
## Residuals:
## Min 1Q Median 3Q Max
## -0.815502 -0.134634 -0.002864 0.134037 0.855650
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 3.8413 0.1221 31.4612
## ridageyr 0.0003 0.0002 1.4454
## riagendrFemale 0.0096 0.0073 1.3112
## ridreth32 0.0128 0.0135 0.9457
## ridreth33 -0.0137 0.0109 -1.2512
## ridreth34 -0.0299 0.0116 -2.5746
## ridreth36 0.0173 0.0132 1.3124
## ridreth37 -0.0047 0.0168 -0.2768
## dmdeduc22 -0.0119 0.0153 -0.7775
## dmdeduc23 -0.0096 0.0139 -0.6903
## dmdeduc24 -0.0178 0.0137 -1.2993
## dmdeduc25 0.0028 0.0149 0.1882
## dmdeduc27 0.2768 0.1459 1.8970
## dmdeduc29 -0.0292 0.0737 -0.3964
## indfmpir 0.0024 0.0024 0.9786
## smq0202 -0.0135 0.0071 -1.9000
## alq1112 0.0016 0.0112 0.1413
## vigorous_minutes 0.0000 0.0000 0.4303
## moderate_minutes 0.0000 0.0000 -0.4248
## sedentary_minutes 0.0000 0.0000 0.2559
## bpxsy1 0.0005 0.0002 2.4532
## bpxdi1 0.0019 0.0003 6.9362
## log_lbxhscrp 0.0159 0.0034 4.7017
## log_lbxtr 0.1023 0.0063 16.2614
## log_dr1tsodi -0.0043 0.0095 -0.4467
## log_dr1ttfat -0.0047 0.0086 -0.5497
## log_lbdhdd 0.2630 0.0145 18.1569
## log_lbxglu -0.0938 0.0145 -6.4655
## log_bmxbmi 0.0433 0.0168 2.5844
##
## Residual standard error: 0.1991 on 4114 degrees of freedom
car::vif(robust_bisquare_model_full_transformed_target_v2) # looks good
## GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.528157 1 1.236186
## riagendr 1.322256 1 1.149894
## ridreth3 1.663460 5 1.052207
## dmdeduc2 1.746420 6 1.047560
## indfmpir 1.298539 1 1.139535
## smq020 1.201694 1 1.096218
## alq111 1.104830 1 1.051109
## vigorous_minutes 1.050970 1 1.025168
## moderate_minutes 1.046275 1 1.022876
## sedentary_minutes 1.021901 1 1.010891
## bpxsy1 1.596263 1 1.263433
## bpxdi1 1.244341 1 1.115500
## log_lbxhscrp 1.389502 1 1.178771
## log_lbxtr 1.440165 1 1.200069
## log_dr1tsodi 2.716875 1 1.648295
## log_dr1ttfat 2.668648 1 1.633600
## log_lbdhdd 1.613914 1 1.270399
## log_lbxglu 1.214539 1 1.102061
## log_bmxbmi 1.505382 1 1.226940
# Model 12
set.seed(123)
model_transformed_newfeatures_v2 <- glm(log_lbxtc ~ ., data = train_data_transformed_newfeatures%>% dplyr::select(-c(AGE_YOUNG, ridageyr)))
summary(model_transformed_newfeatures_v2)
##
## Call:
## glm(formula = log_lbxtc ~ ., data = train_data_transformed_newfeatures %>%
## dplyr::select(-c(AGE_YOUNG, ridageyr)))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.948e+00 9.430e-02 41.863 < 2e-16 ***
## riagendrFemale 9.152e-03 7.187e-03 1.274 0.20290
## ridreth32 1.311e-02 1.324e-02 0.990 0.32204
## ridreth33 3.808e-03 1.065e-02 0.358 0.72066
## ridreth34 -2.550e-02 1.139e-02 -2.238 0.02525 *
## ridreth36 2.115e-02 1.299e-02 1.628 0.10352
## ridreth37 -5.674e-03 1.651e-02 -0.344 0.73112
## dmdeduc22 -1.187e-02 1.497e-02 -0.793 0.42811
## dmdeduc23 -8.620e-03 1.361e-02 -0.633 0.52662
## dmdeduc24 -1.273e-02 1.324e-02 -0.962 0.33626
## dmdeduc25 -9.041e-04 1.455e-02 -0.062 0.95047
## dmdeduc27 2.293e-01 1.429e-01 1.605 0.10867
## dmdeduc29 -1.271e-02 7.220e-02 -0.176 0.86027
## indfmpir 5.189e-04 2.352e-03 0.221 0.82540
## smq0202 -6.363e-03 6.913e-03 -0.920 0.35743
## alq1112 8.854e-03 1.104e-02 0.802 0.42241
## sedentary_minutes 1.535e-06 4.521e-06 0.339 0.73433
## TOTAL_ACTIVITY -1.259e-06 4.761e-06 -0.264 0.79145
## ACTIVE_FLAG 8.184e-03 7.113e-03 1.151 0.24997
## HIGH_FAT_DIET -6.507e-03 6.587e-03 -0.988 0.32328
## HIGH_SODIUM -2.018e-02 7.391e-03 -2.731 0.00635 **
## AGE_MID 7.868e-02 6.869e-03 11.455 < 2e-16 ***
## BMI_UNDER -7.340e-02 2.685e-02 -2.734 0.00629 **
## BMI_NORMAL -2.704e-02 9.342e-03 -2.894 0.00382 **
## BMI_OVER 1.273e-02 7.730e-03 1.646 0.09975 .
## bpxsy1 1.041e-03 1.938e-04 5.372 8.22e-08 ***
## bpxdi1 7.765e-04 2.819e-04 2.754 0.00591 **
## log_lbxhscrp 1.473e-02 3.219e-03 4.575 4.90e-06 ***
## log_lbxtr 9.098e-02 6.172e-03 14.741 < 2e-16 ***
## log_lbdhdd 2.590e-01 1.405e-02 18.431 < 2e-16 ***
## log_lbxglu -8.303e-02 1.413e-02 -5.877 4.51e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.04019509)
##
## Null deviance: 200.61 on 4142 degrees of freedom
## Residual deviance: 165.28 on 4112 degrees of freedom
## AIC: -1525.4
##
## Number of Fisher Scoring iterations: 2
car::vif(model_transformed_newfeatures_v2)
## GVIF Df GVIF^(1/(2*Df))
## riagendr 1.328951 1 1.152801
## ridreth3 1.661164 5 1.052062
## dmdeduc2 1.707273 6 1.045583
## indfmpir 1.295341 1 1.138130
## smq020 1.187965 1 1.089938
## alq111 1.109649 1 1.053399
## sedentary_minutes 1.030418 1 1.015095
## TOTAL_ACTIVITY 1.130412 1 1.063208
## ACTIVE_FLAG 1.223972 1 1.106333
## HIGH_FAT_DIET 1.076678 1 1.037631
## HIGH_SODIUM 1.103043 1 1.050258
## AGE_MID 1.185378 1 1.088751
## BMI_UNDER 1.077733 1 1.038139
## BMI_NORMAL 1.659556 1 1.288237
## BMI_OVER 1.373428 1 1.171934
## bpxsy1 1.324607 1 1.150916
## bpxdi1 1.338908 1 1.157112
## log_lbxhscrp 1.313531 1 1.146094
## log_lbxtr 1.444867 1 1.202026
## log_lbdhdd 1.584001 1 1.258571
## log_lbxglu 1.201714 1 1.096227
Coefficients are presented as multiplicative effects (exponentiated from log-log model), meaning they represent proportional changes in cholesterol. * Biomarkers Dominating the prediction are HDL Cholesterol (log_lbdhdd), Triglycerides (log_lbxtr), Glucose (log_lbxglu),CRP/Inflammation (log_lbxhscrp). * HDL is a component of total cholesterol, so this positive relationship is biologically expected and is the strongest predictor in your model. * Triglycerides contribute to VLDL cholesterol, another component of total cholesterol. * Interestingly, Higher glucose associated with LOWER cholesterol. * Inflammation and cholesterol are linked . * Cholesterol follows an inverted-U trajectory across lifespan as it peaks in middge age(AGE_MID (40-64 years)) and then declines. * Overweight individuals(BMI_OVER (25-29.9)) have 1.28% HIGHER cholesterol than obese and hence Lower BMI leads to Lower cholesterol . * High sodium diet associated with 2% LOWER cholesterol(HIGH_SODIUM (>2300 mg/day): 0.9800)). * Lifestyle changes like Physical Activity, Diet Quality, behaviors like Smoking,Alcohol and Income had no significant effect.
# Logistic Regression Models
# Interpretation of coefficients (Model 1)
cat("\n=== Coefficient Interpretation (Model 1) ===\n")
##
## === Coefficient Interpretation (Model 1) ===
coef_exp <- exp(coef(model_full_v2))
print(round(coef_exp, 4))
## (Intercept) ridageyr riagendrFemale ridreth32
## 2.412504e+43 1.064500e+00 1.198091e+02 8.429400e+00
## ridreth33 ridreth34 ridreth36 ridreth37
## 8.710000e-02 1.300000e-03 4.902800e+00 6.250000e-02
## dmdeduc22 dmdeduc23 dmdeduc24 dmdeduc25
## 5.530000e-02 1.290000e-01 2.960000e-02 7.537000e-01
## dmdeduc27 dmdeduc29 indfmpir smq0202
## 7.500096e+22 2.400000e-03 1.509400e+00 4.680000e-02
## alq1112 vigorous_minutes moderate_minutes sedentary_minutes
## 2.760500e+00 1.000100e+00 9.997000e-01 1.000100e+00
## dr1tsodi bmxbmi bpxsy1 bpxdi1
## 9.997000e-01 1.287100e+00 1.169400e+00 1.427500e+00
## lbxglu lbxhscrp lbdhdd lbxtr
## 9.430000e-01 9.833000e-01 1.915600e+00 1.061200e+00
# Interpretation of coefficients (Model 2)
cat("\n=== Coefficient Interpretation (Model 2) ===\n")
##
## === Coefficient Interpretation (Model 2) ===
coef_exp <- exp(coef(model_full_transformed_v2))
print(round(coef_exp, 4))
## (Intercept) ridageyr riagendrFemale ridreth32
## 0.000000e+00 1.032200e+00 1.700340e+01 1.152050e+01
## ridreth33 ridreth34 ridreth36 ridreth37
## 1.943000e-01 8.000000e-03 2.418640e+01 1.410000e-01
## dmdeduc22 dmdeduc23 dmdeduc24 dmdeduc25
## 1.800000e-02 8.620000e-02 1.710000e-02 3.991000e-01
## dmdeduc27 dmdeduc29 indfmpir smq0202
## 6.708678e+23 4.000000e-04 1.787900e+00 8.200000e-02
## alq1112 vigorous_minutes moderate_minutes sedentary_minutes
## 2.203300e+00 1.000600e+00 9.995000e-01 1.000000e+00
## bpxsy1 bpxdi1 log_lbxhscrp log_lbxtr
## 1.161300e+00 1.332600e+00 2.527480e+01 9.309804e+07
## log_dr1tsodi log_dr1ttfat log_lbdhdd log_lbxglu
## 4.627000e-01 6.648000e-01 4.685730e+19 0.000000e+00
## log_bmxbmi
## 2.486760e+01
# Interpretation of coefficients (Model 3)
cat("\n=== Coefficient Interpretation (Model 3) ===\n")
##
## === Coefficient Interpretation (Model 3) ===
coef_exp <- exp(coef(model_full_transformed_target_v2))
print(round(coef_exp, 4))
## (Intercept) ridageyr riagendrFemale ridreth32
## 48.5508 1.0001 1.0138 1.0140
## ridreth33 ridreth34 ridreth36 ridreth37
## 0.9899 0.9708 1.0165 0.9885
## dmdeduc22 dmdeduc23 dmdeduc24 dmdeduc25
## 0.9812 0.9857 0.9776 0.9969
## dmdeduc27 dmdeduc29 indfmpir smq0202
## 1.2985 0.9630 1.0027 0.9886
## alq1112 vigorous_minutes moderate_minutes sedentary_minutes
## 1.0049 1.0000 1.0000 1.0000
## bpxsy1 bpxdi1 log_lbxhscrp log_lbxtr
## 1.0007 1.0018 1.0183 1.1021
## log_dr1tsodi log_dr1ttfat log_lbdhdd log_lbxglu
## 0.9976 0.9958 1.2917 0.9195
## log_bmxbmi
## 1.0294
# Interpretation of coefficients (Model 10)
cat("\n=== Coefficient Interpretation (Model 10) ===\n")
##
## === Coefficient Interpretation (Model 10) ===
coef_exp <- exp(coef(robust_bisquare_model_full_transformed_target_v2))
print(round(coef_exp, 4))
## (Intercept) ridageyr riagendrFemale ridreth32
## 46.5847 1.0003 1.0096 1.0129
## ridreth33 ridreth34 ridreth36 ridreth37
## 0.9864 0.9705 1.0175 0.9953
## dmdeduc22 dmdeduc23 dmdeduc24 dmdeduc25
## 0.9882 0.9904 0.9824 1.0028
## dmdeduc27 dmdeduc29 indfmpir smq0202
## 1.3189 0.9712 1.0024 0.9866
## alq1112 vigorous_minutes moderate_minutes sedentary_minutes
## 1.0016 1.0000 1.0000 1.0000
## bpxsy1 bpxdi1 log_lbxhscrp log_lbxtr
## 1.0005 1.0019 1.0160 1.1077
## log_dr1tsodi log_dr1ttfat log_lbdhdd log_lbxglu
## 0.9957 0.9953 1.3008 0.9105
## log_bmxbmi
## 1.0443
# Interpretation of coefficients (Model 12)
cat("\n=== Coefficient Interpretation (Model 12) ===\n")
##
## === Coefficient Interpretation (Model 12) ===
coef_exp <- exp(coef(model_transformed_newfeatures_v2))
print(round(coef_exp, 4))
## (Intercept) riagendrFemale ridreth32 ridreth33
## 51.8199 1.0092 1.0132 1.0038
## ridreth34 ridreth36 ridreth37 dmdeduc22
## 0.9748 1.0214 0.9943 0.9882
## dmdeduc23 dmdeduc24 dmdeduc25 dmdeduc27
## 0.9914 0.9874 0.9991 1.2578
## dmdeduc29 indfmpir smq0202 alq1112
## 0.9874 1.0005 0.9937 1.0089
## sedentary_minutes TOTAL_ACTIVITY ACTIVE_FLAG HIGH_FAT_DIET
## 1.0000 1.0000 1.0082 0.9935
## HIGH_SODIUM AGE_MID BMI_UNDER BMI_NORMAL
## 0.9800 1.0819 0.9292 0.9733
## BMI_OVER bpxsy1 bpxdi1 log_lbxhscrp
## 1.0128 1.0010 1.0008 1.0148
## log_lbxtr log_lbdhdd log_lbxglu
## 1.0952 1.2957 0.9203
Best performing model is LM12( Multiple Linear regression with transformed predictors and target+ new features). ** It has lowest RMSE (0.2041 on log scale) and highest R² = 0.1215 ** Has the best predictive performance. ** Provides interpretable coefficients on the log scale. ** Feature engineering is better than Regularization in this case. ** Ridge’s shrinkage couldn’t compensate for missing non-linear patterns
Second best performing Model is LM6 (Ridge with transformed predictors and target) because: ** It has RMSE (0.2054) and R² (0.1104). ** Has the better predictive performance. ** Handles multicollinearity among dietary/lipid variables. ** Is more stable than OLS for correlated predictors.
Other comparable models are LM10 (Robust bisquare) as its close second with RMSE 0.2057 and R² 0.1078; LM3 (Full transformed) with RMSE 0.2057, R² 0.1070.
The log transformation of the target (Total Cholesterol) dramatically improved model performance. ** Original scale models (LM1, LM2): RMSE ~39, R² ~0.09 ** Log-transformed models (LM3+): RMSE ~0.21, R² ~0.10-0.11 ** This makes sense given that cholesterol and many biomarkers follow log-normal distributions.
Ridge regression (LM6) slightly outperformed OLS, suggesting multicollinearity among predictors (especially the correlated dietary variables).Ridge’s shrinkage helps with the correlation among dr1tkcal, dr1ttfat, dr1tsodi, lbdhdd, lbxtr.
Stepwise models (LM4, LM5) performed worse than full models, indicating important predictors were likely removed. The modest R² suggests complex interactions that stepwise doesn’t capture well.
Robust regression (LM9, LM10) performed similarly to OLS (LM3), suggesting Outliers aren’t significantly affecting our results.
The low R² (~0.11) across all models indicates that our predictors explain only about 10-11% of cholesterol variation. ** Looks like the Cholesterol is influenced by many factors not in our dataset (maybe genetics, medications, detailed dietary patterns) ** The relationship may be non-linear or involve complex interactions. ** This is actually typical for epidemiological data.
# Model 1: Multiple Linear Regression - Full model (post mutilcollinearity updates)
# Check assumptions
# Model Evaluation on Test Set
print_evaluate_lm_model <- function(model, target_variable, test_data, model_name) {
preds <- predict(model, newdata = test_data)
if (target_variable == "lbxtc") {
actual <- test_data$lbxtc
} else { # log
actual <- test_data$log_lbxtc
}
rmse <- sqrt(mean((preds - actual)^2))
mae <- mean(abs(preds - actual))
r2 <- 1 - sum((preds - actual)^2) / sum((actual - mean(actual))^2)
aic <- AIC(model)
output <- paste(
"\n=== Model Selection and Evaluation ===\n\n",
"=== ", model_name, " Evaluation ===\n",
"RMSE:", round(rmse, 4),
"| MAE:", round(mae, 4),
"| R²:", round(r2, 4), "\n",
"AIC:", round(aic, 4), "\n\n",
sep = " "
)
cat(output)
}
print_lm_model_diagnostics <- function(model, train_data, model_name) {
cat("\n=== Diagnostics for", model_name, "===\n")
# Histogram of residuals
ggplot(data = as.data.frame(residuals(model)), aes(x = residuals(model))) +
geom_histogram(bins = 25, fill = "skyblue", color = "black") +
xlab("Residuals") +
ggtitle(paste("Residual Histogram -", model_name))
# Residuals vs Fitted
plot(model, which = 1, main = paste("Residuals vs Fitted -", model_name))
# Q-Q Plot
plot(model, which = 2, main = paste("Normal Q-Q Plot -", model_name))
# Scale-Location Plot
plot(model, which = 3, main = paste("Scale-Location Plot -", model_name))
# Residuals vs Leverage
plot(model, which = 5, main = paste("Residuals vs Leverage -", model_name))
# Cook's Distance
try(plot(model, which = 4, main = paste("Cook's Distance -", model_name)), silent = TRUE)
# plot(model, which = 4, main = paste("Cook's Distance -", model_name))
# Heteroscedasticity test
print("Heteroscedasticity test:")
print(bptest(model))
# High leverage cutoff
p <- length(coef(model))
n <- nrow(train_data)
high_lev <- (2 * (p+1)) / n
cat("High leverage threshold:", round(high_lev, 5), "\n")
# Half-normal plot of leverage values
try(halfnorm(hatvalues(model), main = paste("Half-Normal Plot of Leverage -", model_name)), silent = TRUE)
}
# === LM1 Evaluation ===
print_evaluate_lm_model(model_full_v2, "lbxtc", test_data_original, "LM1: Full model")
##
## === Model Selection and Evaluation ===
##
## === LM1: Full model Evaluation ===
## RMSE: 39.1062 | MAE: 29.9116 | R²: 0.0893
## AIC: 42173.6259
print_lm_model_diagnostics(model_full_v2, train_data_original, "LM1: Full model")
##
## === Diagnostics for LM1: Full model ===
## [1] "Heteroscedasticity test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 181.84, df = 27, p-value < 2.2e-16
##
## High leverage threshold: 0.014
# Model 2: Multiple Linear Regression: Full model with Log transformation of the skewed predictors
# Check assumptions
# === LM2 Evaluation ===
print_evaluate_lm_model(model_full_transformed_v2, "lbxtc", test_data_transformed, "LM2: Full model (Transformed Predictors)")
##
## === Model Selection and Evaluation ===
##
## === LM2: Full model (Transformed Predictors) Evaluation ===
## RMSE: 39.0595 | MAE: 29.6559 | R²: 0.0914
## AIC: 42049.7957
print_lm_model_diagnostics(model_full_transformed_v2, train_data_transformed, "LM2: Full model (Transformed Predictors)")
##
## === Diagnostics for LM2: Full model (Transformed Predictors) ===
## [1] "Heteroscedasticity test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 96.186, df = 28, p-value = 2.084e-09
##
## High leverage threshold: 0.01448
# Model 3: Multiple Linear Regression: Full model with Log transformation of the skewed predictors + Log target
# Check assumptions
# === LM3 Evaluation ===
print_evaluate_lm_model(model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM3: Full model (Transformed Predictors + Target)")
##
## === Model Selection and Evaluation ===
##
## === LM3: Full model (Transformed Predictors + Target) Evaluation ===
## RMSE: 0.2057 | MAE: 0.1597 | R²: 0.107
## AIC: -1360.3999
print_lm_model_diagnostics(model_full_transformed_target_v2, train_data_transformed_target, "LM3: Full model (Transformed Predictors + Target)")
##
## === Diagnostics for LM3: Full model (Transformed Predictors + Target) ===
## [1] "Heteroscedasticity test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 141.07, df = 28, p-value < 2.2e-16
##
## High leverage threshold: 0.01448
# Model 4: Stepwise on Model 1
# Check assumptions
# === LM4 Evaluation ===
print_evaluate_lm_model(step_model_full, "lbxtc", test_data_original, "LM4: Stepwise Model 1")
##
## === Model Selection and Evaluation ===
##
## === LM4: Stepwise Model 1 Evaluation ===
## RMSE: 39.1801 | MAE: 29.9975 | R²: 0.0858
## AIC: 42158.8052
print_lm_model_diagnostics(step_model_full, train_data_original, "LM4: Stepwise Model 1")
##
## === Diagnostics for LM4: Stepwise Model 1 ===
## [1] "Heteroscedasticity test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 170.4, df = 14, p-value < 2.2e-16
##
## High leverage threshold: 0.00772
# Model 5: Stepwise on Model 5
# Check assumptions
# === LM5 Evaluation ===
print_evaluate_lm_model(step_model_full_transformed_target, "log_lbxtc", test_data_transformed_target, "LM5: Stepwise Model 5")
##
## === Model Selection and Evaluation ===
##
## === LM5: Stepwise Model 5 Evaluation ===
## RMSE: 0.2063 | MAE: 0.1605 | R²: 0.1026
## AIC: -1373.733
print_lm_model_diagnostics(step_model_full_transformed_target, train_data_transformed_target, "LM5: Stepwise Model 5")
##
## === Diagnostics for LM5: Stepwise Model 5 ===
## [1] "Heteroscedasticity test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 116, df = 17, p-value < 2.2e-16
##
## High leverage threshold: 0.00917
# Model 6 - Ridge
testX <- test_data_transformed_target %>%
dplyr::select(-log_lbxtc)
testY <- test_data_transformed_target$log_lbxtc
ridge_pred <- predict(ridge_model, testX)
ridge_resample <- postResample(ridge_pred, testY)
ridge_rmse <- ridge_resample["RMSE"]
ridge_Rsquared <- ridge_resample["Rsquared"]
ridge_MAE <- ridge_resample["MAE"]
ridge_output <- paste(
"\n=== Model Selection and Evaluation ===\n\n",
"=== LM6: Ridge Evaluation ===\n",
"RMSE:", round(ridge_rmse, 4),
"| R²:", round(ridge_Rsquared, 4),
"| MAE:", round(ridge_MAE, 4),
"\n\n",
sep = " "
)
cat(ridge_output)
##
## === Model Selection and Evaluation ===
##
## === LM6: Ridge Evaluation ===
## RMSE: 0.2054 | R²: 0.1104 | MAE: 0.1596
# Model 7 - Elastic Net
testX <- test_data_transformed_target %>%
dplyr::select(-log_lbxtc)
testY <- test_data_transformed_target$log_lbxtc
enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)
enet_rmse <- enet_resample["RMSE"]
enet_Rsquared <- enet_resample["Rsquared"]
enet_MAE <- enet_resample["MAE"]
# enet_aic <- AIC(enet_model)
enet_output <- paste(
"\n=== Model Selection and Evaluation ===\n\n",
"=== LM7: Elastic Net Evaluation ===\n",
"RMSE:", round(enet_rmse, 4),
"| R²:", round(enet_Rsquared, 4),
"| MAE:", round(enet_MAE, 4),
"\n\n",
sep = " "
)
cat(enet_output)
##
## === Model Selection and Evaluation ===
##
## === LM7: Elastic Net Evaluation ===
## RMSE: 0.2062 | R²: 0.1031 | MAE: 0.1606
# Model 8: Robust regression - original
# Check assumptions
# === LM8 Evaluation ===
print_evaluate_lm_model(robust_model_full_v2, "lbxtc", test_data_original, "LM8: Robust model (Original)")
##
## === Model Selection and Evaluation ===
##
## === LM8: Robust model (Original) Evaluation ===
## RMSE: 39.1133 | MAE: 29.683 | R²: 0.0889
## AIC: 42200.5285
# reason for try: not every graph is for robust regression, so just print out the graphs R can print
try(print_lm_model_diagnostics(robust_model_full_v2, train_data_original, "LM8: Robust model (Original)"), silent=TRUE)
##
## === Diagnostics for LM8: Robust model (Original) ===
## [1] "Heteroscedasticity test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 182.47, df = 28, p-value < 2.2e-16
##
## High leverage threshold: 0.01448
# Model 9: Robust regression - transformed
# Check assumptions
# === LM9 Evaluation ===
print_evaluate_lm_model(robust_model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM9: Robust model (Transformed Predictors + Target)")
##
## === Model Selection and Evaluation ===
##
## === LM9: Robust model (Transformed Predictors + Target) Evaluation ===
## RMSE: 0.2057 | MAE: 0.1595 | R²: 0.1076
## AIC: -1355.7682
try(print_lm_model_diagnostics(robust_model_full_transformed_target_v2, train_data_transformed_target, "LM9: Robust model (Transformed Predictors + Target)"), silent=TRUE)
##
## === Diagnostics for LM9: Robust model (Transformed Predictors + Target) ===
## [1] "Heteroscedasticity test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 141.07, df = 28, p-value < 2.2e-16
##
## High leverage threshold: 0.01448
# Model 10: Robust regression - transformed, bisquare
# Check assumptions
# === LM10 Evaluation ===
print_evaluate_lm_model(robust_bisquare_model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM10: Robust model (Transformed Predictors + Target, bisquare)")
##
## === Model Selection and Evaluation ===
##
## === LM10: Robust model (Transformed Predictors + Target, bisquare) Evaluation ===
## RMSE: 0.2057 | MAE: 0.1594 | R²: 0.1078
## AIC: -1355.2245
try(print_lm_model_diagnostics(robust_bisquare_model_full_transformed_target_v2, train_data_transformed_target, "LM10: Robust model (Transformed Predictors + Target, bisquare)"), silent=TRUE)
##
## === Diagnostics for LM10: Robust model (Transformed Predictors + Target, bisquare) ===
## [1] "Heteroscedasticity test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 141.07, df = 28, p-value < 2.2e-16
##
## High leverage threshold: 0.01448
# === LM12 Evaluation ===
print_evaluate_lm_model(model_transformed_newfeatures_v2, "log_lbxtc",test_data_transformed_newfeatures, "LM12: Transformed predictors and target + New features")
##
## === Model Selection and Evaluation ===
##
## === LM12: Transformed predictors and target + New features Evaluation ===
## RMSE: 0.2041 | MAE: 0.1585 | R²: 0.1215
## AIC: -1525.4366
print_lm_model_diagnostics(model_transformed_newfeatures_v2,train_data_transformed_newfeatures,"LM12: Transformed predictors and target + New features")
##
## === Diagnostics for LM12: Transformed predictors and target + New features ===
## [1] "Heteroscedasticity test:"
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 131.61, df = 30, p-value = 1.092e-14
##
## High leverage threshold: 0.01545
# Results table
evaluate_lm_model <- function(model, target_variable, test_data, model_name) {
preds <- predict(model, newdata = test_data)
if (target_variable == "lbxtc") {
actual <- test_data$lbxtc
} else { # log
actual <- test_data$log_lbxtc
}
rmse <- sqrt(mean((preds - actual)^2))
mae <- mean(abs(preds - actual))
r2 <- 1 - sum((preds - actual)^2) / sum((actual - mean(actual))^2)
aic <- AIC(model)
return(data.frame(
Model = model_name,
RMSE = round(rmse, 4),
MAE = round(mae, 4),
R_squared = round(r2, 4),
AIC = round(aic, 4)
))
}
ridge_df <- data.frame(
Model = "LM6: Ridge (Predictors + Target)",
RMSE = round(ridge_rmse, 4),
MAE = round(ridge_MAE, 4),
R_squared = round(ridge_Rsquared, 4),
AIC = NA
)
enet_df <- data.frame(
Model = "LM7: Elastic Net (Predictors + Target)",
RMSE = round(enet_rmse, 4),
MAE = round(enet_MAE, 4),
R_squared = round(enet_Rsquared, 4),
AIC = NA
)
comparison <- rbind(
evaluate_lm_model(model_full_v2, "lbxtc", test_data_original, "LM1: Full model"),
evaluate_lm_model(model_full_transformed_v2, "lbxtc", test_data_transformed, "LM2: Full model (Transformed Predictors)"),
evaluate_lm_model(model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM3: Full model (Transformed Predictors + Target)"),
evaluate_lm_model(step_model_full, "lbxtc", test_data_original, "LM4: Stepwise Model 1"),
evaluate_lm_model(step_model_full_transformed_target, "log_lbxtc", test_data_transformed_target, "LM5: Stepwise Model 3"),
ridge_df,
enet_df,
evaluate_lm_model(robust_model_full_v2, "lbxtc", test_data_original, "LM8: Robust model (Original)"),
evaluate_lm_model(robust_model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM9: Robust model (Transformed Predictors + Target)"),
evaluate_lm_model(robust_bisquare_model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM10: Robust model (Transformed Predictors + Target, bisquare)"),
evaluate_lm_model(model_transformed_newfeatures_v2,"log_lbxtc", test_data_transformed_newfeatures, "LM12: Transformed predictors and target + New features")
)
print(comparison)
## Model RMSE
## 1 LM1: Full model 39.1062
## 2 LM2: Full model (Transformed Predictors) 39.0595
## 3 LM3: Full model (Transformed Predictors + Target) 0.2057
## 4 LM4: Stepwise Model 1 39.1801
## 5 LM5: Stepwise Model 3 0.2063
## RMSE LM6: Ridge (Predictors + Target) 0.2054
## RMSE1 LM7: Elastic Net (Predictors + Target) 0.2062
## 11 LM8: Robust model (Original) 39.1133
## 12 LM9: Robust model (Transformed Predictors + Target) 0.2057
## 13 LM10: Robust model (Transformed Predictors + Target, bisquare) 0.2057
## 14 LM12: Transformed predictors and target + New features 0.2041
## MAE R_squared AIC
## 1 29.9116 0.0893 42173.626
## 2 29.6559 0.0914 42049.796
## 3 0.1597 0.1070 -1360.400
## 4 29.9975 0.0858 42158.805
## 5 0.1605 0.1026 -1373.733
## RMSE 0.1596 0.1104 NA
## RMSE1 0.1606 0.1031 NA
## 11 29.6830 0.0889 42200.529
## 12 0.1595 0.1076 -1355.768
## 13 0.1594 0.1078 -1355.225
## 14 0.1585 0.1215 -1525.437
kable(
comparison,
caption = "Multiple Linear Regression Models",
digits = 4,
align = "c"
)
| Model | RMSE | MAE | R_squared | AIC | |
|---|---|---|---|---|---|
| 1 | LM1: Full model | 39.1062 | 29.9116 | 0.0893 | 42173.626 |
| 2 | LM2: Full model (Transformed Predictors) | 39.0595 | 29.6559 | 0.0914 | 42049.796 |
| 3 | LM3: Full model (Transformed Predictors + Target) | 0.2057 | 0.1597 | 0.1070 | -1360.400 |
| 4 | LM4: Stepwise Model 1 | 39.1801 | 29.9975 | 0.0858 | 42158.805 |
| 5 | LM5: Stepwise Model 3 | 0.2063 | 0.1605 | 0.1026 | -1373.733 |
| RMSE | LM6: Ridge (Predictors + Target) | 0.2054 | 0.1596 | 0.1104 | NA |
| RMSE1 | LM7: Elastic Net (Predictors + Target) | 0.2062 | 0.1606 | 0.1031 | NA |
| 11 | LM8: Robust model (Original) | 39.1133 | 29.6830 | 0.0889 | 42200.529 |
| 12 | LM9: Robust model (Transformed Predictors + Target) | 0.2057 | 0.1595 | 0.1076 | -1355.768 |
| 13 | LM10: Robust model (Transformed Predictors + Target, bisquare) | 0.2057 | 0.1594 | 0.1078 | -1355.225 |
| 14 | LM12: Transformed predictors and target + New features | 0.2041 | 0.1585 | 0.1215 | -1525.437 |
We used the 2017–2018 NHANES cycle set of key modules that capture demographic, behavioral, and clinical information relevant to predicting cholesterol and BMI outcomes. These modules include demographics, smoking status, alcohol use, physical activity, dietary intake, body measurements, blood pressure, glucose, inflammation (CRP), HDL cholesterol, triglycerides, and total cholesterol.Each dataset was imported individually and prepared for merging so that a comprehensive, unified analytic file could be created for building our regression models.
Several continuous variables like Total calories(dr1tkcal),Total fat(dr1ttfat) ,Sodium intake(dr1tsodi), HDL(lbdhdd) and Triglycerides(lbxtr) in the dataset exhibit skewed distributions.
We addressed variables with moderate missingness by replacing with their median values, while categorical variables were imputed using the most frequent category. For variables with higher levels of missingness, we applied MICE using predictive mean matching, allowing more accurate estimation based on relationships among variables.
The most prominent outliers were concentrated in biomarkers and nutrient totals, which could influence regression models if unaddressed. These findings suggested the need for careful handling, such as log transformation or robust modeling approaches, rather than outright removal, to ensure valid and generalizable analytical results.
To address skewness and high-end outliers in the NHANES dataset, we applied log transformations to several right-skewed variables. Specifically, CRP (lbxhscrp), triglycerides (lbxtr), total cholesterol (lbxtc), HDL cholesterol (lbdhdd), glucose (lbxglu), BMI (bmxbmi), and dietary intake variables such as total calories. (dr1tkcal), total fat (dr1ttfat), and sodium (dr1tsodi) were log-transformed. In addition to transforming skewed numeric variables, we converted key categorical variables into factor type for modeling.
Checking for linear relationship across the scatterplots, we observed that the most numeric predictors exhibit very weak or negligible linear relationships with total cholesterol.This suggested that total cholesterol is largely uncorrelated with most individual predictors, with only modest positive trends observed for a few clinical measures.
Strong Positive Correlations: - dr1tkcal/dr1ttfat 0.89 (Calories and Total Fat Intake - more fat indicates more calories) - dr1tkcal/dr1tsodi 0.80 (Calories and Total Sodium Intake - more sodium indicates more calories) - dr1ttfat/dr1tsodi 0.76 (Total Fat and Sodium Intake -more fat indicates more sodium) - ridageyr/bpxsy1 0.47 (Higher age indicates higher systolic bp)
Negative Correlations: - lbdhdd/lbxtr -0.31 (Higher HDL, lower Triglycerides) - bmxbmi/lbdhdd -0.27 (Higher BMI, lower HDL) - lbxglu/lbdhdd -0.18 (Higher Glucose, lower HDL)
Several linear regression models were built and evaluated . Best Performing Models is LM12( Multiple Linear regression with transformed predictors and target+ new features). ** It has lowest RMSE (0.2041 on log scale) and highest R² = 0.1215 ** Has the best predictive performance. ** Provides interpretable coefficients on the log scale. ** Feature engineering is better than Regularization in this case. ** Ridge’s shrinkage couldn’t compensate for missing non-linear patterns