1 Background

This analysis aims to examine the relationship between cholesterol level and key variables include demographic characteristics (age, sex, race/ethnicity), socioeconomic status, behavioral factors (smoking, alcohol use, physical activity, diet), and clinical measures (HDL, triglycerides, blood pressure, glucose, C-reactive protein) using the National Health and Nutrition Examination Survey (NHANES) 2017-2018 cross-sectional data. The sample of individuals aged 18-80 years will be used to determine which variable most strongly influences cholesterol while accounting for complex population-level variations.

2 Data Exploration

2.1 Data Import

Since the 2017–2018 NHANES cycle contains some of the most complete and up-to-date data, we selected this cycle for our analysis. Within this cycle, we downloaded a set of key modules that capture demographic, behavioral, and clinical information relevant to predicting cholesterol and BMI outcomes. These modules include demographics, smoking status, alcohol use, physical activity, dietary intake, body measurements, blood pressure, glucose, inflammation (CRP), HDL cholesterol, triglycerides, and total cholesterol. Each dataset was imported individually and prepared for merging so that a comprehensive, unified analytic file could be created for building our regression models.

load_nhanes <- function(files, cycle = "2017") {
  
  base_url <- paste0("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/", cycle, "/DataFiles/")
  
  result <- list()
  
  for (f in files) {
    file_url <- paste0(base_url, f, ".xpt")
    local_name <- paste0(f, ".XPT")
    download.file(file_url, local_name, mode = "wb")
    df <- read_xpt(local_name)
    result[[f]] <- df
  }
  
  return(result)
}

files_to_load <- c(
  "DEMO_J",   # Demographics
  "SMQ_J",    # Smoking
  "ALQ_J",    # Alcohol
  "PAQ_J",    # Physical activity
  "DR1TOT_J", # Diet (calories, fat, sodium)
  "BMX_J",    # Body measures (BMI)
  "BPX_J",    # Blood pressure
  "GLU_J",    # Glucose
  "HSCRP_J",  # CRP
  "HDL_J",    # HDL cholesterol
  "TRIGLY_J",  # Triglycerides
  "TCHOL_J"   # Total Cholesterol
)

nhanes_list <- load_nhanes(files_to_load)

demo     <- nhanes_list$DEMO_J
smoking  <- nhanes_list$SMQ_J
alcohol  <- nhanes_list$ALQ_J
physact  <- nhanes_list$PAQ_J
diet     <- nhanes_list$DR1TOT_J
bmx      <- nhanes_list$BMX_J
bp       <- nhanes_list$BPX_J
glucose  <- nhanes_list$GLU_J
crp      <- nhanes_list$HSCRP_J
hdl      <- nhanes_list$HDL_J
trig     <- nhanes_list$TRIGLY_J
chol     <- nhanes_list$TCHOL_J

2.2 Data Cleaning

After loading the individual NHANES datasets, we merged them into a single analytic dataset using the respondent identifier SEQN, which is shared across all NHANES modules. Once all modules were combined, we restricted the sample to adults aged 18 years and older and created derived physical activity variables representing vigorous, moderate, and sedentary minutes per day. Finally, we selected the key demographic, behavioral, dietary, and clinical variables needed for modeling, including predictors such as age, gender, race/ethnicity, socioeconomic indicators, lifestyle behaviors, and biomarkers, along with total cholesterol, lbxtc, as the primary outcome.

Features:

  • ridageyr: Age
  • riagendr: Gender
  • ridreth3: Race/Ethnicity
  • dmdeduc2: Education
  • indfmpir: Income/Poverty ratio
  • smq020: Smoking status
  • alq111: Alcohol use
  • vigorous_minutes
  • moderate_minutes
  • sedentary_minutes
  • dr1tkcal: Total calories
  • dr1ttfat: Total fat
  • dr1tsodi: Sodium
  • bmxbmi: BMI
  • bpxsy1: Systolic BP
  • bpxdi1: Diastolic BP
  • lbxglu: Glucose
  • lbxhscrp: CRP
  • lbdhdd: HDL
  • lbxtr: Triglycerides
  • lbxtc: Total Cholesterol (target)
# Merge data

demo <- nhanes_list$DEMO_J %>% 
  janitor::clean_names()

# Start with demographics
nhanes_master <- demo

# Merge BMI
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$BMX_J), by = "seqn")

# Merge Blood Pressure
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$BPX_J), by = "seqn")

# Merge Glucose
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$GLU_J), by = "seqn")

# Merge CRP
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$HSCRP_J), by = "seqn")

# Merge HDL
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$HDL_J), by = "seqn")

# Merge Triglycerides
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$TRIGLY_J), by = "seqn")

# Merge Smoking
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$SMQ_J), by = "seqn")

# Merge Alcohol
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$ALQ_J), by = "seqn")

# Merge Physical Activity
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$PAQ_J), by = "seqn")

# Merge Diet (calories, fat, sodium)
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$DR1TOT_J), by = "seqn")

# Merge Diet (calories, fat, sodium)
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$TCHOL_J), by = "seqn")


# Filter for adults only
nhanes_model <- nhanes_master %>%
  filter(ridageyr >= 18) %>%
  mutate(
    vigorous_minutes = coalesce(pad615, 0) + coalesce(pad660, 0),  # Vigorous activity (work + recreation)
    moderate_minutes = coalesce(pad630, 0) + coalesce(pad675, 0) + coalesce(pad645, 0), # Moderate activity (work + recreation + transportation)
    sedentary_minutes = coalesce(pad680, 0) # Sedentary minutes
  )

# Select Variables for Modeling

nhanes_model2 <- nhanes_model %>%
  dplyr::select(
    seqn,
    ridageyr,        # Age
    riagendr,        # Gender
    ridreth3,        # Race/Ethnicity
    dmdeduc2,        # Education
    indfmpir,        # Income/Poverty ratio
    smq020,          # Smoking status
    alq111,          # Alcohol use
    vigorous_minutes,
    moderate_minutes,
    sedentary_minutes,
    dr1tkcal,        # Total calories
    dr1ttfat,        # Total fat
    dr1tsodi,        # Sodium
    bmxbmi,          # BMI
    bpxsy1, bpxdi1,  # Systolic & Diastolic BP
    lbxglu,          # Glucose
    lbxhscrp,         # CRP
    lbdhdd,          # HDL
    lbxtr,         # Triglycerides
    lbxtc            # Total Cholesterol (target)
  )

2.3 Data Distribution

To assess the structure and distribution of our variables, we began by generating summary statistics for all predictors in the dataset. We then visualized numerical features using histograms to examine normality, skewness, and potential outliers. Boxplots were created for key categorical variables to show how total cholesterol differs across demographic and behavioral groups. Finally, we plotted the distribution of the target variable—total cholesterol—to evaluate its overall shape before modeling. These exploratory steps provide a clear understanding of the data and help verify assumptions for regression analysis.

Several continuous variables in the dataset exhibit skewed distributions that may affect regression modeling:

  • C-reactive protein (lbxhscrp) shows extreme right skew, with a median of approximately 1.94 mg/L and a maximum of 182 mg/L.
  • Triglycerides (lbxtr) are also right-skewed, with a median of 94 mg/dL and a maximum of 2684 mg/dL.
  • Dietary variables such as total calories (dr1tkcal) and total fat intake (dr1ttfat) display mild to moderate skew due to some extreme high values.
  • Sodium intake (dr1tsodi) is highly right-skewed, with a maximum value of around 25,949 mg.

These variables may benefit from log-transformations or other normalization techniques prior to modeling.

# Summary statistics
nhanes_model2 %>%
  summary() %>%
  kable(caption = "Descriptive Statistics of Predictor Variables") %>%
  kable_styling()
Descriptive Statistics of Predictor Variables
seqn ridageyr riagendr ridreth3 dmdeduc2 indfmpir smq020 alq111 vigorous_minutes moderate_minutes sedentary_minutes dr1tkcal dr1ttfat dr1tsodi bmxbmi bpxsy1 bpxdi1 lbxglu lbxhscrp lbdhdd lbxtr lbxtc
Min. : 93705 Min. :18.00 Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.000 Min. :1.000 Min. :1.000 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0 Min. :14.20 Min. : 72 Min. : 0.00 Min. : 47.0 Min. : 0.11 Min. : 10.00 Min. : 10.0 Min. : 76.0
1st Qu.: 95955 1st Qu.:33.00 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:1.180 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 180.0 1st Qu.: 1406 1st Qu.: 50.74 1st Qu.: 2137 1st Qu.:24.60 1st Qu.:112 1st Qu.: 64.00 1st Qu.: 96.0 1st Qu.: 0.88 1st Qu.: 42.00 1st Qu.: 61.0 1st Qu.:158.0
Median : 98263 Median :51.00 Median :2.000 Median :3.000 Median :4.000 Median :2.080 Median :2.000 Median :1.000 Median : 0.0 Median : 50.0 Median : 300.0 Median : 1942 Median : 76.18 Median : 3102 Median :28.50 Median :124 Median : 72.00 Median :104.0 Median : 1.94 Median : 51.00 Median : 92.0 Median :183.0
Mean : 98280 Mean :49.89 Mean :1.515 Mean :3.504 Mean :3.526 Mean :2.521 Mean :1.597 Mean :1.114 Mean : 77.6 Mean : 141.8 Mean : 388.9 Mean : 2110 Mean : 85.07 Mean : 3456 Mean :29.69 Mean :126 Mean : 71.65 Mean :113.7 Mean : 4.17 Mean : 53.19 Mean : 112.3 Mean :186.9
3rd Qu.:100595 3rd Qu.:65.00 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.060 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.: 60.0 3rd Qu.: 180.0 3rd Qu.: 480.0 3rd Qu.: 2628 3rd Qu.:108.28 3rd Qu.: 4311 3rd Qu.:33.50 3rd Qu.:136 3rd Qu.: 80.00 3rd Qu.:115.0 3rd Qu.: 4.48 3rd Qu.: 61.00 3rd Qu.: 134.0 3rd Qu.:212.0
Max. :102956 Max. :80.00 Max. :2.000 Max. :7.000 Max. :9.000 Max. :5.000 Max. :2.000 Max. :2.000 Max. :10029.0 Max. :19998.0 Max. :9999.0 Max. :12501 Max. :567.96 Max. :25949 Max. :86.20 Max. :228 Max. :136.00 Max. :451.0 Max. :182.82 Max. :189.00 Max. :2684.0 Max. :446.0
NA NA NA NA NA’s :287 NA’s :835 NA NA’s :726 NA NA NA NA’s :873 NA’s :873 NA’s :873 NA’s :422 NA’s :957 NA’s :957 NA’s :3319 NA’s :710 NA’s :680 NA’s :3363 NA’s :680
# Plot histograms of all numeric variables in data_train
nhanes_model2 |>
  dplyr::select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
  filter(!is.na(Value)) |>
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~Feature, scales = "free") +
  labs(title = "Histograms of Numerical Features", x = NULL, y = "Frequency") +
  theme_minimal()

# Box plot of categorical variables
categorical_vars <- c("riagendr", "ridreth3", "dmdeduc2", "smq020", "alq111")

for (var in categorical_vars) {
  print(
    ggplot(nhanes_model2, aes_string(x = var, y = "lbxtc")) +
      geom_boxplot(fill = "steelblue", color = "black") +
      labs(
        title = paste("Total Cholesterol by", var),
        x = var,
        y = "Total Cholesterol (mg/dL)"
      ) +
      theme_minimal()
  )
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Removed 680 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Removed 287 rows containing missing values or values outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 632 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Removed 726 rows containing missing values or values outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 293 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Target variable distribution (Total Cholesterol)
ggplot(nhanes_model2, aes(x = lbxtc)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Total Cholesterol (mg/dL)", x = "Total Cholesterol", y = "Count") +
  theme_minimal()
## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_bin()`).

3. Data Preparation

3.1 Fix Missing Values

To prepare the dataset for modeling, we first removed any observations with missing values in the target variable, total cholesterol. Next, we addressed missingness in the predictors. Variables with moderate missingness were replaced with their median values, while categorical variables were imputed using the most frequent category. For variables with higher levels of missingness, we applied MICE using predictive mean matching, allowing more accurate estimation based on relationships among variables.

## Look at data

# Structure of the data 
str(nhanes_model2)
## tibble [5,856 × 22] (S3: tbl_df/tbl/data.frame)
##  $ seqn             : num [1:5856] 93705 93706 93708 93709 93711 ...
##   ..- attr(*, "label")= chr "Respondent sequence number"
##  $ ridageyr         : num [1:5856] 66 18 66 75 56 18 67 54 71 61 ...
##   ..- attr(*, "label")= chr "Age in years at screening"
##  $ riagendr         : num [1:5856] 2 1 2 2 1 1 1 2 1 1 ...
##   ..- attr(*, "label")= chr "Gender"
##  $ ridreth3         : num [1:5856] 4 6 6 4 6 1 3 4 7 6 ...
##   ..- attr(*, "label")= chr "Race/Hispanic origin w/ NH Asian"
##  $ dmdeduc2         : num [1:5856] 2 NA 1 4 5 NA 3 4 3 5 ...
##   ..- attr(*, "label")= chr "Education level - Adults 20+"
##  $ indfmpir         : num [1:5856] 0.82 NA 1.63 0.41 5 0.76 2.65 1.86 1.56 5 ...
##   ..- attr(*, "label")= chr "Ratio of family income to poverty"
##  $ smq020           : num [1:5856] 1 2 2 1 2 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Smoked at least 100 cigarettes in life"
##  $ alq111           : num [1:5856] 1 2 2 NA 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Ever had a drink of any kind of alcohol"
##  $ vigorous_minutes : num [1:5856] 0 0 0 0 60 465 0 0 660 180 ...
##  $ moderate_minutes : num [1:5856] 60 75 30 180 90 270 90 0 260 240 ...
##  $ sedentary_minutes: num [1:5856] 300 240 120 600 420 120 120 360 180 300 ...
##  $ dr1tkcal         : num [1:5856] 1202 1987 1251 NA 2840 ...
##   ..- attr(*, "label")= chr "Energy (kcal)"
##  $ dr1ttfat         : num [1:5856] 57 137.4 65.5 NA 124.2 ...
##   ..- attr(*, "label")= chr "Total fat (gm)"
##  $ dr1tsodi         : num [1:5856] 3574 3657 2135 NA 4382 ...
##   ..- attr(*, "label")= chr "Sodium (mg)"
##  $ bmxbmi           : num [1:5856] 31.7 21.5 23.7 38.9 21.3 19.7 23.5 39.9 22.5 30.7 ...
##   ..- attr(*, "label")= chr "Body Mass Index (kg/m**2)"
##  $ bpxsy1           : num [1:5856] NA 112 NA 120 108 112 104 NA 112 120 ...
##   ..- attr(*, "label")= chr "Systolic: Blood pres (1st rdg) mm Hg"
##  $ bpxdi1           : num [1:5856] NA 74 NA 66 68 68 70 NA 60 72 ...
##   ..- attr(*, "label")= chr "Diastolic: Blood pres (1st rdg) mm Hg"
##  $ lbxglu           : num [1:5856] NA NA 122 NA 107 NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Fasting Glucose (mg/dL)"
##  $ lbxhscrp         : num [1:5856] 2.72 0.74 1.83 6.94 0.82 0.37 1.66 5.71 6.1 1.75 ...
##   ..- attr(*, "label")= chr "HS C-Reactive Protein (mg/L)"
##  $ lbdhdd           : num [1:5856] 60 47 88 65 72 48 48 42 57 39 ...
##   ..- attr(*, "label")= chr "Direct HDL-Cholesterol (mg/dL)"
##  $ lbxtr            : num [1:5856] NA NA 58 NA 48 NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Triglyceride (mg/dL)"
##  $ lbxtc            : num [1:5856] 157 148 209 176 238 182 184 230 180 225 ...
##   ..- attr(*, "label")= chr "Total Cholesterol (mg/dL)"
# Glimpse of the data
head(nhanes_model2)
## # A tibble: 6 × 22
##    seqn ridageyr riagendr ridreth3 dmdeduc2 indfmpir smq020 alq111
##   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>  <dbl>  <dbl>
## 1 93705       66        2        4        2     0.82      1      1
## 2 93706       18        1        6       NA    NA         2      2
## 3 93708       66        2        6        1     1.63      2      2
## 4 93709       75        2        4        4     0.41      1     NA
## 5 93711       56        1        6        5     5         2      1
## 6 93712       18        1        1       NA     0.76      1      1
## # ℹ 14 more variables: vigorous_minutes <dbl>, moderate_minutes <dbl>,
## #   sedentary_minutes <dbl>, dr1tkcal <dbl>, dr1ttfat <dbl>, dr1tsodi <dbl>,
## #   bmxbmi <dbl>, bpxsy1 <dbl>, bpxdi1 <dbl>, lbxglu <dbl>, lbxhscrp <dbl>,
## #   lbdhdd <dbl>, lbxtr <dbl>, lbxtc <dbl>
# Count missing values per Variable
nhanes_model2 %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
  filter(missing_count != 0) %>%
  arrange(desc(missing_count)) 
## # A tibble: 14 × 2
##    variable missing_count
##    <chr>            <int>
##  1 lbxtr             3363
##  2 lbxglu            3319
##  3 bpxsy1             957
##  4 bpxdi1             957
##  5 dr1tkcal           873
##  6 dr1ttfat           873
##  7 dr1tsodi           873
##  8 indfmpir           835
##  9 alq111             726
## 10 lbxhscrp           710
## 11 lbdhdd             680
## 12 lbxtc              680
## 13 bmxbmi             422
## 14 dmdeduc2           287
colSums(is.na(nhanes_model2))
##              seqn          ridageyr          riagendr          ridreth3 
##                 0                 0                 0                 0 
##          dmdeduc2          indfmpir            smq020            alq111 
##               287               835                 0               726 
##  vigorous_minutes  moderate_minutes sedentary_minutes          dr1tkcal 
##                 0                 0                 0               873 
##          dr1ttfat          dr1tsodi            bmxbmi            bpxsy1 
##               873               873               422               957 
##            bpxdi1            lbxglu          lbxhscrp            lbdhdd 
##               957              3319               710               680 
##             lbxtr             lbxtc 
##              3363               680
# Keep only rows where Total Cholesterol is observed
nhanes_model2 <- nhanes_model2 %>% 
  filter(!is.na(lbxtc))


medium_numeric <- c("indfmpir","dr1tkcal","dr1ttfat","dr1tsodi",
                    "bmxbmi","bpxsy1","bpxdi1","lbxhscrp","lbdhdd")
medium_categorical <- c("dmdeduc2","alq111")

high_missing <- c("lbxglu","lbxtr")

## Median Imputations

# Numeric
for(v in medium_numeric){
  nhanes_model2[[v]][is.na(nhanes_model2[[v]])] <- median(nhanes_model2[[v]], na.rm = TRUE)
}

# Categorical
for(v in medium_categorical){
  mode_val <- as.numeric(names(sort(table(nhanes_model2[[v]]), decreasing = TRUE)[1]))
  nhanes_model2[[v]][is.na(nhanes_model2[[v]])] <- mode_val
}

# Mice Imputations

# Select only high missing numeric variables + all predictors
mice_vars <- c(high_missing, medium_numeric, "vigorous_minutes", "moderate_minutes", "sedentary_minutes")

# Run MICE
mice_data <- mice(nhanes_model2[mice_vars], m = 5, method = "pmm", seed = 123)
## 
##  iter imp variable
##   1   1  lbxglu  lbxtr
##   1   2  lbxglu  lbxtr
##   1   3  lbxglu  lbxtr
##   1   4  lbxglu  lbxtr
##   1   5  lbxglu  lbxtr
##   2   1  lbxglu  lbxtr
##   2   2  lbxglu  lbxtr
##   2   3  lbxglu  lbxtr
##   2   4  lbxglu  lbxtr
##   2   5  lbxglu  lbxtr
##   3   1  lbxglu  lbxtr
##   3   2  lbxglu  lbxtr
##   3   3  lbxglu  lbxtr
##   3   4  lbxglu  lbxtr
##   3   5  lbxglu  lbxtr
##   4   1  lbxglu  lbxtr
##   4   2  lbxglu  lbxtr
##   4   3  lbxglu  lbxtr
##   4   4  lbxglu  lbxtr
##   4   5  lbxglu  lbxtr
##   5   1  lbxglu  lbxtr
##   5   2  lbxglu  lbxtr
##   5   3  lbxglu  lbxtr
##   5   4  lbxglu  lbxtr
##   5   5  lbxglu  lbxtr
# Complete the dataset using first imputed dataset
nhanes_model2[mice_vars] <- complete(mice_data, 1)


colSums(is.na(nhanes_model2))
##              seqn          ridageyr          riagendr          ridreth3 
##                 0                 0                 0                 0 
##          dmdeduc2          indfmpir            smq020            alq111 
##                 0                 0                 0                 0 
##  vigorous_minutes  moderate_minutes sedentary_minutes          dr1tkcal 
##                 0                 0                 0                 0 
##          dr1ttfat          dr1tsodi            bmxbmi            bpxsy1 
##                 0                 0                 0                 0 
##            bpxdi1            lbxglu          lbxhscrp            lbdhdd 
##                 0                 0                 0                 0 
##             lbxtr             lbxtc 
##                 0                 0

3.3 Identifying Outliers

The boxplots reveal that several variables, particularly total cholesterol (lbxtc), triglycerides (lbxtr), LDL cholesterol (lbdldl), and nutrient/weight totals, exhibit pronounced high-end outliers with long upper whiskers and many points far above the main distribution. These patterns reflect right-skewed distributions and natural physiological or dietary variability in the population rather than errors. Other variables, such as blood pressure, hemoglobin, and BMI, show moderate high outliers, while some nutrient or BMI measures have only a few scattered extremes. Overall, the most prominent outliers are concentrated in biomarkers and nutrient totals, which could influence regression models if unaddressed. These findings suggest the need for careful handling, such as log transformation or robust modeling approaches, rather than outright removal, to ensure valid and generalizable analytical results.

# Boxplot of lbxtc

nhanes_model2 |>
  dplyr::select(lbxtc) |>
  ggplot(aes(x = "", y = lbxtc)) + 
  geom_boxplot(fill="pink") + 
  labs(title = "Distribution of Total Cholesterol", y = "Total Cholesterol mg/dL") +  
  theme_minimal()

# Boxplots of all predictors 

ggplot(stack(nhanes_model2[,-1]), aes(x = ind, y = values)) + 
  geom_boxplot(fill = "darkgrey") +
  coord_cartesian(ylim = c(0, 1000)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        panel.background = element_rect(fill = 'grey96')) +
  labs(title = "Boxplots of Predictor Variables", x="Predictors")
## Warning in stack.data.frame(nhanes_model2[, -1]): non-vector columns will be
## ignored
## Warning: <ggplot> %+% x was deprecated in ggplot2 4.0.0.
## ℹ Please use <ggplot> + x instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

nhanes_model2 %>%
  dplyr::select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "grey70") +
  coord_cartesian(ylim = c(0, 500)) +   # Adjust depending on variable range
  labs(
    title = "Boxplots of Numeric Predictor Variables",
    x = "Predictors",
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

3.4 Tranformations and Factoring

To address skewness and high-end outliers in the NHANES dataset, we applied log transformations to several right-skewed variables. Specifically, CRP (lbxhscrp), triglycerides (lbxtr), total cholesterol (lbxtc), HDL cholesterol (lbdhdd), glucose (lbxglu), BMI (bmxbmi), and dietary intake variables such as total calories (dr1tkcal), total fat (dr1ttfat), and sodium (dr1tsodi) were log-transformed. A small constant was added to each variable to avoid taking the logarithm of zero. These transformations reduce the influence of extreme values, improve the symmetry of distributions, and stabilize variance, thereby helping to meet the assumptions of linear regression and supporting more reliable model estimation.

In addition to transforming skewed numeric variables, we converted key categorical variables into factor type for modeling. Specifically, gender (riagendr) was labeled as “Male” and “Female,” while race/ethnicity (ridreth3), education level (dmdeduc2), smoking status (smq020), and alcohol use (alq111) were converted to factor variables without changing their original coding. This ensures that these categorical predictors are properly treated in regression models, allowing R to handle them as discrete groups rather than continuous numeric values, which improves interpretation and model performance.

# tranform skewed variables

# Log-transform skewed variables, adding a small constant to avoid log(0)
nhanes_model3 <- nhanes_model2 %>%
  mutate(
    log_lbxhscrp = log(lbxhscrp + 0.01),
    log_lbxtr = log(lbxtr + 0.01),
    log_dr1tkcal = log(dr1tkcal + 1),
    log_dr1tsodi= log(dr1tsodi + 1),
    log_dr1ttfat = log(dr1ttfat + 1),
    log_lbxtc = log(lbxtc + 0.01),
    log_lbdhdd = log(lbdhdd + 0.01),
    log_lbxglu = log(lbxglu + 0.01),
    log_bmxbmi = log(bmxbmi + 0.01)
  )

log_vars <- c("log_lbxhscrp","log_lbxtr","log_dr1tkcal","log_dr1tsodi","log_dr1ttfat",
              "log_lbxtc","log_lbdhdd","log_lbxglu","log_bmxbmi")

nhanes_model3 %>%
  dplyr::select(all_of(log_vars)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~Variable, scales = "free") +
  labs(title = "Histograms of Log-Transformed Variables", x = NULL, y = "Frequency") +
  theme_minimal()

# Factor categorical variables
nhanes_model4 <- nhanes_model3 %>%
  mutate(
    riagendr = factor(riagendr, labels = c("Male", "Female")),
    ridreth3 = factor(ridreth3),
    dmdeduc2 = factor(dmdeduc2),
    smq020   = factor(smq020),
    alq111   = factor(alq111)
  )

3.4 Checking for linear relationship

To evaluate the suitability of linear regression for predicting total cholesterol, we examined the relationships between each numeric predictor and the target variable (lbxtc) using scatterplots and Pearson correlation coefficients. Variables with stronger correlations may be more informative for the regression model, while variables with very weak correlations may contribute less predictive value or require further transformation.

Across the scatterplots, most numeric predictors exhibit very weak or negligible linear relationships with total cholesterol, as reflected by near-zero Pearson correlation coefficients. Variables such as physical activity (vigorous, moderate, sedentary minutes), dietary intakes (calories, fat, sodium), and demographic characteristics show little predictive power individually. A few clinical variables—particularly systolic and diastolic blood pressure (bpxsy1, bpxdi1) and log-transformed triglycerides (log_trig)—display mild positive associations with cholesterol, but these relationships are weak and accompanied by substantial scatter. Overall, the plots suggest that total cholesterol is largely uncorrelated with most individual predictors, with only modest positive trends observed for a few clinical measures.

# Get all numeric predictor variables (excluding total cholesterol)
predictors <- names(nhanes_model4)[sapply(nhanes_model4, is.numeric) & names(nhanes_model4) != "lbxtc"]

# Set up plotting parameters for multiple plots
par(mfrow = c(4, 4), mar = c(4, 4, 3, 2))

# Plot each predictor vs total cholesterol
for(i in 1:length(predictors)) {
  predictor <- predictors[i]
  correlation <- cor(nhanes_model4[[predictor]], nhanes_model4$lbxtc, use = "complete.obs")
  
  plot(nhanes_model4[[predictor]], nhanes_model4$lbxtc,
       xlab = predictor,
       ylab = "Total Cholesterol",
       main = paste(predictor, "\n(r =", round(correlation, 3), ")"),
       pch = 16,
       col = ifelse(correlation > 0, "darkblue", "darkred"),
       cex = 0.6)
  
  # Add regression line
  abline(lm(nhanes_model4$lbxtc ~ nhanes_model4[[predictor]]), 
         col = "red", lwd = 2)
  
  # Add correlation text
  text(x = quantile(nhanes_model4[[predictor]], 0.05, na.rm = TRUE), 
       y = quantile(nhanes_model4$lbxtc, 0.95, na.rm = TRUE),
       labels = paste("r =", round(correlation, 3)),
       cex = 0.8, col = "red")
}

# Reset plotting parameters
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2))

3.5 Identifying Correlations

The correlations between total cholesterol (lbxtc) and other variables in the dataset are generally modest.

  • HDL cholesterol (lbdhdd) shows the strongest positive correlation at 0.19
  • Followed by log-transformed triglycerides (log_trig) and blood pressure measures (bpxdi1 and bpxsy1), each around 0.13. Age (ridageyr).
  • Log-transformed CRP (log_crp), and income-to-poverty ratio (indfmpir) show weak positive correlations ranging from 0.05 to 0.10.

Strong Positive Correlations:

  • dr1tkcal/dr1ttfat 0.89 (Calories and Total Fat Intake - more fat indicates more calories)
  • dr1tkcal/dr1tsodi 0.80 (Calories and Total Sodium Intake - more sodium indicates more calories)
  • dr1ttfat/dr1tsodi 0.76 (Total Fat and Sodium Intake -more fat indicates more sodium)
  • ridageyr/bpxsy1 0.47 (Higher age indicates higher systolic bp)

Negative Correlations:

  • lbdhdd/lbxtr -0.31 (Higher HDL, lower Triglycerides)
  • bmxbmi/lbdhdd -0.27 (Higher BMI, lower HDL)
  • lbxglu/lbdhdd -0.18 (Higher Glucose, lower HDL)

Moderate Correlations:

  • BMI and CRP (bmxBMI ↔︎ lbxhscrp)
  • Systolic and Diastolic Blood Pressure

Weak or Near-Zero Correlations:

  • Physical activity measures (vigorous_minutes, moderate_minutes, sedentary_minutes) show very small correlations with most metabolic markers.
  • Demographic and lifestyle variables like age, gender, education, and alcohol use also show minimal linear correlation with total cholesterol.

Most other lifestyle and dietary variables, including BMI, physical activity, total calories, fat, and sodium intake, have correlations near zero or slightly negative, indicating minimal linear association with total cholesterol in this dataset. Overall, these results suggest that cholesterol levels are more strongly associated with clinical biomarkers (HDL, triglycerides, blood pressure) than with demographic, lifestyle, or dietary factors.

# ORIGINAL PREDICTORS DATASET

# log_vars is "log_crp","log_trig","log_calories","log_sodium","log_fat", "log_chol","log_hdl", "log_glucose", "log_bmi"
nhanes_model4_original <- nhanes_model4 |>
  dplyr::select(-any_of(log_vars))


df_character_wide <- nhanes_model4_original %>%
  select_if(function(col) !is.numeric(col) | all(col == .$lbxtc)) %>%
  pivot_longer(cols = -lbxtc, names_to = "variable", values_to = "value")

df_character_wide %>%
  ggplot(aes(x = value, y = lbxtc)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))

numeric_vars <- nhanes_model4_original %>% dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")

# Correlation with total cholesterol
corr_target <- cor_matrix[, "lbxtc"]
corr_target_sorted <- sort(corr_target, decreasing = TRUE)

kable(as.data.frame(corr_target_sorted), col.names = c("Correlation with Total Cholesterol"), digits = 2)
Correlation with Total Cholesterol
lbxtc 1.00
lbdhdd 0.19
bpxdi1 0.13
bpxsy1 0.13
lbxtr 0.11
ridageyr 0.10
indfmpir 0.05
bmxbmi 0.01
vigorous_minutes 0.00
seqn 0.00
sedentary_minutes 0.00
lbxhscrp 0.00
moderate_minutes 0.00
lbxglu -0.02
dr1tkcal -0.03
dr1ttfat -0.03
dr1tsodi -0.03
# corrplot::corrplot(cor_matrix, method = "color", type = "upper",
#          tl.col = "black", tl.srt = 45, addCoef.col = "black",
#          number.cex = 0.7, diag = FALSE)
corrplot::corrplot(cor_matrix,
        method = "circle",
        type = "upper",
        tl.cex = 0.5,
        tl.srt = 45) # Rotate text diagonally

# Remove self-correlation
corr_target_no_self <- corr_target[names(corr_target) != "lbxtc"]

# Top 3 positive and negative correlations
top_pos <- sort(corr_target_no_self, decreasing = TRUE)[1:3]
top_neg <- sort(corr_target_no_self, decreasing = FALSE)[1:3]

combined_df <- data.frame(
  Feature = c(names(top_pos), names(top_neg)),
  Correlation = c(top_pos, top_neg)
) %>% 
  mutate(Direction = ifelse(Correlation > 0, "Positive", "Negative"))

ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
  labs(title = "Top Features Correlated with Total Cholesterol",
       x = "Feature",
       y = "Correlation with Total Cholesterol",
       fill = "Direction") +
  theme_minimal()

# Correlation among predictors
melt_cor <- melt(cor_matrix)
melt_cor <- melt_cor |>
  filter(Var1 != "SalePrice") |>
  filter(Var2 != "SalePrice")
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
kable(as.data.frame(sorted_cor), digits = 2)
Var1 Var2 value
126 dr1tkcal dr1ttfat 0.89
143 dr1tkcal dr1tsodi 0.80
144 dr1ttfat dr1tsodi 0.76
172 ridageyr bpxsy1 0.47
198 bpxsy1 bpxdi1 0.35
270 lbdhdd lbxtr -0.31
248 bmxbmi lbdhdd -0.27
268 lbxglu lbxtr 0.26
231 bmxbmi lbxhscrp 0.26
287 lbdhdd lbxtc 0.19
214 bmxbmi lbxglu 0.18
215 bpxsy1 lbxglu 0.18
251 lbxglu lbdhdd -0.18
206 ridageyr lbxglu 0.16
265 bmxbmi lbxtr 0.15
138 ridageyr dr1tsodi -0.14
104 ridageyr dr1tkcal -0.13
284 bpxdi1 lbxtc 0.13
72 vigorous_minutes moderate_minutes 0.13
283 bpxsy1 lbxtc 0.13
180 bmxbmi bpxsy1 0.12
288 lbxtr lbxtc 0.11
121 ridageyr dr1ttfat -0.10
274 ridageyr lbxtc 0.10
267 bpxdi1 lbxtr 0.10
234 lbxglu lbxhscrp 0.09
241 indfmpir lbdhdd 0.09
197 bmxbmi bpxdi1 0.09
252 lbxhscrp lbdhdd -0.08
266 bpxsy1 lbxtr 0.08
53 ridageyr vigorous_minutes -0.08
106 vigorous_minutes dr1tkcal 0.08
36 ridageyr indfmpir 0.08
210 sedentary_minutes lbxglu 0.07
240 ridageyr lbdhdd 0.07
245 dr1tkcal lbdhdd -0.07
247 dr1tsodi lbdhdd -0.07
140 vigorous_minutes dr1tsodi 0.07
262 dr1tkcal lbxtr 0.07
224 indfmpir lbxhscrp -0.07
264 dr1tsodi lbxtr 0.07
209 moderate_minutes lbxglu 0.06
260 moderate_minutes lbxtr 0.06
250 bpxdi1 lbdhdd -0.06
194 dr1tkcal bpxdi1 0.05
123 vigorous_minutes dr1ttfat 0.05
228 dr1tkcal lbxhscrp -0.05
230 dr1tsodi lbxhscrp -0.05
161 dr1ttfat bmxbmi 0.05
156 indfmpir bmxbmi -0.05
275 indfmpir lbxtc 0.05
261 sedentary_minutes lbxtr 0.05
196 dr1tsodi bpxdi1 0.05
179 dr1tsodi bpxsy1 -0.04
263 dr1ttfat lbxtr 0.04
223 ridageyr lbxhscrp 0.04
246 dr1ttfat lbdhdd -0.04
87 ridageyr sedentary_minutes 0.04
227 sedentary_minutes lbxhscrp 0.04
229 dr1ttfat lbxhscrp -0.04
159 sedentary_minutes bmxbmi 0.04
208 vigorous_minutes lbxglu -0.04
178 dr1ttfat bpxsy1 -0.04
122 indfmpir dr1ttfat 0.04
257 ridageyr lbxtr 0.04
176 sedentary_minutes bpxsy1 0.04
174 vigorous_minutes bpxsy1 -0.04
177 dr1tkcal bpxsy1 -0.04
139 indfmpir dr1tsodi 0.03
281 dr1tsodi lbxtc -0.03
70 ridageyr moderate_minutes -0.03
280 dr1ttfat lbxtc -0.03
195 dr1ttfat bpxdi1 0.03
190 indfmpir bpxdi1 0.03
192 moderate_minutes bpxdi1 0.03
89 vigorous_minutes sedentary_minutes -0.03
173 indfmpir bpxsy1 -0.03
88 indfmpir sedentary_minutes 0.03
90 moderate_minutes sedentary_minutes -0.03
279 dr1tkcal lbxtc -0.03
162 dr1tsodi bmxbmi 0.03
207 indfmpir lbxglu 0.03
232 bpxsy1 lbxhscrp 0.03
155 ridageyr bmxbmi 0.02
269 lbxhscrp lbxtr 0.02
285 lbxglu lbxtc -0.02
52 seqn vigorous_minutes -0.02
244 sedentary_minutes lbdhdd -0.02
243 moderate_minutes lbdhdd 0.02
205 seqn lbxglu -0.02
71 indfmpir moderate_minutes -0.02
189 ridageyr bpxdi1 0.02
225 vigorous_minutes lbxhscrp -0.02
259 vigorous_minutes lbxtr -0.01
157 vigorous_minutes bmxbmi -0.01
86 seqn sedentary_minutes 0.01
211 dr1tkcal lbxglu -0.01
188 seqn bpxdi1 0.01
258 indfmpir lbxtr 0.01
193 sedentary_minutes bpxdi1 0.01
105 indfmpir dr1tkcal 0.01
158 moderate_minutes bmxbmi 0.01
239 seqn lbdhdd 0.01
160 dr1tkcal bmxbmi 0.01
54 indfmpir vigorous_minutes -0.01
282 bmxbmi lbxtc 0.01
107 moderate_minutes dr1tkcal 0.01
137 seqn dr1tsodi 0.01
171 seqn bpxsy1 0.01
249 bpxsy1 lbdhdd 0.01
256 seqn lbxtr 0.01
142 sedentary_minutes dr1tsodi -0.01
226 moderate_minutes lbxhscrp -0.01
18 seqn ridageyr 0.01
212 dr1ttfat lbxglu 0.01
108 sedentary_minutes dr1tkcal 0.00
277 moderate_minutes lbxtc 0.00
222 seqn lbxhscrp 0.00
175 moderate_minutes bpxsy1 0.00
124 moderate_minutes dr1ttfat 0.00
286 lbxhscrp lbxtc 0.00
103 seqn dr1tkcal 0.00
120 seqn dr1ttfat 0.00
125 sedentary_minutes dr1ttfat 0.00
69 seqn moderate_minutes 0.00
233 bpxdi1 lbxhscrp 0.00
213 dr1tsodi lbxglu 0.00
154 seqn bmxbmi 0.00
242 vigorous_minutes lbdhdd 0.00
216 bpxdi1 lbxglu 0.00
141 moderate_minutes dr1tsodi 0.00
35 seqn indfmpir 0.00
278 sedentary_minutes lbxtc 0.00
273 seqn lbxtc 0.00
191 vigorous_minutes bpxdi1 0.00
276 vigorous_minutes lbxtc 0.00
numeric_predictors <- c("ridageyr", "indfmpir", "vigorous_minutes",
                        "moderate_minutes", "sedentary_minutes", 
                        "bmxbmi", "bpxsy1", "bpxdi1", "lbxglu", 
                        "lbxhscrp", "lbdhdd", "lbxtr")

par(mfrow = c(3, 4))
for (var in numeric_predictors) {
  boxplot(nhanes_model4[[var]] ~ nhanes_model4$lbxtc, 
          main = paste(var, "vs Total Cholesterol"), 
          xlab = "Total Cholesterol", ylab = var, col = "lightblue")
}

continuous_vars <- nhanes_model4 %>%
  dplyr::select(
    lbxtc, ridageyr, indfmpir,
    vigorous_minutes, moderate_minutes, sedentary_minutes,
    dr1tkcal, dr1ttfat, dr1tsodi,
    bmxbmi, bpxsy1, bpxdi1,
    lbxglu, lbxhscrp, lbdhdd, lbxtr
  )

# relationships between predictors
ggpairs(continuous_vars, 
        progress = FALSE,
        upper = list(continuous = wrap("cor", size = 3)))

# TRANSFORMED PREDICTORS

nhanes_model4_transformed <- nhanes_model4 |>
  dplyr::select(-c(lbxhscrp, lbxtr, dr1tsodi, dr1tkcal, dr1ttfat, lbdhdd, lbxglu, bmxbmi))


df_character_wide <- nhanes_model4_transformed %>%
  select_if(function(col) !is.numeric(col) | all(col == .$lbxtc)) %>%
  pivot_longer(cols = -lbxtc, names_to = "variable", values_to = "value")

df_character_wide %>%
  ggplot(aes(x = value, y = lbxtc)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))

numeric_vars <- nhanes_model4_transformed %>% dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")

# Correlation with total cholesterol
corr_target <- cor_matrix[, "lbxtc"]
corr_target_sorted <- sort(corr_target, decreasing = TRUE)

kable(as.data.frame(corr_target_sorted), col.names = c("Correlation with Total Cholesterol"), digits = 2)
Correlation with Total Cholesterol
lbxtc 1.00
log_lbxtc 0.99
log_lbdhdd 0.17
log_lbxtr 0.13
bpxdi1 0.13
bpxsy1 0.13
ridageyr 0.10
log_lbxhscrp 0.06
indfmpir 0.05
log_bmxbmi 0.02
vigorous_minutes 0.00
seqn 0.00
sedentary_minutes 0.00
moderate_minutes 0.00
log_dr1tkcal -0.02
log_lbxglu -0.02
log_dr1ttfat -0.03
log_dr1tsodi -0.04
# corrplot::corrplot(cor_matrix, method = "color", type = "upper",
#          tl.col = "black", tl.srt = 45, addCoef.col = "black",
#          number.cex = 0.7, diag = FALSE)
corrplot::corrplot(cor_matrix,
        method = "circle",
        type = "upper",
        tl.cex = 0.5,
        tl.srt = 45) # Rotate text diagonally

# Remove self-correlation
corr_target_no_self <- corr_target[names(corr_target) != "lbxtc"]

# Top 3 positive and negative correlations
top_pos <- sort(corr_target_no_self, decreasing = TRUE)[1:3]
top_neg <- sort(corr_target_no_self, decreasing = FALSE)[1:3]

combined_df <- data.frame(
  Feature = c(names(top_pos), names(top_neg)),
  Correlation = c(top_pos, top_neg)
) %>% 
  mutate(Direction = ifelse(Correlation > 0, "Positive", "Negative"))

ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
  labs(title = "Top Features Correlated with Total Cholesterol",
       x = "Feature",
       y = "Correlation with Total Cholesterol",
       fill = "Direction") +
  theme_minimal()

# Correlation among predictors
melt_cor <- melt(cor_matrix)
melt_cor <- melt_cor |>
  filter(Var1 != "SalePrice") |>
  filter(Var2 != "SalePrice")
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
kable(as.data.frame(sorted_cor), digits = 2)
Var1 Var2 value
261 lbxtc log_lbxtc 0.99
246 log_dr1tkcal log_dr1ttfat 0.88
228 log_dr1tkcal log_dr1tsodi 0.82
247 log_dr1tsodi log_dr1ttfat 0.79
281 log_lbxtr log_lbdhdd -0.48
316 log_lbxhscrp log_bmxbmi 0.48
110 ridageyr bpxsy1 0.47
133 bpxsy1 bpxdi1 0.35
299 log_lbxtr log_lbxglu 0.31
322 log_lbdhdd log_bmxbmi -0.30
317 log_lbxtr log_bmxbmi 0.25
323 log_lbxglu log_bmxbmi 0.22
304 log_lbdhdd log_lbxglu -0.22
295 bpxsy1 log_lbxglu 0.21
280 log_lbxhscrp log_lbdhdd -0.19
290 ridageyr log_lbxglu 0.19
285 log_lbxtc log_lbdhdd 0.18
279 lbxtc log_lbdhdd 0.17
298 log_lbxhscrp log_lbxglu 0.16
190 log_lbxhscrp log_lbxtr 0.15
188 bpxdi1 log_lbxtr 0.14
260 bpxdi1 log_lbxtc 0.14
189 lbxtc log_lbxtr 0.13
313 bpxsy1 log_bmxbmi 0.13
152 bpxdi1 lbxtc 0.13
263 log_lbxtr log_lbxtc 0.13
76 vigorous_minutes moderate_minutes 0.13
187 bpxsy1 log_lbxtr 0.13
151 bpxsy1 lbxtc 0.13
218 ridageyr log_dr1tsodi -0.12
259 bpxsy1 log_lbxtc 0.12
200 ridageyr log_dr1tkcal -0.11
146 ridageyr lbxtc 0.10
165 indfmpir log_lbxhscrp -0.09
273 indfmpir log_lbdhdd 0.09
314 bpxdi1 log_bmxbmi 0.09
226 log_lbxhscrp log_dr1tsodi -0.09
164 ridageyr log_lbxhscrp 0.09
254 ridageyr log_lbxtc 0.09
208 log_lbxhscrp log_dr1tkcal -0.08
169 bpxsy1 log_lbxhscrp 0.08
294 sedentary_minutes log_lbxglu 0.08
56 ridageyr vigorous_minutes -0.08
38 ridageyr indfmpir 0.08
236 ridageyr log_dr1ttfat -0.07
182 ridageyr log_lbxtr 0.07
272 ridageyr log_lbdhdd 0.07
282 log_dr1tkcal log_lbdhdd -0.07
262 log_lbxhscrp log_lbxtc 0.07
171 lbxtc log_lbxhscrp 0.06
283 log_dr1tsodi log_lbdhdd -0.06
278 bpxdi1 log_lbdhdd -0.06
219 indfmpir log_dr1tsodi 0.06
202 vigorous_minutes log_dr1tkcal 0.06
244 log_lbxhscrp log_dr1ttfat -0.06
237 indfmpir log_dr1ttfat 0.06
227 log_lbxtr log_dr1tsodi 0.06
209 log_lbxtr log_dr1tkcal 0.06
293 moderate_minutes log_lbxglu 0.05
220 vigorous_minutes log_dr1tsodi 0.05
206 bpxdi1 log_dr1tkcal 0.05
308 ridageyr log_bmxbmi 0.05
186 sedentary_minutes log_lbxtr 0.05
255 indfmpir log_lbxtc 0.05
147 indfmpir lbxtc 0.05
170 bpxdi1 log_lbxhscrp 0.05
224 bpxdi1 log_dr1tsodi 0.05
320 log_dr1ttfat log_bmxbmi 0.04
292 vigorous_minutes log_lbxglu -0.04
92 ridageyr sedentary_minutes 0.04
312 sedentary_minutes log_bmxbmi 0.04
238 vigorous_minutes log_dr1ttfat 0.04
309 indfmpir log_bmxbmi -0.04
114 sedentary_minutes bpxsy1 0.04
201 indfmpir log_dr1tkcal 0.04
223 bpxsy1 log_dr1tsodi -0.04
185 moderate_minutes log_lbxtr 0.04
112 vigorous_minutes bpxsy1 -0.04
225 lbxtc log_dr1tsodi -0.04
265 log_dr1tsodi log_lbxtc -0.03
291 indfmpir log_lbxglu 0.03
266 log_dr1ttfat log_lbxtc -0.03
74 ridageyr moderate_minutes -0.03
303 log_lbxtc log_lbxglu -0.03
243 lbxtc log_dr1ttfat -0.03
129 indfmpir bpxdi1 0.03
131 moderate_minutes bpxdi1 0.03
284 log_dr1ttfat log_lbdhdd -0.03
205 bpxsy1 log_dr1tkcal -0.03
94 vigorous_minutes sedentary_minutes -0.03
111 indfmpir bpxsy1 -0.03
321 log_lbxtc log_bmxbmi 0.03
93 indfmpir sedentary_minutes 0.03
166 vigorous_minutes log_lbxhscrp -0.03
168 sedentary_minutes log_lbxhscrp 0.03
95 moderate_minutes sedentary_minutes -0.03
242 bpxdi1 log_dr1ttfat 0.03
184 vigorous_minutes log_lbxtr -0.03
319 log_dr1tsodi log_bmxbmi 0.02
241 bpxsy1 log_dr1ttfat -0.02
297 lbxtc log_lbxglu -0.02
315 lbxtc log_bmxbmi 0.02
245 log_lbxtr log_dr1ttfat 0.02
207 lbxtc log_dr1tkcal -0.02
264 log_dr1tkcal log_lbxtc -0.02
55 seqn vigorous_minutes -0.02
276 sedentary_minutes log_lbdhdd -0.02
75 indfmpir moderate_minutes -0.02
128 ridageyr bpxdi1 0.02
289 seqn log_lbxglu -0.02
163 seqn log_lbxhscrp 0.02
221 moderate_minutes log_dr1tsodi -0.01
91 seqn sedentary_minutes 0.01
311 moderate_minutes log_bmxbmi 0.01
127 seqn bpxdi1 0.01
132 sedentary_minutes bpxdi1 0.01
310 vigorous_minutes log_bmxbmi -0.01
239 moderate_minutes log_dr1ttfat -0.01
271 seqn log_lbdhdd 0.01
300 log_dr1tkcal log_lbxglu -0.01
235 seqn log_dr1ttfat -0.01
57 indfmpir vigorous_minutes -0.01
275 moderate_minutes log_lbdhdd 0.01
240 sedentary_minutes log_dr1ttfat 0.01
302 log_dr1ttfat log_lbxglu 0.01
203 moderate_minutes log_dr1tkcal -0.01
222 sedentary_minutes log_dr1tsodi -0.01
109 seqn bpxsy1 0.01
19 seqn ridageyr 0.01
296 bpxdi1 log_lbxglu 0.01
318 log_dr1tkcal log_bmxbmi 0.01
301 log_dr1tsodi log_lbxglu 0.00
149 moderate_minutes lbxtc 0.00
217 seqn log_dr1tsodi 0.00
113 moderate_minutes bpxsy1 0.00
204 sedentary_minutes log_dr1tkcal 0.00
183 indfmpir log_lbxtr 0.00
181 seqn log_lbxtr 0.00
73 seqn moderate_minutes 0.00
258 sedentary_minutes log_lbxtc 0.00
256 vigorous_minutes log_lbxtc 0.00
257 moderate_minutes log_lbxtc 0.00
277 bpxsy1 log_lbdhdd 0.00
167 moderate_minutes log_lbxhscrp 0.00
37 seqn indfmpir 0.00
199 seqn log_dr1tkcal 0.00
274 vigorous_minutes log_lbdhdd 0.00
307 seqn log_bmxbmi 0.00
150 sedentary_minutes lbxtc 0.00
145 seqn lbxtc 0.00
253 seqn log_lbxtc 0.00
130 vigorous_minutes bpxdi1 0.00
148 vigorous_minutes lbxtc 0.00
numeric_predictors <- c("ridageyr", "indfmpir", "vigorous_minutes",
                        "moderate_minutes", "sedentary_minutes", 
                        "bmxbmi", "bpxsy1", "bpxdi1", "lbxglu", 
                        "lbxhscrp", "lbdhdd", "lbxtr")

par(mfrow = c(3, 4))
for (var in numeric_predictors) {
  boxplot(nhanes_model4[[var]] ~ nhanes_model4$lbxtc, 
          main = paste(var, "vs Total Cholesterol"), 
          xlab = "Total Cholesterol", ylab = var, col = "lightblue")
}

# relationships between predictors
#ggpairs(nhanes_model4_transformed, progress = FALSE)

3.4 Adding new features

To enhance the predictive modeling of total cholesterol, several new features were derived from the raw NHANES data. Age was categorized into young (<30), middle-aged (30–65), and older (>65) groups, while BMI was classified into underweight, normal, overweight, and obese categories. Physical activity variables were combined into a total activity measure, and a binary indicator was created to flag participants meeting recommended activity levels. Dietary intake was normalized by creating fat-to-calorie and sodium-to-calorie ratios. Finally, skewed laboratory and dietary variables, including triglycerides, CRP, total calories, fat, and sodium, were log-transformed to reduce skewness and improve model stability.

nhanes_model5 <- nhanes_model4 %>%
  # Age groups
  mutate(
    AGE_YOUNG = as.numeric(ridageyr < 30),
    AGE_MID = as.numeric(ridageyr >= 30 & ridageyr <= 65),
    AGE_OLD = as.numeric(ridageyr > 65)
  ) %>%
  # BMI categories
  mutate(
    BMI_UNDER = as.numeric(bmxbmi < 18.5),
    BMI_NORMAL = as.numeric(bmxbmi >= 18.5 & bmxbmi < 25),
    BMI_OVER = as.numeric(bmxbmi >= 25 & bmxbmi < 30),
    BMI_OBESE = as.numeric(bmxbmi >= 30)
  ) %>%
  # Physical activity
  mutate(
    TOTAL_ACTIVITY = vigorous_minutes + moderate_minutes,
    ACTIVE_FLAG = as.numeric(TOTAL_ACTIVITY >= 150)  # Meets 150 min/week guideline
  ) %>%
  # Dietary ratios
  mutate(
#    FAT_RATIO = ifelse(dr1tkcal > 0, dr1ttfat / dr1tkcal, NA),
#    SODIUM_RATIO = ifelse(dr1tkcal > 0, dr1tsodi / dr1tkcal, NA)
    HIGH_FAT_DIET = as.numeric((exp(log_dr1ttfat) * 9 / exp(log_dr1tkcal)) > 0.35),
    HIGH_SODIUM = as.numeric(exp(log_dr1tsodi) > 2300)
  ) 

Split Data into Training and Test Sets

To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.

set.seed(123)

train_index <- createDataPartition(nhanes_model4$lbxtc, p = 0.8, list = FALSE)

# Create train/test splits for each dataset
# Removed seqn for each dataset

# Original predictors - removed transformed predictors
nhanes_model_original <- nhanes_model4 |>
  dplyr::select(-c(log_lbxhscrp, log_lbxtr, log_dr1tkcal, log_dr1tsodi, log_dr1ttfat, log_lbxtc, log_lbdhdd, log_lbxglu, log_bmxbmi, seqn))
train_data_original <- nhanes_model_original[train_index, ]
test_data_original <- nhanes_model_original[-train_index, ]

# transformed predictors - ONLY the transformed predictors (remove transformed target variable log_lbxtc)
nhanes_model_transformed <- nhanes_model4 |>
  dplyr::select(-c(lbxhscrp, lbxtr, dr1tkcal, dr1tsodi, dr1ttfat, log_lbxtc, lbdhdd, lbxglu, bmxbmi, seqn))
train_data_transformed <- nhanes_model_transformed[train_index, ]
test_data_transformed <- nhanes_model_transformed[-train_index, ]

# transformed predictors + log transformed target variable log_lbxtc
nhanes_model_transformed_target <- nhanes_model4 |>
  dplyr::select(-c(lbxhscrp, lbxtr, dr1tkcal, dr1tsodi, dr1ttfat, lbxtc, lbdhdd, lbxglu, bmxbmi, seqn))
train_data_transformed_target <- nhanes_model_transformed_target[train_index, ]
test_data_transformed_target <- nhanes_model_transformed_target[-train_index, ]

# new features

nhanes_model_transformed_newfeatures <- nhanes_model5 |>
    dplyr::select(
    # Target
    log_lbxtc,
    
    # Demographics (NO seqn)
    ridageyr, riagendr, ridreth3, dmdeduc2, indfmpir,
    
    # Lifestyle
    smq020, alq111, sedentary_minutes,
    
    # NEW FEATURES (activity & diet)
    TOTAL_ACTIVITY, ACTIVE_FLAG, HIGH_FAT_DIET, HIGH_SODIUM,
    
    # Age/BMI categories
    AGE_YOUNG, AGE_MID,  # AGE_OLD is reference
    BMI_UNDER, BMI_NORMAL, BMI_OVER,  # BMI_OBESE is reference
    
    # Blood pressure (raw values)
    bpxsy1, bpxdi1,
    
    # Biomarkers (LOG versions ONLY - no raw)
    log_lbxhscrp, log_lbxtr, log_lbdhdd, log_lbxglu
    
    )

train_data_transformed_newfeatures <- nhanes_model_transformed_newfeatures[train_index, ]
test_data_transformed_newfeatures <- nhanes_model_transformed_newfeatures[-train_index, ]

4.Build Models

Model 1: Multiple Linear Regression - Full model

# Model 1: original predictors + original target variable lbxtc
set.seed(123)
model_full <- lm(lbxtc ~ ., data = train_data_original)
summary(model_full)
## 
## Call:
## lm(formula = lbxtc ~ ., data = train_data_original)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -205.386  -26.880   -3.902   22.995  246.556 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.859e+01  6.803e+00  14.493  < 2e-16 ***
## ridageyr           6.551e-02  4.048e-02   1.618 0.105684    
## riagendrFemale     4.976e+00  1.400e+00   3.554 0.000384 ***
## ridreth32          2.054e+00  2.584e+00   0.795 0.426837    
## ridreth33         -2.334e+00  2.092e+00  -1.116 0.264487    
## ridreth34         -6.533e+00  2.219e+00  -2.943 0.003264 ** 
## ridreth36          1.608e+00  2.514e+00   0.640 0.522303    
## ridreth37         -2.719e+00  3.223e+00  -0.844 0.398850    
## dmdeduc22         -2.821e+00  2.924e+00  -0.965 0.334658    
## dmdeduc23         -1.927e+00  2.661e+00  -0.724 0.468872    
## dmdeduc24         -3.381e+00  2.612e+00  -1.294 0.195707    
## dmdeduc25         -2.496e-01  2.852e+00  -0.088 0.930271    
## dmdeduc27          5.277e+01  2.790e+01   1.891 0.058689 .  
## dmdeduc29         -6.112e+00  1.410e+01  -0.434 0.664619    
## indfmpir           4.399e-01  4.597e-01   0.957 0.338673    
## smq0202           -3.037e+00  1.356e+00  -2.240 0.025171 *  
## alq1112            1.032e+00  2.149e+00   0.480 0.630904    
## vigorous_minutes   5.556e-05  1.868e-03   0.030 0.976276    
## moderate_minutes  -3.243e-04  1.122e-03  -0.289 0.772516    
## sedentary_minutes  7.255e-05  8.792e-04   0.083 0.934240    
## dr1tkcal           1.876e-03  1.545e-03   1.214 0.224706    
## dr1ttfat          -3.107e-02  2.910e-02  -1.068 0.285671    
## dr1tsodi          -4.430e-04  5.512e-04  -0.804 0.421583    
## bmxbmi             2.602e-01  9.389e-02   2.771 0.005609 ** 
## bpxsy1             1.552e-01  4.148e-02   3.741 0.000186 ***
## bpxdi1             3.544e-01  5.282e-02   6.710 2.21e-11 ***
## lbxglu            -5.781e-02  1.749e-02  -3.306 0.000956 ***
## lbxhscrp          -1.797e-02  8.319e-02  -0.216 0.828978    
## lbdhdd             6.508e-01  4.593e-02  14.167  < 2e-16 ***
## lbxtr              5.909e-02  5.548e-03  10.652  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.14 on 4113 degrees of freedom
## Multiple R-squared:  0.1082, Adjusted R-squared:  0.1019 
## F-statistic: 17.21 on 29 and 4113 DF,  p-value: < 2.2e-16

Model 2: Multiple Linear Regression: Full model with Log transformation of the skewed predictors

Using log transformations for lbxhscrp, lbxtr, dr1tkcal, dr1tsodi, dr1ttfat, lbdhdd, lbxglu, and bmxbmi.

# Model 2: LM transformed predictors + original target variable
set.seed(123)

# Full linear model using log-transformed predictors
# model_full_log <- lm(lbxtc ~ log_chol + log_hdl + log_bmi + bpxsy1 + bpxdi1 +
#                        log_glucose + vigorous_minutes + moderate_minutes +
#                        sedentary_minutes + indfmpir + ridageyr +
#                        log_crp + log_trig + log_sodium + log_calories + log_fat +
#                        riagendr + ridreth3 + dmdeduc2 + smq020 + alq111,
#                      data = train_full_log)
model_full_transformed <- lm(lbxtc ~ .,
                     data = train_data_transformed)
summary(model_full_transformed)
## 
## Call:
## lm(formula = lbxtc ~ ., data = train_data_transformed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.489  -26.461   -3.879   22.866  240.852 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -6.413e+01  2.547e+01  -2.517 0.011862 *  
## ridageyr           3.486e-02  4.000e-02   0.871 0.383550    
## riagendrFemale     3.096e+00  1.388e+00   2.231 0.025761 *  
## ridreth32          2.335e+00  2.546e+00   0.917 0.359228    
## ridreth33         -1.478e+00  2.061e+00  -0.717 0.473536    
## ridreth34         -4.676e+00  2.190e+00  -2.135 0.032800 *  
## ridreth36          3.294e+00  2.486e+00   1.325 0.185150    
## ridreth37         -1.835e+00  3.173e+00  -0.578 0.563126    
## dmdeduc22         -3.932e+00  2.883e+00  -1.364 0.172719    
## dmdeduc23         -2.363e+00  2.621e+00  -0.901 0.367508    
## dmdeduc24         -3.947e+00  2.576e+00  -1.532 0.125509    
## dmdeduc25         -9.525e-01  2.811e+00  -0.339 0.734730    
## dmdeduc27          5.492e+01  2.748e+01   1.999 0.045727 *  
## dmdeduc29         -7.819e+00  1.389e+01  -0.563 0.573419    
## indfmpir           6.082e-01  4.531e-01   1.342 0.179577    
## smq0202           -2.473e+00  1.337e+00  -1.850 0.064415 .  
## alq1112            7.991e-01  2.117e+00   0.377 0.705910    
## vigorous_minutes   5.994e-04  1.841e-03   0.326 0.744700    
## moderate_minutes  -4.654e-04  1.105e-03  -0.421 0.673633    
## sedentary_minutes -4.768e-05  8.658e-04  -0.055 0.956086    
## bpxsy1             1.488e-01  4.091e-02   3.638 0.000278 ***
## bpxdi1             2.856e-01  5.227e-02   5.464 4.94e-08 ***
## log_lbxhscrp       3.224e+00  6.367e-01   5.064 4.29e-07 ***
## log_lbxtr          1.826e+01  1.186e+00  15.397  < 2e-16 ***
## log_dr1tkcal       4.812e+00  2.980e+00   1.615 0.106441    
## log_dr1tsodi      -2.074e+00  1.969e+00  -1.053 0.292402    
## log_dr1ttfat      -2.845e+00  2.219e+00  -1.282 0.199821    
## log_lbdhdd         4.536e+01  2.728e+00  16.627  < 2e-16 ***
## log_lbxglu        -1.505e+01  2.734e+00  -5.505 3.91e-08 ***
## log_bmxbmi         3.525e+00  3.164e+00   1.114 0.265378    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.55 on 4113 degrees of freedom
## Multiple R-squared:  0.1351, Adjusted R-squared:  0.129 
## F-statistic: 22.15 on 29 and 4113 DF,  p-value: < 2.2e-16

Model 3: Multiple Linear Regression: Full model with Log transformation of the skewed predictors + Log transformation of target variable

# Model 3: LM transformed predictors + transformed target variable
set.seed(123)

# Full linear model using log-transformed predictors and target
model_full_transformed_target <- lm(log_lbxtc ~ .,
                     data = train_data_transformed_target)

summary(model_full_transformed_target)
## 
## Call:
## lm(formula = log_lbxtc ~ ., data = train_data_transformed_target)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79528 -0.13458 -0.00201  0.13493  0.85593 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.781e+00  1.351e-01  27.974  < 2e-16 ***
## ridageyr           9.798e-05  2.122e-04   0.462  0.64428    
## riagendrFemale     1.522e-02  7.363e-03   2.067  0.03879 *  
## ridreth32          1.328e-02  1.351e-02   0.983  0.32557    
## ridreth33         -9.215e-03  1.094e-02  -0.843  0.39954    
## ridreth34         -2.878e-02  1.162e-02  -2.477  0.01327 *  
## ridreth36          1.696e-02  1.319e-02   1.286  0.19841    
## ridreth37         -1.084e-02  1.684e-02  -0.644  0.51950    
## dmdeduc22         -1.849e-02  1.530e-02  -1.209  0.22677    
## dmdeduc23         -1.390e-02  1.391e-02  -0.999  0.31767    
## dmdeduc24         -2.196e-02  1.366e-02  -1.607  0.10810    
## dmdeduc25         -3.279e-03  1.491e-02  -0.220  0.82598    
## dmdeduc27          2.616e-01  1.458e-01   1.794  0.07290 .  
## dmdeduc29         -3.810e-02  7.367e-02  -0.517  0.60510    
## indfmpir           2.862e-03  2.404e-03   1.190  0.23395    
## smq0202           -1.135e-02  7.093e-03  -1.600  0.10964    
## alq1112            4.982e-03  1.123e-02   0.443  0.65746    
## vigorous_minutes   3.891e-06  9.765e-06   0.398  0.69033    
## moderate_minutes  -1.840e-06  5.862e-06  -0.314  0.75363    
## sedentary_minutes -3.211e-07  4.593e-06  -0.070  0.94427    
## bpxsy1             6.533e-04  2.170e-04   3.010  0.00263 ** 
## bpxdi1             1.742e-03  2.773e-04   6.280 3.74e-10 ***
## log_lbxhscrp       1.809e-02  3.378e-03   5.357 8.94e-08 ***
## log_lbxtr          9.672e-02  6.292e-03  15.371  < 2e-16 ***
## log_dr1tkcal       2.774e-02  1.581e-02   1.754  0.07943 .  
## log_dr1tsodi      -9.894e-03  1.045e-02  -0.947  0.34371    
## log_dr1ttfat      -1.821e-02  1.177e-02  -1.547  0.12185    
## log_lbdhdd         2.564e-01  1.447e-02  17.712  < 2e-16 ***
## log_lbxglu        -8.268e-02  1.451e-02  -5.700 1.28e-08 ***
## log_bmxbmi         3.078e-02  1.679e-02   1.834  0.06677 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2045 on 4113 degrees of freedom
## Multiple R-squared:  0.1424, Adjusted R-squared:  0.1364 
## F-statistic: 23.55 on 29 and 4113 DF,  p-value: < 2.2e-16

Model 4: Stepwise AIC on Model 1 (Original data)

AIC=30412.97

# Model 4
set.seed(123)

step_model_full <- stats::step(model_full, direction = "both")
## Start:  AIC=30416.75
## lbxtc ~ ridageyr + riagendr + ridreth3 + dmdeduc2 + indfmpir + 
##     smq020 + alq111 + vigorous_minutes + moderate_minutes + sedentary_minutes + 
##     dr1tkcal + dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + bpxdi1 + 
##     lbxglu + lbxhscrp + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - dmdeduc2           6     12506 6314741 30413
## - vigorous_minutes   1         1 6302237 30415
## - sedentary_minutes  1        10 6302246 30415
## - lbxhscrp           1        72 6302307 30415
## - moderate_minutes   1       128 6302363 30415
## - alq111             1       354 6302589 30415
## - dr1tsodi           1       990 6303225 30415
## - indfmpir           1      1403 6303638 30416
## - dr1ttfat           1      1747 6303982 30416
## - dr1tkcal           1      2259 6304495 30416
## <none>                           6302235 30417
## - ridageyr           1      4013 6306248 30417
## - smq020             1      7686 6309921 30420
## - bmxbmi             1     11768 6314003 30423
## - lbxglu             1     16743 6318979 30426
## - ridreth3           5     30282 6332517 30427
## - riagendr           1     19350 6321585 30428
## - bpxsy1             1     21447 6323682 30429
## - bpxdi1             1     68996 6371232 30460
## - lbxtr              1    173850 6476085 30528
## - lbdhdd             1    307535 6609770 30612
## 
## Step:  AIC=30412.97
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + vigorous_minutes + moderate_minutes + sedentary_minutes + 
##     dr1tkcal + dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + bpxdi1 + 
##     lbxglu + lbxhscrp + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - vigorous_minutes   1         1 6314743 30411
## - sedentary_minutes  1         4 6314745 30411
## - lbxhscrp           1       127 6314868 30411
## - moderate_minutes   1       160 6314902 30411
## - alq111             1       422 6315164 30411
## - dr1tsodi           1      1116 6315857 30412
## - dr1ttfat           1      2051 6316793 30412
## - dr1tkcal           1      2686 6317427 30413
## - indfmpir           1      2697 6317439 30413
## <none>                           6314741 30413
## - ridageyr           1      5813 6320554 30415
## - smq020             1      6883 6321624 30416
## + dmdeduc2           6     12506 6302235 30417
## - bmxbmi             1     11978 6326720 30419
## - lbxglu             1     16906 6331647 30422
## - riagendr           1     19002 6333743 30423
## - bpxsy1             1     21650 6336391 30425
## - ridreth3           5     37589 6352330 30428
## - bpxdi1             1     67805 6382546 30455
## - lbxtr              1    173060 6487802 30523
## - lbdhdd             1    310092 6624833 30610
## 
## Step:  AIC=30410.97
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + moderate_minutes + sedentary_minutes + dr1tkcal + 
##     dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + bpxdi1 + lbxglu + 
##     lbxhscrp + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - sedentary_minutes  1         4 6314746 30409
## - lbxhscrp           1       126 6314869 30409
## - moderate_minutes   1       170 6314913 30409
## - alq111             1       423 6315165 30409
## - dr1tsodi           1      1118 6315860 30410
## - dr1ttfat           1      2051 6316794 30410
## - dr1tkcal           1      2685 6317428 30411
## - indfmpir           1      2697 6317439 30411
## <none>                           6314743 30411
## - ridageyr           1      5842 6320585 30413
## + vigorous_minutes   1         1 6314741 30413
## - smq020             1      6882 6321624 30414
## + dmdeduc2           6     12506 6302237 30415
## - bmxbmi             1     11983 6326725 30417
## - lbxglu             1     16911 6331654 30420
## - riagendr           1     19052 6333795 30421
## - bpxsy1             1     21649 6336392 30423
## - ridreth3           5     37617 6352359 30426
## - bpxdi1             1     67807 6382549 30453
## - lbxtr              1    173089 6487832 30521
## - lbdhdd             1    310091 6624834 30608
## 
## Step:  AIC=30408.97
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + moderate_minutes + dr1tkcal + dr1ttfat + dr1tsodi + 
##     bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbxhscrp + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - lbxhscrp           1       126 6314872 30407
## - moderate_minutes   1       172 6314918 30407
## - alq111             1       422 6315168 30407
## - dr1tsodi           1      1119 6315865 30408
## - dr1ttfat           1      2052 6316798 30408
## - dr1tkcal           1      2689 6317435 30409
## - indfmpir           1      2703 6317449 30409
## <none>                           6314746 30409
## - ridageyr           1      5845 6320591 30411
## + sedentary_minutes  1         4 6314743 30411
## + vigorous_minutes   1         1 6314745 30411
## - smq020             1      6882 6321628 30412
## + dmdeduc2           6     12499 6302247 30413
## - bmxbmi             1     11986 6326732 30415
## - lbxglu             1     16929 6331676 30418
## - riagendr           1     19092 6333839 30420
## - bpxsy1             1     21673 6336419 30421
## - ridreth3           5     37646 6352392 30424
## - bpxdi1             1     67814 6382560 30451
## - lbxtr              1    173420 6488166 30519
## - lbdhdd             1    310089 6624835 30606
## 
## Step:  AIC=30407.05
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + moderate_minutes + dr1tkcal + dr1ttfat + dr1tsodi + 
##     bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - moderate_minutes   1       171 6315043 30405
## - alq111             1       409 6315282 30405
## - dr1tsodi           1      1106 6315979 30406
## - dr1ttfat           1      2037 6316909 30406
## - dr1tkcal           1      2682 6317554 30407
## - indfmpir           1      2746 6317618 30407
## <none>                           6314872 30407
## - ridageyr           1      5814 6320686 30409
## + lbxhscrp           1       126 6314746 30409
## + sedentary_minutes  1         3 6314869 30409
## + vigorous_minutes   1         1 6314871 30409
## - smq020             1      6819 6321691 30410
## + dmdeduc2           6     12554 6302318 30411
## - bmxbmi             1     12075 6326948 30413
## - lbxglu             1     17092 6331964 30416
## - riagendr           1     18970 6333842 30418
## - bpxsy1             1     21764 6336637 30419
## - ridreth3           5     38012 6352885 30422
## - bpxdi1             1     67836 6382708 30449
## - lbxtr              1    173768 6488640 30518
## - lbdhdd             1    311113 6625986 30604
## 
## Step:  AIC=30405.16
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + dr1tkcal + dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + 
##     bpxdi1 + lbxglu + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - alq111             1       416 6315459 30403
## - dr1tsodi           1      1095 6316139 30404
## - dr1ttfat           1      2030 6317074 30405
## - dr1tkcal           1      2671 6317715 30405
## - indfmpir           1      2773 6317816 30405
## <none>                           6315043 30405
## - ridageyr           1      5909 6320952 30407
## + moderate_minutes   1       171 6314872 30407
## + lbxhscrp           1       125 6314918 30407
## + vigorous_minutes   1        11 6315032 30407
## + sedentary_minutes  1         5 6315038 30407
## - smq020             1      6789 6321833 30408
## + dmdeduc2           6     12597 6302447 30409
## - bmxbmi             1     12027 6327071 30411
## - lbxglu             1     17307 6332351 30415
## - riagendr           1     19111 6334155 30416
## - bpxsy1             1     21846 6336889 30418
## - ridreth3           5     38111 6353154 30420
## - bpxdi1             1     67695 6382738 30447
## - lbxtr              1    173608 6488652 30516
## - lbdhdd             1    311739 6626782 30603
## 
## Step:  AIC=30403.44
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     dr1tkcal + dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + bpxdi1 + 
##     lbxglu + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - dr1tsodi           1      1113 6316572 30402
## - dr1ttfat           1      2033 6317492 30403
## - indfmpir           1      2638 6318097 30403
## - dr1tkcal           1      2656 6318115 30403
## <none>                           6315459 30403
## + alq111             1       416 6315043 30405
## + moderate_minutes   1       178 6315282 30405
## - ridageyr           1      5959 6321418 30405
## + lbxhscrp           1       112 6315347 30405
## + vigorous_minutes   1        13 6315446 30405
## + sedentary_minutes  1         4 6315455 30405
## - smq020             1      6400 6321859 30406
## + dmdeduc2           6     12664 6302795 30407
## - bmxbmi             1     11982 6327442 30409
## - lbxglu             1     17484 6332943 30413
## - riagendr           1     19527 6334986 30414
## - bpxsy1             1     22015 6337474 30416
## - ridreth3           5     38933 6354392 30419
## - bpxdi1             1     67308 6382767 30445
## - lbxtr              1    173551 6489011 30514
## - lbdhdd             1    311396 6626855 30601
## 
## Step:  AIC=30402.17
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     dr1tkcal + dr1ttfat + bmxbmi + bpxsy1 + bpxdi1 + lbxglu + 
##     lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - dr1tkcal           1      1763 6318336 30401
## - indfmpir           1      2593 6319165 30402
## - dr1ttfat           1      2684 6319256 30402
## <none>                           6316572 30402
## + dr1tsodi           1      1113 6315459 30403
## + alq111             1       434 6316139 30404
## + moderate_minutes   1       166 6316406 30404
## + lbxhscrp           1       101 6316472 30404
## + vigorous_minutes   1        17 6316555 30404
## + sedentary_minutes  1         5 6316568 30404
## - ridageyr           1      6262 6322834 30404
## - smq020             1      6489 6323062 30404
## + dmdeduc2           6     12788 6303785 30406
## - bmxbmi             1     11673 6328245 30408
## - lbxglu             1     17493 6334065 30412
## - riagendr           1     20116 6336689 30413
## - bpxsy1             1     21965 6338537 30415
## - ridreth3           5     38115 6354687 30417
## - bpxdi1             1     67451 6384023 30444
## - lbxtr              1    173143 6489716 30512
## - lbdhdd             1    311010 6627583 30599
## 
## Step:  AIC=30401.32
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     dr1ttfat + bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - dr1ttfat           1       997 6319333 30400
## - indfmpir           1      2434 6320770 30401
## <none>                           6318336 30401
## + dr1tkcal           1      1763 6316572 30402
## + alq111             1       410 6317925 30403
## - ridageyr           1      5762 6324098 30403
## + dr1tsodi           1       221 6318115 30403
## + moderate_minutes   1       163 6318173 30403
## + lbxhscrp           1       101 6318234 30403
## + vigorous_minutes   1        13 6318323 30403
## + sedentary_minutes  1         8 6318328 30403
## - smq020             1      6573 6324908 30404
## + dmdeduc2           6     13094 6305242 30405
## - bmxbmi             1     11238 6329574 30407
## - lbxglu             1     17995 6336330 30411
## - riagendr           1     18728 6337063 30412
## - bpxsy1             1     22276 6340611 30414
## - ridreth3           5     39916 6358252 30417
## - bpxdi1             1     68062 6386397 30444
## - lbxtr              1    175897 6494233 30513
## - lbdhdd             1    310869 6629204 30598
## 
## Step:  AIC=30399.98
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - indfmpir           1      2293 6321625 30400
## <none>                           6319333 30400
## + dr1tsodi           1      1112 6318221 30401
## + dr1ttfat           1       997 6318336 30401
## + alq111             1       454 6318879 30402
## + moderate_minutes   1       155 6319178 30402
## + dr1tkcal           1        76 6319256 30402
## + lbxhscrp           1        76 6319257 30402
## + vigorous_minutes   1        18 6319315 30402
## + sedentary_minutes  1         5 6319328 30402
## - ridageyr           1      6417 6325750 30402
## - smq020             1      6494 6325826 30402
## + dmdeduc2           6     13170 6306162 30403
## - bmxbmi             1     10892 6330224 30405
## - lbxglu             1     18021 6337354 30410
## - riagendr           1     21858 6341190 30412
## - bpxsy1             1     22385 6341717 30413
## - ridreth3           5     41077 6360410 30417
## - bpxdi1             1     67820 6387153 30442
## - lbxtr              1    175262 6494595 30511
## - lbdhdd             1    309874 6629207 30596
## 
## Step:  AIC=30399.48
## lbxtc ~ ridageyr + riagendr + ridreth3 + smq020 + bmxbmi + bpxsy1 + 
##     bpxdi1 + lbxglu + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## <none>                           6321625 30400
## + indfmpir           1      2293 6319333 30400
## + dr1tsodi           1      1006 6320620 30401
## + dr1ttfat           1       856 6320770 30401
## - smq020             1      5649 6327274 30401
## + alq111             1       319 6321307 30401
## + moderate_minutes   1       178 6321448 30401
## + lbxhscrp           1       111 6321514 30401
## + dr1tkcal           1        57 6321568 30401
## + vigorous_minutes   1        16 6321609 30402
## + sedentary_minutes  1        12 6321613 30402
## + dmdeduc2           6     14303 6307323 30402
## - ridageyr           1      7235 6328861 30402
## - bmxbmi             1     11074 6332699 30405
## - lbxglu             1     17574 6339200 30409
## - riagendr           1     20983 6342608 30411
## - bpxsy1             1     21323 6342949 30411
## - ridreth3           5     42534 6364160 30417
## - bpxdi1             1     69815 6391440 30443
## - lbxtr              1    176892 6498518 30512
## - lbdhdd             1    319754 6641379 30602
summary(step_model_full)
## 
## Call:
## lm(formula = lbxtc ~ ridageyr + riagendr + ridreth3 + smq020 + 
##     bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbdhdd + lbxtr, data = train_data_original)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -205.991  -26.687   -3.703   22.741  247.549 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    97.05147    6.24730  15.535  < 2e-16 ***
## ridageyr        0.08447    0.03886   2.174 0.029790 *  
## riagendrFemale  4.92693    1.33103   3.702 0.000217 ***
## ridreth32       2.23916    2.57090   0.871 0.383825    
## ridreth33      -2.63478    1.96970  -1.338 0.181082    
## ridreth34      -6.95561    2.11946  -3.282 0.001040 ** 
## ridreth36       2.76851    2.33896   1.184 0.236620    
## ridreth37      -3.10601    3.15328  -0.985 0.324677    
## smq0202        -2.52044    1.31232  -1.921 0.054852 .  
## bmxbmi          0.24385    0.09068   2.689 0.007193 ** 
## bpxsy1          0.15324    0.04107   3.732 0.000193 ***
## bpxdi1          0.35326    0.05232   6.752 1.66e-11 ***
## lbxglu         -0.05884    0.01737  -3.388 0.000712 ***
## lbdhdd          0.65403    0.04526  14.450  < 2e-16 ***
## lbxtr           0.05933    0.00552  10.748  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.13 on 4128 degrees of freedom
## Multiple R-squared:  0.1055, Adjusted R-squared:  0.1024 
## F-statistic: 34.76 on 14 and 4128 DF,  p-value: < 2.2e-16

Model 5: Stepwise AIC on Model 3 (Log transformed predictors + target)

# Model 5
set.seed(123)

step_model_full_transformed_target <- stats::step(model_full_transformed_target, direction = "both")
## Start:  AIC=-13120.82
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + dmdeduc2 + indfmpir + 
##     smq020 + alq111 + vigorous_minutes + moderate_minutes + sedentary_minutes + 
##     bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + 
##     log_dr1tsodi + log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - dmdeduc2           6    0.3961 172.43 -13123
## - sedentary_minutes  1    0.0002 172.04 -13123
## - moderate_minutes   1    0.0041 172.04 -13123
## - vigorous_minutes   1    0.0066 172.04 -13123
## - alq111             1    0.0082 172.04 -13123
## - ridageyr           1    0.0089 172.04 -13123
## - log_dr1tsodi       1    0.0375 172.07 -13122
## - indfmpir           1    0.0593 172.10 -13121
## <none>                           172.04 -13121
## - log_dr1ttfat       1    0.1002 172.14 -13120
## - smq020             1    0.1071 172.14 -13120
## - log_dr1tkcal       1    0.1287 172.16 -13120
## - log_bmxbmi         1    0.1406 172.18 -13119
## - riagendr           1    0.1787 172.22 -13118
## - bpxsy1             1    0.3790 172.42 -13114
## - ridreth3           5    0.8144 172.85 -13111
## - log_lbxhscrp       1    1.2002 173.24 -13094
## - log_lbxglu         1    1.3590 173.40 -13090
## - bpxdi1             1    1.6495 173.69 -13083
## - log_lbxtr          1    9.8824 181.92 -12891
## - log_lbdhdd         1   13.1221 185.16 -12818
## 
## Step:  AIC=-13123.3
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + vigorous_minutes + moderate_minutes + sedentary_minutes + 
##     bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + 
##     log_dr1tsodi + log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - sedentary_minutes  1    0.0005 172.43 -13125
## - vigorous_minutes   1    0.0046 172.44 -13125
## - moderate_minutes   1    0.0053 172.44 -13125
## - alq111             1    0.0103 172.44 -13125
## - ridageyr           1    0.0288 172.46 -13125
## - log_dr1tsodi       1    0.0408 172.47 -13124
## <none>                           172.43 -13123
## - smq020             1    0.0894 172.52 -13123
## - indfmpir           1    0.1082 172.54 -13123
## - log_dr1ttfat       1    0.1181 172.55 -13122
## - log_bmxbmi         1    0.1489 172.58 -13122
## - log_dr1tkcal       1    0.1495 172.58 -13122
## - riagendr           1    0.1725 172.60 -13121
## + dmdeduc2           6    0.3961 172.04 -13121
## - bpxsy1             1    0.3792 172.81 -13116
## - ridreth3           5    1.0580 173.49 -13108
## - log_lbxhscrp       1    1.1580 173.59 -13098
## - log_lbxglu         1    1.3590 173.79 -13093
## - bpxdi1             1    1.6288 174.06 -13086
## - log_lbxtr          1    9.8568 182.29 -12895
## - log_lbdhdd         1   13.2179 185.65 -12819
## 
## Step:  AIC=-13125.29
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + vigorous_minutes + moderate_minutes + bpxsy1 + bpxdi1 + 
##     log_lbxhscrp + log_lbxtr + log_dr1tkcal + log_dr1tsodi + 
##     log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - vigorous_minutes   1    0.0046 172.44 -13127
## - moderate_minutes   1    0.0052 172.44 -13127
## - alq111             1    0.0103 172.44 -13127
## - ridageyr           1    0.0288 172.46 -13127
## - log_dr1tsodi       1    0.0407 172.47 -13126
## <none>                           172.43 -13125
## - smq020             1    0.0894 172.52 -13125
## - indfmpir           1    0.1080 172.54 -13125
## - log_dr1ttfat       1    0.1181 172.55 -13124
## - log_bmxbmi         1    0.1489 172.58 -13124
## - log_dr1tkcal       1    0.1494 172.58 -13124
## + sedentary_minutes  1    0.0005 172.43 -13123
## - riagendr           1    0.1721 172.60 -13123
## + dmdeduc2           6    0.3963 172.04 -13123
## - bpxsy1             1    0.3788 172.81 -13118
## - ridreth3           5    1.0616 173.50 -13110
## - log_lbxhscrp       1    1.1577 173.59 -13100
## - log_lbxglu         1    1.3656 173.80 -13095
## - bpxdi1             1    1.6286 174.06 -13088
## - log_lbxtr          1    9.8634 182.30 -12897
## - log_lbdhdd         1   13.2175 185.65 -12821
## 
## Step:  AIC=-13127.17
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + moderate_minutes + bpxsy1 + bpxdi1 + log_lbxhscrp + 
##     log_lbxtr + log_dr1tkcal + log_dr1tsodi + log_dr1ttfat + 
##     log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - moderate_minutes   1    0.0037 172.44 -13129
## - alq111             1    0.0102 172.45 -13129
## - ridageyr           1    0.0277 172.47 -13128
## - log_dr1tsodi       1    0.0401 172.48 -13128
## <none>                           172.44 -13127
## - smq020             1    0.0902 172.53 -13127
## - indfmpir           1    0.1082 172.55 -13127
## - log_dr1ttfat       1    0.1184 172.56 -13126
## - log_bmxbmi         1    0.1488 172.59 -13126
## - log_dr1tkcal       1    0.1495 172.59 -13126
## + vigorous_minutes   1    0.0046 172.43 -13125
## + sedentary_minutes  1    0.0005 172.44 -13125
## - riagendr           1    0.1700 172.61 -13125
## + dmdeduc2           6    0.3943 172.04 -13125
## - bpxsy1             1    0.3791 172.82 -13120
## - ridreth3           5    1.0589 173.50 -13112
## - log_lbxhscrp       1    1.1560 173.59 -13102
## - log_lbxglu         1    1.3711 173.81 -13096
## - bpxdi1             1    1.6285 174.07 -13090
## - log_lbxtr          1    9.8588 182.30 -12899
## - log_lbdhdd         1   13.2164 185.65 -12823
## 
## Step:  AIC=-13129.08
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + 
##     log_dr1tsodi + log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - alq111             1    0.0104 172.45 -13131
## - ridageyr           1    0.0287 172.47 -13130
## - log_dr1tsodi       1    0.0398 172.48 -13130
## <none>                           172.44 -13129
## - smq020             1    0.0898 172.53 -13129
## - indfmpir           1    0.1089 172.55 -13128
## - log_dr1ttfat       1    0.1181 172.56 -13128
## - log_bmxbmi         1    0.1482 172.59 -13128
## - log_dr1tkcal       1    0.1493 172.59 -13128
## + moderate_minutes   1    0.0037 172.44 -13127
## + vigorous_minutes   1    0.0032 172.44 -13127
## + sedentary_minutes  1    0.0004 172.44 -13127
## - riagendr           1    0.1721 172.61 -13127
## + dmdeduc2           6    0.3955 172.05 -13127
## - bpxsy1             1    0.3808 172.82 -13122
## - ridreth3           5    1.0622 173.50 -13114
## - log_lbxhscrp       1    1.1551 173.60 -13103
## - log_lbxglu         1    1.3809 173.82 -13098
## - bpxdi1             1    1.6255 174.07 -13092
## - log_lbxtr          1    9.8564 182.30 -12901
## - log_lbdhdd         1   13.2504 185.69 -12824
## 
## Step:  AIC=-13130.83
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + 
##     log_dr1tsodi + log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - ridageyr           1    0.0292 172.48 -13132
## - log_dr1tsodi       1    0.0401 172.49 -13132
## - smq020             1    0.0815 172.53 -13131
## <none>                           172.45 -13131
## - indfmpir           1    0.1050 172.56 -13130
## - log_dr1ttfat       1    0.1194 172.57 -13130
## - log_bmxbmi         1    0.1469 172.60 -13129
## - log_dr1tkcal       1    0.1491 172.60 -13129
## + alq111             1    0.0104 172.44 -13129
## + moderate_minutes   1    0.0039 172.45 -13129
## + vigorous_minutes   1    0.0031 172.45 -13129
## + sedentary_minutes  1    0.0005 172.45 -13129
## - riagendr           1    0.1772 172.63 -13129
## + dmdeduc2           6    0.3976 172.05 -13128
## - bpxsy1             1    0.3845 172.84 -13124
## - ridreth3           5    1.0917 173.54 -13115
## - log_lbxhscrp       1    1.1626 173.61 -13105
## - log_lbxglu         1    1.3877 173.84 -13100
## - bpxdi1             1    1.6160 174.07 -13094
## - log_lbxtr          1    9.8481 182.30 -12903
## - log_lbdhdd         1   13.2450 185.70 -12826
## 
## Step:  AIC=-13132.13
## log_lbxtc ~ riagendr + ridreth3 + indfmpir + smq020 + bpxsy1 + 
##     bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + log_dr1tsodi + 
##     log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - log_dr1tsodi       1    0.0448 172.53 -13133
## <none>                           172.48 -13132
## - smq020             1    0.0965 172.58 -13132
## - log_dr1ttfat       1    0.1156 172.60 -13131
## - indfmpir           1    0.1173 172.60 -13131
## + ridageyr           1    0.0292 172.45 -13131
## - log_dr1tkcal       1    0.1439 172.62 -13131
## - log_bmxbmi         1    0.1449 172.63 -13131
## + alq111             1    0.0109 172.47 -13130
## + moderate_minutes   1    0.0050 172.48 -13130
## + dmdeduc2           6    0.4181 172.06 -13130
## + vigorous_minutes   1    0.0020 172.48 -13130
## + sedentary_minutes  1    0.0004 172.48 -13130
## - riagendr           1    0.1703 172.65 -13130
## - bpxsy1             1    0.6265 173.11 -13119
## - ridreth3           5    1.0913 173.57 -13116
## - log_lbxhscrp       1    1.1836 173.66 -13106
## - log_lbxglu         1    1.3627 173.84 -13102
## - bpxdi1             1    1.5877 174.07 -13096
## - log_lbxtr          1    9.9258 182.41 -12902
## - log_lbdhdd         1   13.5534 186.03 -12821
## 
## Step:  AIC=-13133.06
## log_lbxtc ~ riagendr + ridreth3 + indfmpir + smq020 + bpxsy1 + 
##     bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + log_dr1ttfat + 
##     log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## <none>                           172.53 -13133
## - smq020             1    0.1010 172.63 -13133
## - log_dr1tkcal       1    0.1028 172.63 -13133
## - indfmpir           1    0.1139 172.64 -13132
## + log_dr1tsodi       1    0.0448 172.48 -13132
## + ridageyr           1    0.0339 172.49 -13132
## - log_bmxbmi         1    0.1367 172.66 -13132
## + alq111             1    0.0113 172.51 -13131
## + dmdeduc2           6    0.4237 172.10 -13131
## - log_dr1ttfat       1    0.1608 172.69 -13131
## + moderate_minutes   1    0.0046 172.52 -13131
## + vigorous_minutes   1    0.0016 172.52 -13131
## + sedentary_minutes  1    0.0004 172.53 -13131
## - riagendr           1    0.1793 172.71 -13131
## - bpxsy1             1    0.6367 173.16 -13120
## - ridreth3           5    1.0666 173.59 -13118
## - log_lbxhscrp       1    1.2063 173.73 -13106
## - log_lbxglu         1    1.3656 173.89 -13102
## - bpxdi1             1    1.5851 174.11 -13097
## - log_lbxtr          1    9.9074 182.43 -12904
## - log_lbdhdd         1   13.5512 186.08 -12822
summary(step_model_full_transformed_target)
## 
## Call:
## lm(formula = log_lbxtc ~ riagendr + ridreth3 + indfmpir + smq020 + 
##     bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + 
##     log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi, data = train_data_transformed_target)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78399 -0.13374 -0.00143  0.13398  0.86598 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.7387534  0.1324538  28.227  < 2e-16 ***
## riagendrFemale  0.0151442  0.0073136   2.071  0.03845 *  
## ridreth32       0.0137406  0.0134444   1.022  0.30683    
## ridreth33      -0.0110760  0.0103169  -1.074  0.28308    
## ridreth34      -0.0313803  0.0111297  -2.820  0.00483 ** 
## ridreth36       0.0197936  0.0124451   1.590  0.11181    
## ridreth37      -0.0139308  0.0165009  -0.844  0.39858    
## indfmpir        0.0036243  0.0021962   1.650  0.09897 .  
## smq0202        -0.0106687  0.0068649  -1.554  0.12024    
## bpxsy1          0.0007444  0.0001908   3.902 9.70e-05 ***
## bpxdi1          0.0016752  0.0002721   6.156 8.17e-10 ***
## log_lbxhscrp    0.0180565  0.0033622   5.370 8.29e-08 ***
## log_lbxtr       0.0965329  0.0062721  15.391  < 2e-16 ***
## log_dr1tkcal    0.0224983  0.0143528   1.568  0.11707    
## log_dr1ttfat   -0.0223556  0.0114010  -1.961  0.04996 *  
## log_lbdhdd      0.2574592  0.0143033  18.000  < 2e-16 ***
## log_lbxglu     -0.0822295  0.0143904  -5.714 1.18e-08 ***
## log_bmxbmi      0.0302211  0.0167136   1.808  0.07065 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2045 on 4125 degrees of freedom
## Multiple R-squared:   0.14,  Adjusted R-squared:  0.1364 
## F-statistic:  39.5 on 17 and 4125 DF,  p-value: < 2.2e-16

Model 6: Ridge: Log transformed predictors + target

Good for multicollinearity - adds a penalty (L2 regularization)

# Model 6
set.seed(123)

ridge_model <- train(log_lbxtc ~ ., data = train_data_transformed_target,
                method = "glmnet",
                preProcess = c("center", "scale"),
                trControl = trainControl(method = "cv", number = 10),
                tuneGrid = expand.grid(alpha = 0, lambda = seq(0, 1, 0.1)))
summary(ridge_model)
##             Length Class      Mode     
## a0           100   -none-     numeric  
## beta        2900   dgCMatrix  S4       
## df           100   -none-     numeric  
## dim            2   -none-     numeric  
## lambda       100   -none-     numeric  
## dev.ratio    100   -none-     numeric  
## nulldev        1   -none-     numeric  
## npasses        1   -none-     numeric  
## jerr           1   -none-     numeric  
## offset         1   -none-     logical  
## call           5   -none-     call     
## nobs           1   -none-     numeric  
## lambdaOpt      1   -none-     numeric  
## xNames        29   -none-     character
## problemType    1   -none-     character
## tuneValue      2   data.frame list     
## obsLevels      1   -none-     logical  
## param          0   -none-     list
testX <- test_data_transformed_target %>%
  dplyr::select(-log_lbxtc)

testY <- test_data_transformed_target$log_lbxtc

ridge_pred <- predict(ridge_model, testX)
ridge_resample <- postResample(ridge_pred, testY)
ridge_resample
##      RMSE  Rsquared       MAE 
## 0.2054136 0.1103974 0.1596422

Model 7: Elastic Net: Log transformed predictors + target

# Model 7

set.seed(123)
# Elastic net
testX <- test_data_transformed_target %>%
  dplyr::select(-log_lbxtc)

testY <- test_data_transformed_target$log_lbxtc

enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))
enet_model <- train(log_lbxtc ~ ., data = train_data_transformed_target, method = "enet",
                 tuneGrid = enetGrid,
                 trControl = trainControl(method = "cv", number = 10),
                 preProcess = c("center", "scale"))
enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)
enet_resample
##      RMSE  Rsquared       MAE 
## 0.2062352 0.1030591 0.1606479

Model 8: Robust Regression: Original Data

# Model 8
set.seed(123)
# weighting functions to down-weight influential observations
# default weighting function is Huber
# uses Iteratively Reweighted Least Squares (IRLS)
robust_model_full <- rlm(lbxtc ~ ., data = train_data_original, psi = psi.huber)
summary(robust_model_full)
## 
## Call: rlm(formula = lbxtc ~ ., data = train_data_original, psi = psi.huber)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -249.339  -24.778   -1.536   24.560  250.206 
## 
## Coefficients:
##                   Value   Std. Error t value
## (Intercept)       93.9388  6.5811    14.2740
## ridageyr           0.0931  0.0392     2.3770
## riagendrFemale     3.8709  1.3546     2.8576
## ridreth32          2.4554  2.5001     0.9821
## ridreth33         -3.0352  2.0235    -1.4999
## ridreth34         -6.3734  2.1471    -2.9683
## ridreth36          1.5197  2.4317     0.6250
## ridreth37         -1.2667  3.1176    -0.4063
## dmdeduc22         -0.9823  2.8286    -0.3473
## dmdeduc23         -0.7882  2.5741    -0.3062
## dmdeduc24         -2.2230  2.5274    -0.8796
## dmdeduc25          1.3845  2.7593     0.5018
## dmdeduc27         59.5691 26.9946     2.2067
## dmdeduc29         -6.3412 13.6374    -0.4650
## indfmpir           0.3149  0.4447     0.7080
## smq0202           -2.5989  1.3119    -1.9810
## alq1112            0.5368  2.0787     0.2582
## vigorous_minutes   0.0002  0.0018     0.1261
## moderate_minutes  -0.0003  0.0011    -0.2554
## sedentary_minutes  0.0004  0.0009     0.4564
## dr1tkcal           0.0013  0.0015     0.8648
## dr1ttfat          -0.0289  0.0282    -1.0282
## dr1tsodi          -0.0003  0.0005    -0.5100
## bmxbmi             0.3318  0.0908     3.6527
## bpxsy1             0.1089  0.0401     2.7127
## bpxdi1             0.4065  0.0511     7.9548
## lbxglu            -0.0769  0.0169    -4.5432
## lbxhscrp          -0.0558  0.0805    -0.6935
## lbdhdd             0.6806  0.0444    15.3166
## lbxtr              0.0765  0.0054    14.2568
## 
## Residual standard error: 36.66 on 4113 degrees of freedom

Model 9: Robust Regression: Transformed Data

# Model 9
set.seed(123)
# weighting functions to down-weight influential observations
# default weighting function is Huber
# uses Iteratively Reweighted Least Squares (IRLS)
robust_model_full_transformed_target <- rlm(log_lbxtc ~ ., data = train_data_transformed_target, psi = psi.huber)
summary(robust_model_full_transformed_target)
## 
## Call: rlm(formula = log_lbxtc ~ ., data = train_data_transformed_target, 
##     psi = psi.huber)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80890 -0.13423 -0.00364  0.13459  0.85464 
## 
## Coefficients:
##                   Value   Std. Error t value
## (Intercept)        3.7498  0.1343    27.9279
## ridageyr           0.0003  0.0002     1.4544
## riagendrFemale     0.0114  0.0073     1.5641
## ridreth32          0.0136  0.0134     1.0170
## ridreth33         -0.0116  0.0109    -1.0668
## ridreth34         -0.0272  0.0115    -2.3562
## ridreth36          0.0196  0.0131     1.4938
## ridreth37         -0.0045  0.0167    -0.2677
## dmdeduc22         -0.0124  0.0152    -0.8175
## dmdeduc23         -0.0109  0.0138    -0.7924
## dmdeduc24         -0.0186  0.0136    -1.3708
## dmdeduc25          0.0008  0.0148     0.0542
## dmdeduc27          0.2756  0.1449     1.9023
## dmdeduc29         -0.0331  0.0732    -0.4526
## indfmpir           0.0029  0.0024     1.1974
## smq0202           -0.0136  0.0070    -1.9270
## alq1112            0.0007  0.0112     0.0627
## vigorous_minutes   0.0000  0.0000     0.4452
## moderate_minutes   0.0000  0.0000    -0.4649
## sedentary_minutes  0.0000  0.0000     0.2177
## bpxsy1             0.0005  0.0002     2.4705
## bpxdi1             0.0019  0.0003     6.9050
## log_lbxhscrp       0.0160  0.0034     4.7700
## log_lbxtr          0.1012  0.0063    16.1821
## log_dr1tkcal       0.0280  0.0157     1.7821
## log_dr1tsodi      -0.0112  0.0104    -1.0836
## log_dr1ttfat      -0.0190  0.0117    -1.6273
## log_lbdhdd         0.2615  0.0144    18.1882
## log_lbxglu        -0.0926  0.0144    -6.4237
## log_bmxbmi         0.0439  0.0167     2.6306
## 
## Residual standard error: 0.1993 on 4113 degrees of freedom

Model 10: Robust Regression: Transformed Data: Bisquare weighting function

# Model 10
set.seed(123)
# weighting functions to down-weight influential observations
# default weighting function is Huber
# uses Iteratively Reweighted Least Squares (IRLS)
robust_bisquare_model_full_transformed_target <- rlm(log_lbxtc ~ ., data = train_data_transformed_target, psi = psi.bisquare)
summary(robust_bisquare_model_full_transformed_target)
## 
## Call: rlm(formula = log_lbxtc ~ ., data = train_data_transformed_target, 
##     psi = psi.bisquare)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.810378 -0.134024 -0.003544  0.134312  0.855685 
## 
## Coefficients:
##                   Value   Std. Error t value
## (Intercept)        3.7382  0.1352    27.6441
## ridageyr           0.0003  0.0002     1.5380
## riagendrFemale     0.0112  0.0074     1.5156
## ridreth32          0.0121  0.0135     0.8933
## ridreth33         -0.0126  0.0109    -1.1558
## ridreth34         -0.0290  0.0116    -2.4912
## ridreth36          0.0181  0.0132     1.3685
## ridreth37         -0.0038  0.0168    -0.2252
## dmdeduc22         -0.0115  0.0153    -0.7500
## dmdeduc23         -0.0094  0.0139    -0.6776
## dmdeduc24         -0.0172  0.0137    -1.2556
## dmdeduc25          0.0024  0.0149     0.1585
## dmdeduc27          0.2773  0.1459     1.9006
## dmdeduc29         -0.0297  0.0737    -0.4026
## indfmpir           0.0025  0.0024     1.0586
## smq0202           -0.0134  0.0071    -1.8892
## alq1112            0.0016  0.0112     0.1413
## vigorous_minutes   0.0000  0.0000     0.4244
## moderate_minutes   0.0000  0.0000    -0.4410
## sedentary_minutes  0.0000  0.0000     0.2513
## bpxsy1             0.0005  0.0002     2.4135
## bpxdi1             0.0019  0.0003     6.9122
## log_lbxhscrp       0.0159  0.0034     4.7084
## log_lbxtr          0.1016  0.0063    16.1453
## log_dr1tkcal       0.0290  0.0158     1.8305
## log_dr1tsodi      -0.0120  0.0105    -1.1503
## log_dr1ttfat      -0.0195  0.0118    -1.6588
## log_lbdhdd         0.2631  0.0145    18.1656
## log_lbxglu        -0.0925  0.0145    -6.3756
## log_bmxbmi         0.0450  0.0168     2.6791
## 
## Residual standard error: 0.1989 on 4113 degrees of freedom

Model 12: Multiple Linear Regression: Log transformation and New features

Age Categories Revealed Non-Linear Pattern as Cholesterol peaks in middle age, then declines in elderly . BMI Categories are more informative than Log BMI. High Sodium Flag is Significant as HIGH_SODIUM (>2300 mg/day) → -1.93% cholesterol (p = 0.009) and hence the effect is in unexpected direction (negative).

Physical activity and high-fat diet does not have detectable direct effect on cholesterol.

# Model 12
set.seed(123)
model_transformed_newfeatures <- lm(log_lbxtc ~ ., data = 
train_data_transformed_newfeatures)
summary(model_transformed_newfeatures)
## 
## Call:
## lm(formula = log_lbxtc ~ ., data = train_data_transformed_newfeatures)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84349 -0.13022  0.00021  0.13048  0.92606 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.939e+00  9.713e-02  40.557  < 2e-16 ***
## ridageyr           2.610e-04  3.910e-04   0.668  0.50447    
## riagendrFemale     9.479e-03  7.208e-03   1.315  0.18854    
## ridreth32          1.294e-02  1.325e-02   0.977  0.32849    
## ridreth33          3.639e-03  1.079e-02   0.337  0.73587    
## ridreth34         -2.564e-02  1.141e-02  -2.247  0.02472 *  
## ridreth36          2.124e-02  1.300e-02   1.634  0.10242    
## ridreth37         -5.423e-03  1.654e-02  -0.328  0.74299    
## dmdeduc22         -1.153e-02  1.501e-02  -0.768  0.44268    
## dmdeduc23         -8.491e-03  1.368e-02  -0.621  0.53493    
## dmdeduc24         -1.219e-02  1.342e-02  -0.908  0.36384    
## dmdeduc25         -3.099e-04  1.462e-02  -0.021  0.98309    
## dmdeduc27          2.315e-01  1.430e-01   1.619  0.10556    
## dmdeduc29         -1.392e-02  7.229e-02  -0.193  0.84730    
## indfmpir           4.088e-04  2.367e-03   0.173  0.86288    
## smq0202           -6.101e-03  6.986e-03  -0.873  0.38257    
## alq1112            8.724e-03  1.104e-02   0.790  0.42945    
## sedentary_minutes  1.529e-06  4.523e-06   0.338  0.73543    
## TOTAL_ACTIVITY    -1.197e-06  4.763e-06  -0.251  0.80162    
## ACTIVE_FLAG        8.684e-03  7.269e-03   1.195  0.23230    
## HIGH_FAT_DIET     -6.568e-03  6.590e-03  -0.997  0.31892    
## HIGH_SODIUM       -2.001e-02  7.416e-03  -2.699  0.00699 ** 
## AGE_YOUNG          1.239e-02  2.112e-02   0.586  0.55763    
## AGE_MID            8.465e-02  1.216e-02   6.962 3.90e-12 ***
## BMI_UNDER         -7.372e-02  2.690e-02  -2.741  0.00616 ** 
## BMI_NORMAL        -2.728e-02  9.366e-03  -2.912  0.00361 ** 
## BMI_OVER           1.256e-02  7.745e-03   1.621  0.10500    
## bpxsy1             1.006e-03  2.150e-04   4.680 2.96e-06 ***
## bpxdi1             7.897e-04  2.844e-04   2.777  0.00552 ** 
## log_lbxhscrp       1.460e-02  3.227e-03   4.524 6.23e-06 ***
## log_lbxtr          9.078e-02  6.181e-03  14.686  < 2e-16 ***
## log_lbdhdd         2.580e-01  1.421e-02  18.160  < 2e-16 ***
## log_lbxglu        -8.354e-02  1.418e-02  -5.889 4.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2005 on 4110 degrees of freedom
## Multiple R-squared:  0.1762, Adjusted R-squared:  0.1698 
## F-statistic: 27.47 on 32 and 4110 DF,  p-value: < 2.2e-16

Check for multicollinearity

We know we have some low correlated features, so let’s address this potential issue by looking at the VIF for the models and removing predictors.

VIF for LM1: Basic Model had dr1tkcal and dr1ttfat whose VIF > 5.Hence these predictos should be removed.

For the logistic regression models, luckily no predictor has a VIF > 5 (for Df > 1 (categorical), must look at GVIF^(1/(2*Df)) value), so we don’t need to remove any predictors.

# Multiple linear models

cat("=== VIF for LM1: Basic Model ===\n")
## === VIF for LM1: Basic Model ===
print(car::vif(model_full))
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.521880  1        1.233645
## riagendr          1.323334  1        1.150363
## ridreth3          1.644105  5        1.050976
## dmdeduc2          1.742917  6        1.047385
## indfmpir          1.298192  1        1.139382
## smq020            1.199062  1        1.095017
## alq111            1.103425  1        1.050440
## vigorous_minutes  1.049989  1        1.024690
## moderate_minutes  1.045875  1        1.022680
## sedentary_minutes 1.022078  1        1.010979
## dr1tkcal          6.267934  1        2.503584
## dr1ttfat          5.378923  1        2.319251
## dr1tsodi          2.854178  1        1.689431
## bmxbmi            1.304856  1        1.142303
## bpxsy1            1.591836  1        1.261680
## bpxdi1            1.232550  1        1.110203
## lbxglu            1.156394  1        1.075357
## lbxhscrp          1.110765  1        1.053928
## lbdhdd            1.374965  1        1.172589
## lbxtr             1.173039  1        1.083069
cat("=== VIF for LM2: Transformed predictors + original target variable ===\n")
## === VIF for LM2: Transformed predictors + original target variable ===
print(car::vif(model_full_transformed))
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.531744  1        1.237636
## riagendr          1.340638  1        1.157859
## ridreth3          1.674581  5        1.052908
## dmdeduc2          1.752564  6        1.047867
## indfmpir          1.300326  1        1.140318
## smq020            1.201889  1        1.096307
## alq111            1.104838  1        1.051113
## vigorous_minutes  1.050999  1        1.025182
## moderate_minutes  1.046303  1        1.022889
## sedentary_minutes 1.021970  1        1.010925
## bpxsy1            1.596436  1        1.263502
## bpxdi1            1.244769  1        1.115692
## log_lbxhscrp      1.389545  1        1.178790
## log_lbxtr         1.443205  1        1.201335
## log_dr1tkcal      5.771872  1        2.402472
## log_dr1tsodi      3.265137  1        1.806969
## log_dr1ttfat      4.965713  1        2.228388
## log_lbdhdd        1.614277  1        1.270542
## log_lbxglu        1.217585  1        1.103442
## log_bmxbmi        1.510988  1        1.229222
cat("=== VIF for LM3: Transformed predictors + transformed target variable ===\n")
## === VIF for LM3: Transformed predictors + transformed target variable ===
print(car::vif(model_full_transformed_target))
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.531744  1        1.237636
## riagendr          1.340638  1        1.157859
## ridreth3          1.674581  5        1.052908
## dmdeduc2          1.752564  6        1.047867
## indfmpir          1.300326  1        1.140318
## smq020            1.201889  1        1.096307
## alq111            1.104838  1        1.051113
## vigorous_minutes  1.050999  1        1.025182
## moderate_minutes  1.046303  1        1.022889
## sedentary_minutes 1.021970  1        1.010925
## bpxsy1            1.596436  1        1.263502
## bpxdi1            1.244769  1        1.115692
## log_lbxhscrp      1.389545  1        1.178790
## log_lbxtr         1.443205  1        1.201335
## log_dr1tkcal      5.771872  1        2.402472
## log_dr1tsodi      3.265137  1        1.806969
## log_dr1ttfat      4.965713  1        2.228388
## log_lbdhdd        1.614277  1        1.270542
## log_lbxglu        1.217585  1        1.103442
## log_bmxbmi        1.510988  1        1.229222
cat("=== VIF for LM4: Stepwise on Model 1 ===\n")
## === VIF for LM4: Stepwise on Model 1 ===
print(car::vif(step_model_full))
##              GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.403331  1        1.184623
## riagendr 1.196522  1        1.093856
## ridreth3 1.199283  5        1.018339
## smq020   1.123607  1        1.060003
## bmxbmi   1.217974  1        1.103619
## bpxsy1   1.561195  1        1.249478
## bpxdi1   1.210080  1        1.100036
## lbxglu   1.141326  1        1.068329
## lbdhdd   1.335759  1        1.155750
## lbxtr    1.162166  1        1.078038
cat("=== VIF for LM5: Stepwise on Model 3 ===\n")
## === VIF for LM5: Stepwise on Model 3 ===
print(car::vif(step_model_full_transformed_target))
##                  GVIF Df GVIF^(1/(2*Df))
## riagendr     1.322698  1        1.150086
## ridreth3     1.280753  5        1.025053
## indfmpir     1.085509  1        1.041878
## smq020       1.125802  1        1.061038
## bpxsy1       1.233886  1        1.110804
## bpxdi1       1.198605  1        1.094808
## log_lbxhscrp 1.376780  1        1.173363
## log_lbxtr    1.434107  1        1.197542
## log_dr1tkcal 4.757132  1        2.181085
## log_dr1ttfat 4.659189  1        2.158515
## log_lbdhdd   1.576608  1        1.255631
## log_lbxglu   1.198388  1        1.094709
## log_bmxbmi   1.497774  1        1.223836
cat("=== VIF for LM8: Robust: Original ===\n")
## === VIF for LM8: Robust: Original ===
print(car::vif(robust_model_full))
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.521880  1        1.233645
## riagendr          1.323334  1        1.150363
## ridreth3          1.644105  5        1.050976
## dmdeduc2          1.742917  6        1.047385
## indfmpir          1.298192  1        1.139382
## smq020            1.199062  1        1.095017
## alq111            1.103425  1        1.050440
## vigorous_minutes  1.049989  1        1.024690
## moderate_minutes  1.045875  1        1.022680
## sedentary_minutes 1.022078  1        1.010979
## dr1tkcal          6.267934  1        2.503584
## dr1ttfat          5.378923  1        2.319251
## dr1tsodi          2.854178  1        1.689431
## bmxbmi            1.304856  1        1.142303
## bpxsy1            1.591836  1        1.261680
## bpxdi1            1.232550  1        1.110203
## lbxglu            1.156394  1        1.075357
## lbxhscrp          1.110765  1        1.053928
## lbdhdd            1.374965  1        1.172589
## lbxtr             1.173039  1        1.083069
cat("=== VIF for LM9: Robust: Transformed ===\n")
## === VIF for LM9: Robust: Transformed ===
print(car::vif(robust_model_full_transformed_target))
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.531744  1        1.237636
## riagendr          1.340638  1        1.157859
## ridreth3          1.674581  5        1.052908
## dmdeduc2          1.752564  6        1.047867
## indfmpir          1.300326  1        1.140318
## smq020            1.201889  1        1.096307
## alq111            1.104838  1        1.051113
## vigorous_minutes  1.050999  1        1.025182
## moderate_minutes  1.046303  1        1.022889
## sedentary_minutes 1.021970  1        1.010925
## bpxsy1            1.596436  1        1.263502
## bpxdi1            1.244769  1        1.115692
## log_lbxhscrp      1.389545  1        1.178790
## log_lbxtr         1.443205  1        1.201335
## log_dr1tkcal      5.771872  1        2.402472
## log_dr1tsodi      3.265137  1        1.806969
## log_dr1ttfat      4.965713  1        2.228388
## log_lbdhdd        1.614277  1        1.270542
## log_lbxglu        1.217585  1        1.103442
## log_bmxbmi        1.510988  1        1.229222
cat("=== VIF for LM10: Robust: Transformed: Bisquare ===\n")
## === VIF for LM10: Robust: Transformed: Bisquare ===
print(car::vif(robust_bisquare_model_full_transformed_target))
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.531744  1        1.237636
## riagendr          1.340638  1        1.157859
## ridreth3          1.674581  5        1.052908
## dmdeduc2          1.752564  6        1.047867
## indfmpir          1.300326  1        1.140318
## smq020            1.201889  1        1.096307
## alq111            1.104838  1        1.051113
## vigorous_minutes  1.050999  1        1.025182
## moderate_minutes  1.046303  1        1.022889
## sedentary_minutes 1.021970  1        1.010925
## bpxsy1            1.596436  1        1.263502
## bpxdi1            1.244769  1        1.115692
## log_lbxhscrp      1.389545  1        1.178790
## log_lbxtr         1.443205  1        1.201335
## log_dr1tkcal      5.771872  1        2.402472
## log_dr1tsodi      3.265137  1        1.806969
## log_dr1ttfat      4.965713  1        2.228388
## log_lbdhdd        1.614277  1        1.270542
## log_lbxglu        1.217585  1        1.103442
## log_bmxbmi        1.510988  1        1.229222
cat("=== VIF for LM12: Transformed predictors and target + new features ===\n")
## === VIF for LM12: Transformed predictors and target + new features ===
print(car::vif(model_transformed_newfeatures))
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          5.409444  1        2.325821
## riagendr          1.336220  1        1.155950
## ridreth3          1.724996  5        1.056036
## dmdeduc2          1.789852  6        1.049707
## indfmpir          1.311627  1        1.145263
## smq020            1.212757  1        1.101253
## alq111            1.110053  1        1.053590
## sedentary_minutes 1.030810  1        1.015288
## TOTAL_ACTIVITY    1.131137  1        1.063549
## ACTIVE_FLAG       1.277647  1        1.130331
## HIGH_FAT_DIET     1.077105  1        1.037837
## HIGH_SODIUM       1.110074  1        1.053600
## AGE_YOUNG         6.836026  1        2.614580
## AGE_MID           3.713059  1        1.926930
## BMI_UNDER         1.081259  1        1.039836
## BMI_NORMAL        1.667455  1        1.291300
## BMI_OVER          1.378407  1        1.174056
## bpxsy1            1.629344  1        1.276458
## bpxdi1            1.361794  1        1.166959
## log_lbxhscrp      1.318844  1        1.148410
## log_lbxtr         1.448880  1        1.203694
## log_lbdhdd        1.617624  1        1.271859
## log_lbxglu        1.211128  1        1.100513
# Model 1 (original predictors + original target variable lbxtc)
# Remove dr1tkcal which has highest VIF (above 5)
set.seed(123)
model_full_v2 <- lm(lbxtc ~ ., data = train_data_original %>% dplyr::select(-c(dr1tkcal,dr1ttfat)))
summary(model_full_v2)
## 
## Call:
## lm(formula = lbxtc ~ ., data = train_data_original %>% dplyr::select(-c(dr1tkcal, 
##     dr1ttfat)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -205.782  -26.737   -3.807   22.935  245.631 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.989e+01  6.707e+00  14.893  < 2e-16 ***
## ridageyr           6.249e-02  4.040e-02   1.547 0.121953    
## riagendrFemale     4.786e+00  1.386e+00   3.453 0.000560 ***
## ridreth32          2.132e+00  2.583e+00   0.825 0.409304    
## ridreth33         -2.441e+00  2.088e+00  -1.169 0.242457    
## ridreth34         -6.643e+00  2.216e+00  -2.998 0.002737 ** 
## ridreth36          1.590e+00  2.503e+00   0.635 0.525434    
## ridreth37         -2.773e+00  3.222e+00  -0.861 0.389422    
## dmdeduc22         -2.895e+00  2.921e+00  -0.991 0.321690    
## dmdeduc23         -2.048e+00  2.657e+00  -0.771 0.440889    
## dmdeduc24         -3.519e+00  2.608e+00  -1.349 0.177390    
## dmdeduc25         -2.827e-01  2.849e+00  -0.099 0.920965    
## dmdeduc27          5.267e+01  2.790e+01   1.888 0.059112 .  
## dmdeduc29         -6.016e+00  1.409e+01  -0.427 0.669497    
## indfmpir           4.117e-01  4.591e-01   0.897 0.369903    
## smq0202           -3.061e+00  1.356e+00  -2.258 0.023977 *  
## alq1112            1.015e+00  2.148e+00   0.473 0.636504    
## vigorous_minutes   6.989e-05  1.868e-03   0.037 0.970155    
## moderate_minutes  -3.135e-04  1.122e-03  -0.280 0.779859    
## sedentary_minutes  8.496e-05  8.790e-04   0.097 0.923008    
## dr1tsodi          -2.817e-04  3.438e-04  -0.819 0.412633    
## bmxbmi             2.524e-01  9.366e-02   2.694 0.007078 ** 
## bpxsy1             1.565e-01  4.146e-02   3.775 0.000163 ***
## bpxdi1             3.559e-01  5.280e-02   6.741 1.79e-11 ***
## lbxglu            -5.873e-02  1.747e-02  -3.361 0.000783 ***
## lbxhscrp          -1.686e-02  8.317e-02  -0.203 0.839347    
## lbdhdd             6.500e-01  4.592e-02  14.157  < 2e-16 ***
## lbxtr              5.943e-02  5.540e-03  10.727  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.14 on 4115 degrees of freedom
## Multiple R-squared:  0.1079, Adjusted R-squared:  0.102 
## F-statistic: 18.43 on 27 and 4115 DF,  p-value: < 2.2e-16
car::vif(model_full_v2) # looks good
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.515673  1        1.231127
## riagendr          1.296892  1        1.138812
## ridreth3          1.596730  5        1.047908
## dmdeduc2          1.732063  6        1.046840
## indfmpir          1.294865  1        1.137921
## smq020            1.198310  1        1.094673
## alq111            1.103324  1        1.050392
## vigorous_minutes  1.049948  1        1.024670
## moderate_minutes  1.045805  1        1.022646
## sedentary_minutes 1.021712  1        1.010798
## dr1tsodi          1.110621  1        1.053860
## bmxbmi            1.298835  1        1.139665
## bpxsy1            1.590519  1        1.261158
## bpxdi1            1.231716  1        1.109827
## lbxglu            1.154277  1        1.074373
## lbxhscrp          1.110360  1        1.053736
## lbdhdd            1.373973  1        1.172166
## lbxtr             1.170124  1        1.081723
# Model 2 (transformed predictors + original target variable lbxtc)
# Remove log_dr1tkcal which has highest VIF (above 5)
set.seed(123)
model_full_transformed_v2 <- lm(lbxtc ~ ., data = train_data_transformed %>% dplyr::select(-c(log_dr1tkcal)))
summary(model_full_transformed_v2)
## 
## Call:
## lm(formula = lbxtc ~ ., data = train_data_transformed %>% dplyr::select(-c(log_dr1tkcal)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.70  -26.43   -3.92   22.77  240.85 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4.642e+01  2.300e+01  -2.019 0.043596 *  
## ridageyr           3.173e-02  3.996e-02   0.794 0.427180    
## riagendrFemale     2.833e+00  1.379e+00   2.055 0.039917 *  
## ridreth32          2.444e+00  2.546e+00   0.960 0.337078    
## ridreth33         -1.638e+00  2.059e+00  -0.796 0.426344    
## ridreth34         -4.824e+00  2.188e+00  -2.204 0.027560 *  
## ridreth36          3.186e+00  2.485e+00   1.282 0.199955    
## ridreth37         -1.959e+00  3.173e+00  -0.617 0.536996    
## dmdeduc22         -4.018e+00  2.883e+00  -1.394 0.163527    
## dmdeduc23         -2.452e+00  2.621e+00  -0.935 0.349722    
## dmdeduc24         -4.068e+00  2.575e+00  -1.580 0.114252    
## dmdeduc25         -9.185e-01  2.811e+00  -0.327 0.743923    
## dmdeduc27          5.486e+01  2.749e+01   1.996 0.046010 *  
## dmdeduc29         -7.750e+00  1.389e+01  -0.558 0.576885    
## indfmpir           5.811e-01  4.529e-01   1.283 0.199543    
## smq0202           -2.501e+00  1.337e+00  -1.870 0.061538 .  
## alq1112            7.900e-01  2.118e+00   0.373 0.709153    
## vigorous_minutes   6.152e-04  1.841e-03   0.334 0.738280    
## moderate_minutes  -4.563e-04  1.105e-03  -0.413 0.679729    
## sedentary_minutes -3.619e-05  8.659e-04  -0.042 0.966662    
## bpxsy1             1.495e-01  4.091e-02   3.655 0.000261 ***
## bpxdi1             2.872e-01  5.227e-02   5.494 4.18e-08 ***
## log_lbxhscrp       3.230e+00  6.368e-01   5.072 4.11e-07 ***
## log_lbxtr          1.835e+01  1.185e+00  15.485  < 2e-16 ***
## log_dr1tsodi      -7.707e-01  1.797e+00  -0.429 0.668013    
## log_dr1ttfat      -4.083e-01  1.627e+00  -0.251 0.801853    
## log_lbdhdd         4.529e+01  2.728e+00  16.601  < 2e-16 ***
## log_lbxglu        -1.527e+01  2.731e+00  -5.592 2.39e-08 ***
## log_bmxbmi         3.214e+00  3.159e+00   1.017 0.309091    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.56 on 4114 degrees of freedom
## Multiple R-squared:  0.1346, Adjusted R-squared:  0.1287 
## F-statistic: 22.84 on 28 and 4114 DF,  p-value: < 2.2e-16
car::vif(model_full_transformed_v2) # looks good
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.528157  1        1.236186
## riagendr          1.322256  1        1.149894
## ridreth3          1.663460  5        1.052207
## dmdeduc2          1.746420  6        1.047560
## indfmpir          1.298539  1        1.139535
## smq020            1.201694  1        1.096218
## alq111            1.104830  1        1.051109
## vigorous_minutes  1.050970  1        1.025168
## moderate_minutes  1.046275  1        1.022876
## sedentary_minutes 1.021901  1        1.010891
## bpxsy1            1.596263  1        1.263433
## bpxdi1            1.244341  1        1.115500
## log_lbxhscrp      1.389502  1        1.178771
## log_lbxtr         1.440165  1        1.200069
## log_dr1tsodi      2.716875  1        1.648295
## log_dr1ttfat      2.668648  1        1.633600
## log_lbdhdd        1.613914  1        1.270399
## log_lbxglu        1.214539  1        1.102061
## log_bmxbmi        1.505382  1        1.226940
# Model 3 (transformed predictors + transformed target variable log_lbxtc)
# Remove log_dr1tkcal which has highest VIF (above 5)
set.seed(123)
model_full_transformed_target_v2 <- lm(log_lbxtc ~ ., data = train_data_transformed_target %>% dplyr::select(-c(log_dr1tkcal)))
summary(model_full_transformed_target_v2)
## 
## Call:
## lm(formula = log_lbxtc ~ ., data = train_data_transformed_target %>% 
##     dplyr::select(-c(log_dr1tkcal)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80059 -0.13504 -0.00158  0.13451  0.85592 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.883e+00  1.220e-01  31.819  < 2e-16 ***
## ridageyr           7.997e-05  2.120e-04   0.377  0.70604    
## riagendrFemale     1.371e-02  7.314e-03   1.874  0.06099 .  
## ridreth32          1.391e-02  1.351e-02   1.030  0.30307    
## ridreth33         -1.014e-02  1.093e-02  -0.928  0.35341    
## ridreth34         -2.963e-02  1.161e-02  -2.552  0.01073 *  
## ridreth36          1.634e-02  1.319e-02   1.239  0.21539    
## ridreth37         -1.156e-02  1.683e-02  -0.687  0.49232    
## dmdeduc22         -1.899e-02  1.530e-02  -1.241  0.21459    
## dmdeduc23         -1.441e-02  1.391e-02  -1.036  0.30015    
## dmdeduc24         -2.266e-02  1.366e-02  -1.658  0.09730 .  
## dmdeduc25         -3.083e-03  1.492e-02  -0.207  0.83629    
## dmdeduc27          2.612e-01  1.458e-01   1.791  0.07336 .  
## dmdeduc29         -3.770e-02  7.369e-02  -0.512  0.60897    
## indfmpir           2.705e-03  2.403e-03   1.126  0.26028    
## smq0202           -1.151e-02  7.095e-03  -1.622  0.10484    
## alq1112            4.929e-03  1.124e-02   0.439  0.66091    
## vigorous_minutes   3.982e-06  9.767e-06   0.408  0.68356    
## moderate_minutes  -1.787e-06  5.863e-06  -0.305  0.76051    
## sedentary_minutes -2.549e-07  4.594e-06  -0.055  0.95576    
## bpxsy1             6.573e-04  2.171e-04   3.028  0.00248 ** 
## bpxdi1             1.751e-03  2.773e-04   6.312 3.05e-10 ***
## log_lbxhscrp       1.813e-02  3.379e-03   5.365 8.53e-08 ***
## log_lbxtr          9.722e-02  6.287e-03  15.464  < 2e-16 ***
## log_dr1tsodi      -2.383e-03  9.533e-03  -0.250  0.80266    
## log_dr1ttfat      -4.168e-03  8.631e-03  -0.483  0.62916    
## log_lbdhdd         2.560e-01  1.448e-02  17.683  < 2e-16 ***
## log_lbxglu        -8.396e-02  1.449e-02  -5.794 7.40e-09 ***
## log_bmxbmi         2.899e-02  1.676e-02   1.730  0.08378 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2046 on 4114 degrees of freedom
## Multiple R-squared:  0.1418, Adjusted R-squared:  0.1359 
## F-statistic: 24.27 on 28 and 4114 DF,  p-value: < 2.2e-16
car::vif(model_full_transformed_target_v2) # looks good
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.528157  1        1.236186
## riagendr          1.322256  1        1.149894
## ridreth3          1.663460  5        1.052207
## dmdeduc2          1.746420  6        1.047560
## indfmpir          1.298539  1        1.139535
## smq020            1.201694  1        1.096218
## alq111            1.104830  1        1.051109
## vigorous_minutes  1.050970  1        1.025168
## moderate_minutes  1.046275  1        1.022876
## sedentary_minutes 1.021901  1        1.010891
## bpxsy1            1.596263  1        1.263433
## bpxdi1            1.244341  1        1.115500
## log_lbxhscrp      1.389502  1        1.178771
## log_lbxtr         1.440165  1        1.200069
## log_dr1tsodi      2.716875  1        1.648295
## log_dr1ttfat      2.668648  1        1.633600
## log_lbdhdd        1.613914  1        1.270399
## log_lbxglu        1.214539  1        1.102061
## log_bmxbmi        1.505382  1        1.226940
# Model 8 robust regression - original
# Remove dr1tkcal which has highest VIF (above 5)
set.seed(123)
robust_model_full_v2 <- rlm(lbxtc ~ ., psi = psi.huber, data = train_data_original %>% dplyr::select(-c(dr1tkcal)))
summary(robust_model_full_v2)
## 
## Call: rlm(formula = lbxtc ~ ., data = train_data_original %>% dplyr::select(-c(dr1tkcal)), 
##     psi = psi.huber)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -249.964  -24.745   -1.669   24.520  249.598 
## 
## Coefficients:
##                   Value   Std. Error t value
## (Intercept)       94.7462  6.4933    14.5915
## ridageyr           0.0911  0.0391     2.3301
## riagendrFemale     3.7152  1.3442     2.7639
## ridreth32          2.5257  2.4997     1.0104
## ridreth33         -3.0861  2.0228    -1.5256
## ridreth34         -6.4170  2.1463    -2.9897
## ridreth36          1.4823  2.4314     0.6097
## ridreth37         -1.2720  3.1180    -0.4079
## dmdeduc22         -1.0018  2.8288    -0.3541
## dmdeduc23         -0.8358  2.5734    -0.3248
## dmdeduc24         -2.3021  2.5263    -0.9112
## dmdeduc25          1.3924  2.7598     0.5045
## dmdeduc27         59.5554 26.9989     2.2058
## dmdeduc29         -6.2907 13.6396    -0.4612
## indfmpir           0.2940  0.4443     0.6617
## smq0202           -2.6139  1.3119    -1.9925
## alq1112            0.5186  2.0790     0.2495
## vigorous_minutes   0.0002  0.0018     0.1301
## moderate_minutes  -0.0003  0.0011    -0.2469
## sedentary_minutes  0.0004  0.0009     0.4543
## dr1ttfat          -0.0106  0.0189    -0.5604
## dr1tsodi          -0.0001  0.0005    -0.2006
## bmxbmi             0.3272  0.0907     3.6096
## bpxsy1             0.1098  0.0401     2.7371
## bpxdi1             0.4078  0.0511     7.9811
## lbxglu            -0.0774  0.0169    -4.5796
## lbxhscrp          -0.0561  0.0805    -0.6966
## lbdhdd             0.6811  0.0444    15.3239
## lbxtr              0.0768  0.0054    14.3251
## 
## Residual standard error: 36.63 on 4114 degrees of freedom
car::vif(robust_model_full_v2) # looks good
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.515794  1        1.231176
## riagendr          1.302600  1        1.141315
## ridreth3          1.637276  5        1.050539
## dmdeduc2          1.736752  6        1.047076
## indfmpir          1.295408  1        1.138160
## smq020            1.198662  1        1.094834
## alq111            1.103360  1        1.050409
## vigorous_minutes  1.049948  1        1.024670
## moderate_minutes  1.045806  1        1.022647
## sedentary_minutes 1.021872  1        1.010877
## dr1ttfat          2.424637  1        1.557125
## dr1tsodi          2.438760  1        1.561653
## bmxbmi            1.299592  1        1.139996
## bpxsy1            1.591051  1        1.261369
## bpxdi1            1.231797  1        1.109863
## lbxglu            1.154521  1        1.074487
## lbxhscrp          1.110704  1        1.053900
## lbdhdd            1.374888  1        1.172556
## lbxtr             1.170204  1        1.081760
set.seed(123)
# Model 9 robust regression - transformed
# Remove log_dr1tkcal which has highest VIF (above 5)

robust_model_full_transformed_target_v2 <- rlm(log_lbxtc ~ ., psi = psi.huber, data = train_data_transformed_target %>% dplyr::select(-c(log_dr1tkcal)))
summary(robust_model_full_transformed_target_v2)
## 
## Call: rlm(formula = log_lbxtc ~ ., data = train_data_transformed_target %>% 
##     dplyr::select(-c(log_dr1tkcal)), psi = psi.huber)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.814008 -0.133808 -0.002436  0.133927  0.854640 
## 
## Coefficients:
##                   Value   Std. Error t value
## (Intercept)        3.8504  0.1209    31.8572
## ridageyr           0.0003  0.0002     1.3796
## riagendrFemale     0.0099  0.0072     1.3654
## ridreth32          0.0143  0.0134     1.0714
## ridreth33         -0.0126  0.0108    -1.1667
## ridreth34         -0.0280  0.0115    -2.4370
## ridreth36          0.0189  0.0131     1.4450
## ridreth37         -0.0053  0.0167    -0.3183
## dmdeduc22         -0.0128  0.0152    -0.8445
## dmdeduc23         -0.0111  0.0138    -0.8065
## dmdeduc24         -0.0192  0.0135    -1.4188
## dmdeduc25          0.0013  0.0148     0.0853
## dmdeduc27          0.2753  0.1445     1.9061
## dmdeduc29         -0.0330  0.0730    -0.4522
## indfmpir           0.0026  0.0024     1.1095
## smq0202           -0.0137  0.0070    -1.9459
## alq1112            0.0007  0.0111     0.0665
## vigorous_minutes   0.0000  0.0000     0.4513
## moderate_minutes   0.0000  0.0000    -0.4396
## sedentary_minutes  0.0000  0.0000     0.2255
## bpxsy1             0.0005  0.0002     2.5041
## bpxdi1             0.0019  0.0003     6.9660
## log_lbxhscrp       0.0160  0.0033     4.7728
## log_lbxtr          0.1018  0.0062    16.3397
## log_dr1tsodi      -0.0037  0.0094    -0.3921
## log_dr1ttfat      -0.0047  0.0085    -0.5543
## log_lbdhdd         0.2613  0.0143    18.2221
## log_lbxglu        -0.0939  0.0144    -6.5413
## log_bmxbmi         0.0423  0.0166     2.5495
## 
## Residual standard error: 0.1984 on 4114 degrees of freedom
car::vif(robust_model_full_transformed_target_v2) # looks good
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.528157  1        1.236186
## riagendr          1.322256  1        1.149894
## ridreth3          1.663460  5        1.052207
## dmdeduc2          1.746420  6        1.047560
## indfmpir          1.298539  1        1.139535
## smq020            1.201694  1        1.096218
## alq111            1.104830  1        1.051109
## vigorous_minutes  1.050970  1        1.025168
## moderate_minutes  1.046275  1        1.022876
## sedentary_minutes 1.021901  1        1.010891
## bpxsy1            1.596263  1        1.263433
## bpxdi1            1.244341  1        1.115500
## log_lbxhscrp      1.389502  1        1.178771
## log_lbxtr         1.440165  1        1.200069
## log_dr1tsodi      2.716875  1        1.648295
## log_dr1ttfat      2.668648  1        1.633600
## log_lbdhdd        1.613914  1        1.270399
## log_lbxglu        1.214539  1        1.102061
## log_bmxbmi        1.505382  1        1.226940
# Model 10
set.seed(123)
robust_bisquare_model_full_transformed_target_v2 <- rlm(log_lbxtc ~ ., psi = psi.bisquare, data = train_data_transformed_target %>% dplyr::select(-c(log_dr1tkcal)))
summary(robust_bisquare_model_full_transformed_target_v2)
## 
## Call: rlm(formula = log_lbxtc ~ ., data = train_data_transformed_target %>% 
##     dplyr::select(-c(log_dr1tkcal)), psi = psi.bisquare)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.815502 -0.134634 -0.002864  0.134037  0.855650 
## 
## Coefficients:
##                   Value   Std. Error t value
## (Intercept)        3.8413  0.1221    31.4612
## ridageyr           0.0003  0.0002     1.4454
## riagendrFemale     0.0096  0.0073     1.3112
## ridreth32          0.0128  0.0135     0.9457
## ridreth33         -0.0137  0.0109    -1.2512
## ridreth34         -0.0299  0.0116    -2.5746
## ridreth36          0.0173  0.0132     1.3124
## ridreth37         -0.0047  0.0168    -0.2768
## dmdeduc22         -0.0119  0.0153    -0.7775
## dmdeduc23         -0.0096  0.0139    -0.6903
## dmdeduc24         -0.0178  0.0137    -1.2993
## dmdeduc25          0.0028  0.0149     0.1882
## dmdeduc27          0.2768  0.1459     1.8970
## dmdeduc29         -0.0292  0.0737    -0.3964
## indfmpir           0.0024  0.0024     0.9786
## smq0202           -0.0135  0.0071    -1.9000
## alq1112            0.0016  0.0112     0.1413
## vigorous_minutes   0.0000  0.0000     0.4303
## moderate_minutes   0.0000  0.0000    -0.4248
## sedentary_minutes  0.0000  0.0000     0.2559
## bpxsy1             0.0005  0.0002     2.4532
## bpxdi1             0.0019  0.0003     6.9362
## log_lbxhscrp       0.0159  0.0034     4.7017
## log_lbxtr          0.1023  0.0063    16.2614
## log_dr1tsodi      -0.0043  0.0095    -0.4467
## log_dr1ttfat      -0.0047  0.0086    -0.5497
## log_lbdhdd         0.2630  0.0145    18.1569
## log_lbxglu        -0.0938  0.0145    -6.4655
## log_bmxbmi         0.0433  0.0168     2.5844
## 
## Residual standard error: 0.1991 on 4114 degrees of freedom
car::vif(robust_bisquare_model_full_transformed_target_v2) # looks good
##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.528157  1        1.236186
## riagendr          1.322256  1        1.149894
## ridreth3          1.663460  5        1.052207
## dmdeduc2          1.746420  6        1.047560
## indfmpir          1.298539  1        1.139535
## smq020            1.201694  1        1.096218
## alq111            1.104830  1        1.051109
## vigorous_minutes  1.050970  1        1.025168
## moderate_minutes  1.046275  1        1.022876
## sedentary_minutes 1.021901  1        1.010891
## bpxsy1            1.596263  1        1.263433
## bpxdi1            1.244341  1        1.115500
## log_lbxhscrp      1.389502  1        1.178771
## log_lbxtr         1.440165  1        1.200069
## log_dr1tsodi      2.716875  1        1.648295
## log_dr1ttfat      2.668648  1        1.633600
## log_lbdhdd        1.613914  1        1.270399
## log_lbxglu        1.214539  1        1.102061
## log_bmxbmi        1.505382  1        1.226940
# Model 12
set.seed(123)
model_transformed_newfeatures_v2 <- glm(log_lbxtc ~ ., data = train_data_transformed_newfeatures%>% dplyr::select(-c(AGE_YOUNG, ridageyr)))

summary(model_transformed_newfeatures_v2)
## 
## Call:
## glm(formula = log_lbxtc ~ ., data = train_data_transformed_newfeatures %>% 
##     dplyr::select(-c(AGE_YOUNG, ridageyr)))
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.948e+00  9.430e-02  41.863  < 2e-16 ***
## riagendrFemale     9.152e-03  7.187e-03   1.274  0.20290    
## ridreth32          1.311e-02  1.324e-02   0.990  0.32204    
## ridreth33          3.808e-03  1.065e-02   0.358  0.72066    
## ridreth34         -2.550e-02  1.139e-02  -2.238  0.02525 *  
## ridreth36          2.115e-02  1.299e-02   1.628  0.10352    
## ridreth37         -5.674e-03  1.651e-02  -0.344  0.73112    
## dmdeduc22         -1.187e-02  1.497e-02  -0.793  0.42811    
## dmdeduc23         -8.620e-03  1.361e-02  -0.633  0.52662    
## dmdeduc24         -1.273e-02  1.324e-02  -0.962  0.33626    
## dmdeduc25         -9.041e-04  1.455e-02  -0.062  0.95047    
## dmdeduc27          2.293e-01  1.429e-01   1.605  0.10867    
## dmdeduc29         -1.271e-02  7.220e-02  -0.176  0.86027    
## indfmpir           5.189e-04  2.352e-03   0.221  0.82540    
## smq0202           -6.363e-03  6.913e-03  -0.920  0.35743    
## alq1112            8.854e-03  1.104e-02   0.802  0.42241    
## sedentary_minutes  1.535e-06  4.521e-06   0.339  0.73433    
## TOTAL_ACTIVITY    -1.259e-06  4.761e-06  -0.264  0.79145    
## ACTIVE_FLAG        8.184e-03  7.113e-03   1.151  0.24997    
## HIGH_FAT_DIET     -6.507e-03  6.587e-03  -0.988  0.32328    
## HIGH_SODIUM       -2.018e-02  7.391e-03  -2.731  0.00635 ** 
## AGE_MID            7.868e-02  6.869e-03  11.455  < 2e-16 ***
## BMI_UNDER         -7.340e-02  2.685e-02  -2.734  0.00629 ** 
## BMI_NORMAL        -2.704e-02  9.342e-03  -2.894  0.00382 ** 
## BMI_OVER           1.273e-02  7.730e-03   1.646  0.09975 .  
## bpxsy1             1.041e-03  1.938e-04   5.372 8.22e-08 ***
## bpxdi1             7.765e-04  2.819e-04   2.754  0.00591 ** 
## log_lbxhscrp       1.473e-02  3.219e-03   4.575 4.90e-06 ***
## log_lbxtr          9.098e-02  6.172e-03  14.741  < 2e-16 ***
## log_lbdhdd         2.590e-01  1.405e-02  18.431  < 2e-16 ***
## log_lbxglu        -8.303e-02  1.413e-02  -5.877 4.51e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.04019509)
## 
##     Null deviance: 200.61  on 4142  degrees of freedom
## Residual deviance: 165.28  on 4112  degrees of freedom
## AIC: -1525.4
## 
## Number of Fisher Scoring iterations: 2
car::vif(model_transformed_newfeatures_v2)
##                       GVIF Df GVIF^(1/(2*Df))
## riagendr          1.328951  1        1.152801
## ridreth3          1.661164  5        1.052062
## dmdeduc2          1.707273  6        1.045583
## indfmpir          1.295341  1        1.138130
## smq020            1.187965  1        1.089938
## alq111            1.109649  1        1.053399
## sedentary_minutes 1.030418  1        1.015095
## TOTAL_ACTIVITY    1.130412  1        1.063208
## ACTIVE_FLAG       1.223972  1        1.106333
## HIGH_FAT_DIET     1.076678  1        1.037631
## HIGH_SODIUM       1.103043  1        1.050258
## AGE_MID           1.185378  1        1.088751
## BMI_UNDER         1.077733  1        1.038139
## BMI_NORMAL        1.659556  1        1.288237
## BMI_OVER          1.373428  1        1.171934
## bpxsy1            1.324607  1        1.150916
## bpxdi1            1.338908  1        1.157112
## log_lbxhscrp      1.313531  1        1.146094
## log_lbxtr         1.444867  1        1.202026
## log_lbdhdd        1.584001  1        1.258571
## log_lbxglu        1.201714  1        1.096227

Interpretation of coefficients

Coefficients are presented as multiplicative effects (exponentiated from log-log model), meaning they represent proportional changes in cholesterol. * Biomarkers Dominating the prediction are HDL Cholesterol (log_lbdhdd), Triglycerides (log_lbxtr), Glucose (log_lbxglu),CRP/Inflammation (log_lbxhscrp). * HDL is a component of total cholesterol, so this positive relationship is biologically expected and is the strongest predictor in your model. * Triglycerides contribute to VLDL cholesterol, another component of total cholesterol. * Interestingly, Higher glucose associated with LOWER cholesterol. * Inflammation and cholesterol are linked . * Cholesterol follows an inverted-U trajectory across lifespan as it peaks in middge age(AGE_MID (40-64 years)) and then declines. * Overweight individuals(BMI_OVER (25-29.9)) have 1.28% HIGHER cholesterol than obese and hence Lower BMI leads to Lower cholesterol . * High sodium diet associated with 2% LOWER cholesterol(HIGH_SODIUM (>2300 mg/day): 0.9800)). * Lifestyle changes like Physical Activity, Diet Quality, behaviors like Smoking,Alcohol and Income had no significant effect.

# Logistic Regression Models

# Interpretation of coefficients (Model 1)
cat("\n=== Coefficient Interpretation (Model 1) ===\n")
## 
## === Coefficient Interpretation (Model 1) ===
coef_exp <- exp(coef(model_full_v2))
print(round(coef_exp, 4))
##       (Intercept)          ridageyr    riagendrFemale         ridreth32 
##      2.412504e+43      1.064500e+00      1.198091e+02      8.429400e+00 
##         ridreth33         ridreth34         ridreth36         ridreth37 
##      8.710000e-02      1.300000e-03      4.902800e+00      6.250000e-02 
##         dmdeduc22         dmdeduc23         dmdeduc24         dmdeduc25 
##      5.530000e-02      1.290000e-01      2.960000e-02      7.537000e-01 
##         dmdeduc27         dmdeduc29          indfmpir           smq0202 
##      7.500096e+22      2.400000e-03      1.509400e+00      4.680000e-02 
##           alq1112  vigorous_minutes  moderate_minutes sedentary_minutes 
##      2.760500e+00      1.000100e+00      9.997000e-01      1.000100e+00 
##          dr1tsodi            bmxbmi            bpxsy1            bpxdi1 
##      9.997000e-01      1.287100e+00      1.169400e+00      1.427500e+00 
##            lbxglu          lbxhscrp            lbdhdd             lbxtr 
##      9.430000e-01      9.833000e-01      1.915600e+00      1.061200e+00
# Interpretation of coefficients (Model 2)
cat("\n=== Coefficient Interpretation (Model 2) ===\n")
## 
## === Coefficient Interpretation (Model 2) ===
coef_exp <- exp(coef(model_full_transformed_v2))
print(round(coef_exp, 4))
##       (Intercept)          ridageyr    riagendrFemale         ridreth32 
##      0.000000e+00      1.032200e+00      1.700340e+01      1.152050e+01 
##         ridreth33         ridreth34         ridreth36         ridreth37 
##      1.943000e-01      8.000000e-03      2.418640e+01      1.410000e-01 
##         dmdeduc22         dmdeduc23         dmdeduc24         dmdeduc25 
##      1.800000e-02      8.620000e-02      1.710000e-02      3.991000e-01 
##         dmdeduc27         dmdeduc29          indfmpir           smq0202 
##      6.708678e+23      4.000000e-04      1.787900e+00      8.200000e-02 
##           alq1112  vigorous_minutes  moderate_minutes sedentary_minutes 
##      2.203300e+00      1.000600e+00      9.995000e-01      1.000000e+00 
##            bpxsy1            bpxdi1      log_lbxhscrp         log_lbxtr 
##      1.161300e+00      1.332600e+00      2.527480e+01      9.309804e+07 
##      log_dr1tsodi      log_dr1ttfat        log_lbdhdd        log_lbxglu 
##      4.627000e-01      6.648000e-01      4.685730e+19      0.000000e+00 
##        log_bmxbmi 
##      2.486760e+01
# Interpretation of coefficients (Model 3)
cat("\n=== Coefficient Interpretation (Model 3) ===\n")
## 
## === Coefficient Interpretation (Model 3) ===
coef_exp <- exp(coef(model_full_transformed_target_v2))
print(round(coef_exp, 4))
##       (Intercept)          ridageyr    riagendrFemale         ridreth32 
##           48.5508            1.0001            1.0138            1.0140 
##         ridreth33         ridreth34         ridreth36         ridreth37 
##            0.9899            0.9708            1.0165            0.9885 
##         dmdeduc22         dmdeduc23         dmdeduc24         dmdeduc25 
##            0.9812            0.9857            0.9776            0.9969 
##         dmdeduc27         dmdeduc29          indfmpir           smq0202 
##            1.2985            0.9630            1.0027            0.9886 
##           alq1112  vigorous_minutes  moderate_minutes sedentary_minutes 
##            1.0049            1.0000            1.0000            1.0000 
##            bpxsy1            bpxdi1      log_lbxhscrp         log_lbxtr 
##            1.0007            1.0018            1.0183            1.1021 
##      log_dr1tsodi      log_dr1ttfat        log_lbdhdd        log_lbxglu 
##            0.9976            0.9958            1.2917            0.9195 
##        log_bmxbmi 
##            1.0294
# Interpretation of coefficients (Model 10)
cat("\n=== Coefficient Interpretation (Model 10) ===\n")
## 
## === Coefficient Interpretation (Model 10) ===
coef_exp <- exp(coef(robust_bisquare_model_full_transformed_target_v2))
print(round(coef_exp, 4))
##       (Intercept)          ridageyr    riagendrFemale         ridreth32 
##           46.5847            1.0003            1.0096            1.0129 
##         ridreth33         ridreth34         ridreth36         ridreth37 
##            0.9864            0.9705            1.0175            0.9953 
##         dmdeduc22         dmdeduc23         dmdeduc24         dmdeduc25 
##            0.9882            0.9904            0.9824            1.0028 
##         dmdeduc27         dmdeduc29          indfmpir           smq0202 
##            1.3189            0.9712            1.0024            0.9866 
##           alq1112  vigorous_minutes  moderate_minutes sedentary_minutes 
##            1.0016            1.0000            1.0000            1.0000 
##            bpxsy1            bpxdi1      log_lbxhscrp         log_lbxtr 
##            1.0005            1.0019            1.0160            1.1077 
##      log_dr1tsodi      log_dr1ttfat        log_lbdhdd        log_lbxglu 
##            0.9957            0.9953            1.3008            0.9105 
##        log_bmxbmi 
##            1.0443
# Interpretation of coefficients (Model 12)
cat("\n=== Coefficient Interpretation (Model 12) ===\n")
## 
## === Coefficient Interpretation (Model 12) ===
coef_exp <- exp(coef(model_transformed_newfeatures_v2))
print(round(coef_exp, 4))
##       (Intercept)    riagendrFemale         ridreth32         ridreth33 
##           51.8199            1.0092            1.0132            1.0038 
##         ridreth34         ridreth36         ridreth37         dmdeduc22 
##            0.9748            1.0214            0.9943            0.9882 
##         dmdeduc23         dmdeduc24         dmdeduc25         dmdeduc27 
##            0.9914            0.9874            0.9991            1.2578 
##         dmdeduc29          indfmpir           smq0202           alq1112 
##            0.9874            1.0005            0.9937            1.0089 
## sedentary_minutes    TOTAL_ACTIVITY       ACTIVE_FLAG     HIGH_FAT_DIET 
##            1.0000            1.0000            1.0082            0.9935 
##       HIGH_SODIUM           AGE_MID         BMI_UNDER        BMI_NORMAL 
##            0.9800            1.0819            0.9292            0.9733 
##          BMI_OVER            bpxsy1            bpxdi1      log_lbxhscrp 
##            1.0128            1.0010            1.0008            1.0148 
##         log_lbxtr        log_lbdhdd        log_lbxglu 
##            1.0952            1.2957            0.9203

5 Select Models

# Model 1: Multiple Linear Regression - Full model (post mutilcollinearity updates)

# Check assumptions

# Model Evaluation on Test Set
print_evaluate_lm_model <- function(model, target_variable, test_data, model_name) {

  preds <- predict(model, newdata = test_data)
  if (target_variable == "lbxtc") {
    actual <- test_data$lbxtc
  } else { # log
    actual <- test_data$log_lbxtc
  }

  rmse <- sqrt(mean((preds - actual)^2))
  mae  <- mean(abs(preds - actual))
  r2   <- 1 - sum((preds - actual)^2) / sum((actual - mean(actual))^2)
  aic  <- AIC(model)

  output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== ", model_name, " Evaluation ===\n",
    "RMSE:", round(rmse, 4), 
    "| MAE:", round(mae, 4),
    "| R²:", round(r2, 4), "\n",
    "AIC:", round(aic, 4), "\n\n",
    sep = " "
  )

  cat(output)
}

print_lm_model_diagnostics <- function(model, train_data, model_name) {

  cat("\n=== Diagnostics for", model_name, "===\n")
  
  # Histogram of residuals
  ggplot(data = as.data.frame(residuals(model)), aes(x = residuals(model))) +
  geom_histogram(bins = 25, fill = "skyblue", color = "black") +
  xlab("Residuals") +
  ggtitle(paste("Residual Histogram -", model_name))
  
  # Residuals vs Fitted
  plot(model, which = 1, main = paste("Residuals vs Fitted -", model_name))
  
  # Q-Q Plot
  plot(model, which = 2, main = paste("Normal Q-Q Plot -", model_name))
  
  # Scale-Location Plot
  plot(model, which = 3, main = paste("Scale-Location Plot -", model_name))
  
  # Residuals vs Leverage
  plot(model, which = 5, main = paste("Residuals vs Leverage -", model_name))
  
  # Cook's Distance
  try(plot(model, which = 4, main = paste("Cook's Distance -", model_name)), silent = TRUE)
  # plot(model, which = 4, main = paste("Cook's Distance -", model_name))
  
  # Heteroscedasticity test
  print("Heteroscedasticity test:")
  print(bptest(model))
  
  # High leverage cutoff
  p <- length(coef(model))
  n <- nrow(train_data)
  high_lev <- (2 * (p+1)) / n
  cat("High leverage threshold:", round(high_lev, 5), "\n")
  
  # Half-normal plot of leverage values
  try(halfnorm(hatvalues(model), main = paste("Half-Normal Plot of Leverage -", model_name)), silent = TRUE)
}

# === LM1 Evaluation ===
print_evaluate_lm_model(model_full_v2, "lbxtc", test_data_original, "LM1: Full model")
## 
## === Model Selection and Evaluation ===
## 
##  ===  LM1: Full model  Evaluation ===
##  RMSE: 39.1062 | MAE: 29.9116 | R²: 0.0893 
##  AIC: 42173.6259
print_lm_model_diagnostics(model_full_v2, train_data_original, "LM1: Full model")
## 
## === Diagnostics for LM1: Full model ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 181.84, df = 27, p-value < 2.2e-16
## 
## High leverage threshold: 0.014

# Model 2: Multiple Linear Regression: Full model with Log transformation of the skewed predictors

# Check assumptions

# === LM2 Evaluation ===
print_evaluate_lm_model(model_full_transformed_v2, "lbxtc", test_data_transformed, "LM2: Full model (Transformed Predictors)")
## 
## === Model Selection and Evaluation ===
## 
##  ===  LM2: Full model (Transformed Predictors)  Evaluation ===
##  RMSE: 39.0595 | MAE: 29.6559 | R²: 0.0914 
##  AIC: 42049.7957
print_lm_model_diagnostics(model_full_transformed_v2, train_data_transformed, "LM2: Full model (Transformed Predictors)")
## 
## === Diagnostics for LM2: Full model (Transformed Predictors) ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 96.186, df = 28, p-value = 2.084e-09
## 
## High leverage threshold: 0.01448

# Model 3: Multiple Linear Regression: Full model with Log transformation of the skewed predictors + Log target

# Check assumptions

# === LM3 Evaluation ===
print_evaluate_lm_model(model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM3: Full model (Transformed Predictors + Target)")
## 
## === Model Selection and Evaluation ===
## 
##  ===  LM3: Full model (Transformed Predictors + Target)  Evaluation ===
##  RMSE: 0.2057 | MAE: 0.1597 | R²: 0.107 
##  AIC: -1360.3999
print_lm_model_diagnostics(model_full_transformed_target_v2, train_data_transformed_target, "LM3: Full model (Transformed Predictors + Target)")
## 
## === Diagnostics for LM3: Full model (Transformed Predictors + Target) ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 141.07, df = 28, p-value < 2.2e-16
## 
## High leverage threshold: 0.01448

# Model 4: Stepwise on Model 1

# Check assumptions

# === LM4 Evaluation ===
print_evaluate_lm_model(step_model_full, "lbxtc", test_data_original, "LM4: Stepwise Model 1")
## 
## === Model Selection and Evaluation ===
## 
##  ===  LM4: Stepwise Model 1  Evaluation ===
##  RMSE: 39.1801 | MAE: 29.9975 | R²: 0.0858 
##  AIC: 42158.8052
print_lm_model_diagnostics(step_model_full, train_data_original, "LM4: Stepwise Model 1")
## 
## === Diagnostics for LM4: Stepwise Model 1 ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 170.4, df = 14, p-value < 2.2e-16
## 
## High leverage threshold: 0.00772

# Model 5: Stepwise on Model 5

# Check assumptions

# === LM5 Evaluation ===
print_evaluate_lm_model(step_model_full_transformed_target, "log_lbxtc", test_data_transformed_target, "LM5: Stepwise Model 5")
## 
## === Model Selection and Evaluation ===
## 
##  ===  LM5: Stepwise Model 5  Evaluation ===
##  RMSE: 0.2063 | MAE: 0.1605 | R²: 0.1026 
##  AIC: -1373.733
print_lm_model_diagnostics(step_model_full_transformed_target, train_data_transformed_target, "LM5: Stepwise Model 5")
## 
## === Diagnostics for LM5: Stepwise Model 5 ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 116, df = 17, p-value < 2.2e-16
## 
## High leverage threshold: 0.00917

# Model 6 - Ridge
testX <- test_data_transformed_target %>%
  dplyr::select(-log_lbxtc)
testY <- test_data_transformed_target$log_lbxtc
ridge_pred <- predict(ridge_model, testX)
ridge_resample <- postResample(ridge_pred, testY)
ridge_rmse <- ridge_resample["RMSE"]
ridge_Rsquared <- ridge_resample["Rsquared"]
ridge_MAE <- ridge_resample["MAE"]

ridge_output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== LM6: Ridge Evaluation ===\n",
    "RMSE:", round(ridge_rmse, 4), 
    "| R²:", round(ridge_Rsquared, 4),
    "| MAE:", round(ridge_MAE, 4),
    "\n\n",
    sep = " "
  )
cat(ridge_output)
## 
## === Model Selection and Evaluation ===
## 
##  === LM6: Ridge Evaluation ===
##  RMSE: 0.2054 | R²: 0.1104 | MAE: 0.1596
# Model 7 - Elastic Net
testX <- test_data_transformed_target %>%
  dplyr::select(-log_lbxtc)
testY <- test_data_transformed_target$log_lbxtc
enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)
enet_rmse <- enet_resample["RMSE"]
enet_Rsquared <- enet_resample["Rsquared"]
enet_MAE <- enet_resample["MAE"]
# enet_aic  <- AIC(enet_model)

enet_output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== LM7: Elastic Net Evaluation ===\n",
    "RMSE:", round(enet_rmse, 4), 
    "| R²:", round(enet_Rsquared, 4),
    "| MAE:", round(enet_MAE, 4),
    "\n\n",
    sep = " "
  )
cat(enet_output)
## 
## === Model Selection and Evaluation ===
## 
##  === LM7: Elastic Net Evaluation ===
##  RMSE: 0.2062 | R²: 0.1031 | MAE: 0.1606
# Model 8: Robust regression - original

# Check assumptions

# === LM8 Evaluation ===
print_evaluate_lm_model(robust_model_full_v2, "lbxtc", test_data_original, "LM8: Robust model (Original)")
## 
## === Model Selection and Evaluation ===
## 
##  ===  LM8: Robust model (Original)  Evaluation ===
##  RMSE: 39.1133 | MAE: 29.683 | R²: 0.0889 
##  AIC: 42200.5285
# reason for try: not every graph is for robust regression, so just print out the graphs R can print
try(print_lm_model_diagnostics(robust_model_full_v2, train_data_original, "LM8: Robust model (Original)"), silent=TRUE)
## 
## === Diagnostics for LM8: Robust model (Original) ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 182.47, df = 28, p-value < 2.2e-16
## 
## High leverage threshold: 0.01448
# Model 9: Robust regression - transformed

# Check assumptions

# === LM9 Evaluation ===
print_evaluate_lm_model(robust_model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM9: Robust model (Transformed Predictors + Target)")
## 
## === Model Selection and Evaluation ===
## 
##  ===  LM9: Robust model (Transformed Predictors + Target)  Evaluation ===
##  RMSE: 0.2057 | MAE: 0.1595 | R²: 0.1076 
##  AIC: -1355.7682
try(print_lm_model_diagnostics(robust_model_full_transformed_target_v2, train_data_transformed_target, "LM9: Robust model (Transformed Predictors + Target)"), silent=TRUE)
## 
## === Diagnostics for LM9: Robust model (Transformed Predictors + Target) ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 141.07, df = 28, p-value < 2.2e-16
## 
## High leverage threshold: 0.01448
# Model 10: Robust regression - transformed, bisquare

# Check assumptions

# === LM10 Evaluation ===
print_evaluate_lm_model(robust_bisquare_model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM10: Robust model (Transformed Predictors + Target, bisquare)")
## 
## === Model Selection and Evaluation ===
## 
##  ===  LM10: Robust model (Transformed Predictors + Target, bisquare)  Evaluation ===
##  RMSE: 0.2057 | MAE: 0.1594 | R²: 0.1078 
##  AIC: -1355.2245
try(print_lm_model_diagnostics(robust_bisquare_model_full_transformed_target_v2, train_data_transformed_target, "LM10: Robust model (Transformed Predictors + Target, bisquare)"), silent=TRUE)
## 
## === Diagnostics for LM10: Robust model (Transformed Predictors + Target, bisquare) ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 141.07, df = 28, p-value < 2.2e-16
## 
## High leverage threshold: 0.01448
# === LM12 Evaluation ===
print_evaluate_lm_model(model_transformed_newfeatures_v2, "log_lbxtc",test_data_transformed_newfeatures, "LM12: Transformed predictors and target + New features")
## 
## === Model Selection and Evaluation ===
## 
##  ===  LM12: Transformed predictors and target + New features  Evaluation ===
##  RMSE: 0.2041 | MAE: 0.1585 | R²: 0.1215 
##  AIC: -1525.4366
print_lm_model_diagnostics(model_transformed_newfeatures_v2,train_data_transformed_newfeatures,"LM12: Transformed predictors and target + New features")
## 
## === Diagnostics for LM12: Transformed predictors and target + New features ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 131.61, df = 30, p-value = 1.092e-14
## 
## High leverage threshold: 0.01545

# Results table

evaluate_lm_model <- function(model, target_variable, test_data, model_name) {

  preds <- predict(model, newdata = test_data)
  if (target_variable == "lbxtc") {
    actual <- test_data$lbxtc
  } else { # log
    actual <- test_data$log_lbxtc
  }

  rmse <- sqrt(mean((preds - actual)^2))
  mae  <- mean(abs(preds - actual))
  r2   <- 1 - sum((preds - actual)^2) / sum((actual - mean(actual))^2)
  aic  <- AIC(model)

  return(data.frame(
    Model = model_name,
    RMSE = round(rmse, 4),
    MAE = round(mae, 4),
    R_squared = round(r2, 4),
    AIC = round(aic, 4)
  ))
}

ridge_df <- data.frame(
    Model = "LM6: Ridge (Predictors + Target)",
    RMSE = round(ridge_rmse, 4),
    MAE = round(ridge_MAE, 4),
    R_squared = round(ridge_Rsquared, 4),
    AIC = NA
  )
enet_df <- data.frame(
    Model = "LM7: Elastic Net (Predictors + Target)",
    RMSE = round(enet_rmse, 4),
    MAE = round(enet_MAE, 4),
    R_squared = round(enet_Rsquared, 4),
    AIC = NA
  )

comparison <- rbind(
  evaluate_lm_model(model_full_v2, "lbxtc", test_data_original, "LM1: Full model"),
  evaluate_lm_model(model_full_transformed_v2, "lbxtc", test_data_transformed, "LM2: Full model (Transformed Predictors)"),
  evaluate_lm_model(model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM3: Full model (Transformed Predictors + Target)"),
  evaluate_lm_model(step_model_full, "lbxtc", test_data_original, "LM4: Stepwise Model 1"),
  evaluate_lm_model(step_model_full_transformed_target, "log_lbxtc", test_data_transformed_target, "LM5: Stepwise Model 3"),
  ridge_df,
  enet_df,
  evaluate_lm_model(robust_model_full_v2, "lbxtc", test_data_original, "LM8: Robust model (Original)"),
  evaluate_lm_model(robust_model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM9: Robust model (Transformed Predictors + Target)"),
  evaluate_lm_model(robust_bisquare_model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM10: Robust model (Transformed Predictors + Target, bisquare)"),
 evaluate_lm_model(model_transformed_newfeatures_v2,"log_lbxtc", test_data_transformed_newfeatures, "LM12: Transformed predictors and target + New features")
)

print(comparison)
##                                                                Model    RMSE
## 1                                                    LM1: Full model 39.1062
## 2                           LM2: Full model (Transformed Predictors) 39.0595
## 3                  LM3: Full model (Transformed Predictors + Target)  0.2057
## 4                                              LM4: Stepwise Model 1 39.1801
## 5                                              LM5: Stepwise Model 3  0.2063
## RMSE                                LM6: Ridge (Predictors + Target)  0.2054
## RMSE1                         LM7: Elastic Net (Predictors + Target)  0.2062
## 11                                      LM8: Robust model (Original) 39.1133
## 12               LM9: Robust model (Transformed Predictors + Target)  0.2057
## 13    LM10: Robust model (Transformed Predictors + Target, bisquare)  0.2057
## 14            LM12: Transformed predictors and target + New features  0.2041
##           MAE R_squared       AIC
## 1     29.9116    0.0893 42173.626
## 2     29.6559    0.0914 42049.796
## 3      0.1597    0.1070 -1360.400
## 4     29.9975    0.0858 42158.805
## 5      0.1605    0.1026 -1373.733
## RMSE   0.1596    0.1104        NA
## RMSE1  0.1606    0.1031        NA
## 11    29.6830    0.0889 42200.529
## 12     0.1595    0.1076 -1355.768
## 13     0.1594    0.1078 -1355.225
## 14     0.1585    0.1215 -1525.437
kable(
  comparison,
  caption = "Multiple Linear Regression Models",
  digits = 4,
  align = "c"
)
Multiple Linear Regression Models
Model RMSE MAE R_squared AIC
1 LM1: Full model 39.1062 29.9116 0.0893 42173.626
2 LM2: Full model (Transformed Predictors) 39.0595 29.6559 0.0914 42049.796
3 LM3: Full model (Transformed Predictors + Target) 0.2057 0.1597 0.1070 -1360.400
4 LM4: Stepwise Model 1 39.1801 29.9975 0.0858 42158.805
5 LM5: Stepwise Model 3 0.2063 0.1605 0.1026 -1373.733
RMSE LM6: Ridge (Predictors + Target) 0.2054 0.1596 0.1104 NA
RMSE1 LM7: Elastic Net (Predictors + Target) 0.2062 0.1606 0.1031 NA
11 LM8: Robust model (Original) 39.1133 29.6830 0.0889 42200.529
12 LM9: Robust model (Transformed Predictors + Target) 0.2057 0.1595 0.1076 -1355.768
13 LM10: Robust model (Transformed Predictors + Target, bisquare) 0.2057 0.1594 0.1078 -1355.225
14 LM12: Transformed predictors and target + New features 0.2041 0.1585 0.1215 -1525.437

6 Summary and Conclusion

We used the 2017–2018 NHANES cycle set of key modules that capture demographic, behavioral, and clinical information relevant to predicting cholesterol and BMI outcomes. These modules include demographics, smoking status, alcohol use, physical activity, dietary intake, body measurements, blood pressure, glucose, inflammation (CRP), HDL cholesterol, triglycerides, and total cholesterol.Each dataset was imported individually and prepared for merging so that a comprehensive, unified analytic file could be created for building our regression models.

Several continuous variables like Total calories(dr1tkcal),Total fat(dr1ttfat) ,Sodium intake(dr1tsodi), HDL(lbdhdd) and Triglycerides(lbxtr) in the dataset exhibit skewed distributions.

We addressed variables with moderate missingness by replacing with their median values, while categorical variables were imputed using the most frequent category. For variables with higher levels of missingness, we applied MICE using predictive mean matching, allowing more accurate estimation based on relationships among variables.

The most prominent outliers were concentrated in biomarkers and nutrient totals, which could influence regression models if unaddressed. These findings suggested the need for careful handling, such as log transformation or robust modeling approaches, rather than outright removal, to ensure valid and generalizable analytical results.

To address skewness and high-end outliers in the NHANES dataset, we applied log transformations to several right-skewed variables. Specifically, CRP (lbxhscrp), triglycerides (lbxtr), total cholesterol (lbxtc), HDL cholesterol (lbdhdd), glucose (lbxglu), BMI (bmxbmi), and dietary intake variables such as total calories. (dr1tkcal), total fat (dr1ttfat), and sodium (dr1tsodi) were log-transformed. In addition to transforming skewed numeric variables, we converted key categorical variables into factor type for modeling.

Checking for linear relationship across the scatterplots, we observed that the most numeric predictors exhibit very weak or negligible linear relationships with total cholesterol.This suggested that total cholesterol is largely uncorrelated with most individual predictors, with only modest positive trends observed for a few clinical measures.

Strong Positive Correlations: - dr1tkcal/dr1ttfat 0.89 (Calories and Total Fat Intake - more fat indicates more calories) - dr1tkcal/dr1tsodi 0.80 (Calories and Total Sodium Intake - more sodium indicates more calories) - dr1ttfat/dr1tsodi 0.76 (Total Fat and Sodium Intake -more fat indicates more sodium) - ridageyr/bpxsy1 0.47 (Higher age indicates higher systolic bp)

Negative Correlations: - lbdhdd/lbxtr -0.31 (Higher HDL, lower Triglycerides) - bmxbmi/lbdhdd -0.27 (Higher BMI, lower HDL) - lbxglu/lbdhdd -0.18 (Higher Glucose, lower HDL)

Several linear regression models were built and evaluated . Best Performing Models is LM12( Multiple Linear regression with transformed predictors and target+ new features). ** It has lowest RMSE (0.2041 on log scale) and highest R² = 0.1215 ** Has the best predictive performance. ** Provides interpretable coefficients on the log scale. ** Feature engineering is better than Regularization in this case. ** Ridge’s shrinkage couldn’t compensate for missing non-linear patterns