Data 621: Final Project

1 Background

This analysis aims to examine the relationship between cholesterol level and key variables include demographic characteristics (age, sex, race/ethnicity), socioeconomic status, behavioral factors (smoking, alcohol use, physical activity, diet), and clinical measures (HDL, triglycerides, blood pressure, glucose, C-reactive protein) using the National Health and Nutrition Examination Survey (NHANES) 2017-2018 cross-sectional data. The sample of individuals aged 18-80 years will be used to determine which variable most strongly influences cholesterol while accounting for complex population-level variations.

2 Data Exploration

2.1 Data Import

Since the 2017–2018 NHANES cycle contains some of the most complete and up-to-date data, we selected this cycle for our analysis. Within this cycle, we downloaded a set of key modules that capture demographic, behavioral, and clinical information relevant to predicting cholesterol and BMI outcomes. These modules include demographics, smoking status, alcohol use, physical activity, dietary intake, body measurements, blood pressure, glucose, inflammation (CRP), HDL cholesterol, triglycerides, and total cholesterol. Each dataset was imported individually and prepared for merging so that a comprehensive, unified analytic file could be created for building our regression models.

load_nhanes <- function(files, cycle = "2017") {
  
  base_url <- paste0("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/", cycle, "/DataFiles/")
  
  result <- list()
  
  for (f in files) {
    file_url <- paste0(base_url, f, ".xpt")
    local_name <- paste0(f, ".XPT")
    download.file(file_url, local_name, mode = "wb")
    df <- read_xpt(local_name)
    result[[f]] <- df
  }
  
  return(result)
}

files_to_load <- c(
  "DEMO_J",   # Demographics
  "SMQ_J",    # Smoking
  "ALQ_J",    # Alcohol
  "PAQ_J",    # Physical activity
  "DR1TOT_J", # Diet (calories, fat, sodium)
  "BMX_J",    # Body measures (BMI)
  "BPX_J",    # Blood pressure
  "GLU_J",    # Glucose
  "HSCRP_J",  # CRP
  "HDL_J",    # HDL cholesterol
  "TRIGLY_J",  # Triglycerides
  "TCHOL_J"   # Total Cholesterol
)

nhanes_list <- load_nhanes(files_to_load)

demo     <- nhanes_list$DEMO_J
smoking  <- nhanes_list$SMQ_J
alcohol  <- nhanes_list$ALQ_J
physact  <- nhanes_list$PAQ_J
diet     <- nhanes_list$DR1TOT_J
bmx      <- nhanes_list$BMX_J
bp       <- nhanes_list$BPX_J
glucose  <- nhanes_list$GLU_J
crp      <- nhanes_list$HSCRP_J
hdl      <- nhanes_list$HDL_J
trig     <- nhanes_list$TRIGLY_J
chol     <- nhanes_list$TCHOL_J

2.2 Data Cleaning

After loading the individual NHANES datasets, we merged them into a single analytic dataset using the respondent identifier SEQN, which is shared across all NHANES modules. Once all modules were combined, we restricted the sample to adults aged 18 years and older and created derived physical activity variables representing vigorous, moderate, and sedentary minutes per day. Finally, we selected the key demographic, behavioral, dietary, and clinical variables needed for modeling, including predictors such as age, gender, race/ethnicity, socioeconomic indicators, lifestyle behaviors, and biomarkers, along with total cholesterol, lbxtc, as the primary outcome.

Features:

ridageyr: Age
riagendr: Gender
ridreth3: Race/Ethnicity
dmdeduc2: Education
indfmpir: Income/Poverty ratio
smq020: Smoking status
alq111: Alcohol use
vigorous_minutes
moderate_minutes
sedentary_minutes
dr1tkcal: Total calories
dr1ttfat: Total fat
dr1tsodi: Sodium
bmxbmi: BMI
bpxsy1: Systolic BP
bpxdi1: Diastolic BP
lbxglu: Glucose
lbxhscrp: CRP
lbdhdd: HDL
lbxtr: Triglycerides
lbxtc: Total Cholesterol (target)

# Merge data

demo <- nhanes_list$DEMO_J %>% 
  janitor::clean_names()

# Start with demographics
nhanes_master <- demo

# Merge BMI
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$BMX_J), by = "seqn")

# Merge Blood Pressure
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$BPX_J), by = "seqn")

# Merge Glucose
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$GLU_J), by = "seqn")

# Merge CRP
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$HSCRP_J), by = "seqn")

# Merge HDL
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$HDL_J), by = "seqn")

# Merge Triglycerides
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$TRIGLY_J), by = "seqn")

# Merge Smoking
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$SMQ_J), by = "seqn")

# Merge Alcohol
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$ALQ_J), by = "seqn")

# Merge Physical Activity
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$PAQ_J), by = "seqn")

# Merge Diet (calories, fat, sodium)
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$DR1TOT_J), by = "seqn")

# Merge Diet (calories, fat, sodium)
nhanes_master <- nhanes_master %>%
  left_join(janitor::clean_names(nhanes_list$TCHOL_J), by = "seqn")


# Filter for adults only
nhanes_model <- nhanes_master %>%
  filter(ridageyr >= 18) %>%
  mutate(
    vigorous_minutes = coalesce(pad615, 0) + coalesce(pad660, 0),  # Vigorous activity (work + recreation)
    moderate_minutes = coalesce(pad630, 0) + coalesce(pad675, 0) + coalesce(pad645, 0), # Moderate activity (work + recreation + transportation)
    sedentary_minutes = coalesce(pad680, 0) # Sedentary minutes
  )

# Select Variables for Modeling

nhanes_model2 <- nhanes_model %>%
  dplyr::select(
    seqn,
    ridageyr,        # Age
    riagendr,        # Gender
    ridreth3,        # Race/Ethnicity
    dmdeduc2,        # Education
    indfmpir,        # Income/Poverty ratio
    smq020,          # Smoking status
    alq111,          # Alcohol use
    vigorous_minutes,
    moderate_minutes,
    sedentary_minutes,
    dr1tkcal,        # Total calories
    dr1ttfat,        # Total fat
    dr1tsodi,        # Sodium
    bmxbmi,          # BMI
    bpxsy1, bpxdi1,  # Systolic & Diastolic BP
    lbxglu,          # Glucose
    lbxhscrp,         # CRP
    lbdhdd,          # HDL
    lbxtr,         # Triglycerides
    lbxtc            # Total Cholesterol (target)
  )

2.3 Data Distribution

To assess the structure and distribution of our variables, we began by generating summary statistics for all predictors in the dataset. We then visualized numerical features using histograms to examine normality, skewness, and potential outliers. Boxplots were created for key categorical variables to show how total cholesterol differs across demographic and behavioral groups. Finally, we plotted the distribution of the target variable—total cholesterol—to evaluate its overall shape before modeling. These exploratory steps provide a clear understanding of the data and help verify assumptions for regression analysis.

Several continuous variables in the dataset exhibit skewed distributions that may affect regression modeling:

C-reactive protein (lbxhscrp) shows extreme right skew, with a median of approximately 1.94 mg/L and a maximum of 182 mg/L.
Triglycerides (lbxtr) are also right-skewed, with a median of 94 mg/dL and a maximum of 2684 mg/dL.
Dietary variables such as total calories (dr1tkcal) and total fat intake (dr1ttfat) display mild to moderate skew due to some extreme high values.
Sodium intake (dr1tsodi) is highly right-skewed, with a maximum value of around 25,949 mg.

These variables may benefit from log-transformations or other normalization techniques prior to modeling.

# Summary statistics
nhanes_model2 %>%
  summary() %>%
  kable(caption = "Descriptive Statistics of Predictor Variables") %>%
  kable_styling()

Descriptive Statistics of Predictor Variables
seqn	ridageyr	riagendr	ridreth3	dmdeduc2	indfmpir	smq020	alq111	vigorous_minutes	moderate_minutes	sedentary_minutes	dr1tkcal	dr1ttfat	dr1tsodi	bmxbmi	bpxsy1	bpxdi1	lbxglu	lbxhscrp	lbdhdd	lbxtr	lbxtc
Min. : 93705	Min. :18.00	Min. :1.000	Min. :1.000	Min. :1.000	Min. :0.000	Min. :1.000	Min. :1.000	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0	Min. : 0.00	Min. : 0	Min. :14.20	Min. : 72	Min. : 0.00	Min. : 47.0	Min. : 0.11	Min. : 10.00	Min. : 10.0	Min. : 76.0
1st Qu.: 95955	1st Qu.:33.00	1st Qu.:1.000	1st Qu.:3.000	1st Qu.:3.000	1st Qu.:1.180	1st Qu.:1.000	1st Qu.:1.000	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 180.0	1st Qu.: 1406	1st Qu.: 50.74	1st Qu.: 2137	1st Qu.:24.60	1st Qu.:112	1st Qu.: 64.00	1st Qu.: 96.0	1st Qu.: 0.88	1st Qu.: 42.00	1st Qu.: 61.0	1st Qu.:158.0
Median : 98263	Median :51.00	Median :2.000	Median :3.000	Median :4.000	Median :2.080	Median :2.000	Median :1.000	Median : 0.0	Median : 50.0	Median : 300.0	Median : 1942	Median : 76.18	Median : 3102	Median :28.50	Median :124	Median : 72.00	Median :104.0	Median : 1.94	Median : 51.00	Median : 92.0	Median :183.0
Mean : 98280	Mean :49.89	Mean :1.515	Mean :3.504	Mean :3.526	Mean :2.521	Mean :1.597	Mean :1.114	Mean : 77.6	Mean : 141.8	Mean : 388.9	Mean : 2110	Mean : 85.07	Mean : 3456	Mean :29.69	Mean :126	Mean : 71.65	Mean :113.7	Mean : 4.17	Mean : 53.19	Mean : 112.3	Mean :186.9
3rd Qu.:100595	3rd Qu.:65.00	3rd Qu.:2.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:4.060	3rd Qu.:2.000	3rd Qu.:1.000	3rd Qu.: 60.0	3rd Qu.: 180.0	3rd Qu.: 480.0	3rd Qu.: 2628	3rd Qu.:108.28	3rd Qu.: 4311	3rd Qu.:33.50	3rd Qu.:136	3rd Qu.: 80.00	3rd Qu.:115.0	3rd Qu.: 4.48	3rd Qu.: 61.00	3rd Qu.: 134.0	3rd Qu.:212.0
Max. :102956	Max. :80.00	Max. :2.000	Max. :7.000	Max. :9.000	Max. :5.000	Max. :2.000	Max. :2.000	Max. :10029.0	Max. :19998.0	Max. :9999.0	Max. :12501	Max. :567.96	Max. :25949	Max. :86.20	Max. :228	Max. :136.00	Max. :451.0	Max. :182.82	Max. :189.00	Max. :2684.0	Max. :446.0
NA	NA	NA	NA	NA’s :287	NA’s :835	NA	NA’s :726	NA	NA	NA	NA’s :873	NA’s :873	NA’s :873	NA’s :422	NA’s :957	NA’s :957	NA’s :3319	NA’s :710	NA’s :680	NA’s :3363	NA’s :680

# Plot histograms of all numeric variables in data_train
nhanes_model2 |>
  dplyr::select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
  filter(!is.na(Value)) |>
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~Feature, scales = "free") +
  labs(title = "Histograms of Numerical Features", x = NULL, y = "Frequency") +
  theme_minimal()

# Box plot of categorical variables
categorical_vars <- c("riagendr", "ridreth3", "dmdeduc2", "smq020", "alq111")

for (var in categorical_vars) {
  print(
    ggplot(nhanes_model2, aes_string(x = var, y = "lbxtc")) +
      geom_boxplot(fill = "steelblue", color = "black") +
      labs(
        title = paste("Total Cholesterol by", var),
        x = var,
        y = "Total Cholesterol (mg/dL)"
      ) +
      theme_minimal()
  )
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Removed 680 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Removed 287 rows containing missing values or values outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 632 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Removed 726 rows containing missing values or values outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 293 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Target variable distribution (Total Cholesterol)
ggplot(nhanes_model2, aes(x = lbxtc)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Total Cholesterol (mg/dL)", x = "Total Cholesterol", y = "Count") +
  theme_minimal()

## Warning: Removed 680 rows containing non-finite outside the scale range
## (`stat_bin()`).

3. Data Preparation

3.1 Fix Missing Values

To prepare the dataset for modeling, we first removed any observations with missing values in the target variable, total cholesterol. Next, we addressed missingness in the predictors. Variables with moderate missingness were replaced with their median values, while categorical variables were imputed using the most frequent category. For variables with higher levels of missingness, we applied MICE using predictive mean matching, allowing more accurate estimation based on relationships among variables.

## Look at data

# Structure of the data 
str(nhanes_model2)

## tibble [5,856 × 22] (S3: tbl_df/tbl/data.frame)
##  $ seqn             : num [1:5856] 93705 93706 93708 93709 93711 ...
##   ..- attr(*, "label")= chr "Respondent sequence number"
##  $ ridageyr         : num [1:5856] 66 18 66 75 56 18 67 54 71 61 ...
##   ..- attr(*, "label")= chr "Age in years at screening"
##  $ riagendr         : num [1:5856] 2 1 2 2 1 1 1 2 1 1 ...
##   ..- attr(*, "label")= chr "Gender"
##  $ ridreth3         : num [1:5856] 4 6 6 4 6 1 3 4 7 6 ...
##   ..- attr(*, "label")= chr "Race/Hispanic origin w/ NH Asian"
##  $ dmdeduc2         : num [1:5856] 2 NA 1 4 5 NA 3 4 3 5 ...
##   ..- attr(*, "label")= chr "Education level - Adults 20+"
##  $ indfmpir         : num [1:5856] 0.82 NA 1.63 0.41 5 0.76 2.65 1.86 1.56 5 ...
##   ..- attr(*, "label")= chr "Ratio of family income to poverty"
##  $ smq020           : num [1:5856] 1 2 2 1 2 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Smoked at least 100 cigarettes in life"
##  $ alq111           : num [1:5856] 1 2 2 NA 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Ever had a drink of any kind of alcohol"
##  $ vigorous_minutes : num [1:5856] 0 0 0 0 60 465 0 0 660 180 ...
##  $ moderate_minutes : num [1:5856] 60 75 30 180 90 270 90 0 260 240 ...
##  $ sedentary_minutes: num [1:5856] 300 240 120 600 420 120 120 360 180 300 ...
##  $ dr1tkcal         : num [1:5856] 1202 1987 1251 NA 2840 ...
##   ..- attr(*, "label")= chr "Energy (kcal)"
##  $ dr1ttfat         : num [1:5856] 57 137.4 65.5 NA 124.2 ...
##   ..- attr(*, "label")= chr "Total fat (gm)"
##  $ dr1tsodi         : num [1:5856] 3574 3657 2135 NA 4382 ...
##   ..- attr(*, "label")= chr "Sodium (mg)"
##  $ bmxbmi           : num [1:5856] 31.7 21.5 23.7 38.9 21.3 19.7 23.5 39.9 22.5 30.7 ...
##   ..- attr(*, "label")= chr "Body Mass Index (kg/m**2)"
##  $ bpxsy1           : num [1:5856] NA 112 NA 120 108 112 104 NA 112 120 ...
##   ..- attr(*, "label")= chr "Systolic: Blood pres (1st rdg) mm Hg"
##  $ bpxdi1           : num [1:5856] NA 74 NA 66 68 68 70 NA 60 72 ...
##   ..- attr(*, "label")= chr "Diastolic: Blood pres (1st rdg) mm Hg"
##  $ lbxglu           : num [1:5856] NA NA 122 NA 107 NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Fasting Glucose (mg/dL)"
##  $ lbxhscrp         : num [1:5856] 2.72 0.74 1.83 6.94 0.82 0.37 1.66 5.71 6.1 1.75 ...
##   ..- attr(*, "label")= chr "HS C-Reactive Protein (mg/L)"
##  $ lbdhdd           : num [1:5856] 60 47 88 65 72 48 48 42 57 39 ...
##   ..- attr(*, "label")= chr "Direct HDL-Cholesterol (mg/dL)"
##  $ lbxtr            : num [1:5856] NA NA 58 NA 48 NA NA NA NA NA ...
##   ..- attr(*, "label")= chr "Triglyceride (mg/dL)"
##  $ lbxtc            : num [1:5856] 157 148 209 176 238 182 184 230 180 225 ...
##   ..- attr(*, "label")= chr "Total Cholesterol (mg/dL)"

# Glimpse of the data
head(nhanes_model2)

## # A tibble: 6 × 22
##    seqn ridageyr riagendr ridreth3 dmdeduc2 indfmpir smq020 alq111
##   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>  <dbl>  <dbl>
## 1 93705       66        2        4        2     0.82      1      1
## 2 93706       18        1        6       NA    NA         2      2
## 3 93708       66        2        6        1     1.63      2      2
## 4 93709       75        2        4        4     0.41      1     NA
## 5 93711       56        1        6        5     5         2      1
## 6 93712       18        1        1       NA     0.76      1      1
## # ℹ 14 more variables: vigorous_minutes <dbl>, moderate_minutes <dbl>,
## #   sedentary_minutes <dbl>, dr1tkcal <dbl>, dr1ttfat <dbl>, dr1tsodi <dbl>,
## #   bmxbmi <dbl>, bpxsy1 <dbl>, bpxdi1 <dbl>, lbxglu <dbl>, lbxhscrp <dbl>,
## #   lbdhdd <dbl>, lbxtr <dbl>, lbxtc <dbl>

# Count missing values per Variable
nhanes_model2 %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
  filter(missing_count != 0) %>%
  arrange(desc(missing_count))

## # A tibble: 14 × 2
##    variable missing_count
##    <chr>            <int>
##  1 lbxtr             3363
##  2 lbxglu            3319
##  3 bpxsy1             957
##  4 bpxdi1             957
##  5 dr1tkcal           873
##  6 dr1ttfat           873
##  7 dr1tsodi           873
##  8 indfmpir           835
##  9 alq111             726
## 10 lbxhscrp           710
## 11 lbdhdd             680
## 12 lbxtc              680
## 13 bmxbmi             422
## 14 dmdeduc2           287

colSums(is.na(nhanes_model2))

##              seqn          ridageyr          riagendr          ridreth3 
##                 0                 0                 0                 0 
##          dmdeduc2          indfmpir            smq020            alq111 
##               287               835                 0               726 
##  vigorous_minutes  moderate_minutes sedentary_minutes          dr1tkcal 
##                 0                 0                 0               873 
##          dr1ttfat          dr1tsodi            bmxbmi            bpxsy1 
##               873               873               422               957 
##            bpxdi1            lbxglu          lbxhscrp            lbdhdd 
##               957              3319               710               680 
##             lbxtr             lbxtc 
##              3363               680

# Keep only rows where Total Cholesterol is observed
nhanes_model2 <- nhanes_model2 %>% 
  filter(!is.na(lbxtc))


medium_numeric <- c("indfmpir","dr1tkcal","dr1ttfat","dr1tsodi",
                    "bmxbmi","bpxsy1","bpxdi1","lbxhscrp","lbdhdd")
medium_categorical <- c("dmdeduc2","alq111")

high_missing <- c("lbxglu","lbxtr")

## Median Imputations

# Numeric
for(v in medium_numeric){
  nhanes_model2[[v]][is.na(nhanes_model2[[v]])] <- median(nhanes_model2[[v]], na.rm = TRUE)
}

# Categorical
for(v in medium_categorical){
  mode_val <- as.numeric(names(sort(table(nhanes_model2[[v]]), decreasing = TRUE)[1]))
  nhanes_model2[[v]][is.na(nhanes_model2[[v]])] <- mode_val
}

# Mice Imputations

# Select only high missing numeric variables + all predictors
mice_vars <- c(high_missing, medium_numeric, "vigorous_minutes", "moderate_minutes", "sedentary_minutes")

# Run MICE
mice_data <- mice(nhanes_model2[mice_vars], m = 5, method = "pmm", seed = 123)

## 
##  iter imp variable
##   1   1  lbxglu  lbxtr
##   1   2  lbxglu  lbxtr
##   1   3  lbxglu  lbxtr
##   1   4  lbxglu  lbxtr
##   1   5  lbxglu  lbxtr
##   2   1  lbxglu  lbxtr
##   2   2  lbxglu  lbxtr
##   2   3  lbxglu  lbxtr
##   2   4  lbxglu  lbxtr
##   2   5  lbxglu  lbxtr
##   3   1  lbxglu  lbxtr
##   3   2  lbxglu  lbxtr
##   3   3  lbxglu  lbxtr
##   3   4  lbxglu  lbxtr
##   3   5  lbxglu  lbxtr
##   4   1  lbxglu  lbxtr
##   4   2  lbxglu  lbxtr
##   4   3  lbxglu  lbxtr
##   4   4  lbxglu  lbxtr
##   4   5  lbxglu  lbxtr
##   5   1  lbxglu  lbxtr
##   5   2  lbxglu  lbxtr
##   5   3  lbxglu  lbxtr
##   5   4  lbxglu  lbxtr
##   5   5  lbxglu  lbxtr

# Complete the dataset using first imputed dataset
nhanes_model2[mice_vars] <- complete(mice_data, 1)


colSums(is.na(nhanes_model2))

##              seqn          ridageyr          riagendr          ridreth3 
##                 0                 0                 0                 0 
##          dmdeduc2          indfmpir            smq020            alq111 
##                 0                 0                 0                 0 
##  vigorous_minutes  moderate_minutes sedentary_minutes          dr1tkcal 
##                 0                 0                 0                 0 
##          dr1ttfat          dr1tsodi            bmxbmi            bpxsy1 
##                 0                 0                 0                 0 
##            bpxdi1            lbxglu          lbxhscrp            lbdhdd 
##                 0                 0                 0                 0 
##             lbxtr             lbxtc 
##                 0                 0

3.3 Identifying Outliers

The boxplots reveal that several variables, particularly total cholesterol (lbxtc), triglycerides (lbxtr), LDL cholesterol (lbdldl), and nutrient/weight totals, exhibit pronounced high-end outliers with long upper whiskers and many points far above the main distribution. These patterns reflect right-skewed distributions and natural physiological or dietary variability in the population rather than errors. Other variables, such as blood pressure, hemoglobin, and BMI, show moderate high outliers, while some nutrient or BMI measures have only a few scattered extremes. Overall, the most prominent outliers are concentrated in biomarkers and nutrient totals, which could influence regression models if unaddressed. These findings suggest the need for careful handling, such as log transformation or robust modeling approaches, rather than outright removal, to ensure valid and generalizable analytical results.

# Boxplot of lbxtc

nhanes_model2 |>
  dplyr::select(lbxtc) |>
  ggplot(aes(x = "", y = lbxtc)) + 
  geom_boxplot(fill="pink") + 
  labs(title = "Distribution of Total Cholesterol", y = "Total Cholesterol mg/dL") +  
  theme_minimal()

# Boxplots of all predictors 

ggplot(stack(nhanes_model2[,-1]), aes(x = ind, y = values)) + 
  geom_boxplot(fill = "darkgrey") +
  coord_cartesian(ylim = c(0, 1000)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        panel.background = element_rect(fill = 'grey96')) +
  labs(title = "Boxplots of Predictor Variables", x="Predictors")

## Warning in stack.data.frame(nhanes_model2[, -1]): non-vector columns will be
## ignored

## Warning: <ggplot> %+% x was deprecated in ggplot2 4.0.0.
## ℹ Please use <ggplot> + x instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

nhanes_model2 %>%
  dplyr::select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "grey70") +
  coord_cartesian(ylim = c(0, 500)) +   # Adjust depending on variable range
  labs(
    title = "Boxplots of Numeric Predictor Variables",
    x = "Predictors",
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

3.4 Tranformations and Factoring

To address skewness and high-end outliers in the NHANES dataset, we applied log transformations to several right-skewed variables. Specifically, CRP (lbxhscrp), triglycerides (lbxtr), total cholesterol (lbxtc), HDL cholesterol (lbdhdd), glucose (lbxglu), BMI (bmxbmi), and dietary intake variables such as total calories (dr1tkcal), total fat (dr1ttfat), and sodium (dr1tsodi) were log-transformed. A small constant was added to each variable to avoid taking the logarithm of zero. These transformations reduce the influence of extreme values, improve the symmetry of distributions, and stabilize variance, thereby helping to meet the assumptions of linear regression and supporting more reliable model estimation.

In addition to transforming skewed numeric variables, we converted key categorical variables into factor type for modeling. Specifically, gender (riagendr) was labeled as “Male” and “Female,” while race/ethnicity (ridreth3), education level (dmdeduc2), smoking status (smq020), and alcohol use (alq111) were converted to factor variables without changing their original coding. This ensures that these categorical predictors are properly treated in regression models, allowing R to handle them as discrete groups rather than continuous numeric values, which improves interpretation and model performance.

# tranform skewed variables

# Log-transform skewed variables, adding a small constant to avoid log(0)
nhanes_model3 <- nhanes_model2 %>%
  mutate(
    log_lbxhscrp = log(lbxhscrp + 0.01),
    log_lbxtr = log(lbxtr + 0.01),
    log_dr1tkcal = log(dr1tkcal + 1),
    log_dr1tsodi= log(dr1tsodi + 1),
    log_dr1ttfat = log(dr1ttfat + 1),
    log_lbxtc = log(lbxtc + 0.01),
    log_lbdhdd = log(lbdhdd + 0.01),
    log_lbxglu = log(lbxglu + 0.01),
    log_bmxbmi = log(bmxbmi + 0.01)
  )

log_vars <- c("log_lbxhscrp","log_lbxtr","log_dr1tkcal","log_dr1tsodi","log_dr1ttfat",
              "log_lbxtc","log_lbdhdd","log_lbxglu","log_bmxbmi")

nhanes_model3 %>%
  dplyr::select(all_of(log_vars)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~Variable, scales = "free") +
  labs(title = "Histograms of Log-Transformed Variables", x = NULL, y = "Frequency") +
  theme_minimal()

# Factor categorical variables
nhanes_model4 <- nhanes_model3 %>%
  mutate(
    riagendr = factor(riagendr, labels = c("Male", "Female")),
    ridreth3 = factor(ridreth3),
    dmdeduc2 = factor(dmdeduc2),
    smq020   = factor(smq020),
    alq111   = factor(alq111)
  )

3.4 Checking for linear relationship

To evaluate the suitability of linear regression for predicting total cholesterol, we examined the relationships between each numeric predictor and the target variable (lbxtc) using scatterplots and Pearson correlation coefficients. Variables with stronger correlations may be more informative for the regression model, while variables with very weak correlations may contribute less predictive value or require further transformation.

Across the scatterplots, most numeric predictors exhibit very weak or negligible linear relationships with total cholesterol, as reflected by near-zero Pearson correlation coefficients. Variables such as physical activity (vigorous, moderate, sedentary minutes), dietary intakes (calories, fat, sodium), and demographic characteristics show little predictive power individually. A few clinical variables—particularly systolic and diastolic blood pressure (bpxsy1, bpxdi1) and log-transformed triglycerides (log_trig)—display mild positive associations with cholesterol, but these relationships are weak and accompanied by substantial scatter. Overall, the plots suggest that total cholesterol is largely uncorrelated with most individual predictors, with only modest positive trends observed for a few clinical measures.

# Get all numeric predictor variables (excluding total cholesterol)
predictors <- names(nhanes_model4)[sapply(nhanes_model4, is.numeric) & names(nhanes_model4) != "lbxtc"]

# Set up plotting parameters for multiple plots
par(mfrow = c(4, 4), mar = c(4, 4, 3, 2))

# Plot each predictor vs total cholesterol
for(i in 1:length(predictors)) {
  predictor <- predictors[i]
  correlation <- cor(nhanes_model4[[predictor]], nhanes_model4$lbxtc, use = "complete.obs")
  
  plot(nhanes_model4[[predictor]], nhanes_model4$lbxtc,
       xlab = predictor,
       ylab = "Total Cholesterol",
       main = paste(predictor, "\n(r =", round(correlation, 3), ")"),
       pch = 16,
       col = ifelse(correlation > 0, "darkblue", "darkred"),
       cex = 0.6)
  
  # Add regression line
  abline(lm(nhanes_model4$lbxtc ~ nhanes_model4[[predictor]]), 
         col = "red", lwd = 2)
  
  # Add correlation text
  text(x = quantile(nhanes_model4[[predictor]], 0.05, na.rm = TRUE), 
       y = quantile(nhanes_model4$lbxtc, 0.95, na.rm = TRUE),
       labels = paste("r =", round(correlation, 3)),
       cex = 0.8, col = "red")
}

# Reset plotting parameters
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2))

3.5 Identifying Correlations

The correlations between total cholesterol (lbxtc) and other variables in the dataset are generally modest.

HDL cholesterol (lbdhdd) shows the strongest positive correlation at 0.19
Followed by log-transformed triglycerides (log_trig) and blood pressure measures (bpxdi1 and bpxsy1), each around 0.13. Age (ridageyr).
Log-transformed CRP (log_crp), and income-to-poverty ratio (indfmpir) show weak positive correlations ranging from 0.05 to 0.10.

Strong Positive Correlations:

dr1tkcal/dr1ttfat 0.89 (Calories and Total Fat Intake - more fat indicates more calories)
dr1tkcal/dr1tsodi 0.80 (Calories and Total Sodium Intake - more sodium indicates more calories)
dr1ttfat/dr1tsodi 0.76 (Total Fat and Sodium Intake -more fat indicates more sodium)
ridageyr/bpxsy1 0.47 (Higher age indicates higher systolic bp)

Negative Correlations:

lbdhdd/lbxtr -0.31 (Higher HDL, lower Triglycerides)
bmxbmi/lbdhdd -0.27 (Higher BMI, lower HDL)
lbxglu/lbdhdd -0.18 (Higher Glucose, lower HDL)

Moderate Correlations:

BMI and CRP (bmxBMI ↔︎ lbxhscrp)
Systolic and Diastolic Blood Pressure

Weak or Near-Zero Correlations:

Physical activity measures (vigorous_minutes, moderate_minutes, sedentary_minutes) show very small correlations with most metabolic markers.
Demographic and lifestyle variables like age, gender, education, and alcohol use also show minimal linear correlation with total cholesterol.

Most other lifestyle and dietary variables, including BMI, physical activity, total calories, fat, and sodium intake, have correlations near zero or slightly negative, indicating minimal linear association with total cholesterol in this dataset. Overall, these results suggest that cholesterol levels are more strongly associated with clinical biomarkers (HDL, triglycerides, blood pressure) than with demographic, lifestyle, or dietary factors.

# ORIGINAL PREDICTORS DATASET

# log_vars is "log_crp","log_trig","log_calories","log_sodium","log_fat", "log_chol","log_hdl", "log_glucose", "log_bmi"
nhanes_model4_original <- nhanes_model4 |>
  dplyr::select(-any_of(log_vars))


df_character_wide <- nhanes_model4_original %>%
  select_if(function(col) !is.numeric(col) | all(col == .$lbxtc)) %>%
  pivot_longer(cols = -lbxtc, names_to = "variable", values_to = "value")

df_character_wide %>%
  ggplot(aes(x = value, y = lbxtc)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))

numeric_vars <- nhanes_model4_original %>% dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")

# Correlation with total cholesterol
corr_target <- cor_matrix[, "lbxtc"]
corr_target_sorted <- sort(corr_target, decreasing = TRUE)

kable(as.data.frame(corr_target_sorted), col.names = c("Correlation with Total Cholesterol"), digits = 2)

	Correlation with Total Cholesterol
lbxtc	1.00
lbdhdd	0.19
bpxdi1	0.13
bpxsy1	0.13
lbxtr	0.11
ridageyr	0.10
indfmpir	0.05
bmxbmi	0.01
vigorous_minutes	0.00
seqn	0.00
sedentary_minutes	0.00
lbxhscrp	0.00
moderate_minutes	0.00
lbxglu	-0.02
dr1tkcal	-0.03
dr1ttfat	-0.03
dr1tsodi	-0.03

# corrplot::corrplot(cor_matrix, method = "color", type = "upper",
#          tl.col = "black", tl.srt = 45, addCoef.col = "black",
#          number.cex = 0.7, diag = FALSE)
corrplot::corrplot(cor_matrix,
        method = "circle",
        type = "upper",
        tl.cex = 0.5,
        tl.srt = 45) # Rotate text diagonally

# Remove self-correlation
corr_target_no_self <- corr_target[names(corr_target) != "lbxtc"]

# Top 3 positive and negative correlations
top_pos <- sort(corr_target_no_self, decreasing = TRUE)[1:3]
top_neg <- sort(corr_target_no_self, decreasing = FALSE)[1:3]

combined_df <- data.frame(
  Feature = c(names(top_pos), names(top_neg)),
  Correlation = c(top_pos, top_neg)
) %>% 
  mutate(Direction = ifelse(Correlation > 0, "Positive", "Negative"))

ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
  labs(title = "Top Features Correlated with Total Cholesterol",
       x = "Feature",
       y = "Correlation with Total Cholesterol",
       fill = "Direction") +
  theme_minimal()

# Correlation among predictors
melt_cor <- melt(cor_matrix)
melt_cor <- melt_cor |>
  filter(Var1 != "SalePrice") |>
  filter(Var2 != "SalePrice")
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
kable(as.data.frame(sorted_cor), digits = 2)

	Var1	Var2	value
126	dr1tkcal	dr1ttfat	0.89
143	dr1tkcal	dr1tsodi	0.80
144	dr1ttfat	dr1tsodi	0.76
172	ridageyr	bpxsy1	0.47
198	bpxsy1	bpxdi1	0.35
270	lbdhdd	lbxtr	-0.31
248	bmxbmi	lbdhdd	-0.27
268	lbxglu	lbxtr	0.26
231	bmxbmi	lbxhscrp	0.26
287	lbdhdd	lbxtc	0.19
214	bmxbmi	lbxglu	0.18
215	bpxsy1	lbxglu	0.18
251	lbxglu	lbdhdd	-0.18
206	ridageyr	lbxglu	0.16
265	bmxbmi	lbxtr	0.15
138	ridageyr	dr1tsodi	-0.14
104	ridageyr	dr1tkcal	-0.13
284	bpxdi1	lbxtc	0.13
72	vigorous_minutes	moderate_minutes	0.13
283	bpxsy1	lbxtc	0.13
180	bmxbmi	bpxsy1	0.12
288	lbxtr	lbxtc	0.11
121	ridageyr	dr1ttfat	-0.10
274	ridageyr	lbxtc	0.10
267	bpxdi1	lbxtr	0.10
234	lbxglu	lbxhscrp	0.09
241	indfmpir	lbdhdd	0.09
197	bmxbmi	bpxdi1	0.09
252	lbxhscrp	lbdhdd	-0.08
266	bpxsy1	lbxtr	0.08
53	ridageyr	vigorous_minutes	-0.08
106	vigorous_minutes	dr1tkcal	0.08
36	ridageyr	indfmpir	0.08
210	sedentary_minutes	lbxglu	0.07
240	ridageyr	lbdhdd	0.07
245	dr1tkcal	lbdhdd	-0.07
247	dr1tsodi	lbdhdd	-0.07
140	vigorous_minutes	dr1tsodi	0.07
262	dr1tkcal	lbxtr	0.07
224	indfmpir	lbxhscrp	-0.07
264	dr1tsodi	lbxtr	0.07
209	moderate_minutes	lbxglu	0.06
260	moderate_minutes	lbxtr	0.06
250	bpxdi1	lbdhdd	-0.06
194	dr1tkcal	bpxdi1	0.05
123	vigorous_minutes	dr1ttfat	0.05
228	dr1tkcal	lbxhscrp	-0.05
230	dr1tsodi	lbxhscrp	-0.05
161	dr1ttfat	bmxbmi	0.05
156	indfmpir	bmxbmi	-0.05
275	indfmpir	lbxtc	0.05
261	sedentary_minutes	lbxtr	0.05
196	dr1tsodi	bpxdi1	0.05
179	dr1tsodi	bpxsy1	-0.04
263	dr1ttfat	lbxtr	0.04
223	ridageyr	lbxhscrp	0.04
246	dr1ttfat	lbdhdd	-0.04
87	ridageyr	sedentary_minutes	0.04
227	sedentary_minutes	lbxhscrp	0.04
229	dr1ttfat	lbxhscrp	-0.04
159	sedentary_minutes	bmxbmi	0.04
208	vigorous_minutes	lbxglu	-0.04
178	dr1ttfat	bpxsy1	-0.04
122	indfmpir	dr1ttfat	0.04
257	ridageyr	lbxtr	0.04
176	sedentary_minutes	bpxsy1	0.04
174	vigorous_minutes	bpxsy1	-0.04
177	dr1tkcal	bpxsy1	-0.04
139	indfmpir	dr1tsodi	0.03
281	dr1tsodi	lbxtc	-0.03
70	ridageyr	moderate_minutes	-0.03
280	dr1ttfat	lbxtc	-0.03
195	dr1ttfat	bpxdi1	0.03
190	indfmpir	bpxdi1	0.03
192	moderate_minutes	bpxdi1	0.03
89	vigorous_minutes	sedentary_minutes	-0.03
173	indfmpir	bpxsy1	-0.03
88	indfmpir	sedentary_minutes	0.03
90	moderate_minutes	sedentary_minutes	-0.03
279	dr1tkcal	lbxtc	-0.03
162	dr1tsodi	bmxbmi	0.03
207	indfmpir	lbxglu	0.03
232	bpxsy1	lbxhscrp	0.03
155	ridageyr	bmxbmi	0.02
269	lbxhscrp	lbxtr	0.02
285	lbxglu	lbxtc	-0.02
52	seqn	vigorous_minutes	-0.02
244	sedentary_minutes	lbdhdd	-0.02
243	moderate_minutes	lbdhdd	0.02
205	seqn	lbxglu	-0.02
71	indfmpir	moderate_minutes	-0.02
189	ridageyr	bpxdi1	0.02
225	vigorous_minutes	lbxhscrp	-0.02
259	vigorous_minutes	lbxtr	-0.01
157	vigorous_minutes	bmxbmi	-0.01
86	seqn	sedentary_minutes	0.01
211	dr1tkcal	lbxglu	-0.01
188	seqn	bpxdi1	0.01
258	indfmpir	lbxtr	0.01
193	sedentary_minutes	bpxdi1	0.01
105	indfmpir	dr1tkcal	0.01
158	moderate_minutes	bmxbmi	0.01
239	seqn	lbdhdd	0.01
160	dr1tkcal	bmxbmi	0.01
54	indfmpir	vigorous_minutes	-0.01
282	bmxbmi	lbxtc	0.01
107	moderate_minutes	dr1tkcal	0.01
137	seqn	dr1tsodi	0.01
171	seqn	bpxsy1	0.01
249	bpxsy1	lbdhdd	0.01
256	seqn	lbxtr	0.01
142	sedentary_minutes	dr1tsodi	-0.01
226	moderate_minutes	lbxhscrp	-0.01
18	seqn	ridageyr	0.01
212	dr1ttfat	lbxglu	0.01
108	sedentary_minutes	dr1tkcal	0.00
277	moderate_minutes	lbxtc	0.00
222	seqn	lbxhscrp	0.00
175	moderate_minutes	bpxsy1	0.00
124	moderate_minutes	dr1ttfat	0.00
286	lbxhscrp	lbxtc	0.00
103	seqn	dr1tkcal	0.00
120	seqn	dr1ttfat	0.00
125	sedentary_minutes	dr1ttfat	0.00
69	seqn	moderate_minutes	0.00
233	bpxdi1	lbxhscrp	0.00
213	dr1tsodi	lbxglu	0.00
154	seqn	bmxbmi	0.00
242	vigorous_minutes	lbdhdd	0.00
216	bpxdi1	lbxglu	0.00
141	moderate_minutes	dr1tsodi	0.00
35	seqn	indfmpir	0.00
278	sedentary_minutes	lbxtc	0.00
273	seqn	lbxtc	0.00
191	vigorous_minutes	bpxdi1	0.00
276	vigorous_minutes	lbxtc	0.00

numeric_predictors <- c("ridageyr", "indfmpir", "vigorous_minutes",
                        "moderate_minutes", "sedentary_minutes", 
                        "bmxbmi", "bpxsy1", "bpxdi1", "lbxglu", 
                        "lbxhscrp", "lbdhdd", "lbxtr")

par(mfrow = c(3, 4))
for (var in numeric_predictors) {
  boxplot(nhanes_model4[[var]] ~ nhanes_model4$lbxtc, 
          main = paste(var, "vs Total Cholesterol"), 
          xlab = "Total Cholesterol", ylab = var, col = "lightblue")
}

continuous_vars <- nhanes_model4 %>%
  dplyr::select(
    lbxtc, ridageyr, indfmpir,
    vigorous_minutes, moderate_minutes, sedentary_minutes,
    dr1tkcal, dr1ttfat, dr1tsodi,
    bmxbmi, bpxsy1, bpxdi1,
    lbxglu, lbxhscrp, lbdhdd, lbxtr
  )

# relationships between predictors
ggpairs(continuous_vars, 
        progress = FALSE,
        upper = list(continuous = wrap("cor", size = 3)))

# TRANSFORMED PREDICTORS

nhanes_model4_transformed <- nhanes_model4 |>
  dplyr::select(-c(lbxhscrp, lbxtr, dr1tsodi, dr1tkcal, dr1ttfat, lbdhdd, lbxglu, bmxbmi))


df_character_wide <- nhanes_model4_transformed %>%
  select_if(function(col) !is.numeric(col) | all(col == .$lbxtc)) %>%
  pivot_longer(cols = -lbxtc, names_to = "variable", values_to = "value")

df_character_wide %>%
  ggplot(aes(x = value, y = lbxtc)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90))

numeric_vars <- nhanes_model4_transformed %>% dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")

# Correlation with total cholesterol
corr_target <- cor_matrix[, "lbxtc"]
corr_target_sorted <- sort(corr_target, decreasing = TRUE)

kable(as.data.frame(corr_target_sorted), col.names = c("Correlation with Total Cholesterol"), digits = 2)

	Correlation with Total Cholesterol
lbxtc	1.00
log_lbxtc	0.99
log_lbdhdd	0.17
log_lbxtr	0.13
bpxdi1	0.13
bpxsy1	0.13
ridageyr	0.10
log_lbxhscrp	0.06
indfmpir	0.05
log_bmxbmi	0.02
vigorous_minutes	0.00
seqn	0.00
sedentary_minutes	0.00
moderate_minutes	0.00
log_dr1tkcal	-0.02
log_lbxglu	-0.02
log_dr1ttfat	-0.03
log_dr1tsodi	-0.04

# corrplot::corrplot(cor_matrix, method = "color", type = "upper",
#          tl.col = "black", tl.srt = 45, addCoef.col = "black",
#          number.cex = 0.7, diag = FALSE)
corrplot::corrplot(cor_matrix,
        method = "circle",
        type = "upper",
        tl.cex = 0.5,
        tl.srt = 45) # Rotate text diagonally

# Remove self-correlation
corr_target_no_self <- corr_target[names(corr_target) != "lbxtc"]

# Top 3 positive and negative correlations
top_pos <- sort(corr_target_no_self, decreasing = TRUE)[1:3]
top_neg <- sort(corr_target_no_self, decreasing = FALSE)[1:3]

combined_df <- data.frame(
  Feature = c(names(top_pos), names(top_neg)),
  Correlation = c(top_pos, top_neg)
) %>% 
  mutate(Direction = ifelse(Correlation > 0, "Positive", "Negative"))

ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
  labs(title = "Top Features Correlated with Total Cholesterol",
       x = "Feature",
       y = "Correlation with Total Cholesterol",
       fill = "Direction") +
  theme_minimal()

# Correlation among predictors
melt_cor <- melt(cor_matrix)
melt_cor <- melt_cor |>
  filter(Var1 != "SalePrice") |>
  filter(Var2 != "SalePrice")
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
kable(as.data.frame(sorted_cor), digits = 2)

	Var1	Var2	value
261	lbxtc	log_lbxtc	0.99
246	log_dr1tkcal	log_dr1ttfat	0.88
228	log_dr1tkcal	log_dr1tsodi	0.82
247	log_dr1tsodi	log_dr1ttfat	0.79
281	log_lbxtr	log_lbdhdd	-0.48
316	log_lbxhscrp	log_bmxbmi	0.48
110	ridageyr	bpxsy1	0.47
133	bpxsy1	bpxdi1	0.35
299	log_lbxtr	log_lbxglu	0.31
322	log_lbdhdd	log_bmxbmi	-0.30
317	log_lbxtr	log_bmxbmi	0.25
323	log_lbxglu	log_bmxbmi	0.22
304	log_lbdhdd	log_lbxglu	-0.22
295	bpxsy1	log_lbxglu	0.21
280	log_lbxhscrp	log_lbdhdd	-0.19
290	ridageyr	log_lbxglu	0.19
285	log_lbxtc	log_lbdhdd	0.18
279	lbxtc	log_lbdhdd	0.17
298	log_lbxhscrp	log_lbxglu	0.16
190	log_lbxhscrp	log_lbxtr	0.15
188	bpxdi1	log_lbxtr	0.14
260	bpxdi1	log_lbxtc	0.14
189	lbxtc	log_lbxtr	0.13
313	bpxsy1	log_bmxbmi	0.13
152	bpxdi1	lbxtc	0.13
263	log_lbxtr	log_lbxtc	0.13
76	vigorous_minutes	moderate_minutes	0.13
187	bpxsy1	log_lbxtr	0.13
151	bpxsy1	lbxtc	0.13
218	ridageyr	log_dr1tsodi	-0.12
259	bpxsy1	log_lbxtc	0.12
200	ridageyr	log_dr1tkcal	-0.11
146	ridageyr	lbxtc	0.10
165	indfmpir	log_lbxhscrp	-0.09
273	indfmpir	log_lbdhdd	0.09
314	bpxdi1	log_bmxbmi	0.09
226	log_lbxhscrp	log_dr1tsodi	-0.09
164	ridageyr	log_lbxhscrp	0.09
254	ridageyr	log_lbxtc	0.09
208	log_lbxhscrp	log_dr1tkcal	-0.08
169	bpxsy1	log_lbxhscrp	0.08
294	sedentary_minutes	log_lbxglu	0.08
56	ridageyr	vigorous_minutes	-0.08
38	ridageyr	indfmpir	0.08
236	ridageyr	log_dr1ttfat	-0.07
182	ridageyr	log_lbxtr	0.07
272	ridageyr	log_lbdhdd	0.07
282	log_dr1tkcal	log_lbdhdd	-0.07
262	log_lbxhscrp	log_lbxtc	0.07
171	lbxtc	log_lbxhscrp	0.06
283	log_dr1tsodi	log_lbdhdd	-0.06
278	bpxdi1	log_lbdhdd	-0.06
219	indfmpir	log_dr1tsodi	0.06
202	vigorous_minutes	log_dr1tkcal	0.06
244	log_lbxhscrp	log_dr1ttfat	-0.06
237	indfmpir	log_dr1ttfat	0.06
227	log_lbxtr	log_dr1tsodi	0.06
209	log_lbxtr	log_dr1tkcal	0.06
293	moderate_minutes	log_lbxglu	0.05
220	vigorous_minutes	log_dr1tsodi	0.05
206	bpxdi1	log_dr1tkcal	0.05
308	ridageyr	log_bmxbmi	0.05
186	sedentary_minutes	log_lbxtr	0.05
255	indfmpir	log_lbxtc	0.05
147	indfmpir	lbxtc	0.05
170	bpxdi1	log_lbxhscrp	0.05
224	bpxdi1	log_dr1tsodi	0.05
320	log_dr1ttfat	log_bmxbmi	0.04
292	vigorous_minutes	log_lbxglu	-0.04
92	ridageyr	sedentary_minutes	0.04
312	sedentary_minutes	log_bmxbmi	0.04
238	vigorous_minutes	log_dr1ttfat	0.04
309	indfmpir	log_bmxbmi	-0.04
114	sedentary_minutes	bpxsy1	0.04
201	indfmpir	log_dr1tkcal	0.04
223	bpxsy1	log_dr1tsodi	-0.04
185	moderate_minutes	log_lbxtr	0.04
112	vigorous_minutes	bpxsy1	-0.04
225	lbxtc	log_dr1tsodi	-0.04
265	log_dr1tsodi	log_lbxtc	-0.03
291	indfmpir	log_lbxglu	0.03
266	log_dr1ttfat	log_lbxtc	-0.03
74	ridageyr	moderate_minutes	-0.03
303	log_lbxtc	log_lbxglu	-0.03
243	lbxtc	log_dr1ttfat	-0.03
129	indfmpir	bpxdi1	0.03
131	moderate_minutes	bpxdi1	0.03
284	log_dr1ttfat	log_lbdhdd	-0.03
205	bpxsy1	log_dr1tkcal	-0.03
94	vigorous_minutes	sedentary_minutes	-0.03
111	indfmpir	bpxsy1	-0.03
321	log_lbxtc	log_bmxbmi	0.03
93	indfmpir	sedentary_minutes	0.03
166	vigorous_minutes	log_lbxhscrp	-0.03
168	sedentary_minutes	log_lbxhscrp	0.03
95	moderate_minutes	sedentary_minutes	-0.03
242	bpxdi1	log_dr1ttfat	0.03
184	vigorous_minutes	log_lbxtr	-0.03
319	log_dr1tsodi	log_bmxbmi	0.02
241	bpxsy1	log_dr1ttfat	-0.02
297	lbxtc	log_lbxglu	-0.02
315	lbxtc	log_bmxbmi	0.02
245	log_lbxtr	log_dr1ttfat	0.02
207	lbxtc	log_dr1tkcal	-0.02
264	log_dr1tkcal	log_lbxtc	-0.02
55	seqn	vigorous_minutes	-0.02
276	sedentary_minutes	log_lbdhdd	-0.02
75	indfmpir	moderate_minutes	-0.02
128	ridageyr	bpxdi1	0.02
289	seqn	log_lbxglu	-0.02
163	seqn	log_lbxhscrp	0.02
221	moderate_minutes	log_dr1tsodi	-0.01
91	seqn	sedentary_minutes	0.01
311	moderate_minutes	log_bmxbmi	0.01
127	seqn	bpxdi1	0.01
132	sedentary_minutes	bpxdi1	0.01
310	vigorous_minutes	log_bmxbmi	-0.01
239	moderate_minutes	log_dr1ttfat	-0.01
271	seqn	log_lbdhdd	0.01
300	log_dr1tkcal	log_lbxglu	-0.01
235	seqn	log_dr1ttfat	-0.01
57	indfmpir	vigorous_minutes	-0.01
275	moderate_minutes	log_lbdhdd	0.01
240	sedentary_minutes	log_dr1ttfat	0.01
302	log_dr1ttfat	log_lbxglu	0.01
203	moderate_minutes	log_dr1tkcal	-0.01
222	sedentary_minutes	log_dr1tsodi	-0.01
109	seqn	bpxsy1	0.01
19	seqn	ridageyr	0.01
296	bpxdi1	log_lbxglu	0.01
318	log_dr1tkcal	log_bmxbmi	0.01
301	log_dr1tsodi	log_lbxglu	0.00
149	moderate_minutes	lbxtc	0.00
217	seqn	log_dr1tsodi	0.00
113	moderate_minutes	bpxsy1	0.00
204	sedentary_minutes	log_dr1tkcal	0.00
183	indfmpir	log_lbxtr	0.00
181	seqn	log_lbxtr	0.00
73	seqn	moderate_minutes	0.00
258	sedentary_minutes	log_lbxtc	0.00
256	vigorous_minutes	log_lbxtc	0.00
257	moderate_minutes	log_lbxtc	0.00
277	bpxsy1	log_lbdhdd	0.00
167	moderate_minutes	log_lbxhscrp	0.00
37	seqn	indfmpir	0.00
199	seqn	log_dr1tkcal	0.00
274	vigorous_minutes	log_lbdhdd	0.00
307	seqn	log_bmxbmi	0.00
150	sedentary_minutes	lbxtc	0.00
145	seqn	lbxtc	0.00
253	seqn	log_lbxtc	0.00
130	vigorous_minutes	bpxdi1	0.00
148	vigorous_minutes	lbxtc	0.00

numeric_predictors <- c("ridageyr", "indfmpir", "vigorous_minutes",
                        "moderate_minutes", "sedentary_minutes", 
                        "bmxbmi", "bpxsy1", "bpxdi1", "lbxglu", 
                        "lbxhscrp", "lbdhdd", "lbxtr")

par(mfrow = c(3, 4))
for (var in numeric_predictors) {
  boxplot(nhanes_model4[[var]] ~ nhanes_model4$lbxtc, 
          main = paste(var, "vs Total Cholesterol"), 
          xlab = "Total Cholesterol", ylab = var, col = "lightblue")
}

# relationships between predictors
#ggpairs(nhanes_model4_transformed, progress = FALSE)

3.4 Adding new features

To enhance the predictive modeling of total cholesterol, several new features were derived from the raw NHANES data. Age was categorized into young (<30), middle-aged (30–65), and older (>65) groups, while BMI was classified into underweight, normal, overweight, and obese categories. Physical activity variables were combined into a total activity measure, and a binary indicator was created to flag participants meeting recommended activity levels. Dietary intake was normalized by creating fat-to-calorie and sodium-to-calorie ratios. Finally, skewed laboratory and dietary variables, including triglycerides, CRP, total calories, fat, and sodium, were log-transformed to reduce skewness and improve model stability.

nhanes_model5 <- nhanes_model4 %>%
  # Age groups
  mutate(
    AGE_YOUNG = as.numeric(ridageyr < 30),
    AGE_MID = as.numeric(ridageyr >= 30 & ridageyr <= 65),
    AGE_OLD = as.numeric(ridageyr > 65)
  ) %>%
  # BMI categories
  mutate(
    BMI_UNDER = as.numeric(bmxbmi < 18.5),
    BMI_NORMAL = as.numeric(bmxbmi >= 18.5 & bmxbmi < 25),
    BMI_OVER = as.numeric(bmxbmi >= 25 & bmxbmi < 30),
    BMI_OBESE = as.numeric(bmxbmi >= 30)
  ) %>%
  # Physical activity
  mutate(
    TOTAL_ACTIVITY = vigorous_minutes + moderate_minutes,
    ACTIVE_FLAG = as.numeric(TOTAL_ACTIVITY >= 150)  # Meets 150 min/week guideline
  ) %>%
  # Dietary ratios
  mutate(
#    FAT_RATIO = ifelse(dr1tkcal > 0, dr1ttfat / dr1tkcal, NA),
#    SODIUM_RATIO = ifelse(dr1tkcal > 0, dr1tsodi / dr1tkcal, NA)
    HIGH_FAT_DIET = as.numeric((exp(log_dr1ttfat) * 9 / exp(log_dr1tkcal)) > 0.35),
    HIGH_SODIUM = as.numeric(exp(log_dr1tsodi) > 2300)
  )

Split Data into Training and Test Sets

To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.

set.seed(123)

train_index <- createDataPartition(nhanes_model4$lbxtc, p = 0.8, list = FALSE)

# Create train/test splits for each dataset
# Removed seqn for each dataset

# Original predictors - removed transformed predictors
nhanes_model_original <- nhanes_model4 |>
  dplyr::select(-c(log_lbxhscrp, log_lbxtr, log_dr1tkcal, log_dr1tsodi, log_dr1ttfat, log_lbxtc, log_lbdhdd, log_lbxglu, log_bmxbmi, seqn))
train_data_original <- nhanes_model_original[train_index, ]
test_data_original <- nhanes_model_original[-train_index, ]

# transformed predictors - ONLY the transformed predictors (remove transformed target variable log_lbxtc)
nhanes_model_transformed <- nhanes_model4 |>
  dplyr::select(-c(lbxhscrp, lbxtr, dr1tkcal, dr1tsodi, dr1ttfat, log_lbxtc, lbdhdd, lbxglu, bmxbmi, seqn))
train_data_transformed <- nhanes_model_transformed[train_index, ]
test_data_transformed <- nhanes_model_transformed[-train_index, ]

# transformed predictors + log transformed target variable log_lbxtc
nhanes_model_transformed_target <- nhanes_model4 |>
  dplyr::select(-c(lbxhscrp, lbxtr, dr1tkcal, dr1tsodi, dr1ttfat, lbxtc, lbdhdd, lbxglu, bmxbmi, seqn))
train_data_transformed_target <- nhanes_model_transformed_target[train_index, ]
test_data_transformed_target <- nhanes_model_transformed_target[-train_index, ]

# new features

nhanes_model_transformed_newfeatures <- nhanes_model5 |>
    dplyr::select(
    # Target
    log_lbxtc,
    
    # Demographics (NO seqn)
    ridageyr, riagendr, ridreth3, dmdeduc2, indfmpir,
    
    # Lifestyle
    smq020, alq111, sedentary_minutes,
    
    # NEW FEATURES (activity & diet)
    TOTAL_ACTIVITY, ACTIVE_FLAG, HIGH_FAT_DIET, HIGH_SODIUM,
    
    # Age/BMI categories
    AGE_YOUNG, AGE_MID,  # AGE_OLD is reference
    BMI_UNDER, BMI_NORMAL, BMI_OVER,  # BMI_OBESE is reference
    
    # Blood pressure (raw values)
    bpxsy1, bpxdi1,
    
    # Biomarkers (LOG versions ONLY - no raw)
    log_lbxhscrp, log_lbxtr, log_lbdhdd, log_lbxglu
    
    )

train_data_transformed_newfeatures <- nhanes_model_transformed_newfeatures[train_index, ]
test_data_transformed_newfeatures <- nhanes_model_transformed_newfeatures[-train_index, ]

4.Build Models

Model 1: Multiple Linear Regression - Full model

# Model 1: original predictors + original target variable lbxtc
set.seed(123)
model_full <- lm(lbxtc ~ ., data = train_data_original)
summary(model_full)

## 
## Call:
## lm(formula = lbxtc ~ ., data = train_data_original)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -205.386  -26.880   -3.902   22.995  246.556 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.859e+01  6.803e+00  14.493  < 2e-16 ***
## ridageyr           6.551e-02  4.048e-02   1.618 0.105684    
## riagendrFemale     4.976e+00  1.400e+00   3.554 0.000384 ***
## ridreth32          2.054e+00  2.584e+00   0.795 0.426837    
## ridreth33         -2.334e+00  2.092e+00  -1.116 0.264487    
## ridreth34         -6.533e+00  2.219e+00  -2.943 0.003264 ** 
## ridreth36          1.608e+00  2.514e+00   0.640 0.522303    
## ridreth37         -2.719e+00  3.223e+00  -0.844 0.398850    
## dmdeduc22         -2.821e+00  2.924e+00  -0.965 0.334658    
## dmdeduc23         -1.927e+00  2.661e+00  -0.724 0.468872    
## dmdeduc24         -3.381e+00  2.612e+00  -1.294 0.195707    
## dmdeduc25         -2.496e-01  2.852e+00  -0.088 0.930271    
## dmdeduc27          5.277e+01  2.790e+01   1.891 0.058689 .  
## dmdeduc29         -6.112e+00  1.410e+01  -0.434 0.664619    
## indfmpir           4.399e-01  4.597e-01   0.957 0.338673    
## smq0202           -3.037e+00  1.356e+00  -2.240 0.025171 *  
## alq1112            1.032e+00  2.149e+00   0.480 0.630904    
## vigorous_minutes   5.556e-05  1.868e-03   0.030 0.976276    
## moderate_minutes  -3.243e-04  1.122e-03  -0.289 0.772516    
## sedentary_minutes  7.255e-05  8.792e-04   0.083 0.934240    
## dr1tkcal           1.876e-03  1.545e-03   1.214 0.224706    
## dr1ttfat          -3.107e-02  2.910e-02  -1.068 0.285671    
## dr1tsodi          -4.430e-04  5.512e-04  -0.804 0.421583    
## bmxbmi             2.602e-01  9.389e-02   2.771 0.005609 ** 
## bpxsy1             1.552e-01  4.148e-02   3.741 0.000186 ***
## bpxdi1             3.544e-01  5.282e-02   6.710 2.21e-11 ***
## lbxglu            -5.781e-02  1.749e-02  -3.306 0.000956 ***
## lbxhscrp          -1.797e-02  8.319e-02  -0.216 0.828978    
## lbdhdd             6.508e-01  4.593e-02  14.167  < 2e-16 ***
## lbxtr              5.909e-02  5.548e-03  10.652  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.14 on 4113 degrees of freedom
## Multiple R-squared:  0.1082, Adjusted R-squared:  0.1019 
## F-statistic: 17.21 on 29 and 4113 DF,  p-value: < 2.2e-16

Model 2: Multiple Linear Regression: Full model with Log transformation of the skewed predictors

Using log transformations for lbxhscrp, lbxtr, dr1tkcal, dr1tsodi, dr1ttfat, lbdhdd, lbxglu, and bmxbmi.

# Model 2: LM transformed predictors + original target variable
set.seed(123)

# Full linear model using log-transformed predictors
# model_full_log <- lm(lbxtc ~ log_chol + log_hdl + log_bmi + bpxsy1 + bpxdi1 +
#                        log_glucose + vigorous_minutes + moderate_minutes +
#                        sedentary_minutes + indfmpir + ridageyr +
#                        log_crp + log_trig + log_sodium + log_calories + log_fat +
#                        riagendr + ridreth3 + dmdeduc2 + smq020 + alq111,
#                      data = train_full_log)
model_full_transformed <- lm(lbxtc ~ .,
                     data = train_data_transformed)
summary(model_full_transformed)

## 
## Call:
## lm(formula = lbxtc ~ ., data = train_data_transformed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.489  -26.461   -3.879   22.866  240.852 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -6.413e+01  2.547e+01  -2.517 0.011862 *  
## ridageyr           3.486e-02  4.000e-02   0.871 0.383550    
## riagendrFemale     3.096e+00  1.388e+00   2.231 0.025761 *  
## ridreth32          2.335e+00  2.546e+00   0.917 0.359228    
## ridreth33         -1.478e+00  2.061e+00  -0.717 0.473536    
## ridreth34         -4.676e+00  2.190e+00  -2.135 0.032800 *  
## ridreth36          3.294e+00  2.486e+00   1.325 0.185150    
## ridreth37         -1.835e+00  3.173e+00  -0.578 0.563126    
## dmdeduc22         -3.932e+00  2.883e+00  -1.364 0.172719    
## dmdeduc23         -2.363e+00  2.621e+00  -0.901 0.367508    
## dmdeduc24         -3.947e+00  2.576e+00  -1.532 0.125509    
## dmdeduc25         -9.525e-01  2.811e+00  -0.339 0.734730    
## dmdeduc27          5.492e+01  2.748e+01   1.999 0.045727 *  
## dmdeduc29         -7.819e+00  1.389e+01  -0.563 0.573419    
## indfmpir           6.082e-01  4.531e-01   1.342 0.179577    
## smq0202           -2.473e+00  1.337e+00  -1.850 0.064415 .  
## alq1112            7.991e-01  2.117e+00   0.377 0.705910    
## vigorous_minutes   5.994e-04  1.841e-03   0.326 0.744700    
## moderate_minutes  -4.654e-04  1.105e-03  -0.421 0.673633    
## sedentary_minutes -4.768e-05  8.658e-04  -0.055 0.956086    
## bpxsy1             1.488e-01  4.091e-02   3.638 0.000278 ***
## bpxdi1             2.856e-01  5.227e-02   5.464 4.94e-08 ***
## log_lbxhscrp       3.224e+00  6.367e-01   5.064 4.29e-07 ***
## log_lbxtr          1.826e+01  1.186e+00  15.397  < 2e-16 ***
## log_dr1tkcal       4.812e+00  2.980e+00   1.615 0.106441    
## log_dr1tsodi      -2.074e+00  1.969e+00  -1.053 0.292402    
## log_dr1ttfat      -2.845e+00  2.219e+00  -1.282 0.199821    
## log_lbdhdd         4.536e+01  2.728e+00  16.627  < 2e-16 ***
## log_lbxglu        -1.505e+01  2.734e+00  -5.505 3.91e-08 ***
## log_bmxbmi         3.525e+00  3.164e+00   1.114 0.265378    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.55 on 4113 degrees of freedom
## Multiple R-squared:  0.1351, Adjusted R-squared:  0.129 
## F-statistic: 22.15 on 29 and 4113 DF,  p-value: < 2.2e-16

Model 3: Multiple Linear Regression: Full model with Log transformation of the skewed predictors + Log transformation of target variable

# Model 3: LM transformed predictors + transformed target variable
set.seed(123)

# Full linear model using log-transformed predictors and target
model_full_transformed_target <- lm(log_lbxtc ~ .,
                     data = train_data_transformed_target)

summary(model_full_transformed_target)

## 
## Call:
## lm(formula = log_lbxtc ~ ., data = train_data_transformed_target)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79528 -0.13458 -0.00201  0.13493  0.85593 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.781e+00  1.351e-01  27.974  < 2e-16 ***
## ridageyr           9.798e-05  2.122e-04   0.462  0.64428    
## riagendrFemale     1.522e-02  7.363e-03   2.067  0.03879 *  
## ridreth32          1.328e-02  1.351e-02   0.983  0.32557    
## ridreth33         -9.215e-03  1.094e-02  -0.843  0.39954    
## ridreth34         -2.878e-02  1.162e-02  -2.477  0.01327 *  
## ridreth36          1.696e-02  1.319e-02   1.286  0.19841    
## ridreth37         -1.084e-02  1.684e-02  -0.644  0.51950    
## dmdeduc22         -1.849e-02  1.530e-02  -1.209  0.22677    
## dmdeduc23         -1.390e-02  1.391e-02  -0.999  0.31767    
## dmdeduc24         -2.196e-02  1.366e-02  -1.607  0.10810    
## dmdeduc25         -3.279e-03  1.491e-02  -0.220  0.82598    
## dmdeduc27          2.616e-01  1.458e-01   1.794  0.07290 .  
## dmdeduc29         -3.810e-02  7.367e-02  -0.517  0.60510    
## indfmpir           2.862e-03  2.404e-03   1.190  0.23395    
## smq0202           -1.135e-02  7.093e-03  -1.600  0.10964    
## alq1112            4.982e-03  1.123e-02   0.443  0.65746    
## vigorous_minutes   3.891e-06  9.765e-06   0.398  0.69033    
## moderate_minutes  -1.840e-06  5.862e-06  -0.314  0.75363    
## sedentary_minutes -3.211e-07  4.593e-06  -0.070  0.94427    
## bpxsy1             6.533e-04  2.170e-04   3.010  0.00263 ** 
## bpxdi1             1.742e-03  2.773e-04   6.280 3.74e-10 ***
## log_lbxhscrp       1.809e-02  3.378e-03   5.357 8.94e-08 ***
## log_lbxtr          9.672e-02  6.292e-03  15.371  < 2e-16 ***
## log_dr1tkcal       2.774e-02  1.581e-02   1.754  0.07943 .  
## log_dr1tsodi      -9.894e-03  1.045e-02  -0.947  0.34371    
## log_dr1ttfat      -1.821e-02  1.177e-02  -1.547  0.12185    
## log_lbdhdd         2.564e-01  1.447e-02  17.712  < 2e-16 ***
## log_lbxglu        -8.268e-02  1.451e-02  -5.700 1.28e-08 ***
## log_bmxbmi         3.078e-02  1.679e-02   1.834  0.06677 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2045 on 4113 degrees of freedom
## Multiple R-squared:  0.1424, Adjusted R-squared:  0.1364 
## F-statistic: 23.55 on 29 and 4113 DF,  p-value: < 2.2e-16

Model 4: Stepwise AIC on Model 1 (Original data)

AIC=30412.97

# Model 4
set.seed(123)

step_model_full <- stats::step(model_full, direction = "both")

## Start:  AIC=30416.75
## lbxtc ~ ridageyr + riagendr + ridreth3 + dmdeduc2 + indfmpir + 
##     smq020 + alq111 + vigorous_minutes + moderate_minutes + sedentary_minutes + 
##     dr1tkcal + dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + bpxdi1 + 
##     lbxglu + lbxhscrp + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - dmdeduc2           6     12506 6314741 30413
## - vigorous_minutes   1         1 6302237 30415
## - sedentary_minutes  1        10 6302246 30415
## - lbxhscrp           1        72 6302307 30415
## - moderate_minutes   1       128 6302363 30415
## - alq111             1       354 6302589 30415
## - dr1tsodi           1       990 6303225 30415
## - indfmpir           1      1403 6303638 30416
## - dr1ttfat           1      1747 6303982 30416
## - dr1tkcal           1      2259 6304495 30416
## <none>                           6302235 30417
## - ridageyr           1      4013 6306248 30417
## - smq020             1      7686 6309921 30420
## - bmxbmi             1     11768 6314003 30423
## - lbxglu             1     16743 6318979 30426
## - ridreth3           5     30282 6332517 30427
## - riagendr           1     19350 6321585 30428
## - bpxsy1             1     21447 6323682 30429
## - bpxdi1             1     68996 6371232 30460
## - lbxtr              1    173850 6476085 30528
## - lbdhdd             1    307535 6609770 30612
## 
## Step:  AIC=30412.97
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + vigorous_minutes + moderate_minutes + sedentary_minutes + 
##     dr1tkcal + dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + bpxdi1 + 
##     lbxglu + lbxhscrp + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - vigorous_minutes   1         1 6314743 30411
## - sedentary_minutes  1         4 6314745 30411
## - lbxhscrp           1       127 6314868 30411
## - moderate_minutes   1       160 6314902 30411
## - alq111             1       422 6315164 30411
## - dr1tsodi           1      1116 6315857 30412
## - dr1ttfat           1      2051 6316793 30412
## - dr1tkcal           1      2686 6317427 30413
## - indfmpir           1      2697 6317439 30413
## <none>                           6314741 30413
## - ridageyr           1      5813 6320554 30415
## - smq020             1      6883 6321624 30416
## + dmdeduc2           6     12506 6302235 30417
## - bmxbmi             1     11978 6326720 30419
## - lbxglu             1     16906 6331647 30422
## - riagendr           1     19002 6333743 30423
## - bpxsy1             1     21650 6336391 30425
## - ridreth3           5     37589 6352330 30428
## - bpxdi1             1     67805 6382546 30455
## - lbxtr              1    173060 6487802 30523
## - lbdhdd             1    310092 6624833 30610
## 
## Step:  AIC=30410.97
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + moderate_minutes + sedentary_minutes + dr1tkcal + 
##     dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + bpxdi1 + lbxglu + 
##     lbxhscrp + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - sedentary_minutes  1         4 6314746 30409
## - lbxhscrp           1       126 6314869 30409
## - moderate_minutes   1       170 6314913 30409
## - alq111             1       423 6315165 30409
## - dr1tsodi           1      1118 6315860 30410
## - dr1ttfat           1      2051 6316794 30410
## - dr1tkcal           1      2685 6317428 30411
## - indfmpir           1      2697 6317439 30411
## <none>                           6314743 30411
## - ridageyr           1      5842 6320585 30413
## + vigorous_minutes   1         1 6314741 30413
## - smq020             1      6882 6321624 30414
## + dmdeduc2           6     12506 6302237 30415
## - bmxbmi             1     11983 6326725 30417
## - lbxglu             1     16911 6331654 30420
## - riagendr           1     19052 6333795 30421
## - bpxsy1             1     21649 6336392 30423
## - ridreth3           5     37617 6352359 30426
## - bpxdi1             1     67807 6382549 30453
## - lbxtr              1    173089 6487832 30521
## - lbdhdd             1    310091 6624834 30608
## 
## Step:  AIC=30408.97
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + moderate_minutes + dr1tkcal + dr1ttfat + dr1tsodi + 
##     bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbxhscrp + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - lbxhscrp           1       126 6314872 30407
## - moderate_minutes   1       172 6314918 30407
## - alq111             1       422 6315168 30407
## - dr1tsodi           1      1119 6315865 30408
## - dr1ttfat           1      2052 6316798 30408
## - dr1tkcal           1      2689 6317435 30409
## - indfmpir           1      2703 6317449 30409
## <none>                           6314746 30409
## - ridageyr           1      5845 6320591 30411
## + sedentary_minutes  1         4 6314743 30411
## + vigorous_minutes   1         1 6314745 30411
## - smq020             1      6882 6321628 30412
## + dmdeduc2           6     12499 6302247 30413
## - bmxbmi             1     11986 6326732 30415
## - lbxglu             1     16929 6331676 30418
## - riagendr           1     19092 6333839 30420
## - bpxsy1             1     21673 6336419 30421
## - ridreth3           5     37646 6352392 30424
## - bpxdi1             1     67814 6382560 30451
## - lbxtr              1    173420 6488166 30519
## - lbdhdd             1    310089 6624835 30606
## 
## Step:  AIC=30407.05
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + moderate_minutes + dr1tkcal + dr1ttfat + dr1tsodi + 
##     bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - moderate_minutes   1       171 6315043 30405
## - alq111             1       409 6315282 30405
## - dr1tsodi           1      1106 6315979 30406
## - dr1ttfat           1      2037 6316909 30406
## - dr1tkcal           1      2682 6317554 30407
## - indfmpir           1      2746 6317618 30407
## <none>                           6314872 30407
## - ridageyr           1      5814 6320686 30409
## + lbxhscrp           1       126 6314746 30409
## + sedentary_minutes  1         3 6314869 30409
## + vigorous_minutes   1         1 6314871 30409
## - smq020             1      6819 6321691 30410
## + dmdeduc2           6     12554 6302318 30411
## - bmxbmi             1     12075 6326948 30413
## - lbxglu             1     17092 6331964 30416
## - riagendr           1     18970 6333842 30418
## - bpxsy1             1     21764 6336637 30419
## - ridreth3           5     38012 6352885 30422
## - bpxdi1             1     67836 6382708 30449
## - lbxtr              1    173768 6488640 30518
## - lbdhdd             1    311113 6625986 30604
## 
## Step:  AIC=30405.16
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + dr1tkcal + dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + 
##     bpxdi1 + lbxglu + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - alq111             1       416 6315459 30403
## - dr1tsodi           1      1095 6316139 30404
## - dr1ttfat           1      2030 6317074 30405
## - dr1tkcal           1      2671 6317715 30405
## - indfmpir           1      2773 6317816 30405
## <none>                           6315043 30405
## - ridageyr           1      5909 6320952 30407
## + moderate_minutes   1       171 6314872 30407
## + lbxhscrp           1       125 6314918 30407
## + vigorous_minutes   1        11 6315032 30407
## + sedentary_minutes  1         5 6315038 30407
## - smq020             1      6789 6321833 30408
## + dmdeduc2           6     12597 6302447 30409
## - bmxbmi             1     12027 6327071 30411
## - lbxglu             1     17307 6332351 30415
## - riagendr           1     19111 6334155 30416
## - bpxsy1             1     21846 6336889 30418
## - ridreth3           5     38111 6353154 30420
## - bpxdi1             1     67695 6382738 30447
## - lbxtr              1    173608 6488652 30516
## - lbdhdd             1    311739 6626782 30603
## 
## Step:  AIC=30403.44
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     dr1tkcal + dr1ttfat + dr1tsodi + bmxbmi + bpxsy1 + bpxdi1 + 
##     lbxglu + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - dr1tsodi           1      1113 6316572 30402
## - dr1ttfat           1      2033 6317492 30403
## - indfmpir           1      2638 6318097 30403
## - dr1tkcal           1      2656 6318115 30403
## <none>                           6315459 30403
## + alq111             1       416 6315043 30405
## + moderate_minutes   1       178 6315282 30405
## - ridageyr           1      5959 6321418 30405
## + lbxhscrp           1       112 6315347 30405
## + vigorous_minutes   1        13 6315446 30405
## + sedentary_minutes  1         4 6315455 30405
## - smq020             1      6400 6321859 30406
## + dmdeduc2           6     12664 6302795 30407
## - bmxbmi             1     11982 6327442 30409
## - lbxglu             1     17484 6332943 30413
## - riagendr           1     19527 6334986 30414
## - bpxsy1             1     22015 6337474 30416
## - ridreth3           5     38933 6354392 30419
## - bpxdi1             1     67308 6382767 30445
## - lbxtr              1    173551 6489011 30514
## - lbdhdd             1    311396 6626855 30601
## 
## Step:  AIC=30402.17
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     dr1tkcal + dr1ttfat + bmxbmi + bpxsy1 + bpxdi1 + lbxglu + 
##     lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - dr1tkcal           1      1763 6318336 30401
## - indfmpir           1      2593 6319165 30402
## - dr1ttfat           1      2684 6319256 30402
## <none>                           6316572 30402
## + dr1tsodi           1      1113 6315459 30403
## + alq111             1       434 6316139 30404
## + moderate_minutes   1       166 6316406 30404
## + lbxhscrp           1       101 6316472 30404
## + vigorous_minutes   1        17 6316555 30404
## + sedentary_minutes  1         5 6316568 30404
## - ridageyr           1      6262 6322834 30404
## - smq020             1      6489 6323062 30404
## + dmdeduc2           6     12788 6303785 30406
## - bmxbmi             1     11673 6328245 30408
## - lbxglu             1     17493 6334065 30412
## - riagendr           1     20116 6336689 30413
## - bpxsy1             1     21965 6338537 30415
## - ridreth3           5     38115 6354687 30417
## - bpxdi1             1     67451 6384023 30444
## - lbxtr              1    173143 6489716 30512
## - lbdhdd             1    311010 6627583 30599
## 
## Step:  AIC=30401.32
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     dr1ttfat + bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - dr1ttfat           1       997 6319333 30400
## - indfmpir           1      2434 6320770 30401
## <none>                           6318336 30401
## + dr1tkcal           1      1763 6316572 30402
## + alq111             1       410 6317925 30403
## - ridageyr           1      5762 6324098 30403
## + dr1tsodi           1       221 6318115 30403
## + moderate_minutes   1       163 6318173 30403
## + lbxhscrp           1       101 6318234 30403
## + vigorous_minutes   1        13 6318323 30403
## + sedentary_minutes  1         8 6318328 30403
## - smq020             1      6573 6324908 30404
## + dmdeduc2           6     13094 6305242 30405
## - bmxbmi             1     11238 6329574 30407
## - lbxglu             1     17995 6336330 30411
## - riagendr           1     18728 6337063 30412
## - bpxsy1             1     22276 6340611 30414
## - ridreth3           5     39916 6358252 30417
## - bpxdi1             1     68062 6386397 30444
## - lbxtr              1    175897 6494233 30513
## - lbdhdd             1    310869 6629204 30598
## 
## Step:  AIC=30399.98
## lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## - indfmpir           1      2293 6321625 30400
## <none>                           6319333 30400
## + dr1tsodi           1      1112 6318221 30401
## + dr1ttfat           1       997 6318336 30401
## + alq111             1       454 6318879 30402
## + moderate_minutes   1       155 6319178 30402
## + dr1tkcal           1        76 6319256 30402
## + lbxhscrp           1        76 6319257 30402
## + vigorous_minutes   1        18 6319315 30402
## + sedentary_minutes  1         5 6319328 30402
## - ridageyr           1      6417 6325750 30402
## - smq020             1      6494 6325826 30402
## + dmdeduc2           6     13170 6306162 30403
## - bmxbmi             1     10892 6330224 30405
## - lbxglu             1     18021 6337354 30410
## - riagendr           1     21858 6341190 30412
## - bpxsy1             1     22385 6341717 30413
## - ridreth3           5     41077 6360410 30417
## - bpxdi1             1     67820 6387153 30442
## - lbxtr              1    175262 6494595 30511
## - lbdhdd             1    309874 6629207 30596
## 
## Step:  AIC=30399.48
## lbxtc ~ ridageyr + riagendr + ridreth3 + smq020 + bmxbmi + bpxsy1 + 
##     bpxdi1 + lbxglu + lbdhdd + lbxtr
## 
##                     Df Sum of Sq     RSS   AIC
## <none>                           6321625 30400
## + indfmpir           1      2293 6319333 30400
## + dr1tsodi           1      1006 6320620 30401
## + dr1ttfat           1       856 6320770 30401
## - smq020             1      5649 6327274 30401
## + alq111             1       319 6321307 30401
## + moderate_minutes   1       178 6321448 30401
## + lbxhscrp           1       111 6321514 30401
## + dr1tkcal           1        57 6321568 30401
## + vigorous_minutes   1        16 6321609 30402
## + sedentary_minutes  1        12 6321613 30402
## + dmdeduc2           6     14303 6307323 30402
## - ridageyr           1      7235 6328861 30402
## - bmxbmi             1     11074 6332699 30405
## - lbxglu             1     17574 6339200 30409
## - riagendr           1     20983 6342608 30411
## - bpxsy1             1     21323 6342949 30411
## - ridreth3           5     42534 6364160 30417
## - bpxdi1             1     69815 6391440 30443
## - lbxtr              1    176892 6498518 30512
## - lbdhdd             1    319754 6641379 30602

summary(step_model_full)

## 
## Call:
## lm(formula = lbxtc ~ ridageyr + riagendr + ridreth3 + smq020 + 
##     bmxbmi + bpxsy1 + bpxdi1 + lbxglu + lbdhdd + lbxtr, data = train_data_original)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -205.991  -26.687   -3.703   22.741  247.549 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    97.05147    6.24730  15.535  < 2e-16 ***
## ridageyr        0.08447    0.03886   2.174 0.029790 *  
## riagendrFemale  4.92693    1.33103   3.702 0.000217 ***
## ridreth32       2.23916    2.57090   0.871 0.383825    
## ridreth33      -2.63478    1.96970  -1.338 0.181082    
## ridreth34      -6.95561    2.11946  -3.282 0.001040 ** 
## ridreth36       2.76851    2.33896   1.184 0.236620    
## ridreth37      -3.10601    3.15328  -0.985 0.324677    
## smq0202        -2.52044    1.31232  -1.921 0.054852 .  
## bmxbmi          0.24385    0.09068   2.689 0.007193 ** 
## bpxsy1          0.15324    0.04107   3.732 0.000193 ***
## bpxdi1          0.35326    0.05232   6.752 1.66e-11 ***
## lbxglu         -0.05884    0.01737  -3.388 0.000712 ***
## lbdhdd          0.65403    0.04526  14.450  < 2e-16 ***
## lbxtr           0.05933    0.00552  10.748  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.13 on 4128 degrees of freedom
## Multiple R-squared:  0.1055, Adjusted R-squared:  0.1024 
## F-statistic: 34.76 on 14 and 4128 DF,  p-value: < 2.2e-16

Model 5: Stepwise AIC on Model 3 (Log transformed predictors + target)

# Model 5
set.seed(123)

step_model_full_transformed_target <- stats::step(model_full_transformed_target, direction = "both")

## Start:  AIC=-13120.82
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + dmdeduc2 + indfmpir + 
##     smq020 + alq111 + vigorous_minutes + moderate_minutes + sedentary_minutes + 
##     bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + 
##     log_dr1tsodi + log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - dmdeduc2           6    0.3961 172.43 -13123
## - sedentary_minutes  1    0.0002 172.04 -13123
## - moderate_minutes   1    0.0041 172.04 -13123
## - vigorous_minutes   1    0.0066 172.04 -13123
## - alq111             1    0.0082 172.04 -13123
## - ridageyr           1    0.0089 172.04 -13123
## - log_dr1tsodi       1    0.0375 172.07 -13122
## - indfmpir           1    0.0593 172.10 -13121
## <none>                           172.04 -13121
## - log_dr1ttfat       1    0.1002 172.14 -13120
## - smq020             1    0.1071 172.14 -13120
## - log_dr1tkcal       1    0.1287 172.16 -13120
## - log_bmxbmi         1    0.1406 172.18 -13119
## - riagendr           1    0.1787 172.22 -13118
## - bpxsy1             1    0.3790 172.42 -13114
## - ridreth3           5    0.8144 172.85 -13111
## - log_lbxhscrp       1    1.2002 173.24 -13094
## - log_lbxglu         1    1.3590 173.40 -13090
## - bpxdi1             1    1.6495 173.69 -13083
## - log_lbxtr          1    9.8824 181.92 -12891
## - log_lbdhdd         1   13.1221 185.16 -12818
## 
## Step:  AIC=-13123.3
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + vigorous_minutes + moderate_minutes + sedentary_minutes + 
##     bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + 
##     log_dr1tsodi + log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - sedentary_minutes  1    0.0005 172.43 -13125
## - vigorous_minutes   1    0.0046 172.44 -13125
## - moderate_minutes   1    0.0053 172.44 -13125
## - alq111             1    0.0103 172.44 -13125
## - ridageyr           1    0.0288 172.46 -13125
## - log_dr1tsodi       1    0.0408 172.47 -13124
## <none>                           172.43 -13123
## - smq020             1    0.0894 172.52 -13123
## - indfmpir           1    0.1082 172.54 -13123
## - log_dr1ttfat       1    0.1181 172.55 -13122
## - log_bmxbmi         1    0.1489 172.58 -13122
## - log_dr1tkcal       1    0.1495 172.58 -13122
## - riagendr           1    0.1725 172.60 -13121
## + dmdeduc2           6    0.3961 172.04 -13121
## - bpxsy1             1    0.3792 172.81 -13116
## - ridreth3           5    1.0580 173.49 -13108
## - log_lbxhscrp       1    1.1580 173.59 -13098
## - log_lbxglu         1    1.3590 173.79 -13093
## - bpxdi1             1    1.6288 174.06 -13086
## - log_lbxtr          1    9.8568 182.29 -12895
## - log_lbdhdd         1   13.2179 185.65 -12819
## 
## Step:  AIC=-13125.29
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + vigorous_minutes + moderate_minutes + bpxsy1 + bpxdi1 + 
##     log_lbxhscrp + log_lbxtr + log_dr1tkcal + log_dr1tsodi + 
##     log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - vigorous_minutes   1    0.0046 172.44 -13127
## - moderate_minutes   1    0.0052 172.44 -13127
## - alq111             1    0.0103 172.44 -13127
## - ridageyr           1    0.0288 172.46 -13127
## - log_dr1tsodi       1    0.0407 172.47 -13126
## <none>                           172.43 -13125
## - smq020             1    0.0894 172.52 -13125
## - indfmpir           1    0.1080 172.54 -13125
## - log_dr1ttfat       1    0.1181 172.55 -13124
## - log_bmxbmi         1    0.1489 172.58 -13124
## - log_dr1tkcal       1    0.1494 172.58 -13124
## + sedentary_minutes  1    0.0005 172.43 -13123
## - riagendr           1    0.1721 172.60 -13123
## + dmdeduc2           6    0.3963 172.04 -13123
## - bpxsy1             1    0.3788 172.81 -13118
## - ridreth3           5    1.0616 173.50 -13110
## - log_lbxhscrp       1    1.1577 173.59 -13100
## - log_lbxglu         1    1.3656 173.80 -13095
## - bpxdi1             1    1.6286 174.06 -13088
## - log_lbxtr          1    9.8634 182.30 -12897
## - log_lbdhdd         1   13.2175 185.65 -12821
## 
## Step:  AIC=-13127.17
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + moderate_minutes + bpxsy1 + bpxdi1 + log_lbxhscrp + 
##     log_lbxtr + log_dr1tkcal + log_dr1tsodi + log_dr1ttfat + 
##     log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - moderate_minutes   1    0.0037 172.44 -13129
## - alq111             1    0.0102 172.45 -13129
## - ridageyr           1    0.0277 172.47 -13128
## - log_dr1tsodi       1    0.0401 172.48 -13128
## <none>                           172.44 -13127
## - smq020             1    0.0902 172.53 -13127
## - indfmpir           1    0.1082 172.55 -13127
## - log_dr1ttfat       1    0.1184 172.56 -13126
## - log_bmxbmi         1    0.1488 172.59 -13126
## - log_dr1tkcal       1    0.1495 172.59 -13126
## + vigorous_minutes   1    0.0046 172.43 -13125
## + sedentary_minutes  1    0.0005 172.44 -13125
## - riagendr           1    0.1700 172.61 -13125
## + dmdeduc2           6    0.3943 172.04 -13125
## - bpxsy1             1    0.3791 172.82 -13120
## - ridreth3           5    1.0589 173.50 -13112
## - log_lbxhscrp       1    1.1560 173.59 -13102
## - log_lbxglu         1    1.3711 173.81 -13096
## - bpxdi1             1    1.6285 174.07 -13090
## - log_lbxtr          1    9.8588 182.30 -12899
## - log_lbdhdd         1   13.2164 185.65 -12823
## 
## Step:  AIC=-13129.08
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     alq111 + bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + 
##     log_dr1tsodi + log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - alq111             1    0.0104 172.45 -13131
## - ridageyr           1    0.0287 172.47 -13130
## - log_dr1tsodi       1    0.0398 172.48 -13130
## <none>                           172.44 -13129
## - smq020             1    0.0898 172.53 -13129
## - indfmpir           1    0.1089 172.55 -13128
## - log_dr1ttfat       1    0.1181 172.56 -13128
## - log_bmxbmi         1    0.1482 172.59 -13128
## - log_dr1tkcal       1    0.1493 172.59 -13128
## + moderate_minutes   1    0.0037 172.44 -13127
## + vigorous_minutes   1    0.0032 172.44 -13127
## + sedentary_minutes  1    0.0004 172.44 -13127
## - riagendr           1    0.1721 172.61 -13127
## + dmdeduc2           6    0.3955 172.05 -13127
## - bpxsy1             1    0.3808 172.82 -13122
## - ridreth3           5    1.0622 173.50 -13114
## - log_lbxhscrp       1    1.1551 173.60 -13103
## - log_lbxglu         1    1.3809 173.82 -13098
## - bpxdi1             1    1.6255 174.07 -13092
## - log_lbxtr          1    9.8564 182.30 -12901
## - log_lbdhdd         1   13.2504 185.69 -12824
## 
## Step:  AIC=-13130.83
## log_lbxtc ~ ridageyr + riagendr + ridreth3 + indfmpir + smq020 + 
##     bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + 
##     log_dr1tsodi + log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - ridageyr           1    0.0292 172.48 -13132
## - log_dr1tsodi       1    0.0401 172.49 -13132
## - smq020             1    0.0815 172.53 -13131
## <none>                           172.45 -13131
## - indfmpir           1    0.1050 172.56 -13130
## - log_dr1ttfat       1    0.1194 172.57 -13130
## - log_bmxbmi         1    0.1469 172.60 -13129
## - log_dr1tkcal       1    0.1491 172.60 -13129
## + alq111             1    0.0104 172.44 -13129
## + moderate_minutes   1    0.0039 172.45 -13129
## + vigorous_minutes   1    0.0031 172.45 -13129
## + sedentary_minutes  1    0.0005 172.45 -13129
## - riagendr           1    0.1772 172.63 -13129
## + dmdeduc2           6    0.3976 172.05 -13128
## - bpxsy1             1    0.3845 172.84 -13124
## - ridreth3           5    1.0917 173.54 -13115
## - log_lbxhscrp       1    1.1626 173.61 -13105
## - log_lbxglu         1    1.3877 173.84 -13100
## - bpxdi1             1    1.6160 174.07 -13094
## - log_lbxtr          1    9.8481 182.30 -12903
## - log_lbdhdd         1   13.2450 185.70 -12826
## 
## Step:  AIC=-13132.13
## log_lbxtc ~ riagendr + ridreth3 + indfmpir + smq020 + bpxsy1 + 
##     bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + log_dr1tsodi + 
##     log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## - log_dr1tsodi       1    0.0448 172.53 -13133
## <none>                           172.48 -13132
## - smq020             1    0.0965 172.58 -13132
## - log_dr1ttfat       1    0.1156 172.60 -13131
## - indfmpir           1    0.1173 172.60 -13131
## + ridageyr           1    0.0292 172.45 -13131
## - log_dr1tkcal       1    0.1439 172.62 -13131
## - log_bmxbmi         1    0.1449 172.63 -13131
## + alq111             1    0.0109 172.47 -13130
## + moderate_minutes   1    0.0050 172.48 -13130
## + dmdeduc2           6    0.4181 172.06 -13130
## + vigorous_minutes   1    0.0020 172.48 -13130
## + sedentary_minutes  1    0.0004 172.48 -13130
## - riagendr           1    0.1703 172.65 -13130
## - bpxsy1             1    0.6265 173.11 -13119
## - ridreth3           5    1.0913 173.57 -13116
## - log_lbxhscrp       1    1.1836 173.66 -13106
## - log_lbxglu         1    1.3627 173.84 -13102
## - bpxdi1             1    1.5877 174.07 -13096
## - log_lbxtr          1    9.9258 182.41 -12902
## - log_lbdhdd         1   13.5534 186.03 -12821
## 
## Step:  AIC=-13133.06
## log_lbxtc ~ riagendr + ridreth3 + indfmpir + smq020 + bpxsy1 + 
##     bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + log_dr1ttfat + 
##     log_lbdhdd + log_lbxglu + log_bmxbmi
## 
##                     Df Sum of Sq    RSS    AIC
## <none>                           172.53 -13133
## - smq020             1    0.1010 172.63 -13133
## - log_dr1tkcal       1    0.1028 172.63 -13133
## - indfmpir           1    0.1139 172.64 -13132
## + log_dr1tsodi       1    0.0448 172.48 -13132
## + ridageyr           1    0.0339 172.49 -13132
## - log_bmxbmi         1    0.1367 172.66 -13132
## + alq111             1    0.0113 172.51 -13131
## + dmdeduc2           6    0.4237 172.10 -13131
## - log_dr1ttfat       1    0.1608 172.69 -13131
## + moderate_minutes   1    0.0046 172.52 -13131
## + vigorous_minutes   1    0.0016 172.52 -13131
## + sedentary_minutes  1    0.0004 172.53 -13131
## - riagendr           1    0.1793 172.71 -13131
## - bpxsy1             1    0.6367 173.16 -13120
## - ridreth3           5    1.0666 173.59 -13118
## - log_lbxhscrp       1    1.2063 173.73 -13106
## - log_lbxglu         1    1.3656 173.89 -13102
## - bpxdi1             1    1.5851 174.11 -13097
## - log_lbxtr          1    9.9074 182.43 -12904
## - log_lbdhdd         1   13.5512 186.08 -12822

summary(step_model_full_transformed_target)

## 
## Call:
## lm(formula = log_lbxtc ~ riagendr + ridreth3 + indfmpir + smq020 + 
##     bpxsy1 + bpxdi1 + log_lbxhscrp + log_lbxtr + log_dr1tkcal + 
##     log_dr1ttfat + log_lbdhdd + log_lbxglu + log_bmxbmi, data = train_data_transformed_target)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78399 -0.13374 -0.00143  0.13398  0.86598 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.7387534  0.1324538  28.227  < 2e-16 ***
## riagendrFemale  0.0151442  0.0073136   2.071  0.03845 *  
## ridreth32       0.0137406  0.0134444   1.022  0.30683    
## ridreth33      -0.0110760  0.0103169  -1.074  0.28308    
## ridreth34      -0.0313803  0.0111297  -2.820  0.00483 ** 
## ridreth36       0.0197936  0.0124451   1.590  0.11181    
## ridreth37      -0.0139308  0.0165009  -0.844  0.39858    
## indfmpir        0.0036243  0.0021962   1.650  0.09897 .  
## smq0202        -0.0106687  0.0068649  -1.554  0.12024    
## bpxsy1          0.0007444  0.0001908   3.902 9.70e-05 ***
## bpxdi1          0.0016752  0.0002721   6.156 8.17e-10 ***
## log_lbxhscrp    0.0180565  0.0033622   5.370 8.29e-08 ***
## log_lbxtr       0.0965329  0.0062721  15.391  < 2e-16 ***
## log_dr1tkcal    0.0224983  0.0143528   1.568  0.11707    
## log_dr1ttfat   -0.0223556  0.0114010  -1.961  0.04996 *  
## log_lbdhdd      0.2574592  0.0143033  18.000  < 2e-16 ***
## log_lbxglu     -0.0822295  0.0143904  -5.714 1.18e-08 ***
## log_bmxbmi      0.0302211  0.0167136   1.808  0.07065 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2045 on 4125 degrees of freedom
## Multiple R-squared:   0.14,  Adjusted R-squared:  0.1364 
## F-statistic:  39.5 on 17 and 4125 DF,  p-value: < 2.2e-16

Model 6: Ridge: Log transformed predictors + target

Good for multicollinearity - adds a penalty (L2 regularization)

# Model 6
set.seed(123)

ridge_model <- train(log_lbxtc ~ ., data = train_data_transformed_target,
                method = "glmnet",
                preProcess = c("center", "scale"),
                trControl = trainControl(method = "cv", number = 10),
                tuneGrid = expand.grid(alpha = 0, lambda = seq(0, 1, 0.1)))
summary(ridge_model)

##             Length Class      Mode     
## a0           100   -none-     numeric  
## beta        2900   dgCMatrix  S4       
## df           100   -none-     numeric  
## dim            2   -none-     numeric  
## lambda       100   -none-     numeric  
## dev.ratio    100   -none-     numeric  
## nulldev        1   -none-     numeric  
## npasses        1   -none-     numeric  
## jerr           1   -none-     numeric  
## offset         1   -none-     logical  
## call           5   -none-     call     
## nobs           1   -none-     numeric  
## lambdaOpt      1   -none-     numeric  
## xNames        29   -none-     character
## problemType    1   -none-     character
## tuneValue      2   data.frame list     
## obsLevels      1   -none-     logical  
## param          0   -none-     list

testX <- test_data_transformed_target %>%
  dplyr::select(-log_lbxtc)

testY <- test_data_transformed_target$log_lbxtc

ridge_pred <- predict(ridge_model, testX)
ridge_resample <- postResample(ridge_pred, testY)
ridge_resample

##      RMSE  Rsquared       MAE 
## 0.2054136 0.1103974 0.1596422

Model 7: Elastic Net: Log transformed predictors + target

# Model 7

set.seed(123)
# Elastic net
testX <- test_data_transformed_target %>%
  dplyr::select(-log_lbxtc)

testY <- test_data_transformed_target$log_lbxtc

enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))
enet_model <- train(log_lbxtc ~ ., data = train_data_transformed_target, method = "enet",
                 tuneGrid = enetGrid,
                 trControl = trainControl(method = "cv", number = 10),
                 preProcess = c("center", "scale"))
enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)
enet_resample

##      RMSE  Rsquared       MAE 
## 0.2062352 0.1030591 0.1606479

Model 8: Robust Regression: Original Data

# Model 8
set.seed(123)
# weighting functions to down-weight influential observations
# default weighting function is Huber
# uses Iteratively Reweighted Least Squares (IRLS)
robust_model_full <- rlm(lbxtc ~ ., data = train_data_original, psi = psi.huber)
summary(robust_model_full)

## 
## Call: rlm(formula = lbxtc ~ ., data = train_data_original, psi = psi.huber)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -249.339  -24.778   -1.536   24.560  250.206 
## 
## Coefficients:
##                   Value   Std. Error t value
## (Intercept)       93.9388  6.5811    14.2740
## ridageyr           0.0931  0.0392     2.3770
## riagendrFemale     3.8709  1.3546     2.8576
## ridreth32          2.4554  2.5001     0.9821
## ridreth33         -3.0352  2.0235    -1.4999
## ridreth34         -6.3734  2.1471    -2.9683
## ridreth36          1.5197  2.4317     0.6250
## ridreth37         -1.2667  3.1176    -0.4063
## dmdeduc22         -0.9823  2.8286    -0.3473
## dmdeduc23         -0.7882  2.5741    -0.3062
## dmdeduc24         -2.2230  2.5274    -0.8796
## dmdeduc25          1.3845  2.7593     0.5018
## dmdeduc27         59.5691 26.9946     2.2067
## dmdeduc29         -6.3412 13.6374    -0.4650
## indfmpir           0.3149  0.4447     0.7080
## smq0202           -2.5989  1.3119    -1.9810
## alq1112            0.5368  2.0787     0.2582
## vigorous_minutes   0.0002  0.0018     0.1261
## moderate_minutes  -0.0003  0.0011    -0.2554
## sedentary_minutes  0.0004  0.0009     0.4564
## dr1tkcal           0.0013  0.0015     0.8648
## dr1ttfat          -0.0289  0.0282    -1.0282
## dr1tsodi          -0.0003  0.0005    -0.5100
## bmxbmi             0.3318  0.0908     3.6527
## bpxsy1             0.1089  0.0401     2.7127
## bpxdi1             0.4065  0.0511     7.9548
## lbxglu            -0.0769  0.0169    -4.5432
## lbxhscrp          -0.0558  0.0805    -0.6935
## lbdhdd             0.6806  0.0444    15.3166
## lbxtr              0.0765  0.0054    14.2568
## 
## Residual standard error: 36.66 on 4113 degrees of freedom

Model 9: Robust Regression: Transformed Data

# Model 9
set.seed(123)
# weighting functions to down-weight influential observations
# default weighting function is Huber
# uses Iteratively Reweighted Least Squares (IRLS)
robust_model_full_transformed_target <- rlm(log_lbxtc ~ ., data = train_data_transformed_target, psi = psi.huber)
summary(robust_model_full_transformed_target)

## 
## Call: rlm(formula = log_lbxtc ~ ., data = train_data_transformed_target, 
##     psi = psi.huber)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80890 -0.13423 -0.00364  0.13459  0.85464 
## 
## Coefficients:
##                   Value   Std. Error t value
## (Intercept)        3.7498  0.1343    27.9279
## ridageyr           0.0003  0.0002     1.4544
## riagendrFemale     0.0114  0.0073     1.5641
## ridreth32          0.0136  0.0134     1.0170
## ridreth33         -0.0116  0.0109    -1.0668
## ridreth34         -0.0272  0.0115    -2.3562
## ridreth36          0.0196  0.0131     1.4938
## ridreth37         -0.0045  0.0167    -0.2677
## dmdeduc22         -0.0124  0.0152    -0.8175
## dmdeduc23         -0.0109  0.0138    -0.7924
## dmdeduc24         -0.0186  0.0136    -1.3708
## dmdeduc25          0.0008  0.0148     0.0542
## dmdeduc27          0.2756  0.1449     1.9023
## dmdeduc29         -0.0331  0.0732    -0.4526
## indfmpir           0.0029  0.0024     1.1974
## smq0202           -0.0136  0.0070    -1.9270
## alq1112            0.0007  0.0112     0.0627
## vigorous_minutes   0.0000  0.0000     0.4452
## moderate_minutes   0.0000  0.0000    -0.4649
## sedentary_minutes  0.0000  0.0000     0.2177
## bpxsy1             0.0005  0.0002     2.4705
## bpxdi1             0.0019  0.0003     6.9050
## log_lbxhscrp       0.0160  0.0034     4.7700
## log_lbxtr          0.1012  0.0063    16.1821
## log_dr1tkcal       0.0280  0.0157     1.7821
## log_dr1tsodi      -0.0112  0.0104    -1.0836
## log_dr1ttfat      -0.0190  0.0117    -1.6273
## log_lbdhdd         0.2615  0.0144    18.1882
## log_lbxglu        -0.0926  0.0144    -6.4237
## log_bmxbmi         0.0439  0.0167     2.6306
## 
## Residual standard error: 0.1993 on 4113 degrees of freedom

Model 10: Robust Regression: Transformed Data: Bisquare weighting function

# Model 10
set.seed(123)
# weighting functions to down-weight influential observations
# default weighting function is Huber
# uses Iteratively Reweighted Least Squares (IRLS)
robust_bisquare_model_full_transformed_target <- rlm(log_lbxtc ~ ., data = train_data_transformed_target, psi = psi.bisquare)
summary(robust_bisquare_model_full_transformed_target)

## 
## Call: rlm(formula = log_lbxtc ~ ., data = train_data_transformed_target, 
##     psi = psi.bisquare)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.810378 -0.134024 -0.003544  0.134312  0.855685 
## 
## Coefficients:
##                   Value   Std. Error t value
## (Intercept)        3.7382  0.1352    27.6441
## ridageyr           0.0003  0.0002     1.5380
## riagendrFemale     0.0112  0.0074     1.5156
## ridreth32          0.0121  0.0135     0.8933
## ridreth33         -0.0126  0.0109    -1.1558
## ridreth34         -0.0290  0.0116    -2.4912
## ridreth36          0.0181  0.0132     1.3685
## ridreth37         -0.0038  0.0168    -0.2252
## dmdeduc22         -0.0115  0.0153    -0.7500
## dmdeduc23         -0.0094  0.0139    -0.6776
## dmdeduc24         -0.0172  0.0137    -1.2556
## dmdeduc25          0.0024  0.0149     0.1585
## dmdeduc27          0.2773  0.1459     1.9006
## dmdeduc29         -0.0297  0.0737    -0.4026
## indfmpir           0.0025  0.0024     1.0586
## smq0202           -0.0134  0.0071    -1.8892
## alq1112            0.0016  0.0112     0.1413
## vigorous_minutes   0.0000  0.0000     0.4244
## moderate_minutes   0.0000  0.0000    -0.4410
## sedentary_minutes  0.0000  0.0000     0.2513
## bpxsy1             0.0005  0.0002     2.4135
## bpxdi1             0.0019  0.0003     6.9122
## log_lbxhscrp       0.0159  0.0034     4.7084
## log_lbxtr          0.1016  0.0063    16.1453
## log_dr1tkcal       0.0290  0.0158     1.8305
## log_dr1tsodi      -0.0120  0.0105    -1.1503
## log_dr1ttfat      -0.0195  0.0118    -1.6588
## log_lbdhdd         0.2631  0.0145    18.1656
## log_lbxglu        -0.0925  0.0145    -6.3756
## log_bmxbmi         0.0450  0.0168     2.6791
## 
## Residual standard error: 0.1989 on 4113 degrees of freedom

Model 12: Multiple Linear Regression: Log transformation and New features

Age Categories Revealed Non-Linear Pattern as Cholesterol peaks in middle age, then declines in elderly . BMI Categories are more informative than Log BMI. High Sodium Flag is Significant as HIGH_SODIUM (>2300 mg/day) → -1.93% cholesterol (p = 0.009) and hence the effect is in unexpected direction (negative).

Physical activity and high-fat diet does not have detectable direct effect on cholesterol.

# Model 12
set.seed(123)
model_transformed_newfeatures <- lm(log_lbxtc ~ ., data = 
train_data_transformed_newfeatures)
summary(model_transformed_newfeatures)

## 
## Call:
## lm(formula = log_lbxtc ~ ., data = train_data_transformed_newfeatures)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84349 -0.13022  0.00021  0.13048  0.92606 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.939e+00  9.713e-02  40.557  < 2e-16 ***
## ridageyr           2.610e-04  3.910e-04   0.668  0.50447    
## riagendrFemale     9.479e-03  7.208e-03   1.315  0.18854    
## ridreth32          1.294e-02  1.325e-02   0.977  0.32849    
## ridreth33          3.639e-03  1.079e-02   0.337  0.73587    
## ridreth34         -2.564e-02  1.141e-02  -2.247  0.02472 *  
## ridreth36          2.124e-02  1.300e-02   1.634  0.10242    
## ridreth37         -5.423e-03  1.654e-02  -0.328  0.74299    
## dmdeduc22         -1.153e-02  1.501e-02  -0.768  0.44268    
## dmdeduc23         -8.491e-03  1.368e-02  -0.621  0.53493    
## dmdeduc24         -1.219e-02  1.342e-02  -0.908  0.36384    
## dmdeduc25         -3.099e-04  1.462e-02  -0.021  0.98309    
## dmdeduc27          2.315e-01  1.430e-01   1.619  0.10556    
## dmdeduc29         -1.392e-02  7.229e-02  -0.193  0.84730    
## indfmpir           4.088e-04  2.367e-03   0.173  0.86288    
## smq0202           -6.101e-03  6.986e-03  -0.873  0.38257    
## alq1112            8.724e-03  1.104e-02   0.790  0.42945    
## sedentary_minutes  1.529e-06  4.523e-06   0.338  0.73543    
## TOTAL_ACTIVITY    -1.197e-06  4.763e-06  -0.251  0.80162    
## ACTIVE_FLAG        8.684e-03  7.269e-03   1.195  0.23230    
## HIGH_FAT_DIET     -6.568e-03  6.590e-03  -0.997  0.31892    
## HIGH_SODIUM       -2.001e-02  7.416e-03  -2.699  0.00699 ** 
## AGE_YOUNG          1.239e-02  2.112e-02   0.586  0.55763    
## AGE_MID            8.465e-02  1.216e-02   6.962 3.90e-12 ***
## BMI_UNDER         -7.372e-02  2.690e-02  -2.741  0.00616 ** 
## BMI_NORMAL        -2.728e-02  9.366e-03  -2.912  0.00361 ** 
## BMI_OVER           1.256e-02  7.745e-03   1.621  0.10500    
## bpxsy1             1.006e-03  2.150e-04   4.680 2.96e-06 ***
## bpxdi1             7.897e-04  2.844e-04   2.777  0.00552 ** 
## log_lbxhscrp       1.460e-02  3.227e-03   4.524 6.23e-06 ***
## log_lbxtr          9.078e-02  6.181e-03  14.686  < 2e-16 ***
## log_lbdhdd         2.580e-01  1.421e-02  18.160  < 2e-16 ***
## log_lbxglu        -8.354e-02  1.418e-02  -5.889 4.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2005 on 4110 degrees of freedom
## Multiple R-squared:  0.1762, Adjusted R-squared:  0.1698 
## F-statistic: 27.47 on 32 and 4110 DF,  p-value: < 2.2e-16

Check for multicollinearity

We know we have some low correlated features, so let’s address this potential issue by looking at the VIF for the models and removing predictors.

VIF for LM1: Basic Model had dr1tkcal and dr1ttfat whose VIF > 5.Hence these predictos should be removed.

For the logistic regression models, luckily no predictor has a VIF > 5 (for Df > 1 (categorical), must look at GVIF^(1/(2*Df)) value), so we don’t need to remove any predictors.

# Multiple linear models

cat("=== VIF for LM1: Basic Model ===\n")

## === VIF for LM1: Basic Model ===

print(car::vif(model_full))

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.521880  1        1.233645
## riagendr          1.323334  1        1.150363
## ridreth3          1.644105  5        1.050976
## dmdeduc2          1.742917  6        1.047385
## indfmpir          1.298192  1        1.139382
## smq020            1.199062  1        1.095017
## alq111            1.103425  1        1.050440
## vigorous_minutes  1.049989  1        1.024690
## moderate_minutes  1.045875  1        1.022680
## sedentary_minutes 1.022078  1        1.010979
## dr1tkcal          6.267934  1        2.503584
## dr1ttfat          5.378923  1        2.319251
## dr1tsodi          2.854178  1        1.689431
## bmxbmi            1.304856  1        1.142303
## bpxsy1            1.591836  1        1.261680
## bpxdi1            1.232550  1        1.110203
## lbxglu            1.156394  1        1.075357
## lbxhscrp          1.110765  1        1.053928
## lbdhdd            1.374965  1        1.172589
## lbxtr             1.173039  1        1.083069

cat("=== VIF for LM2: Transformed predictors + original target variable ===\n")

## === VIF for LM2: Transformed predictors + original target variable ===

print(car::vif(model_full_transformed))

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.531744  1        1.237636
## riagendr          1.340638  1        1.157859
## ridreth3          1.674581  5        1.052908
## dmdeduc2          1.752564  6        1.047867
## indfmpir          1.300326  1        1.140318
## smq020            1.201889  1        1.096307
## alq111            1.104838  1        1.051113
## vigorous_minutes  1.050999  1        1.025182
## moderate_minutes  1.046303  1        1.022889
## sedentary_minutes 1.021970  1        1.010925
## bpxsy1            1.596436  1        1.263502
## bpxdi1            1.244769  1        1.115692
## log_lbxhscrp      1.389545  1        1.178790
## log_lbxtr         1.443205  1        1.201335
## log_dr1tkcal      5.771872  1        2.402472
## log_dr1tsodi      3.265137  1        1.806969
## log_dr1ttfat      4.965713  1        2.228388
## log_lbdhdd        1.614277  1        1.270542
## log_lbxglu        1.217585  1        1.103442
## log_bmxbmi        1.510988  1        1.229222

cat("=== VIF for LM3: Transformed predictors + transformed target variable ===\n")

## === VIF for LM3: Transformed predictors + transformed target variable ===

print(car::vif(model_full_transformed_target))

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.531744  1        1.237636
## riagendr          1.340638  1        1.157859
## ridreth3          1.674581  5        1.052908
## dmdeduc2          1.752564  6        1.047867
## indfmpir          1.300326  1        1.140318
## smq020            1.201889  1        1.096307
## alq111            1.104838  1        1.051113
## vigorous_minutes  1.050999  1        1.025182
## moderate_minutes  1.046303  1        1.022889
## sedentary_minutes 1.021970  1        1.010925
## bpxsy1            1.596436  1        1.263502
## bpxdi1            1.244769  1        1.115692
## log_lbxhscrp      1.389545  1        1.178790
## log_lbxtr         1.443205  1        1.201335
## log_dr1tkcal      5.771872  1        2.402472
## log_dr1tsodi      3.265137  1        1.806969
## log_dr1ttfat      4.965713  1        2.228388
## log_lbdhdd        1.614277  1        1.270542
## log_lbxglu        1.217585  1        1.103442
## log_bmxbmi        1.510988  1        1.229222

cat("=== VIF for LM4: Stepwise on Model 1 ===\n")

## === VIF for LM4: Stepwise on Model 1 ===

print(car::vif(step_model_full))

##              GVIF Df GVIF^(1/(2*Df))
## ridageyr 1.403331  1        1.184623
## riagendr 1.196522  1        1.093856
## ridreth3 1.199283  5        1.018339
## smq020   1.123607  1        1.060003
## bmxbmi   1.217974  1        1.103619
## bpxsy1   1.561195  1        1.249478
## bpxdi1   1.210080  1        1.100036
## lbxglu   1.141326  1        1.068329
## lbdhdd   1.335759  1        1.155750
## lbxtr    1.162166  1        1.078038

cat("=== VIF for LM5: Stepwise on Model 3 ===\n")

## === VIF for LM5: Stepwise on Model 3 ===

print(car::vif(step_model_full_transformed_target))

##                  GVIF Df GVIF^(1/(2*Df))
## riagendr     1.322698  1        1.150086
## ridreth3     1.280753  5        1.025053
## indfmpir     1.085509  1        1.041878
## smq020       1.125802  1        1.061038
## bpxsy1       1.233886  1        1.110804
## bpxdi1       1.198605  1        1.094808
## log_lbxhscrp 1.376780  1        1.173363
## log_lbxtr    1.434107  1        1.197542
## log_dr1tkcal 4.757132  1        2.181085
## log_dr1ttfat 4.659189  1        2.158515
## log_lbdhdd   1.576608  1        1.255631
## log_lbxglu   1.198388  1        1.094709
## log_bmxbmi   1.497774  1        1.223836

cat("=== VIF for LM8: Robust: Original ===\n")

## === VIF for LM8: Robust: Original ===

print(car::vif(robust_model_full))

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.521880  1        1.233645
## riagendr          1.323334  1        1.150363
## ridreth3          1.644105  5        1.050976
## dmdeduc2          1.742917  6        1.047385
## indfmpir          1.298192  1        1.139382
## smq020            1.199062  1        1.095017
## alq111            1.103425  1        1.050440
## vigorous_minutes  1.049989  1        1.024690
## moderate_minutes  1.045875  1        1.022680
## sedentary_minutes 1.022078  1        1.010979
## dr1tkcal          6.267934  1        2.503584
## dr1ttfat          5.378923  1        2.319251
## dr1tsodi          2.854178  1        1.689431
## bmxbmi            1.304856  1        1.142303
## bpxsy1            1.591836  1        1.261680
## bpxdi1            1.232550  1        1.110203
## lbxglu            1.156394  1        1.075357
## lbxhscrp          1.110765  1        1.053928
## lbdhdd            1.374965  1        1.172589
## lbxtr             1.173039  1        1.083069

cat("=== VIF for LM9: Robust: Transformed ===\n")

## === VIF for LM9: Robust: Transformed ===

print(car::vif(robust_model_full_transformed_target))

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.531744  1        1.237636
## riagendr          1.340638  1        1.157859
## ridreth3          1.674581  5        1.052908
## dmdeduc2          1.752564  6        1.047867
## indfmpir          1.300326  1        1.140318
## smq020            1.201889  1        1.096307
## alq111            1.104838  1        1.051113
## vigorous_minutes  1.050999  1        1.025182
## moderate_minutes  1.046303  1        1.022889
## sedentary_minutes 1.021970  1        1.010925
## bpxsy1            1.596436  1        1.263502
## bpxdi1            1.244769  1        1.115692
## log_lbxhscrp      1.389545  1        1.178790
## log_lbxtr         1.443205  1        1.201335
## log_dr1tkcal      5.771872  1        2.402472
## log_dr1tsodi      3.265137  1        1.806969
## log_dr1ttfat      4.965713  1        2.228388
## log_lbdhdd        1.614277  1        1.270542
## log_lbxglu        1.217585  1        1.103442
## log_bmxbmi        1.510988  1        1.229222

cat("=== VIF for LM10: Robust: Transformed: Bisquare ===\n")

## === VIF for LM10: Robust: Transformed: Bisquare ===

print(car::vif(robust_bisquare_model_full_transformed_target))

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.531744  1        1.237636
## riagendr          1.340638  1        1.157859
## ridreth3          1.674581  5        1.052908
## dmdeduc2          1.752564  6        1.047867
## indfmpir          1.300326  1        1.140318
## smq020            1.201889  1        1.096307
## alq111            1.104838  1        1.051113
## vigorous_minutes  1.050999  1        1.025182
## moderate_minutes  1.046303  1        1.022889
## sedentary_minutes 1.021970  1        1.010925
## bpxsy1            1.596436  1        1.263502
## bpxdi1            1.244769  1        1.115692
## log_lbxhscrp      1.389545  1        1.178790
## log_lbxtr         1.443205  1        1.201335
## log_dr1tkcal      5.771872  1        2.402472
## log_dr1tsodi      3.265137  1        1.806969
## log_dr1ttfat      4.965713  1        2.228388
## log_lbdhdd        1.614277  1        1.270542
## log_lbxglu        1.217585  1        1.103442
## log_bmxbmi        1.510988  1        1.229222

cat("=== VIF for LM12: Transformed predictors and target + new features ===\n")

## === VIF for LM12: Transformed predictors and target + new features ===

print(car::vif(model_transformed_newfeatures))

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          5.409444  1        2.325821
## riagendr          1.336220  1        1.155950
## ridreth3          1.724996  5        1.056036
## dmdeduc2          1.789852  6        1.049707
## indfmpir          1.311627  1        1.145263
## smq020            1.212757  1        1.101253
## alq111            1.110053  1        1.053590
## sedentary_minutes 1.030810  1        1.015288
## TOTAL_ACTIVITY    1.131137  1        1.063549
## ACTIVE_FLAG       1.277647  1        1.130331
## HIGH_FAT_DIET     1.077105  1        1.037837
## HIGH_SODIUM       1.110074  1        1.053600
## AGE_YOUNG         6.836026  1        2.614580
## AGE_MID           3.713059  1        1.926930
## BMI_UNDER         1.081259  1        1.039836
## BMI_NORMAL        1.667455  1        1.291300
## BMI_OVER          1.378407  1        1.174056
## bpxsy1            1.629344  1        1.276458
## bpxdi1            1.361794  1        1.166959
## log_lbxhscrp      1.318844  1        1.148410
## log_lbxtr         1.448880  1        1.203694
## log_lbdhdd        1.617624  1        1.271859
## log_lbxglu        1.211128  1        1.100513

# Model 1 (original predictors + original target variable lbxtc)
# Remove dr1tkcal which has highest VIF (above 5)
set.seed(123)
model_full_v2 <- lm(lbxtc ~ ., data = train_data_original %>% dplyr::select(-c(dr1tkcal,dr1ttfat)))
summary(model_full_v2)

## 
## Call:
## lm(formula = lbxtc ~ ., data = train_data_original %>% dplyr::select(-c(dr1tkcal, 
##     dr1ttfat)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -205.782  -26.737   -3.807   22.935  245.631 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.989e+01  6.707e+00  14.893  < 2e-16 ***
## ridageyr           6.249e-02  4.040e-02   1.547 0.121953    
## riagendrFemale     4.786e+00  1.386e+00   3.453 0.000560 ***
## ridreth32          2.132e+00  2.583e+00   0.825 0.409304    
## ridreth33         -2.441e+00  2.088e+00  -1.169 0.242457    
## ridreth34         -6.643e+00  2.216e+00  -2.998 0.002737 ** 
## ridreth36          1.590e+00  2.503e+00   0.635 0.525434    
## ridreth37         -2.773e+00  3.222e+00  -0.861 0.389422    
## dmdeduc22         -2.895e+00  2.921e+00  -0.991 0.321690    
## dmdeduc23         -2.048e+00  2.657e+00  -0.771 0.440889    
## dmdeduc24         -3.519e+00  2.608e+00  -1.349 0.177390    
## dmdeduc25         -2.827e-01  2.849e+00  -0.099 0.920965    
## dmdeduc27          5.267e+01  2.790e+01   1.888 0.059112 .  
## dmdeduc29         -6.016e+00  1.409e+01  -0.427 0.669497    
## indfmpir           4.117e-01  4.591e-01   0.897 0.369903    
## smq0202           -3.061e+00  1.356e+00  -2.258 0.023977 *  
## alq1112            1.015e+00  2.148e+00   0.473 0.636504    
## vigorous_minutes   6.989e-05  1.868e-03   0.037 0.970155    
## moderate_minutes  -3.135e-04  1.122e-03  -0.280 0.779859    
## sedentary_minutes  8.496e-05  8.790e-04   0.097 0.923008    
## dr1tsodi          -2.817e-04  3.438e-04  -0.819 0.412633    
## bmxbmi             2.524e-01  9.366e-02   2.694 0.007078 ** 
## bpxsy1             1.565e-01  4.146e-02   3.775 0.000163 ***
## bpxdi1             3.559e-01  5.280e-02   6.741 1.79e-11 ***
## lbxglu            -5.873e-02  1.747e-02  -3.361 0.000783 ***
## lbxhscrp          -1.686e-02  8.317e-02  -0.203 0.839347    
## lbdhdd             6.500e-01  4.592e-02  14.157  < 2e-16 ***
## lbxtr              5.943e-02  5.540e-03  10.727  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.14 on 4115 degrees of freedom
## Multiple R-squared:  0.1079, Adjusted R-squared:  0.102 
## F-statistic: 18.43 on 27 and 4115 DF,  p-value: < 2.2e-16

car::vif(model_full_v2) # looks good

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.515673  1        1.231127
## riagendr          1.296892  1        1.138812
## ridreth3          1.596730  5        1.047908
## dmdeduc2          1.732063  6        1.046840
## indfmpir          1.294865  1        1.137921
## smq020            1.198310  1        1.094673
## alq111            1.103324  1        1.050392
## vigorous_minutes  1.049948  1        1.024670
## moderate_minutes  1.045805  1        1.022646
## sedentary_minutes 1.021712  1        1.010798
## dr1tsodi          1.110621  1        1.053860
## bmxbmi            1.298835  1        1.139665
## bpxsy1            1.590519  1        1.261158
## bpxdi1            1.231716  1        1.109827
## lbxglu            1.154277  1        1.074373
## lbxhscrp          1.110360  1        1.053736
## lbdhdd            1.373973  1        1.172166
## lbxtr             1.170124  1        1.081723

# Model 2 (transformed predictors + original target variable lbxtc)
# Remove log_dr1tkcal which has highest VIF (above 5)
set.seed(123)
model_full_transformed_v2 <- lm(lbxtc ~ ., data = train_data_transformed %>% dplyr::select(-c(log_dr1tkcal)))
summary(model_full_transformed_v2)

## 
## Call:
## lm(formula = lbxtc ~ ., data = train_data_transformed %>% dplyr::select(-c(log_dr1tkcal)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.70  -26.43   -3.92   22.77  240.85 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4.642e+01  2.300e+01  -2.019 0.043596 *  
## ridageyr           3.173e-02  3.996e-02   0.794 0.427180    
## riagendrFemale     2.833e+00  1.379e+00   2.055 0.039917 *  
## ridreth32          2.444e+00  2.546e+00   0.960 0.337078    
## ridreth33         -1.638e+00  2.059e+00  -0.796 0.426344    
## ridreth34         -4.824e+00  2.188e+00  -2.204 0.027560 *  
## ridreth36          3.186e+00  2.485e+00   1.282 0.199955    
## ridreth37         -1.959e+00  3.173e+00  -0.617 0.536996    
## dmdeduc22         -4.018e+00  2.883e+00  -1.394 0.163527    
## dmdeduc23         -2.452e+00  2.621e+00  -0.935 0.349722    
## dmdeduc24         -4.068e+00  2.575e+00  -1.580 0.114252    
## dmdeduc25         -9.185e-01  2.811e+00  -0.327 0.743923    
## dmdeduc27          5.486e+01  2.749e+01   1.996 0.046010 *  
## dmdeduc29         -7.750e+00  1.389e+01  -0.558 0.576885    
## indfmpir           5.811e-01  4.529e-01   1.283 0.199543    
## smq0202           -2.501e+00  1.337e+00  -1.870 0.061538 .  
## alq1112            7.900e-01  2.118e+00   0.373 0.709153    
## vigorous_minutes   6.152e-04  1.841e-03   0.334 0.738280    
## moderate_minutes  -4.563e-04  1.105e-03  -0.413 0.679729    
## sedentary_minutes -3.619e-05  8.659e-04  -0.042 0.966662    
## bpxsy1             1.495e-01  4.091e-02   3.655 0.000261 ***
## bpxdi1             2.872e-01  5.227e-02   5.494 4.18e-08 ***
## log_lbxhscrp       3.230e+00  6.368e-01   5.072 4.11e-07 ***
## log_lbxtr          1.835e+01  1.185e+00  15.485  < 2e-16 ***
## log_dr1tsodi      -7.707e-01  1.797e+00  -0.429 0.668013    
## log_dr1ttfat      -4.083e-01  1.627e+00  -0.251 0.801853    
## log_lbdhdd         4.529e+01  2.728e+00  16.601  < 2e-16 ***
## log_lbxglu        -1.527e+01  2.731e+00  -5.592 2.39e-08 ***
## log_bmxbmi         3.214e+00  3.159e+00   1.017 0.309091    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.56 on 4114 degrees of freedom
## Multiple R-squared:  0.1346, Adjusted R-squared:  0.1287 
## F-statistic: 22.84 on 28 and 4114 DF,  p-value: < 2.2e-16

car::vif(model_full_transformed_v2) # looks good

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.528157  1        1.236186
## riagendr          1.322256  1        1.149894
## ridreth3          1.663460  5        1.052207
## dmdeduc2          1.746420  6        1.047560
## indfmpir          1.298539  1        1.139535
## smq020            1.201694  1        1.096218
## alq111            1.104830  1        1.051109
## vigorous_minutes  1.050970  1        1.025168
## moderate_minutes  1.046275  1        1.022876
## sedentary_minutes 1.021901  1        1.010891
## bpxsy1            1.596263  1        1.263433
## bpxdi1            1.244341  1        1.115500
## log_lbxhscrp      1.389502  1        1.178771
## log_lbxtr         1.440165  1        1.200069
## log_dr1tsodi      2.716875  1        1.648295
## log_dr1ttfat      2.668648  1        1.633600
## log_lbdhdd        1.613914  1        1.270399
## log_lbxglu        1.214539  1        1.102061
## log_bmxbmi        1.505382  1        1.226940

# Model 3 (transformed predictors + transformed target variable log_lbxtc)
# Remove log_dr1tkcal which has highest VIF (above 5)
set.seed(123)
model_full_transformed_target_v2 <- lm(log_lbxtc ~ ., data = train_data_transformed_target %>% dplyr::select(-c(log_dr1tkcal)))
summary(model_full_transformed_target_v2)

## 
## Call:
## lm(formula = log_lbxtc ~ ., data = train_data_transformed_target %>% 
##     dplyr::select(-c(log_dr1tkcal)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80059 -0.13504 -0.00158  0.13451  0.85592 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.883e+00  1.220e-01  31.819  < 2e-16 ***
## ridageyr           7.997e-05  2.120e-04   0.377  0.70604    
## riagendrFemale     1.371e-02  7.314e-03   1.874  0.06099 .  
## ridreth32          1.391e-02  1.351e-02   1.030  0.30307    
## ridreth33         -1.014e-02  1.093e-02  -0.928  0.35341    
## ridreth34         -2.963e-02  1.161e-02  -2.552  0.01073 *  
## ridreth36          1.634e-02  1.319e-02   1.239  0.21539    
## ridreth37         -1.156e-02  1.683e-02  -0.687  0.49232    
## dmdeduc22         -1.899e-02  1.530e-02  -1.241  0.21459    
## dmdeduc23         -1.441e-02  1.391e-02  -1.036  0.30015    
## dmdeduc24         -2.266e-02  1.366e-02  -1.658  0.09730 .  
## dmdeduc25         -3.083e-03  1.492e-02  -0.207  0.83629    
## dmdeduc27          2.612e-01  1.458e-01   1.791  0.07336 .  
## dmdeduc29         -3.770e-02  7.369e-02  -0.512  0.60897    
## indfmpir           2.705e-03  2.403e-03   1.126  0.26028    
## smq0202           -1.151e-02  7.095e-03  -1.622  0.10484    
## alq1112            4.929e-03  1.124e-02   0.439  0.66091    
## vigorous_minutes   3.982e-06  9.767e-06   0.408  0.68356    
## moderate_minutes  -1.787e-06  5.863e-06  -0.305  0.76051    
## sedentary_minutes -2.549e-07  4.594e-06  -0.055  0.95576    
## bpxsy1             6.573e-04  2.171e-04   3.028  0.00248 ** 
## bpxdi1             1.751e-03  2.773e-04   6.312 3.05e-10 ***
## log_lbxhscrp       1.813e-02  3.379e-03   5.365 8.53e-08 ***
## log_lbxtr          9.722e-02  6.287e-03  15.464  < 2e-16 ***
## log_dr1tsodi      -2.383e-03  9.533e-03  -0.250  0.80266    
## log_dr1ttfat      -4.168e-03  8.631e-03  -0.483  0.62916    
## log_lbdhdd         2.560e-01  1.448e-02  17.683  < 2e-16 ***
## log_lbxglu        -8.396e-02  1.449e-02  -5.794 7.40e-09 ***
## log_bmxbmi         2.899e-02  1.676e-02   1.730  0.08378 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2046 on 4114 degrees of freedom
## Multiple R-squared:  0.1418, Adjusted R-squared:  0.1359 
## F-statistic: 24.27 on 28 and 4114 DF,  p-value: < 2.2e-16

car::vif(model_full_transformed_target_v2) # looks good

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.528157  1        1.236186
## riagendr          1.322256  1        1.149894
## ridreth3          1.663460  5        1.052207
## dmdeduc2          1.746420  6        1.047560
## indfmpir          1.298539  1        1.139535
## smq020            1.201694  1        1.096218
## alq111            1.104830  1        1.051109
## vigorous_minutes  1.050970  1        1.025168
## moderate_minutes  1.046275  1        1.022876
## sedentary_minutes 1.021901  1        1.010891
## bpxsy1            1.596263  1        1.263433
## bpxdi1            1.244341  1        1.115500
## log_lbxhscrp      1.389502  1        1.178771
## log_lbxtr         1.440165  1        1.200069
## log_dr1tsodi      2.716875  1        1.648295
## log_dr1ttfat      2.668648  1        1.633600
## log_lbdhdd        1.613914  1        1.270399
## log_lbxglu        1.214539  1        1.102061
## log_bmxbmi        1.505382  1        1.226940

# Model 8 robust regression - original
# Remove dr1tkcal which has highest VIF (above 5)
set.seed(123)
robust_model_full_v2 <- rlm(lbxtc ~ ., psi = psi.huber, data = train_data_original %>% dplyr::select(-c(dr1tkcal)))
summary(robust_model_full_v2)

## 
## Call: rlm(formula = lbxtc ~ ., data = train_data_original %>% dplyr::select(-c(dr1tkcal)), 
##     psi = psi.huber)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -249.964  -24.745   -1.669   24.520  249.598 
## 
## Coefficients:
##                   Value   Std. Error t value
## (Intercept)       94.7462  6.4933    14.5915
## ridageyr           0.0911  0.0391     2.3301
## riagendrFemale     3.7152  1.3442     2.7639
## ridreth32          2.5257  2.4997     1.0104
## ridreth33         -3.0861  2.0228    -1.5256
## ridreth34         -6.4170  2.1463    -2.9897
## ridreth36          1.4823  2.4314     0.6097
## ridreth37         -1.2720  3.1180    -0.4079
## dmdeduc22         -1.0018  2.8288    -0.3541
## dmdeduc23         -0.8358  2.5734    -0.3248
## dmdeduc24         -2.3021  2.5263    -0.9112
## dmdeduc25          1.3924  2.7598     0.5045
## dmdeduc27         59.5554 26.9989     2.2058
## dmdeduc29         -6.2907 13.6396    -0.4612
## indfmpir           0.2940  0.4443     0.6617
## smq0202           -2.6139  1.3119    -1.9925
## alq1112            0.5186  2.0790     0.2495
## vigorous_minutes   0.0002  0.0018     0.1301
## moderate_minutes  -0.0003  0.0011    -0.2469
## sedentary_minutes  0.0004  0.0009     0.4543
## dr1ttfat          -0.0106  0.0189    -0.5604
## dr1tsodi          -0.0001  0.0005    -0.2006
## bmxbmi             0.3272  0.0907     3.6096
## bpxsy1             0.1098  0.0401     2.7371
## bpxdi1             0.4078  0.0511     7.9811
## lbxglu            -0.0774  0.0169    -4.5796
## lbxhscrp          -0.0561  0.0805    -0.6966
## lbdhdd             0.6811  0.0444    15.3239
## lbxtr              0.0768  0.0054    14.3251
## 
## Residual standard error: 36.63 on 4114 degrees of freedom

car::vif(robust_model_full_v2) # looks good

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.515794  1        1.231176
## riagendr          1.302600  1        1.141315
## ridreth3          1.637276  5        1.050539
## dmdeduc2          1.736752  6        1.047076
## indfmpir          1.295408  1        1.138160
## smq020            1.198662  1        1.094834
## alq111            1.103360  1        1.050409
## vigorous_minutes  1.049948  1        1.024670
## moderate_minutes  1.045806  1        1.022647
## sedentary_minutes 1.021872  1        1.010877
## dr1ttfat          2.424637  1        1.557125
## dr1tsodi          2.438760  1        1.561653
## bmxbmi            1.299592  1        1.139996
## bpxsy1            1.591051  1        1.261369
## bpxdi1            1.231797  1        1.109863
## lbxglu            1.154521  1        1.074487
## lbxhscrp          1.110704  1        1.053900
## lbdhdd            1.374888  1        1.172556
## lbxtr             1.170204  1        1.081760

set.seed(123)
# Model 9 robust regression - transformed
# Remove log_dr1tkcal which has highest VIF (above 5)

robust_model_full_transformed_target_v2 <- rlm(log_lbxtc ~ ., psi = psi.huber, data = train_data_transformed_target %>% dplyr::select(-c(log_dr1tkcal)))
summary(robust_model_full_transformed_target_v2)

## 
## Call: rlm(formula = log_lbxtc ~ ., data = train_data_transformed_target %>% 
##     dplyr::select(-c(log_dr1tkcal)), psi = psi.huber)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.814008 -0.133808 -0.002436  0.133927  0.854640 
## 
## Coefficients:
##                   Value   Std. Error t value
## (Intercept)        3.8504  0.1209    31.8572
## ridageyr           0.0003  0.0002     1.3796
## riagendrFemale     0.0099  0.0072     1.3654
## ridreth32          0.0143  0.0134     1.0714
## ridreth33         -0.0126  0.0108    -1.1667
## ridreth34         -0.0280  0.0115    -2.4370
## ridreth36          0.0189  0.0131     1.4450
## ridreth37         -0.0053  0.0167    -0.3183
## dmdeduc22         -0.0128  0.0152    -0.8445
## dmdeduc23         -0.0111  0.0138    -0.8065
## dmdeduc24         -0.0192  0.0135    -1.4188
## dmdeduc25          0.0013  0.0148     0.0853
## dmdeduc27          0.2753  0.1445     1.9061
## dmdeduc29         -0.0330  0.0730    -0.4522
## indfmpir           0.0026  0.0024     1.1095
## smq0202           -0.0137  0.0070    -1.9459
## alq1112            0.0007  0.0111     0.0665
## vigorous_minutes   0.0000  0.0000     0.4513
## moderate_minutes   0.0000  0.0000    -0.4396
## sedentary_minutes  0.0000  0.0000     0.2255
## bpxsy1             0.0005  0.0002     2.5041
## bpxdi1             0.0019  0.0003     6.9660
## log_lbxhscrp       0.0160  0.0033     4.7728
## log_lbxtr          0.1018  0.0062    16.3397
## log_dr1tsodi      -0.0037  0.0094    -0.3921
## log_dr1ttfat      -0.0047  0.0085    -0.5543
## log_lbdhdd         0.2613  0.0143    18.2221
## log_lbxglu        -0.0939  0.0144    -6.5413
## log_bmxbmi         0.0423  0.0166     2.5495
## 
## Residual standard error: 0.1984 on 4114 degrees of freedom

car::vif(robust_model_full_transformed_target_v2) # looks good

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.528157  1        1.236186
## riagendr          1.322256  1        1.149894
## ridreth3          1.663460  5        1.052207
## dmdeduc2          1.746420  6        1.047560
## indfmpir          1.298539  1        1.139535
## smq020            1.201694  1        1.096218
## alq111            1.104830  1        1.051109
## vigorous_minutes  1.050970  1        1.025168
## moderate_minutes  1.046275  1        1.022876
## sedentary_minutes 1.021901  1        1.010891
## bpxsy1            1.596263  1        1.263433
## bpxdi1            1.244341  1        1.115500
## log_lbxhscrp      1.389502  1        1.178771
## log_lbxtr         1.440165  1        1.200069
## log_dr1tsodi      2.716875  1        1.648295
## log_dr1ttfat      2.668648  1        1.633600
## log_lbdhdd        1.613914  1        1.270399
## log_lbxglu        1.214539  1        1.102061
## log_bmxbmi        1.505382  1        1.226940

# Model 10
set.seed(123)
robust_bisquare_model_full_transformed_target_v2 <- rlm(log_lbxtc ~ ., psi = psi.bisquare, data = train_data_transformed_target %>% dplyr::select(-c(log_dr1tkcal)))
summary(robust_bisquare_model_full_transformed_target_v2)

## 
## Call: rlm(formula = log_lbxtc ~ ., data = train_data_transformed_target %>% 
##     dplyr::select(-c(log_dr1tkcal)), psi = psi.bisquare)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.815502 -0.134634 -0.002864  0.134037  0.855650 
## 
## Coefficients:
##                   Value   Std. Error t value
## (Intercept)        3.8413  0.1221    31.4612
## ridageyr           0.0003  0.0002     1.4454
## riagendrFemale     0.0096  0.0073     1.3112
## ridreth32          0.0128  0.0135     0.9457
## ridreth33         -0.0137  0.0109    -1.2512
## ridreth34         -0.0299  0.0116    -2.5746
## ridreth36          0.0173  0.0132     1.3124
## ridreth37         -0.0047  0.0168    -0.2768
## dmdeduc22         -0.0119  0.0153    -0.7775
## dmdeduc23         -0.0096  0.0139    -0.6903
## dmdeduc24         -0.0178  0.0137    -1.2993
## dmdeduc25          0.0028  0.0149     0.1882
## dmdeduc27          0.2768  0.1459     1.8970
## dmdeduc29         -0.0292  0.0737    -0.3964
## indfmpir           0.0024  0.0024     0.9786
## smq0202           -0.0135  0.0071    -1.9000
## alq1112            0.0016  0.0112     0.1413
## vigorous_minutes   0.0000  0.0000     0.4303
## moderate_minutes   0.0000  0.0000    -0.4248
## sedentary_minutes  0.0000  0.0000     0.2559
## bpxsy1             0.0005  0.0002     2.4532
## bpxdi1             0.0019  0.0003     6.9362
## log_lbxhscrp       0.0159  0.0034     4.7017
## log_lbxtr          0.1023  0.0063    16.2614
## log_dr1tsodi      -0.0043  0.0095    -0.4467
## log_dr1ttfat      -0.0047  0.0086    -0.5497
## log_lbdhdd         0.2630  0.0145    18.1569
## log_lbxglu        -0.0938  0.0145    -6.4655
## log_bmxbmi         0.0433  0.0168     2.5844
## 
## Residual standard error: 0.1991 on 4114 degrees of freedom

car::vif(robust_bisquare_model_full_transformed_target_v2) # looks good

##                       GVIF Df GVIF^(1/(2*Df))
## ridageyr          1.528157  1        1.236186
## riagendr          1.322256  1        1.149894
## ridreth3          1.663460  5        1.052207
## dmdeduc2          1.746420  6        1.047560
## indfmpir          1.298539  1        1.139535
## smq020            1.201694  1        1.096218
## alq111            1.104830  1        1.051109
## vigorous_minutes  1.050970  1        1.025168
## moderate_minutes  1.046275  1        1.022876
## sedentary_minutes 1.021901  1        1.010891
## bpxsy1            1.596263  1        1.263433
## bpxdi1            1.244341  1        1.115500
## log_lbxhscrp      1.389502  1        1.178771
## log_lbxtr         1.440165  1        1.200069
## log_dr1tsodi      2.716875  1        1.648295
## log_dr1ttfat      2.668648  1        1.633600
## log_lbdhdd        1.613914  1        1.270399
## log_lbxglu        1.214539  1        1.102061
## log_bmxbmi        1.505382  1        1.226940

# Model 12
set.seed(123)
model_transformed_newfeatures_v2 <- glm(log_lbxtc ~ ., data = train_data_transformed_newfeatures%>% dplyr::select(-c(AGE_YOUNG, ridageyr)))

summary(model_transformed_newfeatures_v2)

## 
## Call:
## glm(formula = log_lbxtc ~ ., data = train_data_transformed_newfeatures %>% 
##     dplyr::select(-c(AGE_YOUNG, ridageyr)))
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.948e+00  9.430e-02  41.863  < 2e-16 ***
## riagendrFemale     9.152e-03  7.187e-03   1.274  0.20290    
## ridreth32          1.311e-02  1.324e-02   0.990  0.32204    
## ridreth33          3.808e-03  1.065e-02   0.358  0.72066    
## ridreth34         -2.550e-02  1.139e-02  -2.238  0.02525 *  
## ridreth36          2.115e-02  1.299e-02   1.628  0.10352    
## ridreth37         -5.674e-03  1.651e-02  -0.344  0.73112    
## dmdeduc22         -1.187e-02  1.497e-02  -0.793  0.42811    
## dmdeduc23         -8.620e-03  1.361e-02  -0.633  0.52662    
## dmdeduc24         -1.273e-02  1.324e-02  -0.962  0.33626    
## dmdeduc25         -9.041e-04  1.455e-02  -0.062  0.95047    
## dmdeduc27          2.293e-01  1.429e-01   1.605  0.10867    
## dmdeduc29         -1.271e-02  7.220e-02  -0.176  0.86027    
## indfmpir           5.189e-04  2.352e-03   0.221  0.82540    
## smq0202           -6.363e-03  6.913e-03  -0.920  0.35743    
## alq1112            8.854e-03  1.104e-02   0.802  0.42241    
## sedentary_minutes  1.535e-06  4.521e-06   0.339  0.73433    
## TOTAL_ACTIVITY    -1.259e-06  4.761e-06  -0.264  0.79145    
## ACTIVE_FLAG        8.184e-03  7.113e-03   1.151  0.24997    
## HIGH_FAT_DIET     -6.507e-03  6.587e-03  -0.988  0.32328    
## HIGH_SODIUM       -2.018e-02  7.391e-03  -2.731  0.00635 ** 
## AGE_MID            7.868e-02  6.869e-03  11.455  < 2e-16 ***
## BMI_UNDER         -7.340e-02  2.685e-02  -2.734  0.00629 ** 
## BMI_NORMAL        -2.704e-02  9.342e-03  -2.894  0.00382 ** 
## BMI_OVER           1.273e-02  7.730e-03   1.646  0.09975 .  
## bpxsy1             1.041e-03  1.938e-04   5.372 8.22e-08 ***
## bpxdi1             7.765e-04  2.819e-04   2.754  0.00591 ** 
## log_lbxhscrp       1.473e-02  3.219e-03   4.575 4.90e-06 ***
## log_lbxtr          9.098e-02  6.172e-03  14.741  < 2e-16 ***
## log_lbdhdd         2.590e-01  1.405e-02  18.431  < 2e-16 ***
## log_lbxglu        -8.303e-02  1.413e-02  -5.877 4.51e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.04019509)
## 
##     Null deviance: 200.61  on 4142  degrees of freedom
## Residual deviance: 165.28  on 4112  degrees of freedom
## AIC: -1525.4
## 
## Number of Fisher Scoring iterations: 2

car::vif(model_transformed_newfeatures_v2)

##                       GVIF Df GVIF^(1/(2*Df))
## riagendr          1.328951  1        1.152801
## ridreth3          1.661164  5        1.052062
## dmdeduc2          1.707273  6        1.045583
## indfmpir          1.295341  1        1.138130
## smq020            1.187965  1        1.089938
## alq111            1.109649  1        1.053399
## sedentary_minutes 1.030418  1        1.015095
## TOTAL_ACTIVITY    1.130412  1        1.063208
## ACTIVE_FLAG       1.223972  1        1.106333
## HIGH_FAT_DIET     1.076678  1        1.037631
## HIGH_SODIUM       1.103043  1        1.050258
## AGE_MID           1.185378  1        1.088751
## BMI_UNDER         1.077733  1        1.038139
## BMI_NORMAL        1.659556  1        1.288237
## BMI_OVER          1.373428  1        1.171934
## bpxsy1            1.324607  1        1.150916
## bpxdi1            1.338908  1        1.157112
## log_lbxhscrp      1.313531  1        1.146094
## log_lbxtr         1.444867  1        1.202026
## log_lbdhdd        1.584001  1        1.258571
## log_lbxglu        1.201714  1        1.096227

Interpretation of coefficients

Coefficients are presented as multiplicative effects (exponentiated from log-log model), meaning they represent proportional changes in cholesterol. * Biomarkers Dominating the prediction are HDL Cholesterol (log_lbdhdd), Triglycerides (log_lbxtr), Glucose (log_lbxglu),CRP/Inflammation (log_lbxhscrp). * HDL is a component of total cholesterol, so this positive relationship is biologically expected and is the strongest predictor in your model. * Triglycerides contribute to VLDL cholesterol, another component of total cholesterol. * Interestingly, Higher glucose associated with LOWER cholesterol. * Inflammation and cholesterol are linked . * Cholesterol follows an inverted-U trajectory across lifespan as it peaks in middge age(AGE_MID (40-64 years)) and then declines. * Overweight individuals(BMI_OVER (25-29.9)) have 1.28% HIGHER cholesterol than obese and hence Lower BMI leads to Lower cholesterol . * High sodium diet associated with 2% LOWER cholesterol(HIGH_SODIUM (>2300 mg/day): 0.9800)). * Lifestyle changes like Physical Activity, Diet Quality, behaviors like Smoking,Alcohol and Income had no significant effect.

# Logistic Regression Models

# Interpretation of coefficients (Model 1)
cat("\n=== Coefficient Interpretation (Model 1) ===\n")

## 
## === Coefficient Interpretation (Model 1) ===

coef_exp <- exp(coef(model_full_v2))
print(round(coef_exp, 4))

##       (Intercept)          ridageyr    riagendrFemale         ridreth32 
##      2.412504e+43      1.064500e+00      1.198091e+02      8.429400e+00 
##         ridreth33         ridreth34         ridreth36         ridreth37 
##      8.710000e-02      1.300000e-03      4.902800e+00      6.250000e-02 
##         dmdeduc22         dmdeduc23         dmdeduc24         dmdeduc25 
##      5.530000e-02      1.290000e-01      2.960000e-02      7.537000e-01 
##         dmdeduc27         dmdeduc29          indfmpir           smq0202 
##      7.500096e+22      2.400000e-03      1.509400e+00      4.680000e-02 
##           alq1112  vigorous_minutes  moderate_minutes sedentary_minutes 
##      2.760500e+00      1.000100e+00      9.997000e-01      1.000100e+00 
##          dr1tsodi            bmxbmi            bpxsy1            bpxdi1 
##      9.997000e-01      1.287100e+00      1.169400e+00      1.427500e+00 
##            lbxglu          lbxhscrp            lbdhdd             lbxtr 
##      9.430000e-01      9.833000e-01      1.915600e+00      1.061200e+00

# Interpretation of coefficients (Model 2)
cat("\n=== Coefficient Interpretation (Model 2) ===\n")

## 
## === Coefficient Interpretation (Model 2) ===

coef_exp <- exp(coef(model_full_transformed_v2))
print(round(coef_exp, 4))

##       (Intercept)          ridageyr    riagendrFemale         ridreth32 
##      0.000000e+00      1.032200e+00      1.700340e+01      1.152050e+01 
##         ridreth33         ridreth34         ridreth36         ridreth37 
##      1.943000e-01      8.000000e-03      2.418640e+01      1.410000e-01 
##         dmdeduc22         dmdeduc23         dmdeduc24         dmdeduc25 
##      1.800000e-02      8.620000e-02      1.710000e-02      3.991000e-01 
##         dmdeduc27         dmdeduc29          indfmpir           smq0202 
##      6.708678e+23      4.000000e-04      1.787900e+00      8.200000e-02 
##           alq1112  vigorous_minutes  moderate_minutes sedentary_minutes 
##      2.203300e+00      1.000600e+00      9.995000e-01      1.000000e+00 
##            bpxsy1            bpxdi1      log_lbxhscrp         log_lbxtr 
##      1.161300e+00      1.332600e+00      2.527480e+01      9.309804e+07 
##      log_dr1tsodi      log_dr1ttfat        log_lbdhdd        log_lbxglu 
##      4.627000e-01      6.648000e-01      4.685730e+19      0.000000e+00 
##        log_bmxbmi 
##      2.486760e+01

# Interpretation of coefficients (Model 3)
cat("\n=== Coefficient Interpretation (Model 3) ===\n")

## 
## === Coefficient Interpretation (Model 3) ===

coef_exp <- exp(coef(model_full_transformed_target_v2))
print(round(coef_exp, 4))

##       (Intercept)          ridageyr    riagendrFemale         ridreth32 
##           48.5508            1.0001            1.0138            1.0140 
##         ridreth33         ridreth34         ridreth36         ridreth37 
##            0.9899            0.9708            1.0165            0.9885 
##         dmdeduc22         dmdeduc23         dmdeduc24         dmdeduc25 
##            0.9812            0.9857            0.9776            0.9969 
##         dmdeduc27         dmdeduc29          indfmpir           smq0202 
##            1.2985            0.9630            1.0027            0.9886 
##           alq1112  vigorous_minutes  moderate_minutes sedentary_minutes 
##            1.0049            1.0000            1.0000            1.0000 
##            bpxsy1            bpxdi1      log_lbxhscrp         log_lbxtr 
##            1.0007            1.0018            1.0183            1.1021 
##      log_dr1tsodi      log_dr1ttfat        log_lbdhdd        log_lbxglu 
##            0.9976            0.9958            1.2917            0.9195 
##        log_bmxbmi 
##            1.0294

# Interpretation of coefficients (Model 10)
cat("\n=== Coefficient Interpretation (Model 10) ===\n")

## 
## === Coefficient Interpretation (Model 10) ===

coef_exp <- exp(coef(robust_bisquare_model_full_transformed_target_v2))
print(round(coef_exp, 4))

##       (Intercept)          ridageyr    riagendrFemale         ridreth32 
##           46.5847            1.0003            1.0096            1.0129 
##         ridreth33         ridreth34         ridreth36         ridreth37 
##            0.9864            0.9705            1.0175            0.9953 
##         dmdeduc22         dmdeduc23         dmdeduc24         dmdeduc25 
##            0.9882            0.9904            0.9824            1.0028 
##         dmdeduc27         dmdeduc29          indfmpir           smq0202 
##            1.3189            0.9712            1.0024            0.9866 
##           alq1112  vigorous_minutes  moderate_minutes sedentary_minutes 
##            1.0016            1.0000            1.0000            1.0000 
##            bpxsy1            bpxdi1      log_lbxhscrp         log_lbxtr 
##            1.0005            1.0019            1.0160            1.1077 
##      log_dr1tsodi      log_dr1ttfat        log_lbdhdd        log_lbxglu 
##            0.9957            0.9953            1.3008            0.9105 
##        log_bmxbmi 
##            1.0443

# Interpretation of coefficients (Model 12)
cat("\n=== Coefficient Interpretation (Model 12) ===\n")

## 
## === Coefficient Interpretation (Model 12) ===

coef_exp <- exp(coef(model_transformed_newfeatures_v2))
print(round(coef_exp, 4))

##       (Intercept)    riagendrFemale         ridreth32         ridreth33 
##           51.8199            1.0092            1.0132            1.0038 
##         ridreth34         ridreth36         ridreth37         dmdeduc22 
##            0.9748            1.0214            0.9943            0.9882 
##         dmdeduc23         dmdeduc24         dmdeduc25         dmdeduc27 
##            0.9914            0.9874            0.9991            1.2578 
##         dmdeduc29          indfmpir           smq0202           alq1112 
##            0.9874            1.0005            0.9937            1.0089 
## sedentary_minutes    TOTAL_ACTIVITY       ACTIVE_FLAG     HIGH_FAT_DIET 
##            1.0000            1.0000            1.0082            0.9935 
##       HIGH_SODIUM           AGE_MID         BMI_UNDER        BMI_NORMAL 
##            0.9800            1.0819            0.9292            0.9733 
##          BMI_OVER            bpxsy1            bpxdi1      log_lbxhscrp 
##            1.0128            1.0010            1.0008            1.0148 
##         log_lbxtr        log_lbdhdd        log_lbxglu 
##            1.0952            1.2957            0.9203

5 Select Models

Best performing model is LM12( Multiple Linear regression with transformed predictors and target+ new features). ** It has lowest RMSE (0.2041 on log scale) and highest R² = 0.1215 ** Has the best predictive performance. ** Provides interpretable coefficients on the log scale. ** Feature engineering is better than Regularization in this case. ** Ridge’s shrinkage couldn’t compensate for missing non-linear patterns
Second best performing Model is LM6 (Ridge with transformed predictors and target) because: ** It has RMSE (0.2054) and R² (0.1104). ** Has the better predictive performance. ** Handles multicollinearity among dietary/lipid variables. ** Is more stable than OLS for correlated predictors.
Other comparable models are LM10 (Robust bisquare) as its close second with RMSE 0.2057 and R² 0.1078; LM3 (Full transformed) with RMSE 0.2057, R² 0.1070.
The log transformation of the target (Total Cholesterol) dramatically improved model performance. ** Original scale models (LM1, LM2): RMSE ~39, R² ~0.09 ** Log-transformed models (LM3+): RMSE ~0.21, R² ~0.10-0.11 ** This makes sense given that cholesterol and many biomarkers follow log-normal distributions.
Ridge regression (LM6) slightly outperformed OLS, suggesting multicollinearity among predictors (especially the correlated dietary variables).Ridge’s shrinkage helps with the correlation among dr1tkcal, dr1ttfat, dr1tsodi, lbdhdd, lbxtr.
Stepwise models (LM4, LM5) performed worse than full models, indicating important predictors were likely removed. The modest R² suggests complex interactions that stepwise doesn’t capture well.
Robust regression (LM9, LM10) performed similarly to OLS (LM3), suggesting Outliers aren’t significantly affecting our results.
The low R² (~0.11) across all models indicates that our predictors explain only about 10-11% of cholesterol variation. ** Looks like the Cholesterol is influenced by many factors not in our dataset (maybe genetics, medications, detailed dietary patterns) ** The relationship may be non-linear or involve complex interactions. ** This is actually typical for epidemiological data.

# Model 1: Multiple Linear Regression - Full model (post mutilcollinearity updates)

# Check assumptions

# Model Evaluation on Test Set
print_evaluate_lm_model <- function(model, target_variable, test_data, model_name) {

  preds <- predict(model, newdata = test_data)
  if (target_variable == "lbxtc") {
    actual <- test_data$lbxtc
  } else { # log
    actual <- test_data$log_lbxtc
  }

  rmse <- sqrt(mean((preds - actual)^2))
  mae  <- mean(abs(preds - actual))
  r2   <- 1 - sum((preds - actual)^2) / sum((actual - mean(actual))^2)
  aic  <- AIC(model)

  output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== ", model_name, " Evaluation ===\n",
    "RMSE:", round(rmse, 4), 
    "| MAE:", round(mae, 4),
    "| R²:", round(r2, 4), "\n",
    "AIC:", round(aic, 4), "\n\n",
    sep = " "
  )

  cat(output)
}

print_lm_model_diagnostics <- function(model, train_data, model_name) {

  cat("\n=== Diagnostics for", model_name, "===\n")
  
  # Histogram of residuals
  ggplot(data = as.data.frame(residuals(model)), aes(x = residuals(model))) +
  geom_histogram(bins = 25, fill = "skyblue", color = "black") +
  xlab("Residuals") +
  ggtitle(paste("Residual Histogram -", model_name))
  
  # Residuals vs Fitted
  plot(model, which = 1, main = paste("Residuals vs Fitted -", model_name))
  
  # Q-Q Plot
  plot(model, which = 2, main = paste("Normal Q-Q Plot -", model_name))
  
  # Scale-Location Plot
  plot(model, which = 3, main = paste("Scale-Location Plot -", model_name))
  
  # Residuals vs Leverage
  plot(model, which = 5, main = paste("Residuals vs Leverage -", model_name))
  
  # Cook's Distance
  try(plot(model, which = 4, main = paste("Cook's Distance -", model_name)), silent = TRUE)
  # plot(model, which = 4, main = paste("Cook's Distance -", model_name))
  
  # Heteroscedasticity test
  print("Heteroscedasticity test:")
  print(bptest(model))
  
  # High leverage cutoff
  p <- length(coef(model))
  n <- nrow(train_data)
  high_lev <- (2 * (p+1)) / n
  cat("High leverage threshold:", round(high_lev, 5), "\n")
  
  # Half-normal plot of leverage values
  try(halfnorm(hatvalues(model), main = paste("Half-Normal Plot of Leverage -", model_name)), silent = TRUE)
}

# === LM1 Evaluation ===
print_evaluate_lm_model(model_full_v2, "lbxtc", test_data_original, "LM1: Full model")

## 
## === Model Selection and Evaluation ===
## 
##  ===  LM1: Full model  Evaluation ===
##  RMSE: 39.1062 | MAE: 29.9116 | R²: 0.0893 
##  AIC: 42173.6259

print_lm_model_diagnostics(model_full_v2, train_data_original, "LM1: Full model")

## 
## === Diagnostics for LM1: Full model ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 181.84, df = 27, p-value < 2.2e-16
## 
## High leverage threshold: 0.014

# Model 2: Multiple Linear Regression: Full model with Log transformation of the skewed predictors

# Check assumptions

# === LM2 Evaluation ===
print_evaluate_lm_model(model_full_transformed_v2, "lbxtc", test_data_transformed, "LM2: Full model (Transformed Predictors)")

## 
## === Model Selection and Evaluation ===
## 
##  ===  LM2: Full model (Transformed Predictors)  Evaluation ===
##  RMSE: 39.0595 | MAE: 29.6559 | R²: 0.0914 
##  AIC: 42049.7957

print_lm_model_diagnostics(model_full_transformed_v2, train_data_transformed, "LM2: Full model (Transformed Predictors)")

## 
## === Diagnostics for LM2: Full model (Transformed Predictors) ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 96.186, df = 28, p-value = 2.084e-09
## 
## High leverage threshold: 0.01448

# Model 3: Multiple Linear Regression: Full model with Log transformation of the skewed predictors + Log target

# Check assumptions

# === LM3 Evaluation ===
print_evaluate_lm_model(model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM3: Full model (Transformed Predictors + Target)")

## 
## === Model Selection and Evaluation ===
## 
##  ===  LM3: Full model (Transformed Predictors + Target)  Evaluation ===
##  RMSE: 0.2057 | MAE: 0.1597 | R²: 0.107 
##  AIC: -1360.3999

print_lm_model_diagnostics(model_full_transformed_target_v2, train_data_transformed_target, "LM3: Full model (Transformed Predictors + Target)")

## 
## === Diagnostics for LM3: Full model (Transformed Predictors + Target) ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 141.07, df = 28, p-value < 2.2e-16
## 
## High leverage threshold: 0.01448

# Model 4: Stepwise on Model 1

# Check assumptions

# === LM4 Evaluation ===
print_evaluate_lm_model(step_model_full, "lbxtc", test_data_original, "LM4: Stepwise Model 1")

## 
## === Model Selection and Evaluation ===
## 
##  ===  LM4: Stepwise Model 1  Evaluation ===
##  RMSE: 39.1801 | MAE: 29.9975 | R²: 0.0858 
##  AIC: 42158.8052

print_lm_model_diagnostics(step_model_full, train_data_original, "LM4: Stepwise Model 1")

## 
## === Diagnostics for LM4: Stepwise Model 1 ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 170.4, df = 14, p-value < 2.2e-16
## 
## High leverage threshold: 0.00772

# Model 5: Stepwise on Model 5

# Check assumptions

# === LM5 Evaluation ===
print_evaluate_lm_model(step_model_full_transformed_target, "log_lbxtc", test_data_transformed_target, "LM5: Stepwise Model 5")

## 
## === Model Selection and Evaluation ===
## 
##  ===  LM5: Stepwise Model 5  Evaluation ===
##  RMSE: 0.2063 | MAE: 0.1605 | R²: 0.1026 
##  AIC: -1373.733

print_lm_model_diagnostics(step_model_full_transformed_target, train_data_transformed_target, "LM5: Stepwise Model 5")

## 
## === Diagnostics for LM5: Stepwise Model 5 ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 116, df = 17, p-value < 2.2e-16
## 
## High leverage threshold: 0.00917

# Model 6 - Ridge
testX <- test_data_transformed_target %>%
  dplyr::select(-log_lbxtc)
testY <- test_data_transformed_target$log_lbxtc
ridge_pred <- predict(ridge_model, testX)
ridge_resample <- postResample(ridge_pred, testY)
ridge_rmse <- ridge_resample["RMSE"]
ridge_Rsquared <- ridge_resample["Rsquared"]
ridge_MAE <- ridge_resample["MAE"]

ridge_output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== LM6: Ridge Evaluation ===\n",
    "RMSE:", round(ridge_rmse, 4), 
    "| R²:", round(ridge_Rsquared, 4),
    "| MAE:", round(ridge_MAE, 4),
    "\n\n",
    sep = " "
  )
cat(ridge_output)

## 
## === Model Selection and Evaluation ===
## 
##  === LM6: Ridge Evaluation ===
##  RMSE: 0.2054 | R²: 0.1104 | MAE: 0.1596

# Model 7 - Elastic Net
testX <- test_data_transformed_target %>%
  dplyr::select(-log_lbxtc)
testY <- test_data_transformed_target$log_lbxtc
enet_pred <- predict(enet_model, testX)
enet_resample <- postResample(enet_pred, testY)
enet_rmse <- enet_resample["RMSE"]
enet_Rsquared <- enet_resample["Rsquared"]
enet_MAE <- enet_resample["MAE"]
# enet_aic  <- AIC(enet_model)

enet_output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== LM7: Elastic Net Evaluation ===\n",
    "RMSE:", round(enet_rmse, 4), 
    "| R²:", round(enet_Rsquared, 4),
    "| MAE:", round(enet_MAE, 4),
    "\n\n",
    sep = " "
  )
cat(enet_output)

## 
## === Model Selection and Evaluation ===
## 
##  === LM7: Elastic Net Evaluation ===
##  RMSE: 0.2062 | R²: 0.1031 | MAE: 0.1606

# Model 8: Robust regression - original

# Check assumptions

# === LM8 Evaluation ===
print_evaluate_lm_model(robust_model_full_v2, "lbxtc", test_data_original, "LM8: Robust model (Original)")

## 
## === Model Selection and Evaluation ===
## 
##  ===  LM8: Robust model (Original)  Evaluation ===
##  RMSE: 39.1133 | MAE: 29.683 | R²: 0.0889 
##  AIC: 42200.5285

# reason for try: not every graph is for robust regression, so just print out the graphs R can print
try(print_lm_model_diagnostics(robust_model_full_v2, train_data_original, "LM8: Robust model (Original)"), silent=TRUE)

## 
## === Diagnostics for LM8: Robust model (Original) ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 182.47, df = 28, p-value < 2.2e-16
## 
## High leverage threshold: 0.01448

# Model 9: Robust regression - transformed

# Check assumptions

# === LM9 Evaluation ===
print_evaluate_lm_model(robust_model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM9: Robust model (Transformed Predictors + Target)")

## 
## === Model Selection and Evaluation ===
## 
##  ===  LM9: Robust model (Transformed Predictors + Target)  Evaluation ===
##  RMSE: 0.2057 | MAE: 0.1595 | R²: 0.1076 
##  AIC: -1355.7682

try(print_lm_model_diagnostics(robust_model_full_transformed_target_v2, train_data_transformed_target, "LM9: Robust model (Transformed Predictors + Target)"), silent=TRUE)

## 
## === Diagnostics for LM9: Robust model (Transformed Predictors + Target) ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 141.07, df = 28, p-value < 2.2e-16
## 
## High leverage threshold: 0.01448

# Model 10: Robust regression - transformed, bisquare

# Check assumptions

# === LM10 Evaluation ===
print_evaluate_lm_model(robust_bisquare_model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM10: Robust model (Transformed Predictors + Target, bisquare)")

## 
## === Model Selection and Evaluation ===
## 
##  ===  LM10: Robust model (Transformed Predictors + Target, bisquare)  Evaluation ===
##  RMSE: 0.2057 | MAE: 0.1594 | R²: 0.1078 
##  AIC: -1355.2245

try(print_lm_model_diagnostics(robust_bisquare_model_full_transformed_target_v2, train_data_transformed_target, "LM10: Robust model (Transformed Predictors + Target, bisquare)"), silent=TRUE)

## 
## === Diagnostics for LM10: Robust model (Transformed Predictors + Target, bisquare) ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 141.07, df = 28, p-value < 2.2e-16
## 
## High leverage threshold: 0.01448

# === LM12 Evaluation ===
print_evaluate_lm_model(model_transformed_newfeatures_v2, "log_lbxtc",test_data_transformed_newfeatures, "LM12: Transformed predictors and target + New features")

## 
## === Model Selection and Evaluation ===
## 
##  ===  LM12: Transformed predictors and target + New features  Evaluation ===
##  RMSE: 0.2041 | MAE: 0.1585 | R²: 0.1215 
##  AIC: -1525.4366

print_lm_model_diagnostics(model_transformed_newfeatures_v2,train_data_transformed_newfeatures,"LM12: Transformed predictors and target + New features")

## 
## === Diagnostics for LM12: Transformed predictors and target + New features ===

## [1] "Heteroscedasticity test:"
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 131.61, df = 30, p-value = 1.092e-14
## 
## High leverage threshold: 0.01545

# Results table

evaluate_lm_model <- function(model, target_variable, test_data, model_name) {

  preds <- predict(model, newdata = test_data)
  if (target_variable == "lbxtc") {
    actual <- test_data$lbxtc
  } else { # log
    actual <- test_data$log_lbxtc
  }

  rmse <- sqrt(mean((preds - actual)^2))
  mae  <- mean(abs(preds - actual))
  r2   <- 1 - sum((preds - actual)^2) / sum((actual - mean(actual))^2)
  aic  <- AIC(model)

  return(data.frame(
    Model = model_name,
    RMSE = round(rmse, 4),
    MAE = round(mae, 4),
    R_squared = round(r2, 4),
    AIC = round(aic, 4)
  ))
}

ridge_df <- data.frame(
    Model = "LM6: Ridge (Predictors + Target)",
    RMSE = round(ridge_rmse, 4),
    MAE = round(ridge_MAE, 4),
    R_squared = round(ridge_Rsquared, 4),
    AIC = NA
  )
enet_df <- data.frame(
    Model = "LM7: Elastic Net (Predictors + Target)",
    RMSE = round(enet_rmse, 4),
    MAE = round(enet_MAE, 4),
    R_squared = round(enet_Rsquared, 4),
    AIC = NA
  )

comparison <- rbind(
  evaluate_lm_model(model_full_v2, "lbxtc", test_data_original, "LM1: Full model"),
  evaluate_lm_model(model_full_transformed_v2, "lbxtc", test_data_transformed, "LM2: Full model (Transformed Predictors)"),
  evaluate_lm_model(model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM3: Full model (Transformed Predictors + Target)"),
  evaluate_lm_model(step_model_full, "lbxtc", test_data_original, "LM4: Stepwise Model 1"),
  evaluate_lm_model(step_model_full_transformed_target, "log_lbxtc", test_data_transformed_target, "LM5: Stepwise Model 3"),
  ridge_df,
  enet_df,
  evaluate_lm_model(robust_model_full_v2, "lbxtc", test_data_original, "LM8: Robust model (Original)"),
  evaluate_lm_model(robust_model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM9: Robust model (Transformed Predictors + Target)"),
  evaluate_lm_model(robust_bisquare_model_full_transformed_target_v2, "log_lbxtc", test_data_transformed_target, "LM10: Robust model (Transformed Predictors + Target, bisquare)"),
 evaluate_lm_model(model_transformed_newfeatures_v2,"log_lbxtc", test_data_transformed_newfeatures, "LM12: Transformed predictors and target + New features")
)

print(comparison)

##                                                                Model    RMSE
## 1                                                    LM1: Full model 39.1062
## 2                           LM2: Full model (Transformed Predictors) 39.0595
## 3                  LM3: Full model (Transformed Predictors + Target)  0.2057
## 4                                              LM4: Stepwise Model 1 39.1801
## 5                                              LM5: Stepwise Model 3  0.2063
## RMSE                                LM6: Ridge (Predictors + Target)  0.2054
## RMSE1                         LM7: Elastic Net (Predictors + Target)  0.2062
## 11                                      LM8: Robust model (Original) 39.1133
## 12               LM9: Robust model (Transformed Predictors + Target)  0.2057
## 13    LM10: Robust model (Transformed Predictors + Target, bisquare)  0.2057
## 14            LM12: Transformed predictors and target + New features  0.2041
##           MAE R_squared       AIC
## 1     29.9116    0.0893 42173.626
## 2     29.6559    0.0914 42049.796
## 3      0.1597    0.1070 -1360.400
## 4     29.9975    0.0858 42158.805
## 5      0.1605    0.1026 -1373.733
## RMSE   0.1596    0.1104        NA
## RMSE1  0.1606    0.1031        NA
## 11    29.6830    0.0889 42200.529
## 12     0.1595    0.1076 -1355.768
## 13     0.1594    0.1078 -1355.225
## 14     0.1585    0.1215 -1525.437

kable(
  comparison,
  caption = "Multiple Linear Regression Models",
  digits = 4,
  align = "c"
)

Multiple Linear Regression Models
	Model	RMSE	MAE	R_squared	AIC
1	LM1: Full model	39.1062	29.9116	0.0893	42173.626
2	LM2: Full model (Transformed Predictors)	39.0595	29.6559	0.0914	42049.796
3	LM3: Full model (Transformed Predictors + Target)	0.2057	0.1597	0.1070	-1360.400
4	LM4: Stepwise Model 1	39.1801	29.9975	0.0858	42158.805
5	LM5: Stepwise Model 3	0.2063	0.1605	0.1026	-1373.733
RMSE	LM6: Ridge (Predictors + Target)	0.2054	0.1596	0.1104	NA
RMSE1	LM7: Elastic Net (Predictors + Target)	0.2062	0.1606	0.1031	NA
11	LM8: Robust model (Original)	39.1133	29.6830	0.0889	42200.529
12	LM9: Robust model (Transformed Predictors + Target)	0.2057	0.1595	0.1076	-1355.768
13	LM10: Robust model (Transformed Predictors + Target, bisquare)	0.2057	0.1594	0.1078	-1355.225
14	LM12: Transformed predictors and target + New features	0.2041	0.1585	0.1215	-1525.437

6 Summary and Conclusion

We used the 2017–2018 NHANES cycle set of key modules that capture demographic, behavioral, and clinical information relevant to predicting cholesterol and BMI outcomes. These modules include demographics, smoking status, alcohol use, physical activity, dietary intake, body measurements, blood pressure, glucose, inflammation (CRP), HDL cholesterol, triglycerides, and total cholesterol.Each dataset was imported individually and prepared for merging so that a comprehensive, unified analytic file could be created for building our regression models.

Several continuous variables like Total calories(dr1tkcal),Total fat(dr1ttfat) ,Sodium intake(dr1tsodi), HDL(lbdhdd) and Triglycerides(lbxtr) in the dataset exhibit skewed distributions.

We addressed variables with moderate missingness by replacing with their median values, while categorical variables were imputed using the most frequent category. For variables with higher levels of missingness, we applied MICE using predictive mean matching, allowing more accurate estimation based on relationships among variables.

The most prominent outliers were concentrated in biomarkers and nutrient totals, which could influence regression models if unaddressed. These findings suggested the need for careful handling, such as log transformation or robust modeling approaches, rather than outright removal, to ensure valid and generalizable analytical results.

To address skewness and high-end outliers in the NHANES dataset, we applied log transformations to several right-skewed variables. Specifically, CRP (lbxhscrp), triglycerides (lbxtr), total cholesterol (lbxtc), HDL cholesterol (lbdhdd), glucose (lbxglu), BMI (bmxbmi), and dietary intake variables such as total calories. (dr1tkcal), total fat (dr1ttfat), and sodium (dr1tsodi) were log-transformed. In addition to transforming skewed numeric variables, we converted key categorical variables into factor type for modeling.

Checking for linear relationship across the scatterplots, we observed that the most numeric predictors exhibit very weak or negligible linear relationships with total cholesterol.This suggested that total cholesterol is largely uncorrelated with most individual predictors, with only modest positive trends observed for a few clinical measures.

Strong Positive Correlations: - dr1tkcal/dr1ttfat 0.89 (Calories and Total Fat Intake - more fat indicates more calories) - dr1tkcal/dr1tsodi 0.80 (Calories and Total Sodium Intake - more sodium indicates more calories) - dr1ttfat/dr1tsodi 0.76 (Total Fat and Sodium Intake -more fat indicates more sodium) - ridageyr/bpxsy1 0.47 (Higher age indicates higher systolic bp)

Negative Correlations: - lbdhdd/lbxtr -0.31 (Higher HDL, lower Triglycerides) - bmxbmi/lbdhdd -0.27 (Higher BMI, lower HDL) - lbxglu/lbdhdd -0.18 (Higher Glucose, lower HDL)

Several linear regression models were built and evaluated . Best Performing Models is LM12( Multiple Linear regression with transformed predictors and target+ new features). ** It has lowest RMSE (0.2041 on log scale) and highest R² = 0.1215 ** Has the best predictive performance. ** Provides interpretable coefficients on the log scale. ** Feature engineering is better than Regularization in this case. ** Ridge’s shrinkage couldn’t compensate for missing non-linear patterns