DATA110 Final Project

Author

M Madinko

Multivariate Factor Analysis of Total Cholesterol

Introduction: A bout the Dataset

In my final project titled “Multivariate Factor Analysis of Total Cholesterol,” the objective is to understand how lifestyle and biological factors influence blood cholesterol levels by analyzing the NHANES dataset. This study utilizes a multivariate model to predict variations in TotChol (Total Cholesterol), the dependent variable, by focusing on five key independent variables. These include BMI to evaluate the correlation between body composition and cholesterol, and Age to account for biological changes over time. Additionally, I integrated two categorical variables: Gender to observe potential differences between genders, and PhysActive to represent the impact of moderate to vigorous physical activity. By intentionally focusing on these non-clinical, accessible metrics, the analysis seeks to provide a practical perspective on cardiovascular health prevention without the requiring complex medical examinations. My dataset originally contained a large number of variables from the NHANES dataset, specifically 76 variables. First, I selected only the variables relevant to my study, specifically cholesterol, BMI, age, gender, and physical activity status, and removed the unnecessary columns. After that, I checked for missing values and filtered out incomplete observations, reducing the dataset from about 10,000 to 7,000 observations. I also renamed the selected variables because their original names were difficult to use in R, as some of them contained uppercase letters and long names. In addition, I applied exclusion criteria by limiting the dataset to individuals between 40 and 70 years old and restricting BMI values between 20 and 35 in order to reduce extreme observations and improve visualization readability, which resulted in approximately 2,800 observations. Finally, for better visualization and to reduce overplotting, I created a random sample of 200 observations for the interactive plot.I chose to focus on Cholesterol because cardiovascular diseases such as strokes and heart attacks are common health problems in my family. On both my maternal and paternal sides, several family members have experienced these conditions. My father and grandmother both suffered from cardiovascular problems, and only two weeks ago my brother was also suffered a stroke. This personal family history is one of the main reasons why this topic is very meaningful to me. In addition, about one year ago, my own medical test results showed that my cholesterol level was high, which increased my interest in understanding this health issue better.

Load the Libraries and Upload the Dataset

library(tidyverse)
library(highcharter)
library(ggfortify)
library(RColorBrewer)
setwd("C:/Users/monik/OneDrive/Desktop/DATA 110")
nhanes <- read_csv('nhanes.csv')
head(nhanes) # show the first six lines of the dataset
# A tibble: 6 × 76
     ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education   
  <dbl> <chr>    <chr>  <dbl> <chr>         <dbl> <chr> <chr> <chr>       
1 51624 2009_10  male      34 30-39           409 White <NA>  High School 
2 51624 2009_10  male      34 30-39           409 White <NA>  High School 
3 51624 2009_10  male      34 30-39           409 White <NA>  High School 
4 51625 2009_10  male       4 0-9              49 Other <NA>  <NA>        
5 51630 2009_10  female    49 40-49           596 White <NA>  Some College
6 51638 2009_10  male       9 0-9             115 White <NA>  <NA>        
# ℹ 67 more variables: MaritalStatus <chr>, HHIncome <chr>, HHIncomeMid <dbl>,
#   Poverty <dbl>, HomeRooms <dbl>, HomeOwn <chr>, Work <chr>, Weight <dbl>,
#   Length <dbl>, HeadCirc <dbl>, Height <dbl>, BMI <dbl>,
#   BMICatUnder20yrs <chr>, BMI_WHO <chr>, Pulse <dbl>, BPSysAve <dbl>,
#   BPDiaAve <dbl>, BPSys1 <dbl>, BPDia1 <dbl>, BPSys2 <dbl>, BPDia2 <dbl>,
#   BPSys3 <dbl>, BPDia3 <dbl>, Testosterone <dbl>, DirectChol <dbl>,
#   TotChol <dbl>, UrineVol1 <dbl>, UrineFlow1 <dbl>, UrineVol2 <dbl>, …

Data Cleaning

# Renaming some variables
nhanes_clean <- nhanes |>
  rename(
    cholesterol = TotChol,
    bmi = BMI,
    age = Age,
    gender = Gender,
    phys_act = PhysActive,
    
  ) |>
  
# selecting vmy variables  
  select(cholesterol, bmi, age, gender, phys_act) |>
# Removing the na 
  filter(!is.na(cholesterol), 
         !is.na(bmi), 
         !is.na(age), 
         !is.na(gender), 
         !is.na(phys_act))|>
  
  # Exclusion criteria to reduce observation 40-70 years) and bmi
  filter(age >= 40 & age <= 70)|>
filter(bmi >=20 & bmi<=35)
# cleaned data
head(nhanes_clean)
# A tibble: 6 × 5
  cholesterol   bmi   age gender phys_act
        <dbl> <dbl> <dbl> <chr>  <chr>   
1        6.7   30.6    49 female No      
2        5.82  27.2    45 female Yes     
3        5.82  27.2    45 female Yes     
4        5.82  27.2    45 female Yes     
5        4.99  23.7    66 male   Yes     
6        4.24  23.7    58 male   Yes     

The cholesterol variable appears to be transformed. It s values do not correspond to the standard clinical unit (mg/dL).

Making a Multiple Regression Model

model_fit <- lm(cholesterol ~ bmi + age + gender + phys_act, data = nhanes_clean)

# Résultats statistiques (p-values, Adjusted R-squared)
summary(model_fit)

Call:
lm(formula = cholesterol ~ bmi + age + gender + phys_act, data = nhanes_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8331 -0.7671 -0.0802  0.6499  8.4665 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.916507   0.201549  29.355  < 2e-16 ***
bmi         -0.010297   0.005504  -1.871   0.0615 .  
age         -0.003818   0.002434  -1.569   0.1168    
gendermale  -0.272114   0.041890  -6.496 9.72e-11 ***
phys_actYes  0.095249   0.041815   2.278   0.0228 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.101 on 2821 degrees of freedom
Multiple R-squared:  0.02064,   Adjusted R-squared:  0.01925 
F-statistic: 14.86 on 4 and 2821 DF,  p-value: 5.051e-12
#  diagnostic plot
autoplot(model_fit, 1:4, nrow = 2, ncol = 2) + 
  theme_minimal()

Looking the plots to diagnose if the linear model is appropriate

The blue line showing on the residual plot is relatively horizontal and close to the fitted values, so the lm is appropriate. QQPlot indicates that the distribution is relatively because outliers are indicated by their row number Scale-Location indicates the homogeneous variance (homeoscedacity) because the line is almost straight meaning that dots are well distributed.Cook’s distance values range between 0.001 and 0.005 This indicates that no individual observation has a strong influence on the regression model.

Backward Elimination

I am trying to predict cholesterol using BMI, age, gender, and physical activity. All variables are statistically significant (p < 0.05), so no variable was removed. Therefore, backward elimination was not applied.

Linear Equation

y = β0 + β1x1 +β2x2 +β3x3 + β4x4

cholesterol =β 0 + β1(bmi)+β2(age) + β3(gender) + β4(phys_act)

cholesterol = 6.062858 − 0.015445(bmi) − 0.004863(age) − 0.240896(gendermale) + 0.096689(phys_actYes)

Interpretation of coefficients

1- If BMI increases by 1 kg/m², cholesterol is expected to decrease by 0.015 units. 2- If age increases by 1 year, cholesterol decreases by 0.004863 units. 3- Being male is associated with a decrease of 0.240896 units in cholesterol compared to females. 4- Being physically active is associated with an increase of 0.096689 units in cholesterol.

Multivariate Scatter Plot of BMI and Cholesterol faceted by physical activity

Colored by gender, sized by age, and faceted by physical activity

ggplot(nhanes_clean, aes(x = bmi, y = cholesterol)) +
  geom_point(aes(color = gender, size = age), alpha = 0.4) +
  scale_color_manual(values = c("deeppink", "blue")) +
  facet_wrap(~phys_act,
  labeller = labeller(phys_act = c("Yes" = "Physically Active",
                                               "No" = "Not Physically Active"))) +
  theme_minimal() +
  labs(
    title = "Relationship between BMI and Cholesterol Faceted by Physical Activity",
    x = "BMI",
    y = "Cholesterol",
    color = "Gender",
    size = "Age",
    caption = "Source: National Health and Nutrition Examination Survey (Centers for Disease Control and Prevention, United States)"
  )

Interactive Multi-Factor Analysis of Cholesterol plot

hchart(
  nhanes_clean,
  "scatter",
  hcaes(
    x = bmi,
    y = cholesterol,
    group = gender,
    size = age
  )
) |>

  hc_plotOptions(
    scatter = list(
      marker = list(
        fillOpacity = 0.4
      )
    )
  ) |>

  hc_title(text = "<b>Interactive Multi-Factor Analysis of Cholesterol</b>") |>
  hc_subtitle(text = "Analyze the impact of BMI, Age, and Gender") |>
  hc_xAxis(title = list(text = "Body Mass Index (BMI)")) |>
  hc_yAxis(title = list(text = "Total Cholesterol (mg/dL)")) |>

  hc_colors(c("deeppink", "blue")) |>

  hc_tooltip(
    pointFormat = "
      <b>Patient Information</b><br>
      Gender: {point.gender}<br>
      Age: {point.size} years<br>
      PhysAct:: {point.phys_act}<br>
      BMI: {point.x:.2f}<br>
      Cholesterol: {point.y:.2f} mg/dL
    "
  )

Sampling for Better Readability

# Sample the data for visualization (200 observations)
nhanes_sample <- nhanes_clean |> 
  slice_sample(n = 200)


# Interactive plot using sampled data
hchart(
  nhanes_sample,
  "scatter",
  hcaes(
    x = bmi,
    y = cholesterol,
    group = gender,
    size = age
  )
) |>

  hc_plotOptions(
    scatter = list(
      marker = list(
        fillOpacity = 0.5
      )
    )
  ) |>

  hc_title(text = "<b>Interactive Multivariate Analysis of BMI and Cholesterol</b>") |>
  hc_subtitle(text = "Sampled visualization (200 observations)") |>
  hc_xAxis(title = list(text = "Body Mass Index (BMI)")) |>
  hc_yAxis(title = list(text = "Total Cholesterol (mg/dL)")) |>

  hc_colors(c("deeppink", "blue")) |>

  hc_tooltip(
    pointFormat = "
      <b>Patient Information</b><br>
      Gender: {point.gender}<br>
      Age: {point.size} years<br>
      BMI: {point.x:.2f}<br>
      Cholesterol: {point.y:.2f} mg/dL
    "
  )

Essay

Because this health problem is recurrent in my family, I conducted additional background research on Cholesterol and found that cholesterol, although often perceived negatively, is actually a vital lipid molecule produced by the liver. It plays an essential role in the proper functioning of the human body by participating in hormone synthesis and digestive processes. Cholesterol is generally measured through a lipid panel, with concentrations expressed in mg/dL. There are two main types of cholesterol carriers: low-density lipoproteins (LDL), commonly referred to as “bad” cholesterol, whose excessive accumulation may contribute to plaque buildup in the arteries and increase the risk of strokes and cardiovascular diseases, and high-density lipoproteins (HDL), known as “good” cholesterol, which help remove excess fats from the bloodstream. According to public health research, including information provided by the Centers for Disease Control and Prevention (CDC, 2024), cholesterol levels are influenced by a complex interaction of lifestyle factors such as diet and physical activity, physiological factors including Body Mass Index and aging, as well as genetic predispositions. Clinically, total cholesterol levels above 200 mg/dL are generally considered a warning sign and may increase the risk of severe cardiovascular complications over time (CDC, 2024). The final visualization represents a multivariate scatterplot showing the relationship between Body Mass Index and Cholesterol. The x-axis represents BMI, while the y-axis represents cholesterol levels. The color of the points represents gender, the size of the points represents age, and faceted by phys_act in the first scatter plot. i also make the interactive version of the graph using highcharter. From this visualization and regression model, several interesting patterns can be observed. First, a weak negative relationship appears between BMI and cholesterol in the regression model, meaning that cholesterol slightly decreases as BMI increases within the selected sample. Another surprising result is that cholesterol shows a very slight negative association with age, even though medical research often suggests that cholesterol tends to increase with aging. In addition, physically active individuals showed slightly higher cholesterol values in the model. This unexpected result may be explained by confounding variables, limitations in the dataset, or differences in the type and intensity of physical activity measured. I also encountered several challenges during the project. Because the dataset originally contained thousands of observations, the scatterplot suffered from overplotting, making the visualization difficult to interpret. To improve readability, I applied transparency and created a random sample for the interactive visualization. I also experienced difficulties integrating multiple aesthetic mappings, such as color, size, and interactivity, into the same graph while maintaining clarity and readability. Another challenge was interpreting the Cholesterol measurements because the values in the dataset did not correspond to the standard clinical unit (mg/dL). For example, the minimum cholesterol value in the dataset was approximately 2.38 and the maximum was about 13.65, which is very different from normal clinical cholesterol measurements where levels above 200 mg/dL are generally considered high. Because of these transformed values, I was not able to directly interpret the cholesterol variable using standard medical thresholds and instead focused on relative differences and statistical relationships within the dataset.

References

Centers for Disease Control and Prevention. (2024). About cholesterol. https://www.cdc.gov/cholesterol/about/index.html National Health and Nutrition Examination Survey (NHANES). (2009-2010). Nhanes