Gallstone Data

Author

Rin Hwang

Introduction

For this project, I analyzed a Gallstones dataset that contains 319 individuals’ information on 38 features of demographic information, physical measurements, and biochemical markers, all recorded from June 2022–June 2023. This dataset has 38 variables, with 8 categorical and 38 quantitative.

The main research question I am investigating is: To what extent can a patient’s Total Cholesterol be predicted by their body composition (BMI, TBFR) and metabolic markers (Glucose, Vitamin D, and Creatinine) within the data population?

The source of this data comes from the University of California, Irvine Machine Learning Repository. This dataset is uniquely appropriate for this research question because it contains verified clinical outcomes with specific nutritional and physiological biomarkers. Unlike general health surveys, this dataset targets the intersection of body composition and biochemical markers in the context of gallbladder health.

I used the following variables from the dataset to effectively examine my research question:

“Gallstone Status” (Target Variable): A categorical indicator for whether the patient has or does not have gallstones (0 for presence, 1 for absence). It divides the study population into two groups: the “Gallstone” group and the “Healthy Control”/not diagnosed with gallstone disease group.
“Body Mass Index (BMI)” - A quantitative variable for body weight for height ratio.
“Total Body Fat Ratio (TBFR)” - A quantitative variable that measures total body content by %.
“Creatinine” - A quantitative variable that measures Creatinine levels.
“Glucose” - A quantitative variable that measures blood sugar levels.
“Total Cholesterol (TC)” - A quantitative variable that measures total cholesterol levels.
“Vitamin D” - A quantitative variable representing the serum concentration of Vitamin D of the patient.

To cleanup this dataset for efficient and easier analysis, I first used gsub() and regular expressions to standardize column names by removing units in parentheses (e.g., converting “Total Cholesterol (TC)” to “total_cholesterol”). I also made the variables lowercase for further standardizing and easier to read. Next, inclusion and exclusion criteria were applied with dplyr to isolate the valid records, specifically setting a threshold to exclude extreme glucose outliers above 200 mg/dL. Finally, the dataset was refined through using drop_na(), removing any observations with missing values in order to successfully complete a regression analysis.

I chose this topic and dataset because I am curious about how biological factors interact and how they influence health outcomes. I have known family and friends with gallstones, which makes me further interested in its study and the possible effects and interactions they have with our body composition and metabolic markers.

Uploading Libraries and the Dataset

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.3

Warning: package 'readr' was built under R version 4.5.3

Warning: package 'dplyr' was built under R version 4.5.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(shiny)

Warning: package 'shiny' was built under R version 4.5.3

setwd("C:/Users/hwang/OneDrive/Documents/MC stuff/Spring 2026/DATA 110 Data Visualization and Communication/Projects/Project 2")
gallstonesdata <- read_csv("dataset-uci.csv")

Rows: 319 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (39): Gallstone Status, Age, Gender, Comorbidity, Coronary Artery Diseas...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Retrieving List of Variables

names(gallstonesdata)

 [1] "Gallstone Status"                              
 [2] "Age"                                           
 [3] "Gender"                                        
 [4] "Comorbidity"                                   
 [5] "Coronary Artery Disease (CAD)"                 
 [6] "Hypothyroidism"                                
 [7] "Hyperlipidemia"                                
 [8] "Diabetes Mellitus (DM)"                        
 [9] "Height"                                        
[10] "Weight"                                        
[11] "Body Mass Index (BMI)"                         
[12] "Total Body Water (TBW)"                        
[13] "Extracellular Water (ECW)"                     
[14] "Intracellular Water (ICW)"                     
[15] "Extracellular Fluid/Total Body Water (ECF/TBW)"
[16] "Total Body Fat Ratio (TBFR) (%)"               
[17] "Lean Mass (LM) (%)"                            
[18] "Body Protein Content (Protein) (%)"            
[19] "Visceral Fat Rating (VFR)"                     
[20] "Bone Mass (BM)"                                
[21] "Muscle Mass (MM)"                              
[22] "Obesity (%)"                                   
[23] "Total Fat Content (TFC)"                       
[24] "Visceral Fat Area (VFA)"                       
[25] "Visceral Muscle Area (VMA) (Kg)"               
[26] "Hepatic Fat Accumulation (HFA)"                
[27] "Glucose"                                       
[28] "Total Cholesterol (TC)"                        
[29] "Low Density Lipoprotein (LDL)"                 
[30] "High Density Lipoprotein (HDL)"                
[31] "Triglyceride"                                  
[32] "Aspartat Aminotransferaz (AST)"                
[33] "Alanin Aminotransferaz (ALT)"                  
[34] "Alkaline Phosphatase (ALP)"                    
[35] "Creatinine"                                    
[36] "Glomerular Filtration Rate (GFR)"              
[37] "C-Reactive Protein (CRP)"                      
[38] "Hemoglobin (HGB)"                              
[39] "Vitamin D"

Cleaning the Variables

# Remove parentheses and everything inside them
names(gallstonesdata) <- gsub("\\s*\\(.*?\\)", "", names(gallstonesdata))

# Replace spaces with underscores
names(gallstonesdata) <- gsub("[ /-]", "_", names(gallstonesdata))

# Make all variable characters lowercase
names(gallstonesdata) <- tolower(names(gallstonesdata))

names(gallstonesdata)

 [1] "gallstone_status"                    
 [2] "age"                                 
 [3] "gender"                              
 [4] "comorbidity"                         
 [5] "coronary_artery_disease"             
 [6] "hypothyroidism"                      
 [7] "hyperlipidemia"                      
 [8] "diabetes_mellitus"                   
 [9] "height"                              
[10] "weight"                              
[11] "body_mass_index"                     
[12] "total_body_water"                    
[13] "extracellular_water"                 
[14] "intracellular_water"                 
[15] "extracellular_fluid_total_body_water"
[16] "total_body_fat_ratio"                
[17] "lean_mass"                           
[18] "body_protein_content"                
[19] "visceral_fat_rating"                 
[20] "bone_mass"                           
[21] "muscle_mass"                         
[22] "obesity"                             
[23] "total_fat_content"                   
[24] "visceral_fat_area"                   
[25] "visceral_muscle_area"                
[26] "hepatic_fat_accumulation"            
[27] "glucose"                             
[28] "total_cholesterol"                   
[29] "low_density_lipoprotein"             
[30] "high_density_lipoprotein"            
[31] "triglyceride"                        
[32] "aspartat_aminotransferaz"            
[33] "alanin_aminotransferaz"              
[34] "alkaline_phosphatase"                
[35] "creatinine"                          
[36] "glomerular_filtration_rate"          
[37] "c_reactive_protein"                  
[38] "hemoglobin"                          
[39] "vitamin_d"

Preliminary Exploratory Visualizations

Metabolic Markers For Checking Normality and Outliers

Vitamin D

ggplot(gallstonesdata, aes(x = vitamin_d)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  theme_minimal() +
  labs(title = "Distribution of Vitamin D Levels",
       x = "Vitamin D (ng/mL)",
       y = "Frequency",
       caption = "Source: UCI Machine Learning Repository")

Creatinine

ggplot(gallstonesdata, aes(x = creatinine)) +
  geom_histogram(bins = 30, fill = "darkgreen", color = "white") +
  theme_minimal() +
  labs(title = "Distribution of Creatinine Levels",
       x = "Creatinine (mg/dL)",
       y = "Frequency",
       caption = "Source: UCI Machine Learning Repository")

Glucose

ggplot(gallstonesdata, aes(x = glucose)) +
  geom_histogram(bins = 30, fill = "darkred", color = "white") +
  theme_minimal() +
  labs(title = "Distribution of Glucose Levels",
       x = "Glucose (mg/dL)",
       y = "Frequency",
       caption = "Source: UCI Machine Learning Repository")

In the histogram right above, glucose data is heavily right-skewed. While most patients in this dataset cluster around a healthy range (roughly 70–110 mg/dL), it is apparent that there are extreme outliers stretching all the way to nearly 600 mg/dL. In this case, filtering out the glucose outliers in the inclusion and exclusion criteria is best.

BMI and Cholesterol Levels

ggplot(gallstonesdata, aes(x = body_mass_index, y = total_cholesterol, color = factor(gallstone_status))) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) + # Adds regression lines for each group
  scale_color_manual(values = c("0" = "darkred", "1" = "darkblue"), 
                     labels = c("0" = "Presence", "1" = "Absence")) +
  labs(title = "Body Mass Index (BMI) versus Cholesterol Levels",
       color = "Gallstone Status", 
       caption = "Source: UCI Machine Learning Repository") +
  theme_light()

`geom_smooth()` using formula = 'y ~ x'

Both the red (Presence) and blue (Absence) lines are nearly horizontal. This is a high indicator that in this specific population, BMI is a weak individual predictor of Total Cholesterol. The red and blue dots are heavily mixed together. This suggests that the relationship between weight and cholesterol doesn’t change drastically based on whether a patient has gallstones or not.

For now, we can infer that if BMI doesn’t explain the variation in cholesterol, then other factors, like Vitamin D, Glucose, or Total Body Fat Ratio, must be the significant drivers.

Cleaning the Dataset For Inclusion and Exclusion

gallstonesdata_final <- gallstonesdata %>%
  # Selecting variables
  select(total_cholesterol, body_mass_index, total_body_fat_ratio, 
         vitamin_d, glucose, creatinine, gallstone_status) %>%
  
  # Filter out invalid values and outliers
  filter(total_cholesterol > 0, 
         glucose > 0 & glucose < 200, 
         vitamin_d > 0,
         body_mass_index > 10) %>%
  
  # Mutate to create two groups 
  mutate(bmi_category = ifelse(body_mass_index >= 30, "Obese", "Non-Obese")) %>% drop_na()

summary <- gallstonesdata_final %>%
  group_by(bmi_category) %>%
  summarize(
    Avg_Cholesterol = mean(total_cholesterol),
    Sample_Size = n()
  )

print(summary)

# A tibble: 2 × 3
  bmi_category Avg_Cholesterol Sample_Size
  <chr>                  <dbl>       <int>
1 Non-Obese               206.         192
2 Obese                   197.         116

Multiple Linear Regression Analysis

Dependent Variable: total_cholesterol

Independent Variables: body_mass_index, total_body_fat_ratio, creatinine, glucose, vitamin_d

mlra_model <- lm(total_cholesterol ~ body_mass_index + total_body_fat_ratio + creatinine + glucose + vitamin_d, data = gallstonesdata_final)

summary(mlra_model)


Call:
lm(formula = total_cholesterol ~ body_mass_index + total_body_fat_ratio + 
    creatinine + glucose + vitamin_d, data = gallstonesdata_final)

Residuals:
     Min       1Q   Median       3Q      Max 
-140.276  -29.098   -2.733   27.695  157.967 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          178.20622   23.94684   7.442 1.04e-12 ***
body_mass_index       -2.06374    0.80424  -2.566  0.01077 *  
total_body_fat_ratio   1.83021    0.55503   3.297  0.00109 ** 
creatinine            45.09817   17.75682   2.540  0.01159 *  
glucose               -0.05673    0.15707  -0.361  0.71821    
vitamin_d              0.07952    0.26178   0.304  0.76151    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 45.33 on 302 degrees of freedom
Multiple R-squared:  0.03751,   Adjusted R-squared:  0.02157 
F-statistic: 2.354 on 5 and 302 DF,  p-value: 0.04066

Equation

Adjusted R-squared (0.02157): This value is quite low. It indicates that the model only explains about 2.16% of the variation in Total Cholesterol. This highly suggests that other factors, such as genetics, diet, or medication, are likely responsible for the remaining 98% of the variance.

F-statistic P-value (0.04066): Even though the R squared is low, the overall model is still statistically significant as the p-value is below the standard threshold of 0.05.

Looking at each coefficient, we can see that “body_mass_index”, “total_body_fat_ratio”, and “creatinine”, are significant and suggest they are reliable predictors. On the other hand, “glucose” and “vitamin_d” are not since they have higher p-values that exceed the standard 0.05. In a backward elimination process, these would be the first variables to remove in order to simplify the model, as they do not currently help predict cholesterol in this specific group.

Using shinyapp to Visualize Prediction Risks

ui <- fluidPage(
  titlePanel("Metabolic Predictor: Total Cholesterol & Gallstone Risk"),
  
  sidebarLayout(
    sidebarPanel(
      h4("Individual Metabolic Markers"),
      sliderInput("bmi", "BMI (kg/m^2):", min = 15, max = 50, value = 25),
      sliderInput("tbfr", "Body Fat (%):", min = 5, max = 60, value = 25),
      sliderInput("vitd", "Vitamin D (ng/mL):", min = 5, max = 60, value = 28     ), sliderInput("glu", "Glucose (mg/dL):", min = 60, max = 200, value = 95     ), sliderInput("creat", "Creatinine:", min = 0.1, max = 2.0, value = 0.8, step = 0.1),
      hr(),
      h4("Predicted Total Cholesterol:"),
      span(textOutput("cholPrediction"), 
           style="color:blue; font-size: 20px; font-weight: bold;")
    ),
    
    mainPanel(
      tabsetPanel(
        tabPanel("Multivariate Distribution", 
                 br(),
                 plotlyOutput("metabolicPlot", height = "550px"))))))

server <- function(input, output) {
    output$cholPrediction <- renderText({
    new_case <- data.frame(body_mass_index = input$bmi, 
                           total_body_fat_ratio = input$tbfr, 
                           vitamin_d = input$vitd, 
                           glucose = input$glu, 
                           creatinine = input$creat)
    pred <- predict(metabolic_model, new_case)
    paste(round(pred, 2), "mg/dL")
  })
  
  output$metabolicPlot <- renderPlotly({
    
  prediction_val <- predict(metabolic_model, 
                            data.frame(body_mass_index = input$bmi,
                            total_body_fat_ratio = input$tbfr,
                            vitamin_d = input$vitd,
                            glucose = input$glu,
                            creatinine = input$creat))
    
    current_selection <- data.frame(
      body_mass_index = input$bmi,
      total_cholesterol = prediction_val
    )
    p <- ggplot(gallstonesdata_final, aes(x = body_mass_index, y = total_cholesterol)) +
      geom_jitter(color = "grey", alpha = 0.3, width = 0.4) + 
      geom_point(data = current_selection, 
                 color = "red", size = 5, shape = 18) +
      
      facet_wrap(~gallstone_status, 
                 labeller = as_labeller(c("0" = "Presence", "1" = "Absence"))) +
      theme_minimal() +
      labs(x = "Body Mass Index", 
           y = "Total Cholesterol", 
           caption = "Source: UCI Machine Learning Repository")
    ggplotly(p)
  })
  
  output$modelSummary <- renderPrint({
    summary(metabolic_model)
  })
}

shinyApp(ui = ui, server = server)

Shiny applications not supported in static R Markdown documents

Conclusion

The visualization represents an interactive multivariate prediction tool built with shinyapp. The interface is split with the left side having control sidebars containing BMI, Body Fat, Vitamin D, Glucose, and Creatinine, and on the right side is the main panel featuring a scatter plot. The background of the plot consists of grey dots representing the all the patients in the dataset population, divided into two groups: those with the presence of gallstones and with absence. Overlaid on this population is a red diamond for each group. This represents a hypothetical person as its position is not a fixed data point, but rather a live calculation based on the values set on the left as the user adjusts the sliders to simulate different metabolic profiles. By moving the sliders, there is a predicted cholesterol level based on those values.

An interesting pattern within the visualization is the inverse relationship between weight and fat: while an increase in Body Fat (TBFR) causes the red diamond to rise, a simultaneous increase in BMI (holding fat constant) predicts a slight decrease in cholesterol, suggesting that body composition is a more complex indicator of fat health than total mass. Furthermore, the visual demonstrates variable importance. The diamond jumps significantly when adjusting the Creatinine slider due to its high coefficient of 45.10, whereas it remains nearly still when moving the Vitamin D or Glucose sliders. This accurately reflects the statistical findings where Vitamin D or Glucose were found to be non-significant predictors with p-values far exceeding the 0.05 significance level.

One aspect I would have wanted to include in this project is including confidence intervals into the charts to show the margin of error, given how the adjusted R-squared is 0.02157. While it predicts a specific number, it could actually be anywhere within that interval.

References

Esen, I., Arslan, H., Aktürk, S., Gülşen, M., Kültekin, N., & Özdemir, O. (2024). Gallstone [Dataset]. UCI Machine Learning Repository. https://doi.org/10.1097/md.0000000000037258.

Gallbladder Diagram: https://www.mayoclinic.org/diseases-conditions/gallstones/symptoms-causes/syc-20354214