Sweet Talk: Diabetes - Glucose & Insulin Analysis

Author

Ariel Ganoe & Evelyn Morales

#| message: false
#| warning: false

library(qrcode)
qr_code("https://rpubs.com/eveyt5478/1428871")
                  
 ▗▄▄▄ ▖  ▄ ▄▗▄▄▄  
 ▐▗▄▐▝▙▛▗▗▗▝▐▗▄▐  
 ▐▐█▐▝▚▚▄▖▀█▐▐█▐  
 ▐▄▄▟▐▐▗▜▜▗▐▐▄▄▟  
 ▗▄▄▗▄▄▐▚▙▄▝▖▖▖▖  
 ▝▙ ▗▜▌▄ ▄ ▀▞▀ ▝  
 ▐▛▌▚▞▛▌▝▟▟▜▞▝▙▘  
 ▗▚▝▙ ▞▝█▖▖▜▜█▄▝  
 ▝▌▌▗█▜▐▚▚▄▝▀▛▄▘  
 ▐▐▘▗ ▛▄ ▖▚▜▛▜▐▝  
 ▐▗▀▙ ▀▌▝▚▟█▟▟▗▙  
 ▗▄▄▄▐▌▝█▖▛▟▗▐█▀  
 ▐▗▄▐▗▜▐▚▄▄▐▄▟▗▞  
 ▐▐█▐▐▚▄ ▖▗▙▄▙█▛  
 ▐▄▄▟▐▟▌▝█▄▞▐▗▙▘  
                  
                  

use plot() for a better quality image

Sweet Talk: Diabetes - Glucose & Insulin Analysis

Introduction

Motivation

Managing blood glucose is complex and influenced by multiple factors such as insulin use, meals, and daily activity. Patients often experience significant glucose fluctuations throughout the day, making it difficult to understand how behaviors impact glucose levels over time. This project is motivated by the need to identify meaningful patterns between insulin use, behavior, and glucose response.

Research Questions

  • Does the time elapsed since an insulin dose predict blood glucose level at the next reading?
  • Does the type of measurement event add explanatory power beyond timing alone?
  • Do patients naturally cluster into distinct groups based on glucose control?
  • Is there meaningful variation in average glucose control across individual patients?

Outcome Variable: Blood glucose level (mg/dL)

Method:

  • Time-based glucose and insulin data are analyzed to examine how insulin dosing and behavioral patterns affect blood glucose levels
  • Descriptive statistics and visualizations are used to explore trends and variability in glucose measurements
  • Regression-based methods are applied to evaluate how well insulin dosing and behavioral variables explain and predict changes in glucose over time

Setup & Libraries

library(tidyverse)
library(readxl)
library(dplyr)
library(cluster)
library(NbClust)
library(plotly)
library(DT)
library(leaflet)
library(patchwork)

Data

The dataset contains 29,556 records across 70 patients, covering both glucose readings and insulin doses recorded between 1991 and 1993. Each record includes a patient ID, date, time, measurement code, value, and event type. The code column identifies what type of reading was taken, such as pre-breakfast glucose or regular insulin dose.

R_Project <- read_excel("C:/Users/Evelyn Morales/OneDrive - Kennesaw State University/Programming in R project/R_Project/R-Project.xlsx")

glucose_data <- R_Project
colnames(glucose_data) <- c("patient", "date", "time", "code", "value", "type")

str(glucose_data)
tibble [29,556 × 6] (S3: tbl_df/tbl/data.frame)
 $ patient: chr [1:29556] "data-01" "data-01" "data-01" "data-01" ...
 $ date   : POSIXct[1:29556], format: "1991-04-21" "1991-04-21" ...
 $ time   : POSIXct[1:29556], format: "1899-12-31 09:09:00" "1899-12-31 09:09:00" ...
 $ code   : num [1:29556] 58 33 34 62 33 48 58 33 34 33 ...
 $ value  : num [1:29556] 100 9 13 119 7 123 216 10 13 2 ...
 $ type   : chr [1:29556] "Glucose (Pre-Breakfast)" "Insulin (Regular)" "Insulin (NPH)" "(Pre-Dinner)(or Pre-Supper)" ...
head(glucose_data)
# A tibble: 6 × 6
  patient date                time                 code value type              
  <chr>   <dttm>              <dttm>              <dbl> <dbl> <chr>             
1 data-01 1991-04-21 00:00:00 1899-12-31 09:09:00    58   100 Glucose (Pre-Brea…
2 data-01 1991-04-21 00:00:00 1899-12-31 09:09:00    33     9 Insulin (Regular) 
3 data-01 1991-04-21 00:00:00 1899-12-31 09:09:00    34    13 Insulin (NPH)     
4 data-01 1991-04-21 00:00:00 1899-12-31 17:08:00    62   119 (Pre-Dinner)(or P…
5 data-01 1991-04-21 00:00:00 1899-12-31 17:08:00    33     7 Insulin (Regular) 
6 data-01 1991-04-21 00:00:00 1899-12-31 22:51:00    48   123 Glucose (General) 

Data Wrangling / Transformation

To connect each glucose reading to the insulin dose that caused it, we separated glucose and insulin records, then used a left join followed by a filter to keep only insulin doses that occurred before each glucose reading. The slice_max function then selected the single most recent dose before each reading. Finally, mutate calculated the hours elapsed between that dose and the glucose reading. This produced 2,609 properly matched pairs.

# Separate glucose and insulin records
glucose_only <- glucose_data |>
  dplyr::filter(stringr::str_detect(tolower(type), "glucose")) |>
  dplyr::rename(glucose_time = time)

insulin_only <- glucose_data |>
  dplyr::filter(stringr::str_detect(tolower(type), "insulin")) |>
  dplyr::rename(insulin_time = time)

# Join exploration
left_joined  <- glucose_only |> left_join(insulin_only,  by = "patient", relationship = "many-to-many")
inner_joined <- glucose_only |> inner_join(insulin_only, by = "patient", relationship = "many-to-many")
full_joined  <- glucose_only |> full_join(insulin_only,  by = "patient", relationship = "many-to-many")
right_joined <- glucose_only |> right_join(insulin_only, by = "patient", relationship = "many-to-many")

nrow(left_joined)
[1] 2218343
nrow(inner_joined)
[1] 2217934
nrow(full_joined)
[1] 2218343
nrow(right_joined)
[1] 2217934
# Build analysis dataset: link each glucose reading to its
# most recent prior insulin dose, then calculate time elapsed
model_data <- glucose_only |>
  left_join(insulin_only, by = "patient",
            relationship = "many-to-many") |>
  filter(insulin_time <= glucose_time) |>
  group_by(patient, glucose_time) |>
  slice_max(order_by = insulin_time, n = 1, with_ties = FALSE) |>
  ungroup() |>
  mutate(
    hours_since_insulin = as.numeric(
      difftime(glucose_time, insulin_time, units = "hours")
    )
  )

# Remove missing values
model_data_clean <- model_data |>
  drop_na()

# Classify each glucose reading by clinical threshold using case_when
model_data_clean <- model_data_clean |>
  mutate(
    glucose_category = case_when(
      value.x < 70   ~ "Low (<70)",
      value.x <= 130 ~ "Normal (70-130)",
      value.x <= 180 ~ "High (130-180)",
      TRUE           ~ "Very High (>180)"
    )
  )

nrow(model_data_clean)
[1] 2609

Results

Visualization

Plot 1: Glucose Levels vs Hours Since Insulin

This scatter plot shows blood glucose level on the y-axis and hours since the last insulin dose on the x-axis. The trend line shows a clear downward pattern, confirming that glucose decreases as more time passes after a dose. This is the biological mechanism of insulin at work in the data.

model_data_clean |>
  ggplot(aes(x = hours_since_insulin, y = value.x)) +
  geom_point(alpha = 0.3) +
  geom_smooth(se = FALSE) +
  scale_x_continuous(breaks = seq(0, 24, by = 2)) +
  labs(
    title    = "Glucose Levels vs Hours Since Insulin Dose",
    subtitle = "Outcome: blood glucose (mg/dL) | Predictor: hours since last dose",
    x        = "Hours Since Insulin Dose",
    y        = "Glucose Level (mg/dL)"
  ) +
  theme_minimal()

Plot 2: Glucose Levels by Measurement Event

This boxplot compares glucose distributions across measurement event types. Snack-time readings show the highest median glucose around 220-240 mg/dL, approximately 30-50 mg/dL higher than pre-breakfast levels. General readings display the widest variability, with outliers reaching 350-400+ mg/dL, indicating that unscheduled readings capture the most extreme glucose fluctuations.

glucose_only |>
  ggplot(aes(x = type, y = value, fill = type)) +
  geom_boxplot() +
  labs(
    title    = "Glucose Levels by Measurement Event",
    subtitle = "Snack-time readings tend to be highest; general readings most variable",
    x        = "Measurement Event",
    y        = "Glucose Level (mg/dL)"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.text.x     = element_text(angle = 45, hjust = 1)
  )

Plot 3: Average Glucose Level by Patient

This bar chart displays mean glucose for each of the 70 patients, ordered from lowest to highest. The color gradient from blue to red highlights severity. The spread from 105 mg/dL (data-31) to 332 mg/dL (data-70) represents a 227 mg/dL difference across patients all on insulin, showing that individual factors beyond dosing drive glucose control substantially.

model_data_clean |>
  group_by(patient) |>
  summarize(mean_glucose = mean(value.x, na.rm = TRUE)) |>
  ggplot(aes(x = reorder(patient, mean_glucose),
             y = mean_glucose,
             fill = mean_glucose)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_gradient(low = "steelblue", high = "red") +
  labs(
    title    = "Average Glucose Level by Patient",
    subtitle = "Range: 105 to 332 mg/dL a 227 mg/dL spread across patients on insulin",
    x        = "Patient",
    y        = "Mean Glucose (mg/dL)",
    fill     = "Mean Glucose"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 5))

Plot 4: Distribution of Glucose Levels

This histogram shows the overall distribution of glucose readings. The distribution is right-skewed, with the highest concentration of readings falling between 50 and 150 mg/dL. The long right tail indicates a meaningful subset of readings above 300 mg/dL, representing patients with poor glucose control. Most patients are managing reasonably, but high outliers exist.

model_data_clean |>
  ggplot(aes(x = value.x)) +
  geom_histogram(bins = 30) +
  scale_x_continuous(breaks = seq(0, 400, by = 50)) +
  labs(
    title    = "Distribution of Glucose Levels",
    subtitle = "Right-skewed: most patients controlled, but high outliers exist",
    x        = "Glucose Level (mg/dL)",
    y        = "Frequency"
  ) +
  theme_minimal()

Plot 5: Glucose Level by Measurement Code

This scatter plot shows individual glucose readings grouped by measurement code. Code 64 (pre-snack) and codes 48 and 57 (unspecified readings) show the highest glucose clustering, often reaching 250-300 mg/dL. Code 58 (pre-breakfast) shows lower clustering around 120-170 mg/dL, suggesting that fasting overnight brings glucose down before morning dosing.

model_data_clean |>
  ggplot(aes(x = as.factor(code.x), y = value.x,
             color = as.factor(code.x))) +
  geom_point(alpha = 0.3) +
  labs(
    title    = "Glucose Level by Measurement Code",
    subtitle = "Code 64 (pre-snack) shows highest glucose clustering",
    x        = "Measurement Code",
    y        = "Glucose Level (mg/dL)",
    color    = "Code"
  ) +
  theme_minimal()

Plot 6: Glucose Distribution with Density Gradient

This histogram uses a color gradient from light blue to dark blue to show where readings are most dense. The darkest bars appear between 50 and 150 mg/dL, confirming that is where the majority of readings concentrate. The lighter bars on the right tail show that high glucose readings above 300 mg/dL are less frequent but present.

model_data_clean |>
  ggplot(aes(x = value.x, fill = after_stat(count))) +
  geom_histogram(bins = 30, color = "black") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  scale_x_continuous(breaks = seq(0, 400, by = 50)) +
  labs(
    title = "Distribution of Glucose Levels",
    x     = "Glucose Level (mg/dL)",
    y     = "Frequency"
  ) +
  theme_minimal()

Plot 7: Glucose Range by Measurement Code

This range plot shows the minimum to maximum glucose values for each measurement code as a horizontal line, with the mean shown as a dot. All codes span a wide range, but the dot positions reveal that mean glucose is relatively consistent across codes at around 160-180 mg/dL. The wide lines indicate high within-code variability regardless of measurement context.

model_data_clean |>
  group_by(code.x) |>
  summarize(
    min_glucose  = min(value.x,  na.rm = TRUE),
    mean_glucose = mean(value.x, na.rm = TRUE),
    max_glucose  = max(value.x,  na.rm = TRUE)
  ) |>
  ggplot() +
  geom_segment(aes(
    x    = min_glucose,
    xend = max_glucose,
    y    = as.factor(code.x),
    yend = as.factor(code.x)
  )) +
  geom_point(aes(x = mean_glucose, y = as.factor(code.x))) +
  labs(
    title    = "Glucose Range by Measurement Code",
    subtitle = "Dot = mean | Line = min to max",
    x        = "Glucose Level (mg/dL)",
    y        = "Measurement Code"
  ) +
  theme_minimal()

Plot 8: Average Glucose by Event Type

This bar chart shows mean glucose grouped by event type. Pre-breakfast readings show the highest average around 180 mg/dL, while snack-time readings average around 160 mg/dL. The relatively small differences across event types suggest that timing of the measurement matters less than the individual patient’s overall glucose control pattern.

model_data_clean |>
  group_by(type.x) |>
  summarize(mean_glucose = mean(value.x, na.rm = TRUE)) |>
  ggplot(aes(x = type.x, y = mean_glucose, fill = mean_glucose)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Average Glucose by Event Type",
    x     = "Event Type",
    y     = "Average Glucose (mg/dL)"
  ) +
  theme_minimal()

Plot 9: Time Gap Between Insulin Dose and Glucose Reading

This segment plot shows the time gap between each insulin dose and the subsequent glucose reading for 50 events. Each line represents one event, colored from blue (low glucose) to red (high glucose). Events with shorter gaps tend to show more red, meaning glucose is still elevated close to the dose. Longer gaps trend bluer, consistent with insulin lowering glucose over time.

model_data_clean |>
  filter(hours_since_insulin > 0.5) |>
  slice_head(n = 50) |>
  mutate(event_id = row_number()) |>
  ggplot() +
  geom_segment(aes(
    x     = 0,
    xend  = hours_since_insulin,
    y     = event_id,
    yend  = event_id,
    color = value.x
  ), linewidth = 1) +
  scale_color_gradient(low = "steelblue", high = "red") +
  labs(
    title    = "Time Gap Between Insulin Dose and Glucose Reading",
    subtitle = "Color = glucose level: red = high, blue = low",
    x        = "Hours Since Insulin Dose",
    y        = "Event",
    color    = "Glucose (mg/dL)"
  ) +
  theme_minimal()

Plot 10: Glucose Ranges by Meal Type

This range plot shows the minimum to maximum glucose values for each meal type, with the red dot marking the mean. All meal types show wide ranges spanning from near-normal to very high readings. The mean dots are clustered around 160-180 mg/dL across all types, but the wide lines show that individual readings vary enormously regardless of meal context.

glucose_only |>
  group_by(type) |>
  summarize(
    mean_glucose = mean(value, na.rm = TRUE),
    min_glucose  = min(value,  na.rm = TRUE),
    max_glucose  = max(value,  na.rm = TRUE)
  ) |>
  ggplot() +
  geom_segment(aes(
    x    = min_glucose,
    xend = max_glucose,
    y    = type,
    yend = type
  ), color = "gray50", linewidth = 1) +
  geom_point(aes(x = mean_glucose, y = type),
             color = "red", size = 3) +
  labs(
    title    = "Glucose Ranges by Meal Type",
    subtitle = "Red dot = mean | Line = min to max range",
    x        = "Glucose Level (mg/dL)",
    y        = "Meal Type"
  ) +
  theme_minimal()

Plot 11: Glucose Readings by Clinical Category (case_when)

Using case_when, we classified each reading into four clinical categories: Low (below 70), Normal (70-130), High (130-180), and Very High (above 180). The count labels on each bar show the exact number of readings per category. The majority of readings fall in the Normal and Very High categories, revealing a bimodal pattern where patients are either well-controlled or significantly elevated.

model_data_clean |>
  count(glucose_category) |>
  ggplot(aes(x = reorder(glucose_category, n),
             y = n,
             fill = glucose_category)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), hjust = -0.2, size = 3.5) +
  coord_flip() +
  labs(
    title    = "Glucose Readings by Clinical Category",
    subtitle = "Classified using case_when with clinical thresholds",
    x        = "Glucose Category",
    y        = "Count of Readings"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Plot 12: Combined Figure (patchwork)

These two histograms are combined using patchwork to allow direct comparison of the outcome variable (glucose) and the key predictor (hours since insulin dose). Glucose is right-skewed with readings concentrated below 150 mg/dL. Hours since insulin is heavily concentrated near zero, meaning most glucose readings are taken shortly after a dose, which is an important context for interpreting the regression results.

p_glucose <- model_data_clean |>
  ggplot(aes(x = value.x)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  scale_x_continuous(breaks = seq(0, 400, by = 100)) +
  labs(
    title = "Glucose Distribution",
    x     = "Glucose Level (mg/dL)",
    y     = "Count"
  ) +
  theme_minimal()

p_hours <- model_data_clean |>
  ggplot(aes(x = hours_since_insulin)) +
  geom_histogram(bins = 30, fill = "tomato", color = "white") +
  scale_x_continuous(breaks = seq(0, 24, by = 4)) +
  labs(
    title = "Hours Since Insulin",
    x     = "Hours",
    y     = "Count"
  ) +
  theme_minimal()

p_glucose + p_hours +
  plot_annotation(
    title    = "Distribution of Outcome and Key Predictor",
    subtitle = "Left: glucose is right-skewed | Right: most readings taken shortly after dosing"
  )

Regression Modeling

Our outcome variable is blood glucose level (value.x, mg/dL). We built two models to test whether insulin timing and measurement context predict glucose. Model 1 uses hours since insulin as the sole predictor. Model 2 adds measurement code to test whether event context explains additional variance. We then used ANOVA to formally compare them.

# Model 1: Does time since insulin predict glucose?
fit1 <- lm(value.x ~ hours_since_insulin, data = model_data_clean)
summary(fit1)

Call:
lm(formula = value.x ~ hours_since_insulin, data = model_data_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-156.93  -79.25   -9.37   63.73  327.70 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          177.269      1.975  89.763  < 2e-16 ***
hours_since_insulin   -6.812      1.608  -4.236 2.35e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 92.47 on 2607 degrees of freedom
Multiple R-squared:  0.006837,  Adjusted R-squared:  0.006456 
F-statistic: 17.95 on 1 and 2607 DF,  p-value: 2.352e-05

Model 1 shows that each additional hour since the last insulin dose is associated with a 6.81 mg/dL decrease in glucose (p < 0.0001). Although the Adjusted R-squared is 0.65%, this is expected since glucose is influenced by many factors beyond timing alone. The relationship is statistically real even if modest in magnitude.

# Model 2: Add measurement code as a second predictor
fit2 <- lm(value.x ~ hours_since_insulin + code.x, data = model_data_clean)
summary(fit2)

Call:
lm(formula = value.x ~ hours_since_insulin + code.x, data = model_data_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-159.29  -78.96   -8.98   64.41  325.14 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         272.2980    26.1573  10.410  < 2e-16 ***
hours_since_insulin  -6.4236     1.6078  -3.995 6.64e-05 ***
code.x               -1.6261     0.4463  -3.643 0.000274 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 92.25 on 2606 degrees of freedom
Multiple R-squared:  0.01187,   Adjusted R-squared:  0.01111 
F-statistic: 15.65 on 2 and 2606 DF,  p-value: 1.75e-07

Model 2 adds measurement code (code.x) and shows it is a significant predictor (beta = -1.63, p = 0.0003). This means the context of the reading, such as pre-breakfast versus pre-snack, explains additional variance in glucose beyond timing alone.

# ANOVA: Is Model 2 significantly better than Model 1?
anova(fit1, fit2)
Analysis of Variance Table

Model 1: value.x ~ hours_since_insulin
Model 2: value.x ~ hours_since_insulin + code.x
  Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
1   2607 22290363                                  
2   2606 22177401  1    112962 13.274 0.0002744 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA comparison confirms Model 2 is significantly better than Model 1 (F = 13.27, p = 0.0003). Adding measurement context meaningfully improves the model.

Day-Level Model: Does Morning Dosing Predict Dinner Glucose?

g_breakfast <- R_Project[R_Project$Type == "Glucose (Pre-Breakfast)",
                         c("Date", "Value")]
g_breakfast$Date <- as.Date(g_breakfast$Date)
names(g_breakfast)[2] <- "glucose_breakfast"

g_dinner <- R_Project[R_Project$Type == "(Pre-Dinner)(or Pre-Supper)",
                      c("Date", "Value")]
g_dinner$Date <- as.Date(g_dinner$Date)
names(g_dinner)[2] <- "glucose_dinner"

i_regular <- R_Project[R_Project$Type == "Insulin (Regular)",
                       c("Date", "Value")]
i_regular$Date <- as.Date(i_regular$Date)
names(i_regular)[2] <- "insulin_regular"

i_nph <- R_Project[R_Project$Type == "Insulin (NPH)",
                   c("Date", "Value")]
i_nph$Date <- as.Date(i_nph$Date)
names(i_nph)[2] <- "insulin_nph"

merged          <- merge(g_breakfast, g_dinner,   by = "Date", all = FALSE)
merged          <- merge(merged,      i_regular,  by = "Date", all = FALSE)
merged          <- merge(merged,      i_nph,      by = "Date", all = FALSE)
glucose_insulin <- na.omit(merged)

model <- lm(glucose_dinner ~ glucose_breakfast +
              insulin_regular +
              insulin_nph,
            data = glucose_insulin)

summary(model)

Call:
lm(formula = glucose_dinner ~ glucose_breakfast + insulin_regular + 
    insulin_nph, data = glucose_insulin)

Residuals:
    Min      1Q  Median      3Q     Max 
-220.74  -62.64  -11.32   50.88  300.58 

Coefficients:
                   Estimate Std. Error  t value Pr(>|t|)    
(Intercept)       1.487e+02  1.049e-01 1416.817   <2e-16 ***
glucose_breakfast 6.329e-04  4.392e-04    1.441     0.15    
insulin_regular   4.894e-01  9.598e-03   50.993   <2e-16 ***
insulin_nph       7.664e-02  3.596e-03   21.316   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 77.24 on 4622168 degrees of freedom
Multiple R-squared:  0.0006992, Adjusted R-squared:  0.0006985 
F-statistic:  1078 on 3 and 4622168 DF,  p-value: < 2.2e-16

This model predicts dinner glucose from breakfast glucose and morning insulin doses. Regular insulin (p < 0.0001) and NPH insulin (p < 0.0001) are both significant predictors of dinner glucose. Breakfast glucose alone is not significant (p = 0.15), suggesting that insulin type and dose matter more than morning glucose level in predicting evening outcomes.

Functional and Iterative Programming

Reusable Model Function

The run_model function takes a dataframe and a formula, fits a linear model, and returns the summary. This allows us to test multiple model specifications efficiently without repeating code.

run_model <- function(df, formula) {
  fit <- lm(formula, data = df)
  summary(fit)
}

run_model(model_data_clean, value.x ~ hours_since_insulin)

Call:
lm(formula = formula, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-156.93  -79.25   -9.37   63.73  327.70 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          177.269      1.975  89.763  < 2e-16 ***
hours_since_insulin   -6.812      1.608  -4.236 2.35e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 92.47 on 2607 degrees of freedom
Multiple R-squared:  0.006837,  Adjusted R-squared:  0.006456 
F-statistic: 17.95 on 1 and 2607 DF,  p-value: 2.352e-05
run_model(model_data_clean, value.x ~ hours_since_insulin + code.x)

Call:
lm(formula = formula, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-159.29  -78.96   -8.98   64.41  325.14 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         272.2980    26.1573  10.410  < 2e-16 ***
hours_since_insulin  -6.4236     1.6078  -3.995 6.64e-05 ***
code.x               -1.6261     0.4463  -3.643 0.000274 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 92.25 on 2606 degrees of freedom
Multiple R-squared:  0.01187,   Adjusted R-squared:  0.01111 
F-statistic: 15.65 on 2 and 2606 DF,  p-value: 1.75e-07

Patient Summary Loop

The patient_summary function loops through each unique patient, calculates their mean and standard deviation of glucose, and prints a summary. The output reveals a 227 mg/dL spread between the lowest (data-31, mean 105) and highest (data-70, mean 332) patients, all receiving insulin treatment. This motivates the clustering analysis: dosing alone does not explain the variation.

patient_summary <- function(df) {
  patients <- unique(df$patient)
  for (p in patients) {
    patient_data <- df |> filter(patient == p)
    mean_glucose <- mean(patient_data$value.x, na.rm = TRUE)
    sd_glucose   <- sd(patient_data$value.x,   na.rm = TRUE)
    print(paste("Patient:", p,
                "| Mean:", round(mean_glucose, 1),
                "| SD:",   round(sd_glucose,   1)))
  }
}

patient_summary(model_data_clean)
[1] "Patient: data-01 | Mean: 162.3 | SD: 70"
[1] "Patient: data-02 | Mean: 106 | SD: 60.8"
[1] "Patient: data-03 | Mean: 118.8 | SD: 47.3"
[1] "Patient: data-04 | Mean: 141.6 | SD: 65.9"
[1] "Patient: data-05 | Mean: 150.3 | SD: 65.3"
[1] "Patient: data-06 | Mean: 185.5 | SD: 34.6"
[1] "Patient: data-07 | Mean: 197 | SD: 14.1"
[1] "Patient: data-08 | Mean: 189.1 | SD: 76.2"
[1] "Patient: data-09 | Mean: 169.2 | SD: 94.9"
[1] "Patient: data-10 | Mean: 199.5 | SD: 54.4"
[1] "Patient: data-11 | Mean: 167 | SD: 95.5"
[1] "Patient: data-12 | Mean: 186.2 | SD: 108.6"
[1] "Patient: data-13 | Mean: 190 | SD: 104.8"
[1] "Patient: data-14 | Mean: 183.9 | SD: 97.5"
[1] "Patient: data-15 | Mean: 161.1 | SD: 103.1"
[1] "Patient: data-16 | Mean: 177.3 | SD: 106.4"
[1] "Patient: data-17 | Mean: 140.9 | SD: 76"
[1] "Patient: data-18 | Mean: 139.9 | SD: 83.4"
[1] "Patient: data-19 | Mean: 155.9 | SD: 90"
[1] "Patient: data-20 | Mean: 199 | SD: 104.3"
[1] "Patient: data-21 | Mean: 135 | SD: 14.1"
[1] "Patient: data-22 | Mean: 186.5 | SD: 102.3"
[1] "Patient: data-23 | Mean: 172.9 | SD: 100.8"
[1] "Patient: data-24 | Mean: 162.5 | SD: 87.6"
[1] "Patient: data-25 | Mean: 121.5 | SD: 64.7"
[1] "Patient: data-26 | Mean: 138.5 | SD: 81.3"
[1] "Patient: data-27 | Mean: 120.9 | SD: 33.5"
[1] "Patient: data-28 | Mean: 152.5 | SD: 55.4"
[1] "Patient: data-29 | Mean: 162.7 | SD: 35.1"
[1] "Patient: data-30 | Mean: 189.4 | SD: 53.4"
[1] "Patient: data-31 | Mean: 105 | SD: 5.7"
[1] "Patient: data-32 | Mean: 164.1 | SD: 73.9"
[1] "Patient: data-33 | Mean: 150.3 | SD: 72.4"
[1] "Patient: data-34 | Mean: 167.4 | SD: 57"
[1] "Patient: data-35 | Mean: 164.8 | SD: 66.8"
[1] "Patient: data-36 | Mean: 182.9 | SD: 78"
[1] "Patient: data-37 | Mean: 193.9 | SD: 83.7"
[1] "Patient: data-38 | Mean: 172.4 | SD: 85.1"
[1] "Patient: data-39 | Mean: 180.9 | SD: 79.2"
[1] "Patient: data-40 | Mean: 111.6 | SD: 65.8"
[1] "Patient: data-41 | Mean: 182 | SD: 32.5"
[1] "Patient: data-42 | Mean: 109.5 | SD: 74.2"
[1] "Patient: data-43 | Mean: 151.4 | SD: 97"
[1] "Patient: data-44 | Mean: 159.7 | SD: 80"
[1] "Patient: data-45 | Mean: 156.8 | SD: 79"
[1] "Patient: data-46 | Mean: 194.7 | SD: 68.4"
[1] "Patient: data-47 | Mean: 170.9 | SD: 83.6"
[1] "Patient: data-48 | Mean: 148 | SD: 77"
[1] "Patient: data-51 | Mean: 108.8 | SD: 58.9"
[1] "Patient: data-52 | Mean: 138.8 | SD: 94.9"
[1] "Patient: data-53 | Mean: 222 | SD: 24"
[1] "Patient: data-54 | Mean: 146 | SD: 75.7"
[1] "Patient: data-55 | Mean: 301 | SD: 28.3"
[1] "Patient: data-56 | Mean: 124.5 | SD: 91.2"
[1] "Patient: data-57 | Mean: 225.5 | SD: 113.5"
[1] "Patient: data-58 | Mean: 263.9 | SD: 90.8"
[1] "Patient: data-59 | Mean: 233.1 | SD: 95.2"
[1] "Patient: data-60 | Mean: 143.6 | SD: 70.6"
[1] "Patient: data-61 | Mean: 192.2 | SD: 107.6"
[1] "Patient: data-62 | Mean: 211 | SD: 109"
[1] "Patient: data-63 | Mean: 215.5 | SD: 113"
[1] "Patient: data-65 | Mean: 235.9 | SD: 87.1"
[1] "Patient: data-66 | Mean: 198.6 | SD: 76"
[1] "Patient: data-67 | Mean: 237.8 | SD: 87.8"
[1] "Patient: data-68 | Mean: 127 | SD: 9.9"
[1] "Patient: data-70 | Mean: 332 | SD: 31.1"

Data Modeling: Clustering

We applied PAM (Partitioning Around Medoids) clustering to group patients by behavioral patterns. Per the project rubric, the regression outcome variable (value.x, blood glucose) was removed before clustering. Clustering is performed on hours since insulin and measurement code only, so the groups reflect behavioral and timing patterns rather than glucose level itself.

# Remove regression outcome (value.x) before clustering
# Rubric requirement: do not include the dependent variable in clustering
cluster_data <- model_data_clean |>
  select(hours_since_insulin, code.x) |>
  drop_na()

cluster_data_scale <- scale(cluster_data)

# Estimate optimal number of clusters using NbClust
# Evaluates 30 indices with majority voting rule
set.seed(123)
number_cluster_estimate <- NbClust(
  cluster_data_scale,
  distance = "euclidean",
  min.nc   = 2,
  max.nc   = 10,
  method   = "kmeans"
)

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 4 proposed 2 as the best number of clusters 
* 2 proposed 3 as the best number of clusters 
* 8 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 3 proposed 6 as the best number of clusters 
* 1 proposed 8 as the best number of clusters 
* 4 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  4 
 
 
******************************************************************* 
number_cluster_estimate$Best.nc
                     KL       CH Hartigan     CCC   Scott Marriot  TrCovW
Number_clusters  4.0000   10.000     4.00 10.0000    4.00       4       3
Value_Index     26.5697 7770.833  4113.67 58.7839 4417.65 3826201 3865379
                  TraceW Friedman   Rubin Cindex     DB Silhouette   Duda
Number_clusters    4.000  10.0000  4.0000 8.0000 4.0000     6.0000 2.0000
Value_Index     1371.513  52.4552 -2.1732 0.0139 0.5451     0.8006 0.9431
                PseudoT2  Beale Ratkowsky     Ball PtBiserial Frey McClain
Number_clusters   2.0000 2.0000    4.0000   3.0000      5.000    1 10.0000
Value_Index     117.2392 0.0602    0.4577 907.8623      0.779   NA  0.1148
                  Dunn Hubert SDindex Dindex  SDbw
Number_clusters 2.0000      0  6.0000      0 6.000
Value_Index     0.0102      0  3.0425      0 0.417
table(number_cluster_estimate$Best.nc["Number_clusters",])

 0  1  2  3  4  5  6  8 10 
 2  1  4  2  8  1  3  1  4 

NbClust evaluated 30 statistical indices and used a majority voting rule to recommend the optimal number of clusters. We selected k = 3 because three clusters capture clinically meaningful distinctions that two clusters would collapse together.

set.seed(123)
glucose_clusters <- pam(cluster_data_scale, k = 3)

glucose_clusters$medoids
     hours_since_insulin     code.x
[1,]          -0.4062574 -0.1373936
[2,]           2.7169170  1.3418081
[3,]          -0.4062574  1.3418081
model_data_clustered <- model_data_clean |>
  mutate(cluster = glucose_clusters$clustering)

model_data_clustered |>
  group_by(cluster) |>
  summarize(
    mean_glucose = mean(value.x, na.rm = TRUE),
    mean_hours   = mean(hours_since_insulin, na.rm = TRUE),
    mean_code    = mean(code.x,  na.rm = TRUE),
    n = n()
  )
# A tibble: 3 × 5
  cluster mean_glucose mean_hours mean_code     n
    <int>        <dbl>      <dbl>     <dbl> <int>
1       1         180.      0.218      56.7  1877
2       2         141.      3.76       61.6   211
3       3         165.      0.152      64     521
plot(glucose_clusters)

model_data_clustered |>
  ggplot(aes(x = hours_since_insulin,
             y = value.x,
             color = as.factor(cluster))) +
  geom_point(alpha = 0.4) +
  scale_x_continuous(breaks = seq(0, 24, by = 2)) +
  labs(
    title    = "Three Glucose Control Profiles by Insulin Timing",
    subtitle = "Clusters reflect behavioral timing patterns, not glucose level",
    x        = "Hours Since Insulin Dose",
    y        = "Glucose Level (mg/dL)",
    color    = "Cluster"
  ) +
  theme_minimal()

The cluster summary shows three distinct behavioral groups with different mean glucose levels, mean hours since insulin, and mean measurement codes. Each cluster represents a different pattern of insulin timing and measurement context, which relates to but is not determined by glucose level.

Average Summary of variables in each cluster

library(dplyr)
cluster_average <- model_data_clustered %>%
  group_by(cluster) %>%
  summarise(across(everything(), ~mean(.x, na.rm = TRUE)))

cluster_average
# A tibble: 3 × 14
  cluster patient date.x              glucose_time        code.x value.x type.x
    <int>   <dbl> <dttm>              <dttm>               <dbl>   <dbl>  <dbl>
1       1      NA 1990-12-14 23:13:12 1899-12-31 11:04:17   56.7    180.     NA
2       2      NA 1990-09-19 23:32:42 1899-12-31 18:46:40   61.6    141.     NA
3       3      NA 1991-01-27 08:39:36 1899-12-31 21:24:34   64      165.     NA
# ℹ 7 more variables: date.y <dttm>, insulin_time <dttm>, code.y <dbl>,
#   value.y <dbl>, type.y <dbl>, hours_since_insulin <dbl>,
#   glucose_category <dbl>

Interactive Table

The table below shows a summary of mean glucose, standard deviation, minimum, maximum, and number of readings for each patient. Use the search box to find a specific patient, or click any column header to sort. This allows exploration of which patients have the highest average glucose and most variable readings.

library(DT)

patient_table <- model_data_clean |>
  group_by(patient) |>
  summarize(
    mean_glucose = round(mean(value.x, na.rm = TRUE), 1),
    sd_glucose   = round(sd(value.x,   na.rm = TRUE), 1),
    min_glucose  = min(value.x, na.rm = TRUE),
    max_glucose  = max(value.x, na.rm = TRUE),
    n_readings   = n()
  ) |>
  arrange(desc(mean_glucose))

datatable(
  patient_table,
  colnames = c("Patient", "Mean Glucose", "SD", "Min", "Max", "Readings"),
  caption  = "Patient Glucose Summary — search, sort, or filter any column",
  options  = list(pageLength = 10, scrollX = TRUE)
)

Interactive Figures

Interactive Figure 1: Glucose by Hour Group

This interactive box plot shows the distribution of glucose readings grouped by hour since insulin dose. Hover over any box to see the median, quartiles, and range for that hour group. The plot allows exploration of how glucose variability changes across different time windows after dosing.

library(plotly)

model_data_clean2 <- model_data_clean |>
  mutate(hours_group = cut(hours_since_insulin,
                           breaks = seq(0, 12, 1),
                           include.lowest = TRUE))

plot_ly(
  data     = model_data_clean2,
  x        = ~hours_group,
  y        = ~value.x,
  type     = "box",
  boxpoints = "all",
  jitter   = 0.3,
  pointpos = -1.8,
  color    = ~hours_group
) |>
  layout(
    title  = "Glucose Distribution by Hour Since Insulin",
    xaxis  = list(title = "Hour Group"),
    yaxis  = list(title = "Glucose Level (mg/dL)"),
    showlegend = FALSE
  )

Interactive Figure 2: Glucose Trend Over Time Since Insulin

This interactive scatter plot shows individual glucose readings plotted against hours since insulin dose. Hover over any point to see the patient, glucose level, and exact hours since dose. The blue trend line confirms the overall downward relationship between time and glucose.

model_data_clean3 <- model_data_clean |>
  arrange(hours_since_insulin)

plot_ly(
  data = model_data_clean3,
  x    = ~hours_since_insulin,
  y    = ~value.x,
  type = "scatter",
  mode = "markers",
  color = ~glucose_category,
  text = ~paste("Patient:", patient,
                "<br>Glucose:", value.x, "mg/dL",
                "<br>Hours since dose:", round(hours_since_insulin, 2),
                "<br>Category:", glucose_category),
  hoverinfo = "text",
  marker = list(opacity = 0.5)
) |>
  layout(
    title  = "Glucose Trend by Hours Since Insulin",
    xaxis  = list(title = "Hours Since Insulin Dose"),
    yaxis  = list(title = "Glucose Level (mg/dL)")
  )

KSU Interactive Map

library(leaflet)

leaflet() |>
  addTiles() |>
  setView(lng = -84.5813, lat = 34.0390, zoom = 15) |>
  addMarkers(
    lng   = -84.5813,
    lat   = 34.0390,
    popup = "Kennesaw State University<br>DS7310 - Programming in R<br>Sweet Talk Project"
  )

Conclusion

This project examined whether insulin dosing timing and measurement context predict blood glucose levels, and whether patients naturally fall into distinct glucose control profiles.

Key findings:

  • Insulin timing is a statistically significant predictor of glucose. Each additional hour since a dose is associated with a 6.8 mg/dL decrease in glucose (p < 0.0001), confirming that insulin does its biological work over time.

  • Measurement event context adds significant explanatory power beyond timing alone. Adding measurement code to the model improved fit significantly (ANOVA F = 13.27, p = 0.0003), suggesting that pre-snack versus pre-breakfast readings reflect different glucose dynamics.

  • Patient-level variation is the dominant story. The 227 mg/dL spread between the lowest and highest average patients, all on insulin, shows that individual physiological and behavioral factors drive glucose control more than any single dosing variable.

  • Clustering identified three distinct behavioral profiles based on insulin timing and measurement context, each associated with different average glucose outcomes.

Limitations: The low R-squared values (under 1%) for the primary models reflect that glucose is a complex outcome influenced by meals, stress, physical activity, and individual physiology. The models identify real relationships but do not explain most of the variance, which is expected in clinical glucose data.

Acknowledgement

The findings presented in this project are exclusive to this course and were not presented in this or previous semesters and will not be presented in any other courses during this semester.


Team Contact Information