Managing blood glucose is complex and influenced by multiple factors such as insulin use, meals, and daily activity. Patients often experience significant glucose fluctuations throughout the day, making it difficult to understand how behaviors impact glucose levels over time. This project is motivated by the need to identify meaningful patterns between insulin use, behavior, and glucose response.
Research Questions
Does the time elapsed since an insulin dose predict blood glucose level at the next reading?
Does the type of measurement event add explanatory power beyond timing alone?
Do patients naturally cluster into distinct groups based on glucose control?
Is there meaningful variation in average glucose control across individual patients?
Outcome Variable: Blood glucose level (mg/dL)
Method:
Time-based glucose and insulin data are analyzed to examine how insulin dosing and behavioral patterns affect blood glucose levels
Descriptive statistics and visualizations are used to explore trends and variability in glucose measurements
Regression-based methods are applied to evaluate how well insulin dosing and behavioral variables explain and predict changes in glucose over time
The dataset contains 29,556 records across 70 patients, covering both glucose readings and insulin doses recorded between 1991 and 1993. Each record includes a patient ID, date, time, measurement code, value, and event type. The code column identifies what type of reading was taken, such as pre-breakfast glucose or regular insulin dose.
R_Project <-read_excel("C:/Users/Evelyn Morales/OneDrive - Kennesaw State University/Programming in R project/R_Project/R-Project.xlsx")glucose_data <- R_Projectcolnames(glucose_data) <-c("patient", "date", "time", "code", "value", "type")str(glucose_data)
To connect each glucose reading to the insulin dose that caused it, we separated glucose and insulin records, then used a left join followed by a filter to keep only insulin doses that occurred before each glucose reading. The slice_max function then selected the single most recent dose before each reading. Finally, mutate calculated the hours elapsed between that dose and the glucose reading. This produced 2,609 properly matched pairs.
# Separate glucose and insulin recordsglucose_only <- glucose_data |> dplyr::filter(stringr::str_detect(tolower(type), "glucose")) |> dplyr::rename(glucose_time = time)insulin_only <- glucose_data |> dplyr::filter(stringr::str_detect(tolower(type), "insulin")) |> dplyr::rename(insulin_time = time)# Join explorationleft_joined <- glucose_only |>left_join(insulin_only, by ="patient", relationship ="many-to-many")inner_joined <- glucose_only |>inner_join(insulin_only, by ="patient", relationship ="many-to-many")full_joined <- glucose_only |>full_join(insulin_only, by ="patient", relationship ="many-to-many")right_joined <- glucose_only |>right_join(insulin_only, by ="patient", relationship ="many-to-many")nrow(left_joined)
[1] 2218343
nrow(inner_joined)
[1] 2217934
nrow(full_joined)
[1] 2218343
nrow(right_joined)
[1] 2217934
# Build analysis dataset: link each glucose reading to its# most recent prior insulin dose, then calculate time elapsedmodel_data <- glucose_only |>left_join(insulin_only, by ="patient",relationship ="many-to-many") |>filter(insulin_time <= glucose_time) |>group_by(patient, glucose_time) |>slice_max(order_by = insulin_time, n =1, with_ties =FALSE) |>ungroup() |>mutate(hours_since_insulin =as.numeric(difftime(glucose_time, insulin_time, units ="hours") ) )# Remove missing valuesmodel_data_clean <- model_data |>drop_na()# Classify each glucose reading by clinical threshold using case_whenmodel_data_clean <- model_data_clean |>mutate(glucose_category =case_when( value.x <70~"Low (<70)", value.x <=130~"Normal (70-130)", value.x <=180~"High (130-180)",TRUE~"Very High (>180)" ) )nrow(model_data_clean)
[1] 2609
Results
Visualization
Plot 1: Glucose Levels vs Hours Since Insulin
This scatter plot shows blood glucose level on the y-axis and hours since the last insulin dose on the x-axis. The trend line shows a clear downward pattern, confirming that glucose decreases as more time passes after a dose. This is the biological mechanism of insulin at work in the data.
model_data_clean |>ggplot(aes(x = hours_since_insulin, y = value.x)) +geom_point(alpha =0.3) +geom_smooth(se =FALSE) +scale_x_continuous(breaks =seq(0, 24, by =2)) +labs(title ="Glucose Levels vs Hours Since Insulin Dose",subtitle ="Outcome: blood glucose (mg/dL) | Predictor: hours since last dose",x ="Hours Since Insulin Dose",y ="Glucose Level (mg/dL)" ) +theme_minimal()
Plot 2: Glucose Levels by Measurement Event
This boxplot compares glucose distributions across measurement event types. Snack-time readings show the highest median glucose around 220-240 mg/dL, approximately 30-50 mg/dL higher than pre-breakfast levels. General readings display the widest variability, with outliers reaching 350-400+ mg/dL, indicating that unscheduled readings capture the most extreme glucose fluctuations.
glucose_only |>ggplot(aes(x = type, y = value, fill = type)) +geom_boxplot() +labs(title ="Glucose Levels by Measurement Event",subtitle ="Snack-time readings tend to be highest; general readings most variable",x ="Measurement Event",y ="Glucose Level (mg/dL)" ) +theme_minimal() +theme(legend.position ="none",axis.text.x =element_text(angle =45, hjust =1) )
Plot 3: Average Glucose Level by Patient
This bar chart displays mean glucose for each of the 70 patients, ordered from lowest to highest. The color gradient from blue to red highlights severity. The spread from 105 mg/dL (data-31) to 332 mg/dL (data-70) represents a 227 mg/dL difference across patients all on insulin, showing that individual factors beyond dosing drive glucose control substantially.
model_data_clean |>group_by(patient) |>summarize(mean_glucose =mean(value.x, na.rm =TRUE)) |>ggplot(aes(x =reorder(patient, mean_glucose),y = mean_glucose,fill = mean_glucose)) +geom_bar(stat ="identity") +coord_flip() +scale_fill_gradient(low ="steelblue", high ="red") +labs(title ="Average Glucose Level by Patient",subtitle ="Range: 105 to 332 mg/dL a 227 mg/dL spread across patients on insulin",x ="Patient",y ="Mean Glucose (mg/dL)",fill ="Mean Glucose" ) +theme_minimal() +theme(axis.text.y =element_text(size =5))
Plot 4: Distribution of Glucose Levels
This histogram shows the overall distribution of glucose readings. The distribution is right-skewed, with the highest concentration of readings falling between 50 and 150 mg/dL. The long right tail indicates a meaningful subset of readings above 300 mg/dL, representing patients with poor glucose control. Most patients are managing reasonably, but high outliers exist.
model_data_clean |>ggplot(aes(x = value.x)) +geom_histogram(bins =30) +scale_x_continuous(breaks =seq(0, 400, by =50)) +labs(title ="Distribution of Glucose Levels",subtitle ="Right-skewed: most patients controlled, but high outliers exist",x ="Glucose Level (mg/dL)",y ="Frequency" ) +theme_minimal()
Plot 5: Glucose Level by Measurement Code
This scatter plot shows individual glucose readings grouped by measurement code. Code 64 (pre-snack) and codes 48 and 57 (unspecified readings) show the highest glucose clustering, often reaching 250-300 mg/dL. Code 58 (pre-breakfast) shows lower clustering around 120-170 mg/dL, suggesting that fasting overnight brings glucose down before morning dosing.
Plot 6: Glucose Distribution with Density Gradient
This histogram uses a color gradient from light blue to dark blue to show where readings are most dense. The darkest bars appear between 50 and 150 mg/dL, confirming that is where the majority of readings concentrate. The lighter bars on the right tail show that high glucose readings above 300 mg/dL are less frequent but present.
model_data_clean |>ggplot(aes(x = value.x, fill =after_stat(count))) +geom_histogram(bins =30, color ="black") +scale_fill_gradient(low ="lightblue", high ="darkblue") +scale_x_continuous(breaks =seq(0, 400, by =50)) +labs(title ="Distribution of Glucose Levels",x ="Glucose Level (mg/dL)",y ="Frequency" ) +theme_minimal()
Plot 7: Glucose Range by Measurement Code
This range plot shows the minimum to maximum glucose values for each measurement code as a horizontal line, with the mean shown as a dot. All codes span a wide range, but the dot positions reveal that mean glucose is relatively consistent across codes at around 160-180 mg/dL. The wide lines indicate high within-code variability regardless of measurement context.
model_data_clean |>group_by(code.x) |>summarize(min_glucose =min(value.x, na.rm =TRUE),mean_glucose =mean(value.x, na.rm =TRUE),max_glucose =max(value.x, na.rm =TRUE) ) |>ggplot() +geom_segment(aes(x = min_glucose,xend = max_glucose,y =as.factor(code.x),yend =as.factor(code.x) )) +geom_point(aes(x = mean_glucose, y =as.factor(code.x))) +labs(title ="Glucose Range by Measurement Code",subtitle ="Dot = mean | Line = min to max",x ="Glucose Level (mg/dL)",y ="Measurement Code" ) +theme_minimal()
Plot 8: Average Glucose by Event Type
This bar chart shows mean glucose grouped by event type. Pre-breakfast readings show the highest average around 180 mg/dL, while snack-time readings average around 160 mg/dL. The relatively small differences across event types suggest that timing of the measurement matters less than the individual patient’s overall glucose control pattern.
model_data_clean |>group_by(type.x) |>summarize(mean_glucose =mean(value.x, na.rm =TRUE)) |>ggplot(aes(x = type.x, y = mean_glucose, fill = mean_glucose)) +geom_bar(stat ="identity") +labs(title ="Average Glucose by Event Type",x ="Event Type",y ="Average Glucose (mg/dL)" ) +theme_minimal()
Plot 9: Time Gap Between Insulin Dose and Glucose Reading
This segment plot shows the time gap between each insulin dose and the subsequent glucose reading for 50 events. Each line represents one event, colored from blue (low glucose) to red (high glucose). Events with shorter gaps tend to show more red, meaning glucose is still elevated close to the dose. Longer gaps trend bluer, consistent with insulin lowering glucose over time.
model_data_clean |>filter(hours_since_insulin >0.5) |>slice_head(n =50) |>mutate(event_id =row_number()) |>ggplot() +geom_segment(aes(x =0,xend = hours_since_insulin,y = event_id,yend = event_id,color = value.x ), linewidth =1) +scale_color_gradient(low ="steelblue", high ="red") +labs(title ="Time Gap Between Insulin Dose and Glucose Reading",subtitle ="Color = glucose level: red = high, blue = low",x ="Hours Since Insulin Dose",y ="Event",color ="Glucose (mg/dL)" ) +theme_minimal()
Plot 10: Glucose Ranges by Meal Type
This range plot shows the minimum to maximum glucose values for each meal type, with the red dot marking the mean. All meal types show wide ranges spanning from near-normal to very high readings. The mean dots are clustered around 160-180 mg/dL across all types, but the wide lines show that individual readings vary enormously regardless of meal context.
glucose_only |>group_by(type) |>summarize(mean_glucose =mean(value, na.rm =TRUE),min_glucose =min(value, na.rm =TRUE),max_glucose =max(value, na.rm =TRUE) ) |>ggplot() +geom_segment(aes(x = min_glucose,xend = max_glucose,y = type,yend = type ), color ="gray50", linewidth =1) +geom_point(aes(x = mean_glucose, y = type),color ="red", size =3) +labs(title ="Glucose Ranges by Meal Type",subtitle ="Red dot = mean | Line = min to max range",x ="Glucose Level (mg/dL)",y ="Meal Type" ) +theme_minimal()
Plot 11: Glucose Readings by Clinical Category (case_when)
Using case_when, we classified each reading into four clinical categories: Low (below 70), Normal (70-130), High (130-180), and Very High (above 180). The count labels on each bar show the exact number of readings per category. The majority of readings fall in the Normal and Very High categories, revealing a bimodal pattern where patients are either well-controlled or significantly elevated.
model_data_clean |>count(glucose_category) |>ggplot(aes(x =reorder(glucose_category, n),y = n,fill = glucose_category)) +geom_bar(stat ="identity") +geom_text(aes(label = n), hjust =-0.2, size =3.5) +coord_flip() +labs(title ="Glucose Readings by Clinical Category",subtitle ="Classified using case_when with clinical thresholds",x ="Glucose Category",y ="Count of Readings" ) +theme_minimal() +theme(legend.position ="none")
Plot 12: Combined Figure (patchwork)
These two histograms are combined using patchwork to allow direct comparison of the outcome variable (glucose) and the key predictor (hours since insulin dose). Glucose is right-skewed with readings concentrated below 150 mg/dL. Hours since insulin is heavily concentrated near zero, meaning most glucose readings are taken shortly after a dose, which is an important context for interpreting the regression results.
p_glucose <- model_data_clean |>ggplot(aes(x = value.x)) +geom_histogram(bins =30, fill ="steelblue", color ="white") +scale_x_continuous(breaks =seq(0, 400, by =100)) +labs(title ="Glucose Distribution",x ="Glucose Level (mg/dL)",y ="Count" ) +theme_minimal()p_hours <- model_data_clean |>ggplot(aes(x = hours_since_insulin)) +geom_histogram(bins =30, fill ="tomato", color ="white") +scale_x_continuous(breaks =seq(0, 24, by =4)) +labs(title ="Hours Since Insulin",x ="Hours",y ="Count" ) +theme_minimal()p_glucose + p_hours +plot_annotation(title ="Distribution of Outcome and Key Predictor",subtitle ="Left: glucose is right-skewed | Right: most readings taken shortly after dosing" )
Regression Modeling
Our outcome variable is blood glucose level (value.x, mg/dL). We built two models to test whether insulin timing and measurement context predict glucose. Model 1 uses hours since insulin as the sole predictor. Model 2 adds measurement code to test whether event context explains additional variance. We then used ANOVA to formally compare them.
# Model 1: Does time since insulin predict glucose?fit1 <-lm(value.x ~ hours_since_insulin, data = model_data_clean)summary(fit1)
Call:
lm(formula = value.x ~ hours_since_insulin, data = model_data_clean)
Residuals:
Min 1Q Median 3Q Max
-156.93 -79.25 -9.37 63.73 327.70
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 177.269 1.975 89.763 < 2e-16 ***
hours_since_insulin -6.812 1.608 -4.236 2.35e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 92.47 on 2607 degrees of freedom
Multiple R-squared: 0.006837, Adjusted R-squared: 0.006456
F-statistic: 17.95 on 1 and 2607 DF, p-value: 2.352e-05
Model 1 shows that each additional hour since the last insulin dose is associated with a 6.81 mg/dL decrease in glucose (p < 0.0001). Although the Adjusted R-squared is 0.65%, this is expected since glucose is influenced by many factors beyond timing alone. The relationship is statistically real even if modest in magnitude.
# Model 2: Add measurement code as a second predictorfit2 <-lm(value.x ~ hours_since_insulin + code.x, data = model_data_clean)summary(fit2)
Call:
lm(formula = value.x ~ hours_since_insulin + code.x, data = model_data_clean)
Residuals:
Min 1Q Median 3Q Max
-159.29 -78.96 -8.98 64.41 325.14
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 272.2980 26.1573 10.410 < 2e-16 ***
hours_since_insulin -6.4236 1.6078 -3.995 6.64e-05 ***
code.x -1.6261 0.4463 -3.643 0.000274 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 92.25 on 2606 degrees of freedom
Multiple R-squared: 0.01187, Adjusted R-squared: 0.01111
F-statistic: 15.65 on 2 and 2606 DF, p-value: 1.75e-07
Model 2 adds measurement code (code.x) and shows it is a significant predictor (beta = -1.63, p = 0.0003). This means the context of the reading, such as pre-breakfast versus pre-snack, explains additional variance in glucose beyond timing alone.
# ANOVA: Is Model 2 significantly better than Model 1?anova(fit1, fit2)
Analysis of Variance Table
Model 1: value.x ~ hours_since_insulin
Model 2: value.x ~ hours_since_insulin + code.x
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2607 22290363
2 2606 22177401 1 112962 13.274 0.0002744 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA comparison confirms Model 2 is significantly better than Model 1 (F = 13.27, p = 0.0003). Adding measurement context meaningfully improves the model.
Day-Level Model: Does Morning Dosing Predict Dinner Glucose?
g_breakfast <- R_Project[R_Project$Type =="Glucose (Pre-Breakfast)",c("Date", "Value")]g_breakfast$Date <-as.Date(g_breakfast$Date)names(g_breakfast)[2] <-"glucose_breakfast"g_dinner <- R_Project[R_Project$Type =="(Pre-Dinner)(or Pre-Supper)",c("Date", "Value")]g_dinner$Date <-as.Date(g_dinner$Date)names(g_dinner)[2] <-"glucose_dinner"i_regular <- R_Project[R_Project$Type =="Insulin (Regular)",c("Date", "Value")]i_regular$Date <-as.Date(i_regular$Date)names(i_regular)[2] <-"insulin_regular"i_nph <- R_Project[R_Project$Type =="Insulin (NPH)",c("Date", "Value")]i_nph$Date <-as.Date(i_nph$Date)names(i_nph)[2] <-"insulin_nph"merged <-merge(g_breakfast, g_dinner, by ="Date", all =FALSE)merged <-merge(merged, i_regular, by ="Date", all =FALSE)merged <-merge(merged, i_nph, by ="Date", all =FALSE)glucose_insulin <-na.omit(merged)model <-lm(glucose_dinner ~ glucose_breakfast + insulin_regular + insulin_nph,data = glucose_insulin)summary(model)
Call:
lm(formula = glucose_dinner ~ glucose_breakfast + insulin_regular +
insulin_nph, data = glucose_insulin)
Residuals:
Min 1Q Median 3Q Max
-220.74 -62.64 -11.32 50.88 300.58
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.487e+02 1.049e-01 1416.817 <2e-16 ***
glucose_breakfast 6.329e-04 4.392e-04 1.441 0.15
insulin_regular 4.894e-01 9.598e-03 50.993 <2e-16 ***
insulin_nph 7.664e-02 3.596e-03 21.316 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 77.24 on 4622168 degrees of freedom
Multiple R-squared: 0.0006992, Adjusted R-squared: 0.0006985
F-statistic: 1078 on 3 and 4622168 DF, p-value: < 2.2e-16
This model predicts dinner glucose from breakfast glucose and morning insulin doses. Regular insulin (p < 0.0001) and NPH insulin (p < 0.0001) are both significant predictors of dinner glucose. Breakfast glucose alone is not significant (p = 0.15), suggesting that insulin type and dose matter more than morning glucose level in predicting evening outcomes.
Functional and Iterative Programming
Reusable Model Function
The run_model function takes a dataframe and a formula, fits a linear model, and returns the summary. This allows us to test multiple model specifications efficiently without repeating code.
run_model <-function(df, formula) { fit <-lm(formula, data = df)summary(fit)}run_model(model_data_clean, value.x ~ hours_since_insulin)
Call:
lm(formula = formula, data = df)
Residuals:
Min 1Q Median 3Q Max
-156.93 -79.25 -9.37 63.73 327.70
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 177.269 1.975 89.763 < 2e-16 ***
hours_since_insulin -6.812 1.608 -4.236 2.35e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 92.47 on 2607 degrees of freedom
Multiple R-squared: 0.006837, Adjusted R-squared: 0.006456
F-statistic: 17.95 on 1 and 2607 DF, p-value: 2.352e-05
Call:
lm(formula = formula, data = df)
Residuals:
Min 1Q Median 3Q Max
-159.29 -78.96 -8.98 64.41 325.14
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 272.2980 26.1573 10.410 < 2e-16 ***
hours_since_insulin -6.4236 1.6078 -3.995 6.64e-05 ***
code.x -1.6261 0.4463 -3.643 0.000274 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 92.25 on 2606 degrees of freedom
Multiple R-squared: 0.01187, Adjusted R-squared: 0.01111
F-statistic: 15.65 on 2 and 2606 DF, p-value: 1.75e-07
Patient Summary Loop
The patient_summary function loops through each unique patient, calculates their mean and standard deviation of glucose, and prints a summary. The output reveals a 227 mg/dL spread between the lowest (data-31, mean 105) and highest (data-70, mean 332) patients, all receiving insulin treatment. This motivates the clustering analysis: dosing alone does not explain the variation.
We applied PAM (Partitioning Around Medoids) clustering to group patients by behavioral patterns. Per the project rubric, the regression outcome variable (value.x, blood glucose) was removed before clustering. Clustering is performed on hours since insulin and measurement code only, so the groups reflect behavioral and timing patterns rather than glucose level itself.
# Remove regression outcome (value.x) before clustering# Rubric requirement: do not include the dependent variable in clusteringcluster_data <- model_data_clean |>select(hours_since_insulin, code.x) |>drop_na()cluster_data_scale <-scale(cluster_data)# Estimate optimal number of clusters using NbClust# Evaluates 30 indices with majority voting ruleset.seed(123)number_cluster_estimate <-NbClust( cluster_data_scale,distance ="euclidean",min.nc =2,max.nc =10,method ="kmeans")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 4 proposed 2 as the best number of clusters
* 2 proposed 3 as the best number of clusters
* 8 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 3 proposed 6 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 4 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 4
*******************************************************************
NbClust evaluated 30 statistical indices and used a majority voting rule to recommend the optimal number of clusters. We selected k = 3 because three clusters capture clinically meaningful distinctions that two clusters would collapse together.
set.seed(123)glucose_clusters <-pam(cluster_data_scale, k =3)glucose_clusters$medoids
model_data_clustered |>ggplot(aes(x = hours_since_insulin,y = value.x,color =as.factor(cluster))) +geom_point(alpha =0.4) +scale_x_continuous(breaks =seq(0, 24, by =2)) +labs(title ="Three Glucose Control Profiles by Insulin Timing",subtitle ="Clusters reflect behavioral timing patterns, not glucose level",x ="Hours Since Insulin Dose",y ="Glucose Level (mg/dL)",color ="Cluster" ) +theme_minimal()
The cluster summary shows three distinct behavioral groups with different mean glucose levels, mean hours since insulin, and mean measurement codes. Each cluster represents a different pattern of insulin timing and measurement context, which relates to but is not determined by glucose level.
# A tibble: 3 × 14
cluster patient date.x glucose_time code.x value.x type.x
<int> <dbl> <dttm> <dttm> <dbl> <dbl> <dbl>
1 1 NA 1990-12-14 23:13:12 1899-12-31 11:04:17 56.7 180. NA
2 2 NA 1990-09-19 23:32:42 1899-12-31 18:46:40 61.6 141. NA
3 3 NA 1991-01-27 08:39:36 1899-12-31 21:24:34 64 165. NA
# ℹ 7 more variables: date.y <dttm>, insulin_time <dttm>, code.y <dbl>,
# value.y <dbl>, type.y <dbl>, hours_since_insulin <dbl>,
# glucose_category <dbl>
Interactive Table
The table below shows a summary of mean glucose, standard deviation, minimum, maximum, and number of readings for each patient. Use the search box to find a specific patient, or click any column header to sort. This allows exploration of which patients have the highest average glucose and most variable readings.
This interactive box plot shows the distribution of glucose readings grouped by hour since insulin dose. Hover over any box to see the median, quartiles, and range for that hour group. The plot allows exploration of how glucose variability changes across different time windows after dosing.
Interactive Figure 2: Glucose Trend Over Time Since Insulin
This interactive scatter plot shows individual glucose readings plotted against hours since insulin dose. Hover over any point to see the patient, glucose level, and exact hours since dose. The blue trend line confirms the overall downward relationship between time and glucose.
model_data_clean3 <- model_data_clean |>arrange(hours_since_insulin)plot_ly(data = model_data_clean3,x =~hours_since_insulin,y =~value.x,type ="scatter",mode ="markers",color =~glucose_category,text =~paste("Patient:", patient,"<br>Glucose:", value.x, "mg/dL","<br>Hours since dose:", round(hours_since_insulin, 2),"<br>Category:", glucose_category),hoverinfo ="text",marker =list(opacity =0.5)) |>layout(title ="Glucose Trend by Hours Since Insulin",xaxis =list(title ="Hours Since Insulin Dose"),yaxis =list(title ="Glucose Level (mg/dL)") )
KSU Interactive Map
library(leaflet)leaflet() |>addTiles() |>setView(lng =-84.5813, lat =34.0390, zoom =15) |>addMarkers(lng =-84.5813,lat =34.0390,popup ="Kennesaw State University<br>DS7310 - Programming in R<br>Sweet Talk Project" )
Conclusion
This project examined whether insulin dosing timing and measurement context predict blood glucose levels, and whether patients naturally fall into distinct glucose control profiles.
Key findings:
Insulin timing is a statistically significant predictor of glucose. Each additional hour since a dose is associated with a 6.8 mg/dL decrease in glucose (p < 0.0001), confirming that insulin does its biological work over time.
Measurement event context adds significant explanatory power beyond timing alone. Adding measurement code to the model improved fit significantly (ANOVA F = 13.27, p = 0.0003), suggesting that pre-snack versus pre-breakfast readings reflect different glucose dynamics.
Patient-level variation is the dominant story. The 227 mg/dL spread between the lowest and highest average patients, all on insulin, shows that individual physiological and behavioral factors drive glucose control more than any single dosing variable.
Clustering identified three distinct behavioral profiles based on insulin timing and measurement context, each associated with different average glucose outcomes.
Limitations: The low R-squared values (under 1%) for the primary models reflect that glucose is a complex outcome influenced by meals, stress, physical activity, and individual physiology. The models identify real relationships but do not explain most of the variance, which is expected in clinical glucose data.
Acknowledgement
The findings presented in this project are exclusive to this course and were not presented in this or previous semesters and will not be presented in any other courses during this semester.