Diabetes is, “a condition that happens when your blood sugar (glucose) is too high. It develops when your pancreas doesn’t make enough insulin or any at all, or when your body isn’t responding to the effects of insulin properly” (Cleveland Clinic, 2023). This condition has always been a frequent topic of conversation in my family. My uncle has had Type 1 diabetes for his entire life, and watching the sharp rises and drops in his blood sugar has shown me how constant and demanding diabetes management can be. He has to know his own body incredibly well and make quick decisions throughout the day to keep himself stable. A few years ago, my dog was also diagnosed with diabetes, and surprisingly, our family went through a similar learning curve. We began noticing changes in his behavior when his blood sugar shifted, and over time we learned how to respond and support him. Seeing both of them navigate these highs and lows showcased just how complicated diabetes can be.
As a result, I’ve always been curious about the different stages of diabetes and why its effects can differ so dramatically from one individual to another. This project gave me an opportunity to explore those questions using data. The dataset I analyzed includes 100,000 synthetic patient profiles, each containing demographic characteristics, lifestyle habits, family history, and clinical measurements. These are variables that, according to my research, are well-established indicators of diabetes risk. Although the data is generated for privacy and does not represent real patients, it is built using statistical patterns inspired by real-world medical research, which allows it to reflect true health patterns.
Working with this dataset allowed me to look beyond my personal experiences and understand diabetes from a broader, population-level perspective. I explored how different lifestyle choices relate to clinical risk, how demographic trends correlate to diabetes outcomes, and how family history connects to ones diagnosis. This project gave me the chance to connect my personal curiosity in diabetes with an analytical exploration, helping me better understand a condition that has reached my family in multiple ways.
According to the CDC, “some diabetes risk factors can be managed through behavior change, such as being more physically active,” however, “Other risk factors can’t be changed, such as family history and age.” Therefore, understanding who is represented in the dataset and exploring these “unchangeable” factors is essential before exploring clinical trends or manageable risk patterns. The demographics in this synthetic population include age, gender, ethnicity, education level, income level, and employment status. These variables form the foundation for many public health studies, as demographic differences can shape not only the likelihood of developing diabetes but also the way it is managed. In this project, these demographic features help contextualize the patterns seen in the clinical variables and allow me to explore how diabetes risk can differ across groups.
As a means of better understanding how diabetes prevalence differs across different demographic groups, I began by examining the relationship between age, diabetes stage, and either gender or ethnicity. Due to the hypothesized and proven statement that “the effect of age is an important and underappreciated risk factor” (National Library of Medicine), this visualization groups patients into five-year age bins and compares diabetes stages across demographic subgroups. I created an interactive Shiny app that allows the user to facet the chart by either gender or ethnicity, making it possible to explore population-level differences dynamically.
The stacked bar chart displays the number of patients within each age bin, separated by diabetes stage (Type 1, Type 2, Gestational, Pre-Diabetes, or No Diabetes). By allowing the user to facet the plot by either gender or ethnicity, the app highlights how diabetes patterns shift across different subpopulations. While many combinations of demographic groups can be explored, several notable patterns stand out. For example, among most ethnicities, the highest number of Type 1 diabetes diagnoses occurs in the 35–39 age group; however, this pattern differs for Asian patients, who show the highest concentration of Type 1 cases in the 20–24 age range and the lowest in the 35–39 group. Another consistent trend across all genders and ethnicities is a notable spike in Type 2 diabetes and Pre-Diabetes diagnoses in the 50–54 age range. This interactive view helps illustrate how demographic characteristics influence diabetes patterns and provides an accessible way to compare trends across groups.
Going beyond age and ethnicity, another important aspect of demographics is socioeconomic status and its influence on diabetes outcomes. To explore this, I created an alluvial plot illustrating how education level, employment status, and income level interact, and how these combined socioeconomic pathways relate to the presence (or absence) of diagnosed diabetes. Each flow in the diagram represents individuals moving from education to employment and then to income, with colors indicating diagnosis status. However, as shown in the plot below, the flows appeared fairly evenly distributed across all categories. Instead of revealing clear socioeconomic patterns, the visualization suggested that diabetes prevalence was relatively uniform across the variables. This evenness is likely a result of the synthetic nature of the dataset. As a result, the plot raised the question of whether these characteristics truly lacked influence on diabetes status in the data or whether the alluvial plot simply was not the best tool for detecting subtler differences.
| term | estimate | p.value |
|---|---|---|
| (Intercept) | 0.4465 | 0.0000 |
| education_levelHighschool | -0.0008 | 0.9564 |
| education_levelNo formal | 0.0205 | 0.5082 |
| education_levelPostgraduate | -0.0260 | 0.1972 |
| income_levelLow | -0.0289 | 0.3945 |
| income_levelLower-Middle | -0.0152 | 0.6361 |
| income_levelMiddle | -0.0443 | 0.1576 |
| income_levelUpper-Middle | -0.0387 | 0.2390 |
| employment_statusRetired | -0.0238 | 0.1461 |
| employment_statusStudent | -0.0195 | 0.4823 |
| employment_statusUnemployed | 0.0007 | 0.9726 |
To evaluate this, I ran a logistic regression model using education level, income level, and employment status as predictors of diagnosed diabetes. This allowed me to test whether any of these socioeconomic variables had a statistically significant association with diabetes risk beyond what could be seen visually. The regression results confirmed what the alluvial plot suggested: none of the socioeconomic variables had a statistically significant effect on diabetes diagnosis. All p-values were well above 0.05, indicating no meaningful difference in diabetes prevalence across education, income, or employment groups. This supports the idea that the synthetic dataset was generated without embedding strong socioeconomic disparities.
After exploring demographic patterns such as age, gender, ethnicity, and socioeconomic status, another important sector of diabetes risk involves lifestyle behaviors. Habits such as smoking, alcohol consumption, physical activity, diet quality, sleep duration, and screen time can potentially influence diabetes diagnoses and management. Understanding how these behaviors vary across individuals with different diabetes stages can provide insight into potential risk factors or even protective behaviors.
In order to examine these lifestyle patterns interactively, I developed a Shiny app with two main features. The first is a radar plot, which allows users to visually compare multiple lifestyle variables across selected diabetes stages and smoking categories. Users can select which variables to include and choose whether to aggregate the data by mean or median. The second feature is a summary table, which provides numerical summaries for the same variables under the selected filters.
The patterns revealed by the app highlight several important trends. Individuals with No Diabetes generally exhibit healthier behaviors, including significantly higher physical activity, better diet scores, and lower screen time compared to those with Type 2 Diabetes. Those with Prediabetes fall in between, with more activity than Type 2 patients but less than individuals without diabetes. According to the American Diabetes Association, “regular physical activity is an important part of managing diabetes or dealing with prediabetes,” and these patterns suggest that people with prediabetes may be actively engaging in healthier behaviors to manage or slow disease progression. Alcohol consumption and sleep duration remain relatively consistent across groups, though current and former smokers tend to consume slightly more alcohol than those who have never smoked. Overall, the Shiny app provides a clear, interactive way to visualize these relationships and compare lifestyle behaviors across different subpopulations, reinforcing the importance of daily habits in diabetes risk and management.
Another significant aspect of understanding diabetes risk falls under an individual’s medical history, particularly the history of chronic conditions within their family. Family history plays a major role in shaping someone’s susceptibility, “for both cardiovascular disease (CVD) and diabetes” (National Library of Medicine). Variables such as familial history of diabetes, hypertension history, and cardiovascular history help capture these risks. By looking at these indicators, we can explore how common chronic conditions cluster together and whether people with diabetes are more likely to come from families with multiple overlapping health issues. This section shifts the focus from behavior-based factors to biological risk patterns.
In order to visualize the overlap among these three chronic conditions, I created two side-by-side Venn diagrams: one for individuals without diagnosed diabetes, and one for individuals with diagnosed diabetes. Each diagram represents the size of the groups reporting family histories of diabetes, hypertension, and cardiovascular disease, along with the overlapping intersections. By comparing the two diagrams, we can visually assess whether individuals with diabetes tend to come from families with more combinations of cardiometabolic conditions.
The Venn diagrams reveal clear differences in family-history patterns between individuals with and without diagnosed diabetes. In the No Diabetes group, most reports occur in isolation: 3,259 for diabetes only, 7,414 for hypertension only, and 1,737 for cardiovascular disease only, with relatively small overlaps. In contrast, the Has Diabetes group shows higher counts across all categories, particularly in the overlaps. For example, 11,502 report only a family history of diabetes, but 3,856 have both diabetes and hypertension in their family, and 415 report all three conditions, which is over five times higher than in the non-diabetic group.
These patterns suggest that individuals with diabetes tend to come from families with greater clusters of chronic conditions, especially combinations involving diabetes and hypertension. Although single-condition histories appear in both groups, the rise in overlapping conditions suggests that genetics, lifestyle, and environment together shape diabetes risk, showing how family history can reveal patterns of related chronic diseases.
Lastly, it is essential to examine clinical measurements and associated risk patterns that directly reflect an individual’s current metabolic state. The dataset offers access to a wide range of indicators, including BMI and waist-to-hip ratio, systolic and diastolic blood pressure, heart rate, total cholesterol, HDL and LDL cholesterol, triglycerides, fasting and postprandial glucose, insulin levels, and HbA1c. Among these, HbA1c is particularly important for understanding long-term blood glucose levels. The HbA1c test measures “what percentage of your hemoglobin is coated with glucose”, providing an average view of blood sugar over the past two to three months (Medline Plus). Elevated HbA1c levels are a strong indicator of prediabetes or diabetes, with common clinical thresholds being less than 5.7% for normal, 5.7–6.4% for prediabetes, and 6.5% or higher for diabetes. The plot below shows the distribution of HbA1c values across diabetes stages, with colored bands highlighting these clinical thresholds.
This visualization explores how HbA1c levels vary across diabetes stages. The violin plot shows that individuals with no diabetes largely fall within the normal range, below 5.7%. For those classified as prediabetic, the bulk of the distribution lies within the intermediate range of 5.7% to 6.4%. Individuals with type 2 diabetes show a concentration near the lower end of the diabetes range, just above 6.4%, with values extending higher. For gestational and type 1 diabetes, the distributions are broader, with the middle of the distribution roughly aligned around 6.4%, indicating a wider spread of HbA1c levels within these groups. Therefore, this plot appears to well align with the percentages commonly uses to diagnose diabetes or prediabetes, according to Medline Plus.
Building on the previous analysis of HbA1c distributions across diabetes stages, we can further explore how HbA1c interacts with other clinical measurements and overall diabetes risk. Using an animated Tableau scatter plot, we can explore BMI and its relation to HbA1c, using color-coding to indicate whether an individual has a diabetes diagnosis. The animation progresses across increasing risk scores, allowing us to observe how the distribution of individuals shifts as risk accumulates.
In this plot, the cluster of points moves diagonally upward as risk score increases, reflecting the pattern that higher BMI is associated with higher HbA1c values. This aligns with the earlier violin plot, where individuals with prediabetes and diabetes showed elevated HbA1c levels. However, there is a noticeable horizontal separation between those with a diabetes diagnosis and those without, suggesting that HbA1c is a key differentiator in the diagnostic process. Even as BMI and overall risk increase, individuals without a diagnosis tend to remain in the lower HbA1c range, emphasizing the importance of HbA1c as a marker in identifying diabetes.
Next, I visualized the relationship between fasting glucose and insulin levels across different diabetes stages. Due to the high density of points, hexagonal bins were used instead of a traditional scatterplot, allowing me to better visualize patterns within the data.
The hexbin plot highlights distinct insulin–glucose patterns across diabetes stages. Type 2 diabetes shows a large, dense circular pattern, reflecting substantial variability in both glucose and insulin levels, with a high concentration near moderate values. No diabetes and pre-diabetes exhibit sharp, linear borders, where insulin rises quickly at specific glucose thresholds (approximately 100 and 125, respectively), indicating tightly constrained responses. Type 1 and gestational diabetes display smaller, less dense circular patterns, suggesting moderate variability in insulin–glucose dynamics but lower overall density compared to Type 2. These patterns illustrate how metabolic responses differ across stages, with Type 2 showing the widest spread and highest density, while pre-diabetes and no diabetes have more sharply defined relationships. It is important to note, however, that the sharp borders may partly reflect the synthetic nature of the data.
Finally, I summarized all key clinical measurements across diabetes stages, providing a comprehensive view of how metabolic and cardiovascular variables differ among groups. Each cell represents the scaled mean value of a clinical variable for a specific diabetes stage, with darker colors indicating higher values.
From the plot, Type 2 diabetes stands out with high scaled means across most clinical variables, reflecting elevated metabolic and cardiovascular measures, while HDL cholesterol is notably low. No diabetes and Type 1 diabetes show higher scaled means for HDL cholesterol, but otherwise generally moderate to low values across other measures. In the no diabetes group, variables such as systolic and diastolic blood pressure, total cholesterol, heart rate, and LDL cholesterol are moderately elevated (around 0.75 scaled), while other measures remain relatively low. Therefore, the heatmap well illustrates how a combination of clinical measurements distinguishes each diabetes stage.
Overall, this project allowed me to explore diabetes from both a personal and analytical perspective, using demographic, lifestyle, family history, and clinical data to understand how risk and outcomes vary across populations. The analyses reflect on how these factors interact to influence diabetes risk and progression, emphasizing the complex interaction of “risk factors [that] can’t be changed” with those that can. This project identifies broader patterns while highlighting the importance of personalized diabetes management, and it gave me the chance to learn more about a condition that people I care about have learned to manage.