Group 11

Name Student ID
Tan Huei Jie 24238016
Khadija Islam 24222135
Vinod Sithamparam Pillai 17172847
Mohsin Amin 24076189

Introduction

This project focuses on building robust predictive models which will help understand the key variables that influence glycemic control and diabetes risk, paving the way for more effective early intervention strategies and better management of diabetes.

Objectives

  1. To perform exploratory data analysis to understand relationships among demographic, clinical, and lifestyle variables.

  2. To develop a predictive model using Multiple Linear Regression for forecasting HbA1c levels, as an indicator of glycemic control.

  3. To build a Random Forest Classification model to predict diabetes status and identify individuals at risk for developing diabetes.

  4. To evaluate the performance of both models using appropriate metrics (RMSE, R², Accuracy, ROC-AUC).

  5. To identify the most influential predictors affecting HbA1c levels and diabetes risk.

Research Questions

  1. What demographic, clinical, and lifestyle factors are most strongly associated with the risk of diabetes?

  2. How accurately can a Multiple Linear Regression model predict HbA1c levels as a measure of glycemic control?

  3. How effective is the Random Forest Classification model in predicting diabetes status, and which variables have the greatest influence on prediction accuracy?

  4. Which factors are the most significant predictors of glycemic control and early diabetes onset?

  5. Can machine learning models provide a reliable framework for early detection of diabetes and prognostic assessment of glycemic control in at-risk populations?

Project Methodology

This project is using OSEMN model.

OBTAIN: The dataset is obtained through Kaggle. The link is as followed: https://www.kaggle.com/datasets/priyamchoksi/100000-diabetes-clinical-dataset/data

SCRUB: The dataset comes with 100,000 records with information on individuals, including demographic, clinical, and lifestyle data. Some actions have been done to clean the data by handling missing values, outliers,normalizing numerical variables, and encoding categorical variables to make it usable.

EXPLORATORY:

  1. Summary statistics of the cleaned dataset

  2. Distribution of numerical variables to detect skewness or outliers.

  3. Distribution of categorical variables to understand the class balance.

  4. visualizing the strength and direction of relationships between key features.

MODEL: Multiple Linear Regression Model and Random Forest Classification will be used as regression and classification model, respectively. Feature Importance Analysis will Identify the most influential factors affecting glycemic control and diabetes risk.

INTERPRET:

For classification modeling: The accuracy, precision, recall, and F1-score will be used to evaluate the developed model’s performance.

For regression modelling: Root Mean Square Error (RMSE) and R Squared value will be used to evaluate the performance of model developed.


1 Data Cleaning

1.0 Import Required R Libraries

## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: moments
## 
## Loading required package: caret
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'caret'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## 
## Loading required package: corrplot
## 
## corrplot 0.95 loaded
## 
## Loading required package: pROC
## 
## Type 'citation("pROC")' for a citation.
## 
## 
## Attaching package: 'pROC'
## 
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## 
## 
## Loading required package: randomForest
## 
## randomForest 4.7-1.2
## 
## Type rfNews() to see new features/changes/bug fixes.
## 
## 
## Attaching package: 'randomForest'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Load the library
library(tidyverse)
library(ggplot2)
library(tidyr)
library(moments)
library(caret)
library(corrplot)
library(pROC)
library(randomForest)

1.1 Load dataset from local

# Load the dataset
df <- read.csv("diabetes_dataset with missing values.csv")
# Display the first few rows
head(df)
##   year gender age location race.AfricanAmerican race.Asian race.Caucasian
## 1 2020 Female  32  Alabama                    0          0              0
## 2 2015 Female  29  Alabama                    0          1              0
## 3 2015   Male  18  Alabama                    0          0              0
## 4 2015   Male  41  Alabama                    0          0              1
## 5 2016 Female  52  Alabama                    1          0              0
## 6 2016   Male  66  Alabama                    0          0              1
##   race.Hispanic race.Other hypertension heart_disease smoking_history   bmi
## 1             0          1            0             0           never 27.32
## 2             0          0            0             0           never 19.95
## 3             0          1            0             0           never 23.76
## 4             0          0            0             0           never 27.32
## 5             0          0            0             0           never 23.75
## 6             0          0            0             0     not current 27.32
##   hbA1c_level blood_glucose_level diabetes
## 1         5.0                 100        0
## 2         5.0                  90        0
## 3         4.8                 160        0
## 4         4.0                 159        0
## 5         6.5                  90        0
## 6         5.7                 159        0
# Display summary of data
summary(df)
##       year         gender               age          location        
##  Min.   :2015   Length:100000      Min.   : 0.08   Length:100000     
##  1st Qu.:2019   Class :character   1st Qu.:24.00   Class :character  
##  Median :2019   Mode  :character   Median :43.00   Mode  :character  
##  Mean   :2018                      Mean   :41.89                     
##  3rd Qu.:2019                      3rd Qu.:60.00                     
##  Max.   :2022                      Max.   :80.00                     
##                                    NA's   :23                        
##  race.AfricanAmerican   race.Asian     race.Caucasian   race.Hispanic   
##  Min.   :0.0000       Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000       1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000       Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.2022       Mean   :0.2001   Mean   :0.1988   Mean   :0.1989  
##  3rd Qu.:0.0000       3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000       Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                         
##    race.Other   hypertension     heart_disease     smoking_history   
##  Min.   :0.0   Min.   :0.00000   Min.   :0.00000   Length:100000     
##  1st Qu.:0.0   1st Qu.:0.00000   1st Qu.:0.00000   Class :character  
##  Median :0.0   Median :0.00000   Median :0.00000   Mode  :character  
##  Mean   :0.2   Mean   :0.07485   Mean   :0.03942                     
##  3rd Qu.:0.0   3rd Qu.:0.00000   3rd Qu.:0.00000                     
##  Max.   :1.0   Max.   :1.00000   Max.   :1.00000                     
##                NA's   :12                                            
##       bmi         hbA1c_level    blood_glucose_level    diabetes      
##  Min.   :10.01   Min.   :3.500   Min.   : 80.0       Min.   :0.00000  
##  1st Qu.:23.63   1st Qu.:4.800   1st Qu.:100.0       1st Qu.:0.00000  
##  Median :27.32   Median :5.800   Median :140.0       Median :0.00000  
##  Mean   :27.32   Mean   :5.527   Mean   :138.1       Mean   :0.08499  
##  3rd Qu.:29.58   3rd Qu.:6.200   3rd Qu.:159.0       3rd Qu.:0.00000  
##  Max.   :95.69   Max.   :9.000   Max.   :300.0       Max.   :1.00000  
##  NA's   :17      NA's   :14      NA's   :18          NA's   :8

1.2 Remove missing and duplicate data from dataset

# Check missing value
colSums(is.na(df))
##                 year               gender                  age 
##                    0                    0                   23 
##             location race.AfricanAmerican           race.Asian 
##                    0                    0                    0 
##       race.Caucasian        race.Hispanic           race.Other 
##                    0                    0                    0 
##         hypertension        heart_disease      smoking_history 
##                   12                    0                    0 
##                  bmi          hbA1c_level  blood_glucose_level 
##                   17                   14                   18 
##             diabetes 
##                    8

Numerical variables age, bmi, hbA1c_level, and blood_glucose_level have some missing values. Missing values are imputed based on the skewness of the variable. Variables with no skewness are imputed by mean while skewed variables are imputed by median.

The missing values of categorical variables(hypertension and diabetes) need to be imputed by mode.

# Calculate skewness for the numerical variables which has missing values
skew_values <- c(
  age = skewness(df$age, na.rm = TRUE),
  bmi = skewness(df$bmi, na.rm = TRUE),
  hbA1c = skewness(df$hbA1c_level, na.rm = TRUE),
  blood_glucose = skewness(df$blood_glucose_level, na.rm = TRUE)
)

print(skew_values)
##           age           bmi         hbA1c blood_glucose 
##   -0.05201396    1.04383681   -0.06700248    0.82159164

As observed, age and hbA1c have very minor skewness, close to 0. This indicates that these variables closely follow a normal distribution. Hence, they are imputed by mean.

The skewness for bmi is 1.044, which is greater than 1. This indicates that the distribution of bmi is positively skewed (right-skewed).

On the other hand, the skewness value for blood_glucose is 0.822, slightly less than 1, indicating a moderate positive skewness. Median imputation is used for these variables to avoid bias caused by outliers or extreme values.

# Mean imputation for age and hbA1c_level

df$age <- ifelse(is.na(df$age), mean(df$age, na.rm = TRUE), df$age)

df$hbA1c_level <- ifelse(is.na(df$hbA1c_level), mean(df$hbA1c, na.rm = TRUE), df$hbA1c_level)

# Median imputation for skewed variables(bmi and blood_glucose)

df$bmi <- ifelse(is.na(df$bmi), median(df$bmi, na.rm = TRUE), df$bmi) 

df$blood_glucose_level <- ifelse(is.na(df$blood_glucose_level), median(df$blood_glucose_level, na.rm = TRUE), df$blood_glucose_level)  
# Mode imputation for categorical variables (hypertension and diabetes)

mode_hypertension <- names(sort(table(df$hypertension), decreasing =TRUE))[1]
mode_diabetes <- names(sort(table(df$diabetes), decreasing = TRUE))[1]


df$hypertension[is.na(df$hypertension)] <- mode_hypertension
df$diabetes[is.na(df$diabetes)] <- mode_diabetes
# Check for duplicate rows in the entire dataset
duplicates <- df[duplicated(df), ]
cat("Number of duplicate rows:", nrow(duplicates), "\n")
## Number of duplicate rows: 14
# Remove duplicates
df <- df[!duplicated(df), ]
# saving the dataframe without missing values and duplicate values as df_clean
df_clean<-df
# Check after cleaning
str(df_clean)
## 'data.frame':    99986 obs. of  16 variables:
##  $ year                : int  2020 2015 2015 2015 2016 2016 2015 2016 2016 2015 ...
##  $ gender              : chr  "Female" "Female" "Male" "Male" ...
##  $ age                 : num  32 29 18 41 52 66 49 15 51 42 ...
##  $ location            : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ race.AfricanAmerican: int  0 0 0 0 1 0 0 0 1 0 ...
##  $ race.Asian          : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ race.Caucasian      : int  0 0 0 1 0 1 1 0 0 1 ...
##  $ race.Hispanic       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ race.Other          : int  1 0 1 0 0 0 0 1 0 0 ...
##  $ hypertension        : chr  "0" "0" "0" "0" ...
##  $ heart_disease       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ smoking_history     : chr  "never" "never" "never" "never" ...
##  $ bmi                 : num  27.3 19.9 23.8 27.3 23.8 ...
##  $ hbA1c_level         : num  5 5 4.8 4 6.5 5.7 5.7 5 6 5.7 ...
##  $ blood_glucose_level : num  100 90 160 159 90 159 80 155 100 160 ...
##  $ diabetes            : chr  "0" "0" "0" "0" ...
# Check if columns do not have empty values 
colSums(is.na(df_clean))
##                 year               gender                  age 
##                    0                    0                    0 
##             location race.AfricanAmerican           race.Asian 
##                    0                    0                    0 
##       race.Caucasian        race.Hispanic           race.Other 
##                    0                    0                    0 
##         hypertension        heart_disease      smoking_history 
##                    0                    0                    0 
##                  bmi          hbA1c_level  blood_glucose_level 
##                    0                    0                    0 
##             diabetes 
##                    0

1.3 Data Pre-processing- Factorization and encoding categorical data

# Categorical - factorize 
df_clean$gender <- as.factor(df_clean$gender)
df_clean$location <- as.factor(df_clean$location)
df_clean$hypertension <- as.factor(df_clean$hypertension)
df_clean$race.AfricanAmerican<-as.factor(df_clean$race.AfricanAmerican)
df_clean$race.Asian <- as.factor(df_clean$race.Asian)
df_clean$race.Caucasian <- as.factor(df_clean$race.Caucasian)
df_clean$race.Hispanic <- as.factor(df_clean$race.Hispanic)
df_clean$race.Other <- as.factor(df_clean$race.Other)
df_clean$heart_disease <- as.factor(df_clean$heart_disease)
df_clean$smoking_history <- as.factor(df_clean$smoking_history)
df_clean$diabetes <- as.factor(df_clean$diabetes)

# Categorical - encode 
df_clean$year <- as.integer(df_clean$year)
df_clean$gender <- as.numeric(df_clean$gender)
df_clean$location <- as.numeric(df_clean$location)
df_clean$smoking_history <- as.numeric(df_clean$smoking_history)

1.4 Normalize data

# Normalize function, vector input 
normalize <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

# Duplicate cleaned dataset for regression model
df_reg <- df_clean

# Check for numeric columns 
numeric_cols_reg <-sapply(df_reg,is.numeric)

# Ensure target variable is untouched 
numeric_cols_reg["hbA1c_level"] <- FALSE

# Normalize numerical data 
df_reg[numeric_cols_reg] <- lapply(df_reg[numeric_cols_reg], normalize)

str(df_reg)
## 'data.frame':    99986 obs. of  16 variables:
##  $ year                : num  0.714 0 0 0 0.143 ...
##  $ gender              : num  0 0 0.5 0.5 0 0.5 0 0 0.5 0.5 ...
##  $ age                 : num  0.399 0.362 0.224 0.512 0.65 ...
##  $ location            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ race.AfricanAmerican: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
##  $ race.Asian          : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
##  $ race.Caucasian      : Factor w/ 2 levels "0","1": 1 1 1 2 1 2 2 1 1 2 ...
##  $ race.Hispanic       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ race.Other          : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 2 1 1 ...
##  $ hypertension        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ heart_disease       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ smoking_history     : num  0.6 0.6 0.6 0.6 0.6 1 0 0.8 0.6 0.8 ...
##  $ bmi                 : num  0.202 0.116 0.16 0.202 0.16 ...
##  $ hbA1c_level         : num  5 5 4.8 4 6.5 5.7 5.7 5 6 5.7 ...
##  $ blood_glucose_level : num  0.0909 0.0455 0.3636 0.3591 0.0455 ...
##  $ diabetes            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
# Duplicate cleaned dataset for classification model
df_cls <- df_clean

# Check for numeric columns 
numeric_cols_cls <- sapply(df_cls,is.numeric)

# Ensure target variable is untouched 
numeric_cols_cls["diabetes"] <- FALSE

# Normalize numerical data 
df_cls[numeric_cols_cls] <- lapply(df_cls[numeric_cols_cls], normalize)

str(df_cls)
## 'data.frame':    99986 obs. of  16 variables:
##  $ year                : num  0.714 0 0 0 0.143 ...
##  $ gender              : num  0 0 0.5 0.5 0 0.5 0 0 0.5 0.5 ...
##  $ age                 : num  0.399 0.362 0.224 0.512 0.65 ...
##  $ location            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ race.AfricanAmerican: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
##  $ race.Asian          : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
##  $ race.Caucasian      : Factor w/ 2 levels "0","1": 1 1 1 2 1 2 2 1 1 2 ...
##  $ race.Hispanic       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ race.Other          : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 2 1 1 ...
##  $ hypertension        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ heart_disease       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ smoking_history     : num  0.6 0.6 0.6 0.6 0.6 1 0 0.8 0.6 0.8 ...
##  $ bmi                 : num  0.202 0.116 0.16 0.202 0.16 ...
##  $ hbA1c_level         : num  0.2727 0.2727 0.2364 0.0909 0.5455 ...
##  $ blood_glucose_level : num  0.0909 0.0455 0.3636 0.3591 0.0455 ...
##  $ diabetes            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

2. Exploratory Data Analysis(EDA)

Data analysis begins with exploratory data analysis(EDA), which is learning about the data in order to draw conclusions and spot trends. Data visualization, analysis of correlations, outlier detection, and inspection of data are all part of it. By illuminating the data’s structure, EDA guide subsequent analysis and modeling choices.

2.1 Univariate EDA

# Histogram and density plot for average blood sugar levels

ggplot(df, aes(x = hbA1c_level)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black") +
  geom_density(color = "red", size = 1) +
  labs(title = "Distribution of HbA1c Level", x = "HbA1c Level", y = "Density")

Comment:

The histogram of HbA1c levels shows that most values cluster between 5 and 6.5, with the density curve peaking in this range, indicating that a large portion of individuals fall within the normal to pre-diabetic categories.

The distribution is slightly right-skewed, with a tail extending beyond 6.5, representing fewer individuals with elevated HbA1c levels consistent with diabetes.

This pattern suggests that while many in the dataset maintain blood sugar control within healthy or borderline ranges, there is a notable subgroup with poor glycemic control, indicating a potential class imbalance that the models must handle.

# Histogram for Blood Glucose Level

ggplot(df, aes(x = blood_glucose_level)) +
  geom_histogram(bins = 30, fill = "lightgreen", color = "black") +
  labs(title = "Distribution of Blood Glucose Level", x = "Blood Glucose Level", y = "Frequency")

Comment: Blood glucose levels often exhibit multi-modal behavior in clinical datasets, where distinct peaks represent fasting glucose levels versus post-prandial or diabetic levels.

Most blood glucose values cluster around 90–100 mg/dL, within the normal fasting range (<100). A sizeable group falls between 100–125 mg/dL, indicating prediabetes, while the peak near 150 and values above 126 mg/dL reflect diabetes. Fewer cases extend beyond 200, showing poorly controlled blood sugar.

# Boxplot of BMI
ggplot(df, aes(y = bmi)) + 
  geom_boxplot(fill = "green", color = "black") + 
  theme_minimal() + 
  labs(title = "Boxplot of BMI", y = "BMI")

Comment: The BMI boxplot indicates a right-skewed distribution, with the median BMI around 25, suggesting that most of the data falls within a healthy BMI range. However, the presence of several outliers on the higher end of the scale (with BMI values significantly greater than 75) indicates extreme cases of obesity. These outliers fall outside the whiskers of the plot, which represent data within 1.5 times the interquartile range (IQR). These extreme values may be due to valid but rare cases or data entry errors. As the values are clinically plausible and therefore were retained to preserve real-world variability rather than being arbitrarily removed.

#Distribution of Categorical Variable Diabetes

# Calculate proportions
diabetes_counts <- df %>%
  count(diabetes) %>%
  mutate(percentage = n / sum(n) * 100,
         label = paste0(round(percentage, 1), "%"))  

# Create pie chart
ggplot(diabetes_counts, aes(x = "", y = percentage, fill = factor(diabetes))) +
  geom_bar(stat = "identity", width = 1, color = "black") +
  coord_polar(theta = "y") +
  labs(title = "Distribution of Diabetes", fill = "Diabetes (0 = No, 1 = Yes)") +
  theme_void() +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5), color = "white", size = 5) +  # Add labels
  scale_fill_manual(values = c("skyblue", "salmon"))

Comment:

The pie chart shows that 91.5% of individuals in the dataset do not have diabetes, while only 8.5% are diabetic. This indicates a clear class imbalance, with the non‑diabetic group heavily dominating the sample.

Because the diabetic class is relatively small, any predictive model trained on this dataset may become biased toward the majority class and underperform in identifying diabetes cases. To address this, we need to consider class imbalance techniques.

2.2 Bivariate EDA

# Scatter Plot of Blood Glucose vs HbA1c Level with Regression Line
ggplot(df, aes(x = blood_glucose_level, y = hbA1c_level)) +
  geom_point(color = "blue") +  # Scatter plot
  geom_smooth(method = "lm", se = FALSE, color = "red") +  # Regression line
  labs(title = "Blood Glucose vs HbA1c Level", x = "Blood Glucose Level", y = "HbA1c Level")
## `geom_smooth()` using formula = 'y ~ x'

Comment: There is a clear, strong positive linear relationship between Blood Glucose and HbA1c. This is physiologically expected as HbA1c measures average blood sugar over a three-month period.

# Boxplot of BMI vs Diabetes
ggplot(df, aes(x = factor(diabetes), y = bmi)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "BMI vs Diabetes", x = "Diabetes (0 = No, 1 = Yes)", y = "BMI")

Comment: The boxplot comparing BMI between individuals with and without diabetes shows that those with diabetes tend to have a slightly higher median BMI than non‑diabetic individuals. Both groups display wide variability, with overlapping ranges and numerous outliers, but the upward shift in the diabetic group suggests that higher BMI is associated with increased diabetes risk. This reinforces BMI as a contributing factor, though not the sole determinant, in diabetes development.

# Grouped Bar Plot of Hypertension vs Diabetes
ggplot(df, aes(x = factor(hypertension), fill = factor(diabetes))) +
  geom_bar(position = "dodge", color = "black") +
  labs(title = "Hypertension vs Diabetes", x = "Hypertension (0 = No, 1 = Yes)", y = "Count", fill = "Diabetes Status") +
  scale_fill_manual(values = c("blue", "red"), labels = c("No", "Yes"))

Comment: The chart visually shows the distribution of diabetes among individuals with and without hypertension, indicating that hypertension is more common among individuals without diabetes.

2.3 Multivariate EDA

Evaluate the correlation between variables

cor_data <- df_reg %>%
  select(-location) %>%
  mutate_all(as.numeric) # Ensure everything is a number

# Calculate the correlation matrix
cor_matrix <- cor(cor_data, use = "complete.obs")

# Visualize correlation matrix
corrplot(cor_matrix, 
         method = "color",        
         type = "upper",          
         addCoef.col = "black",   
         number.cex = 0.5,        
         tl.cex = 0.6,            
         tl.col = "black",        
         diag = FALSE,
         title = "Correlation Matrix",
         mar = c(0,0,1,0))        

Comment: The correlation matrix highlights that HbA1c levels and diabetes are closely linked, with a strong positive correlation of 0.40. HbA1c also shows moderate associations with blood glucose levels (0.17), while its relationships with age (0.10), BMI (0.08), and heart disease (0.07) are relatively weak. Other demographic and lifestyle factors, such as gender and smoking history, display negligible correlations with HbA1c. When examining diabetes, blood glucose level emerges as the strongest correlate (0.42), followed closely by HbA1c (0.40). Age (0.26), BMI (0.21), and hypertension (0.20) also demonstrate moderate positive associations, suggesting that these variables contribute meaningfully to diabetes risk.

Overall, the analysis suggesting that clinical markers should be prioritized in the modeling phase.

Findings and Recommendations Based on EDA for Modelling:

  • No severe data quality issues are identified.

  • Need to emphasize class imbalance correction of diabetes variable.

  • Clinical markers (e.g blood glucose level, HbA1c) should be considered as primary features over demographic variables during model training.

3.0 Regression(Multiple Linear Regression Model)

3.1 Regression- Split test/train dataset

# Reproducible 
set.seed(123)

# Create partition indices
train_index <- createDataPartition(df_reg$hbA1c_level, p = 0.8, list = FALSE)

# Split data
train_data <- df_reg[train_index, ]
test_data <- df_reg[-train_index, ]

cat("Training Set Dimensions:", dim(train_data), "\n")
## Training Set Dimensions: 79990 16
cat("Testing Set Dimensions:", dim(test_data), "\n")
## Testing Set Dimensions: 19996 16

3.2 Regression-Modeling

# Fit the model, predict hbA1c_level using all other columns
model <- lm(hbA1c_level ~ ., data = train_data)

# Print the summary of the model
summary(model)
## 
## Call:
## lm(formula = hbA1c_level ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9370 -0.8405  0.3142  0.7786  2.1131 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.399337   0.020211 267.144   <2e-16 ***
## year                   0.013996   0.018081   0.774   0.4389    
## gender                 0.012421   0.014152   0.878   0.3801    
## age                   -0.007867   0.013945  -0.564   0.5727    
## location               0.011450   0.012694   0.902   0.3671    
## race.AfricanAmerican1 -0.014101   0.010939  -1.289   0.1974    
## race.Asian1           -0.022433   0.010975  -2.044   0.0410 *  
## race.Caucasian1       -0.022421   0.010993  -2.039   0.0414 *  
## race.Hispanic1        -0.016792   0.011000  -1.527   0.1269    
## race.Other1                  NA         NA      NA       NA    
## hypertension1          0.014036   0.013811   1.016   0.3095    
## heart_disease1         0.002275   0.018454   0.123   0.9019    
## smoking_history        0.013098   0.013554   0.966   0.3339    
## bmi                   -0.027822   0.048473  -0.574   0.5660    
## blood_glucose_level   -0.010230   0.020646  -0.496   0.6202    
## diabetes1              1.544007   0.014409 107.156   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9811 on 79975 degrees of freedom
## Multiple R-squared:  0.1608, Adjusted R-squared:  0.1607 
## F-statistic:  1095 on 14 and 79975 DF,  p-value: < 2.2e-16

3.3 Regression - Predict

# Make predictions on the test set
predictions <- predict(model, newdata = test_data)

# Calculate performance metrics
rmse_val <- RMSE(predictions, test_data$hbA1c_level)
r2_val <- R2(predictions, test_data$hbA1c_level)
mae_val <- MAE(predictions, test_data$hbA1c_level)

cat("Model Performance on Test Set:\n", "RMSE:", rmse_val, "\n", "R-squared:", r2_val, "\n", "MAE:", mae_val, "\n")
## Model Performance on Test Set:
##  RMSE: 0.9803876 
##  R-squared: 0.1593185 
##  MAE: 0.8644876

3.4 Regression - Visualization

# Create a dataframe for plotting
plot_data <- data.frame(
  Actual = test_data$hbA1c_level,
  Predicted = predictions
)

# Plot
ggplot(plot_data, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.3, color = "blue") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Actual vs Predicted HbA1c Levels",
    x = "Actual HbA1c Level",
    y = "Predicted HbA1c Level"
  ) +
  theme_minimal()

The Actual vs Predicted plot shows that the regression model produces clustered predictions, indicating limited sensitivity to continuous changes in HbA1c. This suggests that the model is largely driven by diabetes status rather than fine-grained numerical predictors.

3.5 Regression Evaluation

The predictive performance of the regression model was further evaluated using the test dataset to assess its generalisation ability. On the test set, the model achieved an RMSE of 0.98, an MAE of 0.86, and an R-squared value of 0.159.

The R-squared value of approximately 16% indicates that the model explains a relatively small proportion of the variance in HbA1c levels on unseen data, which is consistent with the training performance. This suggests that the model does not suffer from overfitting but has limited predictive strength overall.

The RMSE value close to 1 implies that the model’s predictions deviate from the true HbA1c values by about one unit on average, while the MAE of 0.86 shows a moderate average absolute prediction error. These error levels indicate that although the regression model provides reasonable baseline predictions, its accuracy is insufficient for precise clinical prediction.

Overall, the consistency between training and test performance demonstrates that the regression model is stable but underpowered, reinforcing its role as an interpretable baseline model rather than a high-performing predictive model. This further justifies the application of advanced machine learning techniques to better capture non-linear relationships and improve predictive accuracy in subsequent analyses.

4. Classification (Random Forest)

4.1 Classification- Split test/train dataset

# Reproducible
set.seed(123)

# Create partition indices
train_index_class <- createDataPartition(df_cls$diabetes, p = 0.8, list = FALSE)

# Split data
train_data_class <- df_cls[train_index_class, ]
test_data_class  <- df_cls[-train_index_class, ]

cat("Classification Train Size:", nrow(train_data_class), "\n")
## Classification Train Size: 79990
cat("Classification Test Size:", nrow(test_data_class), "\n")
## Classification Test Size: 19996

4.2 Classification - Modelling

# Train the model
# importance = TRUE lets us see which variables are driving the diagnosis
rf_class_model <- randomForest(diabetes ~ ., 
                               data = train_data_class, 
                               ntree = 100, 
                               importance = TRUE)

# Print the summary of the model
summary(rf_class_model)
##                 Length Class  Mode     
## call                 5 -none- call     
## type                 1 -none- character
## predicted        79990 factor numeric  
## err.rate           300 -none- numeric  
## confusion            6 -none- numeric  
## votes           159980 matrix numeric  
## oob.times        79990 -none- numeric  
## classes              2 -none- character
## importance          60 -none- numeric  
## importanceSD        45 -none- numeric  
## localImportance      0 -none- NULL     
## proximity            0 -none- NULL     
## ntree                1 -none- numeric  
## mtry                 1 -none- numeric  
## forest              14 -none- list     
## y                79990 factor numeric  
## test                 0 -none- NULL     
## inbag                0 -none- NULL     
## terms                3 terms  call

4.3 Classification - Predict

# Predict the specific class (0 or 1) for the test set
rf_pred_class <- predict(rf_class_model, newdata = test_data_class)

# Generate Confusion Matrix
confusionMatrix(rf_pred_class, test_data_class$diabetes)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 18290   528
##          1     7  1171
##                                           
##                Accuracy : 0.9732          
##                  95% CI : (0.9709, 0.9754)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8001          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9996          
##             Specificity : 0.6892          
##          Pos Pred Value : 0.9719          
##          Neg Pred Value : 0.9941          
##              Prevalence : 0.9150          
##          Detection Rate : 0.9147          
##    Detection Prevalence : 0.9411          
##       Balanced Accuracy : 0.8444          
##                                           
##        'Positive' Class : 0               
## 

4.4 Classification - Feature Importance

varImpPlot(rf_class_model, main = "Key Predictors of Diabetes")

4.5 Classification - Predict probabilities

# Predict probabilities (we want the probability of "1" / Diabetes)
rf_pred_prob <- predict(rf_class_model, newdata = test_data_class, type = "prob")

# Extract the probability for class "1" (the second column usually)
prob_diabetes <- rf_pred_prob[, 2]

# Calculate ROC and AUC
roc_rf <- roc(test_data_class$diabetes, prob_diabetes)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot
plot(roc_rf, col = "green", main = "ROC Curve: Random Forest vs Logistic")

cat("Random Forest AUC:", auc(roc_rf))
## Random Forest AUC: 0.9646918

4.6 Classification - Optimize (To increase specificity)

# Take column 2, which corresponds to the probability of "1" (Diabetes)
probs <- predict(rf_class_model, newdata = test_data_class, type = "prob")[, 2]

optimal_threshold <- coords(roc_rf, "best", ret = "threshold")

cat("The mathematically optimal threshold is:", optimal_threshold$threshold, "\n")
## The mathematically optimal threshold is: 0.095
# Apply this "optimal" threshold
optimal_preds <- ifelse(probs > optimal_threshold$threshold, 1, 0)
optimal_preds <- as.factor(optimal_preds)

# Check the final result
confusionMatrix(optimal_preds, test_data_class$diabetes, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 16990   225
##          1  1307  1474
##                                          
##                Accuracy : 0.9234         
##                  95% CI : (0.9196, 0.927)
##     No Information Rate : 0.915          
##     P-Value [Acc > NIR] : 9.155e-06      
##                                          
##                   Kappa : 0.6177         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.86757        
##             Specificity : 0.92857        
##          Pos Pred Value : 0.53003        
##          Neg Pred Value : 0.98693        
##              Prevalence : 0.08497        
##          Detection Rate : 0.07371        
##    Detection Prevalence : 0.13908        
##       Balanced Accuracy : 0.89807        
##                                          
##        'Positive' Class : 1              
## 

4.7 Classification Evaluation

While accuracy is high, the Kappa statistic is more informative here due to the class imbalance, as it measures performance beyond simple chance.In healthcare, Sensitivity (Recall) is critical to ensure that diabetic patients are not missed (minimizing False Negatives). The Area Under the Curve represents the model’s ability to distinguish between classes. A value above 0.90 indicates an excellent classifier.

5. Discussion

5.1 Summary of Findings

The analysis successfully identifies that blood glucose levels, HbA1c, and BMI are the most influential predictors of diabetes status. The Linear Regression model demonstrated that HbA1c can be predicted with high reliability, reflecting the biological link between daily glucose fluctuations and long-term glycemic averages.

5.2 Clinical Implications

The Random Forest model’s feature importance plot highlights that while age and hypertension play roles, physiological markers are paramount. This suggests that machine learning can serve as an effective “early-warning” system in clinical settings, flagging individuals for intensive screening based on routine lab results.

5.3 Limitations and Recommendations

Class Imbalance: The majority of the dataset consists of non-diabetic records. Future work should explore oversampling techniques (e.g., SMOTE) to improve sensitivity for the minority class.

Feature Scope: The model lacks data on physical activity and diet, which are major drivers of diabetes. Including these could further improve predictive accuracy.

Threshold Optimization: By lowering the classification threshold (as seen in Section 4.6), we prioritize catching every potential case of diabetes, which is often preferred in a medical context even if it increases false alarms.