Introduction

Project Background & Objectives

Diabetes is a group of metabolic diseases characterized by hyperglycemia resulting from defects in insulin secretion, insulin action, or both.

Although medical science is rapidly growing, diabetes is still an incurable disease. Diabetes cannot be cured yet it is preventable based on the results of the controlled randomized trials. There is evidence which shows that intensive lifestyle interventions can effectively prevent or delay the onset of diabetes in high-risk individuals, by considering the main risk factors of diabetes.

As such, early diagnosis of diabetes using statistical and machine learning models can greatly benefit each individual such as early treatment to prevent complications like lower-extremity amputation, cardiovascular diseases etc and subsequently reduce the individual’s financial burden and healthcare costs borne by the government.

Objectives

To identify the significant risk factors or features that results in diabetes using statistical techniques.
To construct a generalized linear model (GLM) to estimate the probability of an individual being diabetic.
To construct an effective binary classification which uses attributes that are accessible via IoT devices, to predict diabetes.

Libraries & Dataset Details

Throughout our project, we will be using the libraries below.

library(rpart)
library(rpart.plot)
library(corrplot)
library(ggplot2)
library(party)
library(dplyr)
library(gridExtra)
library(caTools)
library(ROSE)
library(e1071)
library(caret)
library(class)
library(data.table)
library(patchwork)
library(gridExtra)

The dataset are courtesy of Dr John Schorling, Department of Medicine, University of Virginia Schoolof Medicine. The data consist of 19 variables on 403 subjects from 1046 subjects who were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African Americans.

This data set was obtained from http://hbiostat.org/data courtesy of the Vanderbilt University Department of Biostatistics, where the specific explanation of each variable can also be found. As such, it is convincing to say that the data set is reliable for our project. The

Here is a brief summary of the dataset, including data structure, dimension and missing values

df <- read.csv("diabetes.csv", header = TRUE)
allowed_frame_values <- c('small', 'medium', 'large')
df$frame[!(df$frame %in% allowed_frame_values)] <- NA
dim(df)

## [1] 403  19

summary(df)

##        id             chol          stab.glu          hdl        
##  Min.   : 1000   Min.   : 78.0   Min.   : 48.0   Min.   : 12.00  
##  1st Qu.: 4792   1st Qu.:179.0   1st Qu.: 81.0   1st Qu.: 38.00  
##  Median :15766   Median :204.0   Median : 89.0   Median : 46.00  
##  Mean   :15978   Mean   :207.8   Mean   :106.7   Mean   : 50.45  
##  3rd Qu.:20336   3rd Qu.:230.0   3rd Qu.:106.0   3rd Qu.: 59.00  
##  Max.   :41756   Max.   :443.0   Max.   :385.0   Max.   :120.00  
##                  NA's   :1                       NA's   :1       
##      ratio            glyhb         location              age       
##  Min.   : 1.500   Min.   : 2.68   Length:403         Min.   :19.00  
##  1st Qu.: 3.200   1st Qu.: 4.38   Class :character   1st Qu.:34.00  
##  Median : 4.200   Median : 4.84   Mode  :character   Median :45.00  
##  Mean   : 4.522   Mean   : 5.59                      Mean   :46.85  
##  3rd Qu.: 5.400   3rd Qu.: 5.60                      3rd Qu.:60.00  
##  Max.   :19.300   Max.   :16.11                      Max.   :92.00  
##  NA's   :1        NA's   :13                                        
##     gender              height          weight         frame          
##  Length:403         Min.   :52.00   Min.   : 99.0   Length:403        
##  Class :character   1st Qu.:63.00   1st Qu.:151.0   Class :character  
##  Mode  :character   Median :66.00   Median :172.5   Mode  :character  
##                     Mean   :66.02   Mean   :177.6                     
##                     3rd Qu.:69.00   3rd Qu.:200.0                     
##                     Max.   :76.00   Max.   :325.0                     
##                     NA's   :5       NA's   :1                         
##      bp.1s           bp.1d            bp.2s           bp.2d       
##  Min.   : 90.0   Min.   : 48.00   Min.   :110.0   Min.   : 60.00  
##  1st Qu.:121.2   1st Qu.: 75.00   1st Qu.:138.0   1st Qu.: 84.00  
##  Median :136.0   Median : 82.00   Median :149.0   Median : 92.00  
##  Mean   :136.9   Mean   : 83.32   Mean   :152.4   Mean   : 92.52  
##  3rd Qu.:146.8   3rd Qu.: 90.00   3rd Qu.:161.0   3rd Qu.:100.00  
##  Max.   :250.0   Max.   :124.00   Max.   :238.0   Max.   :124.00  
##  NA's   :5       NA's   :5        NA's   :262     NA's   :262     
##      waist           hip           time.ppn     
##  Min.   :26.0   Min.   :30.00   Min.   :   5.0  
##  1st Qu.:33.0   1st Qu.:39.00   1st Qu.:  90.0  
##  Median :37.0   Median :42.00   Median : 240.0  
##  Mean   :37.9   Mean   :43.04   Mean   : 341.2  
##  3rd Qu.:41.0   3rd Qu.:46.00   3rd Qu.: 517.5  
##  Max.   :56.0   Max.   :64.00   Max.   :1560.0  
##  NA's   :2      NA's   :2       NA's   :3

sapply(df, typeof)

##          id        chol    stab.glu         hdl       ratio       glyhb 
##   "integer"   "integer"   "integer"   "integer"    "double"    "double" 
##    location         age      gender      height      weight       frame 
## "character"   "integer" "character"   "integer"   "integer" "character" 
##       bp.1s       bp.1d       bp.2s       bp.2d       waist         hip 
##   "integer"   "integer"   "integer"   "integer"   "integer"   "integer" 
##    time.ppn 
##   "integer"

Defining the Response Variable

There’s no response variable (diabetic status) in our dataset. We use the threshold of 6.5 for glyhb values to determine the diabetic status. https://doi.org/10.2337/dc18-S002. We then drop glyhb.

df <- mutate(df, diabetic = ifelse(glyhb >= 6.5, 1, 0))

count <- table(df$diabetic)
withDiabetes <- count[2]  
withoutDiabetes <- count[1]  

cat("Diabetic Count:", withDiabetes, "\nNon-Diabetic Count:", withoutDiabetes, "\n")

## Diabetic Count: 65 
## Non-Diabetic Count: 325

df <- df[complete.cases(df$glyhb), ]
df <- df[, !names(df) %in% c("glyhb")]

Initial EDA & Data Cleaning

Uniqueness of Observations

We first check whether the observations are unique, and we discard the id of the observations.

length(unique(df$id)) == length(df$id)

## [1] TRUE

df <- df[, !(names(df) %in% c('id'))]

Numerical Summaries Stratified by Diabetic Status

Mean

means_by_diabetic <- df %>%
  select(-c(location, gender, frame)) %>%
  group_by(diabetic) %>%
  summarise(across(everything(), mean, na.rm = TRUE))

means_transposed <- t(means_by_diabetic[, -1])  

cat("Mean(non-diabetic, diabetic) \n");print(means_transposed)

## Mean(non-diabetic, diabetic)

##                [,1]       [,2]
## chol     202.953704 228.815385
## stab.glu  90.889231 189.584615
## hdl       51.225309  45.492308
## ratio      4.304012   5.635385
## age       44.443077  58.430769
## height    65.953271  66.140625
## weight   174.566154 191.484375
## bp.1s    134.903125 148.200000
## bp.1d     82.987500  84.753846
## bp.2s    152.057143 155.333333
## bp.2d     93.561905  90.090909
## waist     37.318885  40.769231
## hip       42.681115  44.784615
## time.ppn 330.838509 362.307692

Median

medians_by_diabetic <- df %>%
  select(-c(location, gender, frame)) %>%
  group_by(diabetic) %>%
  summarise(across(everything(), median, na.rm = TRUE))

medians_transposed <- t(means_by_diabetic[, -1])  

cat("Median(non-diabetic, diabetic) \n"); print(medians_transposed)

## Median(non-diabetic, diabetic)

##                [,1]       [,2]
## chol     202.953704 228.815385
## stab.glu  90.889231 189.584615
## hdl       51.225309  45.492308
## ratio      4.304012   5.635385
## age       44.443077  58.430769
## height    65.953271  66.140625
## weight   174.566154 191.484375
## bp.1s    134.903125 148.200000
## bp.1d     82.987500  84.753846
## bp.2s    152.057143 155.333333
## bp.2d     93.561905  90.090909
## waist     37.318885  40.769231
## hip       42.681115  44.784615
## time.ppn 330.838509 362.307692

Graphical Summaries

Boxplots

These boxplots helps to identify the plausible outliers. One plausible outlier has an usually high cholesterol ratio, however the ratio of chol to hdl agrees with the ratio. Hence, we do not remove the outlier.

columns_to_exclude <- c('id', 'location', 'gender', 'frame','diabetic')

filtered_columns <- setdiff(names(df), columns_to_exclude)

par(mfrow=c(3, 5), mar=c(3, 3, 1, 1), oma=c(1, 1, 1, 1))


for (col in filtered_columns) {
  boxplot_data <- df[, col]
  
  boxplot(boxplot_data, main = paste("Boxplot of", col), col="skyblue", border="black", horizontal=TRUE)
}
max_ratio_row <- df[which.max(df$ratio), ]
print(max_ratio_row)

##    chol stab.glu hdl ratio   location age gender height weight  frame bp.1s
## 63  443      185  23  19.3 Buckingham  51 female     70    235 medium   158
##    bp.1d bp.2s bp.2d waist hip time.ppn diabetic
## 63    98   148    88    43  48      420        1

Pairwise Plots

These pairwise plots help to reveal plausible linear relationships. Some notable relationships include weight and waist, bp.1s and bp.1d, waist and hip, weight and hip.

numeric_vars <- sapply(df, is.numeric)
pairs(df[, numeric_vars], pch = 19, cex = 0.1, main = "Pair Plot of Numerical Variables")

Data Cleaning

Cleaning by Deletion

We drop bp.2s and bp.2d as there are too many missing values.

df <- df %>% select(-"bp.2s", -"bp.2d")

Cleaning by Imputation

Median imputation for chol, ratio and hdlaccording to diabetic status.

medians_by_diabetic <- df %>%
  group_by(diabetic) %>%
  summarise(across(c(chol, hdl, ratio), median, na.rm = TRUE))

df <- left_join(df, medians_by_diabetic, by = "diabetic", suffix = c("", "_median"))

df <- df %>%
  mutate(
    chol = ifelse(is.na(chol), chol_median, chol),
    hdl = ifelse(is.na(hdl), hdl_median, hdl),
    ratio = ifelse(is.na(ratio), chol_median / hdl_median, ratio)
  ) %>%
  select(-ends_with("_median"))

Predict weight using hip and waist with Multiple Linear Regression, suggested by the plausible linear relationship in the pairwise plots.

features <- c('hip', 'waist')
target <- 'weight'

df_impute <- df

# Drop rows with missing values in 'hip' or 'waist'
df_impute <- df_impute[complete.cases(df_impute[, c('hip', 'waist')]), ]

# Split the data into missing and not-missing weight
df_missing_weight <- df_impute[df_impute$weight %in% NA, ]
df_not_missing_weight <- df_impute[!is.na(df_impute$weight), ]

# Fit a linear regression model
model <- lm(weight ~ hip + waist, data = df_not_missing_weight)

# Predict weights for missing values
predicted_weights <- predict(model, newdata = df_missing_weight)

# Impute missing values in the original data frame
df$weight[is.na(df$weight)] <- predicted_weights

Predict hip and waist using weight using Linear Regression, suggested by the plausible linear relationship in the pairwise plots.

features <- c('weight')
targets <- c('hip', 'waist')

# Create a copy of the DataFrame
df_impute <- df

# Identify rows with missing values in any of the target columns
df_missing_values <- df_impute[rowSums(is.na(df_impute[targets])) > 0, ]
df_not_missing_values <- df_impute[rowSums(is.na(df_impute[targets])) == 0, ]

# Create a list to store linear regression models
models <- list()

# Train linear regression models for each target column
for (target in targets) {
  model <- lm(df_not_missing_values[[target]] ~ weight, data = df_not_missing_values)
  models[[target]] <- model
}

# Impute missing values for each target column
for (target in targets) {
  predicted_values <- predict(models[[target]], newdata = df_missing_values)
  df[apply(is.na(df[targets]), 1, any), target] <- predicted_values
}

Impute missing values of bp.1s,bp.1d and time.ppn using median.

df <- df %>%
  mutate(bp.1s = ifelse(is.na(bp.1s), median(bp.1s, na.rm = TRUE), bp.1s),
         bp.1d = ifelse(is.na(bp.1d), median(bp.1d, na.rm = TRUE), bp.1d),
         time.ppn = ifelse(is.na(time.ppn), median(time.ppn, na.rm = TRUE), time.ppn))

Impute missing values of height, using median of each gender.

# Calculate median height by gender
median_height_by_gender <- df %>%
  group_by(gender) %>%
  summarise(median_height = median(height, na.rm = TRUE))

# Merge the median height values back into the original dataframe
df <- df %>%
  left_join(median_height_by_gender, by = "gender") %>%
  mutate(imputed_height = ifelse(!is.na(height), height, median_height))

# Replace missing values in the "height" column with imputed values
df$height <- ifelse(is.na(df$height), df$imputed_height, df$height)

# Drop the auxiliary columns used for imputation
df <- df %>%
  select(-median_height, -imputed_height)

Decision Tree based Imputation on frame, using body measurements.

features <- c('gender', 'height', 'weight', 'hip', 'waist')
target <- 'frame'

# Create a copy of the DataFrame
df_impute <- df
df_impute$gender <- ifelse(df_impute$gender == 'male', 1, 0)

# Identify rows with missing values in the 'frame' column
df_missing_frame <- df_impute[is.na(df_impute$frame), ]
df_not_missing_frame <- df_impute[!is.na(df_impute$frame), ]

# Train Decision Tree Classifier
model <- rpart(frame ~ gender + height + weight + hip + waist, data = df_not_missing_frame, method = "class")

# Predict the response for the rows with missing 'frame'
predicted_frame <- predict(model, newdata = df_missing_frame, type = "class")

# Update the original DataFrame with the predicted values
df$frame[is.na(df$frame)] <- as.character(predicted_frame)

Final EDA

We investigate several graphical and numerical summaries according to the diabetic status of the individuals.

Histograms

These histograms show the distribution, spread, skewness for different continuous variables according to their diabetic status.

df11 <- df[, !(names(df) %in% c('location', 'gender', 'frame'))]
write.csv(df11, "df_filtered(histograms).csv",row.names = FALSE)
df11 <- fread("df_filtered(histograms).csv")
dt<-melt(df11,id.vars='diabetic')
dt[,diabetic:=ifelse(diabetic,'Diabetes','No Diabetes')] 

lapply(dt[,unique(variable)],function(x)
{
ggplot(dt[variable==x],aes(value,group=diabetic,colour=diabetic))+
geom_histogram(aes(y=after_stat(density),fill = after_scale(alpha(colour, 0.3))))+
geom_density(fill=NA)+
theme_bw()+
labs(fill='',
     x=x,
     colour='',
     y='Density',
     title=paste('Distribution of',x))+
theme(plot.title=element_text(hjust=0.5),
legend.position = c(.99, 0.99),
  legend.justification = c("right", "top"),
  legend.box.just = "right",
  )+
scale_fill_manual(values=c('red','green'))
})->tus

par(mfrow=c(3,4))
tus[[1]]+tus[[2]]+tus[[3]]+tus[[4]]+plot_layout(nrow=1)

tus[[5]]+tus[[6]]+tus[[7]]+tus[[8]]+plot_layout(nrow=1)

tus[[9]]+tus[[10]]+tus[[11]]+tus[[12]]+plot_layout(nrow=1)

Boxplots

These boxplots helps to compare the distributions of each variable with rescpect to their diabetic status.

columns_to_exclude <- c('id', 'location', 'gender', 'frame','diabetic')
filtered_columns <- setdiff(names(df), columns_to_exclude)

par(mfrow=c(3, 4), mar=c(3, 3, 1, 1), oma=c(1, 1, 1, 1))
for (col in filtered_columns) {
  boxplot_data <- df[, col]
  boxplot(boxplot_data, main = paste("Boxplot of", col), col="lightyellow", border="black", horizontal=FALSE,cex.main=1.2)
}

max_ratio_row <- df[which.max(df$ratio), ]
print(max_ratio_row)

##    chol stab.glu hdl ratio   location age gender height weight  frame bp.1s
## 61  443      185  23  19.3 Buckingham  51 female     70    235 medium   158
##    bp.1d waist hip time.ppn diabetic
## 61    98    43  48      420        1

Box Plot (Stratified by diabetic status.)

columns_to_exclude <- c('id', 'location', 'gender', 'frame','diabetic')
filtered_columns <- setdiff(names(df), columns_to_exclude)

par(mfrow=c(3, 4), mar=c(3, 3, 1, 1), oma=c(1, 1, 1, 1))
for(i in seq_along(filtered_columns)){
col <- filtered_columns[i]
boxplot(df[,col] ~ df$diabetic,
data=df,
main = paste('Boxplot of', col),
xlab = 'Diabetic',
ylab = col,
col="Orange",
border="black",
horizontal=FALSE,
cex.main=1.2)
}

Heatmap

From heatmap, we can learn the matrix of correlation between different features

attributes_to_exclude <- c('location', 'gender', 'frame')
df_filtered <- df[, !(names(df) %in% attributes_to_exclude)]

corrMatrix <- cor(df_filtered[, names(df_filtered) != "diabetic"])
melted_corr <- reshape2::melt(corrMatrix)
ggplot(melted_corr, aes(x = Var2, y = Var1, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), vjust = 1, size = 2.5, color = "black") +
  scale_fill_gradient2(low = "black", high = "lightyellow", mid = "red", midpoint = 0, limit = c(-1, 1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 9, hjust = 1),
        axis.text.y = element_text(size = 9),
        axis.title = element_blank(), 
        plot.title = element_text(size = 13, face = "bold")) +
  coord_fixed() +
  
ggtitle( "Correlation Between Features")

Pairwise Plots

These pairwise plots reveals associations between different numerical variables

attributes_to_exclude <- c('location', 'gender', 'frame')
df_filtered <- df[, !(names(df) %in% attributes_to_exclude)]

pairs(df_filtered, col = df_filtered$diabetic,, pch = 19, cex = 0.1, main = "Pairwise Plot of Numerical Variables")

Numerical Summaries

After data cleaning, numerical summaries stratified by diabetic Status

col_names <- df_filtered[1, ]
col_names <- as.character(col_names)  
df_filtered <- df_filtered[-1, ] 

mean_diabetic <- df_filtered %>%
  group_by(diabetic) %>%
  summarise(across(where(is.numeric), mean)) %>%
  t()

median_diabetic <- df_filtered %>%
  group_by(diabetic) %>%
  summarise(across(where(is.numeric), median)) %>%
  t()

var_diabetic <- df_filtered %>%
  group_by(diabetic) %>%
  summarise(across(where(is.numeric), var)) %>%
  t()

formatted_mean <- format(mean_diabetic, scientific = FALSE, digits = 4)
formatted_mean <- formatted_mean[-1, ]
colnames(formatted_mean) <- c("diabetic(0)", "diabetic(1)")

formatted_median <- format(median_diabetic, scientific = FALSE, digits = 4)
formatted_median <- formatted_median[-1, ]
colnames(formatted_median) <- c("diabetic(0)", "diabetic(1)")

formatted_var <- format(var_diabetic, scientific = FALSE, digits = 4)
formatted_var <- formatted_var[-1, ]
colnames(formatted_var) <- c("diabetic(0)", "diabetic(1)")

# Output the numerical Summaries
print('mean_by_diabetic');print(formatted_mean)

## [1] "mean_by_diabetic"

##          diabetic(0) diabetic(1)
## chol     "202.941"   "228.815"  
## stab.glu " 90.917"   "189.585"  
## hdl      " 51.198"   " 45.492"  
## ratio    "  4.306"   "  5.635"  
## age      " 44.438"   " 58.431"  
## height   " 65.929"   " 66.092"  
## weight   "174.731"   "190.559"  
## bp.1s    "134.972"   "148.200"  
## bp.1d    " 83.046"   " 84.754"  
## waist    " 37.323"   " 40.769"  
## hip      " 42.674"   " 44.785"  
## time.ppn "328.796"   "362.308"

print('median_by_diabetic');print(formatted_median)

## [1] "median_by_diabetic"

##          diabetic(0) diabetic(1)
## chol     "199.0"     "219.0"    
## stab.glu " 86.0"     "184.0"    
## hdl      " 47.0"     " 41.0"    
## ratio    "  4.1"     "  5.3"    
## age      " 41.0"     " 59.0"    
## height   " 66.0"     " 67.0"    
## weight   "170.0"     "183.0"    
## bp.1s    "132.0"     "146.0"    
## bp.1d    " 82.0"     " 86.0"    
## waist    " 37.0"     " 40.0"    
## hip      " 42.0"     " 44.0"    
## time.ppn "240.0"     "240.0"

print('variance_by_diabetic');print(formatted_var)

## [1] "variance_by_diabetic"

##          diabetic(0)  diabetic(1) 
## chol     "  1655.517" "  3200.809"
## stab.glu "   616.906" "  6232.747"
## hdl      "   287.849" "   334.098"
## ratio    "     2.032" "     6.560"
## age      "   259.844" "   164.905"
## height   "    15.639" "    13.960"
## weight   "  1603.565" "  1594.900"
## bp.1s    "   516.591" "   412.475"
## bp.1d    "   182.651" "   173.501"
## waist    "    31.533" "    30.899"
## hip      "    31.954" "    28.015"
## time.ppn " 91171.457" "112544.591"

Regression: Logistic Regression

We predict the risk of being diabetic using the logistic regression. A parsimonious model will be developed using Akaike Information Criterion (AIC) and Analysis of Deviance. Residual plots will then be used for model diagnostics and inspection.

Preliminary Fitting

We first fit a logistic model, using diabetic as the response variable and the rest as the explanatory variables.

logistic_model <- glm(diabetic ~ . , data = df, family = "binomial")

summary(logistic_model)

## 
## Call:
## glm(formula = diabetic ~ ., family = "binomial", data = df)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -8.5040100  6.4809914  -1.312    0.189    
## chol            0.0123022  0.0089470   1.375    0.169    
## stab.glu        0.0385904  0.0056605   6.817 9.27e-12 ***
## hdl            -0.0236502  0.0288946  -0.818    0.413    
## ratio          -0.0781535  0.2917518  -0.268    0.789    
## locationLouisa  0.2157647  0.4613570   0.468    0.640    
## age             0.0273054  0.0174351   1.566    0.117    
## gendermale     -0.1049980  0.7042208  -0.149    0.881    
## height         -0.0548925  0.0870973  -0.630    0.529    
## weight          0.0002607  0.0134459   0.019    0.985    
## framemedium     0.0223288  0.5193458   0.043    0.966    
## framesmall     -0.1476738  0.7693414  -0.192    0.848    
## bp.1s           0.0110539  0.0118416   0.933    0.351    
## bp.1d           0.0046564  0.0207925   0.224    0.823    
## waist           0.0674118  0.0801001   0.842    0.400    
## hip            -0.0380336  0.0823937  -0.462    0.644    
## time.ppn        0.0008675  0.0006601   1.314    0.189    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 351.44  on 389  degrees of freedom
## Residual deviance: 168.89  on 373  degrees of freedom
## AIC: 202.89
## 
## Number of Fisher Scoring iterations: 6

It seems that except stab.glu, the other exploratory variables are insignificant. However, their significance is affected by the presence of other variables. Hence, to investigate further, we consider the deviance of each variable that results in the deduction of model residual deviance.

anova(logistic_model)

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: diabetic
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev
## NULL                       389     351.44
## chol      1   16.919       388     334.52
## stab.glu  1  149.413       387     185.11
## hdl       1    0.951       386     184.15
## ratio     1    0.039       385     184.12
## location  1    0.020       384     184.10
## age       1    8.299       383     175.80
## gender    1    0.594       382     175.20
## height    1    0.506       381     174.70
## weight    1    1.034       380     173.66
## frame     2    0.271       378     173.39
## bp.1s     1    1.871       377     171.52
## bp.1d     1    0.045       376     171.48
## waist     1    0.613       375     170.86
## hip       1    0.250       374     170.61
## time.ppn  1    1.719       373     168.89

We observe that stab.glu has the highest deviance, followed by chol, age, bp.1s and so on. Therefore, we define 5 models, where the exploratory variables were added sequentially, and their significance is tested with the Sequential Analysis of Deviance.

Model Comparisons

Defining the models.

model_1 <- glm(diabetic ~ 1, data = df, family = "binomial")
model_2 <- glm(diabetic ~ stab.glu, data = df, family = "binomial")
model_3 <- glm(diabetic ~ stab.glu + chol, data = df, family = "binomial")
model_4 <- glm(diabetic ~ stab.glu + chol + age, data = df, family = "binomial")
model_5 <- glm(diabetic ~ stab.glu + chol + age + bp.1s, data = df, family = "binomial")

Sequential Analysis of Deviance

Under the null model, the addition of stab.glu is significant. Under the presence of stab.glu, the addition of chol is significant. Under the presence of stab.glu and chol, the addition of bp.1s is significant. However, the subsequent addition of bp.1s and variables will no longer be significant as the deviance is decreasing.

anova(model_1,model_2,model_3,model_4,model_5,test = 'Chisq')

## Analysis of Deviance Table
## 
## Model 1: diabetic ~ 1
## Model 2: diabetic ~ stab.glu
## Model 3: diabetic ~ stab.glu + chol
## Model 4: diabetic ~ stab.glu + chol + age
## Model 5: diabetic ~ stab.glu + chol + age + bp.1s
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       389     351.44                          
## 2       388     192.88  1  158.557 < 2.2e-16 ***
## 3       387     185.11  1    7.775  0.005296 ** 
## 4       386     177.68  1    7.428  0.006422 ** 
## 5       385     175.24  1    2.435  0.118680    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Therefore, in the sense of Analysis of Deviance, Model 4 is the best model.

Akaike Information Criterion

The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. If a model is more than 2 AIC units lower than another, then it is considered significantly better than that model. We observe that the sequential deduction in AIC from Model 2 to Model 4 is more than 2, until Model 5 which has a deduction of less than 2.

aic_values <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
  AIC = c(AIC(model_1), AIC(model_2), AIC(model_3), AIC(model_4), AIC(model_5))
)

print(aic_values)

##     Model      AIC
## 1 Model 1 353.4377
## 2 Model 2 196.8811
## 3 Model 3 191.1057
## 4 Model 4 185.6779
## 5 Model 5 185.2433

Hence, Model 4 is the best model in the sense of AIC.

summary(model_4)

## 
## Call:
## glm(formula = diabetic ~ stab.glu + chol + age, family = "binomial", 
##     data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.234230   1.434304  -7.135 9.66e-13 ***
## stab.glu      0.038445   0.005208   7.382 1.56e-13 ***
## chol          0.010338   0.004849   2.132  0.03300 *  
## age           0.034875   0.012987   2.685  0.00725 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 351.44  on 389  degrees of freedom
## Residual deviance: 177.68  on 386  degrees of freedom
## AIC: 185.68
## 
## Number of Fisher Scoring iterations: 6

Hence, our fitted model is: \(logit(p) = 0.034875*(age) + 0.010338*(chol) + 0.038445*(stab.glu)-10.234230\), where \(p\) denote the probability of being diabetic.

Model Diagnostics

Aside from tests, it is important to interpret the model graphically. We consider the diagnostic plots of the selected models.

From the residuals vs fitted plot, we generally observe the residuals are close to zero, except for some observations such as the \(188^{th}, 323^{th}, 134^{th}\) observation. These observations were also highlighted in the scale-location plot and residuals vs leverage plot. It suggests that these point might be influential (Cook’s distance \(>\) 1).

The Q-Q residuals suggests that the distribution of outliers might be right-tailed.

par(mfrow = c(2, 2))
plot(model_4)

Overall, the fitted model can be used to predict the probability of a person being diabetic. However, further work can be done to improve the model, such as transformation of variables, addressing the influential points and considering interaction terms in the model.

Statistical Tests

We perform several statistical tests to identify significant features for the ML classification later. Non-significant variables (gender, location, height, bp.1d and time.ppn) are discarded.

Proportion Test

On `gender`

Proportion Test on gender. Insignificant variable.

# Calculate counts and diabetic counts for female and male
female_count <- sum(df$gender == 'female')
male_count <- sum(df$gender == 'male')

female_diab <- sum(df[df$gender == 'female', 'diabetic'])
male_diab <- sum(df[df$gender == 'male', 'diabetic'])

# Function for 2-sample z-test of proportions
z_test_2_sample_proportions <- function(x1, x2, n1, n2, two_tailed = TRUE) {
  avg_p <- (x1 + x2) / (n1 + n2)
  z_val <- (x1/n1 - x2/n2) / sqrt(avg_p * (1 - avg_p) * (1/n1 + 1/n2))
  z_prob <- pnorm(-abs(z_val))

  if (two_tailed) {
    return(c(z_val, 2 * z_prob))
  } else {
    return(c(z_val, z_prob))
  }
}

# Perform the 2-sample z-test
result <- z_test_2_sample_proportions(female_diab, male_diab, female_count, male_count, two_tailed = TRUE)

cat("P-value:", result[2], "\n")

## P-value: 0.5813274

On `Location`

Proportion Test on location

Buckingham_count <- sum(df$location == 'Buckingham')
Louisa_count <- sum(df$location == 'Louisa')

Buck_diab <- sum(df[df$location == 'Buckingham', 'diabetic'])
Loui_diab <- sum(df[df$location == 'Louisa', 'diabetic'])
result <- z_test_2_sample_proportions(Buck_diab, Loui_diab, Buckingham_count, Louisa_count, two_tailed = TRUE)

cat("P-value:", result[2], "\n")

## P-value: 0.7170174

\(\chi^2\) Test

To test whether frame is associated with diabetes. From this result, we can learn that there is a statistically significant association between frame and diabetes.

contingency_table <- table(df$frame, df$diabetic)

# Perform chi-squared test
result <- chisq.test(contingency_table)

print(contingency_table)

##         
##            0   1
##   large   75  26
##   medium 155  30
##   small   95   9

cat(paste("P-value:", result$p.value, "\n"))

## P-value: 0.00446088022610374

\(Z\)-Test

From the Z-test result, we can see that some variables such as bp1d, height, and time.ppn are not significant, whereas other variables are significant.

# Assuming 'df' is your data frame

df_testing <- df[, !(names(df) %in% c('gender', 'location', 'frame'))]


alpha <- 0.05

# Iterate through each variable for z-test
for (column in colnames(df_testing)[-which(colnames(df_testing) == 'diabetic')]) {
  # Separate data into diabetic and non-diabetic groups
  diabetic_data <- df_testing[df_testing$diabetic == 1, column]
  non_diabetic_data <- df_testing[df_testing$diabetic == 0, column]

  # Calculate means
  mean_diabetic <- mean(diabetic_data)
  mean_non_diabetic <- mean(non_diabetic_data)

  # Calculate standard deviations
  std_diabetic <- sd(diabetic_data)
  std_non_diabetic <- sd(non_diabetic_data)

  # Calculate standard error of the mean (SE)
  se_diabetic <- std_diabetic / sqrt(length(diabetic_data))
  se_non_diabetic <- std_non_diabetic / sqrt(length(non_diabetic_data))

  # Calculate z-score
  z_score <- (mean_diabetic - mean_non_diabetic) / sqrt(se_diabetic^2 + se_non_diabetic^2)

  # Two-tailed z-test
  p_value <- 2 * (1 - pnorm(abs(z_score)))

  # Decision Rule: Reject the null hypothesis if p-value is less than alpha
  if (p_value < alpha) {
    cat(paste0(column, ": Reject the null hypothesis. p-value: ", p_value, "\n"))
  } else {
    cat(paste0(column, ": Fail to reject the null hypothesis. p-value: ", p_value, "\n"))
  }

  # Print z-score and p-value
  cat(paste0("  Z-Score: ", z_score, "\n"))
  cat(paste0("  P-Value: ", p_value, "\n\n"))
}

## chol: Reject the null hypothesis. p-value: 0.000447175545025269
##   Z-Score: 3.51055306517855
##   P-Value: 0.000447175545025269
## 
## stab.glu: Reject the null hypothesis. p-value: 0
##   Z-Score: 9.98087339319042
##   P-Value: 0
## 
## hdl: Reject the null hypothesis. p-value: 0.0197696833979846
##   Z-Score: -2.33069053958415
##   P-Value: 0.0197696833979846
## 
## ratio: Reject the null hypothesis. p-value: 4.74826639946802e-05
##   Z-Score: 4.06768423604957
##   P-Value: 4.74826639946802e-05
## 
## age: Reject the null hypothesis. p-value: 1.84297022087776e-14
##   Z-Score: 7.66051514462494
##   P-Value: 1.84297022087776e-14
## 
## height: Fail to reject the null hypothesis. p-value: 0.732303502870469
##   Z-Score: 0.342062971512887
##   P-Value: 0.732303502870469
## 
## weight: Reject the null hypothesis. p-value: 0.00322605834030298
##   Z-Score: 2.9453343563559
##   P-Value: 0.00322605834030298
## 
## bp.1s: Reject the null hypothesis. p-value: 2.4177627828692e-06
##   Z-Score: 4.71494480673251
##   P-Value: 2.4177627828692e-06
## 
## bp.1d: Fail to reject the null hypothesis. p-value: 0.321924953541206
##   Z-Score: 0.99050989817087
##   P-Value: 0.321924953541206
## 
## waist: Reject the null hypothesis. p-value: 4.48922688955911e-06
##   Z-Score: 4.58734304799436
##   P-Value: 4.48922688955911e-06
## 
## hip: Reject the null hypothesis. p-value: 0.00348644799289599
##   Z-Score: 2.92123673078848
##   P-Value: 0.00348644799289599
## 
## time.ppn: Fail to reject the null hypothesis. p-value: 0.471424560575314
##   Z-Score: 0.720163132159828
##   P-Value: 0.471424560575314

all_objects <- ls()

# Objects to keep (add more if needed)
objects_to_keep <- c("df")

# Objects to remove
objects_to_remove <- setdiff(all_objects, objects_to_keep)

# Remove the objects
rm(list = objects_to_remove)

Machine Learning

Before we perform machine learning training, we first split the dataset into training and testing set randomly with a proportion of 0.8 and 0.2 respectively.

We considered three popular machine learning models: Gaussian Naive Bayes, Support Vector Machine (SVM) and k-Nearest Neigbours (kNN).

Best model will be selected ussing performance metrics such as accuracy, precision, recall and F1-score. We put more emphasis on recall as false-negative diabetes prediction could possibly delay treatment, leading to serious complications

df$frame <- as.numeric(factor(df$frame, levels = c('small', 'medium', 'large')))
df <- df[, !(names(df) %in% c('location', 'height', 'bp.1d', 'time.ppn', 'gender'))]
# Split the data into training and testing sets
set.seed(333)
splitIndex <- createDataPartition(df$diabetic, p = 0.8, list = FALSE)
train_data <- df[splitIndex, ]
test_data <- df[-splitIndex, ]
test_data$diabetic <- as.factor(test_data$diabetic)



# Define the models
models <- list(
  'Gaussian Naive Bayes' = naiveBayes(diabetic ~ ., data = train_data),
  'Support Vector Machine' = svm(diabetic ~ . , data = train_data, type = 'C'),
  'kNN' = knn(train = train_data[, -ncol(train_data)], test = test_data[, -ncol(train_data)], cl = train_data$diabetic, k = 5)
)

# Create a data frame to store the results
results_df <- data.frame(Model = character(),
                         Accuracy = numeric(),
                         Precision = numeric(),
                         Recall = numeric(),
                         F1_Score = numeric(),
                         stringsAsFactors = FALSE)

# Train and evaluate models
for (model_name in names(models)) {
  model <- models[[model_name]]
  
  if(model_name == 'Gaussian Naive Bayes' | model_name == 'Support Vector Machine' ){
  # Make predictions on the test set
  y_pred <- as.factor(predict(model, newdata = test_data, type = 'class' ))
  #y_pred <- as.factor(predict(model, newdata = test_data, type = 'response' ) > 0.5)
  # Calculate evaluation metrics
  accuracy <- confusionMatrix(y_pred, test_data$diabetic)$overall['Accuracy'] * 100
  precision <- confusionMatrix(y_pred, test_data$diabetic)$byClass['Pos Pred Value'] * 100
  recall <- confusionMatrix(y_pred, test_data$diabetic)$byClass['Sensitivity'] * 100
  f1 <- confusionMatrix(y_pred, test_data$diabetic)$byClass['F1'] * 100
  
  # Print results
  cat(paste0(model_name, ":\n"))
  cat(paste0("  Accuracy: ", accuracy, "\n"))
  cat(paste0("  Precision: ", precision, "\n"))
  cat(paste0("  Recall: ", recall, "\n"))
  cat(paste0("  F1-Score: ", f1, "\n"))
  
  # Add results to the data frame
  results_df <- rbind(results_df, data.frame(Model = model_name,
                                             Accuracy = accuracy,
                                             Precision = precision,
                                             Recall = recall,
                                             F1_Score = f1))
  
  # Print Confusion Matrix
  cat("  Confusion Matrix:\n")
  print(table(y_pred, test_data$diabetic))
  cat("\n")
  } else if(model_name == 'kNN'){
    
    y_pred <- model
    accuracy <- confusionMatrix(y_pred, test_data$diabetic)$overall['Accuracy'] * 100
    precision <- confusionMatrix(y_pred, test_data$diabetic)$byClass['Pos Pred Value'] * 100
    recall <- confusionMatrix(y_pred, test_data$diabetic)$byClass['Sensitivity'] * 100
    f1 <- confusionMatrix(y_pred, test_data$diabetic)$byClass['F1'] * 100
    cat(paste0(model_name, ":\n"))
    cat(paste0("  Accuracy: ", accuracy, "\n"))
    cat(paste0("  Precision: ", precision, "\n"))
    cat(paste0("  Recall: ", recall, "\n"))
    cat(paste0("  F1-Score: ", f1, "\n"))
    
    # Add results to the data frame
    results_df <- rbind(results_df, data.frame(Model = model_name,
                                               Accuracy = accuracy,
                                               Precision = precision,
                                               Recall = recall,
                                               F1_Score = f1))
    
    # Print Confusion Matrix
    cat("  Confusion Matrix:\n")
    print(table(y_pred, test_data$diabetic))
    cat("\n")
    
    
    
  }
  
  
  
}

## Gaussian Naive Bayes:
##   Accuracy: 93.5897435897436
##   Precision: 94.2028985507246
##   Recall: 98.4848484848485
##   F1-Score: 96.2962962962963
##   Confusion Matrix:
##       
## y_pred  0  1
##      0 65  4
##      1  1  8
## 
## Support Vector Machine:
##   Accuracy: 92.3076923076923
##   Precision: 91.6666666666667
##   Recall: 100
##   F1-Score: 95.6521739130435
##   Confusion Matrix:
##       
## y_pred  0  1
##      0 66  6
##      1  0  6
## 
## kNN:
##   Accuracy: 93.5897435897436
##   Precision: 94.2028985507246
##   Recall: 98.4848484848485
##   F1-Score: 96.2962962962963
##   Confusion Matrix:
##       
## y_pred  0  1
##      0 65  4
##      1  1  8

# Print Results Summary 
print("Results Summary:");print(results_df)

## [1] "Results Summary:"

##                            Model Accuracy Precision    Recall F1_Score
## Accuracy    Gaussian Naive Bayes 93.58974  94.20290  98.48485 96.29630
## Accuracy1 Support Vector Machine 92.30769  91.66667 100.00000 95.65217
## Accuracy2                    kNN 93.58974  94.20290  98.48485 96.29630

Key Insights:

SVM had perfect recall (100%) but lower precision, indicating it is sensitive but less specific in predicting diabetes.

Naive Bayes and k-NN had the highest precision, meaning its positive predictions tend to be accurate.

All models achieved over 90% accuracy, indicating good performance overall on this dataset. The top performers were Naive Bayes and kNN with approcimately 94% accuracy. Their precision, recall and F1 scores were identical.

The high performance of the models suggests these models can effectively predict diabetes using the given risk factors. Additional validation on new datasets would be beneficial.

To conclude, we deduce that the Support Vector Machine (SVM) is the best model out of the three as it has the highest recall.

Conclusion and Future Work

We have successfully identified significant risk factors of diabetes via statistical analyses, built a parsimonious logistic regression model for predicting the risk of diabetes, and developed an effective machine learning model in predicting diabetes with inputs that can be obtained using IoT devices.

However, there are some possible future work that we can work on.

Firstly, we can consider effect sizes to quantify the strength of diabetes risk factors. On regression, we can address influential points or apply transformations to obtain a more robust and effective model. Lastly, we can potentially improve performance of machine learning models by dealing with class imbalance (325 non-diabetic patients and 65 diabetic patients) by oversampling the minority class instances, perform hyperparameter tuning for better results and cross validation for a more reliable empirical estimate of performance.

WQD 7004 Group 1: Classification and Prediction of Diabetes

Lai Yuen Seng (Group Leader)

Leong Lip San

Wang Zi

Chen Youhui

Liu Ming

2024-01-13

Introduction

Project Background & Objectives

Libraries & Dataset Details

Defining the Response Variable

Initial EDA & Data Cleaning

Uniqueness of Observations

Numerical Summaries Stratified by Diabetic Status

Mean

Median

Graphical Summaries

Boxplots

Pairwise Plots

Data Cleaning

Cleaning by Deletion

Cleaning by Imputation

Final EDA

Histograms

Boxplots

Heatmap

Pairwise Plots

Numerical Summaries

Regression: Logistic Regression

Preliminary Fitting

Model Comparisons

Defining the models.

Sequential Analysis of Deviance

Akaike Information Criterion

Model Diagnostics

Statistical Tests

Proportion Test

On gender

On Location

\(\chi^2\) Test

\(Z\)-Test

Machine Learning

Conclusion and Future Work

On `gender`

On `Location`