Group 4

Student Name	Student Id	Occurrence
Yew Wei Hong	22087124	RL
Qu Han Lin	22119754	OCC3
Yu Zi Qun	22111261	OCC3
Loh Siew Chin	23054980	RL

Title

Developing a Diagnostic Tool for Diabetes based on Data Analysis

Introduction

Diabetes is a prevalent chronic disease affecting individuals across diverse demographics and age groups, characterized by elevated blood sugar levels. It poses significant health risks and can lead to severe complications if left unmanaged. Our project, ’Developing a Diagnostic Tool for Diabetes based on Data Analysis, embarks on a data-driven journey harnessing the power of data analytics to proactively identify individuals at risk of developing diabetes.

This project aims to leverage data analytics to forecast and preemptively identify individuals at risk of diabetes onset using diabetes dataset from National Institute of Diabetes and Digestive and Kidney Diseases. Machine learning and statistical models are employed on this comprehensive dataset to analyze various health parameters, such as BMI, glucose level, blood pressure etc. By scrutinizing these data points, the project seeks to establish predictive models capable of identifying potential diabetic conditions in individuals. The ultimate goal is to develop a robust predictive system that aids in early diagnosis, thereby facilitating proactive healthcare interventions and personalized treatment plans to mitigate the risk and impact of diabetes.

Objective

To develop a predictive model for diagnosing diabetes based on medical predictor factors.
To assess different factors that influence diabetes prediction, and gain insights into their individual and collective impact on diabetes diagnosis.

Dataset

The Diabetes Dataset, sourced from the National Institute of Diabetes and Digestive and Kidney Diseases, is designed to facilitate the diagnostic prediction of diabetes in patients. The dataset specifically focuses on females aged at least 21 years with Pima Indian heritage. The instances included in the dataset were selected with certain constraints from a larger database.

Comprising various variables, the dataset contains both independent variables, representing several medical predictor factors such as BMI, glucose, and a single target dependent variable known as “Outcome”, allowing for comprehensive analysis and exploration of the relationships between the predictor variables and the diabetes outcome.

Metadata

Columns	Description	Units of measurement/Calculation
Pregnancies	Number of times pregnant	None
Glucose	Plasma glucose concentration a 2 hours in an oral glucose tolerance test	None
BloodPressure	Diastolic blood pressure	mm Hg
SkinThickness	Triceps skin fold thickness	mm
Insulin	2-Hour serum insulin	mu U/ml
BMI	Body Mass Index	weight in kg/(height in m)^2
DiabetesPedigreeFunction	Diabetes pedigree function	None
Age	Individual age	years
Outcome	Outcome of Diabetes	0 = No Diabetes / 1 = Diabetes

Import Data

First, we need to import required packages for this project for data exploratory, cleaning and analysis process.

install.packages("tidyverse")
install.packages("caret")
install.packages("ggplot2")
install.packages("corrplot")
install.packages("reshape2")
install.packages("psych")
install.packages("skimr")
install.packages("dplyr")
install.packages("e1071")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

We will then load the required packages into our R session.

Package	Description
tidyverse	A collection of R packages (including ggplot2, dplyr, tidyr, and others) designed for data manipulation and visualization. Tidyverse follows a consistent and efficient approach to data analysis.
caret	Short for Classification And REgression Training, the caret package provides a unified interface for various machine learning models. It streamlines the process of model training, evaluation, and comparison.
ggplot2	Part of the tidyverse, ggplot2 is a powerful and flexible package for creating static and dynamic data visualizations. It follows the Grammar of Graphics principles for creating graphics layer by layer.
corrplot	corrplot is used for visualizing correlation matrices. It provides a variety of options for plotting correlations, including color-coded cells and hierarchical clustering.
reshape2	reshape2 is designed for reshaping data frames, particularly for converting between wide and long formats. It provides functions like melt and cast for efficient data manipulation.
psych	The psych package includes various functions for psychological and psychometric research. It provides tools for factor analysis, clustering, and other statistical analyses often used in psychology.
skimr	skimr is a package for efficient and flexible summary statistics and exploratory data analysis. It provides functions like skim() to generate informative summaries of variables, making it easy to understand the characteristics of a dataset.
dplyr	dplyr is a key component of the tidyverse and offers a set of functions for data manipulation. It includes verbs like filter, mutate, and summarise, providing a consistent and expressive grammar for handling data frames.
e1071	The e1071 package supports various statistical methods, including support vector machines (SVM) for classification and regression tasks. It is a useful tool for machine learning applications, particularly in the context of support vector machines.

library(tidyverse)
library(caret)
library(ggplot2)
library(corrplot)
library(reshape2)
library(psych)
library(skimr)
library(dplyr)
library(e1071)

We will also import diabetes dataset (diabetes.csv) through read.csv function into diabetes.data variable

diabetes.data <- read.csv("/content/diabetes.csv")

Raw Data Exploration

Dimension

In our diabetes raw data exploratoty, we will first get the dimension of the dataset through dim() function.

dim(diabetes.data)

The Dataset contains 768 rows and 9 columns

Content

We will also get the column names of our dataset using colnames().

colnames(diabetes.data)

‘Pregnancies’
‘Glucose’
‘BloodPressure’
‘SkinThickness’
‘Insulin’
‘BMI’
‘DiabetesPedigreeFunction’
‘Age’
‘Outcome’

The head() function allows us to peek through the first six rows of our dataset.

head(diabetes.data)

A data.frame: 6 × 9
	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
	<int>	<int>	<int>	<int>	<int>	<dbl>	<dbl>	<int>	<int>
1	6	148	72	35	0	33.6	0.627	50	1
2	1	85	66	29	0	26.6	0.351	31	0
3	8	183	64	0	0	23.3	0.672	32	1
4	1	89	66	23	94	28.1	0.167	21	0
5	0	137	40	35	168	43.1	2.288	33	1
6	5	116	74	0	0	25.6	0.201	30	0

Structure

Additionally, we can choose to thoroughly examine the entire dataset to extract information. This will provide some dataset key attributes such as the number of rows, columns, column names, data types, and the initial data entries for each column.

glimpse(diabetes.data)

Rows: 768
Columns: 9
$ Pregnancies              [3m[90m<int>[39m[23m 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
$ Glucose                  [3m[90m<int>[39m[23m 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
$ BloodPressure            [3m[90m<int>[39m[23m 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
$ SkinThickness            [3m[90m<int>[39m[23m 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
$ Insulin                  [3m[90m<int>[39m[23m 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
$ BMI                      [3m[90m<dbl>[39m[23m 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
$ DiabetesPedigreeFunction [3m[90m<dbl>[39m[23m 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
$ Age                      [3m[90m<int>[39m[23m 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
$ Outcome                  [3m[90m<int>[39m[23m 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …

From the glimpse result, we can observe that the dataset contains 768 rows and 9 columns.

The column’s data types are as follow:

Column Name	Data Type
Pregnancies	Integer
Glucose	Integer
BloodPressure	Integer
SkinThickness	Integer
Insulin	Integer
BMI	Double
DiabetesPedigreeFunction	Double
Age	Integer
Outcome	Integer

The sum() to check the sum of missing value in the dataset.

sum(is.na(diabetes.data))

There is no NA value in our dataset.

Class Imbalance

In our project, we also determine whether there exists an uneven distribution among various classes in the target variable.

# get class distribution
class_distribution <- table(diabetes.data$Outcome)

# Create a data frame with column names
class_distribution_df <- data.frame(Class = names(class_distribution), Count = as.vector(class_distribution))

# Display class distribution
print(class_distribution_df)

  Class Count
1     0   500
2     1   268

The findings indicate an imbalance in our dataset, with the non-diabetic class being nearly twice as prevalent as the diabetic class. However, it is deemed acceptable to overlook this imbalance, as attempting to balance the dataset may introduce bias and result in information loss. Moreover, it aligns with the real-world scenario where the number of non-diabetic individuals typically exceeds that of diabetic individuals.

Measure of Dispersion / Measure of Central Tendency

Dispersion refers to the extent to which a dataset’s values deviate from the central tendency, providing insights into the spread or variability of the data. Measures of dispersion quantify this spread and help analyze the distribution of values within a dataset.

Measures of central tendency are statistical measures that describe the center or typical value of a dataset. These measures provide insights into where the data tends to cluster or concentrate. The three primary measures of central tendency are the mean, median, and mode.

We will be utilizing summary() and skim() to perform measure of dispersion and measure of central tendency for this dataset.

summary(diabetes.data)

  Pregnancies        Glucose      BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
    Outcome     
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000

skim(diabetes.data)

── Data Summary ────────────────────────
                           Values       
Name                       diabetes.data
Number of rows             768          
Number of columns          9            
_______________________                 
Column type frequency:                  
  numeric                  9            
________________________                
Group variables            None         

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable            n_missing complete_rate    mean      sd     p0    p25
[90m1[39m Pregnancies                      0             1   3.85    3.37   0      1    
[90m2[39m Glucose                          0             1 121.     32.0    0     99    
[90m3[39m BloodPressure                    0             1  69.1    19.4    0     62    
[90m4[39m SkinThickness                    0             1  20.5    16.0    0      0    
[90m5[39m Insulin                          0             1  79.8   115.     0      0    
[90m6[39m BMI                              0             1  32.0     7.88   0     27.3  
[90m7[39m DiabetesPedigreeFunction         0             1   0.472   0.331  0.078  0.244
[90m8[39m Age                              0             1  33.2    11.8   21     24    
[90m9[39m Outcome                          0             1   0.349   0.477  0      0    
      p50     p75   p100 hist 
[90m1[39m   3       6      17    ▇▃▂▁▁
[90m2[39m 117     140.    199    ▁▁▇▆▂
[90m3[39m  72      80     122    ▁▁▇▇▁
[90m4[39m  23      32      99    ▇▇▂▁▁
[90m5[39m  30.5   127.    846    ▇▁▁▁▁
[90m6[39m  32      36.6    67.1  ▁▃▇▂▁
[90m7[39m   0.372   0.626   2.42 ▇▃▁▁▁
[90m8[39m  29      41      81    ▇▃▁▁▁
[90m9[39m   0       1       1    ▇▁▁▁▅



Error in is.null(text_repr) || nchar(text_repr) == 0L: 'length = 12' in coercion to 'logical(1)'
Traceback:

Data Cleaning

Zero values imputation

Upon reviewing our raw data, it becomes evident that our dataset does not contain any missing values. The subsequent task involves assessing the prevalence of zero values in the dataset.

# loop and get the sum of 0 values count
for(i in colnames(diabetes.data)){
  print(paste('The number of 0(s) in the column', i ,'is',sum(diabetes.data[i]==0)))
}

[1] "The number of 0(s) in the column Pregnancies is 111"
[1] "The number of 0(s) in the column Glucose is 0"
[1] "The number of 0(s) in the column BloodPressure is 0"
[1] "The number of 0(s) in the column SkinThickness is 0"
[1] "The number of 0(s) in the column Insulin is 0"
[1] "The number of 0(s) in the column BMI is 0"
[1] "The number of 0(s) in the column DiabetesPedigreeFunction is 0"
[1] "The number of 0(s) in the column Age is 0"
[1] "The number of 0(s) in the column Outcome is 500"

The outcome reveals that certain columns, such as glucose, blood pressure, skin thickness, insulin, and BMI, have zero values in their measurements which is not logical.

We intend to move forward by replacing the 0 values in these columns through mean imputation.

First, we will be replacing the column values of 0 with NA (Not Available).

# Replace 0s with NA in specific columns
zeros_cols <- c('Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI')
diabetes.data[zeros_cols] <- lapply(diabetes.data[zeros_cols], function(x) ifelse(x == 0, NA, x))

# Check for missing values
missing_values <- colSums(is.na(diabetes.data))
print(missing_values)

             Pregnancies                  Glucose            BloodPressure 
                       0                        5                       35 
           SkinThickness                  Insulin                      BMI 
                     227                      374                       11 
DiabetesPedigreeFunction                      Age                  Outcome 
                       0                        0                        0

Subsequently, we will assess the sum of NA values in our columns. It is noteworthy that both Skin Thickness and Insulin exhibit a relatively high number of 0 values.

We will procede by replacing the 0 values in each columns with the mean value.

# Replace missing values with mean in specific columns
replace_with_mean <- function(x) {
  ifelse(is.na(x), mean(x, na.rm = TRUE), x)
}

diabetes.data$Glucose <- replace_with_mean(diabetes.data$Glucose)
diabetes.data$BloodPressure <- replace_with_mean(diabetes.data$BloodPressure)
diabetes.data$SkinThickness <- replace_with_mean(diabetes.data$SkinThickness)
diabetes.data$Insulin <- replace_with_mean(diabetes.data$Insulin)
diabetes.data$BMI <- replace_with_mean(diabetes.data$BMI)

# Check for missing values after imputation
missing_values <- colSums(is.na(diabetes.data))
print(missing_values)

             Pregnancies                  Glucose            BloodPressure 
                       0                        0                        0 
           SkinThickness                  Insulin                      BMI 
                       0                        0                        0 
DiabetesPedigreeFunction                      Age                  Outcome 
                       0                        0                        0

We will evaluate the range of values within the columns after the imputation process.

# Check summary statistics of the cleaned dataset
summary_stats <- summary(diabetes.data)
print(summary_stats)

  Pregnancies        Glucose       BloodPressure    SkinThickness  
 Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 7.00  
 1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:25.00  
 Median : 3.000   Median :117.00   Median : 72.20   Median :29.15  
 Mean   : 3.845   Mean   :121.69   Mean   : 72.41   Mean   :29.15  
 3rd Qu.: 6.000   3rd Qu.:140.25   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.00   Max.   :122.00   Max.   :99.00  
    Insulin           BMI        DiabetesPedigreeFunction      Age       
 Min.   : 14.0   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
 1st Qu.:121.5   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
 Median :155.5   Median :32.40   Median :0.3725           Median :29.00  
 Mean   :155.5   Mean   :32.46   Mean   :0.4719           Mean   :33.24  
 3rd Qu.:155.5   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
    Outcome     
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.349  
 3rd Qu.:1.000  
 Max.   :1.000

Based on the results, the range of values in each column now falls within a logical and appropriate range after imputation.

Data Transformation

Data Scaling

Given that all of the columns in our dataset consist of numerical values, we will examine the range of each column and ascertain whether any transformations are necessary. We will be usin boxplot to visualise data range.

# Selecting the columns
selected_cols <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
                      "Insulin", "BMI", "DiabetesPedigreeFunction", "Age")

# Reshaping data for ggplot
data_long <- melt(diabetes.data[, selected_cols])

# Creating boxplots using ggplot2
ggplot(data_long, aes(x = variable, y = value)) +
  geom_boxplot() +
  labs(x = "Variables", y = "Values") +
  ggtitle("Boxplots of Selected Features")

No id variables; using all as measure variables

Observing the plot, it’s evident that the insulin column exhibits a significantly broader range compared to the other columns. This distinct ranges emphasize the necessity for scaling to ensure uniformity and improved model performance.

We will be using preProcess() function to scale the nemerical data. we will be using method = c(“center”, “scale”) in the preProcess() function to center and scale columns in the diabetes.data dataset. The predict() function will then be utilize to apply scaling transformation with the resulting preprocess_params object.

# Select columns to scale (excluding the Outcome column if it's the label)
columns_to_scale <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness",
                      "Insulin", "BMI", "DiabetesPedigreeFunction", "Age")

# Create a preProcess object to scale the data
preprocess_params <- preProcess(diabetes.data[columns_to_scale], method = c("center", "scale"))

# Apply the scaling transformation to the data
scaled_data <- predict(preprocess_params, newdata = diabetes.data[columns_to_scale])

# Combine scaled columns with other non-scaled columns
df <- cbind(diabetes.data["Outcome"], scaled_data)

Target Variable

As our project encompasses two distinct analysis tasks—Regression and Classification—we will generate two separate sets of dataframe variables with different type of target variable tailored to suit each specific use case.

# Make two copies of df
classification_df <- data.frame(df)

Factor transformation

Factor transformation refers to the conversion of a variable into a factor. A factor is a categorical variable that can take on a limited, fixed set of values, often representing different categories or levels.

We will be using factor transformation on the target variable Outcome for the classification task.

# Convert Outcome to a factor as it's target variable
classification_df$Outcome <- factor(classification_df$Outcome, levels = c(0, 1), labels = c("non.diabetic", "diabetic"))

Currently, we are maintaining two distinct datasets, each tailored to different tasks with varying target variable format.

df: Regression Task
classification_df: Classifiction Task

Exploratory Data Analysis

Correlation Plot

During the Exploratory Data Analysis phase, we are examining the correlation among various numerical data within our dataset.

numeric_data <- df[sapply(df, is.numeric)]

# Compute the correlation matrix
correlation_matrix <- cor(numeric_data)

# Convert the correlation matrix to a long format for ggplot
correlation_long <- melt(correlation_matrix)

# Plot using ggplot with a larger size
ggplot(correlation_long, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_fixed() +
  labs(title = "Diabetes Correlation Heatmap") +
  theme(plot.title = element_text(size = 20)) +
  theme(legend.title = element_blank()) +
  theme(legend.text = element_text(size = 15))

The correlation plot reveals high correlation between certain variables. There is a high correlation observed between Glucose and Outcome, Age and Pregnancies, Insulin and Glucose, and BMI with Skin Thickness.

Pair Plot

We can also uncover the relationships between various variables by employing a pair plot. This visualization tool enables the exploration of correlations and patterns among different pairs of variables, offering a comprehensive view of their interactions.

# Creating the pairs plot
pairs(df)

BMI Density

Distinct patterns in BMI distribution can be discerned through density plots based on different diabetes outcomes.

ggplot(diabetes.data, aes(x = BMI, fill = as.factor(Outcome))) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of BMI by Diabetes Outcome")

The density plot illustrates that individuals with elevated BMI are more likely to develop diabetes.

Age

The histogram provides a visual representation of the age disparity among various diabetes outcomes.

ggplot(diabetes.data, aes(x = Age, fill = as.factor(Outcome))) +
  geom_histogram(binwidth = 5, position = "dodge", alpha = 0.7) +
  labs(title = "Histogram of Age by Diabetes Outcome")

The histogram indicates that diabetes frequently develops during middle age.

Data Analysis

Impacts of each individual medical predictor factors other than glucose to diabetes diagnosis

Questions

What other factors other than glucose contribute to the prediction of diabetes, and how do these individual and collective factors influence the accuracy of diabetes diagnosis?

Objective

To assess different factors except glucose that influence diabetes prediction, and gain insights into their individual and collective impact on diabetes diagnosis using regression model.

The regression dataset from Data Cleaning stage is used to perform regression task.

First, the dataset will be split to 80% training data and 20% testing data

# Extract all columns except the target variable except glucose
df <- df[, !(names(df) %in% 'Glucose')]
X <- df[, names(df) != 'Outcome']

# Extract the target variable
y <- df$Outcome

# Set seed for reproducibility
set.seed(42)

# Generate random indices for training set
train_indices <- sample(1:nrow(df), 0.8 * nrow(df))

# Create training and testing sets
train <- df[train_indices, ]
X_test <- X[-train_indices, ]
y_test <- y[-train_indices]

The dataset obtained from the previous train-test split will be utilized for training and making predictions using the linear regression model.

# Build a linear regression model
lm_model <- lm(Outcome ~ ., data = train)

# Make predictions on the test set
predictions <- predict(lm_model, newdata = X_test)

The accuracy of the model will be assessed by evaluating the predictions using RMSE and R-squared with the test target variable (y data).

# Evaluate the model
rmse <- sqrt(mean((predictions - y_test)^2))
r_squared <- cor(predictions, y_test)^2

cat("Root Mean Squared Error (RMSE):", rmse, "\n")
cat("R-squared:", r_squared, "\n")

Root Mean Squared Error (RMSE): 0.4316718 
R-squared: 0.1902614

The RMSE of 0.4316718 suggests a moderate level of prediction error.

The R-squared value, 0.1902614, indicates the proportion of the variance in the target variable explained by the model.

The liner regressor coefficients associated with each predictor variable will be assessed to evaluate the impact of different factors on the diagnosis of diabetes.

coefficients <- coef(lm_model)
print(coefficients)

coef_df <- data.frame(
  Feature = names(coefficients)[-1],
  Coefficient = coefficients[-1]
)

ggplot(coef_df, aes(x = Feature, y = Coefficient)) +
  geom_bar(stat = 'identity', fill = 'skyblue', color = 'black') +
  labs(title = 'Coefficients of Linear Regression Model',
       x = 'Feature', y = 'Coefficient') +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

             (Intercept)              Pregnancies            BloodPressure 
              0.35080034               0.08263249               0.01776218 
           SkinThickness                  Insulin                      BMI 
              0.02220537               0.06400394               0.10883709 
DiabetesPedigreeFunction                      Age 
              0.05232888               0.04592340

The coefficient plot reveals that every variable positively influences diabetes prediction. BMI exhibits the most substantial impact on diabetes diagnosis, while blood pressure has the least influence.

Based on the results, health predictor factors influence on diabetes prediction can be as:

BMI
Pregnancies
Insulin
DiabetesPedigreeFunction
Age
SkinThickness
BloodPressure

Diabetes prediction classifier based on different medical predictor factors

Questions

What can be achieved in diabetes diagnosis using the diabetes dataset?

Objective

To develop a predictive model for diagnosing diabetes based on medical predictor factors.

The classification dataset from Data Cleaning stage is used to perform classification task.

First, the classification dataset will be split to 75% training data and 25% testing data

# Extract all columns except the target variable except glucose
X <- classification_df[, names(classification_df) != 'Outcome']

# Extract the target variable
y <- classification_df$Outcome

# Set seed for reproducibility
set.seed(42)

# Generate random indices for training set
# Create a stratified train-test split
indices <- createDataPartition(classification_df$Outcome, p = 0.75, list = FALSE)

# Create training and testing sets
train <- classification_df[indices,]

X_test <- X[-indices,]
y_test <- y[-indices]

The Support Vector Machine (SVM) is choosen as the classification model for this data analysis task. SVM is a supervised machine learning algorithm used for both classification and regression tasks. It is particularly well-suited for binary classification problems. SVM aims to find a hyperplane that best separates the data into distinct classes.

During model training, the SVM best model will be tuned to get the best model for prediction using the training data. The hyperparamters for SVM are as follow:

Hyperparameter	Description
C	Controls the trade-off between a smooth decision boundary and accurate classification. Small C values result in a larger-margin hyperplane, allowing some misclassifications.
Sigma	Determines the width of the radial basis function (RBF) kernel. Influences the flexibility of the decision boundary. Small sigma values lead to a more flexible boundary, capturing intricate patterns.

The SVM best model will be assigned to best_svm_model to be utilized in prediction

# Define the tuning parameter grid
svm_tune_grid <- expand.grid(
  C = c(0.1, 1, 10),
  sigma = c(0.01, 0.1, 1)
  )

# Create a train control object for cross-validation
ctrl <- trainControl(method = "cv", number = 5)

# Tune the SVM model using the defined grid
svm_tune <- tune(
  svm,
  Outcome ~ .,
  data = train,
  ranges = svm_tune_grid,
  kernel = 'radial',
  trControl = ctrl
)

# Obtain the best model from tuning
best_svm_model <- svm_tune$best.model

The SVM model will be utilized to make predictions on the test set. The predictive outcomes will be assessed using a confusion matrix.

# Make predictions on the test set
predictions <- predict(best_svm_model, X_test)

# Assuming 'diabetic' is the positive class
conf_matrix <- confusionMatrix(predictions, y_test, positive = 'diabetic')

# Confusion matrix plot
conf_matrix_table <- as.table(conf_matrix$table)
conf_matrix_plot <- plot(conf_matrix_table, col = c("lightblue", "lightcoral"),
                         main = "Confusion Matrix",
                         cex.main = 1.2, cex.col = 1.5, cex.axis = 1.2)

text(conf_matrix_plot$margins[1], conf_matrix_table[, 1], rownames(conf_matrix_table),
     col = "black", cex = 1.2, font = 2)
text(conf_matrix_table[, 2], conf_matrix_plot$margins[2], colnames(conf_matrix_table),
     col = "black", cex = 1.2, font = 2, srt = 45, pos = 2)

# Display the plot
print(conf_matrix_plot)

Warning message:
“In mosaicplot.default(x, xlab = xlab, ylab = ylab, ...) :
 extra arguments ‘cex.main’, ‘cex.col’ will be disregarded”



Error in rep_len(x, ny): cannot replicate NULL to a non-zero length
Traceback:


1. text(conf_matrix_plot$margins[1], conf_matrix_table[, 1], rownames(conf_matrix_table), 
 .     col = "black", cex = 1.2, font = 2)

2. text.default(conf_matrix_plot$margins[1], conf_matrix_table[, 
 .     1], rownames(conf_matrix_table), col = "black", cex = 1.2, 
 .     font = 2)

3. xy.coords(x, y, recycle = TRUE, setLab = FALSE)

# Extract metrics
accuracy <- conf_matrix$overall["Accuracy"]
recall <- conf_matrix$byClass["Sensitivity"]
specificity <- conf_matrix$byClass["Specificity"]

# Print the metrics
cat("Accuracy:", accuracy, "\n")
cat("Recall (Sensitivity):", recall, "\n")
cat("Specificity:", specificity, "\n")

Accuracy: 0.7604167 
Recall (Sensitivity): 0.5671642 
Specificity: 0.864

According to the evaluation outcome, it can be inferred that our classification model demonstrated satisfactory performance in predicting diabetes. It attained an overall accuracy of 76.04%. Additionally, the model exhibited elevated specificity of 89.89%, implying that the model is correctly identifying non-diabetic cases with an accuracy of 86.4%. It means that out of all the actual non-diabetic cases, the model correctly identified 86.4% as non-diabetic.

A higher specificity indicates that the model has a lower rate of false positives, meaning it is good at correctly classifying instances that do not belong to the positive class (in this case, non-diabetic). It is an important metric, especially when the consequences of false positives are significant. In the medical context of diabetes prediction, a high specificity suggests that the model is effective at minimizing the chances of misclassifying non-diabetic individuals as diabetic.

Recall

Upon examining the SVM model performance, it became apparent that the recall achieved a suboptimal value, registering at merely 56.72%. This prompts us to conduct a more in-depth examination of the dataset to unearth the contributing factors leading to this relatively low recall. By delving deeper into the data, we aim to uncover insights that may elucidate and address the challenges encountered in achieving a higher recall rate. This investigative process is crucial for refining and optimizing the model’s performance, particularly in correctly identifying instances of diabetes, which holds significant importance in a medical context.

# 'diabetic' is the positive class
train_class_distribution <- table(train$Outcome)
test_class_distribution <- table(y_test)

# Display class distribution for the train set
print("Train Set Class Distribution:")
print(train_class_distribution)

# Display class distribution for the test set
print("Test Set Class Distribution:")
print(test_class_distribution)

[1] "Train Set Class Distribution:"

non.diabetic     diabetic 
         375          201 
[1] "Test Set Class Distribution:"
y_test
non.diabetic     diabetic 
         125           67

The class distribution in the training set indicates that there are 375 instances of non-diabetic cases and 201 instances of diabetic cases. In the test set, there are 125 instances of non-diabetic cases and 67 instances of diabetic cases. Relating this distribution to the low recall observed in the SVM model evaluation, it suggests that the relatively fewer instances of diabetic cases in both the training and test sets could be a contributing factor. The model might not have sufficient data to effectively learn and generalize patterns associated with diabetic cases, leading to a lower recall rate. Addressing this issue may involve strategies such as obtaining more data, applying resampling techniques, or adjusting the model parameters to better accommodate the imbalanced class distribution.

Conclusion

In conclusion, this project encompasses comprehensive data processing, cleaning, and analysis tasks, utilizing exploratory data analysis, regression, and classification approaches. In the regression analysis, the focus is on assessing the influence of medical predictor factors, excluding glucose on diabetes diagnosis. The impact of each predictor is quantified through coefficients, shedding light on their relative importance. In the classification analysis, an effective classifier is developed to accurately diagnose diabetes based on the identified health predictor factors. The project aims to provide valuable insights into the factors influencing diabetes and develop robust models for predictive purposes.

Moving forward, the project has the ambition to expand its dataset to include more instance and features, ensuring a thorough examination to enhance the robustness of the model. The primary objective is to elevate the recall score, thereby reducing Type I errors. Additionally, there are plans to conduct more extensive analyses, incorporating multiple models to further refine the understanding and prediction of diabetes. This iterative approach aims to continually improve the accuracy and reliability of the developed models.