Introduction

What is Diabetes?

Diabetes is a chronic and metabolic disease whereby an individual’s blood glucose level is too high and the body has difficulties in converting glucose to energy.
Diabetes comprises of Type 1 and Type 2.
Type 1 Diabetes is a condition whereby the pancreas produce little or no insulin; whereas Type 2 Diabetes occurs when the body does not make enough insulin or is resistant to insulin (World Health Organization, 2021).
Diabetes is a major health concern in many countries and is one of the leading causes of death in the world.

Is there a relationship between Diabetes and Body Mass Index (BMI)?

Several research studies have found that BMI has a strong relationship to diabetes and insulin resistance and that having a higher amount of body fat increases the chance of developing Type 2 Diabetes (Al-Goblan et al., 2014).
Typically the likelihood of being diabetic is higher among overweight and obese individuals compared to non-overweight individuals (Gupta & Bansal, 2021).

Problem Statement

The aim of this study is to investigate whether there is a difference in BMI means in a sample of diabetic and non-diabetic females.

As we are looking at two different groups which are independent of each other, an independent two-sample \(\small t\)-test will be used for hypothesis testing to compare the difference of mean BMI between the two groups.

The two-sample \(\small t\)-test assumes that the data for both groups have equal variance and are normally distributed.

These assumptions will be checked prior to implementing the two-sample \(\small t\)-test and interpreting the results of the test.

BMI is the weight in kilograms divided by height in meters squared. \[BMI = \frac{Weight (kg)}{Height (m^2)}\] Mean BMI is calculated as the sum of BMI across both groups divided by the number of individuals in each respective group.

Data Description

The dataset diabetes.csv was downloaded from Github, whereby the dataset was originally from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK).

The data was collected from a sample of 768 women of Pima Indian heritage from a population near Phoenix, Arizona, USA.

There are 768 observations, 8 features and 1 target variable (Outcome) in the dataset:

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (Weight in kg/(Height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (Years)
Outcome: Class variable (0 for no diabetes, 1 for diabetes)

Data (Cont.) - Data Wrangling

For the purpose of the hypothesis testing, we are only interested in two variables which are BMI and Outcome.

Diabetes data set has been sub-setted to only include these two variables.

In this section, the Outcome variable has been renamed to Diagnosis after reading in the data.

Upon checking the structure of the subsetted dataset db, Diagnosis was read in as integer class, therefore it has been factorized, ordered and labelled.

“0” has been labelled as “Negative” and “1” as “Positive”:

Diagnosis \(\small ==\) “Positive” for diabetic group;

Diagnosis \(\small ==\) “Negative” for non-diabetic group.

diabetes <- read.csv("diabetes.csv")
colnames(diabetes)[colnames(diabetes) == "Outcome"] <- "Diagnosis"
db <- diabetes %>% select(BMI, Diagnosis)
str(db)

## 'data.frame':    768 obs. of  2 variables:
##  $ BMI      : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ Diagnosis: int  1 0 1 0 1 0 1 0 1 1 ...

db$Diagnosis %<>% factor(level = c(0, 1), labels = c("Negative", "Positive"), ordered = TRUE)
str(db)

## 'data.frame':    768 obs. of  2 variables:
##  $ BMI      : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ Diagnosis: Ord.factor w/ 2 levels "Negative"<"Positive": 2 1 2 1 2 1 2 1 2 2 ...

Data (Cont.) - Summary Statistics & Dealing with Missing Values

Based on the summary statistics, the minimum value for BMI is 0.

The zero values were filtered out and checked.

They are obvious inconsistencies in the data set because BMI values cannot be zero for an individual.

After converting the zero values to NAs, there were 9 missing values for the non-diabetic group and 2 missing values for the diabetic group.

As we have 500 observations in the non-diabetic group and 268 observations in the diabetic group, the number of missing values are relatively small in proportion, therefore omitting the missing values would be a better option than imputing the missing values with the median or mean BMI, as mean/median imputation could potentially affect the hypothesis test results in some cases.

summary(db)

##       BMI           Diagnosis  
##  Min.   : 0.00   Negative:500  
##  1st Qu.:27.30   Positive:268  
##  Median :32.00                 
##  Mean   :31.99                 
##  3rd Qu.:36.60                 
##  Max.   :67.10

db_zero <- db %>% filter(BMI == "0") 
t(db_zero) %>% kbl()

BMI	0	0	0	0	0	0	0	0	0	0	0
Diagnosis	Positive	Negative	Negative	Negative	Negative	Negative	Negative	Negative	Negative	Negative	Positive

db$BMI[db$BMI == "0"] <- NA
db %<>% na.omit()

Descriptive Statistics

There are 491 non-diabetic observations and 266 diabetic observations after removing the missing values.

The mean BMI values for the non-diabetic and diabetic group of individuals are 30.86 and 35.41 respectively.

The BMI values fall within the range of [18.2 - 57.3] for the non-diabetic group and [22.9 - 67.1] for the diabetic group.

The guideline for BMI values are:

Underweight: Below 18.5
Normal: 18.5 - 24.9
Overweight: 25.0 - 29.9
Obese: 30.0 and above

According to the BMI guideline, the maximum BMI values appeared to be very high. Having a BMI of 67.1 could mean that a female individual is approximately 165cm tall and weighs 183kg.

db %>% group_by(Diagnosis) %>% summarise(Min = min(BMI), 
                                         `First Quartile` = quantile(BMI, probs = 0.25), 
                                         Median = median(BMI),
                                         `Third Quartile` = quantile(BMI, probs = 0.75), 
                                         Max = max(BMI), 
                                         `IQR` = IQR(BMI), 
                                         Mean = mean(BMI), 
                                         SD = sd(BMI),`Count` = n()) %>% 
  kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Diagnosis	Min	First Quartile	Median	Third Quartile	Max	IQR	Mean	SD	Count
Negative	18.2	25.6	30.1	35.300	57.3	9.700	30.85967	6.560737	491
Positive	22.9	30.9	34.3	38.925	67.1	8.025	35.40677	6.614982	266

Outlier Detection

BMI values above the upper fence of 49.85 and 50.96 are outliers for the non-diabetic and diabetic group respectively.

These values seemed like extreme values, however they are not impossible values, but values which indicate extreme obesity.

The outliers for BMI shown in the table below were not removed from the data set.

upper_fence <- function (Q3, IQR){Q3 + (1.5*IQR)}
upper_fence(35.300,9.700) #non-diabetic

## [1] 49.85

upper_fence(38.925,8.025) #diabetic

## [1] 50.9625

diabetes %>% filter(BMI > 49.85) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")

Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Diagnosis
0	162	76	56	100	53.2	0.759	25	1
1	88	30	42	99	55.0	0.496	26	1
7	152	88	44	0	50.0	0.337	36	1
0	129	110	46	130	67.1	0.319	26	1
11	135	0	0	0	52.3	0.578	40	1
0	165	90	33	680	52.3	0.427	23	0
5	115	98	0	0	52.9	0.209	28	1
0	180	78	63	14	59.4	2.420	25	1
3	123	100	35	240	57.3	0.880	22	0

Hypothesis Testing Assumptions

We want to investigate whether there is a statistical significance between the mean BMI for diabetic and non-diabetic individuals from the sample.

\[H_0: \mu_1 = \mu_2 \] \[H_A: \mu_1 \ne \mu_2\] \(\mu_1\) represents the population mean of the non-diabetic group and \(\mu_2\) represents the population mean of the diabetic group.

Assumptions which must be checked prior to performing the two-sample t-test:

The populations of both groups being compared are independent of each other.
The data for the populations of both groups are normally distributed.
The data for the populations of both groups have equal variances.

The first assumption has been met as the people in both groups are different individuals (i.e., not the same person), therefore both groups are independent of each other.

Assumptions (Cont.) - Data Visualisation & Normality Check (Boxplot)

The side-by-side boxplots below also show the outliers as mentioned in the previous slide. Visually, the BMI values of the non-diabetic group appear to be approximately normal as the area of the boxplot on both sides of the median line looks almost symmetrical; whereas the boxplot for the diabetic group appears to be slightly right-skewed as the longer part of the box is above the median line.

boxplot <- boxplot(db$BMI ~ db$Diagnosis,
        main = "Boxplot of BMI by Diabetes Diagnosis",
        xlab = "Diabetes Diagnosis",
        ylab = expression(paste("BMI (", kg/m^2, ")")),
        col = c("#D55E00", "#009E73"))

Assumptions (Cont.) - Data Visualization & Normality Check (Q-Q Plot)

It is easier to visualize normality with a Q-Q plot for both groups respectively. In both Q-Q plots, both ends of the Q-Q plot deviate from the straight line and curve upwards, indicating right-skewness. There are many points which fall outside of the 95% confidence interval depicted by the blue-dashed lines which we should be careful about assuming normality for both groups, as for a normal distribution, we expect most of the Q-Q plot points to lie within the confidence interval bound.

Based on the summary statistics reported earlier, both groups have BMI distributions which are slightly right-skewed because the mean BMI (30.86) is larger than the median BMI (30.1) for the non-diabetic group, and similarly for the diabetic group, the mean BMI (35.41) is larger than the median BMI (34.3). The diabetic group has a larger right-skew compared to the non-diabetic group.

db_positive <- db %>% filter(Diagnosis == "Positive")
db_negative <- db %>% filter(Diagnosis == "Negative")
par(mfrow = c(1,2))
qqPlot(db_negative$BMI, dist = "norm", ylab = expression(paste("BMI (", kg/m^2, ")")), xlab = "Theoretical Quantile", main = "Normal Q-Q Plot for Non-diabetic Group")
qqPlot(db_positive$BMI, dist = "norm", ylab = expression(paste("BMI (", kg/m^2, ")")), xlab = "Theoretical Quantile", main = "Normal Q-Q Plot for Diabetic Group")

Assumptions (Cont.) - Normality Check, Central Limit Theorem

Despite having right-skewed distributions, it is safe to assume normality for both groups in this case based on the Central Limit Theorem (CLT).

According to the CLT:

When the sample size is large (\(\small n > 30\)), the sampling distribution of the mean will be approximately normally distributed, regardless of the underlying population distribution.

In this case, the sample size for both groups are much larger than 30, therefore we can safely assume normality and proceed with the two-sample \(\small t\)-test if all the assumptions are met.

Non-Diabetic group: \(\small n = 491 > 30\)
Diabetic group: \(\small n = 266 > 30\)

Assumptions (Cont.) - Levene’s Test

Homogeneity of Variance (Equal Variances)

Based on the assumptions listed earlier, the first two assumptions have been met, we will assess the third assumption for Homogeneity of Variance using the Levene’s test.

If we have unequal variances, Welch’s two-sample \(\small t\)-test will be used.

The statistical hypotheses for Levene’s test are: \[H_0: \sigma_1^2 = \sigma_2^2\] \[H_A: \sigma_1^2 \ne \sigma_2^2\] \(\small \sigma_1^2\) is the population variance of the non-diabetic group, and \(\small \sigma_2^2\) is the population variance of the diabetic group.

options(knitr.kable.NA = '')
leveneTest(`BMI` ~ `Diagnosis`, data = db) %>% kbl()

	Df	F value	Pr(>F)
group	1	1.620356	0.203434
	755

Based on the Levene’s test, the \(\small p\)-value (denoted by \(\small\Pr(>F)\)) for BMI between diabetic and non-diabetic groups is \(\small p=0.203434\), which is larger than the significance level of \(\small p=0.05\), therefore we fail to reject the null hypothesis \(\small H_0\) for Levene’s Test.

We can safely assume that the populations for both groups have equal variances.

Hypothesis Testing (Two-sample t-test) - Equal Variance Assumption met

Assuming equal variance, we set \(\small var.equal=TRUE\) and we perform the two-sample \(\small t\)-test with two-sided hypothesis as we are looking at a two-tailed test.

t_test <- t.test(BMI ~ Diagnosis, data = db, var.equal = TRUE, alternative = "two.sided") 
t_test

## 
##  Two Sample t-test
## 
## data:  BMI by Diagnosis
## t = -9.0772, df = 755, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Negative and group Positive is not equal to 0
## 95 percent confidence interval:
##  -5.530483 -3.563703
## sample estimates:
## mean in group Negative mean in group Positive 
##               30.85967               35.40677

To recall our null hypothesis for two-sample \(\small t\)-test \(\small H_0\) stated earlier, \(\small H_0: \mu_1 = \mu_2\) or \((\small \mu_1 - \mu_2 = 0)\), the difference between the population means for the two independent groups is 0.

Hypothesis Testing (Cont.)

p-value Approach

t_test$p.value

## [1] 9.545638e-19

Based on the two-sample \(\small t\)-test, we see that there is a difference in BMI means of \(\small \bar{x_1} - \bar{x_2} = 30.85967 - 35.40677 = -4.5471\). The reported two-tailed \(\small p\)-value above which represents the probability of observing a sample difference between the BMI means of \(\small -4.5471\) is \(\small p = 9.545638\times\ 10^{-19}\). Since the \(\small p\)-value is very small, we will write \(\small p < 0.001\) for short.

\(\small p\)-value is less than the significance level \(\small \alpha = 0.05\). \(\rightarrow\) Reject the null hypothesis \(\small H_0\).

Confidence Level Approach

The lower and upper bound of the 95% confidence interval of the mean difference are \(\small[-5.530483, -3.563703]\).

The 95% confidence interval did not capture \(\small H_0: \mu_1 = \mu_2\) \((\small \mu_1 - \mu_2 = 0)\), as 0 lies outside of the confidence interval range. \(\rightarrow\) Reject the null hypothesis \(\small H_0\).

Critical Value Approach

qt(p = 0.025, df = 755)

## [1] -1.963111

Computing the two-tailed \(\small t\)-critical value with \(\small p = \frac{0.05}{2} = 0.025\) for a two-tailed test and degrees of freedom \(\small df=755\), the \(\small t\)-critical value is \(\small -1.963111\).

The test statistic assuming equal variance is \(\small t = -9.0772\), this value appears to be more extreme than \(\small -1.963111\). \(\rightarrow\) Reject the null hypothesis \(\small H_0\).

Discussions

Based on the three approaches \(\small t(df=755)=-9.0772\), \(\small p<0.001\), \(\small 95\%\) CI for the difference in means \(\small [-5.530483, -3.563703]\), we reject the null hypothesis \(\small H_0\) which states that there is no statistical difference between the BMI means for the two groups.

Strengths

The dataset can be easily accessed by the public via Kaggle, Github and other sites. The original dataset can be requested from the NIDDK Central Repository, the data is from a trusted source.
Large sample size for both groups to assume normality and perform the independent two-sample \(\small t\)-test based on CLT.

Limitations

The dataset with many zero values for BMI and other numeric features.
The large values of outliers which were not impossible values are questionable, as they could be due to data entry/experimental/measurement (calculation) errors. Individuals in the sample could have reported an incorrect height or weight, and because height and weight measurements were not included in the data set, so we cannot check if BMI was calculated correctly.

Directions for Future Investigations

Since the sampling method was not clearly stated, it is unclear as to how the data was collected/inputted into the system. In general, to reduce the number of obvious inconsistencies in the data, there should be data validation defense in place during data collection from a sample population, for example, when an individual within the sample population inputs a 0, it would return an error if 0 is an impossible value.
It would be useful if the variables used for calculations were included in the data set, so that calculation errors can be checked and mitigated.

Final Conclusion

The results of the study conclude that there is a significant difference between the BMI means for non-diabetic and diabetic group of female individuals of Pima Indian heritage in Phoenix, Arizona.
Although we cannot use this results to generalize against the entire world population, we saw that the mean BMI for diabetic group of individuals is higher compared to the non-diabetic group, which was also what was highlighted in several research studies mentioned earlier that the likelihood of being diabetic is higher among people who are obese and overweight.

Reference List

Al-Goblan, A. S., Al-Alfi, Mohammed A., Khan, Muhammad Z. (2014). Mechanism linking diabetes mellitus and obesity. Diabetes Metabolic Syndrome and Obesity: Targets and Therapy, 7, 587-591. https://dx.doi.org/10.2147%2FDMSO.S67400

Baglin, J. (2020). Module 7 Testing the Null: Data on Trial. [Module Webpage]. Canvas @ RMIT University. https://astral-theory-157510.appspot.com/secured/MATH1324_Module_07.html

Github. (2019). Pima-Indians-Diabetes-Dataset. https://github.com/npradaschnor/Pima-Indians-Diabetes-Dataset/blob/master/diabetes.csv

Gupta, S., Bansal, S. (2021). Correction: Does a rise in BMI cause an increased risk of diabetes?: Evidence from India. PLoS One, 16(2): e0247537. https://dx.doi.org/10.1371%2Fjournal.pone.0247537

World Health Organization. (2021). Diabetes. https://www.who.int/health-topics/diabetes#tab=tab_1

Hypothesis Testing (Independent Two-Sample t-Test)

Statistical Differences in Mean Body Mass Index (BMI) between Diabetic and Non-diabetic Group of Females