Hypothesis Testing (Independent Two-Sample t-Test)

Statistical Differences in Mean Body Mass Index (BMI) between Diabetic and Non-diabetic Group of Females

Lai Teng Wong (S3714421)

https://rpubs.com/laitengw/A2

Last updated: 15 October, 2021

Introduction

What is Diabetes?

Is there a relationship between Diabetes and Body Mass Index (BMI)?

Problem Statement

The aim of this study is to investigate whether there is a difference in BMI means in a sample of diabetic and non-diabetic females.

As we are looking at two different groups which are independent of each other, an independent two-sample \(\small t\)-test will be used for hypothesis testing to compare the difference of mean BMI between the two groups.

The two-sample \(\small t\)-test assumes that the data for both groups have equal variance and are normally distributed.

These assumptions will be checked prior to implementing the two-sample \(\small t\)-test and interpreting the results of the test.

BMI is the weight in kilograms divided by height in meters squared. \[BMI = \frac{Weight (kg)}{Height (m^2)}\] Mean BMI is calculated as the sum of BMI across both groups divided by the number of individuals in each respective group.

Data Description

The dataset diabetes.csv was downloaded from Github, whereby the dataset was originally from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK).

The data was collected from a sample of 768 women of Pima Indian heritage from a population near Phoenix, Arizona, USA.

There are 768 observations, 8 features and 1 target variable (Outcome) in the dataset:

Data (Cont.) - Data Wrangling

For the purpose of the hypothesis testing, we are only interested in two variables which are BMI and Outcome.

Diabetes data set has been sub-setted to only include these two variables.

In this section, the Outcome variable has been renamed to Diagnosis after reading in the data.

Upon checking the structure of the subsetted dataset db, Diagnosis was read in as integer class, therefore it has been factorized, ordered and labelled.

“0” has been labelled as “Negative” and “1” as “Positive”:

Diagnosis \(\small ==\) “Positive” for diabetic group;

Diagnosis \(\small ==\) “Negative” for non-diabetic group.

diabetes <- read.csv("diabetes.csv")
colnames(diabetes)[colnames(diabetes) == "Outcome"] <- "Diagnosis"
db <- diabetes %>% select(BMI, Diagnosis)
str(db)
## 'data.frame':    768 obs. of  2 variables:
##  $ BMI      : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ Diagnosis: int  1 0 1 0 1 0 1 0 1 1 ...
db$Diagnosis %<>% factor(level = c(0, 1), labels = c("Negative", "Positive"), ordered = TRUE)
str(db)
## 'data.frame':    768 obs. of  2 variables:
##  $ BMI      : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ Diagnosis: Ord.factor w/ 2 levels "Negative"<"Positive": 2 1 2 1 2 1 2 1 2 2 ...

Data (Cont.) - Summary Statistics & Dealing with Missing Values

Based on the summary statistics, the minimum value for BMI is 0.

The zero values were filtered out and checked.

They are obvious inconsistencies in the data set because BMI values cannot be zero for an individual.

After converting the zero values to NAs, there were 9 missing values for the non-diabetic group and 2 missing values for the diabetic group.

As we have 500 observations in the non-diabetic group and 268 observations in the diabetic group, the number of missing values are relatively small in proportion, therefore omitting the missing values would be a better option than imputing the missing values with the median or mean BMI, as mean/median imputation could potentially affect the hypothesis test results in some cases.

summary(db)
##       BMI           Diagnosis  
##  Min.   : 0.00   Negative:500  
##  1st Qu.:27.30   Positive:268  
##  Median :32.00                 
##  Mean   :31.99                 
##  3rd Qu.:36.60                 
##  Max.   :67.10
db_zero <- db %>% filter(BMI == "0") 
t(db_zero) %>% kbl()
BMI 0 0 0 0 0 0 0 0 0 0 0
Diagnosis Positive Negative Negative Negative Negative Negative Negative Negative Negative Negative Positive
db$BMI[db$BMI == "0"] <- NA
db %<>% na.omit()

Descriptive Statistics

There are 491 non-diabetic observations and 266 diabetic observations after removing the missing values.

The mean BMI values for the non-diabetic and diabetic group of individuals are 30.86 and 35.41 respectively.

The BMI values fall within the range of [18.2 - 57.3] for the non-diabetic group and [22.9 - 67.1] for the diabetic group.

The guideline for BMI values are:

According to the BMI guideline, the maximum BMI values appeared to be very high. Having a BMI of 67.1 could mean that a female individual is approximately 165cm tall and weighs 183kg.

db %>% group_by(Diagnosis) %>% summarise(Min = min(BMI), 
                                         `First Quartile` = quantile(BMI, probs = 0.25), 
                                         Median = median(BMI),
                                         `Third Quartile` = quantile(BMI, probs = 0.75), 
                                         Max = max(BMI), 
                                         `IQR` = IQR(BMI), 
                                         Mean = mean(BMI), 
                                         SD = sd(BMI),`Count` = n()) %>% 
  kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")
Diagnosis Min First Quartile Median Third Quartile Max IQR Mean SD Count
Negative 18.2 25.6 30.1 35.300 57.3 9.700 30.85967 6.560737 491
Positive 22.9 30.9 34.3 38.925 67.1 8.025 35.40677 6.614982 266

Outlier Detection

BMI values above the upper fence of 49.85 and 50.96 are outliers for the non-diabetic and diabetic group respectively.

These values seemed like extreme values, however they are not impossible values, but values which indicate extreme obesity.

The outliers for BMI shown in the table below were not removed from the data set.

upper_fence <- function (Q3, IQR){Q3 + (1.5*IQR)}
upper_fence(35.300,9.700) #non-diabetic
## [1] 49.85
upper_fence(38.925,8.025) #diabetic
## [1] 50.9625
diabetes %>% filter(BMI > 49.85) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Diagnosis
0 162 76 56 100 53.2 0.759 25 1
1 88 30 42 99 55.0 0.496 26 1
7 152 88 44 0 50.0 0.337 36 1
0 129 110 46 130 67.1 0.319 26 1
11 135 0 0 0 52.3 0.578 40 1
0 165 90 33 680 52.3 0.427 23 0
5 115 98 0 0 52.9 0.209 28 1
0 180 78 63 14 59.4 2.420 25 1
3 123 100 35 240 57.3 0.880 22 0

Hypothesis Testing Assumptions

We want to investigate whether there is a statistical significance between the mean BMI for diabetic and non-diabetic individuals from the sample.

\[H_0: \mu_1 = \mu_2 \] \[H_A: \mu_1 \ne \mu_2\] \(\mu_1\) represents the population mean of the non-diabetic group and \(\mu_2\) represents the population mean of the diabetic group.

Assumptions which must be checked prior to performing the two-sample t-test:

  1. The populations of both groups being compared are independent of each other.
  2. The data for the populations of both groups are normally distributed.
  3. The data for the populations of both groups have equal variances.

The first assumption has been met as the people in both groups are different individuals (i.e., not the same person), therefore both groups are independent of each other.

Assumptions (Cont.) - Data Visualisation & Normality Check (Boxplot)

The side-by-side boxplots below also show the outliers as mentioned in the previous slide. Visually, the BMI values of the non-diabetic group appear to be approximately normal as the area of the boxplot on both sides of the median line looks almost symmetrical; whereas the boxplot for the diabetic group appears to be slightly right-skewed as the longer part of the box is above the median line.

boxplot <- boxplot(db$BMI ~ db$Diagnosis,
        main = "Boxplot of BMI by Diabetes Diagnosis",
        xlab = "Diabetes Diagnosis",
        ylab = expression(paste("BMI (", kg/m^2, ")")),
        col = c("#D55E00", "#009E73"))

Assumptions (Cont.) - Data Visualization & Normality Check (Q-Q Plot)

It is easier to visualize normality with a Q-Q plot for both groups respectively. In both Q-Q plots, both ends of the Q-Q plot deviate from the straight line and curve upwards, indicating right-skewness. There are many points which fall outside of the 95% confidence interval depicted by the blue-dashed lines which we should be careful about assuming normality for both groups, as for a normal distribution, we expect most of the Q-Q plot points to lie within the confidence interval bound.

Based on the summary statistics reported earlier, both groups have BMI distributions which are slightly right-skewed because the mean BMI (30.86) is larger than the median BMI (30.1) for the non-diabetic group, and similarly for the diabetic group, the mean BMI (35.41) is larger than the median BMI (34.3). The diabetic group has a larger right-skew compared to the non-diabetic group.

db_positive <- db %>% filter(Diagnosis == "Positive")
db_negative <- db %>% filter(Diagnosis == "Negative")
par(mfrow = c(1,2))
qqPlot(db_negative$BMI, dist = "norm", ylab = expression(paste("BMI (", kg/m^2, ")")), xlab = "Theoretical Quantile", main = "Normal Q-Q Plot for Non-diabetic Group")
qqPlot(db_positive$BMI, dist = "norm", ylab = expression(paste("BMI (", kg/m^2, ")")), xlab = "Theoretical Quantile", main = "Normal Q-Q Plot for Diabetic Group")

Assumptions (Cont.) - Normality Check, Central Limit Theorem

Despite having right-skewed distributions, it is safe to assume normality for both groups in this case based on the Central Limit Theorem (CLT).

According to the CLT:

In this case, the sample size for both groups are much larger than 30, therefore we can safely assume normality and proceed with the two-sample \(\small t\)-test if all the assumptions are met.

Assumptions (Cont.) - Levene’s Test

Homogeneity of Variance (Equal Variances)

Based on the assumptions listed earlier, the first two assumptions have been met, we will assess the third assumption for Homogeneity of Variance using the Levene’s test.

If we have unequal variances, Welch’s two-sample \(\small t\)-test will be used.

The statistical hypotheses for Levene’s test are: \[H_0: \sigma_1^2 = \sigma_2^2\] \[H_A: \sigma_1^2 \ne \sigma_2^2\] \(\small \sigma_1^2\) is the population variance of the non-diabetic group, and \(\small \sigma_2^2\) is the population variance of the diabetic group.

options(knitr.kable.NA = '')
leveneTest(`BMI` ~ `Diagnosis`, data = db) %>% kbl()
Df F value Pr(>F)
group 1 1.620356 0.203434
755

Based on the Levene’s test, the \(\small p\)-value (denoted by \(\small\Pr(>F)\)) for BMI between diabetic and non-diabetic groups is \(\small p=0.203434\), which is larger than the significance level of \(\small p=0.05\), therefore we fail to reject the null hypothesis \(\small H_0\) for Levene’s Test.

We can safely assume that the populations for both groups have equal variances.

Hypothesis Testing (Two-sample t-test) - Equal Variance Assumption met

Assuming equal variance, we set \(\small var.equal=TRUE\) and we perform the two-sample \(\small t\)-test with two-sided hypothesis as we are looking at a two-tailed test.

t_test <- t.test(BMI ~ Diagnosis, data = db, var.equal = TRUE, alternative = "two.sided") 
t_test
## 
##  Two Sample t-test
## 
## data:  BMI by Diagnosis
## t = -9.0772, df = 755, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Negative and group Positive is not equal to 0
## 95 percent confidence interval:
##  -5.530483 -3.563703
## sample estimates:
## mean in group Negative mean in group Positive 
##               30.85967               35.40677

To recall our null hypothesis for two-sample \(\small t\)-test \(\small H_0\) stated earlier, \(\small H_0: \mu_1 = \mu_2\) or \((\small \mu_1 - \mu_2 = 0)\), the difference between the population means for the two independent groups is 0.

Hypothesis Testing (Cont.)

p-value Approach

t_test$p.value
## [1] 9.545638e-19

Based on the two-sample \(\small t\)-test, we see that there is a difference in BMI means of \(\small \bar{x_1} - \bar{x_2} = 30.85967 - 35.40677 = -4.5471\). The reported two-tailed \(\small p\)-value above which represents the probability of observing a sample difference between the BMI means of \(\small -4.5471\) is \(\small p = 9.545638\times\ 10^{-19}\). Since the \(\small p\)-value is very small, we will write \(\small p < 0.001\) for short.

Confidence Level Approach

The lower and upper bound of the 95% confidence interval of the mean difference are \(\small[-5.530483, -3.563703]\).

Critical Value Approach

qt(p = 0.025, df = 755)
## [1] -1.963111

Computing the two-tailed \(\small t\)-critical value with \(\small p = \frac{0.05}{2} = 0.025\) for a two-tailed test and degrees of freedom \(\small df=755\), the \(\small t\)-critical value is \(\small -1.963111\).

Discussions

Based on the three approaches \(\small t(df=755)=-9.0772\), \(\small p<0.001\), \(\small 95\%\) CI for the difference in means \(\small [-5.530483, -3.563703]\), we reject the null hypothesis \(\small H_0\) which states that there is no statistical difference between the BMI means for the two groups.

Strengths

Limitations

Directions for Future Investigations

Final Conclusion

Reference List

Al-Goblan, A. S., Al-Alfi, Mohammed A., Khan, Muhammad Z. (2014). Mechanism linking diabetes mellitus and obesity. Diabetes Metabolic Syndrome and Obesity: Targets and Therapy, 7, 587-591. https://dx.doi.org/10.2147%2FDMSO.S67400

Baglin, J. (2020). Module 7 Testing the Null: Data on Trial. [Module Webpage]. Canvas @ RMIT University. https://astral-theory-157510.appspot.com/secured/MATH1324_Module_07.html

Github. (2019). Pima-Indians-Diabetes-Dataset. https://github.com/npradaschnor/Pima-Indians-Diabetes-Dataset/blob/master/diabetes.csv

Gupta, S., Bansal, S. (2021). Correction: Does a rise in BMI cause an increased risk of diabetes?: Evidence from India. PLoS One, 16(2): e0247537. https://dx.doi.org/10.1371%2Fjournal.pone.0247537

World Health Organization. (2021). Diabetes. https://www.who.int/health-topics/diabetes#tab=tab_1