Lai Teng Wong (S3714421)
https://rpubs.com/laitengw/A2
Last updated: 15 October, 2021
The aim of this study is to investigate whether there is a difference in BMI means in a sample of diabetic and non-diabetic females.
As we are looking at two different groups which are independent of each other, an independent two-sample \(\small t\)-test will be used for hypothesis testing to compare the difference of mean BMI between the two groups.
The two-sample \(\small t\)-test assumes that the data for both groups have equal variance and are normally distributed.
These assumptions will be checked prior to implementing the two-sample \(\small t\)-test and interpreting the results of the test.
BMI is the weight in kilograms divided by height in meters squared. \[BMI = \frac{Weight (kg)}{Height (m^2)}\] Mean BMI is calculated as the sum of BMI across both groups divided by the number of individuals in each respective group.
The dataset diabetes.csv was downloaded from Github, whereby the dataset was originally from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK).
The data was collected from a sample of 768 women of Pima Indian heritage from a population near Phoenix, Arizona, USA.
There are 768 observations, 8 features and 1 target variable (Outcome) in the dataset:
For the purpose of the hypothesis testing, we are only interested in two variables which are BMI and Outcome.
Diabetes data set has been sub-setted to only include these two variables.
In this section, the Outcome variable has been renamed to Diagnosis after reading in the data.
Upon checking the structure of the subsetted dataset db, Diagnosis was read in as integer class, therefore it has been factorized, ordered and labelled.
“0” has been labelled as “Negative” and “1” as “Positive”:
Diagnosis \(\small ==\) “Positive” for diabetic group;
Diagnosis \(\small ==\) “Negative” for non-diabetic group.
diabetes <- read.csv("diabetes.csv")
colnames(diabetes)[colnames(diabetes) == "Outcome"] <- "Diagnosis"
db <- diabetes %>% select(BMI, Diagnosis)
str(db)## 'data.frame': 768 obs. of 2 variables:
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ Diagnosis: int 1 0 1 0 1 0 1 0 1 1 ...
db$Diagnosis %<>% factor(level = c(0, 1), labels = c("Negative", "Positive"), ordered = TRUE)
str(db)## 'data.frame': 768 obs. of 2 variables:
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ Diagnosis: Ord.factor w/ 2 levels "Negative"<"Positive": 2 1 2 1 2 1 2 1 2 2 ...
Based on the summary statistics, the minimum value for BMI is 0.
The zero values were filtered out and checked.
They are obvious inconsistencies in the data set because BMI values cannot be zero for an individual.
After converting the zero values to NAs, there were 9 missing values for the non-diabetic group and 2 missing values for the diabetic group.
As we have 500 observations in the non-diabetic group and 268 observations in the diabetic group, the number of missing values are relatively small in proportion, therefore omitting the missing values would be a better option than imputing the missing values with the median or mean BMI, as mean/median imputation could potentially affect the hypothesis test results in some cases.
## BMI Diagnosis
## Min. : 0.00 Negative:500
## 1st Qu.:27.30 Positive:268
## Median :32.00
## Mean :31.99
## 3rd Qu.:36.60
## Max. :67.10
| BMI | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Diagnosis | Positive | Negative | Negative | Negative | Negative | Negative | Negative | Negative | Negative | Negative | Positive |
There are 491 non-diabetic observations and 266 diabetic observations after removing the missing values.
The mean BMI values for the non-diabetic and diabetic group of individuals are 30.86 and 35.41 respectively.
The BMI values fall within the range of [18.2 - 57.3] for the non-diabetic group and [22.9 - 67.1] for the diabetic group.
The guideline for BMI values are:
According to the BMI guideline, the maximum BMI values appeared to be very high. Having a BMI of 67.1 could mean that a female individual is approximately 165cm tall and weighs 183kg.
db %>% group_by(Diagnosis) %>% summarise(Min = min(BMI),
`First Quartile` = quantile(BMI, probs = 0.25),
Median = median(BMI),
`Third Quartile` = quantile(BMI, probs = 0.75),
Max = max(BMI),
`IQR` = IQR(BMI),
Mean = mean(BMI),
SD = sd(BMI),`Count` = n()) %>%
kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")| Diagnosis | Min | First Quartile | Median | Third Quartile | Max | IQR | Mean | SD | Count |
|---|---|---|---|---|---|---|---|---|---|
| Negative | 18.2 | 25.6 | 30.1 | 35.300 | 57.3 | 9.700 | 30.85967 | 6.560737 | 491 |
| Positive | 22.9 | 30.9 | 34.3 | 38.925 | 67.1 | 8.025 | 35.40677 | 6.614982 | 266 |
BMI values above the upper fence of 49.85 and 50.96 are outliers for the non-diabetic and diabetic group respectively.
These values seemed like extreme values, however they are not impossible values, but values which indicate extreme obesity.
The outliers for BMI shown in the table below were not removed from the data set.
## [1] 49.85
## [1] 50.9625
diabetes %>% filter(BMI > 49.85) %>% kbl() %>% kable_styling(latex_options="scale_down", bootstrap_options="condensed")| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Diagnosis |
|---|---|---|---|---|---|---|---|---|
| 0 | 162 | 76 | 56 | 100 | 53.2 | 0.759 | 25 | 1 |
| 1 | 88 | 30 | 42 | 99 | 55.0 | 0.496 | 26 | 1 |
| 7 | 152 | 88 | 44 | 0 | 50.0 | 0.337 | 36 | 1 |
| 0 | 129 | 110 | 46 | 130 | 67.1 | 0.319 | 26 | 1 |
| 11 | 135 | 0 | 0 | 0 | 52.3 | 0.578 | 40 | 1 |
| 0 | 165 | 90 | 33 | 680 | 52.3 | 0.427 | 23 | 0 |
| 5 | 115 | 98 | 0 | 0 | 52.9 | 0.209 | 28 | 1 |
| 0 | 180 | 78 | 63 | 14 | 59.4 | 2.420 | 25 | 1 |
| 3 | 123 | 100 | 35 | 240 | 57.3 | 0.880 | 22 | 0 |
We want to investigate whether there is a statistical significance between the mean BMI for diabetic and non-diabetic individuals from the sample.
\[H_0: \mu_1 = \mu_2 \] \[H_A: \mu_1 \ne \mu_2\] \(\mu_1\) represents the population mean of the non-diabetic group and \(\mu_2\) represents the population mean of the diabetic group.
The first assumption has been met as the people in both groups are different individuals (i.e., not the same person), therefore both groups are independent of each other.
The side-by-side boxplots below also show the outliers as mentioned in the previous slide. Visually, the BMI values of the non-diabetic group appear to be approximately normal as the area of the boxplot on both sides of the median line looks almost symmetrical; whereas the boxplot for the diabetic group appears to be slightly right-skewed as the longer part of the box is above the median line.
boxplot <- boxplot(db$BMI ~ db$Diagnosis,
main = "Boxplot of BMI by Diabetes Diagnosis",
xlab = "Diabetes Diagnosis",
ylab = expression(paste("BMI (", kg/m^2, ")")),
col = c("#D55E00", "#009E73"))It is easier to visualize normality with a Q-Q plot for both groups respectively. In both Q-Q plots, both ends of the Q-Q plot deviate from the straight line and curve upwards, indicating right-skewness. There are many points which fall outside of the 95% confidence interval depicted by the blue-dashed lines which we should be careful about assuming normality for both groups, as for a normal distribution, we expect most of the Q-Q plot points to lie within the confidence interval bound.
Based on the summary statistics reported earlier, both groups have BMI distributions which are slightly right-skewed because the mean BMI (30.86) is larger than the median BMI (30.1) for the non-diabetic group, and similarly for the diabetic group, the mean BMI (35.41) is larger than the median BMI (34.3). The diabetic group has a larger right-skew compared to the non-diabetic group.
db_positive <- db %>% filter(Diagnosis == "Positive")
db_negative <- db %>% filter(Diagnosis == "Negative")
par(mfrow = c(1,2))
qqPlot(db_negative$BMI, dist = "norm", ylab = expression(paste("BMI (", kg/m^2, ")")), xlab = "Theoretical Quantile", main = "Normal Q-Q Plot for Non-diabetic Group")
qqPlot(db_positive$BMI, dist = "norm", ylab = expression(paste("BMI (", kg/m^2, ")")), xlab = "Theoretical Quantile", main = "Normal Q-Q Plot for Diabetic Group")Despite having right-skewed distributions, it is safe to assume normality for both groups in this case based on the Central Limit Theorem (CLT).
According to the CLT:
In this case, the sample size for both groups are much larger than 30, therefore we can safely assume normality and proceed with the two-sample \(\small t\)-test if all the assumptions are met.
Based on the assumptions listed earlier, the first two assumptions have been met, we will assess the third assumption for Homogeneity of Variance using the Levene’s test.
If we have unequal variances, Welch’s two-sample \(\small t\)-test will be used.
The statistical hypotheses for Levene’s test are: \[H_0: \sigma_1^2 = \sigma_2^2\] \[H_A: \sigma_1^2 \ne \sigma_2^2\] \(\small \sigma_1^2\) is the population variance of the non-diabetic group, and \(\small \sigma_2^2\) is the population variance of the diabetic group.
| Df | F value | Pr(>F) | |
|---|---|---|---|
| group | 1 | 1.620356 | 0.203434 |
| 755 |
Based on the Levene’s test, the \(\small p\)-value (denoted by \(\small\Pr(>F)\)) for BMI between diabetic and non-diabetic groups is \(\small p=0.203434\), which is larger than the significance level of \(\small p=0.05\), therefore we fail to reject the null hypothesis \(\small H_0\) for Levene’s Test.
We can safely assume that the populations for both groups have equal variances.
Assuming equal variance, we set \(\small var.equal=TRUE\) and we perform the two-sample \(\small t\)-test with two-sided hypothesis as we are looking at a two-tailed test.
##
## Two Sample t-test
##
## data: BMI by Diagnosis
## t = -9.0772, df = 755, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Negative and group Positive is not equal to 0
## 95 percent confidence interval:
## -5.530483 -3.563703
## sample estimates:
## mean in group Negative mean in group Positive
## 30.85967 35.40677
To recall our null hypothesis for two-sample \(\small t\)-test \(\small H_0\) stated earlier, \(\small H_0: \mu_1 = \mu_2\) or \((\small \mu_1 - \mu_2 = 0)\), the difference between the population means for the two independent groups is 0.
## [1] 9.545638e-19
Based on the two-sample \(\small t\)-test, we see that there is a difference in BMI means of \(\small \bar{x_1} - \bar{x_2} = 30.85967 - 35.40677 = -4.5471\). The reported two-tailed \(\small p\)-value above which represents the probability of observing a sample difference between the BMI means of \(\small -4.5471\) is \(\small p = 9.545638\times\ 10^{-19}\). Since the \(\small p\)-value is very small, we will write \(\small p < 0.001\) for short.
The lower and upper bound of the 95% confidence interval of the mean difference are \(\small[-5.530483, -3.563703]\).
## [1] -1.963111
Computing the two-tailed \(\small t\)-critical value with \(\small p = \frac{0.05}{2} = 0.025\) for a two-tailed test and degrees of freedom \(\small df=755\), the \(\small t\)-critical value is \(\small -1.963111\).
Based on the three approaches \(\small t(df=755)=-9.0772\), \(\small p<0.001\), \(\small 95\%\) CI for the difference in means \(\small [-5.530483, -3.563703]\), we reject the null hypothesis \(\small H_0\) which states that there is no statistical difference between the BMI means for the two groups.
Al-Goblan, A. S., Al-Alfi, Mohammed A., Khan, Muhammad Z. (2014). Mechanism linking diabetes mellitus and obesity. Diabetes Metabolic Syndrome and Obesity: Targets and Therapy, 7, 587-591. https://dx.doi.org/10.2147%2FDMSO.S67400
Baglin, J. (2020). Module 7 Testing the Null: Data on Trial. [Module Webpage]. Canvas @ RMIT University. https://astral-theory-157510.appspot.com/secured/MATH1324_Module_07.html
Github. (2019). Pima-Indians-Diabetes-Dataset. https://github.com/npradaschnor/Pima-Indians-Diabetes-Dataset/blob/master/diabetes.csv
Gupta, S., Bansal, S. (2021). Correction: Does a rise in BMI cause an increased risk of diabetes?: Evidence from India. PLoS One, 16(2): e0247537. https://dx.doi.org/10.1371%2Fjournal.pone.0247537
World Health Organization. (2021). Diabetes. https://www.who.int/health-topics/diabetes#tab=tab_1