There are a few conflicting studies on the relationship between BMI and blood sugar levels. A study performed on a population of African people concluded that “BMI…correlates with random blood glucose levels” (http://ijod.uaeu.ac.ae/iss_1403/e.pdf). However, this same report notes that other large studies “did not demonstrate a correlation between casual blood sugar and BMI”. Even so, it makes sense that there would be a relationship between blood sugar levels and BMI, and because of this, I believe that there would be a relationship between the ability of a person to process glucose and their BMI. In the Pima Indians Diabetes dataset, the ability of a person to process glucose is measured through their glucose concentration levels two hours after an OGTT, or an oral glucose tolerance test. A higher glucose concentration would represent a lack of ability to process glucose, so I believe that higher glucose concentration levels would lead to higher BMI levels. I will investigate the relationship between these two variables using a significance test to determine whether the slope of the LSRL relating these two variables is significantly greater than zero.
Shown below is the code used to read in the Pima Indians Diabetes dataset from the UCI Machine Learning Repository, clean the data, calculate the residuals and categorize all of the people based on their age. In this particular dataset, it seemed as though zeroes represented missing values, as there were a number of zeroes in values such as BMI, which cannot possibly be zero.
setInternet2(use = TRUE)
library(ggplot2)
#Reads in the Pima Indians Dataset from the UCI Repository
diabetesData <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data", header = FALSE)
#Names the columns of the data based on what they represent
columnNames <- c("Pregnancies", "Glucose", "BP", "TST", "Insulin", "BMI", "Function", "Age", "ClassVariable")
names(diabetesData) <- columnNames
#Removes all values of zero in the BMI and Blood Glucose Level columns
cleanedDiabetesData <- diabetesData[diabetesData$Glucose!=0 &diabetesData$BMI!=0,]
#Creates a new column classifying each person by which age category they fall into.
cleanedDiabetesData$AgeClass = ""
cleanedDiabetesData[which(cleanedDiabetesData$Age <= 30),]$AgeClass <- "21-30"
cleanedDiabetesData[which(cleanedDiabetesData$Age <= 50 & cleanedDiabetesData$Age > 30),]$AgeClass <- "30-50"
cleanedDiabetesData[which(cleanedDiabetesData$Age > 50),]$AgeClass <- "50+"
#Creates a new column for the residuals
cleanedDiabetesData$residuals <- (cleanedDiabetesData$BMI - 26.02767 - (0.05271 * cleanedDiabetesData$Glucose))
Shown below is the scatterplot relating BMI to glucose concentration levels 2 hours after an OGTT, with different colors representing different age groups:
I will now check the conditions to see if it is appropriate to use a significance test to see if the slope of the LSRL between BMI and Glucose levels two hours after a OGTT is significantly greater than zero.
It is not indicated whether the dataset I am using is a random sample or not, so I will have to assume that this is a random sample. Since I am assuming the data is a random sample, the independence condition is also met. While the data does appear to take the form of a cluster of points, it also does not appear to be in the form of a curve, and seems to follow a linear pattern. Shown below is a histogram of the residuals from the scatterplot above, which I will use to determine whether the BMI values are Normally distributed around the LSRL:
This histogram shows that the distribution of residuals is roughly symmetric and close to bell-shaped. Its center is also close to zero, meaning that the BMI values are approximately Normally distributed around the LSRL. Furthermore, from looking at the scatterplot above, it appears that the variance of BMI values is close to equal for all values of Glucose Concentration 2 Hours after an OGTT. Thus, since all of the conditions for inference are met, it is appropriate to use a significance test to see whether the slope of the LSRL relating BMI to glucose concentration levels 2 hours after an OGTT is significantly greater than zero.
Shown below are the results of a significance test on the slope of the LSRL relating BMI to glucose concentration levels 2 hours after an OGTT, with an alpha level of 0.01:
#conducts a significance test on the slope of the LSRL between Glucose Concentration Levels and BMI.
cleanedDiabetesData.lm <- lm(formula = BMI ~ Glucose, data = cleanedDiabetesData)
summary(cleanedDiabetesData.lm)
##
## Call:
## lm(formula = BMI ~ Glucose, data = cleanedDiabetesData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.227 -4.967 -0.520 4.210 34.273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.027674 1.010873 25.748 < 2e-16 ***
## Glucose 0.052705 0.008041 6.555 1.04e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.743 on 750 degrees of freedom
## Multiple R-squared: 0.05418, Adjusted R-squared: 0.05292
## F-statistic: 42.96 on 1 and 750 DF, p-value: 1.037e-10
Since the p-value is much lower than 0.01, there is strong evidence supporting the conclusion that the slope of the LSRL relating BMI and glucose concentration levels two hours after an OGTT is greater than zero.
From the results of the significance test, there does appear to be a relationship between BMI levels in people and their ability to process sugars. Even so, there are a number of limitations to this conclusion. The Pima Indians Diabetes dataset looks only at females of Pima Indian heritage that are over 21 years old. This is a pretty narrow section of the population, so even though these results are interesting, they do not apply to many people. An much more meaningful follow-up might involve taking a random sample of different people around the United States to be able to make a more applicable conclusion on the relationship between BMI and a person’s ability to process sugars.
It is also interesting to note that in the scatterplot above, it appears that younger people, or those between the ages of 21 and 30, tended to have a greater ability to process glucose than those who were older than 50. This would be another interesting conclusion to follow up on with a broader sample of people, perhaps from around the country.