Introduction:
For many years, scientists have questioned why so many Pima Indian Women suffer from diabetes in relation to other ethnicities.
To test whether there is a relationship between the numbers of times a women was pregnant and the BMIs of Pima Indian Women older than 21 years old, I used a dataset regarding this and more variables such as whether the women have diabetes and their diabetes pedigree function (a function that represents how likely they are to get the disease by extrapolating from their ancestor’s history).
According to http://www.personal.kent.edu/~mshanker/personal/Zip_files/sar_2000.pdf, the diabetes pedigree function provides “a synthesis of the diabetes mellitus history in relatives and the genetic relationship of those relatives to the subject.” It utilizes information from a person’s family history to predict how diabetes will affect that individual. According to http://www.rci.rutgers.edu/~cabrera/587/pima.pdf, “many Pima Indians have diabetes”. I was intrigued at why this might be and what variables may influence this. To test the relationship between BMI and the number of times the women were pregnant, I will compare the two variables visually.
Through observation, it seems to predict that with a greater number of pregnancies, BMI is likely to increase and those with a higher BMI are more likely to have diabetes. According to http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1890993/, “an increase in body fat is generally associated with an increase in risk of metabolic diseases such as type 2 diabetes mellitus”.
I predict that BMI will generally be higher for women who have had more numbers of pregnancy as well as for those who test positive for diabetes and that the relationship between the pedigree function and the test results will show that those who had a higher pedigree function tested positive and those who had a lower pedigree function tested negative.
Main Hypothesis:
Is there a difference in means for the BMI and pregnancy counts for those who tested positive for diabetes and those who tested negative for diabetes and also, is there a relationship between the test results for diabetes and the pedigree function?
Null Hypothesis: There is no difference between the two means
Alternative Hypothesis: There is a difference between the two means
Methods
In order to download the data, I embedded the URL for the data in the read.table command. From the data received, there were missing values in the form of “0.0” in biologically impossible places. In order to get rid of those, I changed them to NA’s. From there, the complete.cases command was used to remove any rows including the NAs.
library(ggplot2)
setInternet2(use = TRUE)
##downloading the data and changing the "0.0" strings into NAs
dataset <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data", sep = ",",na.strings="0.0",strip.white=TRUE, fill = TRUE)
#removing all the NAs
pima <- dataset[complete.cases(dataset),]
## changing the names of the attributes to more indentifiable names
names(pima) <- c("numpreg", "plasmacon", "bloodpress", "skinfold", "seruminsulin", "BMI",
"pedigreefunction", "age", "classvariable")
Results for Graph 1
A clear relationship between the number of times women were pregnant and their BMI is not seemingly obvious from just looking at the graphs. However, when closely inspecting, it seems that those who tested negative for diabetes seemed to have lower body mass index. This lines up with research because those who have diabetes generally have a higher BMI. However, in terms of the relationship between BMI and number of pregnancies, it seems that those who were either pregnancy fewer times seemed to have larger ranges and as the number increased, the ranges decreased. This can be attributed to a variety of reasons. First, there could have been more people who were pregnant 0 times than 17 and so there would simply be more data points for the 0 pregnancies. There could also be some sort of association between pregnancy and BMI such that more pregnancies lead everyone to the same BMI. However, within each test result group, the median BMI for each number of pregnancy does not seem to follow any trend. Both graphs also seem to have some outliers. For those who tested positive, there were more outliers in which women who had very few pregnancies had very high BMIs. For those who tested negative, the women who had about 6-8 pregnancies seemed to have relatively high BMIs.
Results for Graph 2
This graph more clearly shows the relationship between the pedigree function and the test results that the women got for diabetes. Since those who tested positive have a higher median and more high outliers, it is clear that the pedigree function does in fact, accurately help estimate the test results for diabetes. It shows that diabetes does follow genetics so those whose ancestors suffered from it have a higher risk of getting the disease themselves as well. Both test results show many outliers yet the outliers for those who tested negative seem to be lower pedigree functions than those who tested positive. In addition, the interquartile range for the women who tested positive reaches a higher BMI than the IQR for those who tested negative. Therefore, women could have higher BMIs and not be outliers if they tested positive as opposed to negative, showing that more women who tested positive did, in fact, have higher BMIs than those who tested negative.
Conclusion
Overall, it seems that there is some form of an association between BMI, number of pregnancies, pedigree function, and the test results for diabetes. To me, it was surprising that the median BMI did not immensely change as the number of pregnancies increased. I expected there to be a strong positive relationship between the number of pregnancies and the BMI. I was not surprised that overall, those who tested positive for diabetes had higher BMIs than those who did not; yet, I predicted a larger difference between the medians.
For future reports, I would suggest taking a larger sample of women of all ethnicity backgrounds. I think that to try to answer why Pima Indian women have a higher chance of getting diabetes than others would be an interesting question to study but it would be more valuable to facet by ethnicity if more ethnicities were involved and then being able to inspect that relationship more easily. In addition, to find the relationship between the pedigree function and the test results, it would be interesting to also have males and those under 21 as well as 21 in the sample. That way, possible confounding variables such as a hormone that only females have that may cause diabetes, can be eliminated.
In addition, the data that I received contained some 0s in columns that does not biologically make sense. In order to get results as accurate and useful as possible, I removed the 0s from the BMI column. In future reports, it would be valuable to question those 0s more and to whether they should be part of the data or why they were recorded.
It would also be interesting to continue researching relationships between diabetes and other variables. For future research, collecting a larger sample size and doing different tests by blocking on different variables could give even more insight into factors that are associated with diabetes, in terms of Pima Indian women and others as well.