Pima Indians Diabetes Data Set

diabetes <- read.csv("/var/folders/l8/b0grpgd925dd7t6pq41f4mjc0000gn/T//RtmprNTNg3/data6ee662fa677", header=FALSE)
  View(diabetes)
colnames(diabetes)<-c("Pregnancies","PlasmaGC","DiastolicBP","Triceps","SerumIns","BMI","DiabetesPed","Age","Class")
dim(diabetes)
## [1] 768   9

Data description

The Pima Indians Diabetes Data Set was compiled by researchers at the Johns Hopkins University School of Medicine, from a larger database owned by the National Institute of Diabetes and Digestive and Kidney Diseases. Included in the dataset is information on females of Pima Indian heritage who were at least 21 years old at the time of data collection. Pima Indians have one of the highest rates of diabetes in the world, and the researchers at Johns Hopkins collected this dataset with the intention of creating a model that would predict the onset of diabetes in the Pima Indian population.

This dataset includes 768 observations, taken at the individual level. The key response variable is diabetes, which is defined by the World Health Organization as a plasma glucose concentration greater than 200 mg/dl 2 hours following ingestion of a 75 gm carbohydrate solution. The explanatory variables are known risk factors for diabetes: number of pregnancies, diastolic blood pressure, triceps skinfold thickness (an indicator of bodyfat), 2 hour serum insulin, body mass index, age, and diabetes pedigree function (a synthesis of diabetes history in an individual’s relatives).

In this dataset, I am assuming that a value of 0 for a given observation represents a missing value, since it is not biologically possible that some of the variable values would be equal to 0 (e.g. plasma glucose concentration and BMI couldn’t be 0). The decision to replace missing values with a value of 0 complicates the analysis of the data since a value of 0 is reasonable for the “Pregnancies” variable, yet there is no information regarding whether these 0 values are in fact missing or if observations with a value of 0 for this variable have had 0 pregancies. Many of the observations that are missing a value for the variable SerumIns are also missing values for the variable Triceps. There doesn’t seem to be any other trend in the missing values.

For certain variables, I replaced the 0 values with the value “NA”

diabetes$PlasmaGC[diabetes$PlasmaGC==0]<-NA
diabetes$DiastolicBP[diabetes$DiastolicBP==0]<-NA
diabetes$Triceps[diabetes$Triceps==0]<-NA
diabetes$SerumIns[diabetes$SerumIns==0]<-NA
diabetes$BMI[diabetes$BMI==0]<-NA
diabetes$Age[diabetes$Age==0]<-NA
colSums(is.na(diabetes))
## Pregnancies    PlasmaGC DiastolicBP     Triceps    SerumIns         BMI 
##           0           5          35         227         374          11 
## DiabetesPed         Age       Class 
##           0           0           0

The following table describes the variables in the dataset:

a<-c("Pregnancies","PlasmaGC","DiastolicBP","Triceps","SerumIns","BMI","DiabetesPed","Age","Class")
b<-c("Number of pregnancies","Plasma glucose concentration","Diastolic blood pressure (mm Hg)", "Triceps skinfold thickness (mm)", "2 hr. serum insulin (mmU/ml","Body mass index","Diabetes pedigree function","Participant age (years)","Diagnosed with diabetes")
c<-c("Continuous","Continuous","Continuous","Continuous","Continuous","Continuous","Continuous","Continuous","Binary")
d<-c(0,5,35,227,374,11,0,0,0)
DiabetesData<-data.frame(Variable=a,Definition=b,Type=c,Missing=d)
DiabetesData
##      Variable                       Definition       Type Missing
## 1 Pregnancies            Number of pregnancies Continuous       0
## 2    PlasmaGC     Plasma glucose concentration Continuous       5
## 3 DiastolicBP Diastolic blood pressure (mm Hg) Continuous      35
## 4     Triceps  Triceps skinfold thickness (mm) Continuous     227
## 5    SerumIns      2 hr. serum insulin (mmU/ml Continuous     374
## 6         BMI                  Body mass index Continuous      11
## 7 DiabetesPed       Diabetes pedigree function Continuous       0
## 8         Age          Participant age (years) Continuous       0
## 9       Class          Diagnosed with diabetes     Binary       0

Here is a univariate distribution plot for the variable “Age.” The distribution is skewed right:

hist(diabetes$Age,main="Age Distribution, Pima Indians Diabetes Dataset",xlab="Age (Years)",col="aquamarine3")

Here is a distribution plot for the variable “BMI,” which appears to be mostly normal, with a few outliers:

hist(diabetes$BMI,main="BMI Distribution, Pima Indians Diabetes Dataset",xlab="Body Mass Index",col="darkslateblue")

A distribution for the variable “PlasmaGC.” This distribution is also pretty normal:

hist(diabetes$PlasmaGC,main="Plasma Glucose Concentration, Pima Indians Diabetes Dataset",xlab="Plasma Glucose Concentration",col="indianred1")

Finally, here is a pairs plot for each of the previous variables:

library(GGally)
ggpairs(diabetes[c("Pregnancies","DiastolicBP","BMI","DiabetesPed")])

Simple Linear Regression

PlasmaGC modeled by Age with fitted regression line:

library(ggplot2)
ggplot(diabetes,aes(Age,PlasmaGC))+geom_point()+geom_smooth(method="lm",se=FALSE)

m1<-lm(PlasmaGC~Age,data=diabetes)
summary(m1)
## 
## Call:
## lm(formula = PlasmaGC ~ Age, data = diabetes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.058 -21.310  -3.727  17.615  85.123 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 98.63245    3.19767  30.845  < 2e-16 ***
## Age          0.69292    0.09061   7.647 6.21e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.45 on 761 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.07136,    Adjusted R-squared:  0.07014 
## F-statistic: 58.48 on 1 and 761 DF,  p-value: 6.208e-14

The intercept of this linear model indicates that when age is equal to zero, blood glucose concentration has a value of 98 (units for glucose concentration were not specified in the dataset). The Age coefficient of .693 means that for every 1 year increase in age, blood glucose concentration increases by a value of .693. Finally, the adjusted R squared coefficient is .07, meaning that age in years explains 7% of the variation in plasma glucose concentration.

PlasmaGC modeled by BMI with fitted regression line:

ggplot(diabetes,aes(BMI,PlasmaGC))+geom_point()+geom_smooth(method="lm",se=FALSE)

m2<-lm(PlasmaGC~BMI,data=diabetes)
summary(m2)
## 
## Call:
## lm(formula = PlasmaGC ~ BMI, data = diabetes)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70.28 -21.41  -4.11  18.32  81.80 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  88.5775     5.2046  17.019  < 2e-16 ***
## BMI           1.0280     0.1568   6.555 1.04e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.78 on 750 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.05418,    Adjusted R-squared:  0.05292 
## F-statistic: 42.96 on 1 and 750 DF,  p-value: 1.037e-10

The intercept of PlasmaGC~BMI is 88.578, which means that when BMI is 0, PlasmaGC is 88.578. However, a BMI of 0 is not possible, so this intercept is not pratically meaningful. The model’s BMI coefficient is 1.028, so for every one unit increase in BMI, PlasmaGC is expected to increase by 1.028 units. The adjusted R squared for this model is .0529, so the model only explains 5.29% of the variation in PlasmaGC.

PlasmaGC modeled by Pregnancies with fitted regression line:

ggplot(diabetes,aes(Pregnancies,PlasmaGC))+geom_point()+geom_smooth(method="lm",se=FALSE)

m3<-lm(PlasmaGC~Pregnancies,data=diabetes)
summary(m3)
## 
## Call:
## lm(formula = PlasmaGC ~ Pregnancies, data = diabetes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.018 -21.938  -5.177  19.779  80.779 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 117.2209     1.6654  70.385  < 2e-16 ***
## Pregnancies   1.1594     0.3253   3.564 0.000388 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.3 on 761 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.01642,    Adjusted R-squared:  0.01513 
## F-statistic:  12.7 on 1 and 761 DF,  p-value: 0.0003878

My final linear regression model, PlasmaGC~Pregnancies, has an intercept coefficient that suggests that when a woman has had 0 pregnancies, her PlasmaGC will have a value of 117.221. The Pregnancies coefficient of 1.159 indicates a 1.159 unit change in PlasmaGC for each additional pregnancy. The adjusted R squared for this model is .01513– number of pregnancies only accounts for 1.513% of the variation in PlasmaGC.

Multiple Regression

Multiple regression with 3 continuous variables:

mlr1<-lm(PlasmaGC~Age+Pregnancies+BMI,data=diabetes)
summary(mlr1)
## 
## Call:
## lm(formula = PlasmaGC ~ Age + Pregnancies + BMI, data = diabetes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.883 -19.942  -2.605  16.998  83.551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  66.7147     5.8172  11.469  < 2e-16 ***
## Age           0.7073     0.1069   6.615 7.06e-11 ***
## Pregnancies  -0.2359     0.3709  -0.636    0.525    
## BMI           1.0037     0.1515   6.625 6.63e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.76 on 748 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.1204, Adjusted R-squared:  0.1169 
## F-statistic: 34.13 on 3 and 748 DF,  p-value: < 2.2e-16

Here the intercept is 66.715: when Age, Pregnancies, and BMI are equal to zero, PlasmaGC is expected to have a value of 66.715. The explanatory variable Pregnancies has a coefficient of -.2359, so when we hold all other variables constant, each 1 unit change in number of pregnancies yields a -.2359 unit change in PlasmaGC.

Multiple regression with Pregnancies as a categorical variable:

diabetes$Pregnancies.f<-factor(diabetes$Pregnancies)
mlr2<-lm(PlasmaGC~Age+Pregnancies.f+BMI,data=diabetes)
summary(mlr2)
## 
## Call:
## lm(formula = PlasmaGC ~ Age + Pregnancies.f + BMI, data = diabetes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.194 -18.392  -2.335  18.044  85.621 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      69.29706    6.84200  10.128  < 2e-16 ***
## Age               0.69154    0.10941   6.321 4.52e-10 ***
## Pregnancies.f1   -4.25316    3.75595  -1.132   0.2578    
## Pregnancies.f2   -7.74901    4.00722  -1.934   0.0535 .  
## Pregnancies.f3    4.09640    4.37058   0.937   0.3489    
## Pregnancies.f4    1.21234    4.48551   0.270   0.7870    
## Pregnancies.f5   -8.67499    4.88638  -1.775   0.0763 .  
## Pregnancies.f6   -3.75935    5.17265  -0.727   0.4676    
## Pregnancies.f7    5.99930    5.33736   1.124   0.2614    
## Pregnancies.f8   -0.80957    5.78471  -0.140   0.8887    
## Pregnancies.f9    0.03588    6.34701   0.006   0.9955    
## Pregnancies.f10  -9.78347    6.79847  -1.439   0.1506    
## Pregnancies.f11 -11.96971    9.24709  -1.294   0.1959    
## Pregnancies.f12 -20.69322   10.15953  -2.037   0.0420 *  
## Pregnancies.f13  -9.35150    9.62560  -0.972   0.3316    
## Pregnancies.f14   4.27798   20.46032   0.209   0.8344    
## Pregnancies.f15   0.09896   28.76958   0.003   0.9973    
## Pregnancies.f17  20.55659   28.81031   0.714   0.4758    
## BMI               0.99374    0.15536   6.396 2.84e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.59 on 733 degrees of freedom
##   (16 observations deleted due to missingness)
## Multiple R-squared:  0.1482, Adjusted R-squared:  0.1273 
## F-statistic: 7.084 on 18 and 733 DF,  p-value: < 2.2e-16

In this model, I treated Pregnancies as a categorical variable, which changed the slope from my previous model (in which Pregnancies was continuous). Here there are 17 categories for pregnancy, and the reference category is 0 pregnancies. To interpret the coefficient for 2 pregnancies, we could say that holding all else constant, women who have had 2 pregnancies tend to have PlasmaGC -7.749 units lower than women who have had 0 pregnancies.