Fasting Glucose – Distribution Fitting

Distribution

Reading [1] made me realize that I didn’t really had to check for all the exotic distributions. Bimodality in distribution of blood glucose shows “that diabetes is a distinct entity rather than an arbitrarily defined extreme end of a continously distributed measurement” ^[1]. This implies that inside the glucose distribution of population a separate distribution exists for diabetic subjects. In other words, the population distribution is made up of two subsets of population those who are diagnosed with Diabetes and those who aren’t. This also makes the distribution a bimodal distribution.

To test for bimodality, I only need to try fit a unimodal normal and skewed distribution and check it against the fit of a mixture of two normal distributions.

Such an obvious thing!

Disrepancy in Dataset?

I am relying on information from ihw.info data frame which is extracted from the SPSS .sav file. While looking for a binary column for diagnosis of diabetics, I found 3 columns, all 3 of which seem to be logical vectors about diagnosis of diabetes in subject.

# list columns
ihw.info[c(19, 40, 119), ]

##        labels                     description
## 19       q4b3                        diabetes
## 40        q7c                        diabetes
## 119 Diabetics FBG>=126 or previously diagnose

I am not sure what do columns q4b3 and q7c really tell. They’re vectors of ‘yes’, ‘no’ and NA which makes it obvious they’re talking about diagnosis but the numbers don’t match up!

summary(ihw$q4b3)

##   no  yes NA's 
## 3945 1973  280

# which is different from
summary(ihw$q7c)

##   no  yes NA's 
## 4981 1109  108

# which is again different from
summary(ihw$Diabetics)

##     diabeic nondiabetic        NA's 
##        1406        3953         839

I guess I’ll stick with ihw$Diabetics for now.

summary(ihw$q10j >= 126)

##    Mode   FALSE    TRUE    NA's 
## logical    4515     929     754

# rows which have Diabetics logical set as NA when fasting glucose is available
length(filter(ihw, !is.na(q10j), is.na(Diabetics))$q10j)

## [1] 85

This seems to suggest that there are 1406 - 929 people who don’t have fasting glucose over 126 but have been previously diagnosed with diabetes. We also have 85 subjects where fasting glucose results were available but the diagnosis of diabetes wasn’t.

Spliting

[1] doesn’t make distinctions based on sex so I’ll just split the data into set of diabetics and non-diabetics using ihw$Diabetics. Please note that FG stands for Fasting Glucose. For now, I am not spiliting this into different age groups.

fg.diabetic <- filter(ihw, Diabetics == 'diabeic')$q10j
fg.nondiabetic <- filter(ihw, Diabetics == 'nondiabetic')$q10j

# data frame of the two vectors above
fg <- dplyr::select(ihw, glucose = q10j, diabetic = Diabetics)
fg <- fg[!is.na(fg$diabetic), ] # get rid of NA in column diabetic

I wonder how this looks,

ggplot(data = fg, aes(x = glucose)) +
  geom_histogram(aes(y = ..density..), binwidth = 5, color = 'blue', fill = NA) +
  geom_density(adjust = 2) +
  facet_wrap(~diabetic, ncol = 1, scales = 'free')

Hmm, this does look like lognormal and normal distribution curves.

Fitting Diabetics

Fitting a distribution on sample data is a three-step work when things are easy and there isn’t much detective work involved. These steps are,

Plot values
Plot Cullen-Fry graph which is plot of Kurtosis against square of Skewness. This gives you a rough idea of the kind of distributions which you can try to fit.
Fit distributions and compare for best fit

# plot
plotdist(fg.diabetic, histo = TRUE, demp = TRUE) # Diabetic Population

plotdist(log(fg.diabetic), histo = TRUE, demp = TRUE) # Log of Diabetic Population

Aha! Looks very normal!

# cullen-frey graph
descdist(fg.diabetic, discrete=FALSE, boot=500)

## summary statistics
## ------
## min:  50   max:  534 
## median:  140 
## mean:  152.0185 
## estimated sd:  60.94157 
## estimated skewness:  1.466083 
## estimated kurtosis:  6.146511

This plot suggests that Kurtosis and Skewness values of the data go along a gamma distribution. I’ll compare two fits against lognormal and gamma distributions to check which one fits better.

# fit
d.fit_ln <- fitdist(fg.diabetic, "lnorm")
d.fit_g <- fitdist(fg.diabetic, "gamma")
plot(d.fit_g) # Gamma Distribution fit

plot(d.fit_ln) # Lognormal Distribution fit

summary(d.fit_ln)

## Fitting of the distribution ' lnorm ' by maximum likelihood 
## Parameters : 
##          estimate  Std. Error
## meanlog 4.9538209 0.009821978
## sdlog   0.3682914 0.006944957
## Loglikelihood:  -7555.673   AIC:  15115.35   BIC:  15125.84 
## Correlation matrix:
##         meanlog sdlog
## meanlog       1     0
## sdlog         0     1

summary(d.fit_g)

## Fitting of the distribution ' gamma ' by maximum likelihood 
## Parameters : 
##         estimate  Std. Error
## shape 7.28477606 0.267612925
## rate  0.04791731 0.001821562
## Loglikelihood:  -7595.987   AIC:  15195.97   BIC:  15206.47 
## Correlation matrix:
##           shape      rate
## shape 1.0000000 0.9656593
## rate  0.9656593 1.0000000

Although a lognormal distribution fit has much lower overall standard error. I’ll compare again with a goodness of fit test.

gofstat(list(d.fit_ln, d.fit_g), fitnames = c("lnorm", "gamma"))

## Goodness-of-fit statistics
##                                   lnorm      gamma
## Kolmogorov-Smirnov statistic 0.05249851  0.0787956
## Cramer-von Mises statistic   0.65455882  1.7160982
## Anderson-Darling statistic   3.85268983 10.1001153
## 
## Goodness-of-fit criteria
##                                   lnorm    gamma
## Aikake's Information Criterion 15115.35 15195.97
## Bayesian Information Criterion 15125.84 15206.47

All 3 statistics are lower for the lognormal fit.

Thus, the distribution of fasting glucose in people diagnosed with diabetes shows a very clear fit with lognormal distribution from India Heart Watch data.

Fitting Non-Diabetics

Using the same three step process followed above, we can decide a fit for non-diabetics as well. This is healthy population and previous research suggests that it should follow normal distribution.

# plot
plotdist(fg.nondiabetic, histo = TRUE, demp = TRUE)

# cullen-fry graph
descdist(fg.nondiabetic, discrete = FALSE, boot = 500)

## summary statistics
## ------
## min:  44.4   max:  125 
## median:  89 
## mean:  90.20758 
## estimated sd:  13.74121 
## estimated skewness:  0.3198626 
## estimated kurtosis:  2.590937

# fit distribution
n.fit_n <- fitdist(fg.nondiabetic, "norm")
plot(n.fit_n)

summary(n.fit_n)

## Fitting of the distribution ' norm ' by maximum likelihood 
## Parameters : 
##      estimate Std. Error
## mean 90.20758  0.2185278
## sd   13.73947  0.1545225
## Loglikelihood:  -15967   AIC:  31938.01   BIC:  31950.57 
## Correlation matrix:
##      mean sd
## mean    1  0
## sd      0  1

Thus, the distribution of fasting glucose in healthy population shows a clear fit with normal distribution curve.

Mixture Distribution

So far, I have shown that the two exhaustive subsets of our population have a good fit with lognormal and normal distribution whose paramteres have been computed.

This can be better visualized if we overlay both these distribution on the population histogram.

Time to plot!

ggplot(data = ihw, aes(x = q10j)) +
  geom_histogram(aes(y = ..density..), fill = NA, color = 'black', bins = 75) +
  geom_density(data = mix, aes(x = diabetes), color = NA, fill = 'red', alpha = 0.3) +
  geom_density(data = mix, aes(x = nondiabetes), color = NA, fill = 'blue', alpha = 0.3) +
  xlab("Fasting Glucose") +
  ggtitle("Mixture Distribution overlay on Glucose Histogram") +
  coord_cartesian(xlim = c(50, 300))

## Warning: Removed 754 rows containing non-finite values (stat_bin).

Fasting Glucose – Distribution Fitting

Shailendra

12 August 2016

Setup