setwd("D:/Code/india-heart-watch")
ihw <- read.csv('ihw.csv')
ihw.info <- read.csv('ihw_info.csv')
Reading [1] made me realize that I didn’t really had to check for all the exotic distributions. Bimodality in distribution of blood glucose shows “that diabetes is a distinct entity rather than an arbitrarily defined extreme end of a continously distributed measurement” [1]. This implies that inside the glucose distribution of population a separate distribution exists for diabetic subjects. In other words, the population distribution is made up of two subsets of population those who are diagnosed with Diabetes and those who aren’t. This also makes the distribution a bimodal distribution.
To test for bimodality, I only need to try fit a unimodal normal and skewed distribution and check it against the fit of a mixture of two normal distributions.
Such an obvious thing!
I am relying on information from ihw.info
data frame which is extracted from the SPSS .sav
file. While looking for a binary column for diagnosis of diabetics, I found 3 columns, all 3 of which seem to be logical vectors about diagnosis of diabetes in subject.
# list columns
ihw.info[c(19, 40, 119), ]
## labels description
## 19 q4b3 diabetes
## 40 q7c diabetes
## 119 Diabetics FBG>=126 or previously diagnose
I am not sure what do columns q4b3
and q7c
really tell. They’re vectors of ‘yes’, ‘no’ and NA which makes it obvious they’re talking about diagnosis but the numbers don’t match up!
summary(ihw$q4b3)
## no yes NA's
## 3945 1973 280
# which is different from
summary(ihw$q7c)
## no yes NA's
## 4981 1109 108
# which is again different from
summary(ihw$Diabetics)
## diabeic nondiabetic NA's
## 1406 3953 839
I guess I’ll stick with ihw$Diabetics
for now.
summary(ihw$q10j >= 126)
## Mode FALSE TRUE NA's
## logical 4515 929 754
# rows which have Diabetics logical set as NA when fasting glucose is available
length(filter(ihw, !is.na(q10j), is.na(Diabetics))$q10j)
## [1] 85
This seems to suggest that there are 1406 - 929
people who don’t have fasting glucose over 126 but have been previously diagnosed with diabetes. We also have 85 subjects where fasting glucose results were available but the diagnosis of diabetes wasn’t.
[1] doesn’t make distinctions based on sex so I’ll just split the data into set of diabetics and non-diabetics using ihw$Diabetics
. Please note that FG stands for Fasting Glucose. For now, I am not spiliting this into different age groups.
fg.diabetic <- filter(ihw, Diabetics == 'diabeic')$q10j
fg.nondiabetic <- filter(ihw, Diabetics == 'nondiabetic')$q10j
# data frame of the two vectors above
fg <- dplyr::select(ihw, glucose = q10j, diabetic = Diabetics)
fg <- fg[!is.na(fg$diabetic), ] # get rid of NA in column diabetic
I wonder how this looks,
ggplot(data = fg, aes(x = glucose)) +
geom_histogram(aes(y = ..density..), binwidth = 5, color = 'blue', fill = NA) +
geom_density(adjust = 2) +
facet_wrap(~diabetic, ncol = 1, scales = 'free')
Hmm, this does look like lognormal and normal distribution curves.
Fitting a distribution on sample data is a three-step work when things are easy and there isn’t much detective work involved. These steps are,
# plot
plotdist(fg.diabetic, histo = TRUE, demp = TRUE) # Diabetic Population
plotdist(log(fg.diabetic), histo = TRUE, demp = TRUE) # Log of Diabetic Population
Aha! Looks very normal!
# cullen-frey graph
descdist(fg.diabetic, discrete=FALSE, boot=500)
## summary statistics
## ------
## min: 50 max: 534
## median: 140
## mean: 152.0185
## estimated sd: 60.94157
## estimated skewness: 1.466083
## estimated kurtosis: 6.146511
This plot suggests that Kurtosis and Skewness values of the data go along a gamma distribution. I’ll compare two fits against lognormal and gamma distributions to check which one fits better.
# fit
d.fit_ln <- fitdist(fg.diabetic, "lnorm")
d.fit_g <- fitdist(fg.diabetic, "gamma")
plot(d.fit_g) # Gamma Distribution fit
plot(d.fit_ln) # Lognormal Distribution fit
summary(d.fit_ln)
## Fitting of the distribution ' lnorm ' by maximum likelihood
## Parameters :
## estimate Std. Error
## meanlog 4.9538209 0.009821978
## sdlog 0.3682914 0.006944957
## Loglikelihood: -7555.673 AIC: 15115.35 BIC: 15125.84
## Correlation matrix:
## meanlog sdlog
## meanlog 1 0
## sdlog 0 1
summary(d.fit_g)
## Fitting of the distribution ' gamma ' by maximum likelihood
## Parameters :
## estimate Std. Error
## shape 7.28477606 0.267612925
## rate 0.04791731 0.001821562
## Loglikelihood: -7595.987 AIC: 15195.97 BIC: 15206.47
## Correlation matrix:
## shape rate
## shape 1.0000000 0.9656593
## rate 0.9656593 1.0000000
Although a lognormal distribution fit has much lower overall standard error. I’ll compare again with a goodness of fit test.
gofstat(list(d.fit_ln, d.fit_g), fitnames = c("lnorm", "gamma"))
## Goodness-of-fit statistics
## lnorm gamma
## Kolmogorov-Smirnov statistic 0.05249851 0.0787956
## Cramer-von Mises statistic 0.65455882 1.7160982
## Anderson-Darling statistic 3.85268983 10.1001153
##
## Goodness-of-fit criteria
## lnorm gamma
## Aikake's Information Criterion 15115.35 15195.97
## Bayesian Information Criterion 15125.84 15206.47
All 3 statistics are lower for the lognormal fit.
Thus, the distribution of fasting glucose in people diagnosed with diabetes shows a very clear fit with lognormal distribution from India Heart Watch data.
Using the same three step process followed above, we can decide a fit for non-diabetics as well. This is healthy population and previous research suggests that it should follow normal distribution.
# plot
plotdist(fg.nondiabetic, histo = TRUE, demp = TRUE)
# cullen-fry graph
descdist(fg.nondiabetic, discrete = FALSE, boot = 500)
## summary statistics
## ------
## min: 44.4 max: 125
## median: 89
## mean: 90.20758
## estimated sd: 13.74121
## estimated skewness: 0.3198626
## estimated kurtosis: 2.590937
# fit distribution
n.fit_n <- fitdist(fg.nondiabetic, "norm")
plot(n.fit_n)
summary(n.fit_n)
## Fitting of the distribution ' norm ' by maximum likelihood
## Parameters :
## estimate Std. Error
## mean 90.20758 0.2185278
## sd 13.73947 0.1545225
## Loglikelihood: -15967 AIC: 31938.01 BIC: 31950.57
## Correlation matrix:
## mean sd
## mean 1 0
## sd 0 1
Thus, the distribution of fasting glucose in healthy population shows a clear fit with normal distribution curve.
So far, I have shown that the two exhaustive subsets of our population have a good fit with lognormal and normal distribution whose paramteres have been computed.
This can be better visualized if we overlay both these distribution on the population histogram.
Time to plot!
ggplot(data = ihw, aes(x = q10j)) +
geom_histogram(aes(y = ..density..), fill = NA, color = 'black', bins = 75) +
geom_density(data = mix, aes(x = diabetes), color = NA, fill = 'red', alpha = 0.3) +
geom_density(data = mix, aes(x = nondiabetes), color = NA, fill = 'blue', alpha = 0.3) +
xlab("Fasting Glucose") +
ggtitle("Mixture Distribution overlay on Glucose Histogram") +
coord_cartesian(xlim = c(50, 300))
## Warning: Removed 754 rows containing non-finite values (stat_bin).