Introduction:
This analysis examines data from liver patients concentrating on relationships between a key list of liver enzymes, proteins, age and gender using them to try and predict the likeliness of liver disease.
Having worked, for two years now, in public health data science, understanding how data can be used to promote health and better predict illness, I find models which use existing data to estimate risk of serious diseases both fascinating and useful. Liver ailments are particularly serious and often fatal, so earlier detection could be a big win. Even if there is limited predictive power, models can help us know when it makes sense to look deeper despite the absence of disease symptoms.
In general, models such as this derived of much larger data sets, will eventually evolve to provide better health outcomes with less invasive methods by predicting possible illness prior to symptoms instead of waiting for outward signs.
In the case of this particular type of study, the possibility of recognizing early signs of liver disease based on demographics and blood proteins could decrease recovery time and extend the length and quality of life for high-risk people.
Data:
The data for this analysis lives in the University of California Irvine Machine Learning Data Repository and was groomed specifically for classifications such as this.
liver_data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00225/Indian%20Liver%20Patient%20Dataset%20(ILPD).csv",
header = FALSE)
colnames(liver_data) <- c("Age", "Sex", "Tot_Bil", "Dir_Bil", "Alkphos", "Alamine",
"Aspartate", "Tot_Prot", "Albumin", "A_G_Ratio", "Disease")
liver_data$Sex <- (ifelse(liver_data$Sex == "Male", "M", "F")) #made shorter
liver_data$Disease <- as.numeric(ifelse(liver_data$Disease == 2, 0, 1)) #converted to zeros and ones
Age | Sex | Tot_Bil | Dir_Bil | Alkphos | Alamine | Aspartate | Tot_Prot | Albumin | A_G_Ratio | Disease |
---|---|---|---|---|---|---|---|---|---|---|
65 | F | 0.7 | 0.1 | 187 | 16 | 18 | 6.8 | 3.3 | 0.90 | 1 |
62 | M | 10.9 | 5.5 | 699 | 64 | 100 | 7.5 | 3.2 | 0.74 | 1 |
62 | M | 7.3 | 4.1 | 490 | 60 | 68 | 7.0 | 3.3 | 0.89 | 1 |
58 | M | 1.0 | 0.4 | 182 | 14 | 20 | 6.8 | 3.4 | 1.00 | 1 |
72 | M | 3.9 | 2.0 | 195 | 27 | 59 | 7.3 | 2.4 | 0.40 | 1 |
46 | M | 1.8 | 0.7 | 208 | 19 | 14 | 7.6 | 4.4 | 1.30 | 1 |
Type of Study
This cross-sectional observational study based on gathered, non-experimental data. We might be able to make meaningful associations based on it, but due to the limited ethnic diversity and the gender bias within the set, it is definitely not going to provide generalizable results for humanity at large. However, it can provide insight for future controlled experiments or observational studies which take diversity across ages, genders and races into consideration.
Data Source
The set was gathered from the patient records in area north-east of Andhra Pradesh, India and donated to to University of California Irvine by:
Bendi Venkata Ramana ramana.bendi ‘@’ gmail.com Associate Professor, Department of Information Technology, Aditya Instutute of Technology and Management, Tekkali - 532201, Andhra Pradesh, India.
Prof. M. Surendra Prasad Babu drmsprasadbabu ‘@’ yahoo.co.in Deptartment of Computer Science & Systems Engineering, Andhra University College of Engineering, Visakhapatnam-530 003 Andhra Pradesh, India.
3.Prof. N. B. Venkateswarlu venkat_ritch ‘@’ yahoo.com Department of Computer Science and Engineering, Aditya Instutute of Technology and Management, Tekkali - 532201, Andhra Pradesh, India.
About the
There are 583 observations, 416 represent subjects with diseased livers, 167 represent subjects without diseased livers.
The data represent 441 male subjects (of whom 324 have liver disease) and 142 female subjects (of whom 92 have liver disease).
Data Dictionary
- Age = Age of the patient (all subjects greater than 89 are labelled 90)
- Sex = Gender of the patient Female Male
- Tot_Bil = Total Bilirubin
- Dir_Bil = Direct Bilirubin
- Alk_Phos = Alkaline Phosphotase
- Alamine = Alamine Aminotransferase
- Aspartate = Aspartate Aminotransferase
- Tot_Prot = Total Protiens
- Albumin = Albumin 10.A_G_Ration = Albumin and Globulin Ratio 11.Disease = Disease State (classified labeled by the medical experts ) 0 - not diseased 1- diseased
Predictor Variables are columns 1 through 10 and the outcome variable is column 11, Disease.
Exploratory Data Analysis:
The graphs below were created to explore the distribution of data points within each variable forming solid foundation for the analysis to come. In the initial graphs, six of the variables were extremely left or right skewed, so were transformed using natural log prior to both graphing and analysis.
This aided in centering the data better, but still they are not completely normal, so it is important to have realistic expectations about the linear predictive power of individual variables within the model as a whole.
Age
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 583 | 44.74614 | 16.18983 | 45 | 44.84368 | 17.7912 | 4 | 90 | 86 | -0.0292343 | -0.5738921 | 0.6705144 |
Gender
Sex | Disease | Frequency |
---|---|---|
F | 0 | 50 |
F | 1 | 92 |
M | 0 | 117 |
M | 1 | 324 |
Total Bilirubin
Bilirubin is a byproduct of hemolytic catabolism, it is one of many substances the liver filters from the body. Heighten presence of either or both can be indicative of liver disease and is the cause of skin yellowing associated with jaundice. (medscape)
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 583 | 0.4634209 | 1.018527 | 0 | 0.293858 | 0.5288063 | -0.9162907 | 4.317488 | 5.233779 | 1.311718 | 0.8898145 | 0.0421831 |
Direct Bilirubin
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 583 | -0.6503733 | 1.326394 | -1.203973 | -0.7869591 | 1.02766 | -2.302585 | 2.980619 | 5.283204 | 0.8269181 | -0.2990599 | 0.0549336 |
Alkaline Phosphotase
This is one of the enzymes included in a normal liver panel, frequently used to estimate overall liver health.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 583 | 5.493417 | 0.5281278 | 5.337538 | 5.42725 | 0.3757633 | 4.143135 | 7.654443 | 3.511309 | 1.317959 | 2.230417 | 0.0218728 |
Alamine Aminotransferase
Alamine Aminotransferase is a natural art of the liver ecosystem tested for in a liver panel.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 583 | 3.751829 | 0.9002358 | 3.555348 | 3.638797 | 0.6883795 | 2.302585 | 7.600903 | 5.298317 | 1.418303 | 2.578034 | 0.037284 |
Aspartate Aminotransferase
Aspartate Aminotransferase is also a natural part of the liver ecosystem tested for in a liver panel.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 583 | 3.956771 | 0.9973813 | 3.73767 | 3.837544 | 0.8596389 | 2.302585 | 8.502891 | 6.200306 | 1.188757 | 1.560033 | 0.0413073 |
Total Proteins
Total protein is a measure of both albumin and globulin combined.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 583 | 1.853967 | 0.1796109 | 1.88707 | 1.867098 | 0.149453 | 0.9932518 | 2.261763 | 1.268511 | -0.9689299 | 1.979856 | 0.0074387 |
Albumin
Albumin is a blood protein which adds structure to the vascular system keeping preventing blood from seeping out through the vessel walls.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 583 | 3.141853 | 0.7955188 | 3.1 | 3.14925 | 0.88956 | 0.9 | 5.5 | 4.6 | -0.0434602 | -0.4037893 | 0.032947 |
Albumin-Globulin Ratio
The Albumin-Globulin ratio is considered an index of systemic diseases in general.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 579 | 0.9470639 | 0.3195921 | 0.93 | 0.9297634 | 0.252042 | 0.3 | 2.8 | 2.5 | 0.9871639 | 3.221735 | 0.0132818 |
The above graphs suggest that the enzymes and proteins in question vary relative to the diseased state which we may be able to understand more completely after the analysis.
Inference:
The goal of this analysis is to see how well a logistic regression can be tuned on these data to predict the presence of liver disease. The ## Preparing Data for Analysis
set.seed(455) # for reproducibility
liver_data$Splits <- sample.split(liver_data, SplitRatio = 0.7) #set indexes
liver_data <- liver_data %>% mutate_each(funs(log), -Age, -Sex, -Albumin, -A_G_Ratio,
-Disease, -Splits)
train <- liver_data[liver_data$Splits == TRUE, ] #extract training using indexes
test <- liver_data[liver_data$Splits == FALSE, ] #extract test using indexes
Training Summary
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Age | 1 | 371 | 44.2938005 | 15.6390861 | 45.000000 | 44.3804714 | 17.7912000 | 4.0000000 | 90.000000 | 86.0000000 | -0.0054548 | -0.5021859 | 0.8119409 |
Sex* | 2 | 371 | NaN | NA | NA | NaN | NA | Inf | -Inf | -Inf | NA | NA | NA |
Tot_Bil | 3 | 371 | 0.4635428 | 1.0142436 | 0.000000 | 0.2883406 | 0.5288063 | -0.6931472 | 4.317488 | 5.0106353 | 1.4009765 | 1.2964306 | 0.0526569 |
Dir_Bil | 4 | 371 | -0.6463009 | 1.3057033 | -1.203973 | -0.7801100 | 1.0276600 | -2.3025851 | 2.980619 | 5.2832037 | 0.8440730 | -0.1631233 | 0.0677887 |
Alkphos | 5 | 371 | 5.5181453 | 0.5214282 | 5.370638 | 5.4494785 | 0.3657189 | 4.1431347 | 7.654443 | 3.5113085 | 1.2739498 | 1.7736041 | 0.0270712 |
Alamine | 6 | 371 | 3.7940291 | 0.9044910 | 3.610918 | 3.6875879 | 0.7167284 | 2.3025851 | 7.396335 | 5.0937502 | 1.3116266 | 2.2122098 | 0.0469588 |
Aspartate | 7 | 371 | 4.0096669 | 1.0123014 | 3.761200 | 3.8865681 | 0.8645727 | 2.4849066 | 8.502891 | 6.0179848 | 1.1841618 | 1.4928227 | 0.0525561 |
Tot_Prot | 8 | 371 | 1.8562181 | 0.1687600 | 1.871802 | 1.8665221 | 0.1516386 | 1.2809338 | 2.261763 | 0.9808293 | -0.6255046 | 0.4391950 | 0.0087616 |
Albumin | 9 | 371 | 3.1380054 | 0.7902345 | 3.100000 | 3.1424242 | 0.8895600 | 1.4000000 | 5.500000 | 4.1000000 | 0.0586121 | -0.4816292 | 0.0410269 |
A_G_Ratio | 10 | 370 | 0.9473514 | 0.3259117 | 0.900000 | 0.9275676 | 0.2965200 | 0.3000000 | 2.800000 | 2.5000000 | 1.0646070 | 3.4478161 | 0.0169433 |
Disease | 11 | 371 | 0.7196765 | 0.4497638 | 1.000000 | 0.7744108 | 0.0000000 | 0.0000000 | 1.000000 | 1.0000000 | -0.9742200 | -1.0537139 | 0.0233506 |
Splits* | 12 | 371 | NaN | NA | NA | NaN | NA | Inf | -Inf | -Inf | NA | NA | NA |
Initial Logistic Regression Model
fit <- glm(Disease ~ Age + Sex + Tot_Bil + Dir_Bil + Alkphos + Alamine + Aspartate +
Tot_Prot + Albumin + A_G_Ratio, data = train, family = binomial(link = "logit"))
Coefficients:
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -15.8559002 | 4.1200217 | -3.8484992 | 0.0001188 |
Age | 0.0217058 | 0.0087573 | 2.4786079 | 0.0131896 |
SexM | -0.4007260 | 0.3266459 | -1.2267904 | 0.2199014 |
Tot_Bil | 0.4276794 | 0.7432995 | 0.5753797 | 0.5650346 |
Dir_Bil | 0.1718378 | 0.4851157 | 0.3542202 | 0.7231739 |
Alkphos | 0.9463127 | 0.3856115 | 2.4540572 | 0.0141254 |
Alamine | 0.9373542 | 0.3219296 | 2.9116743 | 0.0035950 |
Aspartate | 0.1980968 | 0.2889246 | 0.6856348 | 0.4929434 |
Tot_Prot | 5.3262285 | 2.3915094 | 2.2271409 | 0.0259379 |
Albumin | -1.4553451 | 0.7649599 | -1.9025117 | 0.0571043 |
A_G_Ratio | 1.8906917 | 1.2224601 | 1.5466286 | 0.1219528 |
Model Pseudo R-Square and Log-Likelihoods (from pscl package)
llh | llhNull | G2 | McFadden | r2ML | r2CU |
---|---|---|---|---|---|
-170.5744 | -220.0989 | 99.04894 | 0.2250101 | 0.2348626 | 0.3375943 |
Based on the above table using the MacFadden \(R_2\) as a guide, it appears that this model explains approximately 22.5% of the disease classification.
Another estimate of utility is the Coefficient of Discrimination is test metric which subtracts mean of all No-disease probabilities from the mean of all disease-probabilities based on the predicted outcomes of the test data.
# Start by making a data frame of predictions
Test_Predictions <- data.frame(Probability = predict(fit, test, type = "response"))
# add then add predicted class using 50% probability to round up and down
Test_Predictions$Prediction <- ifelse(Test_Predictions > 0.5, 1, 0)
# added the actual diagnosed disease/ non classification to the set
Test_Predictions$Disease <- test$Disease
# accuracy is simply the mean of all trues (1) where prediction = reality
accuracy <- mean(Test_Predictions$Disease == Test_Predictions$Prediction, na.rm = TRUE)
# For the Coefficient of Discrimination create arrays of predicted
# #probabilities where the true outcome is disease and one where it is not
# taking an average of the two give a sense of the overall predictive power
disease <- Test_Predictions$Probability[which(Test_Predictions$Disease == 1)]
non <- Test_Predictions$Probability[which(Test_Predictions$Disease == 0)]
Coef_Desc <- mean(disease, na.rm = TRUE) - mean(non, na.rm = TRUE)
Coefficient of Discrimination : 0.1420897
Accuracy: 0.6889952
The accuracy is simply the number of times the model was right…which seems considerably higher than the Model Fit Estimates would suggest, at around 69%.
plot(Test_Predictions$Probability, Test_Predictions$Disease, xlim = c(0.1, 1.1),
xlab = "Probability", ylim = c(-0.1, 1.1), ylab = "Disease", col = "blue",
pch = 18)
Scope of Analysis
Improvements
Although there were variables which were not in and of themselves statistically significant (meaning we cannot tell what their precise contribution to the diseased diagnosis is reliable) removing them in any order did nothing but drop the accuracy, the Coefficient of Discrimination and the Pseudo R-square scores, indicating that for shear ability to predict the likelihood of a patient having liver disease based on the included variables, the whole model is more accurate than an abridged logistic model.
Model Conditions
The two conditions which we need to meet are:
- That each predictor Xi is linearly related to the logit( \(p_i\) )when all other predictors are held constant
Based on the following samples of x plotted against the predicted probabilities of the logistic regression, it seems as though there may be some concerns with using these particular regressors in a logistic model, even transformed.
train$predictions <- predict(fit, train, type = "response")
require(ggplot2)
ggplot(train, aes(x = Tot_Bil, y = predictions)) + geom_point() + geom_smooth(method = "lm",
se = FALSE)
ggplot(train, aes(x = Dir_Bil, y = predictions)) + geom_point() + geom_smooth(method = "lm",
se = FALSE)
ggplot(train, aes(x = Alkphos, y = predictions)) + geom_point() + geom_smooth(method = "lm",
se = FALSE)
ggplot(train, aes(x = Albumin, y = predictions)) + geom_point() + geom_smooth(method = "lm",
se = FALSE)
- Each \(Y-i\) is independent of other outcomes
no mention of relations between family members was included in the metadata for this set, so the assumption is that all subjects are independent of each other
Conclusion:
Given the estimated predictive power of the Pseudo R-squared values at at around 22% as well as the model condition of a linear relationship between the predictor and the logit, not being met, I would not consider using this model as a diagnostic tool, or automating a model.
However, given the accuracy is 68% on the validation set, with further tuning of this model on more validation data, it might be useful as a tool to suggest further definitive testing of liver disease for patients who this model suggests might be positive.
It might be that this type of model, while not diagnostic, is useful in early detection, like a mammogram, which also is indicative of breast cancer, but not diagnostic.
If such a model is to be useful, then a similar method would need to be applied to a more racially diverse data set with more even gender distributions.
References:
UCI Machine Learning Repository: ILPD (Indian Liver Patient Dataset) Data Set. (n.d.). Retrieved December 8, 2017, from https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)
A/G RATIO (3293A). (n.d.). Retrieved December 5, 2017, from http://www.questdiagnostics.com/testcenter/BUOrderInfo.action?tc=3293A&labCode=QBA
Bilirubin: Reference Range, Interpretation, Collection and Panels. (2017). Retrieved from https://emedicine.medscape.com/article/2074068-overview