Introduction:

This analysis examines data from liver patients concentrating on relationships between a key list of liver enzymes, proteins, age and gender using them to try and predict the likeliness of liver disease.

Having worked, for two years now, in public health data science, understanding how data can be used to promote health and better predict illness, I find models which use existing data to estimate risk of serious diseases both fascinating and useful. Liver ailments are particularly serious and often fatal, so earlier detection could be a big win. Even if there is limited predictive power, models can help us know when it makes sense to look deeper despite the absence of disease symptoms.

In general, models such as this derived of much larger data sets, will eventually evolve to provide better health outcomes with less invasive methods by predicting possible illness prior to symptoms instead of waiting for outward signs.

In the case of this particular type of study, the possibility of recognizing early signs of liver disease based on demographics and blood proteins could decrease recovery time and extend the length and quality of life for high-risk people.

Data:

The data for this analysis lives in the University of California Irvine Machine Learning Data Repository and was groomed specifically for classifications such as this.

liver_data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00225/Indian%20Liver%20Patient%20Dataset%20(ILPD).csv", 
    header = FALSE)
colnames(liver_data) <- c("Age", "Sex", "Tot_Bil", "Dir_Bil", "Alkphos", "Alamine", 
    "Aspartate", "Tot_Prot", "Albumin", "A_G_Ratio", "Disease")
liver_data$Sex <- (ifelse(liver_data$Sex == "Male", "M", "F"))  #made shorter
liver_data$Disease <- as.numeric(ifelse(liver_data$Disease == 2, 0, 1))  #converted to zeros and ones
Age Sex Tot_Bil Dir_Bil Alkphos Alamine Aspartate Tot_Prot Albumin A_G_Ratio Disease
65 F 0.7 0.1 187 16 18 6.8 3.3 0.90 1
62 M 10.9 5.5 699 64 100 7.5 3.2 0.74 1
62 M 7.3 4.1 490 60 68 7.0 3.3 0.89 1
58 M 1.0 0.4 182 14 20 6.8 3.4 1.00 1
72 M 3.9 2.0 195 27 59 7.3 2.4 0.40 1
46 M 1.8 0.7 208 19 14 7.6 4.4 1.30 1

Type of Study

This cross-sectional observational study based on gathered, non-experimental data. We might be able to make meaningful associations based on it, but due to the limited ethnic diversity and the gender bias within the set, it is definitely not going to provide generalizable results for humanity at large. However, it can provide insight for future controlled experiments or observational studies which take diversity across ages, genders and races into consideration.

Data Source

The set was gathered from the patient records in area north-east of Andhra Pradesh, India and donated to to University of California Irvine by:

  1. Bendi Venkata Ramana ramana.bendi ‘@’ gmail.com Associate Professor, Department of Information Technology, Aditya Instutute of Technology and Management, Tekkali - 532201, Andhra Pradesh, India.

  2. Prof. M. Surendra Prasad Babu drmsprasadbabu ‘@’ yahoo.co.in Deptartment of Computer Science & Systems Engineering, Andhra University College of Engineering, Visakhapatnam-530 003 Andhra Pradesh, India.

3.Prof. N. B. Venkateswarlu venkat_ritch ‘@’ yahoo.com Department of Computer Science and Engineering, Aditya Instutute of Technology and Management, Tekkali - 532201, Andhra Pradesh, India.

About the

There are 583 observations, 416 represent subjects with diseased livers, 167 represent subjects without diseased livers.

The data represent 441 male subjects (of whom 324 have liver disease) and 142 female subjects (of whom 92 have liver disease).

Data Dictionary

  1. Age = Age of the patient (all subjects greater than 89 are labelled 90)
  2. Sex = Gender of the patient Female Male
  3. Tot_Bil = Total Bilirubin
  4. Dir_Bil = Direct Bilirubin
  5. Alk_Phos = Alkaline Phosphotase
  6. Alamine = Alamine Aminotransferase
  7. Aspartate = Aspartate Aminotransferase
  8. Tot_Prot = Total Protiens
  9. Albumin = Albumin 10.A_G_Ration = Albumin and Globulin Ratio 11.Disease = Disease State (classified labeled by the medical experts ) 0 - not diseased 1- diseased

Predictor Variables are columns 1 through 10 and the outcome variable is column 11, Disease.

Exploratory Data Analysis:

The graphs below were created to explore the distribution of data points within each variable forming solid foundation for the analysis to come. In the initial graphs, six of the variables were extremely left or right skewed, so were transformed using natural log prior to both graphing and analysis.

This aided in centering the data better, but still they are not completely normal, so it is important to have realistic expectations about the linear predictive power of individual variables within the model as a whole.

Age

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 583 44.74614 16.18983 45 44.84368 17.7912 4 90 86 -0.0292343 -0.5738921 0.6705144

Gender

Sex Disease Frequency
F 0 50
F 1 92
M 0 117
M 1 324

Total Bilirubin

Bilirubin is a byproduct of hemolytic catabolism, it is one of many substances the liver filters from the body. Heighten presence of either or both can be indicative of liver disease and is the cause of skin yellowing associated with jaundice. (medscape)

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 583 0.4634209 1.018527 0 0.293858 0.5288063 -0.9162907 4.317488 5.233779 1.311718 0.8898145 0.0421831

Direct Bilirubin

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 583 -0.6503733 1.326394 -1.203973 -0.7869591 1.02766 -2.302585 2.980619 5.283204 0.8269181 -0.2990599 0.0549336

Alkaline Phosphotase

This is one of the enzymes included in a normal liver panel, frequently used to estimate overall liver health.

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 583 5.493417 0.5281278 5.337538 5.42725 0.3757633 4.143135 7.654443 3.511309 1.317959 2.230417 0.0218728

Alamine Aminotransferase

Alamine Aminotransferase is a natural art of the liver ecosystem tested for in a liver panel.

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 583 3.751829 0.9002358 3.555348 3.638797 0.6883795 2.302585 7.600903 5.298317 1.418303 2.578034 0.037284

Aspartate Aminotransferase

Aspartate Aminotransferase is also a natural part of the liver ecosystem tested for in a liver panel.

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 583 3.956771 0.9973813 3.73767 3.837544 0.8596389 2.302585 8.502891 6.200306 1.188757 1.560033 0.0413073

Total Proteins

Total protein is a measure of both albumin and globulin combined.

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 583 1.853967 0.1796109 1.88707 1.867098 0.149453 0.9932518 2.261763 1.268511 -0.9689299 1.979856 0.0074387

Albumin

Albumin is a blood protein which adds structure to the vascular system keeping preventing blood from seeping out through the vessel walls.

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 583 3.141853 0.7955188 3.1 3.14925 0.88956 0.9 5.5 4.6 -0.0434602 -0.4037893 0.032947

Albumin-Globulin Ratio

The Albumin-Globulin ratio is considered an index of systemic diseases in general.

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 579 0.9470639 0.3195921 0.93 0.9297634 0.252042 0.3 2.8 2.5 0.9871639 3.221735 0.0132818

The above graphs suggest that the enzymes and proteins in question vary relative to the diseased state which we may be able to understand more completely after the analysis.

Inference:

The goal of this analysis is to see how well a logistic regression can be tuned on these data to predict the presence of liver disease. The ## Preparing Data for Analysis

set.seed(455)  # for reproducibility
liver_data$Splits <- sample.split(liver_data, SplitRatio = 0.7)  #set indexes
liver_data <- liver_data %>% mutate_each(funs(log), -Age, -Sex, -Albumin, -A_G_Ratio, 
    -Disease, -Splits)
train <- liver_data[liver_data$Splits == TRUE, ]  #extract training using  indexes
test <- liver_data[liver_data$Splits == FALSE, ]  #extract test using indexes

Training Summary

vars n mean sd median trimmed mad min max range skew kurtosis se
Age 1 371 44.2938005 15.6390861 45.000000 44.3804714 17.7912000 4.0000000 90.000000 86.0000000 -0.0054548 -0.5021859 0.8119409
Sex* 2 371 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
Tot_Bil 3 371 0.4635428 1.0142436 0.000000 0.2883406 0.5288063 -0.6931472 4.317488 5.0106353 1.4009765 1.2964306 0.0526569
Dir_Bil 4 371 -0.6463009 1.3057033 -1.203973 -0.7801100 1.0276600 -2.3025851 2.980619 5.2832037 0.8440730 -0.1631233 0.0677887
Alkphos 5 371 5.5181453 0.5214282 5.370638 5.4494785 0.3657189 4.1431347 7.654443 3.5113085 1.2739498 1.7736041 0.0270712
Alamine 6 371 3.7940291 0.9044910 3.610918 3.6875879 0.7167284 2.3025851 7.396335 5.0937502 1.3116266 2.2122098 0.0469588
Aspartate 7 371 4.0096669 1.0123014 3.761200 3.8865681 0.8645727 2.4849066 8.502891 6.0179848 1.1841618 1.4928227 0.0525561
Tot_Prot 8 371 1.8562181 0.1687600 1.871802 1.8665221 0.1516386 1.2809338 2.261763 0.9808293 -0.6255046 0.4391950 0.0087616
Albumin 9 371 3.1380054 0.7902345 3.100000 3.1424242 0.8895600 1.4000000 5.500000 4.1000000 0.0586121 -0.4816292 0.0410269
A_G_Ratio 10 370 0.9473514 0.3259117 0.900000 0.9275676 0.2965200 0.3000000 2.800000 2.5000000 1.0646070 3.4478161 0.0169433
Disease 11 371 0.7196765 0.4497638 1.000000 0.7744108 0.0000000 0.0000000 1.000000 1.0000000 -0.9742200 -1.0537139 0.0233506
Splits* 12 371 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA

Initial Logistic Regression Model

fit <- glm(Disease ~ Age + Sex + Tot_Bil + Dir_Bil + Alkphos + Alamine + Aspartate + 
    Tot_Prot + Albumin + A_G_Ratio, data = train, family = binomial(link = "logit"))

Coefficients:

Estimate Std. Error z value Pr(>|z|)
(Intercept) -15.8559002 4.1200217 -3.8484992 0.0001188
Age 0.0217058 0.0087573 2.4786079 0.0131896
SexM -0.4007260 0.3266459 -1.2267904 0.2199014
Tot_Bil 0.4276794 0.7432995 0.5753797 0.5650346
Dir_Bil 0.1718378 0.4851157 0.3542202 0.7231739
Alkphos 0.9463127 0.3856115 2.4540572 0.0141254
Alamine 0.9373542 0.3219296 2.9116743 0.0035950
Aspartate 0.1980968 0.2889246 0.6856348 0.4929434
Tot_Prot 5.3262285 2.3915094 2.2271409 0.0259379
Albumin -1.4553451 0.7649599 -1.9025117 0.0571043
A_G_Ratio 1.8906917 1.2224601 1.5466286 0.1219528

Model Pseudo R-Square and Log-Likelihoods (from pscl package)

llh llhNull G2 McFadden r2ML r2CU
-170.5744 -220.0989 99.04894 0.2250101 0.2348626 0.3375943

Based on the above table using the MacFadden \(R_2\) as a guide, it appears that this model explains approximately 22.5% of the disease classification.

Another estimate of utility is the Coefficient of Discrimination is test metric which subtracts mean of all No-disease probabilities from the mean of all disease-probabilities based on the predicted outcomes of the test data.

# Start by making a data frame of predictions
Test_Predictions <- data.frame(Probability = predict(fit, test, type = "response"))
# add then add predicted class using 50% probability to round up and down
Test_Predictions$Prediction <- ifelse(Test_Predictions > 0.5, 1, 0)
# added the actual diagnosed disease/ non classification to the set
Test_Predictions$Disease <- test$Disease

# accuracy is simply the mean of all trues (1) where prediction = reality
accuracy <- mean(Test_Predictions$Disease == Test_Predictions$Prediction, na.rm = TRUE)

# For the Coefficient of Discrimination create arrays of predicted
# #probabilities where the true outcome is disease and one where it is not
# taking an average of the two give a sense of the overall predictive power
disease <- Test_Predictions$Probability[which(Test_Predictions$Disease == 1)]
non <- Test_Predictions$Probability[which(Test_Predictions$Disease == 0)]
Coef_Desc <- mean(disease, na.rm = TRUE) - mean(non, na.rm = TRUE)

Coefficient of Discrimination : 0.1420897

Accuracy: 0.6889952

The accuracy is simply the number of times the model was right…which seems considerably higher than the Model Fit Estimates would suggest, at around 69%.

plot(Test_Predictions$Probability, Test_Predictions$Disease, xlim = c(0.1, 1.1), 
    xlab = "Probability", ylim = c(-0.1, 1.1), ylab = "Disease", col = "blue", 
    pch = 18)

Scope of Analysis

Improvements

Although there were variables which were not in and of themselves statistically significant (meaning we cannot tell what their precise contribution to the diseased diagnosis is reliable) removing them in any order did nothing but drop the accuracy, the Coefficient of Discrimination and the Pseudo R-square scores, indicating that for shear ability to predict the likelihood of a patient having liver disease based on the included variables, the whole model is more accurate than an abridged logistic model.

Model Conditions

The two conditions which we need to meet are:

  1. That each predictor Xi is linearly related to the logit( \(p_i\) )when all other predictors are held constant

Based on the following samples of x plotted against the predicted probabilities of the logistic regression, it seems as though there may be some concerns with using these particular regressors in a logistic model, even transformed.

train$predictions <- predict(fit, train, type = "response")
require(ggplot2)
ggplot(train, aes(x = Tot_Bil, y = predictions)) + geom_point() + geom_smooth(method = "lm", 
    se = FALSE)

ggplot(train, aes(x = Dir_Bil, y = predictions)) + geom_point() + geom_smooth(method = "lm", 
    se = FALSE)

ggplot(train, aes(x = Alkphos, y = predictions)) + geom_point() + geom_smooth(method = "lm", 
    se = FALSE)

ggplot(train, aes(x = Albumin, y = predictions)) + geom_point() + geom_smooth(method = "lm", 
    se = FALSE)

  1. Each \(Y-i\) is independent of other outcomes

no mention of relations between family members was included in the metadata for this set, so the assumption is that all subjects are independent of each other

Conclusion:

Given the estimated predictive power of the Pseudo R-squared values at at around 22% as well as the model condition of a linear relationship between the predictor and the logit, not being met, I would not consider using this model as a diagnostic tool, or automating a model.

However, given the accuracy is 68% on the validation set, with further tuning of this model on more validation data, it might be useful as a tool to suggest further definitive testing of liver disease for patients who this model suggests might be positive.

It might be that this type of model, while not diagnostic, is useful in early detection, like a mammogram, which also is indicative of breast cancer, but not diagnostic.

If such a model is to be useful, then a similar method would need to be applied to a more racially diverse data set with more even gender distributions.

References:

UCI Machine Learning Repository: ILPD (Indian Liver Patient Dataset) Data Set. (n.d.). Retrieved December 8, 2017, from https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)

A/G RATIO (3293A). (n.d.). Retrieved December 5, 2017, from http://www.questdiagnostics.com/testcenter/BUOrderInfo.action?tc=3293A&labCode=QBA

Bilirubin: Reference Range, Interpretation, Collection and Panels. (2017). Retrieved from https://emedicine.medscape.com/article/2074068-overview

2017-12-08