Detecting people at risk of Chronic Kidney Disease (CKD)

The data set consists of selected information from 8,819 adults 20 years of age or older taken from the 1999 to 2000 and 2001 to 2002 surveys. The sample subjects were randomly divided into two pools: a 6,000-case training set and a 2,819-case hold-out sample for testing. A test for CKD was administered to everyone in the study population. The dependent variable of interest is CKD, a binary variable indicating whether or not the subject had CKD. The 33 independent variables include age, weight, income, cholesterol level, systolic/diastolic blood pressure, family history of diabetes, cardiovascular diseases and so on.

The goal of the project is to predict the risk score for CKD as well as the disease status of tested patients (hold-out sample), find the best predictors of CKD and build a predictive model that could be turned into a quick screening tool for early detection of the disease. Given that the total annual bill for treating kidney failure (the last stage of CKD) is over $35 billion, early detection and treatment of CKD can save insurance companies $250,000 per case in which the patient does not progress to kidney failure and the Federal government approximately $1 billion per year (US Dept. of Health and Human Services).

# load libraries and data
library(dplyr)
library(ggcorrplot)
library(ggplot2)
library(caret)
library(mice)
library(tidyr)
library(ROCR)
library(varhandle)

Data preparation

casedata = read.csv("E:/MSBA/5 Mutivariate Analysis/Project/casedata.csv")

Check the data:

casedata1=casedata[,-c(1, 34)] # remove ID and disease status
str(casedata1)

## 'data.frame':    8819 obs. of  32 variables:
##  $ Age             : int  65 36 66 54 63 26 66 59 53 78 ...
##  $ Female          : int  1 1 1 1 1 0 0 1 1 1 ...
##  $ Racegrp         : Factor w/ 4 levels "black","hispa",..: 4 2 4 4 1 4 1 4 4 1 ...
##  $ Educ            : int  0 0 0 1 0 1 1 1 1 0 ...
##  $ Unmarried       : int  0 NA 1 0 0 0 0 1 0 1 ...
##  $ Income          : int  1 1 0 0 NA 0 0 0 1 0 ...
##  $ CareSource      : Factor w/ 5 levels " ","clinic","DrHMO",..: 5 4 4 3 2 3 3 3 3 3 ...
##  $ Insured         : int  1 0 1 1 1 1 0 1 1 1 ...
##  $ Weight          : num  56 60.2 83.9 69.4 73.1 ...
##  $ Height          : num  162 162 162 160 159 ...
##  $ BMI             : num  21.3 22.9 31.8 26.9 28.8 ...
##  $ Obese           : int  0 0 1 0 0 1 0 0 0 0 ...
##  $ Waist           : num  83.6 76.6 113.2 77.9 89.3 ...
##  $ SBP             : int  135 96 115 110 132 129 137 124 110 170 ...
##  $ DBP             : int  71 52 57 57 73 70 92 73 74 78 ...
##  $ HDL             : int  48 31 44 74 67 43 41 43 62 105 ...
##  $ LDL             : int  249 135 211 156 154 159 143 140 110 90 ...
##  $ Total.Chol      : int  297 166 255 230 221 202 184 183 172 195 ...
##  $ Dyslipidemia    : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ PVD             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Activity        : int  3 3 1 2 1 2 3 2 1 1 ...
##  $ PoorVision      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Smoker          : int  1 0 1 1 0 0 0 0 1 0 ...
##  $ Hypertension    : int  0 0 0 0 1 0 1 0 1 1 ...
##  $ Fam.Hypertension: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ Diabetes        : int  0 0 1 0 0 0 0 1 0 1 ...
##  $ Fam.Diabetes    : int  1 0 0 0 0 0 0 1 0 1 ...
##  $ Stroke          : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ CVD             : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ Fam.CVD         : int  0 0 0 0 0 1 1 1 1 1 ...
##  $ CHF             : int  0 0 0 0 0 0 0 0 0 NA ...
##  $ Anemia          : int  0 0 0 0 0 0 0 0 0 0 ...

Check missing pattern:

# Unique values per column
lapply(casedata, function(x) length(unique(x)))

#Check for Missing values
missing_values = casedata %>% summarize_all(funs(sum(is.na(.))/n())) # from package tidyr
missing_values = gather(missing_values, key="feature", value="missing_pct")
missing_values %>% 
  ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct))+
  geom_bar(stat="identity",fill="tomato3")+
  coord_flip()+theme_bw()

Most of the missing values come from CKD, which is expected because we will hold out observations with missing CKD status. Variables with a significant amount of missing values are Income, PoorVision, Unmarried and Fam.CVD. In total, there are 24 variables having missing values.

I’ll combine the multiple imputation methods to predict missing values based on known values.

init = mice(casedata1, maxit=0) 
meth = init$method
predM = init$predictorMatrix

Skip variables without missing values, ones that will be used as predictors:

meth[c("Smoker", "Racegrp", "PVD", "Female", "Fam.Hypertension", "Fam.Diabetes", "Dyslipidemia", "CareSource", "Age")]=""

Specify imputation methods for each variable with linear regression for continuous variables, logistic regression for binary variables and polytomous logistic regression for ordinal variables.

meth[c("Educ", "Unmarried", "Insured", "Obese", "PoorVision", "Hypertension", 
       "Diabetes", "Stroke", "CVD", "Fam.CVD", "CHF", "Anemia", "Income")]="logreg" 
meth[c("Weight","Height", "BMI", "Waist", "SBP", "DBP", "HDL", "LDL", "Total.Chol")]="norm" 
meth[c("Activity")]="polyreg"

categorical = casedata1[,c("Income", "Educ", "Unmarried", "Insured", "Obese", "PoorVision", "Hypertension", "Diabetes", "Stroke", "CVD", "Fam.CVD", "CHF", "Anemia", "Smoker", "PVD", "Female", "Fam.Hypertension", "Fam.Diabetes", "Dyslipidemia", "Activity")]
categorical_var = lapply(categorical, as.factor)
categorical_var=as.data.frame(categorical_var)
casedata1 = cbind(casedata1[,c("Weight", "Height", "BMI", "SBP", "DBP", "Waist", "HDL", "LDL", "Total.Chol", "Racegrp", "CareSource", "Age")], categorical_var)

Impute missing values:

set.seed(1)
imputed = mice(casedata1, method=meth, predictorMatrix=predM, m = 5)

## Warning: Number of logged events: 410

Integrate imputed data into the main dataset:

complete = mice::complete(imputed)
fulldata=mutate(complete, CKD=casedata$CKD)
sapply(fulldata, function(x) sum(is.na(x))) # Check the number of missing values after imputation

##           Weight           Height              BMI              SBP 
##                0                0                0                0 
##              DBP            Waist              HDL              LDL 
##                0                0                0                0 
##       Total.Chol          Racegrp       CareSource              Age 
##                0                0                0                0 
##           Income             Educ        Unmarried          Insured 
##                0                0                0                0 
##            Obese       PoorVision     Hypertension         Diabetes 
##                0                0                0                0 
##           Stroke              CVD          Fam.CVD              CHF 
##                0                0                0                0 
##           Anemia           Smoker              PVD           Female 
##                0                0                0                0 
## Fam.Hypertension     Fam.Diabetes     Dyslipidemia         Activity 
##                0                0                0                0 
##              CKD 
##             2819

One-hot encode categorical variables:

dv = dummyVars(~ Racegrp+CareSource, data = fulldata, sep = ".", fullRank = TRUE) 
nv = as.data.frame(predict(dv, newdata = fulldata))

fulldata=fulldata[,-c(10, 11)] # combine dummified variables
fulldata=cbind(fulldata, nv)

Separate 2819 cases without CKD status for later use as a test set:

fd=subset(fulldata, is.na(CKD)==FALSE)
test=subset(fulldata, is.na(CKD)==TRUE)
dim(fd)

## [1] 6000   38

EDA

Correlation plot:

# Handle factor variables
cv = fd[,c("Educ", "Unmarried", "Insured", "Obese", "PoorVision", "Hypertension", 
       "Diabetes", "Stroke", "CVD", "Fam.CVD", "CHF", "Anemia", "Income", "Dyslipidemia", 
       "PVD", "Smoker","Fam.Hypertension", "Female", "Fam.Diabetes", "Activity")]
cv = lapply(cv, as.numeric)
cv = as.data.frame(cv)

for (i in 1:19){
  cv[,i] = cv[,i] - 1
}

fd = fd[,-which(names(fd) %in% c("Educ", "Unmarried", "Insured", "Obese", "PoorVision", "Hypertension", 
       "Diabetes", "Stroke", "CVD", "Fam.CVD", "CHF", "Anemia", "Income", "Dyslipidemia", 
       "PVD", "Smoker","Fam.Hypertension", "Female", "Fam.Diabetes", "Activity"))]
fd = cbind(cv, fd)

cor <- cor(fd)
ggcorrplot(cor)

Comparing attributes of people with and without CKD (the red line indicates those tested positive with the disease):

a=subset(fd, fd$CKD==1)
b=subset(fd, fd$CKD==0)
plot(density(a$Age), col="red", main='Comparing age distribution', xlab='Age')
lines(density(b$Age))

There’s a staggering difference in age distribution of people with and without CKD. Most of those diagnosed with the disease are between 70 and 90 years old.

plot(density(a$Activity, bw = 0.05), col = 'red', main = "Comparing Activity patterns", xlab = 'Activity level')
lines(density(b$Activity))

Activity Levels are given as 1 - Mostly sit, 2 - Stand or walk a lot, 3 - Lift light loads or climb stairs often and 4 - Heavy work and heavy loads. People having CKD tend to have a more sedentary lifestyle with little to no physically activities. Meanwhile, those without CKD are more likely to be physically active.

plot(density(a$SBP, bw = 0.6), col = 'red', main = 'Comparing systolic blood pressure patterns', xlab = 'Systolic blood pressure level')
lines(density(b$SBP, bw = 0.6))

Those tested positive with CKD tend to have a higher level of systolic blood pressure.

plot(density(a$Income, bw=0.2), col = 'red', main = "Comparing income levels", xlab = 'Income level')
lines(density(b$Income, bw=0.2))

Note: Income level is a binary variable indicating whether or not a person has an above-median income.

prop.table(table(fd$Income, fd$CKD), margin = 2) # specific ratios

##    
##             0         1
##   0 0.5798410 0.7219828
##   1 0.4201590 0.2780172

42% of the people tested negative with CKD have an income level that’s above the median, while the number for those tested positive is only 28%.

Logistic regression

Based on the corelation matrix and some background research, we can safely remove variables HDL, LDL (bad/good cholesterol, providing overalapping information with Total Cholesterol), Obese, Waist (highly correlated with Weight) and BMI (computed from Weight and Height). The dataset will be split by 70/30 for training and validating regression models.

fd=fd[,-which(names(fd) %in% c("BMI","HDL", "LDL", "Obese", "Waist"))] # remove unwanted variables 
split = round(nrow(fd)*.7) # split by 70/30
train = fd[1:split,]
validate=fd[(split+1):nrow(fd),]
dim(train)

## [1] 4200   33

Generalized linear model with all variables:

model=glm(CKD~.,family="binomial", data=train)
summary(model)

## 
## Call:
## glm(formula = CKD ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2588  -0.3097  -0.1174  -0.0521   3.3653  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -18.392894 324.749587  -0.057 0.954834    
## Educ                -0.168699   0.161691  -1.043 0.296790    
## Unmarried            0.127792   0.155150   0.824 0.410128    
## Insured             -0.253782   0.309968  -0.819 0.412937    
## PoorVision          -0.089227   0.208966  -0.427 0.669386    
## Hypertension         0.496459   0.186378   2.664 0.007728 ** 
## Diabetes             0.559290   0.175610   3.185 0.001448 ** 
## Stroke              -0.138998   0.299381  -0.464 0.642444    
## CVD                  0.730434   0.239152   3.054 0.002256 ** 
## Fam.CVD              0.471182   0.240007   1.963 0.049623 *  
## CHF                  0.256365   0.248936   1.030 0.303083    
## Anemia               1.262729   0.446219   2.830 0.004657 ** 
## Income               0.165317   0.170814   0.968 0.333136    
## Dyslipidemia        -0.010923   0.234063  -0.047 0.962778    
## PVD                  0.820053   0.210964   3.887 0.000101 ***
## Smoker               0.142031   0.146979   0.966 0.333877    
## Fam.Hypertension    -0.598654   0.282693  -2.118 0.034202 *  
## Female               0.340186   0.205081   1.659 0.097158 .  
## Fam.Diabetes        -0.119842   0.154568  -0.775 0.438144    
## Activity            -0.277992   0.108226  -2.569 0.010210 *  
## Weight               0.010793   0.004577   2.358 0.018364 *  
## Height               0.020933   0.010512   1.991 0.046435 *  
## SBP                 -0.002485   0.003866  -0.643 0.520390    
## DBP                 -0.002029   0.005840  -0.347 0.728249    
## Total.Chol           0.001940   0.001750   1.108 0.267725    
## Age                  0.097322   0.008044  12.098  < 2e-16 ***
## Racegrp.hispa       -0.417113   0.250797  -1.663 0.096282 .  
## Racegrp.other       -0.151506   0.652315  -0.232 0.816337    
## Racegrp.white        0.154886   0.203447   0.761 0.446474    
## CareSource.clinic    5.461375 324.744709   0.017 0.986582    
## CareSource.DrHMO     5.313196 324.744705   0.016 0.986946    
## CareSource.noplace   4.628450 324.744846   0.014 0.988628    
## CareSource.other     5.457358 324.744796   0.017 0.986592    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2202.3  on 4199  degrees of freedom
## Residual deviance: 1435.1  on 4167  degrees of freedom
## AIC: 1501.1
## 
## Number of Fisher Scoring iterations: 11

Model 1 - Logistic regresison with variables selected using stepwise regression. The procedure is used for selecting the best predictors for the dependent variable using many methods to evaluate model fit like t-test, F-test, adjusted R-squared, etc. Here we use AIC (Akaike Information Criterion - https://en.m.wikipedia.org/wiki/Akaike_information_criterion) and p-value. The smaller the AIC value is, the better the model is.

Forward selection starts with no variable in the model, then tests the addition of each variable using a chosen model fit criterion, adds the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeats this process until none improves the model to a statistically significant extent.

In the summary, we will see two deviances NULL and Residual. Deviance is a measure of goodness of fit of a model. Higher numbers always indicate bad fit. The null deviance shows how well the dependent variable is predicted by a model that includes only the intercept (grand mean), while residual includes independent variables.

model1=step(model,direction="forward")

summary(model1)

## 
## Call:
## glm(formula = CKD ~ Educ + Unmarried + Insured + PoorVision + 
##     Hypertension + Diabetes + Stroke + CVD + Fam.CVD + CHF + 
##     Anemia + Income + Dyslipidemia + PVD + Smoker + Fam.Hypertension + 
##     Female + Fam.Diabetes + Activity + Weight + Height + SBP + 
##     DBP + Total.Chol + Age + Racegrp.hispa + Racegrp.other + 
##     Racegrp.white + CareSource.clinic + CareSource.DrHMO + CareSource.noplace + 
##     CareSource.other, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2588  -0.3097  -0.1174  -0.0521   3.3653  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -18.392894 324.749587  -0.057 0.954834    
## Educ                -0.168699   0.161691  -1.043 0.296790    
## Unmarried            0.127792   0.155150   0.824 0.410128    
## Insured             -0.253782   0.309968  -0.819 0.412937    
## PoorVision          -0.089227   0.208966  -0.427 0.669386    
## Hypertension         0.496459   0.186378   2.664 0.007728 ** 
## Diabetes             0.559290   0.175610   3.185 0.001448 ** 
## Stroke              -0.138998   0.299381  -0.464 0.642444    
## CVD                  0.730434   0.239152   3.054 0.002256 ** 
## Fam.CVD              0.471182   0.240007   1.963 0.049623 *  
## CHF                  0.256365   0.248936   1.030 0.303083    
## Anemia               1.262729   0.446219   2.830 0.004657 ** 
## Income               0.165317   0.170814   0.968 0.333136    
## Dyslipidemia        -0.010923   0.234063  -0.047 0.962778    
## PVD                  0.820053   0.210964   3.887 0.000101 ***
## Smoker               0.142031   0.146979   0.966 0.333877    
## Fam.Hypertension    -0.598654   0.282693  -2.118 0.034202 *  
## Female               0.340186   0.205081   1.659 0.097158 .  
## Fam.Diabetes        -0.119842   0.154568  -0.775 0.438144    
## Activity            -0.277992   0.108226  -2.569 0.010210 *  
## Weight               0.010793   0.004577   2.358 0.018364 *  
## Height               0.020933   0.010512   1.991 0.046435 *  
## SBP                 -0.002485   0.003866  -0.643 0.520390    
## DBP                 -0.002029   0.005840  -0.347 0.728249    
## Total.Chol           0.001940   0.001750   1.108 0.267725    
## Age                  0.097322   0.008044  12.098  < 2e-16 ***
## Racegrp.hispa       -0.417113   0.250797  -1.663 0.096282 .  
## Racegrp.other       -0.151506   0.652315  -0.232 0.816337    
## Racegrp.white        0.154886   0.203447   0.761 0.446474    
## CareSource.clinic    5.461375 324.744709   0.017 0.986582    
## CareSource.DrHMO     5.313196 324.744705   0.016 0.986946    
## CareSource.noplace   4.628450 324.744846   0.014 0.988628    
## CareSource.other     5.457358 324.744796   0.017 0.986592    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2202.3  on 4199  degrees of freedom
## Residual deviance: 1435.1  on 4167  degrees of freedom
## AIC: 1501.1
## 
## Number of Fisher Scoring iterations: 11

Next, I try backward elimination, which starts with all candidate variables, tests the deletion of each variable using a chosen model fit criterion, deletes the feature (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeats this process until no further variables can be deleted without a statistically significant loss of fit.

model2=step(model,direction="backward")

summary(model2)

## 
## Call:
## glm(formula = CKD ~ Hypertension + Diabetes + CVD + Fam.CVD + 
##     Anemia + PVD + Fam.Hypertension + Female + Activity + Weight + 
##     Height + Age + Racegrp.hispa + CareSource.clinic + CareSource.DrHMO + 
##     CareSource.other, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1452  -0.3133  -0.1196  -0.0524   3.3776  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -13.588349   1.883742  -7.213 5.45e-13 ***
## Hypertension        0.394577   0.163673   2.411 0.015919 *  
## Diabetes            0.542114   0.164674   3.292 0.000995 ***
## CVD                 0.672112   0.170898   3.933 8.40e-05 ***
## Fam.CVD             0.482640   0.238005   2.028 0.042575 *  
## Anemia              1.151353   0.437572   2.631 0.008508 ** 
## PVD                 0.819813   0.205717   3.985 6.74e-05 ***
## Fam.Hypertension   -0.618367   0.280648  -2.203 0.027569 *  
## Female              0.353111   0.190929   1.849 0.064394 .  
## Activity           -0.292137   0.106485  -2.743 0.006080 ** 
## Weight              0.010880   0.004487   2.425 0.015310 *  
## Height              0.019968   0.010291   1.940 0.052328 .  
## Age                 0.097324   0.006850  14.207  < 2e-16 ***
## Racegrp.hispa      -0.534466   0.195780  -2.730 0.006335 ** 
## CareSource.clinic   0.782286   0.457468   1.710 0.087260 .  
## CareSource.DrHMO    0.613976   0.443470   1.384 0.166211    
## CareSource.other    0.761276   0.528535   1.440 0.149768    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2202.3  on 4199  degrees of freedom
## Residual deviance: 1443.9  on 4183  degrees of freedom
## AIC: 1477.9
## 
## Number of Fisher Scoring iterations: 7

Interpret: The final result above gives us the most important variables. The AIC value indicates that model 2 is better than model 1.

Predict on the validation set:

prob = predict(model2, type = "response", newdata=validate)

Model evaluation

Find the Area Under the Curve (AUC) of Model 2:

ROCRpred = prediction(prob, validate$CKD)
auc = round(as.numeric(performance(ROCRpred, "auc")@y.values),2)
cat("AUC of Model2 is",auc)

## AUC of Model2 is 0.88

The model has an AUC of 0.88, which is quite decent. Let’s see the ROC curve with respect to True Posive Rate (TPR) and False Positive Rate (FPR):

ROCRperf = performance(ROCRpred, "tpr", "fpr")

# Adding threshold labels
plot(ROCRperf, colorize=TRUE, print.cutoffs.at = seq(0,1,0.1), text.adj = c(-0.2, 1.7))
abline(a=0, b=1)

auc_train <- round(as.numeric(performance(ROCRpred, "auc")@y.values),2)
legend(.8, .2, auc_train, title = "AUC", cex=1)

Classification threshold

opt.cut = function(ROCRperf, ROCRpred){
    cut.ind = mapply(FUN=function(x, y, p){
        d = x^2 + (y-1)^2
        ind = which(d == min(d))
        c(sensitivity = y[[ind]], specificity = 1-x[[ind]], 
            cutoff = p[[ind]])
    }, ROCRperf@x.values, ROCRperf@y.values, ROCRpred@cutoffs)
}
print(opt.cut(ROCRperf, ROCRpred))

##                   [,1]
## sensitivity 0.83333333
## specificity 0.77919708
## cutoff      0.06284708

Opt.cut is the threshold that would give us the highest possible TPR. If the project’s goal is to find out as many positive cases as possible and avoid FN at all cost, which is more preferred in healthcare, then we can settle at 0.063. But if the goal is to maximize accuracy rate, which takes both TP and TN into account, we should increase the threshold a bit.

# evaluation function
c_accuracy=function(actuals,classifications){
  df=data.frame(actuals,classifications);
  
  tp=nrow(df[df$classifications==1 & df$actuals==1,]);        
  fp=nrow(df[df$classifications==1 & df$actuals==0,]);
  fn=nrow(df[df$classifications==0 & df$actuals==1,]);
  tn=nrow(df[df$classifications==0 & df$actuals==0,]); 
  
  recall=tp/(tp+fn)
  precision=tp/(tp+fp)
  accuracy=(tp+tn)/(tp+fn+fp+tn)
  tpr=recall
  fpr=fp/(fp+tn)
  fmeasure=2*precision*recall/(precision+recall)
  scores=c(recall,precision,accuracy,tpr,fpr,fmeasure,tp,tn,fp,fn)
  names(scores)=c("recall","precision","accuracy","tpr","fpr","fmeasure","tp","tn","fp","fn")
  
  #print(scores)
  return(scores);
}

Let’s check a few cutoff values:

class=ifelse(prob>0.063,1,0)
c_accuracy(validate$CKD, class)

##       recall    precision     accuracy          tpr          fpr 
##    0.8269231    0.2627291    0.7838889    0.8269231    0.2201946 
##     fmeasure           tp           tn           fp           fn 
##    0.3987635  129.0000000 1282.0000000  362.0000000   27.0000000

class=ifelse(prob>0.08,1,0)
c_accuracy(validate$CKD, class)

##       recall    precision     accuracy          tpr          fpr 
##    0.7628205    0.2846890    0.8133333    0.7628205    0.1818735 
##     fmeasure           tp           tn           fp           fn 
##    0.4146341  119.0000000 1345.0000000  299.0000000   37.0000000

class=ifelse(prob>0.15,1,0)
c_accuracy(validate$CKD, class)

##       recall    precision     accuracy          tpr          fpr 
##    0.6474359    0.3713235    0.8744444    0.6474359    0.1040146 
##     fmeasure           tp           tn           fp           fn 
##    0.4719626  101.0000000 1473.0000000  171.0000000   55.0000000

Note: TPR decreases as accuracy rate increases.

barplot(sort(abs(model2$coefficients[-1]), decreasing=TRUE), col="steelblue", cex.names = .6, las=2, main="Variable comparison")

Based on the final model, we can build a simple questionnaire or screening software to check whether a person has a high risk for CKD and should undergo further medical tests.

Sample screening test:

Gender? (Female/Male/Other)
Race group? (White/Hispanic/Black/Other)
Age?
Care source? (Clinic/Dr.-HMO/No place/Other)
Weight?
Height?
What is your most common type of activity? (Choose one)

Mostly sit
Stand or walk a lot
Lift light loads or climb stairs often
Heavy work and heavy loads

Check on whatever applies to you:

Have a treatment for anemia received in past three months or hemoglobin at exam lower than 11g/d (Anemia)
Have been diagnosed with peripheral vascular disease reflected by reduced SBP at the leg relative to the arm (PVD)
Have been diagnosed with angina pectoris, myocardial infarction, or stroke (CVD)
Have a family history of hypertension (high blood pressure)
Have been diagnosed with diabetes
Have a family history of cardiovascular disease
Have a history of hypertension

After collecting data from the questionnaire above, suppose the input data of new patients is arranged in a similar fashion with the one we’ve worked on before, an example screening tool could be given as follows:

screeningTool = function(db) {
    p = predict(model2, type="response", newdata=db)
    db['CKD'] = ifelse(p>cutoff,1,0)
    return(db)
}