The data set consists of selected information from 8,819 adults 20 years of age or older taken from the 1999 to 2000 and 2001 to 2002 surveys. The sample subjects were randomly divided into two pools: a 6,000-case training set and a 2,819-case hold-out sample for testing. A test for CKD was administered to everyone in the study population. The dependent variable of interest is CKD, a binary variable indicating whether or not the subject had CKD. The 33 independent variables include age, weight, income, cholesterol level, systolic/diastolic blood pressure, family history of diabetes, cardiovascular diseases and so on.
The goal of the project is to predict the risk score for CKD as well as the disease status of tested patients (hold-out sample), find the best predictors of CKD and build a predictive model that could be turned into a quick screening tool for early detection of the disease. Given that the total annual bill for treating kidney failure (the last stage of CKD) is over $35 billion, early detection and treatment of CKD can save insurance companies $250,000 per case in which the patient does not progress to kidney failure and the Federal government approximately $1 billion per year (US Dept. of Health and Human Services).
# load libraries and data
library(dplyr)
library(ggcorrplot)
library(ggplot2)
library(caret)
library(mice)
library(tidyr)
library(ROCR)
library(varhandle)
casedata = read.csv("E:/MSBA/5 Mutivariate Analysis/Project/casedata.csv")
Check the data:
casedata1=casedata[,-c(1, 34)] # remove ID and disease status
str(casedata1)
## 'data.frame': 8819 obs. of 32 variables:
## $ Age : int 65 36 66 54 63 26 66 59 53 78 ...
## $ Female : int 1 1 1 1 1 0 0 1 1 1 ...
## $ Racegrp : Factor w/ 4 levels "black","hispa",..: 4 2 4 4 1 4 1 4 4 1 ...
## $ Educ : int 0 0 0 1 0 1 1 1 1 0 ...
## $ Unmarried : int 0 NA 1 0 0 0 0 1 0 1 ...
## $ Income : int 1 1 0 0 NA 0 0 0 1 0 ...
## $ CareSource : Factor w/ 5 levels " ","clinic","DrHMO",..: 5 4 4 3 2 3 3 3 3 3 ...
## $ Insured : int 1 0 1 1 1 1 0 1 1 1 ...
## $ Weight : num 56 60.2 83.9 69.4 73.1 ...
## $ Height : num 162 162 162 160 159 ...
## $ BMI : num 21.3 22.9 31.8 26.9 28.8 ...
## $ Obese : int 0 0 1 0 0 1 0 0 0 0 ...
## $ Waist : num 83.6 76.6 113.2 77.9 89.3 ...
## $ SBP : int 135 96 115 110 132 129 137 124 110 170 ...
## $ DBP : int 71 52 57 57 73 70 92 73 74 78 ...
## $ HDL : int 48 31 44 74 67 43 41 43 62 105 ...
## $ LDL : int 249 135 211 156 154 159 143 140 110 90 ...
## $ Total.Chol : int 297 166 255 230 221 202 184 183 172 195 ...
## $ Dyslipidemia : int 0 0 1 0 0 0 0 0 0 0 ...
## $ PVD : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Activity : int 3 3 1 2 1 2 3 2 1 1 ...
## $ PoorVision : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Smoker : int 1 0 1 1 0 0 0 0 1 0 ...
## $ Hypertension : int 0 0 0 0 1 0 1 0 1 1 ...
## $ Fam.Hypertension: int 0 0 0 0 0 1 0 1 1 0 ...
## $ Diabetes : int 0 0 1 0 0 0 0 1 0 1 ...
## $ Fam.Diabetes : int 1 0 0 0 0 0 0 1 0 1 ...
## $ Stroke : int 0 0 0 0 0 0 0 0 0 1 ...
## $ CVD : int 1 0 0 0 0 0 0 0 0 1 ...
## $ Fam.CVD : int 0 0 0 0 0 1 1 1 1 1 ...
## $ CHF : int 0 0 0 0 0 0 0 0 0 NA ...
## $ Anemia : int 0 0 0 0 0 0 0 0 0 0 ...
Check missing pattern:
# Unique values per column
lapply(casedata, function(x) length(unique(x)))
#Check for Missing values
missing_values = casedata %>% summarize_all(funs(sum(is.na(.))/n())) # from package tidyr
missing_values = gather(missing_values, key="feature", value="missing_pct")
missing_values %>%
ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct))+
geom_bar(stat="identity",fill="tomato3")+
coord_flip()+theme_bw()
Most of the missing values come from CKD, which is expected because we will hold out observations with missing CKD status. Variables with a significant amount of missing values are Income, PoorVision, Unmarried and Fam.CVD. In total, there are 24 variables having missing values.
I’ll combine the multiple imputation methods to predict missing values based on known values.
init = mice(casedata1, maxit=0)
meth = init$method
predM = init$predictorMatrix
Skip variables without missing values, ones that will be used as predictors:
meth[c("Smoker", "Racegrp", "PVD", "Female", "Fam.Hypertension", "Fam.Diabetes", "Dyslipidemia", "CareSource", "Age")]=""
Specify imputation methods for each variable with linear regression for continuous variables, logistic regression for binary variables and polytomous logistic regression for ordinal variables.
meth[c("Educ", "Unmarried", "Insured", "Obese", "PoorVision", "Hypertension",
"Diabetes", "Stroke", "CVD", "Fam.CVD", "CHF", "Anemia", "Income")]="logreg"
meth[c("Weight","Height", "BMI", "Waist", "SBP", "DBP", "HDL", "LDL", "Total.Chol")]="norm"
meth[c("Activity")]="polyreg"
categorical = casedata1[,c("Income", "Educ", "Unmarried", "Insured", "Obese", "PoorVision", "Hypertension", "Diabetes", "Stroke", "CVD", "Fam.CVD", "CHF", "Anemia", "Smoker", "PVD", "Female", "Fam.Hypertension", "Fam.Diabetes", "Dyslipidemia", "Activity")]
categorical_var = lapply(categorical, as.factor)
categorical_var=as.data.frame(categorical_var)
casedata1 = cbind(casedata1[,c("Weight", "Height", "BMI", "SBP", "DBP", "Waist", "HDL", "LDL", "Total.Chol", "Racegrp", "CareSource", "Age")], categorical_var)
Impute missing values:
set.seed(1)
imputed = mice(casedata1, method=meth, predictorMatrix=predM, m = 5)
## Warning: Number of logged events: 410
Integrate imputed data into the main dataset:
complete = mice::complete(imputed)
fulldata=mutate(complete, CKD=casedata$CKD)
sapply(fulldata, function(x) sum(is.na(x))) # Check the number of missing values after imputation
## Weight Height BMI SBP
## 0 0 0 0
## DBP Waist HDL LDL
## 0 0 0 0
## Total.Chol Racegrp CareSource Age
## 0 0 0 0
## Income Educ Unmarried Insured
## 0 0 0 0
## Obese PoorVision Hypertension Diabetes
## 0 0 0 0
## Stroke CVD Fam.CVD CHF
## 0 0 0 0
## Anemia Smoker PVD Female
## 0 0 0 0
## Fam.Hypertension Fam.Diabetes Dyslipidemia Activity
## 0 0 0 0
## CKD
## 2819
One-hot encode categorical variables:
dv = dummyVars(~ Racegrp+CareSource, data = fulldata, sep = ".", fullRank = TRUE)
nv = as.data.frame(predict(dv, newdata = fulldata))
fulldata=fulldata[,-c(10, 11)] # combine dummified variables
fulldata=cbind(fulldata, nv)
Separate 2819 cases without CKD status for later use as a test set:
fd=subset(fulldata, is.na(CKD)==FALSE)
test=subset(fulldata, is.na(CKD)==TRUE)
dim(fd)
## [1] 6000 38
Correlation plot:
# Handle factor variables
cv = fd[,c("Educ", "Unmarried", "Insured", "Obese", "PoorVision", "Hypertension",
"Diabetes", "Stroke", "CVD", "Fam.CVD", "CHF", "Anemia", "Income", "Dyslipidemia",
"PVD", "Smoker","Fam.Hypertension", "Female", "Fam.Diabetes", "Activity")]
cv = lapply(cv, as.numeric)
cv = as.data.frame(cv)
for (i in 1:19){
cv[,i] = cv[,i] - 1
}
fd = fd[,-which(names(fd) %in% c("Educ", "Unmarried", "Insured", "Obese", "PoorVision", "Hypertension",
"Diabetes", "Stroke", "CVD", "Fam.CVD", "CHF", "Anemia", "Income", "Dyslipidemia",
"PVD", "Smoker","Fam.Hypertension", "Female", "Fam.Diabetes", "Activity"))]
fd = cbind(cv, fd)
cor <- cor(fd)
ggcorrplot(cor)
Comparing attributes of people with and without CKD (the red line indicates those tested positive with the disease):
a=subset(fd, fd$CKD==1)
b=subset(fd, fd$CKD==0)
plot(density(a$Age), col="red", main='Comparing age distribution', xlab='Age')
lines(density(b$Age))
There’s a staggering difference in age distribution of people with and without CKD. Most of those diagnosed with the disease are between 70 and 90 years old.
plot(density(a$Activity, bw = 0.05), col = 'red', main = "Comparing Activity patterns", xlab = 'Activity level')
lines(density(b$Activity))
Activity Levels are given as 1 - Mostly sit, 2 - Stand or walk a lot, 3 - Lift light loads or climb stairs often and 4 - Heavy work and heavy loads. People having CKD tend to have a more sedentary lifestyle with little to no physically activities. Meanwhile, those without CKD are more likely to be physically active.
plot(density(a$SBP, bw = 0.6), col = 'red', main = 'Comparing systolic blood pressure patterns', xlab = 'Systolic blood pressure level')
lines(density(b$SBP, bw = 0.6))
Those tested positive with CKD tend to have a higher level of systolic blood pressure.
plot(density(a$Income, bw=0.2), col = 'red', main = "Comparing income levels", xlab = 'Income level')
lines(density(b$Income, bw=0.2))
Note: Income level is a binary variable indicating whether or not a person has an above-median income.
prop.table(table(fd$Income, fd$CKD), margin = 2) # specific ratios
##
## 0 1
## 0 0.5798410 0.7219828
## 1 0.4201590 0.2780172
42% of the people tested negative with CKD have an income level that’s above the median, while the number for those tested positive is only 28%.
Based on the corelation matrix and some background research, we can safely remove variables HDL, LDL (bad/good cholesterol, providing overalapping information with Total Cholesterol), Obese, Waist (highly correlated with Weight) and BMI (computed from Weight and Height). The dataset will be split by 70/30 for training and validating regression models.
fd=fd[,-which(names(fd) %in% c("BMI","HDL", "LDL", "Obese", "Waist"))] # remove unwanted variables
split = round(nrow(fd)*.7) # split by 70/30
train = fd[1:split,]
validate=fd[(split+1):nrow(fd),]
dim(train)
## [1] 4200 33
Generalized linear model with all variables:
model=glm(CKD~.,family="binomial", data=train)
summary(model)
##
## Call:
## glm(formula = CKD ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2588 -0.3097 -0.1174 -0.0521 3.3653
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.392894 324.749587 -0.057 0.954834
## Educ -0.168699 0.161691 -1.043 0.296790
## Unmarried 0.127792 0.155150 0.824 0.410128
## Insured -0.253782 0.309968 -0.819 0.412937
## PoorVision -0.089227 0.208966 -0.427 0.669386
## Hypertension 0.496459 0.186378 2.664 0.007728 **
## Diabetes 0.559290 0.175610 3.185 0.001448 **
## Stroke -0.138998 0.299381 -0.464 0.642444
## CVD 0.730434 0.239152 3.054 0.002256 **
## Fam.CVD 0.471182 0.240007 1.963 0.049623 *
## CHF 0.256365 0.248936 1.030 0.303083
## Anemia 1.262729 0.446219 2.830 0.004657 **
## Income 0.165317 0.170814 0.968 0.333136
## Dyslipidemia -0.010923 0.234063 -0.047 0.962778
## PVD 0.820053 0.210964 3.887 0.000101 ***
## Smoker 0.142031 0.146979 0.966 0.333877
## Fam.Hypertension -0.598654 0.282693 -2.118 0.034202 *
## Female 0.340186 0.205081 1.659 0.097158 .
## Fam.Diabetes -0.119842 0.154568 -0.775 0.438144
## Activity -0.277992 0.108226 -2.569 0.010210 *
## Weight 0.010793 0.004577 2.358 0.018364 *
## Height 0.020933 0.010512 1.991 0.046435 *
## SBP -0.002485 0.003866 -0.643 0.520390
## DBP -0.002029 0.005840 -0.347 0.728249
## Total.Chol 0.001940 0.001750 1.108 0.267725
## Age 0.097322 0.008044 12.098 < 2e-16 ***
## Racegrp.hispa -0.417113 0.250797 -1.663 0.096282 .
## Racegrp.other -0.151506 0.652315 -0.232 0.816337
## Racegrp.white 0.154886 0.203447 0.761 0.446474
## CareSource.clinic 5.461375 324.744709 0.017 0.986582
## CareSource.DrHMO 5.313196 324.744705 0.016 0.986946
## CareSource.noplace 4.628450 324.744846 0.014 0.988628
## CareSource.other 5.457358 324.744796 0.017 0.986592
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2202.3 on 4199 degrees of freedom
## Residual deviance: 1435.1 on 4167 degrees of freedom
## AIC: 1501.1
##
## Number of Fisher Scoring iterations: 11
Model 1 - Logistic regresison with variables selected using stepwise regression. The procedure is used for selecting the best predictors for the dependent variable using many methods to evaluate model fit like t-test, F-test, adjusted R-squared, etc. Here we use AIC (Akaike Information Criterion - https://en.m.wikipedia.org/wiki/Akaike_information_criterion) and p-value. The smaller the AIC value is, the better the model is.
Forward selection starts with no variable in the model, then tests the addition of each variable using a chosen model fit criterion, adds the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeats this process until none improves the model to a statistically significant extent.
In the summary, we will see two deviances NULL and Residual. Deviance is a measure of goodness of fit of a model. Higher numbers always indicate bad fit. The null deviance shows how well the dependent variable is predicted by a model that includes only the intercept (grand mean), while residual includes independent variables.
model1=step(model,direction="forward")
summary(model1)
##
## Call:
## glm(formula = CKD ~ Educ + Unmarried + Insured + PoorVision +
## Hypertension + Diabetes + Stroke + CVD + Fam.CVD + CHF +
## Anemia + Income + Dyslipidemia + PVD + Smoker + Fam.Hypertension +
## Female + Fam.Diabetes + Activity + Weight + Height + SBP +
## DBP + Total.Chol + Age + Racegrp.hispa + Racegrp.other +
## Racegrp.white + CareSource.clinic + CareSource.DrHMO + CareSource.noplace +
## CareSource.other, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2588 -0.3097 -0.1174 -0.0521 3.3653
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.392894 324.749587 -0.057 0.954834
## Educ -0.168699 0.161691 -1.043 0.296790
## Unmarried 0.127792 0.155150 0.824 0.410128
## Insured -0.253782 0.309968 -0.819 0.412937
## PoorVision -0.089227 0.208966 -0.427 0.669386
## Hypertension 0.496459 0.186378 2.664 0.007728 **
## Diabetes 0.559290 0.175610 3.185 0.001448 **
## Stroke -0.138998 0.299381 -0.464 0.642444
## CVD 0.730434 0.239152 3.054 0.002256 **
## Fam.CVD 0.471182 0.240007 1.963 0.049623 *
## CHF 0.256365 0.248936 1.030 0.303083
## Anemia 1.262729 0.446219 2.830 0.004657 **
## Income 0.165317 0.170814 0.968 0.333136
## Dyslipidemia -0.010923 0.234063 -0.047 0.962778
## PVD 0.820053 0.210964 3.887 0.000101 ***
## Smoker 0.142031 0.146979 0.966 0.333877
## Fam.Hypertension -0.598654 0.282693 -2.118 0.034202 *
## Female 0.340186 0.205081 1.659 0.097158 .
## Fam.Diabetes -0.119842 0.154568 -0.775 0.438144
## Activity -0.277992 0.108226 -2.569 0.010210 *
## Weight 0.010793 0.004577 2.358 0.018364 *
## Height 0.020933 0.010512 1.991 0.046435 *
## SBP -0.002485 0.003866 -0.643 0.520390
## DBP -0.002029 0.005840 -0.347 0.728249
## Total.Chol 0.001940 0.001750 1.108 0.267725
## Age 0.097322 0.008044 12.098 < 2e-16 ***
## Racegrp.hispa -0.417113 0.250797 -1.663 0.096282 .
## Racegrp.other -0.151506 0.652315 -0.232 0.816337
## Racegrp.white 0.154886 0.203447 0.761 0.446474
## CareSource.clinic 5.461375 324.744709 0.017 0.986582
## CareSource.DrHMO 5.313196 324.744705 0.016 0.986946
## CareSource.noplace 4.628450 324.744846 0.014 0.988628
## CareSource.other 5.457358 324.744796 0.017 0.986592
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2202.3 on 4199 degrees of freedom
## Residual deviance: 1435.1 on 4167 degrees of freedom
## AIC: 1501.1
##
## Number of Fisher Scoring iterations: 11
Next, I try backward elimination, which starts with all candidate variables, tests the deletion of each variable using a chosen model fit criterion, deletes the feature (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeats this process until no further variables can be deleted without a statistically significant loss of fit.
model2=step(model,direction="backward")
summary(model2)
##
## Call:
## glm(formula = CKD ~ Hypertension + Diabetes + CVD + Fam.CVD +
## Anemia + PVD + Fam.Hypertension + Female + Activity + Weight +
## Height + Age + Racegrp.hispa + CareSource.clinic + CareSource.DrHMO +
## CareSource.other, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1452 -0.3133 -0.1196 -0.0524 3.3776
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.588349 1.883742 -7.213 5.45e-13 ***
## Hypertension 0.394577 0.163673 2.411 0.015919 *
## Diabetes 0.542114 0.164674 3.292 0.000995 ***
## CVD 0.672112 0.170898 3.933 8.40e-05 ***
## Fam.CVD 0.482640 0.238005 2.028 0.042575 *
## Anemia 1.151353 0.437572 2.631 0.008508 **
## PVD 0.819813 0.205717 3.985 6.74e-05 ***
## Fam.Hypertension -0.618367 0.280648 -2.203 0.027569 *
## Female 0.353111 0.190929 1.849 0.064394 .
## Activity -0.292137 0.106485 -2.743 0.006080 **
## Weight 0.010880 0.004487 2.425 0.015310 *
## Height 0.019968 0.010291 1.940 0.052328 .
## Age 0.097324 0.006850 14.207 < 2e-16 ***
## Racegrp.hispa -0.534466 0.195780 -2.730 0.006335 **
## CareSource.clinic 0.782286 0.457468 1.710 0.087260 .
## CareSource.DrHMO 0.613976 0.443470 1.384 0.166211
## CareSource.other 0.761276 0.528535 1.440 0.149768
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2202.3 on 4199 degrees of freedom
## Residual deviance: 1443.9 on 4183 degrees of freedom
## AIC: 1477.9
##
## Number of Fisher Scoring iterations: 7
Interpret: The final result above gives us the most important variables. The AIC value indicates that model 2 is better than model 1.
Predict on the validation set:
prob = predict(model2, type = "response", newdata=validate)
Find the Area Under the Curve (AUC) of Model 2:
ROCRpred = prediction(prob, validate$CKD)
auc = round(as.numeric(performance(ROCRpred, "auc")@y.values),2)
cat("AUC of Model2 is",auc)
## AUC of Model2 is 0.88
The model has an AUC of 0.88, which is quite decent. Let’s see the ROC curve with respect to True Posive Rate (TPR) and False Positive Rate (FPR):
ROCRperf = performance(ROCRpred, "tpr", "fpr")
# Adding threshold labels
plot(ROCRperf, colorize=TRUE, print.cutoffs.at = seq(0,1,0.1), text.adj = c(-0.2, 1.7))
abline(a=0, b=1)
auc_train <- round(as.numeric(performance(ROCRpred, "auc")@y.values),2)
legend(.8, .2, auc_train, title = "AUC", cex=1)
opt.cut = function(ROCRperf, ROCRpred){
cut.ind = mapply(FUN=function(x, y, p){
d = x^2 + (y-1)^2
ind = which(d == min(d))
c(sensitivity = y[[ind]], specificity = 1-x[[ind]],
cutoff = p[[ind]])
}, ROCRperf@x.values, ROCRperf@y.values, ROCRpred@cutoffs)
}
print(opt.cut(ROCRperf, ROCRpred))
## [,1]
## sensitivity 0.83333333
## specificity 0.77919708
## cutoff 0.06284708
Opt.cut is the threshold that would give us the highest possible TPR. If the project’s goal is to find out as many positive cases as possible and avoid FN at all cost, which is more preferred in healthcare, then we can settle at 0.063. But if the goal is to maximize accuracy rate, which takes both TP and TN into account, we should increase the threshold a bit.
# evaluation function
c_accuracy=function(actuals,classifications){
df=data.frame(actuals,classifications);
tp=nrow(df[df$classifications==1 & df$actuals==1,]);
fp=nrow(df[df$classifications==1 & df$actuals==0,]);
fn=nrow(df[df$classifications==0 & df$actuals==1,]);
tn=nrow(df[df$classifications==0 & df$actuals==0,]);
recall=tp/(tp+fn)
precision=tp/(tp+fp)
accuracy=(tp+tn)/(tp+fn+fp+tn)
tpr=recall
fpr=fp/(fp+tn)
fmeasure=2*precision*recall/(precision+recall)
scores=c(recall,precision,accuracy,tpr,fpr,fmeasure,tp,tn,fp,fn)
names(scores)=c("recall","precision","accuracy","tpr","fpr","fmeasure","tp","tn","fp","fn")
#print(scores)
return(scores);
}
Let’s check a few cutoff values:
class=ifelse(prob>0.063,1,0)
c_accuracy(validate$CKD, class)
## recall precision accuracy tpr fpr
## 0.8269231 0.2627291 0.7838889 0.8269231 0.2201946
## fmeasure tp tn fp fn
## 0.3987635 129.0000000 1282.0000000 362.0000000 27.0000000
class=ifelse(prob>0.08,1,0)
c_accuracy(validate$CKD, class)
## recall precision accuracy tpr fpr
## 0.7628205 0.2846890 0.8133333 0.7628205 0.1818735
## fmeasure tp tn fp fn
## 0.4146341 119.0000000 1345.0000000 299.0000000 37.0000000
class=ifelse(prob>0.15,1,0)
c_accuracy(validate$CKD, class)
## recall precision accuracy tpr fpr
## 0.6474359 0.3713235 0.8744444 0.6474359 0.1040146
## fmeasure tp tn fp fn
## 0.4719626 101.0000000 1473.0000000 171.0000000 55.0000000
Note: TPR decreases as accuracy rate increases.
barplot(sort(abs(model2$coefficients[-1]), decreasing=TRUE), col="steelblue", cex.names = .6, las=2, main="Variable comparison")
Based on the final model, we can build a simple questionnaire or screening software to check whether a person has a high risk for CKD and should undergo further medical tests.
Height?
After collecting data from the questionnaire above, suppose the input data of new patients is arranged in a similar fashion with the one we’ve worked on before, an example screening tool could be given as follows:
screeningTool = function(db) {
p = predict(model2, type="response", newdata=db)
db['CKD'] = ifelse(p>cutoff,1,0)
return(db)
}