ID: 24271677
Background
Diabetes is a chronic medical condition affecting millions of Americans, but if managed well, with good diet, exercise and medication, patients can lead relatively normal lives. However, if improperly managed, diabetes can lead to patients being continuously admitted and readmitted to hospitals. Readmissions are especially serious - they represent a failure of the health system to provide adequate support to the patient and are extremely costly to the system. As a result, the Centers for Medicare and Medicaid Services announced in 2012 that they would no longer reimburse hospitals for services rendered if a patient was readmitted with complications within 30 days of discharge.
Given these policy changes, being able to identify and predict those patients most at risk for costly readmissions has become a pressing priority for hospital administrators.
Data
The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria: * It is an inpatient encounter (a hospital admission). * It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis. * The length of stay was at least 1 day and at most 14 days. * Laboratory tests were performed during the encounter. * Medications were administered during the encounter. The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.
Methodology
In the project, I build the logistic regression model with the elastic net regularization for feature selection. To pick the best model, I suggest to use the Bayes’ rule for model evaluation and set the risk ratio at a01/a10 = 0.1 to take the loss into account. Because for patients who are readmitted within 30 days, the hospital won’t get paid from the social insurance system. In other words, the cost is more considerable for a flase negtive prediction. I find that the length of a hospital stay, the frequency of the inpatient visit, and the location of the discharge are most important predictors for the readmission status.
Limitations
There is no quarantee that the prediction will be the ‘truth’. The model can only provide information in the past, but not necessary the past trend will continue in the future. Therefore, it would be better to update the model from time to time. Moreover, I suggest that we should take physicans’ views seriously and use the model as a promising reference.
Data Resource
The original data is from the Center for Clinical and Translational Research at Virginia Commonwealth University. It covers data on diabetes patients across 130 U.S. hospitals during a ten-year period from 1999 to 2008. There are over 100,000 unique hospital admissions in this dataset, from ~70,000 unique patients. The data includes demographic elements, such as age, gender, and race, as well as clinical attributes such as tests conducted, emergency/inpatient visits, etc.
Characteristics of the Dataset
All observations have five things in common:
Description of variables
The dataset used covers ~50 different variables to describe every hospital diabetes admission. In this section we give an overview and brief description of the variables in this dataset.
Patient identifiers
encounter_id is a unique identifier for each admission, and patient_nbr uniquely identifies each patientPatient Demographics
race, age, gender, weight cover the basic demographic information associated with each patient. Payer_code is an additional variable that identifies which health insurance (Medicare /Medicaid / Commercial) the patient holds.Admission and discharge details
admission_source_id and admission_type_id identify who referred the patient to the hospital (e.g. physician vs. emergency dept.) and what type of admission this was (Emergency vs. Elective vs. Urgent). And discharge_disposition_id indicates where the patient was discharged to after treatment.Patient Medical History
num_outpatient: number of outpatient visits by the patient in the year prior to the current encounternum_inpatient: number of inpatient visits by the patient in the year prior to the current encounternum_emergency: number of emergency visits by the patient in the year prior to the current encounterPatient admission details
medical_specialty: the specialty of the physician admitting the patientdiag_1, diag_2, diag_3: ICD9 codes for the primary, secondary and tertiary diagnoses of the patient. ICD9 are the universal codes that all physicians use to record diagnoses. There are various easy to use tools to lookup what individual codes mean (Wikipedia is pretty decent on its own)time_in_hospital: the patient? length of stay in the hospital (in days)number_diagnoses: Total no. of diagnosis entered for the patientnum_lab_procedures: No. of lab procedures performed in the current encounternum_procedures: No. of non-lab procedures performed in the current encounternum_medications: No. of distinct medications prescribed in the current encounterClinical Results
max_glu_serum: indicates results of the glucose serum testA1Cresult: indicates results of the A1c testMedication Details
diabetesMed: indicates if any diabetes medication was prescribedchange: indicates if there was a change in diabetes medication24 medication variables: indicate whether the dosage of the medicines was changed in any manner during the encounterReadmission indicator
Data Cleaning
\(\color{blue}{\text{diabetic.data.csv}}\) is the original data.
\(\color{blue}{\text{readmission.csv}}\) is a cleaned version and they are modified in the following ways:
This dataset contains 101,766 observations and 31 variables.
Payer code, weight and Medical Specialty are not included since they have a large number of missing values.
Variables which have little variability are excluded. These also include the following variables: acetohexamide, glimepiride.pioglitazone, metformin.rosiglitazone, metformin.pioglitazone, chlorpropamide, acetohexamide, tolbutamide, acarbose, miglitor, troglitazone, tolazamide, examide, citoglipton, glyburide.metformin, glipizide.metformin, and glimepiride.pioglitazone.
Regroup some categorical variables. For example, Diag1_mod keeps some original levels with large number of patients and aggregates other patients as others. This process is known as ‘binning.’
Missing Values
There are only a small proportion of missing values in this cleaned dataset.
| \(\;\) | gender |
race |
diag3_mod |
|---|---|---|---|
| Count | 3 | 2273 | 1423 |
| %Total | < .01% | 2.23% | 1.40% |
# originaldata <- read.csv("diabetic.data.csv", header = T, as.is = F)
rawdata <- read.csv("readmission.csv", header = T, as.is = F)
# raw data summary
dataset <- rawdata
dim <- dim(dataset)
str <- str(dataset)
# summary(dataset) dataset <- subset(dataset, select = -c(encounter_id)) # delete ID variables
names(dataset)[length(dataset)] <- "y"
dataset$y <- as.character(dataset$y)
dataset$y[which(dataset$y == ">30" | dataset$y == "NO")] <- "0"
dataset$y[which(dataset$y == "<30")] <- "1"
dataset$y <- as.factor(dataset$y)
# rename variables
names(dataset) <- c("pers", "race", "gend", "hosp", "labp", "proc", "nmed", "nout", "emer", "ninp", "diag", "mglu", "A1Cr", "metf", "glim", "glip", "glyb", "piog", "rosi", "insu", "chge", "diab", "disc", "adms", "admt", "agem", "mod1", "mod2", "mod3", "y")
# Data partition
set.seed(43)
pos <- sample(1:nrow(dataset), round(nrow(dataset)*0.3))
testing <- dataset[pos, -1]
training <- dataset[-pos, -1]
# XX and y
data <- training
y <- data$y
response <- "readmitted"
XX <- model.matrix(y ~., data)[, -1]
Y <- y(see Appendix for the recoding details)
Readmission indicator
About 11.1% of observations have a record of readmission within 30 days.
| \(\;\) | Never or > 30 days | < 30 days |
|---|---|---|
| %Total | 88.9% | 11.1% |
yprop <- prop.table(table(data$y)) Patient Demographics
Some patients have more than one record in our dataset. However, a patient’s lastest medical record can be really different from the previous ones. These dulplicate data doesn’t affect the demographic features much if we show them as percentages.
From the table, the composition of race is similar if we compare the proportions between two levels of readmission status. For gender, the composition is similar as well. As for the age distribution, the mean age is slightly higher for the patients who have a record of ever being readmitted within 30 days.
# subset data which excludes duplicate data points
# data2 <- data[!duplicated(data$pers), ]
# data3 <- data[!duplicated(c(paste0(data$pers, data$agem))), ]
# demographics
ytab1 <- prop.table(table(data$race))
crosstab <- with(data, table(race, y))
ytab2 <- prop.table(crosstab, margin = 2)
df.race <- data.frame(round(rbind(ytab1, t(ytab2)), 3))
names(df.race)[1] <- "Missing"
rownames(df.race) <- c("Total", "y = 0", "y = 1")
df.raceytab1 <- prop.table(table(data$gend))
crosstab <- with(data, table(gend, y))
ytab2 <- prop.table(crosstab, margin = 2)
df.gender <- data.frame(round(rbind(ytab1, t(ytab2)), 3))
rownames(df.gender) <- c("Total","y = 0", "y = 1")
df.genderytab1 <- prop.table(table(data$agem))
crosstab <- with(data, table(agem, y))
ytab2 <- prop.table(crosstab, margin = 2)
df.age <- data.frame(round(rbind(ytab1, t(ytab2)), 3))
rownames(df.age) <- c("Total", "y = 0", "y = 1")
colnames(df.age) <- c("0~19", "20~59", "60~79", "80+")
df.agePatient Medical History
As for labp, proc, and nmed, each shares similar distributions in the meas and spreads under two levels of the readmission status. For hosp and diag, the spread of the length of stay in the hospital is larger and the average number of diagnosis is higher for a patient being readmitted within 30 days.
# names(data)
# df <- subset(data, select = c(race, gender, y))
# knitr::kable(df)
# continuous data
plot1 <- ggplot(data, aes(x = y, y = hosp, color = y)) + geom_boxplot() + theme(legend.position = "none") +
labs(y = "length of stay (days)", x = "readmitted within < 30 days")
plot2 <- ggplot(data, aes(x = y, y = labp, color = y)) + geom_boxplot() + theme(legend.position = "none") +
labs(y = "No. of lab procedures", x = "readmitted within < 30 days")
plot3 <- ggplot(data, aes(x = y, y = proc, color = y)) + geom_boxplot() + theme(legend.position = "none") +
labs(y = "No. of non-lab procedures", x = "readmitted within < 30 days")
plot4 <- ggplot(data, aes(x = y, y = nmed, color = y)) + geom_boxplot() + theme(legend.position = "none") +
labs(y = "No. of distinct medications", x = "readmitted within < 30 days")
plot8 <- ggplot(data, aes(x = y, y = diag, color = y)) + geom_boxplot() + theme(legend.position = "none") +
labs(y = "No. of diagnosis", x = "readmitted within < 30 days")
grid.arrange(plot1, plot2, plot3, plot4, plot8, nrow = 2)Patient Medical History
For nout, ninp, and emer, there are a bunch of 0 in the cells and I compare the distributionS at the tail. Figure shows that for patients readmitted within 30 days, they tend to have lower numbers of outpaitient visits and emergency visites in the previous year.
| \(\;\) | No. outpatient | No. emergency | No. inpatient |
|---|---|---|---|
| %count\(=0\) | 83.4 | 88.8 | 66.5 |
| %count\(!=0\) | 16.6 | 11.2 | 33.5 |
plot5 <- ggplot(data, aes(x = y, y = nout, color = y)) + geom_boxplot() + theme(legend.position = "none") +
labs(y = "No. of outpatient", x = "readmitted within < 30 days")
plot6 <- ggplot(data, aes(x = y, y = emer, color = y)) + geom_boxplot() + theme(legend.position = "none") +
labs(y = "No. of emergency", x = "readmitted within < 30 days")
plot7 <- ggplot(data, aes(x = y, y = ninp, color = y)) + geom_boxplot() + theme(legend.position = "none") +
labs(y = "No. of inpatient", x = "readmitted within < 30 days")
grid.arrange(plot5, plot6, plot7, nrow = 1)Clinical Results
The correlation between mglu and A1Cr is -0.046, which is not significant.
The diagnosis of diabetes is based on the patients’ blood glucose level, i.e. mglu. And the A1c test (A1Cr) is an important measure of how well a person with diabetes is controlling their blood glucose level. The American Diabetes Association recommends a goal of less than 7.0% A1c.
subdata <- subset(dataset, select = c(mglu, A1Cr))
subdata <- as.matrix(subdata)
subdata[which(subdata == "None")] <- 0
subdata[which(subdata == "Norm")] <- 1
subdata[which(subdata == ">7")] <- 2
subdata[which(subdata == ">8")] <- 3
subdata[which(subdata == ">200")] <- 2
subdata[which(subdata == ">300")] <- 3
subdata <- matrix(as.numeric(subdata), ncol = 2)
subdata <- data.frame(subdata)
names(subdata) <- c("mglu", "A1Cr")
cor <- round(cor(subdata), 3)Medication Details
To briefly exam the hidden relationship between medicines, I recode the four levels of the dosage (“No” = 0, “Down” = 1, “Steady” = 2, “Up” = 3). I find that the correlations among metf, glim, glip, glyb, piog, rosi, insu are not significant.
# factor
ytab1 <- prop.table(table(data$chge))
crosstab <- with(data, table(chge, y))
ytab2 <- prop.table(crosstab, margin = 2)
df <- data.frame(round(rbind(ytab1, t(ytab2)), 3))
rownames(df) <- c("Total","y = 0", "y = 1")
names(df) <- c("Change diabetes medication", "No change")
df# relationship between medication details
nomed <- sum(with(dataset, metf == "No" & glim == "No" & glip == "No" & glyb == "No" & piog == "No" & rosi == "No" & insu == "No"))
subdata <- subset(dataset, select = c(metf, glim, glip, glyb, piog, rosi, insu))
subdata <- as.matrix(subdata)
subdata[which(subdata == "No")] <- 0
subdata[which(subdata == "Down")] <- 1
subdata[which(subdata == "Steady")] <- 2
subdata[which(subdata == "Up")] <- 3
subdata <- matrix(as.numeric(subdata), ncol = 7)
subdata <- data.frame(subdata)
names(subdata) <- c("metf", "glim", "glip", "glyb", "piog", "rosi", "insu")
data.frame(round(cor(subdata), 3))*Admission and discharge details
# factor
ytab1 <- prop.table(table(data$disc))
crosstab <- with(data, table(disc, y))
ytab2 <- prop.table(crosstab, margin = 2)
df.disc <- data.frame(round(rbind(ytab1, t(ytab2)), 3))
rownames(df) <- c("Total","y = 0", "y = 1")
df.discytab1 <- prop.table(table(data$adms))
crosstab <- with(data, table(adms, y))
ytab2 <- prop.table(crosstab, margin = 2)
df.adms <- data.frame(round(rbind(ytab1, t(ytab2)), 3))
rownames(df) <- c("Total","y = 0", "y = 1")
df.admsytab1 <- prop.table(table(data$admt))
crosstab <- with(data, table(admt, y))
ytab2 <- prop.table(crosstab, margin = 2)
df.admt <- data.frame(round(rbind(ytab1, t(ytab2)), 3))
rownames(df) <- c("Total","y = 0", "y = 1")
df.admtGoal
My aim in the project is to avoid patients being readmitted within 30 days of discharge, which reduces costs for the hospital and improves outcomes for patients.
Use classification model to find the important predictors for the occurrance of the readmission within 30 days for a patient with diabetes. I split the data into training (70%) and testing (30%) sets. The procedure of building the model is as the follow:
Step 1) Get the preliminary model using the stepwise and the lasso regression techniques.
Step 2) Use GLM and refine the model based on the p-value criteria.
Step 3) Exam the model performance based on the unweighted and weighted misclassificatin rates with training and testing dataset.
Step 4) Build the final model.
Final Model
hosp, ninp, and disc
hosp + 0.207 ninp + 0.00831 disc [Discharged/Transferred to SNF]Secondary Model
hosp, ninp, and disc
hosp, ninp, diag and disc
Preliminary Model
hosp, emer, ninp, diag, insu, diab, disc, agem, mod1, and mod3
hosp, proc, nmed, emer, ninp, diag, A1Cr, metf, diab, adms, disc, agem, mod1, mod2, and mod3
I conduct backward selection manually, then I select 15 variables in the preliminary model (Model 2) and remove the others which are not significant one by one. Variables are dropped in the following order: chge, glyb, piog, labp, rosi, mglu, gend, nout, glim, race, admt, insu, and glip.
data <- select(data, -c(pers))
fitGLM <- glm(y ~. , finaldata, family = "binomial")
fitGLMtest <- glm(y ~. , data, family = binomial(logit))
fitGLM1 <- update(fitGLM, .~. - chge)
fitGLM2 <- update(fitGLM1, .~. - glyb)
fitGLM3 <- update(fitGLM2, .~. - piog)
fitGLM4 <- update(fitGLM3, .~. - labp)
fitGLM5 <- update(fitGLM4, .~. - rosi)
fitGLM6 <- update(fitGLM5, .~. - mglu)
fitGLM7 <- update(fitGLM6, .~. - gend)
fitGLM8 <- update(fitGLM7, .~. - nout)
fitGLM9 <- update(fitGLM8, .~. - glim)
fitGLM10 <- update(fitGLM9, .~. - race)
fitGLM11 <- update(fitGLM10, .~. - admt)
fitGLM12 <- update(fitGLM11, .~. - insu)
fitGLM13 <- update(fitGLM12, .~. - glip)
AnovaModel14 <- Anova(fitGLM13)dataModel3 <- subset(data, select = -c(chge, glyb, piog, labp, rosi, mglu, gend, nout, glim, race, admt, insu, glip))
Model3 <- glm(y ~. , dataModel3, family = "binomial")I test \(\alpha\) bewteen (0, 1), and record the model with which \(\lambda\) achieves the smallest error: hosp, ninp, and disc (Model 1). And the best model performance occurs as \(\alpha\) is close to 0.5 and \(\lambda\) = 0.02580. Because of computation limitation, I only run cv.glmnet() once for each level of \(\lambda\) with the whole dataset.
Moreover, I also use Elastic Net to select another preliminary set of predictors. Based on the cross validation errror, I first choose 11 variables: hosp, nmed, emer, ninp, diag, insu, diab, disc, agem, mod1, and mod3. There may be some variables which have p values greater than 0.05. Then I exclude the non-significant predictors one by one. And I get a preliminary model with 10 predictors: hosp, emer, ninp, diag, insu, diab, disc, agem, mod1, and mod3 (Model 2), and 9 out of 10 predictors are also included in Model 3. Both Elastic Net and Backward selection give me the similar results.
From the graph below, it’s obvious that the misclassification error cannot be further improved even if I add more predictors into the model. In this case, it’s better to keep a small model than a larger one which includes only redundant information about the readmission status.
setModel <- function(lambda, model){
coefSelect <- coef(model, s = lambda)
coefSelect <- coefSelect[which(coefSelect != 0), ]
list <- list(vars = names(coefSelect)[-1], coefficients = coefSelect[-1])
return(list)
}
lambdaBest <- lambdaCV$lambda.min #
bestModel <- glmnet(XX, Y, family = "binomial", alpha = 0.5, lambda = lambdaBest)
dataModel1 <- select(data, c(hosp, ninp, disc, y))
Model1 <- glm(y ~. , family = "binomial", dataModel1)lambdaSelect <- lambdaCV$lambda[which(lambdaCV$nzero == 19)]
tempModel <- glmnet(XX, Y, family = "binomial", alpha = 0.5, lambda = lambdaSelect)
list_preModel <- setModel(lambdaSelect, tempModel)
tempdata <- select(data, c(hosp, nmed, emer, ninp, diag, insu, diab, disc, agem, mod1, mod3, y))
fitEN1 <- glm(y ~. , family = "binomial", tempdata)
fitEN2 <- update(fitEN1, .~. - nmed)
dataModel2 <- select(tempdata, -c(nmed))
Model2 <- glm(y ~. , family = "binomial", dataModel2)# Use the set of predictors selected with the backward method to rerun the Elastic Net model
N <- 30
fold <- 10
level <- seq(0, 1, by = 0.1)
XX1.1 <- model.matrix(y ~., dataModel3)
cvm1.1 = array(dim = c(length(level), N))
lambda1.1 = array(dim = c(length(level), N))
coefnum1.1 = array(dim = c(length(level), N))
for (i in level){
for (j in 1:N){
lambdaCV1.1 <- cv.glmnet(XX1.1, Y, family = "binomial", alpha = i, nfolds = fold, type.measure = "class")
mincvm <- lambdaCV1.1$cvm[which(lambdaCV1.1$lambda == lambdaCV1.1$lambda.min)]
lambdaSelect <- lambdaCV1.1$lambda.min
cvm1.1[which(level == i), j] <- mincvm
lambda1.1[which(level == i), j] <- lambdaSelect
coefnum1.1[which(level == i), j] <- sum(coef(lambdaCV1.1, s = "lambda.min") != 0)
}
}
df_par1.1 <- data.frame(cvm = as.vector(cvm1.1) , lambda = as.vector(lambda1.1), alpha = rep(level, N), coef_num = as.vector(coefnum1.1))
Plot1 <- ggplot(df_par1.1, aes(as.factor(alpha), cvm)) + geom_boxplot() + labs(x = "alpha", y = "CV error")
Plot2 <- ggplot(df_par1.1, aes(as.factor(alpha), coef_num)) + geom_boxplot() + labs(x = "alpha", y = "# coefficient")
Plot3 <- ggplot(df_par1.1, aes(as.factor(alpha), lambda)) + geom_boxplot() + labs(x = "alpha")
grid.arrange(Plot1, Plot2, Plot3, ncol = 1)XX1.1 <- model.matrix(y ~., dataModel3)
fitLasso <- cv.glmnet(XX1.1, Y, alpha = 1, family = "binomial", type.measure = "class")
lambdaSelect <- fitLasso$lambda[which(fitLasso$nzero == 5)][1]
tempModel <- glmnet(XX1.1, Y, family = "binomial", alpha = 1, lambda = lambdaSelect)
list_preModel <- setModel(lambdaSelect, tempModel)
dataModel1.1 <- select(data, c(hosp, ninp, diag, disc, y))
Model1.1 <- glm(y ~. , family = "binomial", dataModel1.1)The coefficients for Model 1, Model EN, and Model 1.1:
selectPredictors <- names(Model1$coefficients)[-1]
selectcoef <- as.numeric(Model1$coefficients)[-1]
data.frame(Model1 = selectPredictors, Coefficient = round(selectcoef, 3))selectPredictors <- names(Model1$coefficients)[-1]
selectcoef <- c(0.0006855987, 0.2072881252, 0, 0.0083088229, 0) # intercept = -2.2406392094
data.frame(Model.EN = selectPredictors, Coefficient = round(selectcoef, 3))selectPredictors <- names(Model1.1$coefficients)[-1]
selectcoef <- as.numeric(Model1.1$coefficients)[-1]
data.frame(Model1.1 = selectPredictors, Coefficient = round(selectcoef, 3))Misclassification Rate
Three rules to compare the misclassification rate with both taining and testing datasets among Model 1, Model EN, and Model 1.1:
Unweighted misclassification rate using ROC and AUC
Unweighted misclassification rate using the positive predictive value and negative predictive value
Bayes’ rule with the risk ratio (a01/a10) be equal to 0.1
Unweighted Misclassification Rate
Unweighted misclassification rate is not an ideal choice for model evaluation. Model EN achieves the lowest misclassification rate, and the performace is similar beteen Model 1 and Model 1. The misclassification rate stays small after the threshold set larger than 0.4 for all models.
setThres <- function(thres, Model, y){
thresPred <- rep("0", length(Model))
thresPred[Model > thres] <- "1"
thresPred <- as.factor(thresPred)
confMat <- table(thresPred, y)
# MCE
MCE <- (confMat[1,2] + confMat[2,1])/sum(confMat)
# P(T+|D+) & P(T+|D-)
thresFNeg <- confMat[1,2]/(confMat[1, 2] + confMat[2, 2])
thresFPos <- confMat[2,1]/(confMat[1, 1] + confMat[2, 1])
# P(D+|T+) & P(D-|T-)
thresPPV <- confMat[2,2]/(confMat[2, 1] + confMat[2, 2])
thresNPV <- confMat[1,1]/(confMat[1, 1] + confMat[1, 2])
eva <- data.frame(MCE = MCE, "Positive Prediction" = thresPPV, "Negative Prediction" = thresNPV, "False Positive" = thresFPos, "False Negative" = thresFNeg)
return(eva)
}# training
XX1 <- model.matrix(y ~. , dataModel1)[, -1]
XX1.1 <- model.matrix(y ~. , dataModel1.1)[, -1]
# testing
XXtest <- model.matrix(y ~. , testing)[, -1]
testdata1 = testing[, c(names(dataModel1))]
testdata1.1 = testing[, c(names(dataModel1.1))]
XXt1 <- model.matrix(y ~. , testdata1)[, -1]
XXt1.1 <- model.matrix(y ~. , testdata1.1)[, -1]# mis-classification rate include cv result
# Model 1 EN
trainLasso <- predict(bestModel, XX, type = "class")
mceLasso <- mean(y != trainLasso) # 0.1111657
testLasso <- predict(bestModel, XXtest, type = "class")
mceLasso.test <- mean(testing$y != testLasso) # 0.1118572# training
trainPred1 <- predict(Model1, dataModel1, type = "response")
trainPred1.1 <- predict(Model1.1, dataModel1.1, type = "response")
trainLasso.prob <- as.numeric(predict(bestModel, XX, s = lambdaBest, type = "response"))
# testing
testPred1 <- predict(Model1, testdata1, type = "response")
testPred1.1 <- predict(Model1.1, testdata1.1, type = "response")
testLasso.prob <- as.numeric(predict(bestModel, XXtest, s = lambdaBest, type = "response"))
# threshold
thres <- seq(0.2, 0.8, by = 0.05)
eva1 <- NULL
eva1.1 <- NULL
eva.en <- NULL
for (i in thres){
eva1 <- c(eva1, setThres(i, trainPred1, y)$MCE)
eva1.1 <- c(eva1.1, setThres(i, trainPred1.1, y)$MCE)
eva.en <- c(eva.en, setThres(i, trainLasso.prob, y)$MCE)
}
trainMCE <- data.frame(Model1 = eva1, Model1.1 = eva1.1, thres = thres)
# testing
thres <- seq(0.2, 0.8, by = 0.05)
eva1 <- NULL
eva1.1 <- NULL
eva.en <- NULL
for (i in thres){
eva1 <- c(eva1, setThres(i, testPred1, testing$y)$MCE)
eva1.1 <- c(eva1.1, setThres(i, testPred1.1, testing$y)$MCE)
eva.en <- c(eva.en, setThres(i, testLasso.prob, testing$y)$MCE)
}
testMCE <- data.frame(Model1 = eva1, Model1.1 = eva1.1, Model1.EN = eva.en, "thres(testing)" = thres)
ggplot(testMCE, aes(thres.testing.)) +
geom_line(aes(y = Model1, color = "red"), lwd = 1) +
geom_line(aes(y = Model1.1, color = "blue"), lwd = 1.5, lty = 2) +
geom_line(aes(y = Model1.EN, color = "green"), lwd = 1.5, lty = 3) +
labs(x = "Threshold", y = "Unweighted MCE") +
scale_color_brewer(palette = "Set2", name = "Model", labels = c("Model 1", "Model EN", "Model 1.1"))Receiver Operating Characteristic Curve
ROC reports the corresponding relationship between Sensitivey (true positive) and Specificity (true negative). As Sensitivity goes up, Specificity goes down.
The curves are nearly overlapping and the AUC is nearly the same. Although Model 1.1 performs slightly better, ROC and AUC are not good techniques for model evaluation in this case.
# training
trainROC1 <- roc(y, trainPred1, col = "blue")
trainROC1.1 <- roc(y, trainPred1.1, col = "blue")
trainROC1.lasso <- roc(y, trainLasso.prob, col = "blue")
# testing
testROC1 <- roc(testing$y, testPred1, col = "blue")
testROC1.1 <- roc(testing$y, testPred1.1, col = "blue")
testROC1.lasso <- roc(testing$y, testLasso.prob, col = "blue") fitROC1 <- trainROC1
fitROC2 <- trainROC1.lasso
trainROC <- data.frame(Sensitivity = c(fitROC1$sensitivities, fitROC2$sensitivities), Specificity = c(fitROC1$specificities, fitROC2$specificities), index = c(rep(1, length(fitROC1$sensitivities)), rep(2, length(fitROC2$sensitivities))))
LeftPlot <- ggplot(trainROC, aes(x = Specificity, y = Sensitivity)) +
geom_line(aes(color = factor(index), lty = factor(index)), size = 2) +
labs(title = "ROC", x = "Specificity", y = "Sensitivity") +
scale_color_brewer(palette = "Set1", name = "Model", labels = c("Model1", "Model Lasso")) +
scale_linetype_discrete(guide = F) +
scale_x_reverse() +
geom_abline(intercept = 1, slope = 1, col = "white")
fitROC1 <- testROC1
fitROC2 <- testROC1.lasso
testROC <- data.frame(Sensitivity = c(fitROC1$sensitivities, fitROC2$sensitivities), Specificity = c(fitROC1$specificities, fitROC2$specificities), index = c(rep(1, length(fitROC1$sensitivities)), rep(2, length(fitROC2$sensitivities))))
RightPlot <- ggplot(testROC, aes(x = Specificity, y = Sensitivity)) +
geom_line(aes(color = factor(index), lty = factor(index)), size = 2) +
labs(title = "ROC: Testing", x = "Specificity", y = "Sensitivity") +
scale_color_brewer(palette = "Set2", name = "Model", labels = c("Model1", "Model Lasso")) +
scale_linetype_discrete(guide = F) +
scale_x_reverse() +
geom_abline(intercept = 1, slope = 1, col = "white")
grid.arrange(LeftPlot, RightPlot, nrow = 1)data.frame(Model1 = auc(trainROC1), Model1.1 = auc(trainROC1.1), Model.Lasso = auc(trainROC1.lasso), Model1.test = auc(testROC1), Model1.1.test = auc(testROC1.1), Model.Lasso.test = auc(testROC1.lasso))Positive Predictive Value and Negative Predictive Value
Estimate the Positive Prediction Values (PPV) and Negative Prediction Values (NPV) for Model 1, Model EN, and Model 1.1 with different thresholds.
In the graph below, we would like to choose a threshold which make PPV as larger as possible without severe tradeoff in NPV. The NPV for both models remains relatively steady as the threshold changes. As for the PPV, Model EN performs better than the otehr two models. Based on the testing results, I choose the threshold to be 0.7. For the threshold set at 0.7, the PPV and NPV of Model EN with testing set is 0.5 and 0.89, respectively.
Usually we should care more about the PPV, and would like to make false discovery rate (1 - PPV) as small as possible. The reason is if we get a positive prediction, we will react to it as a prevention. If we get a negative prediction, we probably won’t double check the case afterwards. However, this doesn’t mean the cost of getting a false negative prediction is smaller, especially in our case.
# PPV & NPV
# data.frame(MCE = MCE, "Positive Prediction" = thresPPV, "Negative Prediction" = thresNPV, "False Positive" = thresFPos, "False Negative" = thresFNeg)
# training
thres <- seq(0.2, 0.8, by = 0.05)
eva1 <- NULL
eva1.1 <- NULL
eva.en <- NULL
for (i in thres){
df1 <- setThres(i, trainPred1, y)
df1.1 <- setThres(i, trainPred1.1, y)
dfen <- setThres(i, trainLasso.prob, y)
eva1 <- rbind(eva1, c(df1$"Positive.Prediction", df1$"Negative.Prediction"))
eva1.1 <- rbind(eva1.1, c(df1.1$"Positive.Prediction", df1.1$"Negative.Prediction"))
eva.en <- rbind(eva.en, c(dfen$"Positive.Prediction", dfen$"Negative.Prediction"))
}
trainPV <- data.frame("Positive Prediction" = c(eva1[, 1], eva1.1[, 1], eva.en[, 1]), "Negative Prediction" = c(eva1[, 2], eva1.1[, 2], eva.en[, 2]), thres = thres)
trainPV <- melt(trainPV, id.vars = "thres")
index <- c(rep("PPV1", length(eva1[, 1])), rep("PPV1.1", length(eva1[, 1])), rep("PPVen", length(eva.en[, 1])), rep("NPV1", length(eva1[, 1])), rep("NPV1.1", length(eva1[, 1])), rep("NPVen", length(eva.en[, 1])))
trainPV <- data.frame(trainPV, index)
# testing
thres <- seq(0.2, 0.8, by = 0.05)
eva1 <- NULL
eva1.1 <- NULL
eva.en <- NULL
for (i in thres){
df1 <- setThres(i, testPred1, testing$y)
df1.1 <- setThres(i, testPred1.1, testing$y)
dfen <- setThres(i, testLasso.prob, testing$y)
eva1 <- rbind(eva1, c(df1$"Positive.Prediction", df1$"Negative.Prediction"))
eva1.1 <- rbind(eva1.1, c(df1.1$"Positive.Prediction", df1.1$"Negative.Prediction"))
eva.en <- rbind(eva.en, c(dfen$"Positive.Prediction", dfen$"Negative.Prediction"))
}
testPV <- data.frame("Positive Prediction" = c(eva1[, 1], eva1.1[, 1], eva.en[, 1]), "Negative Prediction" = c(eva1[, 2], eva1.1[, 2], eva.en[, 2]), thres = thres)
testPV <- melt(testPV, id.vars = "thres")
index <- c(rep("PPV1", length(eva1[, 1])), rep("PPV1.1", length(eva1[, 1])), rep("PPVen", length(eva.en[, 1])), rep("NPV1", length(eva1[, 1])), rep("NPV1.1", length(eva1[, 1])), rep("NPVen", length(eva.en[, 1])))
testPV <- data.frame(testPV, index)Bayes’ Rule
To take the loss into consideration, I suggest to use the Bayes’ rule as the main criteria for model evaluation.
It’s highly possible that the loss will be higher if the readmission happens within 30 days given that we expect it may not occur shortly. In other words, the loss for a false negative prediction should be higher. I assume that a01/a10 \(\star\) is 0.1. Because more costs will be needed for the following treatment than for the prevention beforehands.
As I set the risk ratio at a01/a10 = 0.1, then I get
\[P(Y = 1|X) \ge \frac{a_{01}/a_{10}}{1 + a_{01}/a_{10}} \approx 0.1\]
which is equivalent to
\[ \log\frac{P(Y = 1|X)}{P(Y = 0|X)} \ge -2.303\]
Thus, the linear boundary for the Bayes classifier with the risk ratio at a01/a10 = 0.1 is as the follow.
# Model1
a01 <- 0.1
p <- a01/(a01 + 1)
oddsBayes <- log(p/(1 - p)) # for logit
coef1 <- as.numeric(Model1$coefficients)
coef1[1] <- coef1[1] - oddsBayes
coefBayes1 <- NULL
for (i in 2:length(coef1)){
temp <- -coef1/coef1[i]
coefBayes1 <- rbind(coefBayes1, temp)
}
coefBayes1 <- round(coefBayes1, 3)
colnames(coefBayes1) <- c("intercept", "hosp", "ninp", "Disc to home w Health Service", "Disc to SNF", "Other")
rownames(coefBayes1) <- c("hosp", "ninp", "Disc to home w Health Service", "Disc to SNF", "Other")
coefBayes1[coefBayes1 == -1] <- ""
# Model1.1
coef1.1 <- as.numeric(Model1.1$coefficients)
coef1.1[1] <- coef1.1[1] - oddsBayes
coefBayes1.1 <- NULL
for (i in 2:length(coef1.1)){
temp <- -coef1.1/coef1.1[i]
coefBayes1.1 <- rbind(coefBayes1.1, temp)
}
coefBayes1.1 <- round(coefBayes1.1, 3)
colnames(coefBayes1.1) <- c("intercept", "hosp", "ninp", "diag", "Disc to home w Health Service", "Disc to SNF", "Other")
rownames(coefBayes1.1) <- c("hosp", "ninp", "diag", "Disc to home w Health Service", "Disc to SNF", "Other")
coefBayes1.1[coefBayes1.1 == -1] <- ""
# Model EN
a01 <- 0.1
p <- a01/(a01 + 1)
oddsBayes <- log(p/(1 - p)) # for logit
coefen <- c(-2.2406392094, 0.0006855987, 0.2072881252, 0.0083088229)
coefen[1] <- coefen[1] - oddsBayes
coefBayes.en <- NULL
for (i in 2:length(coefen)){
temp <- -coefen/coefen[i]
coefBayes.en <- rbind(coefBayes.en, temp)
}
coefBayes.en <- round(coefBayes.en, 3)
colnames(coefBayes.en) <- c("intercept", "hosp", "ninp", "Disc to SNF")
rownames(coefBayes.en) <- c("hosp", "ninp", "Disc to SNF")
coefBayes.en[coefBayes.en == -1] <- ""
# output
data.frame(coefBayes1)data.frame(coefBayes1.1)data.frame(coefBayes.en)The weighted misclassification rate becomes smaller for all models with the training set. As for the testing misclassification error, the model performance is very good for Model EN and becomes worse for Model 1 and Model 1.1. Thus, I suggest to use Model EN as the final prediction model.
# training
a01 <- 0.1
a10 <- 1
predBayes1 <- rep("0", length(Model1$fitted))
predBayes1[Model1$fitted > a01/(1 + a01)] <- "1"
BayesMCE1 <- mean(a10*(y != predBayes1)*(y == "1") + a01*(y != predBayes1)*(y == "0"))
predBayes1.1 <- rep("0", length(Model1.1$fitted))
predBayes1.1[Model1.1$fitted > a01/(1 + a01)] <- "1"
BayesMCE1.1 <- mean(a10*(y != predBayes1.1)*(y == "1") + a01*(y != predBayes1.1)*(y == "0"))
predBayes.EN <- rep("0", length(trainLasso.prob))
predBayes.EN[trainLasso.prob > a01/(1 + a01)] <- "1"
BayesMCE.EN <- mean(a10*(y != predBayes.EN)*(y == "1") + a01*(y != predBayes.EN)*(y == "0"))
# testing
testBayes1 <- rep("0", length(testPred1))
testBayes1[testPred1 > a01/(1 + a01)] <- "1"
testBayesMCE1 <- mean(a10*(testing$y != testPred1)*(testing$y == "1") + a01*(testing$y != testPred1)*(testing$y == "0"))
testBayes1.1 <- rep("0", length(testPred1.1))
testBayes1.1[testPred1.1 > a01/(1 + a01)] <- "1"
testBayesMCE1.1 <- mean(a10*(testing$y != testPred1.1)*(testing$y == "1") + a01*(testing$y != testPred1.1)*(testing$y == "0"))
testBayes.EN <- rep("0", length(testLasso.prob))
testBayes.EN[testLasso.prob > a01/(1 + a01)] <- "1"
testBayesMCE.EN <- mean(a10*(testing$y != testBayes.EN)*(testing$y == "1") + a01*(testing$y != testBayes.EN)*(testing$y == "0"))
# table
data.frame(Model1 = BayesMCE1, Model1.1 = BayesMCE1.1, ModelEN = BayesMCE.EN, Model1.test = testBayesMCE1, Model1.1.test = testBayesMCE1.1, ModelEN.test = testBayesMCE.EN)\(\star\) a10 stands for the loss function of making a positive outcome (y = 1) to a negative prediction (y = 0).
To predict the readmission status for patients with diabetes, the length of stay in the hospital, the number of inpatient visits, and the information of where the patient was discharged to after treatment are the most important predictors. Clinical results and medication details may be helpful for physicians in the diagnosis in some way but they may provide redundant information we needed for prediction as we have the health service records already. Based on the information provided with these three variables, the misclassification rate for prediction of my model is about 9%.
data.frame(cbind(names(rawdata)[-1], names(dataset)))