#Section 1: Optimization Methods
The Adam algorithm, otherwise known as Adaptive Moment Estimation, uses both the concept of momentum and adaptive learning in order to find the global minimum of an optimization problem. Momentum is refers to the direction that the optimization algorithm is moving in and the adaptive leaning rates refer to the fact that each “step” taken by the algorithm will be a different size. Adam uses a decaying weighted sum of the two, meaning that the latest steps taken impart the greatest influence on the size of the next step, although it still maintains information from prior movements. This way, Adam is able to avoid converging on local minima and is much more efficient than a simple Stochastic Gradient Descent. This is because momentum (keeping track of previous gradients) allows the algorithm to push though local minima while the adaptive step size allows the algorithm to take larger steps in areas of little curvature and smaller steps when there is significant curvature. This approach to optimization is well suited for non-convex problems because it is able to overcome local minima and find the global minimum. Furthermore, the researcher implementing the algorithm is able to set a maximum step size in order to maintain closer control over the progression of the problem. However, this optimization method is still constrained by certain limitations. For example, if the first derivative of the problem does not exist then the Adam algorithm cannot be implemented.
Below, I have taken the code that performs a simple, vanilla gradient descent that was presented in class. As you can see, the function easily gets stuck at the local minima. Please note that for each of the following graphs, the starting value chosen for the plot was 4. I have only showed the final frame, which represents that minima that the algorithms have calculated. If desired, there is also a ggplotly function in the code appendix that shows the animation, including the rate of discovering the minima.
As Plot 1 shows, the vanilla gradient descent is unable to find the global minimum. It quickly finds the local minimum but because it is only optimizing for the gradient at the current position and does not include any momentum , once the direction of the gradient changes it remains in that minimum.
When using the Adam function, it is usually able to overcome these local minima. Due to its incorporation of both gradient-based and history-based momentum, the algorithm is able to keep a more global perspective and use the momentum to move out of the minima. However, as is seen in the final case, it does have its limitations.
I found an edge case below that seems to stump the Adam algorithm. Because of the signifiacntly larger hump between the local and global minimum, the algorithm loses its momentum quickly and it unable to get over the hump. This was true for all learning rates that I tried, starting from 0.5 and going all the way up to 5000. Furthermore, for the first non-convex function, when the selected starting point was 1, the adam function still failed to determine the global minimum. It would converge on the local minimum. This makes intuative sense to me since the initial gradient that the function would calculate would be pointing towards the local minimum and it would never get the momentum to pass by its initial starting point and find the global minimum.
Initially when applying the Adam algorithm, I thought that it would
perform better. In fact, when both models (the vanilla gradient descent
and Adam) were implemented against a multiple linear regression model,
they gave different results (shown in the table labeled GD and ADAM).
However, when the max iterations where increased from 1000 to 10000, the
coefficients converged onto the same values (increased max iterations
are labeld ADAM2 in the table). This suggests that rather than actually
getting different results, the ADAM algorithm was less efficient in this
case and its learning rate was slower than the gradient descent. I was
curious to see whether this would be true for all cases of a multiple
linear regression model, as the model I created was relatively simple.
However, when I complicated the formula, GD and ADAM found different
results (shown in table 2). Although it took a significant amount of
iterations, ADAM found a different set of coefficients. Although it is
difficult to determine if this is true, the vanilla gradient descent may
have gotten stuck in a local minima created by the polynomial function
created by the exponents that were added to the formula.
I also investigated how different algorithms respond to different learning rates. The learning rate represents the size of the update to the gradients in these functions. Thus, a larger learning rate would allow for larger steps and faster convergance upon the minimum. However, a risk is that if the learning rate is too large, then the algorithm may pass over the minimum or result in bounding back and forth around the minimum. On the other hand, choosing a learning rate that is too small may result in the algorithm requiring a long time to calculate the minimum due to the smaller steps it takes. Both of these effects are seen in Table 3. As the learning rate decreases, both algorithms require more iterations to reach the minimum.
#Section 2 The next part of the assignment was to develop two models (a logistic regression model and a classification tree) to predict whether heart failure patients would be readmitted to hospitals within 90 days. The original paper that performed this analysis used a multivariable multilevel Poisson regression analyses and determined that the only significant predictor for readmission was the presence of implementing a heart failure specific care pathway with their primacy care physician.
Table 4 summarizes the pre-processed data. To summarize the preprocessing, we first converted the 90-day readmission variable into a binary endpoint and removed cases where the patient died within 90 days of discharge. We also excluded binary variables with proportions < 3%. We also removed certain variables from the original data set that were not relevant for our analysis, as well as renaming variables. After this pre-processing, we identified two groups based on the outcome of interest (readmission within 90 days), with 393 readmitted and 1291 not readmitted. These groups will form our study cohort for the decision tree and logistic regression models, which can potentially use up to 32 variables as covariates. The table displays the rates of various variables in the two groups, with bolded p-values <0.05 (though we acknowledge the risk of coincidental findings with multiple tests). Next, we split the data into training and testing sets, ensuring similar proportions of readmitted and non-readmitted cases in both sets.
| 0 | 1 | p | test | |
|---|---|---|---|---|
| n | 1291 | 393 | ||
| gender_pt = Male (%) | 577 (44.7) | 172 (43.8) | 0.790 | |
| age_pt (mean (SD)) | 81.22 (8.24) | 82.42 (7.43) | 0.010 | |
| los (mean (SD)) | 9.01 (5.98) | 10.50 (7.05) | <0.001 | |
| card_discharge = Yes (%) | 162 (12.5) | 45 (11.5) | 0.622 | |
| malignant_tumors = Yes (%) | 102 ( 7.9) | 46 (11.7) | 0.026 | |
| diabetes = Yes (%) | 113 ( 8.8) | 60 (15.3) | <0.001 | |
| disorders_of_lipoid_metabolism = Yes (%) | 56 ( 4.3) | 19 ( 4.8) | 0.781 | |
| obesity = Yes (%) | 49 ( 3.8) | 10 ( 2.5) | 0.306 | |
| hematologic_diseases = Yes (%) | 79 ( 6.1) | 38 ( 9.7) | 0.021 | |
| hypertensive_disease = Yes (%) | 315 (24.4) | 120 (30.5) | 0.018 | |
| old_acute_myocardial_infarction = Yes (%) | 110 ( 8.5) | 36 ( 9.2) | 0.770 | |
| other_forms_of_ischemic_heart_disease = Yes (%) | 309 (23.9) | 109 (27.7) | 0.144 | |
| rheumatic_heart_disease = Yes (%) | 51 ( 4.0) | 20 ( 5.1) | 0.401 | |
| cardiomyopathies = Yes (%) | 82 ( 6.4) | 30 ( 7.6) | 0.437 | |
| other_cardiac_diseases = Yes (%) | 67 ( 5.2) | 27 ( 6.9) | 0.252 | |
| conduction_disorders_cardiac_dysrhythmias = Yes (%) | 215 (16.7) | 86 (21.9) | 0.022 | |
| cerebrovascular_diseases = Yes (%) | 151 (11.7) | 58 (14.8) | 0.127 | |
| vascular_diseases = Yes (%) | 98 ( 7.6) | 42 (10.7) | 0.065 | |
| chronic_obstructive_pulmonary_disease = Yes (%) | 122 ( 9.5) | 58 (14.8) | 0.004 | |
| chronic_nephropathies = Yes (%) | 207 (16.0) | 94 (23.9) | <0.001 | |
| old_percutaneous_transluminal_coronary_angioplasty = Yes (%) | 44 ( 3.4) | 11 ( 2.8) | 0.665 | |
| antidiabetic_drugs = Yes (%) | 235 (18.2) | 96 (24.4) | 0.008 | |
| drugs_for_cardiac_therapy = Yes (%) | 327 (25.3) | 131 (33.3) | 0.002 | |
| drugs_for_obstructive_airway_diseases = Yes (%) | 180 (13.9) | 68 (17.3) | 0.118 | |
| antihypertensive_drugs = Yes (%) | 1084 (84.0) | 360 (91.6) | <0.001 | |
| statins = Yes (%) | 315 (24.4) | 102 (26.0) | 0.577 | |
| antiplatelet_drugs = Yes (%) | 562 (43.5) | 181 (46.1) | 0.410 | |
| gender_gp = Male (%) | 398 (30.8) | 90 (22.9) | 0.003 | |
| age_gp (mean (SD)) | 55.93 (4.96) | 56.33 (4.73) | 0.154 | |
| org_arrangement (%) | 0.714 | |||
| Group practice | 690 (53.4) | 211 (53.7) | ||
| Network | 307 (23.8) | 86 (21.9) | ||
| None | 281 (21.8) | 90 (22.9) | ||
| Simple association | 13 ( 1.0) | 6 ( 1.5) | ||
| n_patients (%) | 0.953 | |||
| <1000 | 158 (12.2) | 47 (12.0) | ||
| >=1500 | 576 (44.6) | 173 (44.0) | ||
| 1000-1499 | 557 (43.1) | 173 (44.0) | ||
| ambulatory_area (%) | 0.105 | |||
| Mountain | 89 ( 6.9) | 19 ( 4.8) | ||
| Rural | 633 (49.0) | 180 (45.8) | ||
| Urban | 569 (44.1) | 194 (49.4) | ||
| n_gps = >15 (%) | 655 (50.7) | 215 (54.7) | 0.186 | |
| nurses_hours = >350 (%) | 607 (47.0) | 184 (46.8) | 0.991 | |
| hf_pathway = Yes (%) | 620 (48.0) | 179 (45.5) | 0.422 | |
| readmitted_90 (mean (SD)) | 0.00 (0.00) | 1.00 (0.00) | <0.001 |
Afterwards, we establish weights for the classes based on the proportionality determined earlier. The cases that have a lower overall proportion, such as readmitted cases, receive a greater weight compared to cases that are present in higher proportion, such as non-readmitted cases. The purpose of this is to rectify class imbalance in the training set.
Implementing the logistic regression model gave interesting results (Table 6). When considering all the different variables by themselves, the regression model found multiple significant predictors of readmission to hospitals within 90 days. A comforting finding was that our model also found that the hf pathway was a protective factor for predicting readmission into hospitals. In other words, just as the original paper discovered, being enrolled in a heart failure specific care pathway with their primacy care physician resulted in fewer readmissions. All other significant predictors had a positive correlation (ie pateints had a higher likelyhood of being readmitted if they had the indicated variable).
I was also interested in looking at the interactions between varaibles. I thought it might be interesting to look a patients who are taking multiple medications. Since both antihypertensive and antidiabetic drugs were significantly associated with a higher risk of readmission (coefficients of 0.85 and 1.43, respectively with p values below 0.05), I thought that patients who are taking both might have an even higher risk of readmission due to these patients having multiple condisions that have cardiovascular effects. Interestingly, the model shows that patients taking both have significantly lower risk of being readmitted. I also looked at whether having COPD and taking medications for the disease might increase the risk of readmissions since these patients would presumably have more severe disease. However, this did not appear to be significant.
The classification tree found that similar variables were predictive of readmission to hospitals within 90 days. In the model that was created, taking antihypertensive drugs was the most powerful predictor of readmission followed by the length of stay and whether patients had COPD. The tree model was developed by first using a very low complexity parameter. I then used plotcp to look at which value the relative error is minimized. Interestingly, these values do not corespond with the values that were generated by printcp. Based off of the plot, I pruned the tree to a cp of 0.019.
In order to evaluate the two models, I used the withheld test data and
calculated the MSE between the predicted and actual values. According to
these metrics, the decision tree was a better modeling system for this
data set. I also reran the models without using the class weighting and
this resulted in much worse modeling systems, as one would expect.
##
## Classification tree:
## rpart(formula = readmitted_90 ~ ., data = combotrain, weights = weights,
## method = "class", cp = 0.001)
##
## Variables actually used in tree construction:
## [1] age_gp
## [2] age_pt
## [3] ambulatory_area
## [4] antidiabetic_drugs
## [5] antihypertensive_drugs
## [6] antiplatelet_drugs
## [7] cardiomyopathies
## [8] chronic_nephropathies
## [9] chronic_obstructive_pulmonary_disease
## [10] conduction_disorders_cardiac_dysrhythmias
## [11] drugs_for_cardiac_therapy
## [12] drugs_for_obstructive_airway_diseases
## [13] gender_gp
## [14] gender_pt
## [15] hf_pathway
## [16] hypertensive_disease
## [17] los
## [18] n_gps
## [19] n_patients
## [20] nurses_hours
## [21] org_arrangement
## [22] other_forms_of_ischemic_heart_disease
## [23] rheumatic_heart_disease
## [24] vascular_diseases
##
## Root node error: 969/1264 = 0.76661
##
## n= 1264
##
## CP nsplit rel error xerror xstd
## 1 0.0567595 0 1.00000 1.00000 0.023805
## 2 0.0196078 1 0.94324 0.98762 0.023777
## 3 0.0175439 6 0.83901 0.97420 0.023743
## 4 0.0144479 7 0.82147 1.00516 0.023815
## 5 0.0092879 9 0.79257 0.98658 0.023774
## 6 0.0072239 12 0.76471 1.00929 0.023824
## 7 0.0067079 17 0.72549 1.03096 0.023861
## 8 0.0061920 21 0.69866 1.03199 0.023863
## 9 0.0056760 25 0.67389 1.03302 0.023864
## 10 0.0051600 27 0.66254 1.03199 0.023863
## 11 0.0048160 35 0.61197 1.03818 0.023872
## 12 0.0043860 38 0.59752 1.05160 0.023888
## 13 0.0041280 43 0.57276 1.03818 0.023872
## 14 0.0036120 58 0.48607 1.03302 0.023864
## 15 0.0025800 62 0.47162 1.05470 0.023892
## 16 0.0024080 65 0.46336 1.05057 0.023887
## 17 0.0020640 68 0.45614 1.04850 0.023885
## 18 0.0010320 71 0.44995 1.06089 0.023898
## 19 0.0010000 72 0.44892 1.07018 0.023906
| Num Splits | CP value | MSE |
|---|---|---|
| 1 | 0.0568 | 0.2524 |
| 6 | 0.0196 | 0.2561 |
| 7 | 0.0175 | 0.2666 |
| 9 | 0.0144 | 0.2696 |
| 11 | 0.0093 | 0.2729 |
| 12 | 0.0072 | 0.2800 |
| Descision Tree | Logistic Regression Model | |
|---|---|---|
| Weighted | 0.2666128 | 0.4648358 |
| Not Weighted | 0.3377175 | 2.5541276 |
knitr::opts_chunk$set(echo = FALSE)
options(tinytex.verbose = TRUE)
library(Rfast)
library(MASS)
library(kernlab)
library(Rtsne)
library(tidyverse)
library(readr)
library(parallel)
library(grid)
library(gridExtra)
library(ggplot2)
library(knitr)
library(kableExtra)
library(tinytex)
library(Deriv)
library(plotly)
library(modelr)
library(tableone)
library(lubridate)
library(rpart)
library(car)
library(generics)
#The code for a simple vanilla gradient descent
gradient_desc <- function(f, x_init){
# Parameters
max_iters <- 50
learning_rate <- 0.01
iters <- 0
# Starting conditions
x <- rep(0,max_iters)
x[1] <- x_init
f_der <- Deriv(f)
for (i in 2:max_iters){
# Calculate gradient
grad <- f_der(x[i-1])
x[i] <- x[i-1]-learning_rate*grad
}
return(x)
}
# non-convex function
f <- function(x) ((x^2-4*x+4)*(x^2+4*x+2))
x <- gradient_desc(f, 4)
y <- f(x)
df <- data.frame(x=x, y=y, time=1:length(x))
f_df <- data.frame(x=seq(-4,4,0.1))
f_df$y <- f(f_df$x)
p <- ggplot() +
geom_line(data=f_df, aes(x,y), color="blue") +
geom_point(data=df, aes(x,y,frame = time)) +
xlim(-4,4)
#ggplotly(p)
plot1 <- ggplot() +
geom_line(data=f_df, aes(x,y), color="blue") +
geom_point(data=df[nrow(df),], aes(x,y,frame = time)) +
xlim(-4,4) +
labs(title = "Plot 1: Vanilla Gradient Descent")
plot1
adam_func <- function(f, x_init){
# Parameters
max_iters <- 2000
learning_rate <- 0.75
beta1 <- 0.9
beta2 <- 0.99
eps <- 1e-8
# Starting conditions
x <- rep(0,max_iters)
x[1] <- x_init
m <- 0
v <- 0
f_der <- Deriv(f)
for (i in 2:max_iters){
grad <- f_der(x[i-1])
m <- beta1*m + (1-beta1)*grad
v <- beta2*v + (1-beta2)*grad^2
m_2 <- m/(1-beta1^i)
v_2 <- v/(1-beta2^i)
x[i] <- x[i-1] - learning_rate*m_2/(sqrt(v_2) + eps)
}
return(x)
}
# non-convex function with multiple local minima
f2 <- function(x) (x^6 - 15*x^4 + 65*x^2 + 20*x + 10)
x2 <- adam_func(f2, 4)
y2 <- f2(x2)
df2 <- data.frame(x=x2, y=y2, time=1:length(x2))
f_df2 <- data.frame(x=seq(-4,4,0.1))
f_df2$y <- f2(f_df2$x)
p2 <- ggplot() +
geom_line(data=f_df2, aes(x,y), color="blue") +
geom_point(data=df2, aes(x,y,frame = time)) +
xlim(-4,4)
plot3 <- ggplot() +
geom_line(data=f_df2, aes(x,y), color="blue") +
geom_point(data=df2[nrow(df2),], aes(x,y,frame = time)) +
xlim(-4,4) +
labs(title = "Plot 3: Adam Descent with a different function")
plot3
# non-convex function
f1 <- function(x) ((x^2-4*x+4)*(x^2+4*x+2))
x1 <- adam_func(f1, 4)
y1 <- f1(x1)
df1 <- data.frame(x=x1, y=y1, time=1:length(x1))
f_df1 <- data.frame(x=seq(-4,4,0.1))
f_df1$y <- f1(f_df1$x)
p1 <- ggplot() +
geom_line(data=f_df1, aes(x,y), color="blue") +
geom_point(data=df1, aes(x1,y1,frame = time)) +
xlim(-4,4)
plot2 <- ggplot() +
geom_line(data=f_df1, aes(x,y), color="blue") +
geom_point(data=df1[nrow(df1),], aes(x,y,frame = time)) +
xlim(-4,4) +
labs(title = "Plot 2: Adam Descent")
plot2
# non-convex function with huge hump between local and global minima
f3 <- function(x) (x^6 - 15*x^4 + x^2 + 20*x + 10)
x3 <- adam_func(f3, 4)
y3 <- f3(x3)
df3 <- data.frame(x=x3, y=y3, time=1:length(x3))
f_df3 <- data.frame(x=seq(-4,4,0.1))
f_df3$y <- f3(f_df3$x)
p3 <- ggplot() +
geom_line(data=f_df3, aes(x,y), color="blue") +
geom_point(data=df3, aes(x3,y3,frame = time)) +
xlim(-4,4)
plot4 <- ggplot() +
geom_line(data=f_df3, aes(x,y), color="blue") +
geom_point(data=df3[nrow(df3),], aes(x,y,frame = time)) +
xlim(-4,4) +
labs(title = "Plot 4: Adam Descent with a different function: The Edge Case")
plot4
# Generate data
set.seed(123)
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
y <- 2 + 3*x1 - 4*x2 + 5*x3 + rnorm(100, 0, 1)
data <- data.frame(x1, x2, x3, y)
y2 <- 2 + 3*x1^2 - 4*x2 + 5*x3^3 + rnorm(100, 0, 1)
data_2 <- data.frame(x1, x2, x3, y2)
# Return gradient vector
grad <- function(beta, x, y) {
y_pred <- x %*% beta
error <- y_pred - y
grad <- t(x) %*% error / length(y)
return(grad)
}
# Re-define gradient descent, pulling out gradient directly
grad_desc <- function(X, y, alpha=0.01, max_iters=1000, tol=1e-6) {
n <- nrow(X)
p <- ncol(X)
theta <- rep(0, p)
iter <- 0
for (i in 1:max_iters) {
theta_new <- theta - alpha * grad(theta, X, y)
if (sqrt(sum((theta_new - theta)^2))/sqrt(sum(theta^2)) < tol) {
break
}
theta <- theta_new
iter <- i
}
return(list(theta=theta, iter=iter, alpha = alpha))
}
# Re-define ADAM function without Deriv function, extracting gradient directly
adam <- function(X, y, alpha=0.01, beta1=0.9, beta2=0.999, eps=1e-8, max_iters=1000, tol=1e-6) {
n <- nrow(X)
p <- ncol(X)
theta <- rep(0, p)
m <- 0
v <- 0
iter <- 0
for (i in 1:max_iters) {
grad_theta <- grad(theta, X, y)
m <- beta1 * m + (1 - beta1) * grad_theta
v <- beta2 * v + (1 - beta2) * grad_theta^2
m_hat <- m / (1 - beta1^i)
v_hat <- v / (1 - beta2^i)
theta_new <- theta - alpha * m_hat / (sqrt(v_hat) + eps)
if (sqrt(sum((theta_new - theta)^2))/sqrt(sum(theta^2)) < tol) {
break
}
theta <- theta_new
iter <- i
}
return(list(theta=theta, iter=iter, alpha = alpha))
}
# Fit models
X <- as.matrix(data[,1:3])
Y <- data[,4]
X2 <- as.matrix(data_2[,1:3])
Y2 <- data_2[,4]
gd_theta <- grad_desc(X, Y)
adam_theta <- adam(X, Y)
adam_theta_2 <- adam(X, Y, max_iters=10000)
GD <- matrix(,4,1)
GD[1:3] <- round(as.matrix(gd_theta$theta), 2)
GD[4] <- as.character(gd_theta$iter)
ADAM <- matrix(,4,1)
ADAM[1:3] <- round(as.matrix(adam_theta$theta), 2)
ADAM[4] <- as.character(adam_theta$iter)
ADAM2 <- matrix(,4,1)
ADAM2[1:3] <- round(as.matrix(adam_theta_2$theta), 2)
ADAM2[4] <- as.character(adam_theta_2$iter)
# Compare coefficients and total number of iterations per model
coefficients_df <- data.frame(GD, ADAM, ADAM2)
rownames(coefficients_df) <- c("ß0", "ß1", "ß2", "Iterations")
coefficients_df
#kable(coefficients_df, align = "c", caption = "Table 1: Applying the two algorithms to a linear regression model")
#For the more complicated formula
gd_theta_new <- grad_desc(X2, Y2)
adam_theta_2_new <- adam(X2, Y2, max_iters=1000000)
GD_new <- matrix(,4,1)
GD_new[1:3] <- round(as.matrix(gd_theta_new$theta), 2)
GD_new[4] <- as.character(gd_theta_new$iter)
ADAM2_new <- matrix(,4,1)
ADAM2_new[1:3] <- round(as.matrix(adam_theta_2_new$theta), 2)
ADAM2_new[4] <- as.character(adam_theta_2_new$iter)
coefficients_df_new <- data.frame(GD_new, ADAM2_new)
rownames(coefficients_df_new) <- c("ß0", "ß1", "ß2", "Iterations")
colnames(coefficients_df_new) <- c("GD", "ADAM")
#kable(coefficients_df_new, align = "c", caption = "Table 2: Applying the two algorithms to more complicated problems")
coefficients_df_new
#Comparing different learning rates
gd_theta_lr1 <- grad_desc(X2, Y2)
gd_theta_lr2 <- grad_desc(X2, Y2, alpha = 0.001, max_iters=1000000)
adam_theta_lr1 <- adam(X2, Y2, max_iters=1000000)
adam_theta_lr2 <- adam(X2, Y2, max_iters=1000000, alpha = 0.1)
adam_theta_lr3 <- adam(X2, Y2, max_iters=1000000, alpha = 0.5)
GD_lr1 <- matrix(,5,1)
GD_lr1[1:3] <- round(as.matrix(gd_theta_lr1$theta), 2)
GD_lr1[4] <- as.character(gd_theta_lr1$iter)
GD_lr1[5] <- as.character(gd_theta_lr1$alpha)
GD_lr2 <- matrix(,5,1)
GD_lr2[1:3] <- round(as.matrix(gd_theta_lr2$theta), 2)
GD_lr2[4] <- as.character(gd_theta_lr2$iter)
GD_lr2[5] <- as.character(gd_theta_lr2$alpha)
ADAM2_lr1 <- matrix(,5,1)
ADAM2_lr1[1:3] <- round(as.matrix(adam_theta_lr1$theta), 2)
ADAM2_lr1[4] <- as.character(adam_theta_lr1$iter)
ADAM2_lr1[5] <- as.character(adam_theta_lr1$alpha)
ADAM2_lr2 <- matrix(,5,1)
ADAM2_lr2[1:3] <- round(as.matrix(adam_theta_lr2$theta), 2)
ADAM2_lr2[4] <- as.character(adam_theta_lr2$iter)
ADAM2_lr2[5] <- as.character(adam_theta_lr2$alpha)
ADAM2_lr3 <- matrix(,5,1)
ADAM2_lr3[1:3] <- round(as.matrix(adam_theta_lr3$theta), 2)
ADAM2_lr3[4] <- as.character(adam_theta_lr3$iter)
ADAM2_lr3[5] <- as.character(adam_theta_lr3$alpha)
coefficients_df_lr <- data.frame(GD_lr1, GD_lr2, ADAM2_lr1, ADAM2_lr2, ADAM2_lr3)
rownames(coefficients_df_lr) <- c("ß0", "ß1", "ß2", "Iterations", "Learning Rate")
colnames(coefficients_df_lr) <- c("GD", "GD", "ADAM", "ADAM", "ADAM")
#kable(coefficients_df_lr, align = "c", caption = "Table 3: Applying the two algorithms using different learning rates")
coefficients_df_lr
#Loading in data
hosp <- read.csv("hospital_readmissions.csv")
#str(hosp)
#dim(hosp)
#Database cleaning and reorganizing
new_names <- c("malignant_tumors", "diabetes",
"disorders_of_lipoid_metabolism",
"obesity","hematologic_diseases","hypertensive_disease",
"old_acute_myocardial_infarction",
"other_forms_of_ischemic_heart_disease",
"ill_defined_descriptions_of_heart_disease",
"rheumatic_heart_disease", "cardiomyopathies",
"other_cardiac_diseases",
"conduction_disorders_cardiac_dysrhythmias",
"cerebrovascular_diseases", "vascular_diseases",
"chronic_obstructive_pulmonary_disease",
"chronic_nephropathies",
"chronic_diseases_other", "old_bypass",
"old_percutaneous_transluminal_coronary_angioplasty",
"other_surgery_of_the_heart",
"other_surgery_of_great_vessels",
"antidiabetic_drugs", "drugs_for_cardiac_therapy",
"drugs_for_obstructive_airway_diseases",
"antihypertensive_drugs",
"statins", "antiplatelet_drugs")
colnames(hosp)[7:34] <- new_names
hosp[hosp==""] <- NA
hosp <- hosp %>%
filter(is.na(death_dt) | (mdy(death_dt)-mdy(dt_discharge) > 90))
hosp$readmitted_90 <- ifelse(hosp$n_reric90 > 0, 1, 0)
hosp <- hosp %>% select(-c('dt_admission', 'dt_discharge',
'n_reric30','n_reric30_hf',
'dt_end_fup30', 'n_reric90_hf',
'dt_end_fup90', 'n_reric180',
'n_reric180_hf', 'dt_end_fup180',
'n_reric365', 'n_reric365_hf',
'dt_end_fup365', 'death_dt',
'n_reric90'))
# Drop unique identifiers
hosp <- hosp %>% select(-c('id_pt', 'id_ncp', 'id_gp'))
# Mutation of categorical variables into binary variables for logistic regression
#Removing any variables that have a very low frequency
Percent_Yes <- 0
for(i in c(4:33, 41)){
Yes <- hosp[,i] == "Yes"
Percent_Yes[i-3] <- sum(Yes)/nrow(hosp)
}
hosp <- hosp[, c(TRUE, TRUE, TRUE, Percent_Yes[1:30] > 0.03, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE)]
tab1 <- CreateTableOne(data=hosp, strata=c("readmitted_90"))
kableone(tab1, caption = "Table 4: Summarized pre-processed data")
#Creating the test and train sets (while accounting for the two subgroups)
set.seed(1)
readmitted <- hosp %>% filter(readmitted_90 == 1)
noreadmiss <- hosp %>% filter(readmitted_90 == 0)
readmittedindex <- sample(1:nrow(readmitted), floor(0.25*nrow(readmitted)), replace = FALSE)
noreadmissindex <- sample(1:nrow(noreadmiss), floor(0.25*nrow(noreadmiss)), replace = FALSE)
readtest <- readmitted[readmittedindex, ]
readtrain <- readmitted[-readmittedindex, ]
noretest <- noreadmiss[noreadmissindex,]
noretrain <- noreadmiss[-noreadmissindex,]
combotest <- rbind(readtest, noretest)
combotrain <- rbind(readtrain, noretrain)
#Change yes/no to 1/0 respectively to allow for weighting of variables
#combotest[combotest=="Yes"] <- 1
#combotest[combotest=="No"] <- 0
#combotrain[combotrain=="Yes"] <- 1
#combotrain[combotrain=="No"] <- 0
#Makes all of the values numeric so they can be properly weighted
#combotest[4:37] <- apply(combotest[4:37], 2, as.numeric)
#combotrain[4:37] <- apply(combotrain[4:37], 2, as.numeric)
# Define weights for observations in Training set
weight0 <- round(nrow(combotrain)/sum(combotrain$readmitted_90==0))
weight1 <- round(nrow(combotrain)/sum(combotrain$readmitted_90==1))
weights <- ifelse(combotrain$readmitted_90==0, weight0, weight1)
#Check and display the proportions of the split test and train data sets
trainprop <- c(nrow(combotrain), nrow(combotrain)/(nrow(combotest)+nrow(combotrain)), sum(combotrain$readmitted_90==0)/nrow(combotrain), sum(combotrain$readmitted_90==1)/nrow(combotrain))
testprop <- c(nrow(combotest), nrow(combotest)/(nrow(combotest)+nrow(combotrain)), sum(combotest$readmitted_90==0)/nrow(combotest), sum(combotest$readmitted_90==1)/nrow(combotest))
testtrainprop <- data.frame(c(as.character(trainprop[1]), round(trainprop[2] * 100, 3), round(trainprop[3] * 100, 3), round(trainprop[4] * 100, 3)), c(as.character(testprop[1]), round(testprop[2] * 100, 3), round(testprop[3] * 100, 3), round(testprop[4] * 100, 3)))
rownames(testtrainprop) <- c("# of Cases", "% of Cases", "% Not Readmitted", "% Readmitted")
colnames(testtrainprop) <- c("Training Set", "Testing Set")
testtrainprop
#kable(testtrainprop, align = "c", caption = "Table 5: A summary of test and train data sets")
#Apply logistic regression to weighted test set
model1 <- step(
glm(
readmitted_90 ~ .,
data = combotrain,
family = "binomial",
weights = weights
),
direction = "backward", trace = FALSE
)
modelinteractions <- step(
glm(
readmitted_90 ~ .
+ antidiabetic_drugs:antihypertensive_drugs, # Interaction terms
data = combotrain,
family = "binomial",
weights = weights
),
direction = "backward", trace = FALSE
)
modelinteractions2 <- step(
glm(
readmitted_90 ~ .
+ chronic_obstructive_pulmonary_disease:drugs_for_obstructive_airway_diseases, # Interaction terms
data = combotrain,
family = "binomial",
weights = weights
),
direction = "backward", trace = FALSE
)
modelnoweight <- step(
glm(
readmitted_90 ~ ., # Interaction terms
data = combotrain,
family = "binomial"
),
direction = "backward", trace = FALSE
)
#model1 <- step(
# glm(
# readmitted_90 ~ .
# , # Interaction terms
# data = combotrain,
# family = "binomial",
# weights = weights
# ),
# direction = "backward"
#)
# Check the summary of the model & display as table
summarytable <- tidy(model1)
summarytable <- summarytable[-1,-c(3:4)]
summarytable$p.value <- ifelse(summarytable$p.value < 0.05,
cell_spec(round(summarytable$p.value, 3), "latex",
bold = TRUE),
round(summarytable$p.value, 3))
summarytable <- as.data.frame(summarytable)
summarytable
#summarytable %>% kbl(caption = "Results of a logistic regression model") %>%
# kable_styling(full_width = FALSE, latex_options = c('hold_position'))
summarytable2 <- tidy(modelinteractions)
summarytable2 <- summarytable2[-1,-c(3:4)]
summarytable2$p.value <- ifelse(summarytable2$p.value < 0.05,
cell_spec(round(summarytable2$p.value, 3), "latex",
bold = TRUE),
round(summarytable2$p.value, 3))
summarytable2 <- as.data.frame(summarytable2)
summarytable2
#summarytable2 %>% kbl(caption = "Results of a logistic regression model looking at the interaction of taking anti-diabetic and anti-hypertensive medications") %>% kable_styling(full_width = FALSE, latex_options = c('hold_position'))
summarytable3 <- tidy(modelinteractions2)
summarytable3 <- summarytable3[-1,-c(3:4)]
summarytable3$p.value <- ifelse(summarytable3$p.value < 0.05,
cell_spec(round(summarytable3$p.value, 3), "latex",
bold = TRUE),
round(summarytable3$p.value, 3))
summarytable3 <- as.data.frame(summarytable3)
summarytable3
#summarytable3 %>% kbl(caption = "Results of a logistic regression model accounting for severe COPD") %>% kable_styling(full_width = FALSE, latex_options = c('hold_position'))
set.seed(1)
#Generating the tree models with a very low cp
treemodel <- rpart(readmitted_90 ~ ., combotrain, weights = weights, cp = 0.000001, method = "class")
treenoweight <- rpart(readmitted_90 ~ ., combotrain, cp = 0.000001, method = "class")
#Plotcp will be used to decide which cp to use when pruning the tree
#plot(treemodel)
plotcp(treemodel)
#printcp(treemodel)
#Generating the final trees (one with weights, one without)
prunetree <- rpart::prune(treemodel, cp=0.019)
prunetreenoweight <- rpart::prune(treenoweight, cp=0.0063)
plot(prunetree)
text(prunetree)
#Looking at the effect of picking different CP values on MSE
set.seed(1)
treefunc <- rpart(readmitted_90 ~ ., combotrain, weights = weights, cp = 0.001, method = "class")
cp <- printcp(treefunc)
msetree <- 0
#This for loop will iterate through different cp values (I decided to use the first six values)
#It will then calculate the MSE for each model
for(i in 1:6){
prunetreefunc <- rpart::prune(treemodel, cp= cp[i])
predfunctree <- predict(prunetreefunc, combotest)
msetree[i] <- round(mean((predfunctree - combotest$readmitted_90)^2), 4)
}
cpmsedf <- data_frame(c(1,6,7,9,11,12), c(round(cp[1:6], 4)), msetree)
names(cpmsedf) <- c("Num Splits", "CP value", "MSE")
cpmsedf %>% kbl(caption = "Results of a logistic regression model accounting for severe COPD") %>%
kable_styling(full_width = FALSE, latex_options = c('hold_position')) %>%
row_spec(2, bold = TRUE) %>%
column_spec(1, width = "33%") %>%
column_spec(2, width = "33%") %>%
column_spec(3, width = "33%")
#Looking at the effect that weighting has on the MSE
predreadmittree <- predict(prunetree, combotest)
predtreenoweight <- predict(prunetreenoweight, combotest)
testmsetree <- mean((predreadmittree - combotest$readmitted_90)^2)
testmsetreenoweight <- mean((predtreenoweight - combotest$readmitted_90)^2)
#test_mae <- mean(abs(bike_pred - bike_test$count))
#test_mae
predreadmitglm <- predict(model1, combotest)
testmseglm <- mean((predreadmitglm - combotest$readmitted_90)^2)
prednoweightglm <- predict(modelnoweight, combotest)
noweightmseglm <- mean((prednoweightglm - combotest$readmitted_90)^2)
msetable <- data_frame(c("Weighted", "Not Weighted"), c(testmsetree, testmsetreenoweight), c(testmseglm, noweightmseglm))
names(msetable) <- c("", "Descision Tree", "Logistic Regression Model")
msetable %>% kbl(caption = "Results of a logistic regression model accounting for severe COPD") %>%
kable_styling(full_width = FALSE, latex_options = c('hold_position')) %>%
column_spec(1, bold = TRUE) %>%
column_spec(1, width = "33%") %>%
column_spec(2, width = "33%") %>%
column_spec(3, width = "33%")