#Section 1: Optimization Methods

The Adam algorithm, otherwise known as Adaptive Moment Estimation, uses both the concept of momentum and adaptive learning in order to find the global minimum of an optimization problem. Momentum is refers to the direction that the optimization algorithm is moving in and the adaptive leaning rates refer to the fact that each “step” taken by the algorithm will be a different size. Adam uses a decaying weighted sum of the two, meaning that the latest steps taken impart the greatest influence on the size of the next step, although it still maintains information from prior movements. This way, Adam is able to avoid converging on local minima and is much more efficient than a simple Stochastic Gradient Descent. This is because momentum (keeping track of previous gradients) allows the algorithm to push though local minima while the adaptive step size allows the algorithm to take larger steps in areas of little curvature and smaller steps when there is significant curvature. This approach to optimization is well suited for non-convex problems because it is able to overcome local minima and find the global minimum. Furthermore, the researcher implementing the algorithm is able to set a maximum step size in order to maintain closer control over the progression of the problem. However, this optimization method is still constrained by certain limitations. For example, if the first derivative of the problem does not exist then the Adam algorithm cannot be implemented.

Below, I have taken the code that performs a simple, vanilla gradient descent that was presented in class. As you can see, the function easily gets stuck at the local minima. Please note that for each of the following graphs, the starting value chosen for the plot was 4. I have only showed the final frame, which represents that minima that the algorithms have calculated. If desired, there is also a ggplotly function in the code appendix that shows the animation, including the rate of discovering the minima.

As Plot 1 shows, the vanilla gradient descent is unable to find the global minimum. It quickly finds the local minimum but because it is only optimizing for the gradient at the current position and does not include any momentum , once the direction of the gradient changes it remains in that minimum.

When using the Adam function, it is usually able to overcome these local minima. Due to its incorporation of both gradient-based and history-based momentum, the algorithm is able to keep a more global perspective and use the momentum to move out of the minima. However, as is seen in the final case, it does have its limitations.

I found an edge case below that seems to stump the Adam algorithm. Because of the signifiacntly larger hump between the local and global minimum, the algorithm loses its momentum quickly and it unable to get over the hump. This was true for all learning rates that I tried, starting from 0.5 and going all the way up to 5000. Furthermore, for the first non-convex function, when the selected starting point was 1, the adam function still failed to determine the global minimum. It would converge on the local minimum. This makes intuative sense to me since the initial gradient that the function would calculate would be pointing towards the local minimum and it would never get the momentum to pass by its initial starting point and find the global minimum.

Initially when applying the Adam algorithm, I thought that it would perform better. In fact, when both models (the vanilla gradient descent and Adam) were implemented against a multiple linear regression model, they gave different results (shown in the table labeled GD and ADAM). However, when the max iterations where increased from 1000 to 10000, the coefficients converged onto the same values (increased max iterations are labeld ADAM2 in the table). This suggests that rather than actually getting different results, the ADAM algorithm was less efficient in this case and its learning rate was slower than the gradient descent. I was curious to see whether this would be true for all cases of a multiple linear regression model, as the model I created was relatively simple. However, when I complicated the formula, GD and ADAM found different results (shown in table 2). Although it took a significant amount of iterations, ADAM found a different set of coefficients. Although it is difficult to determine if this is true, the vanilla gradient descent may have gotten stuck in a local minima created by the polynomial function created by the exponents that were added to the formula.

I also investigated how different algorithms respond to different learning rates. The learning rate represents the size of the update to the gradients in these functions. Thus, a larger learning rate would allow for larger steps and faster convergance upon the minimum. However, a risk is that if the learning rate is too large, then the algorithm may pass over the minimum or result in bounding back and forth around the minimum. On the other hand, choosing a learning rate that is too small may result in the algorithm requiring a long time to calculate the minimum due to the smaller steps it takes. Both of these effects are seen in Table 3. As the learning rate decreases, both algorithms require more iterations to reach the minimum.

#Section 2 The next part of the assignment was to develop two models (a logistic regression model and a classification tree) to predict whether heart failure patients would be readmitted to hospitals within 90 days. The original paper that performed this analysis used a multivariable multilevel Poisson regression analyses and determined that the only significant predictor for readmission was the presence of implementing a heart failure specific care pathway with their primacy care physician.

Table 4 summarizes the pre-processed data. To summarize the preprocessing, we first converted the 90-day readmission variable into a binary endpoint and removed cases where the patient died within 90 days of discharge. We also excluded binary variables with proportions < 3%. We also removed certain variables from the original data set that were not relevant for our analysis, as well as renaming variables. After this pre-processing, we identified two groups based on the outcome of interest (readmission within 90 days), with 393 readmitted and 1291 not readmitted. These groups will form our study cohort for the decision tree and logistic regression models, which can potentially use up to 32 variables as covariates. The table displays the rates of various variables in the two groups, with bolded p-values <0.05 (though we acknowledge the risk of coincidental findings with multiple tests). Next, we split the data into training and testing sets, ensuring similar proportions of readmitted and non-readmitted cases in both sets.

Table 4: Summarized pre-processed data
0 1 p test
n 1291 393
gender_pt = Male (%) 577 (44.7) 172 (43.8) 0.790
age_pt (mean (SD)) 81.22 (8.24) 82.42 (7.43) 0.010
los (mean (SD)) 9.01 (5.98) 10.50 (7.05) <0.001
card_discharge = Yes (%) 162 (12.5) 45 (11.5) 0.622
malignant_tumors = Yes (%) 102 ( 7.9) 46 (11.7) 0.026
diabetes = Yes (%) 113 ( 8.8) 60 (15.3) <0.001
disorders_of_lipoid_metabolism = Yes (%) 56 ( 4.3) 19 ( 4.8) 0.781
obesity = Yes (%) 49 ( 3.8) 10 ( 2.5) 0.306
hematologic_diseases = Yes (%) 79 ( 6.1) 38 ( 9.7) 0.021
hypertensive_disease = Yes (%) 315 (24.4) 120 (30.5) 0.018
old_acute_myocardial_infarction = Yes (%) 110 ( 8.5) 36 ( 9.2) 0.770
other_forms_of_ischemic_heart_disease = Yes (%) 309 (23.9) 109 (27.7) 0.144
rheumatic_heart_disease = Yes (%) 51 ( 4.0) 20 ( 5.1) 0.401
cardiomyopathies = Yes (%) 82 ( 6.4) 30 ( 7.6) 0.437
other_cardiac_diseases = Yes (%) 67 ( 5.2) 27 ( 6.9) 0.252
conduction_disorders_cardiac_dysrhythmias = Yes (%) 215 (16.7) 86 (21.9) 0.022
cerebrovascular_diseases = Yes (%) 151 (11.7) 58 (14.8) 0.127
vascular_diseases = Yes (%) 98 ( 7.6) 42 (10.7) 0.065
chronic_obstructive_pulmonary_disease = Yes (%) 122 ( 9.5) 58 (14.8) 0.004
chronic_nephropathies = Yes (%) 207 (16.0) 94 (23.9) <0.001
old_percutaneous_transluminal_coronary_angioplasty = Yes (%) 44 ( 3.4) 11 ( 2.8) 0.665
antidiabetic_drugs = Yes (%) 235 (18.2) 96 (24.4) 0.008
drugs_for_cardiac_therapy = Yes (%) 327 (25.3) 131 (33.3) 0.002
drugs_for_obstructive_airway_diseases = Yes (%) 180 (13.9) 68 (17.3) 0.118
antihypertensive_drugs = Yes (%) 1084 (84.0) 360 (91.6) <0.001
statins = Yes (%) 315 (24.4) 102 (26.0) 0.577
antiplatelet_drugs = Yes (%) 562 (43.5) 181 (46.1) 0.410
gender_gp = Male (%) 398 (30.8) 90 (22.9) 0.003
age_gp (mean (SD)) 55.93 (4.96) 56.33 (4.73) 0.154
org_arrangement (%) 0.714
Group practice 690 (53.4) 211 (53.7)
Network 307 (23.8) 86 (21.9)
None 281 (21.8) 90 (22.9)
Simple association 13 ( 1.0) 6 ( 1.5)
n_patients (%) 0.953
<1000 158 (12.2) 47 (12.0)
>=1500 576 (44.6) 173 (44.0)
1000-1499 557 (43.1) 173 (44.0)
ambulatory_area (%) 0.105
Mountain 89 ( 6.9) 19 ( 4.8)
Rural 633 (49.0) 180 (45.8)
Urban 569 (44.1) 194 (49.4)
n_gps = >15 (%) 655 (50.7) 215 (54.7) 0.186
nurses_hours = >350 (%) 607 (47.0) 184 (46.8) 0.991
hf_pathway = Yes (%) 620 (48.0) 179 (45.5) 0.422
readmitted_90 (mean (SD)) 0.00 (0.00) 1.00 (0.00) <0.001

Afterwards, we establish weights for the classes based on the proportionality determined earlier. The cases that have a lower overall proportion, such as readmitted cases, receive a greater weight compared to cases that are present in higher proportion, such as non-readmitted cases. The purpose of this is to rectify class imbalance in the training set.

Implementing the logistic regression model gave interesting results (Table 6). When considering all the different variables by themselves, the regression model found multiple significant predictors of readmission to hospitals within 90 days. A comforting finding was that our model also found that the hf pathway was a protective factor for predicting readmission into hospitals. In other words, just as the original paper discovered, being enrolled in a heart failure specific care pathway with their primacy care physician resulted in fewer readmissions. All other significant predictors had a positive correlation (ie pateints had a higher likelyhood of being readmitted if they had the indicated variable).

I was also interested in looking at the interactions between varaibles. I thought it might be interesting to look a patients who are taking multiple medications. Since both antihypertensive and antidiabetic drugs were significantly associated with a higher risk of readmission (coefficients of 0.85 and 1.43, respectively with p values below 0.05), I thought that patients who are taking both might have an even higher risk of readmission due to these patients having multiple condisions that have cardiovascular effects. Interestingly, the model shows that patients taking both have significantly lower risk of being readmitted. I also looked at whether having COPD and taking medications for the disease might increase the risk of readmissions since these patients would presumably have more severe disease. However, this did not appear to be significant.

The classification tree found that similar variables were predictive of readmission to hospitals within 90 days. In the model that was created, taking antihypertensive drugs was the most powerful predictor of readmission followed by the length of stay and whether patients had COPD. The tree model was developed by first using a very low complexity parameter. I then used plotcp to look at which value the relative error is minimized. Interestingly, these values do not corespond with the values that were generated by printcp. Based off of the plot, I pruned the tree to a cp of 0.019.

In order to evaluate the two models, I used the withheld test data and calculated the MSE between the predicted and actual values. According to these metrics, the decision tree was a better modeling system for this data set. I also reran the models without using the class weighting and this resulted in much worse modeling systems, as one would expect.

## 
## Classification tree:
## rpart(formula = readmitted_90 ~ ., data = combotrain, weights = weights, 
##     method = "class", cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] age_gp                                   
##  [2] age_pt                                   
##  [3] ambulatory_area                          
##  [4] antidiabetic_drugs                       
##  [5] antihypertensive_drugs                   
##  [6] antiplatelet_drugs                       
##  [7] cardiomyopathies                         
##  [8] chronic_nephropathies                    
##  [9] chronic_obstructive_pulmonary_disease    
## [10] conduction_disorders_cardiac_dysrhythmias
## [11] drugs_for_cardiac_therapy                
## [12] drugs_for_obstructive_airway_diseases    
## [13] gender_gp                                
## [14] gender_pt                                
## [15] hf_pathway                               
## [16] hypertensive_disease                     
## [17] los                                      
## [18] n_gps                                    
## [19] n_patients                               
## [20] nurses_hours                             
## [21] org_arrangement                          
## [22] other_forms_of_ischemic_heart_disease    
## [23] rheumatic_heart_disease                  
## [24] vascular_diseases                        
## 
## Root node error: 969/1264 = 0.76661
## 
## n= 1264 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0567595      0   1.00000 1.00000 0.023805
## 2  0.0196078      1   0.94324 0.98762 0.023777
## 3  0.0175439      6   0.83901 0.97420 0.023743
## 4  0.0144479      7   0.82147 1.00516 0.023815
## 5  0.0092879      9   0.79257 0.98658 0.023774
## 6  0.0072239     12   0.76471 1.00929 0.023824
## 7  0.0067079     17   0.72549 1.03096 0.023861
## 8  0.0061920     21   0.69866 1.03199 0.023863
## 9  0.0056760     25   0.67389 1.03302 0.023864
## 10 0.0051600     27   0.66254 1.03199 0.023863
## 11 0.0048160     35   0.61197 1.03818 0.023872
## 12 0.0043860     38   0.59752 1.05160 0.023888
## 13 0.0041280     43   0.57276 1.03818 0.023872
## 14 0.0036120     58   0.48607 1.03302 0.023864
## 15 0.0025800     62   0.47162 1.05470 0.023892
## 16 0.0024080     65   0.46336 1.05057 0.023887
## 17 0.0020640     68   0.45614 1.04850 0.023885
## 18 0.0010320     71   0.44995 1.06089 0.023898
## 19 0.0010000     72   0.44892 1.07018 0.023906
Results of a logistic regression model accounting for severe COPD
Num Splits CP value MSE
1 0.0568 0.2524
6 0.0196 0.2561
7 0.0175 0.2666
9 0.0144 0.2696
11 0.0093 0.2729
12 0.0072 0.2800
Results of a logistic regression model accounting for severe COPD
Descision Tree Logistic Regression Model
Weighted 0.2666128 0.4648358
Not Weighted 0.3377175 2.5541276

Code Appendix:

knitr::opts_chunk$set(echo = FALSE)
options(tinytex.verbose = TRUE)
library(Rfast)
library(MASS)
library(kernlab)
library(Rtsne)
library(tidyverse)
library(readr)
library(parallel)
library(grid)
library(gridExtra)
library(ggplot2)
library(knitr)
library(kableExtra)
library(tinytex)
library(Deriv)
library(plotly)
library(modelr)
library(tableone)
library(lubridate)
library(rpart)
library(car)
library(generics)
#The code for a simple vanilla gradient descent
gradient_desc <- function(f, x_init){
    
    # Parameters
    max_iters <- 50
    learning_rate <- 0.01
    iters <- 0
    
    # Starting conditions
    x <- rep(0,max_iters)
    x[1] <- x_init
    f_der <- Deriv(f)
    
    for (i in 2:max_iters){
        
        # Calculate gradient
        grad <- f_der(x[i-1])
        x[i] <- x[i-1]-learning_rate*grad
    }
    return(x)
}

# non-convex function
f <- function(x) ((x^2-4*x+4)*(x^2+4*x+2))
x <- gradient_desc(f, 4)
y <- f(x)
df <- data.frame(x=x, y=y, time=1:length(x))

f_df <- data.frame(x=seq(-4,4,0.1))
f_df$y <- f(f_df$x)

p <- ggplot() + 
  geom_line(data=f_df, aes(x,y), color="blue") +
  geom_point(data=df, aes(x,y,frame = time)) +
  xlim(-4,4)
#ggplotly(p)

plot1 <- ggplot() + 
  geom_line(data=f_df, aes(x,y), color="blue") +
  geom_point(data=df[nrow(df),], aes(x,y,frame = time)) +
  xlim(-4,4) +
  labs(title = "Plot 1: Vanilla Gradient Descent")
plot1


adam_func <- function(f, x_init){
    
    # Parameters
    max_iters <- 2000
    learning_rate <- 0.75
    beta1 <- 0.9
    beta2 <- 0.99
    eps <- 1e-8
    
    # Starting conditions
    x <- rep(0,max_iters)
    x[1] <- x_init
    m <- 0
    v <- 0
    f_der <- Deriv(f)
    
    for (i in 2:max_iters){
        grad <- f_der(x[i-1])
        m <- beta1*m + (1-beta1)*grad
        v <- beta2*v + (1-beta2)*grad^2
        m_2 <- m/(1-beta1^i)
        v_2 <- v/(1-beta2^i)
        x[i] <- x[i-1] - learning_rate*m_2/(sqrt(v_2) + eps)
    }
    return(x)
}


# non-convex function with multiple local minima
f2 <- function(x) (x^6 - 15*x^4 + 65*x^2 + 20*x + 10)
x2 <- adam_func(f2, 4)
y2 <- f2(x2)
df2 <- data.frame(x=x2, y=y2, time=1:length(x2))

f_df2 <- data.frame(x=seq(-4,4,0.1))
f_df2$y <- f2(f_df2$x)

p2 <- ggplot() + 
  geom_line(data=f_df2, aes(x,y), color="blue") +
  geom_point(data=df2, aes(x,y,frame = time)) +
  xlim(-4,4)

plot3 <- ggplot() + 
  geom_line(data=f_df2, aes(x,y), color="blue") +
  geom_point(data=df2[nrow(df2),], aes(x,y,frame = time)) +
  xlim(-4,4) +
  labs(title = "Plot 3: Adam Descent with a different function")
plot3

# non-convex function
f1 <- function(x) ((x^2-4*x+4)*(x^2+4*x+2))
x1 <- adam_func(f1, 4)
y1 <- f1(x1)
df1 <- data.frame(x=x1, y=y1, time=1:length(x1))

f_df1 <- data.frame(x=seq(-4,4,0.1))
f_df1$y <- f1(f_df1$x)

p1 <- ggplot() + 
  geom_line(data=f_df1, aes(x,y), color="blue") +
  geom_point(data=df1, aes(x1,y1,frame = time)) +
  xlim(-4,4)

plot2 <- ggplot() + 
  geom_line(data=f_df1, aes(x,y), color="blue") +
  geom_point(data=df1[nrow(df1),], aes(x,y,frame = time)) +
  xlim(-4,4) +
  labs(title = "Plot 2: Adam Descent")
plot2
# non-convex function with huge hump between local and global minima
f3 <- function(x) (x^6 - 15*x^4 + x^2 + 20*x + 10)
x3 <- adam_func(f3, 4)
y3 <- f3(x3)
df3 <- data.frame(x=x3, y=y3, time=1:length(x3))

f_df3 <- data.frame(x=seq(-4,4,0.1))
f_df3$y <- f3(f_df3$x)

p3 <- ggplot() + 
  geom_line(data=f_df3, aes(x,y), color="blue") +
  geom_point(data=df3, aes(x3,y3,frame = time)) +
  xlim(-4,4)

plot4 <- ggplot() + 
  geom_line(data=f_df3, aes(x,y), color="blue") +
  geom_point(data=df3[nrow(df3),], aes(x,y,frame = time)) +
  xlim(-4,4) +
  labs(title = "Plot 4: Adam Descent with a different function: The Edge Case")

plot4
# Generate data
set.seed(123)
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
y <- 2 + 3*x1 - 4*x2 + 5*x3 + rnorm(100, 0, 1)
data <- data.frame(x1, x2, x3, y)

y2 <- 2 + 3*x1^2 - 4*x2 + 5*x3^3 + rnorm(100, 0, 1)
data_2 <- data.frame(x1, x2, x3, y2)

# Return gradient vector
grad <- function(beta, x, y) {
          y_pred <- x %*% beta
          error <- y_pred - y
          grad <- t(x) %*% error / length(y)
          return(grad)
}

# Re-define gradient descent, pulling out gradient directly
grad_desc <- function(X, y, alpha=0.01, max_iters=1000, tol=1e-6) {
  n <- nrow(X)
  p <- ncol(X)
  theta <- rep(0, p)
  iter <- 0
  for (i in 1:max_iters) {
    theta_new <- theta - alpha * grad(theta, X, y)
    if (sqrt(sum((theta_new - theta)^2))/sqrt(sum(theta^2)) < tol) {
      break
    }
    theta <- theta_new
    iter <- i
  }
  return(list(theta=theta, iter=iter, alpha = alpha))
}

# Re-define ADAM function without Deriv function, extracting gradient directly
adam <- function(X, y, alpha=0.01, beta1=0.9, beta2=0.999, eps=1e-8, max_iters=1000, tol=1e-6) {
  n <- nrow(X)
  p <- ncol(X)
  theta <- rep(0, p)
  m <- 0
  v <- 0
  iter <- 0
  for (i in 1:max_iters) {
    grad_theta <- grad(theta, X, y)
    m <- beta1 * m + (1 - beta1) * grad_theta
    v <- beta2 * v + (1 - beta2) * grad_theta^2
    m_hat <- m / (1 - beta1^i)
    v_hat <- v / (1 - beta2^i)
    theta_new <- theta - alpha * m_hat / (sqrt(v_hat) + eps)
    if (sqrt(sum((theta_new - theta)^2))/sqrt(sum(theta^2)) < tol) {
      break
    }
    theta <- theta_new
    iter <- i
  }
  return(list(theta=theta, iter=iter, alpha = alpha))
}

# Fit models
X <- as.matrix(data[,1:3])
Y <- data[,4]

X2 <- as.matrix(data_2[,1:3])
Y2 <- data_2[,4]

gd_theta <- grad_desc(X, Y)
adam_theta <- adam(X, Y)
adam_theta_2 <- adam(X, Y, max_iters=10000)


GD <- matrix(,4,1)
  GD[1:3] <- round(as.matrix(gd_theta$theta), 2)
  GD[4] <- as.character(gd_theta$iter)
  
ADAM <- matrix(,4,1)
ADAM[1:3] <- round(as.matrix(adam_theta$theta), 2)
ADAM[4] <- as.character(adam_theta$iter)
  
ADAM2 <- matrix(,4,1)
ADAM2[1:3] <- round(as.matrix(adam_theta_2$theta), 2)
ADAM2[4] <- as.character(adam_theta_2$iter)

# Compare coefficients and total number of iterations per model
coefficients_df <- data.frame(GD, ADAM, ADAM2)
rownames(coefficients_df) <- c("ß0", "ß1", "ß2", "Iterations")
  
coefficients_df
#kable(coefficients_df, align = "c", caption = "Table 1: Applying the two algorithms to a linear regression model")

#For the more complicated formula
gd_theta_new <- grad_desc(X2, Y2)
adam_theta_2_new <- adam(X2, Y2, max_iters=1000000)

GD_new <- matrix(,4,1)
  GD_new[1:3] <- round(as.matrix(gd_theta_new$theta), 2)
  GD_new[4] <- as.character(gd_theta_new$iter)
  
  
ADAM2_new <- matrix(,4,1)
  ADAM2_new[1:3] <- round(as.matrix(adam_theta_2_new$theta), 2)
  ADAM2_new[4] <- as.character(adam_theta_2_new$iter)
  
coefficients_df_new <- data.frame(GD_new, ADAM2_new)
rownames(coefficients_df_new) <- c("ß0", "ß1", "ß2", "Iterations")
colnames(coefficients_df_new) <- c("GD", "ADAM")

#kable(coefficients_df_new, align = "c", caption = "Table 2: Applying the two algorithms to more complicated problems")
coefficients_df_new
#Comparing different learning rates
gd_theta_lr1 <- grad_desc(X2, Y2)
gd_theta_lr2 <- grad_desc(X2, Y2, alpha = 0.001, max_iters=1000000)
adam_theta_lr1 <- adam(X2, Y2, max_iters=1000000)
adam_theta_lr2 <- adam(X2, Y2, max_iters=1000000, alpha = 0.1)
adam_theta_lr3 <- adam(X2, Y2, max_iters=1000000, alpha = 0.5)

GD_lr1 <- matrix(,5,1)
  GD_lr1[1:3] <- round(as.matrix(gd_theta_lr1$theta), 2)
  GD_lr1[4] <- as.character(gd_theta_lr1$iter)
  GD_lr1[5] <- as.character(gd_theta_lr1$alpha)
  
GD_lr2 <- matrix(,5,1)
  GD_lr2[1:3] <- round(as.matrix(gd_theta_lr2$theta), 2)
  GD_lr2[4] <- as.character(gd_theta_lr2$iter)
  GD_lr2[5] <- as.character(gd_theta_lr2$alpha)
  
ADAM2_lr1 <- matrix(,5,1)
  ADAM2_lr1[1:3] <- round(as.matrix(adam_theta_lr1$theta), 2)
  ADAM2_lr1[4] <- as.character(adam_theta_lr1$iter)
  ADAM2_lr1[5] <- as.character(adam_theta_lr1$alpha)
  
ADAM2_lr2 <- matrix(,5,1)
  ADAM2_lr2[1:3] <- round(as.matrix(adam_theta_lr2$theta), 2)
  ADAM2_lr2[4] <- as.character(adam_theta_lr2$iter)
  ADAM2_lr2[5] <- as.character(adam_theta_lr2$alpha)
  
ADAM2_lr3 <- matrix(,5,1)
  ADAM2_lr3[1:3] <- round(as.matrix(adam_theta_lr3$theta), 2)
  ADAM2_lr3[4] <- as.character(adam_theta_lr3$iter)
  ADAM2_lr3[5] <- as.character(adam_theta_lr3$alpha)  
  
coefficients_df_lr <- data.frame(GD_lr1, GD_lr2, ADAM2_lr1, ADAM2_lr2, ADAM2_lr3)
rownames(coefficients_df_lr) <- c("ß0", "ß1", "ß2", "Iterations", "Learning Rate")
colnames(coefficients_df_lr) <- c("GD", "GD", "ADAM", "ADAM", "ADAM")

#kable(coefficients_df_lr, align = "c", caption = "Table 3: Applying the two algorithms using different learning rates")
coefficients_df_lr
#Loading in data
hosp <- read.csv("hospital_readmissions.csv")

#str(hosp)
#dim(hosp)

#Database cleaning and reorganizing 
new_names <- c("malignant_tumors", "diabetes", 
               "disorders_of_lipoid_metabolism", 
               "obesity","hematologic_diseases","hypertensive_disease", 
               "old_acute_myocardial_infarction", 
               "other_forms_of_ischemic_heart_disease", 
               "ill_defined_descriptions_of_heart_disease", 
               "rheumatic_heart_disease", "cardiomyopathies", 
               "other_cardiac_diseases", 
               "conduction_disorders_cardiac_dysrhythmias", 
               "cerebrovascular_diseases", "vascular_diseases", 
               "chronic_obstructive_pulmonary_disease", 
               "chronic_nephropathies", 
               "chronic_diseases_other", "old_bypass", 
               "old_percutaneous_transluminal_coronary_angioplasty", 
               "other_surgery_of_the_heart", 
               "other_surgery_of_great_vessels", 
               "antidiabetic_drugs", "drugs_for_cardiac_therapy", 
               "drugs_for_obstructive_airway_diseases", 
               "antihypertensive_drugs", 
               "statins", "antiplatelet_drugs")

colnames(hosp)[7:34] <- new_names

hosp[hosp==""] <- NA

hosp <- hosp %>% 
  filter(is.na(death_dt) | (mdy(death_dt)-mdy(dt_discharge) > 90))

hosp$readmitted_90 <- ifelse(hosp$n_reric90 > 0, 1, 0)

hosp <- hosp %>% select(-c('dt_admission', 'dt_discharge',
                                 'n_reric30','n_reric30_hf',
                                 'dt_end_fup30', 'n_reric90_hf',
                                 'dt_end_fup90', 'n_reric180',
                                 'n_reric180_hf', 'dt_end_fup180',
                                 'n_reric365', 'n_reric365_hf',
                                 'dt_end_fup365', 'death_dt', 
                                 'n_reric90'))
# Drop unique identifiers
hosp <- hosp %>% select(-c('id_pt', 'id_ncp', 'id_gp'))

# Mutation of categorical variables into binary variables for logistic regression


#Removing any variables that have a very low frequency
Percent_Yes <- 0
for(i in c(4:33, 41)){
  Yes <- hosp[,i] == "Yes"
  Percent_Yes[i-3] <- sum(Yes)/nrow(hosp)
}

hosp <- hosp[, c(TRUE, TRUE, TRUE, Percent_Yes[1:30] > 0.03, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE)]

tab1 <- CreateTableOne(data=hosp, strata=c("readmitted_90"))

kableone(tab1, caption = "Table 4: Summarized pre-processed data")

#Creating the test and train sets (while accounting for the two subgroups)
set.seed(1)

readmitted <- hosp %>% filter(readmitted_90 == 1) 
noreadmiss <- hosp %>% filter(readmitted_90 == 0) 

readmittedindex <- sample(1:nrow(readmitted), floor(0.25*nrow(readmitted)), replace = FALSE)
noreadmissindex <- sample(1:nrow(noreadmiss), floor(0.25*nrow(noreadmiss)), replace = FALSE)

readtest <- readmitted[readmittedindex, ]
readtrain <- readmitted[-readmittedindex, ]

noretest <- noreadmiss[noreadmissindex,]
noretrain <- noreadmiss[-noreadmissindex,]

combotest <- rbind(readtest, noretest)
combotrain <- rbind(readtrain, noretrain)

#Change yes/no to 1/0 respectively to allow for weighting of variables 
#combotest[combotest=="Yes"] <- 1
#combotest[combotest=="No"] <- 0

#combotrain[combotrain=="Yes"] <- 1
#combotrain[combotrain=="No"] <- 0

#Makes all of the values numeric so they can be properly weighted
#combotest[4:37] <- apply(combotest[4:37], 2, as.numeric)
#combotrain[4:37] <- apply(combotrain[4:37], 2, as.numeric)


# Define weights for observations in Training set
weight0 <- round(nrow(combotrain)/sum(combotrain$readmitted_90==0))
weight1 <- round(nrow(combotrain)/sum(combotrain$readmitted_90==1))
weights <- ifelse(combotrain$readmitted_90==0, weight0, weight1)

#Check and display the proportions of the split test and train data sets
trainprop <- c(nrow(combotrain), nrow(combotrain)/(nrow(combotest)+nrow(combotrain)), sum(combotrain$readmitted_90==0)/nrow(combotrain), sum(combotrain$readmitted_90==1)/nrow(combotrain))
testprop <- c(nrow(combotest), nrow(combotest)/(nrow(combotest)+nrow(combotrain)), sum(combotest$readmitted_90==0)/nrow(combotest), sum(combotest$readmitted_90==1)/nrow(combotest))

testtrainprop <- data.frame(c(as.character(trainprop[1]), round(trainprop[2] * 100, 3), round(trainprop[3] * 100, 3), round(trainprop[4] * 100, 3)), c(as.character(testprop[1]), round(testprop[2] * 100, 3), round(testprop[3] * 100, 3), round(testprop[4] * 100, 3)))
rownames(testtrainprop) <- c("# of Cases", "% of Cases", "% Not Readmitted", "% Readmitted")
colnames(testtrainprop) <- c("Training Set", "Testing Set")
testtrainprop
#kable(testtrainprop, align = "c", caption = "Table 5: A summary of test and train data sets")
#Apply logistic regression to weighted test set
model1 <- step(
  glm(
    readmitted_90 ~ ., 
    data = combotrain,
    family = "binomial",
    weights = weights
  ),
  direction = "backward", trace = FALSE
)

modelinteractions <- step(
  glm(
    readmitted_90 ~ . 
    + antidiabetic_drugs:antihypertensive_drugs, # Interaction terms 
    data = combotrain,
    family = "binomial",
    weights = weights
  ),
  direction = "backward", trace = FALSE
)

modelinteractions2 <- step(
  glm(
    readmitted_90 ~ . 
    + chronic_obstructive_pulmonary_disease:drugs_for_obstructive_airway_diseases, # Interaction terms 
    data = combotrain,
    family = "binomial",
    weights = weights
  ),
  direction = "backward", trace = FALSE
)

modelnoweight <- step(
  glm(
    readmitted_90 ~ ., # Interaction terms
    data = combotrain,
    family = "binomial"
  ),
  direction = "backward", trace = FALSE
)

#model1 <- step(
#  glm(
#    readmitted_90 ~ . 
#      , # Interaction terms
#    data = combotrain,
#    family = "binomial",
#    weights = weights
#  ),
#  direction = "backward"
#)

# Check the summary of the model & display as table
summarytable <- tidy(model1)
summarytable <- summarytable[-1,-c(3:4)]
summarytable$p.value <- ifelse(summarytable$p.value < 0.05, 
                             cell_spec(round(summarytable$p.value, 3), "latex", 
                             bold = TRUE), 
                             round(summarytable$p.value, 3)) 
summarytable <- as.data.frame(summarytable)
summarytable
#summarytable %>% kbl(caption = "Results of a logistic regression model") %>% 
#  kable_styling(full_width = FALSE, latex_options = c('hold_position'))


summarytable2 <- tidy(modelinteractions)
summarytable2 <- summarytable2[-1,-c(3:4)]
summarytable2$p.value <- ifelse(summarytable2$p.value < 0.05, 
                             cell_spec(round(summarytable2$p.value, 3), "latex", 
                             bold = TRUE), 
                             round(summarytable2$p.value, 3))
summarytable2 <- as.data.frame(summarytable2)
summarytable2
#summarytable2 %>% kbl(caption = "Results of a logistic regression model looking at the interaction of taking anti-diabetic and anti-hypertensive medications") %>% kable_styling(full_width = FALSE, latex_options = c('hold_position'))

summarytable3 <- tidy(modelinteractions2)
summarytable3 <- summarytable3[-1,-c(3:4)]
summarytable3$p.value <- ifelse(summarytable3$p.value < 0.05, 
                             cell_spec(round(summarytable3$p.value, 3), "latex", 
                             bold = TRUE), 
                             round(summarytable3$p.value, 3))
summarytable3 <- as.data.frame(summarytable3)
summarytable3
#summarytable3 %>% kbl(caption = "Results of a logistic regression model accounting for severe COPD") %>% kable_styling(full_width = FALSE, latex_options = c('hold_position'))

set.seed(1)

#Generating the tree models with a very low cp
treemodel <- rpart(readmitted_90 ~ ., combotrain, weights = weights, cp = 0.000001, method = "class")
treenoweight <- rpart(readmitted_90 ~ ., combotrain, cp = 0.000001, method = "class")

#Plotcp will be used to decide which cp to use when pruning the tree
#plot(treemodel)
plotcp(treemodel)
#printcp(treemodel)

#Generating the final trees (one with weights, one without)
prunetree <- rpart::prune(treemodel, cp=0.019)
prunetreenoweight <- rpart::prune(treenoweight, cp=0.0063)

plot(prunetree)
text(prunetree)

#Looking at the effect of picking different CP values on MSE
set.seed(1)
treefunc <- rpart(readmitted_90 ~ ., combotrain, weights = weights, cp = 0.001, method = "class")
cp <- printcp(treefunc)
msetree <- 0

#This for loop will iterate through different cp values (I decided to use the first six values) 
#It will then calculate the MSE for each model 
for(i in 1:6){
  prunetreefunc <- rpart::prune(treemodel, cp= cp[i])
  predfunctree <- predict(prunetreefunc, combotest)
  msetree[i] <- round(mean((predfunctree - combotest$readmitted_90)^2), 4)
  }
cpmsedf <- data_frame(c(1,6,7,9,11,12), c(round(cp[1:6], 4)), msetree)
names(cpmsedf) <- c("Num Splits", "CP value", "MSE")


cpmsedf %>% kbl(caption = "Results of a logistic regression model accounting for severe COPD") %>%
  kable_styling(full_width = FALSE, latex_options = c('hold_position')) %>%
  row_spec(2, bold = TRUE) %>%
  column_spec(1, width = "33%") %>%
  column_spec(2, width = "33%") %>%
  column_spec(3, width = "33%")


#Looking at the effect that weighting has on the MSE
predreadmittree <- predict(prunetree, combotest)
predtreenoweight <- predict(prunetreenoweight, combotest)

testmsetree <- mean((predreadmittree - combotest$readmitted_90)^2)

testmsetreenoweight <- mean((predtreenoweight - combotest$readmitted_90)^2)

#test_mae <- mean(abs(bike_pred - bike_test$count))
#test_mae

predreadmitglm <- predict(model1, combotest)
testmseglm <- mean((predreadmitglm - combotest$readmitted_90)^2)


prednoweightglm <- predict(modelnoweight, combotest)
noweightmseglm <- mean((prednoweightglm - combotest$readmitted_90)^2)


msetable <- data_frame(c("Weighted", "Not Weighted"), c(testmsetree, testmsetreenoweight), c(testmseglm, noweightmseglm))
names(msetable) <- c("", "Descision Tree", "Logistic Regression Model")

msetable %>% kbl(caption = "Results of a logistic regression model accounting for severe COPD") %>%
  kable_styling(full_width = FALSE, latex_options = c('hold_position')) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(1, width = "33%") %>%
  column_spec(2, width = "33%") %>%
  column_spec(3, width = "33%")