The logistic regression model is simple and can be used for inference. The model produced in this section had an overall accuracy of 0.817, a sensitivity of 0.753, specificity of 0.854, and AUC of 0.882.

Setup

library(tidyverse)
library(caret)
library(mfstylr)
library(flextable)

load("./grant_01.RData")

Background

Logistic regression estimates the probability of a particular level of a categorical response variable given a set of predictors. The response levels can be binary, nominal (multiple categories), or ordinal (multiple levels).

The binary logistic regression model is

\[y = logit(\pi) = \ln \left( \frac{\pi}{1 - \pi} \right) = X \beta\]

where \(\pi\) is the event probability. The model predicts the log odds of the response variable. The maximum likelihood estimator maximizes the likelihood function

\[L(\beta; y, X) = \prod_{i=1}^n \pi_i^{y_i}(1 - \pi_i)^{(1-y_i)} = \prod_{i=1}^n\frac{\exp(y_i X_i \beta)}{1 + \exp(X_i \beta)}.\]

There is no closed-form solution, so GLM estimates coefficients with interatively reweighted least squares.

Feature Engineering

Logistic regression is a linear regression model, meaning it models a linear relationship with the predictors. When the underlying relationship is not linear, you can often transform the predictor to create one. One method for doing this uses restricted cubic splines to create flexible, adaptive versions of the predictors. More information on splines here.

The Grants data set has 24 continuous variables. (By “continuous”, I mean numeric with several different values, and I decided “several” should mean “at least 7”.)

continuous_vars <- training[pre2008, reducedSet] %>%
  map(n_distinct) %>%
  unlist() %>%
  subset(. >= 7) %>%
  names()
continuous_vars

##  [1] "A.CI"         "A.PS"         "Astar.CI"     "AstarTotal"   "ATotal"      
##  [6] "B.CI"         "BTotal"       "C.CI"         "C.PS"         "CI.Australia"
## [11] "CI.Faculty19" "CI.Faculty25" "CI.Faculty31" "CI.PhD"       "CTotal"      
## [16] "Day"          "Duration0to5" "DurationUnk"  "NumCI"        "NumECI"      
## [21] "Success.CI"   "Success.PS"   "Unsuccess.CI" "Unsuccess.PS"

The rms::lrm() function with helper function rms::rcs() fits a restricted cubic spline. 14 predictors returned significant (p < .05) for higher degree polynomials.

xforms <- data.frame(feature = continuous_vars) %>%
  mutate(
    lrm_formula = map(feature, ~ as.formula(paste0("Class ~ rms::rcs(", ., ")"))),
    lrm_obj = map(lrm_formula, 
                  possibly(~ rms::lrm(., data = training[pre2008, ]),
                           otherwise = NULL)),
    x = map(
      lrm_obj, 
      ~ tibble(est = .$coefficients[-1], 
               se = sqrt(diag(.$var)[-1])) %>% 
        rownames_to_column(var = "degree")
      )
    ) %>%
  unnest(x) %>%
  mutate(
    z = est / se,
    p = pnorm(abs(z), lower.tail = FALSE) * 2
  )  %>% 
  filter(p < .05 & degree > "1")

## singular information matrix in lrm.fit (rank= 4 ).  Offending variable(s):
## C.PS''' 
## Fewer than 3 unique knots.  Frequency table of variable:
## x
##    0    1    2    3    4    5    8   11 
## 6117  408   80   17    4    5    1    1

# xforms %>% mutate(val = "y") %>%
#   pivot_wider(id_cols = feature, names_from = degree, values_from = val) 

xforms %>% select(-lrm_formula, -lrm_obj)

## # A tibble: 22 x 6
##    feature      degree    est     se     z         p
##    <fct>        <chr>   <dbl>  <dbl> <dbl>     <dbl>
##  1 A.CI         2       1.95  0.616   3.17 0.00150  
##  2 A.CI         3      -2.78  0.889  -3.13 0.00177  
##  3 ATotal       2       1.37  0.396   3.47 0.000525 
##  4 ATotal       3      -2.46  0.728  -3.38 0.000738 
##  5 B.CI         2       1.68  0.462   3.64 0.000269 
##  6 B.CI         3      -2.63  0.730  -3.60 0.000317 
##  7 BTotal       2       2.12  0.498   4.25 0.0000216
##  8 BTotal       3      -3.28  0.783  -4.19 0.0000276
##  9 CI.Australia 2       0.145 0.0666  2.18 0.0293   
## 10 CI.Faculty25 2      -0.447 0.104  -4.29 0.0000177
## # ... with 12 more rows

Add these transforms to the training data set and call it training2.

new_cols <- xforms %>%
  mutate(
    new_col_name = paste0(feature, degree),
    dat = map(feature, ~ training[, colnames(training) == .]),
    new_col_vals = map2(.x = dat, .y = degree, ~ .x^as.numeric(.y))
  )

new_cols2 <- new_cols %>%
  pivot_wider(id_cols = new_col_name, names_from = new_col_name, values_from = new_col_vals)

training2 <- training
for(col_name in colnames(new_cols2)) {
  x <- data.frame(new_cols2[[col_name]][[1]])
  colnames(x) <- col_name
  training2 <- bind_cols(training2, x)
}
reducedSet2 <- c(reducedSet, colnames(new_cols2))

Here is a log-odds plot of one of the predictors, Day with the predicted values from the restricted cubic spline model.

preds <- rms::Predict(
  xforms[xforms$feature == "Day", ]$lrm_obj[[1]],
  Day = unique(training$Day),
  fun = function(x) -x
) 
obs <- prop.table(table(training$Day, training$Class), margin = 1) %>%
  as.data.frame.matrix() %>% rownames_to_column(var = "Day") %>%
  mutate(
    Day = as.numeric(Day),
    odds = successful / unsuccessful,
    log_odds = log(odds)
  )
dat <- obs %>% inner_join(preds, by = "Day")
dat %>% 
  ggplot(aes(x = Day)) +
  geom_point(aes(y = log_odds), alpha = 0.6) +
  geom_line(aes(y = yhat)) +
  theme_mf() +
  labs(title = "Log-Odds of Class with fitted Restricted Cubic Spline")

Model

Train Control

train() will tune the model based on the AUC. To construct the ROC curve, set classProbs = TRUE to produce class probabilities rather than class predictions. Summary function twoClassSummary calculates the area under the ROC curve, the sensitivity, and the specificity.

The data splitting scheme defined in step 01 is to fit the model on the pre-2008 data, and tune with the 2008 holdout data. Both pre-2008 and the holdout are in training, indexed by vector pre2008. This train and tune scheme is accomplished with method = "LGOCV" (“leave group out cross validation”). There are no parameters to tune in this model, so the holdout set is just for comparing the model perfpormance to other models. savePredictions = TRUE saves the 2008 holdout data predictions.

ctrl <- trainControl(
  method = "LGOCV",  
  summaryFunction = twoClassSummary,  
  classProbs = TRUE,
  index = list(TrainSet = pre2008),  
  savePredictions = TRUE
)

Models

Fit the full-predictor-set model first. method = "glm" fits a generalized linear model. metric = "ROC" sets the summary metric to select the optimal model. Again, there are no tuning parameters, so there is only one model produced, but the AUC can be compared to other models, including the reduced-parameter-set model.

Model 1: Full Set

# Kuhn added a singled tranformed variable, but I've added 14 transformations
#training$Day2 <- training$Day^2
#testing$Day2 <- testing$Day^2
fullSet_Day2 <- c(fullSet, "Day2")
reducedSet_Day2 <- c(reducedSet, "Day2")

set.seed(476)
mdl_1 <- train(
  training2[, fullSet_Day2],
  y = training$Class,
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = ctrl
)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

mdl_1

## Generalized Linear Model 
## 
## 8190 samples
## 1076 predictors
##    2 classes: 'successful', 'unsuccessful' 
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (1 reps, 75%) 
## Summary of sample sizes: 6633 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.7851375  0.7614035  0.7720365

From the training set of 8,190 observations and 1076 predictor variables, caret trained the model on the indexed 6,633 pre-2008 observations. The “Resampling Results” is the performance estimate of the 2008 holdout set. The model achieved an area under the ROC curve of 0.785, a sensitivity of 0.761 and a specificity of 0.772 on the 2008 holdout set. The confusion matrix calculates other measures.

confusionMatrix(data = mdl_1$pred$pred, reference = mdl_1$pred$obs)

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     successful unsuccessful
##   successful          434          225
##   unsuccessful        136          762
##                                           
##                Accuracy : 0.7681          
##                  95% CI : (0.7464, 0.7889)
##     No Information Rate : 0.6339          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5164          
##                                           
##  Mcnemar's Test P-Value : 3.629e-06       
##                                           
##             Sensitivity : 0.7614          
##             Specificity : 0.7720          
##          Pos Pred Value : 0.6586          
##          Neg Pred Value : 0.8486          
##              Prevalence : 0.3661          
##          Detection Rate : 0.2787          
##    Detection Prevalence : 0.4232          
##       Balanced Accuracy : 0.7667          
##                                           
##        'Positive' Class : successful      
##

The prediction accuracy was 0.768.

Model 2: Reduced Set

Now fit the reduced-predictor-set (the 253 predictors with frequency ratio >= 8190 / 50) model.

set.seed(476)
mdl_2 <- train(
  training2[, reducedSet_Day2],
  y = training$Class,
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = ctrl
) 
mdl_2

## Generalized Linear Model 
## 
## 8190 samples
##  253 predictor
##    2 classes: 'successful', 'unsuccessful' 
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (1 reps, 75%) 
## Summary of sample sizes: 6633 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8773467  0.7842105  0.8237082

Removing the low-frequence predictors improved the model performance. For the reduced set of 253 variables, the area under the ROC curve increased from 0.785 to 0.877, and the sensitivity and specificity also increased.

For logistic regression you can assess the slope coeficients with the Z statistic. Kuhn describes it as a “signal to noise ratio”, the estimated slope divided by the standard error. The most important predictors, based on the Z statistic, were the number of unsuccessful and successful grants by chief investigators, contract value band F, and the squared numeric day of the year.

varImp(mdl_2, scale = FALSE)

## glm variable importance
## 
##   only 20 most important variables shown (out of 253)
## 
##                    Overall
## Unsuccess.CI        21.985
## Success.CI          20.896
## ContractValueBandF   6.521
## Day2                 6.467
## ContractValueBandD   6.172
## ContractValueBandG   6.155
## Sponsor2B            6.105
## ContractValueBandE   5.973
## ContractValueBandA   5.649
## Day                  5.348
## Sponsor4D            5.288
## Sponsor24D           5.179
## Sponsor6B            5.086
## ContractValueBandC   4.883
## ContractValueBandB   4.674
## Sponsor5A            4.402
## Sponsor62B           3.770
## Unsuccess.PS         3.546
## Tue                  3.503
## Mon                  3.443

The predictions for the holdout set (the year 2008 grants) are in the sub-object pred.

head(mdl_2$pred)

##         pred          obs successful unsuccessful rowIndex parameter Resample
## 1 successful   successful  0.9987831  0.001216919     6634      none TrainSet
## 2 successful   successful  0.8551938  0.144806179     6635      none TrainSet
## 3 successful   successful  0.9233975  0.076602460     6636      none TrainSet
## 4 successful   successful  0.9672439  0.032756136     6637      none TrainSet
## 5 successful unsuccessful  0.5684208  0.431579184     6638      none TrainSet
## 6 successful unsuccessful  0.9228940  0.077105974     6639      none TrainSet

Train saves predictions for every tuning parameter. Column parameter identifies which model generated the predictions. This version of logistic regression has no tuning parameters, so parameter always equals “none”. Use the confusion matrix to get the Accuracy.

mdl_2_cm <- confusionMatrix(data = mdl_2$pred$pred, reference = mdl_2$pred$obs)  
mdl_2_cm

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     successful unsuccessful
##   successful          447          174
##   unsuccessful        123          813
##                                           
##                Accuracy : 0.8092          
##                  95% CI : (0.7888, 0.8285)
##     No Information Rate : 0.6339          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5966          
##                                           
##  Mcnemar's Test P-Value : 0.003716        
##                                           
##             Sensitivity : 0.7842          
##             Specificity : 0.8237          
##          Pos Pred Value : 0.7198          
##          Neg Pred Value : 0.8686          
##              Prevalence : 0.3661          
##          Detection Rate : 0.2871          
##    Detection Prevalence : 0.3988          
##       Balanced Accuracy : 0.8040          
##                                           
##        'Positive' Class : successful      
##

plot(mdl_2_cm$table)

The accuracy was 0.8092 - an improvement over the 0.768 from the full-set model. Generate the ROC curve with pROC::roc().

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

mdl_2_roc <- roc(
  response = mdl_2$pred$obs,
  predictor = mdl_2$pred$successful,
  levels = rev(levels(mdl_2$pred$obs))
)

## Setting direction: controls < cases

ggroc(mdl_2_roc) + 
  theme_minimal() + 
  ggtitle("Logistic Regression Model ROC Curve") + 
    geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "grey", linetype = "dashed")

auc(mdl_2_roc)

## Area under the curve: 0.8773

Model 3: Reduced Set 2.0

Kuhn demonstrated the transformation principle with just variable Day. I applied accross the entire reduced predictor set and found 14 variables to transform. What was the performance lift?

set.seed(476)
mdl_3 <- train(
  training2[, reducedSet2],
  y = training2$Class,
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = ctrl
)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

mdl_3

## Generalized Linear Model 
## 
## 8190 samples
##  274 predictor
##    2 classes: 'successful', 'unsuccessful' 
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (1 reps, 75%) 
## Summary of sample sizes: 6633 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8822686  0.7526316  0.8541033

Adding the transformed variables improved the model performance. For the reduced and transformed set of 274 variables, the area under the ROC curve increased from 0.877 to 0.882. The sensitivity actually fell a little (from 0.784 to 0.752), but the specificity increased (from 0.824 to 0.854).

mdl_3_cm <- confusionMatrix(data = mdl_3$pred$pred, reference = mdl_3$pred$obs)  
mdl_3_cm

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     successful unsuccessful
##   successful          429          144
##   unsuccessful        141          843
##                                           
##                Accuracy : 0.817           
##                  95% CI : (0.7968, 0.8359)
##     No Information Rate : 0.6339          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6061          
##                                           
##  Mcnemar's Test P-Value : 0.9057          
##                                           
##             Sensitivity : 0.7526          
##             Specificity : 0.8541          
##          Pos Pred Value : 0.7487          
##          Neg Pred Value : 0.8567          
##              Prevalence : 0.3661          
##          Detection Rate : 0.2755          
##    Detection Prevalence : 0.3680          
##       Balanced Accuracy : 0.8034          
##                                           
##        'Positive' Class : successful      
##

The accuracy improved from 0.8092 to 0.8170.

Kaggle - Grant Prediction

Step 3: Logistic Regression

Michael Foley

4/14/2020