Analysing the Efficacy of Miraculon-B: Phase III Clinical Trial

Author

Dr Owen Williams

Published

May 9, 2023

Description

GSK is developing a new anti cancer drug called ‘Miraculon-B’ which has just finished its Phase III clinical trial. The purpose of this clinical was to explore whether the efficacy of this new drug compared to standard level of care. A randomized case-control clinical trial consisted of around 768 was performed with around half prescribed the new drug while the control group maintained normal standard of care. This report will analyse these results to determine the efficacy of Miraculon-B and see whether this provides a better outcomes to patients than normal standard of care. This report will also examine whether protein concentration is an effective bio marker and also explore how other clinical features may effect patient response to the new drug.

1. Install packages

Load packages required to run. Install packages if required

library(tidyverse) # Data wrangling
library(readxl) # Read in excel files
library(stringr) # manipulate text
require(here) # manage files
require(rms) # Statistical regression
require(ggpubr) # Publishing figures
require(glmnet) # lasso regression
require(ROCit) # ROC analysis
require(grpreg) # Group lasso
require(yardstick) # assess model performance

2. Import Data Sets

Import and merge both excel data sets. perform initial tidying of data including converting non numeric predictors into factors and any missing data into ‘NA’s’

# set current directory
setwd(here::here())

# read data
train = readxl::read_excel('Data/clinical-study.xlsx') %>%
  mutate(across(everything(), ~replace(.x, .x == 'NA', NA))) %>%
  mutate(sex = factor(sex, levels = c('Male', 'Female')),
         trt_grp = factor(trt_grp, levels = c('CONTROL', 'DRUG')),
         RESPONSE = factor(RESPONSE, levels = c('N', 'Y'))) %>%
  mutate(weight = as.numeric(weight))

# protein data
protein = readxl::read_xlsx('Data/protein-levels.xlsx') %>%
  rename(subject_id = participant_id) %>%
  mutate(across(everything(), ~replace(.x, .x == 'NA', NA))) %>%
  mutate(protein_concentration = as.numeric(protein_concentration))

# merge datasets
train = left_join(train, protein, by = 'subject_id')

train
# A tibble: 772 × 8
   subject_id   age sex    weight height trt_grp RESPONSE protein_concentration
   <chr>      <dbl> <fct>   <dbl>  <dbl> <fct>   <fct>                    <dbl>
 1 SUBJ_001      46 Female   84.7   1.59 DRUG    N                          148
 2 SUBJ_001      46 Female   84.7   1.59 DRUG    N                          148
 3 SUBJ_002      47 Female   71.2   1.64 DRUG    Y                           85
 4 SUBJ_003      48 Female   69.8   1.73 CONTROL N                          183
 5 SUBJ_004      59 Female   62.9   1.5  DRUG    Y                           89
 6 SUBJ_005      59 Female  114.    1.63 CONTROL N                          137
 7 SUBJ_006      63 Male     79.3   1.77 CONTROL Y                          116
 8 SUBJ_007      77 Male     96.1   1.77 CONTROL N                           78
 9 SUBJ_008      57 Male     93.5   1.63 DRUG    N                          115
10 SUBJ_009      72 Male     85.6   1.68 DRUG    N                          197
# ℹ 762 more rows

3. Tidy Data

3.1 Check for duplicates

Are there any duplicates?

# Check for duplicates
message(sprintf('There is %s duplicated subject on row %s', sum(duplicated(train)), which(duplicated(train) == TRUE)))
There is 1 duplicated subject on row 2

View duplicated row

train[which(duplicated(train) == TRUE),]
# A tibble: 1 × 8
  subject_id   age sex    weight height trt_grp RESPONSE protein_concentration
  <chr>      <dbl> <fct>   <dbl>  <dbl> <fct>   <fct>                    <dbl>
1 SUBJ_001      46 Female   84.7   1.59 DRUG    N                          148

Remove duplicated row and confirm all duplicated subjects are removed

# remove duplicates
train = train %>%
  distinct()

message(sprintf('There is %s duplicated subjects', sum(duplicated(train))))
There is 0 duplicated subjects

3.2 Identify Missing data

columns with missing data

# Identify missing data
apply(is.na(train), 2, any)
           subject_id                   age                   sex 
                FALSE                 FALSE                 FALSE 
               weight                height               trt_grp 
                 TRUE                 FALSE                 FALSE 
             RESPONSE protein_concentration 
                FALSE                  TRUE 

We can see both weight and protein concentration predictors have missing data. Next we can examine the distribution of missing values by plotting missing data fraction.

# plot fraction of missing data per variable.
na.patterns = naclus(dplyr::select(train, colnames(train)[apply(train, 2, anyNA)]))
naplot(na.patterns, 'na per var')

This shows both predictors have small missing data fraction.

3.3 Impute Missing Values

In this section we shall look at whether it is possible to confidently infer missing values from other predictors.

To do this we will use a single imputation method and assess R2 values. It is also important to assess which predictors contribute most to its inference by examining ‘Coefficients of canonical variate’ as if important predictors also have missing this can impede inference

3.3.1 Assessing confidence in inference

# using single imputation
w = transcan(~ age + sex + weight + height + trt_grp + RESPONSE +
               protein_concentration,
             imputed = TRUE, trantab=TRUE, data = train, pl = FALSE, pr = FALSE)

# See R^2 value to determine how reliable imputed values are for each predictor 
w$rsq
                  age                   sex                weight 
           0.15354125            0.73039386            0.38403740 
               height               trt_grp              RESPONSE 
           0.76102330            0.09775763            0.36451313 
protein_concentration 
           0.35818590 
summary(w)
transcan(x = ~age + sex + weight + height + trt_grp + RESPONSE + 
    protein_concentration, imputed = TRUE, trantab = TRUE, pr = FALSE, 
    pl = FALSE, data = train)

Iterations: 4 

R-squared achieved in predicting each variable:

                  age                   sex                weight 
                0.154                 0.730                 0.384 
               height               trt_grp              RESPONSE 
                0.761                 0.098                 0.365 
protein_concentration 
                0.358 

Adjusted R-squared:

                  age                   sex                weight 
                0.142                 0.728                 0.375 
               height               trt_grp              RESPONSE 
                0.758                 0.088                 0.358 
protein_concentration 
                0.349 

Coefficients of canonical variates for predicting each (row) variable

                      age   sex   weight height trt_grp RESPONSE
age                          0.30  1.01   0.15  -0.01    0.04   
sex                    0.04        0.09  -1.04  -0.01   -0.01   
weight                 0.47  0.27         0.86  -0.17   -0.39   
height                 0.02 -0.90  0.24          0.02    0.03   
trt_grp               -0.02 -0.10 -0.49   0.19          -1.16   
RESPONSE               0.02 -0.03 -0.42   0.13  -0.42           
protein_concentration -0.07 -0.03  0.55  -0.25   0.28    0.95   
                      protein_concentration
age                   -0.03                
sex                   -0.01                
weight                 0.51                
height                -0.06                
trt_grp                0.77                
RESPONSE               0.94                
protein_concentration                      

Summary of imputed values

weight 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      11        0       11        1    84.01    23.41    63.66    64.10 
     .25      .50      .75      .90      .95 
   66.73    76.21    97.36   109.25   114.19 

lowest :  63.2107  64.1003  66.1857  67.2789  67.2957
highest:  96.7697  97.0529  97.6701 109.2510 119.1310
                                                                         
Value       63.2107  64.1003  66.1857  67.2789  67.2957  76.2131  96.7697
Frequency         1        1        1        1        1        1        1
Proportion    0.091    0.091    0.091    0.091    0.091    0.091    0.091
                                              
Value       97.0529  97.6701 109.2510 119.1310
Frequency         1        1        1        1
Proportion    0.091    0.091    0.091    0.091
protein_concentration 
       n  missing distinct     Info     Mean      Gmd 
       8        0        7    0.988    95.89    48.16 

lowest :  44.0000  67.1504  87.9069 116.1890 133.5520
highest:  87.9069 116.1890 133.5520 135.7930 138.5560
                                                                         
Value       44.0000  67.1504  87.9069 116.1890 133.5520 135.7930 138.5560
Frequency         2        1        1        1        1        1        1
Proportion    0.250    0.125    0.125    0.125    0.125    0.125    0.125

Starting estimates for imputed values:

                  age                   sex                weight 
               62.000                 2.000                88.875 
               height               trt_grp              RESPONSE 
                1.670                 1.000                 1.000 
protein_concentration 
              117.000 

As shown by the R2 value it is possible to impute missing weight and protein concentration values with an R2 confidence value of 0.384 and 0.358. We can see that age, height and protein concentration are the most informative predictors for inferring missing weight, while weight and response are the most informative predictors for protein concentration values. This makes biological sense so confidence can be taken in the imputation process.

Whether or not to impute these values is dependent on the confidence in the imputation (R2 value) and missing data fraction of informative predictors. In this case the R2 value for both predictors is not ‘fantastic’ but at the same time does provide some inference, while missing data of informative predictors is low.

Ideally we would consult an expert of the data set for advice but for the sake of this case we shall assume it was worth inputting missing data.

3.3.2 Impute missing data

Impute missing data and re-check confirm imputation by re-checking for any missing data

# Impute missing data
train = train %>%
  mutate(weight = with(train, impute(w, weight, data = train)),
         protein_concentration = with(train, impute(w, protein_concentration, data = train)))

# Re-check missing values
apply(is.na(train), 2, any)
           subject_id                   age                   sex 
                FALSE                 FALSE                 FALSE 
               weight                height               trt_grp 
                FALSE                 FALSE                 FALSE 
             RESPONSE protein_concentration 
                FALSE                 FALSE 

3.3.3 Removing Unwanted Subjects

By examining the distribution of values across each predictor we can see that outliers in age, height, and weight. These samples are from subjects which should have been removed from this study.

# visualise distribution of values per predictor including missing values
ggplot(w)

To remove these subjects we can filter out any samples below the age of 20 years old

train = train %>%
  filter(age > 20)

# check distribution
w2 = transcan(~ age + sex + weight + height + trt_grp + RESPONSE +
                protein_concentration,
              imputed = TRUE, trantab=TRUE, data = train, pl = FALSE, pr = FALSE)
ggplot(w2)

as we can see all outlier results have now been excluded.

3.3.4 Include BMI predictor

BMI can be included as an additional informative predictor by combining both height and weight. This is often used as a form of data reduction, by reducing degrees of freedom used by combing predictors.

train = train %>%
  mutate(BMI = weight/height^2)

3.3.5 Are predictors appropriately randomized

Randomization is critical in any clinical trial to ensure the effect you are seeing between control and treatment group (i.e administration of Miraculon-b drug) is a true effect rather effects from confounding predictors. As such we will therefore ensure that each both control and treatment group have a similar distribution of subjects.

ggplot(train, aes(x=weight, fill = trt_grp)) + 
  geom_density(alpha = 0.4) +
  scale_fill_brewer(palette="Dark2") +
  ggtitle("Density plot of weight across different treatment groups") 
Don't know how to automatically pick scale for object of type <impute>.
Defaulting to continuous.
ggplot(train, aes(x=height, fill = trt_grp)) + 
  geom_density(alpha = 0.4) +
  scale_fill_brewer(palette="Dark2") +
  ggtitle("Density plot of height across different treatment groups")
ggplot(train, aes(x=BMI, fill = trt_grp)) + 
  geom_density(alpha = 0.4) +
  scale_fill_brewer(palette="Dark2") +
  ggtitle("Density plot of BMI across different treatment groups")
Don't know how to automatically pick scale for object of type <impute>.
Defaulting to continuous.
ggplot(train, aes(x=protein_concentration, fill = trt_grp)) + 
  geom_density(alpha = 0.4) +
  scale_fill_brewer(palette="Dark2") +
  ggtitle("Density plot of protein concentration across different treatment groups")
Don't know how to automatically pick scale for object of type <impute>.
Defaulting to continuous.
train %>%
  ggplot(aes(x= sex,  group=trt_grp)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
  geom_text(aes(label = scales::percent(..prop.., accuracy = 0.1),
                y= ..prop.. ), stat= "count", vjust = -.5, size = 2) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_discrete(labels = levels(train$sex)) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + 
  facet_wrap(~trt_grp) +
  ggtitle("Proportion Male/Female subjects per treatment group") 
Warning: The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(prop)` instead.

Based on the above plots we can see there is a relatively even distribution of values in each predictor for each treatment group. We can therefore be relatively confident there has been a adequate randomisation process in the clinical trial design.

4. Building Logistic Regression Model

We will use a logistic regression to model tumor response to new drug treatment as this is powerful model for assessing binary outcomes. Furthermore its inherent low variance means that we can be expect similar performance when applied to external data. This is advantageous over more complex models such as random forest and neural networks which can be more difficult to interpret results and optimise models.

4.1. Predictor Selection

One of the major causes of an unreliable model is over-fitting, which can reduce performance r when model is applied to external data. We can prevent over-fitting by being selective of which predictors to use in model development.

Determine how many predictors to choose can be challenging however we can estimate how many degrees of freedom we can expend before over-fitting based on the effective sample size. This being the min(n0,n1), where n0 is the number of class 0s, and n1 is the number of class 1s. So, the number of df=min(n0,n1)/15 (or /10, or /20 depending on how strict you want to be. In this case I shall use a stringent approach to ensure model is not over fitting

# find n0, n1
describe(train$RESPONSE)
train$RESPONSE 
       n  missing distinct 
     768        0        2 
                      
Value          N     Y
Frequency    434   334
Proportion 0.565 0.435
#* effective sample size is 334

# Determine available df freedom
sprintf('the available degrees of freedom before overfitting is estimated at: %s', round(334/20, digits = 0))
[1] "the available degrees of freedom before overfitting is estimated at: 17"

We can see that we have 17 degrees of freedom so it is unlikely that we need to exclude it from developing the model. Despite this it is still worth examining the effect of these predictors as a more parsimonious model is generally a more powerful model.

To do this we shall use Somers Rank, Redundancy analysis and Lasso regression.

4.1.1 Somers Rank

This looks at how each predictor individually relates to the response

sDxy = with(train,
            rcorrcens(factor(RESPONSE) ~ age + sex + weight + height + trt_grp + 
                        protein_concentration + BMI))

plot(sDxy)

we can see that protein concentration and treatment group are highly informative predictors, while height and sex are the least informative.

4.1.2 Redundancy analysis

This determines whether there is any redundant predictors, ie the other predictors are able to capture its information. We can specify redundancy cut-off threshold by setting R2 value. In this case we will set it to 0.6

redundancy = with(train,
                  redun(~ age + sex + weight + height + trt_grp + 
                          protein_concentration + BMI,
                        r2 = 0.6, type = 'adjusted'))

redundancy$Out
[1] "height" "weight"

Here we see height and weight are redundant. This is likely caused by the BMI predictor we included earlier.

4.1.3 Lasso regression

This is able to inform most informative predictors by adjusting lambda, which in turn moves coefficients closer 0. Least informative predictors will move closer to 0 quicker, therefore allowing us to be able to select most informative predictors.

Note this is not a guaranteed method for identifying most optimal predictors but is a better method for selecting predictors than backwards/forwards stepwise selection. (As this is a terrible method). Also note this only allows for numeric predictors, we can include non numeric predictors by group lasso regression but for the sake of this we havent.

# Predictors selected for lasso
lassoPred_comp = train %>%
  dplyr::select(where(is.numeric)) %>%
  colnames()

# Perform lasso regression
set.seed(1)
df = train %>% dplyr::select(dplyr::matches(lassoPred_comp))
lassoFit = cv.glmnet(as.matrix(df), as.matrix(as.numeric(train$RESPONSE)), type.measure = "mse", nfolds = 20)
plot(lassoFit)

# Print coefficient values
lassoFit

Call:  cv.glmnet(x = as.matrix(df), y = as.matrix(as.numeric(train$RESPONSE)),      type.measure = "mse", nfolds = 20) 

Measure: Mean-Squared Error 

     Lambda Index Measure       SE Nonzero
min 0.00719    39  0.1757 0.003994       3
1se 0.03837    21  0.1792 0.003729       2
# Get coefficents for min MSE
coef(lassoFit, s = "lambda.min")
6 x 1 sparse Matrix of class "dgCMatrix"
                                 s1
(Intercept)            2.0308237709
age                    .           
weight                 0.0004811477
height                 .           
protein_concentration -0.0086067400
BMI                    0.0126272160
# Get coefficients for 1 standard error of min MSE
coef(lassoFit, s = "lambda.1se")
6 x 1 sparse Matrix of class "dgCMatrix"
                                s1
(Intercept)            2.059033575
age                    .          
weight                 .          
height                 .          
protein_concentration -0.007267932
BMI                    0.008065782

Using a k-20 cross validation we can see that model is optimised when lambda is ~log(-5) which includes 3 predictors and is in 1 SE when lambda is ~log(-3). We can see which predictors are most informative by examining coefficents. If coefficient is 0 then they have been effectively removed from model.

Most optimal model identified by lasso

# Get coefficents for min MSE
coef(lassoFit, s = "lambda.min")
6 x 1 sparse Matrix of class "dgCMatrix"
                                 s1
(Intercept)            2.0308237709
age                    .           
weight                 0.0004811477
height                 .           
protein_concentration -0.0086067400
BMI                    0.0126272160

1 standard deviation of optimal

# Get coefficients for 1 standard error of min MSE
coef(lassoFit, s = "lambda.1se")
6 x 1 sparse Matrix of class "dgCMatrix"
                                s1
(Intercept)            2.059033575
age                    .          
weight                 .          
height                 .          
protein_concentration -0.007267932
BMI                    0.008065782

4.1.4 Interpretation of predictor selection

From Somers rank and redundancy we can see that height and weight can be excluded as these provide minimal information and are effectively captured by other predictors. There is an argument to exclude age based on Somers and Lasso regression, however this may provide some informative information which other predictors do not and as we have the degrees of freedom to spend we shall not remove it from the model

4.2 Develop Model

4.2.1 Modelling potential predictor interactions

Here we will create a model using all predictors except for weight and height which through data reduction was captured through BMI. We will also investigate whether there is any interaction between variables, where we will explore whether protein concentration interacts with age, sex or BMI

# Create datadist object
dd = datadist(train)
options(datadist= 'dd' )

#* Develop model
#* Use restricted cubic splines to accommodate non linearity
f = lrm(RESPONSE ~  age + sex + trt_grp + 
          protein_concentration + BMI +
          # check interactions
          protein_concentration*sex +
          protein_concentration*age + 
          protein_concentration*BMI,
        data=train)

We will use ANOVA to assess model

anova(f)
                Wald Statistics          Response: RESPONSE 

 Factor                                                     Chi-Square d.f.
 age  (Factor+Higher Order Factors)                           0.51     2   
  All Interactions                                            0.48     1   
 sex  (Factor+Higher Order Factors)                           0.29     2   
  All Interactions                                            0.26     1   
 trt_grp                                                     63.32     1   
 protein_concentration  (Factor+Higher Order Factors)       162.28     4   
  All Interactions                                            1.64     3   
 BMI  (Factor+Higher Order Factors)                          54.07     2   
  All Interactions                                            0.78     1   
 sex * protein_concentration  (Factor+Higher Order Factors)   0.26     1   
 age * protein_concentration  (Factor+Higher Order Factors)   0.48     1   
 protein_concentration * BMI  (Factor+Higher Order Factors)   0.78     1   
 TOTAL INTERACTION                                            1.64     3   
 TOTAL                                                      178.09     8   
 P     
 0.7731
 0.4888
 0.8636
 0.6093
 <.0001
 <.0001
 0.6495
 <.0001
 0.3772
 0.6093
 0.4888
 0.3772
 0.6495
 <.0001

ANOVA shows there is little evidence to suggest interactions between protein concentration and with age, sex and BMI. We shall therefore re-develop the model but this time excluding interactions.

4.2.2 Develop optimized model

In this model we capture any non linear relationships in age, protein concentration and BMI predictors which we observed in section 3.3.3. This can be achieved using restricted cubic splines

f = lrm(RESPONSE ~  rcs(age,3) + sex + trt_grp + 
          rcs(protein_concentration,3) + rcs(BMI,3),
        x=TRUE,y=TRUE, data=train)

assess model using ANOVA

anova(f)
                Wald Statistics          Response: RESPONSE 

 Factor                Chi-Square d.f. P     
 age                     0.05     2    0.9729
  Nonlinear              0.03     1    0.8575
 sex                     0.04     1    0.8415
 trt_grp                61.72     1    <.0001
 protein_concentration 154.82     2    <.0001
  Nonlinear              1.38     1    0.2398
 BMI                    52.69     2    <.0001
  Nonlinear              0.61     1    0.4333
 TOTAL NONLINEAR         1.74     3    0.6270
 TOTAL                 172.23     8    <.0001

The global significance provides strong evidence to suggest this is a valid model (P< 0.0001). We can see that the drug is having a significant effect on response (p<0.001) with a large effect size (Chi_sqr = 61.72) however whether this is postive or negative effect is not shown.

We can also see that protein concentration is significant predictor with large effect size (Chi_sqr = 154.82, P < 0.0001) therefore suggesting this is an effective biomarker.

A patients BMI also seems to have an effect to tumor response. Age and Sex were not significant predictors but this does not mean they are uninformative as they may be providing information on subset of data.

We can visualize how the model predicts the data used to train it

# plot predicted variables
ggplot(Predict(f), sepdiscrete="vertical", nlevels=4, vnames="names")

These plots give predicted log odds of each predictor. This demonstrates treatment group has increased log odds of having a response on tumour growth size. We also see that lower protein concentration coincides with increased odds of tumour response while increased BMI coincides with reduced odds of tumour response.

5. Model Performance

Whilst we may have developed a significant model (as shown using ANOVA) this does NOT indicate a good model. To do this we need to assess its performance. One possible method for validating the model could be through data hold out, where data is split into training and validation data. However by splitting the data this reduces the power of the model. Instead a more efficient method is using cross validation and bootstrapping. This is able to accurately assess the model performance without requiring loss of training data. Here we will use internal validation and calibration using bootstrapping (x1000 resampling)

5.1 Validate Model

# Validate model
v = validate (f, B=1000)

print(v, B=20, digits =3)
          index.orig training   test optimism index.corrected    n
Dxy            0.737    0.743  0.730    0.012           0.725 1000
R2             0.487    0.497  0.479    0.018           0.469 1000
Intercept      0.000    0.000 -0.003    0.003          -0.003 1000
Slope          1.000    1.000  0.954    0.046           0.954 1000
Emax           0.000    0.000  0.011    0.011           0.011 1000
D              0.451    0.463  0.441    0.022           0.428 1000
U             -0.003   -0.003  0.001   -0.004           0.001 1000
Q              0.453    0.466  0.439    0.026           0.427 1000
B              0.146    0.144  0.148   -0.004           0.150 1000
g              2.373    2.464  2.334    0.130           2.242 1000
gp             0.354    0.357  0.351    0.007           0.347 1000
# Calcualte concordance index
Dxy = v[1,5]
c.index = Dxy*0.5+0.5

sprintf('The model is able to correctly descriminate %s%% of the samples correctly', round(c.index*100, digits = 2))
[1] "The model is able to correctly descriminate 86.24% of the samples correctly"

5.2 Calibrate Model

#* Calibrate Model
cal = calibrate(f, B=1000)
plot(cal)


n=768   Mean absolute error=0.027   Mean squared error=0.00098
0.9 Quantile of absolute error=0.047

5.3 Model over fitting

As a final preliminary assessment we can also re-check for over fitting by calcualting the noise being fitted. This is important as an over-fitted model is a major cause for an unreliable model despite apparent good performance metrics.

df = f$stats[4] #degree of freedom
LR = f$stats[3] #liklihood ratio
shrinkage = (LR-df)/LR
noise = (1-shrinkage)*100 # estimated percentage of fitted noise
noise
Model L.R. 
  2.305482 

Here we see we are plotting around 2.3% noise which is way below the arbitrary 10% threshold.

5.4 Performance Interpretation

Performance metrics show model is very good at predicting response with an ability to correctly discriminate ~86.2% of the subjects correctly. The calibration curve depicts the reliability of the model which can be seen closely follows the expected line, demonstrating excellent validation on absolute probability scale. We also see little evidence for model over-fitting with only ~2.3% noise. This gives confidence that model will perform similar on external data.

Collectively this provides strong evidence to support that that Miraculon-B has a significant effect to tumor response, and protein concentration is a strong bio-marker.

6. Checking overly influential observations

As every observation influences the fit of the regression model, anomalous data points can dramatically skew the model fit and increase variance of predicted values. These are known as overly influential observations which can be commonly caused by data transcription errors. As a quality control check it is therefore worth checking for overly influential observations and if necessary confirming these are not spurious observations caused by data collection or transcription errors.

6.1 Plotting influence

We can determine overly influential observations by plotting residuals for each predictor

#* Update model
f = update (f, x=TRUE , y=TRUE)

# Get residuals to calculate predictor influence
rr = resid(f, type = "dfbeta")

# Remove any characters after "=" resulting from categorical data
colnames(rr) = str_remove(colnames(rr), pattern = "=.*$")

# Replace any "'" with a "." caused by RCS predictors
colnames(rr) = str_replace_all(colnames(rr), pattern = "'", replacement = ".")

# Convert Residuals to df
rr = data.frame(rr)

# Plot influence 
with(train, plot(age,rr[["age"]],xlab="age",ylab="influence"))
with(train, plot(sex,rr[["sex"]],xlab="sex",ylab="influence"))
with(train, plot(trt_grp,rr[["trt_grp"]],xlab="trt_grp",ylab="influence"))
with(train, plot(protein_concentration,rr[["protein_concentration"]],xlab="protein_concentration",ylab="influence"))
with(train, plot(BMI,rr[["BMI"]],xlab="BMI",ylab="influence"))

Whilst plotting influence can be useful when trying to identify potentially overly influential observations, it can be challenging to interpret how influential these observations are. It is also difficult to cross reference which patient had the overly influential observation from the plot alone.

To overcome this challenge we can specify an influential cut-off threshold and identify which subjects exceed this threshold and for which predictor. For this example we will set a highly stringent influence cut-off threshold of 0.4.

# Dataframe of overly Influential observations
Infl = which.influence(f, 0.4) # stringent observations 

df = with(train, data.frame(age, sex, trt_grp, protein_concentration, BMI))
#Show influence
tibble::rownames_to_column(show.influence(Infl, df), "Observation Id")
    Count protein_concentration       BMI
446     1                   180 *59.12934
547     1                  *187  43.75240
  Observation Id Count protein_concentration       BMI
1            446     1                   180 *59.12934
2            547     1                  *187  43.75240

From this we can see that subject 446 and 547 show overly influential values for BMI and protein concentration respectively given the 0.4 cut-off threhold. Whilst this has been set to a high stringent cut off threshold, it maybe worth checking the validity of these data points with person who collected the data.

We will however consider these observations as acceptable.

7. Summarizing model

7.1 Plotting odds ratio of each predictor to tumor response

Now that we have developed an optimized model we can now sumarise the results by presenting the log-odd of each predictor in regards of tumor response. This can be represented in the plot below.

# plot odds ratio
plot(summary(f), log = TRUE)

From these results we can see that:

  • Miraculon-B has better outcome to patients for tumor with an odds ratio (OR) of ~4.8 [3.27-7.17 95% CI]. Meaning that patients prescribed the drug had a ~380% increase in the odds of tumor response compared to control.

  • Protein concentration is a significant biomarker to tumor response with increased expression co-inciding with reduced likelihood of tumor response.

  • BMI effects odds of tumor response with increased BMI co-coinciding with better outcomes.

  • No evidence to suggest that Age and Sex significantly impacts tumor response.

7.2 Nomogram

We can use a nonogram as another way of visualise expected tumor response. This method is particularly powerful as we can also use this to determine the likelihood a patient will respond to drug treatment given their clinical features. For example if there are severe side-effects to the drug and the patients has a low probability of responding to the drug treatment then an alternative patient management strategies should be considered.

# Plot nomogram
plot(nomogram(f,
              fun=plogis , funlabel ="Probability of Tumour Response",
              fun.at =c(.01 ,.05 ,.1 ,.25 ,.5 ,.75 ,.9 ,.95 ,.99 )))

From this we can see that a patient with low BMI and high protein concentration will unlikely have a positive outcome to tumor response, regardless of drug treatment. While a patient with higher BMI and protein concentration will have a higher probability of a positive outcome.

8. ROC analysis

ROC analysis can be used as another way to measure model performance but also can be used to determine the optimum cut-off threshold for characterizing whether a response has occurred or not based on sensitivity and specificity requirements. This information is useful if we wanted to use the model to predict the probability a patient will have a positive tumor response given there clinical features and treatment type.

8.1 Assessing model performance using a ROC analysis

# Use model to predict training data response 
train$prediction = predict(f, type = "fitted")

# Create ROC objects 
rocit_bin = rocit(score = train$prediction,
                   class = train$RESPONSE, 
                   method = "emp")


# Summary of ROC analysis

summary(rocit_bin)
                           
 Method used: empirical    
 Number of positive(s): 334
 Number of negative(s): 434
 Area under curve: 0.8686  

ROC analysis shows an AUC of 0.867. This is comparable to the the concordance index calculated earlier of 0.862 which suggests this model can correctly discriminate tumor response ~86-87% of the patients.

8.2 Plot ROC analysis to identify optimum cut-of threshold

We will plot a ROC curve to show the relationship between sensitivity (ability to identify tumor response) and specificity (ability to identify non tumor response). This is important to considering what cut-off threshold to specify for characterising tumor response as this can have severe impact to patient management. In this case we will give equal weighting to sensitivity and specificity, however ideally we should consult project lead.

# Get Confidence intervals 95% to ROC analysis
set.seed(1)
ciROC_bin95 = ciROC(rocit_bin, 
                    level = 0.95, nboot = 1e4)
Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
collapsing to unique 'x' values
# Get cutoff values for ROC analysis
cutoff = data.frame(cbind(Cutoff=rocit_bin$Cutoff, 
                          TPR=rocit_bin$TPR, 
                          FPR=rocit_bin$FPR))


# Plot ROC ----------------------------------------------------------------
#Set figure format

plot(rocit_bin, col = c(2,4), 
     legend = FALSE)
lines(ciROC_bin95$LowerTPR~ciROC_bin95$FPR, 
      col = 2, lty = 2)
lines(ciROC_bin95$UpperTPR~ciROC_bin95$FPR, 
      col = 2, lty = 2)
legend("bottomright", col = c(2,2,4),
       lty = c(1,2,2),
       lwd = c(2,1,2),
       c( 'Binormal ROC', "95% CI", 'Chance line'))

ROC confirms the model can correctly predict tumor response with high performance. The Youden Index shows the optimal cut-off threshold where sensitivity and specificity have equal weighting. This is where sensitivity is ~0.7 and specificity ~0.9.

Another method for identifying optimum cut-off threshold is examining the cumulative density functions plot

# Plot cumulative density function 
kplot = ksplot(rocit_bin)

We can use this to get the optimum cut-off threshold which is comparable to the Youden Index

# optimal cut-off 
sprintf('Using the cumulative density function plot the optimum cut-off threshold for characterising tumor response given equal weighting for sensitivity and specificity is: %s', round(kplot$`KS Cutoff`, digits = 2))
[1] "Using the cumulative density function plot the optimum cut-off threshold for characterising tumor response given equal weighting for sensitivity and specificity is: 0.61"

Sensitivity and specificity at this cut-off threshold is:

# Get optimal cut off
optimal = cutoff[which.min(abs(0.6055646-cutoff$Cutoff)),]
round(optimal, digits = 3)
    Cutoff   TPR   FPR
276  0.606 0.698 0.097

8.3 Comparing performance between optimum and default cut-off threshold

8.3.1 Confusion Matrix

We can compare performance of model using default (0.5) and optimum cut-off threshold (0.61) for characterizing tumor response using a confusion matrix

# Confusion matrix for default cut-off threshold
table(Actualvalue=train$RESPONSE,Predictedvalue=train$prediction>0.5)
           Predictedvalue
Actualvalue FALSE TRUE
          N   351   83
          Y    78  256
# Confusion matrix for optimum cut-off threshold
table(Actualvalue=train$RESPONSE,Predictedvalue=train$prediction>0.6055646)
           Predictedvalue
Actualvalue FALSE TRUE
          N   392   42
          Y   102  232

This shows that default cut-off threshold has better specificity but lower sensitivity than optimum cut-off threshold.

8.3.2 Other performance metrics

We can also compare other performance metrics between the two different cut-off thresholds including:

  • accuracy

  • mathews coefficient correlation

  • F meaure

  • Kappa

# Predict trained model using default cut-off threshold
predict_data_default = as.data.frame(predict(f, type = 'fitted')) %>%
                rename(fitted = `predict(f, type = "fitted")`) %>%
                mutate(predicted = ifelse(fitted < 0.5, 'N', 'Y')) %>%
                mutate(predicted = factor(predicted, levels = c('N','Y'))) %>%
                mutate(actual = train$RESPONSE)

# Predict trained model using optimum cut-off threshold
predict_data_opt = as.data.frame(predict(f, type = 'fitted')) %>%
                rename(fitted = `predict(f, type = "fitted")`) %>%
                mutate(predicted = ifelse(fitted < 0.6055646, 'N', 'Y')) %>%
                mutate(predicted = factor(predicted, levels = c('N','Y'))) %>%
                mutate(actual = train$RESPONSE)
  
# Assess performance ------------------------------------------------------
# Combining these three classification metrics together

acc = yardstick::accuracy(predict_data_default,
                               truth = actual, estimate = predicted)

classification_metrics <- yardstick::metric_set(accuracy, mcc, f_meas, kap)
bind_rows(
  comparison = classification_metrics(predict_data_default,
                               truth = actual, estimate = predicted),
  comparison = classification_metrics(predict_data_opt,
                               truth = actual, estimate = predicted)
) %>%
  mutate(.metric = c('accuracy_default', 'mcc_default', 'f_meas_default', 'kap_default', 'accuracy_optimum', 'mcc_optimum', 'f_meas_optimum', 'kap_optimum'))
# A tibble: 8 × 3
  .metric          .estimator .estimate
  <chr>            <chr>          <dbl>
1 accuracy_default binary         0.790
2 mcc_default      binary         0.574
3 f_meas_default   binary         0.813
4 kap_default      binary         0.574
5 accuracy_optimum binary         0.812
6 mcc_optimum      binary         0.619
7 f_meas_optimum   binary         0.845
8 kap_optimum      binary         0.610

From this we can see that using the optimum cut-off threshold for determining tumor response has better performance than the default for all examined metrics.

9. Discussion

To conclude an optimized logistic regression model was developed to analyse the Phase III Clinical trial which examined the efficacy of the anti cancer drug ‘Miraculon-B’.

This was a randomized case-control clinical trial comprising of around 768 individuals with treatment outcome being tumor response. Half the subject were prescribed new drug while the control subjects had current level of care.

Data was collected for each subject included age sex, height, weight, protein concentration and tumor response.

An assessment of clinical trial design was performed by confirming adequate randomization of subjects between the two treatment groups. Results found little evidence to suggest the randomization requirement had be violated, therefore giving confidence in study design.

A small fraction of missing data was observed for weight and protein concentration (missing fraction <0.02) which was imputed using single imputation methodology.

To minimise over-fitting and promote a parsimonious model data reduction was implemented where height and weight was converting into a single predictor BMI.

Somers Rank, Redundancy analysis and Lasso regression were used to inform optimal predictor selection. Non linear effects of age, protein concentration and BMI was captured using restricted cubic spline with three knots. No interactions were observed between protein concentration and age, sex or BMI.

ANOVA assessment showed the model was significant with treatment group, BMI and protein concentration as significant predictors with large effect size, while age and sex were non-significant with small effect size.

The prescribed ‘Miraculon-B’ was shown to improve probability of tumor response compared to control treatment with an OR ~4.8. Furthermore protein concentration was found to be an important biomarker with reduced concentration coinciding with improved probability of tumor response. It is also interesting to note that BMI was significantly associated to tumor response with increased BMI linked to better patient outcome. No significant evidence was found to show age and sex impacted probability of tumor response.

The model was internally validated using bootstrapping re-sampling with 1000 repetitions and assessed for discrimination and calibration performance. It was found that model could correctly discriminate ~86% of the subjects correctly which according to calibration curve was consistent across probabilities. Little evidence was found to show over-fitting therefore indicating a reliable model capturing only ~2.6% noise.

Over influential observations were examined where subjects 446 and 547 showed overly influential values for BMI and protein concentration respectively, given a 0.4 cut-off threshold, which may need further examination.

ROC analysis was also performed to determine optimum cut-off threshold for characterising tumor response where sensitivity and specificity were given equal weighting. A cut-off probability threshold of 0.606 was identified which gave 69.8% true positive rate and 9.7% false positive rate.

10. Conclusions

In conclusion, from the given information it has been determined that the trialed drug Miraculon-B has a positive effect on tumor response compared to the standard treatment. However the probability of responding to treatment is effected by BMI, and protein concentration. Therefore it is important to confirm expected efficacy of the drug to individual patient given their known clinical features before administration. This can be achieved using the nomogram (see Section 7.2). This is particularly important if there are any potential adverse effects of the drug which has not been examined in this study. Finally further work is required to assess any potential adverse effects of this new treatment.