1. Setup

rm( list = ls() )
require( ggplot2 )
require( Matrix )
require( corrplot )
# load an tidy data
d <- read.csv( "Data_prediction_FEP.csv")
N <- nrow( d )
# remove the ID column
d <- d[ , -1 ]
str( d )
'data.frame':   77 obs. of  19 variables:
 $ age         : int  22 24 20 23 28 25 19 37 21 20 ...
 $ gender      : int  2 2 1 2 2 1 1 1 2 1 ...
 $ iq          : int  116 93 98 90 102 90 89 79 102 133 ...
 $ PANSS_pos   : int  17 20 14 11 13 18 23 15 13 9 ...
 $ PANSS_neg   : int  20 26 9 16 18 22 23 20 8 8 ...
 $ PANSS_gen   : int  36 35 24 29 30 40 41 37 29 22 ...
 $ PANSS_total : int  73 81 47 56 61 80 87 72 50 39 ...
 $ Glu.Cre     : num  1.31 1.13 1.07 1.44 1.18 ...
 $ MI.Cre      : num  0.743 0.635 0.698 0.708 0.713 0.62 0.731 0.752 0.627 0.711 ...
 $ NAA.Cre     : num  1.22 1.06 1.14 1.49 1.33 ...
 $ GPC.Pch_Cre : num  0.228 0.182 0.231 0.252 0.248 0.266 0.213 0.303 0.26 0.252 ...
 $ NAA.NAAG_Cre: num  1.28 1.1 1.25 1.49 1.33 ...
 $ BA8         : num  -6.07 -5.85 -66.47 19.48 33.93 ...
 $ Caudaate    : num  -42.15 39.86 40.09 5.29 -23.81 ...
 $ PAC         : num  -87.46 5.25 -77.49 -59.63 -62.94 ...
 $ lgi_parietal: num  3.21 3.33 3.62 3.74 3.29 ...
 $ lgi_occ     : num  4.35 3.92 4.5 4.76 4.08 ...
 $ lgi_dlpfc   : num  2.54 2.45 2.68 2.83 2.66 ...
 $ group       : int  1 1 0 0 0 0 1 0 0 0 ...

Gender is coded numerically, as 1 and 2. This means it needs dummy coding, but as only two levels, we can hack it to be 0/1 and keep it as a numerical predictor for a preliminary linear model / GLM

d$gender <- d$gender - 1

2. Inferential / Association Analysis

\(Y = \{ 0,1 \}\), so clearly a binomial, which suggests a logistic regression is the most basic model for this kind of data. Include all variables (so no regularisation / online feature selection) and let’s just see what it shows.

# independent (predictor) variables
x.vars <- names( d )[1:(length(names(d))-1)]
# dependent (outcome, response) variables
y.var  <- names( d )[length(names(d))]
# build formula object for the glm
(modelFormula <- as.formula( paste(y.var, paste(x.vars, collapse=" + "), sep=" ~ ") ) )
group ~ age + gender + iq + PANSS_pos + PANSS_neg + PANSS_gen + 
    PANSS_total + Glu.Cre + MI.Cre + NAA.Cre + GPC.Pch_Cre + 
    NAA.NAAG_Cre + BA8 + Caudaate + PAC + lgi_parietal + lgi_occ + 
    lgi_dlpfc

Now, fit the GLM with all the data available – just a classic inferential analysis

lr.fit <- glm( modelFormula, family = binomial, data = d )
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lr.fit)

Call:
glm(formula = modelFormula, family = binomial, data = d)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-3.227e-05  -2.100e-08  -2.100e-08   2.100e-08   3.071e-05  

Coefficients: (1 not defined because of singularities)
               Estimate Std. Error z value Pr(>|z|)
(Intercept)  -1.570e+03  1.672e+06  -0.001    0.999
age           9.241e+00  7.278e+03   0.001    0.999
gender        1.051e+02  9.432e+04   0.001    0.999
iq            3.106e+00  2.182e+03   0.001    0.999
PANSS_pos     2.986e+01  1.627e+04   0.002    0.999
PANSS_neg     6.640e+00  1.085e+04   0.001    1.000
PANSS_gen    -5.542e+00  1.142e+04   0.000    1.000
PANSS_total          NA         NA      NA       NA
Glu.Cre       1.193e+02  1.858e+05   0.001    0.999
MI.Cre       -2.623e+02  3.062e+05  -0.001    0.999
NAA.Cre       1.105e+02  6.476e+05   0.000    1.000
GPC.Pch_Cre   6.366e+01  1.118e+06   0.000    1.000
NAA.NAAG_Cre -2.870e+02  5.809e+05   0.000    1.000
BA8          -5.172e-01  4.407e+02  -0.001    0.999
Caudaate      2.436e-02  5.769e+02   0.000    1.000
PAC          -1.332e-01  6.447e+02   0.000    1.000
lgi_parietal  3.222e+02  6.336e+05   0.001    1.000
lgi_occ      -5.592e+00  8.162e+04   0.000    1.000
lgi_dlpfc    -9.581e+01  3.562e+05   0.000    1.000

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8.8209e+01  on 76  degrees of freedom
Residual deviance: 7.9491e-09  on 59  degrees of freedom
AIC: 36

Number of Fisher Scoring iterations: 25

Something suggesing rank deficiency in the design matrix – likely that the predictors are not linearly-independent or one of them is perfectly associated with the response variable group. summary( lr.fit ) suggests PANSS_total is problematic, but there might be others, so look at correlations:

# explore ?highly correlated variables
corr.d <- cor( d )
corr.d[ lower.tri( corr.d, diag = TRUE ) ] <- NA
corrplot( corr.d, type = "upper", diag = FALSE )

Prepare a table of pairs of variables, descending-ordered by correlation:

# sort pairs of correlated variables
m <- melt( corr.d )
m <- m[order(- abs(m$value)), ]
# get rid of NAs
m <- na.omit(m)
# view the highest correlated: rho > 0.5
(m[ which( m$value > 0.5 ), ])

3. Debugging the GLM

Observe:

  1. NAA.Cre is highly correlated with NAA.NAAG_Cre – unsurprising given the properties of MRS data.
  2. PANSS_pos is impressively correlated with group – suggesting that if group represents something meaningful, it may be confounded with positive symptoms.

There are two possibility:

  1. If group is defined as a function of PANSS_pos
  2. group is not defined as a function of PANSS_pos, but represents a measurement only at some baseline/timepoint.

The first problem is easily seen as follows: If the response (dependent) variable group represents a function of PANSS_pos across time points, then we have found the reason why our GLM is failing - they are correlated because one is a function of the other. Let PANSS_pos measured at some “baseline” be \(P_+(t=0)\). Let the same variable at time \(t=1\) (some follow up) be \(P_+(t=1)\). Then if group represents “treatment resistance” status I suspect that in fact group is defined:

\[ Group\left( P_+( t = 0), P_+( t = 1 ) \right) = \left\{ \begin{array} \\ 1 \text{ if } P_+( t = 1 ) - P_+( t = 0 ) > \tau_{TRS} \\ 0 \text{ otherwise} \end{array} \right. \] Where \(\tau_{TRS}\) is some threshold value such that: if the patient’s positive PANSS score does not change \(> \tau_{TRS}\) between two time points (\(t=0\) and \(t=1\)), the patient is assigned TRS status (group=1). So, if group is defined by the function above, there is a reason to expect group will be linearly dependent on PANSS_pos.

The second problem is more prosaically related to the value of biomarkers. Assume that PANSS_pos is measured only at baseline, so we hope it is one of many possible predictors. Then, given it’s correlation with group it is a pretty good predictor all on it’s own (irrespective of any of the biomarkers).

To proceed : I assume that PANSS_pos is the measurement taken at baseline, and in fact, group is not defined as above so that there is no way \(P_+(t=0)\) and \(P_+(t=1)\) both enter into the derived variable group.

So we’ll try this: removing “redundant” independent variables (predictors) which are highly correlated with each other (since being correlated with the response/outcome group is not the problem, its linear-dependence between predictors which causes the GLM to fail)

4. Revising the GLM

Start be defining (somewhat arbitrarily) the predictor variables which are highly linearly dependent based on their correlation scores:

m[ which( m$value > 0.5 ),  ]

And now try a re-fit of the original GLM, with variables that are highly correlated with each other removed, and apply a bit of clinical intuition also; PANSS_neg is correlated with PANSS_gen and likewise, PANSS_pos is correlated with PANSS_gen suggesting that PANSS_gen is redundant or might be seen as a confounder and is adding to the rank-deficiency in estimating the GLM. As discussed above, NAA.NAAG_Cre also highly correlated with NAA.Cre and since one is derived from the other (by the LCModel) we cna ditch one.

dropCols <- which( names( d ) %in% c("NAA.NAAG_Cre", "PANSS_gen") )
d <- d[ , -dropCols]
x.vars <- names( d )[1:(length(names(d))-1)]
y.var  <- names( d )[length(names(d))]
(modelFormula <- as.formula( paste(y.var, paste(x.vars, collapse=" + "), sep=" ~ ") ) )
group ~ age + gender + iq + PANSS_pos + PANSS_neg + PANSS_total + 
    Glu.Cre + MI.Cre + NAA.Cre + GPC.Pch_Cre + BA8 + Caudaate + 
    PAC + lgi_parietal + lgi_occ + lgi_dlpfc
lr.fit <- glm( modelFormula, family = binomial, data = d )
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lr.fit)

Call:
glm(formula = modelFormula, family = binomial, data = d)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-4.379e-05  -2.100e-08  -2.100e-08   2.100e-08   4.229e-05  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)  -1.999e+03  1.991e+06  -0.001    0.999
age           1.035e+01  1.407e+04   0.001    0.999
gender        1.348e+02  2.267e+05   0.001    1.000
iq            2.937e+00  2.219e+03   0.001    0.999
PANSS_pos     4.753e+01  3.361e+04   0.001    0.999
PANSS_neg     2.177e+01  2.909e+04   0.001    0.999
PANSS_total  -1.010e+01  1.446e+04  -0.001    0.999
Glu.Cre       8.214e+01  5.903e+05   0.000    1.000
MI.Cre       -4.160e+02  4.935e+05  -0.001    0.999
NAA.Cre      -1.188e+02  1.585e+06   0.000    1.000
GPC.Pch_Cre   3.332e+02  1.580e+06   0.000    1.000
BA8          -4.787e-01  9.683e+02   0.000    1.000
Caudaate      3.585e-02  4.583e+02   0.000    1.000
PAC          -2.497e-02  1.045e+03   0.000    1.000
lgi_parietal  3.706e+02  6.331e+05   0.001    1.000
lgi_occ       4.432e+00  3.107e+05   0.000    1.000
lgi_dlpfc    -3.385e+01  4.452e+05   0.000    1.000

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8.8209e+01  on 76  degrees of freedom
Residual deviance: 1.0513e-08  on 60  degrees of freedom
AIC: 34

Number of Fisher Scoring iterations: 25

Still, we have a problem – applying a bit of “human intelligence” and knowledge of the data set to the problem doesn’t fix it.

5. Automated Feature Selection with Redundancy Reduced

It’s going to be hard to proceed by “manually” trying to debug the design matrix for the GLM, so we can resort to some automated methods. With complex covariance / linear-dependence structures, sometimes using regularisation can help in feature selecting the best predictors – especially in cases like this, where there is a bunch of dependencies we haven’t worked out.

We’ll try to see if using LASSO-type regularisation can help reduce the number of predictors in the model.

require( glmnet )
x.vars <- names( d )[1:(length(names(d))-1)]
y.var  <- names( d )[length(names(d))]
Y <- d[ , y.var ]
X <- as.matrix( d[ , x.vars ] )
# note: using largely default parameters (alpha = 1 is the LASSO regularisation), and we let glmnet
# standardise variables (mean = 0, sd = 1) as units / scales not equivalent across all predictors (IQ and Age for example - you 
# can have an IQ of 100, but unlikely that Age will be on the same scale)
net.fit <- glmnet( X, Y, family = "binomial", alpha = 1 )

The difficulty with glmnets is that they have a free parameter (\(\lambda\)) – the “strength” of penalising the fit – which alters the number of predictors included, so we use cross-validation to find the optimal \(\lambda\) for this data set – remember, this is not cross validation to test the predictive value of the model overall, but “within” the model building phase, a way of feature selecting the best predictors given the problems highlighted above (dependency, sparsity, correlated predictors etc.).

cv.fit <- cv.glmnet( X, Y, family = "binomial", alpha = 1, type.measure = "class")
plot( cv.fit )

So, this graph shows the misclassification error (the lower the better) on the group variable as a function of \(\lambda\) – where varying \(\lambda\) includes (or excludes) more (or less) predictors. By looking for a global minimum in this plot, we can find the \(\lambda\) – or range of \(\lambda\) values – that gives us the best performing model and then investigate which predictors were included. The dotted lines show us the best \(\lambda\) values. And the axis label on the top of the graph tells us the number of predictors, which appears to be one predictor. We can go hunting for this predictor for values of \(\lambda\):

# lambda that gives best performance on misclassification error
bestlambda <- cv.fit$lambda.min
# the best lambda which is within one standard error of the minimum misclassification error
best1se    <- cv.fit$lambda.1se
# extract parameters at each of these lambda values
coefs.bestLambda <- as.matrix( coef(cv.fit, s = bestlambda ) )
coefs.best1se    <- as.matrix( coef(cv.fit, s = best1se ) )
# which coefficients are non-zero (e.g. contributing to the optimal performing model at lambda)
# 1. for bestLambda
( rbind( rownames( coefs.bestLambda )[ which( coefs.bestLambda > 0 ) ], coefs.bestLambda[ which( coefs.bestLambda > 0 ) ] ) )
     [,1]               
[1,] "PANSS_pos"        
[2,] "0.304969219156202"
# 2. for best1se
( rbind( rownames( coefs.best1se )[ which( coefs.best1se > 0 ) ], coefs.best1se[ which( coefs.best1se > 0 ) ] ) )
     [,1]               
[1,] "PANSS_pos"        
[2,] "0.215960414480748"

So it seems, PANSS_pos – among all the predictors – is the best predictor of group when feature selection is automated by a LASSO-penalty regularised logistic regression.

Just in case we think some ‘magic’ is happening, we can apply the same model to the original complete data set without the pre-processing (where we used context and understanding of the problem to remove redundancy in the predictors)

# re-load and tidy data
d <- read.csv( "Data_prediction_FEP.csv")
N <- nrow( d )
# remove the ID column
d <- d[ , -1 ]
# convert gender to a 0/1
d$gender <- d$gender - 1
str( d )
'data.frame':   77 obs. of  19 variables:
 $ age         : int  22 24 20 23 28 25 19 37 21 20 ...
 $ gender      : num  1 1 0 1 1 0 0 0 1 0 ...
 $ iq          : int  116 93 98 90 102 90 89 79 102 133 ...
 $ PANSS_pos   : int  17 20 14 11 13 18 23 15 13 9 ...
 $ PANSS_neg   : int  20 26 9 16 18 22 23 20 8 8 ...
 $ PANSS_gen   : int  36 35 24 29 30 40 41 37 29 22 ...
 $ PANSS_total : int  73 81 47 56 61 80 87 72 50 39 ...
 $ Glu.Cre     : num  1.31 1.13 1.07 1.44 1.18 ...
 $ MI.Cre      : num  0.743 0.635 0.698 0.708 0.713 0.62 0.731 0.752 0.627 0.711 ...
 $ NAA.Cre     : num  1.22 1.06 1.14 1.49 1.33 ...
 $ GPC.Pch_Cre : num  0.228 0.182 0.231 0.252 0.248 0.266 0.213 0.303 0.26 0.252 ...
 $ NAA.NAAG_Cre: num  1.28 1.1 1.25 1.49 1.33 ...
 $ BA8         : num  -6.07 -5.85 -66.47 19.48 33.93 ...
 $ Caudaate    : num  -42.15 39.86 40.09 5.29 -23.81 ...
 $ PAC         : num  -87.46 5.25 -77.49 -59.63 -62.94 ...
 $ lgi_parietal: num  3.21 3.33 3.62 3.74 3.29 ...
 $ lgi_occ     : num  4.35 3.92 4.5 4.76 4.08 ...
 $ lgi_dlpfc   : num  2.54 2.45 2.68 2.83 2.66 ...
 $ group       : int  1 1 0 0 0 0 1 0 0 0 ...

Now, re-run the exact same analyses using the glmnet:

x.vars <- names( d )[1:(length(names(d))-1)]
y.var  <- names( d )[length(names(d))]
# this time, keep **all** variables
Y <- d[ , y.var ]
X <- as.matrix( d[ , x.vars ] )
# CV the glmnet to find the best lambda's
cv.fit <- cv.glmnet( X, Y, family = "binomial", alpha = 1, type.measure = "class")
plot( cv.fit )

And now, retrieve the best predictors at the \(\lambda\) that minimised misclassification error:

bestlambda <- cv.fit$lambda.min
best1se    <- cv.fit$lambda.1se
coefs.bestLambda <- as.matrix( coef(cv.fit, s = bestlambda ) )
coefs.best1se    <- as.matrix( coef(cv.fit, s = best1se ) )
# which coefficients are non-zero (e.g. contributing to the optimal performing model at lambda)
# 1. for bestLambda
( rbind( rownames( coefs.bestLambda )[ which( coefs.bestLambda > 0 ) ], coefs.bestLambda[ which( coefs.bestLambda > 0 ) ] ) )
     [,1]               
[1,] "PANSS_pos"        
[2,] "0.377010551976357"
# 2. for best1se
( rbind( rownames( coefs.best1se )[ which( coefs.best1se > 0 ) ], coefs.best1se[ which( coefs.best1se > 0 ) ] ) )
     [,1]              
[1,] "PANSS_pos"       
[2,] "0.32290657768718"

And again : it’s PANSS_pos which is the strongest predictor of group

This tells us:

  1. using context knowledge to manually debug the GLM showed some redundancy, but removing it didn’t help with a classic GLM
  2. with the context knowledge (“human expertise”) applied, we then defer to some automatic feature selection (glmnet) which tells us it’s the PANSS_pos that does the work in predicting group
  3. Throwing away human expertise, and using the original unfetered data – deferring entirely to the automated feature selection – we arrive at the same answer.

6. Assume PANSS scores are not Needed / Relevant

Assume that we want to use the biomarker data, which we hope is a better or more informative predictor than clinical data. We can strip out the clinical data and try on just the biomarkers.

# identify the clinical data and remove (note, we leave demographic data in)
dropCols <- c("PANSS_pos", "PANSS_neg", "PANSS_gen", "PANSS_total")
d <- d[ , -which( names(d) %in% dropCols ) ]
(str( d ))
'data.frame':   77 obs. of  15 variables:
 $ age         : int  22 24 20 23 28 25 19 37 21 20 ...
 $ gender      : num  1 1 0 1 1 0 0 0 1 0 ...
 $ iq          : int  116 93 98 90 102 90 89 79 102 133 ...
 $ Glu.Cre     : num  1.31 1.13 1.07 1.44 1.18 ...
 $ MI.Cre      : num  0.743 0.635 0.698 0.708 0.713 0.62 0.731 0.752 0.627 0.711 ...
 $ NAA.Cre     : num  1.22 1.06 1.14 1.49 1.33 ...
 $ GPC.Pch_Cre : num  0.228 0.182 0.231 0.252 0.248 0.266 0.213 0.303 0.26 0.252 ...
 $ NAA.NAAG_Cre: num  1.28 1.1 1.25 1.49 1.33 ...
 $ BA8         : num  -6.07 -5.85 -66.47 19.48 33.93 ...
 $ Caudaate    : num  -42.15 39.86 40.09 5.29 -23.81 ...
 $ PAC         : num  -87.46 5.25 -77.49 -59.63 -62.94 ...
 $ lgi_parietal: num  3.21 3.33 3.62 3.74 3.29 ...
 $ lgi_occ     : num  4.35 3.92 4.5 4.76 4.08 ...
 $ lgi_dlpfc   : num  2.54 2.45 2.68 2.83 2.66 ...
 $ group       : int  1 1 0 0 0 0 1 0 0 0 ...
NULL
x.vars <- names( d )[1:(length(names(d))-1)]
y.var  <- names( d )[length(names(d))]
Y <- d[ , y.var ]
X <- as.matrix( d[ , x.vars ] )
# CV the glmnet to find the best lambda's
cv.fit <- cv.glmnet( X, Y, family = "binomial", alpha = 1, type.measure = "class")
plot( cv.fit )

Notice how the misclassification error is now much higher overall (0.25, as compared with 0.10 when clinical data was included) suggesting the biomarkers are not doing better than the clinical data. Problematically, the best values for \(\lambda\) and the one-standard error range all indicate a model with zero predictors, suggesting the model is degenerate.

This could be because we are being too “strict” with the elastic- versus LASSO-penalty. Assume that rather than their being an optimal set of linearly-independent predictors, there are in fact a small set of relatively weak predictors that “work together” to assist in prediction. This can be controlled by altering the value of the \(\alpha\) parameter in the GLMnet model. Here’s how: setting \(\alpha = 1\) means we want a “strict” LASSO regularisation and \(\alpha = 0\) is the strict ridge-regression. The LASSO tends to find the best predictors and throw out all others. The ridge model allows for groups of variables to work together. So if we set \(\alpha = 0.5\), (and then vary \(\lambda\) as before), groups of predictors which covary together are kept in the model by balancing LASSO against elastic-net behaviours in the model fitting (feature selection). This may or may not improve the model performance, but it will likely keep the number of predictors included > 0. The authors of GLMnet recommend that setting \(\alpha = 1 - \epsilon\) (where \(\epsilon\) is some small value \(> 0\)) means predictors are included using LASSO, but degenerate or extreme correlations in variables are thrown out.

# the model with alpha = 0.5
cv.fit <- cv.glmnet( X, Y, family = "binomial", alpha = 0.5, type.measure = "class")
plot( cv.fit )

Still the same problem. Let’s now try the \(\epsilon\) approach to remove degeneracy but keep the robustness of LASSO:

# the model with alpha = 0.05
cv.fit <- cv.glmnet( X, Y, family = "binomial", alpha = 0.05, type.measure = "class")
plot( cv.fit )

Same problems – of note, rerunning these above two analyses gives different results each time (they are stochastic algorithms) but the repeatability can be tested by re-running and keeping track of the predictors that are included. Note that a notion of “p values” on the coefficients in a GLMnet is somewhat moot – instead, we rely on resampling to find out if the predictors are consistently associated with misclassification error.

Let’s try that, assuming that some occasional stochastic factors are causing the solution to vary and we’ll record the predictors found to be useful on each run of the estimation to see if there’s a pattern, and to do this, we’ll use only the best \(\lambda\):

# Warning : this takes about 10 minutes to run on a single CPU
N.sims <- 10000
tabResults <- data.frame( Var = colnames(X), Included = rep(0,length(colnames(X))) )
tabError   <- rep(0, N.sims)

for ( i in 1:N.sims ) {
  cv.fit <- cv.glmnet( X, Y, family = "binomial", alpha = 0.05, type.measure = "class")
  bestlambda <- cv.fit$lambda.min
  coefs.bestLambda <- round( as.matrix( coef(cv.fit, s = bestlambda ) ), 6 )
  
  # find non-zero coefficients (at 6 dec. places precision) -- excluding the intercept
  nonZero <- which( coefs.bestLambda[2:nrow(coefs.bestLambda)] != 0)
  
  if( length( nonZero ) > 0 ) {  ## if there are non-zero coefficients
    # increment counts
    tabResults$Included[ nonZero ] <- tabResults$Included[ nonZero ] + 1
  }
  
  # store the best CV misclassification error
  tabError[i] <- cv.fit$cvm[ which( cv.fit$lambda == bestlambda ) ]
}

The mean classification error works out to be with standard deviation:

(mean( tabError ))
[1] 0.2523286
(sd(tabError))
[1] 0.008595453

Examining the table of predictors which are inluded / excluded on each of 10,000 runs:

(tabResults)

We can see that on aprroximately half the runs all predictors are included in the model, and on the other 5000+ runs, none are included (that’s why they all have 4764 as their counts - they were all in, or out, together). Glu.cre creeps in for a very small proportion: 0.46% of the models used this predictor).

This tells us that the best error we’ll get is 0.25 (so 25% of the sample are incorrectly classified) – there are 57 for group = 0 and 20 for group = 1. So assuming symmetric errors (e.g. we get 25% wrong equally in both groups) this means 14 people are classified incorrectly in group = 0 and 5 are incorrectly classified in group = 1 – and these can be with either none, or all of the predictors included.

This suggests – even with access to all the data for training and testing (validating) the model, we can’t find very robust predictors beyond the PANSS_pos variable which outperforms all the biomarkers (recall, the minimum misclassification error was 0.1 for PANSS_pos).

---
title: "FEP Predictors: Exploratory Analysis and Regularised GLMs"
date: "`r format(Sys.time(), '%d %B, %Y')`"
author: Dan W Joyce
output: html_notebook
---

## 1. Setup
```{r}
rm( list = ls() )
require( ggplot2 )
require( Matrix )
require( corrplot )

# load an tidy data
d <- read.csv( "Data_prediction_FEP.csv")
N <- nrow( d )
# remove the ID column
d <- d[ , -1 ]

str( d )
```

Gender is coded numerically, as 1 and 2.  This means it needs dummy coding, but as only two levels, we can hack it to be 0/1 and keep it as a numerical predictor for a preliminary linear model / GLM

```{r}
d$gender <- d$gender - 1
```


## 2. Inferential / Association Analysis
$Y = \{ 0,1 \}$, so clearly a binomial, which suggests a logistic regression is the most basic model for this kind of data.  Include all variables (so no regularisation / online feature selection) and let's just see what it shows.
```{r}

# independent (predictor) variables
x.vars <- names( d )[1:(length(names(d))-1)]

# dependent (outcome, response) variables
y.var  <- names( d )[length(names(d))]

# build formula object for the glm
(modelFormula <- as.formula( paste(y.var, paste(x.vars, collapse=" + "), sep=" ~ ") ) )
```

Now, fit the GLM with *all* the data available -- just a classic inferential analysis

```{r}
lr.fit <- glm( modelFormula, family = binomial, data = d )
summary(lr.fit)
```

Something suggesing rank deficiency in the design matrix -- likely that the predictors are not linearly-independent or one of them is perfectly associated with the response variable `group`.  `summary( lr.fit )` suggests `PANSS_total` is problematic, but there might be others, so look at correlations:

```{r}
# explore ?highly correlated variables
corr.d <- cor( d )
corr.d[ lower.tri( corr.d, diag = TRUE ) ] <- NA
corrplot( corr.d, type = "upper", diag = FALSE )
```

Prepare a table of pairs of variables, descending-ordered by correlation:

```{r}
# sort pairs of correlated variables
m <- melt( corr.d )
m <- m[order(- abs(m$value)), ]
# get rid of NAs
m <- na.omit(m)

# view the highest correlated: rho > 0.5
(m[ which( m$value > 0.5 ), ])

```
## 3. Debugging the GLM 

Observe:

  1. `NAA.Cre` is highly correlated with `NAA.NAAG_Cre` -- unsurprising given the properties of MRS data.
  2. `PANSS_pos` is impressively correlated with `group` -- suggesting that if group represents something meaningful, it **may be** confounded with positive symptoms.

There are two possibility:

  1. If `group` is defined as a function of `PANSS_pos`
  2. `group` is not defined as a function of `PANSS_pos`, but represents a measurement only at some baseline/timepoint.

The first problem is easily seen as follows:  If the response (dependent) variable `group` represents a function of `PANSS_pos` **across** time points, then we have found the reason why our GLM is failing - they are correlated because one is a function of the other.  Let PANSS_pos measured at some "baseline" be $P_+(t=0)$.  Let the same variable at time $t=1$ (some follow up) be $P_+(t=1)$.  Then if `group` represents "treatment resistance" status I suspect that in fact group is defined: 

$$ 
Group\left( P_+( t = 0), P_+( t = 1 ) \right) = \left\{
                                  \begin{array}
                                  \\
                                      1 \text{ if } P_+( t = 1 ) - P_+( t = 0 ) > \tau_{TRS} \\
                                      0 \text{ otherwise}
                                  \end{array}
                                  \right.
$$
Where $\tau_{TRS}$ is some threshold value such that: if the patient's positive PANSS score *does not* change $> \tau_{TRS}$ between two time points ($t=0$ and $t=1$), the patient is assigned TRS status (`group=1`).  So, if `group` is defined by the function above, there is a reason to expect `group` will be linearly dependent on `PANSS_pos`.

The second problem is more prosaically related to the **value** of biomarkers.  Assume that `PANSS_pos` is measured **only** at baseline, so we hope it is *one of many* possible predictors.  Then, given it's correlation with `group` it is a pretty good predictor all on it's own (irrespective of any of the biomarkers).


To proceed : I assume that `PANSS_pos` is the measurement taken at **baseline**, and in fact, `group` is **not** defined as above so that there is no way $P_+(t=0)$ and $P_+(t=1)$ both enter into the derived variable `group`.

So we'll try this: removing "redundant" independent variables (predictors) which are highly correlated **with each other** (since being correlated with the response/outcome `group` is not the problem, its linear-*dependence* between predictors which causes the GLM to fail)

## 4. Revising the GLM

Start be defining (somewhat arbitrarily) the predictor variables which are highly linearly dependent based on their correlation scores:

```{r}
m[ which( m$value > 0.5 ),  ]
```

And now try a re-fit of the original GLM, with variables that are highly correlated with each other removed, and apply a bit of clinical intuition also; `PANSS_neg` is correlated with `PANSS_gen` and likewise, `PANSS_pos` is correlated with `PANSS_gen` suggesting that `PANSS_gen` is redundant or might be seen as a confounder and is adding to the rank-deficiency in estimating the GLM.  As discussed above, `NAA.NAAG_Cre` also highly correlated with `NAA.Cre` and since one is derived from the other (by the LCModel) we cna ditch one.

```{r}

dropCols <- which( names( d ) %in% c("NAA.NAAG_Cre", "PANSS_gen") )
d <- d[ , -dropCols]

x.vars <- names( d )[1:(length(names(d))-1)]
y.var  <- names( d )[length(names(d))]
(modelFormula <- as.formula( paste(y.var, paste(x.vars, collapse=" + "), sep=" ~ ") ) )
lr.fit <- glm( modelFormula, family = binomial, data = d )
summary(lr.fit)

```
Still, we have a problem -- applying a bit of "human intelligence" and knowledge of the data set to the problem doesn't fix it.



## 5. Automated Feature Selection **with** Redundancy Reduced
It's going to be hard to proceed by "manually" trying to debug the design matrix for the GLM, so we can resort to some automated methods.  With complex covariance / linear-dependence structures, sometimes using regularisation can help in feature selecting the best predictors -- especially in cases like this, where there is a bunch of dependencies we haven't worked out. 

We'll try to see if using LASSO-type regularisation can help reduce the number of predictors in the model.

```{r}
require( glmnet )

x.vars <- names( d )[1:(length(names(d))-1)]
y.var  <- names( d )[length(names(d))]

Y <- d[ , y.var ]
X <- as.matrix( d[ , x.vars ] )

# note: using largely default parameters (alpha = 1 is the LASSO regularisation), and we let glmnet
# standardise variables (mean = 0, sd = 1) as units / scales not equivalent across all predictors (IQ and Age for example - you 
# can have an IQ of 100, but unlikely that Age will be on the same scale)
net.fit <- glmnet( X, Y, family = "binomial", alpha = 1 )


```

The difficulty with `glmnet`s is that they have a free parameter ($\lambda$) -- the "strength" of penalising the fit --  which alters the number of predictors included, so we use cross-validation to find the optimal $\lambda$ for this data set -- remember, this is **not** cross validation to **test** the predictive value of the model overall, but "within" the model building phase, a way of feature selecting the **best** predictors given the problems highlighted above (dependency, sparsity, correlated predictors etc.).  


```{r}
cv.fit <- cv.glmnet( X, Y, family = "binomial", alpha = 1, type.measure = "class")
plot( cv.fit )
```
So, this graph shows the **misclassification error** (the lower the better) on the `group` variable as a function of $\lambda$ -- where varying $\lambda$ includes (or excludes) more (or less) predictors.  By looking for a global minimum in this plot, we can find the $\lambda$ -- or range of $\lambda$ values -- that gives us the best performing model and then investigate which predictors were included.  The dotted lines show us the best $\lambda$ values.  And the axis label on the top of the graph tells us the number of predictors, which appears to be **one** predictor.  We can go hunting for this predictor for values of $\lambda$:

```{r}
# lambda that gives best performance on misclassification error
bestlambda <- cv.fit$lambda.min

# the best lambda which is within one standard error of the minimum misclassification error
best1se    <- cv.fit$lambda.1se

# extract parameters at each of these lambda values
coefs.bestLambda <- as.matrix( coef(cv.fit, s = bestlambda ) )
coefs.best1se    <- as.matrix( coef(cv.fit, s = best1se ) )

# which coefficients are non-zero (e.g. contributing to the optimal performing model at lambda)
# 1. for bestLambda
( rbind( rownames( coefs.bestLambda )[ which( coefs.bestLambda > 0 ) ], coefs.bestLambda[ which( coefs.bestLambda > 0 ) ] ) )
# 2. for best1se
( rbind( rownames( coefs.best1se )[ which( coefs.best1se > 0 ) ], coefs.best1se[ which( coefs.best1se > 0 ) ] ) )


```

So it seems, `PANSS_pos` -- among all the predictors -- is the best predictor of `group` when feature selection is automated by a LASSO-penalty regularised logistic regression.

Just in case we think some 'magic' is happening, we can apply the same model to the original complete data set without the pre-processing (where we used context and understanding of the problem to remove redundancy in the predictors)

```{r}
# re-load and tidy data
d <- read.csv( "Data_prediction_FEP.csv")
N <- nrow( d )
# remove the ID column
d <- d[ , -1 ]
# convert gender to a 0/1
d$gender <- d$gender - 1
str( d )
```

Now, re-run the exact same analyses using the `glmnet`:

```{r}
x.vars <- names( d )[1:(length(names(d))-1)]
y.var  <- names( d )[length(names(d))]

# this time, keep **all** variables
Y <- d[ , y.var ]
X <- as.matrix( d[ , x.vars ] )

# CV the glmnet to find the best lambda's
cv.fit <- cv.glmnet( X, Y, family = "binomial", alpha = 1, type.measure = "class")
plot( cv.fit )



```
And now, retrieve the best predictors at the $\lambda$ that minimised misclassification error:

```{r}
bestlambda <- cv.fit$lambda.min
best1se    <- cv.fit$lambda.1se

coefs.bestLambda <- as.matrix( coef(cv.fit, s = bestlambda ) )
coefs.best1se    <- as.matrix( coef(cv.fit, s = best1se ) )

# which coefficients are non-zero (e.g. contributing to the optimal performing model at lambda)
# 1. for bestLambda
( rbind( rownames( coefs.bestLambda )[ which( coefs.bestLambda > 0 ) ], coefs.bestLambda[ which( coefs.bestLambda > 0 ) ] ) )
# 2. for best1se
( rbind( rownames( coefs.best1se )[ which( coefs.best1se > 0 ) ], coefs.best1se[ which( coefs.best1se > 0 ) ] ) )
```

And again : it's `PANSS_pos` which is the strongest predictor of `group`

This tells us:

  1. using context knowledge to manually debug the GLM showed some redundancy, but removing it didn't help with a classic GLM
  2. with the context knowledge ("human expertise") applied, we then defer to some automatic feature selection (`glmnet`) which tells us it's the `PANSS_pos` that does the work in predicting `group`
  3. Throwing away human expertise, and using the original unfetered data -- deferring entirely to the automated feature selection -- we arrive at the same answer.
  
  
## 6. Assume PANSS scores are **not** Needed / Relevant 
Assume that we want to use the biomarker data, which we hope is a better or more informative predictor than clinical data. We can strip out the clinical data and try on **just** the biomarkers.

```{r}

# identify the clinical data and remove (note, we leave demographic data in)
dropCols <- c("PANSS_pos", "PANSS_neg", "PANSS_gen", "PANSS_total")

d <- d[ , -which( names(d) %in% dropCols ) ]

(str( d ))

x.vars <- names( d )[1:(length(names(d))-1)]
y.var  <- names( d )[length(names(d))]

Y <- d[ , y.var ]
X <- as.matrix( d[ , x.vars ] )

# CV the glmnet to find the best lambda's
cv.fit <- cv.glmnet( X, Y, family = "binomial", alpha = 1, type.measure = "class")
plot( cv.fit )

```
Notice how the misclassification error is now much higher overall (0.25, as compared with 0.10 when clinical data **was** included) suggesting the biomarkers are not doing better than the clinical data.  Problematically, the **best** values for $\lambda$ and the one-standard error range all indicate a model with **zero** predictors, suggesting the model is degenerate.

This could be because we are being too "strict" with the elastic- versus LASSO-penalty.  Assume that rather than their being an optimal set of linearly-independent predictors, there are in fact a small set of relatively weak predictors that "work together" to assist in prediction.  This can be controlled by altering the value of the $\alpha$ parameter in the GLMnet model.  Here's how: setting $\alpha = 1$ means we want a "strict" LASSO regularisation and $\alpha = 0$ is the strict ridge-regression.  The LASSO tends to find the best predictors and throw out all others.  The ridge model allows for groups of variables to work together.  So if we set $\alpha = 0.5$, (and then vary $\lambda$ as before), groups of predictors which covary together are kept in the model by balancing LASSO against elastic-net behaviours in the model fitting (feature selection).  This may or may not improve the model performance, but it will likely keep the number of predictors included > 0.  The authors of GLMnet recommend that setting $\alpha = 1 - \epsilon$ (where $\epsilon$ is some small value  $> 0$) means predictors are included using LASSO, but degenerate or extreme correlations in variables are thrown out. 

```{r}
# the model with alpha = 0.5
cv.fit <- cv.glmnet( X, Y, family = "binomial", alpha = 0.5, type.measure = "class")
plot( cv.fit )
```

Still the same problem.  Let's now try the $\epsilon$ approach to remove degeneracy but keep the robustness of LASSO:

```{r}
# the model with alpha = 0.05
cv.fit <- cv.glmnet( X, Y, family = "binomial", alpha = 0.05, type.measure = "class")
plot( cv.fit )
```
Same problems -- of note, rerunning these above two analyses gives different results each time (they are stochastic algorithms) but the  repeatability can be tested by re-running and keeping track of the predictors that are included.  Note that a notion of "p values" on the coefficients in a GLMnet is somewhat moot -- instead, we rely on resampling to find out if the predictors are consistently associated with misclassification error.  

Let's try that, assuming that some occasional stochastic factors are causing the solution to vary and we'll record the predictors found to be useful on each run of the estimation to see if there's a pattern, and to do this, we'll use only the best $\lambda$:

```{r}
# Warning : this takes about 10 minutes to run on a single CPU
N.sims <- 10000
tabResults <- data.frame( Var = colnames(X), Included = rep(0,length(colnames(X))) )
tabError   <- rep(0, N.sims)

for ( i in 1:N.sims ) {
  cv.fit <- cv.glmnet( X, Y, family = "binomial", alpha = 0.05, type.measure = "class")
  bestlambda <- cv.fit$lambda.min
  coefs.bestLambda <- round( as.matrix( coef(cv.fit, s = bestlambda ) ), 6 )
  
  # find non-zero coefficients (at 6 dec. places precision) -- excluding the intercept
  nonZero <- which( coefs.bestLambda[2:nrow(coefs.bestLambda)] != 0)
  
  if( length( nonZero ) > 0 ) {  ## if there are non-zero coefficients
    # increment counts
    tabResults$Included[ nonZero ] <- tabResults$Included[ nonZero ] + 1
  }
  
  # store the best CV misclassification error
  tabError[i] <- cv.fit$cvm[ which( cv.fit$lambda == bestlambda ) ]
}
```

The mean classification error works out to be with standard deviation:
```{r}
(mean( tabError ))
(sd(tabError))
```

Examining the table of predictors which are inluded / excluded on each of 10,000 runs:
```{r}
(tabResults)
```
We can see that on aprroximately half the runs **all** predictors are included in the model, and on the other 5000+ runs, none are included (that's why they all have 4764 as their counts - they were all in, or out, together).  `Glu.cre` creeps in for a very small proportion: 0.46% of the models used this predictor).

This tells us that the best error we'll get is 0.25 (so 25% of the sample are incorrectly classified)  -- there are 57 for `group = 0` and 20 for `group = 1`.  So assuming symmetric errors (e.g. we get 25% wrong equally in both groups) this means 14 people are classified incorrectly in `group = 0` and 5 are incorrectly classified in `group = 1` -- and these can be with either none, or all of the predictors included.  

This suggests -- even with access to *all* the data for training and testing (validating) the model, we can't find very robust predictors beyond the `PANSS_pos` variable which outperforms all the biomarkers (recall, the minimum misclassification error was 0.1 for `PANSS_pos`).

