Here I showcase of to use BSWiMS feature selection/modeling function coupled with Goal Driven Sparse Transformation Matrix (UPSTM) as a pre-processing step to decorrelate highly correlated features. The aim(s) are:
To improve model performance by uncovering the hidden information between correlated features.
To simplify the interpretation of the machine learning models.
This demo will use:
FRESA.CAD::IDeA(). For Decorrelation of Multidimensional data sets
FRESA.CAD::randomCV() For the cross-validation of the Machine Learning models
FRESA.CAD::BSWiMS.model(). For the generation of bootstrapped logistic models
FRESA.CAD::predictionStats_binary(). For describing the performance of the model
heatmap.2(). For displaying the correlation matrix
igraph::graph_from_adjacency_matrix(). For the display of the network of BSWiMS formulas
vioplot::vioplot(). For the display of the z-distribution of significant features.
library("FRESA.CAD")
library(readxl)
library(vioplot)
library(igraph)
op <- par(no.readonly = TRUE)
pander::panderOptions('digits', 3)
pander::panderOptions('table.split.table', 400)
pander::panderOptions('keep.trailing.zeros',TRUE)
The function will be used to transform all the continuous features of the data
signedlog <- function(x) { return (sign(x)*log10(abs(1.0e6*x)+1.0)-6)}
The data to process is described in:
Erdogdu Sakar, Betul, Gorkem Serbes, and C. Okan Sakar. “Analyzing the effectiveness of vocal features in early telediagnosis of Parkinson’s disease.” PloS one 12, no. 8 (2017): e0182428.
The data was obtained from the UCI ML repository:
https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification
I added a column to the data identifying the repeated experiments.
pd_speech_features <- as.data.frame(read_excel("~/GitHub/FCA/Data/pd_speech_features.xlsx",sheet = "pd_speech_features", range = "A2:ACB758"))
##The fraction of samples in the training set
trainFraction=0.65
## The file with the codes to create shorter names
namecode <- read.csv("~/GitHub/FCA/Data/Parkinson_names.csv")
Each subject had three repeated observations. Here I’ll use the average of the three experiments per subject.
rep1Parkison <- subset(pd_speech_features,RID==1)
rownames(rep1Parkison) <- rep1Parkison$id
rep1Parkison$id <- NULL
rep1Parkison$RID <- NULL
rep1Parkison[,1:ncol(rep1Parkison)] <- sapply(rep1Parkison,as.numeric)
rep2Parkison <- subset(pd_speech_features,RID==2)
rownames(rep2Parkison) <- rep2Parkison$id
rep2Parkison$id <- NULL
rep2Parkison$RID <- NULL
rep2Parkison[,1:ncol(rep2Parkison)] <- sapply(rep2Parkison,as.numeric)
rep3Parkison <- subset(pd_speech_features,RID==3)
rownames(rep3Parkison) <- rep3Parkison$id
rep3Parkison$id <- NULL
rep3Parkison$RID <- NULL
rep3Parkison[,1:ncol(rep3Parkison)] <- sapply(rep3Parkison,as.numeric)
whof <- !(colnames(rep1Parkison) %in% c("gender","class"));
avgParkison <- rep1Parkison;
avgParkison[,whof] <- (rep1Parkison[,whof] + rep2Parkison[,whof] + rep3Parkison[,whof])/3
## I apply the log transform to the data
avgParkison[,whof] <- signedlog(avgParkison[,whof])
pander::pander(table(avgParkison$class))
| 0 | 1 |
|---|---|
| 64 | 188 |
The heat-map of the correlation:
cormat <- cor(avgParkison[,colnames(avgParkison)!="class"],method="spearman")
cormat[is.na(cormat)] <- 0
gplots::heatmap.2(abs(cormat),
trace = "none",
scale = "none",
mar = c(10,10),
col=rev(heat.colors(5)),
main = "Raw Correlation",
cexRow = 0.35,
cexCol = 0.35,
key.title=NA,
key.xlab="Spearman Correlation",
xlab="Feature", ylab="Feature",
# srtRow = 45,
# srtCol = 45
)
We divided the data into training and testing sets.
set.seed(2)
caseSet <- subset(avgParkison, class == 1)
controlSet <- subset(avgParkison, class == 0)
caseTrainSize <- nrow(caseSet)*trainFraction;
controlTrainSize <- nrow(controlSet)*trainFraction;
sampleCaseTrain <- sample(nrow(caseSet),caseTrainSize)
sampleControlTrain <- sample(nrow(controlSet),controlTrainSize)
trainSet <- rbind(caseSet[sampleCaseTrain,], controlSet[sampleControlTrain,])
testSet <- rbind(caseSet[-sampleCaseTrain,],controlSet[-sampleControlTrain,])
pander::pander(table(trainSet$class))
| 0 | 1 |
|---|---|
| 41 | 122 |
pander::pander(table(testSet$class))
| 0 | 1 |
|---|---|
| 23 | 66 |
I compute a decorrelated version of the training and testing sets using the IDeA() function of FRESA.CAD. The first decorrelation will be driven by features associated with the outcome. The second decorrelation will find the UPSTM without the outcome restriction.
## The UPSTM transformation driven by the Outcome
deTrain <- IDeA(trainSet,Outcome="class",thr=0.8,verbose = TRUE,skipRelaxed=FALSE)
Included: 679 , Uni p: 0.0159386 , Uncorrelated Base: 167 , Outcome-Driven Size: 67 , Base Size: 196
1 <R=1.000,w= 1,N= 373>, Top: 95( 2 )1 : 95 : 0.975,<|>Tot Used: 317 , Added: 224 , Zero Std: 0 , Max Cor: 1.000
2 <R=1.000,w= 1,N= 373>, Top: 24( 7 )1 : 24 : 0.975,<|>Tot Used: 353 , Added: 55 , Zero Std: 0 , Max Cor: 0.998
3 <R=0.998,w= 1,N= 373>, Top: 14( 3 )1 : 14 : 0.974,<|>Tot Used: 368 , Added: 16 , Zero Std: 0 , Max Cor: 0.974
4 <R=0.974,w= 2,N= 208>, Top: 76( 2 )=( 1 )2 : 76 : 0.966,<|>Tot Used: 441 , Added: 102 , Zero Std: 0 , Max Cor: 0.984
5 <R=0.984,w= 2,N= 208>, Top: 16( 2 )1 : 16 : 0.942,<|>Tot Used: 443 , Added: 19 , Zero Std: 0 , Max Cor: 0.967
6 <R=0.967,w= 3,N= 162>, Top: 60( 1 )1 : 60 : 0.884,<|>Tot Used: 477 , Added: 71 , Zero Std: 0 , Max Cor: 0.983
7 <R=0.983,w= 3,N= 162>, Top: 16( 1 )1 : 16 : 0.891,<|>Tot Used: 478 , Added: 19 , Zero Std: 0 , Max Cor: 0.909
8 <R=0.909,w= 3,N= 162>, Top: 35( 1 )1 : 35 : 0.854,<|>Tot Used: 493 , Added: 39 , Zero Std: 0 , Max Cor: 0.985
9 <R=0.985,w= 3,N= 162>, Top: 6( 1 )1 : 6 : 0.893,<|>Tot Used: 493 , Added: 6 , Zero Std: 0 , Max Cor: 0.877
10 <R=0.877,w= 4,N= 158>, Top: 64( 1 )1 : 64 : 0.800,<|>Tot Used: 527 , Added: 69 , Zero Std: 0 , Max Cor: 0.962
11 <R=0.962,w= 4,N= 158>, Top: 8( 1 )1 : 8 : 0.831,<|>Tot Used: 527 , Added: 10 , Zero Std: 0 , Max Cor: 0.883
12 <R=0.883,w= 5,N= 19>, Top: 9( 1 )1 : 9 : 0.800,<|>Tot Used: 527 , Added: 8 , Zero Std: 0 , Max Cor: 0.800
13 <R=0.000,w= 5,N= 19>
[ 13 ], 0.7999961 Decor Dimension: 527 . Cor to Base: 168 , ABase: 21 , Outcome Base: 33
deTest <- predictDecorrelate(deTrain,testSet)
## The UPSTM transformation without outcome
deTrainU <- IDeA(trainSet,thr=0.8,verbose = TRUE,skipRelaxed=FALSE)
Included: 679 , Uni p: 0.0159386 , Uncorrelated Base: 186 , Outcome-Driven Size: 0 , Base Size: 186
1 <R=1.000,w= 1,N= 373>, Top: 93( 2 )1 : 93 : 0.975,<|>Tot Used: 323 , Added: 231 , Zero Std: 0 , Max Cor: 1.000
2 <R=1.000,w= 1,N= 373>, Top: 20( 7 )1 : 20 : 0.975,<|>Tot Used: 354 , Added: 53 , Zero Std: 0 , Max Cor: 0.998
3 <R=0.998,w= 1,N= 373>, Top: 11( 3 )1 : 11 : 0.974,<|>Tot Used: 369 , Added: 13 , Zero Std: 0 , Max Cor: 0.974
4 <R=0.974,w= 2,N= 208>, Top: 73( 2 )1 : 73 : 0.937,<|>Tot Used: 438 , Added: 103 , Zero Std: 0 , Max Cor: 0.984
5 <R=0.984,w= 2,N= 208>, Top: 14( 2 )1 : 14 : 0.942,<|>Tot Used: 442 , Added: 16 , Zero Std: 0 , Max Cor: 0.944
6 <R=0.944,w= 3,N= 176>, Top: 62( 2 )=2 : 62 : 0.916,<|>Tot Used: 477 , Added: 83 , Zero Std: 0 , Max Cor: 0.981
7 <R=0.981,w= 3,N= 176>, Top: 13( 1 )1 : 13 : 0.891,<|>Tot Used: 485 , Added: 16 , Zero Std: 0 , Max Cor: 0.897
8 <R=0.897,w= 4,N= 202>, Top: 71( 6 )1 : 71 : 0.800,<|>Tot Used: 520 , Added: 102 , Zero Std: 0 , Max Cor: 0.979
9 <R=0.979,w= 4,N= 202>, Top: 19( 1 )1 : 19 : 0.840,<|>Tot Used: 521 , Added: 19 , Zero Std: 0 , Max Cor: 0.919
10 <R=0.919,w= 5,N= 22>, Top: 10( 1 )1 : 10 : 0.800,<|>Tot Used: 522 , Added: 12 , Zero Std: 0 , Max Cor: 0.905
11 <R=0.905,w= 5,N= 22>, Top: 1( 1 )1 : 1 : 0.800,<|>Tot Used: 522 , Added: 1 , Zero Std: 0 , Max Cor: 0.799
12 <R=0.000,w= 6,N= 0>
[ 12 ], 0.7994925 Decor Dimension: 522 . Cor to Base: 217 , ABase: 39 , Outcome Base: 0
deTestU <- predictDecorrelate(deTrainU,testSet)
tranformed2 <- colnames(deTrain)[str_detect(colnames(deTrain),"La_")]
tranformed <- str_remove_all(tranformed2,"La_")
dsubs <- as.matrix(dist(as.matrix(trainSet[,tranformed]),"euclidean"))
gplots::heatmap.2(dsubs,
trace = "none",
scale = "none",
mar = c(10,10),
col=rev(heat.colors(5)),
main = "Original: Train Distances",
cexRow = 0.35,
cexCol = 0.35,
key.title=NA,
key.xlab="Distance",
xlab="Subject", ylab="Subject")
dsubsD <- as.matrix(dist(as.matrix(deTrain[,tranformed2]),"euclidean"))
gplots::heatmap.2(dsubsD,
trace = "none",
scale = "none",
mar = c(10,10),
col=rev(heat.colors(5)),
main = "Transformed: Train Distances",
cexRow = 0.35,
cexCol = 0.35,
key.title=NA,
key.xlab="Distance",
xlab="Subject", ylab="Subject")
diff <- dsubs - dsubsD
gplots::heatmap.2(diff,
trace = "none",
scale = "none",
mar = c(10,10),
col=rev(heat.colors(5)),
main = "Distances Diff",
cexRow = 0.35,
cexCol = 0.35,
key.title=NA,
key.xlab="Distance Diff",
xlab="Subject", ylab="Subject")
Before doing the feature analysis. I’ll explore BSWiMS modeling using the Holdout cross validation method of FRESA.CAD. The purpose of the cross-validation is to observe and estimate the performance gain of decorrelation.
par(op)
par(mfrow=c(1,3))
## The Raw validation
cvBSWiMSRaw <- randomCV(avgParkison,
"class",
fittingFunction= BSWiMS.model,
classSamplingType = "Pro",
trainFraction = trainFraction,
repetitions = 150
)
.[++++++++++–+++-]..[+++-+++++-+-].[++++++++++++++++++++]…[+++-].[++++-+++-++-+++-]..[++++–].[++++++++++++++++++++]…[+++++++++++++-+–]..[++++++++++++–]..[++++++-+–]10 Tested: 250 Avg. Selected: 29.4 Min Tests: 1 Max Tests: 7 Mean Tests: 3.56 . MAD: 0.3091545
.[++++-].[++++++++++++++++++++]…[++++++++++++++++++++]…[++++++++++++++++++++]…[+++++++++—+++++]..[++–].[++++++++++++++++-+-]..[++++++++++++++++++++]…[+++++++++++++-++-]..[++++++++++++++++++++]..20 Tested: 252 Avg. Selected: 34.45 Min Tests: 2 Max Tests: 14 Mean Tests: 7.063492 . MAD: 0.3012962
.[++++++++++++++++++++]…[+++++++++++++++++–]..[++++++++++++++++++++]…[+++++++-+++–]..[+++++++++++-]..[+++++++-+++++++++++]..[+++++-+-].[+++++++++++++++-+++]..[++++++++++++++++-]..[++++++++++++++++++++]..30 Tested: 252 Avg. Selected: 36.63333 Min Tests: 4 Max Tests: 19 Mean Tests: 10.59524 . MAD: 0.3090249
.[+++-].[++++++++++++++++++-]..[++++++++++++++++++++]…[++++++++++++++++++++]…[++++++++++++-++++++]..[++++++++++++++++-++]..[+++++++++++++++++++-]..[++++++++++++++++++++]…[++++++++++—+++-]..[++++++++-++-++++-].40 Tested: 252 Avg. Selected: 38 Min Tests: 5 Max Tests: 22 Mean Tests: 14.12698 . MAD: 0.3067733
.[++++++++-+++++-++-]..[+++++++++++-]..[+++++++++++++–]..[+++++++++++++-]..[+++++++++++++–+++]..[++++++-].[++++++++++++-]..[++++++-].[++++++++++++++++-++]..[+++++++++++++++-].50 Tested: 252 Avg. Selected: 37.26 Min Tests: 9 Max Tests: 26 Mean Tests: 17.65873 . MAD: 0.3060329
.[+++++-+++++++++++-]..[+++++++++++++++++++-]..[++++++++++-++-++-]..[++++++++–++++–]..[+++++++-++++—]..[++++++++++–]..[++++++-+++-].[++++++++++-++++++++]..[+++++++++–+-+–]..[+++++++++++++++–+].60 Tested: 252 Avg. Selected: 37.05 Min Tests: 10 Max Tests: 31 Mean Tests: 21.19048 . MAD: 0.305701
.[++++++++++++++++++++]…[++++++++++++++++++++]…[++++++++++++++++++++]…[++++++++++++++–]..[++++++++++++++++++-]..[+++++-].[++++++++++++++++++++]…[+++++++++++++++++–]..[+++-+++++-].[++++++++++++++++++++]..70 Tested: 252 Avg. Selected: 37.45714 Min Tests: 11 Max Tests: 35 Mean Tests: 24.72222 . MAD: 0.3059183
.[++++++++++++++++++++]…[++++++++++++++++++-]..[+++++++-+++++-]..[+++++++++++++++-++-]..[+++++-].[++++++++++++++++-+-]..[++++++++++++++++++++]…[+++++++++-++-++–]..[+++++++++++++-++-]..[++++++++++++++++++++]..80 Tested: 252 Avg. Selected: 37.7375 Min Tests: 17 Max Tests: 40 Mean Tests: 28.25397 . MAD: 0.3073354
.[+++++++++++-+—]..[++++++++++++++++++-]..[++++++++++++++++++++]…[++++++++++++++++++++]…[+++++++++-].[++++++++++++++++++++]…[++++++++++++++++++++]…[+++++++++++++-++–]..[+++++++++++++++++++-]..[+-++-]90 Tested: 252 Avg. Selected: 38.17778 Min Tests: 20 Max Tests: 43 Mean Tests: 31.78571 . MAD: 0.3077018
.[++++++++-+++++++-+]..[++++++++++++++++-++]..[++++++++++++++++++++]…[++++++++++++++++++++]…[++++++-+++++-]..[+++++++++++++++++–]..[++++++++++++++++++++]…[++++++++++++++++++++]…[++++++++++++-]..[+++++++++-+++++-++].100 Tested: 252 Avg. Selected: 38.56 Min Tests: 21 Max Tests: 46 Mean Tests: 35.31746 . MAD: 0.3085001
.[++++++++++++++++++++]…[+++++++++++++++-+-]..[+++++++-+-++-]..[++++++++++++++++++++]…[++++++++++++++++++++]…[++++++++-].[++++++-+++++++++-+]..[++++++++++++++++++++]…[++++++++++++++++++-]..[+++–]110 Tested: 252 Avg. Selected: 38.81818 Min Tests: 23 Max Tests: 51 Mean Tests: 38.84921 . MAD: 0.3073142
.[+++++++++-++++-+++]..[++++++++-++++++++++]..[+++++-+-].[+-+++++–].[+-+++-++++++-]..[++++++++++++++++++++]…[++++++++++++++++++-]..[++++++++-+-].[++++++++++++++++++++]…[++++++++++++++++-++].120 Tested: 252 Avg. Selected: 38.725 Min Tests: 29 Max Tests: 54 Mean Tests: 42.38095 . MAD: 0.306978
.[+++++++++++-++++–]..[++++++++++++++++++++]…[+++++++++++++++-+++]..[+++++++++++++–+++]..[++++++++++++++++++++]…[++++++++++++-+++–]..[++++++++++++++++++++]…[++++++++++++++++-]..[++++++++++++++++++++]…[+++++++++++++++–+].130 Tested: 252 Avg. Selected: 39.3 Min Tests: 33 Max Tests: 58 Mean Tests: 45.9127 . MAD: 0.307711
.[++++++++++++++++++++]…[++++++++++++++++++++]…[++++++++++++++++++-]..[+++++++++++++++–+]..[++++++++++++-++–]..[++++++++++++++++-++]..[++++++++++++++++++++]…[+++++++++++++++—]..[+++++++++++++-]..[+++++++++++++++-].140 Tested: 252 Avg. Selected: 39.57143 Min Tests: 36 Max Tests: 62 Mean Tests: 49.44444 . MAD: 0.3082386
.[++++++++++++++++++++]…[++++++++++++++++++++]…[++++++++++++++++++++]…[++++++++++++++++++++]…[++++++++++++-]..[++++++++++++-++++++]..[+++++-].[++++++++++++++++++++]…[++++++++++–++-]..[++++++++++++++++++++]..150 Tested: 252 Avg. Selected: 39.68667 Min Tests: 39 Max Tests: 65 Mean Tests: 52.97619 . MAD: 0.309539
bpraw <- predictionStats_binary(cvBSWiMSRaw$medianTest,"BSWiMS RAW",cex=0.60)
BSWiMS RAW
pander::pander(bpraw$CM.analysis$tab)
| Outcome + | Outcome - | Total | |
|---|---|---|---|
| Test + | 144 | 20 | 164 |
| Test - | 44 | 44 | 88 |
| Total | 188 | 64 | 252 |
pander::pander(bpraw$accc)
| est | lower | upper |
|---|---|---|
| 0.746 | 0.688 | 0.799 |
pander::pander(bpraw$aucs)
| est | lower | upper |
|---|---|---|
| 0.84 | 0.785 | 0.894 |
pander::pander(bpraw$berror)
| 50% | 2.5% | 97.5% |
|---|---|---|
| 0.271 | 0.211 | 0.343 |
## The validation with Outcome-driven Decorrelation
cvBSWiMSDeCor <- randomCV(avgParkison,
"class",
trainSampleSets= cvBSWiMSRaw$trainSamplesSets,
fittingFunction= filteredFit,
fitmethod=BSWiMS.model,
filtermethod=NULL,
DECOR = TRUE,
DECOR.control=list(Outcome="class",thr=0.8,skipRelaxed=FALSE)
)
.[++++++++-].[++++++-+-].[++++++-+-].[++–].[++++-].[+++–].[+++++++-+++-]..[+++++++++++++-]..[+++++-].[+++-++-]10 Tested: 250 Avg. Selected: 22.4 Min Tests: 1 Max Tests: 7 Mean Tests: 3.56 . MAD: 0.2871137
.[++++-+-+-].[++++-].[++++-+++–].[++++++++++++++-+++-]..[+++++++-].[++-+++-].[+++++++++++++++-+++]..[++-].[+++-+-++-].[+++-]20 Tested: 252 Avg. Selected: 23.6 Min Tests: 2 Max Tests: 14 Mean Tests: 7.063492 . MAD: 0.2826612
.[+++++-+++-].[+++++++-+++-]..[+++++++-].[++++++++-++-]..[++++++-].[++++—].[++-].[++++++-].[+++–].[++++++–+++-]30 Tested: 252 Avg. Selected: 23.6 Min Tests: 4 Max Tests: 19 Mean Tests: 10.59524 . MAD: 0.2878354
.[+++++++++++—-]..[++++-+++++-++-+++]..[+++++-].[+++-].[+++++++++++-]..[+++++++++-].[+++++++++++-++-]..[+++++++++-].[+–+-].[+++++—]40 Tested: 252 Avg. Selected: 24.6 Min Tests: 5 Max Tests: 22 Mean Tests: 14.12698 . MAD: 0.2909058
.[+++++-].[+++-+++–].[++++-].[+++++++++++++++-]..[++++++++-].[+++—-].[++++++++-].[++++++-].[++++++-].[+++++—]50 Tested: 252 Avg. Selected: 24.38 Min Tests: 9 Max Tests: 26 Mean Tests: 17.65873 . MAD: 0.2902138
.[+++++++-].[++++++-+++++++—]..[+++++-].[+++++-].[++++++++-].[++-++-].[+++++++++++-+-+++-]..[+++++++++-].[+++++++-].[+++++++++++++++-].60 Tested: 252 Avg. Selected: 25.28333 Min Tests: 10 Max Tests: 31 Mean Tests: 21.19048 . MAD: 0.2936879
.[+++-+-].[+-].[+++++++-+-+-].[++++++-].[+++++-++-].[+++++++++-].[++++++++-].[+++-+-].[+++++–+-].[+++–]70 Tested: 252 Avg. Selected: 24.54286 Min Tests: 11 Max Tests: 35 Mean Tests: 24.72222 . MAD: 0.2904197
.[++-].[++++++-++++–+-]..[++++++++++++++-]..[+++-].[++—++-].[++++++++-].[++++++++++++++++–]..[+++++++-].[+++-].[+++++++++-]80 Tested: 252 Avg. Selected: 24.6875 Min Tests: 17 Max Tests: 40 Mean Tests: 28.25397 . MAD: 0.291087
.[+++-].[+++++-].[++++–+—].[+++++++++-+-+-+-]..[++++++–].[++++++++++-]..[+++++++-].[++-].[+++++–].[+++-]90 Tested: 252 Avg. Selected: 24.07778 Min Tests: 20 Max Tests: 43 Mean Tests: 31.78571 . MAD: 0.2897524
.[+++++-].[++++++++++++++-++++]..[++++-].[++++-].[+++-].[+++++++++++++++-+-]..[+++++-++++-].[++++++++-++-+++-]..[+-+—].[++++++-]100 Tested: 252 Avg. Selected: 24.22 Min Tests: 21 Max Tests: 46 Mean Tests: 35.31746 . MAD: 0.2903004
.[++++-].[+++++-].[+++-].[++++++++++-+-++++-]..[+++++++++++-+-+–]..[++++++++–].[+++++++-].[++++-++—].[++++++++++-+++-]..[+++-++++++-+-].110 Tested: 252 Avg. Selected: 24.72727 Min Tests: 23 Max Tests: 51 Mean Tests: 38.84921 . MAD: 0.2896679
.[++++-+++-++-].[+-].[+++-].[+++++–].[+++++++++++-]..[++++++-++++-]..[++++++++-+++++-]..[+++++-].[++-+-++-].[+++-]120 Tested: 252 Avg. Selected: 24.64167 Min Tests: 29 Max Tests: 54 Mean Tests: 42.38095 . MAD: 0.28616
.[++++++++-].[++++–+-].[+–].[+++++++++++++++-]..[+++—].[+++++++++-+++++++++]..[+++-+++++-++-]..[+++-+++-].[++++++-+-].[+++++++++–]130 Tested: 252 Avg. Selected: 24.86923 Min Tests: 33 Max Tests: 58 Mean Tests: 45.9127 . MAD: 0.2869853
.[++++++++++++-]..[+++++++-].[+-].[++-].[++++-].[+++++-].[+++++++++++-]..[++++++++–].[++++++-].[+++++-]140 Tested: 252 Avg. Selected: 24.62143 Min Tests: 36 Max Tests: 62 Mean Tests: 49.44444 . MAD: 0.2860115
.[+++++-].[+++-].[++++++-].[+++–].[+++-+–].[+++-++-++-].[++++++++++-+–]..[+++-++-].[++++-].[+++++++-+–]150 Tested: 252 Avg. Selected: 24.32667 Min Tests: 39 Max Tests: 65 Mean Tests: 52.97619 . MAD: 0.2861055
bpDecor <- predictionStats_binary(cvBSWiMSDeCor$medianTest,"BSWiMS Outcome-Driven UPSTM",cex=0.60)
BSWiMS Outcome-Driven UPSTM
pander::pander(bpDecor$CM.analysis$tab)
| Outcome + | Outcome - | Total | |
|---|---|---|---|
| Test + | 165 | 14 | 179 |
| Test - | 23 | 50 | 73 |
| Total | 188 | 64 | 252 |
pander::pander(bpDecor$accc)
| est | lower | upper |
|---|---|---|
| 0.853 | 0.803 | 0.894 |
pander::pander(bpDecor$aucs)
| est | lower | upper |
|---|---|---|
| 0.867 | 0.809 | 0.925 |
pander::pander(bpDecor$berror)
| 50% | 2.5% | 97.5% |
|---|---|---|
| 0.169 | 0.118 | 0.23 |
### Here we compute the probability that the outcome-driven decorrelation ROC is superior to the RAW ROC.
pander::pander(roc.test(bpDecor$ROC.analysis$roc.predictor,bpraw$ROC.analysis$roc.predictor,alternative = "greater"))
| Test statistic | P value | Alternative hypothesis | AUC of roc1 | AUC of roc2 |
|---|---|---|---|---|
| 1.78 | 0.0376 * | greater | 0.867 | 0.84 |
### Testing improving proability
iprob <- .Call("improveProbCpp",cvBSWiMSRaw$medianTest[,2],
cvBSWiMSDeCor$medianTest[,2],
cvBSWiMSRaw$medianTest[,1]);
pander::pander(iprob)
### Testing improving accuracy
testRaw <- (cvBSWiMSRaw$medianTest[,1]-cvBSWiMSRaw$medianTest[,2])<0.5
testDecor <- (cvBSWiMSDeCor$medianTest[,1]-cvBSWiMSDeCor$medianTest[,2])<0.5
pander::pander(mcnemar.test(testRaw,testDecor))
| Test statistic | df | P value |
|---|---|---|
| 17.4 | 1 | 3.04e-05 * * * |
## The validation of Decorrelation without the outcome restriction
cvBSWiMSDeCorU <- randomCV(avgParkison,
"class",
trainSampleSets= cvBSWiMSRaw$trainSamplesSets,
fittingFunction= filteredFit,
fitmethod=BSWiMS.model,
filtermethod=NULL,
DECOR = TRUE,
DECOR.control=list(thr=0.8,skipRelaxed=FALSE)
)
.[++-++++-].[++++++++-+–].[+++++-].[+++++++++-].[++++-].[++–].[++++–].[+++++++-++–].[++—-].[++-]10 Tested: 250 Avg. Selected: 18.9 Min Tests: 1 Max Tests: 7 Mean Tests: 3.56 . MAD: 0.2873746
.[++-].[+++++–].[+++–].[+++-+++++-].[+-].[+++++++-].[+++++++-].[++-++–].[+++-+-+-].[++++++-+-]20 Tested: 252 Avg. Selected: 17.6 Min Tests: 2 Max Tests: 14 Mean Tests: 7.063492 . MAD: 0.2700011
.[++++++++-+++-]..[++++++++++-++–]..[++++++-+-].[+++++++++-].[+++++-+–].[+++++++++++++++-]..[+++++++++++++-]..[+++-].[++-+++—].[++++-++++++–].30 Tested: 252 Avg. Selected: 22.3 Min Tests: 4 Max Tests: 19 Mean Tests: 10.59524 . MAD: 0.2791296
.[++++++–+++-].[++++++-].[++-+-].[+++-].[++++++-].[++++++-+++–].[++++-].[++-++-+++–].[++++–].[+–]40 Tested: 252 Avg. Selected: 21.725 Min Tests: 5 Max Tests: 22 Mean Tests: 14.12698 . MAD: 0.2726633
.[++++++—].[+++++++-++-].[+++++++-].[++++++-].[++++++++++-++-]..[++++++++++++-]..[++++++-].[++-++++-].[+++-].[++++—+-]50 Tested: 252 Avg. Selected: 22.28 Min Tests: 9 Max Tests: 26 Mean Tests: 17.65873 . MAD: 0.2762398
.[++++++++-].[+++++++++-++–]..[++++++-].[+++++-].[++++++++-].[+++++–].[++-+–].[++++++-+—].[+++++-].[+++++++-+++-+-].60 Tested: 252 Avg. Selected: 22.75 Min Tests: 10 Max Tests: 31 Mean Tests: 21.19048 . MAD: 0.2804432
.[+++++++-+++-+-]..[+++-++++-].[+++-].[+-+++-].[++++-].[++++++++-].[++++++++++-+–]..[++-++++-].[++++++++-+++-]..[++-+-]70 Tested: 252 Avg. Selected: 22.97143 Min Tests: 11 Max Tests: 35 Mean Tests: 24.72222 . MAD: 0.2820656
.[++++-].[+++++-+—+-].[+++++++++++++++++-]..[++—].[++–].[++++++++-+++–+-]..[++++++++++++++–+-]..[++++++–].[+++++++++-].[++++++-]80 Tested: 252 Avg. Selected: 23.65 Min Tests: 17 Max Tests: 40 Mean Tests: 28.25397 . MAD: 0.2829365
.[+++++++-+++-]..[+++-+++++-+-].[+++–].[+++++++–+++++-]..[+++++—].[++++++–+-].[++++++—-].[+++++++–].[+++++++-].[++++-]90 Tested: 252 Avg. Selected: 23.78889 Min Tests: 20 Max Tests: 43 Mean Tests: 31.78571 . MAD: 0.280273
.[+++-+++-+-].[+++++++++-].[++-].[+++++-].[++++++++—++-]..[++++-].[++++++++–+-].[+++++++-].[++++-].[+++++—]100 Tested: 252 Avg. Selected: 23.72 Min Tests: 21 Max Tests: 46 Mean Tests: 35.31746 . MAD: 0.2804125
.[+++—-].[++++++-].[++-].[+++++++++++++++–]..[++++++++++-+-]..[++-].[++++++++++-]..[+++++–+-].[+++++-+-+-].[++-]110 Tested: 252 Avg. Selected: 23.7 Min Tests: 23 Max Tests: 51 Mean Tests: 38.84921 . MAD: 0.2798014
.[++++++++-].[++-].[+++-].[++++++++++-+-++-]..[+++++-+++++-]..[+++++++++-].[++++++++++++-]..[++++-].[++–++-].[++-]120 Tested: 252 Avg. Selected: 23.71667 Min Tests: 29 Max Tests: 54 Mean Tests: 42.38095 . MAD: 0.2785596
.[++++-].[+-++-].[+—-].[+++++++++++—–]..[+++-+-].[++++++++-+—].[++-++-+-++–].[+++++-].[++++++++++++-]..[++++-+-]130 Tested: 252 Avg. Selected: 23.7 Min Tests: 33 Max Tests: 58 Mean Tests: 45.9127 . MAD: 0.2792973
.[++++++–].[+++-+++–].[++-].[++–].[++++-+–].[++++–].[+++–].[++++++-].[++++-+-].[+++++-]140 Tested: 252 Avg. Selected: 23.19286 Min Tests: 36 Max Tests: 62 Mean Tests: 49.44444 . MAD: 0.2762984
.[+-].[+++–].[++-++++-].[++++–].[+++-++-].[++-].[+++++-].[+++–].[+++++++-].[++++-]150 Tested: 252 Avg. Selected: 22.69333 Min Tests: 39 Max Tests: 65 Mean Tests: 52.97619 . MAD: 0.2738196
bpDecorU <- predictionStats_binary(cvBSWiMSDeCorU$medianTest,"BSWiMS Data Driven UPSTM",cex=0.60)
BSWiMS Data Driven UPSTM
pander::pander(bpDecorU$CM.analysis$tab)
| Outcome + | Outcome - | Total | |
|---|---|---|---|
| Test + | 167 | 14 | 181 |
| Test - | 21 | 50 | 71 |
| Total | 188 | 64 | 252 |
pander::pander(bpDecorU$accc)
| est | lower | upper |
|---|---|---|
| 0.861 | 0.812 | 0.901 |
pander::pander(bpDecorU$aucs)
| est | lower | upper |
|---|---|---|
| 0.877 | 0.819 | 0.935 |
pander::pander(bpDecorU$berror)
| 50% | 2.5% | 97.5% |
|---|---|---|
| 0.163 | 0.112 | 0.223 |
### Here we compute the probability that the blind decorrelation ROC is superior to the RAW ROC.
pander::pander(roc.test(bpDecorU$ROC.analysis$roc.predictor,bpraw$ROC.analysis$roc.predictor,alternative = "greater"))
| Test statistic | P value | Alternative hypothesis | AUC of roc1 | AUC of roc2 |
|---|---|---|---|---|
| 2.26 | 0.0118 * | greater | 0.877 | 0.84 |
par(op)
## Testing probability improvement
iprob <- .Call("improveProbCpp",cvBSWiMSRaw$medianTest[,2],cvBSWiMSDeCorU$medianTest[,2],cvBSWiMSRaw$medianTest[,1]);
pander::pander(iprob)
## Testing accuracy improvement
testDecorU <- (cvBSWiMSDeCorU$medianTest[,1]-cvBSWiMSDeCorU$medianTest[,2])<0.5
pander::pander(mcnemar.test(testRaw,testDecorU))
| Test statistic | df | P value |
|---|---|---|
| 15.6 | 1 | 7.77e-05 * * * |
I’ll print the graph showing the association between features. Each feature cluster represents a logistic regression formula (formula nugget) discovered by the BSWiMS method. The figure will plot:
Raw formula network
Outcome-driven network
Blind network
The plots will show only formula networks with more than 50% of occurrence and 25% of feature to feature association.
par(op)
par(mfrow=c(1,3))
### The raw model
pander::pander(nrow(bm$bagging$formulaNetwork))
84
cmax <- apply(bm$bagging$formulaNetwork,2,max)
cnames <- names(cmax[cmax>=0.5])
cmax <- cmax[cmax>=0.5]
adma <- bm$bagging$formulaNetwork[cnames,cnames]
for (cx in c(1:nrow(namecode)))
{
cnames <- str_replace_all(cnames,namecode[cx,1],namecode[cx,2])
}
cnames <- str_replace_all(cnames,"_","")
cnames <- str_replace_all(cnames,"th","")
rownames(adma) <- cnames
colnames(adma) <- cnames
names(cmax) <- cnames
adma[adma<0.25] <- 0;
gr <- graph_from_adjacency_matrix(adma,mode = "undirected",diag = FALSE,weighted=TRUE)
gr$layout <- layout_with_fr
fc <- cluster_optimal(gr)
plot(fc, gr,
vertex.size=20*cmax,
vertex.label.cex=0.5,
vertex.label.dist=0,
main="Original Feature Association")
### The Outcome Driven Model
pander::pander(nrow(bmd$bagging$formulaNetwork))
71
cmax <- apply(bmd$bagging$formulaNetwork,2,max)
cnames <- names(cmax[cmax>=0.5])
outcomeNames <- cnames
cmax <- cmax[cmax>=0.5]
adma <- bmd$bagging$formulaNetwork[cnames,cnames]
for (cx in c(1:nrow(namecode)))
{
cnames <- str_replace_all(cnames,namecode[cx,1],namecode[cx,2])
}
cnames <- str_replace_all(cnames,"_","")
cnames <- str_replace_all(cnames,"th","")
rownames(adma) <- cnames
colnames(adma) <- cnames
names(cmax) <- cnames
adma[adma<0.25] <- 0;
gr <- graph_from_adjacency_matrix(adma,mode = "undirected",diag = FALSE,weighted=TRUE)
gr$layout <- layout_with_fr
fc <- cluster_optimal(gr)
clusterOutcome <- fc
clusterOutcome$names <- outcomeNames
plot(fc, gr,
vertex.size=20*cmax,
vertex.label.cex=0.5,
vertex.label.dist=0,
main="Outcome-Driven Decorrelation")
### The Blind Decorrelation
pander::pander(nrow(bmdU$bagging$formulaNetwork))
49
cmax <- apply(bmdU$bagging$formulaNetwork,2,max)
cnames <- names(cmax[cmax>=0.5])
cmax <- cmax[cmax>=0.5]
adma <- bmdU$bagging$formulaNetwork[cnames,cnames]
for (cx in c(1:nrow(namecode)))
{
cnames <- str_replace_all(cnames,namecode[cx,1],namecode[cx,2])
}
cnames <- str_replace_all(cnames,"_","")
cnames <- str_replace_all(cnames,"th","")
rownames(adma) <- cnames
colnames(adma) <- cnames
names(cmax) <- cnames
adma[adma<0.25] <- 0;
gr <- graph_from_adjacency_matrix(adma,mode = "undirected",diag = FALSE,weighted=TRUE)
gr$layout <- layout_with_fr
fc <- cluster_optimal(gr)
plot(fc, gr,
vertex.size=20*cmax,
vertex.label.cex=0.5,
vertex.label.dist=0,
main="Blind Decorrelation")
The analysis of the features required to predict the outcome will use the following:
Analysis of the BSWiMS bagged model using the summary function.
Analysis of the sparse GDSMT
Analysis of the univariate association of the model features of both models
Report the new features not found by the Original data analysis
par(op)
par(mfrow=c(1,1))
## 1 Get the Model Features
smOriginal <- summary(bm)
rawnames <- rownames(smOriginal$coefficients)
### From Drived Decorrelation
smDecor <- summary(bmd)
decornames <- rownames(smDecor$coefficients)
### From Blind Decorrelation
smDecorU <- summary(bmdU)
decornamesU <- rownames(smDecorU$coefficients)
## 2 Get the decorrelation matrix formulas
dc <- getLatentCoefficients(deTrain)
### 2a Get only the ones that were decorrelated by the decorrelation-based model
deNames_in_dc <- decornames[decornames %in% names(dc)]
selectedlist <- dc[deNames_in_dc]
theDeFormulas <- selectedlist
pander::pander(selectedlist)
La_tqwt_entropy_log_dec_4:
| tqwt_entropy_log_dec_1 | tqwt_entropy_log_dec_4 |
|---|---|
| -1.09 | 1 |
La_tqwt_TKEO_std_dec_28:
| tqwt_TKEO_mean_dec_28 | tqwt_TKEO_std_dec_28 |
|---|---|
| -0.73 | 1 |
La_tqwt_entropy_log_dec_31:
| tqwt_entropy_log_dec_31 | tqwt_entropy_log_dec_35 |
|---|---|
| 1 | -0.731 |
La_tqwt_stdValue_dec_1:
| tqwt_entropy_shannon_dec_1 | tqwt_entropy_shannon_dec_2 | tqwt_stdValue_dec_1 | tqwt_stdValue_dec_2 |
|---|---|---|---|
| -0.524 | 0.448 | 1 | -0.849 |
La_std_12th_delta:
| std_MFCC_12th_coef | std_12th_delta |
|---|---|
| -0.96 | 1 |
La_tqwt_energy_dec_5:
| tqwt_energy_dec_4 | tqwt_energy_dec_5 |
|---|---|
| -0.933 | 1 |
La_tqwt_TKEO_mean_dec_17:
| tqwt_TKEO_mean_dec_17 | tqwt_minValue_dec_17 |
|---|---|
| 1 | 2.31 |
La_tqwt_kurtosisValue_dec_3:
| tqwt_kurtosisValue_dec_2 | tqwt_kurtosisValue_dec_3 |
|---|---|
| -0.939 | 1 |
La_tqwt_kurtosisValue_dec_31:
| tqwt_kurtosisValue_dec_31 | tqwt_kurtosisValue_dec_33 |
|---|---|
| 1 | -0.921 |
La_locShimmer:
| locShimmer | apq3Shimmer |
|---|---|
| 1 | -0.957 |
La_std_MFCC_5th_coef:
| std_MFCC_5th_coef | std_5th_delta |
|---|---|
| 1 | -0.865 |
La_tqwt_TKEO_std_dec_17:
| tqwt_TKEO_std_dec_17 | tqwt_minValue_dec_17 |
|---|---|
| 1 | 2.12 |
La_tqwt_TKEO_std_dec_32:
| tqwt_TKEO_mean_dec_33 | tqwt_TKEO_std_dec_32 |
|---|---|
| -0.882 | 1 |
La_det_LT_entropy_shannon_1_coef:
| det_TKEO_mean_1_coef | det_LT_entropy_shannon_1_coef |
|---|---|
| -0.781 | 1 |
La_tqwt_kurtosisValue_dec_2:
| tqwt_kurtosisValue_dec_2 | tqwt_kurtosisValue_dec_4 |
|---|---|
| 1 | -0.917 |
La_tqwt_maxValue_dec_1:
| tqwt_minValue_dec_1 | tqwt_maxValue_dec_1 | tqwt_skewnessValue_dec_1 |
|---|---|---|
| 0.949 | 1 | -0.0151 |
La_std_4th_delta:
| std_MFCC_4th_coef | std_4th_delta |
|---|---|
| -0.97 | 1 |
La_tqwt_minValue_dec_11:
| tqwt_minValue_dec_10 | tqwt_minValue_dec_11 |
|---|---|
| -0.989 | 1 |
La_tqwt_minValue_dec_20:
| tqwt_TKEO_mean_dec_20 | tqwt_minValue_dec_20 |
|---|---|
| 0.431 | 1 |
La_std_10th_delta:
| std_MFCC_10th_coef | std_10th_delta |
|---|---|
| -1.05 | 1 |
La_tqwt_minValue_dec_17:
| tqwt_entropy_shannon_dec_17 | tqwt_minValue_dec_17 |
|---|---|
| 0.586 | 1 |
La_app_LT_entropy_log_3_coef:
| app_LT_entropy_log_1_coef | app_LT_entropy_log_2_coef | app_LT_entropy_log_3_coef |
|---|---|---|
| 1.59 | -2.69 | 1 |
La_tqwt_energy_dec_33:
| tqwt_energy_dec_31 | tqwt_energy_dec_33 |
|---|---|
| -0.884 | 1 |
La_std_3rd_delta:
| std_MFCC_3rd_coef | std_3rd_delta |
|---|---|
| -0.95 | 1 |
La_std_MFCC_2nd_coef:
| std_MFCC_2nd_coef | std_2nd_delta |
|---|---|
| 1 | -0.828 |
La_tqwt_TKEO_std_dec_10:
| tqwt_entropy_shannon_dec_9 | tqwt_entropy_shannon_dec_10 | tqwt_TKEO_std_dec_9 | tqwt_TKEO_std_dec_10 |
|---|---|---|---|
| 0.722 | -1.02 | -0.716 | 1 |
La_tqwt_entropy_log_dec_29:
| tqwt_entropy_log_dec_28 | tqwt_entropy_log_dec_29 |
|---|---|
| -1.01 | 1 |
La_tqwt_kurtosisValue_dec_32:
| tqwt_kurtosisValue_dec_32 | tqwt_kurtosisValue_dec_33 |
|---|---|
| 1 | -1.02 |
names(selectedlist) <- NULL
### 2b Get the the names of the original features
allDevar <- unique(c(names(unlist(selectedlist)),decornames))
allDevar <- allDevar[!str_detect(allDevar,"La_")]
#allDevar <- str_remove(allDevar,"Ba_")
allDevar <- unique(allDevar)
# The analysis of the blind decorrelation
dcU <- getLatentCoefficients(deTrainU)
### 2a Get only the ones that were decorrelated by the decorrelation-based model
deNames_in_dcU <- decornamesU[decornamesU %in% names(dcU)]
selectedlistU <- dcU[deNames_in_dcU]
pander::pander(selectedlistU)
La_locShimmer:
| locShimmer | apq3Shimmer |
|---|---|
| 1 | -0.957 |
La_std_12th_delta:
| std_MFCC_12th_coef | std_12th_delta |
|---|---|
| -0.96 | 1 |
La_tqwt_TKEO_mean_dec_33:
| tqwt_TKEO_mean_dec_33 | tqwt_TKEO_std_dec_32 |
|---|---|
| 1 | -1.04 |
La_std_4th_delta:
| std_MFCC_4th_coef | std_4th_delta |
|---|---|
| -0.97 | 1 |
La_tqwt_TKEO_std_dec_33:
| tqwt_TKEO_std_dec_32 | tqwt_TKEO_std_dec_33 |
|---|---|
| -0.986 | 1 |
La_std_MFCC_2nd_coef:
| std_MFCC_2nd_coef | std_2nd_delta |
|---|---|
| 1 | -0.828 |
La_tqwt_entropy_shannon_dec_33:
| tqwt_entropy_shannon_dec_31 | tqwt_entropy_shannon_dec_33 |
|---|---|
| -1.07 | 1 |
La_tqwt_kurtosisValue_dec_33:
| tqwt_kurtosisValue_dec_32 | tqwt_kurtosisValue_dec_33 |
|---|---|
| -0.883 | 1 |
La_tqwt_TKEO_std_dec_36:
| tqwt_entropy_shannon_dec_35 | tqwt_TKEO_std_dec_36 |
|---|---|
| -1.07 | 1 |
La_std_3rd_delta:
| std_MFCC_3rd_coef | std_3rd_delta |
|---|---|
| -0.95 | 1 |
La_std_10th_delta:
| std_MFCC_10th_coef | std_10th_delta |
|---|---|
| -1.05 | 1 |
La_tqwt_entropy_log_dec_16:
| tqwt_entropy_log_dec_16 | tqwt_minValue_dec_17 |
|---|---|
| 1 | 0.39 |
La_app_LT_TKEO_mean_4_coef:
| app_entropy_shannon_8_coef | app_LT_entropy_shannon_8_coef | app_LT_TKEO_mean_4_coef | app_LT_TKEO_std_2_coef |
|---|---|---|---|
| -0.195 | 0.95 | 1 | -0.969 |
La_tqwt_TKEO_std_dec_28:
| tqwt_TKEO_mean_dec_28 | tqwt_TKEO_std_dec_28 |
|---|---|
| -0.73 | 1 |
La_tqwt_maxValue_dec_1:
| tqwt_minValue_dec_1 | tqwt_maxValue_dec_1 | tqwt_skewnessValue_dec_1 |
|---|---|---|
| 0.949 | 1 | -0.0151 |
La_tqwt_TKEO_mean_dec_17:
| tqwt_TKEO_mean_dec_17 | tqwt_minValue_dec_17 |
|---|---|
| 1 | 2.31 |
La_tqwt_entropy_shannon_dec_1:
| tqwt_entropy_shannon_dec_1 | tqwt_entropy_shannon_dec_4 |
|---|---|
| 1 | -0.669 |
La_tqwt_TKEO_std_dec_10:
| tqwt_entropy_shannon_dec_9 | tqwt_entropy_shannon_dec_10 | tqwt_TKEO_std_dec_9 | tqwt_TKEO_std_dec_10 |
|---|---|---|---|
| 0.722 | -1.02 | -0.716 | 1 |
La_tqwt_kurtosisValue_dec_4:
| tqwt_kurtosisValue_dec_2 | tqwt_kurtosisValue_dec_4 |
|---|---|
| -0.86 | 1 |
La_tqwt_TKEO_std_dec_17:
| tqwt_TKEO_std_dec_17 | tqwt_minValue_dec_17 |
|---|---|
| 1 | 2.12 |
La_minIntensity:
| minIntensity | maxIntensity |
|---|---|
| 1 | -1.34 |
names(selectedlistU) <- NULL
### 2b Get the the names of the original features
allDevarU <- unique(c(names(unlist(selectedlistU)),decornamesU))
allDevarU <- allDevarU[!str_detect(allDevarU,"La_")]
#allDevarU <- str_remove(allDevarU,"Ba_")
allDevarU <- unique(allDevarU)
pander::pander(c(length(rawnames),length(decornames),length(decornamesU)))
68, 61 and 41
pander::pander(c(length(rawnames),length(allDevar),length(allDevarU)))
68, 88 and 63
### 2c Get only the new feautres not found in the original analysis
dvar <- allDevar[!(allDevar %in% rawnames)]
### 2d Get the decorrelated variables that have new features
newvars <- character();
for (cvar in deNames_in_dc)
{
lvar <- dc[cvar]
names(lvar) <- NULL
lvar <- names(unlist(lvar))
if (length(lvar[lvar %in% dvar]) > 0)
{
newvars <- append(newvars,cvar)
}
}
## 3 Here is the univariate z values of the orignal set
#pander::pander(bm$univariate[dvar,])
## 4 Here is the univariate z values of the decorrelated set
#pander::pander(bmd$univariate[newvars,])
## 4a The scater plot of the decorrelated vs original Univariate values
zvalueNew <- bmd$univariate[newvars,]
rownames(zvalueNew) <- str_remove(rownames(zvalueNew),"La_")
#rownames(zvalueNew) <- str_remove(rownames(zvalueNew),"Ba_")
zvaluePrePost <- bm$univariate[rownames(zvalueNew),c(1,3)]
zvaluePrePost$Name <- NULL
zvaluePrePost$NewZ <- zvalueNew[rownames(zvaluePrePost),"ZUni"]
pander::pander(zvaluePrePost)
| ZUni | NewZ | |
|---|---|---|
| tqwt_entropy_log_dec_4 | 2.628 | 3.27 |
| tqwt_TKEO_std_dec_28 | 1.166 | 3.68 |
| tqwt_entropy_log_dec_31 | 0.922 | 4.29 |
| tqwt_stdValue_dec_1 | 1.354 | 3.00 |
| std_12th_delta | 4.431 | 4.27 |
| tqwt_energy_dec_5 | 1.933 | 3.26 |
| tqwt_TKEO_mean_dec_17 | 4.905 | 3.48 |
| tqwt_kurtosisValue_dec_3 | 0.652 | 2.94 |
| tqwt_kurtosisValue_dec_31 | 0.444 | 3.68 |
| locShimmer | 3.390 | 5.02 |
| std_MFCC_5th_coef | 2.208 | 2.36 |
| tqwt_TKEO_std_dec_17 | 4.631 | 3.15 |
| tqwt_TKEO_std_dec_32 | 1.647 | 3.77 |
| det_LT_entropy_shannon_1_coef | 1.842 | 3.72 |
| tqwt_kurtosisValue_dec_2 | 0.341 | 2.95 |
| tqwt_maxValue_dec_1 | 4.731 | 3.28 |
| std_4th_delta | 3.943 | 3.67 |
| tqwt_minValue_dec_11 | 5.493 | 3.34 |
| tqwt_minValue_dec_20 | 1.424 | 3.03 |
| std_10th_delta | 4.761 | 3.43 |
| tqwt_minValue_dec_17 | 4.068 | 1.86 |
| app_LT_entropy_log_3_coef | 1.902 | 3.70 |
| tqwt_energy_dec_33 | 1.027 | 5.13 |
| std_3rd_delta | 2.899 | 4.02 |
| std_MFCC_2nd_coef | 0.628 | 4.68 |
| tqwt_TKEO_std_dec_10 | 4.060 | 2.56 |
| tqwt_entropy_log_dec_29 | 1.425 | 4.17 |
| tqwt_kurtosisValue_dec_32 | 0.433 | 3.98 |
plot(zvaluePrePost,
xlim=c(-0.5,6.5),
ylim=c(0,7),
xlab="Original Z",
ylab="Decorrelated Z",
main="Unviariate IDI Z Values",
pch=3,cex=0.5,
col="red")
abline(v=1.96,col="blue")
abline(h=1.96,col="blue")
text(zvaluePrePost$ZUni,zvaluePrePost$NewZ,rownames(zvaluePrePost),srt=65,cex=0.75)
In this section I will show the differences in unaltered basis vectors between the Outcome driven Transformation vs. the blind decorrelated transformation
par(op)
par(mfrow=c(1,1))
smDecorU <- summary(bmdU)
decornamesU <- rownames(smDecorU$coefficients)
get_La_names <- decornames[!str_detect(decornames,"La_")]
get_La_namesU <- decornamesU[!str_detect(decornamesU,"La_")]
unn <- bmd$univariate[,3]
names(unn) <- rownames(bmd$univariate)
pander::pander(as.matrix(unn[get_La_names]))
| mean_MFCC_2nd_coef | 3.85 |
| tqwt_meanValue_dec_11 | 3.38 |
| tqwt_maxValue_dec_12 | 5.99 |
| mean_MFCC_3rd_coef | 3.42 |
| std_MFCC_6th_coef | 3.55 |
| std_delta_delta_log_energy | 6.15 |
| tqwt_energy_dec_6 | 2.77 |
| tqwt_kurtosisValue_dec_20 | 4.85 |
| tqwt_entropy_log_dec_11 | 4.83 |
| apq11Shimmer | 4.44 |
| tqwt_entropy_log_dec_35 | 3.56 |
| f2 | 3.34 |
| tqwt_kurtosisValue_dec_18 | 4.55 |
| tqwt_meanValue_dec_25 | 2.16 |
| tqwt_entropy_log_dec_16 | 5.02 |
| tqwt_energy_dec_11 | 3.44 |
| tqwt_kurtosisValue_dec_33 | 2.06 |
| tqwt_energy_dec_12 | 4.45 |
| VFER_SNR_TKEO | 3.59 |
| std_11th_delta | 4.42 |
| tqwt_entropy_shannon_dec_36 | 3.98 |
| tqwt_kurtosisValue_dec_28 | 4.51 |
| tqwt_meanValue_dec_18 | 1.77 |
| std_MFCC_8th_coef | 4.40 |
| std_5th_delta | 4.28 |
| locAbsJitter | 4.18 |
| IMF_SNR_entropy | 3.22 |
| tqwt_kurtosisValue_dec_35 | 4.43 |
| f1 | 4.02 |
| minIntensity | 6.02 |
| mean_delta_log_energy | 3.92 |
| IMF_NSR_TKEO | 2.92 |
| numPulses | 3.80 |
pander::pander(summary(unn[get_La_names]))
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 1.77 | 3.42 | 3.98 | 3.98 | 4.45 | 6.15 |
unnU <- bmdU$univariate[,3]
names(unnU) <- rownames(bmdU$univariate)
pander::pander(as.matrix(unnU[get_La_namesU]))
| tqwt_entropy_shannon_dec_11 | 4.99 |
| tqwt_TKEO_mean_dec_7 | 4.36 |
| std_5th_delta | 4.28 |
| tqwt_kurtosisValue_dec_28 | 4.51 |
| mean_MFCC_2nd_coef | 3.85 |
| tqwt_kurtosisValue_dec_36 | 5.83 |
| tqwt_energy_dec_12 | 4.45 |
| IMF_NSR_TKEO | 2.92 |
| std_MFCC_8th_coef | 4.40 |
| std_delta_log_energy | 5.87 |
| tqwt_meanValue_dec_18 | 1.77 |
| maxIntensity | 4.82 |
| tqwt_meanValue_dec_25 | 2.16 |
| mean_delta_log_energy | 3.92 |
| tqwt_meanValue_dec_11 | 3.38 |
| tqwt_kurtosisValue_dec_20 | 4.85 |
| std_11th_delta | 4.42 |
| VFER_SNR_TKEO | 3.59 |
| f1 | 4.02 |
| tqwt_kurtosisValue_dec_17 | 4.27 |
pander::pander(summary(unnU[get_La_namesU]))
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 1.77 | 3.79 | 4.32 | 4.13 | 4.59 | 5.87 |
#boxplot(unn[get_La_names],unnU[get_La_namesU],xlab=c("Method"),ylab="Z",main="Z Values of Basis Features")
x1 <- unn[get_La_names]
x2 <- unnU[get_La_namesU]
X3 <- x1[!(get_La_names %in% get_La_namesU)]
X4 <- x2[!(get_La_namesU %in% get_La_names)]
vioplot(x1, x2, X3,X4,
names = c("Outcome-Driven",
"Blind",
"Not in Blind",
"Not in Outcome-Driven"),
ylab="Z IDI",
col="gold")
title("Violin Plots of Unaltered-Basis")
sameFeatures <- get_La_names[get_La_names %in% get_La_namesU]
pander::pander(as.matrix(unn[sameFeatures]))
| mean_MFCC_2nd_coef | 3.85 |
| tqwt_meanValue_dec_11 | 3.38 |
| tqwt_kurtosisValue_dec_20 | 4.85 |
| tqwt_meanValue_dec_25 | 2.16 |
| tqwt_energy_dec_12 | 4.45 |
| VFER_SNR_TKEO | 3.59 |
| std_11th_delta | 4.42 |
| tqwt_kurtosisValue_dec_28 | 4.51 |
| tqwt_meanValue_dec_18 | 1.77 |
| std_MFCC_8th_coef | 4.40 |
| std_5th_delta | 4.28 |
| f1 | 4.02 |
| mean_delta_log_energy | 3.92 |
| IMF_NSR_TKEO | 2.92 |
## The features by Outcome Drive not in Blind
pander::pander(as.matrix(x1[!(get_La_names %in% get_La_namesU)]))
| tqwt_maxValue_dec_12 | 5.99 |
| mean_MFCC_3rd_coef | 3.42 |
| std_MFCC_6th_coef | 3.55 |
| std_delta_delta_log_energy | 6.15 |
| tqwt_energy_dec_6 | 2.77 |
| tqwt_entropy_log_dec_11 | 4.83 |
| apq11Shimmer | 4.44 |
| tqwt_entropy_log_dec_35 | 3.56 |
| f2 | 3.34 |
| tqwt_kurtosisValue_dec_18 | 4.55 |
| tqwt_entropy_log_dec_16 | 5.02 |
| tqwt_energy_dec_11 | 3.44 |
| tqwt_kurtosisValue_dec_33 | 2.06 |
| tqwt_entropy_shannon_dec_36 | 3.98 |
| locAbsJitter | 4.18 |
| IMF_SNR_entropy | 3.22 |
| tqwt_kurtosisValue_dec_35 | 4.43 |
| minIntensity | 6.02 |
| numPulses | 3.80 |
## The features not in outcome driven
pander::pander(as.matrix(x2[!(get_La_namesU %in% get_La_names)]))
| tqwt_entropy_shannon_dec_11 | 4.99 |
| tqwt_TKEO_mean_dec_7 | 4.36 |
| tqwt_kurtosisValue_dec_36 | 5.83 |
| std_delta_log_energy | 5.87 |
| maxIntensity | 4.82 |
| tqwt_kurtosisValue_dec_17 | 4.27 |
I’ll create a table subset of the logistic model from the Outcome-Driven decorrelated data.
The table will have:
The top associated features described by the feature network, as well as, and the new features.
Nugget labels
The feature coefficient
The feature Odd ratios and their corresponding 95%CI
## The features in top nugget
clusterFeatures <- clusterOutcome$names
## The new features
discoveredFeatures <- newvars[zvaluePrePost$ZUni<1.96]
tablefinal <- smDecor$coefficients[unique(c(clusterFeatures,discoveredFeatures)),
c("Estimate",
"lower",
"OR",
"upper",
"full.AUC",
"Delta.AUC",
"z.IDI",
"Frequency")]
nugget <- clusterOutcome$membership
names(nugget) <- clusterOutcome$names
tablefinal$Nugget <- nugget[rownames(tablefinal)]
tablefinal$Nugget[is.na(tablefinal$Nugget)] <- "D"
deFromula <- character(length(theDeFormulas))
names(deFromula) <- names(theDeFormulas)
for (dx in names(deFromula))
{
coef <- theDeFormulas[[dx]]
cname <- names(theDeFormulas[[dx]])
names(cname) <- cname
for (cf in names(coef))
{
if (cf != dx)
{
if (coef[cf]>0)
{
deFromula[dx] <- paste(deFromula[dx],
sprintf("+ %5.3f*%s",coef[cf],cname[cf]))
}
else
{
deFromula[dx] <- paste(deFromula[dx],
sprintf("%5.3f*%s",coef[cf],cname[cf]))
}
}
}
}
tablefinal$DecorFormula <- deFromula[rownames(tablefinal)]
pander::pander(tablefinal)
| Estimate | lower | OR | upper | full.AUC | Delta.AUC | z.IDI | Frequency | Nugget | DecorFormula | |
|---|---|---|---|---|---|---|---|---|---|---|
| std_delta_delta_log_energy | 0.5652 | 1.31e+00 | 1.75978 | 2.361 | 0.798 | 0.0530 | 3.87 | 1.00 | 1 | NA |
| La_tqwt_energy_dec_33 | -0.2641 | 6.26e-01 | 0.76791 | 0.942 | 0.798 | 0.0266 | 2.56 | 1.00 | 1 | -0.884tqwt_energy_dec_31 + 1.000tqwt_energy_dec_33 |
| La_std_MFCC_2nd_coef | -1.5069 | 6.06e-02 | 0.22159 | 0.810 | 0.798 | 0.0228 | 2.26 | 1.00 | 1 | + 1.000std_MFCC_2nd_coef -0.828std_2nd_delta |
| tqwt_maxValue_dec_12 | -0.3024 | 6.32e-01 | 0.73901 | 0.865 | 0.794 | 0.0749 | 4.00 | 1.00 | 2 | NA |
| La_locShimmer | 1.5022 | 1.64e+00 | 4.49134 | 12.337 | 0.797 | 0.0393 | 3.15 | 1.00 | 2 | + 1.000locShimmer -0.957apq3Shimmer |
| La_tqwt_kurtosisValue_dec_32 | -0.4351 | 4.41e-01 | 0.64723 | 0.949 | 0.796 | 0.0219 | 2.11 | 1.00 | 2 | + 1.000tqwt_kurtosisValue_dec_32 -1.018tqwt_kurtosisValue_dec_33 |
| minIntensity | -2.3412 | 1.70e-02 | 0.09621 | 0.544 | 0.780 | 0.0228 | 2.59 | 1.00 | 3 | NA |
| La_tqwt_kurtosisValue_dec_31 | -0.4262 | 4.92e-01 | 0.65301 | 0.866 | 0.787 | 0.0373 | 3.25 | 1.00 | 3 | + 1.000tqwt_kurtosisValue_dec_31 -0.921tqwt_kurtosisValue_dec_33 |
| La_tqwt_TKEO_std_dec_32 | 0.2590 | 1.08e+00 | 1.29565 | 1.547 | 0.776 | 0.0356 | 3.11 | 1.00 | 4 | -0.882tqwt_TKEO_mean_dec_33 + 1.000tqwt_TKEO_std_dec_32 |
| La_std_3rd_delta | 0.7956 | 1.20e+00 | 2.21574 | 4.085 | 0.783 | 0.0278 | 2.54 | 1.00 | 3 | -0.950std_MFCC_3rd_coef + 1.000std_3rd_delta |
| tqwt_kurtosisValue_dec_20 | 0.9820 | 1.47e+00 | 2.66987 | 4.833 | 0.783 | 0.0665 | 3.67 | 0.90 | 3 | NA |
| tqwt_meanValue_dec_18 | -0.3003 | 5.89e-01 | 0.74057 | 0.931 | 0.777 | 0.0000 | 2.99 | 0.60 | 3 | NA |
| tqwt_entropy_log_dec_11 | -0.7823 | 2.86e-01 | 0.45733 | 0.732 | 0.762 | 0.0739 | 3.62 | 1.00 | 4 | NA |
| tqwt_entropy_log_dec_16 | -0.3471 | 5.71e-01 | 0.70670 | 0.875 | 0.775 | 0.0695 | 3.37 | 0.95 | 5 | NA |
| La_std_12th_delta | 1.7791 | 1.87e+00 | 5.92477 | 18.723 | 0.774 | 0.0464 | 3.29 | 1.00 | 4 | -0.960std_MFCC_12th_coef + 1.000std_12th_delta |
| mean_delta_log_energy | -0.0201 | 9.65e-01 | 0.98012 | 0.996 | 0.783 | 0.0196 | 2.54 | 0.90 | 4 | NA |
| La_std_4th_delta | 0.9385 | 1.33e+00 | 2.55616 | 4.906 | 0.772 | 0.0492 | 2.92 | 0.80 | 5 | -0.970std_MFCC_4th_coef + 1.000std_4th_delta |
| std_5th_delta | 0.4696 | 1.16e+00 | 1.59940 | 2.200 | 0.750 | 0.0439 | 2.97 | 0.70 | 4 | NA |
| tqwt_kurtosisValue_dec_35 | 0.1116 | 1.03e+00 | 1.11804 | 1.214 | 0.783 | 0.0531 | 2.79 | 0.95 | 5 | NA |
| std_MFCC_8th_coef | 0.5357 | 1.16e+00 | 1.70867 | 2.516 | 0.756 | 0.0412 | 2.97 | 0.85 | 6 | NA |
| tqwt_energy_dec_12 | -0.0666 | 8.95e-01 | 0.93561 | 0.978 | 0.774 | 0.0655 | 3.28 | 0.65 | 7 | NA |
| std_11th_delta | 0.3676 | 1.13e+00 | 1.44423 | 1.851 | 0.758 | 0.0487 | 3.15 | 0.70 | 8 | NA |
| La_tqwt_TKEO_std_dec_17 | -0.4358 | 4.67e-01 | 0.64673 | 0.896 | 0.765 | 0.0507 | 3.12 | 0.50 | 9 | + 1.000tqwt_TKEO_std_dec_17 + 2.124tqwt_minValue_dec_17 |
| tqwt_kurtosisValue_dec_28 | -0.0420 | 9.31e-01 | 0.95889 | 0.988 | 0.740 | 0.0524 | 3.04 | 0.65 | 10 | NA |
| tqwt_kurtosisValue_dec_18 | 0.8663 | 1.37e+00 | 2.37798 | 4.124 | 0.769 | 0.0471 | 3.43 | 0.80 | 5 | NA |
| tqwt_meanValue_dec_25 | -0.3364 | 5.76e-01 | 0.71435 | 0.886 | 0.768 | 0.0000 | 3.41 | 0.50 | 4 | NA |
| tqwt_energy_dec_6 | -0.1151 | 8.34e-01 | 0.89131 | 0.952 | 0.774 | 0.0700 | 3.72 | 0.55 | 5 | NA |
| tqwt_entropy_log_dec_35 | -0.6965 | 3.18e-01 | 0.49834 | 0.781 | 0.776 | 0.0598 | 3.46 | 0.85 | 10 | NA |
| mean_MFCC_3rd_coef | 0.0255 | 1.01e+00 | 1.02587 | 1.040 | 0.786 | 0.0743 | 3.99 | 0.55 | 5 | NA |
| apq11Shimmer | 0.1937 | 1.09e+00 | 1.21368 | 1.356 | 0.746 | 0.0784 | 3.48 | 0.50 | 11 | NA |
| La_tqwt_entropy_log_dec_31 | 0.6173 | 1.27e+00 | 1.85390 | 2.707 | 0.749 | 0.0773 | 3.49 | 0.50 | 12 | + 1.000tqwt_entropy_log_dec_31 -0.731tqwt_entropy_log_dec_35 |
| La_tqwt_maxValue_dec_1 | -0.7541 | 2.85e-01 | 0.47042 | 0.777 | 0.757 | 0.0493 | 3.04 | 0.55 | 11 | + 0.949tqwt_minValue_dec_1 + 1.000tqwt_maxValue_dec_1 -0.015*tqwt_skewnessValue_dec_1 |
| La_std_10th_delta | 1.1297 | 1.34e+00 | 3.09474 | 7.143 | 0.764 | 0.0344 | 2.74 | 0.70 | 13 | -1.049std_MFCC_10th_coef + 1.000std_10th_delta |
| mean_MFCC_2nd_coef | 0.0168 | 1.01e+00 | 1.01696 | 1.026 | 0.773 | 0.0848 | 4.11 | 0.50 | 13 | NA |
| La_tqwt_entropy_log_dec_4 | -2.4537 | 2.09e-02 | 0.08598 | 0.354 | 0.784 | 0.0834 | 3.94 | 0.50 | 5 | -1.090tqwt_entropy_log_dec_1 + 1.000tqwt_entropy_log_dec_4 |
| std_MFCC_6th_coef | 1.0443 | 1.54e+00 | 2.84154 | 5.234 | 0.777 | 0.0574 | 3.91 | 0.60 | 13 | NA |
| La_tqwt_TKEO_std_dec_28 | -0.0827 | 8.76e-01 | 0.92066 | 0.967 | 0.736 | 0.0536 | 3.50 | 0.35 | D | -0.730tqwt_TKEO_mean_dec_28 + 1.000tqwt_TKEO_std_dec_28 |
| La_tqwt_stdValue_dec_1 | -0.5159 | 4.27e-01 | 0.59695 | 0.835 | 0.721 | 0.0773 | 3.29 | 0.15 | D | -0.524tqwt_entropy_shannon_dec_1 + 0.448tqwt_entropy_shannon_dec_2 + 1.000tqwt_stdValue_dec_1 -0.849tqwt_stdValue_dec_2 |
| La_tqwt_energy_dec_5 | -0.1450 | 7.89e-01 | 0.86505 | 0.949 | 0.777 | 0.0618 | 3.27 | 0.25 | D | -0.933tqwt_energy_dec_4 + 1.000tqwt_energy_dec_5 |
| La_tqwt_kurtosisValue_dec_3 | 0.1117 | 1.04e+00 | 1.11817 | 1.201 | 0.757 | 0.0698 | 3.25 | 0.15 | D | -0.939tqwt_kurtosisValue_dec_2 + 1.000tqwt_kurtosisValue_dec_3 |
| La_det_LT_entropy_shannon_1_coef | 0.0876 | 1.02e+00 | 1.09160 | 1.164 | 0.757 | 0.0443 | 3.06 | 0.40 | D | -0.781det_TKEO_mean_1_coef + 1.000det_LT_entropy_shannon_1_coef |
| La_tqwt_kurtosisValue_dec_2 | -0.1319 | 8.03e-01 | 0.87640 | 0.957 | 0.779 | 0.0550 | 3.04 | 0.35 | D | + 1.000tqwt_kurtosisValue_dec_2 -0.917tqwt_kurtosisValue_dec_4 |
| La_tqwt_minValue_dec_20 | -0.1514 | 7.77e-01 | 0.85950 | 0.951 | 0.741 | 0.0408 | 2.85 | 0.15 | D | + 0.431tqwt_TKEO_mean_dec_20 + 1.000tqwt_minValue_dec_20 |
| La_app_LT_entropy_log_3_coef | -6.8711 | 4.42e-06 | 0.00104 | 0.244 | 0.774 | 0.0207 | 2.62 | 0.10 | D | + 1.587app_LT_entropy_log_1_coef -2.692app_LT_entropy_log_2_coef + 1.000*app_LT_entropy_log_3_coef |
| La_tqwt_entropy_log_dec_29 | -0.0875 | 8.49e-01 | 0.91618 | 0.989 | 0.746 | 0.0412 | 2.23 | 0.25 | D | -1.015tqwt_entropy_log_dec_28 + 1.000tqwt_entropy_log_dec_29 |
save.image("~/GitHub/FCA/ParkinsonDemo.RData")