The 79 samples present in both the microarray and dCT data sets used in this analyses are listed below.
The sample group stratification used for the analyses is the following. This decision was made considering the small sample sizes on the “Diagnosis” column described below where samples have unique diagnostic labels.
Exploratory data analysis with principal component analysis was conducted with both data sets, no clear division could be seen regardless the data set.
In this section, correlation analysis was carried out with the data sets individually and to analyse the correlation in gene expression in both studies. When looking at the individual data sets, more correlation clusters with higher pearson scores are apparent in the dCT data. The gene expression correlation analysis between studies resulted in overall low correlation scores and a majority of negative correlation scores, indicating that a comparison between the predictive capabilities of the kSORT gene subset with the two data sets is not ideal.
Boruta, a wrapper feature selection algorithm with Random Forest as classifier, was used to perform feature importance analysis using the kSORT gene set as features and the 6 sample groups as classes (i.e. PRETx, POSTTx, AR, cAMR, BKV, IFTA). For the microarray data set, 7 genes were classified as important for classification (ITGA, RARA,CFLAR, GZMK,MAPK9, RHEB, RXRA), while only RXRA was confirmed as important and CEACAM4, MAPK9, PSEN1, RYBP were tentative and not rejected as important for classification with the dCT data set. Box plots with the non-rejected gene expression data by class can be seen after the results from Boruta for both data sets.
print(getNonRejectedFormula(classification))
## as.factor(intersect_microarray$Class) ~ ITGAX + RARA + CFLAR +
## GZMK + MAPK9 + RHEB + RXRA
## <environment: 0x55bf224760c8>
print(getNonRejectedFormula(classification))
## as.factor(intersect_dCT$Class) ~ CEACAM4 + MAPK9 + PSEN1 + RXRA +
## RYBP
## <environment: 0x55bf279d15d0>
Recursive Feature Elimination with Logistic Regression (RFELG) was carried out to recursively remove features in order to obtain a minimal model with the best weighted one-versus-rest Area Under the Receiver Operating Characteristic Curve. THe line plots represent the auc_roc value for each decreasing size of the subset of genes, aditionally, a confusion matrix with the selected subset is available for each comparison. Further description of the selected genes and corresponding coefficient in the resulting model are available in the tables.
The results indicate improved classification results from the kSORT gene set when performing binary comparisons with better overall results from the microarray data set. Feature Selection with Boruta resulted in more confirmed genes from microarray data and aligned with better AUCROC values from RFELG either multiclass or binary. This point is specifically stressed out in case of the binary comparison between POSTTx and IFTA where the limit AUC results from microarray data are [0.72, 0.84] while the limit results from cDT are [0.26, 0.37] - it is important to consider effect of small sample sizes when training models and performing predictions, for both data sets when comparing POSTTx and IFTA the model predicted IFTA correctly 100% of the times while predicting POSTTx incorrectly 100% of the time. An augmented data set is needed to further analyse the predictive capabilities of the kSORT genes.