1 Sample Information

The 79 samples present in both the microarray and dCT data sets used in this analyses are listed below.

The sample group stratification used for the analyses is the following. This decision was made considering the small sample sizes on the “Diagnosis” column described below where samples have unique diagnostic labels.

2 Exploratory Data Analysis

The boxplot below illustrates the behaviour of the three data sets, clear difference in gene expression is absent regardless the data set. Additionally, the dCT#1 data set results are far from both dCT #2 and microarray data, which have similar overall values.

Exploratory data analysis with principal component analysis was conducted with both data sets, no clear division could be seen regardless the data set.

2.1 cDT

2.2 cDT Dataset #2

2.3 Microarray

3 Correlation Analysis

In this section, correlation analysis was carried out with the data sets individually and to analyse the correlation in gene expression in the studies. When looking at the individual data sets, more correlation clusters with higher pearson scores are apparent in the dCT data sets. The gene expression correlation analysis between studies resulted in overall low correlation scores when comparing microarray and PCR data, indicating that a comparison between the predictive capabilities of the kSORT gene subset with these data sets is not ideal.

3.1 Microarray

3.2 dCT

3.3 dCT 2

3.4 Microarray and dCT #1

3.5 Microarray and dCT #2

3.6 dCT #1 and dCT #2

4 Predictive Capabilities of kSORT with Support Vector Machine

4.1 Microarray

4.2 dCT #1

4.3 dCT #2

5 Recursive Feature Elimination with SVM

Recursive Feature Elimination with SVM (RFESVM) was carried out to recursively remove features in order to obtain a minimal model with the best accuracy score. The line plots represent the accuracy value for each decreasing size of the subset of genes, additionally, a confusion matrix with the selected subset is available for each comparison. Further description of the selected genes and corresponding coefficient in the resulting model are available in the tables. The results indicate that removing features from the data model improve the mapping functions learned by SVM, especially to predict rejection - this is depicted in the confusion matrices with the selected features where there has been improvement in the correct predictions for this class.

5.1 Microarray

5.2 dCT

5.3 dCT #2

6 Feature Importance with Boruta

Boruta, a wrapper feature selection algorithm with Random Forest as classifier, was used to perform feature importance analysis using the kSORT gene set as features and the sample groups as classes (i.e. Control, Rejection). For the microarray data set, 6 genes were classified as important for classification (IPSEN1,RARA,RYBP,GZMK,MAPK9,RHEB), while only none of the genes were confirmed as important or tentative for classification with the dCT data set #1 and two genes (NKTR, RHEB) were considered important for classifying rejection when using dCT data set #2 . Box plots with the non-rejected gene expression data by class can be seen after the results from Boruta for the applicable data sets.

6.1 Microarray

6.2 dCT #1

6.3 dCT #2

7 Conclusion

The results indicate improved predictive results when using the dCT #2 data set. The results from RFESVM demonstrate a clear improvement in predicting the rejection class in particular, which suggests that the SVM algorithm learns a better mapping function with a subset of the kSORT gene set. Feature Selection with Boruta resulted in more important genes from microarray data and aligned with better accuracy values from RFESVM compared with dCT data set #1, but dCT #2 was better overall. It is crucial to consider the effect of small sample sizes when training models and performing predictions, which points towards the need for an augmented data set to analyse the predictive capabilities of the kSORT genes.