Introduction
For this LBB, I am using a real gene expression dataset from breast cancer patients. I am going to use a breast cancer dataset, GSE96058 from the GEO database. The study was conducted by the Population-Based Multicenter Sweden Cancerome Analysis Network Breast Initiative. The dataset in this lbb has been sized down to only include Basal and HER2 cancer subtype from a tutorial page in GitHub.
Data Exploration
Let’s load the data and do some transformation first.
mat <- read.csv("GSE81538.gene_expr.csv", header=T, row.names = "Gene", sep=",")
meta <- read.csv("GSE81538.metadata.csv", header=T, row.names = "Sample_ID", sep=",")## GSM2155558 GSM2155573 GSM2155576 GSM2155577 GSM2155578 GSM2155579
## A1BG 1.590427 1.843249 2.40252682 0.7957699 3.0461528 1.9914732
## A1CF -3.321928 -2.987143 -2.99921024 -3.3219281 -3.3219281 -3.0794813
## A2M 10.532488 5.679714 7.36209637 6.9138968 6.6680964 6.8304391
## A2ML1 3.230358 2.533546 4.63239910 0.4043134 -0.5475978 -0.4682657
## A4GALT 3.687543 1.924038 4.13743870 2.6548153 4.8779754 2.8167662
## A4GNT -3.321928 -3.321928 -3.32192809 -3.3219281 -3.3219281 -3.3219281
## AAAS 3.817716 4.606686 4.11423291 3.6850586 4.4099688 3.8267972
## AACS 4.187752 2.904505 3.19106577 3.6700293 3.8023336 4.2283249
## AADAC -3.045970 -3.321928 0.04110698 -1.6388111 -3.3219281 -2.3282313
## AADACL2 -3.321928 -3.089915 -3.32192809 -3.3219281 -2.9963932 -3.3219281
## GSM2155580 GSM2155589 GSM2155592 GSM2155594
## A1BG 3.864399 3.6169857 1.812331 2.153686
## A1CF -3.321928 -3.3219281 -3.271774 -3.321928
## A2M 9.363927 7.1892622 6.433237 8.462191
## A2ML1 2.231530 -0.8605566 3.046839 1.137483
## A4GALT 3.158690 3.1324142 2.534428 2.786167
## A4GNT -2.922023 -3.0811705 -3.321928 -3.321928
## AAAS 3.157392 4.1123465 3.883359 4.498970
## AACS 2.520392 2.6387985 3.805710 3.864587
## AADAC -1.835116 -3.3219281 -3.321928 -2.799313
## AADACL2 -3.321928 -3.3219281 -3.321928 -3.321928
## Subtype
## GSM2155558 Basal
## GSM2155573 Basal
## GSM2155576 Basal
## GSM2155577 Basal
## GSM2155578 Basal
## GSM2155579 Basal
The mat dataframe is an expression dataset that has been normalized. The rows are the gene expressed with a total gene count of 18.213 genes. We cover almost every gene as the human genome contain more or less 21.000 genes. The column names are anonymous patient label. Here, we only have 122 patients divided into two categories, Basal and HER2 subtype.
Data Processing
Check Missing Value
## [1] FALSE
## [1] FALSE
This dataset does not have missing value.
Scaling
What we want to scale(normalize) are the gene expressions that are now currently placed as row. Let’s transpose it first, and then transpose it again as scale only works on columns.
Merging Expression Data with label
We have more than 18 thousands gene but most of these genes are not differentially expressed in the two sub-types. As such we are going to take the top 2000 genes that shows the greatest variation between sub-type.
This is not the best practice. The best practice is for us to perform differentially expressed genes (DEG) analysis to determine differentially expressed genes. But, this method should be good enough to reduce our dimension.
After filtering, we can merge our data with the label.
mat <- t(mat)
mat <- merge(meta,mat,by="row.names")
## remove the rownames column
rownames(mat) <- mat$Row.names
mat <- mat[,-c(1)]
mat[1:10,] %>%
select(1:10)## Subtype ASTN2 C8orf76 CIDEA DMPK GAGE12J
## GSM2155558 Basal -0.2216182 -0.76627329 -1.4517424 0.11913181 -0.1536797
## GSM2155559 Her2 0.9775220 0.08444846 -1.4517424 -0.41039873 -0.1536797
## GSM2155561 Her2 1.8363197 -0.50900181 1.9026721 -0.15997469 -0.1536797
## GSM2155563 Her2 -0.1995949 -0.63590983 0.8944964 0.83477202 -0.1536797
## GSM2155565 Her2 -0.5462489 1.99561604 -0.8264288 1.00457231 -0.1536797
## GSM2155566 Her2 1.1946041 -0.78154328 0.1297961 -0.27732079 -0.1536797
## GSM2155567 Her2 0.6575408 -0.14520721 1.2085468 0.38990840 -0.1536797
## GSM2155568 Her2 -0.2537171 0.58009829 -0.2663376 0.03524777 -0.1536797
## GSM2155571 Her2 1.4083397 -0.42814311 1.0901411 2.23500937 -0.1536797
## GSM2155572 Her2 -1.8457329 0.63719180 -0.3510496 -0.43898782 -0.1536797
## HAO2 KRT38 MRPL47 NDRG3
## GSM2155558 -0.6146181 1.0329036 -0.8904883 0.7063261
## GSM2155559 2.7563034 0.2097443 -0.1493350 1.7485516
## GSM2155561 -0.3506882 -0.2991592 -0.7295586 -1.2602521
## GSM2155563 -0.6146181 -0.2991592 0.8091464 0.9744298
## GSM2155565 -0.1591325 -0.2991592 0.5617525 -0.4074978
## GSM2155566 -0.6146181 -0.2991592 0.4483600 0.7022963
## GSM2155567 0.5858403 -0.2991592 -0.8354202 0.1953714
## GSM2155568 6.3062972 -0.2991592 -0.3538955 2.2257588
## GSM2155571 -0.6146181 -0.2991592 4.2963740 1.6189602
## GSM2155572 -0.6146181 -0.2991592 0.8104392 1.0954044
PCA and Dataset split
We now have 2000 predictors which is still quite a lot. What we can do to reduce computational burder during modeling later is to reduce the number of predictors with PCA. By converting our data into PCA loading scores instead of gene name, we can greatly reduce our dimension. I used rsample to split data into train-test set.
## <Analysis/Assess/Total>
## <99/23/122>
Using recipes, we can create a “recipe” that shows procedural guide on what to do on test and train set. Even though we already filter our genes with based on variance, we can apply step_nvz to fully make sure that the genes we have is free of low variance genes. We then do pca with step_pca to get the loading scores. By default, the first 6 PCs are recorded.
rec <- recipe(Subtype ~ ., training(splitted)) %>%
step_nzv(all_predictors()) %>%
step_pca(all_numeric()) %>%
prep()We can use juice to directly obtain the training set.
## # A tibble: 6 x 6
## Subtype PC1 PC2 PC3 PC4 PC5
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Basal -15.8 -9.23 -0.174 4.25 -0.725
## 2 Her2 13.6 -4.06 24.5 0.872 7.00
## 3 Her2 -1.16 5.73 1.91 4.19 -3.40
## 4 Her2 7.58 22.2 -6.34 6.47 0.672
## 5 Her2 16.8 4.86 -9.48 10.9 11.8
## 6 Her2 15.9 -12.3 12.2 -1.79 2.32
By plotting PC1 vs PC2 (PCs with highest variance) we can see the clustering of data (later on we are using all 6 PCs during modelling). Just from PC1, we see the divide between Basal and HER2 breast cancer subtype.
ggplot(data = train, mapping = aes(x = PC1, y = PC2, color = Subtype)) +
geom_hline(yintercept=0) +
geom_vline(xintercept=0) +
geom_point() To obtain the test set, we need to specify that we are going to use the test set specifed from splitted.
## # A tibble: 5 x 6
## Subtype PC1 PC2 PC3 PC4 PC5
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Her2 8.84 13.8 -13.6 3.23 1.17
## 2 Her2 15.5 5.84 -1.72 14.0 -0.778
## 3 Basal -5.42 -16.4 -1.56 11.3 -6.76
## 4 Basal -12.3 2.13 6.39 2.03 7.26
## 5 Her2 11.4 -0.175 0.894 -5.35 12.2
Modeling
Decision Tree
We are going to try and use decision tree to predict cancer subtype based on the patient’s transcriptome profile.
model_DT <- train(Subtype ~ .,
data=train,
method="rpart", #regression trees
trControl = trainControl(method = "cv", number = 5))plot(model_DT$finalModel, uniform=TRUE,
main="Classification Tree")
text(model_DT$finalModel, all=TRUE, cex=.5)Our classication tree is very simple. So simple in fact that it only uses PC1 as classifier. Let’s see how it performs.
##
## train_pred_DT Basal Her2
## Basal 46 2
## Her2 0 51
## Confusion Matrix and Statistics
##
## Reference
## Prediction Basal Her2
## Basal 46 2
## Her2 0 51
##
## Accuracy : 0.9798
## 95% CI : (0.9289, 0.9975)
## No Information Rate : 0.5354
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9595
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 1.0000
## Specificity : 0.9623
## Pos Pred Value : 0.9583
## Neg Pred Value : 1.0000
## Prevalence : 0.4646
## Detection Rate : 0.4646
## Detection Prevalence : 0.4848
## Balanced Accuracy : 0.9811
##
## 'Positive' Class : Basal
##
The accuracy is very high at 97% when applied to test set. Let’s see if this model overfits by trying it on test set.
##
## test_pred_DT Basal Her2
## Basal 11 0
## Her2 0 12
## Confusion Matrix and Statistics
##
## Reference
## Prediction Basal Her2
## Basal 11 0
## Her2 0 12
##
## Accuracy : 1
## 95% CI : (0.8518, 1)
## No Information Rate : 0.5217
## P-Value [Acc > NIR] : 3.173e-07
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.4783
## Detection Rate : 0.4783
## Detection Prevalence : 0.4783
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : Basal
##
As it turns out, we are able to classify all patients correctly using our decision tree model.
Conclusion
Using decision tree we are able to correctly classify all patients into their correct cancer sub-type. We first reduce our number of genes from 18K to only 2K by selecting those with high variance between group. To reduce computational burden, we transform our genes column to PCs loading scores and use them for modelling input. The limitation of using PCs loadings instead of directly using our gene expressions is that our results are less interpretable.