Machine learning classification with Breast Cancer Expression Data from GEO database

Zainul

2021-05-07

Introduction

For this LBB, I am using a real gene expression dataset from breast cancer patients. I am going to use a breast cancer dataset, GSE96058 from the GEO database. The study was conducted by the Population-Based Multicenter Sweden Cancerome Analysis Network Breast Initiative. The dataset in this lbb has been sized down to only include Basal and HER2 cancer subtype from a tutorial page in GitHub.

Data Exploration

Let’s load the data and do some transformation first.

##         GSM2155558 GSM2155573  GSM2155576 GSM2155577 GSM2155578 GSM2155579
## A1BG      1.590427   1.843249  2.40252682  0.7957699  3.0461528  1.9914732
## A1CF     -3.321928  -2.987143 -2.99921024 -3.3219281 -3.3219281 -3.0794813
## A2M      10.532488   5.679714  7.36209637  6.9138968  6.6680964  6.8304391
## A2ML1     3.230358   2.533546  4.63239910  0.4043134 -0.5475978 -0.4682657
## A4GALT    3.687543   1.924038  4.13743870  2.6548153  4.8779754  2.8167662
## A4GNT    -3.321928  -3.321928 -3.32192809 -3.3219281 -3.3219281 -3.3219281
## AAAS      3.817716   4.606686  4.11423291  3.6850586  4.4099688  3.8267972
## AACS      4.187752   2.904505  3.19106577  3.6700293  3.8023336  4.2283249
## AADAC    -3.045970  -3.321928  0.04110698 -1.6388111 -3.3219281 -2.3282313
## AADACL2  -3.321928  -3.089915 -3.32192809 -3.3219281 -2.9963932 -3.3219281
##         GSM2155580 GSM2155589 GSM2155592 GSM2155594
## A1BG      3.864399  3.6169857   1.812331   2.153686
## A1CF     -3.321928 -3.3219281  -3.271774  -3.321928
## A2M       9.363927  7.1892622   6.433237   8.462191
## A2ML1     2.231530 -0.8605566   3.046839   1.137483
## A4GALT    3.158690  3.1324142   2.534428   2.786167
## A4GNT    -2.922023 -3.0811705  -3.321928  -3.321928
## AAAS      3.157392  4.1123465   3.883359   4.498970
## AACS      2.520392  2.6387985   3.805710   3.864587
## AADAC    -1.835116 -3.3219281  -3.321928  -2.799313
## AADACL2  -3.321928 -3.3219281  -3.321928  -3.321928
##            Subtype
## GSM2155558   Basal
## GSM2155573   Basal
## GSM2155576   Basal
## GSM2155577   Basal
## GSM2155578   Basal
## GSM2155579   Basal

The mat dataframe is an expression dataset that has been normalized. The rows are the gene expressed with a total gene count of 18.213 genes. We cover almost every gene as the human genome contain more or less 21.000 genes. The column names are anonymous patient label. Here, we only have 122 patients divided into two categories, Basal and HER2 subtype.

Data Processing

Check Missing Value

## [1] FALSE
## [1] FALSE

This dataset does not have missing value.

Scaling

What we want to scale(normalize) are the gene expressions that are now currently placed as row. Let’s transpose it first, and then transpose it again as scale only works on columns.

Merging Expression Data with label

We have more than 18 thousands gene but most of these genes are not differentially expressed in the two sub-types. As such we are going to take the top 2000 genes that shows the greatest variation between sub-type.

This is not the best practice. The best practice is for us to perform differentially expressed genes (DEG) analysis to determine differentially expressed genes. But, this method should be good enough to reduce our dimension.

After filtering, we can merge our data with the label.

##            Subtype      ASTN2     C8orf76      CIDEA        DMPK    GAGE12J
## GSM2155558   Basal -0.2216182 -0.76627329 -1.4517424  0.11913181 -0.1536797
## GSM2155559    Her2  0.9775220  0.08444846 -1.4517424 -0.41039873 -0.1536797
## GSM2155561    Her2  1.8363197 -0.50900181  1.9026721 -0.15997469 -0.1536797
## GSM2155563    Her2 -0.1995949 -0.63590983  0.8944964  0.83477202 -0.1536797
## GSM2155565    Her2 -0.5462489  1.99561604 -0.8264288  1.00457231 -0.1536797
## GSM2155566    Her2  1.1946041 -0.78154328  0.1297961 -0.27732079 -0.1536797
## GSM2155567    Her2  0.6575408 -0.14520721  1.2085468  0.38990840 -0.1536797
## GSM2155568    Her2 -0.2537171  0.58009829 -0.2663376  0.03524777 -0.1536797
## GSM2155571    Her2  1.4083397 -0.42814311  1.0901411  2.23500937 -0.1536797
## GSM2155572    Her2 -1.8457329  0.63719180 -0.3510496 -0.43898782 -0.1536797
##                  HAO2      KRT38     MRPL47      NDRG3
## GSM2155558 -0.6146181  1.0329036 -0.8904883  0.7063261
## GSM2155559  2.7563034  0.2097443 -0.1493350  1.7485516
## GSM2155561 -0.3506882 -0.2991592 -0.7295586 -1.2602521
## GSM2155563 -0.6146181 -0.2991592  0.8091464  0.9744298
## GSM2155565 -0.1591325 -0.2991592  0.5617525 -0.4074978
## GSM2155566 -0.6146181 -0.2991592  0.4483600  0.7022963
## GSM2155567  0.5858403 -0.2991592 -0.8354202  0.1953714
## GSM2155568  6.3062972 -0.2991592 -0.3538955  2.2257588
## GSM2155571 -0.6146181 -0.2991592  4.2963740  1.6189602
## GSM2155572 -0.6146181 -0.2991592  0.8104392  1.0954044

PCA and Dataset split

We now have 2000 predictors which is still quite a lot. What we can do to reduce computational burder during modeling later is to reduce the number of predictors with PCA. By converting our data into PCA loading scores instead of gene name, we can greatly reduce our dimension. I used rsample to split data into train-test set.

## <Analysis/Assess/Total>
## <99/23/122>

Using recipes, we can create a “recipe” that shows procedural guide on what to do on test and train set. Even though we already filter our genes with based on variance, we can apply step_nvz to fully make sure that the genes we have is free of low variance genes. We then do pca with step_pca to get the loading scores. By default, the first 6 PCs are recorded.

We can use juice to directly obtain the training set.

## # A tibble: 6 x 6
##   Subtype    PC1    PC2    PC3    PC4    PC5
##   <fct>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Basal   -15.8   -9.23 -0.174  4.25  -0.725
## 2 Her2     13.6   -4.06 24.5    0.872  7.00 
## 3 Her2     -1.16   5.73  1.91   4.19  -3.40 
## 4 Her2      7.58  22.2  -6.34   6.47   0.672
## 5 Her2     16.8    4.86 -9.48  10.9   11.8  
## 6 Her2     15.9  -12.3  12.2   -1.79   2.32

By plotting PC1 vs PC2 (PCs with highest variance) we can see the clustering of data (later on we are using all 6 PCs during modelling). Just from PC1, we see the divide between Basal and HER2 breast cancer subtype.

To obtain the test set, we need to specify that we are going to use the test set specifed from splitted.

## # A tibble: 5 x 6
##   Subtype    PC1     PC2     PC3   PC4    PC5
##   <fct>    <dbl>   <dbl>   <dbl> <dbl>  <dbl>
## 1 Her2      8.84  13.8   -13.6    3.23  1.17 
## 2 Her2     15.5    5.84   -1.72  14.0  -0.778
## 3 Basal    -5.42 -16.4    -1.56  11.3  -6.76 
## 4 Basal   -12.3    2.13    6.39   2.03  7.26 
## 5 Her2     11.4   -0.175   0.894 -5.35 12.2

Modeling

Decision Tree

We are going to try and use decision tree to predict cancer subtype based on the patient’s transcriptome profile.

Our classication tree is very simple. So simple in fact that it only uses PC1 as classifier. Let’s see how it performs.

##              
## train_pred_DT Basal Her2
##         Basal    46    2
##         Her2      0   51
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Basal Her2
##      Basal    46    2
##      Her2      0   51
##                                           
##                Accuracy : 0.9798          
##                  95% CI : (0.9289, 0.9975)
##     No Information Rate : 0.5354          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9595          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9623          
##          Pos Pred Value : 0.9583          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.4646          
##          Detection Rate : 0.4646          
##    Detection Prevalence : 0.4848          
##       Balanced Accuracy : 0.9811          
##                                           
##        'Positive' Class : Basal           
## 

The accuracy is very high at 97% when applied to test set. Let’s see if this model overfits by trying it on test set.

##             
## test_pred_DT Basal Her2
##        Basal    11    0
##        Her2      0   12
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Basal Her2
##      Basal    11    0
##      Her2      0   12
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8518, 1)
##     No Information Rate : 0.5217     
##     P-Value [Acc > NIR] : 3.173e-07  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.4783     
##          Detection Rate : 0.4783     
##    Detection Prevalence : 0.4783     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : Basal      
## 

As it turns out, we are able to classify all patients correctly using our decision tree model.

Conclusion

Using decision tree we are able to correctly classify all patients into their correct cancer sub-type. We first reduce our number of genes from 18K to only 2K by selecting those with high variance between group. To reduce computational burden, we transform our genes column to PCs loading scores and use them for modelling input. The limitation of using PCs loadings instead of directly using our gene expressions is that our results are less interpretable.