intrahepatic cholangiocarcinoma ICC GSE316921 analysis and ML

Intrahepatic Cholangiocarcinoma or ICC in GSE316921 study about the knockout of a gene that provides angiogenesis to the tumor microenvironment, no published article as of this date, but study is from late January this year. In it, the description says the hypoxia in the ICC tumor microenvironment creates tumor-like cancer fibroblasts and then it also looks at the short hairpin, I take that to be a microRNA to inhibit this gene, SLC2A1 as shSLC2A1. We saw in the samples there seem to be two sets of control and hypoxia but in one set there is no shSLC2A1, and there is in the other set. Confused on what is the control for the hypoxia that led to the SLC2A1 gene as the target to create hypoxia if there is a control with hypoxia already. It did say something about or similar to copy number variant maybe we can discover something here to explain more in the data provided in 12 samples with 3 of each class in two sets. There was already a data table provided and a series text file to explore.

library(rmarkdown)
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

Data <- read.csv("GSE316921_processed_data.csv")

paged_table(Data[1:10,])

dim(Data)

## [1] 36242    13

colnames(Data)

##  [1] "X"              "CAFSHNCCTRL_1"  "CAFSHNCCTRL_2"  "CAFSHNCCTRL_3" 
##  [5] "CAFSHNCHYPO_1"  "CAFSHNCHYPO_2"  "CAFSHNCHYPO_3"  "CAFSHSLCCTRL_1"
##  [9] "CAFSHSLCCTRL_2" "CAFSHSLCCTRL_3" "CAFSHSLCHYPO_1" "CAFSHSLCHYPO_2"
## [13] "CAFSHSLCHYPO_3"

colnames(Data)[1] <- "Gene"

colnames(Data)

##  [1] "Gene"           "CAFSHNCCTRL_1"  "CAFSHNCCTRL_2"  "CAFSHNCCTRL_3" 
##  [5] "CAFSHNCHYPO_1"  "CAFSHNCHYPO_2"  "CAFSHNCHYPO_3"  "CAFSHSLCCTRL_1"
##  [9] "CAFSHSLCCTRL_2" "CAFSHSLCCTRL_3" "CAFSHSLCHYPO_1" "CAFSHSLCHYPO_2"
## [13] "CAFSHSLCHYPO_3"

It looks like the work is mostly done. There are the CAF for cancer fibroblasts, the controls, then the hypoxia, and another set of controls but for the short hairpin of the targeted gene, and the hypoxia of those controls. I think now, that the hypoxia is something they do in the lab to these samples to see what happens to the controls when they remove oxygen to see as the description says, what happens with the endothelial tissue.

Now, lets look at the series information on the details.

series_22lines <- read.table("GSE316921_series_matrix.txt", nrow=22)

paged_table(series_22lines)

From details above the data does seem to suggest this is sort of exactly what happened in the series summary and series overall design information.

Lets see what else is in the series data to help us understand what we are working with.

series_22skipped <- read.table("GSE316921_series_matrix.txt", skip=22, nrow=22)

paged_table(series_22skipped)

In the above the tissue type is biliary tract tumor tissue taken for all samples of control or shSLC2A1 in conditions hypoxia or not. They are all fibroblasts, the first 6 samples are WT for wild type assumed. The last 6 are the SLC2A1 knockdown or inhibition.

The first 3 samples are controls WT, then hypoxia WT for next 3 samples, and following 3 samples are the SLC2A1 knockdown, and last 3 following that are the SLC2A1 knockdown with hypoxia.

Now we can look at the data and explore it.

Data$WT_noHypoxia_Mean <- rowMeans(Data[,c(2:4)])
Data$WT_Hypoxia_Mean <- rowMeans(Data[,c(5:7)])
Data$shSLC2A1_noHypoxia_Mean <- rowMeans(Data[,c(8:10)])
Data$shSLC2A1_Hypoxia_Mean <- rowMeans(Data[,c(11:13)])

colnames(Data)

##  [1] "Gene"                    "CAFSHNCCTRL_1"          
##  [3] "CAFSHNCCTRL_2"           "CAFSHNCCTRL_3"          
##  [5] "CAFSHNCHYPO_1"           "CAFSHNCHYPO_2"          
##  [7] "CAFSHNCHYPO_3"           "CAFSHSLCCTRL_1"         
##  [9] "CAFSHSLCCTRL_2"          "CAFSHSLCCTRL_3"         
## [11] "CAFSHSLCHYPO_1"          "CAFSHSLCHYPO_2"         
## [13] "CAFSHSLCHYPO_3"          "WT_noHypoxia_Mean"      
## [15] "WT_Hypoxia_Mean"         "shSLC2A1_noHypoxia_Mean"
## [17] "shSLC2A1_Hypoxia_Mean"

Now we can add fold change values to these two sets of genes looking at hypoxia in both but one with the inhibition of SLC2A1.

Data$WT_Foldchange <- Data$WT_Hypoxia_Mean/Data$WT_noHypoxia_Mean
Data$shSLC2A1_Foldchange <- Data$shSLC2A1_Hypoxia_Mean/Data$shSLC2A1_noHypoxia_Mean

colnames(Data)

##  [1] "Gene"                    "CAFSHNCCTRL_1"          
##  [3] "CAFSHNCCTRL_2"           "CAFSHNCCTRL_3"          
##  [5] "CAFSHNCHYPO_1"           "CAFSHNCHYPO_2"          
##  [7] "CAFSHNCHYPO_3"           "CAFSHSLCCTRL_1"         
##  [9] "CAFSHSLCCTRL_2"          "CAFSHSLCCTRL_3"         
## [11] "CAFSHSLCHYPO_1"          "CAFSHSLCHYPO_2"         
## [13] "CAFSHSLCHYPO_3"          "WT_noHypoxia_Mean"      
## [15] "WT_Hypoxia_Mean"         "shSLC2A1_noHypoxia_Mean"
## [17] "shSLC2A1_Hypoxia_Mean"   "WT_Foldchange"          
## [19] "shSLC2A1_Foldchange"

paged_table(Data[1:10,]) # 36,242 X 19

write.csv(Data, "allData_FCs_36242X19.csv", row.names=F)

Lets order the genes by the WT foldchange and get top stimulated and top inhibited genes of 10 each.

Data_WT_FC_ordered <- Data[order(Data$WT_Foldchange, decreasing=T),]

paged_table(Data_WT_FC_ordered[c(1:10,36233:36242),])

Lets remove the NaNs and the Infs in this data and remove all 0 values for genes.

Data_removedNaN <- Data_WT_FC_ordered[!(is.na(Data_WT_FC_ordered$WT_Foldchange)),] # 21,468 X 19

Data_removedINF_NaNs <- Data_removedNaN[!(is.infinite(Data_removedNaN$WT_Foldchange)),] # 20,178 X 19

Data_clean_WT_FC <- Data_removedINF_NaNs[which(Data_removedINF_NaNs$WT_Foldchange > 0),]

write.csv(Data_clean_WT_FC, "DataCleaned_WT_FCs_18374X19.csv", row.names=F)

Lets take the top genes from this group of wild type controls and hypoxic environments.

topWT_20 <- Data_clean_WT_FC[c(1:10,18465:18474),]

paged_table(topWT_20)

Lets now get the cleaned up data of no Infinite or NaN values and those greater than 0 for the foldchange values of the shSLC2A1 group of controls and hypoxic condition.

Data_SLC_ordered <- Data[order(Data$shSLC2A1_Foldchange, decreasing=T),]

Data_SLC_noNaN <- Data_SLC_ordered[!(is.na(Data_SLC_ordered$shSLC2A1_Foldchange)),] #22,461 X 19

Data_SLC_noInfNaN <- Data_SLC_noNaN[!(is.infinite(Data_SLC_noNaN$shSLC2A1_Foldchange)),] # 20,749 X 19

Data_SLC_clean <- Data_SLC_noInfNaN[which(Data_SLC_noInfNaN$shSLC2A1_Foldchange > 0),] # 19,102 X 19

write.csv(Data_SLC_clean, "Data_shSLC_foldchanges_19102X19.csv", row.names=F)

Now we get our top genes of 10 stimulated and 10 inhibited.

topSLC_20 <- Data_SLC_clean[c(1:10, 19093:19102),]

paged_table(topSLC_20)

Now we have our top genes of both groups. Lets see how these genes play out in our random forest classifier to predict the class of the sample.

These are the top 20 WT genes in hypoxia compared to not of ICC.

topWT_20$Gene

##  [1] "INHBB"        "IGFBP3"       "LINC00520"    "ALDH1A1"      "S100P"       
##  [6] "CXCR4"        "CIDEA"        "CARMIL3"      "CD300A"       "MRGPRX3"     
## [11] "GBP6"         "SNORD93"      "FIGN"         "HPD"          "PTENP1-AS"   
## [16] "SUCNR1"       "ANO3"         "DIO2"         "NOTCH2NLA"    "LOC107986951"

And these are the top 20 genes in hypoxia versus non hypoxic conditions of the knockout inhibition of the gene SLC2A1.

topSLC_20$Gene

##  [1] "INHBB"     "LRRTM1"    "RPL7AP50"  "RPL10P6"   "KISS1R"    "IGFBP1"   
##  [7] "ASCL2"     "GRIN2A"    "ADAMTS18"  "INKA2-AS1" "RPS15P7"   "POM121L9P"
## [13] "IP6K3"     "JAKMIP3"   "LRRN3"     "MCCD1"     "GCNT3"     "ANO3"     
## [19] "SNHG21"    "FABP5P3"

Lets do some machine learning to see if this can predict the class relative to the data set the top genes were in. There are 3 samples of each. We will train on 4 samples and test on the remaining 2 in each group or dataset.

For the wild type lets create our matrix.

WT_df <- topWT_20[,c(1:7)]
class <- c("ICC","ICC","ICC",
           "ICC-hypoxia","ICC-hypoxia","ICC-hypoxia")
table(class)

## class
##         ICC ICC-hypoxia 
##           3           3

WT_mx <- data.frame(t(WT_df[,2:7]))
colnames(WT_mx) <- WT_df$Gene
WT_mx$class <- as.factor(class)

paged_table(WT_mx)

Group our training and testing data.

set.seed(123)

training <- WT_mx[c(1,2,5,6),]
testing <- WT_mx[c(3,4),]

table(training$class)

## 
##         ICC ICC-hypoxia 
##           2           2

table(testing$class)

## 
##         ICC ICC-hypoxia 
##           1           1

rf_WT <- randomForest(training[1:19], training$class, mtry=6, ntree=5000, confusion=T, importance=T)

rf_WT$confusion

##             ICC ICC-hypoxia class.error
## ICC           2           0           0
## ICC-hypoxia   0           2           0

That is good, these genes are good for this training set on class prediction. The 2 ICC samples were predicted correctly as well as the 2 hypoxic samples of ICC.

Lets see how well it did this model on the hold out validation testing set.

predicted_WT <- predict(rf_WT, testing)

results_WT <- data.frame(predicted=predicted_WT, actual=testing$class)
results_WT

##                 predicted      actual
## CAFSHNCCTRL_3         ICC         ICC
## CAFSHNCHYPO_1 ICC-hypoxia ICC-hypoxia

Great! This model was good with this set of genes in determining the class of the sample.

Lets see the importance of these features.

rf_WT$importance

##               ICC ICC-hypoxia MeanDecreaseAccuracy MeanDecreaseGini
## INHBB      0.0076      0.0076               0.0076           0.1008
## IGFBP3     0.0064      0.0064               0.0064           0.1057
## LINC00520  0.0064      0.0064               0.0064           0.0962
## ALDH1A1    0.0092      0.0092               0.0092           0.0976
## S100P     -0.0030     -0.0030              -0.0030           0.0092
## CXCR4      0.0064      0.0064               0.0064           0.0907
## CIDEA      0.0070      0.0070               0.0070           0.0950
## CARMIL3    0.0028      0.0028               0.0028           0.0947
## CD300A     0.0088      0.0088               0.0088           0.1038
## MRGPRX3    0.0078      0.0078               0.0078           0.1116
## GBP6       0.0070      0.0070               0.0070           0.1102
## SNORD93    0.0088      0.0088               0.0088           0.1015
## FIGN       0.0078      0.0078               0.0078           0.0979
## HPD        0.0000      0.0000               0.0000           0.0510
## PTENP1-AS  0.0048      0.0048               0.0048           0.0779
## SUCNR1     0.0050      0.0050               0.0050           0.0948
## ANO3       0.0060      0.0060               0.0060           0.0891
## DIO2       0.0064      0.0064               0.0064           0.0868
## NOTCH2NLA  0.0084      0.0084               0.0084           0.1008

So these WT genes of 20 of them are good at predicting the class of that subset of ICC and hypoxic conditions. The importance of the features was identical for each sample. Could be due to the small amount of samples or something else.

Lets now make a matrix of the knockout SLC2A1 top 20 genes and test those out in its subset.

SLC_df <- topSLC_20[,c(1,8:13)]
class1 <- c("ICC_shSLC2A1","ICC_shSLC2A1","ICC_shSLC2A1",
            "ICC_shSLC2A1_hypoxia","ICC_shSLC2A1_hypoxia","ICC_shSLC2A1_hypoxia")

SLC_mx <- data.frame(t(SLC_df[,c(2:7)]))
colnames(SLC_mx) <- SLC_df$Gene
SLC_mx$class <- as.factor(class1)

paged_table(SLC_mx)

Make our training and testing sets.

set.seed(123)

training <- SLC_mx[c(1,2,5,6),]
testing <- SLC_mx[c(3,4),]

table(training$class)

## 
##         ICC_shSLC2A1 ICC_shSLC2A1_hypoxia 
##                    2                    2

table(testing$class)

## 
##         ICC_shSLC2A1 ICC_shSLC2A1_hypoxia 
##                    1                    1

Now for our random forest model for the shSLC2A1 knockout with hypoxia.

rf_SLC <- randomForest(training[1:19], training$class, mtry=6, ntree=5000,confusion=T, importance=T)

rf_SLC$confusion

##                      ICC_shSLC2A1 ICC_shSLC2A1_hypoxia class.error
## ICC_shSLC2A1                    2                    0           0
## ICC_shSLC2A1_hypoxia            0                    2           0

These genes are also good in predicting the class of the sample as 100% accuracy for both samples in this group.

Lets see about this model predicting the testing set of hold out samples.

predicted_SLC <- predict(rf_SLC,testing)

results1 <- data.frame(predicted=predicted_SLC, actual=testing$class)
results1

##                           predicted               actual
## CAFSHSLCCTRL_3         ICC_shSLC2A1         ICC_shSLC2A1
## CAFSHSLCHYPO_1 ICC_shSLC2A1_hypoxia ICC_shSLC2A1_hypoxia

Great! Because the testing set also scored 100% accuracy in predicting the class sample.

So, for this study, it looks like we could see that these genes by fold change comparison in hypoxic states of both groups of wild type Intrahepatic cholangiocarcinoma and the inhibition of the SLC2A1 gene in a separate group of hypoxic state comparison showed the study was done and a change occured that was demonstrated by the change in these genes for the hypoxic state. Lets see if we can look at what these genes are by merging them with a separate file that has a brief description of their gene summary.

path <- "location to your ensemble ID dataset to merge"

setwd(path)

ensembl <- read.csv("GSE271486_ensembleIDs_NPC_LBMP_study.csv")

paged_table(ensembl[1:10,])

Lets merge this with the top genes of each group but only the first 3 columns.

top20WT_merged <- merge(ensembl[,c(1:4)], topWT_20, by.x="gene_name", by.y="Gene")

paged_table(top20WT_merged)

Lets now merge ensembl descriptors with the top 20 genes of knocking out SLC2A1 with shSLC2A1 in hypoxic state.

topSLC_merged <- merge(ensembl[,c(1:4)], topSLC_20, by.x='gene_name', by.y="Gene")

paged_table(topSLC_merged)

top20WT_merged$datasetSource <- "top 20 genes by foldchange Wild Type hypoxia in ICC"
topSLC_merged$datasetSource <- "top 20 genes by foldchange shSLC2A1 knockout in hypoxic state of ICC"

topGenes_ICC <- rbind(top20WT_merged,topSLC_merged)

paged_table(topGenes_ICC)

write.csv(topGenes_ICC,"top40genes_ICC_GSE316921.csv", row.names=F)

Next time we will have to add in some gene summaries to see what these genes were doing and add these genes to the pathology database because they are good predictors for the class of the sample in their respective subsets.

Lets quickly return to the larger cleaned data to see if a few EBV genes are present. Some are LMP1, LMP2, PDL1, and PDL2, but those have different names of CD274 and PDCD1LG2.

EBV <- c("LMP1","LMP2","CD274","PDCD1LG2")

EBV_WT <- Data_clean_WT_FC[which(Data_clean_WT_FC$Gene %in% EBV),]

paged_table(EBV_WT)

Looks like PDL1 or CD274 and PDL2 or PDCD1LG2 are in the dataset from above in Wild Type cleaned of infinites, NaNs, and greater than 0 for fold change. They are both inhibited in the hypoxic state of wild type ICC.

Now for the cleaned data of hypoxic state in ICC with the knockout of SLC2A1.

EBV_SLC <- Data_SLC_clean[which(Data_SLC_clean$Gene %in% EBV),]

paged_table(EBV_SLC)

Both of the same PDL1 and PDL2 genes are also in the ICC hypoxic group of knockout SLC2A1. Also, both inhibited. Those are chromosome 9 genes.

Next time we will explore the gene summaries for each of these top 40 genes in ICC hypoxic states between Wild Type and knockout of SLC2A1 with shSLC2A1.

Here is a YouTube link to this video.

intrahepatic cholangiocarcinoma ICC GSE316921 analysis and ML

Janis Harris

2026-05-12