Introduction to the research

The creation of the application dashboard is to be able to answer an overall main question provided by the client. The main question at hand is, through the app, are you able to determine which clusters are similar and or different from one another via PCA and UMAP analysis? Along with this, the question is to be answered by providing visualization/interaction plots. These plots should allow the client to view the plots quickly and simply for the user to work. The client is responsible for providing data to our team. The data obtained had already been pre-processed and is only a portion of the single-cell RNAseq samples from the original data from the research paper. To view and further understand the data in the research article it is available here: https://www.sciencedirect.com/science/article/pii/S1534580722000338?via%3Dihub. Some necessary information is needed as to why we are examining the data. The original goal of the article is to use UMAP to determine the functionality of cells during different life stages. And from that figuring out when these cells are changing from being grouped as one function and ultimately ending up in their final state of function. The reason for using Arabidopsis Thaliana root is because it is a model organism. Meaning for this experiment the structure of the root is very simple making it easy to work with and something useful for trying to cluster cells together. To summarize, this data is tracking the Arabidopsis thaliana root development and gives examples of how UMAP is used to determine the development of cells into their function. Overall, we were given 3 files. The UMAP file contains the 3355 cell barcodes with their x and y UMAP coordinates. The second file contains the 3355 cell barcodes and the 50 PC values that were analyzed for us. and the third file is the cluster file, which contains the 3355 cell barcodes with their cluster ID number which ranges from 1 to 20. In total, we received only a select portion of the more than 110,000 RNAseq samples to work with for this project. But even with a portion of the data, we can still show the relationships between PCA and UMAP plots in determining clusters. This data is implemented with the application and is a way to create great interactions with the data.

UMAP<-read.table('AT_root_scRNAseq_UMAP.csv',sep=',', header=T)
dim(UMAP)
## [1] 3355    3
clusters <- read.table('AT_root_scRNAseq_clustermap.csv',sep=',', header=T)
dim(clusters)
## [1] 3355    2
PCA<-read.table('AT_root_scRNAseq_PCA.csv',sep=',', header=T)
dim(PCA)
## [1] 3355   51

Programming and analytical tasks

Through this application development, our three-person team ultimately had four tasks at hand we needed to complete for the client. And those tasks included making the app contain visuals for the UMAP and PCA scatter plots and interactions for the PCA plots by being able to select the pc axes the client wanted to see. For the third task, we were to allow for cluster selection/deselection for both the UMAP and PCA plots. And lastly, we were to create a second tab that showed the counts of the cells in each cluster and a heatmap of the correlation of cell clusters. This was to all be done while also implementing a way that these can be somewhat organized well within the app as well. To go into further detail our team was able to create a fully functional app that included all of the clients’ specifications. The first step to do was to view the data and determine how we could make the figures. Then we implemented the app to contain two tabs which when finished provide the two different visualizations as specified above. With the first tab contains two plots depicting the non-dimensional RNA transcription data of the Arabidopsis thaliana root tips. One plot would show the PCA comparisons and the other the UMAP. with the plots being next to each other it makes it easier for the client to see the differences in analyses. This part of creating the visualizations was rather easy because we have had practice using ggplot. This issue came when trying to add both plots in the tab section. We came across many errors that did not allow both plots to show next to each other and eventually figured out how to use the split layout function in shiny to allow both figures to appear without distortion. Once all the plots were added we then added the coloring of the clusters based on the cluster file. Our team thought this would be one of the easier tasks to accomplish but had issues with the cluster filtering where not all clusters would appear. This ended up just being a simple coding fix but took a while to fully understand why we were wrong. Once this was complete, we worked on the interactive portion of the app. Our team did not have experience in this type of interaction before so there was a learning curve when it came to implementing it. After determining what we wanted the interactions to be we added a select input function to our app which is what allows us to display 2 dropdown menus for the PCA plot along with a checkbox menu for the cluster selections of all plots. The reason for the 2 dropdown menus for the PCA plot was to allow the client to select the PC values for both the x and y-axis which could be changed without reloading the application. Provided below is what the UMAP plot should look like in the app before an interaction to select/deselect clusters are possible.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggcorrplot)
UMAP<-read.table('AT_root_scRNAseq_UMAP.csv',sep=',', header=T)
UMAP_clusters<-left_join(UMAP, clusters, by='CellBarcode')
ggplot(data = UMAP_clusters, aes(x=UMAP_1, y=UMAP_2, color=as.factor(seurat_clusters))) + geom_point()

The PCA plot, should be plotted with all the clusters present before adding the interaction of choosing the PC for the x and y axes and the clusters. To start, we were just working with the PC_1 versus the PC_2 which will look like this:

PCA_clusters <- left_join(PCA, clusters, by = 'CellBarcode')
ggplot(data = PCA_clusters, aes(x=PC_1, y=PC_2, color=as.factor(seurat_clusters))) + geom_point()

Having these references of the plots, made it easier to implement the R code into the correct RShiny app syntax that runs the app. I think this portion took the longest time to implement. Especially just coming up with an interactive portion of the app in general is hard. I think what is probably most useful is just trying to work on making interactive objects first and not with the data just to see if it would work, and then reworking that into the app that is already being built. A useful tip one of our team members came up with is adding print statements to see where the app stopped running which I found is more efficient in determining where potential problems were happening. Once the interactions were added to the application, it was great to be able to see how it worked on the plots. Then we added the second tab which provided more information about the cell clusters. Most of this portion will be implemented by Sophie and with regards to this because we spent so much time on the interactive portion of the app, we did not give much time to implement the second tab and the plots that section contained. The bar plot is the count of cells in each cluster. This is done using ggplot and just implemented taking each cluster and counting the number of cells for each one and plotting them. And then for the heatmap we just took the clusters and were mapping them versus the PC values. Below I provided the image of the heatmap.

new_PCA <- PCA_clusters[-1]
  PCA_ma <- as.matrix(new_PCA)
  PCA_cor <- cor(PCA_ma)
  
  heatmap <- ggcorrplot(PCA_cor) + theme(axis.text.x=element_text(angle=70)) +
    ggtitle("Heatmap of PCA Values per Cluster")
  heatmap

When this was first created for the heatmap plot we thought this was not correct, but it makes sense because the PC clusters are supposed to be representative of the clusters which is why there is not much color on the graph except along the diagonal and cluster row/column. This figure is just one of the figures plotted on the second tab. But this figure does not provide much to the client other than reassuring that the clustering of the PCA plots is correct. For the second tab in general we initially had wanted these two figures to be interactive as well, such as hovering over a bar on the bar plot and having it describe what it represented. And for the heatmap having the value displayed when hovered over. I think our only error that has still never been resolved, is when you run the app red text comes up on the app page stating ERROR:arg must be a symbol. Even with multiple trials and errors the team was still not able to figure out what the issue was with this error. The app is still fully functional even with the error. Our last obstacle was just trying to get all the graphics to fit in each tab, specifically for the second tab. We had issues with the figure overlapping with each other. After discussing it we think this is because of R shiny and the use to make the apps usable for the browser and to be mobile. We had wanted the figures side by side but had to switch them to being one on top and one on the bottom for the client to have the best viewing experience when using the app. Overall this project was doable, and each team member was able to provide their strength where needed.

Concluding the research

Overall, the application our team built does meet the requirements the client specified. And it can allow the client to answer the question they posed. With only two weeks to build the application, we had to quickly come together to come up with turning this concept into a functional app. For the most part, I think for the most part our dashboard is at the best it can be in being able to answer which clusters are similar and or different from one another via PCA and UMAP analysis via interactions by the user. Just from looking at the application, you would be able to conclude which method PCA or UMAP does a better job at clustering the data. Knowing UMAP and PCA is important in understanding the visualizations and why the cells cluster the way they do. I think improvements in this work would be understanding the shiny app techniques/functions available. And having a better understanding of Shiny syntax. And having more understanding of the order of functions in the app was challenging and required quick learning given the timeline to finish the app for the client on time. I also think having all the single-cell RNAseq samples would have made for a great way to work with the data and could use the images in the article as a resource for checking if the UMAP analysis is done correctly. But obviously, this would be a very difficult task for a personal computer to take in using R. My final thought is that implementing a team role was a great addition to I think getting the project completed within the given timeline. It allowed each team member to be representative of their strengths and when someone was stuck on something a team member with more knowledge was able to help. Not only that but it allowed us to check in with each other and change our code and make sure that all variables were uniform so when there are mistakes that occur it is not due to naming but to syntax issues, etc. When researching and learning about PCA and UMAP plots I found you can make these plots in 3D. I have not figured out if this is possible to do with R Shiny but implementing this kind of function could be cool to use in visualizing especially with the current interaction the application already does. These are just some of the ideas I have come up with for improving the application and some of it is a personal improvement as well. If you would like to view the functional application run the code down below:

#Newest shiny app
#Load necessary libraries
library(shiny)
library(rlang)
library(ggplot2)
library(dplyr)
library(ggcorrplot)

#Preload graphs
 server <- function(input, output, session) {
   clusters <- read.table('AT_root_scRNAseq_clustermap.csv',sep=',', header=T)
   PCA<-read.table('AT_root_scRNAseq_PCA.csv',sep=',', header=T)
   UMAP<-read.table('AT_root_scRNAseq_UMAP.csv',sep=',', header=T)
    UMAP_clusters<-left_join(UMAP, clusters, by='CellBarcode')
    PCA_clusters <- left_join(PCA, clusters, by = 'CellBarcode')
    
#Define a  function that takes in two integers between 1-50
   pc_plot <- function(x_PC, y_PC,filter_var) {

    ggplot(data = filter(PCA_clusters, seurat_clusters %in% filter_var), aes_string(x_PC, y_PC, alpha= 0.5)) + geom_point(aes(color=as.factor(seurat_clusters)))+theme(legend.position="none")
   }
   
   #Pre-render all the plots
  output$UMAP_plot<-renderPlot({ggplot(filter(UMAP_clusters, seurat_clusters %in% input$cluster_choice), aes(x = UMAP_1, y = UMAP_2,color=as.factor(seurat_clusters))) +
    geom_point()}, res = 60, height = 300, width = 350)
  
  output$PCA_plot<-renderPlot({pc_plot({input$pca_plot_x_axis},{input$pca_plot_y_axis},input$cluster_choice)},96, height = 300, width = 250)
    
    
  output$x_axis<-renderText({paste0("you chose this as you X axis type", typeof(ensym(input$pca_plot_x_axis)))})
  
  ### PCA heatmap
  new_PCA <- PCA_clusters[-1]
  PCA_ma <- as.matrix(new_PCA)
  PCA_cor <- cor(PCA_ma)
  
  heatmap <- ggcorrplot(PCA_cor) +
    theme(axis.text.x=element_text(angle=70)) +
    ggtitle("Heatmap of PCA Values per Cluster") 
  
  output$heatmap <- renderPlot(heatmap, height=500, width = 600)
  
  ### Bar plot
  cluster_ID <- c(1:20)
number_cells <- c(285, 280, 228, 224, 200, 191, 191, 181, 165, 144, 126, 123, 116, 111, 108, 72, 66, 53, 44, 34) #averge pca 
cluster_cells <- data.frame(cluster_ID, number_cells)

bar_plot <- ggplot(data=cluster_cells, aes(x=cluster_ID, y=number_cells, fill=cluster_ID)) + 
  geom_bar(stat="identity", fill="skyblue1") +
  ggtitle("Bar Plot of Cluster ID vs. Number of Cells") +
  xlab("Cluster ID") +
  ylab("Number of Cells")
  
  output$barplot <- renderPlot(bar_plot, height=250, width=600)
  
 }
 PCA<-read.table('AT_root_scRNAseq_PCA.csv',sep=',', header=T)
cols<-colnames(PCA)[-1]
str(cols)
##  chr [1:50] "PC_1" "PC_2" "PC_3" "PC_4" "PC_5" "PC_6" "PC_7" "PC_8" "PC_9" ...
ui <- fluidPage(
   titlePanel("Project 1"),
      #Interactive axis chooser
   fluidRow(
     tabsetPanel(
      tabPanel("Main Figures",
     column(2,
            selectInput("pca_plot_x_axis", "Select one PCA x-axis plot to display:",  cols, selected = cols[1]),
            selectInput("pca_plot_y_axis", "Select one PCA y-axis plot to display:", cols, selected = cols[2]),
            textOutput("x_axis"),
            #plotOutput("PCA_plot")
            ),
            
     #),
   #Plot the graphs and radio buttons
    fluidRow(
      column(4, 
      plotOutput("UMAP_plot")
      ),
      column(4, 
      plotOutput("PCA_plot")
      ),
      column(1,
            #radioButtons("cluster_choice","Pick what clusters you can see",1:20)
            checkboxGroupInput("cluster_choice","Pick what clusters you can see",1:20, selected = 1:20)
             )
           )
     ),
   
   
    tabPanel("Heatmap and Barplot",
            fluidRow(splitLayout(cellWidths = c("50%", "50%"),        plotOutput("heatmap"), plotOutput("barplot"))
             ),
    ),
   )
   )
)
 shinyApp(ui, server)

Shiny applications not supported in static R Markdown documents