DWA_Celgene_data_label

Last updated: 2023-02-13

Checks: 6 1

Knit directory: DWA_file_naming_fix/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20230207)

The command set.seed(20230207) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: fa16b40

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version fa16b40. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/DWA_file_name_fixing.Rmd
    Untracked:  data/aim1/
    Untracked:  data/aim2/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

Aim of this markdown file

Overall: Trying to come up quick fixes for the couple measurement found with partially incorrect labels.

Background and rationale:

Aim1:
At the this level, I want to see if I can identify treatment level files that are mislabelled in folder ***_uns_2S_3S_total*** on DWA server Saturn > Phenix.

This will be the quickest fix to the problem. The data files in this folder are the split product of raw Harmoney evalutation file (this file is generated by Harmony software based on the measurement from experiment images,hence I call this evaluation file the lowest level, the size of this file is usually large, ~1-2GB .txt file, as the file is the single cell level image features). We usually split this evaluation file by treatmentsum to generate the treatment level files seen in ***_uns_2S_3S_total*** . These treatment level files were then act as the input to the Matlab based Myclassifier for classifier building etc. If I can identify the mislabelled files here using the file names (Patient ID, time and treatment) I might be able to correct the file name and content. So this database is clean.

For the paper writing, the highest level output, Myclassifier output file will be treatment wells and cell distribution. I might need to look into that for a quick fix for the paper we are writing.

Aim2:
Since my start at the lab, I have found multiple treatment level files have “copies” appened to their names. So I suspect the problem might run further than the ones we identiy. What matters the most is the root evaluation files, I want to perform an automated scan on all evalutaiton files. This is the safest action, although rebuilding the ***_uns_2S_3S_total*** will take considerable amount of time.

The idea of an automated scan on all evalutaiton files could be read into each file, check the patient ID columns, it should only be one, and if not, the correct one should be the majority as mislabel only happened at a minority of bottom rows when copy pasting from excel file (plate layout)

# load the required packages if (!require('pacman')) install.packages('pacman')
# pacman::p_load(package1, package2, package_n)

library(tidyverse)
library(here)
library(stringr)

Aim 1

# Get treatment files name and extract patient ID Need to build in some
# condition handling as I can see not all files are labelled correctly e.g.
# some are missing patient ID in their file names so the simple extraction can
# return nothing thus erroring out

## The best way may be read into the file and extract names or iso those files
## first and handle seperately?

treatment.files <- read_csv(here::here("data/aim1/ML_uns_2S_3S_total_list.files.csv"),
    col_names = T, skip = 1) %>%
    select(2) %>%
    rename(filename = 1) %>%
    mutate(Patient.ID = stringr::str_extract(filename, pattern = "S[0-9]+"), Stimulation = stringr::str_extract(filename,
        pattern = "uns|3S"))

treatment.files$correct.ID <- ""

treatment.files$correct.filename <- ""

# def a loop to read into each treatment level files and check if the patient ID is unique

for (i in nrow(treatment.files)){

  
  # As I only have one example file, most of the names extracted from the directory will be NA
  # That means most of the loop will fail with ERROR when creating variables
  # I use TryCatch() to make the loop going, feel free to take it off and use the inner loop
    
  #get patient ID labelled in the file name
    name.display <- 
      as.character(treatment.files[i, 2])
      
  #read in the corresponding file and extract patient ID from the file
    tryCatch({
      file.read <- 
      read_tsv(paste0(here::here("data/aim1/"),
                      as.character(treatment.files[i, 1])), 
               col_names = T)
    
  #column 12 is the cell type aka Patient ID
    name_label<- file.read$"Cell Type"
    
    }, error=function(e){cat("ERROR :",as.character(treatment.files[i,1]), "does not exist")})
    
    
   try(    #evaluate if the displayed name exist or not
    if (is.na(name.display)){
      # now check if it is correctly labelled
      ## correctly labelled layout should only have one patient ID
    
        if (length(unique(name_label) == 1)){
        
        #update the look-up table treatment.files
        treatment.files[i, 4] <- as.character(unique(name_label))
        
        #try to create a correct file name
        new.filename <- paste0(as.character(unique(name_label)), "_",
                               as.character(treatment.files[i, 1]))
        
        treatment.files[i, 5] <- as.character(new.filename)
        
        #rename the file
        file.rename(paste0(here::here("data/aim1/"),
                           as.character(treatment.files[i, 1])),
                    paste0(here::here("data/aim1/"), new.filename)) 
        } else{
          
          ## incorrectly labelled layout should have the correct patient ID in majority
          correct.name <- 
            name_label %>% 
            group_by_at(1) %>% 
            summarise(count=n()) %>% 
            slice_max(., order_by = count, n= 1) %>%
            select(1) %>% 
            as.character()
          
          #update the data content
          correct.file<- 
            file.read%>% 
            mutate("Cell Type" = correct.name)
          
          #update the lookup table
          treatment.files[i, 4] <- correct.name
          
          #try to create a correct file name
          new.filename <- paste0(as.character(unique(name_label)), "_",
                               as.character(treatment.files[i, 1]))
          
          treatment.files[i, 5] <- as.character(new.filename)
          # delete original file
          file.remove(paste0(here::here("data/aim1/"),
                             as.character(treatment.files[i, 1])))
          
          # write new file
          write.table(correct.file, file = paste0(here::here("data/aim1/"),
                                                  new.filename), 
                      sep = "\t",row.names = FALSE)
          
          
        }
      
      }else{
    
  ## incorrectly labelled layout should have the correct patient ID in majority
          correct.name <- 
            name_label %>% 
            group_by_at(1) %>% 
            summarise(count=n()) %>% 
            slice_max(., order_by = count, n= 1) %>%
            select(1) %>% 
            as.character()
          
          #update the data content
          correct.file<- 
            file.read%>% 
            mutate("Cell Type" = correct.name)
          
          #update the lookup table
          treatment.files[i, 4] <- correct.name
          
          #try to create a correct file name
          new.filename <- paste0(as.character(unique(name_label)), "_",
                               as.character(treatment.files[i, 1]))
          
          treatment.files[i, 5] <- as.character(new.filename)
          # delete original file
          file.remove(paste0(here::here("data/aim1/"),
                             as.character(treatment.files[i, 1])))
          
          # write new file
          write.table(correct.file, file = paste0(here::here("data/aim1/"),
                                                  new.filename), 
                      sep = "\t",row.names = FALSE)
          
    
  })
    


}

ERROR : S999706_20190123_T6__uns_TRAIL_3ugml_none.txt does not existError in tbl_vars_dispatch(x) : object 'name_label' not found

#here we output a new updated treatment files for use to check the changes being made
write.csv(treatment.files, file = paste0(here::here("data/aim1/treatmentfiles.csv")), row.names=F)

sessionInfo()

R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /usr/lib64/libblas.so.3.4.2
LAPACK: /usr/lib64/liblapack.so.3.4.2

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] here_1.0.1      forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10   
 [5] purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.8   
 [9] ggplot2_3.4.0   tidyverse_1.3.2 workflowr_1.7.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9          lubridate_1.8.0     getPass_0.2-2      
 [4] ps_1.7.2            assertthat_0.2.1    rprojroot_2.0.3    
 [7] digest_0.6.29       utf8_1.2.2          R6_2.5.1           
[10] cellranger_1.1.0    backports_1.4.1     reprex_2.0.2       
[13] evaluate_0.16       httr_1.4.4          pillar_1.8.1       
[16] rlang_1.0.6         readxl_1.4.1        googlesheets4_1.0.1
[19] rstudioapi_0.14     whisker_0.4         callr_3.7.3        
[22] jquerylib_0.1.4     rmarkdown_2.16      googledrive_2.0.0  
[25] bit_4.0.4           munsell_0.5.0       broom_1.0.1        
[28] compiler_4.2.0      httpuv_1.6.5        modelr_0.1.9       
[31] xfun_0.32           pkgconfig_2.0.3     htmltools_0.5.3    
[34] tidyselect_1.1.2    fansi_1.0.3         crayon_1.5.1       
[37] withr_2.5.0         tzdb_0.3.0          dbplyr_2.2.1       
[40] later_1.3.0         grid_4.2.0          jsonlite_1.8.0     
[43] gtable_0.3.1        lifecycle_1.0.3     DBI_1.1.3          
[46] git2r_0.30.1        magrittr_2.0.3      formatR_1.12       
[49] scales_1.2.1        vroom_1.5.7         cli_3.4.1          
[52] stringi_1.7.8       cachem_1.0.6        fs_1.5.2           
[55] promises_1.2.0.1    xml2_1.3.3          bslib_0.4.0        
[58] ellipsis_0.3.2      generics_0.1.3      vctrs_0.5.1        
[61] tools_4.2.0         bit64_4.0.5         glue_1.6.2         
[64] hms_1.1.2           parallel_4.2.0      processx_3.8.0     
[67] fastmap_1.1.0       yaml_2.3.5          colorspace_2.0-3   
[70] gargle_1.2.0        rvest_1.0.3         knitr_1.40         
[73] haven_2.5.1         sass_0.4.2

DWA_Celgene_data_label_fix

Mark Li

2023-02-07

Aim of this markdown file

Aim 1