According to the methodology of Pagès, H., & Kakopo, A. (2024) from BSgenomeForge package in Bioconductor, the BSgenome data package is used to store the genome data of desired organisms in R. It can be created from the 17 chromosome data sets of S. cerevisiae in FASTA format (Figure 1), including C3751, C3253, and BY474, obtained from the Department of Biotechnology, Faculty of Science, Mahidol University. This allows for the identification of specific nucleotide sequences and their extraction for further analysis in FASTA format. BY4743, a laboratory strain derived from the wild-type strain S288c (Harsch et al., 2010), serves as a control strain due to its lower heat tolerance compared to industrial strains (Pizarro et al., 2008).

Figure 1: 17 chromosome data sets of S. cerevisiae in FASTA format BY4743
Figure 1: 17 chromosome data sets of S. cerevisiae in FASTA format BY4743

1. Install BSgenomeForge package in R

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
## Bioconductor version '3.18' is out-of-date; the current release version '3.21'
##   is available with R version '4.5'; see https://bioconductor.org/install
BiocManager::install("BSgenome")
## Bioconductor version 3.18 (BiocManager 1.30.23), R 4.3.3 (2024-02-29 ucrt)
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'BSgenome'
## Installation paths not writeable, unable to update packages
##   path: C:/Program Files/R/R-4.3.3/library
##   packages:
##     boot, class, cluster, codetools, foreign, KernSmooth, lattice, mgcv, nlme,
##     nnet, rpart, spatial, survival
## Old packages: 'abind', 'askpass', 'BH', 'BiocManager', 'bit', 'bit64',
##   'bitops', 'broom', 'bslib', 'cli', 'cpp11', 'curl', 'data.table', 'digest',
##   'evaluate', 'fontawesome', 'fs', 'generics', 'ggplot2', 'glue', 'gtable',
##   'haven', 'jsonlite', 'knitr', 'locfit', 'lubridate', 'matrixStats', 'mime',
##   'openssl', 'pheatmap', 'pillar', 'processx', 'ps', 'purrr', 'R6', 'ragg',
##   'Rcpp', 'RcppArmadillo', 'RCurl', 'readxl', 'rjson', 'rlang', 'rmarkdown',
##   'RSQLite', 'rstudioapi', 'sass', 'scales', 'stringi', 'sys', 'systemfonts',
##   'textshaping', 'tibble', 'tinytex', 'tzdb', 'utf8', 'withr', 'xfun', 'XML',
##   'xml2'
library(BSgenome)
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
## 
##     windows
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: Biostrings
## Loading required package: XVector
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## Loading required package: BiocIO
## Loading required package: rtracklayer
## 
## Attaching package: 'rtracklayer'
## The following object is masked from 'package:BiocIO':
## 
##     FileForFormat

2. Forge a BSgenome by Seed file

Seed file is a file used to identify organism genome data called the DESCRIPTION and forge a BSgenome data package by overwriting the seed file of another organism that is in the database. From the code in Section 2, a seed file for a specific organism (as shown in the example named ‘musFur1_seed’) will be generated and saved in our current working directory (Figure 2). The file name can be customized, as in the example where it is named ‘BY4743’.


Figure 2: How to set working directory you should set the working directory first before starting to write all the code.
Figure 2: How to set working directory you should set the working directory first before starting to write all the code.

Figure 2: Working directory
Figure 2: Working directory

#Setting seed file
seed_files <- system.file("extdata", "GentlemanLab", package="BSgenome")

#Use some seed file (example seed file) to replace with own DESCRIPTION
musFur1_seed <- list.files(seed_files, pattern="\\.musFur1-seed$", full.names=TRUE)

#Example DESCRIPTION from example seed file
cat(readLines(musFur1_seed), sep="\n")
## Package: BSgenome.Mfuro.UCSC.musFur1
## Title: Full genomic sequences for Mustela putorius furo (UCSC version musFur1)
## Description: Full genomic sequences for Mustela putorius furo (Ferret) as provided by UCSC (musFur1, Apr. 2011) and stored in Biostrings objects.
## Version: 1.4.2
## organism: Mustela putorius furo
## common_name: Ferret
## provider: UCSC
## provider_version: musFur1
## release_date: Apr. 2011
## release_name: Ferret Genome Sequencing Consortium MusPutFur1.0
## source_url: http://hgdownload.soe.ucsc.edu/goldenPath/musFur1/bigZips/
## organism_biocview: Mustela_furo
## BSgenomeObjname: Mfuro
## SrcDataFiles: musFur1.2bit from http://hgdownload.soe.ucsc.edu/goldenPath/musFur1/bigZips/
## PkgExamples: bsg$GL896898  # same as bsg[["GL896898"]]
## seqs_srcdir: /fh/fast/morgan_m/BioC/BSgenomeForge/srcdata/BSgenome.Mfuro.UCSC.musFur1/seqs
## seqfile_name: musFur1.2bit
#Write seed file with read.dcf
con <- url("https://cran.r-project.org/src/contrib/PACKAGES ")
BY4743<-read.dcf(musFur1_seed, all = TRUE)
close(con)
utils::str(BY4743)
## 'data.frame':    1 obs. of  17 variables:
##  $ Package          : chr "BSgenome.Mfuro.UCSC.musFur1"
##  $ Title            : chr "Full genomic sequences for Mustela putorius furo (UCSC version musFur1)"
##  $ Description      : chr "Full genomic sequences for Mustela putorius furo (Ferret) as provided by UCSC (musFur1, Apr. 2011) and stored i"| __truncated__
##  $ Version          : chr "1.4.2"
##  $ organism         : chr "Mustela putorius furo"
##  $ common_name      : chr "Ferret"
##  $ provider         : chr "UCSC"
##  $ provider_version : chr "musFur1"
##  $ release_date     : chr "Apr. 2011"
##  $ release_name     : chr "Ferret Genome Sequencing Consortium MusPutFur1.0"
##  $ source_url       : chr "http://hgdownload.soe.ucsc.edu/goldenPath/musFur1/bigZips/"
##  $ organism_biocview: chr "Mustela_furo"
##  $ BSgenomeObjname  : chr "Mfuro"
##  $ SrcDataFiles     : chr "musFur1.2bit from http://hgdownload.soe.ucsc.edu/goldenPath/musFur1/bigZips/"
##  $ PkgExamples      : chr "bsg$GL896898  # same as bsg[[\"GL896898\"]]"
##  $ seqs_srcdir      : chr "/fh/fast/morgan_m/BioC/BSgenomeForge/srcdata/BSgenome.Mfuro.UCSC.musFur1/seqs"
##  $ seqfile_name     : chr "musFur1.2bit"
write.dcf(BY4743,file = "BY4743")

Open that file in your working directory and replace the text in the file with your description and save the file. (and seqs_srcdir : path of FASTA all 17 chromosome data sets on computer).


example DESCRIPTION of BY4743

Package: BSgenome.Scerevisiae.MUSC.BY4743

Title: Full genomic sequences for Saccharomyces cerevisiae (MUSC version BY4743)

Description: Full genomic sequences for Saccharomyces cerevisiae (Yeast) as provided by MUSC (BY4743, Nov. 2023)

Version: 1.0

organism: Saccharomyces cerevisiae

common_name: Yeast

provider: MUSC

provider_version: BY4743

release_date: Nov. 13

release_name: SCBY47431.0

source_url: https://www.ncbi.nlm.nih.gov/genome/?term=txid4932[orgn]&shouldredirect=false

organism_biocview: Saccharomyces_cerevisiae

BSgenomeObjname: Scerevisiae

seqnames: c(“NC_001133.9”,“NC_001134.8”,“NC_001135.5”,“NC_001136.10”,“NC_001137.3”,“NC_001138.5”,“NC_001139.9”,“NC_001140.6”,“NC_001141.2”,“NC_001142.9”,“NC_001143.9”,“NC_001144.5”,“NC_001145.3”,“NC_001146.8”,“NC_001147.6”,“NC_001148.4”,“NC_001224.1”)

circ_seqs: “NC_001224.1”

seqs_srcdir: C:/Users/Lenovo/Desktop/BSgenome/seqs


3. Use forgeBSgenomeDataPkg and path to seed file

This code will take the seed file that we just saved to create a BSgenome data package by fetching all 17 chromosome data sets FASTA data via seqs_srcdir and save it as a folder named Package (BSgenome.Scerevisiae.MUSC.BY4743) in the working directory (Figure 3).

forgeBSgenomeDataPkg("C:/Users/Lenovo/Desktop/BSgenome/BY4743", replace=TRUE)
## Warning in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir =
## destdir, : field 'provider_version' is deprecated in favor of 'genome'
## Warning in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir =
## destdir, : field 'release_name' is deprecated
## Creating package in ./BSgenome.Scerevisiae.MUSC.BY4743 
## Loading 'NC_001133.9' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001133.9.fa' ... DONE
## Loading 'NC_001134.8' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001134.8.fa' ... DONE
## Loading 'NC_001135.5' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001135.5.fa' ... DONE
## Loading 'NC_001136.10' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001136.10.fa' ... DONE
## Loading 'NC_001137.3' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001137.3.fa' ... DONE
## Loading 'NC_001138.5' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001138.5.fa' ... DONE
## Loading 'NC_001139.9' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001139.9.fa' ... DONE
## Loading 'NC_001140.6' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001140.6.fa' ... DONE
## Loading 'NC_001141.2' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001141.2.fa' ... DONE
## Loading 'NC_001142.9' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001142.9.fa' ... DONE
## Loading 'NC_001143.9' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001143.9.fa' ... DONE
## Loading 'NC_001144.5' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001144.5.fa' ... DONE
## Warning in .replace_non_ACGTN_with_N(seq[[1L]]): DNA sequence contains letters not supported by UCSC 2bit format (the
##   format only supports A, C, G, T, and N). Replacing them with Ns.
## Loading 'NC_001145.3' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001145.3.fa' ... DONE
## Loading 'NC_001146.8' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001146.8.fa' ... DONE
## Loading 'NC_001147.6' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001147.6.fa' ... DONE
## Loading 'NC_001148.4' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001148.4.fa' ... DONE
## Loading 'NC_001224.1' sequence from FASTA file 'C:/Users/Lenovo/Desktop/BSgenome/seqs/NC_001224.1.fa' ... DONE
## Writing all sequences to './BSgenome.Scerevisiae.MUSC.BY4743/inst/extdata/single_sequences.2bit' ... DONE

Figure 3: BSgenome.Scerevisiae.MUSC.BY4743
Figure 3: BSgenome.Scerevisiae.MUSC.BY4743

4. Build the source package (tarball)

Take the package folder and create a BSgenome source package with linux (with Aj.Todsapol Techo) and install BSgenome package in r.

library(BSgenome.Scerevisiae.MUSC.BY4743)
genome<-BSgenome.Scerevisiae.MUSC.BY4743

References

Pagès, H. (2024). BSgenomeForge: Forge your own BSgenome data package (Version 1.6.0) [R package]. Bioconductor. https://bioconductor.org/packages/BSgenomeForge.

Harsch, M. J., Lee, S. A., Goddard, M. R., & Gardner, R. C. (2010). Optimized fermentation of grape juice by laboratory strains of Saccharomyces cerevisiae. FEMS Yeast Research, 10(1), 72–82. https://doi.org/10.1111/j.1567-1364.2009.00580.x

Pizarro, F. J., Jewett, M. C., Nielsen, J., & Agosin, E. (2008). Growth temperature exerts differential physiological and transcriptional responses in laboratory and wine strains of Saccharomyces cerevisiae. Applied and Environmental Microbiology, 74(20), 6358–6368. https://doi.org/10.1128/AEM.00602-08