Load packages and set-up

Part I: FASTA Imports

myFA <- read.FASTA('MT103168.fasta')
head(myFA)

## 1 DNA sequence in binary format stored in a list.
## 
## Sequence length: 1560 
## 
## Label:
## MT103168.1 Bifidobacterium longum strain BB536 cell division...
## 
## Base composition:
##     a     c     g     t 
## 0.156 0.319 0.289 0.236 
## (Total: 1.56 kb)

str(myFA)

## List of 1
##  $ MT103168.1 Bifidobacterium longum strain BB536 cell division protein FtsW (rodA) gene, complete cds: raw [1:1560] 88 18 48 88 ...
##  - attr(*, "class")= chr "DNAbin"

# Read.fasta reads in the fasta file and stores it as a list.
# The head command provides a report with details telling us that the file format is binary, sequence length is 1560 nucleotides, the label (up to a certain number fo characters) is "MT103168.1 Bifidobacterium longum strain BB536 cell division...", and base composition is 15.6% a, 31.9% c, 28.9% g, and 23.6% t.
# The str command tells us about the internal structure of the fasta file. In this case it is telling us it is a list of 1 and prints out the entire header of the file "MT103168.1 Bifidobacterium longum strain BB536 cell division protein FtsW (rodA) gene, complete cds". It tells us there are 1560 nucleotides are in the sequence, and gives a preview of the binary data.

Part II: FASTAQ Imports

myFQ <- read.fastq('ERR1072710.fastq')
head(myFQ)

## 3 DNA sequences in binary format stored in a list.
## 
## Mean sequence length: 183.667 
##    Shortest sequence: 146 
##     Longest sequence: 259 
## 
## Labels:
## ERR1072710.1 10317.000001315_0 length=151
## ERR1072710.2 10317.000001315_1 length=116
## ERR1072710.4 10317.000001315_3 length=151
## 
## Base composition:
##     a     c     g     t 
## 0.318 0.208 0.254 0.219 
## (Total: 551 bases)

str(myFQ)

## List of 3
##  $ ERR1072710.1 10317.000001315_0 length=151: raw [1:146] 18 18 88 88 ...
##  $ ERR1072710.2 10317.000001315_1 length=116: raw [1:259] 18 28 18 28 ...
##  $ ERR1072710.4 10317.000001315_3 length=151: raw [1:146] 28 28 88 28 ...
##  - attr(*, "class")= chr "DNAbin"
##  - attr(*, "QUAL")=List of 7
##   ..$ ERR1072710.1 10317.000001315_0 length=151: num [1:11] 32 38 51 34 32 34 32 34 32 38 ...
##   ..$ ERR1072710.2 10317.000001315_1 length=116: num [1:11] 30 30 30 30 30 30 30 30 30 30 ...
##   ..$ ERR1072710.4 10317.000001315_3 length=151: num [1:42] 10 36 49 49 16 15 22 17 22 16 ...
##   ..$ NA                                       : num [1:70] 51 32 34 38 38 32 38 38 38 51 ...
##   ..$ NA                                       : num [1:67] 30 30 30 30 30 30 30 30 30 30 ...
##   ..$ NA                                       : num [1:11] 32 51 51 32 38 32 38 34 34 51 ...
##   ..$ NA                                       : num [1:11] 30 30 30 30 30 30 30 30 30 30 ...

# Read.fastq reads in the fastq file and stores it as a list.
# The head command provides a report with details telling us that the there are 3 DNA sequences in a binary format. Mean sequence length is 183.667, the sortest sequence length is 146 nucleotides, and the longest sequence length is 259 bases. The labels for each sequence are provided. The overall base composition is 31.8% a, 20.8% c, 25.4% g, and 21.9% t.
# The str command tells us about the internal structure of the fasta file. In this case it is telling us it is a list of 3 DNA sequencs and prints out the entire header of each. It tells us the number of nucleotides in each sequence, and gives a preview of the binary data.

Part III: VCF Imports

myVCF <- read.table('TwoVariants.vcf')
names(myVCF) <- c("CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO", "FORMAT", "__NONE__")
head(myVCF)

##               CHROM POS ID REF ALT QUAL FILTER                 INFO
## 1 NZ_BCYL01000006.1  29  .   A   G    .      .  AC=84;AF=1.0;SB=0.0
## 2 NZ_BCYL01000006.1 145  .   A   G    .      . AC=114;AF=1.0;SB=0.0
##           FORMAT                   __NONE__
## 1 GT:AC:AF:SB:NC  1:84:1.0:0.0:+G=37,-G=47,
## 2 GT:AC:AF:SB:NC 1:114:1.0:0.0:+G=42,-G=72,

str(myVCF)

## 'data.frame':    2 obs. of  10 variables:
##  $ CHROM   : chr  "NZ_BCYL01000006.1" "NZ_BCYL01000006.1"
##  $ POS     : int  29 145
##  $ ID      : chr  "." "."
##  $ REF     : chr  "A" "A"
##  $ ALT     : chr  "G" "G"
##  $ QUAL    : chr  "." "."
##  $ FILTER  : chr  "." "."
##  $ INFO    : chr  "AC=84;AF=1.0;SB=0.0" "AC=114;AF=1.0;SB=0.0"
##  $ FORMAT  : chr  "GT:AC:AF:SB:NC" "GT:AC:AF:SB:NC"
##  $ __NONE__: chr  "1:84:1.0:0.0:+G=37,-G=47," "1:114:1.0:0.0:+G=42,-G=72,"

# read.table calls in the file and ignores all commented out characters in the file. 
# Line 43: This line names the columns of the data frame with the variables given in quotations.
# str provides information about the data frame. Each column/variable shows the type of variable, and the values of the two observations in the data frame.

BIN501_Project_Week_1

Jourdan Hourican

2024-09-02

Load packages and set-up

Part I: FASTA Imports

Part II: FASTAQ Imports

Part III: VCF Imports