The goal of this rlab is to read in a file from the American Gut Project and clean it up into a more human-friendly viewable output.

loading in libraries

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("knitr")

Reading in the .txt file from the American Gut Project and checking the format

biosample <- read.table("biosample_result (1).txt", sep="\t", header=FALSE, fill=TRUE, quote="")
head(biosample)
##                                                         V1        V2
## 1                 1: american gut project; 10317.X00185902          
## 2 Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877          
## 3                           Organism: human gut metagenome          
## 4                                              Attributes:          
## 5                                           /ENA-CHECKLIST ERC000011
## 6                                        /ENA-FIRST-PUBLIC 4/28/2023

Starting to clean data by removing the slashes (/) being used as seperators

biosample <- biosample %>%
  mutate(V1=sub("/", "", V1))
head(biosample)
##                                                         V1        V2
## 1                 1: american gut project; 10317.X00185902          
## 2 Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877          
## 3                           Organism: human gut metagenome          
## 4                                              Attributes:          
## 5                                            ENA-CHECKLIST ERC000011
## 6                                         ENA-FIRST-PUBLIC 4/28/2023

cleaning up data frame slightly more by removing spacing before start of attributes being asked about

biosample <- biosample %>%
  mutate(V1=sub("    ", "", V1))
head(biosample)
##                                                         V1        V2
## 1                 1: american gut project; 10317.X00185902          
## 2 Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877          
## 3                           Organism: human gut metagenome          
## 4                                              Attributes:          
## 5                                            ENA-CHECKLIST ERC000011
## 6                                         ENA-FIRST-PUBLIC 4/28/2023

creating new headers to reflect questions and answers and applying new headers to the biosample data set

header <- c("Survey Input/Question", "Participant Information/Answer")
colnames(biosample) <-header
head(biosample)
##                                      Survey Input/Question
## 1                 1: american gut project; 10317.X00185902
## 2 Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877
## 3                           Organism: human gut metagenome
## 4                                              Attributes:
## 5                                            ENA-CHECKLIST
## 6                                         ENA-FIRST-PUBLIC
##   Participant Information/Answer
## 1                               
## 2                               
## 3                               
## 4                               
## 5                      ERC000011
## 6                      4/28/2023

Looking at 25 lines of the datafram to see format

kable(head(biosample, 25))
Survey Input/Question Participant Information/Answer
1: american gut project; 10317.X00185902
Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877
Organism: human gut metagenome
Attributes:
ENA-CHECKLIST ERC000011
ENA-FIRST-PUBLIC 4/28/2023
ENA-LAST-UPDATE 4/28/2023
External Id SAMEA112990854
INSDC center alias UCSDMI
INSDC center name University of California San Diego Microbiome Initiative
INSDC first public 2023-04-28T16:20:26Z
INSDC last update 2023-04-28T16:20:26Z
INSDC status public
Submitter Id qiita_sid_10317:10317.X00185902
acid_reflux i do not have this condition
acne_medication no
acne_medication_otc no
add_adhd i do not have this condition
age_cat 30s
alcohol_consumption yes
alcohol_frequency occasionally (1-2 times/week)
alcohol_types_beercider TRUE
alcohol_types_red_wine TRUE
alcohol_types_sour_beers TRUE
alcohol_types_spiritshard_alcohol TRUE

Converting the edited .txt file into a csv file to be more human friendly to view

write.csv(biosample, "Biosample.csv")