The goal of this rlab is to read in a file from the American Gut Project and clean it up into a more human-friendly viewable output.
loading in libraries
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("knitr")
Reading in the .txt file from the American Gut Project and checking the format
biosample <- read.table("biosample_result (1).txt", sep="\t", header=FALSE, fill=TRUE, quote="")
head(biosample)
## V1 V2
## 1 1: american gut project; 10317.X00185902
## 2 Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877
## 3 Organism: human gut metagenome
## 4 Attributes:
## 5 /ENA-CHECKLIST ERC000011
## 6 /ENA-FIRST-PUBLIC 4/28/2023
Starting to clean data by removing the slashes (/) being used as seperators
biosample <- biosample %>%
mutate(V1=sub("/", "", V1))
head(biosample)
## V1 V2
## 1 1: american gut project; 10317.X00185902
## 2 Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877
## 3 Organism: human gut metagenome
## 4 Attributes:
## 5 ENA-CHECKLIST ERC000011
## 6 ENA-FIRST-PUBLIC 4/28/2023
cleaning up data frame slightly more by removing spacing before start of attributes being asked about
biosample <- biosample %>%
mutate(V1=sub(" ", "", V1))
head(biosample)
## V1 V2
## 1 1: american gut project; 10317.X00185902
## 2 Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877
## 3 Organism: human gut metagenome
## 4 Attributes:
## 5 ENA-CHECKLIST ERC000011
## 6 ENA-FIRST-PUBLIC 4/28/2023
creating new headers to reflect questions and answers and applying new headers to the biosample data set
header <- c("Survey Input/Question", "Participant Information/Answer")
colnames(biosample) <-header
head(biosample)
## Survey Input/Question
## 1 1: american gut project; 10317.X00185902
## 2 Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877
## 3 Organism: human gut metagenome
## 4 Attributes:
## 5 ENA-CHECKLIST
## 6 ENA-FIRST-PUBLIC
## Participant Information/Answer
## 1
## 2
## 3
## 4
## 5 ERC000011
## 6 4/28/2023
Looking at 25 lines of the datafram to see format
kable(head(biosample, 25))
| Survey Input/Question | Participant Information/Answer |
|---|---|
| 1: american gut project; 10317.X00185902 | |
| Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877 | |
| Organism: human gut metagenome | |
| Attributes: | |
| ENA-CHECKLIST | ERC000011 |
| ENA-FIRST-PUBLIC | 4/28/2023 |
| ENA-LAST-UPDATE | 4/28/2023 |
| External Id | SAMEA112990854 |
| INSDC center alias | UCSDMI |
| INSDC center name | University of California San Diego Microbiome Initiative |
| INSDC first public | 2023-04-28T16:20:26Z |
| INSDC last update | 2023-04-28T16:20:26Z |
| INSDC status | public |
| Submitter Id | qiita_sid_10317:10317.X00185902 |
| acid_reflux | i do not have this condition |
| acne_medication | no |
| acne_medication_otc | no |
| add_adhd | i do not have this condition |
| age_cat | 30s |
| alcohol_consumption | yes |
| alcohol_frequency | occasionally (1-2 times/week) |
| alcohol_types_beercider | TRUE |
| alcohol_types_red_wine | TRUE |
| alcohol_types_sour_beers | TRUE |
| alcohol_types_spiritshard_alcohol | TRUE |
Converting the edited .txt file into a csv file to be more human friendly to view
write.csv(biosample, "Biosample.csv")