Cleaning Textual Data

The goal of this rlab is to read in a file from the American Gut Project and clean it up into a more human-friendly viewable output.

loading in libraries

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library("knitr")

Reading in the .txt file from the American Gut Project and checking the format

biosample <- read.table("biosample_result (1).txt", sep="\t", header=FALSE, fill=TRUE, quote="")
head(biosample)

##                                                         V1        V2
## 1                 1: american gut project; 10317.X00185902          
## 2 Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877          
## 3                           Organism: human gut metagenome          
## 4                                              Attributes:          
## 5                                           /ENA-CHECKLIST ERC000011
## 6                                        /ENA-FIRST-PUBLIC 4/28/2023

Starting to clean data by removing the slashes (/) being used as seperators

biosample <- biosample %>%
  mutate(V1=sub("/", "", V1))
head(biosample)

##                                                         V1        V2
## 1                 1: american gut project; 10317.X00185902          
## 2 Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877          
## 3                           Organism: human gut metagenome          
## 4                                              Attributes:          
## 5                                            ENA-CHECKLIST ERC000011
## 6                                         ENA-FIRST-PUBLIC 4/28/2023

cleaning up data frame slightly more by removing spacing before start of attributes being asked about

biosample <- biosample %>%
  mutate(V1=sub("    ", "", V1))
head(biosample)

##                                                         V1        V2
## 1                 1: american gut project; 10317.X00185902          
## 2 Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877          
## 3                           Organism: human gut metagenome          
## 4                                              Attributes:          
## 5                                            ENA-CHECKLIST ERC000011
## 6                                         ENA-FIRST-PUBLIC 4/28/2023

creating new headers to reflect questions and answers and applying new headers to the biosample data set

header <- c("Survey Input/Question", "Participant Information/Answer")
colnames(biosample) <-header
head(biosample)

##                                      Survey Input/Question
## 1                 1: american gut project; 10317.X00185902
## 2 Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877
## 3                           Organism: human gut metagenome
## 4                                              Attributes:
## 5                                            ENA-CHECKLIST
## 6                                         ENA-FIRST-PUBLIC
##   Participant Information/Answer
## 1                               
## 2                               
## 3                               
## 4                               
## 5                      ERC000011
## 6                      4/28/2023

Looking at 25 lines of the datafram to see format

kable(head(biosample, 25))

Survey Input/Question	Participant Information/Answer
1: american gut project; 10317.X00185902
Identifiers: BioSample: SAMEA112990854; SRA: ERS14985877
Organism: human gut metagenome
Attributes:
ENA-CHECKLIST	ERC000011
ENA-FIRST-PUBLIC	4/28/2023
ENA-LAST-UPDATE	4/28/2023
External Id	SAMEA112990854
INSDC center alias	UCSDMI
INSDC center name	University of California San Diego Microbiome Initiative
INSDC first public	2023-04-28T16:20:26Z
INSDC last update	2023-04-28T16:20:26Z
INSDC status	public
Submitter Id	qiita_sid_10317:10317.X00185902
acid_reflux	i do not have this condition
acne_medication	no
acne_medication_otc	no
add_adhd	i do not have this condition
age_cat	30s
alcohol_consumption	yes
alcohol_frequency	occasionally (1-2 times/week)
alcohol_types_beercider	TRUE
alcohol_types_red_wine	TRUE
alcohol_types_sour_beers	TRUE
alcohol_types_spiritshard_alcohol	TRUE

Converting the edited .txt file into a csv file to be more human friendly to view

write.csv(biosample, "Biosample.csv")

Cleaning Textual Data

Lenna Wolffe

2024-09-30