Data Import Considerations
Prologue
- Importing data frames into R is a requirement before you can further manipulate your data frame in R.
- This is unlike Mac or Linux operating system where you can manipulate your data frame directly using the command-line interface, i.e. terminal.
- In this session, I will share some quirks I encountered during data import. The list is, of course, not exhaustive. At the same time, I hope it will not grow significantly longer over time!
- The examples here are primarily applicable for importing data from tab-delimited or comma-separated files.
Zero prefix
- Some elements may have 0’s as their prefix.
- This is especially true for sample/patient/participant ID’s, e.g. 00001, 00002 etc.
- By using the default arguments, you will lose these 0’s.
df <- read.table(file="example_dataset_1.txt", header=TRUE, sep="\t")
df
## PersonID Height..m. Weight..kg.
## 1 1 1.75 67
## 2 2 1.55 88
## 3 3 1.42 100
## 4 4 0.97 40
## 5 5 1.96 66
class(df$PersonID)
## [1] "integer"
- R automatically converts these ‘number text’ (in Excel lingo) to integer.
- We need to explicitly specify the PersonID column as characters using the colClasses argument.
- The catch here is that we need to specify the classes for all columns even if we only originally intend to specify the class for just one column.
df <- read.table(file="example_dataset_1.txt", header=TRUE, sep="\t", colClasses=c("character", "character", "character"))
df
## PersonID Height..m. Weight..kg.
## 1 00001 1.75 67
## 2 00002 1.55 88
## 3 00003 1.42 100
## 4 00004 0.97 40
## 5 00005 1.96 66
class(df$PersonID)
## [1] "character"
Character strings
- By default, R will convert vectors of character strings into factors.
df2 <- read.table(file="example_dataset_2.txt", header=TRUE, sep="\t", fill=TRUE, quote="", comment.char="")
df2
## Name
## 1 Smile Dental Centre
## 2 Obsessive Fitness Gym
## 3 Hoarders Club
## 4 Lee Restaurant
## 5 Pickle Farm Produce
## Address Ratings
## 1 "Grim Reaper Shopping Mall, Lot # 09-08, 2nd Floor, Florida" 9.0
## 2 "Lot # 1000, Grand Kingston Road, United Kingdom" 8.5
## 3 "Level 6, Marching Square, California" 7.0
## 4 "Grand Station Road, # G01-55" 6.5
## 5 "# L1-22, Fresh Market, Malaysia" 5.5
class(df2$Name)
## [1] "factor"
- From my experience, character class is more feasible for data manipulation such as subsetting data frames, merging data frames etc.
- Use the stringsAsFactors=FALSE argument prevents R from converting vectors of character strings into factors.
- Nevertheless, factor class is particular useful during statistical analysis and potentially other instances such as ordering data frames.
df2 <- read.table(file="example_dataset_2.txt", header=TRUE, sep="\t", fill=TRUE, quote="", comment.char="", stringsAsFactors=FALSE)
df2
## Name
## 1 Smile Dental Centre
## 2 Obsessive Fitness Gym
## 3 Hoarders Club
## 4 Lee Restaurant
## 5 Pickle Farm Produce
## Address Ratings
## 1 "Grim Reaper Shopping Mall, Lot # 09-08, 2nd Floor, Florida" 9.0
## 2 "Lot # 1000, Grand Kingston Road, United Kingdom" 8.5
## 3 "Level 6, Marching Square, California" 7.0
## 4 "Grand Station Road, # G01-55" 6.5
## 5 "# L1-22, Fresh Market, Malaysia" 5.5
class(df2$Name)
## [1] "character"