Data Import Considerations

Prologue

  • Importing data frames into R is a requirement before you can further manipulate your data frame in R.
  • This is unlike Mac or Linux operating system where you can manipulate your data frame directly using the command-line interface, i.e. terminal.
  • In this session, I will share some quirks I encountered during data import. The list is, of course, not exhaustive. At the same time, I hope it will not grow significantly longer over time!
  • The examples here are primarily applicable for importing data from tab-delimited or comma-separated files.

Zero prefix

  • Some elements may have 0’s as their prefix.
  • This is especially true for sample/patient/participant ID’s, e.g. 00001, 00002 etc.
  • By using the default arguments, you will lose these 0’s.
df <- read.table(file="example_dataset_1.txt", header=TRUE, sep="\t")
df
##   PersonID Height..m. Weight..kg.
## 1        1       1.75          67
## 2        2       1.55          88
## 3        3       1.42         100
## 4        4       0.97          40
## 5        5       1.96          66
class(df$PersonID)
## [1] "integer"
  • R automatically converts these ‘number text’ (in Excel lingo) to integer.
  • We need to explicitly specify the PersonID column as characters using the colClasses argument.
  • The catch here is that we need to specify the classes for all columns even if we only originally intend to specify the class for just one column.
df <- read.table(file="example_dataset_1.txt", header=TRUE, sep="\t", colClasses=c("character", "character", "character"))
df
##   PersonID Height..m. Weight..kg.
## 1    00001       1.75          67
## 2    00002       1.55          88
## 3    00003       1.42         100
## 4    00004       0.97          40
## 5    00005       1.96          66
class(df$PersonID)
## [1] "character"

Hashtags

  • I encountered a peculiar problem when I imported a data frame which has hashtags (#) in some of the rows. That particular data frame has # in the Address column. For example, “# 05-01 Livingston Road”.
  • This caused all data after the # to not appear in R.
  • This is because, by default, # symbol and everything thereafter are considered as comment and will be ignored during the data import process. This is especially true when the arguments fill=TRUE and quote=“” are used.
  • The arguments fill=TRUE and quote=“” are especially useful for files that have many blanks across multiple columns. Just imagine an excel file in which cells with missing values are left completely blank.
df2 <- read.table(file="example_dataset_2.txt", header=TRUE, sep="\t", fill=TRUE, quote="")
df2
##                    Name                                Address Ratings
## 1   Smile Dental Centre       "Grim Reaper Shopping Mall, Lot       NA
## 2 Obsessive Fitness Gym                                  "Lot       NA
## 3         Hoarders Club "Level 6, Marching Square, California"       7
## 4        Lee Restaurant                  "Grand Station Road,       NA
## 5   Pickle Farm Produce                                      "      NA
  • By default the comment.char argument specifies # to be a comment indicator.
  • To indicate that the # in our case is not a comment indicator, rather it is part our data, simply set the comment.char to something other than the #. In this case, we replace the default with blank.
df2 <- read.table(file="example_dataset_2.txt", header=TRUE, sep="\t", fill=TRUE, quote="", comment.char="")
df2
##                    Name
## 1   Smile Dental Centre
## 2 Obsessive Fitness Gym
## 3         Hoarders Club
## 4        Lee Restaurant
## 5   Pickle Farm Produce
##                                                        Address Ratings
## 1 "Grim Reaper Shopping Mall, Lot # 09-08, 2nd Floor, Florida"     9.0
## 2            "Lot # 1000, Grand Kingston Road, United Kingdom"     8.5
## 3                       "Level 6, Marching Square, California"     7.0
## 4                               "Grand Station Road, # G01-55"     6.5
## 5                            "# L1-22, Fresh Market, Malaysia"     5.5

Character strings

  • By default, R will convert vectors of character strings into factors.
df2 <- read.table(file="example_dataset_2.txt", header=TRUE, sep="\t", fill=TRUE, quote="", comment.char="")
df2
##                    Name
## 1   Smile Dental Centre
## 2 Obsessive Fitness Gym
## 3         Hoarders Club
## 4        Lee Restaurant
## 5   Pickle Farm Produce
##                                                        Address Ratings
## 1 "Grim Reaper Shopping Mall, Lot # 09-08, 2nd Floor, Florida"     9.0
## 2            "Lot # 1000, Grand Kingston Road, United Kingdom"     8.5
## 3                       "Level 6, Marching Square, California"     7.0
## 4                               "Grand Station Road, # G01-55"     6.5
## 5                            "# L1-22, Fresh Market, Malaysia"     5.5
class(df2$Name)
## [1] "factor"
  • From my experience, character class is more feasible for data manipulation such as subsetting data frames, merging data frames etc.
  • Use the stringsAsFactors=FALSE argument prevents R from converting vectors of character strings into factors.
  • Nevertheless, factor class is particular useful during statistical analysis and potentially other instances such as ordering data frames.
df2 <- read.table(file="example_dataset_2.txt", header=TRUE, sep="\t", fill=TRUE, quote="", comment.char="", stringsAsFactors=FALSE)
df2
##                    Name
## 1   Smile Dental Centre
## 2 Obsessive Fitness Gym
## 3         Hoarders Club
## 4        Lee Restaurant
## 5   Pickle Farm Produce
##                                                        Address Ratings
## 1 "Grim Reaper Shopping Mall, Lot # 09-08, 2nd Floor, Florida"     9.0
## 2            "Lot # 1000, Grand Kingston Road, United Kingdom"     8.5
## 3                       "Level 6, Marching Square, California"     7.0
## 4                               "Grand Station Road, # G01-55"     6.5
## 5                            "# L1-22, Fresh Market, Malaysia"     5.5
class(df2$Name)
## [1] "character"