Data Import Considerations

Prologue

Importing data frames into R is a requirement before you can further manipulate your data frame in R.
This is unlike Mac or Linux operating system where you can manipulate your data frame directly using the command-line interface, i.e. terminal.
In this session, I will share some quirks I encountered during data import. The list is, of course, not exhaustive. At the same time, I hope it will not grow significantly longer over time!
The examples here are primarily applicable for importing data from tab-delimited or comma-separated files.

Zero prefix

Some elements may have 0’s as their prefix.
This is especially true for sample/patient/participant ID’s, e.g. 00001, 00002 etc.
By using the default arguments, you will lose these 0’s.

df <- read.table(file="example_dataset_1.txt", header=TRUE, sep="\t")
df

##   PersonID Height..m. Weight..kg.
## 1        1       1.75          67
## 2        2       1.55          88
## 3        3       1.42         100
## 4        4       0.97          40
## 5        5       1.96          66

class(df$PersonID)

## [1] "integer"

R automatically converts these ‘number text’ (in Excel lingo) to integer.
We need to explicitly specify the PersonID column as characters using the colClasses argument.
The catch here is that we need to specify the classes for all columns even if we only originally intend to specify the class for just one column.

df <- read.table(file="example_dataset_1.txt", header=TRUE, sep="\t", colClasses=c("character", "character", "character"))
df

##   PersonID Height..m. Weight..kg.
## 1    00001       1.75          67
## 2    00002       1.55          88
## 3    00003       1.42         100
## 4    00004       0.97          40
## 5    00005       1.96          66

class(df$PersonID)

## [1] "character"

Hashtags

I encountered a peculiar problem when I imported a data frame which has hashtags (#) in some of the rows. That particular data frame has # in the Address column. For example, “# 05-01 Livingston Road”.
This caused all data after the # to not appear in R.
This is because, by default, # symbol and everything thereafter are considered as comment and will be ignored during the data import process. This is especially true when the arguments fill=TRUE and quote=“” are used.
The arguments fill=TRUE and quote=“” are especially useful for files that have many blanks across multiple columns. Just imagine an excel file in which cells with missing values are left completely blank.

df2 <- read.table(file="example_dataset_2.txt", header=TRUE, sep="\t", fill=TRUE, quote="")
df2

##                    Name                                Address Ratings
## 1   Smile Dental Centre       "Grim Reaper Shopping Mall, Lot       NA
## 2 Obsessive Fitness Gym                                  "Lot       NA
## 3         Hoarders Club "Level 6, Marching Square, California"       7
## 4        Lee Restaurant                  "Grand Station Road,       NA
## 5   Pickle Farm Produce                                      "      NA

By default the comment.char argument specifies # to be a comment indicator.
To indicate that the # in our case is not a comment indicator, rather it is part our data, simply set the comment.char to something other than the #. In this case, we replace the default with blank.

df2 <- read.table(file="example_dataset_2.txt", header=TRUE, sep="\t", fill=TRUE, quote="", comment.char="")
df2

##                    Name
## 1   Smile Dental Centre
## 2 Obsessive Fitness Gym
## 3         Hoarders Club
## 4        Lee Restaurant
## 5   Pickle Farm Produce
##                                                        Address Ratings
## 1 "Grim Reaper Shopping Mall, Lot # 09-08, 2nd Floor, Florida"     9.0
## 2            "Lot # 1000, Grand Kingston Road, United Kingdom"     8.5
## 3                       "Level 6, Marching Square, California"     7.0
## 4                               "Grand Station Road, # G01-55"     6.5
## 5                            "# L1-22, Fresh Market, Malaysia"     5.5

Character strings

By default, R will convert vectors of character strings into factors.

df2 <- read.table(file="example_dataset_2.txt", header=TRUE, sep="\t", fill=TRUE, quote="", comment.char="")
df2

##                    Name
## 1   Smile Dental Centre
## 2 Obsessive Fitness Gym
## 3         Hoarders Club
## 4        Lee Restaurant
## 5   Pickle Farm Produce
##                                                        Address Ratings
## 1 "Grim Reaper Shopping Mall, Lot # 09-08, 2nd Floor, Florida"     9.0
## 2            "Lot # 1000, Grand Kingston Road, United Kingdom"     8.5
## 3                       "Level 6, Marching Square, California"     7.0
## 4                               "Grand Station Road, # G01-55"     6.5
## 5                            "# L1-22, Fresh Market, Malaysia"     5.5

class(df2$Name)

## [1] "factor"

From my experience, character class is more feasible for data manipulation such as subsetting data frames, merging data frames etc.
Use the stringsAsFactors=FALSE argument prevents R from converting vectors of character strings into factors.
Nevertheless, factor class is particular useful during statistical analysis and potentially other instances such as ordering data frames.

df2 <- read.table(file="example_dataset_2.txt", header=TRUE, sep="\t", fill=TRUE, quote="", comment.char="", stringsAsFactors=FALSE)
df2

##                    Name
## 1   Smile Dental Centre
## 2 Obsessive Fitness Gym
## 3         Hoarders Club
## 4        Lee Restaurant
## 5   Pickle Farm Produce
##                                                        Address Ratings
## 1 "Grim Reaper Shopping Mall, Lot # 09-08, 2nd Floor, Florida"     9.0
## 2            "Lot # 1000, Grand Kingston Road, United Kingdom"     8.5
## 3                       "Level 6, Marching Square, California"     7.0
## 4                               "Grand Station Road, # G01-55"     6.5
## 5                            "# L1-22, Fresh Market, Malaysia"     5.5

class(df2$Name)

## [1] "character"