Loading data into R

Loading data into R can be a pain. Here are some hints.

1) Before you try to load a .csv file for the first time, open it up and go to the last column. Then highlight several dozen clumns to the right and tell Excel to “clear all.” This will get rid any stray key strokes and formatting that you don't intend to load into R, or which R can't interpret. 2) Then, go to the last row and do the same. 3) Make sure all of your column names start with a character, not a number. R doesn't allow numbers to be the first thing in a column heading. So, change “2001 mass” to “mass 2001.” Better yet, change it to “mass_2001” or “mass.2001”
4) R can handle spaces in column headings but will add a period to them. If you happen to have two spaces you'll get 2 periods 5) Make sure all of your column names are unique. 6) Fill in blank spaces with NA values. R's deafult NA value is “NA” but you can tell it to use some other character such as “.” if the data was previously set up for SAS. See below. 7) Highlight all of your data and use “control-F” to weed out typos and blank spaces. Set the Find routine to “Find entire cells only” and search for things like spaces, periods, and other punctuation marks. 8) You can highlight individual columns of numeric data and search for alphabetical characters to try to weed them out.

In general I like to load something into R and use R's summary command to identify columns that aren't behaving properly. Then I can re-open the file in Excel and look at a particular column to fix. Below are some hints for getting recalcitrent files into R by modifying the read.csv command.

The basic code for loading data is

data.object <- read.csv("filename.csv")

R is pretty smart and can often figure out most of the details, like whether you have columns names, what columns represent numeric data and factors, etc. However, sometimes it doesn't get right, especially when the data has not been scrubbed completely clean. R's deafult is to turn anything column that has ANY entry with a non-numeric character (which includes spaces) to a factor. So, every typo and accidental key stroke will turn the entire column it occurs in into a factor variable.

You can ease the data import process by giving R some more instruction within the call to read.csv. When I typically load data in I use the following command.

data.object <- read.csv("filename.csv", 
                        header = TRUE, 
                        sep = ",")

This is a bit redundant and I mostly do it out of habitat. However, I think its better to feed read.csv more commands just in case it helps the loading process. In part this is because it can often be somewhat mysterious as to why a file isn't loading into R. I've tried to re-create problems and intentionally create files that should have problems load, but I can't. So I guess I'm saying there is a bit of Juju to loading csv files.

When I actually have a problem loading data I usually start with adding

data.object <- read.csv("filename.csv",
                         header = TRUE, 
                        sep = ",",
                        stringsAsFactors = FALSE)

“stringsAsFactors = FALSE” turns off the automatic conversion of character strings to factors. you can then examine columns in R to see what they look like.

When something really doesn't seem to want to load, you can add every more commands to



read.csv("filename.csv",            #name of file
         header = TRUE,             #are there column names in 1st row?
         sep = ",",                 #what separates rows?
         as.is = !stringsAsFactors, #turn off automatic conversion of character data to factors
         colClasses = NA            # to convert everything to character data set to "character"
         na.string = "NA"           # could be "." for SAS files
         skip = 0,                  #number of rows to skip at the top of the file.  Can be set >0
                                    #  if there are comments contained at the top of the file
         strip.white = TRUE,        #strip out extra white space in strings.  For example, " 0.1" = "0.1"
         fill = TRUE,               #fill in rows that have unequal numbers of columns
         comment.char = "#",        #character used for comments that should not be read in
         stringsAsFactors = FALSE   #Another control for deciding whether characters should be converted to factor
         )

The default behavior of R is to convert columns that contains characters (eg letters) to factor variables. This works fine as long as there are no typos. However, if there are typos, factor levels AND every typo gets converted to a factor. So, if you have “R” and “NR” for “reproductive” and “non-reproductive” and some typos like “N R” and “ NR” then instead of two factor you'll end up with four.

Also, if there are any typos in numeric data such as extra spaces are stray letters (eg “ 101.00” or “101.00 ”) then R will, first, interprete this as character data and 2nd, convert it to a factor. So you'll get a different factor variable for each and every number in that column - which is real pain for continous variables!

In general, its useful when 1st importing data to set ' “stringsAsFactors = FALSE”' and converting things to factors by hand. If you have problems with numeric columns you can set '“colClasses = character” ' and import EVERYTHING as character data.

To convert bad factor variables back to character data use

data$column.name <- as.character(data$column.name)

You can convert it to factor variables wtih

data$column.name <- as.factor(data$column.name)

To convert character data to numeric data

data$column.name <- as.numeric(data$column.name)

Many (basic) R functions are smart are will automatically convert character data to factors, then continue running. Sometimes its easier to leave things as character data and let R sort it out for you.

Be careful with accidentally converting factors to numeric data. Say you have a numeric column that accientally loaded as a factor variable. If you convert it directly to numeric data without fixing it, R will convert the FACTOR LEVELS to numeric data. So, instead of having numeric data like 1, 1.1,,1.1, 1.76, 2, might end up 1, 2,2,3,4 because R had categorized that data into 4 different factor levels.

Missing Data

Missing data can be a real pain to manage in a dataframe that has just been created using read.csv. R for some reason has mulitple ways of indicating missing data, depending on the context. This includes

NA
NaN
<NA>

For some reaason when you're cleaning data - especially if you clean it within R or reshaping your data in R - you can get a mix of these NA designators within the same column, which can create problems. A general purpose solution is to conver the whole column to character data, then convert it to what ever its suppossed to be, either factor or numeric. For example

data$column.name <- as.numeric(as.character(data$column.name))

In general it seems that converting a column to character data turns all the NAs to the same format.