3.1 Overview

Most of you are looking to use R for data analysis, which of course requires learning how to import your data into R. This is maybe the big leap for people who are not used to programming.
To import data into R, you will create an object that points to a dataset somewhere on your computer. R will then import that data as a dataframe object (see Module 2).

3.1.1 What is a dataframe?

The most important thing to keep in mind is that your data should contain each observation as a separate row. For example, each subject should be a separate row. If you have repeated measurements from the same subject, each observation should be a separate row, and you should have a column that specifies which observation/trial number.

For example, this is the WRONG way to structure a dataframe:

Subject Time 1 Time 2 Time 3
A 1.2 2.5 4.1
B 1.6 2.6 3.8
C 0.9 2.4 3.6

The correct structure for this data set would be:

Subject Time Value
A 1 1.2
A 2 2.5
A 3 4.1
B 1 1.6
B 2 2.6
B 3 3.8
C 1 0.9
C 2 2.4
C 3 3.6

The order of the rows is not relevant. The key is that each observation for a given individual shows up as separate rows.

3.1.2 File Formats

R can read several file formats, but the popular excel format (.xlsor .xlsx) is NOT one of them. Instead, you will most often convert your excel sheet into one of two formats: tab-delimited files or comma-separated-values files (.csv). These are file formats that can be read by any text editor.
If you use Excel or any other spreadsheet for data storage, you can easily convert those into one of these formats using [Save as...]

In this class, I will be using .csv files for almost all data importing tasks.

R can also handle databases (e.g., Microsoft Access) using a special package called RODBC. You can get a comprehensive tutorial here

3.2 Import .csv files

Step 1: Download the practice data folder:

Go to the course google drive folder for week 2 and download the entire folder called ‘data’ by right-clicking (or control-click) and selecting ‘download’. It will compress the files into a .zip format and you can save it somewhere on your computer. Uncompress the .zip file, and you should end up with a folder named ‘data’.

Step 2: Find out the path to your data folder and files

Quick refresher: finding the path to a file
Importing data is easier if you know how to get the path to the .csv file on your computer. We covered this in module 1 (Intro).

  • For Windows, you can get the path name of the file or folder by right-clicking it and click “Copy as Path”
  • For Mac (or Windows), you can look for the file/folder in Finder, and then drag and drop the icon into the “Go to file/function” bar at the top of the RStudio window.

As an example, say I made folder on my desktop called Rcourse and inside that folder, I made another folder called Week_2. If my data folder is inside that, this would be the path: /Users/dshizuka/Desktop/Rcourse/Week_2/data
In turn, the path to a file inside that folder might be: /Users/dshizuka/Desktop/Rcourse/Week_2/data/SampleData.csv

Step 3: Import the data

There are several ways to import your data in to R. Two of the methods require the use of file paths. There are two other ways you can import data without the path by manually browsing and choosing your file. However, I strongly recommend learning how to use the paths to import data. One reason is that it actually takes you longer to browse for the file. Another reason is that the script is not reproducible if you don’t know which file is supposed to be imported. But at the end of the day, it’s your choice how to do this.

No matter how you import the data, the final goal is the same: assign the imported data as an object. Here, we will assign the data into an object called dat. (tip: I would not use “data” as an object name because data() is a function name). Try out each of the following methods for importing a sample data set called “SampleData_1.csv”.

Four ways to import data:

1. Directly enter full path to file

Using the above example location for the file, you can use either of the below versions of the pathname to import the data. ***Remeber: You will have to change the path inside the lines of code here to be where the file resides in your computer.

#option 1
dat=read.csv("/Users/dshizuka/Desktop/Rcourse/Week_2/data/SampleData.csv") 

#option 2
dat=read.csv("~/Desktop/Rcourse/Week_2/data/SampleData.csv") 

2. Set working directory to folder where your file resides.

Here, you are going to first set the working directory to the folder that contains the file. You will then run a second line of code to import the data by simply choosing the file name. This method can be particularly useful if you want to batch-process a bunch of files… something that we might learn later in the course.

setwd("~/Desktop/Rcourse/Week_2/data/") 
dat=read.csv(file="SampleData.csv")

3. Call ‘choose file’ prompt.

This will call a prompt that will let you choose the file. You can then find and choose the file you want to import. This is a convenient and quick way to import data. However, it is limiting because it takes time to click around to find the file, and more importantly, this step is not reproducible.

dat=read.csv(file.choose())

4. Using RStudio GUI

In RStudio, select [File][Import Dataset][From CSV...] The first time you do this, you may be asked to install the readr package. Note: I actually do NOT recommend using this method for a variety of reasons (irreproducibility, takes time, saves the object as a special format).


So let’s now use method #2 to import the sample data and look at the structure of the data:

setwd("~/Desktop/Rcourse/Week_2/data/")
dat=read.csv(file="SampleData.csv")
str(dat)
setwd("~/Dropbox/Dai_Teaching/Intro_To_R_2017/3-Import")
dat=read.csv(file="SampleData.csv")
str(dat)
## 'data.frame':    13 obs. of  5 variables:
##  $ Indiv.ID: Factor w/ 13 levels "20-01","20-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 2 1 1 1 2 1 2 2 1 ...
##  $ age     : Factor w/ 2 levels "adult","juvenile": 1 2 1 1 2 2 1 1 2 2 ...
##  $ size    : num  29.5 28 26 25 25 28 26 28.5 27.5 26 ...
##  $ weight  : num  66.5 58.5 57 55.5 62 61 58 65 60 59 ...

A good thing to check is whether the columns are all of the correct type (e.g., continuous variables come out as numeric, categorical variables come out as factor, etc.)

Some important arguments inside read.csv

You can specify what data values should be imported as unknown data (NA). You do this by using the argument na.string=. For example, if you want R to ignore all the cell values that contain “#N/A” in your .csv file:

dat=read.csv(file="SampleData.csv", na.string="#N/A")

Importing from .txt files:

A more generic function is read.table(). This function allows you to read any kind of text file (including .csv files) while specifying the data is delimited. So, for example, you can use this function to import comma-separated files dat=read.table(file.choose(), sep=","), or tab-delimited files dat=read.table(file.choose(), sep="\t"). So the read.table() function is more flexible in many ways.

3.3 Some common errors when importing data

You will often have some kind of problem reading in files. One of the big roadblocks to using R is dealing with these bugs.

Some common problems:

  • It won’t import
    • Are you sure you saved the file as a .csv file?
    • Is the path and file name correct?
    • Is the path and file name inside quotes?
  • I get a weird column that has a bunch of NA
    • You probably have a column in your spreadsheet that has a value in one of the cells that you forgot about.
  • I have a weird column at the bottom of my data that is just a bunch of NA
    • You probably have a cell at the bottom of your spreadsheet with “hidden spaces”.
    • Fix: In excel, do a “Find and Replace” to replace blank space with some character (e.g., a *) and then go and find and erase those characters.
  • Column is supposed to be numeric but it is showing up as factors.
    • You probably have a cell value in the column that is non-numeric, e.g., “#N/A” or ‘hidden space’
    • Fix: You can specify what kind of characters should be interpreted as NA using the na.string= argument (see above).

Exercise: Troubleshoot data import

Step 1: Import the SampleData_w_errors.csv file and look at its data structure:

errdat=read.csv("SampleData_w_errors.csv") #assuming your working directory is set to the sample data folder
str(errdat)

Step 2: Fix the spreadsheet so that you get the following:

  • The column called “sex” only has “female” and “male”.
  • The column called “size” is numeric
  • The column called “weight” is numeric
  • The column called “X” is removed.

3.4 Subsets and Subtotals

Many times, you will want to just look at subsets of the data: i.e., certain columns or rows that satisfy some condition.
We already covered how to look at specific columns using either the [] or $ operators. For example, to look at the sex of individuals in the sample data, we can do either:

#These commands will do the same thing
dat$sex
dat[,2]
dat[,"sex"]

In contrast, there is no equivalent of the $ for selecting rows. But we can specify the rows we want to look at using the first element within []:

dat[2,] #get the second row of the dataset

If we want to look at all rows that fulfill some criteria, we can use the == operator. For example, let’s look at all of the data for males:

dat[dat$sex=="male", ]
##    Indiv.ID  sex      age size weight
## 1     20-01 male    adult 29.5   66.5
## 2     20-02 male juvenile 28.0   58.5
## 6     20-06 male juvenile 28.0   61.0
## 8     20-08 male    adult 28.5   65.0
## 9     20-09 male juvenile 27.5   60.0
## 11    20-11 male    adult 25.5   62.0

In this case, R interprets this command to pick out rows in the data where the answer to dat$sex=="male" is TRUE.

You can use multiple criteria to subset the data. Use the | to mean “or” and the & operator to mean “and”. For example:

dat[dat$sex=="male"|dat$weight<60,] #give me the data for rows where sex is male OR weight is less than 60. This gives you all but two records. 
##    Indiv.ID    sex      age size weight
## 1     20-01   male    adult 29.5   66.5
## 2     20-02   male juvenile 28.0   58.5
## 3     20-03 female    adult 26.0   57.0
## 4     20-04 female    adult 25.0   55.5
## 6     20-06   male juvenile 28.0   61.0
## 7     20-07 female    adult 26.0   58.0
## 8     20-08   male    adult 28.5   65.0
## 9     20-09   male juvenile 27.5   60.0
## 10    20-10 female juvenile 26.0   59.0
## 11    20-11   male    adult 25.5   62.0
## 12    20-12 female    adult 27.0   59.0
dat[dat$sex=="male"&dat$weight<60,] #give me the data for rows where sex is male AND weight is less than 60. This gives you just one record of the male that is less than 60g. 
##   Indiv.ID  sex      age size weight
## 2    20-02 male juvenile   28   58.5

If you want to EXCLUDE data for certain rows, you can use the != operator:

dat[dat$sex!="male",] #this will give you the records for individuals that are NOT male (i.e., females)
##    Indiv.ID    sex      age size weight
## 3     20-03 female    adult 26.0   57.0
## 4     20-04 female    adult 25.0   55.5
## 5     20-05 female juvenile 25.0   62.0
## 7     20-07 female    adult 26.0   58.0
## 10    20-10 female juvenile 26.0   59.0
## 12    20-12 female    adult 27.0   59.0
## 13    20-13 female    adult 26.5   60.0

Using the which() function:

Another way to subset the data is to use the which() function, which is extremely useful. We will be using this function a lot as we go forward.
Basically, which() allows you to get the element number of a vector that fits certain criteria. For example, these are the row numbers for males in the dataset:

which(dat$sex=="male")
## [1]  1  2  6  8  9 11

So you can use this command to subset the data in a similar way as above:

dat[which(dat$sex=="male"), ] #don't forget the comma!
##    Indiv.ID  sex      age size weight
## 1     20-01 male    adult 29.5   66.5
## 2     20-02 male juvenile 28.0   58.5
## 6     20-06 male juvenile 28.0   61.0
## 8     20-08 male    adult 28.5   65.0
## 9     20-09 male juvenile 27.5   60.0
## 11    20-11 male    adult 25.5   62.0

You can also look at a particular variable within a particular subset. Here are two ways to do the same thing–get the body size of males.

dat[dat$sex=="male","size"]
## [1] 29.5 28.0 28.0 28.5 27.5 25.5
dat$size[dat$sex=="male"]
## [1] 29.5 28.0 28.0 28.5 27.5 25.5

Using this, you can look at the mean size of males vs. females

mean(dat$size[dat$sex=="male"])
## [1] 27.83333
mean(dat$size[dat$sex=="female"])
## [1] 25.92857

A more elegant way to do this is to use either the tapply() or aggregate() functions. For both of these functions, the first argument is the column to apply the calculation to, second argument is the categorical variable (in list format), and the third argument is the function name for the calculation you want to apply. Try these out for calculating the mean body size by sex:

tapply(dat$size,dat$sex,mean)
##   female     male 
## 25.92857 27.83333
aggregate(dat$size,by=list(Sex=dat$sex),mean)
##      Sex        x
## 1 female 25.92857
## 2   male 27.83333

Note that the tapply() and aggregate() uses different formats as outputs—-tapply() returns an array, while aggregate() returns a dataframe.
You can see this better when you calculate the means while splitting up the data by two different factors, like sex and age:

tap=tapply(dat$size,list(dat$sex,dat$age),mean)
agg=aggregate(dat$size,by=list(Sex=dat$sex, Age=dat$age),mean)
class(tap)
## [1] "matrix"
class(agg)
## [1] "data.frame"
tap
##           adult juvenile
## female 26.10000 25.50000
## male   27.83333 27.83333
agg
##      Sex      Age        x
## 1 female    adult 26.10000
## 2   male    adult 27.83333
## 3 female juvenile 25.50000
## 4   male juvenile 27.83333

You can see that the output of tapply is a matrix, and the output of aggregate() is a dataframe. So you can use one or other function depending on what you want to do with the output of the calculation you want to do.

For example, you can use the same set of functions to calculate the sums of sizes for each sex x age combination:

tapply(dat$size,list(dat$sex,dat$age),sum)
##        adult juvenile
## female 130.5     51.0
## male    83.5     83.5
aggregate(dat$size,by=list(Sex=dat$sex, Age=dat$age),sum)
##      Sex      Age     x
## 1 female    adult 130.5
## 2   male    adult  83.5
## 3 female juvenile  51.0
## 4   male juvenile  83.5

3.5 Linking two data sets

There are a many instances in which you have two different datasets for the same subjects. For example, I might have one dataset from capturing and measuring a bunch of individual birds. At that point, I take a blood or tissue sample from each individual and I might later use loci linked to sex chromosomes to confirm sex the individual is (many birds are monomorphic and it can be difficult to tell the sexes apart). I now have two different spreadsheets containing different pieces of information about the same subjects, and I might want to combine this information. You may have done something like this in the past using “lookup functions” in excel or database software (e.g., Microsoft Access, FileMaker, etc.). Now we will learn how to do this in R. Let’s start by inputting two different sets of data (one of them is the same as what we have been using so far). Remember that your directory name will be different for you:

#assuming the working directory is already set to your sample data folder:
dat=read.csv(file="SampleData.csv",header=TRUE)
chromo=read.csv(file="sex_chromo.csv", header=TRUE)

We now have two dataframes—dat and sex. This second dataframe has information of sex chromosomes that we have identified for each individual using molecular methods. We now use a match() function to look up which rows correspond to the same individual:

match(dat$Indiv.ID,chromo$Indiv.ID)
##  [1]  8  9  1  2  3 10  4 11 12  5 13  6  7

Here, the first value indicates which row in the dataframe chromo corresponds to the first row in dat? Remember that the order of the elements within the parentheses matter. You can confirm this by looking at both dataframes. You can use this to lookup the sex chromosomes for each individual in the dataframe dat.

sex.chromo=chromo[match(dat$Indiv.ID,chromo$Indiv.ID),"chromosomes"]
sex.chromo
##  [1] XY XY XX XX XX XY XX XY XY XX XY XX XX
## Levels: XX XY

You can append this new vector, sex.chromo, to the dataframe dat using the function cbind():

merged.dat=cbind(dat,sex.chromo)
merged.dat
##    Indiv.ID    sex      age size weight sex.chromo
## 1     20-01   male    adult 29.5   66.5         XY
## 2     20-02   male juvenile 28.0   58.5         XY
## 3     20-03 female    adult 26.0   57.0         XX
## 4     20-04 female    adult 25.0   55.5         XX
## 5     20-05 female juvenile 25.0   62.0         XX
## 6     20-06   male juvenile 28.0   61.0         XY
## 7     20-07 female    adult 26.0   58.0         XX
## 8     20-08   male    adult 28.5   65.0         XY
## 9     20-09   male juvenile 27.5   60.0         XY
## 10    20-10 female juvenile 26.0   59.0         XX
## 11    20-11   male    adult 25.5   62.0         XY
## 12    20-12 female    adult 27.0   59.0         XX
## 13    20-13 female    adult 26.5   60.0         XX

Ok, now that I have shown you the hard way to do this, here is the easy way, using the function merge():

merged.dat2=merge(dat,chromo,by="Indiv.ID")
merged.dat2
##    Indiv.ID    sex      age size weight chromosomes
## 1     20-01   male    adult 29.5   66.5          XY
## 2     20-02   male juvenile 28.0   58.5          XY
## 3     20-03 female    adult 26.0   57.0          XX
## 4     20-04 female    adult 25.0   55.5          XX
## 5     20-05 female juvenile 25.0   62.0          XX
## 6     20-06   male juvenile 28.0   61.0          XY
## 7     20-07 female    adult 26.0   58.0          XX
## 8     20-08   male    adult 28.5   65.0          XY
## 9     20-09   male juvenile 27.5   60.0          XY
## 10    20-10 female juvenile 26.0   59.0          XX
## 11    20-11   male    adult 25.5   62.0          XY
## 12    20-12 female    adult 27.0   59.0          XX
## 13    20-13 female    adult 26.5   60.0          XX

Note that you have to make sure that the variable names are identical in both dataframes for this to work!

Lastly, try this code to see that (in this hypothetical dataset), the sex chromosomes match up with the gender scored in the field.

table(merged.dat2$sex,merged.dat2$chromosomes)
##         
##          XX XY
##   female  7  0
##   male    0  6

Exercise: Import your own data

After you have run through the module, try importing your own dataset and check to make sure that all of your columns show up correctly. Then trying subsetting and calculating means of values according to different categorical variables.