Most of you are looking to use R for data analysis, which of course requires learning how to import your data into R. This is maybe the big leap for people who are not used to programming.
To import data into R, you will create an object that points to a dataset somewhere on your computer. R will then import that data as a dataframe object (see Module 2).
The most important thing to keep in mind is that your data should contain each observation as a separate row. For example, each subject should be a separate row. If you have repeated measurements from the same subject, each observation should be a separate row, and you should have a column that specifies which observation/trial number.
For example, this is the WRONG way to structure a dataframe:
| Subject | Time 1 | Time 2 | Time 3 |
|---|---|---|---|
| A | 1.2 | 2.5 | 4.1 |
| B | 1.6 | 2.6 | 3.8 |
| C | 0.9 | 2.4 | 3.6 |
The correct structure for this data set would be:
| Subject | Time | Value |
|---|---|---|
| A | 1 | 1.2 |
| A | 2 | 2.5 |
| A | 3 | 4.1 |
| B | 1 | 1.6 |
| B | 2 | 2.6 |
| B | 3 | 3.8 |
| C | 1 | 0.9 |
| C | 2 | 2.4 |
| C | 3 | 3.6 |
The order of the rows is not relevant. The key is that each observation for a given individual shows up as separate rows.
R can read several file formats, but the popular excel format (.xlsor .xlsx) is NOT one of them. Instead, you will most often convert your excel sheet into one of two formats: tab-delimited files or comma-separated-values files (.csv). These are file formats that can be read by any text editor.
If you use Excel or any other spreadsheet for data storage, you can easily convert those into one of these formats using [Save as...]
In this class, I will be using .csv files for almost all data importing tasks.
R can also handle databases (e.g., Microsoft Access) using a special package called RODBC. You can get a comprehensive tutorial here
Go to the course google drive folder for week 2 and download the entire folder called ‘data’ by right-clicking (or control-click) and selecting ‘download’. It will compress the files into a .zip format and you can save it somewhere on your computer. Uncompress the .zip file, and you should end up with a folder named ‘data’.
Quick refresher: finding the path to a file
Importing data is easier if you know how to get the path to the .csv file on your computer. We covered this in module 1 (Intro).
As an example, say I made folder on my desktop called Rcourse and inside that folder, I made another folder called Week_2. If my data folder is inside that, this would be the path: /Users/dshizuka/Desktop/Rcourse/Week_2/data
In turn, the path to a file inside that folder might be: /Users/dshizuka/Desktop/Rcourse/Week_2/data/SampleData.csv
There are several ways to import your data in to R. Two of the methods require the use of file paths. There are two other ways you can import data without the path by manually browsing and choosing your file. However, I strongly recommend learning how to use the paths to import data. One reason is that it actually takes you longer to browse for the file. Another reason is that the script is not reproducible if you don’t know which file is supposed to be imported. But at the end of the day, it’s your choice how to do this.
No matter how you import the data, the final goal is the same: assign the imported data as an object. Here, we will assign the data into an object called dat. (tip: I would not use “data” as an object name because data() is a function name). Try out each of the following methods for importing a sample data set called “SampleData_1.csv”.
Using the above example location for the file, you can use either of the below versions of the pathname to import the data. ***Remeber: You will have to change the path inside the lines of code here to be where the file resides in your computer.
#option 1
dat=read.csv("/Users/dshizuka/Desktop/Rcourse/Week_2/data/SampleData.csv")
#option 2
dat=read.csv("~/Desktop/Rcourse/Week_2/data/SampleData.csv")
Here, you are going to first set the working directory to the folder that contains the file. You will then run a second line of code to import the data by simply choosing the file name. This method can be particularly useful if you want to batch-process a bunch of files… something that we might learn later in the course.
setwd("~/Desktop/Rcourse/Week_2/data/")
dat=read.csv(file="SampleData.csv")
This will call a prompt that will let you choose the file. You can then find and choose the file you want to import. This is a convenient and quick way to import data. However, it is limiting because it takes time to click around to find the file, and more importantly, this step is not reproducible.
dat=read.csv(file.choose())
In RStudio, select [File]–[Import Dataset]–[From CSV...] The first time you do this, you may be asked to install the readr package. Note: I actually do NOT recommend using this method for a variety of reasons (irreproducibility, takes time, saves the object as a special format).
So let’s now use method #2 to import the sample data and look at the structure of the data:
setwd("~/Desktop/Rcourse/Week_2/data/")
dat=read.csv(file="SampleData.csv")
str(dat)
setwd("~/Dropbox/Dai_Teaching/Intro_To_R_2017/3-Import")
dat=read.csv(file="SampleData.csv")
str(dat)
## 'data.frame': 13 obs. of 5 variables:
## $ Indiv.ID: Factor w/ 13 levels "20-01","20-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 1 1 2 1 2 2 1 ...
## $ age : Factor w/ 2 levels "adult","juvenile": 1 2 1 1 2 2 1 1 2 2 ...
## $ size : num 29.5 28 26 25 25 28 26 28.5 27.5 26 ...
## $ weight : num 66.5 58.5 57 55.5 62 61 58 65 60 59 ...
A good thing to check is whether the columns are all of the correct type (e.g., continuous variables come out as numeric, categorical variables come out as factor, etc.)
read.csvYou can specify what data values should be imported as unknown data (NA). You do this by using the argument na.string=. For example, if you want R to ignore all the cell values that contain “#N/A” in your .csv file:
dat=read.csv(file="SampleData.csv", na.string="#N/A")
Importing from .txt files:
A more generic function is
read.table(). This function allows you to read any kind of text file (including .csv files) while specifying the data is delimited. So, for example, you can use this function to import comma-separated filesdat=read.table(file.choose(), sep=","), or tab-delimited filesdat=read.table(file.choose(), sep="\t"). So theread.table()function is more flexible in many ways.
You will often have some kind of problem reading in files. One of the big roadblocks to using R is dealing with these bugs.
Some common problems:
NA
NA
NA using the na.string= argument (see above).Step 1: Import the SampleData_w_errors.csv file and look at its data structure:
errdat=read.csv("SampleData_w_errors.csv") #assuming your working directory is set to the sample data folder
str(errdat)
Step 2: Fix the spreadsheet so that you get the following:
Many times, you will want to just look at subsets of the data: i.e., certain columns or rows that satisfy some condition.
We already covered how to look at specific columns using either the [] or $ operators. For example, to look at the sex of individuals in the sample data, we can do either:
#These commands will do the same thing
dat$sex
dat[,2]
dat[,"sex"]
In contrast, there is no equivalent of the $ for selecting rows. But we can specify the rows we want to look at using the first element within []:
dat[2,] #get the second row of the dataset
If we want to look at all rows that fulfill some criteria, we can use the == operator. For example, let’s look at all of the data for males:
dat[dat$sex=="male", ]
## Indiv.ID sex age size weight
## 1 20-01 male adult 29.5 66.5
## 2 20-02 male juvenile 28.0 58.5
## 6 20-06 male juvenile 28.0 61.0
## 8 20-08 male adult 28.5 65.0
## 9 20-09 male juvenile 27.5 60.0
## 11 20-11 male adult 25.5 62.0
In this case, R interprets this command to pick out rows in the data where the answer to dat$sex=="male" is TRUE.
You can use multiple criteria to subset the data. Use the | to mean “or” and the & operator to mean “and”. For example:
dat[dat$sex=="male"|dat$weight<60,] #give me the data for rows where sex is male OR weight is less than 60. This gives you all but two records.
## Indiv.ID sex age size weight
## 1 20-01 male adult 29.5 66.5
## 2 20-02 male juvenile 28.0 58.5
## 3 20-03 female adult 26.0 57.0
## 4 20-04 female adult 25.0 55.5
## 6 20-06 male juvenile 28.0 61.0
## 7 20-07 female adult 26.0 58.0
## 8 20-08 male adult 28.5 65.0
## 9 20-09 male juvenile 27.5 60.0
## 10 20-10 female juvenile 26.0 59.0
## 11 20-11 male adult 25.5 62.0
## 12 20-12 female adult 27.0 59.0
dat[dat$sex=="male"&dat$weight<60,] #give me the data for rows where sex is male AND weight is less than 60. This gives you just one record of the male that is less than 60g.
## Indiv.ID sex age size weight
## 2 20-02 male juvenile 28 58.5
If you want to EXCLUDE data for certain rows, you can use the != operator:
dat[dat$sex!="male",] #this will give you the records for individuals that are NOT male (i.e., females)
## Indiv.ID sex age size weight
## 3 20-03 female adult 26.0 57.0
## 4 20-04 female adult 25.0 55.5
## 5 20-05 female juvenile 25.0 62.0
## 7 20-07 female adult 26.0 58.0
## 10 20-10 female juvenile 26.0 59.0
## 12 20-12 female adult 27.0 59.0
## 13 20-13 female adult 26.5 60.0
which() function:Another way to subset the data is to use the which() function, which is extremely useful. We will be using this function a lot as we go forward.
Basically, which() allows you to get the element number of a vector that fits certain criteria. For example, these are the row numbers for males in the dataset:
which(dat$sex=="male")
## [1] 1 2 6 8 9 11
So you can use this command to subset the data in a similar way as above:
dat[which(dat$sex=="male"), ] #don't forget the comma!
## Indiv.ID sex age size weight
## 1 20-01 male adult 29.5 66.5
## 2 20-02 male juvenile 28.0 58.5
## 6 20-06 male juvenile 28.0 61.0
## 8 20-08 male adult 28.5 65.0
## 9 20-09 male juvenile 27.5 60.0
## 11 20-11 male adult 25.5 62.0
You can also look at a particular variable within a particular subset. Here are two ways to do the same thing–get the body size of males.
dat[dat$sex=="male","size"]
## [1] 29.5 28.0 28.0 28.5 27.5 25.5
dat$size[dat$sex=="male"]
## [1] 29.5 28.0 28.0 28.5 27.5 25.5
Using this, you can look at the mean size of males vs. females
mean(dat$size[dat$sex=="male"])
## [1] 27.83333
mean(dat$size[dat$sex=="female"])
## [1] 25.92857
A more elegant way to do this is to use either the tapply() or aggregate() functions. For both of these functions, the first argument is the column to apply the calculation to, second argument is the categorical variable (in list format), and the third argument is the function name for the calculation you want to apply. Try these out for calculating the mean body size by sex:
tapply(dat$size,dat$sex,mean)
## female male
## 25.92857 27.83333
aggregate(dat$size,by=list(Sex=dat$sex),mean)
## Sex x
## 1 female 25.92857
## 2 male 27.83333
Note that the tapply() and aggregate() uses different formats as outputs—-tapply() returns an array, while aggregate() returns a dataframe.
You can see this better when you calculate the means while splitting up the data by two different factors, like sex and age:
tap=tapply(dat$size,list(dat$sex,dat$age),mean)
agg=aggregate(dat$size,by=list(Sex=dat$sex, Age=dat$age),mean)
class(tap)
## [1] "matrix"
class(agg)
## [1] "data.frame"
tap
## adult juvenile
## female 26.10000 25.50000
## male 27.83333 27.83333
agg
## Sex Age x
## 1 female adult 26.10000
## 2 male adult 27.83333
## 3 female juvenile 25.50000
## 4 male juvenile 27.83333
You can see that the output of tapply is a matrix, and the output of aggregate() is a dataframe. So you can use one or other function depending on what you want to do with the output of the calculation you want to do.
For example, you can use the same set of functions to calculate the sums of sizes for each sex x age combination:
tapply(dat$size,list(dat$sex,dat$age),sum)
## adult juvenile
## female 130.5 51.0
## male 83.5 83.5
aggregate(dat$size,by=list(Sex=dat$sex, Age=dat$age),sum)
## Sex Age x
## 1 female adult 130.5
## 2 male adult 83.5
## 3 female juvenile 51.0
## 4 male juvenile 83.5
There are a many instances in which you have two different datasets for the same subjects. For example, I might have one dataset from capturing and measuring a bunch of individual birds. At that point, I take a blood or tissue sample from each individual and I might later use loci linked to sex chromosomes to confirm sex the individual is (many birds are monomorphic and it can be difficult to tell the sexes apart). I now have two different spreadsheets containing different pieces of information about the same subjects, and I might want to combine this information. You may have done something like this in the past using “lookup functions” in excel or database software (e.g., Microsoft Access, FileMaker, etc.). Now we will learn how to do this in R. Let’s start by inputting two different sets of data (one of them is the same as what we have been using so far). Remember that your directory name will be different for you:
#assuming the working directory is already set to your sample data folder:
dat=read.csv(file="SampleData.csv",header=TRUE)
chromo=read.csv(file="sex_chromo.csv", header=TRUE)
We now have two dataframes—dat and sex. This second dataframe has information of sex chromosomes that we have identified for each individual using molecular methods. We now use a match() function to look up which rows correspond to the same individual:
match(dat$Indiv.ID,chromo$Indiv.ID)
## [1] 8 9 1 2 3 10 4 11 12 5 13 6 7
Here, the first value indicates which row in the dataframe chromo corresponds to the first row in dat? Remember that the order of the elements within the parentheses matter. You can confirm this by looking at both dataframes. You can use this to lookup the sex chromosomes for each individual in the dataframe dat.
sex.chromo=chromo[match(dat$Indiv.ID,chromo$Indiv.ID),"chromosomes"]
sex.chromo
## [1] XY XY XX XX XX XY XX XY XY XX XY XX XX
## Levels: XX XY
You can append this new vector, sex.chromo, to the dataframe dat using the function cbind():
merged.dat=cbind(dat,sex.chromo)
merged.dat
## Indiv.ID sex age size weight sex.chromo
## 1 20-01 male adult 29.5 66.5 XY
## 2 20-02 male juvenile 28.0 58.5 XY
## 3 20-03 female adult 26.0 57.0 XX
## 4 20-04 female adult 25.0 55.5 XX
## 5 20-05 female juvenile 25.0 62.0 XX
## 6 20-06 male juvenile 28.0 61.0 XY
## 7 20-07 female adult 26.0 58.0 XX
## 8 20-08 male adult 28.5 65.0 XY
## 9 20-09 male juvenile 27.5 60.0 XY
## 10 20-10 female juvenile 26.0 59.0 XX
## 11 20-11 male adult 25.5 62.0 XY
## 12 20-12 female adult 27.0 59.0 XX
## 13 20-13 female adult 26.5 60.0 XX
Ok, now that I have shown you the hard way to do this, here is the easy way, using the function merge():
merged.dat2=merge(dat,chromo,by="Indiv.ID")
merged.dat2
## Indiv.ID sex age size weight chromosomes
## 1 20-01 male adult 29.5 66.5 XY
## 2 20-02 male juvenile 28.0 58.5 XY
## 3 20-03 female adult 26.0 57.0 XX
## 4 20-04 female adult 25.0 55.5 XX
## 5 20-05 female juvenile 25.0 62.0 XX
## 6 20-06 male juvenile 28.0 61.0 XY
## 7 20-07 female adult 26.0 58.0 XX
## 8 20-08 male adult 28.5 65.0 XY
## 9 20-09 male juvenile 27.5 60.0 XY
## 10 20-10 female juvenile 26.0 59.0 XX
## 11 20-11 male adult 25.5 62.0 XY
## 12 20-12 female adult 27.0 59.0 XX
## 13 20-13 female adult 26.5 60.0 XX
Note that you have to make sure that the variable names are identical in both dataframes for this to work!
Lastly, try this code to see that (in this hypothetical dataset), the sex chromosomes match up with the gender scored in the field.
table(merged.dat2$sex,merged.dat2$chromosomes)
##
## XX XY
## female 7 0
## male 0 6
After you have run through the module, try importing your own dataset and check to make sure that all of your columns show up correctly. Then trying subsetting and calculating means of values according to different categorical variables.