17 June 2016

Today's agenda

Today we'll look at how to carry out repetitive calculations quickly and easily in R. This procedure will work regardless of the programming language you're using (MATLAB, Python…) and can save you a lot of time and trouble.

Practically all serious data analysis uses this technique in one way or another.

Our plot from Meeting 2

Looking at the data

Have a look at http://cdiac.ornl.gov/trends/co2/sio-mlo.html

  • In Chrome, right-click on the "Data" link near the top of the page
  • Choose "Save Link As…"
  • Save the file with the name mauna_loa_co2.txt and with the format "All files"
  • Navigate to where you've stored the file and right-click on the file's icon
  • Choose "Open with…"
  • Open the file with Excel

Calculating annual mean CO2 concentrations

You probably all know how to calculate the annual means in Excel. Just in case, we'll practice using Excel to do the averaging and then we'll see how to do the same thing in R.

Why bother? Using R will save you a lot of time and trouble when you're working with larger datasets. This is a toy example.

When you're done, save your file with a new filename.

Reading the data into R

Copy-paste the R code below into a new script in RStudio, then press the Source button.

# Download the data file and save it as mauna_loa_co2.txt.  
download.file("http://cdiac.ornl.gov/ftp/trends/co2/maunaloa.co2", 
              "mauna_loa_co2.txt")

# Read the data into R.  
co2.data <- read.table("mauna_loa_co2.txt", header = TRUE, 
                       row.names = NULL, skip = 14, nrows = 51, 
                       na.strings = "-99.99")

Looking at the data

Type the following in the Console sub-window and press Enter.

head(co2.data)

How big is the data matrix?

dim(co2.data)

How many rows does it have?

nrow(co2.data)

Working with part of co2.data

Putting square brackets [ ] after a variable's name lets you tell R that you want to work with just part of the variable. Try the following commands.

co2.data[1, 1]
co2.data[1: 3, 1]
co2.data[1, ]

A coordinate system for a grid of numbers

Calculating the average of just the first line of co2.data

Try typing this command into the Console and pressing Enter.

mean(unlist(co2.data[1, 2: 13]))

How do we do this repeatedly?

We need to do this operation repeatedly, working with each row of co2.data in turn. To do that, we need to

  1. find out how many rows co2.data has
  2. make a container to store the averages in; this container should have a place for every row average we will calculate
  3. calculate the average CO2 concentration from the first row and store the result in our container
  4. repeat step 3 with each row of co2.data

Add the following lines to your R code

# How many rows are there in co2.data?
n.rows <- length(co2.data[, 1])

# Make a container to store the mean values.  
mean.co2 <- rep(NA, length.out = n.rows)

# Calculate the mean of the first row and store it
# in mean.co2.  
mean.co2 <- mean(unlist(co2.data[1, 2: 13]))

Source this code and look at the contents of mean.co2. What do you notice about the result?

for loops

for (i in 1: 5) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

Changing the last piece of the code from an earlier slide

Modify the end of your script so that it looks like this instead.

# Step through the rows and calculate the.  
# mean of each one.  Note the two lowercase letter i's
# in brackets after mean.co2 and co2.data.  
for (i in 1: n.rows) {
  mean.co2[i] <- mean(unlist(co2.data[i, 2: 13]))
}

Source your code and look at mean.co2 now. How are the results different?

What did we learn?

for loops provide a way of carrying out repetitive tasks. This procedure exists in practically all programming languages.

To use for loops effectively, we need to

  1. figure out what the output of the loop should look like
  2. create a container to store the results
  3. write a command that will carry out the desired operation for just part of the overall calculation
  4. write a loop around the command from step 3 so that it carries out the whole calculation