Ch4 Programming with R: Part 2

Ken Jinkun Xiao
12 Feb 2015

Some General Programming Guidelines

In order to program the program effectively, the following general guidelines would helpful here:

  • Understand the program: Argument and Output
  • Form a general idea how to solve it
  • Translate the idea into a detailed implementation: Steps to solve the problem
  • Debug if it does not work:
    • Does it work?
    • Is it good enough?

Basic Programming Example: Sort

We wish to write a program which will sort a vector of integers into increasing order.

  • Argument: Vectors
  • Output: Vectors with number from smallest to largest

Understand the Problem

Start with a specific case, usually simple, but not too simple.

Sometimes, you might try to solve the problem on your own, without the computer.

Consider sorting the vector consisting of the elements 3, 5, 24, 6, 2, 4, 13, 1.

Understand the Problem

Our goal is to write a function called bubblesort() for which we could do the following:

x <- c(3, 5, 24, ..., 1)
bubblesort(x)

The output should be the numbers in increasing order: 1, 2, 3, 4, 5, 6, 13, 24.

Work out a General Idea

A first idea might be to find where the smallest value is, and record it.

Repeat, with the remaining values, recording the smallest value each time.

Repeat …

This might be time-consuming.

Work out a General Idea

An alternative idea: compare successive pairs of values, starting at the beginning of the vector, and running through to the end.

Swap pairs if they are out of order.

Try using this idea on 2; 1; 4; 3; 0, for example.

After running through it, you should end up with 1; 2; 3; 0; 4.

This method doesn't give the solution, directly.

Work out a General Idea

In checking the alternate idea, notice that the largest value always lands at the end of the new vector. (Can you prove to yourself that this should always happen?)

This means that we can sort the vector by starting at the beginning of the vector, go through all adjacent pairs.

Then repeat this procedure for all but the last value, and so on.

Detailed Implementation

At this point, we need to address specific coding questions.

e.g. How do we swap x[i] and x[i+1]?

Here is a way to swap the value of x[3] with that of x[4]:

save <- x[3]
x[3] <- x[4]
x[4] <- save

Detailed Implementation

Note that you should not over-write the value of x[3] with the value of x[4] before its old value has been saved in another place; otherwise, you will not be able to assign that value to x[4].

Detailed Implementation

bubblesort <- function(x) {
# x is initially the input vector and will be
# modified to form the output
# last is the last element to compare with
for(last in length(x):2) {
  for(first in 1:(last - 1)) {
    if(x[first] > x[first + 1]) { # swap the pair
      save <- x[first]
      x[first] <- x[first + 1]
      x[first + 1] <- save
     }
   }
}
return (x)
}

Check

Always begin testing your code on simple examples to identify obvious bugs.

bubblesort(c(2, 1))
[1] 1 2
bubblesort(c(2, 24, 3, 4, 5, 13, 6, 1))
[1]  1  2  3  4  5  6 13 24

Check

Try the code on several other numeric vectors. What is the output when the input vector has length 1?

bubblesort(1)

Check

The problem is that when length(x) == 1, the value of last will take on the values 1:2, rather than no values at all.

This doesn't require a redesign of the function; we can fix it by handling this as a special case at the beginning of our function:

Check

bubblesort <- function(x) {
# x is initially the input vector and will be
# modified to form the output
if (length(x) < 2) return (x)
# last is the last element to compare with
for(last in length(x):2) {
   for(first in 1:(last - 1)) {
     if(x[first] > x[first + 1]) { # swap the pair
     save <- x[first]
     x[first] <- x[first + 1]
     x[first + 1] <- save
     }
  }
}
return (x)
}

Check

Test the new version:

bubblesort(1)
[1] 1

Top-down design

Working out the detailed implementation of a program can appear to be a daunting task. The key to making it manageable is to break it down into smaller pieces which you know how to solve.

One strategy for doing that is known as “top-down design”. Top-down design is similar to outlining an essay before filling in the details:

  • Write out the whole program in a small number (1-5) of steps.
  • Expand each step into a small number of steps.
  • Keep going until you have a program.

Example - Merge Sort

The sort algorithm just described is known as a “bubble sort”. The bubble sort is easy to program and is efficient when the vector x is short, but when x is longer, more efficient methods are available.

One of these is known as a “merge sort”.

The general idea of a merge sort is to split the vector into two halves, sort each half, and then merge the two halves.

Example - Merge Sort

During the merge, we only need to compare the first elements of each sorted half to decide which is the smallest value over all.

Remove that value from its half; then the second value becomes the smallest remaining value in this half, and we can proceed to put the two parts together into one sorted vector.

Example - Merge Sort

So how do we do the initial sorting of each half?

We could use a bubble sort, but a more elegant procedure is to use a merge sort on each of them.

This is an idea called recursion.

The mergesort() function which we will write below can make calls to itself.

Because of variable scoping, new copies of all of the local variables will be created each time it is called, and the different calls will not interfere with each other.

Example - Merge Sort

It is often worthwhile to consider small numerical examples in order to ensure that we understand the basic idea of the algorithm, before we proceed to designing it in detail.

For example, suppose x is [8, 6, 7, 4], and we want to construct a sorted result r.

Then our merge sort would proceed as follows:

Understanding the idea

  • Split x into two parts: y [8; 6], z [7; 4]
  • Sort y and z: y [6; 8], z [4; 7]
  • Merge y and z:
    • Compare y1 = 6 and z1 = 4: r1 4; Remove z1; z is now [7].
    • Compare y1 = 6 and z1 = 7: r2 6; Remove y1; y is now [8].
    • Compare y1 = 8 and z1 = 7: r3 7; Remove z1; z is now empty.
    • Append remaining values of y onto r: r4 8
  • Return r = [4; 6; 7; 8]

Translating into code

It is helpful to think of the translation process as a stepwise process of refining a program until it works.

We begin with a general statement, and gradually expand each part.

We will use a double comment marker ## to mark descriptive lines that still need expansion. We will number these comments so that we can refer to them in the slides; in practice, you would probably not find this necessary.

After expanding, we will change to the usual comment marker to leave our description in place.

Initial Steps

We start with just one aim, which we can use as our first descriptive line:

## 1. Use a merge sort to sort a vector

We will gradually expand upon previous steps, adding in detail as we go.

An expansion of step 1 follows from recognizing that we need an input vector x which will be processed by a function that we are naming mergesort.

Somehow, we will sort this vector.

Initial Steps

In the end, we want the output to be returned:

# 1. Use a merge sort to sort a vector
mergesort <- function (x) {
## 2: sort x into result
return (result)
}

Breaking Down one of the Steps

We now expand step 2, noting how the merge sort algorithm proceeds:

# 1. Use a merge sort to sort a vector
mergesort <- function (x) {
# 2: sort x into result
## 2.1: split x in half
## 2.2: sort each half
## 2.3: merge the 2 sorted parts into a
## sorted result
return (result)
}

Breaking Down one of the Steps

Each substep of the above needs to be expanded. First, we expand step 2.1.

# 2.1: split x in half
len <- length(x)
x1 <- x[1:(len / 2)]
x2 <- x[(len / 2 + 1):len]

Caution: check your code

x <- c(4,1,3,2)
len <- length(x)
x1 <- x[1:(len / 2)]
x2 <- x[(len / 2 + 1):len]
x1
[1] 4 1
x2
[1] 3 2
# ok, for this example

Caution: check your code

x <- c(4,1,3,2,0)
len <- length(x)
x1 <- x[1:(len / 2)]
x2 <- x[(len / 2 + 1):len]
x1
[1] 4 1
x2
[1] 3 2
# not ok, for this example -- code needs to be fixed

Caution: Boundary Cases can be Different

Be careful with “edge” cases; usually, we expect to sort a vector containing more than one element, but our sort function should be able to handle the simple problem of sorting a single element.

The code above does not handle len < 2 properly.

We must try again, fixing step 2.1. The solution is simple: if the length of x is 0 or 1, our function should simply return x.

Otherwise, we proceed to split x and sort as above. This affects code outside of step 2.1, so we need to correct our outline.

Revised Program

Here is the new outline, including the new step 2.1:

# 1. Use a merge sort to sort a vector
mergesort <- function (x) {
   # Check for a vector that doesn't need sorting
   len <- length(x)
   if (len < 2) result <- x
   else {
     # 2: sort x into result
         # 2.1: split x in half
         y <- x[1:(len / 2)]
         z <- x[-(1:(len / 2))]
           ## 2.2: sort y and z
           ## 2.3: merge y and z into a sorted result
        }
return(result)
}

Further Expansion

Step 2.2 is very easy to expand, because we can make use of our mergesort() function, even though we haven't written it yet!

The key idea is to remember that we are not executing the code at this point, we are designing it.

We should assume our design will eventually be successful, and we will be able to make use of the fruits of our labour.

Further Expansion

So step 2.2 becomes

# 2.2: sort y and z
y <- mergesort(y)
z <- mergesort(z)

Further Expansion

Step 2.3 is more complicated, so let's take it slowly.

We know that we will need a result vector, but let's describe the rest of the process before we code it.

We repeat the whole function here, including this expansion and the expansion of step 2.2:

Further Expansion

# 1. Use a merge sort to sort a vector
mergesort <- function (x) {
# Check for a vector that doesn't need sorting
len <- length(x)
if (len < 2) result <- x
else {
  # 2: sort x into result
    # 2.1: split x in half
    y <- x[1:(len / 2)]
    z <- x[-(1:(len / 2))]
# 2.2: sort y and z
   y <- mergesort(y)
   z <- mergesort(z)
# 2.3: merge y and z into a sorted result
   result <- c()
## 2.3.1: while (some are left in both piles)
## 2.3.2: put the smallest first element on the end
## 2.3.3: remove it from y or z
## 2.3.4: put the leftovers onto the end of result
}
return(result)
}

Further Expansion

Steps 2.3.2 and 2.3.3 both depend on the test of which of y[1] and z[1] is smallest.

 # 1. Use a merge sort to sort a vector
 mergesort <- function (x) {
 # Check for a vector that doesn't need sorting
  len <- length(x)
  if (len < 2) result <- x
  else {
  # 2: sort x into result
  # 2.1: split x in half
  y <- x[1:(len / 2)]
  z <- x[-(1:(len / 2))]
  # 2.2: sort y and z
  y <- mergesort(y)
  z <- mergesort(z)
  # 2.3: merge y and z into a sorted result
  result <- c()
  # 2.3.1: while (some are left in both piles)

Further Expansion

while (min(length(y), length(z)) > 0) {
  # 2.3.2: put the smallest first element on the + # 2.3.3: remove it from y or z
  if (y[1] < z[1]) {
  result <- c(result, y[1])
  y <- y[-1]
  } else {
  result <- c(result, z[1])
  z <- z[-1]
  }
  }
  # 2.3.4: put the leftovers onto the end of result
  if (length(y) > 0)
  result <- c(result, y)
  else
  result <- c(result, z)
  }
  return(result)
  }

Debugging and Maintenance

Computer errors are called “bugs”.

Removing these errors from a program is called “debugging”.

Debugging is difficult, and one of our goals is to write programs that don't have bugs in them: but sometimes we make mistakes.

Debugging and Maintenance

We have found that the following five steps help us to find and fix bugs in our own programs:

  • Recognize that a bug exists.
  • Make the bug reproducible.
  • Identify the cause of the bug.
  • Fix the error and test.
  • Look for similar errors. We will consider each of these in turn.

Recognizing that a bug exists

Sometimes this is easy; if the program doesn't work, there is a bug. However, in other cases the program seems to work, but the output is incorrect, or the program works for some inputs, but not for others.

A bug causing this kind of error is much more difficult to recognize.

There are several strategies to make it easier.

Recognizing that a bug exists

First, follow the advice given earlier, and break up your program into simple, self-contained functions.

Document their inputs and outputs.

Within the function, test that the inputs obey your assumptions about them, and think of test inputs where you can see at a glance whether the outputs match your expectations.

Recognizing that a bug exists

In some situations, it may be worthwhile writing two versions of a function: one that may be too slow to use in practice, but which you are sure is right, and another that is faster but harder to be sure about.

Test that both versions produce the same output in all situations.

Recognizing that a bug exists

When errors only occur for certain inputs, our experience shows that those are often what are called “edge cases”: situations which are right on the boundary between legal and illegal inputs.

Test those! For example, test what happens when you try a vector of length zero, test very large or very small values, etc.

Make the bug reproducible

Before you can fix a bug, you need to know where things are going wrong. This is much easier if you know how to trigger the bug.

Bugs that only appear unpredictably are extremely difficult to fix. The good news is that for the most part computers are predictable: if you give them the same inputs, they give you the same outputs.

The difficulty is in working out what the necessary inputs are

Make the bug reproducible

For example, a common mistake in programming is to misspell the name of a variable.

Normally this results in an immediate error message, but sometimes you accidentally choose a variable that actually does exist.

Then you'll probably get the wrong answer, and the answer you get may appear to be random, because it depends on the value in some unrelated variable.

Make the bug reproducible

The key to tracking down this sort of problem is to work hard to make the error reproducible.

Simplify things as much as possible: start a new empty R session, and see if you can reproduce it.

Once you can reproduce the error, you will eventually be able to track it down.

Some programs do random simulations.

For those, you can make the simulations reproducible by setting the value of the random number seed at the start.

Identify the cause of the bug

When you have confirmed that a bug exists, the next step is to identify its cause.

If your program has stopped with an error, read the error messages.

Try to understand them as well as you can.

Trouble-shooting

The simplest way to do this is to edit your functions to add statements like this:

cat("In cv, x=", x, "\n")

This will print the value of x, identifying where the message is coming from. The “\n” at the end tells R to go to a new line after printing.

Trouble-shooting

You may want to use print() rather than cat() to take advantage of its formatting, but remember that it can only print one thing at a time, so you would likely use it as

cat("In cv, x=\n")
print(x)

Trouble-shooting

Another way to understand what is going wrong in a small function is to simulate it by hand.

Act as you think R would act, and write down the values of all variables as the function progresses.

Fixing errors and testing

Once you have identified the bug in your program, you need to fix it.

Try to fix it in such a way that you don't cause a different problem.

Then test what you've done.

You should put together tests that include the way you know that would reproduce the error, as well as edge cases, and anything else you can think of.

The debug() Function

Rather than using cat() or print() for debugging, R allows you to call the function debug(). This will pause execution of your function, and allow you to examine (or change!) local variables, or execute any other R command, inside the evaluation environment of the function.

The debug() Function

Commands to use with debug() are

  • n - “next”; execute the next line of code, single-stepping through the function
  • c - “continue”; let the function continue running
  • Q - quit the debugger

You mark function f for debugging using debug(f), and then the browser will be called when you enter the function. Turn off debugging using undebug(f).

Example - Constructing and Debugging a Function

We will write and debug a function which will compute a confidence interval for the true mean of a population, based on a random sample of size n using the formula

\[ \bar{x}\pm t_{\alpha/2,n-1}s/\sqrt{n} \] where \( \bar{x} \) is the sample mean and s is the sample standard deviation, and the t value is the \( 1 -\alpha/2 \) percentile of the t distribution on n - 1 degrees of freedom.

Writing a Confidence Interval Function

Our goal is to write a function which will take input like x such as some male heights:

x <- c(170, 185, 177, 160)

and print to the screen an interval, i.e. a vector of length 2. We will also want to be able to print out confidence intervals having different \( \alpha \) levels, so alpha should also be an argument:

Writing a Confidence Interval Function

ci <- function (x, alpha=.05) {
.... commands .....
interval
}
ci(x) # this should print out a 95%
# confidence interval for the true mean

Solving the Problem

The confidence interval formula requires:

  • the sample mean which we can compute with mean(x)
  • the sample standard deviation (sd(x))
  • the t percentiles (qt(c(alpha/2, 1-alpha/2), df))
  • the square root of n (sqrt(n))

Implementing the Solution

Here is a first attempt at implementing the solution to the problem:

ci <- function (x, alpha=.05) {
xbar <- mean(x)
s <- sd(x)
interval <- xbar + qt(c(alpha/2, 1-alpha/2), df)*
s/sqrt(n)
interval
}

Testing the Function

Here is a first test for our ci function. Use the data vector of heights:


x <- c(170, 185, 177, 160)
ci(x)

Error in qt(p, df, lower.tail, log.p) :
Non-numeric argument to mathematical function

Something is wrong. One of the arguments to qt is incorrect

Looking for the Error

We can add a print statement immediately before the call to qt:

ci <- function (x, alpha=.05) {
xbar <- mean(x)
s <- sd(x)
print(list(alpha=alpha, df=df))
interval <- xbar + qt(c(alpha/2, 1-alpha/2), df)*
s/sqrt(n)
interval
}

By using a list in the print() command we are not restricting ourselves to numeric output.

Looking for the Error

  ci(x)

$alpha
[1] 0.05
$df
function (x, df1, df2, ncp, log = FALSE)
{
if (missing(ncp))
.Internal(df(x, df1, df2, log))
else .Internal(dnf(x, df1, df2, ncp, log))
}
<environment: namespace:stats>
Error in qt(p, df, lower.tail, log.p) :
Non-numeric argument to mathematical function

The df argument to qt should be set to n-1.

Another Attempt

ci <- function (x, alpha=.05) {
xbar <- mean(x)
s <- sd(x)
interval <- xbar + qt(c(alpha/2, 1-alpha/2), df=n-1)*
s/sqrt(n)
interval
}

Re-test

 ci(x)
Error in qt(c(alpha/2, 1 - alpha/2), df = n - 1) :
object 'n' not found

Now, we realize that we forgot to assign length(x) to n.

Another Attempt

ci <- function (x, alpha=.05) {
xbar <- mean(x)
s <- sd(x)
n <- length(x)
interval <- xbar + qt(c(alpha/2, 1-alpha/2), df=n-1)*s/sqrt(n)
interval
}

Test again:

ci(x)
[1] 156.11 189.89

Checking the Boundary Case

Although we should not compute confidence intervals for sample sizes less than 2, it might happen by accident:

ci(3)
[1] NaN NaN

Checking the Boundary Case

Again, we can handle this boundary case with an if statement.

ci <- function (x, alpha=.05) {
if (length(x) < 2) {
stop("Sample size must be larger than 1.")
} else {
xbar <- mean(x)
s <- sd(x)
n <- length(x)
interval <- xbar + qt(c(alpha/2, 1-alpha/2), df=n-1)*s/sqrt(n)
interval
}
}

Our Function Can Now be Used Elsewhere

Now that ci() is a function that is known to work on numeric vectors of any length, we can call it from other functions. For example, the following function uses the ci() function to compute confidence intervals for all vectors in a matrix or list, as well as for a single vector.

CI <- function(xy, alpha=.05) {
   if(is.matrix(xy)) xy <- data.frame(xy)
   if(is.list(xy)) {
     sapply(xy, ci, alpha=alpha)
    } else {
    ci(xy, alpha=alpha)
   }
}

Testing the CI Function

This function needs to be tested on vectors, lists (including data frames) and matrices:

x
[1] 170 185 177 160
CI(x)
[1] 156.11 189.89

Testing with a Matrix

(xy <- matrix(rnorm(1:20),nrow = 4))
           [,1]        [,2]        [,3]       [,4]       [,5]
[1,] -1.3621684  0.77849257 -0.05460643  1.6231356 -0.5329873
[2,] -0.7544070  0.33646639  0.03606347 -0.1738255  1.1019822
[3,] -0.1496062 -1.99359755 -0.33244989 -1.3524755 -1.1108980
[4,]  1.3960598 -0.08034031 -1.24620332 -1.4993081 -0.9684080
CI(xy)
            X1        X2         X3        X4        X5
[1,] -2.101797 -2.182131 -1.3317020 -2.647382 -1.995127
[2,]  1.666736  1.702642  0.5331039  1.946145  1.239972

Testing with a Matrix

CI(xy)
            X1        X2         X3        X4        X5
[1,] -2.101797 -2.182131 -1.3317020 -2.647382 -1.995127
[2,]  1.666736  1.702642  0.5331039  1.946145  1.239972