Ken Jinkun Xiao
12 Feb 2015
In order to program the program effectively, the following general guidelines would helpful here:
We wish to write a program which will sort a vector of integers into increasing order.
Start with a specific case, usually simple, but not too simple.
Sometimes, you might try to solve the problem on your own, without the computer.
Consider sorting the vector consisting of the elements 3, 5, 24, 6, 2, 4, 13, 1.
Our goal is to write a function called bubblesort() for which we could do the following:
x <- c(3, 5, 24, ..., 1)
bubblesort(x)
The output should be the numbers in increasing order: 1, 2, 3, 4, 5, 6, 13, 24.
A first idea might be to find where the smallest value is, and record it.
Repeat, with the remaining values, recording the smallest value each time.
Repeat …
This might be time-consuming.
An alternative idea: compare successive pairs of values, starting at the beginning of the vector, and running through to the end.
Swap pairs if they are out of order.
Try using this idea on 2; 1; 4; 3; 0, for example.
After running through it, you should end up with 1; 2; 3; 0; 4.
This method doesn't give the solution, directly.
In checking the alternate idea, notice that the largest value always lands at the end of the new vector. (Can you prove to yourself that this should always happen?)
This means that we can sort the vector by starting at the beginning of the vector, go through all adjacent pairs.
Then repeat this procedure for all but the last value, and so on.
At this point, we need to address specific coding questions.
e.g. How do we swap x[i] and x[i+1]?
Here is a way to swap the value of x[3] with that of x[4]:
save <- x[3]
x[3] <- x[4]
x[4] <- save
Note that you should not over-write the value of x[3] with the value of x[4] before its old value has been saved in another place; otherwise, you will not be able to assign that value to x[4].
bubblesort <- function(x) {
# x is initially the input vector and will be
# modified to form the output
# last is the last element to compare with
for(last in length(x):2) {
for(first in 1:(last - 1)) {
if(x[first] > x[first + 1]) { # swap the pair
save <- x[first]
x[first] <- x[first + 1]
x[first + 1] <- save
}
}
}
return (x)
}
Always begin testing your code on simple examples to identify obvious bugs.
bubblesort(c(2, 1))
[1] 1 2
bubblesort(c(2, 24, 3, 4, 5, 13, 6, 1))
[1] 1 2 3 4 5 6 13 24
Try the code on several other numeric vectors. What is the output when the input vector has length 1?
bubblesort(1)
The problem is that when length(x) == 1, the value of last will take on the values 1:2, rather than no values at all.
This doesn't require a redesign of the function; we can fix it by handling this as a special case at the beginning of our function:
bubblesort <- function(x) {
# x is initially the input vector and will be
# modified to form the output
if (length(x) < 2) return (x)
# last is the last element to compare with
for(last in length(x):2) {
for(first in 1:(last - 1)) {
if(x[first] > x[first + 1]) { # swap the pair
save <- x[first]
x[first] <- x[first + 1]
x[first + 1] <- save
}
}
}
return (x)
}
Test the new version:
bubblesort(1)
[1] 1
Working out the detailed implementation of a program can appear to be a daunting task. The key to making it manageable is to break it down into smaller pieces which you know how to solve.
One strategy for doing that is known as “top-down design”. Top-down design is similar to outlining an essay before filling in the details:
The sort algorithm just described is known as a “bubble sort”. The bubble sort is easy to program and is efficient when the vector x is short, but when x is longer, more efficient methods are available.
One of these is known as a “merge sort”.
The general idea of a merge sort is to split the vector into two halves, sort each half, and then merge the two halves.
During the merge, we only need to compare the first elements of each sorted half to decide which is the smallest value over all.
Remove that value from its half; then the second value becomes the smallest remaining value in this half, and we can proceed to put the two parts together into one sorted vector.
So how do we do the initial sorting of each half?
We could use a bubble sort, but a more elegant procedure is to use a merge sort on each of them.
This is an idea called recursion.
The mergesort() function which we will write below can make calls to itself.
Because of variable scoping, new copies of all of the local variables will be created each time it is called, and the different calls will not interfere with each other.
It is often worthwhile to consider small numerical examples in order to ensure that we understand the basic idea of the algorithm, before we proceed to designing it in detail.
For example, suppose x is [8, 6, 7, 4], and we want to construct a sorted result r.
Then our merge sort would proceed as follows:
It is helpful to think of the translation process as a stepwise process of refining a program until it works.
We begin with a general statement, and gradually expand each part.
We will use a double comment marker ## to mark descriptive lines that still need expansion. We will number these comments so that we can refer to them in the slides; in practice, you would probably not find this necessary.
After expanding, we will change to the usual comment marker to leave our description in place.
We start with just one aim, which we can use as our first descriptive line:
## 1. Use a merge sort to sort a vector
We will gradually expand upon previous steps, adding in detail as we go.
An expansion of step 1 follows from recognizing that we need an input vector x which will be processed by a function that we are naming mergesort.
Somehow, we will sort this vector.
In the end, we want the output to be returned:
# 1. Use a merge sort to sort a vector
mergesort <- function (x) {
## 2: sort x into result
return (result)
}
We now expand step 2, noting how the merge sort algorithm proceeds:
# 1. Use a merge sort to sort a vector
mergesort <- function (x) {
# 2: sort x into result
## 2.1: split x in half
## 2.2: sort each half
## 2.3: merge the 2 sorted parts into a
## sorted result
return (result)
}
Each substep of the above needs to be expanded. First, we expand step 2.1.
# 2.1: split x in half
len <- length(x)
x1 <- x[1:(len / 2)]
x2 <- x[(len / 2 + 1):len]
x <- c(4,1,3,2)
len <- length(x)
x1 <- x[1:(len / 2)]
x2 <- x[(len / 2 + 1):len]
x1
[1] 4 1
x2
[1] 3 2
# ok, for this example
x <- c(4,1,3,2,0)
len <- length(x)
x1 <- x[1:(len / 2)]
x2 <- x[(len / 2 + 1):len]
x1
[1] 4 1
x2
[1] 3 2
# not ok, for this example -- code needs to be fixed
Be careful with “edge” cases; usually, we expect to sort a vector containing more than one element, but our sort function should be able to handle the simple problem of sorting a single element.
The code above does not handle len < 2 properly.
We must try again, fixing step 2.1. The solution is simple: if the length of x is 0 or 1, our function should simply return x.
Otherwise, we proceed to split x and sort as above. This affects code outside of step 2.1, so we need to correct our outline.
Here is the new outline, including the new step 2.1:
# 1. Use a merge sort to sort a vector
mergesort <- function (x) {
# Check for a vector that doesn't need sorting
len <- length(x)
if (len < 2) result <- x
else {
# 2: sort x into result
# 2.1: split x in half
y <- x[1:(len / 2)]
z <- x[-(1:(len / 2))]
## 2.2: sort y and z
## 2.3: merge y and z into a sorted result
}
return(result)
}
Step 2.2 is very easy to expand, because we can make use of our mergesort() function, even though we haven't written it yet!
The key idea is to remember that we are not executing the code at this point, we are designing it.
We should assume our design will eventually be successful, and we will be able to make use of the fruits of our labour.
So step 2.2 becomes
# 2.2: sort y and z
y <- mergesort(y)
z <- mergesort(z)
Step 2.3 is more complicated, so let's take it slowly.
We know that we will need a result vector, but let's describe the rest of the process before we code it.
We repeat the whole function here, including this expansion and the expansion of step 2.2:
# 1. Use a merge sort to sort a vector
mergesort <- function (x) {
# Check for a vector that doesn't need sorting
len <- length(x)
if (len < 2) result <- x
else {
# 2: sort x into result
# 2.1: split x in half
y <- x[1:(len / 2)]
z <- x[-(1:(len / 2))]
# 2.2: sort y and z
y <- mergesort(y)
z <- mergesort(z)
# 2.3: merge y and z into a sorted result
result <- c()
## 2.3.1: while (some are left in both piles)
## 2.3.2: put the smallest first element on the end
## 2.3.3: remove it from y or z
## 2.3.4: put the leftovers onto the end of result
}
return(result)
}
Steps 2.3.2 and 2.3.3 both depend on the test of which of y[1] and z[1] is smallest.
# 1. Use a merge sort to sort a vector
mergesort <- function (x) {
# Check for a vector that doesn't need sorting
len <- length(x)
if (len < 2) result <- x
else {
# 2: sort x into result
# 2.1: split x in half
y <- x[1:(len / 2)]
z <- x[-(1:(len / 2))]
# 2.2: sort y and z
y <- mergesort(y)
z <- mergesort(z)
# 2.3: merge y and z into a sorted result
result <- c()
# 2.3.1: while (some are left in both piles)
while (min(length(y), length(z)) > 0) {
# 2.3.2: put the smallest first element on the + # 2.3.3: remove it from y or z
if (y[1] < z[1]) {
result <- c(result, y[1])
y <- y[-1]
} else {
result <- c(result, z[1])
z <- z[-1]
}
}
# 2.3.4: put the leftovers onto the end of result
if (length(y) > 0)
result <- c(result, y)
else
result <- c(result, z)
}
return(result)
}
Computer errors are called “bugs”.
Removing these errors from a program is called “debugging”.
Debugging is difficult, and one of our goals is to write programs that don't have bugs in them: but sometimes we make mistakes.
We have found that the following five steps help us to find and fix bugs in our own programs:
Sometimes this is easy; if the program doesn't work, there is a bug. However, in other cases the program seems to work, but the output is incorrect, or the program works for some inputs, but not for others.
A bug causing this kind of error is much more difficult to recognize.
There are several strategies to make it easier.
First, follow the advice given earlier, and break up your program into simple, self-contained functions.
Document their inputs and outputs.
Within the function, test that the inputs obey your assumptions about them, and think of test inputs where you can see at a glance whether the outputs match your expectations.
In some situations, it may be worthwhile writing two versions of a function: one that may be too slow to use in practice, but which you are sure is right, and another that is faster but harder to be sure about.
Test that both versions produce the same output in all situations.
When errors only occur for certain inputs, our experience shows that those are often what are called “edge cases”: situations which are right on the boundary between legal and illegal inputs.
Test those! For example, test what happens when you try a vector of length zero, test very large or very small values, etc.
Before you can fix a bug, you need to know where things are going wrong. This is much easier if you know how to trigger the bug.
Bugs that only appear unpredictably are extremely difficult to fix. The good news is that for the most part computers are predictable: if you give them the same inputs, they give you the same outputs.
The difficulty is in working out what the necessary inputs are
For example, a common mistake in programming is to misspell the name of a variable.
Normally this results in an immediate error message, but sometimes you accidentally choose a variable that actually does exist.
Then you'll probably get the wrong answer, and the answer you get may appear to be random, because it depends on the value in some unrelated variable.
The key to tracking down this sort of problem is to work hard to make the error reproducible.
Simplify things as much as possible: start a new empty R session, and see if you can reproduce it.
Once you can reproduce the error, you will eventually be able to track it down.
Some programs do random simulations.
For those, you can make the simulations reproducible by setting the value of the random number seed at the start.
When you have confirmed that a bug exists, the next step is to identify its cause.
If your program has stopped with an error, read the error messages.
Try to understand them as well as you can.
The simplest way to do this is to edit your functions to add statements like this:
cat("In cv, x=", x, "\n")
This will print the value of x, identifying where the message is coming from. The “\n” at the end tells R to go to a new line after printing.
You may want to use print() rather than cat() to take advantage of its formatting, but remember that it can only print one thing at a time, so you would likely use it as
cat("In cv, x=\n")
print(x)
Another way to understand what is going wrong in a small function is to simulate it by hand.
Act as you think R would act, and write down the values of all variables as the function progresses.
Once you have identified the bug in your program, you need to fix it.
Try to fix it in such a way that you don't cause a different problem.
Then test what you've done.
You should put together tests that include the way you know that would reproduce the error, as well as edge cases, and anything else you can think of.
Rather than using cat() or print() for debugging, R allows you to call the function debug(). This will pause execution of your function, and allow you to examine (or change!) local variables, or execute any other R command, inside the evaluation environment of the function.
Commands to use with debug() are
You mark function f for debugging using debug(f), and then the browser will be called when you enter the function. Turn off debugging using undebug(f).
We will write and debug a function which will compute a confidence interval for the true mean of a population, based on a random sample of size n using the formula
\[ \bar{x}\pm t_{\alpha/2,n-1}s/\sqrt{n} \] where \( \bar{x} \) is the sample mean and s is the sample standard deviation, and the t value is the \( 1 -\alpha/2 \) percentile of the t distribution on n - 1 degrees of freedom.
Our goal is to write a function which will take input like x such as some male heights:
x <- c(170, 185, 177, 160)
and print to the screen an interval, i.e. a vector of length 2. We will also want to be able to print out confidence intervals having different \( \alpha \) levels, so alpha should also be an argument:
ci <- function (x, alpha=.05) {
.... commands .....
interval
}
ci(x) # this should print out a 95%
# confidence interval for the true mean
The confidence interval formula requires:
Here is a first attempt at implementing the solution to the problem:
ci <- function (x, alpha=.05) {
xbar <- mean(x)
s <- sd(x)
interval <- xbar + qt(c(alpha/2, 1-alpha/2), df)*
s/sqrt(n)
interval
}
Here is a first test for our ci function. Use the data vector of heights:
x <- c(170, 185, 177, 160)
ci(x)
Error in qt(p, df, lower.tail, log.p) :
Non-numeric argument to mathematical function
Something is wrong. One of the arguments to qt is incorrect
We can add a print statement immediately before the call to qt:
ci <- function (x, alpha=.05) {
xbar <- mean(x)
s <- sd(x)
print(list(alpha=alpha, df=df))
interval <- xbar + qt(c(alpha/2, 1-alpha/2), df)*
s/sqrt(n)
interval
}
By using a list in the print() command we are not restricting ourselves to numeric output.
ci(x)
$alpha
[1] 0.05
$df
function (x, df1, df2, ncp, log = FALSE)
{
if (missing(ncp))
.Internal(df(x, df1, df2, log))
else .Internal(dnf(x, df1, df2, ncp, log))
}
<environment: namespace:stats>
Error in qt(p, df, lower.tail, log.p) :
Non-numeric argument to mathematical function
The df argument to qt should be set to n-1.
ci <- function (x, alpha=.05) {
xbar <- mean(x)
s <- sd(x)
interval <- xbar + qt(c(alpha/2, 1-alpha/2), df=n-1)*
s/sqrt(n)
interval
}
Re-test
ci(x)
Error in qt(c(alpha/2, 1 - alpha/2), df = n - 1) :
object 'n' not found
Now, we realize that we forgot to assign length(x) to n.
ci <- function (x, alpha=.05) {
xbar <- mean(x)
s <- sd(x)
n <- length(x)
interval <- xbar + qt(c(alpha/2, 1-alpha/2), df=n-1)*s/sqrt(n)
interval
}
Test again:
ci(x)
[1] 156.11 189.89
Although we should not compute confidence intervals for sample sizes less than 2, it might happen by accident:
ci(3)
[1] NaN NaN
Again, we can handle this boundary case with an if statement.
ci <- function (x, alpha=.05) {
if (length(x) < 2) {
stop("Sample size must be larger than 1.")
} else {
xbar <- mean(x)
s <- sd(x)
n <- length(x)
interval <- xbar + qt(c(alpha/2, 1-alpha/2), df=n-1)*s/sqrt(n)
interval
}
}
Now that ci() is a function that is known to work on numeric vectors of any length, we can call it from other functions. For example, the following function uses the ci() function to compute confidence intervals for all vectors in a matrix or list, as well as for a single vector.
CI <- function(xy, alpha=.05) {
if(is.matrix(xy)) xy <- data.frame(xy)
if(is.list(xy)) {
sapply(xy, ci, alpha=alpha)
} else {
ci(xy, alpha=alpha)
}
}
This function needs to be tested on vectors, lists (including data frames) and matrices:
x
[1] 170 185 177 160
CI(x)
[1] 156.11 189.89
(xy <- matrix(rnorm(1:20),nrow = 4))
[,1] [,2] [,3] [,4] [,5]
[1,] -1.3621684 0.77849257 -0.05460643 1.6231356 -0.5329873
[2,] -0.7544070 0.33646639 0.03606347 -0.1738255 1.1019822
[3,] -0.1496062 -1.99359755 -0.33244989 -1.3524755 -1.1108980
[4,] 1.3960598 -0.08034031 -1.24620332 -1.4993081 -0.9684080
CI(xy)
X1 X2 X3 X4 X5
[1,] -2.101797 -2.182131 -1.3317020 -2.647382 -1.995127
[2,] 1.666736 1.702642 0.5331039 1.946145 1.239972
CI(xy)
X1 X2 X3 X4 X5
[1,] -2.101797 -2.182131 -1.3317020 -2.647382 -1.995127
[2,] 1.666736 1.702642 0.5331039 1.946145 1.239972