4 Programming basics

4.1 Functions

In addition to functions already defined in R such as mean() or read.table() you can write your own functions to perform repetitive tasks. Using functions has multiple advantages:

  • you avoid mistakes by copy-pasting multiple lines of code
  • when new output is needed, you only need to change the function (not the multiple copies of code)
  • makes your code better structured, more readable, easier to follow and understand

4.1.1 Motivating example

Imagine you want to calculate the difference between “males” and “females” in mean of “BILL_AMT6” in all files “cd#”.


Approach using copy-paste

cd <- read.table("cd1.csv", header=TRUE, sep=",")
diff_mean <- mean(cd$BILL_AMT6[cd$SEX=="male"])-mean(cd$BILL_AMT6[cd$SEX=="female"])
print(diff_mean)
## [1] 2895.431
cd <- read.table("cd2.csv", header=TRUE, sep=",")
diff_mean <- mean(cd$BILL_AMT6[cd$SEX=="male"])-mean(cd$BILL_AMT6[cd$SEX=="female"])
print(diff_mean)
## [1] 950.9994
cd <- read.table("cd3.csv", header=TRUE, sep=",")
diff_mean <- mean(cd$BILL_AMT6[cd$SEX=="male"])-mean(cd$BILL_AMT6[cd$SEX=="female"])
print(diff_mean)
## [1] 880.2368
cd <- read.table("cd4.csv", header=TRUE, sep=",")
diff_mean <- mean(cd$BILL_AMT6[cd$SEX=="male"])-mean(cd$BILL_AMT6[cd$SEX=="female"])
print(diff_mean)
## [1] 2395.319
cd <- read.table("cd5.csv", header=TRUE, sep=",")
diff_mean <- mean(cd$BILL_AMT6[cd$SEX=="male"])-mean(cd$BILL_AMT6[cd$SEX=="female"])
print(diff_mean)
## [1] 3580.756
cd <- read.table("cd6.csv", header=TRUE, sep=",")
diff_mean <- mean(cd$BILL_AMT6[cd$SEX=="male"])-mean(cd$BILL_AMT6[cd$SEX=="female"])
print(diff_mean)
## [1] 2913.215

Approach using functions and loops

# approach using functions
process_file <- function(file_name){
  cd <- read.table(file_name, header=TRUE, sep=",")
  diff_mean <- mean(cd$BILL_AMT6[cd$SEX=="male"])-mean(cd$BILL_AMT6[cd$SEX=="female"])
  print(diff_mean)
}

for (fname in c("cd1.csv", "cd2.csv", "cd3.csv", "cd4.csv", "cd5.csv", "cd6.csv")) {
  process_file(fname)
}
## [1] 2895.431
## [1] 950.9994
## [1] 880.2368
## [1] 2395.319
## [1] 3580.756
## [1] 2913.215

Now imaginge I ask you to calculate the same difference but for variable “BILL_AMT5”. In the copy-paste approach you need to add the lines for calculating and printing the new result to all 6 copies of the code.

cd <- read.table("cd1.csv", header=TRUE, sep=",")
diff_mean6 <- mean(cd$BILL_AMT6[cd$SEX=="male"])-mean(cd$BILL_AMT6[cd$SEX=="female"])
print(diff_mean6)
## [1] 2895.431
diff_mean5 <- mean(cd$BILL_AMT5[cd$SEX=="male"])-mean(cd$BILL_AMT5[cd$SEX=="female"])
print(diff_mean5)
## [1] 2304.153
# copy paste this for cd2, cd3 etc.

In the function approach code, you only need to change the function once

process_file <- function(file_name){
  cd <- read.table(file_name, header=TRUE, sep=",")
  diff_mean6 <- mean(cd$BILL_AMT6[cd$SEX=="male"])-mean(cd$BILL_AMT6[cd$SEX=="female"])
  print(diff_mean6)
  diff_mean5 <- mean(cd$BILL_AMT5[cd$SEX=="male"])-mean(cd$BILL_AMT5[cd$SEX=="female"])
  print(diff_mean5)
}

for (fname in c("cd1.csv", "cd2.csv", "cd3.csv", "cd4.csv", "cd5.csv", "cd6.csv")) {
  process_file(fname)
}
## [1] 2895.431
## [1] 2304.153
## [1] 950.9994
## [1] 650.6265
## [1] 880.2368
## [1] 1672.845
## [1] 2395.319
## [1] 3090.049
## [1] 3580.756
## [1] 3815.046
## [1] 2913.215
## [1] 2573.274

4.1.2 Structure of a function

function_name <- function(arg1, arg2, ...){
  function_body
  return(return_value)
}

Function is a special R object.

  1. You first need to pick the name for the function, follow standard R naming conventions and avoid names that already exist in R.
  2. You then use the keyword function() and within the parentheses () put the list of arguments over which the function shall operate. In the motivating example, we only used 1 argument, file_name.
  3. Immediately after put the curly braces and define the function body containing the commands the function shall execute.
  4. Use return() command to tell the function what it shall return as result (other than printing).

For example:

addition <- function(x,y){
  return(x+y)
}

add_result <- addition(3,5)
add_result
## [1] 8

You can ask the function to print some explanatory text about the calculation. You use the functions print() for simple printing, and cat for printing concatenated text. The special symbol \n indicates the end of line and tells R to start the next printed output on a new line.

print_addition <- function(x,y){
  print("This function adds two numbers.")
  cat("The result of adding", x, "and", y, "is", x+y, "\n")
}

add_result <- print_addition(3,5)
## [1] "This function adds two numbers."
## The result of adding 3 and 5 is 8
add_result
## NULL

Note that the add_result now contains NULL instead of the result of the calculation. This is because we didn’t use the return() command! (Alternatively, if you use no return(), the function returns the last evaluated expression.)

4.1.2.1 Function arguments

You can use as many arguments as you want for a function. When calling a function, the user then will need to supply values for the arguments.

A function can have no arguments such as this one:

print_today <- function(){
  today = Sys.Date()
  print("Hello!")
  cat("Today is", format(today, format="%d %B %Y"))
}

print_today()
## [1] "Hello!"
## Today is 20 March 2019

Note that when calling the function print_today() we do not specify any argument values.

You can also set default values for some arguments. It the user does not provide a specific value, the function will use the default value.

For example:

increase_number <- function(num_to_increase, increase_by=1){
  result = num_to_increase + increase_by
  cat("When I increase", num_to_increase, "by", increase_by, "I get", result, "\n")
}

# provide only one argument value and use the default
increase_number(5)
## When I increase 5 by 1 I get 6
# provide two argument values, replace the default by requested value
increase_number(5, 3)
## When I increase 5 by 3 I get 8

4.2 For loops

When we need to repeat the same operation multiple times, instead of copy-pasting the commands multiple times, we can use for loops.

for (val in list_of_values){
  loop_body
}

To request R to iterate over some commands we use the keywoard for. Within parentheses () we then specify the list_of_values over which R shall iterate. Finally, within the loop_body we specify the commands to be executed at every iteration.

For example, this simple loop will iterate over a list of years and will print out the year at every iteration.

list_of_years = c("2015", "2016", "2017", "2018")

for (yr in list_of_years){
  cat("Happy new year", yr, "\n")
}
## Happy new year 2015 
## Happy new year 2016 
## Happy new year 2017 
## Happy new year 2018

We can use loops for repeating calculation over multiple variables in a dataset and we can store the results within a single vector.

First we create the list of columns to operate over:

cd <- read.table("cd1.csv", header=TRUE, sep=",")
col_list <- names(cd)[13:18]
col_list
## [1] "BILL_AMT1" "BILL_AMT2" "BILL_AMT3" "BILL_AMT4" "BILL_AMT5" "BILL_AMT6"

Next we create an empty vector to store the final results in.

out_vector <- numeric(length(col_list))
out_vector
## [1] 0 0 0 0 0 0

Next we ask to calculate mean for each column in the list and store the result in the vector.

vec_idx = 1
for (col in col_list){
  col_mean <- mean(cd[,col])
  cat("Mean of", col, "is", col_mean, "\n")
  cat("This value will be stored in out_vector element with index", vec_idx, "\n")
  out_vector[vec_idx] <- col_mean
  vec_idx <- vec_idx + 1 # increase vector index for next iteration
}
## Mean of BILL_AMT1 is 49546.82 
## This value will be stored in out_vector element with index 1 
## Mean of BILL_AMT2 is 47455.59 
## This value will be stored in out_vector element with index 2 
## Mean of BILL_AMT3 is 44592.59 
## This value will be stored in out_vector element with index 3 
## Mean of BILL_AMT4 is 40108.57 
## This value will be stored in out_vector element with index 4 
## Mean of BILL_AMT5 is 38855.96 
## This value will be stored in out_vector element with index 5 
## Mean of BILL_AMT6 is 37439.45 
## This value will be stored in out_vector element with index 6
# check the content of the output vector
out_vector
## [1] 49546.82 47455.59 44592.59 40108.57 38855.96 37439.45

Observe how in the for loop we increase the value of vec_idx at the end of each iteration.


We can use for loops to iterate over commands and functions but we can also use for loops within function body to request the function to repeat some calcualtion several times.

For example:

# define the function
print_years <- function(list_of_years){
  for (yr in list_of_years){
  cat("Happy new year", yr, "\n")
  }
}

# create a list ot iterate over
list_years = c("2015", "2016", "2017", "2018")

# call the function with the list as argument
print_years(list_years)
## Happy new year 2015 
## Happy new year 2016 
## Happy new year 2017 
## Happy new year 2018

4.3 Conditions

Similarly as we could select rows or elements of a data.frame using logical expressions, we can ask R to consider logical expressions to execute commands.

if (condition) {
  TRUE_body
} else {
  FALSE_body
}

We begin the logical execution with the keyword if. The condition must be an expression that evaluates to TRUE or FALSE. It it evaluates to TRUE the commands in the TRUE_body will be executed, else the commands in the FALSE_body will be executed.

For the condition you can use the standard logical comparison operators

Operator Meaning
== equal
> bigger then
>= bigger then or equal
< smaller then
<= smaller then or equal

A trivial example:

if (3 > 1) {
  print("YEs, this is true.")
} else {
  print("No, this is not true.")
}
## [1] "YEs, this is true."

Obviously, you may want to use this for something more useful such as:

if (sum(cd$BILL_AMT6) < sum(cd$BILL_AMT5)) {
  print("The total bill decreased.")
} else {
  print("The total bill didn't decrease.")
}
## [1] "The total bill decreased."

You can create more complicated logical conditions by using || to indicate or, and && to indicate and.

if (sum(cd$BILL_AMT6) < sum(cd$BILL_AMT5) &&  sum(cd$BILL_AMT5) < sum(cd$BILL_AMT4)) {
  print("The total bill decreased twice in a row.")
} else {
  print("The total bill didn't decrease twice in a row.")
}
## [1] "The total bill decreased twice in a row."

You do not have to include the else statement. In such a case, if the condition is not evaluated to TRUE, nothing will be executed.

if (3 < 1) {
  print("Yes, this is true.")
}

Note that running the above piece of code above results in no output.


You can use the if conditioning within functions. For example:

# define comparison function
compare_numbers <- function(a,b){
  if (a<b){
    cat(a, "is smaller than", b, "\n")
  }
  else {
    cat(a, "is NOT smaller than", b, "\n")
  }
}

# use comparison function to compare two numbers
compare_numbers(3,5)
## 3 is smaller than 5
compare_numbers(3,1)
## 3 is NOT smaller than 1

You can use more then one condition to decide which part of code to execute by using else if statement.

# define comparison function
compare_numbers <- function(a,b){
  if (a<b){
    cat(a, "is smaller than", b, "\n")
  } else if (a==b){
    cat(a, "is equal to", b, "\n")
  } else {
    cat(a, "is bigger than", b, "\n")
  }
}

# use comparison function to compare two numbers
compare_numbers(3,1)
## 3 is bigger than 1
compare_numbers(3,3)
## 3 is equal to 3