In this script, you will learn the basic elements that you need for programming in R, that means that these are elements that go beyond mere arithmetical computation or queries on objects. These four elements are:
In such a statement, you base the next step of the code on a condition, that means that a certain code is only carried out if a certain prior expression is TRUE or FALSE. Such condition is written in brackets () after the word “if”, and this condition has to be a logical expression that yields either TRUE or FALSE. Let us create two variables and compare them against each other:
a<-10
b<-20
a==b
## [1] FALSE
10==20
## [1] FALSE
a!=b
## [1] TRUE
The expression “a==b” is a logical expression that yields the answer FALSE, and thus you could use this expression in an IF-statement. The code that follows after the IF-statement is conditional on the expression to be TRUE, that is why in the first case below, nothing is printed (print just means that a certain value appears as output of the code, without any object being changed or so), whereas in the second case, the condition following the word “if” is TRUE and therefore the print function is executed:
if (a==b) print("a is equal to b")
if (a!=b) print("a is NOT equal to b")
## [1] "a is NOT equal to b"
You can define more than just one line to be executed in case that the IF-condition is TRUE by using curly brackets {}:
if (a!=b) {
print("a is NOT equal to b")
c <- a*b
print(c) }
## [1] "a is NOT equal to b"
## [1] 200
You can simplify the code in lines 71-73 by including an ELSE-statement, which specifies what is to be done if the condition after the IF-statement is FALSE. To indicate that both statements belong together, you should put curly brackets {} around them:
{ if (a==b) print("a is equal to b")
else print("a is NOT equal to b") }
## [1] "a is NOT equal to b"
You can define multilayered conditions with ELSE-IF-statements:
a <- 10000
a
## [1] 10000
{ if (a==8) print("a is equal to 8")
else if (a==9) print("a is equal to 9")
else if (a==10) print("a is equal to 10")
else if (a==11) print("a is equal to 11")
else print("a is something different") }
## [1] "a is something different"
And you can nest IF- and ELSE-statements inside each other:
{ if (a==10)
{ if (b==20) print("a is equal to 10 and b is equal to 20")
else print("a is equal to 10 and b is NOT equal to 20")
}
else if (b==20) print("a is NOT equal to 10 and b is equal to 20")
else print("a is NOT equal to 10 and b is NOT equal to 20")
}
## [1] "a is NOT equal to 10 and b is equal to 20"
An IF-condition can be any kind of logical expression that yields TRUE or FALSE. That means you can also define more complex conditions:
if (a==10 & b==20)
print("a is equal to 10 AND b is equal to 20")
if (a==10 | b==10)
print("a is equal to 10 OR b is equal to 10")
d <- seq(from=10, to=100, by=10)
{ if (a%in%d)
print("a is included in d")
else print("a is NOT included in d") }
## [1] "a is NOT included in d"
Often, you may want to perform the same kind of operation several times, but each time change a certain parameter. Imagine you have 10 values, and you want to calculate a certain formula 10 times, each time using a different one of the 10 values. Then, instead of writing the respective code ten times, you can write a FOR-loop and only write the formula once. That means that you iterate over a variable. The name of the variable and the values it will take are specified in brackets () after the word “for”. All code that is to be iteratively executed has to be inside of curly brackets {}
Let’s have a look at an example: Here, we first define a vector e, and then we iterate over the variable i (you can put any letter or letter combination here, but i is often used as default). The formula “print(e[i])” is executed ten times, and each time, i changes: In the first round, i takes the value 1, in the second round it takes the value 2, etc. In each round, the i’th element of the vector e is hence printed, i.e. in the first round, you print the first element of e, that is e[1], which is “11”; in the second round, you print e[2], that is “12”, etc.
e <- c(11:20)
for (i in 1:10)
{ print(e[i]) }
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
This example is of course very trivial and you could just print the whole vector by calling e. But there will be other instances when you will not be able to avoid a FOR-loop.
e
## [1] 11 12 13 14 15 16 17 18 19 20
Let’s not just print a value but also run a calculation in the FOR-loop:
for (i in 1:10)
{ f <- e[i]^2
print(f) }
## [1] 121
## [1] 144
## [1] 169
## [1] 196
## [1] 225
## [1] 256
## [1] 289
## [1] 324
## [1] 361
## [1] 400
Again, this is something you could have also achieved by just typing:
e^2
## [1] 121 144 169 196 225 256 289 324 361 400
But what about that: Here, in each of the ten steps, the whole vector e is multiplied with the squared value of the i’th element of e, so in each iteration, you get a vector of ten values as an output. This cannot be so easily done without a FOR-loop.
for (i in 1:10)
{ f <- e[i]^2
print(e*f) }
## [1] 1331 1452 1573 1694 1815 1936 2057 2178 2299 2420
## [1] 1584 1728 1872 2016 2160 2304 2448 2592 2736 2880
## [1] 1859 2028 2197 2366 2535 2704 2873 3042 3211 3380
## [1] 2156 2352 2548 2744 2940 3136 3332 3528 3724 3920
## [1] 2475 2700 2925 3150 3375 3600 3825 4050 4275 4500
## [1] 2816 3072 3328 3584 3840 4096 4352 4608 4864 5120
## [1] 3179 3468 3757 4046 4335 4624 4913 5202 5491 5780
## [1] 3564 3888 4212 4536 4860 5184 5508 5832 6156 6480
## [1] 3971 4332 4693 5054 5415 5776 6137 6498 6859 7220
## [1] 4400 4800 5200 5600 6000 6400 6800 7200 7600 8000
Indeed, you should only use FOR-loops when you cannot avoid them, as they are computationally expensive, i.e. when you have large datasets, a FOR-loop might take very long to finish. Nested FOR-loops, i.e. a FOR-loop inside another FOR-loop, are even more costly, however they might also be unavoidable from time to time. In a nested loop, the value of the variable of the “outer” loop stays the same until the “inner” loop has taken all possible values, and only then the outer loop switches to the next value. Note that you cannot use the same names for the variables over which you want to iterate in both FOR-loops. In the following example, the iterator variable of the inner loop is called j, whereas the iterator of the outer loop is called i:
g <- c(100, 200, 300)
for (i in 1:5)
{ for (j in 1:3)
{ print(paste(e[i], g[j], sep="/")) }}
## [1] "11/100"
## [1] "11/200"
## [1] "11/300"
## [1] "12/100"
## [1] "12/200"
## [1] "12/300"
## [1] "13/100"
## [1] "13/200"
## [1] "13/300"
## [1] "14/100"
## [1] "14/200"
## [1] "14/300"
## [1] "15/100"
## [1] "15/200"
## [1] "15/300"
for (i in 1:5)
{ for (j in 1:3)
{ print(g[j]) }
print(e[i]) }
## [1] 100
## [1] 200
## [1] 300
## [1] 11
## [1] 100
## [1] 200
## [1] 300
## [1] 12
## [1] 100
## [1] 200
## [1] 300
## [1] 13
## [1] 100
## [1] 200
## [1] 300
## [1] 14
## [1] 100
## [1] 200
## [1] 300
## [1] 15
The difference between a FOR- and a WHILE-loop is that the expression in the WHILE-loop is only executed as long as a certain condition is TRUE. So at the beginning of the WHILE-loop, instead of defining the values that a variable can take, you put a logical expression, similar to an IF-statement. The value of the variable has then to be changed inside the loop. If you do not do that, the loop will run for an infinite time! Once the value of the variable has been changed, the condition in the beginning is again checked, and only if it is still TRUE, the code inside the loop is executed once more:
i <- 1 # initially set i to 1
while (i < 5) # check if i is smaller than 5
{ print(i) # print i
i <- i+1 } # increment i by 1
## [1] 1
## [1] 2
## [1] 3
## [1] 4
R has a lot of built-in functions, so you often just have to write the name of a variable or object after the name of a function inside brackets to have R do something - that can be a calculation, a transformation, etc. Some examples of built-in functions are:
log(8)
## [1] 2.079442
print(e)
## [1] 11 12 13 14 15 16 17 18 19 20
sqrt(144)
## [1] 12
paste("Hello", "World", sep="/")
## [1] "Hello/World"
However, you might have your own, very customized algorithm and you might need that more than just once in your code. To not have to write the code of that algorithm again and again, it is very convenient to define a function and give it a name once at the very beginning of your code, and then every time you need that algorithm, you just call the name of the function and apply it to a specific object. Let’s have a look at a very simple example. You might want to calculate the square root of a value x and multiply it by a value y. For that, you define a function with two arguments (x and y). Let’s call this function square10:
square10 <- function(x, y)
{ value <- sqrt(x)*y # multiply the squareroot of x with y and assign the results to an object called "value"
return(value) # this is what the function will return when it is executed
}
square10(50, 10) # run your newly defined function with the value x=50 and y=10
## [1] 70.71068
square10(10, 50) # run your newly defined function with the value x=10 and y=50
## [1] 158.1139
square10(x=50, y=10) # you can also call the name of the arguments when running the function
## [1] 70.71068
square10(y=10, x=50)
## [1] 70.71068
Let’s define another function that takes several parameters:
calculation <- function(x, y, z)
{ value <- x*y-z
print(value) # you can also use *print"* instead of *return*. There is a difference between these two, but for now, it does not matter so much.
}
calculation <- function(x, y, z)
{ print(x*y-z) # and of course, you can save one line of code by leaving out the assignment of "value"
}
calculation(10, 20, 30) # if you do not call the names of the parameters (x, y, z) when running the function on some numbers, the values have to be in the order in which they were when you defined the function (i.e. the first value corresponds to x, the second to y and the third one to z.). However, if you call the parameter names, you can change the order:
## [1] 170
calculation(x=10, y=20, z=30)
## [1] 170
calculation(y=20, z=30, x=10)
## [1] 170
Again, the above formula of "x*y-z" is a trivial example for which you do not necessarily need to define a function, but think of more complicated cases in which you can really save a lot of lines of code when you just define the function once in the beginning.
calculation2 <- function(x, y, z)
{ a <- x*y-z
b <- x*y+z
c <- x^2+y^2+z^2
d <- z+1
return((a/b)-(c/d))
}
calculation2(1, 2, 3)
## [1] -3.7
calculation2(2, 2, 2)
## [1] -3.666667
calculation2(1, 1, 1)
## [1] -1.5
Of course, your function arguments can also stand for vectors (or even matrices or arrays) instead of mere values:
calculation3 <- function(x, y)
{ return(x+y) }
vector1 <- seq(from=3, to=27, by=3)
vector2 <- c(1:9)
calculation3(vector1, vector2)
## [1] 4 8 12 16 20 24 28 32 36
Let’s combine IF-ELSE-statements, loops and functions working with the file “data_male2.csv”:
data_male <- read.table("F:/IAMO_B/2022_DSIK/eLearning/Module_0_R-Basics/data/data_male2.csv", header=T, sep=";")
head(data_male)
| X | male_15_64 | male_0_14 |
|---|---|---|
| 1960 | 2193941 | 1815435 |
| 1961 | 2224062 | 1928130 |
| 1962 | 2255871 | 2047913 |
| 1963 | 2298695 | 2166533 |
| 1964 | 2357347 | 2279349 |
| 1965 | 2424259 | 2394552 |
First, we want to know which decade each entry in our table belongs to. For that, we create a new column. We then run a FOR-loop over each row of the table, and check if the value in the first column of the respective row is less than 1970, less than 1980, less than 1990, etc.
data_male$decade <- 0 # create a new, empty column called "decade"
# check again how the data looks now
for (i in 1:NROW(data_male)) # here, we iterate over the number of rows in data_male!
{ if (data_male[i,1] < 1970) # check the value in the first column of the i'th row and if it is less than 1970
data_male[i,4] <- "sixties" # change the value in the fourth column of the i'th row to "sixties"
else if (data_male[i,1] < 1980)
data_male[i,4] <- "seventies"
else if (data_male[i,1] < 1990)
data_male[i,4] <- "eighties"
else if (data_male[i,1] < 2000)
data_male[i,4] <- "nineties"
else if (data_male[i,1] < 2010)
data_male[i,4] <- "2000s"
else data_male[i,4] <- "2010s" }
head(data_male)
| X | male_15_64 | male_0_14 | decade |
|---|---|---|---|
| 1960 | 2193941 | 1815435 | sixties |
| 1961 | 2224062 | 1928130 | sixties |
| 1962 | 2255871 | 2047913 | sixties |
| 1963 | 2298695 | 2166533 | sixties |
| 1964 | 2357347 | 2279349 | sixties |
| 1965 | 2424259 | 2394552 | sixties |
Second, for each decade, we want to know the minimum, mean and maximum value for each age class. One way of doing that is to (1.) define a function that takes the respective values from the data table and returns minimum, mean and maximum value, and to then (2.) iterate over that function, i.e. to gradually change the parameter that the function takes:
min_mean_max <- function(a)
{ values_15_64 <- c(min(a[,2]), mean(a[,2]), max(a[,2]))
values_0_14 <- c(min(a[,3]), mean(a[,3]), max(a[,3]))
output <- as.data.frame(rbind(values_15_64, values_0_14)) # combine the two vectors to one dataframe
names(output) <- c("min", "mean", "max") # name the columns of the dataframe
print(a[1,4]) # print the name of the decade
print(output) # print the dataframe
}
min_mean_max(data_male)
## [1] "sixties"
## min mean max
## values_15_64 2193941 5757639 10857091
## values_0_14 1815435 3734299 4754844
## calculate for minimum, mean and maximum for each decade separately:
for (i in 1:6)
{ min_mean_max(data_male[which(data_male[,4]==unique(data_male[,4])[i]),]) }
## [1] "sixties"
## min mean max
## values_15_64 2193941 2438236 2810626
## values_0_14 1815435 2320439 2776234
## [1] "seventies"
## min mean max
## values_15_64 2928678 3539962 4227267
## values_0_14 2835444 3059434 3286465
## [1] "eighties"
## min mean max
## values_15_64 4378242 4996253 5613177
## values_0_14 3354377 3771778 4232072
## [1] "nineties"
## min mean max
## values_15_64 5725935 6379015 7101392
## values_0_14 4332720 4622156 4754844
## [1] "2000s"
## min mean max
## values_15_64 7294513 8252895 9391832
## values_0_14 4217922 4380944 4636537
## [1] "2010s"
## min mean max
## values_15_64 9703133 10303113 10857091
## values_0_14 4338764 4472508 4667807
This example is only supposed to help you understand how powerful IF-conditions, FOR-loops and functions are, and what they allow you to do with your data. There are a lot of cases where these elements are unavoidable, so please make sure you understand them well!
However, for the particular problem we just looked at, there is a way of solving that with less code, using the package dplyr. Please check the cheat sheet of this package, it helps you to rearrange and summarise tables with very few lines of code, so very efficiently!
With the dplyr package, lines 297 to 311 can be shortened. However, there are enough cases where you do need to customize your analysis in a way that you cannot simplify your code with the help of any package.
#install.package("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summarise_each(group_by(data_male[,2:4], decade), funs(min, mean, max))
## Warning: `summarise_each_()` was deprecated in dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
| decade | male_15_64_min | male_0_14_min | male_15_64_mean | male_0_14_mean | male_15_64_max | male_0_14_max |
|---|---|---|---|---|---|---|
| 2000s | 7294513 | 4217922 | 8252895 | 4380944 | 9391832 | 4636537 |
| 2010s | 9703133 | 4338764 | 10303113 | 4472508 | 10857091 | 4667807 |
| eighties | 4378242 | 3354377 | 4996253 | 3771778 | 5613177 | 4232072 |
| nineties | 5725935 | 4332720 | 6379015 | 4622156 | 7101392 | 4754844 |
| seventies | 2928678 | 2835444 | 3539962 | 3059434 | 4227267 | 3286465 |
| sixties | 2193941 | 1815435 | 2438236 | 2320439 | 2810626 | 2776234 |
Congratulations, you are done with this script and now know the basic concepts of programming in R!