Operators, Loops, and Functions

R excels at letting users interactively work with data in a fluid manner. This is achieved through an interactive (line by line) interpreter, data structures, and expressive syntax. Using the basics of of data types, vectors, an subsetting, we can use the functions and methods presented here to mold data into the form needed for further analysis and modeling. This is by no means an exhaustive or even representative overview of data processing functions/methods, but it will offer some of the fundamental aspects.

This overview is focused on functions in the base installation of R. Another overview will briefly introduce data processing with in the ‘Tidyverse’; a set of packages and functions designed explicitly for processing, tidying, and visualizing data in R.

Introduced here are:

Operators
Control structures
Basic function
Constructing User Defined Functions
Apply functions

Operators

R contains a series of mathematical, logical, relational, and other operators that are the foundation of working with data. An operator is a typically single or double character function that serves to perform comparative or arithmetic operations on data object. The most basic examples are + for additions, > or < greater than or less than to test relative magnitude, == for testing equivalency, and <- for assigning something to an object. Below are examples of a few of these.

Arithmetic operators

R has all the functionality to work as a massively expanded calculator. A wide list of Arithmetic operators can be used on numeric type objects.

2 + 3 # addition

[1] 5

3 - 5 # subtraction

[1] -2

3 * 3 # multiplication

[1] 9

3^2 # raise to power of

[1] 9

34 %% 15 # modulo

[1] 4

sqrt(532) # square root

[1] 23.06513

log(1003) # natural log

[1] 6.910751

exp(34) # exponentiate

[1] 5.834617e+14

sum(1:10) # sum

[1] 55

mean(1:10) # mean

[1] 5.5

median(1:10) # median

[1] 5.5

var(1:10) # variance

[1] 9.166667

Arithmetic operators on vectors and data frames

These functions are vectorized, meaning that they will automatically apply the function to each element of a vector. As such, they will return a vector of the same length

# build a data.frame to work on
set.seed(717)
df2 <- data.frame(col1 = rnorm(10,0,1),
                  col2 = rnorm(10,4,0.5),
                  col3 = sample(c(TRUE, FALSE),10,replace=TRUE),
                  col4 = letters[1:10],
                  stringsAsFactors = FALSE)
print(df2)

         col1     col2  col3 col4
1   1.3982972 3.817427  TRUE    a
2   0.6425140 4.025979  TRUE    b
3  -0.1128888 3.601069  TRUE    c
4  -0.2124540 3.997407 FALSE    d
5   0.2063796 4.041947  TRUE    e
6   0.2751510 4.793777  TRUE    f
7   0.6880224 3.529871  TRUE    g
8   1.2245582 3.920952 FALSE    h
9   0.7984930 4.040930  TRUE    i
10 -1.2060176 3.946755  TRUE    j

df2$col5 <- df2$col1 + df2$col2 # addition
print(df2$col5)

 [1] 5.215724 4.668493 3.488180 3.784953 4.248327 5.068928 4.217893
 [8] 5.145510 4.839423 2.740737

df2$col5 <- df2$col1 - df2$col2 # subtraction
print(df2$col5)

 [1] -2.419129 -3.383465 -3.713958 -4.209861 -3.835568 -4.518626 -2.841848
 [8] -2.696394 -3.242437 -5.152773

df2$col5 <- df2$col1 / df2$col2 # division
print(df2$col5)

 [1]  0.36629314  0.15959198 -0.03134870 -0.05314795  0.05105944
 [6]  0.05739754  0.19491436  0.31231143  0.19760132 -0.30557195

df2$col5 <- df2$col1 * df2$col2 # multiplication
print(df2$col5)

 [1]  5.3378967  2.5867482 -0.4065205 -0.8492651  0.8341753  1.3190126
 [7]  2.4286302  4.8014338  3.2266541 -4.7598562

df2$col5 <- df2$col1 %% 0.2 # modulo
print(df2$col5)

 [1] 0.198297165 0.042514022 0.087111162 0.187546021 0.006379555
 [6] 0.075151006 0.088022448 0.024558150 0.198493031 0.193982374

Relational and Logical operators

Relational and Logical operators return a TRUE or FALSE if the expression evaluated to TRUE or FALSE. These can be used to subset a vector or data.frame and may not return a logical vector of the same length as the input.

df2$col4 == "b" # equals (returns logical)

 [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

df2[which(df2$col4 == "b"),] # using which() to return columns

      col1     col2 col3 col4       col5
2 0.642514 4.025979 TRUE    b 0.04251402

df2[which(df2$col1 < -0.5), "col1"] # less than

[1] -1.206018

df2[which(df2$col1 >= -0.5), "col1"] # greater than or equal to

[1]  1.3982972  0.6425140 -0.1128888 -0.2124540  0.2063796  0.2751510
[7]  0.6880224  1.2245582  0.7984930

isTRUE(df2[1,"col3"]) # is a value TRUE

[1] TRUE

df2$col3 != TRUE # does not equal (returnd logical T/F)

 [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE

df2$col4 %in% c("b","h","j") # within a set, returns logical

 [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE

df2[which(df2$col4 %in% c("b","h","j")), "col4"] # returning values using which

[1] "b" "h" "j"

df2[which(df2$col1 < -0.5 | df2$col1 > 0.5), "col1"] # element wise or

[1]  1.3982972  0.6425140  0.6880224  1.2245582  0.7984930 -1.2060176

df2[which(df2$col1 < -0.5 & df2$col3 == TRUE), ] # element wise and

        col1     col2 col3 col4      col5
10 -1.206018 3.946755 TRUE    j 0.1939824

Control structures

Control structures are a main tool in building a program. While operators allow you to manipulate data one function at a time (or across a vector), control structures allow you to add a logical flow to the operations. The basic structures demonstrated here common across most any computer language, although syntax will vary. These structures can be nested to add increasing level of control (and complexity) Grasping the utility of control structures is really important in making the conceptual leap from R as a high-powered calculator to R as an environment to building an analysis. Once understood, even more powerful structures exist in R.

Format of basic control structures

# if statement
if (condition) { 
  # do something
}
# if else statment
if (condition) {
  # do somthing
} else {
  # or do something else
}
# for loop
for (variable in vector) {
  # do something for each variable
}
# while loop
while (condition) {
  # do something while condition is TRUE
}
# nested loops
if (condition) {
    for (variable in vector) {
    # do something for each variable
  }
} else {
    if (condition) { 
    # do something
    } else {
    # do something else
  }
}

Examples of control structures

# if statment
some_data <- rnorm(10,0,1)
if (mean(some_data) > 0) {
  some_new_data <- mean(some_data)
  print(mean(some_data))
}

[1] 0.4159586

# if else statment
some_data <- rnorm(10,0,1)
if (mean(some_data) > 0) {
  some_new_data <- mean(some_data)
  print(mean(some_data))
} else {
  some_new_data <- NULL
  print("Mean of data is less than zero")
}

[1] "Mean of data is less than zero"

# for loop
some_data <- rnorm(10,0,1)
for (i in 1:length(some_data)) {
  some_new_data <- some_data[i]^2
  print(some_new_data)
}

[1] 3.986782
[1] 2.973403
[1] 2.627902
[1] 0.5789869
[1] 0.1446968
[1] 2.11807
[1] 0.05899888
[1] 4.886508
[1] 0.7530598
[1] 0.1821432

# nested ifelse in for loop
some_data <- rnorm(10,0,1)
some_new_data <- NULL
for (i in 1:length(some_data)) {
  iter_data <- some_data[i]
  if (iter_data > 0) {
    some_new_data[i] <- iter_data^2
    print("Data positive")
  } else {
    some_new_data[i] <- abs(iter_data)^2
    print("Data < 0, applied asb()")
  }
}

[1] "Data positive"
[1] "Data positive"
[1] "Data < 0, applied asb()"
[1] "Data positive"
[1] "Data positive"
[1] "Data positive"
[1] "Data positive"
[1] "Data < 0, applied asb()"
[1] "Data positive"
[1] "Data positive"

print(some_new_data)

 [1] 0.03677341 0.52403013 0.07776627 0.97219775 1.89653583 1.52338704
 [7] 0.68885827 0.18850034 0.11198929 0.75448516

# while loop # be careful
i <- 0
while(i < 4){
  new_value <- rbinom(1,1,0.5)
  i <- i + new_value
  print(paste0("Value of 'i' is: ", i))
}

[1] "Value of 'i' is: 0"
[1] "Value of 'i' is: 0"
[1] "Value of 'i' is: 0"
[1] "Value of 'i' is: 1"
[1] "Value of 'i' is: 2"
[1] "Value of 'i' is: 2"
[1] "Value of 'i' is: 2"
[1] "Value of 'i' is: 3"
[1] "Value of 'i' is: 4"

Vectorized version of nested loop

# vectorized version of above
some_new_data2 <- ifelse(some_data > 0, some_data^2, abs(some_data)^2)
print(some_new_data2)

 [1] 0.03677341 0.52403013 0.07776627 0.97219775 1.89653583 1.52338704
 [7] 0.68885827 0.18850034 0.11198929 0.75448516

identical(some_new_data, some_new_data2)

[1] TRUE

R functions to work with data

Both operators and control structures open the door to functional programming, but there are thousands of pre-build functions available in R and the package ecosystem, and an infinite number of User Defined Functions that you can create for your own needs. The real work of “learning R” is really to learn what functions exist, how they work, how to write your own, and how do so with efficiently and defensively. By creating functions, you unlock the power and potential of R.

Base functions for manipulating data

It is well beyond the scope of this class to cover the thousands of base functions, but a very small sample is presented. The live demo and example rmarkdown files will demonstrate many more.

df2$col5 <- df2$col1 - mean(df2$col1)
print(df2$col5)

 [1]  1.02809167  0.27230853 -0.48309433 -0.58265947 -0.16382594
 [6] -0.09505449  0.31781695  0.85435266  0.42828754 -1.57622312

df2$col5 <- scale(df2$col1, center = TRUE, scale = FALSE)
print(df2$col5)

             [,1]
 [1,]  1.02809167
 [2,]  0.27230853
 [3,] -0.48309433
 [4,] -0.58265947
 [5,] -0.16382594
 [6,] -0.09505449
 [7,]  0.31781695
 [8,]  0.85435266
 [9,]  0.42828754
[10,] -1.57622312
attr(,"scaled:center")
[1] 0.3702055

df2$col5 <- sign(df2$col1)
print(df2$col5)

 [1]  1  1 -1 -1  1  1  1  1  1 -1

df2$col5 <- ifelse(sign(df2$col1) == 1, TRUE, FALSE)
print(df2$col5)

 [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

df_sample <- sample(df2$col1, 20, replace = TRUE)
print(df_sample)

 [1] -0.1128888 -0.1128888  1.3982972 -1.2060176  0.2063796  1.3982972
 [7] -1.2060176  1.3982972  0.6425140  0.2063796  0.2751510  0.6880224
[13]  1.3982972  1.2245582  0.6425140  1.3982972  1.2245582 -0.1128888
[19]  1.3982972  0.6880224

sort(df_sample)

 [1] -1.2060176 -1.2060176 -0.1128888 -0.1128888 -0.1128888  0.2063796
 [7]  0.2063796  0.2751510  0.6425140  0.6425140  0.6880224  0.6880224
[13]  1.2245582  1.2245582  1.3982972  1.3982972  1.3982972  1.3982972
[19]  1.3982972  1.3982972

rev(sort(df_sample))

 [1]  1.3982972  1.3982972  1.3982972  1.3982972  1.3982972  1.3982972
 [7]  1.2245582  1.2245582  0.6880224  0.6880224  0.6425140  0.6425140
[13]  0.2751510  0.2063796  0.2063796 -0.1128888 -0.1128888 -0.1128888
[19] -1.2060176 -1.2060176

User Defined Function

Often in R, there is not available function to do what you want to do. This may be because you are creating a new statistical method, that your need is very specific, or that you are extending the limits of an existing function, for example. In this case, you may create you own function. Creating a function is very easy in R. The structure of a new function is:

name_of_new_function <- function(arguments){
  # do something here
  return(results)
}

Functions can have any name that follows convention, or no name at all (an anonymous function), can take any object type as input arguments, and can return any object type. In the example below:

I assign the object called my_function
the function that takes three arguments x, y, and constant
the ifelse() evaluates if each $x_i$ is greater than or equal to zero
if TRUE, $x_i$ is squared, if FALSE $x_i$ is divided by 2
then the object new_value is assigned the mean of y
multiplied by x (after being altered by the ifelse())
and the constant argument is added
the values assigned to new_value is returned back to where the function was called. In this case

Note: if any of the arguments are not of the type numeric, this will result in an error. A more defensive function would include a test of the arguments that returns a helpful error message if that are anything but numeric. Further, this function will work with either numeric vectors or single numeric values. However, the assignment back to df2$col5 will result in an error because the length of the returned value is not equal to the length of the column vector it is being assigned to.

my_function <- function(x, y, constant){
  x <- ifelse(x >= 0, x^2, x/2)
  new_value <- mean(y) * x + constant
  return(new_value)
}

df2$col5 <- my_function(x = df2$col1, y = df2$col2, constant = 0.5)
print(df2$col5)

 [1]  8.26543337  2.13957755  0.27582470  0.07810768  0.66916094
 [6]  0.80068305  2.38006108  6.45560066  3.03226413 -1.89491665

One important thing to learn about functions is whats referred to as scoping, or where R looks for the variables you are using inside your function. Over simplifying a bit, the function will look for information in the arguments passed to it and the variables assigned within it; as opposed to looking outside the function (out into your program). That means if you define the object myVar <- 5 outside of the function, but do not pass it into the function, the function will not know what you mean by myVar. Bottom line, while learning, do not use global variables. If you want to use an object in a function, create it there or pass it in as an argument. A quick example:

# function environment - look internally for myVar
inside_scoping <- function(x){
  myVar <- 1 # myVar defined inside function
  new_value <- x + myVar
  return(new_value)
}
myVar <- 5 # myVar also defined outside function
myValue1 <- inside_scoping(1)
print(myValue1) # function uses internal myVar

## [1] 2

## or ##

# script environment - look globally for myVar
outside_scoping <- function(x){
  # myVar no longer defined in function, but used in here
  new_value <- x + myVar
  return(new_value)
}
myVar <- 5 # myVar only defined outside (globally)
myValue2 <- outside_scoping(1)
print(myValue2) # function finds myVar in the global environment and uses it

## [1] 6

`apply` functions

Putting the ideas of control structures, vectorizing, and functions together, we end up with the family of apply functions in R. To paraphrase a common sentiment in R users, “If your first thought is ‘how can I loop this’, your second thought should be ‘how can I vecotrize this instead’”. This sentiment is born from the fact that loops (for, each, if, etc…) can be slow and cumbersome compared to vectorized forms that take advantage of faster code and are more presentable (if you are comfortable reading them). The apply functions are a first step at this more efficient version of R programming. They take some getting used to, but can save you major headaches down the road.

The apply functions perform more complex operations to slice and aggregate data in vectors, matrices, and lists than the basic functions used earlier, and do so in faster and in fewer lines of code than making loops to do the job. Here are some examples (see then link in the previous paragraph for a more in-depth treatment):

# make a sample data.frame
df3 <- data.frame(col1 = rnorm(100,0,1),
                  col2 = rnorm(100,4,0.5),
                  col3 = rbinom(100,1,0.5))
# apply() # returns vector by applying function over margins of a matrix
column_means <- apply(df3,2,mean) # compute the mean of each column
print(column_means)

     col1      col2      col3 
0.0435805 3.9454374 0.5500000

row_means <- apply(df3,1,mean) # compute the mean of each row
head(row_means)

[1] 1.096285 1.473665 1.726870 1.308236 1.597097 1.308172

sqrd_matrix <- apply(df3,1:2, function(x) x^2) # square each value in a matrix
head(sqrd_matrix)

          col1     col2 col3
[1,] 0.1545383 13.55689    0
[2,] 0.8851056 12.11175    0
[3,] 0.9736380 17.58863    0
[4,] 3.3043945 22.49139    1
[5,] 0.7357716 15.47258    0
[6,] 0.4727123 13.04696    1

# colMeans(), rowMeans(), colSums(), rowSums() stand in for the above

The `by()` function performs operations by groups

# by # operations by a group
group_col_means <- by(df3[,1:2], df3$col3, colMeans) # column means by group
print(group_col_means)

df3$col3: 0
     col1      col2 
0.2129485 3.9322471 
-------------------------------------------------------- 
df3$col3: 1
       col1        col2 
-0.09499329  3.95622939

`lapply()` apply function to each element of a list; returns list

l2 <- list(part1 = rnorm(5,0,1), part2 = rnorm(12,3,1))
print(l2)

$part1
[1] -0.3919431 -0.8969979 -1.1335531  0.4434971  0.3799945

$part2
 [1] 3.204727 2.566988 2.726823 1.999813 2.968780 1.980304 4.366408
 [8] 2.571043 2.964162 3.243171 1.545260 3.280595

list_means <- lapply(l2, mean)
print(list_means)

$part1
[1] -0.3198005

$part2
[1] 2.78484

list_sums <- lapply(l2, sum)
print(list_sums)

$part1
[1] -1.599002

$part2
[1] 33.41807

# sapply - similar to lapply, but returns vector of matrix
list_means2 <- sapply(l2, mean)
print(list_means2)

     part1      part2 
-0.3198005  2.7848395

### Other apply methods should you need them

SAA Seminar - R working with data

MDH

September 22, 2016

Basics of working with data in R:

Operators, Loops, and Functions

Operators

Arithmetic operators

Arithmetic operators on vectors and data frames

Relational and Logical operators

Control structures

Format of basic control structures

Examples of control structures

Vectorized version of nested loop

R functions to work with data

Base functions for manipulating data

User Defined Function

`apply` functions

The `by()` function performs operations by groups

`lapply()` apply function to each element of a list; returns list

SAA Seminar - R working with data

MDH

September 22, 2016

Basics of working with data in R:

Operators, Loops, and Functions

Operators

Arithmetic operators

Arithmetic operators on vectors and data frames

Relational and Logical operators

Control structures

Format of basic control structures

Examples of control structures

Vectorized version of nested loop

R functions to work with data

Base functions for manipulating data

User Defined Function

apply functions

The by() function performs operations by groups

lapply() apply function to each element of a list; returns list

`apply` functions

The `by()` function performs operations by groups

`lapply()` apply function to each element of a list; returns list