R training: Additional functions

Pretext

This training builds on the basic R and data wrangling with dplyr trainings, and presents a collection of additional formulas that you might find useful.

The topics discussed in this training are the following:

as.Date(): to deal with date formats
write.csv(): to export objects (e.g. save data frame as .csv)
gather() & spread(): to reshape data from “wide” to “long” format (and vice versa)
paste(): to paste multiple strings together and combine them to a single string
grep() & gsub(): to match and replace character patterns
if() statements: to conditionally execute code
Loops (advanced): to repeat tasks
Self-made functions (advanced): to take your R game to the next level

1. Date classification: `as.Date()`

Working with dates in R is at times quite straight-forward, and at times quite frustrating depending on the complexity of the task at hand. There are different ways in which dates can be handled, such as with the in-built as.Date() function, as.POSIX* classification, as well as lubridate and other packages.

Format date variable

When you first import your dataset, your date variables (e.g. “today”) are most likely stored in character class:

class(data$today)

## [1] "character"

In order for R to interpret your date variable as a date, you need to specify it as such using the as.Date() function:

# define class, specify date format
data$today <- as.Date(data$today, format = "%d/%m/%Y")

# check class
class(data$today)

## [1] "Date"

The format argument is key as it tells R exactly how to understand the dates in your dataset. More info on how to specify different date formats can be found here.

Data manipulation with dates

Once your date column is classified as date, you can manipulate your dataset using date properties. Consider the following example, where we filter our dataset and only include observations that were collected after a certain threshold date (15th September 2020):

# a little bit of dplyr magic
data %>%
  select(today, age_respondent, calc_total_income, sources_food) %>%
  filter(today > "2020-09-15") %>%
  head()

##        today age_respondent calc_total_income sources_food
## 1 2020-09-16             35            510000       credit
## 2 2020-09-28             32            200000       credit
## 3 2020-09-17             24            300000     cash_own
## 4 2020-09-18             26            375000       credit
## 5 2020-09-24             31            150000       credit
## 6 2020-09-29             43            600000     cash_own

Note that once formatted with as.Date(), dates always follow the format “year-month-day”.

2. Data exports: `write.csv()`

After manipulating your dataset, you may wish to save (export) it as a .csv-file.

Here is how you would do that:

write.csv(data, "new_export.csv", row.names = FALSE, na = "")

In the example above, row names are excluded, and NAs are specified as blanks (""). More documentation on the write.csv() function can be found here.

3. Wide and long format: `gather()` & `spread()`

Datasets can be presented in two different formats: long and wide. Wide data, the “normal” case, is presented with each different variable in separate column, while “long” data has one column with all the variable names and one with all the values. gather() from the tidyr package lets you convert your dataset from wide to long format, while spread() converts from long to wide.

In most cases, you will be using wide data. However, in some cases, long data is required (for instance to make plots with ggplot2).

# first we select a few columns from the dataset
data_wide <- data %>%
  select(index, age_respondent, calc_total_income, sources_food) %>%
  head()

# let's have a look at it
data_wide

##   index age_respondent calc_total_income sources_food
## 1     1             60            350000       credit
## 2     2             35            510000       credit
## 3     3             33            232000       credit
## 4     4             32            200000       credit
## 5     5             51            420000  gift_family
## 6     6             24            300000     cash_own

As you can see, each variable currently has its own column, which means the data is in “wide” format.

From wide to long

We could convert it to “long” format like this:

# first, call the tidyr package (install it if you haven't already)
library(tidyr)

# then use the gather() function
data_long <- gather(data_wide, variable, value, age_respondent:sources_food)

data_long

##    index          variable       value
## 1      1    age_respondent          60
## 2      2    age_respondent          35
## 3      3    age_respondent          33
## 4      4    age_respondent          32
## 5      5    age_respondent          51
## 6      6    age_respondent          24
## 7      1 calc_total_income      350000
## 8      2 calc_total_income      510000
## 9      3 calc_total_income      232000
## 10     4 calc_total_income      200000
## 11     5 calc_total_income      420000
## 12     6 calc_total_income      300000
## 13     1      sources_food      credit
## 14     2      sources_food      credit
## 15     3      sources_food      credit
## 16     4      sources_food      credit
## 17     5      sources_food gift_family
## 18     6      sources_food    cash_own

We formatted the columns “age_respondent” to “sources_food” from wide to long format, while keeping “index” in wide. As specified in the gather() statement, we are calling the column with the variable names “variables”, and the column with the corresponding values “value”.

From long to wide

We can reshape the dataset back to wide by using the spread() function. In the function, we simply call the long dataset and specify the names of the columns with the variables names and corresponding values.

spread(data_long, variable, value)

##   index age_respondent calc_total_income sources_food
## 1     1             60            350000       credit
## 2     2             35            510000       credit
## 3     3             33            232000       credit
## 4     4             32            200000       credit
## 5     5             51            420000  gift_family
## 6     6             24            300000     cash_own

More info on gather() and spread() can be found here, here (gather) and here (spread).

4. Join strings together: `paste()`

With the paste() function you can join (concatenate) multiple strings together.

# define a string
string1 <- "text1"

# and another one
string2 <- "text2"

# now paste them together and add comma and space (", ") between them
paste(string1, string2, sep = ", ")

## [1] "text1, text2"

You can also define the separator (or alternatively use paste0() if no separator is needed).

You can paste together different elements from a vector using the collapse argument:

# define a vector
v1 <- c("1st", "2nd", "3rd")

# now paste elements together and add comma and space (", ") between them
paste(v1, collapse = ", ")

## [1] "1st, 2nd, 3rd"

Here is an example of how you would integrate paste() in dplyr’s mutate() to create a new variable indicating the strata:

# load the dplyr package
library(dplyr)

# define a new variable pasting together different variables separated by a "."
data <- data %>%
  mutate(strata = paste(location, idp_ref, camp_no_camp, sep = "."))

5. Find and replace string patterns: `grep()` & `gsub()`

The two functions allow you to find patterns in strings, and to replace them as needed. grep() returns a vector indicating elements matching the specified pattern, while gsub() allows you to replace a pattern in a character vector.

`grep()`

Suppose we want to check which elements in a vector include the string “ab”:

# define a vector
v1 <- c("abcd", "bcde", "abba", "ab", "ba")

# using grep() with value=TRUE returns a vector with only the elements that include "ab"
grep("ab", v1, perl=TRUE, value=TRUE)

## [1] "abcd" "abba" "ab"

# setting value=TRUE returns a vector with the positions of the elements within the vector
grep("ab", v1, perl=TRUE, value=FALSE)

## [1] 1 3 4

# grepl() returns a logical vector indicating whether the condition is met for each element
grepl("ab", v1, perl=TRUE)

## [1]  TRUE FALSE  TRUE  TRUE FALSE

Let’s assume we have a variable in our dataset indicating the sources of food (“sources_food”). If we wanted to replace all answer options starting with “assistance_” (e.g. assistance_govt, assistance_local and assistance_un) with “assistance”, we could specify something like this, using grepl() in combination with dplyr’s mutate() and case_when():

data <- data %>%
  mutate(sources_food = case_when(grepl("^assistance_", sources_food, perl=TRUE) ~ "assistance",
                                  TRUE ~ sources_food))

The ^ in the above expression ensures that the pattern is anchored to the start of the string. A compilation of regular expressions in R can be found here. These can be used together with grepl() & gsub() to match complex patterns.

`gsub()`

gsub() is working in a similar fashion, but allows you to replace a character sub-string.

# define a vector
v1 <- c("abcd", "bcde", "abba", "ab", "ba")

# let's replace "ab" with "XX"
gsub("ab", "XX", v1)

## [1] "XXcd" "bcde" "XXba" "XX"   "ba"

Consider the following example, where we want to replace the underscore ("_“) in the governorate names with a space (” "). We would achieve that by specifying the following expression using gsub().

data$governorate <- gsub('_', ' ', data$governorate)

More on grep() and gsub() is found here.

6. Conditional execution: `if()` statements

In some cases you may want to execute a chunk of code conditionally. Conditional expressions are only executed when a specified condition is met.

The syntax requires the following structure.

if(condition) { true.expression } else { false.expression }

If the condition is met, the true.expression is executed, and else the false.expression is run.

Simple `if()` statements

Let us try a simple example to illustrate the point:

#define an object
a <- 1

if( a > 0 ) { "a is above 0" }

## [1] "a is above 0"

The expression return the string “expression is true” because a (1) is bigger than 0 (and the condition therefore met and TRUE).

If the condition is not met, as in the following example, then the code chunk is no executed:

if( a < 0 ) { "a is below 0" }

Complex `if()` statements

if() statements can be combined with else{} clauses and additional if() clauses as needed:

# define another object
b <- -1

if( b > 0 ) { "b is above 0" } else { "b is NOT above 0" }

## [1] "b is NOT above 0"

c <- 0

if( c > 0 ) {
  "c is above 0"
  } else if( c < 0 ) {
    "c is below 0"
  } else {
      "c is 0"
    }

## [1] "c is 0"

7. Loops (advanced): `for()` loop

A particular advantage that R has over applications like Excel and SPSS, is that it allows for the execution of iterations or “loops”. Loops automate a multi-step process and batch processes by grouping the parts that need to be repeated.

Suppose you want to do several printouts of the following form: The year is [i], where [i] is equal to 2018 up to 2021. You can do this as follows:

paste("The year is", 2018)

## [1] "The year is 2018"

paste("The year is", 2019)

## [1] "The year is 2019"

paste("The year is", 2020)

## [1] "The year is 2020"

paste("The year is", 2021)

## [1] "The year is 2021"

This is quite tedious to code, especially if you included even more years. Using a for loop, you could simplify the expression like this:

for (i in 2018:2021){
  print(paste("The year is", i))
}

## [1] "The year is 2018"
## [1] "The year is 2019"
## [1] "The year is 2020"
## [1] "The year is 2021"

With loops, we are getting into advanced R territory, however, so we will not get into more detail in this training. More on loops can be found here and here.

8. Self-made functions (advanced): `function()`

Functions always follow the same structure:

func_name <- function(arguments) {
do something crazy
}

First you assign it a name, then you specify that it is in fact a function (using the function() function). Within the () you specify the arguments, and in the {} you specify what is done with the arguments.

Here is an example of a very basic function:

# we define a function called "add_3", which simply adds 3 to any number you specify
add_3 <- function(x) {
  x + 3
}

# use the function with 5 as input
add_3(5)

## [1] 8

add_3(10)

## [1] 13

As with loops, self-made functions go beyond what we are trying to do in this trainings. The point is to know that functions exists, that you can define them yourself, and that you may want to learn about them at a later stage when you get more advanced in R.

More on user-defined functions can be found here and here.

R training: Additional functions

IMPACT Initiatives - Iraq (Apr 2021)

Pretext

1. Date classification: as.Date()

Format date variable

Data manipulation with dates

2. Data exports: write.csv()

3. Wide and long format: gather() & spread()

From wide to long

From long to wide

4. Join strings together: paste()

5. Find and replace string patterns: grep() & gsub()

grep()

gsub()

6. Conditional execution: if() statements

Simple if() statements

Complex if() statements

7. Loops (advanced): for() loop

8. Self-made functions (advanced): function()