as.Date()write.csv()gather() & spread()paste()grep() & gsub()if() statementsfor() loopfunction()This training builds on the basic R and data wrangling with dplyr trainings, and presents a collection of additional formulas that you might find useful.
The topics discussed in this training are the following:
as.Date(): to deal with date formatswrite.csv(): to export objects (e.g. save data frame as .csv)gather() & spread(): to reshape data from “wide” to “long” format (and vice versa)paste(): to paste multiple strings together and combine them to a single stringgrep() & gsub(): to match and replace character patternsif() statements: to conditionally execute codeas.Date()Working with dates in R is at times quite straight-forward, and at times quite frustrating depending on the complexity of the task at hand. There are different ways in which dates can be handled, such as with the in-built as.Date() function, as.POSIX* classification, as well as lubridate and other packages.
When you first import your dataset, your date variables (e.g. “today”) are most likely stored in character class:
## [1] "character"
In order for R to interpret your date variable as a date, you need to specify it as such using the as.Date() function:
# define class, specify date format
data$today <- as.Date(data$today, format = "%d/%m/%Y")
# check class
class(data$today)## [1] "Date"
The format argument is key as it tells R exactly how to understand the dates in your dataset. More info on how to specify different date formats can be found here.
Once your date column is classified as date, you can manipulate your dataset using date properties. Consider the following example, where we filter our dataset and only include observations that were collected after a certain threshold date (15th September 2020):
# a little bit of dplyr magic
data %>%
select(today, age_respondent, calc_total_income, sources_food) %>%
filter(today > "2020-09-15") %>%
head()## today age_respondent calc_total_income sources_food
## 1 2020-09-16 35 510000 credit
## 2 2020-09-28 32 200000 credit
## 3 2020-09-17 24 300000 cash_own
## 4 2020-09-18 26 375000 credit
## 5 2020-09-24 31 150000 credit
## 6 2020-09-29 43 600000 cash_own
Note that once formatted with as.Date(), dates always follow the format “year-month-day”.
write.csv()After manipulating your dataset, you may wish to save (export) it as a .csv-file.
Here is how you would do that:
In the example above, row names are excluded, and NAs are specified as blanks (""). More documentation on the write.csv() function can be found here.
gather() & spread()Datasets can be presented in two different formats: long and wide. Wide data, the “normal” case, is presented with each different variable in separate column, while “long” data has one column with all the variable names and one with all the values. gather() from the tidyr package lets you convert your dataset from wide to long format, while spread() converts from long to wide.
In most cases, you will be using wide data. However, in some cases, long data is required (for instance to make plots with ggplot2).
# first we select a few columns from the dataset
data_wide <- data %>%
select(index, age_respondent, calc_total_income, sources_food) %>%
head()
# let's have a look at it
data_wide## index age_respondent calc_total_income sources_food
## 1 1 60 350000 credit
## 2 2 35 510000 credit
## 3 3 33 232000 credit
## 4 4 32 200000 credit
## 5 5 51 420000 gift_family
## 6 6 24 300000 cash_own
As you can see, each variable currently has its own column, which means the data is in “wide” format.
We could convert it to “long” format like this:
# first, call the tidyr package (install it if you haven't already)
library(tidyr)
# then use the gather() function
data_long <- gather(data_wide, variable, value, age_respondent:sources_food)
data_long## index variable value
## 1 1 age_respondent 60
## 2 2 age_respondent 35
## 3 3 age_respondent 33
## 4 4 age_respondent 32
## 5 5 age_respondent 51
## 6 6 age_respondent 24
## 7 1 calc_total_income 350000
## 8 2 calc_total_income 510000
## 9 3 calc_total_income 232000
## 10 4 calc_total_income 200000
## 11 5 calc_total_income 420000
## 12 6 calc_total_income 300000
## 13 1 sources_food credit
## 14 2 sources_food credit
## 15 3 sources_food credit
## 16 4 sources_food credit
## 17 5 sources_food gift_family
## 18 6 sources_food cash_own
We formatted the columns “age_respondent” to “sources_food” from wide to long format, while keeping “index” in wide. As specified in the gather() statement, we are calling the column with the variable names “variables”, and the column with the corresponding values “value”.
We can reshape the dataset back to wide by using the spread() function. In the function, we simply call the long dataset and specify the names of the columns with the variables names and corresponding values.
## index age_respondent calc_total_income sources_food
## 1 1 60 350000 credit
## 2 2 35 510000 credit
## 3 3 33 232000 credit
## 4 4 32 200000 credit
## 5 5 51 420000 gift_family
## 6 6 24 300000 cash_own
More info on gather() and spread() can be found here, here (gather) and here (spread).
paste()With the paste() function you can join (concatenate) multiple strings together.
# define a string
string1 <- "text1"
# and another one
string2 <- "text2"
# now paste them together and add comma and space (", ") between them
paste(string1, string2, sep = ", ")## [1] "text1, text2"
You can also define the separator (or alternatively use paste0() if no separator is needed).
You can paste together different elements from a vector using the collapse argument:
# define a vector
v1 <- c("1st", "2nd", "3rd")
# now paste elements together and add comma and space (", ") between them
paste(v1, collapse = ", ")## [1] "1st, 2nd, 3rd"
Here is an example of how you would integrate paste() in dplyr’s mutate() to create a new variable indicating the strata:
# load the dplyr package
library(dplyr)
# define a new variable pasting together different variables separated by a "."
data <- data %>%
mutate(strata = paste(location, idp_ref, camp_no_camp, sep = "."))more on paste() is found here.
grep() & gsub()The two functions allow you to find patterns in strings, and to replace them as needed. grep() returns a vector indicating elements matching the specified pattern, while gsub() allows you to replace a pattern in a character vector.
grep()Suppose we want to check which elements in a vector include the string “ab”:
# define a vector
v1 <- c("abcd", "bcde", "abba", "ab", "ba")
# using grep() with value=TRUE returns a vector with only the elements that include "ab"
grep("ab", v1, perl=TRUE, value=TRUE)## [1] "abcd" "abba" "ab"
# setting value=TRUE returns a vector with the positions of the elements within the vector
grep("ab", v1, perl=TRUE, value=FALSE)## [1] 1 3 4
# grepl() returns a logical vector indicating whether the condition is met for each element
grepl("ab", v1, perl=TRUE)## [1] TRUE FALSE TRUE TRUE FALSE
Let’s assume we have a variable in our dataset indicating the sources of food (“sources_food”). If we wanted to replace all answer options starting with “assistance_” (e.g. assistance_govt, assistance_local and assistance_un) with “assistance”, we could specify something like this, using grepl() in combination with dplyr’s mutate() and case_when():
data <- data %>%
mutate(sources_food = case_when(grepl("^assistance_", sources_food, perl=TRUE) ~ "assistance",
TRUE ~ sources_food))The ^ in the above expression ensures that the pattern is anchored to the start of the string. A compilation of regular expressions in R can be found here. These can be used together with grepl() & gsub() to match complex patterns.
gsub()gsub() is working in a similar fashion, but allows you to replace a character sub-string.
# define a vector
v1 <- c("abcd", "bcde", "abba", "ab", "ba")
# let's replace "ab" with "XX"
gsub("ab", "XX", v1)## [1] "XXcd" "bcde" "XXba" "XX" "ba"
Consider the following example, where we want to replace the underscore ("_“) in the governorate names with a space (” "). We would achieve that by specifying the following expression using gsub().
More on grep() and gsub() is found here.
if() statementsIn some cases you may want to execute a chunk of code conditionally. Conditional expressions are only executed when a specified condition is met.
The syntax requires the following structure.
If the condition is met, the true.expression is executed, and else the false.expression is run.
if() statementsLet us try a simple example to illustrate the point:
## [1] "a is above 0"
The expression return the string “expression is true” because a (1) is bigger than 0 (and the condition therefore met and TRUE).
If the condition is not met, as in the following example, then the code chunk is no executed:
if() statementsif() statements can be combined with else{} clauses and additional if() clauses as needed:
## [1] "b is NOT above 0"
## [1] "c is 0"
for() loopA particular advantage that R has over applications like Excel and SPSS, is that it allows for the execution of iterations or “loops”. Loops automate a multi-step process and batch processes by grouping the parts that need to be repeated.
Suppose you want to do several printouts of the following form: The year is [i], where [i] is equal to 2018 up to 2021. You can do this as follows:
## [1] "The year is 2018"
## [1] "The year is 2019"
## [1] "The year is 2020"
## [1] "The year is 2021"
This is quite tedious to code, especially if you included even more years. Using a for loop, you could simplify the expression like this:
## [1] "The year is 2018"
## [1] "The year is 2019"
## [1] "The year is 2020"
## [1] "The year is 2021"
With loops, we are getting into advanced R territory, however, so we will not get into more detail in this training. More on loops can be found here and here.
function()Functions always follow the same structure:
First you assign it a name, then you specify that it is in fact a function (using the function() function). Within the () you specify the arguments, and in the {} you specify what is done with the arguments.
Here is an example of a very basic function:
# we define a function called "add_3", which simply adds 3 to any number you specify
add_3 <- function(x) {
x + 3
}
# use the function with 5 as input
add_3(5)## [1] 8
## [1] 13
As with loops, self-made functions go beyond what we are trying to do in this trainings. The point is to know that functions exists, that you can define them yourself, and that you may want to learn about them at a later stage when you get more advanced in R.