Session 2 - manipulating data in R with dplyr

Packages

Install and load packages:

install.packages(“dplyr”)

library(dplyr) library(purrr)

Workshop Scope

dplyr is a grammar package for data manipulation, providing a consistent set of functions that help solve the most common manipulation challenges

Again, some important dplyr Functions

  • select() = Select columns
  • filter() = Filter rows
  • mutate() = Create new columns
  • group_by() = Allows for group operations
  • summarize() = Summarize values

dplyr - select() review

To select a set of columns we can use:

# select(data, column_name,...)

To select all the columns except a specific column/s

# select (data, - column_name,...)

To select a range of columns by name, use the “:” operator

# select(data, column_name_1:column_name_n))

dplyr - filter() review

To filter the rows based on a condition, you can use:

filter(data, condition)

dplyr - Pipe operator: %>%

(ctrl-shift-M)

dplyr imports this operator from another package (magrittr). This operator allows you to “pipe” the output from one function to the input of another function.

Many dplyr functions can be used in conjunction with each other in pipes like in the following example. Note that it is common practice to first specify the data frame to be manipulated. This

# data <- data %>%  
# select(column_name_1:column_name_n)) %>%
# filter( column_name_2 > 10)

Import dataset, select and filter columns

Exercise 1 Import a dataset of your choice in R using read.csv() function and call it data .

Select a range of columns of interest and filter for a condition. One of these filters should be for a categorical vriable (ie. all “male”) and at least one should be of a numeric variable (ie. more or less than a number)

dplyr - mutate()

Mutate allows us to create new columns/variables and is commonly used within a dplyr pipe. The syntax is:

# mutate(data, new_column_name = condition)

or, within a pipe:

# data <- data %>%  
# mutate(new_column_name = condition)

Exercise 2 use mutate() to create a new numeric column that is the sum of other numeric columns in your dataset

recoding data with dplyr

dplyr provides the following ways to recode data conveniently

  • recode() = replace values based on their name. best for factors, one to one recoding.
  • case_when() = used for more complicated conditional criteria

recode() / recode_factor()

the recode function takes the following syntax for character values in a specified column(s):

# recode(data$variable, a = "Apple", b = "Banana")

and with numeric class variables, here with all other variables set to NA (.default):

# recode_factor(data$variable, `2` = 1, `4` = 2, .default = NA_real_)

Recode can be applied to one or all values as needed

Exercise 3 use recode() to recode values of a character and numeric factor variable

case_when()

Functions similar to ifelse statements, but with simplified code.

the case_when function takes the following syntax:

# case_when(x < 5 ~ "low", 
#           x == 5 ~ "mid",
#           x > 5 ~ "high"
#           TRUE ~ NA_character_)

where the left side is the old values and the right the new values. the TRUE ~ is equivalent to the ‘else’ in an ifelse statement

case_when used with mutate

It is common that you will need to create a new column with mutate based on conditions set in a case_when (ie. when we want to class ranges in a numeric column as below)

# data <- data %>%  
#         mutate(new_column_name = case_when(numeric_var < 5 ~ "low", 
#                                            numeric_var == 5 ~ "mid",
#                                            numeric_var > 5 ~ "high"
#                                            TRUE ~ NA_character_)

Exercise 4 use mutate and case_when to create 2 new columns based on conditions

group_by() and summarize()

These two functions are generally used together to produce a type of R equivalent to pivot tables.

group_by() groups data according to factors of a variable

# new_group_name <- data %>% 
#                   group_by(variable_to_group)

it is recommended to assign a grouped dataset to a group name, as above

group_by() and summarize() ctnd

summarize() functions similar to mutate in creating new columns, however following a group_by it takes a function that condenses numeric variables into one cell, such as mean, sum etc

# new_group_name <- data %>% 
#                   group_by(variable_to_group) %>% 
#                   summarize(mean = mean(numeric_variable),
#                             sum = sum(numeric_variable))

multiple summarized columns can be added, separated with a ’ , ’

Exercise 5 create a group using your dataset and summarize 2 new variables

mutate_at()

we may want to apply the same mutate transformation to multiple columns in our dataset. The dplyr function mutate_at() allows us to do this. mutate_at() applies the function of our choice to multiple columns in our dataset, and takes the following form:

# data <- data %>% 
#         mutate_at(c("colname", "colname2"), funs(function))

the first argument is a vector of column names, while the second is a function that we add. The function can be a simple mean or something more complex. We can also input a recoding function.

batch recoding using mutate_at()

Exercise 6 First lets create a new csv and paste the column names from our dataset. Next, transpose the data in excel so that this becomes rows instead of columns. find a set of columns that we would like to perform a recoding on and remove the rest. Name the column ‘variables’.

Lets read this csv of names into R and call it something ie. ‘satisfaction_vars’. Then we will create a vector of our column names:

# satisfaction_vars <- satisfaction_vars %>% 
#  column_to_rownames(var = "variables") %>% 
#  row.names()

This will allow us to apply a function to the variables of our choice with very little code.

batch recoding using mutate_at() ctnd

Now we will create our function that will recode our variables, where ‘x’ will be the vector of column names

# recode_binary <- function (x) { recode_factor(x, `1`="negative", 
# `2`="negative", `3`="neutral", `4`="positive", `5`="positive", 
# .default = NA_character_)}

We can now use mutate_at() as follows:

# data <- data %>% 
#         mutate_at(satisfaction_vars, function)