Install and load packages:
install.packages(“dplyr”)
library(dplyr) library(purrr)
dplyr is a grammar package for data manipulation, providing a consistent set of functions that help solve the most common manipulation challenges
Again, some important dplyr Functions
To select a set of columns we can use:
# select(data, column_name,...)
To select all the columns except a specific column/s
# select (data, - column_name,...)
To select a range of columns by name, use the “:” operator
# select(data, column_name_1:column_name_n))
To filter the rows based on a condition, you can use:
filter(data, condition)
(ctrl-shift-M)
dplyr imports this operator from another package (magrittr). This operator allows you to “pipe” the output from one function to the input of another function.
Many dplyr functions can be used in conjunction with each other in pipes like in the following example. Note that it is common practice to first specify the data frame to be manipulated. This
# data <- data %>% # select(column_name_1:column_name_n)) %>% # filter( column_name_2 > 10)
Exercise 1 Import a dataset of your choice in R using read.csv() function and call it data .
Select a range of columns of interest and filter for a condition. One of these filters should be for a categorical vriable (ie. all “male”) and at least one should be of a numeric variable (ie. more or less than a number)
Mutate allows us to create new columns/variables and is commonly used within a dplyr pipe. The syntax is:
# mutate(data, new_column_name = condition)
or, within a pipe:
# data <- data %>% # mutate(new_column_name = condition)
Exercise 2 use mutate() to create a new numeric column that is the sum of other numeric columns in your dataset
dplyr provides the following ways to recode data conveniently
the recode function takes the following syntax for character values in a specified column(s):
# recode(data$variable, a = "Apple", b = "Banana")
and with numeric class variables, here with all other variables set to NA (.default):
# recode_factor(data$variable, `2` = 1, `4` = 2, .default = NA_real_)
Recode can be applied to one or all values as needed
Exercise 3 use recode() to recode values of a character and numeric factor variable
Functions similar to ifelse statements, but with simplified code.
the case_when function takes the following syntax:
# case_when(x < 5 ~ "low", # x == 5 ~ "mid", # x > 5 ~ "high" # TRUE ~ NA_character_)
where the left side is the old values and the right the new values. the TRUE ~ is equivalent to the ‘else’ in an ifelse statement
It is common that you will need to create a new column with mutate based on conditions set in a case_when (ie. when we want to class ranges in a numeric column as below)
# data <- data %>% # mutate(new_column_name = case_when(numeric_var < 5 ~ "low", # numeric_var == 5 ~ "mid", # numeric_var > 5 ~ "high" # TRUE ~ NA_character_)
Exercise 4 use mutate and case_when to create 2 new columns based on conditions
These two functions are generally used together to produce a type of R equivalent to pivot tables.
group_by() groups data according to factors of a variable
# new_group_name <- data %>% # group_by(variable_to_group)
it is recommended to assign a grouped dataset to a group name, as above
summarize() functions similar to mutate in creating new columns, however following a group_by it takes a function that condenses numeric variables into one cell, such as mean, sum etc
# new_group_name <- data %>% # group_by(variable_to_group) %>% # summarize(mean = mean(numeric_variable), # sum = sum(numeric_variable))
multiple summarized columns can be added, separated with a ’ , ’
Exercise 5 create a group using your dataset and summarize 2 new variables
we may want to apply the same mutate transformation to multiple columns in our dataset. The dplyr function mutate_at() allows us to do this. mutate_at() applies the function of our choice to multiple columns in our dataset, and takes the following form:
# data <- data %>% # mutate_at(c("colname", "colname2"), funs(function))
the first argument is a vector of column names, while the second is a function that we add. The function can be a simple mean or something more complex. We can also input a recoding function.
Exercise 6 First lets create a new csv and paste the column names from our dataset. Next, transpose the data in excel so that this becomes rows instead of columns. find a set of columns that we would like to perform a recoding on and remove the rest. Name the column ‘variables’.
Lets read this csv of names into R and call it something ie. ‘satisfaction_vars’. Then we will create a vector of our column names:
# satisfaction_vars <- satisfaction_vars %>% # column_to_rownames(var = "variables") %>% # row.names()
This will allow us to apply a function to the variables of our choice with very little code.
Now we will create our function that will recode our variables, where ‘x’ will be the vector of column names
# recode_binary <- function (x) { recode_factor(x, `1`="negative", # `2`="negative", `3`="neutral", `4`="positive", `5`="positive", # .default = NA_character_)}
We can now use mutate_at() as follows:
# data <- data %>% # mutate_at(satisfaction_vars, function)