How to manipulate variables with dplyr in R (part 1)

Martin Frigaard 2017-08-30

In the last tutorial we introduced the concept of tidy data. Tidy data has one observation per row, and one variable per column. We also went over how to change to the shape of our data set with tidyr using data sets from the fivethirtyeight package.

For newcomers to R, check out my introductory tutorial for Storybench here.

Motivation

In this tutorial, we will dive a little deeper into data manipulation to focus on processing and creating variables. Whether you’re building models, creating visualizations, or just passing a dataset onto another analyst, you’ll spend most of your time manipulating the data into a structure or arrangement that suits your needs.

One example of this is the survey or data collection form. Web-based data collection forms or tools like Survey Monkey and Qualtrics has made the survey distribution process easier. However, data arrangements for collecting and storing survey responses are rarely identical to data arrangements for visualizing or modeling.

Occasionally data management is structured in a way that allows for a seamless transition between data collection and analysis, but I think these cases are rare.

Data rectangling

The preparation work for a dataset before analysis or modeling has many names: data munging/wrangling, cleansing, and preparation etc. I’ve grown to like the term “data rectangling” from Jenny Bryan because it suggests the shape for data in the tidyverse we’re usually working towards.

I suggest not thinking of any data as “dirty” and in need of “cleaning.” David Mimno from Cornell explains why this isn’t a helpful analogy,

"To me, these imply that there is some kind of pure or clean data buried in a thin layer of non-clean data, and that one need only hose the dataset off to reveal the hard porcelain underneath the muck. In reality, the process is more like deciding how to cut into a piece of material, or how much to plane down a surface. It’s not that there’s any real distinction between good and bad, it’s more that some parts are softer or knottier than others. Judgment is critical.

I like to consider data manipulation as a set of fundamental skills you’ll rely on to understand the structure, format, size, and limitations of any data set. The famous basketball coach John Wooden once wrote about how basic ball handling skills, dribbling, and passing abilities contributed to each player’s overall performance.

“These seemingly trivial matters, taken together and added to many, many other so-called trivial matters build into something very big: namely, your success.”

Thinking about data rectangling skills in this way can transform these repetitive, burdensome, sometimes monotonous tasks into a set of bedrock competencies.

Thinking in verbs

The tidyverse has a collection for manipulating data is dplyr (pronounced “d-plier” where “plier” is pronounced just like the hand tool). The dplyr package comes with a collection of verbs for data manipulation. The more you use these verbs, the more you will start thinking about data rectangling as a series of steps, each with a specific function.

When you combine dplyr with magrittr, you’ll be able to create data manipulation pipelines that are logical and easy to read.

Load the packages

library(tidyverse)
library(magrittr)

Actual survey data

The data set we will be using is from the FiveThirtyEight article titled, “What Do Men Think It Means To Be A Man?”.. I won’t be loading this data set from the fivethirtyeight package. There is a wealth of great materials in the fivethirtyeight package, but it’s better to learn how to manipulate data with an actual raw data file, and it just so happens there is one for this article in their GitHub repository.

Downloading files

Below is a code chunk that contains the URLs for the data and documentation files for the masculinity survey. I can use utils::download.file() to download these within RStudio. This code chunk will also check to see if the docs/ and data/ folders exist, and it will create one if they don’t.

I put the data and README.md files in the data/ folder and the masculinity-survey.pdf in the docs folder.

Quick tip #1: to use a particular function within a package you can use the syntax package::function

# assign urls ----
raw_responses_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/masculinity-survey/raw-responses.csv"
data_readme_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/masculinity-survey/README.md"
masculinity_survey_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/masculinity-survey/masculinity-survey.csv"
masculinity_doc_url <- "https://github.com/fivethirtyeight/data/raw/master/masculinity-survey/masculinity-survey.pdf"
# create data folder ----
if (!file.exists("data/")) {
    dir.create("data/")
}
# create docs folder -----
if (!file.exists("docs/")) {
    dir.create("docs/")
}
# download files -----
download.file(url = raw_responses_url, 
              destfile = "data/raw-responses.csv")
download.file(url = masculinity_survey_url, 
              destfile = "data/masculinity-survey.csv")
download.file(url = data_readme_url, 
              destfile = "data/README.md")
# download .pdf into docs folder -----
download.file(url = masculinity_doc_url,
              destfile = "docs/masculinity-survey.pdf")

Importing data

I’ll import the data below using the file path from above. Before I do that I am doing to read through the README.md file and check out the masculinity-survey.pdf. These files inform me of the following:

  1. masculinity-survey.csv contains cross-tabulations of various survey questions
  2. the .pdf tells me:

I’ll use readr::read_csv() to import the .csv file.

RawSurvey <- readr::read_csv("data/raw-responses.csv")
## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   X1 = col_double(),
##   weight = col_double()
## )

## See spec(...) for full column specifications.
RawSurvey %>% utils::head(10)
## # A tibble: 10 x 98
##       X1 StartDate EndDate q0001 q0002 q0004_0001 q0004_0002 q0004_0003
##    <dbl> <chr>     <chr>   <chr> <chr> <chr>      <chr>      <chr>     
##  1     1 5/10/18 … 5/10/1… Some… Some… Not selec… Not selec… Not selec…
##  2     2 5/10/18 … 5/10/1… Some… Some… Father or… Not selec… Not selec…
##  3     3 5/10/18 … 5/10/1… Very… Not … Father or… Not selec… Not selec…
##  4     4 5/10/18 … 5/10/1… Very… Not … Father or… Mother or… Other fam…
##  5     5 5/10/18 … 5/10/1… Very… Very… Not selec… Not selec… Other fam…
##  6     6 5/10/18 … 5/10/1… Very… Some… Father or… Not selec… Not selec…
##  7     7 5/10/18 … 5/10/1… Some… Not … Father or… Mother or… Other fam…
##  8     8 5/10/18 … 5/10/1… Some… Some… Father or… Not selec… Not selec…
##  9     9 5/10/18 … 5/10/1… Very… Not … Father or… Not selec… Not selec…
## 10    10 5/11/18 … 5/11/1… Some… Some… Father or… Not selec… Not selec…
## # … with 90 more variables: q0004_0004 <chr>, q0004_0005 <chr>,
## #   q0004_0006 <chr>, q0005 <chr>, q0007_0001 <chr>, q0007_0002 <chr>,
## #   q0007_0003 <chr>, q0007_0004 <chr>, q0007_0005 <chr>,
## #   q0007_0006 <chr>, q0007_0007 <chr>, q0007_0008 <chr>,
## #   q0007_0009 <chr>, q0007_0010 <chr>, q0007_0011 <chr>,
## #   q0008_0001 <chr>, q0008_0002 <chr>, q0008_0003 <chr>,
## #   q0008_0004 <chr>, q0008_0005 <chr>, q0008_0006 <chr>,
## #   q0008_0007 <chr>, q0008_0008 <chr>, q0008_0009 <chr>,
## #   q0008_0010 <chr>, q0008_0011 <chr>, q0008_0012 <chr>, q0009 <chr>,
## #   q0010_0001 <chr>, q0010_0002 <chr>, q0010_0003 <chr>,
## #   q0010_0004 <chr>, q0010_0005 <chr>, q0010_0006 <chr>,
## #   q0010_0007 <chr>, q0010_0008 <chr>, q0011_0001 <chr>,
## #   q0011_0002 <chr>, q0011_0003 <chr>, q0011_0004 <chr>,
## #   q0011_0005 <chr>, q0012_0001 <chr>, q0012_0002 <chr>,
## #   q0012_0003 <chr>, q0012_0004 <chr>, q0012_0005 <chr>,
## #   q0012_0006 <chr>, q0012_0007 <chr>, q0013 <chr>, q0014 <chr>,
## #   q0015 <chr>, q0017 <chr>, q0018 <chr>, q0019_0001 <chr>,
## #   q0019_0002 <chr>, q0019_0003 <chr>, q0019_0004 <chr>,
## #   q0019_0005 <chr>, q0019_0006 <chr>, q0019_0007 <chr>,
## #   q0020_0001 <chr>, q0020_0002 <chr>, q0020_0003 <chr>,
## #   q0020_0004 <chr>, q0020_0005 <chr>, q0020_0006 <chr>,
## #   q0021_0001 <chr>, q0021_0002 <chr>, q0021_0003 <chr>,
## #   q0021_0004 <chr>, q0022 <chr>, q0024 <chr>, q0025_0001 <chr>,
## #   q0025_0002 <chr>, q0025_0003 <chr>, q0026 <chr>, q0028 <chr>,
## #   q0029 <chr>, q0030 <chr>, q0034 <chr>, q0035 <chr>, q0036 <chr>,
## #   race2 <chr>, racethn4 <chr>, educ3 <chr>, educ4 <chr>, age3 <chr>,
## #   kids <chr>, orientation <chr>, weight <dbl>

The message tells me 1) there was an unnamed column in the raw-responses.csv file, it was named X1 and formatted as number (col_double()), 2) RStudio formatted the weight variable as a number (col_double()), and 3) formatted all the other imported data as character/strings (.default = col_character()).

I will use dplyr::glimpse(78) to view the RawSurvey data frame.

RawSurvey %>% dplyr::glimpse(78)

The dimensions for this data set are 1,615 observations and 98 variables–which matches the description in the README.md file,

raw-responses.csv contains all 1,615 responses to the survey including the weights for each response. Responses to open-ended questions have been omitted, including those where a respondent explained what they meant by selecting the “other” option in response to a question.

But after opening the masculinity-survey.pdf file, I notice it this survey only lists 30 questions. What is going on here? If I take a closer look at the dplyr::glimpse() output above, I start to see what’s going on.

First, there are a few additional variables in this dataset that aren’t in the masculinity-survey.pdf. For example, X1 is a variable that was assigned when we read these data into RStudio (that’s what the 'X1' [1]Parsed with column specification: message was telling us). The StartDate and EndDate variables are also missing from the masculinity-survey.pdf.

Second, I also notice the variable names have two sets of numbers: a prefix (q0000) and a suffix (0000) separated by an underscore (_). See an example of this with question four below.

THIS IS NORMAL. Many times the data dictionary or documentation files don’t match up exactly with the accompanying data set. But with a little detective work, you can usually figure out what the discrepancies are (and why they exist).

Naming stuff

Names are important. The tidyverse has an excellent style guide on how to name things, but you should also check out Jenny Bryan’s slides on this topic. I stick to three basic rules for naming objects in R:

  1. Data frames and tibbles are Pascal case (like DataFrame). If the name gets too long, I start removing vowels
  2. Functions I create are Dromedary case (like myFunction or iPhone)
  3. All other vectors, variables, lists, formulas, etc. are lowercase with underscores (my_vector, my_list, my_model)

You’ll see how a good naming convention can save you a ton of typing.

Select a single variable

The verbs for extracting or moving variables around are dplyr::select() or dplyr::pull(). For example, I can use both to pick out a single variable from a data frame (StartDate).

RawSurvey %>% dplyr::select(StartDate)
# # A tibble: 1,615 x 1
#    StartDate
#    <chr>
#  1 5/10/18 4:01
#  2 5/10/18 6:30
#  3 5/10/18 7:02
#  4 5/10/18 7:27
#  5 5/10/18 7:35
#  6 5/10/18 8:25
#  7 5/10/18 8:29
#  8 5/10/18 10:04
#  9 5/10/18 11:00
# 10 5/11/18 12:36
# # … with 1,605 more rows
RawSurvey %>% head(10) %>% dplyr::pull(StartDate) 
# [1] "5/10/18 4:01"  "5/10/18 6:30"  "5/10/18 7:02"  "5/10/18 7:27"
# [5] "5/10/18 7:35"  "5/10/18 8:25"  "5/10/18 8:29"  "5/10/18 10:04"
# [9] "5/10/18 11:00" "5/11/18 12:36" "5/11/18 3:07"  "5/11/18 5:18"

These both work on a single variable in a data frame, but the result they display is different. The dplyr::select() function returns a tibble, and dplyr::pull() returns a vector.

Quick tip #2: The dplyr::glimpse() function is also from the dplyr package and is very handy for viewing data. Adding this to the end of a manipulation pipeline displays the result with the variables transposed into rows, and shows as much of the data as that will fit on the screen. dplyr::glimpse() can also be applied to a single variable using the $ operator:

DataSet$variable %>% dplyr::glimpse() or

DataSet %$% dplyr::glimpse(variable)

I usually need to select more than one variable from a data frame, and dplyr has some helper functions that make this easier.

Select helpers

These functions are placed inside dplyr::select() to add more specific criteria for the variables I want to extract from a data frame or tibble.

These functions will match on a pattern/location:

These will return variables based on position or range:

And this is the catch-all:

We will use these helper functions to reorganize and rename variables below.

Select multiple variables

The variables in the RawSurvey data frame follow a consistent naming convention (as noted above). Consistent names mean I can easily select() variables if I want to reorganize the data frame. For example, if I wanted to select the first three variables (X1:EndDate) and all the variables for question eight, I’d use the dplyr::contains() and include the appropriate prefix (q0008).

RawSurvey %>% 
    dplyr::select(X1:EndDate,
        dplyr::contains("q0008")) %>% 
    dplyr::glimpse(78)
## Observations: 1,615
## Variables: 15
## $ X1         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ StartDate  <chr> "5/10/18 4:01", "5/10/18 6:30", "5/10/18 7:02", "5/10/18…
## $ EndDate    <chr> "5/10/18 4:06", "5/10/18 6:53", "5/10/18 7:09", "5/10/18…
## $ q0008_0001 <chr> "Not selected", "Not selected", "Not selected", "Not sel…
## $ q0008_0002 <chr> "Not selected", "Your weight", "Not selected", "Not sele…
## $ q0008_0003 <chr> "Your hair or hairline", "Not selected", "Not selected",…
## $ q0008_0004 <chr> "Not selected", "Not selected", "Not selected", "Not sel…
## $ q0008_0005 <chr> "Not selected", "Not selected", "Not selected", "Not sel…
## $ q0008_0006 <chr> "Not selected", "Not selected", "Not selected", "Not sel…
## $ q0008_0007 <chr> "Not selected", "Not selected", "Not selected", "Not sel…
## $ q0008_0008 <chr> "Not selected", "Your mental health", "Not selected", "N…
## $ q0008_0009 <chr> "Your physical health", "Your physical health", "Your ph…
## $ q0008_0010 <chr> "Your finances, including your current or future income,…
## $ q0008_0011 <chr> "Not selected", "Not selected", "Not selected", "Not sel…
## $ q0008_0012 <chr> "Not selected", "Not selected", "Not selected", "None of…

These helper functions also work by negation. If I wanted to create a data frame with only the demographic variables, I place a - sign in front of a helper function.

RawSurvey %>% 
    dplyr::select(
        -dplyr::starts_with("q00")) %>% 
    dplyr::glimpse(78)
## Observations: 1,615
## Variables: 11
## $ X1          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ StartDate   <chr> "5/10/18 4:01", "5/10/18 6:30", "5/10/18 7:02", "5/10/1…
## $ EndDate     <chr> "5/10/18 4:06", "5/10/18 6:53", "5/10/18 7:09", "5/10/1…
## $ race2       <chr> "Non-white", "White", "White", "White", "White", "White…
## $ racethn4    <chr> "Hispanic", "White", "White", "White", "White", "White"…
## $ educ3       <chr> "College or more", "Some college", "College or more", "…
## $ educ4       <chr> "College or more", "Some college", "College or more", "…
## $ age3        <chr> "35 - 64", "65 and up", "35 - 64", "65 and up", "35 - 6…
## $ kids        <chr> "No children", "Has children", "Has children", "Has chi…
## $ orientation <chr> "Gay/Bisexual", "Straight", "Straight", "No answer", "S…
## $ weight      <dbl> 1.71402597, 1.24712012, 0.51574606, 0.60064008, 1.03340…

Be sure to check out the other select helpers here.

Renaming a single variable

I’ll go over a few ways to rename variables. The first is dplyr::rename(), and it’s syntax is new_name = old_name. I’ll use below to rename the X1 variable and create a new object called MascSurveyData.

MascSurveyData <- RawSurvey %>% dplyr::rename(id = X1)
MascSurveyData %$% dplyr::glimpse(id)
##  num [1:1615] 1 2 3 4 5 6 7 8 9 10 ...

Renaming multiple variables

I can also rename multiple variables at one time with the dplyr::rename() function if I separate them with a comma. See an example of this below:

MascSurveyData %>% 
    dplyr::rename(
        start_date = StartDate, # better naming conventions
        end_date = EndDate) %>% dplyr::glimpse(78)
# Observations: 1,615
# Variables: 98
# $ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
# $ start_date  <chr> "5/10/18 4:01", "5/10/18 6:30", "5/10/18 7:02", "5/10/1…
# $ end_date    <chr> "5/10/18 4:06", "5/10/18 6:53", "5/10/18 7:09", "5/10/1…

I’ll make these changes permanent by assigning them to MascSurveyData.

MascSurveyData <- MascSurveyData %>% 
    dplyr::rename(
        start_date = StartDate, # better naming conventions
        end_date = EndDate)

I can also rename variables with select() by following the same syntax (new_name = old_name). I’ll rename question 4, “Where have you gotten your ideas about what it means to be a good man?” as a ‘good man ideas’ scale, by adding the prefix gmis_.

My first option is to rename these items using dplyr::select() and a range (q0004_0001:q0004_0006).

MascSurveyData %>%
  dplyr::select(id,
      gmis_ = q0004_0001:q0004_0006) %>% 
    dplyr::glimpse(78)
## Observations: 1,615
## Variables: 7
## $ id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ gmis_1 <chr> "Not selected", "Father or father figure(s)", "Father or fat…
## $ gmis_2 <chr> "Not selected", "Not selected", "Not selected", "Mother or m…
## $ gmis_3 <chr> "Not selected", "Not selected", "Not selected", "Other famil…
## $ gmis_4 <chr> "Pop culture", "Not selected", "Not selected", "Not selected…
## $ gmis_5 <chr> "Not selected", "Not selected", "Not selected", "Not selecte…
## $ gmis_6 <chr> "Not selected", "Not selected", "Other (please specify)", "N…

This method requires that I know 1) the number and 2) the name of the variables in my make-believe scale. But by combining dplyr::select()’s renaming ability with the helper functions,

Which brings me to the option #2: I can match on a specific pattern (like "q0004_"), and I can preserve the original variables order (by adding dplyr::everything()).

MascSurveyData %>%
  dplyr::select(
      dplyr::everything(), 
      gmis_ = starts_with("q0004_")) %>% 
    dplyr::glimpse(78)
# Observations: 1,615
# Variables: 98
# $ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
# $ start_date  <chr> "5/10/18 4:01", "5/10/18 6:30", "5/10/18 7:02", "5/10/1…
# $ end_date    <chr> "5/10/18 4:06", "5/10/18 6:53", "5/10/18 7:09", "5/10/1…
# $ q0001       <chr> "Somewhat masculine", "Somewhat masculine", "Very mascu…
# $ q0002       <chr> "Somewhat important", "Somewhat important", "Not too im…
# $ gmis_1      <chr> "Not selected", "Father or father figure(s)", "Father o…
# $ gmis_2      <chr> "Not selected", "Not selected", "Not selected", "Mother…
# $ gmis_3      <chr> "Not selected", "Not selected", "Not selected", "Other …
# $ gmis_4      <chr> "Pop culture", "Not selected", "Not selected", "Not sel…
# $ gmis_5      <chr> "Not selected", "Not selected", "Not selected", "Not se…
# $ gmis_6      <chr> "Not selected", "Not selected", "Other (please specify)…
# ...omitted output

Neat huh?


Counting responses

dplyr::count() is an essential tool from the tidyverse because data science is mostly counting things. It is also very versatile. By passing the entire data frame (MascSurveyData) to count() I get the number of rows.

MascSurveyData %>% dplyr::count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1615

The individual responses to each variable tell me a lot about the original question. For example, I can pass q0001 to dplyr::count() and see what it contains.

MascSurveyData %>% dplyr::count(q0001)
## # A tibble: 5 x 2
##   q0001                    n
##   <chr>                <int>
## 1 No answer               14
## 2 Not at all masculine    32
## 3 Not very masculine     131
## 4 Somewhat masculine     826
## 5 Very masculine         612

These are the responses to “In general, how masculine or ‘manly’ do you feel?”. I’ll rename q0001 as how_masc.

# rename q0001 ----
MascSurveyData <- MascSurveyData %>% dplyr::rename(how_masc = q0001) 

The next question is “How important is it to you that others see you as masculine?” and the responses are below:

# q0002 ----
MascSurveyData %>% dplyr::count(q0002) # this is how important question
## # A tibble: 5 x 2
##   q0002                    n
##   <chr>                <int>
## 1 No answer                9
## 2 Not at all important   240
## 3 Not too important      541
## 4 Somewhat important     628
## 5 Very important         197
MascSurveyData <- MascSurveyData %>% 
    dplyr::rename(how_important = q0002) # rename q0002 ----

Quick Tip #3: You can add four dashes ---- inside a code chunk and it will show up on your document outline tool.

The next six variables are all from a question four, “Where have you gotten your ideas about what it means to be a good man?”.

# q0004 ----
MascSurveyData %>% dplyr::count(q0004_0001)
MascSurveyData %>% dplyr::count(q0004_0002)
MascSurveyData %>% dplyr::count(q0004_0003)
MascSurveyData %>% dplyr::count(q0004_0004)
MascSurveyData %>% dplyr::count(q0004_0005)
MascSurveyData %>% dplyr::count(q0004_0006)

The output for each dplyr::count() contains two numbers: the total answers to a particular response (like Father or father figure(s) or Pop culture), and the total of Not selected for that response.

Change an existing variable

Often I need to change the format of an existing variable in a data frame. This can be done using dplyr::mutate() and the equals sign =. For example, I notice the id variable is formatted as a double, but I want it to be an integer. I can do this with dplyr::mutate() and as.integer()

MascSurveyData <- MascSurveyData %>% 
    dplyr::mutate(id = as.integer(id))
MascSurveyData$id %>% dplyr::glimpse(78)
##  int [1:1615] 1 2 3 4 5 6 7 8 9 10 ...

Creating new variables (1 condition)

Now that I’m getting a better understanding of how the survey data are structured in the raw data set, I can begin to create new variables to suit my needs. For example, the article mentions collapsing two response categories into a single statistic.

“When asked how masculine or “manly” they generally feel, 83 percent of men said they felt “very” or “somewhat” masculine."

The first new variable I’ll create will identify if the respondent indicated they were Very masculine or “Somewhat masculine”. I will name this masc_ind.

Quick Tip #4: the _ind suffix is added because this is an indicator variable. A TRUE response to an indicator variable means that this measure is present, and FALSE means that it’s absent. As we saw above, adding a suffix or prefix to variables of a certain type make it easier to identify them in a large dataset.

The function for creating a brand new variable is dplyr::mutate(). The equal sign (=) separates the name of the new variable from the conditions for creating it.

DataSet %>% 
    dplyr::mutate(
        new_variable = conditions, for, new, variable
    )

If I want to create a new variable that has only two possible responses (TRUE or FALSE) I can use the dplyr::if_else() function inside dplyr::mutate().

dplyr::if_else() takes three arguments:

I also have a tool at my disposal to verify this variable has been created correctly (i.e. dplyr::count()). By passing the new variable (masc_ind) and old variable (how_masc) I can check the count to see if the totals make sense. I can also use the tidyr::spread() function to see a cross-tabulation of each response.

MascSurveyData %>% 
    dplyr::mutate(masc_ind = 
        dplyr::if_else(
            condition = how_masc %in% c("Very masculine", 
                                             "Somewhat masculine"),
                 true = TRUE,
                 false = FALSE,
                 missing = NA)) %>% 
    dplyr::count(masc_ind, how_masc) %>% 
    tidyr::spread(masc_ind, n)
## # A tibble: 5 x 3
##   how_masc             `FALSE` `TRUE`
##   <chr>                  <int>  <int>
## 1 No answer                 14     NA
## 2 Not at all masculine      32     NA
## 3 Not very masculine       131     NA
## 4 Somewhat masculine        NA    826
## 5 Very masculine            NA    612

This is what I expected to see! This is the beauty of working in the tidyverse–tibbles (rectangular data) are the common data objects returned by most functions, so we can look at a function’s output as an object that can be manipulated with another tidyverse function.

Replacing variables

The dplyr::mutate() function only creates new variables. What if I wanted this new indicator (masc_ind) to replace the how_masc variable? This can be done using dplyr::transmute().

MascSurveyData %>% 
    dplyr::transmute(masc_ind = 
        dplyr::if_else(
            condition = how_masc %in% c("Very masculine", 
                                             "Somewhat masculine"),
                 true = TRUE,
                 false = FALSE,
                 missing = NA)) %>% dplyr::glimpse(78)
## Observations: 1,615
## Variables: 1
## $ masc_ind <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…

The important thing to note is that this returns a dataset with a single variable (masc_ind).

Creating new variables (2+ conditions)

So far I’ve created a new variable based on a single condition in one variable, but what if I need to create a new variable based on multiple conditions in several different variables? This is where the dplyr::case_when() function comes in handy. The syntax for dplyr::case_when() is similar to dplyr::if_else(). The first argument should be a condition for existing variable values to match, but instead of true and false arguments, I’ll provide a formula operator ~ and the value that belongs in the new variable if the match is met.

See an example below:

new_variable = dplyr::case_when(
    variable_1 == "condition 1" ~ "new value 1",
    variable_2 == "condition 2" ~ "new value 2",
    TRUE ~ NA_character_ # all else are NA
)

This syntax assumes that the existing variables are character/string (variable_1 and variable_2), and that the new variable (new_variable) will also be a character/string variable.

I’m going to create an fictional scale called masc_scale and it has three levels: high, moderate, and low. The levels in masc_scale are based on the responses to two variables:

  1. how_masc the “In general, how masculine or "manly" do you feel?” question, and
  2. q0018 which is “How often do you try to be the one who pays when on a date?

The new variable will have four conditions:

  1. I’m going to assume anyone who indicated they were Very Masculine and Always tried to pay on a date rates high on the masc_scale.
  2. Anyone who responded Somewhat masculine and Often or Sometimes tried to pay on a date will be rated moderate on the masc_scale.
  3. And the low respondents on the masc_scale indicated they were Not very masculine or Not at all masculine and Rarely or Never tried to pay on a date.
  4. All the No answer responses to how_masc and q0018 will get an NA in the masc_scale.

The logic for this new variable is in the comments below. I use the select() helpers to check the new variable and variables used to create it.

MascSurveyData %>% 
dplyr::mutate(masc_scale = dplyr::case_when(
# high masc_scale ----
# feel very masculine and always pays for dates
how_masc == "Very masculine" & q0018 == "Always" ~ "high",

# moderate masc_scale ----
# feel somewhat masculine and often/sometimes pay for dates
how_masc == "Somewhat masculine" & q0018 %in% c("Often", 
                                                "Sometimes") ~ "moderate",

# low masc_scale ----
# feel not very/not at all masculine and rarely/never pay for dates
how_masc %in% c("Not very masculine", 
                "Not at all masculine") & q0018 %in% c("Rarely", 
                                                       "Never") ~ "low",

# all else as NA ----
how_masc == "No answer" & q0018 == "No answer" ~ NA_character_)) %>%
    
    # check this new variable with select helpers
dplyr::select(q0018,
    dplyr::contains("masc"))
# # A tibble: 1,615 x 4
#    q0018     how_masc           masc_ind masc_scale
#    <chr>     <chr>              <lgl>    <chr>
#  1 Sometimes Somewhat masculine TRUE     moderate
#  2 Rarely    Somewhat masculine TRUE     NA
#  3 Sometimes Very masculine     TRUE     NA
#  4 Always    Very masculine     TRUE     high
#  5 Always    Very masculine     TRUE     high
#  6 Always    Very masculine     TRUE     high
#  7 Sometimes Somewhat masculine TRUE     moderate
#  8 Often     Somewhat masculine TRUE     moderate
#  9 Always    Very masculine     TRUE     high
# 10 Always    Somewhat masculine TRUE     NA
# # … with 1,605 more rows

It’s helpful to think of each level of dplyr::case_when() as satisfying a logical condition (TRUE or FALSE), and then what the resulting value should be when each condition is satisfied.

Changing multiple variables at once

Each of the functions covered above work on a single variable. These are dropped inside the single dplyr::mutate() function to create a new variable in the data frame. It’s important to note that I could combine both new variables (masc_scale and masc_ind) into a single dplyr::mutate() function call.

MascSurveyData %>% 
dplyr::mutate(
    # create integer id
    id = as.integer(id), # <- separate with a comma!
    # create masc_ind ----
    masc_ind = 
        dplyr::if_else( ...), # <- separate with a comma!

    # create masc_scale ----
    masc_scale = dplyr::case_when( ...)

There are three additional variants of mutate() I will briefly cover below.

Change variables with a function to all columns

The dplyr::mutate_all() is handy if you want to mutate all variables in a data frame with a particular function. For example, I can select the date variables using the select() helpers and mutate them to dates with lubridate::mdy_hm().

MascSurveyData %>% 
    dplyr::select(
        dplyr::contains("date")) %>% 
    dplyr::mutate_all(lubridate::mdy_hm) %>% 
    dplyr::glimpse(78)
## Observations: 1,615
## Variables: 2
## $ start_date <dttm> 2018-05-10 04:01:00, 2018-05-10 06:30:00, 2018-05-10 07…
## $ end_date   <dttm> 2018-05-10 04:06:00, 2018-05-10 06:53:00, 2018-05-10 07…

Quick Tip #5: pass all the functions inside the mutate_all() variants without the parentheses.

Change variables with a function and a selection of columns

I can also change only a few variables in a data frame and leave the others unchanged.

For example, if I decided all the elements in question four (q0004_0001 through q0004_0006) needed to be factors (read more about factors here), I could pass the data frame to dplyr::mutate_at() and include a string in the vars(matches()) helpers to identify question four variables.

MascSurveyData %>% 
    dplyr::mutate_at(vars(matches("q0004_")), as.factor) %>% 
    dplyr::glimpse(78)
# Observations: 1,615
# Variables: 100
# $ id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
# $ start_date    <dttm> 2018-05-10 04:01:00, 2018-05-10 06:30:00, 2018-05-10…
# $ end_date      <dttm> 2018-05-10 04:06:00, 2018-05-10 06:53:00, 2018-05-10…
# $ how_masc      <chr> "Somewhat masculine", "Somewhat masculine", "Very mas…
# $ how_important <chr> "Somewhat important", "Somewhat important", "Not too …
# $ q0004_0001    <fct> Not selected, Father or father figure(s), Father or f…
# $ q0004_0002    <fct> Not selected, Not selected, Not selected, Mother or m…
# $ q0004_0003    <fct> Not selected, Not selected, Not selected, Other famil…
# $ q0004_0004    <fct> Pop culture, Not selected, Not selected, Not selected…
# $ q0004_0005    <fct> Not selected, Not selected, Not selected, Not selecte…
# $ q0004_0006    <fct> Not selected, Not selected, Other (please specify), N…
# $ q0005         <chr> "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", …

Change variables selected based on a condition (function)

The dplyr::mutate_if() function tests a condition (in the form of a function) and changes only variables where the condition = TRUE. For example, if I wanted to perform a log10() transformation the weight variable (which also happens to be the only variable with decimal points in it’s measurement), I could set the first portion of dplyr::mutate_if() to is.double, and then apply a function (like log10).

# first look at weight untransformed
MascSurveyData %>% 
    dplyr::pull(weight) %>% utils::head()
## [1] 1.71402597 1.24712012 0.51574606 0.60064008 1.03340045 0.05908664
# now transform
MascSurveyData %>% 
    dplyr::mutate_if(is.double, log10) %>% 
    dplyr::pull(weight) %>% utils::head()
## [1]  0.23401740  0.09590828 -0.28756408 -0.22138569  0.01426865 -1.22851070

Now I will export this as a .csv file and time-stamp it.

# fs::dir_ls("data")
readr::write_csv(x = MascSurveyData,
                 path = paste0(
                     "data/",
                     base::noquote(lubridate::today()),
                     "-MascSurveyData.csv"))
# verify
fs::dir_ls("data")
## data/2019-01-30-MascSurveyData.csv
## data/2019-04-09-BikeData.rds
## data/2019-07-12-LomaDatesWide.rds
## data/2019-07-12-tidyr-pivot-post-data.RData
## data/2019-08-03-LomaDatesWide.rds
## data/2019-08-03-MascSurveyData.csv
## data/2019-08-03-tidyr-pivot-post-data.RData
## data/FARS.csv
## data/LomaDatesWide.csv
## data/LomaWideSmall.csv
## data/README copy.md
## data/README.md
## data/README_v02.Rmd
## data/README_v02.md
## data/Readme.txt
## data/aggdp_worldbank.csv
## data/babynames.csv
## data/celeb_heights.csv
## data/csv-data
## data/day.csv
## data/excel-data
## data/ggg-canelo.xlsx
## data/hour.csv
## data/indgdp_worldbank.csv
## data/masculinity-readme.md
## data/masculinity-survey
## data/masculinity-survey.csv
## data/raw-responses.csv
## data/servgdp_worldbank.csv
## data/tidyr-data.RData

Next tutorial I will cover functions to alter cases within the MascSurveyData data set.