Martin Frigaard 2017-08-30
In the last tutorial we introduced the concept of tidy data. Tidy data has one observation per row, and one variable per column. We also went over how to change to the shape of our data set with tidyr using data sets from the fivethirtyeight package.
For newcomers to R, check out my introductory tutorial for Storybench here.
In this tutorial, we will dive a little deeper into data manipulation to focus on processing and creating variables. Whether you’re building models, creating visualizations, or just passing a dataset onto another analyst, you’ll spend most of your time manipulating the data into a structure or arrangement that suits your needs.
One example of this is the survey or data collection form. Web-based data collection forms or tools like Survey Monkey and Qualtrics has made the survey distribution process easier. However, data arrangements for collecting and storing survey responses are rarely identical to data arrangements for visualizing or modeling.
Occasionally data management is structured in a way that allows for a seamless transition between data collection and analysis, but I think these cases are rare.
The preparation work for a dataset before analysis or modeling has many names: data munging/wrangling, cleansing, and preparation etc. I’ve grown to like the term “data rectangling” from Jenny Bryan because it suggests the shape for data in the tidyverse we’re usually working towards.
I suggest not thinking of any data as “dirty” and in need of “cleaning.” David Mimno from Cornell explains why this isn’t a helpful analogy,
"To me, these imply that there is some kind of pure or clean data buried in a thin layer of non-clean data, and that one need only hose the dataset off to reveal the hard porcelain underneath the muck. In reality, the process is more like deciding how to cut into a piece of material, or how much to plane down a surface. It’s not that there’s any real distinction between good and bad, it’s more that some parts are softer or knottier than others. Judgment is critical.
I like to consider data manipulation as a set of fundamental skills you’ll rely on to understand the structure, format, size, and limitations of any data set. The famous basketball coach John Wooden once wrote about how basic ball handling skills, dribbling, and passing abilities contributed to each player’s overall performance.
“These seemingly trivial matters, taken together and added to many, many other so-called trivial matters build into something very big: namely, your success.”
Thinking about data rectangling skills in this way can transform these repetitive, burdensome, sometimes monotonous tasks into a set of bedrock competencies.
The tidyverse has a collection for manipulating data is dplyr (pronounced “d-plier” where “plier” is pronounced just like the hand tool). The dplyr package comes with a collection of verbs for data manipulation. The more you use these verbs, the more you will start thinking about data rectangling as a series of steps, each with a specific function.
When you combine dplyr with magrittr, you’ll be able to create data manipulation pipelines that are logical and easy to read.
The data set we will be using is from the FiveThirtyEight article titled, “What Do Men Think It Means To Be A Man?”.. I won’t be loading this data set from the fivethirtyeight package. There is a wealth of great materials in the fivethirtyeight package, but it’s better to learn how to manipulate data with an actual raw data file, and it just so happens there is one for this article in their GitHub repository.
Below is a code chunk that contains the URLs for the data and documentation files for the masculinity survey. I can use utils::download.file() to download these within RStudio. This code chunk will also check to see if the docs/ and data/ folders exist, and it will create one if they don’t.
I put the data and README.md files in the data/ folder and the masculinity-survey.pdf in the docs folder.
Quick tip #1: to use a particular function within a package you can use the syntax package::function
# assign urls ----
raw_responses_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/masculinity-survey/raw-responses.csv"
data_readme_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/masculinity-survey/README.md"
masculinity_survey_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/masculinity-survey/masculinity-survey.csv"
masculinity_doc_url <- "https://github.com/fivethirtyeight/data/raw/master/masculinity-survey/masculinity-survey.pdf"
# create data folder ----
if (!file.exists("data/")) {
dir.create("data/")
}
# create docs folder -----
if (!file.exists("docs/")) {
dir.create("docs/")
}
# download files -----
download.file(url = raw_responses_url,
destfile = "data/raw-responses.csv")
download.file(url = masculinity_survey_url,
destfile = "data/masculinity-survey.csv")
download.file(url = data_readme_url,
destfile = "data/README.md")
# download .pdf into docs folder -----
download.file(url = masculinity_doc_url,
destfile = "docs/masculinity-survey.pdf")I’ll import the data below using the file path from above. Before I do that I am doing to read through the README.md file and check out the masculinity-survey.pdf. These files inform me of the following:
masculinity-survey.csv contains cross-tabulations of various survey questionsI’ll use readr::read_csv() to import the .csv file.
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_character(),
## X1 = col_double(),
## weight = col_double()
## )
## See spec(...) for full column specifications.
## # A tibble: 10 x 98
## X1 StartDate EndDate q0001 q0002 q0004_0001 q0004_0002 q0004_0003
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 5/10/18 … 5/10/1… Some… Some… Not selec… Not selec… Not selec…
## 2 2 5/10/18 … 5/10/1… Some… Some… Father or… Not selec… Not selec…
## 3 3 5/10/18 … 5/10/1… Very… Not … Father or… Not selec… Not selec…
## 4 4 5/10/18 … 5/10/1… Very… Not … Father or… Mother or… Other fam…
## 5 5 5/10/18 … 5/10/1… Very… Very… Not selec… Not selec… Other fam…
## 6 6 5/10/18 … 5/10/1… Very… Some… Father or… Not selec… Not selec…
## 7 7 5/10/18 … 5/10/1… Some… Not … Father or… Mother or… Other fam…
## 8 8 5/10/18 … 5/10/1… Some… Some… Father or… Not selec… Not selec…
## 9 9 5/10/18 … 5/10/1… Very… Not … Father or… Not selec… Not selec…
## 10 10 5/11/18 … 5/11/1… Some… Some… Father or… Not selec… Not selec…
## # … with 90 more variables: q0004_0004 <chr>, q0004_0005 <chr>,
## # q0004_0006 <chr>, q0005 <chr>, q0007_0001 <chr>, q0007_0002 <chr>,
## # q0007_0003 <chr>, q0007_0004 <chr>, q0007_0005 <chr>,
## # q0007_0006 <chr>, q0007_0007 <chr>, q0007_0008 <chr>,
## # q0007_0009 <chr>, q0007_0010 <chr>, q0007_0011 <chr>,
## # q0008_0001 <chr>, q0008_0002 <chr>, q0008_0003 <chr>,
## # q0008_0004 <chr>, q0008_0005 <chr>, q0008_0006 <chr>,
## # q0008_0007 <chr>, q0008_0008 <chr>, q0008_0009 <chr>,
## # q0008_0010 <chr>, q0008_0011 <chr>, q0008_0012 <chr>, q0009 <chr>,
## # q0010_0001 <chr>, q0010_0002 <chr>, q0010_0003 <chr>,
## # q0010_0004 <chr>, q0010_0005 <chr>, q0010_0006 <chr>,
## # q0010_0007 <chr>, q0010_0008 <chr>, q0011_0001 <chr>,
## # q0011_0002 <chr>, q0011_0003 <chr>, q0011_0004 <chr>,
## # q0011_0005 <chr>, q0012_0001 <chr>, q0012_0002 <chr>,
## # q0012_0003 <chr>, q0012_0004 <chr>, q0012_0005 <chr>,
## # q0012_0006 <chr>, q0012_0007 <chr>, q0013 <chr>, q0014 <chr>,
## # q0015 <chr>, q0017 <chr>, q0018 <chr>, q0019_0001 <chr>,
## # q0019_0002 <chr>, q0019_0003 <chr>, q0019_0004 <chr>,
## # q0019_0005 <chr>, q0019_0006 <chr>, q0019_0007 <chr>,
## # q0020_0001 <chr>, q0020_0002 <chr>, q0020_0003 <chr>,
## # q0020_0004 <chr>, q0020_0005 <chr>, q0020_0006 <chr>,
## # q0021_0001 <chr>, q0021_0002 <chr>, q0021_0003 <chr>,
## # q0021_0004 <chr>, q0022 <chr>, q0024 <chr>, q0025_0001 <chr>,
## # q0025_0002 <chr>, q0025_0003 <chr>, q0026 <chr>, q0028 <chr>,
## # q0029 <chr>, q0030 <chr>, q0034 <chr>, q0035 <chr>, q0036 <chr>,
## # race2 <chr>, racethn4 <chr>, educ3 <chr>, educ4 <chr>, age3 <chr>,
## # kids <chr>, orientation <chr>, weight <dbl>
The message tells me 1) there was an unnamed column in the raw-responses.csv file, it was named X1 and formatted as number (col_double()), 2) RStudio formatted the weight variable as a number (col_double()), and 3) formatted all the other imported data as character/strings (.default = col_character()).
I will use dplyr::glimpse(78) to view the RawSurvey data frame.
The dimensions for this data set are 1,615 observations and 98 variables–which matches the description in the README.md file,
raw-responses.csvcontains all 1,615 responses to the survey including the weights for each response. Responses to open-ended questions have been omitted, including those where a respondent explained what they meant by selecting the “other” option in response to a question.
But after opening the masculinity-survey.pdf file, I notice it this survey only lists 30 questions. What is going on here? If I take a closer look at the dplyr::glimpse() output above, I start to see what’s going on.
First, there are a few additional variables in this dataset that aren’t in the masculinity-survey.pdf. For example, X1 is a variable that was assigned when we read these data into RStudio (that’s what the 'X1' [1]Parsed with column specification: message was telling us). The StartDate and EndDate variables are also missing from the masculinity-survey.pdf.
Second, I also notice the variable names have two sets of numbers: a prefix (q0000) and a suffix (0000) separated by an underscore (_). See an example of this with question four below.
THIS IS NORMAL. Many times the data dictionary or documentation files don’t match up exactly with the accompanying data set. But with a little detective work, you can usually figure out what the discrepancies are (and why they exist).
Names are important. The tidyverse has an excellent style guide on how to name things, but you should also check out Jenny Bryan’s slides on this topic. I stick to three basic rules for naming objects in R:
DataFrame). If the name gets too long, I start removing vowelsmyFunction or iPhone)my_vector, my_list, my_model)You’ll see how a good naming convention can save you a ton of typing.
The verbs for extracting or moving variables around are dplyr::select() or dplyr::pull(). For example, I can use both to pick out a single variable from a data frame (StartDate).
# # A tibble: 1,615 x 1
# StartDate
# <chr>
# 1 5/10/18 4:01
# 2 5/10/18 6:30
# 3 5/10/18 7:02
# 4 5/10/18 7:27
# 5 5/10/18 7:35
# 6 5/10/18 8:25
# 7 5/10/18 8:29
# 8 5/10/18 10:04
# 9 5/10/18 11:00
# 10 5/11/18 12:36
# # … with 1,605 more rows# [1] "5/10/18 4:01" "5/10/18 6:30" "5/10/18 7:02" "5/10/18 7:27"
# [5] "5/10/18 7:35" "5/10/18 8:25" "5/10/18 8:29" "5/10/18 10:04"
# [9] "5/10/18 11:00" "5/11/18 12:36" "5/11/18 3:07" "5/11/18 5:18"These both work on a single variable in a data frame, but the result they display is different. The dplyr::select() function returns a tibble, and dplyr::pull() returns a vector.
Quick tip #2: The dplyr::glimpse() function is also from the dplyr package and is very handy for viewing data. Adding this to the end of a manipulation pipeline displays the result with the variables transposed into rows, and shows as much of the data as that will fit on the screen. dplyr::glimpse() can also be applied to a single variable using the $ operator:
DataSet$variable %>% dplyr::glimpse() or
DataSet %$% dplyr::glimpse(variable)
I usually need to select more than one variable from a data frame, and dplyr has some helper functions that make this easier.
These functions are placed inside dplyr::select() to add more specific criteria for the variables I want to extract from a data frame or tibble.
These functions will match on a pattern/location:
contains()ends_with()starts_with()matches()These will return variables based on position or range:
num_range()one_of()And this is the catch-all:
everything()We will use these helper functions to reorganize and rename variables below.
The variables in the RawSurvey data frame follow a consistent naming convention (as noted above). Consistent names mean I can easily select() variables if I want to reorganize the data frame. For example, if I wanted to select the first three variables (X1:EndDate) and all the variables for question eight, I’d use the dplyr::contains() and include the appropriate prefix (q0008).
## Observations: 1,615
## Variables: 15
## $ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ StartDate <chr> "5/10/18 4:01", "5/10/18 6:30", "5/10/18 7:02", "5/10/18…
## $ EndDate <chr> "5/10/18 4:06", "5/10/18 6:53", "5/10/18 7:09", "5/10/18…
## $ q0008_0001 <chr> "Not selected", "Not selected", "Not selected", "Not sel…
## $ q0008_0002 <chr> "Not selected", "Your weight", "Not selected", "Not sele…
## $ q0008_0003 <chr> "Your hair or hairline", "Not selected", "Not selected",…
## $ q0008_0004 <chr> "Not selected", "Not selected", "Not selected", "Not sel…
## $ q0008_0005 <chr> "Not selected", "Not selected", "Not selected", "Not sel…
## $ q0008_0006 <chr> "Not selected", "Not selected", "Not selected", "Not sel…
## $ q0008_0007 <chr> "Not selected", "Not selected", "Not selected", "Not sel…
## $ q0008_0008 <chr> "Not selected", "Your mental health", "Not selected", "N…
## $ q0008_0009 <chr> "Your physical health", "Your physical health", "Your ph…
## $ q0008_0010 <chr> "Your finances, including your current or future income,…
## $ q0008_0011 <chr> "Not selected", "Not selected", "Not selected", "Not sel…
## $ q0008_0012 <chr> "Not selected", "Not selected", "Not selected", "None of…
These helper functions also work by negation. If I wanted to create a data frame with only the demographic variables, I place a - sign in front of a helper function.
## Observations: 1,615
## Variables: 11
## $ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ StartDate <chr> "5/10/18 4:01", "5/10/18 6:30", "5/10/18 7:02", "5/10/1…
## $ EndDate <chr> "5/10/18 4:06", "5/10/18 6:53", "5/10/18 7:09", "5/10/1…
## $ race2 <chr> "Non-white", "White", "White", "White", "White", "White…
## $ racethn4 <chr> "Hispanic", "White", "White", "White", "White", "White"…
## $ educ3 <chr> "College or more", "Some college", "College or more", "…
## $ educ4 <chr> "College or more", "Some college", "College or more", "…
## $ age3 <chr> "35 - 64", "65 and up", "35 - 64", "65 and up", "35 - 6…
## $ kids <chr> "No children", "Has children", "Has children", "Has chi…
## $ orientation <chr> "Gay/Bisexual", "Straight", "Straight", "No answer", "S…
## $ weight <dbl> 1.71402597, 1.24712012, 0.51574606, 0.60064008, 1.03340…
Be sure to check out the other select helpers here.
I’ll go over a few ways to rename variables. The first is dplyr::rename(), and it’s syntax is new_name = old_name. I’ll use below to rename the X1 variable and create a new object called MascSurveyData.
## num [1:1615] 1 2 3 4 5 6 7 8 9 10 ...
I can also rename multiple variables at one time with the dplyr::rename() function if I separate them with a comma. See an example of this below:
MascSurveyData %>%
dplyr::rename(
start_date = StartDate, # better naming conventions
end_date = EndDate) %>% dplyr::glimpse(78)# Observations: 1,615
# Variables: 98
# $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
# $ start_date <chr> "5/10/18 4:01", "5/10/18 6:30", "5/10/18 7:02", "5/10/1…
# $ end_date <chr> "5/10/18 4:06", "5/10/18 6:53", "5/10/18 7:09", "5/10/1…I’ll make these changes permanent by assigning them to MascSurveyData.
MascSurveyData <- MascSurveyData %>%
dplyr::rename(
start_date = StartDate, # better naming conventions
end_date = EndDate)I can also rename variables with select() by following the same syntax (new_name = old_name). I’ll rename question 4, “Where have you gotten your ideas about what it means to be a good man?” as a ‘good man ideas’ scale, by adding the prefix gmis_.
My first option is to rename these items using dplyr::select() and a range (q0004_0001:q0004_0006).
## Observations: 1,615
## Variables: 7
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ gmis_1 <chr> "Not selected", "Father or father figure(s)", "Father or fat…
## $ gmis_2 <chr> "Not selected", "Not selected", "Not selected", "Mother or m…
## $ gmis_3 <chr> "Not selected", "Not selected", "Not selected", "Other famil…
## $ gmis_4 <chr> "Pop culture", "Not selected", "Not selected", "Not selected…
## $ gmis_5 <chr> "Not selected", "Not selected", "Not selected", "Not selecte…
## $ gmis_6 <chr> "Not selected", "Not selected", "Other (please specify)", "N…
This method requires that I know 1) the number and 2) the name of the variables in my make-believe scale. But by combining dplyr::select()’s renaming ability with the helper functions,
Which brings me to the option #2: I can match on a specific pattern (like "q0004_"), and I can preserve the original variables order (by adding dplyr::everything()).
MascSurveyData %>%
dplyr::select(
dplyr::everything(),
gmis_ = starts_with("q0004_")) %>%
dplyr::glimpse(78)# Observations: 1,615
# Variables: 98
# $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
# $ start_date <chr> "5/10/18 4:01", "5/10/18 6:30", "5/10/18 7:02", "5/10/1…
# $ end_date <chr> "5/10/18 4:06", "5/10/18 6:53", "5/10/18 7:09", "5/10/1…
# $ q0001 <chr> "Somewhat masculine", "Somewhat masculine", "Very mascu…
# $ q0002 <chr> "Somewhat important", "Somewhat important", "Not too im…
# $ gmis_1 <chr> "Not selected", "Father or father figure(s)", "Father o…
# $ gmis_2 <chr> "Not selected", "Not selected", "Not selected", "Mother…
# $ gmis_3 <chr> "Not selected", "Not selected", "Not selected", "Other …
# $ gmis_4 <chr> "Pop culture", "Not selected", "Not selected", "Not sel…
# $ gmis_5 <chr> "Not selected", "Not selected", "Not selected", "Not se…
# $ gmis_6 <chr> "Not selected", "Not selected", "Other (please specify)…
# ...omitted outputNeat huh?
dplyr::count() is an essential tool from the tidyverse because data science is mostly counting things. It is also very versatile. By passing the entire data frame (MascSurveyData) to count() I get the number of rows.
## # A tibble: 1 x 1
## n
## <int>
## 1 1615
The individual responses to each variable tell me a lot about the original question. For example, I can pass q0001 to dplyr::count() and see what it contains.
## # A tibble: 5 x 2
## q0001 n
## <chr> <int>
## 1 No answer 14
## 2 Not at all masculine 32
## 3 Not very masculine 131
## 4 Somewhat masculine 826
## 5 Very masculine 612
These are the responses to “In general, how masculine or ‘manly’ do you feel?”. I’ll rename q0001 as how_masc.
The next question is “How important is it to you that others see you as masculine?” and the responses are below:
## # A tibble: 5 x 2
## q0002 n
## <chr> <int>
## 1 No answer 9
## 2 Not at all important 240
## 3 Not too important 541
## 4 Somewhat important 628
## 5 Very important 197
Quick Tip #3: You can add four dashes ---- inside a code chunk and it will show up on your document outline tool.
The next six variables are all from a question four, “Where have you gotten your ideas about what it means to be a good man?”.
# q0004 ----
MascSurveyData %>% dplyr::count(q0004_0001)
MascSurveyData %>% dplyr::count(q0004_0002)
MascSurveyData %>% dplyr::count(q0004_0003)
MascSurveyData %>% dplyr::count(q0004_0004)
MascSurveyData %>% dplyr::count(q0004_0005)
MascSurveyData %>% dplyr::count(q0004_0006)The output for each dplyr::count() contains two numbers: the total answers to a particular response (like Father or father figure(s) or Pop culture), and the total of Not selected for that response.
Often I need to change the format of an existing variable in a data frame. This can be done using dplyr::mutate() and the equals sign =. For example, I notice the id variable is formatted as a double, but I want it to be an integer. I can do this with dplyr::mutate() and as.integer()
MascSurveyData <- MascSurveyData %>%
dplyr::mutate(id = as.integer(id))
MascSurveyData$id %>% dplyr::glimpse(78)## int [1:1615] 1 2 3 4 5 6 7 8 9 10 ...
Now that I’m getting a better understanding of how the survey data are structured in the raw data set, I can begin to create new variables to suit my needs. For example, the article mentions collapsing two response categories into a single statistic.
“When asked how masculine or “manly” they generally feel, 83 percent of men said they felt “very” or “somewhat” masculine."
The first new variable I’ll create will identify if the respondent indicated they were Very masculine or “Somewhat masculine”. I will name this masc_ind.
Quick Tip #4: the _ind suffix is added because this is an indicator variable. A TRUE response to an indicator variable means that this measure is present, and FALSE means that it’s absent. As we saw above, adding a suffix or prefix to variables of a certain type make it easier to identify them in a large dataset.
The function for creating a brand new variable is dplyr::mutate(). The equal sign (=) separates the name of the new variable from the conditions for creating it.
If I want to create a new variable that has only two possible responses (TRUE or FALSE) I can use the dplyr::if_else() function inside dplyr::mutate().
dplyr::if_else() takes three arguments:
condition (this is q0004_0001 == "Father or father figure(s)" in my case)true is what happens if the condition is satisfiedfalse is what will happen if the condition is not satisfiedI also have a tool at my disposal to verify this variable has been created correctly (i.e. dplyr::count()). By passing the new variable (masc_ind) and old variable (how_masc) I can check the count to see if the totals make sense. I can also use the tidyr::spread() function to see a cross-tabulation of each response.
MascSurveyData %>%
dplyr::mutate(masc_ind =
dplyr::if_else(
condition = how_masc %in% c("Very masculine",
"Somewhat masculine"),
true = TRUE,
false = FALSE,
missing = NA)) %>%
dplyr::count(masc_ind, how_masc) %>%
tidyr::spread(masc_ind, n)## # A tibble: 5 x 3
## how_masc `FALSE` `TRUE`
## <chr> <int> <int>
## 1 No answer 14 NA
## 2 Not at all masculine 32 NA
## 3 Not very masculine 131 NA
## 4 Somewhat masculine NA 826
## 5 Very masculine NA 612
This is what I expected to see! This is the beauty of working in the tidyverse–tibbles (rectangular data) are the common data objects returned by most functions, so we can look at a function’s output as an object that can be manipulated with another tidyverse function.
The dplyr::mutate() function only creates new variables. What if I wanted this new indicator (masc_ind) to replace the how_masc variable? This can be done using dplyr::transmute().
MascSurveyData %>%
dplyr::transmute(masc_ind =
dplyr::if_else(
condition = how_masc %in% c("Very masculine",
"Somewhat masculine"),
true = TRUE,
false = FALSE,
missing = NA)) %>% dplyr::glimpse(78)## Observations: 1,615
## Variables: 1
## $ masc_ind <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
The important thing to note is that this returns a dataset with a single variable (masc_ind).
So far I’ve created a new variable based on a single condition in one variable, but what if I need to create a new variable based on multiple conditions in several different variables? This is where the dplyr::case_when() function comes in handy. The syntax for dplyr::case_when() is similar to dplyr::if_else(). The first argument should be a condition for existing variable values to match, but instead of true and false arguments, I’ll provide a formula operator ~ and the value that belongs in the new variable if the match is met.
See an example below:
new_variable = dplyr::case_when(
variable_1 == "condition 1" ~ "new value 1",
variable_2 == "condition 2" ~ "new value 2",
TRUE ~ NA_character_ # all else are NA
)This syntax assumes that the existing variables are character/string (variable_1 and variable_2), and that the new variable (new_variable) will also be a character/string variable.
I’m going to create an fictional scale called masc_scale and it has three levels: high, moderate, and low. The levels in masc_scale are based on the responses to two variables:
how_masc the “In general, how masculine or "manly" do you feel?” question, andq0018 which is “How often do you try to be the one who pays when on a date?”The new variable will have four conditions:
high on the masc_scale.moderate on the masc_scale.low respondents on the masc_scale indicated they were Not very masculine or Not at all masculine and Rarely or Never tried to pay on a date.how_masc and q0018 will get an NA in the masc_scale.The logic for this new variable is in the comments below. I use the select() helpers to check the new variable and variables used to create it.
MascSurveyData %>%
dplyr::mutate(masc_scale = dplyr::case_when(
# high masc_scale ----
# feel very masculine and always pays for dates
how_masc == "Very masculine" & q0018 == "Always" ~ "high",
# moderate masc_scale ----
# feel somewhat masculine and often/sometimes pay for dates
how_masc == "Somewhat masculine" & q0018 %in% c("Often",
"Sometimes") ~ "moderate",
# low masc_scale ----
# feel not very/not at all masculine and rarely/never pay for dates
how_masc %in% c("Not very masculine",
"Not at all masculine") & q0018 %in% c("Rarely",
"Never") ~ "low",
# all else as NA ----
how_masc == "No answer" & q0018 == "No answer" ~ NA_character_)) %>%
# check this new variable with select helpers
dplyr::select(q0018,
dplyr::contains("masc"))# # A tibble: 1,615 x 4
# q0018 how_masc masc_ind masc_scale
# <chr> <chr> <lgl> <chr>
# 1 Sometimes Somewhat masculine TRUE moderate
# 2 Rarely Somewhat masculine TRUE NA
# 3 Sometimes Very masculine TRUE NA
# 4 Always Very masculine TRUE high
# 5 Always Very masculine TRUE high
# 6 Always Very masculine TRUE high
# 7 Sometimes Somewhat masculine TRUE moderate
# 8 Often Somewhat masculine TRUE moderate
# 9 Always Very masculine TRUE high
# 10 Always Somewhat masculine TRUE NA
# # … with 1,605 more rowsIt’s helpful to think of each level of dplyr::case_when() as satisfying a logical condition (TRUE or FALSE), and then what the resulting value should be when each condition is satisfied.
Each of the functions covered above work on a single variable. These are dropped inside the single dplyr::mutate() function to create a new variable in the data frame. It’s important to note that I could combine both new variables (masc_scale and masc_ind) into a single dplyr::mutate() function call.
MascSurveyData %>%
dplyr::mutate(
# create integer id
id = as.integer(id), # <- separate with a comma!
# create masc_ind ----
masc_ind =
dplyr::if_else( ...), # <- separate with a comma!
# create masc_scale ----
masc_scale = dplyr::case_when( ...)There are three additional variants of mutate() I will briefly cover below.
The dplyr::mutate_all() is handy if you want to mutate all variables in a data frame with a particular function. For example, I can select the date variables using the select() helpers and mutate them to dates with lubridate::mdy_hm().
MascSurveyData %>%
dplyr::select(
dplyr::contains("date")) %>%
dplyr::mutate_all(lubridate::mdy_hm) %>%
dplyr::glimpse(78)## Observations: 1,615
## Variables: 2
## $ start_date <dttm> 2018-05-10 04:01:00, 2018-05-10 06:30:00, 2018-05-10 07…
## $ end_date <dttm> 2018-05-10 04:06:00, 2018-05-10 06:53:00, 2018-05-10 07…
Quick Tip #5: pass all the functions inside the mutate_all() variants without the parentheses.
I can also change only a few variables in a data frame and leave the others unchanged.
For example, if I decided all the elements in question four (q0004_0001 through q0004_0006) needed to be factors (read more about factors here), I could pass the data frame to dplyr::mutate_at() and include a string in the vars(matches()) helpers to identify question four variables.
# Observations: 1,615
# Variables: 100
# $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
# $ start_date <dttm> 2018-05-10 04:01:00, 2018-05-10 06:30:00, 2018-05-10…
# $ end_date <dttm> 2018-05-10 04:06:00, 2018-05-10 06:53:00, 2018-05-10…
# $ how_masc <chr> "Somewhat masculine", "Somewhat masculine", "Very mas…
# $ how_important <chr> "Somewhat important", "Somewhat important", "Not too …
# $ q0004_0001 <fct> Not selected, Father or father figure(s), Father or f…
# $ q0004_0002 <fct> Not selected, Not selected, Not selected, Mother or m…
# $ q0004_0003 <fct> Not selected, Not selected, Not selected, Other famil…
# $ q0004_0004 <fct> Pop culture, Not selected, Not selected, Not selected…
# $ q0004_0005 <fct> Not selected, Not selected, Not selected, Not selecte…
# $ q0004_0006 <fct> Not selected, Not selected, Other (please specify), N…
# $ q0005 <chr> "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", …The dplyr::mutate_if() function tests a condition (in the form of a function) and changes only variables where the condition = TRUE. For example, if I wanted to perform a log10() transformation the weight variable (which also happens to be the only variable with decimal points in it’s measurement), I could set the first portion of dplyr::mutate_if() to is.double, and then apply a function (like log10).
## [1] 1.71402597 1.24712012 0.51574606 0.60064008 1.03340045 0.05908664
# now transform
MascSurveyData %>%
dplyr::mutate_if(is.double, log10) %>%
dplyr::pull(weight) %>% utils::head()## [1] 0.23401740 0.09590828 -0.28756408 -0.22138569 0.01426865 -1.22851070
Now I will export this as a .csv file and time-stamp it.
# fs::dir_ls("data")
readr::write_csv(x = MascSurveyData,
path = paste0(
"data/",
base::noquote(lubridate::today()),
"-MascSurveyData.csv"))
# verify
fs::dir_ls("data")## data/2019-01-30-MascSurveyData.csv
## data/2019-04-09-BikeData.rds
## data/2019-07-12-LomaDatesWide.rds
## data/2019-07-12-tidyr-pivot-post-data.RData
## data/2019-08-03-LomaDatesWide.rds
## data/2019-08-03-MascSurveyData.csv
## data/2019-08-03-tidyr-pivot-post-data.RData
## data/FARS.csv
## data/LomaDatesWide.csv
## data/LomaWideSmall.csv
## data/README copy.md
## data/README.md
## data/README_v02.Rmd
## data/README_v02.md
## data/Readme.txt
## data/aggdp_worldbank.csv
## data/babynames.csv
## data/celeb_heights.csv
## data/csv-data
## data/day.csv
## data/excel-data
## data/ggg-canelo.xlsx
## data/hour.csv
## data/indgdp_worldbank.csv
## data/masculinity-readme.md
## data/masculinity-survey
## data/masculinity-survey.csv
## data/raw-responses.csv
## data/servgdp_worldbank.csv
## data/tidyr-data.RData
Next tutorial I will cover functions to alter cases within the MascSurveyData data set.