Martin Frigaard 2017-08-30
This tutorial is part 2 of manipulating data with dplyr and friends in the tidyverse.
For newcomers to R, check out the Data Journalism in R on Storybench.
Often you’ll be interested in data that meet specific criteria, and not an entire data set. dplyr comes with a set of functions (verbs) for reducing a data set to only those cases that meet conditions you define. I will demonstrate these functions with the data set from last tutorial.These data are from a survey created by FiveThirtyEight for the article titled,
“What Do Men Think It Means To Be A Man?”.
Find these data in their raw form on the GitHub repository. We will load data from the last tutorial in the code chunk below:
## Parsed with column specification:
## cols(
## .default = col_character(),
## id = col_double(),
## start_date = col_datetime(format = ""),
## end_date = col_datetime(format = ""),
## weight = col_double(),
## masc_ind = col_logical()
## )
## See spec(...) for full column specifications.
Important things to note about the import:
col_character(), so most variables will be character stringsid and weight are numeric (col_double())col_datetime(format = ""))masc_ind) is a binary (TRUE / FALSE) variableThe dplyr::slice() function uses numerical indices to retrive observations from a data frame or tibble. This is similar to head().
This will give me the 5th through the last observation in the MascSurveyData data frame
## # A tibble: 1,611 x 100
## id start_date end_date how_masc how_important
## <dbl> <dttm> <dttm> <chr> <chr>
## 1 5 2018-05-10 07:35:00 2018-05-10 07:42:00 Very ma… Very importa…
## 2 6 2018-05-10 08:25:00 2018-05-10 08:34:00 Very ma… Somewhat imp…
## 3 7 2018-05-10 08:29:00 2018-05-10 08:41:00 Somewha… Not too impo…
## 4 8 2018-05-10 10:04:00 2018-05-10 10:11:00 Somewha… Somewhat imp…
## 5 9 2018-05-10 11:00:00 2018-05-10 11:07:00 Very ma… Not at all i…
## 6 10 2018-05-11 12:36:00 2018-05-11 12:43:00 Somewha… Somewhat imp…
## 7 11 2018-05-11 03:07:00 2018-05-11 03:16:00 Very ma… Somewhat imp…
## 8 12 2018-05-11 05:18:00 2018-05-11 05:33:00 Somewha… Not too impo…
## 9 13 2018-05-11 05:50:00 2018-05-11 05:53:00 Very ma… Somewhat imp…
## 10 14 2018-05-11 05:51:00 2018-05-11 06:21:00 Very ma… Very importa…
## # … with 1,601 more rows, and 95 more variables: q0004_0001 <chr>,
## # q0004_0002 <chr>, q0004_0003 <chr>, q0004_0004 <chr>,
## # q0004_0005 <chr>, q0004_0006 <chr>, q0005 <chr>, q0007_0001 <chr>,
## # q0007_0002 <chr>, q0007_0003 <chr>, q0007_0004 <chr>,
## # q0007_0005 <chr>, q0007_0006 <chr>, q0007_0007 <chr>,
## # q0007_0008 <chr>, q0007_0009 <chr>, q0007_0010 <chr>,
## # q0007_0011 <chr>, q0008_0001 <chr>, q0008_0002 <chr>,
## # q0008_0003 <chr>, q0008_0004 <chr>, q0008_0005 <chr>,
## # q0008_0006 <chr>, q0008_0007 <chr>, q0008_0008 <chr>,
## # q0008_0009 <chr>, q0008_0010 <chr>, q0008_0011 <chr>,
## # q0008_0012 <chr>, q0009 <chr>, q0010_0001 <chr>, q0010_0002 <chr>,
## # q0010_0003 <chr>, q0010_0004 <chr>, q0010_0005 <chr>,
## # q0010_0006 <chr>, q0010_0007 <chr>, q0010_0008 <chr>,
## # q0011_0001 <chr>, q0011_0002 <chr>, q0011_0003 <chr>,
## # q0011_0004 <chr>, q0011_0005 <chr>, q0012_0001 <chr>,
## # q0012_0002 <chr>, q0012_0003 <chr>, q0012_0004 <chr>,
## # q0012_0005 <chr>, q0012_0006 <chr>, q0012_0007 <chr>, q0013 <chr>,
## # q0014 <chr>, q0015 <chr>, q0017 <chr>, q0018 <chr>, q0019_0001 <chr>,
## # q0019_0002 <chr>, q0019_0003 <chr>, q0019_0004 <chr>,
## # q0019_0005 <chr>, q0019_0006 <chr>, q0019_0007 <chr>,
## # q0020_0001 <chr>, q0020_0002 <chr>, q0020_0003 <chr>,
## # q0020_0004 <chr>, q0020_0005 <chr>, q0020_0006 <chr>,
## # q0021_0001 <chr>, q0021_0002 <chr>, q0021_0003 <chr>,
## # q0021_0004 <chr>, q0022 <chr>, q0024 <chr>, q0025_0001 <chr>,
## # q0025_0002 <chr>, q0025_0003 <chr>, q0026 <chr>, q0028 <chr>,
## # q0029 <chr>, q0030 <chr>, q0034 <chr>, q0035 <chr>, q0036 <chr>,
## # race2 <chr>, racethn4 <chr>, educ3 <chr>, educ4 <chr>, age3 <chr>,
## # kids <chr>, orientation <chr>, weight <dbl>, masc_ind <lgl>,
## # masc_scale <chr>
The dplyr::filter() function works by including only rows that meet a condition in a logical statement (i.e. the condition is TRUE for that case). For example, if I wanted to only include respondents who describe their sexual orientation as "Straight" (question 26), I would use the following:
# # A tibble: 1,408 x 100
# id start_date end_date how_masc how_important
# <dbl> <dttm> <dttm> <chr> <chr>
# 1 2 2018-05-10 06:30:00 2018-05-10 06:53:00 Somewha… Somewhat imp…
# 2 3 2018-05-10 07:02:00 2018-05-10 07:09:00 Very ma… Not too impo…
# 3 5 2018-05-10 07:35:00 2018-05-10 07:42:00 Very ma… Very importa…
# 4 6 2018-05-10 08:25:00 2018-05-10 08:34:00 Very ma… Somewhat imp…
# 5 8 2018-05-10 10:04:00 2018-05-10 10:11:00 Somewha… Somewhat imp…
# 6 9 2018-05-10 11:00:00 2018-05-10 11:07:00 Very ma… Not at all i…
# 7 10 2018-05-11 12:36:00 2018-05-11 12:43:00 Somewha… Somewhat imp…
# 8 11 2018-05-11 03:07:00 2018-05-11 03:16:00 Very ma… Somewhat imp…
# 9 12 2018-05-11 05:18:00 2018-05-11 05:33:00 Somewha… Not too impo…
# 10 13 2018-05-11 05:50:00 2018-05-11 05:53:00 Very ma… Somewhat imp…
# # … with 1,398 more rows, and 95 more variables:This returns a data frame with fewer rows than the original (1,408 vs. 1,615) because this only includes the "Straight" respondents.
Quick Tip #1: Checking the dimensions of your data from time to time serves as a nice little sanity check :)
If I want to add another condition to filter() the response on, I can use the logical (& or |) operators. For example, what if I wanted to add a condition that included only respondents who were "Divorced" or "Separated" (question q0024)?
I can rely on my old friend dplyr::count() from the last tutorial to verify I’m getting the correct counts here.
# add the or operator
MascSurveyData %>%
dplyr::filter(q0024 == "Divorced" | q0024 == "Separated") %>%
dplyr::count(q0024)## # A tibble: 2 x 2
## q0024 n
## <chr> <int>
## 1 Divorced 218
## 2 Separated 25
# this also works
MascSurveyData %>%
dplyr::filter(q0024 %in% c("Divorced", "Separated")) %>%
dplyr::count(q0024)## # A tibble: 2 x 2
## q0024 n
## <chr> <int>
## 1 Divorced 218
## 2 Separated 25
Note there are two ways to filter these data and get the same response. The %in% operator is better for specifying text responses (as opposed to the equal sign =). For example, if I wanted to build on the previous condition to include repondents who are between the ages of 18 and 34, I could add the following:
MascSurveyData %>%
dplyr::filter(q0024 %in% c("Divorced", "Separated") & age3 == "18 - 34") %>%
dplyr::count(age3, q0024) %>%
tidyr::spread(q0024, n)## # A tibble: 1 x 3
## age3 Divorced Separated
## <chr> <int> <int>
## 1 18 - 34 6 1
Quick Tip #2: Be careful with the logical operators–their order counts!
distinct) responsesThe dplyr::distinct() function returns the unique values in a column, data frame or tibble. It’s similar to dplyr::count(), but it does not return the sum column. If I wanted to check the distinct responses to q0014 (or "How much have you heard about the #MeToo movement?").
## # A tibble: 6 x 1
## q0014
## <chr>
## 1 <NA>
## 2 A lot
## 3 Only a little
## 4 Nothing at all
## 5 Some
## 6 No answer
Depending on the data set, I’ll occaisionally also need to check if I am dealing with the distinct/unique observations (or if the rows have been duplicated). When we downloaded the raw-responses.csv data, we saw that there was an unnamed column (X1) with respondent ids (id). One of the ways I use dplyr::distinct() is in combination with base::identical() and base::nrow().
In the MascSurveyData, I want the number of rows to be identical to the number of unique respondents (we know this from the documentation, but in other data arrangements it isn’t as clear). I can test this by comparing the distinct number of rows in MascSurveyData to the number of rows in MascSurveyData.
base::identical(x = MascSurveyData %>% dplyr::distinct() %>% base::nrow(),
y = MascSurveyData %>% base::nrow())## [1] TRUE