How to manipulate cases with dplyr in R (part 2)

Martin Frigaard 2017-08-30

Motivation

This tutorial is part 2 of manipulating data with dplyr and friends in the tidyverse.

For newcomers to R, check out the Data Journalism in R on Storybench.

Motivation

Often you’ll be interested in data that meet specific criteria, and not an entire data set. dplyr comes with a set of functions (verbs) for reducing a data set to only those cases that meet conditions you define. I will demonstrate these functions with the data set from last tutorial.These data are from a survey created by FiveThirtyEight for the article titled,

“What Do Men Think It Means To Be A Man?”.

Find these data in their raw form on the GitHub repository. We will load data from the last tutorial in the code chunk below:

Importing data

MascSurveyData <- readr::read_csv("data/2019-08-03-MascSurveyData.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   id = col_double(),
##   start_date = col_datetime(format = ""),
##   end_date = col_datetime(format = ""),
##   weight = col_double(),
##   masc_ind = col_logical()
## )

## See spec(...) for full column specifications.

Important things to note about the import:

the default column is col_character(), so most variables will be character strings
id and weight are numeric (col_double())
both dates are formatted as date-time (col_datetime(format = ""))
the masculine indicator (masc_ind) is a binary (TRUE / FALSE) variable

Extract rows by position with (slice)

The dplyr::slice() function uses numerical indices to retrive observations from a data frame or tibble. This is similar to head().

This will give me the 5th through the last observation in the MascSurveyData data frame

MascSurveyData %>% dplyr::slice( 5:n())

## # A tibble: 1,611 x 100
##       id start_date          end_date            how_masc how_important
##    <dbl> <dttm>              <dttm>              <chr>    <chr>        
##  1     5 2018-05-10 07:35:00 2018-05-10 07:42:00 Very ma… Very importa…
##  2     6 2018-05-10 08:25:00 2018-05-10 08:34:00 Very ma… Somewhat imp…
##  3     7 2018-05-10 08:29:00 2018-05-10 08:41:00 Somewha… Not too impo…
##  4     8 2018-05-10 10:04:00 2018-05-10 10:11:00 Somewha… Somewhat imp…
##  5     9 2018-05-10 11:00:00 2018-05-10 11:07:00 Very ma… Not at all i…
##  6    10 2018-05-11 12:36:00 2018-05-11 12:43:00 Somewha… Somewhat imp…
##  7    11 2018-05-11 03:07:00 2018-05-11 03:16:00 Very ma… Somewhat imp…
##  8    12 2018-05-11 05:18:00 2018-05-11 05:33:00 Somewha… Not too impo…
##  9    13 2018-05-11 05:50:00 2018-05-11 05:53:00 Very ma… Somewhat imp…
## 10    14 2018-05-11 05:51:00 2018-05-11 06:21:00 Very ma… Very importa…
## # … with 1,601 more rows, and 95 more variables: q0004_0001 <chr>,
## #   q0004_0002 <chr>, q0004_0003 <chr>, q0004_0004 <chr>,
## #   q0004_0005 <chr>, q0004_0006 <chr>, q0005 <chr>, q0007_0001 <chr>,
## #   q0007_0002 <chr>, q0007_0003 <chr>, q0007_0004 <chr>,
## #   q0007_0005 <chr>, q0007_0006 <chr>, q0007_0007 <chr>,
## #   q0007_0008 <chr>, q0007_0009 <chr>, q0007_0010 <chr>,
## #   q0007_0011 <chr>, q0008_0001 <chr>, q0008_0002 <chr>,
## #   q0008_0003 <chr>, q0008_0004 <chr>, q0008_0005 <chr>,
## #   q0008_0006 <chr>, q0008_0007 <chr>, q0008_0008 <chr>,
## #   q0008_0009 <chr>, q0008_0010 <chr>, q0008_0011 <chr>,
## #   q0008_0012 <chr>, q0009 <chr>, q0010_0001 <chr>, q0010_0002 <chr>,
## #   q0010_0003 <chr>, q0010_0004 <chr>, q0010_0005 <chr>,
## #   q0010_0006 <chr>, q0010_0007 <chr>, q0010_0008 <chr>,
## #   q0011_0001 <chr>, q0011_0002 <chr>, q0011_0003 <chr>,
## #   q0011_0004 <chr>, q0011_0005 <chr>, q0012_0001 <chr>,
## #   q0012_0002 <chr>, q0012_0003 <chr>, q0012_0004 <chr>,
## #   q0012_0005 <chr>, q0012_0006 <chr>, q0012_0007 <chr>, q0013 <chr>,
## #   q0014 <chr>, q0015 <chr>, q0017 <chr>, q0018 <chr>, q0019_0001 <chr>,
## #   q0019_0002 <chr>, q0019_0003 <chr>, q0019_0004 <chr>,
## #   q0019_0005 <chr>, q0019_0006 <chr>, q0019_0007 <chr>,
## #   q0020_0001 <chr>, q0020_0002 <chr>, q0020_0003 <chr>,
## #   q0020_0004 <chr>, q0020_0005 <chr>, q0020_0006 <chr>,
## #   q0021_0001 <chr>, q0021_0002 <chr>, q0021_0003 <chr>,
## #   q0021_0004 <chr>, q0022 <chr>, q0024 <chr>, q0025_0001 <chr>,
## #   q0025_0002 <chr>, q0025_0003 <chr>, q0026 <chr>, q0028 <chr>,
## #   q0029 <chr>, q0030 <chr>, q0034 <chr>, q0035 <chr>, q0036 <chr>,
## #   race2 <chr>, racethn4 <chr>, educ3 <chr>, educ4 <chr>, age3 <chr>,
## #   kids <chr>, orientation <chr>, weight <dbl>, masc_ind <lgl>,
## #   masc_scale <chr>

Filtering responses

The dplyr::filter() function works by including only rows that meet a condition in a logical statement (i.e. the condition is TRUE for that case). For example, if I wanted to only include respondents who describe their sexual orientation as "Straight" (question 26), I would use the following:

MascSurveyData %>% 
    dplyr::filter(q0026 == "Straight")

# # A tibble: 1,408 x 100
#       id start_date          end_date            how_masc how_important
#    <dbl> <dttm>              <dttm>              <chr>    <chr>        
#  1     2 2018-05-10 06:30:00 2018-05-10 06:53:00 Somewha… Somewhat imp…
#  2     3 2018-05-10 07:02:00 2018-05-10 07:09:00 Very ma… Not too impo…
#  3     5 2018-05-10 07:35:00 2018-05-10 07:42:00 Very ma… Very importa…
#  4     6 2018-05-10 08:25:00 2018-05-10 08:34:00 Very ma… Somewhat imp…
#  5     8 2018-05-10 10:04:00 2018-05-10 10:11:00 Somewha… Somewhat imp…
#  6     9 2018-05-10 11:00:00 2018-05-10 11:07:00 Very ma… Not at all i…
#  7    10 2018-05-11 12:36:00 2018-05-11 12:43:00 Somewha… Somewhat imp…
#  8    11 2018-05-11 03:07:00 2018-05-11 03:16:00 Very ma… Somewhat imp…
#  9    12 2018-05-11 05:18:00 2018-05-11 05:33:00 Somewha… Not too impo…
# 10    13 2018-05-11 05:50:00 2018-05-11 05:53:00 Very ma… Somewhat imp…
# # … with 1,398 more rows, and 95 more variables:

This returns a data frame with fewer rows than the original (1,408 vs. 1,615) because this only includes the "Straight" respondents.

Quick Tip #1: Checking the dimensions of your data from time to time serves as a nice little sanity check :)

Filtering responses on multiple conditions

If I want to add another condition to filter() the response on, I can use the logical (& or |) operators. For example, what if I wanted to add a condition that included only respondents who were "Divorced" or "Separated" (question q0024)?

I can rely on my old friend dplyr::count() from the last tutorial to verify I’m getting the correct counts here.

# add the or operator
MascSurveyData %>% 
    dplyr::filter(q0024 == "Divorced" | q0024 == "Separated") %>% 
    dplyr::count(q0024)

## # A tibble: 2 x 2
##   q0024         n
##   <chr>     <int>
## 1 Divorced    218
## 2 Separated    25

# this also works
MascSurveyData %>% 
    dplyr::filter(q0024 %in% c("Divorced", "Separated")) %>% 
    dplyr::count(q0024)

## # A tibble: 2 x 2
##   q0024         n
##   <chr>     <int>
## 1 Divorced    218
## 2 Separated    25

Note there are two ways to filter these data and get the same response. The %in% operator is better for specifying text responses (as opposed to the equal sign =). For example, if I wanted to build on the previous condition to include repondents who are between the ages of 18 and 34, I could add the following:

MascSurveyData %>% 
    dplyr::filter(q0024 %in% c("Divorced", "Separated") & age3 == "18 - 34") %>% 
    dplyr::count(age3, q0024) %>% 
    tidyr::spread(q0024, n)

## # A tibble: 1 x 3
##   age3    Divorced Separated
##   <chr>      <int>     <int>
## 1 18 - 34        6         1

Quick Tip #2: Be careful with the logical operators–their order counts!

Return unique (i.e. `distinct`) responses

The dplyr::distinct() function returns the unique values in a column, data frame or tibble. It’s similar to dplyr::count(), but it does not return the sum column. If I wanted to check the distinct responses to q0014 (or "How much have you heard about the #MeToo movement?").

MascSurveyData %>% dplyr::distinct(q0014)

## # A tibble: 6 x 1
##   q0014         
##   <chr>         
## 1 <NA>          
## 2 A lot         
## 3 Only a little 
## 4 Nothing at all
## 5 Some          
## 6 No answer

Depending on the data set, I’ll occaisionally also need to check if I am dealing with the distinct/unique observations (or if the rows have been duplicated). When we downloaded the raw-responses.csv data, we saw that there was an unnamed column (X1) with respondent ids (id). One of the ways I use dplyr::distinct() is in combination with base::identical() and base::nrow().

In the MascSurveyData, I want the number of rows to be identical to the number of unique respondents (we know this from the documentation, but in other data arrangements it isn’t as clear). I can test this by comparing the distinct number of rows in MascSurveyData to the number of rows in MascSurveyData.

base::identical(x = MascSurveyData %>% dplyr::distinct() %>% base::nrow(), 
                y = MascSurveyData %>% base::nrow())

## [1] TRUE