Tidyverse CREATE Assignment

Stefano Biguzzi

2020-10-25

Assignment

In this assignment we were asked to create a vignette to discuss one or more packages within the tidyverse library. For this assignment I chose to look at the dplyr library specifically the group_by(), tally(), and the summarise() functions.

tidyverse

Intro

We’ll start the assignment with loading the tidyverse package. tidyverse encompasses 8 different packages. Some of them are:

Loading

Below is the full list of the 8 packages that the tidyverse library includes

library(tidyverse)
#> -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
#> v ggplot2 3.3.2     v purrr   0.3.4
#> v tibble  3.0.4     v dplyr   1.0.2
#> v tidyr   1.1.2     v stringr 1.4.0
#> v readr   1.4.0     v forcats 0.5.0
#> -- Conflicts ------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()

Then load in the example data that was downloaded from Kaggle

heart_failure <- read.csv(url("https://raw.githubusercontent.com/sbiguzzi/FALL2020TIDYVERSE/master/heart_failure_clinical_records_dataset.csv"))

The dplyr library

dplyr is the tidyverse’s data manipulation package. It encompasses many functions or verbs that can be used to solve some common data manipulation tasks. The verbs I want to focus on today are:

group_by()

One of the most powerful function of dplyr library is the group_by(). It takes a data frame and allows you to group by one or more variables in the data frame.1

by_sex <- heart_failure %>% group_by(sex)
by_sex_diabetes <- heart_failure %>% group_by(sex,diabetes)

When you print the groupings you can see the number of groups:

by_sex
#> # A tibble: 299 x 13
#> # Groups:   sex [2]
#>      age anaemia creatinine_phos~ diabetes ejection_fracti~ high_blood_pres~
#>    <dbl> <chr>              <int> <chr>               <int> <chr>           
#>  1    75 no_ana~              582 no_diab~               20 high_blood_pres~
#>  2    55 no_ana~             7861 no_diab~               38 no_high_blood_p~
#>  3    65 no_ana~              146 no_diab~               20 no_high_blood_p~
#>  4    50 anaemia              111 no_diab~               20 no_high_blood_p~
#>  5    65 anaemia              160 diabetes               20 no_high_blood_p~
#>  6    90 anaemia               47 no_diab~               40 high_blood_pres~
#>  7    75 anaemia              246 no_diab~               15 no_high_blood_p~
#>  8    60 anaemia              315 diabetes               60 no_high_blood_p~
#>  9    65 no_ana~              157 no_diab~               65 no_high_blood_p~
#> 10    80 anaemia              123 no_diab~               35 high_blood_pres~
#> # ... with 289 more rows, and 7 more variables: platelets <dbl>,
#> #   serum_creatinine <dbl>, serum_sodium <int>, sex <chr>, smoking <chr>,
#> #   time <int>, DEATH_EVENT <chr>

Notice how it says Groups: sex [2], because there are 2 groups, Female and Male. When you print the grouping by sex and diabetes you get 4 groups:

by_sex_diabetes
#> # A tibble: 299 x 13
#> # Groups:   sex, diabetes [4]
#>      age anaemia creatinine_phos~ diabetes ejection_fracti~ high_blood_pres~
#>    <dbl> <chr>              <int> <chr>               <int> <chr>           
#>  1    75 no_ana~              582 no_diab~               20 high_blood_pres~
#>  2    55 no_ana~             7861 no_diab~               38 no_high_blood_p~
#>  3    65 no_ana~              146 no_diab~               20 no_high_blood_p~
#>  4    50 anaemia              111 no_diab~               20 no_high_blood_p~
#>  5    65 anaemia              160 diabetes               20 no_high_blood_p~
#>  6    90 anaemia               47 no_diab~               40 high_blood_pres~
#>  7    75 anaemia              246 no_diab~               15 no_high_blood_p~
#>  8    60 anaemia              315 diabetes               60 no_high_blood_p~
#>  9    65 no_ana~              157 no_diab~               65 no_high_blood_p~
#> 10    80 anaemia              123 no_diab~               35 high_blood_pres~
#> # ... with 289 more rows, and 7 more variables: platelets <dbl>,
#> #   serum_creatinine <dbl>, serum_sodium <int>, sex <chr>, smoking <chr>,
#> #   time <int>, DEATH_EVENT <chr>

Notice how it shows Groups: sex, diabetes [4] since its doing female-diabetes, male-diabetes, female-no_diabetes, and male-no_diabetes.

tally()

Tally works well with the group_by() verb as it allows you to count the number of rows within each group. If we go back and use the same grouping sex and sex_diabetes we see the following counts:

heart_failure %>% group_by(sex) %>% tally()
#> # A tibble: 2 x 2
#>   sex        n
#>   <chr>  <int>
#> 1 Female   105
#> 2 Male     194

We can see the 2 groups that group_by() create Female and Male and see the number of rows for each group.

heart_failure %>% group_by(sex,diabetes) %>% tally()
#> # A tibble: 4 x 3
#> # Groups:   sex [2]
#>   sex    diabetes        n
#>   <chr>  <chr>       <int>
#> 1 Female diabetes       55
#> 2 Female no_diabetes    50
#> 3 Male   diabetes       70
#> 4 Male   no_diabetes   124

Here we see the 4 groups that group_by() created and the number of rows for each of those.

summarise()

The summarise() function allows you manipulate the data within each group. For example if you want to get the mean platelets by sex you would use summarise() like this:

heart_failure %>%
  group_by(sex) %>%
  summarise(mean_platelets = mean(platelets))
#> # A tibble: 2 x 2
#>   sex    mean_platelets
#>   <chr>           <dbl>
#> 1 Female        279964.
#> 2 Male          254370.

It’s possible to also get multiple statistics using the summarise() function. For example below I am grabbing the mean, median, min, max, and standard deviation for platelets grouped by the sex of the patient:

heart_failure %>%
  group_by(sex) %>%
  summarise(
    min_platelets = min(platelets),
    max_platelets = max(platelets),
    mean_platelets = mean(platelets),
    median_platelets = median(platelets),
    sd_platelets = sd(platelets)
  )
#> # A tibble: 2 x 6
#>   sex   min_platelets max_platelets mean_platelets median_platelets sd_platelets
#>   <chr>         <dbl>         <dbl>          <dbl>            <dbl>        <dbl>
#> 1 Fema~         62000        742000        279964.          263358.      102109.
#> 2 Male          25100        850000        254370.          253000        94447.

  1. group_by() is usually used to group by a character variable↩︎