`tidyverse`

Intro

We’ll start the assignment with loading the tidyverse package. tidyverse encompasses 8 different packages. Some of them are:

ggplot2 - for visualization
stringr - for string manipulation
tidyr - for data cleaning and tidying
dplyr - for data manipulation
readr - for reading in rectangular data like csv

Loading

Below is the full list of the 8 packages that the tidyverse library includes

library(tidyverse)
#> -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
#> v ggplot2 3.3.2     v purrr   0.3.4
#> v tibble  3.0.4     v dplyr   1.0.2
#> v tidyr   1.1.2     v stringr 1.4.0
#> v readr   1.4.0     v forcats 0.5.0
#> -- Conflicts ------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()

Then load in the example data that was downloaded from Kaggle

heart_failure <- read.csv(url("https://raw.githubusercontent.com/sbiguzzi/FALL2020TIDYVERSE/master/heart_failure_clinical_records_dataset.csv"))

The `dplyr` library

dplyr is the tidyverse’s data manipulation package. It encompasses many functions or verbs that can be used to solve some common data manipulation tasks. The verbs I want to focus on today are:

group_by()
tally()
summarise()

`group_by()`

One of the most powerful function of dplyr library is the group_by(). It takes a data frame and allows you to group by one or more variables in the data frame.¹

by_sex <- heart_failure %>% group_by(sex)
by_sex_diabetes <- heart_failure %>% group_by(sex,diabetes)

When you print the groupings you can see the number of groups:

by_sex
#> # A tibble: 299 x 13
#> # Groups:   sex [2]
#>      age anaemia creatinine_phos~ diabetes ejection_fracti~ high_blood_pres~
#>    <dbl> <chr>              <int> <chr>               <int> <chr>           
#>  1    75 no_ana~              582 no_diab~               20 high_blood_pres~
#>  2    55 no_ana~             7861 no_diab~               38 no_high_blood_p~
#>  3    65 no_ana~              146 no_diab~               20 no_high_blood_p~
#>  4    50 anaemia              111 no_diab~               20 no_high_blood_p~
#>  5    65 anaemia              160 diabetes               20 no_high_blood_p~
#>  6    90 anaemia               47 no_diab~               40 high_blood_pres~
#>  7    75 anaemia              246 no_diab~               15 no_high_blood_p~
#>  8    60 anaemia              315 diabetes               60 no_high_blood_p~
#>  9    65 no_ana~              157 no_diab~               65 no_high_blood_p~
#> 10    80 anaemia              123 no_diab~               35 high_blood_pres~
#> # ... with 289 more rows, and 7 more variables: platelets <dbl>,
#> #   serum_creatinine <dbl>, serum_sodium <int>, sex <chr>, smoking <chr>,
#> #   time <int>, DEATH_EVENT <chr>

Notice how it says Groups: sex [2], because there are 2 groups, Female and Male. When you print the grouping by sex and diabetes you get 4 groups:

by_sex_diabetes
#> # A tibble: 299 x 13
#> # Groups:   sex, diabetes [4]
#>      age anaemia creatinine_phos~ diabetes ejection_fracti~ high_blood_pres~
#>    <dbl> <chr>              <int> <chr>               <int> <chr>           
#>  1    75 no_ana~              582 no_diab~               20 high_blood_pres~
#>  2    55 no_ana~             7861 no_diab~               38 no_high_blood_p~
#>  3    65 no_ana~              146 no_diab~               20 no_high_blood_p~
#>  4    50 anaemia              111 no_diab~               20 no_high_blood_p~
#>  5    65 anaemia              160 diabetes               20 no_high_blood_p~
#>  6    90 anaemia               47 no_diab~               40 high_blood_pres~
#>  7    75 anaemia              246 no_diab~               15 no_high_blood_p~
#>  8    60 anaemia              315 diabetes               60 no_high_blood_p~
#>  9    65 no_ana~              157 no_diab~               65 no_high_blood_p~
#> 10    80 anaemia              123 no_diab~               35 high_blood_pres~
#> # ... with 289 more rows, and 7 more variables: platelets <dbl>,
#> #   serum_creatinine <dbl>, serum_sodium <int>, sex <chr>, smoking <chr>,
#> #   time <int>, DEATH_EVENT <chr>

Notice how it shows Groups: sex, diabetes [4] since its doing female-diabetes, male-diabetes, female-no_diabetes, and male-no_diabetes.

`tally()`

Tally works well with the group_by() verb as it allows you to count the number of rows within each group. If we go back and use the same grouping sex and sex_diabetes we see the following counts:

heart_failure %>% group_by(sex) %>% tally()
#> # A tibble: 2 x 2
#>   sex        n
#>   <chr>  <int>
#> 1 Female   105
#> 2 Male     194

We can see the 2 groups that group_by() create Female and Male and see the number of rows for each group.

heart_failure %>% group_by(sex,diabetes) %>% tally()
#> # A tibble: 4 x 3
#> # Groups:   sex [2]
#>   sex    diabetes        n
#>   <chr>  <chr>       <int>
#> 1 Female diabetes       55
#> 2 Female no_diabetes    50
#> 3 Male   diabetes       70
#> 4 Male   no_diabetes   124

Here we see the 4 groups that group_by() created and the number of rows for each of those.

`summarise()`

The summarise() function allows you manipulate the data within each group. For example if you want to get the mean platelets by sex you would use summarise() like this:

heart_failure %>%
  group_by(sex) %>%
  summarise(mean_platelets = mean(platelets))
#> # A tibble: 2 x 2
#>   sex    mean_platelets
#>   <chr>           <dbl>
#> 1 Female        279964.
#> 2 Male          254370.

It’s possible to also get multiple statistics using the summarise() function. For example below I am grabbing the mean, median, min, max, and standard deviation for platelets grouped by the sex of the patient:

heart_failure %>%
  group_by(sex) %>%
  summarise(
    min_platelets = min(platelets),
    max_platelets = max(platelets),
    mean_platelets = mean(platelets),
    median_platelets = median(platelets),
    sd_platelets = sd(platelets)
  )
#> # A tibble: 2 x 6
#>   sex   min_platelets max_platelets mean_platelets median_platelets sd_platelets
#>   <chr>         <dbl>         <dbl>          <dbl>            <dbl>        <dbl>
#> 1 Fema~         62000        742000        279964.          263358.      102109.
#> 2 Male          25100        850000        254370.          253000        94447.

group_by() is usually used to group by a character variable↩︎

Tidyverse CREATE Assignment

Stefano Biguzzi

2020-10-25

Assignment

`tidyverse`

Intro

Loading

The `dplyr` library

`group_by()`

`tally()`

`summarise()`

Tidyverse CREATE Assignment

Stefano Biguzzi

2020-10-25

Assignment

tidyverse

Intro

Loading

The dplyr library

group_by()

tally()

summarise()

`tidyverse`

The `dplyr` library

`group_by()`

`tally()`

`summarise()`