In this assignment we were asked to create a vignette to discuss one or more packages within the tidyverse library. For this assignment I chose to look at the dplyr library specifically the group_by(), tally(), and the summarise() functions.
tidyverseWe’ll start the assignment with loading the tidyverse package. tidyverse encompasses 8 different packages. Some of them are:
ggplot2 - for visualizationstringr - for string manipulationtidyr - for data cleaning and tidyingdplyr - for data manipulationreadr - for reading in rectangular data like csvBelow is the full list of the 8 packages that the tidyverse library includes
library(tidyverse)
#> -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
#> v ggplot2 3.3.2 v purrr 0.3.4
#> v tibble 3.0.4 v dplyr 1.0.2
#> v tidyr 1.1.2 v stringr 1.4.0
#> v readr 1.4.0 v forcats 0.5.0
#> -- Conflicts ------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()Then load in the example data that was downloaded from Kaggle
dplyr librarydplyr is the tidyverse’s data manipulation package. It encompasses many functions or verbs that can be used to solve some common data manipulation tasks. The verbs I want to focus on today are:
group_by()tally()summarise()group_by()One of the most powerful function of dplyr library is the group_by(). It takes a data frame and allows you to group by one or more variables in the data frame.1
by_sex <- heart_failure %>% group_by(sex)
by_sex_diabetes <- heart_failure %>% group_by(sex,diabetes)When you print the groupings you can see the number of groups:
by_sex
#> # A tibble: 299 x 13
#> # Groups: sex [2]
#> age anaemia creatinine_phos~ diabetes ejection_fracti~ high_blood_pres~
#> <dbl> <chr> <int> <chr> <int> <chr>
#> 1 75 no_ana~ 582 no_diab~ 20 high_blood_pres~
#> 2 55 no_ana~ 7861 no_diab~ 38 no_high_blood_p~
#> 3 65 no_ana~ 146 no_diab~ 20 no_high_blood_p~
#> 4 50 anaemia 111 no_diab~ 20 no_high_blood_p~
#> 5 65 anaemia 160 diabetes 20 no_high_blood_p~
#> 6 90 anaemia 47 no_diab~ 40 high_blood_pres~
#> 7 75 anaemia 246 no_diab~ 15 no_high_blood_p~
#> 8 60 anaemia 315 diabetes 60 no_high_blood_p~
#> 9 65 no_ana~ 157 no_diab~ 65 no_high_blood_p~
#> 10 80 anaemia 123 no_diab~ 35 high_blood_pres~
#> # ... with 289 more rows, and 7 more variables: platelets <dbl>,
#> # serum_creatinine <dbl>, serum_sodium <int>, sex <chr>, smoking <chr>,
#> # time <int>, DEATH_EVENT <chr>Notice how it says Groups: sex [2], because there are 2 groups, Female and Male. When you print the grouping by sex and diabetes you get 4 groups:
by_sex_diabetes
#> # A tibble: 299 x 13
#> # Groups: sex, diabetes [4]
#> age anaemia creatinine_phos~ diabetes ejection_fracti~ high_blood_pres~
#> <dbl> <chr> <int> <chr> <int> <chr>
#> 1 75 no_ana~ 582 no_diab~ 20 high_blood_pres~
#> 2 55 no_ana~ 7861 no_diab~ 38 no_high_blood_p~
#> 3 65 no_ana~ 146 no_diab~ 20 no_high_blood_p~
#> 4 50 anaemia 111 no_diab~ 20 no_high_blood_p~
#> 5 65 anaemia 160 diabetes 20 no_high_blood_p~
#> 6 90 anaemia 47 no_diab~ 40 high_blood_pres~
#> 7 75 anaemia 246 no_diab~ 15 no_high_blood_p~
#> 8 60 anaemia 315 diabetes 60 no_high_blood_p~
#> 9 65 no_ana~ 157 no_diab~ 65 no_high_blood_p~
#> 10 80 anaemia 123 no_diab~ 35 high_blood_pres~
#> # ... with 289 more rows, and 7 more variables: platelets <dbl>,
#> # serum_creatinine <dbl>, serum_sodium <int>, sex <chr>, smoking <chr>,
#> # time <int>, DEATH_EVENT <chr>Notice how it shows Groups: sex, diabetes [4] since its doing female-diabetes, male-diabetes, female-no_diabetes, and male-no_diabetes.
tally()Tally works well with the group_by() verb as it allows you to count the number of rows within each group. If we go back and use the same grouping sex and sex_diabetes we see the following counts:
heart_failure %>% group_by(sex) %>% tally()
#> # A tibble: 2 x 2
#> sex n
#> <chr> <int>
#> 1 Female 105
#> 2 Male 194We can see the 2 groups that group_by() create Female and Male and see the number of rows for each group.
heart_failure %>% group_by(sex,diabetes) %>% tally()
#> # A tibble: 4 x 3
#> # Groups: sex [2]
#> sex diabetes n
#> <chr> <chr> <int>
#> 1 Female diabetes 55
#> 2 Female no_diabetes 50
#> 3 Male diabetes 70
#> 4 Male no_diabetes 124Here we see the 4 groups that group_by() created and the number of rows for each of those.
summarise()The summarise() function allows you manipulate the data within each group. For example if you want to get the mean platelets by sex you would use summarise() like this:
heart_failure %>%
group_by(sex) %>%
summarise(mean_platelets = mean(platelets))
#> # A tibble: 2 x 2
#> sex mean_platelets
#> <chr> <dbl>
#> 1 Female 279964.
#> 2 Male 254370.It’s possible to also get multiple statistics using the summarise() function. For example below I am grabbing the mean, median, min, max, and standard deviation for platelets grouped by the sex of the patient:
heart_failure %>%
group_by(sex) %>%
summarise(
min_platelets = min(platelets),
max_platelets = max(platelets),
mean_platelets = mean(platelets),
median_platelets = median(platelets),
sd_platelets = sd(platelets)
)
#> # A tibble: 2 x 6
#> sex min_platelets max_platelets mean_platelets median_platelets sd_platelets
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fema~ 62000 742000 279964. 263358. 102109.
#> 2 Male 25100 850000 254370. 253000 94447.group_by() is usually used to group by a character variable↩︎