In this report, we investigate the distribution of patients with heart failure using a dataset of people who had varying degrees of chest pain when they were checked into the hospital. More information about this dataset can be found in this kaggle hyperlink. Several tidying transformations using the tidyverse package will be showcased in this notebook in order to show the benefits of using such transformations on this dataset including others.
The data is imported and stored as a dataframe in the heart_data variable.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.3 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
url <- 'https://raw.githubusercontent.com/peterphung2043/DATA-607---Tidyverse-Create-Assignment/main/heart.csv'
heart_data <- read.csv(url(url), stringsAsFactors = FALSE)
The code block below selects only the Age, Sex and ChestPainType from the heart_data dataframe and stores these columns in a new dataframe called heart_data_parsed.
heart_data_parsed <- heart_data %>%
select(Age, Sex, ChestPainType)
knitr::kable(head(heart_data_parsed))
| Age | Sex | ChestPainType |
|---|---|---|
| 40 | M | ATA |
| 49 | F | NAP |
| 37 | M | ATA |
| 48 | F | ASY |
| 54 | M | NAP |
| 39 | M | NAP |
In the Sex column:
M: MaleF: FemaleIn the ChestPainType column:
TA: Typical anginaATA: Atypical anginaNAP: Non-anginal painASY: AsymptomaticSex and ChestPainTypeThe code block below uses the pivot_wider function from the tidyr package in order to sort the ages by sex and chest pain type using the Age, Sex, and ChestPainType columns. The resulting dataframe is then stored in the heart_data_pivoted variable. The pipeline consists of 4 steps:
group_by function is used on the heart_data_parsed dataframe in order to group the data by ChestPainType first, then Sex.row using the row_number function. Every unique instance of ChestPainType and Sex is given a separate count.pivot_wider function takes in the Sex and ChestPainType variables as arguments to the names_from parameter. The values_from parameters takes in the Age variable. What this function does it take every Age for every unique instance of both Sex and ChestPainType, and lops all of the Age observations for every unique instance of both Sex and ChestPainType into a unique column. For example, the M_ATA column contains all of the ages for all of the males that had atypical angina.heart_data_pivoted <- heart_data_parsed %>%
group_by(ChestPainType, Sex) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = c(Sex, ChestPainType), values_from = c(Age)) %>%
select(-row)
knitr::kable(head(heart_data_pivoted))
| M_ATA | F_NAP | F_ASY | M_NAP | F_ATA | M_ASY | F_TA | M_TA |
|---|---|---|---|---|---|---|---|
| 40 | 49 | 48 | 54 | 45 | 37 | 43 | 43 |
| 37 | 37 | 48 | 39 | 48 | 49 | 35 | 34 |
| 54 | 42 | 47 | 40 | 54 | 38 | 62 | 46 |
| 58 | 54 | 52 | 36 | 43 | 60 | 57 | 47 |
| 39 | 43 | 45 | 53 | 49 | 53 | 30 | 55 |
| 36 | 39 | 44 | 56 | 53 | 54 | 62 | 54 |
nest to Create a List-column of DataframesWe can tidy up our data even further by creating a list-column of dataframes, where each dataframe is a gender. To do this, we can use the nest function found in the tidyr package.
The nest function is used in the pipeline below. The male dataframe takes in all of the columns that start with M_ using the starts_with function. The female datarame takes in all of the columns that start with F_ using the starts_with function.
nest_gender <- heart_data_pivoted %>%
nest(male = starts_with("M_"), female = starts_with("F_"))
In the environment in RStudio, the nest_gender dataframe will show up on the screen with an output similar to what is shown on the table. Each observation in the dataframe below can be clicked on, which will take you to the dataframe containing the information pertaining to the target observation in the nest_gender dataframe.
nest_df_visual <- data.frame(
male = c('4 variables'),
female = c('4 variables')
)
knitr::kable(nest_df_visual)
| male | female |
|---|---|
| 4 variables | 4 variables |
pivot_longer with ggplotIn order to display multiple boxplots on the same screen for Age by Gender and ChestPainType, all of the data must be on one column. Therefore, the pivot_longer function was used on the heart_data_pivoted dataframe in order to do so. The pivot_longer function essentially increases the number of rows and decreases the number of columns.
heart_data_pivoted_longer <- pivot_longer(heart_data_pivoted,
cols = starts_with(c("M_", "F_")),
names_to = 'chest_pain_type_by_gender',
values_to = 'age')
knitr::kable(head(heart_data_pivoted_longer))
| chest_pain_type_by_gender | age |
|---|---|
| M_ATA | 40 |
| M_NAP | 54 |
| M_ASY | 37 |
| M_TA | 43 |
| F_NAP | 49 |
| F_ASY | 48 |
heart_data_pivoted_longer can now be used with ggplot and geom_boxplot to construct boxplots of Age by Sex and ChestPainType on the same screen.
ggplot(heart_data_pivoted_longer, mapping = aes(x = chest_pain_type_by_gender, y = age)) +
geom_boxplot()
## Warning: Removed 2490 rows containing non-finite values (stat_boxplot).
This report shows a use case for select from the dplyr package. The report also showcases the pivot_wider, pivot_longer, and nest functions from the tidyr package, and ggplot which belongs to the ggplot2 package. All of these packages are part of the tidyverse suite of packages which are essential in R for robust data tidying, transforming, and analysis.
ggplot(heart_data_pivoted_longer, mapping = aes(x = chest_pain_type_by_gender, y = age)) +
geom_boxplot()
## Warning: Removed 2490 rows containing non-finite values (stat_boxplot).
This extension will show the customization functions and techniques available in the ggplot library.
The ggtitle function adds the given title to the graph
ggplot(heart_data_pivoted_longer, mapping = aes(x = chest_pain_type_by_gender, y = age)) +
geom_boxplot() + ggtitle("Chest Pain Type By Gender")
## Warning: Removed 2490 rows containing non-finite values (stat_boxplot).
The coord_flip function will show a graph with the flipped Cartisian coordinates
ggplot(heart_data_pivoted_longer, mapping = aes(x = chest_pain_type_by_gender, y = age)) +
geom_boxplot() + ggtitle("Chest Pain Type By Gender") + coord_flip()
## Warning: Removed 2490 rows containing non-finite values (stat_boxplot).
Faceting allows a graph to be divided into sun graphs based on one or more discrete. By using the facet_grid function the parsed dataframe can be divided into boxplots for each unique chest pain type and gender value. This allows us to create the initial boxplot graph without manipulating the dataframe.
t <- ggplot(heart_data_parsed, aes(Age)) + geom_boxplot() + coord_flip()
t + facet_grid(cols = vars(ChestPainType, Sex))
ggplot offers multiple themes for plots, a few are shown below.
ggplot(heart_data_pivoted_longer, mapping = aes(x = chest_pain_type_by_gender, y = age)) +
geom_boxplot() + theme_bw() + ggtitle("theme_bw")
## Warning: Removed 2490 rows containing non-finite values (stat_boxplot).
ggplot(heart_data_pivoted_longer, mapping = aes(x = chest_pain_type_by_gender, y = age)) +
geom_boxplot() + theme_dark() + ggtitle("theme_dark")
## Warning: Removed 2490 rows containing non-finite values (stat_boxplot).
ggplot(heart_data_pivoted_longer, mapping = aes(x = chest_pain_type_by_gender, y = age)) +
geom_boxplot() + theme_light() + ggtitle("theme_light")
## Warning: Removed 2490 rows containing non-finite values (stat_boxplot).
Color by variable: Giving the color variable a x value, organizes each boxplot with a unique color outline. Giving the fill variable a x value, organizes each boxplot with a unique color.
ggplot(heart_data_pivoted_longer, mapping = aes(x = chest_pain_type_by_gender, y = age, color=chest_pain_type_by_gender)) +
geom_boxplot()
## Warning: Removed 2490 rows containing non-finite values (stat_boxplot).
ggplot(heart_data_pivoted_longer, mapping = aes(x = chest_pain_type_by_gender, y = age, fill=chest_pain_type_by_gender)) +
geom_boxplot()
## Warning: Removed 2490 rows containing non-finite values (stat_boxplot).
Same color: Give the desired color to either color or fill to produce boxplots with the same color outlines or solid fill respectively.
ggplot(heart_data_pivoted_longer, mapping = aes(x = chest_pain_type_by_gender, y = age)) + geom_boxplot(color = "green")
## Warning: Removed 2490 rows containing non-finite values (stat_boxplot).
ggplot(heart_data_pivoted_longer, mapping = aes(x = chest_pain_type_by_gender, y = age)) + geom_boxplot(fill = "green")
## Warning: Removed 2490 rows containing non-finite values (stat_boxplot).