Workshop 6: Grouping, Summarizing and Plotting

JOHN OYAN NAIVEST

2023-08-15

1 Load packages and data

To get started, load in the needed packages: {tidyverse}, {here}, {janitor}, and {esquisse}.

Now, read the dataset into R. The data frame you import should have 880 rows and 22 columns. Remember to use the here() function to allow your Rmd to use project-relative paths.

## Rows: 880 Columns: 21
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): Sex, Education, Employment, Alcohol, Smoking, Form of TB, Chext Xr...
## dbl (11): id, Age, WtinKgs, HtinCms, bmi, Diabetes, first visit cost, second...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

The column names in this CSV have spaces in them, which is not R-friendly. I would recommend using janitor::clean_names() to give your variable names a clean and consistent format.

Checkpoint: The dataframe should contain 880 rows and 21 columns (see Environment tab). Column/variable names should be all lowercase with no spaces. You can run names(tb_renamed) to print the variable names.

2 Investigating healthcare costs

This dataset is from a research paper titled “Diagnostic pathways and direct medical costs incurred by new adult pulmonary tuberculosis patients prior to anti-tuberculosis treatment – Tamil Nadu, India”. The study collected data on out of pocket (OOP) expenditures incurred by TB patients.

2.1 Step 1: Calculate a total_cost variable

There were three visits for the participants and each had a cost (first_visit_cost, second_visit_cost, third_visit_cost). Add together these costs to create a total_cost variable.

2.2 Step 2: Summarize costs by group

Let’s compare the cost of treatment at different health facilities using a summary table. Use dplyr verbs to group by first_visit_location and summarize the mean fist visit cost.

## # A tibble: 9 x 2
##   first_visit_location mean_first_visit_cost
##   <chr>                                <dbl>
## 1 GH                                    36.0
## 2 Other                               1138. 
## 3 PHC                                    0  
## 4 Pvt. clini                           948. 
## 5 Pvt. docto                          1517. 
## 6 Pvt. hospi                          3213. 
## 7 T.Govt                                71.4
## 8 T.Pvt                               2500  
## 9 Tambram sanatorium                    16.1

Next, reorder the rows of the summary table to go from highest to lowest mean cost.

## # A tibble: 9 x 2
##   first_visit_location mean_first_visit_cost
##   <chr>                                <dbl>
## 1 Pvt. hospi                          3213. 
## 2 T.Pvt                               2500  
## 3 Pvt. docto                          1517. 
## 4 Other                               1138. 
## 5 Pvt. clini                           948. 
## 6 T.Govt                                71.4
## 7 GH                                    36.0
## 8 Tambram sanatorium                    16.1
## 9 PHC                                    0

Save this summary table as an object which you can use for plotting later on.

3 Encoding age groups

3.1 Step 1: Create an age_group variable

The age variable records the age of each patient in years. For further manipulations, we want to classify the patients into 4 equally-sized age groups (i.e., the number of patients in each age group should be approximately the same).

In order to determine what the age range for each age group should be, we can use the quantile() function.

##   0%  25%  50%  75% 100% 
##   18   37   48   58   88

We will now choose cutoffs for each age group based on these values.

Create a new age_group variable with the following categories:

  • 18-36

  • 37-47

  • 48-57

  • 58+

Now we can create a table of the age_group variable to see if we met our goal of having a similar number of patients in each age group:

##  age_group   n   percent
##      18-36 229 0.2602273
##      37-47 228 0.2590909
##      48-57 204 0.2318182
##        58+ 219 0.2488636

Checkpoint: The if you classified the age groups correctly, you will see that each age group has 24-36% of the patients.

3.2 Step 2: Summarize costs by age group and smoking status

Use “nested” grouping to group the data by two variables: age_group and smoking. Then filter to get the most expensive total_cost for each nested group.

## `summarise()` has grouped output by 'age_group'. You can override using the
## `.groups` argument.
## # A tibble: 8 x 3
##   age_group smoking max_total_cost
##   <fct>     <chr>            <dbl>
## 1 18-36     No               38000
## 2 18-36     Yes               8800
## 3 37-47     No               30000
## 4 37-47     Yes              11200
## 5 48-57     No               21000
## 6 48-57     Yes              30000
## 7 58+       No               35000
## 8 58+       Yes              35000

4 Visualize data with {esquisse}

Using esquisser() and the costs summary table you created earlier, create a bar plot of mean costs, by treatment location.

5 Wrap up

That’s it for this assignment! We will choose 2-3 people to present your work during the workshop. If you would like to share your results with the class, please let an instructor know.

If you finish early and have extra time, you can explore the dataset further with esquisse, and create new plots to share with the class. Try customizing the colors and labels in your plots.