1 Load packages and data
To get started, load in the needed packages: {tidyverse}, {here}, {janitor}, and {esquisse}.
Now, read the dataset into R. The data frame you
import should have 880 rows and 22 columns. Remember to use the
here()
function to allow your Rmd to use project-relative
paths.
## Rows: 880 Columns: 21
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (10): Sex, Education, Employment, Alcohol, Smoking, Form of TB, Chext Xr...
## dbl (11): id, Age, WtinKgs, HtinCms, bmi, Diabetes, first visit cost, second...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
The column names in this CSV have spaces in them, which is not
R-friendly. I would recommend using janitor::clean_names()
to give your variable names a clean and consistent format.
Checkpoint: The dataframe should contain 880 rows
and 21 columns (see Environment tab). Column/variable names should be
all lowercase with no spaces. You can run names(tb_renamed)
to print the variable names.
2 Investigating healthcare costs
This dataset is from a research paper titled “Diagnostic pathways and direct medical costs incurred by new adult pulmonary tuberculosis patients prior to anti-tuberculosis treatment – Tamil Nadu, India”. The study collected data on out of pocket (OOP) expenditures incurred by TB patients.
2.1 Step 1: Calculate a
total_cost
variable
There were three visits for the participants and each had a cost
(first_visit_cost, second_visit_cost, third_visit_cost
).
Add together these costs to create a total_cost
variable.
2.2 Step 2: Summarize costs by group
Let’s compare the cost of treatment at different health facilities
using a summary table. Use dplyr
verbs to group by
first_visit_location
and summarize the mean fist visit
cost.
## # A tibble: 9 x 2
## first_visit_location mean_first_visit_cost
## <chr> <dbl>
## 1 GH 36.0
## 2 Other 1138.
## 3 PHC 0
## 4 Pvt. clini 948.
## 5 Pvt. docto 1517.
## 6 Pvt. hospi 3213.
## 7 T.Govt 71.4
## 8 T.Pvt 2500
## 9 Tambram sanatorium 16.1
Next, reorder the rows of the summary table to go from highest to lowest mean cost.
## # A tibble: 9 x 2
## first_visit_location mean_first_visit_cost
## <chr> <dbl>
## 1 Pvt. hospi 3213.
## 2 T.Pvt 2500
## 3 Pvt. docto 1517.
## 4 Other 1138.
## 5 Pvt. clini 948.
## 6 T.Govt 71.4
## 7 GH 36.0
## 8 Tambram sanatorium 16.1
## 9 PHC 0
Save this summary table as an object which you can use for plotting later on.
3 Encoding age groups
3.1 Step 1: Create an
age_group
variable
The age
variable records the age of each patient in
years. For further manipulations, we want to classify the patients into
4 equally-sized age groups (i.e., the number of patients in each age
group should be approximately the same).
In order to determine what the age range for each age group should
be, we can use the quantile()
function.
## 0% 25% 50% 75% 100%
## 18 37 48 58 88
We will now choose cutoffs for each age group based on these values.
Create a new age_group
variable with the
following categories:
18-36
37-47
48-57
58+
Now we can create a table of the age_group
variable to
see if we met our goal of having a similar number of patients in each
age group:
## age_group n percent
## 18-36 229 0.2602273
## 37-47 228 0.2590909
## 48-57 204 0.2318182
## 58+ 219 0.2488636
Checkpoint: The if you classified the age groups correctly, you will see that each age group has 24-36% of the patients.
3.2 Step 2: Summarize costs by age group and smoking status
Use “nested” grouping to group the data by two variables:
age_group
and smoking
. Then filter to get the
most expensive total_cost
for each nested group.
## `summarise()` has grouped output by 'age_group'. You can override using the
## `.groups` argument.
## # A tibble: 8 x 3
## age_group smoking max_total_cost
## <fct> <chr> <dbl>
## 1 18-36 No 38000
## 2 18-36 Yes 8800
## 3 37-47 No 30000
## 4 37-47 Yes 11200
## 5 48-57 No 21000
## 6 48-57 Yes 30000
## 7 58+ No 35000
## 8 58+ Yes 35000
4 Visualize data with {esquisse}
Using esquisser()
and the costs summary table you
created earlier, create a bar plot of mean costs, by treatment
location.
5 Wrap up
That’s it for this assignment! We will choose 2-3 people to present your work during the workshop. If you would like to share your results with the class, please let an instructor know.
If you finish early and have extra time, you can explore the dataset further with esquisse, and create new plots to share with the class. Try customizing the colors and labels in your plots.