1 Intro
Welcome!
For this workshop, we will be cleaning a dataset. It is a hands-on
approach to using the select()
, filter()
,
mutate()
, case_when()
, and
summarize()
.
The assignment should be submitted individually, but you are encouraged to brainstorm with partners.
The final due date for the assignment is Tuesday, November 22th at 23:59 PM UTC+2.
2 Get the assignment repo
To get started, you should download and look through the assignment folder.
First download the repo to your local computer here.
You should ideally work on your local computer, but if you would rather work on RStudio Cloud, you can upload the zip file to RStudio Cloud through the Files pane. Consult one of the instructors for guidance on this.
Unzip/Extract the downloaded folder.
If you are on macOS, you can simply double-click on a file to unzip it.
If you are on Windows and are not sure how to “unzip” a file, see this image. You need to right-click on the file and then select “extract all”.
Once done, click on the RStudio Project file in the unzipped folder to open the project in RStudio.
In RStudio, navigate to the Files tab and open the “rmd” folder. The instructions for your exercise are outlined there (these are the same instructions you see here).
Open the “data” folder and observe its components. You will work with the “india_tuberculosis.csv” file. (You can also open the “00_info_about_the_dataset” file to learn more about this dataset.)
3 Load packages and data
Now that you understand the structure of the repo, you can load in and clean your dataset.
In the code section below, load in the needed packages.
## Loading required package: pacman
## Warning: package 'dpylr' is not available for this version of R
##
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
## Warning: 'BiocManager' not available. Could not check Bioconductor.
##
## Please use `install.packages('BiocManager')` and then retry.
## Warning in p_install(package, character.only = TRUE, ...):
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'dpylr'
## Warning in pacman::p_load(tidyverse, here, patchwork, janitor, esquisse, : Failed to install/load:
## dpylr
Now, read the dataset into R. The data frame you
import should have 880 rows and 22 columns. Remember to use the
here()
function to allow your Rmd to use project-relative
paths.
Pro-tip: column names aren’t standardized this time.
I would recommend using janitor::clean_names()
(the
function clean_names()
from the janitor
package). A doubt about how to write it? Go back to week 5, the mutate
example.
## Rows: 880 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Sex, Education, Employment, Alcohol, Smoking, Form of TB, Chext Xr...
## dbl (11): id, Age, WtinKgs, HtinCms, bmi, Diabetes, first visit cost, second...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
4 Create new variables
4.1 Step
1: Encode an age_group
variable
The age variable represents the age of participants in years. For
further manipulations, we want to create an age_group
variable with the following categories:
<10
10-17
18-29
30-49
50-79
80+
4.2 Step
2: Create bmi
variableS
4.2.1 Part A: Calculate the BMI
You have at your disposal the weight (in kg) and height (in cm) of your participants. Calculate the BMI of your participants.
(Careful ! Check your units for weight and height: are your variables
in the right unit or do you need to convert them to another unit using
mutate()
? )
4.2.2 Part B: Classify
the BMI into bmi_categories
A healthy BMI is defined between 18,5 and 25: the person is categorized as
healthy
.If the BMI is inferior to 18,5: the person is categorized as
underweight
.If the BMI is between 25 and 30: the person is categorized as
overweight
.If the BMI is above 30: the person is categorized as
obese
.
Using case_when()
, create a variable,
bmi_categories
that classifies each respondent into a
category.
Hint: Inspire yourself from the code of the conditional mutate lesson.
## bmi_categories n percent
## obese 5 0.005681818
## overweight 17 0.019318182
## healthy 244 0.277272727
## underweight 597 0.678409091
## missing BMI 17 0.019318182
4.3 Step 3: Investigating total costs
4.3.1 Part A: Create a
total_cost
variable
There were three visits for the participants and each had a cost. Add
together these costs to create a total_cost
variable.
IMPORTANT HINT: You should group_by()
the id
(of participants) before using mutate()
to sum together the different cost columns of interest.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 500 1767 1500 38000
4.3.2 Part B: Categorize the cost
Using summarize()
, I will calculate for you the
quantiles (0.25, 0.5 and 0.75) of the total_cost
variable.
For your general knowledge (although this is not a statistics course): we are using quantiles because of their capacity at splitting our data into subsets. For example, the 0.25 quantile defines the value (let’s call it x) for a random variable, such that the probability that a random observation of the variable is less than x is 0.25 (25% chance).
We will use these to define cost categories. Run the code below
You can observe that the 0.25 quantile is at 0, the 0.5 quantile is at 500, the 0.75 quantile is at 1500.
If the total cost is less than 500: the total cost is
low
.If the total cost is between 500 and 1500: the total cost is
average
.If the total cost is more than 1500: the total cost is
high
.
(These are arbitrary definitions based on a quick overview, that we will use for this exercise.)
Create a total_cost_categories
variable that reflects
the above classification.
You will use this tuberculosis_data_cleaned
for all the
following subset making and plotting.
5 Tables & Plotting
5.1 Present a demographic table
Using the sex variable and the age groups’ variable you created
above, print (in an aesthetic way, think reactable()
) the
average cost by age group and gender.
Hint: use group_by()
and
summarize()
.
## `summarise()` has grouped output by 'age_group'. You can override using the
## `.groups` argument.
5.2 Cost and BMI
5.2.1 Part A: Create a table
Using the gender variable, as well as the BMI category and total cost
variable you created above, print (in an aesthetic way, think
reactable()
) the average cost by BMI category and
gender.
## `summarise()` has grouped output by 'bmi_categories'. You can override using
## the `.groups` argument.
5.2.2 Part B: Plot a histogram for the obese participants, by gender
Using the above table and esquisse
, keep only obese
respondents and plot their average total cost by gender, as a
histogram.
5.3 Total costs and Treatment initiation delay
Using esquisse, plot a scatter plot of the treatment initiation delay vs. the total costs, for all participants.
## Warning: Removed 3 rows containing missing values (`geom_point()`).
5.4 Employment status and Total cost categories
5.4.1 Part A: Create a table
Group by employment status and the total cost categories you defined previously, to count how many participants fall in each grouping.
## `summarise()` has grouped output by 'Employment'. You can override using the
## `.groups` argument.
5.4.2 Part B: Plot it !
Select the high total cost category and compare the number of working
vs non-working participants. Do so using esquisse
and a
histogram.
Set the total cost categories on the x axis and color the histograms based on the employment status. COMMENT FROM GIULIA: Shouldn’t it be employment status on the x axis?
6 Submission: Upload HTML
Once you have finished the tasks above, you should knit this Rmd into an HTML and upload it on the assignment page.