Task Overview
The brief Rmd-based tutorial presented below introduces novice R users to R factors. Please make two adjustments to improve its pedagogical effectiveness:
New R learners might not be comfortable with the concept of vectors. Therefore, replace the vector example with a real (small) dataset that you import as part of the tutorial. The dataset should be relevant to public health. For this, you can assume the students are familiar with the {dplyr} package, including the
mutate()function.We suspect that the current demonstration of custom ordering with
table()andtabyl()might not captivate students effectively. A more visually engaging alternative could be to showcase the relevance of custom ordering for ggplot2 plots. Please implement this modification. For this, you can assume the students are familiar with basic ggplot syntax.
As you implement these changes, try to avoid making the tutorial much longer than it already is. Keep it short and to the point.
Now, below is the original tutorial. Upon completing the changes, please return your Rmd, knitted HTML and the dataset used to us.
Introduction to Factors
Let’s delve into factors, another data class in R. A factor is a nominal (categorical) variable having a defined set of potential values known as levels.
So, why use a factor class? One common reason is to enable characters to sort in a custom order.
To understand what this means, consider the following example.
Assume you have a variable recording individuals’ birth month:
Suppose you want to count the number of births for each month. You
could use the base table() function:
## birth_month
## Apr Dec Jan Mar Nov Oct
## 2 1 2 1 1 1
This shows two people were born in April, one in December, and so forth.
Alternatively, you can use the tabyl() function from
{janitor}:
## birth_month n percent
## Apr 2 0.250
## Dec 1 0.125
## Jan 2 0.250
## Mar 1 0.125
## Nov 1 0.125
## Oct 1 0.125
But, there’s an issue with these outputs: the months are sorted
alphabetically. Even if you try to sort() the
birth_month vector, you get the same result:
## [1] "Apr" "Apr" "Dec" "Jan" "Jan" "Mar" "Nov" "Oct"
However, for this variable, a chronological order, with January first, makes more sense.
You can rectify this by creating a factor with the
factor() function. Here, you provide your original
character vector and a list of valid levels, arranged
in the right sequence:
birth_month_factor <- factor(x = birth_month,
levels = c("Jan", "Feb", "Mar", "Apr",
"May", "Jun", "Jul", "Aug",
"Sep", "Oct", "Nov", "Dec"))
class(birth_month_factor) # check its class## [1] "factor"
## [1] Dec Apr Jan Mar Oct Nov Jan Apr
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Notice that the levels are displayed in the output.
[1] Dec Apr Jan Mar Oct Nov Jan Apr
👉Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec👈
Now, you can sort the vector correctly:
## [1] Jan Jan Mar Apr Apr Oct Nov Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
And when we create the frequency count tables, we get them in the right order:
## birth_month_factor
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2 0 1 2 0 0 0 0 0 1 1 1
## birth_month_factor n percent
## Jan 2 0.250
## Feb 0 0.000
## Mar 1 0.125
## Apr 2 0.250
## May 0 0.000
## Jun 0 0.000
## Jul 0 0.000
## Aug 0 0.000
## Sep 0 0.000
## Oct 1 0.125
## Nov 1 0.125
## Dec 1 0.125
You’ll notice, even months with zero counts (February, May, June, July, August and September) are included in the table outputs, which can be very useful!
Lastly, if you prefer not to include months with zero counts, the
tabyl() function offers an option to remove them using the
show_missing_levels argument:
## birth_month_factor n percent
## Jan 2 0.250
## Mar 1 0.125
## Apr 2 0.250
## Oct 1 0.125
## Nov 1 0.125
## Dec 1 0.125