Task Overview

The brief Rmd-based tutorial presented below introduces novice R users to R factors. Please make two adjustments to improve its pedagogical effectiveness:

As you implement these changes, try to avoid making the tutorial much longer than it already is. Keep it short and to the point.

Now, below is the original tutorial. Upon completing the changes, please return your Rmd, knitted HTML and the dataset used to us.

Introduction to Factors

Let’s delve into factors, another data class in R. A factor is a nominal (categorical) variable having a defined set of potential values known as levels.

So, why use a factor class? One common reason is to enable characters to sort in a custom order.

To understand what this means, consider the following example.

Assume you have a variable recording individuals’ birth month:

birth_month <- c("Dec", "Apr", "Jan", "Mar", "Oct", "Nov", "Jan", "Apr")

Suppose you want to count the number of births for each month. You could use the base table() function:

table(birth_month)
## birth_month
## Apr Dec Jan Mar Nov Oct 
##   2   1   2   1   1   1

This shows two people were born in April, one in December, and so forth.

Alternatively, you can use the tabyl() function from {janitor}:

tabyl(birth_month)
##  birth_month n percent
##          Apr 2   0.250
##          Dec 1   0.125
##          Jan 2   0.250
##          Mar 1   0.125
##          Nov 1   0.125
##          Oct 1   0.125

But, there’s an issue with these outputs: the months are sorted alphabetically. Even if you try to sort() the birth_month vector, you get the same result:

sort(birth_month)
## [1] "Apr" "Apr" "Dec" "Jan" "Jan" "Mar" "Nov" "Oct"

However, for this variable, a chronological order, with January first, makes more sense.

You can rectify this by creating a factor with the factor() function. Here, you provide your original character vector and a list of valid levels, arranged in the right sequence:

birth_month_factor <- factor(x = birth_month, 
                             levels = c("Jan", "Feb", "Mar", "Apr", 
                                        "May", "Jun", "Jul", "Aug", 
                                        "Sep", "Oct", "Nov", "Dec"))

class(birth_month_factor) # check its class
## [1] "factor"
birth_month_factor # print it
## [1] Dec Apr Jan Mar Oct Nov Jan Apr
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Notice that the levels are displayed in the output.

[1] Dec Apr Jan Mar Oct Nov Jan Apr
👉Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec👈

Now, you can sort the vector correctly:

sort(birth_month_factor)
## [1] Jan Jan Mar Apr Apr Oct Nov Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

And when we create the frequency count tables, we get them in the right order:

table(birth_month_factor)
## birth_month_factor
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
##   2   0   1   2   0   0   0   0   0   1   1   1
tabyl(birth_month_factor)
##  birth_month_factor n percent
##                 Jan 2   0.250
##                 Feb 0   0.000
##                 Mar 1   0.125
##                 Apr 2   0.250
##                 May 0   0.000
##                 Jun 0   0.000
##                 Jul 0   0.000
##                 Aug 0   0.000
##                 Sep 0   0.000
##                 Oct 1   0.125
##                 Nov 1   0.125
##                 Dec 1   0.125

You’ll notice, even months with zero counts (February, May, June, July, August and September) are included in the table outputs, which can be very useful!

Lastly, if you prefer not to include months with zero counts, the tabyl() function offers an option to remove them using the show_missing_levels argument:

tabyl(birth_month_factor, show_missing_levels = FALSE)
##  birth_month_factor n percent
##                 Jan 2   0.250
##                 Mar 1   0.125
##                 Apr 2   0.250
##                 Oct 1   0.125
##                 Nov 1   0.125
##                 Dec 1   0.125