Creating factors
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values.
To work with factors, we use forcats package.
To create a factor we must start by creating a list of the valid levels.
Then, we can make a factor simply like other vectors:
type <- c("Fun","Commentary","Quotation","Quotation","News","Quotation","Document","Commentary")
tweet_type <- factor(type, levels = tweet_types)
tweet_type## [1] Fun Commentary Quotation Quotation News Quotation Document
## [8] Commentary
## Levels: News Fun Quotation Commentary Document
Levels here play an important role. Some points on levels:
Any values not in the levels set will be silently converted to NA.
If you want an error, you can use readr::parse_factor().
If you omit the levels, they’ll be taken from the data in alphabetical order. Instance:
## [1] Fun Commentary Quotation Quotation News Quotation Document
## [8] Commentary
## Levels: Commentary Document Fun News Quotation
- Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data. You can do that when creating the factor by setting levels to unique(x), or after the fact, with fct_inorder().
## [1] Fun Commentary Quotation Quotation News Quotation Document
## [8] Commentary
## Levels: Fun Commentary Quotation News Document
- If you ever need to access the set of valid levels directly, you can do so with levels().
## [1] "Fun" "Commentary" "Quotation" "News" "Document"
So, we learn that we can create a factor by pre-defined levels or concurrent levels.
Exploring levels
When factors are stored in a tibble, you can’t see their levels so easily. One way to see them is with count():
## # A tibble: 3 x 2
## race n
## <fct> <int>
## 1 Other 1959
## 2 Black 3129
## 3 White 16395
Or with a bar chart:
By default, ggplot2 will drop levels that don’t have any values. You can force them to display with:
Reordering and releveling factors
In many cases, when we plot a factor it quite possible that the data is illustrated in a disorder way. To fix this, we should use fct_reorder().
fct_reorder() takes three arguments:
f, the factor whose levels you want to modify.
x, a numeric vector that you want to use to reorder the levels.
Optionally, fun, a function that’s used if there are multiple values of x for each value of f. The default value is median.
There are also other cases when we want to force some levels to come first, or change the levels in general. Here, we should use fct_relevel(). It takes a factor, f, and then any number of levels that you want to move to the front of the line.
fct_relevel() also takes a function, e.g., sort to relevel the factor (other functions: rev and sample).
Recoding the factor’s labels
We can change the labels of levels. fct_recode() does this job. It allows you to recode, or change, the value of each level.
fct_recode(factor, new lable = "old label)
fct_recode() will leave levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.
To combine groups, you can assign multiple old levels to the same new level.
If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels.