Factors

A factor in R is a data type used to represent categorical variables, i.e., variables that take on a fixed number of possible values (levels), such as gender, education, region, treatment group, etc.

Key properties

x = c("low", "high", "medium", "medium", "high", "medium")
f = factor(x)
table(f)
## f
##   high    low medium 
##      2      1      3

Reorder Factors

By default, factor levels are ordered alphabetically. To change the order, specify the desired level order when creating the factor:

x = c("low", "high", "medium", "medium", "high", "medium")
o = factor(x, levels = c("low", "medium", "high"))
table(o)
## o
##    low medium   high 
##      1      3      2

Or reorder an already existing factor:

o =  factor(f, levels = c( "high", "medium", "low"))
table(o)
## o
##   high medium    low 
##      2      3      1

Reordering can also be achieved by referring to existing factor values to reduce coding work. As we know the different values of a character variable:

x = c("low", "high", "medium", "medium", "high", "medium")
unique(x)
## [1] "low"    "high"   "medium"

We can use this list of strings and simply reorder their positions:

o =  factor(x, levels = unique(x)[c(1, 3, 2)])
table(o)
## o
##    low medium   high 
##      1      3      2

Same approach can be applied for factors by using the factor levels:

levels(f)
## [1] "high"   "low"    "medium"

Note the different ordering:

o =  factor(f, levels = levels(f)[c(2, 3, 1)])
table(o)
## o
##    low medium   high 
##      1      3      2

Numeric Factors

Be careful when working with factors whose levels look like numbers, for example:

x = c(80, 30, 40, 30, 20)
f = factor(x)
table(f)
## f
## 20 30 40 80 
##  1  2  1  1

Assume we want to calculate the mean, which is (80+30+40+30+20) / 5 = 40. We cannot directly compute numerical summaries on categorical values, because

mean(f)
## Warning in mean.default(f): Argument ist weder numerisch noch boolesch: gebe NA
## zurück
## [1] NA

does not work. That said, it might seem reasonable to convert the factor to numeric values using

mean(as.numeric(f))
## [1] 2.4

However, this produces 2.4 instead of the correct value 40. This happens because converting a factor to numeric returns the internal integer codes, i.e., the positions of the levels in the level list, not the levels themselves.

For example, although the factor is

f
## [1] 80 30 40 30 20
## Levels: 20 30 40 80

the numeric conversion returns

as.numeric(f)
## [1] 4 2 3 2 1

More concretely, the factor value “80” corresponds to position 4 in the level list; factor value “30” corresponds to position 2 in the level list; factor value “40” corresponds to position 3 in the levels list; and so forth.

To correctly convert factor levels that represent numeric values, we must first convert the factor to its character representation:

txt = as.character(f)
txt
## [1] "80" "30" "40" "30" "20"

and then convert these character strings to numbers:

as.numeric(txt)
## [1] 80 30 40 30 20

In short:

mean(as.numeric(as.character(f)))
## [1] 40