A factor in R is a data type used to represent categorical variables, i.e., variables that take on a fixed number of possible values (levels), such as gender, education, region, treatment group, etc.
Key properties
x = c("low", "high", "medium", "medium", "high", "medium")
f = factor(x)
table(f)
## f
## high low medium
## 2 1 3
By default, factor levels are ordered alphabetically. To change the order, specify the desired level order when creating the factor:
x = c("low", "high", "medium", "medium", "high", "medium")
o = factor(x, levels = c("low", "medium", "high"))
table(o)
## o
## low medium high
## 1 3 2
Or reorder an already existing factor:
o = factor(f, levels = c( "high", "medium", "low"))
table(o)
## o
## high medium low
## 2 3 1
Reordering can also be achieved by referring to existing factor values to reduce coding work. As we know the different values of a character variable:
x = c("low", "high", "medium", "medium", "high", "medium")
unique(x)
## [1] "low" "high" "medium"
We can use this list of strings and simply reorder their positions:
o = factor(x, levels = unique(x)[c(1, 3, 2)])
table(o)
## o
## low medium high
## 1 3 2
Same approach can be applied for factors by using the factor levels:
levels(f)
## [1] "high" "low" "medium"
Note the different ordering:
o = factor(f, levels = levels(f)[c(2, 3, 1)])
table(o)
## o
## low medium high
## 1 3 2
Be careful when working with factors whose levels look like numbers, for example:
x = c(80, 30, 40, 30, 20)
f = factor(x)
table(f)
## f
## 20 30 40 80
## 1 2 1 1
Assume we want to calculate the mean, which is (80+30+40+30+20) / 5 = 40. We cannot directly compute numerical summaries on categorical values, because
mean(f)
## Warning in mean.default(f): Argument ist weder numerisch noch boolesch: gebe NA
## zurück
## [1] NA
does not work. That said, it might seem reasonable to convert the factor to numeric values using
mean(as.numeric(f))
## [1] 2.4
However, this produces 2.4 instead of the correct value 40. This happens because converting a factor to numeric returns the internal integer codes, i.e., the positions of the levels in the level list, not the levels themselves.
For example, although the factor is
f
## [1] 80 30 40 30 20
## Levels: 20 30 40 80
the numeric conversion returns
as.numeric(f)
## [1] 4 2 3 2 1
More concretely, the factor value “80” corresponds to position 4 in the level list; factor value “30” corresponds to position 2 in the level list; factor value “40” corresponds to position 3 in the levels list; and so forth.
To correctly convert factor levels that represent numeric values, we must first convert the factor to its character representation:
txt = as.character(f)
txt
## [1] "80" "30" "40" "30" "20"
and then convert these character strings to numbers:
as.numeric(txt)
## [1] 80 30 40 30 20
In short:
mean(as.numeric(as.character(f)))
## [1] 40