A factor is a vector object used to specify a discrete classification of the components of other vectors of same length. R provide both ordered and unordered factors.
Suppose, for example, we have a sample of 30 tax accountants from all the states and territories of Australia and their individual state of origin is specified by a character vector of state mnemonics as
state <- c("tas", "sa", "qld", "nsw", "nsw", "nt", "wa", "wa","qld", "vic", "nsw", "vic", "qld", "qld", "sa", "tas","sa", "nt", "wa", "vic", "qld", "nsw", "nsw", "wa","sa", "act", "nsw", "vic", "vic", "act")
state
## [1] "tas" "sa" "qld" "nsw" "nsw" "nt" "wa" "wa" "qld" "vic" "nsw" "vic"
## [13] "qld" "qld" "sa" "tas" "sa" "nt" "wa" "vic" "qld" "nsw" "nsw" "wa"
## [25] "sa" "act" "nsw" "vic" "vic" "act"
Note that, Charater vectors are always sorted in alphabetical order.
A factor is simply created using factor() function:
statef <- factor(state)
statef
## [1] tas sa qld nsw nsw nt wa wa qld vic nsw vic qld qld sa tas sa nt wa
## [20] vic qld nsw nsw wa sa act nsw vic vic act
## Levels: act nsw nt qld sa tas vic wa
The print() function handles factors slightly differently from other objects
To find out levels of factors levels() function is used.
levels(statef)
## [1] "act" "nsw" "nt" "qld" "sa" "tas" "vic" "wa"
To continue the previous example, suppose we have the incomes of the same tax accountants in another vector (in suitably large units of money)
incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56,61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,59, 46, 58, 43)
incomes
## [1] 60 49 40 61 64 60 59 54 62 69 70 42 56 61 61 61 58 51 48 65 49 49 41 48 52
## [26] 46 59 46 58 43
We can use tapply() function to find out the mean income of each state.It will recieve three arguements(income,factor(for lavels),operation to be performed).
income_means <- tapply(incomes,statef,mean)
income_means
## act nsw nt qld sa tas vic wa
## 44.50000 57.33333 55.50000 53.60000 55.00000 60.50000 56.00000 52.25000
The result is a structure of the same length as the levels attribute of the factor containing the results.
The combination of a vector and a labelling factor is an example of what is sometimes called a ragged array, since the subclass sizes are possibly irregular. When the subclass sizes are all the same the indexing may be done implicitly and much more efficiently