A factor is a vector object used to specify a discrete classification of the components of other vectors of same length. R provide both ordered and unordered factors.

Example

Suppose, for example, we have a sample of 30 tax accountants from all the states and territories of Australia and their individual state of origin is specified by a character vector of state mnemonics as

state <- c("tas", "sa",  "qld", "nsw", "nsw", "nt",  "wa",  "wa","qld", "vic", "nsw", "vic", "qld", "qld", "sa",  "tas","sa",  "nt",  "wa",  "vic", "qld", "nsw", "nsw", "wa","sa",  "act", "nsw", "vic", "vic", "act")
state
##  [1] "tas" "sa"  "qld" "nsw" "nsw" "nt"  "wa"  "wa"  "qld" "vic" "nsw" "vic"
## [13] "qld" "qld" "sa"  "tas" "sa"  "nt"  "wa"  "vic" "qld" "nsw" "nsw" "wa" 
## [25] "sa"  "act" "nsw" "vic" "vic" "act"

Note that, Charater vectors are always sorted in alphabetical order.

Creation of Factors

A factor is simply created using factor() function:

statef <- factor(state)
statef
##  [1] tas sa  qld nsw nsw nt  wa  wa  qld vic nsw vic qld qld sa  tas sa  nt  wa 
## [20] vic qld nsw nsw wa  sa  act nsw vic vic act
## Levels: act nsw nt qld sa tas vic wa

The print() function handles factors slightly differently from other objects

Levels of Factors

To find out levels of factors levels() function is used.

levels(statef)
## [1] "act" "nsw" "nt"  "qld" "sa"  "tas" "vic" "wa"

The function tapply() and ragged arrays

To continue the previous example, suppose we have the incomes of the same tax accountants in another vector (in suitably large units of money)

 incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56,61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,59, 46, 58, 43)
incomes
##  [1] 60 49 40 61 64 60 59 54 62 69 70 42 56 61 61 61 58 51 48 65 49 49 41 48 52
## [26] 46 59 46 58 43

We can use tapply() function to find out the mean income of each state.It will recieve three arguements(income,factor(for lavels),operation to be performed).

income_means <- tapply(incomes,statef,mean)
income_means
##      act      nsw       nt      qld       sa      tas      vic       wa 
## 44.50000 57.33333 55.50000 53.60000 55.00000 60.50000 56.00000 52.25000

The result is a structure of the same length as the levels attribute of the factor containing the results.

Ragged Array

The combination of a vector and a labelling factor is an example of what is sometimes called a ragged array, since the subclass sizes are possibly irregular. When the subclass sizes are all the same the indexing may be done implicitly and much more efficiently

Ordered Factors

  • The levels of factors are stored in alphabetical order, or in the order they were specified to factor if they were specified explicitly.
  • Sometimes the levels will have a natural ordering that we want to record and want our statistical analysis to make use of.
  • The ordered() function creates such ordered factors but is otherwise identical to factor.