Introduction

  • Factors form the basis for many of R’s powerful operations, including many of those performed on tabular data.

  • The motivation for factors comes from the notion of nominal, or categorical, variables in statistics. These values are nonnumerical in nature, corresponding to categories such as Democrat, Republican, and Unaffiliated, although they may be coded using numbers.

Factors and Levels

  • An R factor might be viewed simply as a vector with a bit more information added (though, as seen below, it’s different from this internally). That extra information consists of a record of the distinct values in that vector, called levels. Here’s an example:
x <- c(5,12,13,12)
xf <- factor(x)
xf
## [1] 5  12 13 12
## Levels: 5 12 13
  • The distinct values in xf—5, 12, and 13—are the levels here. Let’s take a look inside:
str(xf)
##  Factor w/ 3 levels "5","12","13": 1 2 3 2
unclass(xf)
## [1] 1 2 3 2
## attr(,"levels")
## [1] "5"  "12" "13"
attr(xf,"levels")
## [1] "5"  "12" "13"
  • This is revealing. The core of xf here is not (5,12,13,12) but rather (1,2,3,2). The latter means that our data consists first of a level-1 value, then level-2 and level-3 values, and finally another level-2 value. So the data has been recoded by level. The levels themselves are recorded too, of course, though as characters such as “5” rather than 5.
  • The length of a factor is still defined in terms of the length of the data rather than, say, being a count of the number of levels:
length(xf)
## [1] 4
  • We can anticipate future new levels, as seen here:
x <- c(5,12,13,12)
xff <- factor(x,levels=c(5,12,13,88))
xff[2] <- 88
xff
## [1] 5  88 13 12
## Levels: 5 12 13 88
  • Originally, xff did not contain the value 88, but in defining it, we allowed for that future possibility. Later, we did indeed add the value. By the same token, you cannot sneak in an “illegal” level. Here’s what happens when you try:
xff[2] <- 28

Common Functions Used with Factors

  • With factors, we have yet another member of the family of apply functions, tapply. We’ll look at that function, as well as two other functions commonly used with factors: split() and by().

The tapply() Function

  • As motivation, suppose we have a vector x of ages of voters and a factor f showing some nonumeric trait of those voters, such as party affiliation (Democrat, Republican, Unaffiliated). We might wish to find the mean ages in x within each of the party groups.

  • In typical usage, the call tapply(x,f,g) has x as a vector, f as a factor or list of factors, and g as a function. The function g() in our little example above would be R’s built-in mean() function. If we wanted to group by both party and another factor, say gender, we would need f to consist of the two factors, party and gender.

  • Each factor in f must have the same length as x. This makes sense in light of the voter example above; we should have as many party affiliations as ages. If a component of f is a vector, it will be coerced into a factor by applying as.factor() to it.

  • The operation performed by tapply() is to (temporarily) split x into groups, each group corresponding to a level of the factor (or a combination of levels of the factors in the case of multiple factors), and then apply g() to the resulting subvectors of x. Here’s a little example:

ages <- c(25,26,55,37,21,42)
affils <- c("R","D","D","R","U","D")
tapply(ages,affils,mean)
##  D  R  U 
## 41 31 21
  • The function tapply() treated the vector (“R”,“D”,“D”,“R”,“U”,“D”) as a factor with levels “D”, “R”, and “U”. It noted that “D” occurred in indices 2, 3 and 6; “R” occurred in indices 1 and 4; and “U” occurred in index 5. For convenience, let’s refer to the three index vectors (2,3,6), (1,4), and (5) as x, y, and z, respectively. Then tapply() computed mean(u[x]), mean(u[y]), and mean(u[z]) and returned those means in a three-element vector. And that vector’s element names are “D”, “R”, and “U”, reflecting the factor levels that were used by tapply().

  • What if we have two or more factors? Then each factor yields a set of groups, as in the preceding example.

  • suppose that we have an economic data set that includes variables for gender, age, and income. Here, the call tapply(x,f,g) might have x as income and f as a pair of factors: one for gender and the other coding whether the person is older or younger than 25. We may be interested in finding mean income, broken down by gender and age. If we set g() to be mean(), tapply() will return the mean incomes in each of four subgroups:

    • Male and under 25 years old
    • Female and under 25 years old
    • Male and over 25 years old
    • Female and over 25 years old
  • Here’s a toy example of that setting:

d <- data.frame(list(
  gender=c("M","M","F","M","F","F"),
  age=c(47,59,21,32,33,24),
  income=c(55000,88000,32450,76500,123000,45650)))
d
d$over25 <- ifelse(d$age > 25,1,0)
d
tapply(d$income,list(d$gender,d$over25),mean)
##       0         1
## F 39050 123000.00
## M    NA  73166.67
  • We specified two factors, gender and indicator variable for age over or under 25. Since each of these factors has two levels, tapply() partitioned the income data into four groups, one for each combination of gender and age, and then applied to mean() function to each group.

The split() Function

  • In contrast to tapply(), which splits a vector into groups and then applies a specified function on each group, split() stops at that first stage, just forming the groups.
  • The basic form, without bells and whistles, is split(x,f), with x and f playing roles similar to those in the call tapply(x,f,g); that is, x being a vector or data frame and f being a factor or a list of factors. The action is to split x into groups, which are returned in a list. (Note that x is allowed to be a data frame with split() but not with tapply().
  • Let’s try it out with our earlier example.
d
split(d$income,list(d$gender,d$over25))
## $F.0
## [1] 32450 45650
## 
## $M.0
## numeric(0)
## 
## $F.1
## [1] 123000
## 
## $M.1
## [1] 55000 88000 76500
  • The output of split() is a list, and recall that list components are denoted by dollar signs. So the last vector, for example, was named “M.1” to indicate that it was the result of combining “M” in the first factor and 1 in the second.

The by() Function

  • Suppose in the abalone example we wish to do regression analyses of diameter against length separately for each gender code: males, females, and infants. At first, this seems like something tailor-made for tapply(), but the first argument of that function must be a vector, not a matrix or a data frame. The function to be applied can be multivariate—for example, range()—but the input must be a vector. Yet the input for regression is a matrix (or data frame) with at least two columns: one for the predicted variable and one or more for predictor variables. In our abalone data application, the matrix would consist of a column for the diameter data and a column for length.

  • The by() function can be used here. It works like tapply() (which it calls internally, in fact), but it is applied to objects rather than vectors. Here’s how to use it for the desired regression analyses:

abaloneDataURL<- 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
aba <- read.csv(abaloneDataURL,header=FALSE,as.is=T)
names(aba)[1:9] <- c("Gender","Length","Diameter","Height","WholeWt","ShuckedWt","ViscWt","ShellWt","Rings")
aba
by(aba,aba$Gender,function(m) lm(m[,2]~m[,3]))
## aba$Gender: F
## 
## Call:
## lm(formula = m[, 2] ~ m[, 3])
## 
## Coefficients:
## (Intercept)       m[, 3]  
##     0.04288      1.17918  
## 
## ------------------------------------------------------------ 
## aba$Gender: I
## 
## Call:
## lm(formula = m[, 2] ~ m[, 3])
## 
## Coefficients:
## (Intercept)       m[, 3]  
##     0.02997      1.21833  
## 
## ------------------------------------------------------------ 
## aba$Gender: M
## 
## Call:
## lm(formula = m[, 2] ~ m[, 3])
## 
## Coefficients:
## (Intercept)       m[, 3]  
##     0.03653      1.19480
  • Calls to by() look very similar to calls to tapply(), with the first argument specifying our data, the second the grouping factor, and the third the function to be applied to each group.

  • Just as tapply() forms groups of indices of a vector according to levels of a factor, this by() call finds groups of row numbers of the data frame aba. That creates three subdata frames: one for each gender level of M, F, and I.

  • The anonymous function we defined regresses the second column of its matrix argument m against the third column. This function will be called three times—once for each of the three subdata frames created earlier— thus producing the three regression analyses.

Working with Tables

  • To begin exploring R tables, consider this example:
u <- c(22,8,33,6,8,29,-2)
fl <- list(c(5,12,13,12,13,5,13),c("a","bc","a","a","bc","a","a"))
tapply(u,fl,length)
##    a bc
## 5  2 NA
## 12 1  1
## 13 2  1
  • Here, tapply() again temporarily breaks u into subvectors, as you saw earlier, and then applies the length() function to each subvector. (Note that this is independent of what’s in u. Our focus now is purely on the factors.) Those subvector lengths are the counts of the occurrences of each of the 3 × 2 = 6 combinations of the two factors. For instance, 5 occurred twice with “a” and not at all with “bc”; hence the entries 2 and NA in the first row of the output. In statistics, this is called a contingency table.
  • There is one problem in this example: the NA value. It really should be 0, meaning that in no cases did the first factor have level 5 and the second have level “bc”. The table() function creates contingency tables correctly.
table(fl)
##     fl.2
## fl.1 a bc
##   5  2  0
##   12 1  1
##   13 2  1
  • The first argument in a call to table() is either a factor or a list of factors. The two factors here were (5,12,13,12,13,5,13) and (“a”,“bc”,“a”,“a”,“bc”, “a”,“a”). In this case, an object that is interpretable as a factor is counted as one.

  • Typically a data frame serves as the table() data argument. Suppose for instance the file ct.dat consists of election-polling data, in which candidate X is running for reelection. The ct.dat file looks like this:

ct <- read.table("ct.txt",header=T)
ct
  • We can use the table() function to compute the contingency table for this data:
cttab <- table(ct)
cttab
##           Voted.For.X.Last.Time
## Vote.for.X No Yes
##   No        2   0
##   Not Sure  0   1
##   Yes       1   1
  • The 2 in the upper-left corner of the table shows that we had, for example, two people who said “no” to the first and second questions. The 1 in the middle-right indicates that one person answered “not sure” to the first question and “yes” to the second question.

  • We can also get one-dimensional counts, which are counts on a single factor, as follows:

table(c(5,12,13,12,8,5))
## 
##  5  8 12 13 
##  2  1  2  1
  • Here’s an example of a three-dimensional table, involving voters’ genders, race (white, black, Asian, and other), and political views (liberal or conservative):
v<-data.frame(
  gender=c("M","M","F","M","F","F"),
  race=c("W","W","A","O","B","B"),
  pol=c("L","L","C","L","L","C"))
vt<-table(v)
vt
## , , pol = C
## 
##       race
## gender A B O W
##      F 1 1 0 0
##      M 0 0 0 0
## 
## , , pol = L
## 
##       race
## gender A B O W
##      F 0 1 0 0
##      M 0 0 1 2
  • R prints out a three-dimensional table as a series of two-dimensional tables. In this case, it generates a table of gender and race for conservatives and then a corresponding table for liberals. For example, the second twodimensional table says that there were two white male liberals.

Matrix/Array-Like Operations on Tables

  • Just as most (nonmathematical) matrix/array operations can be used on data frames, they can be applied to tables, too. (This is not surprising, given that the cell counts portion of a table object is an array.)

  • For example, we can access the table cell counts using matrix notation. Let’s apply this to our voting example from the previous section.

class(cttab)
## [1] "table"
cttab[1,1]
## [1] 2
cttab[1,]
##  No Yes 
##   2   0
  • In the second command, even though the first command had shown that cttab had class “cttab”, we treated it as a matrix and printed out its “[1,1] element.” Continuing this idea, the third command printed the first column of this “matrix.”

  • We can multiply the matrix by a scalar. For instance, here’s how to change cell counts to proportions:

ctt<-cttab
ctt/5
##           Voted.For.X.Last.Time
## Vote.for.X  No Yes
##   No       0.4 0.0
##   Not Sure 0.0 0.2
##   Yes      0.2 0.2
  • In statistics, the marginal values of a variable are those obtained when this variable is held constant while others are summed. In the voting example, the marginal values of the Vote.for.X variable are 2 + 0 = 2, 0 + 1 = 1, and 1 + 1 = 2. We can of course obtain these via the matrix apply() function:
apply(ctt,1,sum)
##       No Not Sure      Yes 
##        2        1        2
  • Note that the labels here, such as No, came from the row names of the matrix, which table() produced. But R supplies a function addmargins() for this purpose—that is, to find marginal totals. Here’s an example:
addmargins(cttab)
##           Voted.For.X.Last.Time
## Vote.for.X No Yes Sum
##   No        2   0   2
##   Not Sure  0   1   1
##   Yes       1   1   2
##   Sum       3   2   5
  • Here, we got the marginal data for both dimensions at once, conveniently superimposed onto the original table. We can get the names of the dimensions and levels through dimnames(), as follows:
dimnames(cttab)
## $Vote.for.X
## [1] "No"       "Not Sure" "Yes"     
## 
## $Voted.For.X.Last.Time
## [1] "No"  "Yes"