Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter. Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
Factors and Levels
x <- c(7,22,13,22)
xf <- factor(x)
xf
## [1] 7 22 13 22
## Levels: 7 13 22
str(xf)
## Factor w/ 3 levels "7","13","22": 1 3 2 3
unclass(xf)
## [1] 1 3 2 3
## attr(,"levels")
## [1] "7" "13" "22"
length(xf)
## [1] 4
x <- c(7,22,13,22)
xff <- factor(x,levels=c(7,13,22,999))
xff
## [1] 7 22 13 22
## Levels: 7 13 22 999
xff[2] <- 999 #'[2]' - stands for second element in dataset 'c()'
x
## [1] 7 22 13 22
Common Functions Used with Factors
ages <- c(35,23,65,27,22,41)
affils <- c("D","R","R","D","D","U")
tapply(ages,affils,mean)
## D R U
## 28 44 41
Note: - The function tapply() treated the vector (“R”,“D”,“D”,“R”,“U”,“D”) as a factor with levels “D”, “R”, and “U”. - It noted that “D” occurred in indices 2, 3 and 6; “R” occurred in indices 1 and 4; and “U” occurred in index 5.For convenience, let’s refer to the three index vectors (2,3,6), (1,4), and (5) as x, y, and z, respectively. - Then tapply() computed mean(u[x]), mean(u[y]), and mean(u[z]) and returned those means in a three-element vector. - And that vector’s element names are “D”, “R”, and “U”, reflecting the factor levels that were used by tapply().
NOTE: - If two or more factors are available, we may be interested in finding mean income, broken down by gender and age. - If we set g() to be mean(), tapply() will return the mean incomes in each of four subgroups: • Male and under 25 years old • Female and under 25 years old • Male and over 25 years old • Female and over 25 years old
d <- data.frame(list(gender=c("F","F","M","F","M","F"),age=c(72,29,27,33,31,29),income=c(165000,188000,132450,276500,193000,145650)))
d
d$over30 <- ifelse(d$age > 30,1,0)
d
tapply(d$income,list(d$gender,d$over30),mean)
## 0 1
## F 166825 220750
## M 132450 193000
The split() Function
NOTE: - In contrast to tapply(), which splits a vector into groups and then applies a specified function on each group, split() stops at that first stage, just forming the groups. - Basic form: split(x,f), with x and f playing roles similar to those in the call tapply(x,f,g); x being a vector or data frame f being a factor or a list of factors. - The action is to split x into groups, which are returned in a list. - Note! x is allowed to be a data frame with split() but not with tapply().
split(d$income,list(d$gender,d$over30))
## $F.0
## [1] 188000 145650
##
## $M.0
## [1] 132450
##
## $F.1
## [1] 165000 276500
##
## $M.1
## [1] 193000
g <- c("I","I","M","F","M","M","F")
split(1:7,g)
## $F
## [1] 4 7
##
## $I
## [1] 1 2
##
## $M
## [1] 3 5 6
u <- c(222,88,333,66,88,929,-20)
fl <- list(c(50,120,130,120,130,50,130),c("a","bc","a","a","bc","a","a"))
tapply(u,fl,length)
## a bc
## 50 2 NA
## 120 1 1
## 130 2 1
NOTE: - tapply() temporarily breaks ‘u’ into subvectors, then applies the ‘length()’ function to each subvector. - (Note that this is independent of what’s in u. Our focus now is purely on the factors.) - Those subvector lengths are the counts of the occurrences of each of the 3 × 2 = 6 combinations of the two factors. - For instance, 5 occurred twice with “a” and not at all with “bc”; hence the entries 2 and NA in the first row of the output. - In statistics, this is called a contingency table. - There is one problem in this example: the NA value. It really should be 0, meaning that in no cases did the first factor have level 5 and the second have level “bc”. The table() function creates contingency tables correctly.
table(fl)
## fl.2
## fl.1 a bc
## 50 2 0
## 120 1 1
## 130 2 1