Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter. Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

Factors and Levels

  1. ‘Factors’ explained + distinct charecteristics demonstrated => record of the distinct values in that vector;
x <- c(7,22,13,22)
xf <- factor(x)
xf
## [1] 7  22 13 22
## Levels: 7 13 22
str(xf)
##  Factor w/ 3 levels "7","13","22": 1 3 2 3
unclass(xf)
## [1] 1 3 2 3
## attr(,"levels")
## [1] "7"  "13" "22"
length(xf)
## [1] 4
  1. New levels can be added in the future. But they need to be defined 1st:
x <- c(7,22,13,22)
xff <- factor(x,levels=c(7,13,22,999))
xff
## [1] 7  22 13 22
## Levels: 7 13 22 999
xff[2] <- 999 #'[2]' - stands for second element in dataset 'c()'
x
## [1]  7 22 13 22

Common Functions Used with Factors

  1. ‘tapply()’ => to (temporarily) split x into groups, each group corresponding to a level of the factor (or a combination of levels of the factors in the case of multiple factors), and then apply g() to the resulting subvectors of x.
ages <- c(35,23,65,27,22,41)
affils <- c("D","R","R","D","D","U")
tapply(ages,affils,mean)
##  D  R  U 
## 28 44 41

Note: - The function tapply() treated the vector (“R”,“D”,“D”,“R”,“U”,“D”) as a factor with levels “D”, “R”, and “U”. - It noted that “D” occurred in indices 2, 3 and 6; “R” occurred in indices 1 and 4; and “U” occurred in index 5.For convenience, let’s refer to the three index vectors (2,3,6), (1,4), and (5) as x, y, and z, respectively. - Then tapply() computed mean(u[x]), mean(u[y]), and mean(u[z]) and returned those means in a three-element vector. - And that vector’s element names are “D”, “R”, and “U”, reflecting the factor levels that were used by tapply().


NOTE: - If two or more factors are available, we may be interested in finding mean income, broken down by gender and age. - If we set g() to be mean(), tapply() will return the mean incomes in each of four subgroups: • Male and under 25 years old • Female and under 25 years old • Male and over 25 years old • Female and over 25 years old

  1. Create DF from 3 vectors: gender, age, income
d <- data.frame(list(gender=c("F","F","M","F","M","F"),age=c(72,29,27,33,31,29),income=c(165000,188000,132450,276500,193000,145650)))
d
  1. Add column over 30 years, assign 1 = TRUE, 0 = FALSE
d$over30 <- ifelse(d$age > 30,1,0)
d
  1. Two factors = (gender) & (age over or under 30 years). As a result tapply() partitioned income data into 4 groups, one for each combination of gender and age, and then applied to mean() function to each group.
tapply(d$income,list(d$gender,d$over30),mean)
##        0      1
## F 166825 220750
## M 132450 193000

The split() Function

NOTE: - In contrast to tapply(), which splits a vector into groups and then applies a specified function on each group, split() stops at that first stage, just forming the groups. - Basic form: split(x,f), with x and f playing roles similar to those in the call tapply(x,f,g); x being a vector or data frame f being a factor or a list of factors. - The action is to split x into groups, which are returned in a list. - Note! x is allowed to be a data frame with split() but not with tapply().

  1. The output of split() is a list, and recall that list components are denoted by dollar signs. So the last vector, for example, was named “M.1” to indicate that it was the result of combining “M” in the first factor and 1 in the second.
split(d$income,list(d$gender,d$over30))
## $F.0
## [1] 188000 145650
## 
## $M.0
## [1] 132450
## 
## $F.1
## [1] 165000 276500
## 
## $M.1
## [1] 193000
  1. determine the indices of the vector elements corresponding to male, female, and infant
g <- c("I","I","M","F","M","M","F")
split(1:7,g)
## $F
## [1] 4 7
## 
## $I
## [1] 1 2
## 
## $M
## [1] 3 5 6
  1. Working with Tables
u <- c(222,88,333,66,88,929,-20)
fl <- list(c(50,120,130,120,130,50,130),c("a","bc","a","a","bc","a","a"))
tapply(u,fl,length)
##     a bc
## 50  2 NA
## 120 1  1
## 130 2  1

NOTE: - tapply() temporarily breaks ‘u’ into subvectors, then applies the ‘length()’ function to each subvector. - (Note that this is independent of what’s in u. Our focus now is purely on the factors.) - Those subvector lengths are the counts of the occurrences of each of the 3 × 2 = 6 combinations of the two factors. - For instance, 5 occurred twice with “a” and not at all with “bc”; hence the entries 2 and NA in the first row of the output. - In statistics, this is called a contingency table. - There is one problem in this example: the NA value. It really should be 0, meaning that in no cases did the first factor have level 5 and the second have level “bc”. The table() function creates contingency tables correctly.

  1. To substitute ‘NA’ with ‘0’
table(fl)
##      fl.2
## fl.1  a bc
##   50  2  0
##   120 1  1
##   130 2  1