An R factor might be viewed simply as a vector which consists of a record of the distinct values in it, called levels.
x<-c(1,2,2,4,5)
xf<-factor(x)
xf
## [1] 1 2 2 4 5
## Levels: 1 2 4 5
The length of the factor is defined in terms of length of the data rather than number of levels.
length(xf)
## [1] 5
library(dslabs) # Let's use a dataset from dslabs library
## Warning: package 'dslabs' was built under R version 3.6.1
head(murders,15) # The name of dataset is murders, let's take a sample of 15 rows
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
## 7 Connecticut CT Northeast 3574097 97
## 8 Delaware DE South 897934 38
## 9 District of Columbia DC South 601723 99
## 10 Florida FL South 19687653 669
## 11 Georgia GA South 9920000 376
## 12 Hawaii HI West 1360301 7
## 13 Idaho ID West 1567582 12
## 14 Illinois IL North Central 12830632 364
## 15 Indiana IN North Central 6483802 142
class(murders$region) # The class of region is factor
## [1] "factor"
tapply(x,f,g)
where,
x is a vector, f is the factor and g is the function
tapply(murders$total,murders$region,mean)
## Northeast South North Central West
## 163.2222 246.7647 152.3333 147.0000
ages<-c(18,23,23,56,56,24,26,26)
party<-c("BJP","BJP","TDP","BJP","YSR","TDP","YSR","TDP")
tapply(ages,party,mean)
## BJP TDP YSR
## 32.33333 24.33333 41.00000
How to handle data with 2 levels.
d<-data.frame(list(gender=c("M","M","F","M","F","F"),age=c(47,59,21,32,33,24),income=c(55000,88000,32450,76500,123000,45650)))
d
## gender age income
## 1 M 47 55000
## 2 M 59 88000
## 3 F 21 32450
## 4 M 32 76500
## 5 F 33 123000
## 6 F 24 45650
Add another column over25. And mark 1 for people above 25 years and 0 for people below 25 years.
d$over25<-ifelse(d$age>25,1,0)
d
## gender age income over25
## 1 M 47 55000 1
## 2 M 59 88000 1
## 3 F 21 32450 0
## 4 M 32 76500 1
## 5 F 33 123000 1
## 6 F 24 45650 0
tapply(d$income,list(d$gender,d$over25),mean) # tapply is applied using 2 factors-gender and over25.
## 0 1
## F 39050 123000.00
## M NA 73166.67
split(x,f)
Where, x is a vector and y is a factor.Let’s consider a dataset.
d # consider a dataset
## gender age income over25
## 1 M 47 55000 1
## 2 M 59 88000 1
## 3 F 21 32450 0
## 4 M 32 76500 1
## 5 F 33 123000 1
## 6 F 24 45650 0
split(d$income,list(d$gender,d$over25))
## $F.0
## [1] 32450 45650
##
## $M.0
## numeric(0)
##
## $F.1
## [1] 123000
##
## $M.1
## [1] 55000 88000 76500
``