Factors

An R factor might be viewed simply as a vector which consists of a record of the distinct values in it, called levels.

1. Creating a factor from a vector

x<-c(1,2,2,4,5)
xf<-factor(x)
xf

## [1] 1 2 2 4 5
## Levels: 1 2 4 5

The length of the factor is defined in terms of length of the data rather than number of levels.

length(xf)

## [1] 5

Let’s look at another example.

library(dslabs) # Let's use a dataset from dslabs library

## Warning: package 'dslabs' was built under R version 3.6.1

head(murders,15) # The name of dataset is murders, let's take a sample of 15 rows

##                   state abb        region population total
## 1               Alabama  AL         South    4779736   135
## 2                Alaska  AK          West     710231    19
## 3               Arizona  AZ          West    6392017   232
## 4              Arkansas  AR         South    2915918    93
## 5            California  CA          West   37253956  1257
## 6              Colorado  CO          West    5029196    65
## 7           Connecticut  CT     Northeast    3574097    97
## 8              Delaware  DE         South     897934    38
## 9  District of Columbia  DC         South     601723    99
## 10              Florida  FL         South   19687653   669
## 11              Georgia  GA         South    9920000   376
## 12               Hawaii  HI          West    1360301     7
## 13                Idaho  ID          West    1567582    12
## 14             Illinois  IL North Central   12830632   364
## 15              Indiana  IN North Central    6483802   142

class(murders$region) # The class of region is factor

## [1] "factor"

2. syntax of tapply - Appling a function on factors

tapply(x,f,g)

where,

x is a vector, f is the factor and g is the function

Que: Find the average numbers of murders in each region

tapply(murders$total,murders$region,mean)

##     Northeast         South North Central          West 
##      163.2222      246.7647      152.3333      147.0000

Que: Find the average age of people voting for a particular party.

ages<-c(18,23,23,56,56,24,26,26)
party<-c("BJP","BJP","TDP","BJP","YSR","TDP","YSR","TDP")
tapply(ages,party,mean)

##      BJP      TDP      YSR 
## 32.33333 24.33333 41.00000

3.Dealing with 2 factors at same time.

Que: Create a dataframe with gender, age and income of 6 people.

How many people are over 25 and Male
How many people are over 25 and Female
How many people are under 25 and Male
How many people are under 25 and Female

How to handle data with 2 levels.

d<-data.frame(list(gender=c("M","M","F","M","F","F"),age=c(47,59,21,32,33,24),income=c(55000,88000,32450,76500,123000,45650)))
d

##   gender age income
## 1      M  47  55000
## 2      M  59  88000
## 3      F  21  32450
## 4      M  32  76500
## 5      F  33 123000
## 6      F  24  45650

Add another column over25. And mark 1 for people above 25 years and 0 for people below 25 years.

d$over25<-ifelse(d$age>25,1,0) 
d

##   gender age income over25
## 1      M  47  55000      1
## 2      M  59  88000      1
## 3      F  21  32450      0
## 4      M  32  76500      1
## 5      F  33 123000      1
## 6      F  24  45650      0

tapply(d$income,list(d$gender,d$over25),mean) # tapply is applied using 2 factors-gender and over25.

##       0         1
## F 39050 123000.00
## M    NA  73166.67

4. Split Function: Split is used to group the data.

Syntax of split function

split(x,f)

Where, x is a vector and y is a factor.Let’s consider a dataset.

d # consider a dataset

##   gender age income over25
## 1      M  47  55000      1
## 2      M  59  88000      1
## 3      F  21  32450      0
## 4      M  32  76500      1
## 5      F  33 123000      1
## 6      F  24  45650      0

Que: Find the following from above dataset. using split function
- Male under 25 years old
- Female under 25 years old
- Male over 25 years old
- Female over 25 years old

split(d$income,list(d$gender,d$over25))

## $F.0
## [1] 32450 45650
## 
## $M.0
## numeric(0)
## 
## $F.1
## [1] 123000
## 
## $M.1
## [1] 55000 88000 76500