1 Introduction

We come across several scenarios where we have to convert data types of multiple variables to a different type as part of Data Wrangling.

Suppose we are creating a Binary Classification model on a dataset with few categorical variables. But the raw data has these variables as character data type. In order to pass these categorical variables as factor variables, we have to convert them from character data type to factor data type.

Let’s see few methods to convert variable/s with the help of a sample dataset

2 Dataset

library(dplyr)

2.1 Structure of Dataset

load("testdf.rda")
str(testdf)
## 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...

This dataset has 5 numeric variables and 6 character variables and our model is to be created with all these character variables as factor variables. To do so we have to convert them to factor data type.

con.names = testdf %>% select_if(is.numeric) %>% colnames()
cat.names = testdf %>% select_if(is.character) %>% colnames()
print(con.names)
## [1] "displ" "year"  "cyl"   "cty"   "hwy"
print(cat.names)
## [1] "manufacturer" "model"        "trans"        "drv"         
## [5] "fl"           "class"

3 Different methods to convert

Let’s see different ways to convert these multiple variables from character to factor data type.

3.1 Basic one by one conversion.

This is the basic and simplest way to convert each variable in the data set to factor one by one.

testdf1 = testdf
testdf1$manufacturer = as.factor(testdf1$manufacturer)
testdf1$model = as.factor(testdf1$model)
testdf1$trans = as.factor(testdf1$trans)
testdf1$drv = as.factor(testdf1$drv)
testdf1$fl = as.factor(testdf1$fl)
testdf1$class = as.factor(testdf1$class)
str(testdf1[,cat.names])
## 'data.frame':    234 obs. of  6 variables:
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

This method works good for few variables but becomes strenuous for large number of variables.

3.2 Converting all variables at a time using apply functions

3.2.1 Using apply

testdf2 = testdf
testdf2[,cat.names] = data.frame(apply(testdf2[cat.names], 2, as.factor))
str(testdf2[,cat.names])
## 'data.frame':    234 obs. of  6 variables:
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

3.2.2 Using sapply

testdf3 = testdf
testdf3[,cat.names] = data.frame(sapply(testdf3[,cat.names], as.factor))
str(testdf3[,cat.names])
## 'data.frame':    234 obs. of  6 variables:
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

3.2.3 Using lapply

testdf4 = testdf
testdf4[,cat.names] = lapply(testdf4[,cat.names], as.factor)
str(testdf4[,cat.names])
## 'data.frame':    234 obs. of  6 variables:
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

3.3 What if cat.names has only one variable.

Let’s say the vector cat.names has only single variable name and we observe how these methods work.

cat.names = "manufacturer"

3.3.1 Using apply

testdf5 = testdf
testdf5[,cat.names] = data.frame(apply(testdf2[cat.names], 2, as.factor))
str(testdf5[,cat.names])
##  Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...

3.3.2 Using sapply

testdf6 = testdf
testdf6[,cat.names] = data.frame(sapply(testdf6[,cat.names], as.factor))
str(testdf6[,cat.names])
##  Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...

3.3.3 Using lapply

testdf6 = testdf
testdf6[,cat.names] = lapply(testdf6[,cat.names], as.factor)
## Warning in `[<-.data.frame`(`*tmp*`, , cat.names, value =
## list(structure(1L, .Label = "audi", class = "factor"), : provided 234
## variables to replace 1 variables
str(testdf6[,cat.names])
##  Factor w/ 1 level "audi": 1 1 1 1 1 1 1 1 1 1 ...

Observe the output when we used lapply. Variable “manufacturer” is converted to factor but with only one level.
It’s because lapply gives each records as a list and is trying to replace a vector with a list of records.

Let’s see the structure of the output.

testdf6 = testdf
str(lapply(testdf6[,cat.names], as.factor)[1:10])
## List of 10
##  $ : Factor w/ 1 level "audi": 1
##  $ : Factor w/ 1 level "audi": 1
##  $ : Factor w/ 1 level "audi": 1
##  $ : Factor w/ 1 level "audi": 1
##  $ : Factor w/ 1 level "audi": 1
##  $ : Factor w/ 1 level "audi": 1
##  $ : Factor w/ 1 level "audi": 1
##  $ : Factor w/ 1 level "audi": 1
##  $ : Factor w/ 1 level "audi": 1
##  $ : Factor w/ 1 level "audi": 1

Let’s see if converting the output to data frame works.

testdf6 = testdf
testdf6[,cat.names] = data.frame(lapply(testdf6[,cat.names], as.factor))
## Warning in `[<-.data.frame`(`*tmp*`, , cat.names, value = structure(list(:
## provided 234 variables to replace 1 variables
str(testdf6[,cat.names])
##  Factor w/ 1 level "audi": 1 1 1 1 1 1 1 1 1 1 ...

Still the output is same. So lapply doesn’t work for conversion of single variable.

The same methods can be used for any valid data type conversions say int to numeric or numeric to character etc.