We come across several scenarios where we have to convert data types of multiple variables to a different type as part of Data Wrangling.
Suppose we are creating a Binary Classification model on a dataset with few categorical variables. But the raw data has these variables as character data type. In order to pass these categorical variables as factor variables, we have to convert them from character data type to factor data type.
Let’s see few methods to convert variable/s with the help of a sample dataset
library(dplyr)load("testdf.rda")
str(testdf)## 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
This dataset has 5 numeric variables and 6 character variables and our model is to be created with all these character variables as factor variables. To do so we have to convert them to factor data type.
con.names = testdf %>% select_if(is.numeric) %>% colnames()
cat.names = testdf %>% select_if(is.character) %>% colnames()
print(con.names)## [1] "displ" "year" "cyl" "cty" "hwy"
print(cat.names)## [1] "manufacturer" "model" "trans" "drv"
## [5] "fl" "class"
Let’s see different ways to convert these multiple variables from character to factor data type.
This is the basic and simplest way to convert each variable in the data set to factor one by one.
testdf1 = testdf
testdf1$manufacturer = as.factor(testdf1$manufacturer)
testdf1$model = as.factor(testdf1$model)
testdf1$trans = as.factor(testdf1$trans)
testdf1$drv = as.factor(testdf1$drv)
testdf1$fl = as.factor(testdf1$fl)
testdf1$class = as.factor(testdf1$class)
str(testdf1[,cat.names])## 'data.frame': 234 obs. of 6 variables:
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
This method works good for few variables but becomes strenuous for large number of variables.
testdf2 = testdf
testdf2[,cat.names] = data.frame(apply(testdf2[cat.names], 2, as.factor))
str(testdf2[,cat.names])## 'data.frame': 234 obs. of 6 variables:
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
testdf3 = testdf
testdf3[,cat.names] = data.frame(sapply(testdf3[,cat.names], as.factor))
str(testdf3[,cat.names])## 'data.frame': 234 obs. of 6 variables:
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
testdf4 = testdf
testdf4[,cat.names] = lapply(testdf4[,cat.names], as.factor)
str(testdf4[,cat.names])## 'data.frame': 234 obs. of 6 variables:
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
Let’s say the vector cat.names has only single variable name and we observe how these methods work.
cat.names = "manufacturer"testdf5 = testdf
testdf5[,cat.names] = data.frame(apply(testdf2[cat.names], 2, as.factor))
str(testdf5[,cat.names])## Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
testdf6 = testdf
testdf6[,cat.names] = data.frame(sapply(testdf6[,cat.names], as.factor))
str(testdf6[,cat.names])## Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
testdf6 = testdf
testdf6[,cat.names] = lapply(testdf6[,cat.names], as.factor)## Warning in `[<-.data.frame`(`*tmp*`, , cat.names, value =
## list(structure(1L, .Label = "audi", class = "factor"), : provided 234
## variables to replace 1 variables
str(testdf6[,cat.names])## Factor w/ 1 level "audi": 1 1 1 1 1 1 1 1 1 1 ...
Observe the output when we used lapply. Variable “manufacturer” is converted to factor but with only one level.
It’s because lapply gives each records as a list and is trying to replace a vector with a list of records.
Let’s see the structure of the output.
testdf6 = testdf
str(lapply(testdf6[,cat.names], as.factor)[1:10])## List of 10
## $ : Factor w/ 1 level "audi": 1
## $ : Factor w/ 1 level "audi": 1
## $ : Factor w/ 1 level "audi": 1
## $ : Factor w/ 1 level "audi": 1
## $ : Factor w/ 1 level "audi": 1
## $ : Factor w/ 1 level "audi": 1
## $ : Factor w/ 1 level "audi": 1
## $ : Factor w/ 1 level "audi": 1
## $ : Factor w/ 1 level "audi": 1
## $ : Factor w/ 1 level "audi": 1
Let’s see if converting the output to data frame works.
testdf6 = testdf
testdf6[,cat.names] = data.frame(lapply(testdf6[,cat.names], as.factor))## Warning in `[<-.data.frame`(`*tmp*`, , cat.names, value = structure(list(:
## provided 234 variables to replace 1 variables
str(testdf6[,cat.names])## Factor w/ 1 level "audi": 1 1 1 1 1 1 1 1 1 1 ...
Still the output is same. So lapply doesn’t work for conversion of single variable.
The same methods can be used for any valid data type conversions say int to numeric or numeric to character etc.