Some algorthims such as Xgboost expect numeric vectors so what do you do if your features are categorical?

Here are a couple of techniques on recoding those categorical features.

Say we have the following dataframe

##        dates items
## 1  1/24/2014     A
## 2 10/28/2014     b
## 3 10/29/2014     c
## 4 12/12/2014     d
## 5   1/4/2015     A
## 6   1/5/2015     e
## 7   1/9/2015     f

We can use the recode function in the car package to convert those categorical items to numeric:-

library(car)
required.labels <- df["items"]
recoded.labels <- recode(required.labels$items,"'A'=1; 'b'=2; 'c'=3; 'd'=4; 'e'=5; 'f'=6")
df$items <- recoded.labels

Now the dataframe items column was been recoded into a numeric type.

##        dates items
## 1  1/24/2014     1
## 2 10/28/2014     2
## 3 10/29/2014     3
## 4 12/12/2014     4
## 5   1/4/2015     1
## 6   1/5/2015     5
## 7   1/9/2015     6

If for some reason you didn’t want to use the car package, you can do the recoding manually like this:-

df2
##        dates items
## 1  1/24/2014     A
## 2 10/28/2014     b
## 3 10/29/2014     c
## 4 12/12/2014     d
## 5   1/4/2015     A
## 6   1/5/2015     e
## 7   1/9/2015     f
df2$items <- as.factor(df2$items)
items.1 <- df[,"items"]
num.items = length(levels(items.1))
levels(items.1) = 1:num.items
df$items <- items.1
df2$items <- as.numeric(df2$items)
df2
##        dates items
## 1  1/24/2014     1
## 2 10/28/2014     2
## 3 10/29/2014     3
## 4 12/12/2014     4
## 5   1/4/2015     1
## 6   1/5/2015     5
## 7   1/9/2015     6

Note: We didn’t just use as.numeric(df2$items) here because of the way factors work. You can see more about this in the R FAQ.