Today, I choose to blog about something that is always in the way before start modeling, that is one-hot-encoding.
Getting started in applied machine learning can be difficult, especially when working with real-world data.
Often, machine learning problems will recommend or require that you prepare your data in specific ways before fitting a machine learning model.
What is Categorical Data?
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.Categorical variables are often called nominal. Anexample could be a “color” variable with the values: “red“, “green” and “blue“.
Each value represents a different category. Some categories may have a natural relationship to each other, such as a natural ordering as well.
What is the Problem with Categorical Data?
Some algorithms can work with categorical data directly.However, many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.
In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.
This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.
To solve this problem, we use one-hot-encoding on categorical data.
In their purest form, regression models treat all independent variables as numeric. If we have non numeric data that we think may be important, we want to be able to use this in the model.
If we have a data with a variable that is a categorical variable, the regression algorithm won’t be able to process it.
Often, it will translate each categorical variable into “categorical values”, for example it will assign the first category as 1, second category as 2, and tirth category as 3,….etc. The algorithm will try predict the response variable using these numerical values.
The problem we face here, is that, it’s not fair to say that ‘tirth category’ > ‘fist category’, they are just different categories, without any order to them. We need an approach to counter this, and allow us to fairly understand the relationship between the variables in the dataset.
One way to solve the problem is by introducing what is called as One Hot Encoding.
One-hot-encoding is a representation of categorical variables as binary vectors. What this means is that we want to transform a categorical variable or variables to a format that works better with classification and regression algorithms.
One-hot-encoding converts an unordered categorical vector (i.e. a factor) to multiple binarized vectors where each binary vector of 1s and 0s indicates the presence of a class (i.e. level) of the of the original vector.
In this process we create as many dummy variables as we have categories. Each dummy variable corresponds to one category and contains 1 if the categorical variable contains corresponding category and 0 otherwise. The following example shows how the variable Type can be encoded using 4 dummy variables.
Type | Type.A | Type.B | Type.F | Type.T |
---|---|---|---|---|
A | 1 | 0 | 0 | 0 |
A | 1 | 0 | 0 | 0 |
B | 0 | 1 | 0 | 0 |
T | 0 | 0 | 0 | 1 |
F | 0 | 0 | 1 | 0 |
A | 1 | 0 | 0 | 0 |
B | 0 | 1 | 0 | 0 |
In more general terms one-hot referes to any encoding consisting of combination of 0s and 1s; however, only one high bit (1) is allowed in any valid value. It is most commonly used to indicate the state of a state machine (a machine that can be in exactly one of a finite number of states at any given time). This encoding does not require the use of decoder since the position of the high bit indicates the state of the machine.
Consider the following example where we have some categorical variables.
set.seed(555)
data <- data.frame(
Outcome = seq(1,100,by=1),
Variable = sample(c("Red","Green","Blue"), 100, replace = TRUE)
)
There are various ways to create dummy variables. The easiest way is to use one of the avaiable packages, in this case we choose to use caret package. We can accomplish it with one line of code.
library(caret)
dummy <- dummyVars(" ~ .", data=data)
newdata <- data.frame(predict(dummy, newdata = data))
knitr::kable(head(newdata,10))
Outcome | Variable.Blue | Variable.Green | Variable.Red |
---|---|---|---|
1 | 0 | 1 | 0 |
2 | 0 | 0 | 1 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
5 | 0 | 0 | 1 |
6 | 0 | 0 | 1 |
7 | 0 | 1 | 0 |
8 | 0 | 1 | 0 |
9 | 0 | 0 | 1 |
10 | 0 | 1 | 0 |
In this post, we discovered why categorical data often must be encoded when working with machine learning algorithms.
Specifically:
That categorical data is defined as variables with a finite set of label values.
That most machine learning algorithms require numerical input and output variables.
That an integer and one hot encoding is used to convert categorical data to integer data.
What is One Hot Encoding? Why And When do you have to use it?, Rakshith Vasudev, August 2, 2017, https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
One-hot, Wikipedia, https://en.wikipedia.org/wiki/One-hot
Package caret, https://cran.r-project.org/web/packages/caret/caret.pdf