Blog 5 - One-Hot Encoding

Introduction

Categorical data refers to variables that are made up of label values like different categories that sometimes have a natural ordering to them. Some machine learning algorithms can work directly with categorical data depending on implementation, such as a decision tree, but most require the variables to be a numeric value. This means that any categorical data must be mapped to integers. To convert a categorical variable there are 2 steps and they are as follows:

Integer encoding
One-hot encoding

1. Integer encoding

In this step each category in the categorical variable is assigned an integer value.

2. One-hot encoding

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough. This is were one-hot encoding is useful.

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. In other words One-hot encoding is the process of converting a categorical variable with multiple categories into multiple variables, each with a value of 1 or 0.

Example Code

df <- data.frame(x = seq(1,10,by=1),
  y = sample(c("R","G","B"), 10, replace = TRUE))

df_dummy <- dummyVars(" ~ .", data=df)
df_new <- data.frame(predict(df_dummy, newdata = df)) 

df_new

##     x yB yG yR
## 1   1  0  0  1
## 2   2  1  0  0
## 3   3  0  1  0
## 4   4  0  0  1
## 5   5  1  0  0
## 6   6  1  0  0
## 7   7  0  1  0
## 8   8  0  1  0
## 9   9  0  0  1
## 10 10  0  0  1

Conclusion

The reason for using one-hot encoding was discussed and understood with an example.

The key points discussed as as follows:

Categorical data refers to variables that are made up of a finite set of `label values`.

Most machine learning algorithms require numerical variables.

The methods used to convert categorical data to integer data are integer encoding and one-hot encoding.

DATA 621 Blog 5