Categorical data refers to variables that are made up of label values like different categories that sometimes have a natural ordering to them. Some machine learning algorithms can work directly with categorical data depending on implementation, such as a decision tree, but most require the variables to be a numeric value. This means that any categorical data must be mapped to integers. To convert a categorical variable there are 2 steps and they are as follows:
In this step each category in the categorical variable is assigned an integer value.
For categorical variables where no such ordinal relationship exists, the integer encoding is not enough. This is were one-hot encoding is useful.
One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. In other words One-hot encoding is the process of converting a categorical variable with multiple categories into multiple variables, each with a value of 1 or 0.
df <- data.frame(x = seq(1,10,by=1),
y = sample(c("R","G","B"), 10, replace = TRUE))
df_dummy <- dummyVars(" ~ .", data=df)
df_new <- data.frame(predict(df_dummy, newdata = df))
df_new
## x yB yG yR
## 1 1 0 0 1
## 2 2 1 0 0
## 3 3 0 1 0
## 4 4 0 0 1
## 5 5 1 0 0
## 6 6 1 0 0
## 7 7 0 1 0
## 8 8 0 1 0
## 9 9 0 0 1
## 10 10 0 0 1
The reason for using one-hot encoding was discussed and understood with an example.
The key points discussed as as follows:
Categorical data refers to variables that are made up of a finite set of `label values`.
Most machine learning algorithms require numerical variables.
The methods used to convert categorical data to integer data are integer encoding and one-hot encoding.