1 Goal


The goal of this tutorial is to avoid one common mistake related to the use of factors. When trying to transform a factor containing numbers to numerical value we obtain as a result the position of the levels instead of the content of the variable. We will see how to find this problem and check that everything went fine.


2 Data preparation


# In this exercise we will use a character vector containing numbers
# We will use the iris dataset to perform this exercise
data("iris")
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

3 Turning a character vector into numerical


# We create a character vector using the Sepal Lenght variable
char_vector <- as.character(iris$Sepal.Length)
str(char_vector)
##  chr [1:150] "5.1" "4.9" "4.7" "4.6" "5" "5.4" "4.6" ...
# We create a numerical vector from the character vector
num_vector <- as.numeric(char_vector)
str(num_vector)
##  num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# We plot the difference that should be zero if the value is correctly saved
plot(num_vector - iris$Sepal.Length)

# A plot consisting of zeroes confirms that the transformation was correclty made

4 Turning a factor into a numerical vector


# We create a factor type variable 
my_factor <- factor(iris$Sepal.Length)
str(my_factor)
##  Factor w/ 35 levels "4.3","4.4","4.5",..: 9 7 5 4 8 12 4 8 2 7 ...
# Now we save in a new variable the numerical values inside the factor
num_vector <- as.numeric(my_factor)
str(num_vector)
##  num [1:150] 9 7 5 4 8 12 4 8 2 7 ...
# We plot the difference that should be zero if the value is correctly saved
plot(num_vector - iris$Sepal.Length)

# In this case we saved in the num_vector the position of the level in the levels vector of the factor
# The plot of the difference is now a proof that something went wrong
# This problem is called the factor variable trap (FVT)

5 Solving the factor variable trap


# Again we use the factor of the Sepal Length variable
my_factor <- factor(iris$Sepal.Length)
str(my_factor)
##  Factor w/ 35 levels "4.3","4.4","4.5",..: 9 7 5 4 8 12 4 8 2 7 ...
# To avoid the factor variable trap we must transform first the factor into character
# At this point the levels dissapear and the only value left is the content of the variable
num_vector <- as.numeric(as.character(my_factor))
str(num_vector)
##  num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# We plot the difference that should be zero if the value is correctly saved
plot(num_vector - iris$Sepal.Length)

# Now the correct values are stored in our numerical vector

6 Conclusion


In this tutorial we have found a common mistake that is important to describe in order to be avoided when working with factors. When transforming a factor into numerical values the position of the level is stored instead of the number contained in that position. To solve the problem we must first change to character the variable in order to remove the levels and correctly store the numerical value.

TIP: Always check when working with factors that the transformation kept the original values. A simple plot like the one used in this tutorial can be quick and descriptive at the same time.