1 Goal


The goal of this tutorial is to keep only numerical columns of a dataframe. This is useful when using sum function on aggregate or group_by functions because it returns errors when non-numerical columns are present.


2 Keep numerical columns


# In this example we will use the open repository of plants classification Iris. 
data("iris")
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# If we try to sum all the columns we will get in trouble

#apply(iris, 2, sum)
#Error in FUN(newX[, i], ...) : invalid 'type' (character) of argument

# So we need to keep only numerical columns
# To do so we are going to ask to each column if it is numerical or not
sapply(iris, is.numeric)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##         TRUE         TRUE         TRUE         TRUE        FALSE
# We see that the last column species is not a numerical column, so it is the source of the previous error
# Let's keep only the numerical columns
iris <- iris[, sapply(iris, is.numeric)]
str(iris)
## 'data.frame':    150 obs. of  4 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# And now we can sum all the columns
apply(iris, 2, sum)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        876.5        458.6        563.7        179.9

3 Removing non numerical columns using a function


# Function returns dataset without non numerical variables
# It takes a dataset as an argument and returns the same dataset without non-numerical columns
numeric_dataset <- function(Dataset){
  nums <- sapply(Dataset, is.numeric) # We store on a vector if the column is or not numerical
  return(Dataset[ , nums]) # We return the dataset without non-numerical columns
}

# Let's try the function on the iris dataset
data("iris")
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
str(numeric_dataset(iris))
## 'data.frame':    150 obs. of  4 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

4 Conclusion


In this tutorial we have learnt how to remove non-numerical columns in order to prepare our data to operations that can only be done with numbers and return error if it is not the case.