1 Goal


The goal of this tutorial is to learn how to find which columns of our dataset with zero variance and remove them from the dataset in order to perform certain analysis.


2 Data preparation


# First we load the libraries
library(ggplot2)
library(ggthemes)

# In this example we will use the open repository of plants classification Iris. 
data("iris")
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# We are going to add a column with zero variance
iris$NoVariance <- 1

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species     NoVariance
##  setosa    :50   Min.   :1   
##  versicolor:50   1st Qu.:1   
##  virginica :50   Median :1   
##                  Mean   :1   
##                  3rd Qu.:1   
##                  Max.   :1

3 Columns with zero variance

3.1 How to find columns with zero variance


# We are going to use the var function to check the variance of the different columns
apply(iris, 2, var)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##    0.6856935    0.1899794    3.1162779    0.5810063           NA 
##   NoVariance 
##    0.0000000
# We see that variance is NA for non numerical variables like the species and 0 for the new column

# Let's find which variables have zero variance
which(apply(iris, 2, var) == 0)
## NoVariance 
##          6

3.2 Removing columns with zero variance


# Now that we have found our columns without variance let's get rid of them
# However the result of the apply is not numeric so we have to cast the variable

iris <- iris[ - as.numeric(which(apply(iris, 2, var) == 0))]
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

4 Conclusion


In this tutorial we have learnt how to find and remove columns with zero variance from our dataset.