The goal of this tutorial is to learn how to find which columns of our dataset with zero variance and remove them from the dataset in order to perform certain analysis.
# First we load the libraries
library(ggplot2)
library(ggthemes)
# In this example we will use the open repository of plants classification Iris.
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# We are going to add a column with zero variance
iris$NoVariance <- 1
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species NoVariance
## setosa :50 Min. :1
## versicolor:50 1st Qu.:1
## virginica :50 Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
# We are going to use the var function to check the variance of the different columns
apply(iris, 2, var)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 0.6856935 0.1899794 3.1162779 0.5810063 NA
## NoVariance
## 0.0000000
# We see that variance is NA for non numerical variables like the species and 0 for the new column
# Let's find which variables have zero variance
which(apply(iris, 2, var) == 0)
## NoVariance
## 6
# Now that we have found our columns without variance let's get rid of them
# However the result of the apply is not numeric so we have to cast the variable
iris <- iris[ - as.numeric(which(apply(iris, 2, var) == 0))]
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
In this tutorial we have learnt how to find and remove columns with zero variance from our dataset.