The goal of this tutorial is to remove all white spaces from a dataframe. This could be useful if we want to compare strings that could have white spaces in different positions.
# We create a dataframe deliberately containing repeated levels with different white spaces
name <- c(" John", "John ", "Peter", "Mark", "Joseph", "Alex", "Alex ")
town <- c(" Barcelona", "Barcelona", "Valencia", "Paris ", "London", "Barcelona", "London ")
people <- cbind.data.frame(name, town)
# In strings spaces count as characters so they are seen as different levels
levels(people$name)
## [1] " John" "Alex" "Alex " "John " "Joseph" "Mark" "Peter"
# We can apply a gsub function to remove all white spaces
people_new <- apply(people,2,function(x)gsub('\\s+', '',x))
# However the output of the apply is a matrix of characters
str(people_new)
## chr [1:7, 1:2] "John" "John" "Peter" "Mark" "Joseph" ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:2] "name" "town"
# Then we have to cast the result as a dataframe
people_new <- as.data.frame(apply(people,2,function(x)gsub('\\s+', '',x)))
str(people_new)
## 'data.frame': 7 obs. of 2 variables:
## $ name: Factor w/ 5 levels "Alex","John",..: 2 2 5 4 3 1 1
## $ town: Factor w/ 4 levels "Barcelona","London",..: 1 1 4 3 2 1 2
# Now the levels of the names are properly built
levels(people_new$name)
## [1] "Alex" "John" "Joseph" "Mark" "Peter"
levels(people_new$town)
## [1] "Barcelona" "London" "Paris" "Valencia"
In this tutorial we have learnt how to remove all white spaces from an entire dataframe. This process can be useful when working with text analysis like sentiment analysis, natural language or name identification.