1 Goal

The goal of this tutorial is to remove all white spaces from a dataframe. This could be useful if we want to compare strings that could have white spaces in different positions.

2 Create the dataset

# We create a dataframe deliberately containing repeated levels with different white spaces

name <- c(" John", "John ", "Peter", "Mark", "Joseph", "Alex", "Alex ")
town <- c(" Barcelona", "Barcelona", "Valencia", "Paris ", "London", "Barcelona", "London ")

people <- cbind.data.frame(name, town)

# In strings spaces count as characters so they are seen as different levels
levels(people$name)

## [1] " John"  "Alex"   "Alex "  "John "  "Joseph" "Mark"   "Peter"

3 Remove white spaces

# We can apply a gsub function to remove all white spaces
people_new <- apply(people,2,function(x)gsub('\\s+', '',x))

# However the output of the apply is a matrix of characters
str(people_new)

##  chr [1:7, 1:2] "John" "John" "Peter" "Mark" "Joseph" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "name" "town"

# Then we have to cast the result as a dataframe

people_new <- as.data.frame(apply(people,2,function(x)gsub('\\s+', '',x)))
str(people_new)

## 'data.frame':    7 obs. of  2 variables:
##  $ name: Factor w/ 5 levels "Alex","John",..: 2 2 5 4 3 1 1
##  $ town: Factor w/ 4 levels "Barcelona","London",..: 1 1 4 3 2 1 2

# Now the levels of the names are properly built
levels(people_new$name)

## [1] "Alex"   "John"   "Joseph" "Mark"   "Peter"

levels(people_new$town)

## [1] "Barcelona" "London"    "Paris"     "Valencia"

4 Conclusion

In this tutorial we have learnt how to remove all white spaces from an entire dataframe. This process can be useful when working with text analysis like sentiment analysis, natural language or name identification.

Remove all white spaces from text dataset

Ubiqum Code Academy

1 Goal

2 Create the dataset

3 Remove white spaces

4 Conclusion