1 Goal


The goal of this tutorial is to remove non-UTF8 characters from text. This process is very useful and a function can be created to automatize this procedure.


2 Removing special characters from text


# Imagine we have a text with non printable or special characters that we want to read and remove

phrase <- c("<U+0096>Hi", "<U+0096>my", "<U+0006>friend")
phrase
## [1] "<U+0096>Hi"     "<U+0096>my"     "<U+0006>friend"
# We can remove all of them first by using

phrase_clean <- gsub("[^[:alnum:][:blank:]?&/\\-]", "", phrase)
phrase_clean
## [1] "U0096Hi"     "U0096my"     "U0006friend"
# "[^[:alnum:][:blank:]?&/\\-]"
# This grammar means: remove everything but:
# [:alnum:] Alphanumeric characters: 0-9 a-Z
# [:blank:] spaces and tabs
# ?&/\\- Specific characters you want to save for some reason. Punctuation signs can be saved here

# Once the text is clean we can remove the rest of the unwanted strings
phrase_clean <- gsub("U00..", "", phrase_clean)
phrase_clean
## [1] "Hi"     "my"     "friend"
# The . in the gsub string function means any character 
# We will remove any 4 character sequence that starts with 00

3 Conclusion


In this tutorial we have learnt how to remove special characters from text and save the ones we want to keep. In addition we have used gsub to find sequences of characters containing any character. This process can be incorpored into a function to automatize the process.