The goal of this tutorial is to remove non-UTF8 characters from text. This process is very useful and a function can be created to automatize this procedure.
# Imagine we have a text with non printable or special characters that we want to read and remove
phrase <- c("<U+0096>Hi", "<U+0096>my", "<U+0006>friend")
phrase
## [1] "<U+0096>Hi" "<U+0096>my" "<U+0006>friend"
# We can remove all of them first by using
phrase_clean <- gsub("[^[:alnum:][:blank:]?&/\\-]", "", phrase)
phrase_clean
## [1] "U0096Hi" "U0096my" "U0006friend"
# "[^[:alnum:][:blank:]?&/\\-]"
# This grammar means: remove everything but:
# [:alnum:] Alphanumeric characters: 0-9 a-Z
# [:blank:] spaces and tabs
# ?&/\\- Specific characters you want to save for some reason. Punctuation signs can be saved here
# Once the text is clean we can remove the rest of the unwanted strings
phrase_clean <- gsub("U00..", "", phrase_clean)
phrase_clean
## [1] "Hi" "my" "friend"
# The . in the gsub string function means any character
# We will remove any 4 character sequence that starts with 00
In this tutorial we have learnt how to remove special characters from text and save the ones we want to keep. In addition we have used gsub to find sequences of characters containing any character. This process can be incorpored into a function to automatize the process.