We will learn to work with strings. For this we will analyse one of my favorite books: George Orwell’s 1984.
Objectives: Learn string handling, e.g. functions grep(), gsub(), nchar(), strsplit(), and many more
Requirements: None
install.packages("tidyr", repos = "http://cran.us.r-project.org" )
## package 'tidyr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\acliv1\AppData\Local\Temp\RtmpsdLhzQ\downloaded_packages
install.packages("wordcloud", repos = "http://cran.us.r-project.org")
## package 'wordcloud' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\acliv1\AppData\Local\Temp\RtmpsdLhzQ\downloaded_packages
library(stringr)
library(ggplot2)
library(wordcloud)
Luckily this book is available for download. The link is stored in variable “url” and the text is downloaded with readLines() and save in “text_1984”.
url <- "http://gutenberg.net.au/ebooks01/0100021.txt"
text_1984 <- readLines(url)
The book has some overhead: introductory text at the beginning and some appendix at the end. We want to analyse the pure book, so we filter the text to its core.
text_1984_filt <- text_1984[46: length(text_1984)]
text_1984_filt <- text_1984_filt[1:9859]
head(text_1984_filt)
## [1] "It was a bright cold day in April, and the clocks were striking thirteen."
## [2] "Winston Smith, his chin nuzzled into his breast in an effort to escape the"
## [3] "vile wind, slipped quickly through the glass doors of Victory Mansions,"
## [4] "though not quickly enough to prevent a swirl of gritty dust from entering"
## [5] "along with him."
## [6] ""
tail(text_1984_filt)
## [1] "" "" "THE END" "" "" ""
What structure does this object have?
str(text_1984_filt) #This structure is not ideal, as there are missing strings and elements through the text due to spaces.
## chr [1:9859] "It was a bright cold day in April, and the clocks were striking thirteen." ...
It is a character vector with 9865 elements. There is just one problem. Some elements contain several words, some don’t contain a single word. Our aim is to have a vector with a single word as each element.
First, we collapse this vector to one single string. This can be done with paste() function. Separators are blank signs between the words. In a second step we modify all letters to lower letters with str_to_lower().
# collapse to a single string
text_1984_one_single_string <- paste(text_1984_filt, collapse = " ")
text_1984_one_single_string <- str_to_lower(text_1984_one_single_string)
If you want to use upper case you can use str_to_upper(). If you want to use titles use str_to_title().
Now we create the character vector with single words. We can use str_split() and split at each blank sign " “. The result is a list with one element. We access this element with”[[1]]".
# separate each word
text_1984_separate_words <- str_split(string = text_1984_one_single_string, pattern = " ")[[1]]
head(text_1984_separate_words, n=10)
## [1] "it" "was" "a" "bright" "cold" "day" "in" "april,"
## [9] "and" "the"
This looks as desired.
Are there numbers in the text? We will find out with grep() or str_subset(). These commands search for matches within the text.
But how can we define all numbers? The easy way is to run grep() with parameter “0”, “1”, “2”, … But this takes quite some effort and contradicts DRY (don’t repeat yourself principle).
There is a better way. You can use “[0-9]” for all numbers from 0 to 9. This is a regular expression. Regular expressions are extremely powerful. Some links are shown at the end of this article.
head(grep("[0-9]", text_1984_separate_words, value = T))
## [1] "300" "4th," "1984." "1984." "1944" "1945;"
head(str_subset(string = text_1984_separate_words, pattern = "[0-9]"))
## [1] "300" "4th," "1984." "1984." "1944" "1945;"
We can use this regular expression to remove numbers and hyphens. We will have a separate lecture on regular expressions.
# delete numbers
text_1984_separate_words <- str_replace_all(pattern = "[0-9]",
replacement = "",
string = text_1984_separate_words)
# delete hyphens
text_1984_separate_words <- str_replace_all(pattern = "-",
replacement = " ",
string = text_1984_separate_words)
head(text_1984_separate_words, n = 100)
## [1] "it" "was" "a" "bright" "cold" "day"
## [7] "in" "april," "and" "the" "clocks" "were"
## [13] "striking" "thirteen." "winston" "smith," "his" "chin"
## [19] "nuzzled" "into" "his" "breast" "in" "an"
## [25] "effort" "to" "escape" "the" "vile" "wind,"
## [31] "slipped" "quickly" "through" "the" "glass" "doors"
## [37] "of" "victory" "mansions," "though" "not" "quickly"
## [43] "enough" "to" "prevent" "a" "swirl" "of"
## [49] "gritty" "dust" "from" "entering" "along" "with"
## [55] "him." "" "the" "hallway" "smelt" "of"
## [61] "boiled" "cabbage" "and" "old" "rag" "mats."
## [67] "at" "one" "end" "of" "it" "a"
## [73] "coloured" "poster," "too" "large" "for" "indoor"
## [79] "display," "had" "been" "tacked" "to" "the"
## [85] "wall." "it" "depicted" "simply" "an" "enormous"
## [91] "face," "more" "than" "a" "metre" "wide:"
## [97] "the" "face" "of" "a"
There are still empty elements, which we will delete in the next step. Empty element have zero characters, which we can find out with str_length(). so we filter for str_length() > 0.
# delete empty words
text_1984_separate_words <- text_1984_separate_words[str_length(text_1984_separate_words) > 0]
The main characters are “Winston”, “Julia”, “O’Brien” and of in a way “big brother”. with table() the number of occurances are shown. We concentrate on the main characters and filter for them with“[ ]”.
table(text_1984_separate_words)[c("winston", "julia", "o\'brien", "brother")]
## text_1984_separate_words
## winston julia o'brien brother
## 315 44 120 40
Not surprisingly “Winston” as the main character has the most appearances.This is not sorted. We can order this table with sort(). Default is ascending order, but with parameter “decreasing = T” it is changed to decreasing.
sort(table(text_1984_separate_words)[c("winston", "julia", "o\'brien", "brother")], decreasing = T)
## text_1984_separate_words
## winston o'brien julia brother
## 315 120 44 40
I am curious about finding out, what the shortest word is. We already know the length of a word can be found with str_length(). Now, we need to find the position of maximum and use which.max().
pos_min <- which.min(str_length(text_1984_separate_words))
pos_max <- which.max(str_length(text_1984_separate_words))
text_1984_separate_words[pos_min]
## [1] "a"
text_1984_separate_words[pos_max] # This has clearly not worked, and therefore if we were analysing this dataset, we would need to go back and edit the cleaning steps to figure it out.
## [1] "dirty mindedness everything"
Surprise, surprise. The shortest word is “a”. The opposite is which.max(). Find out for yourself what the longest word is.
What are the distribution of word lengths. Let’s see with hist()_ and plot a histogram.
hist(str_length(text_1984_separate_words), breaks = seq(1, 30, 1))
How often does each letter appear in the text? To find out we split our single text-string at each position with "". With _table() we can calculate occurance of each letter.
single_chars <- str_to_lower(string = strsplit(text_1984_one_single_string, "")[[1]])
char_freq <- table(single_chars)[letters]
The relative frequencies are shown in this graph.
What is this good for? Well, each language has its unique distribution of letter. Without being able to speak english at all we can find out that this is an english text - just compare this distribution with Wikipedia article on letter frequencies (link at the end of the article).
If some monoalphabetic encryption is used, this is the key to decrypt it and find the plain text. If you are interested in a simple encryption system read this article.
Finally, we will use some nice visualisation technique: wordclouds. Wordclouds take all words and present the most common words. Sizes represent number of occurances.
wordcloud(words = text_1984_separate_words[1:2000])