Module –> r4t1

PrepaRing JM Barrie’s PeteR Pan

Upload

scan() is a {base} function. It reads data into a vector or list from the console or file.

text.v <- scan("http://www.gutenberg.org/files/16/16-0.txt", what="character", encoding="UTF-8", sep="\n")
# when reading from project gutenberg, be sure to use the identical url that displays the file you wish to read in.

Using scan(), we build a word vector called text.v

Note we are using the arbitrary extension .v, which is a convention suggested by Matt Jockers to keep track of data type: here vector
sep is an argument of the “scan function” which tells it to separate the input in a certain place
\n is a regular expression: newline (carriage return) characters

Write to disc

Note that we can also write files from R to disc. Here, we take our vector text.vand write it to file as a .txt-file:

write(text.v, file="peterpan_utf8.txt")

Above we’ve read the entire novel Peter Pan from the web. Now that we’ve made our own .txt file, we can also read it into our environment from disc.

text.v <- scan("peterpan_utf8.txt", what="char", sep="\n")

Scrutinize your R objects

Get structural information

str() displays the structure of an arbitrary R object. head() prints the first 6 lines. tail() prints the last 6 lines.

str(text.v)
head(text.v)
tail(text.v)

Print and view your R objects

I print text.v to see the whole vector.

text.v

But R gives me the first 1,000 lines only.

If I wish to view the whole vector, I can use View() (note the case sensitive spelling!)

View(text.v)

Else, when printing directly, like I did above, I can change the max.print default.

options(max.print=1000000)
text.v
# the default parameter in R is 10000L; I need to extend that window (here, I try with 1,000,000)

Clean the text file

Interact with your Data: Using those square brackets

Basics: Using the index [ ] for data manipulation.

The character-vector “text.v” is indexed. Each line has a number. Using square brackets, code text.v[6] renders the newline-entry number #6.

text.v[6] 
text.v[600]

Remove unwanted material

Our text files will often contain superfluous lines (“boilerplate”) that we wish to remove.

Use code with which() - finds the indices of particular character-strings - refers to the indices of entire lines - e.g., text.v[600] is the line “when for scores on scores of miles you wade knee-deep among” - two equal signs == serve as a comparison operator

which(text.v == "Chapter 1 PETER BREAKS THROUGH") 
View(text.v)
start.v <- 39

within our vector text.v, using this code, you cannot search for single words (yet).
we find two instances of “CHAPTER 1. Loomings.”.
looking at the file with View(text.v) shows that line 537L is what we want to set as start.v.
code text.v[537] renders the same line!

Now, we search for a line that contains the last line of text.v We inspect text.v (it is still open in RStudio, or we use View() again)

end.v <- which(text.v == "THE END") 
end.v

text.v[4480] returns the exact line as end.v
code end.v<- 4480 may be used alternatively to end.v <- which(text.v == "THE END")
It is important to note why which(text.v == "THE END") does not work: In the text-vector, there is no line that contains the string (without the fullstop.)!

You’re familiar with class() and length(), right?

class(end.v) # what type of data is x?
length(text.v) # how many lines does text.v have?

We see that end.v is an integer. And we see that text.v contains 4808 lines.

To clean the text, we build new vectors

“boilerplate.v”, “text.v”

start.boilerplate.v <- text.v[1:start.v-1]

This creates a vector from the first line up to one line before the text proper starts.

An alternative is

start.boilerplate.v <- text.v[1:39-1]

We also need to single out the boilerplate at the end of the file.

end.boilerplate.v <- text.v[(end.v+1):length(text.v)]
# This is the same:
#end.boilerplate.v <- text.v[(4480+1):length(text.v)]

Remember end.v is the last line of my target text. Here I define: from the first line after end.v up to the last line of the vector length(text.v).

Create clean text!

And now: we create clean text. This means, for now, that we exclude the content we are not interested in.

novel.lines.v <- text.v[start.v:end.v]

Just for fun, we can also create a boilerplate vector. We use the combine function c()

boilerplate.v <- c(start.boilerplate.v, end.boilerplate.v)

We can validate our progress so far:

length(novel.lines.v) + length(boilerplate.v)
length(text.v)
#nothing lost?
length(text.v) - length(boilerplate.v) 
#how much shorter is the new, clean vector?

the “boilerplate,” more elegant: > boilerplate.v <- c(text.v[1:(start.v-1)], text.v[(end.v+1)length(text.v)])

Prepare the text

paste() function to join and collapse all the lines into one long string

novel.v <- paste(novel.lines.v, collapse=" ")

empty space " " is the glue character
Note: if you write "", empty spaces in the text will be missing in novel.v!

length(novel.v) # now, we have a SINGLE character string!
View(novel.v) # the whole novel is stashed into the one line!

… but we want to be able to index the individual words…

Tokenize!

In order to split this enormous string into single strings (tokens), we use tolower() and strsplit() and unlist().

novel.lower.v <- tolower(novel.v)

make a list of tokens with strsplit() and then transform into atomic vector with unlist():

peter.words.l <- strsplit(novel.lower.v, "\\W")
peter.words.v <- unlist(peter.words.l) # we need an AUTOMIC VECTOR, therefore unlist()

Compare the output of str(peter.words.v) and str(peter.words.l):

str(peter.words.v)
str(peter.words.l)

You see the difference!

More clean up

Remove the extra empty spaces

extra empty spaces retained from applying regex \\W
diese werden nun identifziert, indem wir ihr Gegenteil definieren.
So erhalten wir Liste aller Positionen in moby.words.v wo sich kein “Fehleintrag” befindet

not.blanks.v <- which(peter.words.v !="") # != means "unequal"
not.blanks.v[1:10] # missing are 3, 5, 9, 13: these are "blanks!"

Overwrite vector with exactly those elements that are not empty

peter.words.v <- peter.words.v[not.blanks.v]

This is possible because not.blanks.v is an integer which can operate peter.word.v.

# same, but more elegant
# peter.words.v <- peter.words.v[which(peter.words.v != "")]

How about: test with words.l?

peter.words.l <- peter.words.l[which(peter.words.l != "")]

This doesn’t work, because, remember: moby.words.l contains just one list-element (whose sub-elements we cannot index like this)

Voilà the tokenized text!

Now we have completed our first mission. - the vector peter.words.v is now ready for computation - it is an atomic vector (not a list type of vector (generic vector))

View(peter.words.v)
rm(peter.words.v)

R For Textual Data. Preparing Textual Data from Scratch

JTL

2020-06-29