scan() is a {base} function. It reads data into a vector or list from the console or file.
text.v <- scan("http://www.gutenberg.org/files/16/16-0.txt", what="character", encoding="UTF-8", sep="\n")
# when reading from project gutenberg, be sure to use the identical url that displays the file you wish to read in.
Using scan(), we build a word vector called text.v
.v, which is a convention suggested by Matt Jockers to keep track of data type: here vectorsep is an argument of the “scan function” which tells it to separate the input in a certain place\n is a regular expression: newline (carriage return) charactersNote that we can also write files from R to disc. Here, we take our vector text.vand write it to file as a .txt-file:
write(text.v, file="peterpan_utf8.txt")
Above we’ve read the entire novel Peter Pan from the web. Now that we’ve made our own .txt file, we can also read it into our environment from disc.
text.v <- scan("peterpan_utf8.txt", what="char", sep="\n")
str() displays the structure of an arbitrary R object. head() prints the first 6 lines. tail() prints the last 6 lines.
str(text.v)
head(text.v)
tail(text.v)
I print text.v to see the whole vector.
text.v
But R gives me the first 1,000 lines only.
If I wish to view the whole vector, I can use View() (note the case sensitive spelling!)
View(text.v)
Else, when printing directly, like I did above, I can change the max.print default.
options(max.print=1000000)
text.v
# the default parameter in R is 10000L; I need to extend that window (here, I try with 1,000,000)
Basics: Using the index [ ] for data manipulation.
The character-vector “text.v” is indexed. Each line has a number. Using square brackets, code text.v[6] renders the newline-entry number #6.
text.v[6]
text.v[600]
Our text files will often contain superfluous lines (“boilerplate”) that we wish to remove.
Use code with which() - finds the indices of particular character-strings - refers to the indices of entire lines - e.g., text.v[600] is the line “when for scores on scores of miles you wade knee-deep among” - two equal signs == serve as a comparison operator
which(text.v == "Chapter 1 PETER BREAKS THROUGH")
View(text.v)
start.v <- 39
text.v, using this code, you cannot search for single words (yet).View(text.v) shows that line 537L is what we want to set as start.v.text.v[537] renders the same line!Now, we search for a line that contains the last line of text.v We inspect text.v (it is still open in RStudio, or we use View() again)
end.v <- which(text.v == "THE END")
end.v
text.v[4480] returns the exact line as end.v
code end.v<- 4480 may be used alternatively to end.v <- which(text.v == "THE END")
It is important to note why which(text.v == "THE END") does not work: In the text-vector, there is no line that contains the string (without the fullstop.)!
You’re familiar with class() and length(), right?
class(end.v) # what type of data is x?
length(text.v) # how many lines does text.v have?
We see that end.v is an integer. And we see that text.v contains 4808 lines.
“boilerplate.v”, “text.v”
start.boilerplate.v <- text.v[1:start.v-1]
This creates a vector from the first line up to one line before the text proper starts.
An alternative is
start.boilerplate.v <- text.v[1:39-1]
We also need to single out the boilerplate at the end of the file.
end.boilerplate.v <- text.v[(end.v+1):length(text.v)]
# This is the same:
#end.boilerplate.v <- text.v[(4480+1):length(text.v)]
Remember end.v is the last line of my target text. Here I define: from the first line after end.v up to the last line of the vector length(text.v).
And now: we create clean text. This means, for now, that we exclude the content we are not interested in.
novel.lines.v <- text.v[start.v:end.v]
Just for fun, we can also create a boilerplate vector. We use the combine function c()
boilerplate.v <- c(start.boilerplate.v, end.boilerplate.v)
We can validate our progress so far:
length(novel.lines.v) + length(boilerplate.v)
length(text.v)
#nothing lost?
length(text.v) - length(boilerplate.v)
#how much shorter is the new, clean vector?
the “boilerplate,” more elegant: > boilerplate.v <- c(text.v[1:(start.v-1)], text.v[(end.v+1)length(text.v)])
paste() function to join and collapse all the lines into one long stringnovel.v <- paste(novel.lines.v, collapse=" ")
" " is the glue character"", empty spaces in the text will be missing in novel.v!length(novel.v) # now, we have a SINGLE character string!
View(novel.v) # the whole novel is stashed into the one line!
… but we want to be able to index the individual words…
In order to split this enormous string into single strings (tokens), we use tolower() and strsplit() and unlist().
novel.lower.v <- tolower(novel.v)
make a list of tokens with strsplit() and then transform into atomic vector with unlist():
peter.words.l <- strsplit(novel.lower.v, "\\W")
peter.words.v <- unlist(peter.words.l) # we need an AUTOMIC VECTOR, therefore unlist()
Compare the output of str(peter.words.v) and str(peter.words.l):
str(peter.words.v)
str(peter.words.l)
You see the difference!
\\Wnot.blanks.v <- which(peter.words.v !="") # != means "unequal"
not.blanks.v[1:10] # missing are 3, 5, 9, 13: these are "blanks!"
peter.words.v <- peter.words.v[not.blanks.v]
This is possible because not.blanks.v is an integer which can operate peter.word.v.
# same, but more elegant
# peter.words.v <- peter.words.v[which(peter.words.v != "")]
How about: test with words.l?
peter.words.l <- peter.words.l[which(peter.words.l != "")]
This doesn’t work, because, remember: moby.words.l contains just one list-element (whose sub-elements we cannot index like this)
Now we have completed our first mission. - the vector peter.words.v is now ready for computation - it is an atomic vector (not a list type of vector (generic vector))
View(peter.words.v)
rm(peter.words.v)