Name:Sunil Dhaka
Time: Note:
This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Sunday 11:59pm, this week. Make sure to complete your weekly check-in (which can be done by coming to lecture, recitation, lab, or any office hour), as this will count a small number of points towards your lab score.
This week’s agenda: basic string manipulations; practice reading in and summarizing real text data (Shakespeare); practice with iteration; just a little bit of regular expressions.
str1="Scientific Computing"
str2='Scientific Computing'
str1==str2
## [1] TRUE
Exa="Sunil's mother is Kamala"# here single apostrphy is already used
tolower() and toupper() do as you’d expect: they convert strings to all lower case characters, and all upper case characters, respectively. Apply them to the strings below, as directed by the comments, to observe their behavior.tolower("I'M NOT ANGRY I SWEAR") # Convert to lower case
## [1] "i'm not angry i swear"
toupper("Mom, I don't want my veggies") # Convert to upper case
## [1] "MOM, I DON'T WANT MY VEGGIES"
toupper("Hulk, sMasH" ) # Convert to upper case
## [1] "HULK, SMASH"
tolower("R2-D2 is in prime condition, a real bargain!") # Convert to lower case
## [1] "r2-d2 is in prime condition, a real bargain!"
presidents of length 5 below, containing the last names of past US presidents. Define a string vector first.letters to contain the first letters of each of these 5 last names. Hint: use substr(), and take advantage of vectorization; this should only require one line of code. Define first.letters.scrambled to be the output of sample(first.letters) (the sample() function can be used to perform random permutations, we’ll learn more about it later in the course). Lastly, reset the first letter of each last name stored in presidents according to the scrambled letters in first.letters.scrambled. Hint: use substr() again, and take advantage of vectorization; this should only take one line of code. Display these new last names.presidents = c("Clinton", "Bush", "Reagan", "Carter", "Ford")
first.letters=substr(presidents,1,1)
first.letters.scrambled=sample(first.letters,5)
substr(presidents,1,1)=first.letters.scrambled
phrase defined below. Using substr(), replace the first four characters in phrase by “Provide”. Print phrase to the console, and describe the behavior you are observing. Using substr() again, replace the last five characters in phrase by “kit” (don’t use the length of phrase as magic constant in the call to substr(), instead, compute the length using nchar()). Print phrase to the console, and describe the behavior you are observing.phrase = "Give me a break"
substr(phrase,1,4)="Provide"
substr(phrase,nchar(phrase)-4,nchar(phrase))="kit"
# these strings are having constant size/length. substr() just can replace specified size of substrings
ingredients defined below. Using strsplit(), split this string up into a string vector of length 5, with elements “chickpeas”, “tahini”, “olive oil”, “garlic”, and “salt.” Using paste(), combine this string vector into a single string “chickpeas + tahini + olive oil + garlic + salt”. Then produce a final string of the same format, but where the ingredients are sorted in alphabetical (increasing) order.ingredients = "chickpeas, tahini, olive oil, garlic, salt"
split.words=strsplit(ingredients,split = ",")
paste(split.words[[1]],"+",sep = " ",collapse = "")
## [1] "chickpeas + tahini + olive oil + garlic + salt +"
sort(split.words[[1]])
## [1] " garlic" " olive oil" " salt" " tahini" "chickpeas"
## now see that [[]] and [] for list. It is now being handy here. Remember it.
Project Gutenberg offers over 50,000 free online books, especially old books (classic literature), for which copyright has expired. We’re going to look at the complete works of William Shakespeare, taken from the Project Gutenberg website.
To avoid hitting the Project Gutenberg server over and over again, we’ve grabbed a text file from them that contains the complete works of William Shakespeare and put it on our course website. Visit http://www.stat.cmu.edu/~ryantibs/statcomp/data/shakespeare.txt in your web browser and just skim through this text file a little bit to get a sense of what it contains (a whole lot!).
readLines(). Make sure you are reading the data file directly from the web (rather than locally, from a downloaded file on your computer). Call the result shakespeare.lines. This should be a vector of strings, each element representing a “line” of text. Print the first 5 lines. How many lines are there? How many characters in the longest line? What is the average number of characters per line? How many lines are there with zero characters (empty lines)? Hint: each of these queries should only require one line of code; for the last one, use an on-the-fly Boolean comparison and sum().shakespeare.lines=readLines("http://www.stat.cmu.edu/~ryantibs/statcomp/data/shakespeare.txt")
shakespeare.lines[1:5]
## [1] ""
## [2] "Project Gutenberg’s The Complete Works of William Shakespeare, by"
## [3] "William Shakespeare"
## [4] ""
## [5] "This eBook is for the use of anyone anywhere in the United States and"
sum(shakespeare.lines=="")
## [1] 17744
lines.count=sapply(shakespeare.lines,FUN=nchar)
lines.count[which.max(lines.count)]
## Section 3. Information about the Project Gutenberg Literary Archive Foundation
## 78
mean(lines.count)
## [1] 37.50825
shakespeare.lines (i.e., lines with zero characters). Check that that the new length of shakespeare.lines makes sense to you.shakespeare.lines=shakespeare.lines[-c(which(lines.count==0))]
length(shakespeare.lines)
## [1] 130094
shakespeare.lines into one big string, separating each line by a space in doing so, using paste(). Call the resulting string shakespeare.all. How many characters does this string have? How does this compare to the sum of characters in shakespeare.lines, and does this make sense to you?shakespeare.all=paste(shakespeare.lines,collapse=" ")
nchar(shakespeare.all)-sum(lines.count)
## [1] 130093
## this comes natural to me. In doing collapse we added " "{space} character 130093 ## times
shakespeare.all into words, using strsplit() with split=" ". Call the resulting string vector (note: here we are asking you for a vector, not a list) shakespeare.words. How long is this vector, i.e., how many words are there? Using the unique() function, compute and store the unique words as shakespeare.words.unique. How many unique words are there?shakespeare.words=strsplit((shakespeare.all),split = " ")[[1]]
shakespeare.words=shakespeare.words[-which(shakespeare.words=="")]
length(shakespeare.words)
## [1] 959301
shakespeare.words.unique=unique(shakespeare.words)
length(shakespeare.words.unique)
## [1] 76170
#Q) how "" comes so many times even in unique
#Q) why unique and previous lengths are same
shakespeare.words.unique. You will have to set a large value of the breaks argument (say, breaks=50) in order to see in more detail what is going on. What does the bulk of this distribution look like to you? Why is the x-axis on the histogram extended so far to the right (what does this tell you about the right tail of the distribution)?hist(nchar(shakespeare.words.unique),breaks = 50)
# it looks exp. distn
# as words having characters more than 10 are very rare in shakespear's work
# right tail is cover very low area of this distribution
2f. Reminder: the sort() function sorts a given vector into increasing order; its close friend, the order() function, returns the indices that put the vector into increasing order. Both functions can take decreasing=TRUE as an argument, to sort/find indices according to decreasing order. See the code below for an example.
set.seed(0)
(x = round(runif(5, -1, 1), 2))
## [1] 0.79 -0.47 -0.26 0.15 0.82
sort(x, decreasing=TRUE)
## [1] 0.82 0.79 0.15 -0.26 -0.47
order(x, decreasing=TRUE)
## [1] 5 1 4 3 2shakespeare.words.unique[head(order(nchar(shakespeare.words.unique),decreasing = TRUE))]
## [1] "_______________________________________________________________"
## [2] "tragical-comical-historical-pastoral,"
## [3] "http://www.gutenberg.org/1/0/100/"
## [4] "honorificabilitudinitatibus;"
## [5] "enemies?—Capulet,—Montague,—"
## [6] "six-or-seven-times-honour’d"
Using the `order()` function, find the indices that correspond to the top 5 longest words in `shakespeare.words.unique`. Then, print the top 5 longest words themselves. Do you recognize any of these as actual words? **Challenge**: try to pronounce the fourth longest word! What does it mean?
table(), compute counts for the words in shakespeare.words, and save the result as shakespeare.wordtab. How long is shakespeare.wordtab, and is this equal to the number of unique words (as computed above)? Using named indexing, answer: how many times does the word “thou” appear? The word “rumour”? The word “gloomy”? The word “assassination”?shakespeare.wordtab=table(shakespeare.words)
length(shakespeare.wordtab)
## [1] 76170
shakespeare.wordtab["thou"]
## thou
## 4522
shakespeare.wordtab["rumor"]
## rumor
## 3
shakespeare.wordtab["gloomy"]
## gloomy
## 3
shakespeare.wordtab["assassination"]
## assassination
## 1
length(shakespeare.wordtab[(shakespeare.wordtab)==1])
## [1] 41842
length(shakespeare.wordtab[(shakespeare.wordtab)==2])
## [1] 10756
length(shakespeare.wordtab[(shakespeare.wordtab)>10])
## [1] 7544
length(shakespeare.wordtab[(shakespeare.wordtab)>100])
## [1] 974
shakespeare.wordtab so that its entries (counts) are in decreasing order, and save the result as shakespeare.wordtab.sorted. Print the 25 most commonly used words, along with their counts. What is the most common word? Second and third most common words?shakespeare.wordtab.sorted=sort(shakespeare.wordtab,decreasing = TRUE)
head(shakespeare.wordtab.sorted,25)
## shakespeare.words
## the I and to of a my in you is that And not
## 25378 20629 19806 16966 16718 13657 11443 10519 9591 8335 8150 7769 7415
## with his be your for have it this me he as thou
## 7380 6851 6411 6386 6014 5584 5242 5190 5107 5009 4584 4522
#see that the "" comes 411073 times
shakespeare.all by spaces, using strsplit(). Redefine shakespeare.words so that all empty strings are deleted from this vector. Then recompute shakespeare.wordtab and shakespeare.wordtab.sorted. Check that you have done this right by printing out the new 25 most commonly used words, and verifying (just visually) that is overlaps with your solution to the last question.##how to split blanks using strsplit
## although I have ploptted below equations by manually removing ""<empty strings
shakespeare.wordtab=table(shakespeare.words)
shakespeare.wordtab.sorted=sort(shakespeare.wordtab,decreasing = TRUE)
head(shakespeare.wordtab.sorted,25)
## shakespeare.words
## the I and to of a my in you is that And not
## 25378 20629 19806 16966 16718 13657 11443 10519 9591 8335 8150 7769 7415
## with his be your for have it this me he as thou
## 7380 6851 6411 6386 6014 5584 5242 5190 5107 5009 4584 4522
shakespeare.wordtab.sorted. Set xlim=c(1,1000) as an argument to plot(); this restricts the plotting window to just the first 1000 ranks, which is helpful here to see the trend more clearly. Do you see Zipf’s law in action, i.e., does it appear that \(\mathrm{Frequency} \approx C(1/\mathrm{Rank})^a\) (for some \(C,a\))? Challenge: either programmatically, or manually, determine reasonably-well-fitting values of \(C,a\) for the Shakespeare data set; then draw the curve \(y=C(1/x)^a\) on top of your plot as a red line to show how well it fits.C=1234
a=0.9
plot(1:(length(shakespeare.wordtab.sorted)),as.numeric(shakespeare.wordtab.sorted),xlim=c(1,1000),xlab = "Rank",ylab = "Frequency",type = "l")
curve(C*(1/x)^a,from=1,to=length(shakespeare.wordtab.sorted),add=TRUE,col="yellow")
## as values of C and a are random I don't how to compute them programmatically or how to get ##them manually
shakespeare.words. The first is that capitalization matters; from Q3c, you should have seen that “and” and “And” are counted as separate words. The second is that many words contain punctuation marks (and so, aren’t really words in the first place); to see this, retrieve the count corresponding to “and,” in your word table shakespeare.wordtab.shakespeare.wordtab["and"]
## and
## 19806
shakespeare.wordtab["And"]
## And
## 7769
shakespeare.words.new=(strsplit(tolower(shakespeare.all),split="[[:space:]]|[[:punct:]]"))[[1]]
shakespeare.words.new=shakespeare.words.new[-which(shakespeare.words.new=="")]
shakespeare.wordtab.new=table(shakespeare.words.new)
shakespeare.wordtab.new["and"]
## and
## 28402
The fix for the first issue is to convert `shakespeare.all` to all lower case characters. Hint: recall `tolower()` from Q1b. The fix for the second issue is to use the argument `split="[[:space:]]|[[:punct:]]"` in the call to `strsplit()`, when defining the words. In words, this means: *split on spaces or on punctuation marks* (more precisely, it uses what we call a **regular expression** for the `split` argument). Carry out both of these fixes to define new words `shakespeare.words.new`. Then, delete all empty strings from this vector, and compute word table from it, called `shakespeare.wordtab.new`.
shakespeare.words.new to that of shakespeare.words; also compare the length of shakespeare.wordtab.new to that of shakespeare.wordtab. Explain what you are observing.length(shakespeare.words)-length(shakespeare.words.new)
## [1] -30618
length(shakespeare.wordtab)-length(shakespeare.wordtab.new)
## [1] 49915
shakespeare.words.new, calling the result shakespeare.words.new.unique. Then repeat the queries in Q2e and Q2f on shakespeare.words.new.unique. Comment on the histogram—is it different in any way than before? How about the top 5 longest words?shakespeare.words.new.unique=unique(shakespeare.words.new)
hist(nchar(shakespeare.words.new.unique),breaks = 50)
shakespeare.words.unique[head(order(nchar(shakespeare.words.unique),decreasing = TRUE))]
## [1] "_______________________________________________________________"
## [2] "tragical-comical-historical-pastoral,"
## [3] "http://www.gutenberg.org/1/0/100/"
## [4] "honorificabilitudinitatibus;"
## [5] "enemies?—Capulet,—Montague,—"
## [6] "six-or-seven-times-honour’d"
# yes the hist is different because in new case we have removed punctuations and other spaces so the the one increase in character numbers due to them is not in shakespear.words.new.unique
shakespeare.wordtab.new so that its entries (counts) are in decreasing order, and save the result as shakespeare.wordtab.sorted.new. Print out the 25 most common words and their counts, and compare them (informally) to what you saw in Q3d. Also, produce a plot of the new word counts, as you did in Q3e. Does Zipf’s law look like it still holds?shakespeare.wordtab.sorted.new=sort(shakespeare.wordtab.new,decreasing = TRUE)
head(shakespeare.wordtab.sorted.new,25)
## shakespeare.words.new
## the and i to of a you my that in is d not
## 30027 28402 23840 21437 18836 16139 14683 13208 12257 12191 9911 9486 9087
## with s for me it his be he this your but have
## 8542 8337 8303 8284 8228 7578 7413 7283 7193 7078 6793 6297
C=1234
a=0.9
## still don't know how to compute these a and C
plot(1:(length(shakespeare.wordtab.sorted.new)),as.numeric(shakespeare.wordtab.sorted.new),xlim=c(1,1000),xlab = "Rank",ylab = "Frequency",type = "l")
curve(C*(1/x)^a,from=1,to=length(shakespeare.wordtab.sorted.new),add=TRUE,col="yellow")
shakespeare.lines. Take a look at lines 19 through 23 of this vector: you should see a bunch of spaces preceding the text in lines 21, 22, and 23. Redefine shakespeare.lines by setting it equal to the output of calling the function trimws() on shakespeare.lines. Print out lines 19 through 23 again, and describe what’s happened.shakespeare.lines[19:23]
## [1] "The Complete Works of William Shakespeare"
## [2] "by William Shakespeare"
## [3] " Contents"
## [4] " THE SONNETS"
## [5] " ALL’S WELL THAT ENDS WELL"
shakespeare.lines=trimws(shakespeare.lines)#trim whitspaces
shakespeare.lines[19:23]
## [1] "The Complete Works of William Shakespeare"
## [2] "by William Shakespeare"
## [3] "Contents"
## [4] "THE SONNETS"
## [5] "ALL’S WELL THAT ENDS WELL"
which(), find the indices of the lines in shakespeare.lines that equal “THE SONNETS”, report the index of the first such occurence, and store it as toc.start. Similarly, find the indices of the lines in shakespeare.lines that equal “VENUS AND ADONIS”, report the index of the first such occurence, and store it as toc.end.toc.start=which(shakespeare.lines=="THE SONNETS")[1]#letter i think should change into #as.numeric
toc.end=which(shakespeare.lines=="VENUS AND ADONIS")[1]
n = toc.end - toc.start + 1, and create an empty string vector of length n called titles. Using a for() loop, populate titles with the titles of Shakespeare’s plays as ordered in the table of contents list, with the first being “THE SONNETS”, and the last being “VENUS AND ADONIS”. Print out the resulting titles vector to the console. Hint: if you define the counter variable i in your for() loop to run between 1 and n, then you will have to index shakespeare.lines carefully to extract the correct titles. Think about the following. When i=1, you want to extract the title of the first play in shakespeare.lines, which is located at index toc.start. When i=2, you want to extract the title of the second play, which is located at index toc.start + 1. And so on.n=toc.end-toc.start+1
titles=c(1:n)
for (i in 1:n) {
titles[i]=shakespeare.lines[toc.start+i-1]
}
titles
## [1] "THE SONNETS"
## [2] "ALL’S WELL THAT ENDS WELL"
## [3] "THE TRAGEDY OF ANTONY AND CLEOPATRA"
## [4] "AS YOU LIKE IT"
## [5] "THE COMEDY OF ERRORS"
## [6] "THE TRAGEDY OF CORIOLANUS"
## [7] "CYMBELINE"
## [8] "THE TRAGEDY OF HAMLET, PRINCE OF DENMARK"
## [9] "THE FIRST PART OF KING HENRY THE FOURTH"
## [10] "THE SECOND PART OF KING HENRY THE FOURTH"
## [11] "THE LIFE OF KING HENRY THE FIFTH"
## [12] "THE FIRST PART OF HENRY THE SIXTH"
## [13] "THE SECOND PART OF KING HENRY THE SIXTH"
## [14] "THE THIRD PART OF KING HENRY THE SIXTH"
## [15] "KING HENRY THE EIGHTH"
## [16] "KING JOHN"
## [17] "THE TRAGEDY OF JULIUS CAESAR"
## [18] "THE TRAGEDY OF KING LEAR"
## [19] "LOVE’S LABOUR’S LOST"
## [20] "THE TRAGEDY OF MACBETH"
## [21] "MEASURE FOR MEASURE"
## [22] "THE MERCHANT OF VENICE"
## [23] "THE MERRY WIVES OF WINDSOR"
## [24] "A MIDSUMMER NIGHT’S DREAM"
## [25] "MUCH ADO ABOUT NOTHING"
## [26] "THE TRAGEDY OF OTHELLO, MOOR OF VENICE"
## [27] "PERICLES, PRINCE OF TYRE"
## [28] "KING RICHARD THE SECOND"
## [29] "KING RICHARD THE THIRD"
## [30] "THE TRAGEDY OF ROMEO AND JULIET"
## [31] "THE TAMING OF THE SHREW"
## [32] "THE TEMPEST"
## [33] "THE LIFE OF TIMON OF ATHENS"
## [34] "THE TRAGEDY OF TITUS ANDRONICUS"
## [35] "THE HISTORY OF TROILUS AND CRESSIDA"
## [36] "TWELFTH NIGHT; OR, WHAT YOU WILL"
## [37] "THE TWO GENTLEMEN OF VERONA"
## [38] "THE TWO NOBLE KINSMEN"
## [39] "THE WINTER’S TALE"
## [40] "A LOVER’S COMPLAINT"
## [41] "THE PASSIONATE PILGRIM"
## [42] "THE PHOENIX AND THE TURTLE"
## [43] "THE RAPE OF LUCRECE"
## [44] "VENUS AND ADONIS"
for() loop to find out, for each play, the index of the line in shakespeare.lines at which this play begins. It turns out that the second occurence of “THE SONNETS” in shakespeare.lines is where this play actually begins (this first ocurrence is in the table of contents), and so on, for each play title. Use your for() loop to fill out an integer vector called titles.start, containing the indices at which each of Shakespeare’s plays begins in shakespeare.lines. Print the resulting vector titles.start to the console.titles.start=c(1:n)
for (i in 1:n) {
titles.start[i]=which(shakespeare.lines==titles[i])[2]
}
titles.start
## [1] 66 2377 5310 9141 11772 13702 17590 21385 26644 30389
## [11] 33614 36902 39957 43248 46412 49895 52680 55427 60107 62923
## [21] 65462 68319 71020 73766 75996 79469 83083 86327 89286 93442
## [31] 97535 101205 103640 106198 108938 113682 116175 NA 122682 126020
## [41] 126351 126556 126626 128534
titles.end to be an integer vector of the same length as titles.start, whose first element is the second element in titles.start minus 1, whose second element is the third element in titles.start minus 1, and so on. What this means: we are considering the line before the second play begins to be the last line of the first play, and so on. Define the last element in titles.end to be the length of shakespeare.lines. You can solve this question either with a for() loop, or with proper indexing and vectorization. Challenge: it’s not really correct to set the last element in titles.end to be length of shakespeare.lines, because there is a footer at the end of the Shakespeare data file. By looking at the data file visually in your web browser, come up with a way to programmatically determine the index of the last line of the last play, and implement it.titles.end=c(1:n)
for (i in 1:(n-1)) {
titles.end[i]=titles.start[i+1]-1
}
titles.end[n]=length(shakespeare.lines)
## solved challange
titles.end[n]=which(shakespeare.lines=="FINIS")[2]
titles.end
## [1] 2376 5309 9140 11771 13701 17589 21384 26643 30388 33613
## [11] 36901 39956 43247 46411 49894 52679 55426 60106 62922 65461
## [21] 68318 71019 73765 75995 79468 83082 86326 89285 93441 97534
## [31] 101204 103639 106197 108937 113681 116174 NA 122681 126019 126350
## [41] 126555 126625 128533 129749
NA, in the vector titles.start. Why? If you run which(shakespeare.lines == "THE TWO NOBLE KINSMEN") in your console, you will see that there is only one occurence of “THE TWO NOBLE KINSMEN” in shakespeare.lines, and this occurs in the table of contents. So there was no second occurence, hence the resulting NA value.which(shakespeare.lines == "THE TWO NOBLE KINSMEN")
## [1] 59
shakespeare.lines[118463]
## [1] "THE TWO NOBLE KINSMEN:"
for (i in 1:n) {
titles.start[i]=as.numeric(grep(pattern = titles[i],shakespeare.lines)[2])
}
titles.start
## [1] 66 2377 5310 9141 11772 13702 17590 21385 26644 30389
## [11] 33614 36902 39957 43248 46412 49895 52680 55427 60107 62923
## [21] 65462 68319 71020 73766 75996 79469 83083 86327 89286 93442
## [31] 97535 101205 103640 106198 108938 113682 116175 118463 122682 126020
## [41] 126351 126556 126626 128534
for (i in 1:(n-1)) {
titles.end[i]=titles.start[i+1]-1
}
titles.end[n]=which(shakespeare.lines=="FINIS")[2]
titles.end
## [1] 2376 5309 9140 11771 13701 17589 21384 26643 30388 33613
## [11] 36901 39956 43247 46411 49894 52679 55426 60106 62922 65461
## [21] 68318 71019 73765 75995 79468 83082 86326 89285 93441 97534
## [31] 101204 103639 106197 108937 113681 116174 118462 122681 126019 126350
## [41] 126555 126625 128533 129749
But now take a look at line 118,463 in `shakespeare.lines`: you will see that it is "THE TWO NOBLE KINSMEN:", so this is really where the second play starts, but because of colon ":" at the end of the string, this doesn't exactly match the title "THE TWO NOBLE KINSMEN", as we were looking for. The advantage of using the `grep()` function, versus checking for exact equality of strings, is that `grep()` allows us to match substrings. Specifically, `grep()` returns the indices of the strings in a vector for which a substring match occurs, e.g.,
```r
grep(pattern="cat",
x=c("cat", "canned goods", "batman", "catastrophe", "tomcat"))
```
```
## [1] 1 4 5
```
so we can see that in this example, `grep()` was able to find substring matches to "cat" in the first, fourth, and fifth strings in the argument `x`. Redefine `titles.start` by repeating the logic in your solution to Q5d, but replacing the `which()` command in the body of your `for()` loop with an appropriate call to `grep()`. Also, redefine `titles.end` by repeating the logic in your solution to Q5e. Print out the new vectors `titles.start` and `titles.end` to the console---they should be free of `NA` values.
titles vector. Use this to find the indices at which this play starts and ends, in the titles.start and titles.end vectors, respectively. Call the lines of text corresponding to this play shakespeare.lines.hamlet. How many such lines are there? Do the same, but now for the play “THE TRAGEDY OF ROMEO AND JULIET”, and call the lines of text corresponding to this play shakespeare.lines.romeo. How many such lines are there?index.tradegy=which(titles=="THE TRAGEDY OF HAMLET, PRINCE OF DENMARK")
start.hamlet=titles.start[index.tradegy]
end.hamlet=titles.end[index.tradegy]
shakespeare.lines.hamlet=shakespeare.lines[start.hamlet:end.hamlet]
length(shakespeare.lines.hamlet)
## [1] 5259
index.romeo=which(titles=="THE TRAGEDY OF ROMEO AND JULIET")
start.romeo=titles.start[index.romeo]
end.romeo=titles.end[index.romeo]
shakespeare.lines.romeo=shakespeare.lines[start.romeo:end.romeo]
length(shakespeare.lines.romeo)
## [1] 4093
shakespeare.lines.hamlet. (This should mostly just involve copying and pasting code as needed.) That is, to be clear: * collapse shakespeare.lines.hamlet into one big string, separated by spaces; * convert this string into all lower case characters; * divide this string into words, by splitting on spaces or on punctuation marks, using split="[[:space:]]|[[:punct:]]" in the call to strsplit(); * remove all empty words (equal to the empty string ""), and report how many words remain; * compute the unique words, report the number of unique words, and plot a histogram of their numbers of characters; * report the 5 longest words; * compute a word table, and report the 25 most common words and their counts; * finally, produce a plot of the word counts verus rank.shakespeare.lines.hamlet.all=paste(shakespeare.lines.hamlet,collapse = " ")
shakespeare.lines.hamlet.all=tolower(shakespeare.lines.hamlet.all)
shakespeare.lines.hamlet.words=
strsplit(shakespeare.lines.hamlet.all,split = "[[:space:]]|[[:punct:]]")[[1]]
shakespeare.lines.hamlet.words=
shakespeare.lines.hamlet.words[-which(shakespeare.lines.hamlet.words=="")]
shakespeare.lines.hamlet.unique=unique(shakespeare.lines.hamlet.words)
hist(nchar(shakespeare.lines.hamlet.unique),breaks=20)
shakespeare.lines.hamlet.unique[head(order(nchar(shakespeare.lines.hamlet.unique),decreasing = TRUE))]
## [1] "transformation" "understanding" "entertainment" "imperfections"
## [5] "encompassment" "circumstances"
shakespeare.lines.hamlet.wordtab=table(shakespeare.lines.hamlet.words)
shakespeare.lines.hamlet.sorted.wordtab=sort(shakespeare.lines.hamlet.wordtab,decreasing = TRUE)
C=234
a=0.56
plot(1:length(shakespeare.lines.hamlet.sorted.wordtab),as.numeric(shakespeare.lines.hamlet.sorted.wordtab),xlab = 'Rank',ylab = "Frequency",type = "l")
curve(C*(1/x)^a,from=1,to=length(shakespeare.lines.hamlet.sorted.wordtab),col="green")
- 6c. Repeat the same task as in the last part, but on
shakespeare.lines.romeo. (Again, this should just involve copying and pasting code as needed. P.S. Isn’t this getting tiresome? You’ll be happy when we learn functions, next week!) Comment on any similarities/differences you see in the answers.
shakespeare.lines.romeo.all=paste(shakespeare.lines.romeo,collapse = " ")
shakespeare.lines.romeo.all=tolower(shakespeare.lines.romeo.all)
shakespeare.lines.romeo.words=strsplit(shakespeare.lines.romeo.all,split = "[[:space:]]|[[:punct:]]")[[1]]
shakespeare.lines.romeo.words=
shakespeare.lines.romeo.words[-which(shakespeare.lines.romeo.words=="")]
length(shakespeare.lines.hamlet.words)
## [1] 32977
shakespeare.lines.romeo.unique=unique(shakespeare.lines.romeo.words)
hist(nchar(shakespeare.lines.romeo.unique),breaks=20)
shakespeare.lines.romeo.unique[head(order(nchar(shakespeare.lines.romeo.unique),decreasing = TRUE))]
## [1] "distemperature" "unthankfulness" "interchanging" "transgression"
## [5] "disparagement" "gentlemanlike"
shakespeare.lines.romeo.wordtab=table(shakespeare.lines.romeo.words)
shakespeare.lines.romeo.sorted.wordtab=sort(shakespeare.lines.romeo.wordtab,decreasing = TRUE)
C=234
a=0.56
plot(1:length(shakespeare.lines.romeo.sorted.wordtab),as.numeric(shakespeare.lines.romeo.sorted.wordtab),xlab = 'Rank',ylab = "Frequency",type = "l")
curve(C*(1/x)^a,from=1,to=length(shakespeare.lines.romeo.sorted.wordtab),col="green")
for() loop and the titles.start, titles.end vectors constructed above, answer the following questions. What is Shakespeare’s longest play (in terms of the number of words)? What is Shakespeare’s shortest play? In which play did Shakespeare use his longest word (in terms of the number of characters)? Are there any plays in which “the” is not the most common word?shakespeare.plays.length.words=numeric(n)
shakespeare.plays.length.lines=numeric(n)
shakespeare.plays.length.max=numeric(n)
shakespeare.plays.length.the=numeric(n)
for (i in 1:n) {
shakespeare.plays.length.lines[i]=length(shakespeare.lines[titles.start[i]:titles.end[i]])
s=shakespeare.lines[titles.start[i]:titles.end[i]]
s=paste(s,collapse = " ")
s=tolower(s)
s=strsplit(s,split = "[[:space:]]|[[:punct:]]")[[1]]
s=s[-which(s=="")]
s1=unique(s)
shakespeare.plays.length.max[i]=nchar((s1[order(nchar(s1),decreasing = TRUE)])[1])
s2=table(s)
s2=sort(s2,decreasing = TRUE)
if(((names(s2))[1]=="the")){
shakespeare.plays.length.the[i]=0
}
else{
shakespeare.plays.length.the[i]=1
}
shakespeare.plays.length.words[i]=length(s)
}
shakespeare.plays.length.lines
## [1] 2311 2933 3831 2631 1930 3888 3795 5259 3745 3225 3288 3055 3291 3164 3483
## [16] 2785 2747 4680 2816 2539 2857 2701 2746 2230 3473 3614 3244 2959 4156 4093
## [31] 3670 2435 2558 2740 4744 2493 2288 4219 3338 331 205 70 1908 1216
shakespeare.plays.length.words
## [1] 18126 25166 27468 23390 16684 30211 29964 32977 26267 28470 28150 23533
## [13] 27621 26698 26720 22398 21189 28686 23504 18788 23564 22811 24148 17721
## [25] 23105 28545 20366 24024 32149 26689 22935 18160 20366 22269 28608 21829
## [37] 18846 26086 27012 2608 1616 377 15660 10546
shakespeare.plays.length.max
## [1] 14 15 15 14 15 14 15 14 16 15 15 14 15 15 15 16 15 15 27 14 14 15 17 17 15
## [26] 15 15 15 15 14 13 15 15 15 17 15 15 13 15 15 11 11 14 13
shakespeare.plays.length.the
## [1] 1 1 0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1
## [39] 0 1 0 0 0 0
titles[which.max(shakespeare.plays.length.lines)]
## [1] "THE TRAGEDY OF HAMLET, PRINCE OF DENMARK"
titles[which.min(shakespeare.plays.length.lines)]
## [1] "THE PHOENIX AND THE TURTLE"
titles[which.max(shakespeare.plays.length.words)]
## [1] "THE TRAGEDY OF HAMLET, PRINCE OF DENMARK"
titles[which.min(shakespeare.plays.length.words)]
## [1] "THE PHOENIX AND THE TURTLE"
titles[which.max(shakespeare.plays.length.max)]
## [1] "LOVE’S LABOUR’S LOST"
titles[which.min(shakespeare.plays.length.max)]
## [1] "THE PASSIONATE PILGRIM"
cat("\n")
cat("Without max 'the' word plays from Shakespear:\n")
## Without max 'the' word plays from Shakespear:
cat("\n")
titles[(shakespeare.plays.length.the==1)]
## [1] "THE SONNETS"
## [2] "ALL’S WELL THAT ENDS WELL"
## [3] "AS YOU LIKE IT"
## [4] "THE COMEDY OF ERRORS"
## [5] "THE FIRST PART OF KING HENRY THE FOURTH"
## [6] "THE FIRST PART OF HENRY THE SIXTH"
## [7] "THE THIRD PART OF KING HENRY THE SIXTH"
## [8] "THE TRAGEDY OF JULIUS CAESAR"
## [9] "THE MERRY WIVES OF WINDSOR"
## [10] "A MIDSUMMER NIGHT’S DREAM"
## [11] "MUCH ADO ABOUT NOTHING"
## [12] "THE TRAGEDY OF OTHELLO, MOOR OF VENICE"
## [13] "THE TRAGEDY OF ROMEO AND JULIET"
## [14] "THE TAMING OF THE SHREW"
## [15] "THE TEMPEST"
## [16] "THE LIFE OF TIMON OF ATHENS"
## [17] "THE TRAGEDY OF TITUS ANDRONICUS"
## [18] "TWELFTH NIGHT; OR, WHAT YOU WILL"
## [19] "THE TWO GENTLEMEN OF VERONA"
## [20] "THE TWO NOBLE KINSMEN"
## [21] "A LOVER’S COMPLAINT"
#total plays without being "the" word most occuring in that play by Shakespear
sum(shakespeare.plays.length.the)
## [1] 21