Name:Sunil Dhaka
Time: Note:

This lab is to be done in class (completed outside of class if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit your own lab as an knitted HTML file on Canvas, by Sunday 11:59pm, this week. Make sure to complete your weekly check-in (which can be done by coming to lecture, recitation, lab, or any office hour), as this will count a small number of points towards your lab score.

This week’s agenda: basic string manipulations; practice reading in and summarizing real text data (Shakespeare); practice with iteration; just a little bit of regular expressions.

Some string basics

str1="Scientific Computing"
str2='Scientific Computing'
str1==str2
## [1] TRUE
Exa="Sunil's mother is Kamala"# here single apostrphy is already used
tolower("I'M NOT ANGRY I SWEAR")         # Convert to lower case
## [1] "i'm not angry i swear"
toupper("Mom, I don't want my veggies")  # Convert to upper case
## [1] "MOM, I DON'T WANT MY VEGGIES"
toupper("Hulk, sMasH" )                  # Convert to upper case
## [1] "HULK, SMASH"
tolower("R2-D2 is in prime condition, a real bargain!") # Convert to lower case
## [1] "r2-d2 is in prime condition, a real bargain!"
presidents = c("Clinton", "Bush", "Reagan", "Carter", "Ford")
first.letters=substr(presidents,1,1)
first.letters.scrambled=sample(first.letters,5)
substr(presidents,1,1)=first.letters.scrambled
phrase = "Give me a break"
substr(phrase,1,4)="Provide"
substr(phrase,nchar(phrase)-4,nchar(phrase))="kit"
# these strings are having constant size/length. substr() just can replace specified size of substrings
ingredients = "chickpeas, tahini, olive oil, garlic, salt"
split.words=strsplit(ingredients,split = ",")
paste(split.words[[1]],"+",sep = " ",collapse = "")
## [1] "chickpeas + tahini + olive oil + garlic + salt +"
sort(split.words[[1]])
## [1] " garlic"    " olive oil" " salt"      " tahini"    "chickpeas"
## now see that [[]] and [] for list. It is now being handy here. Remember it.

Shakespeare’s complete works

Project Gutenberg offers over 50,000 free online books, especially old books (classic literature), for which copyright has expired. We’re going to look at the complete works of William Shakespeare, taken from the Project Gutenberg website.

To avoid hitting the Project Gutenberg server over and over again, we’ve grabbed a text file from them that contains the complete works of William Shakespeare and put it on our course website. Visit http://www.stat.cmu.edu/~ryantibs/statcomp/data/shakespeare.txt in your web browser and just skim through this text file a little bit to get a sense of what it contains (a whole lot!).

Reading in text, basic exploratory tasks

shakespeare.lines=readLines("http://www.stat.cmu.edu/~ryantibs/statcomp/data/shakespeare.txt")
shakespeare.lines[1:5]
## [1] ""                                                                     
## [2] "Project Gutenberg’s The Complete Works of William Shakespeare, by"    
## [3] "William Shakespeare"                                                  
## [4] ""                                                                     
## [5] "This eBook is for the use of anyone anywhere in the United States and"
sum(shakespeare.lines=="")
## [1] 17744
lines.count=sapply(shakespeare.lines,FUN=nchar)
lines.count[which.max(lines.count)]
## Section 3. Information about the Project Gutenberg Literary Archive Foundation 
##                                                                             78
mean(lines.count)
## [1] 37.50825
shakespeare.lines=shakespeare.lines[-c(which(lines.count==0))]
length(shakespeare.lines)
## [1] 130094
shakespeare.all=paste(shakespeare.lines,collapse=" ")
nchar(shakespeare.all)-sum(lines.count)
## [1] 130093
## this comes natural to me. In doing collapse we added " "{space} character 130093 ## times 
shakespeare.words=strsplit((shakespeare.all),split = " ")[[1]]
shakespeare.words=shakespeare.words[-which(shakespeare.words=="")]
length(shakespeare.words)
## [1] 959301
shakespeare.words.unique=unique(shakespeare.words)
length(shakespeare.words.unique)
## [1] 76170
#Q) how "" comes so many times even in unique
#Q) why unique and previous lengths are same
hist(nchar(shakespeare.words.unique),breaks = 50)

# it looks exp. distn
# as words having characters more than 10 are very rare in shakespear's work
# right tail is cover very low area of this distribution
shakespeare.words.unique[head(order(nchar(shakespeare.words.unique),decreasing = TRUE))]
## [1] "_______________________________________________________________"
## [2] "tragical-comical-historical-pastoral,"                          
## [3] "http://www.gutenberg.org/1/0/100/"                              
## [4] "honorificabilitudinitatibus;"                                   
## [5] "enemies?—Capulet,—Montague,—"                                   
## [6] "six-or-seven-times-honour’d"
Using the `order()` function, find the indices that correspond to the top 5 longest words in `shakespeare.words.unique`. Then, print the top 5 longest words themselves. Do you recognize any of these as actual words? **Challenge**: try to pronounce the fourth longest word! What does it mean?

Computing word counts

shakespeare.wordtab=table(shakespeare.words)
length(shakespeare.wordtab)
## [1] 76170
shakespeare.wordtab["thou"]
## thou 
## 4522
shakespeare.wordtab["rumor"]
## rumor 
##     3
shakespeare.wordtab["gloomy"]
## gloomy 
##      3
shakespeare.wordtab["assassination"]
## assassination 
##             1
length(shakespeare.wordtab[(shakespeare.wordtab)==1])
## [1] 41842
length(shakespeare.wordtab[(shakespeare.wordtab)==2])
## [1] 10756
length(shakespeare.wordtab[(shakespeare.wordtab)>10])
## [1] 7544
length(shakespeare.wordtab[(shakespeare.wordtab)>100])
## [1] 974
shakespeare.wordtab.sorted=sort(shakespeare.wordtab,decreasing = TRUE)
head(shakespeare.wordtab.sorted,25)
## shakespeare.words
##   the     I   and    to    of     a    my    in   you    is  that   And   not 
## 25378 20629 19806 16966 16718 13657 11443 10519  9591  8335  8150  7769  7415 
##  with   his    be  your   for  have    it  this    me    he    as  thou 
##  7380  6851  6411  6386  6014  5584  5242  5190  5107  5009  4584  4522
#see that the "" comes 411073 times
##how to split blanks using strsplit
## although I have ploptted below equations by manually removing ""<empty strings
shakespeare.wordtab=table(shakespeare.words)
shakespeare.wordtab.sorted=sort(shakespeare.wordtab,decreasing = TRUE)
head(shakespeare.wordtab.sorted,25)
## shakespeare.words
##   the     I   and    to    of     a    my    in   you    is  that   And   not 
## 25378 20629 19806 16966 16718 13657 11443 10519  9591  8335  8150  7769  7415 
##  with   his    be  your   for  have    it  this    me    he    as  thou 
##  7380  6851  6411  6386  6014  5584  5242  5190  5107  5009  4584  4522
C=1234
a=0.9
plot(1:(length(shakespeare.wordtab.sorted)),as.numeric(shakespeare.wordtab.sorted),xlim=c(1,1000),xlab = "Rank",ylab = "Frequency",type = "l")
curve(C*(1/x)^a,from=1,to=length(shakespeare.wordtab.sorted),add=TRUE,col="yellow")

## as values of C and a are random I don't how to compute them programmatically or how to get ##them manually

A tiny bit of regular expressions

shakespeare.wordtab["and"]
##   and 
## 19806
shakespeare.wordtab["And"]
##  And 
## 7769
shakespeare.words.new=(strsplit(tolower(shakespeare.all),split="[[:space:]]|[[:punct:]]"))[[1]]
shakespeare.words.new=shakespeare.words.new[-which(shakespeare.words.new=="")]
shakespeare.wordtab.new=table(shakespeare.words.new)
shakespeare.wordtab.new["and"]
##   and 
## 28402
The fix for the first issue is to convert `shakespeare.all` to all lower case characters. Hint: recall `tolower()` from Q1b. The fix for the second issue is to use the argument `split="[[:space:]]|[[:punct:]]"` in the call to `strsplit()`, when defining the words. In words, this means: *split on spaces or on punctuation marks* (more precisely, it uses what we call a **regular expression** for the `split` argument). Carry out both of these fixes to define new words `shakespeare.words.new`. Then, delete all empty strings from this vector, and compute word table from it, called `shakespeare.wordtab.new`. 
length(shakespeare.words)-length(shakespeare.words.new)
## [1] -30618
length(shakespeare.wordtab)-length(shakespeare.wordtab.new)
## [1] 49915
shakespeare.words.new.unique=unique(shakespeare.words.new)
hist(nchar(shakespeare.words.new.unique),breaks = 50)

shakespeare.words.unique[head(order(nchar(shakespeare.words.unique),decreasing = TRUE))]
## [1] "_______________________________________________________________"
## [2] "tragical-comical-historical-pastoral,"                          
## [3] "http://www.gutenberg.org/1/0/100/"                              
## [4] "honorificabilitudinitatibus;"                                   
## [5] "enemies?—Capulet,—Montague,—"                                   
## [6] "six-or-seven-times-honour’d"
# yes the hist is different because in new case we have removed punctuations and other spaces so the the one increase in character numbers due to them is not in shakespear.words.new.unique
shakespeare.wordtab.sorted.new=sort(shakespeare.wordtab.new,decreasing = TRUE)
head(shakespeare.wordtab.sorted.new,25)
## shakespeare.words.new
##   the   and     i    to    of     a   you    my  that    in    is     d   not 
## 30027 28402 23840 21437 18836 16139 14683 13208 12257 12191  9911  9486  9087 
##  with     s   for    me    it   his    be    he  this  your   but  have 
##  8542  8337  8303  8284  8228  7578  7413  7283  7193  7078  6793  6297
C=1234
a=0.9
## still don't know how to compute these a and C
plot(1:(length(shakespeare.wordtab.sorted.new)),as.numeric(shakespeare.wordtab.sorted.new),xlim=c(1,1000),xlab = "Rank",ylab = "Frequency",type = "l")
curve(C*(1/x)^a,from=1,to=length(shakespeare.wordtab.sorted.new),add=TRUE,col="yellow")

Where are Shakespeare’s plays, in this massive text?

shakespeare.lines[19:23]
## [1] "The Complete Works of William Shakespeare"
## [2] "by William Shakespeare"                   
## [3] "      Contents"                           
## [4] "               THE SONNETS"               
## [5] "               ALL’S WELL THAT ENDS WELL"
shakespeare.lines=trimws(shakespeare.lines)#trim whitspaces
shakespeare.lines[19:23]
## [1] "The Complete Works of William Shakespeare"
## [2] "by William Shakespeare"                   
## [3] "Contents"                                 
## [4] "THE SONNETS"                              
## [5] "ALL’S WELL THAT ENDS WELL"
toc.start=which(shakespeare.lines=="THE SONNETS")[1]#letter i think should change into #as.numeric
toc.end=which(shakespeare.lines=="VENUS AND ADONIS")[1]
n=toc.end-toc.start+1
titles=c(1:n)
for (i in 1:n) {
  titles[i]=shakespeare.lines[toc.start+i-1]
}
titles
##  [1] "THE SONNETS"                             
##  [2] "ALL’S WELL THAT ENDS WELL"               
##  [3] "THE TRAGEDY OF ANTONY AND CLEOPATRA"     
##  [4] "AS YOU LIKE IT"                          
##  [5] "THE COMEDY OF ERRORS"                    
##  [6] "THE TRAGEDY OF CORIOLANUS"               
##  [7] "CYMBELINE"                               
##  [8] "THE TRAGEDY OF HAMLET, PRINCE OF DENMARK"
##  [9] "THE FIRST PART OF KING HENRY THE FOURTH" 
## [10] "THE SECOND PART OF KING HENRY THE FOURTH"
## [11] "THE LIFE OF KING HENRY THE FIFTH"        
## [12] "THE FIRST PART OF HENRY THE SIXTH"       
## [13] "THE SECOND PART OF KING HENRY THE SIXTH" 
## [14] "THE THIRD PART OF KING HENRY THE SIXTH"  
## [15] "KING HENRY THE EIGHTH"                   
## [16] "KING JOHN"                               
## [17] "THE TRAGEDY OF JULIUS CAESAR"            
## [18] "THE TRAGEDY OF KING LEAR"                
## [19] "LOVE’S LABOUR’S LOST"                    
## [20] "THE TRAGEDY OF MACBETH"                  
## [21] "MEASURE FOR MEASURE"                     
## [22] "THE MERCHANT OF VENICE"                  
## [23] "THE MERRY WIVES OF WINDSOR"              
## [24] "A MIDSUMMER NIGHT’S DREAM"               
## [25] "MUCH ADO ABOUT NOTHING"                  
## [26] "THE TRAGEDY OF OTHELLO, MOOR OF VENICE"  
## [27] "PERICLES, PRINCE OF TYRE"                
## [28] "KING RICHARD THE SECOND"                 
## [29] "KING RICHARD THE THIRD"                  
## [30] "THE TRAGEDY OF ROMEO AND JULIET"         
## [31] "THE TAMING OF THE SHREW"                 
## [32] "THE TEMPEST"                             
## [33] "THE LIFE OF TIMON OF ATHENS"             
## [34] "THE TRAGEDY OF TITUS ANDRONICUS"         
## [35] "THE HISTORY OF TROILUS AND CRESSIDA"     
## [36] "TWELFTH NIGHT; OR, WHAT YOU WILL"        
## [37] "THE TWO GENTLEMEN OF VERONA"             
## [38] "THE TWO NOBLE KINSMEN"                   
## [39] "THE WINTER’S TALE"                       
## [40] "A LOVER’S COMPLAINT"                     
## [41] "THE PASSIONATE PILGRIM"                  
## [42] "THE PHOENIX AND THE TURTLE"              
## [43] "THE RAPE OF LUCRECE"                     
## [44] "VENUS AND ADONIS"
titles.start=c(1:n)
for (i in 1:n) {
  titles.start[i]=which(shakespeare.lines==titles[i])[2]
}
titles.start
##  [1]     66   2377   5310   9141  11772  13702  17590  21385  26644  30389
## [11]  33614  36902  39957  43248  46412  49895  52680  55427  60107  62923
## [21]  65462  68319  71020  73766  75996  79469  83083  86327  89286  93442
## [31]  97535 101205 103640 106198 108938 113682 116175     NA 122682 126020
## [41] 126351 126556 126626 128534
titles.end=c(1:n)
for (i in 1:(n-1)) {
  titles.end[i]=titles.start[i+1]-1
}
titles.end[n]=length(shakespeare.lines)
## solved challange
titles.end[n]=which(shakespeare.lines=="FINIS")[2]
titles.end
##  [1]   2376   5309   9140  11771  13701  17589  21384  26643  30388  33613
## [11]  36901  39956  43247  46411  49894  52679  55426  60106  62922  65461
## [21]  68318  71019  73765  75995  79468  83082  86326  89285  93441  97534
## [31] 101204 103639 106197 108937 113681 116174     NA 122681 126019 126350
## [41] 126555 126625 128533 129749
which(shakespeare.lines == "THE TWO NOBLE KINSMEN")
## [1] 59
shakespeare.lines[118463]
## [1] "THE TWO NOBLE KINSMEN:"
for (i in 1:n) {
  titles.start[i]=as.numeric(grep(pattern = titles[i],shakespeare.lines)[2])
}
titles.start
##  [1]     66   2377   5310   9141  11772  13702  17590  21385  26644  30389
## [11]  33614  36902  39957  43248  46412  49895  52680  55427  60107  62923
## [21]  65462  68319  71020  73766  75996  79469  83083  86327  89286  93442
## [31]  97535 101205 103640 106198 108938 113682 116175 118463 122682 126020
## [41] 126351 126556 126626 128534
for (i in 1:(n-1)) {
  titles.end[i]=titles.start[i+1]-1
}
titles.end[n]=which(shakespeare.lines=="FINIS")[2]
titles.end
##  [1]   2376   5309   9140  11771  13701  17589  21384  26643  30388  33613
## [11]  36901  39956  43247  46411  49894  52679  55426  60106  62922  65461
## [21]  68318  71019  73765  75995  79468  83082  86326  89285  93441  97534
## [31] 101204 103639 106197 108937 113681 116174 118462 122681 126019 126350
## [41] 126555 126625 128533 129749
But now take a look at line 118,463 in `shakespeare.lines`: you will see that it is "THE TWO NOBLE KINSMEN:", so this is really where the second play starts, but because of colon ":" at the end of the string, this doesn't exactly match the title "THE TWO NOBLE KINSMEN", as we were looking for. The advantage of using the `grep()` function, versus checking for exact equality of strings, is that `grep()` allows us to match substrings. Specifically, `grep()` returns the indices of the strings in a vector for which a substring match occurs, e.g.,

```r
grep(pattern="cat",
     x=c("cat", "canned goods", "batman", "catastrophe", "tomcat"))
```

```
## [1] 1 4 5
```
so we can see that in this example, `grep()` was able to find substring matches to "cat" in the first, fourth, and fifth strings in the argument `x`. Redefine `titles.start` by repeating the logic in your solution to Q5d, but replacing the `which()` command in the body of your `for()` loop with an appropriate call to `grep()`. Also, redefine `titles.end` by repeating the logic in your solution to Q5e. Print out the new vectors `titles.start` and `titles.end` to the console---they should be free of `NA` values.

Extracting and analysing a couple of plays

index.tradegy=which(titles=="THE TRAGEDY OF HAMLET, PRINCE OF DENMARK")
start.hamlet=titles.start[index.tradegy]
end.hamlet=titles.end[index.tradegy]
shakespeare.lines.hamlet=shakespeare.lines[start.hamlet:end.hamlet]
length(shakespeare.lines.hamlet)
## [1] 5259
index.romeo=which(titles=="THE TRAGEDY OF ROMEO AND JULIET")
start.romeo=titles.start[index.romeo]
end.romeo=titles.end[index.romeo]
shakespeare.lines.romeo=shakespeare.lines[start.romeo:end.romeo]
length(shakespeare.lines.romeo)
## [1] 4093
shakespeare.lines.hamlet.all=paste(shakespeare.lines.hamlet,collapse = " ")
shakespeare.lines.hamlet.all=tolower(shakespeare.lines.hamlet.all)
shakespeare.lines.hamlet.words=
strsplit(shakespeare.lines.hamlet.all,split = "[[:space:]]|[[:punct:]]")[[1]]
shakespeare.lines.hamlet.words=
shakespeare.lines.hamlet.words[-which(shakespeare.lines.hamlet.words=="")]
shakespeare.lines.hamlet.unique=unique(shakespeare.lines.hamlet.words)
hist(nchar(shakespeare.lines.hamlet.unique),breaks=20)

shakespeare.lines.hamlet.unique[head(order(nchar(shakespeare.lines.hamlet.unique),decreasing = TRUE))]
## [1] "transformation" "understanding"  "entertainment"  "imperfections" 
## [5] "encompassment"  "circumstances"
shakespeare.lines.hamlet.wordtab=table(shakespeare.lines.hamlet.words)
shakespeare.lines.hamlet.sorted.wordtab=sort(shakespeare.lines.hamlet.wordtab,decreasing = TRUE)
C=234
a=0.56
plot(1:length(shakespeare.lines.hamlet.sorted.wordtab),as.numeric(shakespeare.lines.hamlet.sorted.wordtab),xlab = 'Rank',ylab = "Frequency",type = "l")

curve(C*(1/x)^a,from=1,to=length(shakespeare.lines.hamlet.sorted.wordtab),col="green")

- 6c. Repeat the same task as in the last part, but on shakespeare.lines.romeo. (Again, this should just involve copying and pasting code as needed. P.S. Isn’t this getting tiresome? You’ll be happy when we learn functions, next week!) Comment on any similarities/differences you see in the answers.

shakespeare.lines.romeo.all=paste(shakespeare.lines.romeo,collapse = " ")
shakespeare.lines.romeo.all=tolower(shakespeare.lines.romeo.all)
shakespeare.lines.romeo.words=strsplit(shakespeare.lines.romeo.all,split = "[[:space:]]|[[:punct:]]")[[1]]
shakespeare.lines.romeo.words=
shakespeare.lines.romeo.words[-which(shakespeare.lines.romeo.words=="")]
length(shakespeare.lines.hamlet.words)
## [1] 32977
shakespeare.lines.romeo.unique=unique(shakespeare.lines.romeo.words)
hist(nchar(shakespeare.lines.romeo.unique),breaks=20)

shakespeare.lines.romeo.unique[head(order(nchar(shakespeare.lines.romeo.unique),decreasing = TRUE))]
## [1] "distemperature" "unthankfulness" "interchanging"  "transgression" 
## [5] "disparagement"  "gentlemanlike"
shakespeare.lines.romeo.wordtab=table(shakespeare.lines.romeo.words)
shakespeare.lines.romeo.sorted.wordtab=sort(shakespeare.lines.romeo.wordtab,decreasing = TRUE)
C=234
a=0.56
plot(1:length(shakespeare.lines.romeo.sorted.wordtab),as.numeric(shakespeare.lines.romeo.sorted.wordtab),xlab = 'Rank',ylab = "Frequency",type = "l")

curve(C*(1/x)^a,from=1,to=length(shakespeare.lines.romeo.sorted.wordtab),col="green")

shakespeare.plays.length.words=numeric(n)
shakespeare.plays.length.lines=numeric(n)
shakespeare.plays.length.max=numeric(n)
shakespeare.plays.length.the=numeric(n)

for (i in 1:n) {
  shakespeare.plays.length.lines[i]=length(shakespeare.lines[titles.start[i]:titles.end[i]])
  s=shakespeare.lines[titles.start[i]:titles.end[i]]
  s=paste(s,collapse = " ")
  s=tolower(s)
  s=strsplit(s,split = "[[:space:]]|[[:punct:]]")[[1]]
  s=s[-which(s=="")]
  s1=unique(s)
  
  shakespeare.plays.length.max[i]=nchar((s1[order(nchar(s1),decreasing =    TRUE)])[1])
  s2=table(s)
  s2=sort(s2,decreasing = TRUE)
  if(((names(s2))[1]=="the")){
    shakespeare.plays.length.the[i]=0
  }
  else{
    shakespeare.plays.length.the[i]=1
  }
  shakespeare.plays.length.words[i]=length(s)
}
shakespeare.plays.length.lines
##  [1] 2311 2933 3831 2631 1930 3888 3795 5259 3745 3225 3288 3055 3291 3164 3483
## [16] 2785 2747 4680 2816 2539 2857 2701 2746 2230 3473 3614 3244 2959 4156 4093
## [31] 3670 2435 2558 2740 4744 2493 2288 4219 3338  331  205   70 1908 1216
shakespeare.plays.length.words
##  [1] 18126 25166 27468 23390 16684 30211 29964 32977 26267 28470 28150 23533
## [13] 27621 26698 26720 22398 21189 28686 23504 18788 23564 22811 24148 17721
## [25] 23105 28545 20366 24024 32149 26689 22935 18160 20366 22269 28608 21829
## [37] 18846 26086 27012  2608  1616   377 15660 10546
shakespeare.plays.length.max
##  [1] 14 15 15 14 15 14 15 14 16 15 15 14 15 15 15 16 15 15 27 14 14 15 17 17 15
## [26] 15 15 15 15 14 13 15 15 15 17 15 15 13 15 15 11 11 14 13
shakespeare.plays.length.the
##  [1] 1 1 0 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1
## [39] 0 1 0 0 0 0
titles[which.max(shakespeare.plays.length.lines)]
## [1] "THE TRAGEDY OF HAMLET, PRINCE OF DENMARK"
titles[which.min(shakespeare.plays.length.lines)]
## [1] "THE PHOENIX AND THE TURTLE"
titles[which.max(shakespeare.plays.length.words)]
## [1] "THE TRAGEDY OF HAMLET, PRINCE OF DENMARK"
titles[which.min(shakespeare.plays.length.words)]
## [1] "THE PHOENIX AND THE TURTLE"
titles[which.max(shakespeare.plays.length.max)]
## [1] "LOVE’S LABOUR’S LOST"
titles[which.min(shakespeare.plays.length.max)]
## [1] "THE PASSIONATE PILGRIM"
cat("\n")
cat("Without max 'the' word plays from Shakespear:\n")
## Without max 'the' word plays from Shakespear:
cat("\n")
titles[(shakespeare.plays.length.the==1)]
##  [1] "THE SONNETS"                            
##  [2] "ALL’S WELL THAT ENDS WELL"              
##  [3] "AS YOU LIKE IT"                         
##  [4] "THE COMEDY OF ERRORS"                   
##  [5] "THE FIRST PART OF KING HENRY THE FOURTH"
##  [6] "THE FIRST PART OF HENRY THE SIXTH"      
##  [7] "THE THIRD PART OF KING HENRY THE SIXTH" 
##  [8] "THE TRAGEDY OF JULIUS CAESAR"           
##  [9] "THE MERRY WIVES OF WINDSOR"             
## [10] "A MIDSUMMER NIGHT’S DREAM"              
## [11] "MUCH ADO ABOUT NOTHING"                 
## [12] "THE TRAGEDY OF OTHELLO, MOOR OF VENICE" 
## [13] "THE TRAGEDY OF ROMEO AND JULIET"        
## [14] "THE TAMING OF THE SHREW"                
## [15] "THE TEMPEST"                            
## [16] "THE LIFE OF TIMON OF ATHENS"            
## [17] "THE TRAGEDY OF TITUS ANDRONICUS"        
## [18] "TWELFTH NIGHT; OR, WHAT YOU WILL"       
## [19] "THE TWO GENTLEMEN OF VERONA"            
## [20] "THE TWO NOBLE KINSMEN"                  
## [21] "A LOVER’S COMPLAINT"
#total plays without being "the" word most occuring in that play by Shakespear
sum(shakespeare.plays.length.the)
## [1] 21