Brown corpus exercise

1.How many sentences are in the corpus?

brown.ori <- scan(file="e:/R_Project/browncorpus.txt",sep="\n",quote="",what="char",comment.char = "")
brown.ori -> brown
brown <- tolower(brown)
length(brown)
## [1] 51763

2.How many words are in the corpus?

brown <- gsub("[[:punct:]]"," ",brown)
brown <- gsub("[0-9]","",brown)
brown<-gsub(" +"," ",brown)
brown<-gsub("^ +","",brown)
brown<-gsub(" +$","",brown)
word.list.brown <- strsplit(brown," ")
unlist.brown <- unlist(word.list.brown)
unlist.brown[unlist.brown!=""] -> unlist.brown
length(unlist.brown)
## [1] 1023243

3.How many instances of the word “the” occur in the corpus?

length(grep("\\bthe\\b",unlist.brown))
## [1] 69968
table(unlist.brown) -> a
a["the"]
##   the 
## 69968

4. How many words in the corpus begin with an (orthographic) vowel?

english.stop.words<-c("a, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, am, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, aren't, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing, backs, be, became, because, become, becomes, been, before, began, behind, being, beings, below, best, better, between, big, both, but, by, c, came, can, cannot, can't, case, cases, certain, certainly, clear, clearly, come, could, couldn't, d, did, didn't, differ, different, differently, do, does, doesn't, doing, done, don't, down, downed, downing, downs, during, e, each, early, either, end, ended, ending, ends, enough, even, evenly, ever, every, everybody, everyone, everything, everywhere, f, face, faces, fact, facts, far, felt, few, find, finds, first, for, four, from, full, fully, further, furthered, furthering, furthers, g, gave, general, generally, get, gets, give, given, gives, go, going, good, goods, got, great, greater, greatest, group, grouped, grouping, groups, h, had, hadn't, has, hasn't, have, haven't, having, he, he'd, he'll, her, here, here's, hers, herself, he's, high, higher, highest, him, himself, his, how, however, how's, i, i'd, if, i'll, i'm, important, in, interest, interested, interesting, interests, into, is, isn't, it, its, it's, itself, i've, j, just, k, keep, keeps, kind, knew, know, known, knows, l, large, largely, last, later, latest, least, less, let, lets, let's, like, likely, long, longer, longest, m, made, make, making, man, many, may, me, member, members, men, might, more, most, mostly, mr, mrs, much, must, mustn't, my, myself, n, necessary, need, needed, needing, needs, never, new, newer, newest, next, no, nobody, non, noone, nor, not, nothing, now, nowhere, number, numbers, o, of, off, often, old, older, oldest, on, once, one, only, open, opened, opening, opens, or, order, ordered, ordering, orders, other, others, ought, our, ours, ourselves, out, over, own, p, part, parted, parting, parts, per, perhaps, place, places, point, pointed, pointing, points, possible, present, presented, presenting, presents, problem, problems, put, puts, q, quite, r, rather, really, right, room, rooms, s, said, same, saw, say, says, second, seconds, see, seem, seemed, seeming, seems, sees, several, shall, shan't, she, she'd, she'll, she's, should, shouldn't, show, showed, showing, shows, side, sides, since, small, smaller, smallest, so, some, somebody, someone, something, somewhere, state, states, still, such, sure, t, take, taken, than, that, that's, the, their, theirs, them, themselves, then, there, therefore, there's, these, they, they'd, they'll, they're, they've, thing, things, think, thinks, this, those, though, thought, thoughts, three, through, thus, to, today, together, too, took, toward, turn, turned, turning, turns, two, u, under, until, up, upon, us, use, used, uses, v, very, w, want, wanted, wanting, wants, was, wasn't, way, ways, we, we'd, well, we'll, wells, went, were, we're, weren't, we've, what, what's, when, when's, where, where's, whether, which, while, who, whole, whom, who's, whose, why, why's, will, with, within, without, won't, work, worked, working, works, would, wouldn't, x, y, year, years, yes, yet, you, you'd, you'll, young, younger, youngest, your, you're, yours, yourself, yourselves, you've, z")
english.stop.words <- strsplit(english.stop.words,", ")
english.stop.words <- unlist(english.stop.words)
head(english.stop.words)
## [1] "a"      "about"  "above"  "across" "after"  "again"
exwords <- unlist.brown[!unlist.brown %in% english.stop.words]
length(grep("^[aeiou]",exwords))
## [1] 75845
length(grep("\\b[aeiou]\\w*\\b",unlist.brown))
## [1] 294097

5. How many words end in the string ness?

ness <- grep("\\w*ness\\b",exwords,value=TRUE)
length(ness)
## [1] 1468

6. Write a regular expression for each of the following:

6-1. All of the words with two z’s in a row.

two_z <- grep("\\b\\w*zz\\w*\\b",exwords,value=TRUE)
length(two_z)
## [1] 351
head(two_z)
## [1] "grizzlies" "fizzled"   "rizzuto"   "huzzahs"   "dazzler"   "jazz"

6-2. All of the words with two vowels in a row.

length(grep("\\b\\w*[aeiou][aeiou]\\w*\\b",unlist.brown))
## [1] 177613

6-3. All of the words with more than two vowels in them.

length(grep("\\b(\\w*[aeiou]\\w*){3,}\\b",unlist.brown,value=T))
## [1] 214016

6-4. All of the words with two x’s in them.

two_x <- grep("\\b(\\w*x\\w*){2}\\b",exwords,value=TRUE)
length(two_x)
## [1] 2

6-5. An expression to match any legitimate email address