For this homework, we’re going to work a bit more closely with bigrams and trigrams. You’ll need to triangulate between a table of bigrams and a dispersion plot of the bigram you choose. If you’d like, you can of course also turn to the text itself.

You have a few choices of text to work with: Dracula, Jane Eyre, Wuthering Heights, The Great Gatsby or A Study in Scarlet (the first Sherlock Holmes novel). Currently, the code is set up to work with Dracula, but you can change this very easily by uncommenting the line with the relevant text (remove the #), and commenting out the line that sets it to Dracula (add a # to the beginning of the line).

## [1] "The text is currently set to DRACULA"

Once you’ve chosen your text and read it in, we can generate bigrams and trigrams for it. You can just run the following (long) block of code, and bigram and trigram tables will be written the the relevant directory in the homework (this will take a second or two).

If you want to look at a different text, you’ll need to re-run this block of code to generate the n-grams again (even if you’ve previously made n-grams for that novel).

Question 1 (25 pts):

Head to the bigram and trigram tables and look for something interesting. Filter and sort the tables to see if you can find patterns. What kinds of bigrams and trigrams do different character names appear in? Do words from the title of the novel appear in the text? If you haven’t read the novel, cross-index against wikipedia. See if you can find a bigram or trigram that you think might correlate with one of the novel’s plot points. (e.g. “marry me”).

Choose five n-grams, give their counts from the novel, and write a sentence or two explaining why you think it might be interesting. You must choose at least one trigram, and at least one bigram. If you’d like, you can group (some of) these together. For example, you could write something like this:

"darcy said" (count) and "elizabeth said" (count): I chose these two bigrams because I was interested the the fact that the count for (X) is much higher than for (Y); I don’t think this is reflected in the dialogue, so it must be that one character’s dialogue is introduced in a way that is very different from the other character’s.

ngram 1:

ngram 2:

ngram 3:

ngram 4:

ngram 5:

Question 2 (15 points):

For this question, we’re going to look at the dispersion of your ngrams through the novel. You’ll need to copy your ngrams into the code below for this to work. Make sure you match their formatting exactly; it may be easiest to directly copy-paste them from the csv. Currently the code is set up to work with the most common ngrams; yours should be somewhat more interesting.

For each of your ngrams, write a few sentences about their distrubtion through the text. If it’s pretty even, explain why that makes sense (or is surprising). For something more conceptrated or rarer, can you connect the disperion to (what you know of) the plot, or use the dispersion to infer anythign about the novel’s plot?

Question 3 (10 points):

The n-grams for this assignment were produced by singling (so each word is a part of multiple bi/trigrams) and by respecting sentence boundaries (so there are no n-grams that include words in multiple sentences). As a result of this process, there are significantly more bigrams than trigrams (for Dracula, 152 000 bigrams versus 144 000 trigrams). Write a few sentences explaining why. (If you’re stuck on this, write a short paragraph with multiple short sentences and work out the bigrams and trigrams for it)

R Notebook